ESM Preprocessing Gallery – check

Check computation

Packages: dplyr, ggplot2, naniar

After having computed a new score it is very important to check whether the computation went fine. To do so, we can go back to the topics of step 1 of the framework. However, if you have allowed the removal of missing values (often defined by ‘na.rm=TRUE’) to be able to compute a time-varying score, you may want to know how many items/observations were used on average in the computation.

Missing values when computing scores

When ‘na.rm=TRUE’ (see below), some missing values can be present and the computed score will be computed based on fewer items.

data$PA_mean = apply(data[,c("PA1","PA2", "PA3")], 1, mean, na.rm=TRUE)

A solution would be to define a required minimum number of items per row to compute the score. First, we will check the number of items used when computing the mean of positive affects. To do so, we compute the number of missing items per row in the items of interest and then we plot it in a histogram.

df_na = !is.na(data[,c("PA1","PA2", "PA3")])
data$PA_missing = apply(df_na, 1, sum)

data %>% 
    ggplot(aes(x=PA_missing)) +
        geom_histogram()

From the plot above, we see that the majority of rows have either no missing value or all missing values in the items from PA1 to PA3. Nonetheless, many rows have 1 or 2 missing values per row among those items. We can further investigate which combinations of item values are missing in a row. Indeed, it could be informative to see if a specific combination of missing values occurs more often in the data. We use the function gg_miss_upset() from the naniar package (see the first missingness analysis section to know how to interpret this plot) .

library(naniar)
gg_miss_upset(data[,c("PA1","PA2","PA3")])

Based on the above plot, it seems that the missingness seems rather independent between variables.