Missingness correlates

Packages: dplyr, tidyr, ggplot2, ggpubr


Missingness is fundamentally associated with ESM designs because of the burden of such studies. Then, it may be explained by predictors, such as age or experimental condition. Plots can help in this investigation. Particularly, we can investigate different levels of the data:

  • Beep-level: does the likelihood of completing a beep response at time t+1 correlate with a time-variant factor at time t?
  • Day-level: does the number of beeps fulfilled in a day correlate with predictors?
  • Person-level: does the number of valid observations over the study correlate with participants variable (e.g., age)?

For each, we can look at the influence of categorical (e.g., ‘location’) or continuous (e.g., self-reported items such as ‘PA1’) variables.

Importantly, here, ‘missingness’ is interpreted as instances where observations are designated as invalid (see flag valid observation section). Consequently, the definition of an invalid observation might be more or less strickly related to missingness stricto sensu, depending on the definition you use. Note that we only investigate descriptive statistics, which does not allow us to draw strong conclusions on predictors.

Beep-level

The main question is: does the likelihood of completing a beep response at time t+1 correlate with a time-variant factor at time t, such as experiencing positive affects (for instance, feeling happy)? To have insights on this question, our first step is to lead the valid variables. In other words, we delay from one row the valid variable to have the information of the next beep at time t+1. Not that missing values in the ‘next_valid’ variable are due to the last beep of the participant, which does not have a following beep. Hence, we can filter out these cases.

data_beep = data %>%
    group_by(id) %>%
    mutate(next_valid = lead(valid)) %>%
    filter(!is.na(next_valid)) %>%
    ungroup()

Then, we can plot the proportion of valid answers at time t+1 in function of:

  • a categorical variable: we will use the ‘location’ variable wich different labels from ‘A’ to ‘D’.
  • a continuous variable: we will use the ‘PA1’ and ‘PA3’ variables, which are self-reported items. Here, we will create bins using the custom function ‘create_bins’ to have a better visualization of the data.

create_bins <- function(x, n_bins) {
  range_max <- max(x, na.rm = TRUE)
  range_min <- min(x, na.rm = TRUE)
  bin_width <- ceiling((range_max - range_min) / n_bins)
  
  breaks <- seq(range_min, range_max, by = bin_width)
  if (tail(breaks, n=1) != range_max) {
      breaks <- c(breaks, range_max + 1)
  }
  
  labels <- paste("[", head(breaks, -1), ":", tail(breaks, -1) - 1, "]", sep="")
  categories <- cut(x, breaks = breaks, labels = labels, include.lowest = TRUE, right = FALSE)
  return(categories)
}

n_bins = 15 # Number of bins
data_beep %>% 
    filter(valid==1) %>% 
    gather(var, val, PA1, PA2) %>%
    mutate(val_cat = create_bins(val, n = n_bins)) %>% # Customize bins using the argument n 
    ggplot(aes(x=val_cat, fill=factor(next_valid))) +
        geom_bar(position = "fill") +
        facet_wrap(var~.) +
        theme(axis.text.x = element_text(angle = 90))

In the above plot, it looks like that higher values of PA2 might be linked to higher number of invalid answer the next beep.


Day-level

At a day level, our focus shifts to knowing if some predictors have an effect on the number of valid observations per day and per participant. In this context, not only do we can consider time-variant predictors, but also time-invariant predictors (e.g., age). We must first aggregate our data on a daily basis and compute the number of valid observations for each participant per day. In addition, we must compute the aggregate scores of each time-varying variables of interest to later explore their potentiel influence.

df_day = data %>%
    filter(valid==1) %>%
    group_by(id, daycum, age, cond_perso) %>%
    summarise(nb_valid = sum(valid),
              PA1_mean = mean(PA1, na.rm=TRUE),
              PA2_mean = mean(PA2, na.rm=TRUE))

Then, we can plot this information in function of:

  • a categorical variables: for instance the ‘cond_perso’ variable, which is a time-invariant variable.
  • a continuous variable: for instance the ‘PA1_mean’ and ‘PA3_mean’ variables, which are self-reported items aggregated at a daily level.

df_day %>% gather(var, val, PA1_mean, PA2_mean) %>%
  ggplot(aes(x=val, y=nb_valid))+
    geom_smooth(method=lm) +
    geom_point(alpha=.2) +
    facet_wrap(var~.) +
    stat_cor(p.accuracy = 0.001, r.accuracy = 0.01)

Above, we can see that higher values of ‘PA1_mean’ seems linked to lower number of valid observations per day.


Participant-level

Finally, we can also be interested in knowing if time-invariant variables (e.g., ‘age’) can explain the total number of valid observations per participant. We must first aggregate the data at a participant level and compute the number of valid observations. In addition, we must compute the aggregate score of time-varying or time-invariant variables.

df_person = data %>%
    filter(valid==1) %>%
    group_by(id, age, cond_perso) %>%
    summarise(nb_valid = sum(valid),
              PA1_mean = mean(PA1, na.rm=TRUE),
              PA2_mean = mean(PA2, na.rm=TRUE))

Then, we can plot this information, again, for continuous vs. categorical predictors:

df_person %>% gather(var, val, PA1_mean, PA2_mean) %>%
  ggplot(aes(x=val, y=nb_valid))+
    geom_smooth(method=lm) +
    geom_point(alpha=.3) +
    facet_wrap(var~.) +
    stat_cor(p.accuracy = 0.001, r.accuracy = 0.01)

Above, we can see that higher values of ‘PA1_mean’ and ‘PA2_mean’ seem linked to lower total number of valid observations per participant.

Instead of focusing solely on counting the number of valid beeps, an alternative approach involves examining the number of days a participant actively engaged in the study. To do so, we first need to compute ‘nb_day’ for each participant, which is the number of unique days the participant took part in the study. Then, we can plot this information in function of continuous and categorical predictors.

df_person = data %>%
    filter(valid==1) %>%
    group_by(id, age, cond_perso) %>%
    summarise(nb_day = length(unique(daycum)),
              PA1_mean = mean(PA1, na.rm=TRUE),
              PA2_mean = mean(PA2, na.rm=TRUE))

df_person %>% gather(var, val, PA1_mean, PA2_mean) %>%
  ggplot(aes(x=val, y=nb_day))+
    geom_smooth(method=lm) +
    geom_point(alpha=.4) +
    facet_wrap(var~.) +
    stat_cor(p.accuracy = 0.001, r.accuracy = 0.01)