ESM Preprocessing Gallery – check_sampling

Check sampling scheme

The following set of R codes helps to check that the sampling scheme have been well followed by the application/server in charge of sending the beeps to the participant. The general idea is to check that the beeps have been sent in the rigth time interval. To do so, the function need to be adapted to the sampling scheme in hand.
The dificulty to adapt such a function will be determined by the type of sampling scheme:

Time-contingent sampling (a)
Interval-contingent sampling (a)
Event-contingent sampling (b)
Burst-contingent sampling (c)
Signal-contingent sampling
etc.

In addition, the sampling can remain the same (i.e., consistent) or change (i.e., unconsistent) over the study period. Nonetheless, it can be a difficult task especially when the sampling in unconsistent and/or when the sampling is a mix of event-based and interval based sampling.

(a) Check time and interval-contingent sampling

Packages: dplyr, ???

Because both time and interval-contingent sampling sent beeps in an interval of time. They can be checked with the same method. The steps are the following:

Recreate the sampling scheme based on study information
Extract the sampling scheme from the true data
Compare both and extract the results
Investigate the results

Recreate sampling scheme

Here, you will need to change the following variables:

duration_study: the number of day the study last
week_day: the week day numbers the beeps could be sent. We are using a string and separated the day number by a comma. Remind that in R: 1=Sunday and 7=Saturday.
obsnoday: number of beeps in a day
sched_sent: the starting time in a day of the periods when the beeps could be sent. We recommend to set up the time one before the actual time. Indeed, because server timing, the beeps could be few seconds before the actual time.
sched_int_sent (in minutes): the period of time
sched_int_start (in minutes): the period of time in which the participant can start the survey.
int_end (in minutes): the period of time in which the participant can finish the survey the survey.

If your sampling scheme is inconsistent over the study duration, you can specify 2+ sampling scheme with the following line of codes and then merge them in a unique dataframe.

duration_study = 14
week_day = "1,2,3,4,5,6,7"
obsnoday = 5
obsno_max = duration_study * obsnoday

sched_sent = hms::as_hms(c("09:59:00", "10:59:00", "11:59:00", "12:59:00", "13:59:00"))
int_sent = 11
sched_int_sent = hms::as_hms(sched_sent + int_sent * 60)
int_start = 21 
sched_int_start = hms::as_hms(sched_sent + int_start * 20)
int_end = 60
sched_int_end = hms::as_hms(sched_sent + int_end * 60)

df_test = expand.grid(daycum=c(1:duration_study), obsnoday=c(1:obsnoday), week_day=week_day)
df_scheme_day = data.frame(obsnoday=1:obsnoday, sched_sent=sched_sent, sched_int_sent=sched_int_sent, sched_int_start=sched_int_start, sched_int_end=sched_int_end)
df_test = left_join(df_test, df_scheme_day, by=c("obsnoday")) %>%
    arrange(daycum, obsnoday) %>% 
    mutate(obsno = 1:n())

Extract sampling from the dataframe

To extract the sampling scheme, we need the following variables: id number, daycum, obsnoday, sent time, start time and end time.

# Extract sampling scheme
df_sampling = data.frame(id = data$id, 
                         daycum = as.numeric(data$daycum), obsnoday=data$obsnoday, 
                         sent_date = data$sent, sent_time=hms::as_hms(data$sent), start_time=hms::as_hms(data$start), end_time=hms::as_hms(data$end))

# Compute obsno
df_sampling = df_sampling %>% 
    arrange(id,daycum,obsnoday) %>%
    group_by(id) %>%
    mutate(obsno=1:n())

Compare sampling schemes

To check the consistency between the two sampling scheme, we will test if a observation in the true sampling scheme matched with a beep in the recreated sampling scheme. We can considere different outputs:

beep_match_ok: if it matched a beep from the recreated sampling scheme
beep_matched: the obsno of the matched beep
sent_time_ok: if the sent time was in the interval defined by the matched beep.
start_time_ok: if the start time was in the interval defined by the matched beep.
end_time_ok: if the end time was in the interval defined by the matched beep.

We need to base the comparison on variables. We have two choices:

day number and sent time variables
day number and obsnoday variables

Choose one of the two methods in function of your dataset properties, knowing that the variables use as basis must be without mistakes!

df_sampling_1 = as.data.frame(df_sampling)
df_sampling_1$beep_match_ok = 0
df_sampling_1$beep_matched = 0
df_sampling_1$sent_time_ok = 0
df_sampling_1$start_time_ok = 0
df_sampling_1$end_time_ok = 0

for(row in 1:nrow(df_sampling_1)){

    day_cum = df_sampling_1[row, "daycum"]
    # obsnoday = df_sampling_1[row, "obsnoday"]

    daycum_test = df_test$daycum == day_cum
    pos = which(daycum_test & df_test$sched_sent < as_hms(df_sampling_1[row, "sent_date"]) & df_test$sched_int_sent > as_hms(df_sampling_1[row, "sent_date"]))
    
    if (length(pos) == 1){
        df_sampling_1[row, "beep_match_ok"] = 1
        df_sampling_1[row, "beep_matched"] = df_test[pos, "obsno"]

        df_sampling_1[row,"sent_time_ok"] = ifelse(df_sampling_1[row,"sent_time"] >= df_test[pos,"sched_sent"] & df_sampling_1[row,"sent_time"] <= df_test[pos,"sched_int_sent"], 
                                                    1, 0)

        df_sampling_1[row,"start_time_ok"] = ifelse(df_sampling_1[row,"start_time"] >= df_test[pos,"sched_sent"] & df_sampling_1[row,"start_time"] <= df_test[pos,"sched_int_end"], 
                                                    1, 0)

        df_sampling_1[row,"end_time_ok"] = ifelse(df_sampling_1[row,"end_time"] >= df_test[pos, "sched_sent"] & df_sampling_1[row,"end_time"] <= df_test[pos, "sched_int_end"], 
                                                    1, 0)
    } 
}

Checking sampling

Now that the comparison have ran, we can analysis the different outputs.

First we will look for beep in the recreated sampling scheme that have not matched any observation in the true sampling scheme.

# Extract which obsno have not been sent(?)
obsno_split = split(df_sampling_2$beep_matched, df_sampling_2$id)
obsno_missed = lapply(obsno_split, function(x) c(1:obsno_max)[!c(1:obsno_max) %in% x])
df_missed = data.frame()
for (i in 1:length(obsno_missed)){
    id_ = names(obsno_split)[i]
    pos = which(df_test$obsno %in% obsno_missed[[i]])
    if (length(pos) > 0){
        new_df = cbind(data.frame(id=id_), df_test[pos,])
        df_missed = rbind(df_missed, new_df)
    }
}

df_missed

data frame with 0 columns and 0 rows

Beeps not matched in the sampling scheme

beep_not_matched = which(is.na(df_sampling_2$beep_matched))
data[beep_not_matched,]

# A tibble: 0 × 20
# Groups:   id, daycum [0]
# ℹ 20 variables: dyad <dbl>, role <int>, obsno <int>, id <dbl>, age <int>,
#   cond_dyad <chr>, scheduled <dttm>, sent <dttm>, start <dttm>, end <dttm>,
#   contact <int>, PA1 <int>, PA2 <int>, PA3 <int>, NA1 <int>, NA2 <int>,
#   NA3 <int>, location <chr>, daycum <drtn>, obsnoday <int>

Beeps that have matched but sent outside of the time interval

sent_out = which(is.na(df_sampling_2$sent_time_ok))
data[sent_out,]

# A tibble: 0 × 20
# Groups:   id, daycum [0]
# ℹ 20 variables: dyad <dbl>, role <int>, obsno <int>, id <dbl>, age <int>,
#   cond_dyad <chr>, scheduled <dttm>, sent <dttm>, start <dttm>, end <dttm>,
#   contact <int>, PA1 <int>, PA2 <int>, PA3 <int>, NA1 <int>, NA2 <int>,
#   NA3 <int>, location <chr>, daycum <drtn>, obsnoday <int>

Beeps that have matched but start outside of the time interval

start_out = which(is.na(df_sampling_2$start_time_ok))
data[start_out,]

# A tibble: 521 × 20
# Groups:   id, daycum [253]
    dyad  role obsno    id   age cond_dyad scheduled          
   <dbl> <int> <int> <dbl> <int> <chr>     <dttm>             
 1     1     1     1     1    40 condB     2018-10-17 08:00:08
 2     1     1     2     1    40 condB     2018-10-17 09:00:01
 3     1     1     3     1    40 condB     2018-10-17 09:59:56
 4     1     1     5     1    40 condB     2018-10-17 12:00:12
 5     1     1     6     1    40 condB     2018-10-18 07:59:47
 6     1     1    10     1    40 condB     2018-10-18 12:00:06
 7     1     1    11     1    40 condB     2018-10-19 07:59:40
 8     1     1    16     1    40 condB     2018-10-20 08:00:04
 9     1     1    17     1    40 condB     2018-10-20 09:00:13
10     1     1    18     1    40 condB     2018-10-20 10:00:13
# ℹ 511 more rows
# ℹ 13 more variables: sent <dttm>, start <dttm>, end <dttm>, contact <int>,
#   PA1 <int>, PA2 <int>, PA3 <int>, NA1 <int>, NA2 <int>, NA3 <int>,
#   location <chr>, daycum <drtn>, obsnoday <int>

Beeps that have matched but end outside of the time interval

end_out = which(is.na(df_sampling_2$end_time_ok))
data[end_out,]

# A tibble: 521 × 20
# Groups:   id, daycum [253]
    dyad  role obsno    id   age cond_dyad scheduled          
   <dbl> <int> <int> <dbl> <int> <chr>     <dttm>             
 1     1     1     1     1    40 condB     2018-10-17 08:00:08
 2     1     1     2     1    40 condB     2018-10-17 09:00:01
 3     1     1     3     1    40 condB     2018-10-17 09:59:56
 4     1     1     5     1    40 condB     2018-10-17 12:00:12
 5     1     1     6     1    40 condB     2018-10-18 07:59:47
 6     1     1    10     1    40 condB     2018-10-18 12:00:06
 7     1     1    11     1    40 condB     2018-10-19 07:59:40
 8     1     1    16     1    40 condB     2018-10-20 08:00:04
 9     1     1    17     1    40 condB     2018-10-20 09:00:13
10     1     1    18     1    40 condB     2018-10-20 10:00:13
# ℹ 511 more rows
# ℹ 13 more variables: sent <dttm>, start <dttm>, end <dttm>, contact <int>,
#   PA1 <int>, PA2 <int>, PA3 <int>, NA1 <int>, NA2 <int>, NA3 <int>,
#   location <chr>, daycum <drtn>, obsnoday <int>

We can have a broad view with the a summary plot:

df_sampling_2_ = df_sampling_2 %>%
    gather(type, ok, beep_match_ok, sent_time_ok, start_time_ok, end_time_ok)

df_sampling_2_ %>%
    ggplot(aes(y=obsno, x=type, fill=as.logical(ok))) +
        geom_tile() +
        facet_grid(id~., space="free_x", scales="free_x") +
        theme(strip.placement = "outside",
              strip.background = element_rect(fill=NA,colour="grey50"),
              panel.spacing=unit(0,"cm"),
              axis.ticks.y = element_blank(),
              axis.text.y = element_blank())

        # scale_x_discrete(guide = guide_axis_nested(delim = "&")) +
        # theme(ggh4x.axis.nesttext.y = element_blank())

(b) Check event-contingent sampling

IN DEVELOPMENT

(c) Check burst-contingent sampling

IN DEVELOPMENT

If the burst are trigered on event, then you can refer to the previous part to check it. Here, we will focus especially on checking if the follow-ups beeps have followed the defined sampling scheme.