This template follows the preprocessing framework described in Revol et al. (2023) and can be found in the R package esmtools.

Study and data collection procedure

We consider data from a pilot ESM study in which 10 families (triads of father-mother-child) participated. Each family member received a notification (beep) prompting them to complete a questionnaire four times a day on weekdays (between 4.30pm and 8.30pm), and nine times a day on weekend days (between 10.15am and 8.30pm) for 10 consecutive days, yielding 60 beeps in total. Beeps were sent semi-randomly within predetermined 15-minute time intervals, with the restriction that there was always at least one hour between two consecutive beeps. The beeps were sent at the same time to family members and expired after 30 minutes.

In our article, we only use a part of the dataset, i.e., we focus on the father-mother dyads, excluding the children’s data, and examine a subset of the assessed variables. We will examine parents’ momentary positive affect and negative affect (rated on scales ranging from 0 to 100). Additionally, identification variables (i.e., participant number, dyad number), demographic variables (i.e., ‘role’ (father/mother), ‘age’), and timestamp variables: ‘scheduled’ (i.e., when the ESM questionnaire was scheduled), ‘sent’ (i.e., when the questionnaire was sent), ‘start’ (i.e., when the questionnaire was opened by the participant), and ‘end’ (i.e., when the questionnaire was completed) were retained. The dataset also includes branching items. Parents reported if their child experienced a stressful event since the last beep. If not, they were asked whether their child had experienced anything fun since the last beep (yes/no). If yes, parents were asked to rate to which extent their child showed they experienced this fun event on a scale ranging from 0 to 100.

Load packages

Import the packages:

library(esmtools) # For button(), txt() functions

# For data management
library(dplyr)
library(tidyr)
library(data.table)

# Descriptive statistics
library(skimr)

# Missing values inspection
library(naniar)
library(visdat)

# Plotting
library(ggplot2)

# To modify characters
library(stringr)

# Dates
library(lubridate)
library(hms)

Step 1: Import data and preliminary preprocessing

This section is dedicated to the first look at the data, the merging of data sources the first basic preprocessing methods (e.g., duplicates, branching items check), and checking the variable consistency when the data has just been imported.

Import the data:

# Find the path toward the dataset stored within the esmtools package
file_path = system.file("extdata", "esmdata_raw.csv", package="esmtools")
# Import data
data = read.csv(file_path)

Raw dataset meta-info:

dataInfo(file_path=file_path, read_fun = read.csv,
         idvar="id", timevar="sent")

## Path : C:/Users/u0148925/AppData/Local/R/win-library/4.2/esmtools/extdata/esmdata_raw.csv 
## Extension : csv 
## Size : 129469 bytes 
## Creation time : 2024-03-11 10:00:19 
## Update time : 2024-03-11 10:00:19 
## ncol : 13 
## nrow : 1242 
## Number participants : 20 
## Average number obs : 62.1 
## Period : from 2022-04-24 19:00:11 to 2022-12-04 20:21:25 
## Variables : dyad, id, role, age, scheduled, sent, start, end, pos_aff, neg_aff, perc_stress_child, perc_fun_child, perc_fun_signaled

First glimpse

The following chunk helps to have a first insight on what the dataset looks like.

dim(data)

## [1] 1242   13

head(data)

##   dyad id role age           scheduled                sent               start
## 1    1 13    1  44 2022-05-05 19:00:00 2022-05-05 19:00:11 2022-05-05 19:00:33
## 2    1 13    1  44 2022-05-05 19:00:00 2022-05-05 19:00:11 2022-05-05 19:00:59
## 3    1 13    1  44 2022-05-05 19:00:00 2022-05-05 19:00:11 2022-05-05 19:03:33
## 4    2  9    0  39 2022-05-19 19:00:00 2022-05-19 19:00:11 2022-05-19 19:46:57
## 5    2  9    0  39 2022-05-19 19:00:00 2022-05-19 19:00:11 2022-05-19 19:47:09
## 6    2  9    0  39 2022-05-19 19:00:00 2022-05-19 19:00:11 2022-05-19 19:47:31
##                   end pos_aff neg_aff perc_stress_child perc_fun_child
## 1 2022-05-05 19:00:36      NA      NA                               NA
## 2 2022-05-05 19:01:02      NA      NA                               NA
## 3 2022-05-05 19:03:35      NA      NA                               NA
## 4 2022-05-19 19:47:00      NA      NA                               NA
## 5 2022-05-19 19:47:13      NA      NA                               NA
## 6 2022-05-19 19:47:34      NA      NA                               NA
##   perc_fun_signaled
## 1                NA
## 2                NA
## 3                NA
## 4                NA
## 5                NA
## 6                NA

tail(data)

##      dyad id role age           scheduled                sent start  end
## 1237   10 15    0  51 2022-11-27 14:09:13 2022-11-27 14:09:15  <NA> <NA>
## 1238   10 92    1  47 2022-11-27 14:09:13 2022-11-27 14:09:15  <NA> <NA>
## 1239   10 92    1  47 2022-11-27 17:55:26 2022-11-27 17:55:28  <NA> <NA>
## 1240   10 15    0  51 2022-11-27 19:04:26 2022-11-27 19:04:28  <NA> <NA>
## 1241   10 15    0  51 2022-11-27 20:27:31 2022-11-27 20:27:32  <NA> <NA>
## 1242   10 92    1  47 2022-11-27 20:27:31 2022-11-27 20:27:32  <NA> <NA>
##      pos_aff neg_aff perc_stress_child perc_fun_child perc_fun_signaled
## 1237      NA      NA                               NA                NA
## 1238      NA      NA                               NA                NA
## 1239      NA      NA                               NA                NA
## 1240      NA      NA                               NA                NA
## 1241      NA      NA                               NA                NA
## 1242      NA      NA                               NA                NA

str(data)

## 'data.frame':    1242 obs. of  13 variables:
##  $ dyad             : int  1 1 1 2 2 2 2 2 1 1 ...
##  $ id               : int  13 13 13 9 9 9 32 32 13 46 ...
##  $ role             : int  1 1 1 0 0 0 1 1 1 0 ...
##  $ age              : int  44 44 44 39 39 39 47 47 44 49 ...
##  $ scheduled        : chr  "2022-05-05 19:00:00" "2022-05-05 19:00:00" "2022-05-05 19:00:00" "2022-05-19 19:00:00" ...
##  $ sent             : chr  "2022-05-05 19:00:11" "2022-05-05 19:00:11" "2022-05-05 19:00:11" "2022-05-19 19:00:11" ...
##  $ start            : chr  "2022-05-05 19:00:33" "2022-05-05 19:00:59" "2022-05-05 19:03:33" "2022-05-19 19:46:57" ...
##  $ end              : chr  "2022-05-05 19:00:36" "2022-05-05 19:01:02" "2022-05-05 19:03:35" "2022-05-19 19:47:00" ...
##  $ pos_aff          : int  NA NA NA NA NA NA NA NA 96 73 ...
##  $ neg_aff          : int  NA NA NA NA NA NA NA NA 4 33 ...
##  $ perc_stress_child: chr  "" "" "" "" ...
##  $ perc_fun_child   : int  NA NA NA NA NA NA NA NA 1 1 ...
##  $ perc_fun_signaled: int  NA NA NA NA NA NA NA NA 81 95 ...

skim(data)

Data summary
Name	data
Number of rows	1242
Number of columns	13
_______________________
Column type frequency:
character	5
numeric	8
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique
scheduled	0	1.00	19	19	0	617
sent	0	1.00	19	19	0	654
start	320	0.74	19	19	0	921
end	320	0.74	19	19	0	921
perc_stress_child	0	1.00	0	7	350	37

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
dyad	0	1.00	5.49	2.87	1	3.00	6	8.00	10	▇▇▇▇▇
id	0	1.00	49.29	29.47	1	16.25	49	73.00	93	▇▃▆▆▇
role	0	1.00	0.50	0.50	0	0.00	1	1.00	1	▇▁▁▁▇
age	0	1.00	44.13	4.79	22	39.00	42	47.00	54	▁▁▇▇▅
pos_aff	347	0.72	80.34	21.25	1	72.00	87	97.00	100	▁▁▁▃▇
neg_aff	347	0.72	18.46	22.57	0	2.00	11	24.00	100	▇▂▁▁▁
perc_fun_child	508	0.59	0.58	0.49	0	0.00	1	1.00	1	▆▁▁▁▇
perc_fun_signaled	814	0.34	57.85	31.90	0	32.75	66	84.25	100	▅▃▅▆▇

Check unique values of the 3 categorical variables.

table(data$role, useNA="ifany")

## 
##   0   1 
## 619 623

table(data$perc_stress_child, useNA="ifany")

## 
##               1     1,5     1,6       2     2,3 2,3,4,5     2,4       3     3,1 
##     350      31       1       1       8       2       1       1      27       1 
##   3,1,4     3,4   3,4,1   3,4,6     3,5     3,6       4     4,1   4,2,5     4,3 
##       1       1       1       1       1       1      26       1       1       2 
##   4,3,1   4,3,5     4,6       5     5,1   5,3,1     5,6     5,8       6   6,1,3 
##       1       1       1       4       1       1       1       1      36       1 
##     6,2       7   7,4,3     7,8       8     8,5     8,7 
##       2     569       1       1     157       1       4

table(data$perc_fun_child, useNA="ifany")

## 
##    0    1 <NA> 
##  306  428  508

Renaming, relabelling, reformating

Modification 1: reformating the timestamp variables to be in POSIXct format.

data = data %>%     
  mutate(across(c(scheduled, sent, start, end), ~as.POSIXct(.,origin="1970-01-01", tz="GMT")))

Modification 2: create a variable that keeps the maximum values per observation in the multi-response item ‘perc_stress_child’.

# Split the 'perc_stress_child' column by commas to get separate values.
split_values = strsplit(data$perc_stress_child, ",")
# Find the maximum value in each set of values and store in 'max_values' vector.
max_values = sapply(split_values, function(x) {
    NA_ = length(x) < 1
    if (NA_) {
      NA
    } else {
      max(as.numeric(x), na.rm = TRUE)
    }
})
# Assign max value to perc_stress_child_max variable in data
data$perc_stress_child_max = max_values

Duplication

No issues detected.

Look for duplicated rows: the dataframe does not have duplicated rows.

sum(duplicated(data))

## [1] 0

Look for duplicated answers (duplication in the self-reported items): no issue found regarding the number of self-reported items, the values of the variables involved, and the number of times duplicated .

data[!is.na(data$start),] %>% # Select answered observations
    select(pos_aff:perc_fun_signaled) %>%  # Select self-report items
    group_by_all() %>%
    summarise(n = n()) %>% # Compute number of similar self-reported item values
    filter(n > 1) %>% arrange(desc(n)) %>% as.data.frame()

## `summarise()` has grouped output by 'pos_aff', 'neg_aff', 'perc_stress_child',
## 'perc_fun_child'. You can override using the `.groups` argument.

##    pos_aff neg_aff perc_stress_child perc_fun_child perc_fun_signaled  n
## 1       NA      NA                               NA                NA 27
## 2      100       0                 7              0                NA 10
## 3      100       0                 8              0                NA 10
## 4       99       0                 7              0                NA  6
## 5      100       1                 7              0                NA  5
## 6       94       0                 7              0                NA  4
## 7       91       0                 7              0                NA  3
## 8       93       0                 8              0                NA  3
## 9       95       0                 7              0                NA  3
## 10      98       0                 7              0                NA  3
## 11     100       8                 7              0                NA  3
## 12      69      16                 7              0                NA  2
## 13      75      11                 7              0                NA  2
## 14      77       6                 7              0                NA  2
## 15      77      21                 7              0                NA  2
## 16      79       0                 7              0                NA  2
## 17      80      19                 7              0                NA  2
## 18      82       0                 7              0                NA  2
## 19      85      12                 7              0                NA  2
## 20      86      20                 7              0                NA  2
## 21      87      11                 7              1                76  2
## 22      88       0                 8              0                NA  2
## 23      89      15                 7              0                NA  2
## 24      90       1                 7              0                NA  2
## 25      90       7                 7              0                NA  2
## 26      92       0                 7              0                NA  2
## 27      94      10                 7              0                NA  2
## 28      96       0                 7              0                NA  2
## 29      96       0                 7              1               100  2
## 30      97       0                 7              0                NA  2
## 31     100       0                 3             NA                NA  2
## 32     100       0                 7              1                67  2
## 33     100       0                 7              1                72  2
## 34     100       0                 7              1               100  2
## 35     100       0                 8              1                 0  2
## 36     100       0                 8              1                48  2
## 37     100       1                 7              1                 0  2
## 38     100       3                 7              1               100  2
## 39     100       6                 7              0                NA  2
## 40     100       7                 7              0                NA  2
## 41     100       8                 7              1               100  2
## 42     100      12                 7              1               100  2

Look for duplicated timestamps: start and end variables don’t have issues (i.e., only 2 similar values within same dyad for each).

# Define a function that finds indices of duplicated non-NA values in a vector.
duplicated_timestamps = function(x) which(!is.na(x) & (duplicated(x) | duplicated(x, fromLast=TRUE)))

# Apply the 'duplicated_timestamps' function to columns "scheduled", "sent", "start", and "end"
vars_dupli = apply(data[,c("scheduled","sent","start","end")], 2, duplicated_timestamps)

# duplication of start values
data[vars_dupli$start,]

##     dyad id role age           scheduled                sent
## 110    1 13    1  44 2022-05-11 17:59:07 2022-05-11 17:59:09
## 111    1 46    0  49 2022-05-11 17:59:07 2022-05-11 17:59:09
##                   start                 end pos_aff neg_aff perc_stress_child
## 110 2022-05-11 18:04:59 2022-05-11 18:06:48      96       1                 7
## 111 2022-05-11 18:04:59 2022-05-11 18:05:58      98       0                 8
##     perc_fun_child perc_fun_signaled perc_stress_child_max
## 110              1                18                     7
## 111              1                 8                     8

# duplication of end values
data[vars_dupli$end,]

##     dyad id role age           scheduled                sent
## 788   10 92    1  47 2022-11-20 16:38:20 2022-11-20 16:38:22
## 789   10 15    0  51 2022-11-20 16:38:20 2022-11-20 16:38:22
##                   start                 end pos_aff neg_aff perc_stress_child
## 788 2022-11-20 16:47:04 2022-11-20 16:48:14     100       0                 7
## 789 2022-11-20 16:47:10 2022-11-20 16:48:14      93       4                 8
##     perc_fun_child perc_fun_signaled perc_stress_child_max
## 788              1                 7                     7
## 789              1                34                     8

Branching items

No issues detected.

Branching items are consistent:

when the maximum value of ‘perc_stress_child_max’ >= 7, ‘perc_fun_child’ is displayed to the participant.
when ‘perc_stress_child_max’ >= 7 and ‘perc_fun_child’ == 1, ‘perc_fun_signaled’ is displayed to the participant.
in any other condition ‘perc_fun_child’ and ‘perc_fun_signaled’ are not displayed.

data %>% 
    mutate(perc_stress_child_thres= perc_stress_child_max >= 7,
           perc_fun_signaled_NA= !is.na(perc_fun_signaled)) %>%
    group_by(perc_stress_child_thres, perc_fun_child, perc_fun_signaled_NA) %>% 
    summarise(n())

## `summarise()` has grouped output by 'perc_stress_child_thres',
## 'perc_fun_child'. You can override using the `.groups` argument.

## # A tibble: 4 × 4
## # Groups:   perc_stress_child_thres, perc_fun_child [4]
##   perc_stress_child_thres perc_fun_child perc_fun_signaled_NA `n()`
##   <lgl>                            <int> <lgl>                <int>
## 1 FALSE                               NA FALSE                  158
## 2 TRUE                                 0 FALSE                  306
## 3 TRUE                                 1 TRUE                   428
## 4 NA                                  NA FALSE                  350

Check variable coherence

Consistency of time-invariant variables.

Issue 1: participant 32 and 46 have two age numbers, respectively (47, 29) and (49, 22). The second values (i.e., 29, 22) occurs only one time. First values (i.e., 47, 49) are correct ones.

vars_consist(data, "id", c("age"))

##    id      age
## 1  13       44
## 2   9       39
## 3  32 (47, 29)
## 4  46 (49, 22)
## 5  66       54
## 6  20       52
## 7  93       43
## 8   7       44
## 9  45       39
## 10 77       42
## 11 52       39
## 12 72       39
## 13  1       50
## 14 49       41
## 15 73       41
## 16 79       39
## 17 67       42
## 18 80       42
## 19 92       47
## 20 15       51

No other issues detected. The variables ‘dyad’, ‘id’, ‘role’ are all consistent with each other.

vars_consist(data, "id", c("dyad", "role", "age"))

##    id dyad role      age
## 1  13    1    1       44
## 2   9    2    0       39
## 3  32    2    1 (47, 29)
## 4  46    1    0 (49, 22)
## 5  66    3    1       54
## 6  20    3    0       52
## 7  93    4    1       43
## 8   7    4    0       44
## 9  45    5    1       39
## 10 77    5    0       42
## 11 52    6    1       39
## 12 72    7    1       39
## 13  1    7    0       50
## 14 49    6    0       41
## 15 73    8    1       41
## 16 79    8    0       39
## 17 67    9    0       42
## 18 80    9    1       42
## 19 92   10    1       47
## 20 15   10    0       51

vars_consist(data, "dyad", c("id", "role", "age"))

##    dyad       id   role          age
## 1     1 (13, 46) (1, 0) (44, 49, 22)
## 2     2  (9, 32) (0, 1) (39, 47, 29)
## 3     3 (66, 20) (1, 0)     (54, 52)
## 4     4  (93, 7) (1, 0)     (43, 44)
## 5     5 (45, 77) (1, 0)     (39, 42)
## 6     6 (52, 49) (1, 0)     (39, 41)
## 7     7  (72, 1) (1, 0)     (39, 50)
## 8     8 (73, 79) (1, 0)     (41, 39)
## 9     9 (67, 80) (0, 1)           42
## 10   10 (92, 15) (1, 0)     (47, 51)

Inspection: Check number of occurence of different age values for participant 32 and 46:

data %>% filter(id %in% c(32,46)) %>% group_by(id, age) %>% summarise(n = n())

## `summarise()` has grouped output by 'id'. You can override using the `.groups`
## argument.

## # A tibble: 4 × 3
## # Groups:   id [2]
##      id   age     n
##   <int> <int> <int>
## 1    32    29     1
## 2    32    47    62
## 3    46    22     1
## 4    46    49    60

Modification 3: set the age of participant 32 to 47 and the age of participant 46 to 49.

data$age[data$id==32 & data$age==29] = 47
data$age[data$id==46 & data$age==22] = 49

First missing values analysis

Here, our goal is to gain an overview of the patterns of missing values and address any preliminary issues related to them.

Recoding missing values

Issue 2: from the descriptive analysis in the ‘First glimpse’ section, we can see that the missing values in perc_stress_child are coded as ’ ’ and not as NA.

Modification 4: We have recoded the missing values of the ‘perc_stress_child’ variable as NA.

data$perc_stress_child[data$perc_stress_child==""] = NA

The following plot gives an overview of the missing values in the dataframe.

Overview of missing values

# Reorder variables
data = data %>% arrange(dyad, id, scheduled)

# Overiew of missing values
vis_miss(data)

We can see multiple patterns and no missing values in the first 6 variables.

Coherence of missing values in the variables of interest:

Issue 3: There are inconsistent missing values:

the ‘end’ and ‘start’ variables have non-missing values in rows where all self-reported items have missing values (27 observations). After investigation, these are observations associated with the tests.
‘perc_stress_child’ has missing values in rows where the ‘start’ and ‘end’ variables have non-missing values (3 observations).

gg_miss_upset(data %>% select(start:perc_fun_signaled), nsets = 12)

Inspection: here are the observations in question:

pos_27 = !is.na(data$end) & !is.na(data$start) & is.na(data$pos_aff)
data[pos_27, ]

##      dyad id role age           scheduled                sent
## 1       1 13    1  44 2022-05-05 19:00:00 2022-05-05 19:00:11
## 2       1 13    1  44 2022-05-05 19:00:00 2022-05-05 19:00:11
## 3       1 13    1  44 2022-05-05 19:00:00 2022-05-05 19:00:11
## 126     2  9    0  39 2022-05-19 19:00:00 2022-05-19 19:00:11
## 127     2  9    0  39 2022-05-19 19:00:00 2022-05-19 19:00:11
## 128     2  9    0  39 2022-05-19 19:00:00 2022-05-19 19:00:11
## 190     2 32    1  47 2022-05-19 19:00:00 2022-05-19 19:00:11
## 191     2 32    1  47 2022-05-19 19:00:00 2022-05-19 19:00:11
## 252     3 20    0  52 2022-04-28 19:00:00 2022-04-28 19:00:11
## 313     3 66    1  54 2022-04-24 19:00:00 2022-04-24 19:00:11
## 375     4  7    0  44 2022-06-16 19:00:00 2022-06-16 19:00:09
## 436     4 93    1  43 2022-06-16 19:00:00 2022-06-16 19:00:09
## 497     5 45    1  39 2022-06-16 19:00:00 2022-06-16 19:00:07
## 558     5 77    0  42 2022-06-16 19:00:00 2022-06-16 19:00:07
## 623     6 49    0  41 2022-08-25 19:00:00 2022-08-25 19:00:09
## 688     6 52    1  39 2022-08-25 19:00:00 2022-08-25 19:00:09
## 689     6 52    1  39 2022-08-25 19:00:00 2022-08-25 19:00:09
## 750     7  1    0  50 2022-09-06 19:00:00 2022-09-06 19:00:09
## 751     7  1    0  50 2022-09-22 19:00:00 2022-09-22 19:00:03
## 812     7 72    1  39 2022-09-06 19:00:00 2022-09-06 19:00:09
## 813     7 72    1  39 2022-09-22 19:00:00 2022-09-22 19:00:03
## 874     8 73    1  41 2022-11-07 19:00:00 2022-11-07 19:00:04
## 937     8 79    0  39 2022-11-10 19:00:00 2022-11-10 19:00:03
## 998     9 67    0  42 2022-11-24 19:00:00 2022-11-24 19:00:05
## 1059    9 80    1  42 2022-11-24 19:00:00 2022-11-24 19:00:05
## 1181   10 92    1  47 2022-11-17 19:00:00 2022-11-17 19:00:06
## 1182   10 92    1  47 2022-11-17 19:00:00 2022-11-17 19:00:06
##                    start                 end pos_aff neg_aff perc_stress_child
## 1    2022-05-05 19:00:33 2022-05-05 19:00:36      NA      NA              <NA>
## 2    2022-05-05 19:00:59 2022-05-05 19:01:02      NA      NA              <NA>
## 3    2022-05-05 19:03:33 2022-05-05 19:03:35      NA      NA              <NA>
## 126  2022-05-19 19:46:57 2022-05-19 19:47:00      NA      NA              <NA>
## 127  2022-05-19 19:47:09 2022-05-19 19:47:13      NA      NA              <NA>
## 128  2022-05-19 19:47:31 2022-05-19 19:47:34      NA      NA              <NA>
## 190  2022-05-19 19:57:11 2022-05-19 19:57:20      NA      NA              <NA>
## 191  2022-05-19 19:57:30 2022-05-19 19:57:32      NA      NA              <NA>
## 252  2022-04-28 19:08:00 2022-04-28 19:08:03      NA      NA              <NA>
## 313  2022-04-24 19:01:28 2022-04-24 19:01:33      NA      NA              <NA>
## 375  2022-06-16 19:00:47 2022-06-16 19:00:49      NA      NA              <NA>
## 436  2022-06-16 19:00:46 2022-06-16 19:00:52      NA      NA              <NA>
## 497  2022-06-16 19:01:01 2022-06-16 19:01:03      NA      NA              <NA>
## 558  2022-06-16 19:08:52 2022-06-16 19:08:56      NA      NA              <NA>
## 623  2022-08-25 19:00:40 2022-08-25 19:00:44      NA      NA              <NA>
## 688  2022-08-25 19:16:12 2022-08-25 19:16:14      NA      NA              <NA>
## 689  2022-08-25 19:16:19 2022-08-25 19:16:20      NA      NA              <NA>
## 750  2022-09-06 19:00:23 2022-09-06 19:00:26      NA      NA              <NA>
## 751  2022-09-22 19:00:12 2022-09-22 19:00:15      NA      NA              <NA>
## 812  2022-09-06 19:00:19 2022-09-06 19:00:23      NA      NA              <NA>
## 813  2022-09-22 19:02:53 2022-09-22 19:02:57      NA      NA              <NA>
## 874  2022-11-07 19:01:20 2022-11-07 19:01:23      NA      NA              <NA>
## 937  2022-11-10 19:00:32 2022-11-10 19:00:34      NA      NA              <NA>
## 998  2022-11-24 18:00:20 2022-11-24 18:00:23      NA      NA              <NA>
## 1059 2022-11-24 18:00:35 2022-11-24 18:00:39      NA      NA              <NA>
## 1181 2022-11-17 18:51:14 2022-11-17 18:51:16      NA      NA              <NA>
## 1182 2022-11-17 18:51:20 2022-11-17 18:51:22      NA      NA              <NA>
##      perc_fun_child perc_fun_signaled perc_stress_child_max
## 1                NA                NA                    NA
## 2                NA                NA                    NA
## 3                NA                NA                    NA
## 126              NA                NA                    NA
## 127              NA                NA                    NA
## 128              NA                NA                    NA
## 190              NA                NA                    NA
## 191              NA                NA                    NA
## 252              NA                NA                    NA
## 313              NA                NA                    NA
## 375              NA                NA                    NA
## 436              NA                NA                    NA
## 497              NA                NA                    NA
## 558              NA                NA                    NA
## 623              NA                NA                    NA
## 688              NA                NA                    NA
## 689              NA                NA                    NA
## 750              NA                NA                    NA
## 751              NA                NA                    NA
## 812              NA                NA                    NA
## 813              NA                NA                    NA
## 874              NA                NA                    NA
## 937              NA                NA                    NA
## 998              NA                NA                    NA
## 1059             NA                NA                    NA
## 1181             NA                NA                    NA
## 1182             NA                NA                    NA

pos_3 = is.na(data$perc_stress_child) & !is.na(data$pos_aff)
data[pos_3, ]

##     dyad id role age           scheduled                sent
## 328    3 66    1  54 2022-05-01 10:22:59 2022-05-01 10:23:01
## 334    3 66    1  54 2022-05-01 17:58:16 2022-05-01 17:58:18
## 396    4  7    0  44 2022-06-19 19:14:44 2022-06-19 19:14:46
##                   start                 end pos_aff neg_aff perc_stress_child
## 328 2022-05-01 10:23:06 2022-05-01 10:24:31      69      34              <NA>
## 334 2022-05-01 17:58:32 2022-05-01 18:00:13      91       6              <NA>
## 396 2022-06-19 19:18:57 2022-06-19 19:19:46      84      16              <NA>
##     perc_fun_child perc_fun_signaled perc_stress_child_max
## 328             NA                NA                    NA
## 334             NA                NA                    NA
## 396             NA                NA                    NA

Modification 5: set as missing the ‘start’ and ‘end’ variables of those 27 first inconsistent cases. For the 3 remaining cases, no modifications are made as these inconsistencies will have no implications for later analyses.

data[pos_27, c("start", "end")] = NA

Create time variables

We create time-related variables (e.g., observation number) that will be necessary for later data preprocessing.

Modification 6: extract time elements (day, year, etc.), and create observation number (obsno), day number (daycum), beep number in a day (beepno) and duration in days variables (duration).

# Datetime elements: year, day, etc.
data["year"] = year(data$start)
data["month"] = month(data$start)
data["day"] = day(data$start)
data["hour"] = hour(data$start)
data["minute"] = minute(data$start)

# Observation number: beep number of the observation that indicates their serial order (within participant)
data = data %>% arrange(id, scheduled) %>%
    group_by(id) %>% 
    mutate(obsno = 1:n()) %>% ungroup()

# Day cumulate: Day number since the first beep sent to the participant 
data = data %>% 
    group_by(id) %>%
    mutate(daycum = difftime(as.Date(sent), as.Date(min(sent, na.rm=TRUE)), units="days") + 1)

# Beep number: Beep number within a day 
data = data %>% arrange(id, sent) %>%
    group_by(id,daycum) %>%
    mutate(beepno = 1:n())

# Duration in days
data = data %>%
    group_by(id) %>%
    mutate(duration = difftime(as.Date(max(sent, na.rm=TRUE)), as.Date(min(sent, na.rm=TRUE)), units="days") + 1) %>% 
    ungroup()

Flag (in)valid observations

Modification 7: using logical tests, we define which observations are valid. The ‘valid’ observations are the ones where there are no missing values in the variables of interest, i.e., 3 main variables (‘pos_aff’, ‘pos_neg’, ‘perc_stress_child’) and two others (‘perc_fun_child’, ‘perc_fun_signaled’) in function of the branching items (see above). We created a function that can be reused later.

check_validity = function(data) {
  # Check if variables are missing
  pos_pos_missing = !is.na(data$pos_aff)
  pos_neg_missing = !is.na(data$neg_aff)
  perc_fun_signaled_missing = !is.na(data$perc_fun_signaled)
  
  # Check additional conditions
  perc_fun_child_missing = ifelse(data$perc_stress_child > 6, !is.na(data$perc_fun_child), TRUE)
  perc_fun_signaled_invalid = ifelse(data$perc_fun_child == 1, !is.na(data$perc_fun_signaled), TRUE)
  
  # Combine all conditions to determine validity
  is_valid = pos_pos_missing & pos_neg_missing & perc_fun_signaled_missing & perc_fun_child_missing & perc_fun_signaled_invalid
  is_valid = as.integer(is_valid)
  
  return(is_valid)
}

data[,"valid"] = check_validity(data)

Step 2: Design and sample scheme checking

This section is dedicated to checking and solving issues due to inconsistencies between the planned and the actual design of the study.

Calendar

Overview of when the beeps were sent in a calendar format.

calendar_plot(data, 'sent') %>%
    labs(title="Calendar of when the beeps were sent to the participants")

## [[1]]

## 
## $title
## [1] "Calendar of when the beeps were sent to the participants"
## 
## attr(,"class")
## [1] "labels"

Sampling scheme plot and quantity of beeps

We proceed to the checking of the actual sample scheme and compare it to the defined one.

Issue 4: the sampling scheme plot aids in visualizing that:

There are big intervals between the first day and the rest of the days for many participants (e.g., participants 1, 49, 52, 72).
The first and sometimes the second days of participation must be removed. Indeed, those days often have less than 4 beeps sent and are testing days. In the end, participants should only have 10 days of participation, starting on a Friday and including 2 weekends with 4 beeps on week days and 9 beeps on weekends days.

data %>% 
    mutate(weekday = ifelse(wday(sent,  week_start=1) %in% c(6,7), "weekend", "weekday")) %>%
    group_by(id, daycum, weekday) %>%
    summarize(count = n()) %>%
    ggplot(aes(x = factor(daycum), y = factor(id))) +
        geom_point(aes(color=factor(count),shape=weekday),size=3) +
        theme(axis.text.x = element_text(angle = 90)) +
        labs(title="Sampling scheme plot of the sent beeps", 
             x="Cumulative day", y="Participant id", color="Number of beeps", shape="Day type")

data %>%
  group_by(id) %>% 
  summarize(n = n()) %>%
  ggplot(aes(x=factor(id),y=n)) +
      geom_col(position = "dodge") +
      scale_y_continuous(breaks = seq(0, 70, 5)) +
        labs(title="Quantity of beeps sent to each participant", x="Participant id", y="Number of beep")

Inspection: check the first day observations for each participant.

data %>% filter(daycum==1) %>% as.data.frame()

##    dyad id role age           scheduled                sent               start
## 1     7  1    0  50 2022-09-06 19:00:00 2022-09-06 19:00:09                <NA>
## 2     4  7    0  44 2022-06-16 19:00:00 2022-06-16 19:00:09                <NA>
## 3     2  9    0  39 2022-05-17 16:43:32 2022-05-17 16:43:34                <NA>
## 4     1 13    1  44 2022-05-05 19:00:00 2022-05-05 19:00:11                <NA>
## 5     1 13    1  44 2022-05-05 19:00:00 2022-05-05 19:00:11                <NA>
## 6     1 13    1  44 2022-05-05 19:00:00 2022-05-05 19:00:11                <NA>
## 7    10 15    0  51 2022-11-17 19:00:00 2022-11-17 19:00:12                <NA>
## 8     3 20    0  52 2022-04-28 19:00:00 2022-04-28 19:00:11                <NA>
## 9     2 32    1  47 2022-05-17 16:43:32 2022-05-17 16:43:34                <NA>
## 10    5 45    1  39 2022-06-16 19:00:00 2022-06-16 19:00:07                <NA>
## 11    1 46    0  49 2022-05-05 19:00:00 2022-05-05 19:00:10                <NA>
## 12    6 49    0  41 2022-08-16 16:35:31 2022-08-16 16:35:32                <NA>
## 13    6 49    0  41 2022-08-16 17:54:15 2022-08-16 17:54:16                <NA>
## 14    6 49    0  41 2022-08-16 19:14:27 2022-08-16 19:14:28 2022-08-16 19:24:43
## 15    6 49    0  41 2022-08-16 20:24:59 2022-08-16 20:25:00 2022-08-16 20:41:18
## 16    6 52    1  39 2022-08-16 16:35:31 2022-08-16 16:35:32                <NA>
## 17    6 52    1  39 2022-08-16 17:54:15 2022-08-16 17:54:16 2022-08-16 17:59:32
## 18    6 52    1  39 2022-08-16 19:14:27 2022-08-16 19:14:28 2022-08-16 19:24:55
## 19    6 52    1  39 2022-08-16 20:24:59 2022-08-16 20:25:00 2022-08-16 20:26:26
## 20    3 66    1  54 2022-04-24 19:00:00 2022-04-24 19:00:11                <NA>
## 21    9 67    0  42 2022-11-24 19:00:00 2022-11-24 19:00:05                <NA>
## 22    7 72    1  39 2022-09-06 19:00:00 2022-09-06 19:00:09                <NA>
## 23    8 73    1  41 2022-11-07 19:00:00 2022-11-07 19:00:04                <NA>
## 24    5 77    0  42 2022-06-16 19:00:00 2022-06-16 19:00:07                <NA>
## 25    8 79    0  39 2022-11-07 19:00:00 2022-11-07 19:00:04                <NA>
## 26    9 80    1  42 2022-11-24 19:00:00 2022-11-24 19:00:05                <NA>
## 27   10 92    1  47 2022-11-17 19:00:00 2022-11-17 19:00:06                <NA>
## 28   10 92    1  47 2022-11-17 19:00:00 2022-11-17 19:00:06                <NA>
## 29    4 93    1  43 2022-06-16 19:00:00 2022-06-16 19:00:09                <NA>
##                    end pos_aff neg_aff perc_stress_child perc_fun_child
## 1                 <NA>      NA      NA              <NA>             NA
## 2                 <NA>      NA      NA              <NA>             NA
## 3                 <NA>      NA      NA              <NA>             NA
## 4                 <NA>      NA      NA              <NA>             NA
## 5                 <NA>      NA      NA              <NA>             NA
## 6                 <NA>      NA      NA              <NA>             NA
## 7                 <NA>      NA      NA              <NA>             NA
## 8                 <NA>      NA      NA              <NA>             NA
## 9                 <NA>      NA      NA              <NA>             NA
## 10                <NA>      NA      NA              <NA>             NA
## 11                <NA>      NA      NA              <NA>             NA
## 12                <NA>      NA      NA              <NA>             NA
## 13                <NA>      NA      NA              <NA>             NA
## 14 2022-08-16 19:27:09      68      12                 7              1
## 15 2022-08-16 20:42:33      89       3                 7              1
## 16                <NA>      NA      NA              <NA>             NA
## 17 2022-08-16 18:01:52      88       0               5,6             NA
## 18 2022-08-16 19:26:16      74      23                 7              1
## 19 2022-08-16 20:27:51      32      83                 7              1
## 20                <NA>      NA      NA              <NA>             NA
## 21                <NA>      NA      NA              <NA>             NA
## 22                <NA>      NA      NA              <NA>             NA
## 23                <NA>      NA      NA              <NA>             NA
## 24                <NA>      NA      NA              <NA>             NA
## 25                <NA>      NA      NA              <NA>             NA
## 26                <NA>      NA      NA              <NA>             NA
## 27                <NA>      NA      NA              <NA>             NA
## 28                <NA>      NA      NA              <NA>             NA
## 29                <NA>      NA      NA              <NA>             NA
##    perc_fun_signaled perc_stress_child_max year month day hour minute obsno
## 1                 NA                    NA   NA    NA  NA   NA     NA     1
## 2                 NA                    NA   NA    NA  NA   NA     NA     1
## 3                 NA                    NA   NA    NA  NA   NA     NA     1
## 4                 NA                    NA   NA    NA  NA   NA     NA     1
## 5                 NA                    NA   NA    NA  NA   NA     NA     2
## 6                 NA                    NA   NA    NA  NA   NA     NA     3
## 7                 NA                    NA   NA    NA  NA   NA     NA     1
## 8                 NA                    NA   NA    NA  NA   NA     NA     1
## 9                 NA                    NA   NA    NA  NA   NA     NA     1
## 10                NA                    NA   NA    NA  NA   NA     NA     1
## 11                NA                    NA   NA    NA  NA   NA     NA     1
## 12                NA                    NA   NA    NA  NA   NA     NA     1
## 13                NA                    NA   NA    NA  NA   NA     NA     2
## 14                80                     7 2022     8  16   19     24     3
## 15                63                     7 2022     8  16   20     41     4
## 16                NA                    NA   NA    NA  NA   NA     NA     1
## 17                NA                     6 2022     8  16   17     59     2
## 18                29                     7 2022     8  16   19     24     3
## 19                40                     7 2022     8  16   20     26     4
## 20                NA                    NA   NA    NA  NA   NA     NA     1
## 21                NA                    NA   NA    NA  NA   NA     NA     1
## 22                NA                    NA   NA    NA  NA   NA     NA     1
## 23                NA                    NA   NA    NA  NA   NA     NA     1
## 24                NA                    NA   NA    NA  NA   NA     NA     1
## 25                NA                    NA   NA    NA  NA   NA     NA     1
## 26                NA                    NA   NA    NA  NA   NA     NA     1
## 27                NA                    NA   NA    NA  NA   NA     NA     1
## 28                NA                    NA   NA    NA  NA   NA     NA     2
## 29                NA                    NA   NA    NA  NA   NA     NA     1
##    daycum beepno duration valid
## 1  1 days      1  27 days     0
## 2  1 days      1  11 days     0
## 3  1 days      1  13 days     0
## 4  1 days      1  11 days     0
## 5  1 days      2  11 days     0
## 6  1 days      3  11 days     0
## 7  1 days      1  11 days     0
## 8  1 days      1  11 days     0
## 9  1 days      1  13 days     0
## 10 1 days      1  11 days     0
## 11 1 days      1  11 days     0
## 12 1 days      1  20 days     0
## 13 1 days      2  20 days     0
## 14 1 days      3  20 days     1
## 15 1 days      4  20 days     1
## 16 1 days      1  20 days     0
## 17 1 days      2  20 days     0
## 18 1 days      3  20 days     1
## 19 1 days      4  20 days     1
## 20 1 days      1  15 days     0
## 21 1 days      1  11 days     0
## 22 1 days      1  27 days     0
## 23 1 days      1  14 days     0
## 24 1 days      1  11 days     0
## 25 1 days      1  14 days     0
## 26 1 days      1  11 days     0
## 27 1 days      1  11 days     0
## 28 1 days      2  11 days     0
## 29 1 days      1  11 days     0

Modification 8: remove test observations and recompute time variables. Test observations are all first day observations and:

day 4 observations for participants 79 and 73.
day 17 observations for participants 1 and 72.
day 10 observations for participants 49, 52.
day 5 observations for participant 66.
day 3 observations for participants 9 and 32.

# Remove day 1 observations and Remove extra testing beeps of participants
pos = data$daycum == 1 | 
    (data$id %in% c(79,73) & data$daycum==4) |
    (data$id %in% c(1,72) & data$daycum==17) |
    (data$id %in% c(49,52) & data$daycum==10) |
    (data$id==66 & data$daycum==5) |
    (data$id %in% c(9,32) & data$daycum==3) 

data = data[!pos,]

# Recompute time variables
# Beep number
data = data %>% arrange(id, scheduled) %>%
    group_by(id) %>% 
    mutate(obsno = 1:n()) %>% ungroup()
# daycum
data = data %>% 
    group_by(id) %>%
    mutate(daycum = difftime(as.Date(sent), as.Date(min(sent, na.rm=TRUE)), units="days") + 1)
# beepno
data = data %>% arrange(id, sent) %>%
    group_by(id,daycum) %>%
    mutate(beepno = 1:n())
# Duration in days
data = data %>%
    group_by(id) %>%
    mutate(duration = difftime(as.Date(max(sent, na.rm=TRUE)), as.Date(min(sent, na.rm=TRUE)), units="days") + 1)

Double-checking sampling scheme: after data removing, participants display the same number of observations (=60) with two weekends and starting on a Friday.

data %>% 
    mutate(weekday = ifelse(wday(sent,  week_start=1) %in% c(6,7), "weekend", "weekday")) %>%
    group_by(id, daycum, weekday) %>%
    summarize(count = n()) %>%
    ggplot(aes(x = factor(daycum), y = factor(id))) +
        geom_point(aes(color=factor(count),shape=weekday),size=3) +
        theme(axis.text.x = element_text(angle = 90)) +
        labs(title="Sampling scheme: number of beeps sent each day", 
             x="Cumulative day", y="Participant id", color="number of beep", shape="Day type")

Coherence timestamps

We check whether there is timestamp incoherence within observations (e.g., an observation with ‘start’ time after the ‘end’ time) or between observations (e.g., an observation that was scheduled after another one but with a ‘start’ time that is before).

Issue 5: there is an observation with its ‘start’ time before its ‘sent’ time. There is an hour of difference.

# Select and order for further tests
df = data[,c("id", "obsno", "scheduled", "sent", "start", "end")]
df = df[order(df$id, df$obsno),]

# Timestamps coherence within observations
df %>%
    group_by(id) %>%
    mutate(sent_after_sched = scheduled > sent,
           start_after_sent = sent > start,
           end_after_end = start > end) %>%
    filter(sent_after_sched | start_after_sent | end_after_end) %>% 
    as.data.frame()

##   id obsno           scheduled                sent               start
## 1 73    52 2022-11-20 10:24:43 2022-11-20 10:24:44 2022-11-20 09:25:02
##                   end sent_after_sched start_after_sent end_after_end
## 1 2022-11-20 09:25:59            FALSE             TRUE         FALSE

# Timestamps coherence between observations
df %>%
    group_by(id) %>%
    mutate(sent_lag = lag(sent)) %>%
    mutate(sent_lag_issue = lag(sent) > scheduled,
            start_lag_issue = lag(start) > scheduled,
            end_lag_issue = lag(end) > scheduled) %>%
    filter(sent_lag_issue | start_lag_issue | end_lag_issue) %>% 
    as.data.frame()

##  [1] id              obsno           scheduled       sent           
##  [5] start           end             sent_lag        sent_lag_issue 
##  [9] start_lag_issue end_lag_issue  
## <0 rows> (or 0-length row.names)

Time and delay to send

No issues detected.

Time of day the beeps were sent to participants: no observations outside of the planned sampling scheme.

data %>%
    mutate(weekday = ifelse(wday(sent,  week_start=1) %in% c(6,7), "weekend", "weekday")) %>%
    ggplot(aes(x=hms::as_hms(sent))) +
        geom_histogram(bins=100) +
        scale_x_time(breaks = scales::date_breaks("1 hours")) +
        facet_grid(weekday~.) +
        labs(title="Time of day the beep were sent",
             y="Quantity of beep sent",x="Time of day") +
        theme(axis.text.x = element_text(angle = 45, hjust=1))

Delay to send: no extreme values and distribution looks good (min = 1 sec and max = 19 secs). The delay is computed as the difference in seconds between the ‘scheduled’ and the ‘sent’ variables.

data = data %>% 
    mutate(delay_sent_min = as.numeric(sent - scheduled)) 

data %>% 
    ggplot(aes(x = delay_sent_min)) +
        geom_histogram(bins=100) +
        labs(title="Delays to send",
             y="Number of beeps",x="Delay in seconds")

Delay between two beeps sent in a day: no extreme values and no negative values (no issues).

df_int_sent = data %>%
    arrange(id,obsno) %>%
    group_by(id, daycum) %>%
    mutate(time_int = difftime(as.POSIXct(lead(sent)), sent, units="mins"))

df_int_sent %>% 
    ggplot(aes(x = time_int)) +
    geom_histogram(bins=100) +
        labs(title="Delays between two beeps sent in a day",
             y="Number of beeps",x="Delay in minutes")

## Don't know how to automatically pick scale for object of type <difftime>.
## Defaulting to continuous.

## Warning: Removed 200 rows containing non-finite values (`stat_bin()`).

Step 3: Participants response behaviors

This section is dedicated to investigating how well participants engaged with the ESM study looking particularly for problematic patterns of behaviors (e.g., invalid observations, response time, careless responding).

Sampling scheme, quantity of beeps and time started

We check how well Participants followed the sampling scheme.

No issues detected.

Sampling Scheme plot: this plot illustrates the start times of participants’ valid observations. The x-axis represents the observation number and the y-axis shows the distribution across weekdays and weekends.

data %>% filter(valid==1) %>%
  mutate(weekday = ifelse(wday(sent,  week_start=1) %in% c(6,7), "weekend", "weekday")) %>%
  ggplot(aes(x=obsno, y=factor(id))) +
      geom_point(aes(color=weekday), size=1) +
        labs(title="Sampling scheme of the valid observations",
             y="Participant id",x="Observation number")

Sampling scheme plot: this plot illustrates the start times of participants’ valid observations with continuous x-axis over the days of participation.

df = data %>% 
    filter(valid==1) %>%
    group_by(id) %>%
    mutate(start_datetime = as.Date(min(scheduled, na.rm=TRUE))) %>%
    ungroup() %>%
    mutate(continuoustime = as.numeric(difftime(scheduled, start_datetime, units="mins")))

# Create the plot
breaks_ = seq(0, 2000000, 1440)[-1]
breaks_limit = breaks_[breaks_ < max(df$continuoustime)]
labels_ = paste0(1:(length(breaks_)), " day")
df %>%
    ggplot(aes(x=continuoustime, y=factor(id))) +
        geom_point() +
        scale_x_continuous(breaks = breaks_ - 720, label=labels_) +
        geom_vline(xintercept=breaks_limit) +
        labs(title="Continous sampling scheme of the valid observations",
             y="Participant id",x="Day number (continious scale)")

Number of valid response over obsno: it shows a sligth decrease over time.

data %>% 
  filter(valid==1) %>% 
  group_by(obsno) %>% summarize(n = n()) %>%
  ggplot(aes(x=obsno,y=n)) +
      geom_col(position = "dodge") +
        labs(title="Number of valid response over obsno",
             y="Number of valid beeps",x="Observation number")

Time of day the beeps were started (and valid): there is an issue already reported above. One beep was started outside of the defined sampling scheme, between 9am. and 10am.

data %>% filter(valid==1) %>%
    mutate(weekday = ifelse(wday(sent, week_start=1) %in% c(6,7), "weekend", "weekday")) %>%
    ggplot(aes(x=hms::as_hms(start))) +
        geom_histogram(bins=100) +
        scale_x_time(breaks = scales::date_breaks("1 hours")) +
        facet_grid(weekday~.) +
        labs(title="Time of day the beeps were started (and valid)",
             y="Quantity of beep started and valid",x="Time in a day") +
        theme(axis.text.x = element_text(angle = 45, hjust=1))

Number of interactions

No issues detected.

Hour at which the beeps were started (in blue) and missed (in red): There were a little more interactions around 15h on weekend days and at 20h on weekdays, but, overall, it is very stable.

df_interact = data %>%
    mutate(hour = hour(sent), 
           sent_ = !is.na(sent),
           weekday = ifelse(wday(sent, week_start=1) %in% c(6,7), "weekend", "weekday")) %>%
    group_by(hour,weekday) %>%
    summarise(interact=sum(valid),
              open = sum(sent_)) %>%
    mutate(not_interact = open - interact) %>%
    gather(type, value, interact, not_interact)

## `summarise()` has grouped output by 'hour'. You can override using the
## `.groups` argument.

df_interact %>%
  mutate(type = factor(type, levels=c("not_interact", "interact"))) %>%
  ggplot(aes(x=factor(hour), y=value, fill=type)) +
    geom_col(position="stack")  +
    facet_grid(weekday~.) +
    labs(title="Number of (non-)interaction with the questionnaire in function of the hour of the day",
         y="Quantity of beep",x="Hour in a day", color="Interaction")

Delays

Compute delays to start and end (or fill) viables. Will be only used for the creation of plot and won’t be exported in the preprocessed dataframe.

data = data %>% 
    mutate(delay_start_min = difftime(start, sent, units="mins"), 
           daily_end_min = difftime(end, start, units="mins"))

Delay to start

No values higher than accepted response delay (30 mins). There is a negative delay for one observation (issue already mentioned above).

data %>% filter(valid==1)  %>% 
    ggplot(aes(x = delay_start_min)) +
        geom_histogram(bins=100) +
        labs(title="Histogram of the delays to start the questionnaires",
             y="Quantity of beep",x="Delay in minutes")

## Don't know how to automatically pick scale for object of type <difftime>.
## Defaulting to continuous.

Delay to fill

Issue 6: there are outliers, specifically belonging to participants 7, 15, 32, 66, 73, and 77, that require further investigation.

data %>%
    ggplot(aes(y=daily_end_min,x=factor(id))) +
        geom_boxplot() +
        coord_flip() +
        labs(title="Box plots of the time interval to fill the beeps",
             x="Participant id",y="Time interval (minute)")

Interval 2 beeps

No issues detected.

Histogram of the delays between two subsequent beeps in a day among valid obs: no values below 0. There is a long tail on the right (maximum value is 515 minutes).

df_int = data %>% filter(valid==1) %>% 
    arrange(id,obsno) %>%
    group_by(id, daycum) %>%
    mutate(time_int = difftime(as.POSIXct(lead(start)), start, units="mins"))

df_int %>% 
    ggplot(aes(x = time_int)) +
    geom_histogram(bins=100) +
    labs(title="Histogram of the delays between two subsequent beeps within a day",
         y="Quantity of beep",x="Delay in minutes")

## Don't know how to automatically pick scale for object of type <difftime>.
## Defaulting to continuous.

## Warning: Removed 148 rows containing non-finite values (`stat_bin()`).

Dyadic time interval

Issue 7: there are delays larger than 10 and 20 minutes (max=28 mins).

# Compute beepno and daycum_dyad to later reformat in dyad dataframe
data = data %>% 
    arrange(id, sent) %>%
    group_by(id,daycum) %>%
    mutate(beepno = 1:n()) %>%
    group_by(dyad) %>%
    mutate(daycum_dyad = difftime(as.Date(sent), as.Date(min(sent, na.rm=TRUE)), units="days") + 1)

# Reformat dataframe to get dyad dataframe format
df1 = data %>% filter(role == 1)
df2 = data %>% filter(role == 0)
data_dyad = df1 %>% 
    left_join(df2, by=c("dyad", "daycum_dyad", "beepno"), suffix = c("_1", "_2"))

# Compute the time intervals between start times
df_dyad_int = data_dyad %>% 
    filter(valid_1==1 & valid_2==1) %>% 
    mutate(time_int = abs(difftime(start_1, start_2, units="mins")))

# Create plot
df_dyad_int %>% 
    ggplot(aes(x = time_int)) +
    geom_histogram(bins=100) +
    scale_x_continuous(breaks=seq(0,2000,10)) +
    labs(title="Histogram of the delays between partners' start times",
         y="Participant id",x="Delay (minutes)")

Compliance Rate

We computed the compliance scores of:

participants: it follows the valid observation definition (see above). An observation is set as “valid” when all the 8 variables of interested are completed by the participant, taking into account the branching conditions. Additionally, all participants received 60 beeps.
dyads: both dyad members must have a ‘valid’ response to the same beep (with the same beep number).

# Participant compliance based on maximal number of beep (60)
obsno_max = 60
data$compliance = ave(data$valid, data$id, FUN=function(x) sum(x, na.rm=TRUE)) / obsno_max

# Dayd compliance
# Create unique value for each obsno within dyads (later use for as grouping vector) 
unique_dyad_beeps = paste0(data$dyad, "_", data$obsno)
# Check if beep was answered by two partners
valid_dyad_obs = ave(data$valid, unique_dyad_beeps, FUN=function(x) sum(x, na.rm=TRUE)) == 2
# Compute dyad compliance
data$comp_dyad = ave(valid_dyad_obs, data$dyad, FUN=function(x) sum(x, na.rm=TRUE)) / 2 / obsno_max

Compliance rate per participant

Issue 8: the overall compliance is rather low. In particular, the participant 66 has a compliance close to 0, and the participants 1, 32, 72 and 79 have a compliance lower than .2.

data %>%
    group_by(id) %>% slice(1) %>%    # Keep one row per participant
    ggplot(aes(y=compliance, x=factor(id))) +
        geom_col(position = "dodge") +
        labs(title="Compliance score of each participant",
            y="Compliance",x="Participant id")

data %>%
    group_by(id) %>% slice(1) %>%    # Keep one row per participant
    ggplot(aes(x=compliance)) +
        geom_histogram() +
        labs(title="Histogram of the compliance score",
            y="Number of participant",x="Compliance")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Dyad compliance

Issue 9: when taking dyads’ partner observations together, the dyads’ compliances (defined as the proportion of beeps answer by both partners) are very low overall.

data %>%
    group_by(dyad) %>% slice(1) %>%    # Keep one row per dyad
    ggplot(aes(x=factor(dyad), y=comp_dyad)) +
        geom_col(position="dodge") +
        labs(title="Dyad compliance score of each dyad",
            y="Dyad compliance score",x="Dyad number")

Step 4: Compute and transform variables

This section is dedicated to computing and modifying variables of interest that will later be used in visualization and statistical analysis.

Modification 9: the ‘pos_aff’ and ‘neg_aff’ variables are person-mean centered.

data$pos_aff_pc = data$pos_aff - ave(data$pos_aff, data$id, FUN=function(x) mean(x, na.rm=TRUE))
data$neg_aff_pc = data$neg_aff - ave(data$neg_aff, data$id, FUN=function(x) mean(x, na.rm=TRUE))

Step 5: Descriptive statistics and visualization

Distribution plots

Statical models used later on are sensitive to the distribution of the variables. We therefore checked the distributions of the variable at a participant level.

data %>%
    ggplot(aes(x=pos_aff, color=factor(id))) +  
        geom_density(alpha = 1) +
        theme(legend.position = "none") +
        labs(title="Density plot of the pos_aff variable")

data %>%
    ggplot(aes(x=neg_aff, color=factor(id))) +  
        geom_density(alpha = 1) +
        theme(legend.position = "none") +
        labs(title="Density plot of the pos_neg variable")

Export

The preprocessed data is finally exported.

Modification 10: we removed irrelevant variables for later analysis.

# Exclude non relevant variables and reorder variables
data_export = data %>% 
    select(-c(minute, hour, day, year, month, delay_sent_min, delay_start_min, daily_end_min, daycum_dyad, comp_dyad, duration)) %>%
    select(dyad:age, compliance, obsno, daycum, beepno, valid, scheduled:perc_fun_signaled, pos_aff_pc, neg_aff_pc)

# Export
file_path_preproc = "C:/DATA_STORAGE/Martine_Data/Triadic_pilot_study/data_example_preprocessed.csv"
write.csv(data_export, file_path_preproc, row.names = FALSE)

Run the data characteristics report:

# Path to the data quality report (.Rmd format) 
rmark_file = "path/data_quality_repot.Rmd"

# Name of the output data quality report. Date is included to keep track of changes
filename_out = paste0(as.Date(Sys.time()), "_Data_Quality_Report.html")

# Knit the data quality report
rmarkdown::render(rmark_file, output_file=filename_out, params=list(file_path=file_path_preproc))

Session and dataset info

For reproducibility purposes, this section informs about the R session and packages used as well as their versions.

sessionInfo()

## R version 4.2.2 (2022-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19045)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=Dutch_Belgium.utf8  LC_CTYPE=Dutch_Belgium.utf8   
## [3] LC_MONETARY=Dutch_Belgium.utf8 LC_NUMERIC=C                  
## [5] LC_TIME=Dutch_Belgium.utf8    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] hms_1.1.2         lubridate_1.9.2   stringr_1.5.0     ggplot2_3.4.2    
##  [5] visdat_0.6.0      naniar_1.0.0      skimr_2.1.5       data.table_1.14.6
##  [9] tidyr_1.3.0       dplyr_1.1.2       esmtools_1.0.0   
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.10       svglite_2.1.1     digest_0.6.31     utf8_1.2.3       
##  [5] plyr_1.8.8        R6_2.5.1          repr_1.1.6        backports_1.4.1  
##  [9] evaluate_0.22     httr_1.4.4        highr_0.10        pillar_1.9.0     
## [13] rlang_1.1.1       rstudioapi_0.14   car_3.1-2         jquerylib_0.1.4  
## [17] DT_0.30           rmarkdown_2.23    labeling_0.4.2    webshot_0.5.4    
## [21] htmlwidgets_1.6.2 munsell_0.5.0     broom_1.0.5       compiler_4.2.2   
## [25] xfun_0.39         pkgconfig_2.0.3   systemfonts_1.0.4 base64enc_0.1-3  
## [29] htmltools_0.5.5   tidyselect_1.2.0  gridExtra_2.3     tibble_3.2.1     
## [33] fansi_1.0.4       viridisLite_0.4.2 withr_2.5.0       ggpubr_0.6.0     
## [37] grid_4.2.2        jsonlite_1.8.5    gtable_0.3.3      lifecycle_1.0.3  
## [41] magrittr_2.0.3    scales_1.2.1      cli_3.6.1         stringi_1.7.12   
## [45] cachem_1.0.8      carData_3.0-5     farver_2.1.1      ggsignif_0.6.4   
## [49] fs_1.6.2          xml2_1.3.4        bslib_0.5.1       ellipsis_0.3.2   
## [53] generics_0.1.3    vctrs_0.6.2       cowplot_1.1.1     kableExtra_1.3.4 
## [57] tools_4.2.2       glue_1.6.2        purrr_1.0.1       abind_1.4-5      
## [61] fastmap_1.1.1     yaml_2.3.7        timechange_0.2.0  colorspace_2.1-0 
## [65] UpSetR_1.4.0      rstatix_0.7.2     rvest_1.0.3       knitr_1.43       
## [69] sass_0.4.6

Additionally, we display the meta-information of the preprocessed dataset.

esmtools::dataInfo(file_path=file_path_preproc, 
                   read_fun = read.csv,
                   idvar="id", timevar="sent")

## Path : C:/DATA_STORAGE/Martine_Data/Triadic_pilot_study/data_example_preprocessed.csv 
## Extension : csv 
## Size : 174531 bytes 
## Creation time : 2023-05-27 13:35:46 
## Update time : 2024-03-12 11:01:57 
## ncol : 20 
## nrow : 1200 
## Number participants : 20 
## Average number obs : 60 
## Period : from 2022-04-29 16:33:12 to 2022-12-04 20:21:25 
## Variables : dyad, id, role, age, compliance, obsno, daycum, beepno, valid, scheduled, sent, start, end, pos_aff, neg_aff, perc_stress_child, perc_fun_child, perc_fun_signaled, pos_aff_pc, neg_aff_pc

Example of preprocessing report

Jordan Revol, jordan.revol@kuleuven.be, KU Leuven

2024-03-12 11:01:20

Study and data collection procedure

Load packages

Step 1: Import data and preliminary preprocessing

First glimpse

Renaming, relabelling, reformating

Duplication

Branching items

Check variable coherence

First missing values analysis

Recoding missing values

Overview of missing values

Coherence of missing values in the variables of interest:

Create time variables

Flag (in)valid observations

Step 2: Design and sample scheme checking

Calendar

Sampling scheme plot and quantity of beeps

Coherence timestamps

Time and delay to send

Step 3: Participants response behaviors

Sampling scheme, quantity of beeps and time started

Number of interactions

Delays

Delay to start

Delay to fill

Interval 2 beeps

Dyadic time interval

Compliance Rate

Compliance rate per participant

Dyad compliance

Step 4: Compute and transform variables

Step 5: Descriptive statistics and visualization

Distribution plots

Export

Session and dataset info