This template follows the preprocessing framework described in Revol et al. (2023) and can be found in the R package esmtools.
We consider data from a pilot ESM study in which 10 families (triads of father-mother-child) participated. Each family member received a notification (beep) prompting them to complete a questionnaire four times a day on weekdays (between 4.30pm and 8.30pm), and nine times a day on weekend days (between 10.15am and 8.30pm) for 10 consecutive days, yielding 60 beeps in total. Beeps were sent semi-randomly within predetermined 15-minute time intervals, with the restriction that there was always at least one hour between two consecutive beeps. The beeps were sent at the same time to family members and expired after 30 minutes.
In our article, we only use a part of the dataset, i.e., we focus on the father-mother dyads, excluding the children’s data, and examine a subset of the assessed variables. We will examine parents’ momentary positive affect and negative affect (rated on scales ranging from 0 to 100). Additionally, identification variables (i.e., participant number, dyad number), demographic variables (i.e., ‘role’ (father/mother), ‘age’), and timestamp variables: ‘scheduled’ (i.e., when the ESM questionnaire was scheduled), ‘sent’ (i.e., when the questionnaire was sent), ‘start’ (i.e., when the questionnaire was opened by the participant), and ‘end’ (i.e., when the questionnaire was completed) were retained. The dataset also includes branching items. Parents reported if their child experienced a stressful event since the last beep. If not, they were asked whether their child had experienced anything fun since the last beep (yes/no). If yes, parents were asked to rate to which extent their child showed they experienced this fun event on a scale ranging from 0 to 100.
Import the packages:
library(esmtools) # For button(), txt() functions
# For data management
library(dplyr)
library(tidyr)
library(data.table)
# Descriptive statistics
library(skimr)
# Missing values inspection
library(naniar)
library(visdat)
# Plotting
library(ggplot2)
# To modify characters
library(stringr)
# Dates
library(lubridate)
library(hms)
This section is dedicated to the first look at the data, the merging of data sources the first basic preprocessing methods (e.g., duplicates, branching items check), and checking the variable consistency when the data has just been imported.
Import the data:# Find the path toward the dataset stored within the esmtools package
file_path = system.file("extdata", "esmdata_raw.csv", package="esmtools")
# Import data
data = read.csv(file_path)
Raw dataset meta-info:
dataInfo(file_path=file_path, read_fun = read.csv,
idvar="id", timevar="sent")
## Path : C:/Users/u0148925/AppData/Local/R/win-library/4.2/esmtools/extdata/esmdata_raw.csv
## Extension : csv
## Size : 129469 bytes
## Creation time : 2024-03-11 10:00:19
## Update time : 2024-03-11 10:00:19
## ncol : 13
## nrow : 1242
## Number participants : 20
## Average number obs : 62.1
## Period : from 2022-04-24 19:00:11 to 2022-12-04 20:21:25
## Variables : dyad, id, role, age, scheduled, sent, start, end, pos_aff, neg_aff, perc_stress_child, perc_fun_child, perc_fun_signaled
The following chunk helps to have a first insight on what the dataset looks like.
Modification 1: reformating the timestamp variables to be in POSIXct format.
data = data %>%
mutate(across(c(scheduled, sent, start, end), ~as.POSIXct(.,origin="1970-01-01", tz="GMT")))
Modification 2: create a variable that keeps the maximum values per observation in the multi-response item ‘perc_stress_child’.
# Split the 'perc_stress_child' column by commas to get separate values.
split_values = strsplit(data$perc_stress_child, ",")
# Find the maximum value in each set of values and store in 'max_values' vector.
max_values = sapply(split_values, function(x) {
NA_ = length(x) < 1
if (NA_) {
NA
} else {
max(as.numeric(x), na.rm = TRUE)
}
})
# Assign max value to perc_stress_child_max variable in data
data$perc_stress_child_max = max_values
No issues detected.
No issues detected.
Consistency of time-invariant variables.
Issue 1: participant 32 and 46 have two age numbers, respectively (47, 29) and (49, 22). The second values (i.e., 29, 22) occurs only one time. First values (i.e., 47, 49) are correct ones.
vars_consist(data, "id", c("age"))
## id age
## 1 13 44
## 2 9 39
## 3 32 (47, 29)
## 4 46 (49, 22)
## 5 66 54
## 6 20 52
## 7 93 43
## 8 7 44
## 9 45 39
## 10 77 42
## 11 52 39
## 12 72 39
## 13 1 50
## 14 49 41
## 15 73 41
## 16 79 39
## 17 67 42
## 18 80 42
## 19 92 47
## 20 15 51
No other issues detected. The variables ‘dyad’, ‘id’, ‘role’ are all consistent with each other.
Inspection: Check number of occurence of different age values for participant 32 and 46:
data %>% filter(id %in% c(32,46)) %>% group_by(id, age) %>% summarise(n = n())
## `summarise()` has grouped output by 'id'. You can override using the `.groups`
## argument.
## # A tibble: 4 × 3
## # Groups: id [2]
## id age n
## <int> <int> <int>
## 1 32 29 1
## 2 32 47 62
## 3 46 22 1
## 4 46 49 60
Modification 3: set the age of participant 32 to 47 and the age of participant 46 to 49.
data$age[data$id==32 & data$age==29] = 47
data$age[data$id==46 & data$age==22] = 49
Here, our goal is to gain an overview of the patterns of missing values and address any preliminary issues related to them.
Issue 2: from the descriptive analysis in the ‘First glimpse’ section, we can see that the missing values in perc_stress_child are coded as ’ ’ and not as NA.
Modification 4: We have recoded the missing values of the ‘perc_stress_child’ variable as NA.
data$perc_stress_child[data$perc_stress_child==""] = NA
The following plot gives an overview of the missing values in the dataframe.
# Reorder variables
data = data %>% arrange(dyad, id, scheduled)
# Overiew of missing values
vis_miss(data)
We can see multiple patterns and no missing values in the first 6 variables.
gg_miss_upset(data %>% select(start:perc_fun_signaled), nsets = 12)
Inspection: here are the observations in question:
pos_27 = !is.na(data$end) & !is.na(data$start) & is.na(data$pos_aff)
data[pos_27, ]
## dyad id role age scheduled sent
## 1 1 13 1 44 2022-05-05 19:00:00 2022-05-05 19:00:11
## 2 1 13 1 44 2022-05-05 19:00:00 2022-05-05 19:00:11
## 3 1 13 1 44 2022-05-05 19:00:00 2022-05-05 19:00:11
## 126 2 9 0 39 2022-05-19 19:00:00 2022-05-19 19:00:11
## 127 2 9 0 39 2022-05-19 19:00:00 2022-05-19 19:00:11
## 128 2 9 0 39 2022-05-19 19:00:00 2022-05-19 19:00:11
## 190 2 32 1 47 2022-05-19 19:00:00 2022-05-19 19:00:11
## 191 2 32 1 47 2022-05-19 19:00:00 2022-05-19 19:00:11
## 252 3 20 0 52 2022-04-28 19:00:00 2022-04-28 19:00:11
## 313 3 66 1 54 2022-04-24 19:00:00 2022-04-24 19:00:11
## 375 4 7 0 44 2022-06-16 19:00:00 2022-06-16 19:00:09
## 436 4 93 1 43 2022-06-16 19:00:00 2022-06-16 19:00:09
## 497 5 45 1 39 2022-06-16 19:00:00 2022-06-16 19:00:07
## 558 5 77 0 42 2022-06-16 19:00:00 2022-06-16 19:00:07
## 623 6 49 0 41 2022-08-25 19:00:00 2022-08-25 19:00:09
## 688 6 52 1 39 2022-08-25 19:00:00 2022-08-25 19:00:09
## 689 6 52 1 39 2022-08-25 19:00:00 2022-08-25 19:00:09
## 750 7 1 0 50 2022-09-06 19:00:00 2022-09-06 19:00:09
## 751 7 1 0 50 2022-09-22 19:00:00 2022-09-22 19:00:03
## 812 7 72 1 39 2022-09-06 19:00:00 2022-09-06 19:00:09
## 813 7 72 1 39 2022-09-22 19:00:00 2022-09-22 19:00:03
## 874 8 73 1 41 2022-11-07 19:00:00 2022-11-07 19:00:04
## 937 8 79 0 39 2022-11-10 19:00:00 2022-11-10 19:00:03
## 998 9 67 0 42 2022-11-24 19:00:00 2022-11-24 19:00:05
## 1059 9 80 1 42 2022-11-24 19:00:00 2022-11-24 19:00:05
## 1181 10 92 1 47 2022-11-17 19:00:00 2022-11-17 19:00:06
## 1182 10 92 1 47 2022-11-17 19:00:00 2022-11-17 19:00:06
## start end pos_aff neg_aff perc_stress_child
## 1 2022-05-05 19:00:33 2022-05-05 19:00:36 NA NA <NA>
## 2 2022-05-05 19:00:59 2022-05-05 19:01:02 NA NA <NA>
## 3 2022-05-05 19:03:33 2022-05-05 19:03:35 NA NA <NA>
## 126 2022-05-19 19:46:57 2022-05-19 19:47:00 NA NA <NA>
## 127 2022-05-19 19:47:09 2022-05-19 19:47:13 NA NA <NA>
## 128 2022-05-19 19:47:31 2022-05-19 19:47:34 NA NA <NA>
## 190 2022-05-19 19:57:11 2022-05-19 19:57:20 NA NA <NA>
## 191 2022-05-19 19:57:30 2022-05-19 19:57:32 NA NA <NA>
## 252 2022-04-28 19:08:00 2022-04-28 19:08:03 NA NA <NA>
## 313 2022-04-24 19:01:28 2022-04-24 19:01:33 NA NA <NA>
## 375 2022-06-16 19:00:47 2022-06-16 19:00:49 NA NA <NA>
## 436 2022-06-16 19:00:46 2022-06-16 19:00:52 NA NA <NA>
## 497 2022-06-16 19:01:01 2022-06-16 19:01:03 NA NA <NA>
## 558 2022-06-16 19:08:52 2022-06-16 19:08:56 NA NA <NA>
## 623 2022-08-25 19:00:40 2022-08-25 19:00:44 NA NA <NA>
## 688 2022-08-25 19:16:12 2022-08-25 19:16:14 NA NA <NA>
## 689 2022-08-25 19:16:19 2022-08-25 19:16:20 NA NA <NA>
## 750 2022-09-06 19:00:23 2022-09-06 19:00:26 NA NA <NA>
## 751 2022-09-22 19:00:12 2022-09-22 19:00:15 NA NA <NA>
## 812 2022-09-06 19:00:19 2022-09-06 19:00:23 NA NA <NA>
## 813 2022-09-22 19:02:53 2022-09-22 19:02:57 NA NA <NA>
## 874 2022-11-07 19:01:20 2022-11-07 19:01:23 NA NA <NA>
## 937 2022-11-10 19:00:32 2022-11-10 19:00:34 NA NA <NA>
## 998 2022-11-24 18:00:20 2022-11-24 18:00:23 NA NA <NA>
## 1059 2022-11-24 18:00:35 2022-11-24 18:00:39 NA NA <NA>
## 1181 2022-11-17 18:51:14 2022-11-17 18:51:16 NA NA <NA>
## 1182 2022-11-17 18:51:20 2022-11-17 18:51:22 NA NA <NA>
## perc_fun_child perc_fun_signaled perc_stress_child_max
## 1 NA NA NA
## 2 NA NA NA
## 3 NA NA NA
## 126 NA NA NA
## 127 NA NA NA
## 128 NA NA NA
## 190 NA NA NA
## 191 NA NA NA
## 252 NA NA NA
## 313 NA NA NA
## 375 NA NA NA
## 436 NA NA NA
## 497 NA NA NA
## 558 NA NA NA
## 623 NA NA NA
## 688 NA NA NA
## 689 NA NA NA
## 750 NA NA NA
## 751 NA NA NA
## 812 NA NA NA
## 813 NA NA NA
## 874 NA NA NA
## 937 NA NA NA
## 998 NA NA NA
## 1059 NA NA NA
## 1181 NA NA NA
## 1182 NA NA NA
pos_3 = is.na(data$perc_stress_child) & !is.na(data$pos_aff)
data[pos_3, ]
## dyad id role age scheduled sent
## 328 3 66 1 54 2022-05-01 10:22:59 2022-05-01 10:23:01
## 334 3 66 1 54 2022-05-01 17:58:16 2022-05-01 17:58:18
## 396 4 7 0 44 2022-06-19 19:14:44 2022-06-19 19:14:46
## start end pos_aff neg_aff perc_stress_child
## 328 2022-05-01 10:23:06 2022-05-01 10:24:31 69 34 <NA>
## 334 2022-05-01 17:58:32 2022-05-01 18:00:13 91 6 <NA>
## 396 2022-06-19 19:18:57 2022-06-19 19:19:46 84 16 <NA>
## perc_fun_child perc_fun_signaled perc_stress_child_max
## 328 NA NA NA
## 334 NA NA NA
## 396 NA NA NA
Modification 5: set as missing the ‘start’ and ‘end’ variables of those 27 first inconsistent cases. For the 3 remaining cases, no modifications are made as these inconsistencies will have no implications for later analyses.
data[pos_27, c("start", "end")] = NA
We create time-related variables (e.g., observation number) that will be necessary for later data preprocessing.
Modification 6: extract time elements (day, year, etc.), and create observation number (obsno), day number (daycum), beep number in a day (beepno) and duration in days variables (duration).
# Datetime elements: year, day, etc.
data["year"] = year(data$start)
data["month"] = month(data$start)
data["day"] = day(data$start)
data["hour"] = hour(data$start)
data["minute"] = minute(data$start)
# Observation number: beep number of the observation that indicates their serial order (within participant)
data = data %>% arrange(id, scheduled) %>%
group_by(id) %>%
mutate(obsno = 1:n()) %>% ungroup()
# Day cumulate: Day number since the first beep sent to the participant
data = data %>%
group_by(id) %>%
mutate(daycum = difftime(as.Date(sent), as.Date(min(sent, na.rm=TRUE)), units="days") + 1)
# Beep number: Beep number within a day
data = data %>% arrange(id, sent) %>%
group_by(id,daycum) %>%
mutate(beepno = 1:n())
# Duration in days
data = data %>%
group_by(id) %>%
mutate(duration = difftime(as.Date(max(sent, na.rm=TRUE)), as.Date(min(sent, na.rm=TRUE)), units="days") + 1) %>%
ungroup()
Modification 7: using logical tests, we define which observations are valid. The ‘valid’ observations are the ones where there are no missing values in the variables of interest, i.e., 3 main variables (‘pos_aff’, ‘pos_neg’, ‘perc_stress_child’) and two others (‘perc_fun_child’, ‘perc_fun_signaled’) in function of the branching items (see above). We created a function that can be reused later.
check_validity = function(data) {
# Check if variables are missing
pos_pos_missing = !is.na(data$pos_aff)
pos_neg_missing = !is.na(data$neg_aff)
perc_fun_signaled_missing = !is.na(data$perc_fun_signaled)
# Check additional conditions
perc_fun_child_missing = ifelse(data$perc_stress_child > 6, !is.na(data$perc_fun_child), TRUE)
perc_fun_signaled_invalid = ifelse(data$perc_fun_child == 1, !is.na(data$perc_fun_signaled), TRUE)
# Combine all conditions to determine validity
is_valid = pos_pos_missing & pos_neg_missing & perc_fun_signaled_missing & perc_fun_child_missing & perc_fun_signaled_invalid
is_valid = as.integer(is_valid)
return(is_valid)
}
data[,"valid"] = check_validity(data)
This section is dedicated to checking and solving issues due to inconsistencies between the planned and the actual design of the study.
Overview of when the beeps were sent in a calendar format.
We proceed to the checking of the actual sample scheme and compare it to the defined one.
Issue 4: the sampling scheme plot aids in visualizing that:data %>%
mutate(weekday = ifelse(wday(sent, week_start=1) %in% c(6,7), "weekend", "weekday")) %>%
group_by(id, daycum, weekday) %>%
summarize(count = n()) %>%
ggplot(aes(x = factor(daycum), y = factor(id))) +
geom_point(aes(color=factor(count),shape=weekday),size=3) +
theme(axis.text.x = element_text(angle = 90)) +
labs(title="Sampling scheme plot of the sent beeps",
x="Cumulative day", y="Participant id", color="Number of beeps", shape="Day type")
data %>%
group_by(id) %>%
summarize(n = n()) %>%
ggplot(aes(x=factor(id),y=n)) +
geom_col(position = "dodge") +
scale_y_continuous(breaks = seq(0, 70, 5)) +
labs(title="Quantity of beeps sent to each participant", x="Participant id", y="Number of beep")
data %>% filter(daycum==1) %>% as.data.frame()
## dyad id role age scheduled sent start
## 1 7 1 0 50 2022-09-06 19:00:00 2022-09-06 19:00:09 <NA>
## 2 4 7 0 44 2022-06-16 19:00:00 2022-06-16 19:00:09 <NA>
## 3 2 9 0 39 2022-05-17 16:43:32 2022-05-17 16:43:34 <NA>
## 4 1 13 1 44 2022-05-05 19:00:00 2022-05-05 19:00:11 <NA>
## 5 1 13 1 44 2022-05-05 19:00:00 2022-05-05 19:00:11 <NA>
## 6 1 13 1 44 2022-05-05 19:00:00 2022-05-05 19:00:11 <NA>
## 7 10 15 0 51 2022-11-17 19:00:00 2022-11-17 19:00:12 <NA>
## 8 3 20 0 52 2022-04-28 19:00:00 2022-04-28 19:00:11 <NA>
## 9 2 32 1 47 2022-05-17 16:43:32 2022-05-17 16:43:34 <NA>
## 10 5 45 1 39 2022-06-16 19:00:00 2022-06-16 19:00:07 <NA>
## 11 1 46 0 49 2022-05-05 19:00:00 2022-05-05 19:00:10 <NA>
## 12 6 49 0 41 2022-08-16 16:35:31 2022-08-16 16:35:32 <NA>
## 13 6 49 0 41 2022-08-16 17:54:15 2022-08-16 17:54:16 <NA>
## 14 6 49 0 41 2022-08-16 19:14:27 2022-08-16 19:14:28 2022-08-16 19:24:43
## 15 6 49 0 41 2022-08-16 20:24:59 2022-08-16 20:25:00 2022-08-16 20:41:18
## 16 6 52 1 39 2022-08-16 16:35:31 2022-08-16 16:35:32 <NA>
## 17 6 52 1 39 2022-08-16 17:54:15 2022-08-16 17:54:16 2022-08-16 17:59:32
## 18 6 52 1 39 2022-08-16 19:14:27 2022-08-16 19:14:28 2022-08-16 19:24:55
## 19 6 52 1 39 2022-08-16 20:24:59 2022-08-16 20:25:00 2022-08-16 20:26:26
## 20 3 66 1 54 2022-04-24 19:00:00 2022-04-24 19:00:11 <NA>
## 21 9 67 0 42 2022-11-24 19:00:00 2022-11-24 19:00:05 <NA>
## 22 7 72 1 39 2022-09-06 19:00:00 2022-09-06 19:00:09 <NA>
## 23 8 73 1 41 2022-11-07 19:00:00 2022-11-07 19:00:04 <NA>
## 24 5 77 0 42 2022-06-16 19:00:00 2022-06-16 19:00:07 <NA>
## 25 8 79 0 39 2022-11-07 19:00:00 2022-11-07 19:00:04 <NA>
## 26 9 80 1 42 2022-11-24 19:00:00 2022-11-24 19:00:05 <NA>
## 27 10 92 1 47 2022-11-17 19:00:00 2022-11-17 19:00:06 <NA>
## 28 10 92 1 47 2022-11-17 19:00:00 2022-11-17 19:00:06 <NA>
## 29 4 93 1 43 2022-06-16 19:00:00 2022-06-16 19:00:09 <NA>
## end pos_aff neg_aff perc_stress_child perc_fun_child
## 1 <NA> NA NA <NA> NA
## 2 <NA> NA NA <NA> NA
## 3 <NA> NA NA <NA> NA
## 4 <NA> NA NA <NA> NA
## 5 <NA> NA NA <NA> NA
## 6 <NA> NA NA <NA> NA
## 7 <NA> NA NA <NA> NA
## 8 <NA> NA NA <NA> NA
## 9 <NA> NA NA <NA> NA
## 10 <NA> NA NA <NA> NA
## 11 <NA> NA NA <NA> NA
## 12 <NA> NA NA <NA> NA
## 13 <NA> NA NA <NA> NA
## 14 2022-08-16 19:27:09 68 12 7 1
## 15 2022-08-16 20:42:33 89 3 7 1
## 16 <NA> NA NA <NA> NA
## 17 2022-08-16 18:01:52 88 0 5,6 NA
## 18 2022-08-16 19:26:16 74 23 7 1
## 19 2022-08-16 20:27:51 32 83 7 1
## 20 <NA> NA NA <NA> NA
## 21 <NA> NA NA <NA> NA
## 22 <NA> NA NA <NA> NA
## 23 <NA> NA NA <NA> NA
## 24 <NA> NA NA <NA> NA
## 25 <NA> NA NA <NA> NA
## 26 <NA> NA NA <NA> NA
## 27 <NA> NA NA <NA> NA
## 28 <NA> NA NA <NA> NA
## 29 <NA> NA NA <NA> NA
## perc_fun_signaled perc_stress_child_max year month day hour minute obsno
## 1 NA NA NA NA NA NA NA 1
## 2 NA NA NA NA NA NA NA 1
## 3 NA NA NA NA NA NA NA 1
## 4 NA NA NA NA NA NA NA 1
## 5 NA NA NA NA NA NA NA 2
## 6 NA NA NA NA NA NA NA 3
## 7 NA NA NA NA NA NA NA 1
## 8 NA NA NA NA NA NA NA 1
## 9 NA NA NA NA NA NA NA 1
## 10 NA NA NA NA NA NA NA 1
## 11 NA NA NA NA NA NA NA 1
## 12 NA NA NA NA NA NA NA 1
## 13 NA NA NA NA NA NA NA 2
## 14 80 7 2022 8 16 19 24 3
## 15 63 7 2022 8 16 20 41 4
## 16 NA NA NA NA NA NA NA 1
## 17 NA 6 2022 8 16 17 59 2
## 18 29 7 2022 8 16 19 24 3
## 19 40 7 2022 8 16 20 26 4
## 20 NA NA NA NA NA NA NA 1
## 21 NA NA NA NA NA NA NA 1
## 22 NA NA NA NA NA NA NA 1
## 23 NA NA NA NA NA NA NA 1
## 24 NA NA NA NA NA NA NA 1
## 25 NA NA NA NA NA NA NA 1
## 26 NA NA NA NA NA NA NA 1
## 27 NA NA NA NA NA NA NA 1
## 28 NA NA NA NA NA NA NA 2
## 29 NA NA NA NA NA NA NA 1
## daycum beepno duration valid
## 1 1 days 1 27 days 0
## 2 1 days 1 11 days 0
## 3 1 days 1 13 days 0
## 4 1 days 1 11 days 0
## 5 1 days 2 11 days 0
## 6 1 days 3 11 days 0
## 7 1 days 1 11 days 0
## 8 1 days 1 11 days 0
## 9 1 days 1 13 days 0
## 10 1 days 1 11 days 0
## 11 1 days 1 11 days 0
## 12 1 days 1 20 days 0
## 13 1 days 2 20 days 0
## 14 1 days 3 20 days 1
## 15 1 days 4 20 days 1
## 16 1 days 1 20 days 0
## 17 1 days 2 20 days 0
## 18 1 days 3 20 days 1
## 19 1 days 4 20 days 1
## 20 1 days 1 15 days 0
## 21 1 days 1 11 days 0
## 22 1 days 1 27 days 0
## 23 1 days 1 14 days 0
## 24 1 days 1 11 days 0
## 25 1 days 1 14 days 0
## 26 1 days 1 11 days 0
## 27 1 days 1 11 days 0
## 28 1 days 2 11 days 0
## 29 1 days 1 11 days 0
Modification 8: remove test observations and recompute time variables. Test observations are all first day observations and:
# Remove day 1 observations and Remove extra testing beeps of participants
pos = data$daycum == 1 |
(data$id %in% c(79,73) & data$daycum==4) |
(data$id %in% c(1,72) & data$daycum==17) |
(data$id %in% c(49,52) & data$daycum==10) |
(data$id==66 & data$daycum==5) |
(data$id %in% c(9,32) & data$daycum==3)
data = data[!pos,]
# Recompute time variables
# Beep number
data = data %>% arrange(id, scheduled) %>%
group_by(id) %>%
mutate(obsno = 1:n()) %>% ungroup()
# daycum
data = data %>%
group_by(id) %>%
mutate(daycum = difftime(as.Date(sent), as.Date(min(sent, na.rm=TRUE)), units="days") + 1)
# beepno
data = data %>% arrange(id, sent) %>%
group_by(id,daycum) %>%
mutate(beepno = 1:n())
# Duration in days
data = data %>%
group_by(id) %>%
mutate(duration = difftime(as.Date(max(sent, na.rm=TRUE)), as.Date(min(sent, na.rm=TRUE)), units="days") + 1)
We check whether there is timestamp incoherence within observations (e.g., an observation with ‘start’ time after the ‘end’ time) or between observations (e.g., an observation that was scheduled after another one but with a ‘start’ time that is before).
Issue 5: there is an observation with its ‘start’ time before its ‘sent’ time. There is an hour of difference.
# Select and order for further tests
df = data[,c("id", "obsno", "scheduled", "sent", "start", "end")]
df = df[order(df$id, df$obsno),]
# Timestamps coherence within observations
df %>%
group_by(id) %>%
mutate(sent_after_sched = scheduled > sent,
start_after_sent = sent > start,
end_after_end = start > end) %>%
filter(sent_after_sched | start_after_sent | end_after_end) %>%
as.data.frame()
## id obsno scheduled sent start
## 1 73 52 2022-11-20 10:24:43 2022-11-20 10:24:44 2022-11-20 09:25:02
## end sent_after_sched start_after_sent end_after_end
## 1 2022-11-20 09:25:59 FALSE TRUE FALSE
# Timestamps coherence between observations
df %>%
group_by(id) %>%
mutate(sent_lag = lag(sent)) %>%
mutate(sent_lag_issue = lag(sent) > scheduled,
start_lag_issue = lag(start) > scheduled,
end_lag_issue = lag(end) > scheduled) %>%
filter(sent_lag_issue | start_lag_issue | end_lag_issue) %>%
as.data.frame()
## [1] id obsno scheduled sent
## [5] start end sent_lag sent_lag_issue
## [9] start_lag_issue end_lag_issue
## <0 rows> (or 0-length row.names)
No issues detected.
This section is dedicated to investigating how well participants engaged with the ESM study looking particularly for problematic patterns of behaviors (e.g., invalid observations, response time, careless responding).
We check how well Participants followed the sampling scheme.
No issues detected.
No issues detected.
data = data %>%
mutate(delay_start_min = difftime(start, sent, units="mins"),
daily_end_min = difftime(end, start, units="mins"))
No values higher than accepted response delay (30 mins). There is a negative delay for one observation (issue already mentioned above).
data %>% filter(valid==1) %>%
ggplot(aes(x = delay_start_min)) +
geom_histogram(bins=100) +
labs(title="Histogram of the delays to start the questionnaires",
y="Quantity of beep",x="Delay in minutes")
## Don't know how to automatically pick scale for object of type <difftime>.
## Defaulting to continuous.
Issue 6: there are outliers, specifically belonging to participants 7, 15, 32, 66, 73, and 77, that require further investigation.
data %>%
ggplot(aes(y=daily_end_min,x=factor(id))) +
geom_boxplot() +
coord_flip() +
labs(title="Box plots of the time interval to fill the beeps",
x="Participant id",y="Time interval (minute)")
No issues detected.
Issue 7: there are delays larger than 10 and 20 minutes (max=28 mins).
# Compute beepno and daycum_dyad to later reformat in dyad dataframe
data = data %>%
arrange(id, sent) %>%
group_by(id,daycum) %>%
mutate(beepno = 1:n()) %>%
group_by(dyad) %>%
mutate(daycum_dyad = difftime(as.Date(sent), as.Date(min(sent, na.rm=TRUE)), units="days") + 1)
# Reformat dataframe to get dyad dataframe format
df1 = data %>% filter(role == 1)
df2 = data %>% filter(role == 0)
data_dyad = df1 %>%
left_join(df2, by=c("dyad", "daycum_dyad", "beepno"), suffix = c("_1", "_2"))
# Compute the time intervals between start times
df_dyad_int = data_dyad %>%
filter(valid_1==1 & valid_2==1) %>%
mutate(time_int = abs(difftime(start_1, start_2, units="mins")))
# Create plot
df_dyad_int %>%
ggplot(aes(x = time_int)) +
geom_histogram(bins=100) +
scale_x_continuous(breaks=seq(0,2000,10)) +
labs(title="Histogram of the delays between partners' start times",
y="Participant id",x="Delay (minutes)")
We computed the compliance scores of:
# Participant compliance based on maximal number of beep (60)
obsno_max = 60
data$compliance = ave(data$valid, data$id, FUN=function(x) sum(x, na.rm=TRUE)) / obsno_max
# Dayd compliance
# Create unique value for each obsno within dyads (later use for as grouping vector)
unique_dyad_beeps = paste0(data$dyad, "_", data$obsno)
# Check if beep was answered by two partners
valid_dyad_obs = ave(data$valid, unique_dyad_beeps, FUN=function(x) sum(x, na.rm=TRUE)) == 2
# Compute dyad compliance
data$comp_dyad = ave(valid_dyad_obs, data$dyad, FUN=function(x) sum(x, na.rm=TRUE)) / 2 / obsno_max
data %>%
group_by(id) %>% slice(1) %>% # Keep one row per participant
ggplot(aes(y=compliance, x=factor(id))) +
geom_col(position = "dodge") +
labs(title="Compliance score of each participant",
y="Compliance",x="Participant id")
data %>%
group_by(id) %>% slice(1) %>% # Keep one row per participant
ggplot(aes(x=compliance)) +
geom_histogram() +
labs(title="Histogram of the compliance score",
y="Number of participant",x="Compliance")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Issue 9: when taking dyads’ partner observations together, the dyads’ compliances (defined as the proportion of beeps answer by both partners) are very low overall.
data %>%
group_by(dyad) %>% slice(1) %>% # Keep one row per dyad
ggplot(aes(x=factor(dyad), y=comp_dyad)) +
geom_col(position="dodge") +
labs(title="Dyad compliance score of each dyad",
y="Dyad compliance score",x="Dyad number")
This section is dedicated to computing and modifying variables of interest that will later be used in visualization and statistical analysis.
Modification 9: the ‘pos_aff’ and ‘neg_aff’ variables are person-mean centered.data$pos_aff_pc = data$pos_aff - ave(data$pos_aff, data$id, FUN=function(x) mean(x, na.rm=TRUE))
data$neg_aff_pc = data$neg_aff - ave(data$neg_aff, data$id, FUN=function(x) mean(x, na.rm=TRUE))
Statical models used later on are sensitive to the distribution of the variables. We therefore checked the distributions of the variable at a participant level.
data %>%
ggplot(aes(x=pos_aff, color=factor(id))) +
geom_density(alpha = 1) +
theme(legend.position = "none") +
labs(title="Density plot of the pos_aff variable")
data %>%
ggplot(aes(x=neg_aff, color=factor(id))) +
geom_density(alpha = 1) +
theme(legend.position = "none") +
labs(title="Density plot of the pos_neg variable")
The preprocessed data is finally exported.
Modification 10: we removed irrelevant variables for later analysis.# Exclude non relevant variables and reorder variables
data_export = data %>%
select(-c(minute, hour, day, year, month, delay_sent_min, delay_start_min, daily_end_min, daycum_dyad, comp_dyad, duration)) %>%
select(dyad:age, compliance, obsno, daycum, beepno, valid, scheduled:perc_fun_signaled, pos_aff_pc, neg_aff_pc)
# Export
file_path_preproc = "C:/DATA_STORAGE/Martine_Data/Triadic_pilot_study/data_example_preprocessed.csv"
write.csv(data_export, file_path_preproc, row.names = FALSE)
Run the data characteristics report:
# Path to the data quality report (.Rmd format)
rmark_file = "path/data_quality_repot.Rmd"
# Name of the output data quality report. Date is included to keep track of changes
filename_out = paste0(as.Date(Sys.time()), "_Data_Quality_Report.html")
# Knit the data quality report
rmarkdown::render(rmark_file, output_file=filename_out, params=list(file_path=file_path_preproc))
For reproducibility purposes, this section informs about the R session and packages used as well as their versions.
sessionInfo()
## R version 4.2.2 (2022-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19045)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=Dutch_Belgium.utf8 LC_CTYPE=Dutch_Belgium.utf8
## [3] LC_MONETARY=Dutch_Belgium.utf8 LC_NUMERIC=C
## [5] LC_TIME=Dutch_Belgium.utf8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] hms_1.1.2 lubridate_1.9.2 stringr_1.5.0 ggplot2_3.4.2
## [5] visdat_0.6.0 naniar_1.0.0 skimr_2.1.5 data.table_1.14.6
## [9] tidyr_1.3.0 dplyr_1.1.2 esmtools_1.0.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.10 svglite_2.1.1 digest_0.6.31 utf8_1.2.3
## [5] plyr_1.8.8 R6_2.5.1 repr_1.1.6 backports_1.4.1
## [9] evaluate_0.22 httr_1.4.4 highr_0.10 pillar_1.9.0
## [13] rlang_1.1.1 rstudioapi_0.14 car_3.1-2 jquerylib_0.1.4
## [17] DT_0.30 rmarkdown_2.23 labeling_0.4.2 webshot_0.5.4
## [21] htmlwidgets_1.6.2 munsell_0.5.0 broom_1.0.5 compiler_4.2.2
## [25] xfun_0.39 pkgconfig_2.0.3 systemfonts_1.0.4 base64enc_0.1-3
## [29] htmltools_0.5.5 tidyselect_1.2.0 gridExtra_2.3 tibble_3.2.1
## [33] fansi_1.0.4 viridisLite_0.4.2 withr_2.5.0 ggpubr_0.6.0
## [37] grid_4.2.2 jsonlite_1.8.5 gtable_0.3.3 lifecycle_1.0.3
## [41] magrittr_2.0.3 scales_1.2.1 cli_3.6.1 stringi_1.7.12
## [45] cachem_1.0.8 carData_3.0-5 farver_2.1.1 ggsignif_0.6.4
## [49] fs_1.6.2 xml2_1.3.4 bslib_0.5.1 ellipsis_0.3.2
## [53] generics_0.1.3 vctrs_0.6.2 cowplot_1.1.1 kableExtra_1.3.4
## [57] tools_4.2.2 glue_1.6.2 purrr_1.0.1 abind_1.4-5
## [61] fastmap_1.1.1 yaml_2.3.7 timechange_0.2.0 colorspace_2.1-0
## [65] UpSetR_1.4.0 rstatix_0.7.2 rvest_1.0.3 knitr_1.43
## [69] sass_0.4.6
Additionally, we display the meta-information of the preprocessed dataset.
esmtools::dataInfo(file_path=file_path_preproc,
read_fun = read.csv,
idvar="id", timevar="sent")
## Path : C:/DATA_STORAGE/Martine_Data/Triadic_pilot_study/data_example_preprocessed.csv
## Extension : csv
## Size : 174531 bytes
## Creation time : 2023-05-27 13:35:46
## Update time : 2024-03-12 11:01:57
## ncol : 20
## nrow : 1200
## Number participants : 20
## Average number obs : 60
## Period : from 2022-04-29 16:33:12 to 2022-12-04 20:21:25
## Variables : dyad, id, role, age, compliance, obsno, daycum, beepno, valid, scheduled, sent, start, end, pos_aff, neg_aff, perc_stress_child, perc_fun_child, perc_fun_signaled, pos_aff_pc, neg_aff_pc