Check variable coherence
Packages: dplyr, skimr, stringr, esmtools
ESM datasets come with different types of variables (see Viechtbauer, 2021), which have implications for the data preprocessing:
- design variables (e.g., participant number, day/beep number, experimental conditions, scheduled time, sent time): these variables are defined by the study design and the sampling scheme of the study. Importantly, no missing data are expected. In particular, there are:
- subject identifier variables (e.g., participant id, couple id, partner’s differentiating variables): allow to identify a unique participant. In the case of dyadic data, the dyads can be non-distinguishable or distinguishable. In this second case, your dataframe should contain a variable differentiating partners’ dyad (e.g., gender, role).
- timestamps variables: contains the timestamps related to each beep. The best is to have at least the scheduled time, the sent time, the start time, and the end time - see terminology - to later inspect participants’ response behaviors. It can also include the timestamp when each question was answered (within each beep).
- variables filled in by the participants:
- time-varying variables (e.g., positive/negative affects): a variable with values that change over time. Missing values may correspond to unfilled variables/beeps by the participants.
- time-invariant variables (e.g., depression score, aggregate score): a variable that does not change over time and maintains the same value across all time points, such as corresponding to baseline or follow-up questionnaires. Ideally, no missing data are expected.
A good start: descriptive statistics
A good start is to compute descriptive statistics (e.g., mean, range of value). It allows checking for minimum, maximum values, number of missing values, mean, etc. Here, we use the summary()
function and the skim()
function from the skimr package to get a quick overview of the data (see First glimpse topic).
summary(data)
dyad id cond_dyad role
Min. : 1.00 Min. : 1.00 Length:4200 Min. :1.000
1st Qu.: 8.00 1st Qu.:15.75 Class :character 1st Qu.:1.000
Median :16.00 Median :30.50 Mode :character Median :2.000
Mean :15.55 Mean :30.50 Mean :1.503
3rd Qu.:23.00 3rd Qu.:45.25 3rd Qu.:2.000
Max. :30.00 Max. :60.00 Max. :2.000
NA's :25 NA's :28
obsno scheduled
Min. : 1.0 Min. :2018-02-02 08:59:47.00
1st Qu.:18.0 1st Qu.:2018-07-20 09:00:01.00
Median :35.5 Median :2018-09-13 11:00:13.50
Mean :35.5 Mean :2018-09-08 11:33:24.83
3rd Qu.:53.0 3rd Qu.:2018-10-24 02:59:52.50
Max. :70.0 Max. :2019-06-10 11:59:46.00
NA's :2
sent start
Min. :2018-02-02 08:59:51.00 Min. :2018-02-02 09:00:31.00
1st Qu.:2018-07-20 09:00:18.75 1st Qu.:2018-07-22 12:00:37.25
Median :2018-09-13 11:30:18.00 Median :2018-09-14 22:00:28.00
Mean :2018-09-08 11:54:09.94 Mean :2018-09-14 18:49:31.13
3rd Qu.:2018-10-23 17:00:04.00 3rd Qu.:2018-10-31 13:00:30.50
Max. :2019-06-10 11:59:54.00 Max. :2019-06-10 12:00:15.00
NA's :1254
end PA1 PA2
Min. :2018-02-02 09:03:07.00 Min. : 1.00 Min. : 1.00
1st Qu.:2018-07-22 12:01:49.25 1st Qu.: 4.00 1st Qu.: 3.00
Median :2018-09-14 22:02:02.00 Median : 18.00 Median : 19.00
Mean :2018-09-14 18:51:14.74 Mean : 23.09 Mean : 21.77
3rd Qu.:2018-10-31 13:02:31.00 3rd Qu.: 32.00 3rd Qu.: 33.00
Max. :2019-06-10 12:02:30.00 Max. :100.00 Max. :100.00
NA's :1254 NA's :1254 NA's :1254
PA3 NA1
Min. : 1.00 Min. : 1.00
1st Qu.: 3.00 1st Qu.: 1.00
Median : 16.00 Median : 11.00
Mean : 23.32 Mean : 21.36
3rd Qu.: 31.00 3rd Qu.: 31.00
Max. :100.00 Max. :100.00
NA's :1254 NA's :1254
Many issues can be detected by looking at the descriptive statistics. In our case, we can see that there are missing values in the ‘dyad’ and ‘role’ variables, which is unexpected as they are identification variables.
Check if Variables are Time-Invariant for Each Subject
As previously mentioned, time-invariant variables (e.g., subject id, dyad id) are not expected to vary within participants. Here, we will look for those variables if they are consistent among participants. In other words, we check that a unique value is attributed to person-level or dyad-level variables (e.g., age).
We show 3 ways to check it at a participant level:
- The ‘vars_consist()’ function from the esmtools package work: merge the unique values of variables while gouping for another one. In our example, we investigate if the dyad and role variables have a unique value per participant.
- We can check if a person-level variable displays a unique value (i.e., length(unique(var)) == 1) for each group. Here, we check that the dyad and role variables have a unique value per id. When the test outcomes are FALSE, there is more than one value of dyad/role or an NA value within the group.
- We can keep all the unique rows after selecting the variables of interest. Here, we keep the unique values of the combination of id, dyad, cond_dyad, and role variables.
- Finally, we can count the number of unique combinations of variables’ values. Here, we first group by the id, dyad, cond_dyad, and role variables and then count the number of rows using: summarize(n = n()). In the results, we investigate for inconsistent rows along with their number of occurrences.
%>%
data group_by(id) %>%
summarize(consistant_dyad=length(unique(dyad)) == 1,
consistant_role=length(unique(role)) == 1)
# A tibble: 60 × 3
id consistant_dyad consistant_role
<dbl> <lgl> <lgl>
1 1 TRUE TRUE
2 2 TRUE TRUE
3 3 TRUE TRUE
4 4 TRUE TRUE
5 5 FALSE FALSE
6 6 TRUE TRUE
7 7 TRUE TRUE
8 8 TRUE TRUE
9 9 TRUE TRUE
10 10 TRUE TRUE
# ℹ 50 more rows
Each method can be adapted to different nesting levels. Below, we verify the unique values of ‘id’ and ‘cond_dyad’ nested within the ‘dyad’ variable:
library(esmtools)
vars_consist(data, "dyad", c("id", "cond_dyad"))
dyad id cond_dyad
1 1 (1, 2) condB
2 2 (3, 4) condB
3 3 (5, 6) condB
4 NA (5, 19, 25, 43) (condB, condA)
5 4 (7, 8) condB
6 5 (9, 10) condA
7 6 (11, 12) condA
8 7 (13, 14) condB
9 28 (13, 55, 56) (condB, condA)
10 8 (15, 16) condB
11 9 (17, 18, 24) (condB, condA)
12 10 (19, 20) condA
13 11 (21, 22) condB
14 12 (23, 24) condA
15 13 (25, 26) condB
16 14 (27, 28) condB
17 15 (29, 30) condB
18 16 (31, 32) condA
19 17 (33, 34) condB
20 18 (35, 36) condA
21 19 (35, 37, 38) (condA, condB)
22 20 (39, 40) condA
23 21 (41, 42) condA
24 22 (43, 44) condA
25 23 (45, 46) condA
26 24 (47, 48) condA
27 25 (49, 50) condA
28 26 (51, 52) condB
29 27 (53, 54) condB
30 29 (57, 58) condA
31 30 (59, 60) condA
Investigating the issues
Every issue should be checked by hand first (by isolating and inspecting the rows) before solving it.
From above, we can see that there are missing values in the dyad variable. Let’s inspect this issue by selecting the rows using ‘is.na(data$dyad)’.
is.na(data$dyad), ] data[
dyad id cond_dyad role obsno scheduled sent
282 NA 5 condB 1 2 2018-03-04 09:59:49 2018-03-04 09:59:56
283 NA 5 condB 1 3 2018-03-04 11:00:07 2018-03-04 11:00:18
296 NA 5 condB 1 16 2018-03-07 08:59:59 2018-03-07 09:00:09
301 NA 5 condB 1 21 2018-03-08 08:59:54 2018-03-08 09:00:05
303 NA 5 condB 1 23 2018-03-08 11:00:10 2018-03-08 11:00:24
318 NA 5 condB 1 38 2018-03-11 11:00:09 2018-03-11 11:00:28
322 NA 5 condB 1 42 2018-03-12 09:59:59 2018-03-12 10:00:03
334 NA 5 condB 1 54 2018-03-14 12:00:13 2018-03-14 12:00:29
347 NA 5 condB 1 67 2018-03-17 09:59:59 2018-03-17 10:00:17
349 NA 5 condB 1 69 2018-03-17 11:59:58 2018-03-17 12:00:02
1316 NA 19 condA 1 56 2019-03-05 08:59:44 2019-03-05 09:00:02
1694 NA 25 condB 1 14 2018-07-20 11:00:13 2018-07-20 11:00:31
1709 NA 25 condB 1 29 2018-07-23 10:59:56 2018-07-23 10:59:58
1715 NA 25 condB 1 35 2018-07-24 12:00:07 2018-07-24 12:00:17
1726 NA 25 condB 1 46 2018-07-27 07:59:55 2018-07-27 08:00:03
1728 NA 25 condB 1 48 2018-07-27 10:00:05 2018-07-27 10:00:09
1734 NA 25 condB NA 54 2018-07-28 10:59:38 2018-07-28 10:59:45
1743 NA 25 condB 1 63 2018-07-30 09:59:44 2018-07-30 09:59:53
1750 NA 25 condB 1 70 2018-07-31 11:59:50 2018-07-31 11:59:56
2950 NA 43 condA 1 10 2018-09-06 11:59:42 2018-09-06 12:00:07
2970 NA 43 condA 1 30 2018-09-10 12:00:07 2018-09-10 12:00:12
2988 NA 43 condA 1 48 2018-09-14 10:00:00 2018-09-14 10:00:02
2990 NA 43 condA 1 50 2018-09-14 11:59:51 2018-09-14 11:59:53
2994 NA 43 condA 1 54 2018-09-15 10:59:57 2018-09-15 11:00:01
3002 NA 43 condA NA 62 2018-09-17 09:00:25 2018-09-17 09:00:26
start end PA1 PA2 PA3 NA1
282 <NA> <NA> NA NA NA NA
283 <NA> <NA> NA NA NA NA
296 <NA> <NA> NA NA NA NA
301 <NA> <NA> NA NA NA NA
303 2018-03-08 11:00:38 2018-03-08 11:02:30 4 64 27 17
318 <NA> <NA> NA NA NA NA
322 2018-03-12 10:00:19 2018-03-12 10:03:24 1 41 1 28
334 <NA> <NA> NA NA NA NA
347 <NA> <NA> NA NA NA NA
349 2018-03-17 12:00:26 2018-03-17 12:02:11 1 40 1 38
1316 2019-03-05 09:00:29 2019-03-05 09:01:26 36 17 16 13
1694 2018-07-20 11:00:59 2018-07-20 11:03:42 86 36 100 91
1709 2018-07-23 11:00:11 2018-07-23 11:02:13 100 17 100 100
1715 2018-07-24 12:00:50 2018-07-24 12:01:53 99 1 100 100
1726 <NA> <NA> NA NA NA NA
1728 2018-07-27 10:00:28 2018-07-27 10:01:08 7 1 1 1
1734 2018-07-28 10:59:55 2018-07-28 11:01:18 24 1 1 1
1743 2018-07-30 10:00:04 2018-07-30 10:02:54 66 25 81 57
1750 2018-07-31 12:00:07 2018-07-31 12:01:47 100 33 100 100
2950 2018-09-06 12:00:20 2018-09-06 12:01:10 2 16 37 1
2970 2018-09-10 12:00:38 2018-09-10 12:02:40 1 21 37 1
2988 2018-09-14 10:00:22 2018-09-14 10:00:53 1 20 34 1
2990 <NA> <NA> NA NA NA NA
2994 2018-09-15 11:00:09 2018-09-15 11:01:39 4 31 40 3
3002 2018-09-17 09:00:29 2018-09-17 09:02:35 1 23 35 1
Solving the issues
Now it’s time to solve the issues. Overall and for the issue displayed above, we propose 2 methods to solve it:
- By hand: we know the true value. For instance, participant 3 should always be in the 5th dyad. After a quick check of the row preceding and following the issue (especially to validate that the issue is for the dyad number), we can easily solve it. When fixing the issue, avoid using the row numbers, rather than using conditional tests.
- General: in case we have identification variables that allow us to identify a unique participant and that does not have any issues, we can recreate the structure and apply it to the dataset based on the correct value. First, we have to recreate the original structure of the dataset. Then, we apply modifications using the reliable variable as a key. Be aware that his method involves large data modifications, so you have to be certain before applying it and, then, check that no issues have been implemented in the dataframe.
We first select and visualize the problematic rows.
= data$id==5 & is.na(data$dyad)
pos c("dyad", "id", "role", "obsno")] data[pos,
dyad id role obsno
282 NA 5 1 2
283 NA 5 1 3
296 NA 5 1 16
301 NA 5 1 21
303 NA 5 1 23
318 NA 5 1 38
322 NA 5 1 42
334 NA 5 1 54
347 NA 5 1 67
349 NA 5 1 69
After confirming that we have correctly identified the targeted rows, we assign the value of 5 to the ‘dyad’ variable for those specific rows.
"dyad"] = 5 data[pos,
Importantly, you will need to check that the issues have been solved using previously discussed functions (e.g., summary(), display a sample of rows). Additionally, be sure to not have introduced new issues in the process.
References
Viechtbauer, W. (2021). Structuring, checking, and preparing the data. In The Open Handbook of Experience Sampling Methodology: A Step-by-Step Guide to Designing, Conducting, and Analyzing ESM Studies, pages 137-152. Center for Research on Experience Sampling and Ambulatory Methods, Leuven.