ESM Preprocessing Gallery – check_variable

Check variable coherence

Packages: dplyr, skimr, stringr, esmtools

ESM datasets come with different types of variables (see Viechtbauer, 2021), which have implications for the data preprocessing:

design variables (e.g., participant number, day/beep number, experimental conditions, scheduled time, sent time): these variables are defined by the study design and the sampling scheme of the study. Importantly, no missing data are expected. In particular, there are:
- subject identifier variables (e.g., participant id, couple id, partner’s differentiating variables): allow to identify a unique participant. In the case of dyadic data, the dyads can be non-distinguishable or distinguishable. In this second case, your dataframe should contain a variable differentiating partners’ dyad (e.g., gender, role).
- timestamps variables: contains the timestamps related to each beep. The best is to have at least the scheduled time, the sent time, the start time, and the end time - see terminology - to later inspect participants’ response behaviors. It can also include the timestamp when each question was answered (within each beep).
variables filled in by the participants:
- time-varying variables (e.g., positive/negative affects): a variable with values that change over time. Missing values may correspond to unfilled variables/beeps by the participants.
- time-invariant variables (e.g., depression score, aggregate score): a variable that does not change over time and maintains the same value across all time points, such as corresponding to baseline or follow-up questionnaires. Ideally, no missing data are expected.

A good start: descriptive statistics

A good start is to compute descriptive statistics (e.g., mean, range of value). It allows checking for minimum, maximum values, number of missing values, mean, etc. Here, we use the summary() function and the skim() function from the skimr package to get a quick overview of the data (see First glimpse topic).

summary(data)

      dyad             id         cond_dyad              role      
 Min.   : 1.00   Min.   : 1.00   Length:4200        Min.   :1.000  
 1st Qu.: 8.00   1st Qu.:15.75   Class :character   1st Qu.:1.000  
 Median :16.00   Median :30.50   Mode  :character   Median :2.000  
 Mean   :15.55   Mean   :30.50                      Mean   :1.503  
 3rd Qu.:23.00   3rd Qu.:45.25                      3rd Qu.:2.000  
 Max.   :30.00   Max.   :60.00                      Max.   :2.000  
 NA's   :25                                         NA's   :28     
     obsno        scheduled                     
 Min.   : 1.0   Min.   :2018-02-02 08:59:47.00  
 1st Qu.:18.0   1st Qu.:2018-07-20 09:00:01.00  
 Median :35.5   Median :2018-09-13 11:00:13.50  
 Mean   :35.5   Mean   :2018-09-08 11:33:24.83  
 3rd Qu.:53.0   3rd Qu.:2018-10-24 02:59:52.50  
 Max.   :70.0   Max.   :2019-06-10 11:59:46.00  
                NA's   :2                       
      sent                            start                       
 Min.   :2018-02-02 08:59:51.00   Min.   :2018-02-02 09:00:31.00  
 1st Qu.:2018-07-20 09:00:18.75   1st Qu.:2018-07-22 12:00:37.25  
 Median :2018-09-13 11:30:18.00   Median :2018-09-14 22:00:28.00  
 Mean   :2018-09-08 11:54:09.94   Mean   :2018-09-14 18:49:31.13  
 3rd Qu.:2018-10-23 17:00:04.00   3rd Qu.:2018-10-31 13:00:30.50  
 Max.   :2019-06-10 11:59:54.00   Max.   :2019-06-10 12:00:15.00  
                                  NA's   :1254                    
      end                              PA1              PA2        
 Min.   :2018-02-02 09:03:07.00   Min.   :  1.00   Min.   :  1.00  
 1st Qu.:2018-07-22 12:01:49.25   1st Qu.:  4.00   1st Qu.:  3.00  
 Median :2018-09-14 22:02:02.00   Median : 18.00   Median : 19.00  
 Mean   :2018-09-14 18:51:14.74   Mean   : 23.09   Mean   : 21.77  
 3rd Qu.:2018-10-31 13:02:31.00   3rd Qu.: 32.00   3rd Qu.: 33.00  
 Max.   :2019-06-10 12:02:30.00   Max.   :100.00   Max.   :100.00  
 NA's   :1254                     NA's   :1254     NA's   :1254    
      PA3              NA1        
 Min.   :  1.00   Min.   :  1.00  
 1st Qu.:  3.00   1st Qu.:  1.00  
 Median : 16.00   Median : 11.00  
 Mean   : 23.32   Mean   : 21.36  
 3rd Qu.: 31.00   3rd Qu.: 31.00  
 Max.   :100.00   Max.   :100.00  
 NA's   :1254     NA's   :1254

Data summary
Name	data
Number of rows	4200
Number of columns	13
_______________________
Column type frequency:
character	1
numeric	8
POSIXct	4
________________________
Group variables	None

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
cond_dyad	0	1	5	5	0	2	0

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
dyad	25	0.99	15.55	8.67	1	8.00	16.0	23.00	30	▇▇▇▇▇
id	0	1.00	30.50	17.32	1	15.75	30.5	45.25	60	▇▇▇▇▇
role	28	0.99	1.50	0.50	1	1.00	2.0	2.00	2	▇▁▁▁▇
obsno	0	1.00	35.50	20.21	1	18.00	35.5	53.00	70	▇▇▇▇▇
PA1	1254	0.70	23.09	23.54	1	4.00	18.0	32.00	100	▇▅▁▁▁
PA2	1254	0.70	21.77	21.50	1	3.00	19.0	33.00	100	▇▅▂▁▁
PA3	1254	0.70	23.32	25.91	1	3.00	16.0	31.00	100	▇▃▁▁▁
NA1	1254	0.70	21.36	26.53	1	1.00	11.0	31.00	100	▇▂▁▁▁

skim_variable	n_missing	complete_rate	min	max	median	n_unique
scheduled	2	1.0	2018-02-02 08:59:47	2019-06-10 11:59:46	2018-09-13 11:00:13	3963
sent	0	1.0	2018-02-02 08:59:51	2019-06-10 11:59:54	2018-09-13 11:30:18	4003
start	1254	0.7	2018-02-02 09:00:31	2019-06-10 12:00:15	2018-09-14 22:00:28	2888
end	1254	0.7	2018-02-02 09:03:07	2019-06-10 12:02:30	2018-09-14 22:02:02	2930

Many issues can be detected by looking at the descriptive statistics. In our case, we can see that there are missing values in the ‘dyad’ and ‘role’ variables, which is unexpected as they are identification variables.

Check if Variables are Time-Invariant for Each Subject

As previously mentioned, time-invariant variables (e.g., subject id, dyad id) are not expected to vary within participants. Here, we will look for those variables if they are consistent among participants. In other words, we check that a unique value is attributed to person-level or dyad-level variables (e.g., age).

We show 3 ways to check it at a participant level:

The ‘vars_consist()’ function from the esmtools package work: merge the unique values of variables while gouping for another one. In our example, we investigate if the dyad and role variables have a unique value per participant.
We can check if a person-level variable displays a unique value (i.e., length(unique(var)) == 1) for each group. Here, we check that the dyad and role variables have a unique value per id. When the test outcomes are FALSE, there is more than one value of dyad/role or an NA value within the group.
We can keep all the unique rows after selecting the variables of interest. Here, we keep the unique values of the combination of id, dyad, cond_dyad, and role variables.
Finally, we can count the number of unique combinations of variables’ values. Here, we first group by the id, dyad, cond_dyad, and role variables and then count the number of rows using: summarize(n = n()). In the results, we investigate for inconsistent rows along with their number of occurrences.

library(esmtools)
vars_consist(data, group="id", vars=c("dyad", "role"))

   id     dyad    role
1   1        1       1
2   2        1       2
3   3        2       1
4   4        2       2
5   5  (3, NA) (1, NA)
6   6        3       2
7   7        4       1
8   8        4       2
9   9        5       1
10 10        5       2
11 11        6       1
12 12        6       2
13 13  (7, 28)       1
14 14        7       2
15 15        8       1
16 16        8       2
17 17        9       1
18 18        9       2
19 19 (10, NA) (1, NA)
20 20       10       2
21 21       11       1
22 22       11       2
23 23       12       1
24 24  (9, 12)       2
25 25 (13, NA) (1, NA)
26 26       13       2
27 27       14       1
28 28       14       2
29 29       15       1
30 30       15       2
31 31       16       1
32 32       16       2
33 33       17       1
34 34       17       2
35 35 (18, 19)       1
36 36       18       2
37 37       19       1
38 38       19       2
39 39       20       1
40 40       20       2
41 41       21       1
42 42       21       2
43 43 (22, NA) (1, NA)
44 44       22       2
45 45       23       1
46 46       23       2
47 47       24       1
48 48       24       2
49 49       25       1
50 50       25       2
51 51       26       1
52 52       26       2
53 53       27       1
54 54       27       2
55 55       28       1
56 56       28       2
57 57       29       1
58 58       29       2
59 59       30       1
60 60       30       2

data %>%
    group_by(id) %>%
    summarize(consistant_dyad=length(unique(dyad)) == 1,
              consistant_role=length(unique(role)) == 1)

# A tibble: 60 × 3
      id consistant_dyad consistant_role
   <dbl> <lgl>           <lgl>          
 1     1 TRUE            TRUE           
 2     2 TRUE            TRUE           
 3     3 TRUE            TRUE           
 4     4 TRUE            TRUE           
 5     5 FALSE           FALSE          
 6     6 TRUE            TRUE           
 7     7 TRUE            TRUE           
 8     8 TRUE            TRUE           
 9     9 TRUE            TRUE           
10    10 TRUE            TRUE           
# ℹ 50 more rows

data %>% select(id, dyad, role) %>% unique()

     id dyad role
1     1    1    1
71    2    1    2
141   3    2    1
211   4    2    2
281   5    3    1
282   5   NA    1
285   5    3   NA
351   6    3    2
421   7    4    1
491   8    4    2
561   9    5    1
631  10    5    2
701  11    6    1
771  12    6    2
841  13    7    1
845  13   28    1
911  14    7    2
981  15    8    1
1051 16    8    2
1121 17    9    1
1191 18    9    2
1261 19   10    1
1268 19   10   NA
1316 19   NA    1
1331 20   10    2
1401 21   11    1
1471 22   11    2
1541 23   12    1
1611 24    9    2
1612 24   12    2
1681 25   13    1
1683 25   13   NA
1694 25   NA    1
1734 25   NA   NA
1751 26   13    2
1821 27   14    1
1891 28   14    2
1961 29   15    1
2031 30   15    2
2101 31   16    1
2171 32   16    2
2241 33   17    1
2311 34   17    2
2381 35   18    1
2383 35   19    1
2451 36   18    2
2521 37   19    1
2591 38   19    2
2661 39   20    1
2731 40   20    2
2801 41   21    1
2871 42   21    2
2941 43   22    1
2943 43   22   NA
2950 43   NA    1
3002 43   NA   NA
3011 44   22    2
3081 45   23    1
3151 46   23    2
3221 47   24    1
3291 48   24    2
3361 49   25    1
3431 50   25    2
3501 51   26    1
3571 52   26    2
3641 53   27    1
3711 54   27    2
3781 55   28    1
3851 56   28    2
3921 57   29    1
3991 58   29    2
4061 59   30    1
4131 60   30    2

# OR
# data %>% group_by(id, cond_dyad, dyad, role) %>% summarize(n = n())

data %>% group_by(id, cond_dyad, dyad, role) %>% summarize(n = n())

# A tibble: 73 × 5
# Groups:   id, cond_dyad, dyad [67]
      id cond_dyad  dyad  role     n
   <dbl> <chr>     <dbl> <int> <int>
 1     1 condB         1     1    70
 2     2 condB         1     2    70
 3     3 condB         2     1    70
 4     4 condB         2     2    70
 5     5 condB         3     1    51
 6     5 condB         3    NA     9
 7     5 condB        NA     1    10
 8     6 condB         3     2    70
 9     7 condB         4     1    70
10     8 condB         4     2    70
# ℹ 63 more rows

Each method can be adapted to different nesting levels. Below, we verify the unique values of ‘id’ and ‘cond_dyad’ nested within the ‘dyad’ variable:

library(esmtools)
vars_consist(data, "dyad", c("id", "cond_dyad"))

   dyad              id      cond_dyad
1     1          (1, 2)          condB
2     2          (3, 4)          condB
3     3          (5, 6)          condB
4    NA (5, 19, 25, 43) (condB, condA)
5     4          (7, 8)          condB
6     5         (9, 10)          condA
7     6        (11, 12)          condA
8     7        (13, 14)          condB
9    28    (13, 55, 56) (condB, condA)
10    8        (15, 16)          condB
11    9    (17, 18, 24) (condB, condA)
12   10        (19, 20)          condA
13   11        (21, 22)          condB
14   12        (23, 24)          condA
15   13        (25, 26)          condB
16   14        (27, 28)          condB
17   15        (29, 30)          condB
18   16        (31, 32)          condA
19   17        (33, 34)          condB
20   18        (35, 36)          condA
21   19    (35, 37, 38) (condA, condB)
22   20        (39, 40)          condA
23   21        (41, 42)          condA
24   22        (43, 44)          condA
25   23        (45, 46)          condA
26   24        (47, 48)          condA
27   25        (49, 50)          condA
28   26        (51, 52)          condB
29   27        (53, 54)          condB
30   29        (57, 58)          condA
31   30        (59, 60)          condA

Investigating the issues

Every issue should be checked by hand first (by isolating and inspecting the rows) before solving it.
From above, we can see that there are missing values in the dyad variable. Let’s inspect this issue by selecting the rows using ‘is.na(data$dyad)’.

data[is.na(data$dyad), ]

     dyad id cond_dyad role obsno           scheduled                sent
282    NA  5     condB    1     2 2018-03-04 09:59:49 2018-03-04 09:59:56
283    NA  5     condB    1     3 2018-03-04 11:00:07 2018-03-04 11:00:18
296    NA  5     condB    1    16 2018-03-07 08:59:59 2018-03-07 09:00:09
301    NA  5     condB    1    21 2018-03-08 08:59:54 2018-03-08 09:00:05
303    NA  5     condB    1    23 2018-03-08 11:00:10 2018-03-08 11:00:24
318    NA  5     condB    1    38 2018-03-11 11:00:09 2018-03-11 11:00:28
322    NA  5     condB    1    42 2018-03-12 09:59:59 2018-03-12 10:00:03
334    NA  5     condB    1    54 2018-03-14 12:00:13 2018-03-14 12:00:29
347    NA  5     condB    1    67 2018-03-17 09:59:59 2018-03-17 10:00:17
349    NA  5     condB    1    69 2018-03-17 11:59:58 2018-03-17 12:00:02
1316   NA 19     condA    1    56 2019-03-05 08:59:44 2019-03-05 09:00:02
1694   NA 25     condB    1    14 2018-07-20 11:00:13 2018-07-20 11:00:31
1709   NA 25     condB    1    29 2018-07-23 10:59:56 2018-07-23 10:59:58
1715   NA 25     condB    1    35 2018-07-24 12:00:07 2018-07-24 12:00:17
1726   NA 25     condB    1    46 2018-07-27 07:59:55 2018-07-27 08:00:03
1728   NA 25     condB    1    48 2018-07-27 10:00:05 2018-07-27 10:00:09
1734   NA 25     condB   NA    54 2018-07-28 10:59:38 2018-07-28 10:59:45
1743   NA 25     condB    1    63 2018-07-30 09:59:44 2018-07-30 09:59:53
1750   NA 25     condB    1    70 2018-07-31 11:59:50 2018-07-31 11:59:56
2950   NA 43     condA    1    10 2018-09-06 11:59:42 2018-09-06 12:00:07
2970   NA 43     condA    1    30 2018-09-10 12:00:07 2018-09-10 12:00:12
2988   NA 43     condA    1    48 2018-09-14 10:00:00 2018-09-14 10:00:02
2990   NA 43     condA    1    50 2018-09-14 11:59:51 2018-09-14 11:59:53
2994   NA 43     condA    1    54 2018-09-15 10:59:57 2018-09-15 11:00:01
3002   NA 43     condA   NA    62 2018-09-17 09:00:25 2018-09-17 09:00:26
                   start                 end PA1 PA2 PA3 NA1
282                 <NA>                <NA>  NA  NA  NA  NA
283                 <NA>                <NA>  NA  NA  NA  NA
296                 <NA>                <NA>  NA  NA  NA  NA
301                 <NA>                <NA>  NA  NA  NA  NA
303  2018-03-08 11:00:38 2018-03-08 11:02:30   4  64  27  17
318                 <NA>                <NA>  NA  NA  NA  NA
322  2018-03-12 10:00:19 2018-03-12 10:03:24   1  41   1  28
334                 <NA>                <NA>  NA  NA  NA  NA
347                 <NA>                <NA>  NA  NA  NA  NA
349  2018-03-17 12:00:26 2018-03-17 12:02:11   1  40   1  38
1316 2019-03-05 09:00:29 2019-03-05 09:01:26  36  17  16  13
1694 2018-07-20 11:00:59 2018-07-20 11:03:42  86  36 100  91
1709 2018-07-23 11:00:11 2018-07-23 11:02:13 100  17 100 100
1715 2018-07-24 12:00:50 2018-07-24 12:01:53  99   1 100 100
1726                <NA>                <NA>  NA  NA  NA  NA
1728 2018-07-27 10:00:28 2018-07-27 10:01:08   7   1   1   1
1734 2018-07-28 10:59:55 2018-07-28 11:01:18  24   1   1   1
1743 2018-07-30 10:00:04 2018-07-30 10:02:54  66  25  81  57
1750 2018-07-31 12:00:07 2018-07-31 12:01:47 100  33 100 100
2950 2018-09-06 12:00:20 2018-09-06 12:01:10   2  16  37   1
2970 2018-09-10 12:00:38 2018-09-10 12:02:40   1  21  37   1
2988 2018-09-14 10:00:22 2018-09-14 10:00:53   1  20  34   1
2990                <NA>                <NA>  NA  NA  NA  NA
2994 2018-09-15 11:00:09 2018-09-15 11:01:39   4  31  40   3
3002 2018-09-17 09:00:29 2018-09-17 09:02:35   1  23  35   1

Solving the issues

Now it’s time to solve the issues. Overall and for the issue displayed above, we propose 2 methods to solve it:

By hand: we know the true value. For instance, participant 3 should always be in the 5th dyad. After a quick check of the row preceding and following the issue (especially to validate that the issue is for the dyad number), we can easily solve it. When fixing the issue, avoid using the row numbers, rather than using conditional tests.
General: in case we have identification variables that allow us to identify a unique participant and that does not have any issues, we can recreate the structure and apply it to the dataset based on the correct value. First, we have to recreate the original structure of the dataset. Then, we apply modifications using the reliable variable as a key. Be aware that his method involves large data modifications, so you have to be certain before applying it and, then, check that no issues have been implemented in the dataframe.

We first select and visualize the problematic rows.

pos = data$id==5 & is.na(data$dyad)
data[pos,c("dyad", "id", "role", "obsno")]

    dyad id role obsno
282   NA  5    1     2
283   NA  5    1     3
296   NA  5    1    16
301   NA  5    1    21
303   NA  5    1    23
318   NA  5    1    38
322   NA  5    1    42
334   NA  5    1    54
347   NA  5    1    67
349   NA  5    1    69

After confirming that we have correctly identified the targeted rows, we assign the value of 5 to the ‘dyad’ variable for those specific rows.

data[pos, "dyad"] = 5

Warning

Importantly, you will need to check that the issues have been solved using previously discussed functions (e.g., summary(), display a sample of rows). Additionally, be sure to not have introduced new issues in the process.

References

Viechtbauer, W. (2021). Structuring, checking, and preparing the data. In The Open Handbook of Experience Sampling Methodology: A Step-by-Step Guide to Designing, Conducting, and Analyzing ESM Studies, pages 137-152. Center for Research on Experience Sampling and Ambulatory Methods, Leuven.