Check variable coherence

Packages: dplyr, skimr, stringr, esmtools


ESM datasets come with different types of variables (see Viechtbauer, 2021), which have implications for the data preprocessing:

  • design variables (e.g., participant number, day/beep number, experimental conditions, scheduled time, sent time): these variables are defined by the study design and the sampling scheme of the study. Importantly, no missing data are expected. In particular, there are:
    • subject identifier variables (e.g., participant id, couple id, partner’s differentiating variables): allow to identify a unique participant. In the case of dyadic data, the dyads can be non-distinguishable or distinguishable. In this second case, your dataframe should contain a variable differentiating partners’ dyad (e.g., gender, role).
    • timestamps variables: contains the timestamps related to each beep. The best is to have at least the scheduled time, the sent time, the start time, and the end time - see terminology - to later inspect participants’ response behaviors. It can also include the timestamp when each question was answered (within each beep).
  • variables filled in by the participants:
    • time-varying variables (e.g., positive/negative affects): a variable with values that change over time. Missing values may correspond to unfilled variables/beeps by the participants.
    • time-invariant variables (e.g., depression score, aggregate score): a variable that does not change over time and maintains the same value across all time points, such as corresponding to baseline or follow-up questionnaires. Ideally, no missing data are expected.

A good start: descriptive statistics

A good start is to compute descriptive statistics (e.g., mean, range of value). It allows checking for minimum, maximum values, number of missing values, mean, etc. Here, we use the summary() function and the skim() function from the skimr package to get a quick overview of the data (see First glimpse topic).

summary(data)
      dyad             id         cond_dyad              role      
 Min.   : 1.00   Min.   : 1.00   Length:4200        Min.   :1.000  
 1st Qu.: 8.00   1st Qu.:15.75   Class :character   1st Qu.:1.000  
 Median :16.00   Median :30.50   Mode  :character   Median :2.000  
 Mean   :15.55   Mean   :30.50                      Mean   :1.503  
 3rd Qu.:23.00   3rd Qu.:45.25                      3rd Qu.:2.000  
 Max.   :30.00   Max.   :60.00                      Max.   :2.000  
 NA's   :25                                         NA's   :28     
     obsno        scheduled                     
 Min.   : 1.0   Min.   :2018-02-02 08:59:47.00  
 1st Qu.:18.0   1st Qu.:2018-07-20 09:00:01.00  
 Median :35.5   Median :2018-09-13 11:00:13.50  
 Mean   :35.5   Mean   :2018-09-08 11:33:24.83  
 3rd Qu.:53.0   3rd Qu.:2018-10-24 02:59:52.50  
 Max.   :70.0   Max.   :2019-06-10 11:59:46.00  
                NA's   :2                       
      sent                            start                       
 Min.   :2018-02-02 08:59:51.00   Min.   :2018-02-02 09:00:31.00  
 1st Qu.:2018-07-20 09:00:18.75   1st Qu.:2018-07-22 12:00:37.25  
 Median :2018-09-13 11:30:18.00   Median :2018-09-14 22:00:28.00  
 Mean   :2018-09-08 11:54:09.94   Mean   :2018-09-14 18:49:31.13  
 3rd Qu.:2018-10-23 17:00:04.00   3rd Qu.:2018-10-31 13:00:30.50  
 Max.   :2019-06-10 11:59:54.00   Max.   :2019-06-10 12:00:15.00  
                                  NA's   :1254                    
      end                              PA1              PA2        
 Min.   :2018-02-02 09:03:07.00   Min.   :  1.00   Min.   :  1.00  
 1st Qu.:2018-07-22 12:01:49.25   1st Qu.:  4.00   1st Qu.:  3.00  
 Median :2018-09-14 22:02:02.00   Median : 18.00   Median : 19.00  
 Mean   :2018-09-14 18:51:14.74   Mean   : 23.09   Mean   : 21.77  
 3rd Qu.:2018-10-31 13:02:31.00   3rd Qu.: 32.00   3rd Qu.: 33.00  
 Max.   :2019-06-10 12:02:30.00   Max.   :100.00   Max.   :100.00  
 NA's   :1254                     NA's   :1254     NA's   :1254    
      PA3              NA1        
 Min.   :  1.00   Min.   :  1.00  
 1st Qu.:  3.00   1st Qu.:  1.00  
 Median : 16.00   Median : 11.00  
 Mean   : 23.32   Mean   : 21.36  
 3rd Qu.: 31.00   3rd Qu.: 31.00  
 Max.   :100.00   Max.   :100.00  
 NA's   :1254     NA's   :1254    

Many issues can be detected by looking at the descriptive statistics. In our case, we can see that there are missing values in the ‘dyad’ and ‘role’ variables, which is unexpected as they are identification variables.

Check if Variables are Time-Invariant for Each Subject

As previously mentioned, time-invariant variables (e.g., subject id, dyad id) are not expected to vary within participants. Here, we will look for those variables if they are consistent among participants. In other words, we check that a unique value is attributed to person-level or dyad-level variables (e.g., age).

We show 3 ways to check it at a participant level:

  1. The ‘vars_consist()’ function from the esmtools package work: merge the unique values of variables while gouping for another one. In our example, we investigate if the dyad and role variables have a unique value per participant.
  2. We can check if a person-level variable displays a unique value (i.e., length(unique(var)) == 1) for each group. Here, we check that the dyad and role variables have a unique value per id. When the test outcomes are FALSE, there is more than one value of dyad/role or an NA value within the group.
  3. We can keep all the unique rows after selecting the variables of interest. Here, we keep the unique values of the combination of id, dyad, cond_dyad, and role variables.
  4. Finally, we can count the number of unique combinations of variables’ values. Here, we first group by the id, dyad, cond_dyad, and role variables and then count the number of rows using: summarize(n = n()). In the results, we investigate for inconsistent rows along with their number of occurrences.

data %>%
    group_by(id) %>%
    summarize(consistant_dyad=length(unique(dyad)) == 1,
              consistant_role=length(unique(role)) == 1) 
# A tibble: 60 × 3
      id consistant_dyad consistant_role
   <dbl> <lgl>           <lgl>          
 1     1 TRUE            TRUE           
 2     2 TRUE            TRUE           
 3     3 TRUE            TRUE           
 4     4 TRUE            TRUE           
 5     5 FALSE           FALSE          
 6     6 TRUE            TRUE           
 7     7 TRUE            TRUE           
 8     8 TRUE            TRUE           
 9     9 TRUE            TRUE           
10    10 TRUE            TRUE           
# ℹ 50 more rows

Each method can be adapted to different nesting levels. Below, we verify the unique values of ‘id’ and ‘cond_dyad’ nested within the ‘dyad’ variable:

library(esmtools)
vars_consist(data, "dyad", c("id", "cond_dyad"))
   dyad              id      cond_dyad
1     1          (1, 2)          condB
2     2          (3, 4)          condB
3     3          (5, 6)          condB
4    NA (5, 19, 25, 43) (condB, condA)
5     4          (7, 8)          condB
6     5         (9, 10)          condA
7     6        (11, 12)          condA
8     7        (13, 14)          condB
9    28    (13, 55, 56) (condB, condA)
10    8        (15, 16)          condB
11    9    (17, 18, 24) (condB, condA)
12   10        (19, 20)          condA
13   11        (21, 22)          condB
14   12        (23, 24)          condA
15   13        (25, 26)          condB
16   14        (27, 28)          condB
17   15        (29, 30)          condB
18   16        (31, 32)          condA
19   17        (33, 34)          condB
20   18        (35, 36)          condA
21   19    (35, 37, 38) (condA, condB)
22   20        (39, 40)          condA
23   21        (41, 42)          condA
24   22        (43, 44)          condA
25   23        (45, 46)          condA
26   24        (47, 48)          condA
27   25        (49, 50)          condA
28   26        (51, 52)          condB
29   27        (53, 54)          condB
30   29        (57, 58)          condA
31   30        (59, 60)          condA

Investigating the issues

Every issue should be checked by hand first (by isolating and inspecting the rows) before solving it.
From above, we can see that there are missing values in the dyad variable. Let’s inspect this issue by selecting the rows using ‘is.na(data$dyad)’.

data[is.na(data$dyad), ]
     dyad id cond_dyad role obsno           scheduled                sent
282    NA  5     condB    1     2 2018-03-04 09:59:49 2018-03-04 09:59:56
283    NA  5     condB    1     3 2018-03-04 11:00:07 2018-03-04 11:00:18
296    NA  5     condB    1    16 2018-03-07 08:59:59 2018-03-07 09:00:09
301    NA  5     condB    1    21 2018-03-08 08:59:54 2018-03-08 09:00:05
303    NA  5     condB    1    23 2018-03-08 11:00:10 2018-03-08 11:00:24
318    NA  5     condB    1    38 2018-03-11 11:00:09 2018-03-11 11:00:28
322    NA  5     condB    1    42 2018-03-12 09:59:59 2018-03-12 10:00:03
334    NA  5     condB    1    54 2018-03-14 12:00:13 2018-03-14 12:00:29
347    NA  5     condB    1    67 2018-03-17 09:59:59 2018-03-17 10:00:17
349    NA  5     condB    1    69 2018-03-17 11:59:58 2018-03-17 12:00:02
1316   NA 19     condA    1    56 2019-03-05 08:59:44 2019-03-05 09:00:02
1694   NA 25     condB    1    14 2018-07-20 11:00:13 2018-07-20 11:00:31
1709   NA 25     condB    1    29 2018-07-23 10:59:56 2018-07-23 10:59:58
1715   NA 25     condB    1    35 2018-07-24 12:00:07 2018-07-24 12:00:17
1726   NA 25     condB    1    46 2018-07-27 07:59:55 2018-07-27 08:00:03
1728   NA 25     condB    1    48 2018-07-27 10:00:05 2018-07-27 10:00:09
1734   NA 25     condB   NA    54 2018-07-28 10:59:38 2018-07-28 10:59:45
1743   NA 25     condB    1    63 2018-07-30 09:59:44 2018-07-30 09:59:53
1750   NA 25     condB    1    70 2018-07-31 11:59:50 2018-07-31 11:59:56
2950   NA 43     condA    1    10 2018-09-06 11:59:42 2018-09-06 12:00:07
2970   NA 43     condA    1    30 2018-09-10 12:00:07 2018-09-10 12:00:12
2988   NA 43     condA    1    48 2018-09-14 10:00:00 2018-09-14 10:00:02
2990   NA 43     condA    1    50 2018-09-14 11:59:51 2018-09-14 11:59:53
2994   NA 43     condA    1    54 2018-09-15 10:59:57 2018-09-15 11:00:01
3002   NA 43     condA   NA    62 2018-09-17 09:00:25 2018-09-17 09:00:26
                   start                 end PA1 PA2 PA3 NA1
282                 <NA>                <NA>  NA  NA  NA  NA
283                 <NA>                <NA>  NA  NA  NA  NA
296                 <NA>                <NA>  NA  NA  NA  NA
301                 <NA>                <NA>  NA  NA  NA  NA
303  2018-03-08 11:00:38 2018-03-08 11:02:30   4  64  27  17
318                 <NA>                <NA>  NA  NA  NA  NA
322  2018-03-12 10:00:19 2018-03-12 10:03:24   1  41   1  28
334                 <NA>                <NA>  NA  NA  NA  NA
347                 <NA>                <NA>  NA  NA  NA  NA
349  2018-03-17 12:00:26 2018-03-17 12:02:11   1  40   1  38
1316 2019-03-05 09:00:29 2019-03-05 09:01:26  36  17  16  13
1694 2018-07-20 11:00:59 2018-07-20 11:03:42  86  36 100  91
1709 2018-07-23 11:00:11 2018-07-23 11:02:13 100  17 100 100
1715 2018-07-24 12:00:50 2018-07-24 12:01:53  99   1 100 100
1726                <NA>                <NA>  NA  NA  NA  NA
1728 2018-07-27 10:00:28 2018-07-27 10:01:08   7   1   1   1
1734 2018-07-28 10:59:55 2018-07-28 11:01:18  24   1   1   1
1743 2018-07-30 10:00:04 2018-07-30 10:02:54  66  25  81  57
1750 2018-07-31 12:00:07 2018-07-31 12:01:47 100  33 100 100
2950 2018-09-06 12:00:20 2018-09-06 12:01:10   2  16  37   1
2970 2018-09-10 12:00:38 2018-09-10 12:02:40   1  21  37   1
2988 2018-09-14 10:00:22 2018-09-14 10:00:53   1  20  34   1
2990                <NA>                <NA>  NA  NA  NA  NA
2994 2018-09-15 11:00:09 2018-09-15 11:01:39   4  31  40   3
3002 2018-09-17 09:00:29 2018-09-17 09:02:35   1  23  35   1

Solving the issues

Now it’s time to solve the issues. Overall and for the issue displayed above, we propose 2 methods to solve it:

  1. By hand: we know the true value. For instance, participant 3 should always be in the 5th dyad. After a quick check of the row preceding and following the issue (especially to validate that the issue is for the dyad number), we can easily solve it. When fixing the issue, avoid using the row numbers, rather than using conditional tests.
  2. General: in case we have identification variables that allow us to identify a unique participant and that does not have any issues, we can recreate the structure and apply it to the dataset based on the correct value. First, we have to recreate the original structure of the dataset. Then, we apply modifications using the reliable variable as a key. Be aware that his method involves large data modifications, so you have to be certain before applying it and, then, check that no issues have been implemented in the dataframe.


We first select and visualize the problematic rows.

pos = data$id==5 & is.na(data$dyad)
data[pos,c("dyad", "id", "role", "obsno")] 
    dyad id role obsno
282   NA  5    1     2
283   NA  5    1     3
296   NA  5    1    16
301   NA  5    1    21
303   NA  5    1    23
318   NA  5    1    38
322   NA  5    1    42
334   NA  5    1    54
347   NA  5    1    67
349   NA  5    1    69

After confirming that we have correctly identified the targeted rows, we assign the value of 5 to the ‘dyad’ variable for those specific rows.

data[pos, "dyad"] = 5
Warning

Importantly, you will need to check that the issues have been solved using previously discussed functions (e.g., summary(), display a sample of rows). Additionally, be sure to not have introduced new issues in the process.

References

Viechtbauer, W. (2021). Structuring, checking, and preparing the data. In The Open Handbook of Experience Sampling Methodology: A Step-by-Step Guide to Designing, Conducting, and Analyzing ESM Studies, pages 137-152. Center for Research on Experience Sampling and Ambulatory Methods, Leuven.