Coherence timestamps and observations order

Packages: dplyr


Ensuring the coherence of the timestamps and the order of the observations is crucial for the analysis of EMA data. We will focus on this critical task, with specific objectives including examining:

  • key timetamps variables: We delve into the ‘scheduled’, ‘sent’, ‘start’, and ‘end’ timestamps (see the Terminology page for further details on the variables). It’s essential that these variables exhibit coherence both within individual observations and across the dataset. For instance, a ‘start’ timestamps should never be before a ‘sent’ timestamps within an observation.
  • the ‘obsno’ variable: this variable is expected to follow the serial order of the observation, recorded by the ‘sent’ timestamps variable. The ‘obsno’ variable might already be included in the imported dataset, or it could have been generated in Step 1 (see Create time variables).

To showcase how to check those aspects, we will use a subset of data in which only the id numbers, the timestamp variables, and the observation numbers are included.

df = data[,c("id", "scheduled", "sent", "start", "end", "obsno")]
  id           scheduled                sent               start
1  1 2018-10-17 08:00:08 2018-10-17 08:00:11                <NA>
2  1 2018-10-17 09:00:01 2018-10-17 09:00:22                <NA>
3  1 2018-10-17 09:59:56 2018-10-17 10:00:08                <NA>
4  1 2018-10-17 10:59:48 2018-10-17 10:59:52 2018-10-17 11:00:12
5  1 2018-10-17 12:00:12 2018-10-17 12:00:15                <NA>
6  1 2018-10-18 07:59:47 2018-10-18 08:00:08                <NA>
                  end obsno
1                <NA>     1
2                <NA>     2
3                <NA>     3
4 2018-10-17 11:03:01     4
5                <NA>     5
6                <NA>     6

Coherence of timestamps variables

We often encounter scenarios where timestamp variables display inconsistencies in various forms. To address these issues, we will conduct a series of logical tests on our dataframe to identify potential discrepancies. Following these tests, we’ll filter the data to isolate and emphasize the problematic rows. In this process, we aim to scrutinize two primary types of inconsistency.

Coherence within observations

We check the chronological sequence of the timestamps within a single observation. Specifically, we ensure that the order of timestamp values follows a logical timeline: ‘scheduled’ should precede ‘sent’, which in turn should be followed by ‘start’, and finally ‘end’ (i.e., ‘scheduled’ < ‘sent’ < ‘start’ < ‘end’). Hence, we specifically check if timestamps values are after the following one (e.g., df\(scheduled > df\)sent). When TRUE is returned, it indicates that the timestamp is not in the expected order. For instance, below, we can see that three observations have a ‘sent’ timestamp that is earlier than the ‘scheduled’ timestamp.

sent_after_sched = df$scheduled > df$sent & !is.na(df$sent)
start_after_sent = df$sent > df$start & !is.na(df$start)
end_after_end = df$start > df$end & !is.na(df$end)

df[sent_after_sched | start_after_sent | end_after_end, ]
     id           scheduled                sent               start
1537 22 2019-02-24 09:59:49 2019-02-24 08:59:51                <NA>
2062 30 2018-08-15 08:59:54 2018-08-15 08:00:17 2018-08-15 09:00:17
2338 34 2018-11-15 10:59:55 2018-11-15 10:30:00 2018-11-15 11:00:15
NA   NA                <NA>                <NA>                <NA>
3800 55 2018-10-30 12:59:48 2018-10-30 12:00:02 2018-10-30 13:00:02
NA.1 NA                <NA>                <NA>                <NA>
                     end obsno
1537                <NA>    67
2062 2018-08-15 09:01:18    32
2338 2018-11-15 11:01:27    28
NA                  <NA>    NA
3800 2018-10-30 13:01:19    20
NA.1                <NA>    NA

Coherence between observations

Our evaluation extends to ensuring that the timestamps are chronologically coherent across successive observations. We specifically verify that the timestamps at a given beep (time t) precede those at the subsequent beep (time t + 1). Note that this analysis needs the dataset to be ordered according to one of the timestamp variables. Additionally, we must use the ‘group_id()’ function to create lagged variables and check the sequential integrity within each participant. For instance, below, we can see that one observation has a scheduled timestamp value that occurs before the scheduled timestamp value of the previous observation which is not expected.

df %>%
    arrange(id, sent) %>%
    group_by(id) %>%
    mutate(sched_lag_issue = lag(scheduled) > scheduled,
           sent_lag_issue = lag(sent) > sent,
           start_lag_issue = lag(start) > start,
           end_lag_issue = lag(end) > end) %>%
    filter(sched_lag_issue | sent_lag_issue | start_lag_issue | end_lag_issue)
  id           scheduled                sent               start
1 22 2019-02-24 08:59:55 2019-02-24 09:00:02 2019-02-24 09:00:19
2 30 2018-08-15 07:59:48 2018-08-15 08:00:25 2018-08-15 08:00:36
3 55 2018-10-30 11:59:53 2018-10-30 12:00:14 2018-10-30 12:00:32
                  end obsno sched_lag_issue sent_lag_issue start_lag_issue
1 2019-02-24 09:02:34    66            TRUE          FALSE              NA
2 2018-08-15 08:02:02    31            TRUE          FALSE            TRUE
3 2018-10-30 12:01:18    19            TRUE          FALSE            TRUE
  end_lag_issue
1            NA
2          TRUE
3          TRUE

Check observation order

The ‘obsno’ variable may already pre-exist in the dataset, or you may have already computed it (see Create time variables). Regardless of its origin, it’s crucial to verify that ‘obsno’ aligns correctly with the other timestamp variables. We suggest two methods.

Recreate and compare ‘obsno’ variables

In this approach,** we create a new ‘obsno’ variable** (called ‘obsno_test’) reflecting the order of the ‘sent’ timestamps. Then, we compare it with the existing ‘obsno’ variable to identify discrepancies. Specifically, we test if the ‘obsno’ values differ from the ‘obsno_test’ values. The R code below shows that in one row the ‘obsno’ values are inconsistent. It seems to be inverted with the previous observation.

df = df[order(df$id, df$sent),]
obsno_test = ave(seq_along(df$id), df$id, FUN = seq_along)
df[which(obsno_test != df$obsno),]
     id           scheduled                sent               start
67    1 2018-10-30 10:00:18 2018-10-30 10:00:25 2018-10-30 10:00:35
1537 22 2019-02-24 09:59:49 2019-02-24 08:59:51                <NA>
1536 22 2019-02-24 08:59:55 2019-02-24 09:00:02 2019-02-24 09:00:19
2062 30 2018-08-15 08:59:54 2018-08-15 08:00:17 2018-08-15 09:00:17
2061 30 2018-08-15 07:59:48 2018-08-15 08:00:25 2018-08-15 08:00:36
3800 55 2018-10-30 12:59:48 2018-10-30 12:00:02 2018-10-30 13:00:02
3799 55 2018-10-30 11:59:53 2018-10-30 12:00:14 2018-10-30 12:00:32
                     end obsno
67   2018-10-30 10:03:19    68
1537                <NA>    67
1536 2019-02-24 09:02:34    66
2062 2018-08-15 09:01:18    32
2061 2018-08-15 08:02:02    31
3800 2018-10-30 13:01:19    20
3799 2018-10-30 12:01:18    19

Identifying gaps in the ‘obsno’ sequence

We can compute the difference in row positions between the existing ‘obsno’ value and its lagged counterpart. We expect this difference to be exactly 1; deviation from this (either greater or less than 1) indicates a gap or overlap in the numbering sequence. It’s important to note that using this approach, a single discrepancy can impact adjacent rows. Therefore, this method may flag more rows as problematic compared to the previous method. In the analysis below, we observe that the inconsistency identified using the previous method is also detected here, but two rows are now reported.

# Sorting the dataframe by 'id' and 'sent'
df = df[order(df$id, df$sent), ]

# Splitting the dataframe by 'id'
df_split = split(df, df$id)

# Function to apply lag calculations
apply_lag_calculations <- function(data) {
    data$obsno_lag <- c(NA, data$obsno[-length(data$obsno)])
    data$obsno_dif <- data$obsno - data$obsno_lag
    return(data)
}

# Applying the function to each group and recombining
df_lagcal = do.call("rbind", lapply(df_split, apply_lag_calculations))

# Filtering the rows where difference is not 1
df_lagcal[df_lagcal$obsno_dif != 1 & !is.na(df_lagcal$obsno_dif), ]
        id           scheduled                sent               start
1.67     1 2018-10-30 10:00:18 2018-10-30 10:00:25 2018-10-30 10:00:35
1.68     1 2018-10-30 10:59:50 2018-10-30 11:00:21 2018-10-30 11:01:06
22.1537 22 2019-02-24 09:59:49 2019-02-24 08:59:51                <NA>
22.1536 22 2019-02-24 08:59:55 2019-02-24 09:00:02 2019-02-24 09:00:19
22.1538 22 2019-02-24 11:00:02 2019-02-24 11:00:24                <NA>
30.2062 30 2018-08-15 08:59:54 2018-08-15 08:00:17 2018-08-15 09:00:17
30.2061 30 2018-08-15 07:59:48 2018-08-15 08:00:25 2018-08-15 08:00:36
30.2063 30 2018-08-15 10:00:03 2018-08-15 10:00:08 2018-08-15 10:00:16
55.3800 55 2018-10-30 12:59:48 2018-10-30 12:00:02 2018-10-30 13:00:02
55.3799 55 2018-10-30 11:59:53 2018-10-30 12:00:14 2018-10-30 12:00:32
55.3801 55 2018-10-31 09:00:04 2018-10-31 09:00:23 2018-10-31 09:00:39
                        end obsno obsno_lag obsno_dif
1.67    2018-10-30 10:03:19    68        66         2
1.68    2018-10-30 11:01:31    68        68         0
22.1537                <NA>    67        65         2
22.1536 2019-02-24 09:02:34    66        67        -1
22.1538                <NA>    68        66         2
30.2062 2018-08-15 09:01:18    32        30         2
30.2061 2018-08-15 08:02:02    31        32        -1
30.2063 2018-08-15 10:01:41    33        31         2
55.3800 2018-10-30 13:01:19    20        18         2
55.3799 2018-10-30 12:01:18    19        20        -1
55.3801 2018-10-31 09:01:35    21        19         2