Coherence timestamps and observations order
Packages: dplyr
Ensuring the coherence of the timestamps and the order of the observations is crucial for the analysis of EMA data. We will focus on this critical task, with specific objectives including examining:
- key timetamps variables: We delve into the ‘scheduled’, ‘sent’, ‘start’, and ‘end’ timestamps (see the Terminology page for further details on the variables). It’s essential that these variables exhibit coherence both within individual observations and across the dataset. For instance, a ‘start’ timestamps should never be before a ‘sent’ timestamps within an observation.
- the ‘obsno’ variable: this variable is expected to follow the serial order of the observation, recorded by the ‘sent’ timestamps variable. The ‘obsno’ variable might already be included in the imported dataset, or it could have been generated in Step 1 (see Create time variables).
To showcase how to check those aspects, we will use a subset of data in which only the id numbers, the timestamp variables, and the observation numbers are included.
= data[,c("id", "scheduled", "sent", "start", "end", "obsno")] df
id scheduled sent start
1 1 2018-10-17 08:00:08 2018-10-17 08:00:11 <NA>
2 1 2018-10-17 09:00:01 2018-10-17 09:00:22 <NA>
3 1 2018-10-17 09:59:56 2018-10-17 10:00:08 <NA>
4 1 2018-10-17 10:59:48 2018-10-17 10:59:52 2018-10-17 11:00:12
5 1 2018-10-17 12:00:12 2018-10-17 12:00:15 <NA>
6 1 2018-10-18 07:59:47 2018-10-18 08:00:08 <NA>
end obsno
1 <NA> 1
2 <NA> 2
3 <NA> 3
4 2018-10-17 11:03:01 4
5 <NA> 5
6 <NA> 6
Coherence of timestamps variables
We often encounter scenarios where timestamp variables display inconsistencies in various forms. To address these issues, we will conduct a series of logical tests on our dataframe to identify potential discrepancies. Following these tests, we’ll filter the data to isolate and emphasize the problematic rows. In this process, we aim to scrutinize two primary types of inconsistency.
Coherence within observations
We check the chronological sequence of the timestamps within a single observation. Specifically, we ensure that the order of timestamp values follows a logical timeline: ‘scheduled’ should precede ‘sent’, which in turn should be followed by ‘start’, and finally ‘end’ (i.e., ‘scheduled’ < ‘sent’ < ‘start’ < ‘end’). Hence, we specifically check if timestamps values are after the following one (e.g., df\(scheduled > df\)sent). When TRUE is returned, it indicates that the timestamp is not in the expected order. For instance, below, we can see that three observations have a ‘sent’ timestamp that is earlier than the ‘scheduled’ timestamp.
= df$scheduled > df$sent & !is.na(df$sent)
sent_after_sched = df$sent > df$start & !is.na(df$start)
start_after_sent = df$start > df$end & !is.na(df$end)
end_after_end
| start_after_sent | end_after_end, ] df[sent_after_sched
id scheduled sent start
1537 22 2019-02-24 09:59:49 2019-02-24 08:59:51 <NA>
2062 30 2018-08-15 08:59:54 2018-08-15 08:00:17 2018-08-15 09:00:17
2338 34 2018-11-15 10:59:55 2018-11-15 10:30:00 2018-11-15 11:00:15
NA NA <NA> <NA> <NA>
3800 55 2018-10-30 12:59:48 2018-10-30 12:00:02 2018-10-30 13:00:02
NA.1 NA <NA> <NA> <NA>
end obsno
1537 <NA> 67
2062 2018-08-15 09:01:18 32
2338 2018-11-15 11:01:27 28
NA <NA> NA
3800 2018-10-30 13:01:19 20
NA.1 <NA> NA
Coherence between observations
Our evaluation extends to ensuring that the timestamps are chronologically coherent across successive observations. We specifically verify that the timestamps at a given beep (time t) precede those at the subsequent beep (time t + 1). Note that this analysis needs the dataset to be ordered according to one of the timestamp variables. Additionally, we must use the ‘group_id()’ function to create lagged variables and check the sequential integrity within each participant. For instance, below, we can see that one observation has a scheduled timestamp value that occurs before the scheduled timestamp value of the previous observation which is not expected.
%>%
df arrange(id, sent) %>%
group_by(id) %>%
mutate(sched_lag_issue = lag(scheduled) > scheduled,
sent_lag_issue = lag(sent) > sent,
start_lag_issue = lag(start) > start,
end_lag_issue = lag(end) > end) %>%
filter(sched_lag_issue | sent_lag_issue | start_lag_issue | end_lag_issue)
id scheduled sent start
1 22 2019-02-24 08:59:55 2019-02-24 09:00:02 2019-02-24 09:00:19
2 30 2018-08-15 07:59:48 2018-08-15 08:00:25 2018-08-15 08:00:36
3 55 2018-10-30 11:59:53 2018-10-30 12:00:14 2018-10-30 12:00:32
end obsno sched_lag_issue sent_lag_issue start_lag_issue
1 2019-02-24 09:02:34 66 TRUE FALSE NA
2 2018-08-15 08:02:02 31 TRUE FALSE TRUE
3 2018-10-30 12:01:18 19 TRUE FALSE TRUE
end_lag_issue
1 NA
2 TRUE
3 TRUE
Check observation order
The ‘obsno’ variable may already pre-exist in the dataset, or you may have already computed it (see Create time variables). Regardless of its origin, it’s crucial to verify that ‘obsno’ aligns correctly with the other timestamp variables. We suggest two methods.
Recreate and compare ‘obsno’ variables
In this approach,** we create a new ‘obsno’ variable** (called ‘obsno_test’) reflecting the order of the ‘sent’ timestamps. Then, we compare it with the existing ‘obsno’ variable to identify discrepancies. Specifically, we test if the ‘obsno’ values differ from the ‘obsno_test’ values. The R code below shows that in one row the ‘obsno’ values are inconsistent. It seems to be inverted with the previous observation.
= df[order(df$id, df$sent),]
df = ave(seq_along(df$id), df$id, FUN = seq_along)
obsno_test which(obsno_test != df$obsno),] df[
id scheduled sent start
67 1 2018-10-30 10:00:18 2018-10-30 10:00:25 2018-10-30 10:00:35
1537 22 2019-02-24 09:59:49 2019-02-24 08:59:51 <NA>
1536 22 2019-02-24 08:59:55 2019-02-24 09:00:02 2019-02-24 09:00:19
2062 30 2018-08-15 08:59:54 2018-08-15 08:00:17 2018-08-15 09:00:17
2061 30 2018-08-15 07:59:48 2018-08-15 08:00:25 2018-08-15 08:00:36
3800 55 2018-10-30 12:59:48 2018-10-30 12:00:02 2018-10-30 13:00:02
3799 55 2018-10-30 11:59:53 2018-10-30 12:00:14 2018-10-30 12:00:32
end obsno
67 2018-10-30 10:03:19 68
1537 <NA> 67
1536 2019-02-24 09:02:34 66
2062 2018-08-15 09:01:18 32
2061 2018-08-15 08:02:02 31
3800 2018-10-30 13:01:19 20
3799 2018-10-30 12:01:18 19
Identifying gaps in the ‘obsno’ sequence
We can compute the difference in row positions between the existing ‘obsno’ value and its lagged counterpart. We expect this difference to be exactly 1; deviation from this (either greater or less than 1) indicates a gap or overlap in the numbering sequence. It’s important to note that using this approach, a single discrepancy can impact adjacent rows. Therefore, this method may flag more rows as problematic compared to the previous method. In the analysis below, we observe that the inconsistency identified using the previous method is also detected here, but two rows are now reported.
# Sorting the dataframe by 'id' and 'sent'
= df[order(df$id, df$sent), ]
df
# Splitting the dataframe by 'id'
= split(df, df$id)
df_split
# Function to apply lag calculations
<- function(data) {
apply_lag_calculations $obsno_lag <- c(NA, data$obsno[-length(data$obsno)])
data$obsno_dif <- data$obsno - data$obsno_lag
datareturn(data)
}
# Applying the function to each group and recombining
= do.call("rbind", lapply(df_split, apply_lag_calculations))
df_lagcal
# Filtering the rows where difference is not 1
$obsno_dif != 1 & !is.na(df_lagcal$obsno_dif), ] df_lagcal[df_lagcal
id scheduled sent start
1.67 1 2018-10-30 10:00:18 2018-10-30 10:00:25 2018-10-30 10:00:35
1.68 1 2018-10-30 10:59:50 2018-10-30 11:00:21 2018-10-30 11:01:06
22.1537 22 2019-02-24 09:59:49 2019-02-24 08:59:51 <NA>
22.1536 22 2019-02-24 08:59:55 2019-02-24 09:00:02 2019-02-24 09:00:19
22.1538 22 2019-02-24 11:00:02 2019-02-24 11:00:24 <NA>
30.2062 30 2018-08-15 08:59:54 2018-08-15 08:00:17 2018-08-15 09:00:17
30.2061 30 2018-08-15 07:59:48 2018-08-15 08:00:25 2018-08-15 08:00:36
30.2063 30 2018-08-15 10:00:03 2018-08-15 10:00:08 2018-08-15 10:00:16
55.3800 55 2018-10-30 12:59:48 2018-10-30 12:00:02 2018-10-30 13:00:02
55.3799 55 2018-10-30 11:59:53 2018-10-30 12:00:14 2018-10-30 12:00:32
55.3801 55 2018-10-31 09:00:04 2018-10-31 09:00:23 2018-10-31 09:00:39
end obsno obsno_lag obsno_dif
1.67 2018-10-30 10:03:19 68 66 2
1.68 2018-10-30 11:01:31 68 68 0
22.1537 <NA> 67 65 2
22.1536 2019-02-24 09:02:34 66 67 -1
22.1538 <NA> 68 66 2
30.2062 2018-08-15 09:01:18 32 30 2
30.2061 2018-08-15 08:02:02 31 32 -1
30.2063 2018-08-15 10:01:41 33 31 2
55.3800 2018-10-30 13:01:19 20 18 2
55.3799 2018-10-30 12:01:18 19 20 -1
55.3801 2018-10-31 09:01:35 21 19 2