Delay to send
Packages: dplyr, ggplot2
The sending of the beeps to the participants is often handled by a server or an application. The time interval between the ‘scheduled’ time and the ‘sent’ time may represent the time that the server took to send the beep to the participant, or the time to display the survey on the phone. In any case, we expect this time interval to be:
- short: a few seconds.
- positive: in other words, the ‘scheduled’ timestamps must be before the ‘sent’ timestamps.
To check those assumptions, we can compute the time interval (in seconds) and, then, display the values in a plot.
# Compute time interval
= data %>%
data mutate(delay_sent_sec = as.numeric(sent - scheduled)) # In second
# Plot the time interval distribution
%>%
data ggplot(aes(x = delay_sent_sec)) +
geom_histogram(bins=100)
Above, we can see some unexpected values such as negative values or high values (superior to few seconds). Most of the time, those types of values are unexpected and may indicate a problem in the data collection process.
Nevertheless, it is important to further investigate the problematic observations to understand the reasons behind those values. Then, you can decide how to handle this issue if it causes any problems in your analysis or for your data quality. To do so, we can first start by displaying the observations using logical tests. Here, we are looking for values below 0 or above 30 seconds:
= c("id","obsno","scheduled","sent","delay_sent_sec")
vars $delay_sent_sec < 0 | data$delay_sent_sec >= 30, vars] data[data
# A tibble: 33 × 5
# Groups: id [19]
id obsno scheduled sent delay_sent_sec
<dbl> <int> <dttm> <dttm> <dbl>
1 2 11 2003-03-01 07:59:52 2003-03-01 08:00:52 60
2 2 16 2003-03-02 07:59:48 2003-03-02 08:00:21 33
3 2 25 2003-03-03 19:59:59 2003-03-03 20:00:54 55
4 2 56 2003-03-10 07:59:58 2003-03-10 08:00:51 53
5 5 23 2003-03-03 13:59:59 2003-03-03 14:01:05 66
6 5 35 2003-03-05 19:59:46 2003-03-05 20:00:16 30
7 5 49 2003-03-08 17:00:03 2003-03-08 17:00:50 47
8 13 63 2003-03-11 14:00:05 2003-03-11 13:59:55 -9.37
9 15 59 2003-03-10 17:00:06 2003-03-10 17:00:03 -2.19
10 18 19 2003-03-02 16:59:55 2003-03-02 17:00:35 40
# ℹ 23 more rows
Consider investigating each of the problematic observations. For instance, you could check if they were occurring all on the same day, or if they were all sent by the same participant.