ESM Preprocessing Gallery

Duplication

Packages: dplyr

We define a duplication as an unsual repetition of the same pattern of values over rows or observations. Especially in ESM data, they can be found in various forms. We showcase those in the following dataset:

  id obsno               start                 end PA1 PA2 PA3
1  1     1 2018-12-17 10:00:41 2018-12-17 10:03:00  98  39  38
2  1     2 2018-12-17 11:00:17 2018-12-17 11:02:36  87   1  23
3  1     2 2018-12-17 11:00:17 2018-12-17 11:02:36  87   1  23
4  1     3 2018-12-17 12:00:07 2018-12-17 12:00:10  38   3  57
5  1     4 2018-12-17 13:00:26 2018-12-17 13:02:01  22  27  19
6  1     5 2018-12-17 13:00:26 2018-12-17 13:02:01  98  39  38

We can see three types of duplicates above:

Duplicated rows: for instance, row 3 is a duplication of row 2.
Duplicated answers: for instance, row 6 has exactly the same answer (i.e., PA1, PA2, and PA3) as row 1.
Duplicated timestamps: for instance, row 5 has the exact same start and end time as row 6.

Duplications can sometimes be straightforwardly seen as problematic, but it is not always the case. It largely relies on the type of duplication and the source of the issue. Duplication of rows is often produced by application errors or wrong data manipulations. They are (most of the time) not expected in the dataframe. In opposition, duplicated answers or duplicated timestamps can sometimes be expected, depending on the number of items, the scales used, the number of participants, etc. They require in-depth data inspection to understand if they are problematic and must be removed or not.

Duplicated rows

Duplication of rows can be investigated with the build-in function ‘duplicated()’. It tests if each row is a duplicate one (having the exact same pattern of values as the row situated above) and returns the value TRUE (duplicated) or FALSE (not duplicated). Hence, this function returns a bolean vector, which contains TRUE and FALSE value.

To visualize the duplicated rows we can use:

data[duplicated(data),]

  id obsno               start                 end PA1 PA2 PA3
3  1     2 2018-12-17 11:00:17 2018-12-17 11:02:36  87   1  23

The number of duplicated rows would be given by:

sum(duplicated(data))

[1] 1

Solving this issue: one common way is to remove the duplicated rows. To this end, we can use ‘!duplicated()’ which gives the following result:

data[!duplicated(data),]

  id obsno               start                 end PA1 PA2 PA3
1  1     1 2018-12-17 10:00:41 2018-12-17 10:03:00  98  39  38
2  1     2 2018-12-17 11:00:17 2018-12-17 11:02:36  87   1  23
4  1     3 2018-12-17 12:00:07 2018-12-17 12:00:10  38   3  57
5  1     4 2018-12-17 13:00:26 2018-12-17 13:02:01  22  27  19
6  1     5 2018-12-17 13:00:26 2018-12-17 13:02:01  98  39  38

Duplicated answers

When looking for duplicated answers, we have to create a subset of the answers to the items or the variables of interest (e.g., PA1, PA2). Then, two methods can be used to visualize duplicated answers:

‘duplicated()’: see above for details. This only displays the rows with duplicates (not the original rows).
‘group_by()’ and ‘filter(n() > 1)’: as the first function groups the dataframe based indicated columns, the second one selects the groups which contain more than one row (meaning that 2+ rows have the same values). The advantage is that it displays the original row as well as its duplications.

pos = duplicated(data[,c("PA1", "PA2", "PA3")])
data[pos,]

  id obsno               start                 end PA1 PA2 PA3
3  1     2 2018-12-17 11:00:17 2018-12-17 11:02:36  87   1  23
6  1     5 2018-12-17 13:00:26 2018-12-17 13:02:01  98  39  38

Solving this issue: we should first understand if the duplicated answers are not caused by an issue (e.g., previous data manipulation). Indeed, it can occur that answers, by chance, have similar patterns (e.g., in case of floor/ceiling effect, when there are missing values). Consequently, each duplicated answer should be examined on an individual basis before making a decision.

Duplicated timestamps

The last type of duplication concerns the timestamp variables, meaning here the ‘scheduled’, ‘sent’, ‘start’, and ‘end’ variables. Depending on your sampling scheme, we may expect similar values, for instance if every beep is sent around the same time to the participants. However, in case there is some randomness in the sending of the beeps (e.g., random-contingent sample scheme), it is less likely to have two rows with the exact same timestamp for the ‘start’ and/or ‘end’ variables, or to have instances where observations possess identical values for both the ‘start’ and ‘end’ timestamps concurrently. Hence, we recommend looking for duplications in the ‘start’ and the ‘end’ variables. Here are two methods already introduced above:

pos = duplicated(data$start)
data[pos,]

  id obsno               start                 end PA1 PA2 PA3
3  1     2 2018-12-17 11:00:17 2018-12-17 11:02:36  87   1  23
6  1     5 2018-12-17 13:00:26 2018-12-17 13:02:01  98  39  38

Solving this issue: similarly to duplicated answers, duplicated timestamps can be du to luck or to issue in the data collection or in the recording. Consequently, each duplicated timestamps should be examined on an individual basis before making a decision.