Flag (in)valid observations

Packages: dplyr

Warning

Warning: Because the computation of the valid variable relies on the presence or absence of missing values, it is important to first make sure that all the missing values are well-coded as ‘NA’, whatever the format of the variables (e.g., factor, numeric, character).

The ‘valid’ variable is computed to specify if an observation (a row) is valid or invalid, meaning that it was correctly fulfilled by the participant (according to pre-defined validation rules) and can be later used in specific visualizations and statistics analysis. In contrast, an invalid observation is either a missed observation or a fulfilled observation that does not meet pre-defined validation rules (e.g., careless responses, incorrect values).

The ‘valid’ variable offers practical advantages. Across the multiple checking tasks, you can record the observations that do not fulfill the criteria of a valid row. It makes it easy to filter out invalid observations when needed. Finally, at the end of your preprocessing, you can rely on the ‘valid’ variable to set the invalid observations as missing, to not include them in your statistical analysis.

On this page, and more broadly on this website, the ‘valid’ variable has the following code:

  • 1 = valid: all values of the variables of interest have been fulfilled.
  • 0 = invalid: at least one value of among the variables of interest are set as missing.

Know that you can choose another code, or even define three categories rather than two.

Valid definitions

Beforehand, it is necessary to define what makes an observation valid regarding your study, your research question, your purpose and/or the planned statistical analysis. For instance, the valid definition can follow one or more of the following rules:

  • Presence of a value for the start or end timestamps variable. In other words, a beginning and end time were recorded for this ESM survey. The assumption is that if the participant has started or ended the beeps, then the questions have been answered. Nonetheless, consider that it can be misleading. Missing values in the items are sometimes inconsistent with the start or the end time due to server issues, the way the application you used works, or if you allowed the participants to skip questions.
  • A minimal number of items have been fulfilled: especially if participants were allowed to answer only a part of the survey.
  • Specific variables should have non-missing values (e.g., all the variables of interest, at least the two first variables).
  • A variable should have a specific value (e.g., participant informed having experienced an argument with his/her partner).

In addition, datasets may contain groups of participants that did not follow the exact same ESM study procedure. In consequence, the definition of validity may need to be adapted to the different groups. Note that you have to determine if the compliance scores will be coherent with this definition or not.

Presence of value in timestamps

An observation is set as valid if the participant has ended (started) the survey, which can be often seen in the end timestamps variable. Hence, for each observation, we test if the observation has a missing value in the timestamps variable. This will return TRUE and FALSE values. Then, we transform the values to numerical ones as follows: TRUE -> 1 and FALSE -> 0.

data$valid = as.numeric(!is.na(data$end))

Minimal number of items answered

In case we base our valid definition on the number of answered items, we first need to compute the number of items answered (no missing values) in each row among the variables of interest. We propose two methods:

subset = data[,c("PA1","PA2","PA3","NA1","NA2","NA3")]
data$valid_var = rowSums(!is.na(subset))

Secondly, we choose how many of those items should have been answered to be set as valid. Here, we use a cut-off of 4, meaning at least 4 out of 6 items has to be fulfilled. Then, we transform the values to numerical ones.

thres = 4
data$valid = as.numeric(data$valid_var >= thres)

Requiered fulfilled items

To be set as valid, we could want that an observation should have specific fulfilled items (non missing values), such as the items of interest of your study. In the following example, we want our 6 variables (PA1, …, NA3) to be present to set an observation as valid. Note that it can also be done using the Minimal number of items answered method.

cond = !is.na(data$PA1) & !is.na(data$PA2) & !is.na(data$PA3) & !is.na(data$NA1) & !is.na(data$NA2) & !is.na(data$NA3)
data$valid = as.numeric(cond)

Specific value present

To be set as valid, an observation should have a specific value in a specific variable. In the following example, we want our ‘contact’ variable to have the value of 1.

cond = data$contact == 1
data$valid = as.numeric(cond)

Complex valid definitions

In other scenarios, we might encounter valid observations requiring the consolidation of distinct valid criteria. We showcase two of them.

Multiple valid conditions

Going beyond a single requirement, we can combine 2+ of the previous conditions. Here is an example with 3 conditions:

# Minimal number of items answered
cond1 = data[,"valid_var"] >= 4  

# Specific value condition
cond2 = data[,"contact"] == 0 & !is.na(data[,"contact"])

# No missing values in the variables PA1 and PA2
cond3 = !is.na(data$PA1) & !is.na(data$PA2)

# Merging conditions and converting to 0 - 1 code
data$valid = as.numeric(cond1 & cond2 & cond3)

Group with different valid conditions

Finally, ESM datasets can contain groups of participants that did not follow the same flow of items. For instance, participant of role = 1 had 2 items and participants of role = 2 had 3 different items. Hence, we can specify different valid conditions, one for each group, and then merge the condition using the operator ‘|’ (or):

# Group 1
cond_group1 = data$role==1 & !is.na(data$PA1) & !is.na(data$PA2)

# Group 2
cond_group2 = data$role==2 & !is.na(data$PA3) & !is.na(data$NA1) & !is.na(data$NA2)

# Merge condition with |
data$valid = as.numeric(cond_group1 | cond_group2)