ESM Preprocessing Gallery – merging_data

Merging data sources

Packages: dplyr, data.table

In ESM studies, it is common to collect data from multiple sources, such as baseline surveys, smartphone surveys, and web-based surveys. In addition, each participant data may be stored in different files. These sources may produce data in different formats, structures, etc. making it (sometimes) challenging to merge them. Merging the data sources can be essential to analyze the data and get a comprehensive view of the participants’ experiences and behaviors.

Below, we assume that all data are stores in csv files. However, the same principles can be applied to other file formats (e.g., Excel, SPSS, etc.). See the import data section for more information on how to import data from different file formats.

Warning

In practice, you may need to go through several preprocess methods on each dataset before merging them. For instance, you may need to rename, recode, or create new variables to ensure the data are consistent across sources. Particular checking might be needed such as making sure that the key variables (i.e., variables that are common between the two datasets and allow to identify participants) are consistent across the datasets. In the example below, we assume that the data are ready to be merged.

Merging person-level questionnaires

Questionnaires including person-level variables are often collected before (e.g., baseline uqestionnaires) or after (e.g., follow-up questionnaires) the ESM procedure and can be stored in a separate CSV file. As an example, the file ‘data_baseline.csv’ includes answers to a baseline questionnaire, in which information about participants’ age age and dyadic condition are stored.

head(data_baseline)

    dyad id age cond_dyad
1      1  1  40     condB
71     1  2  42     condB
141    2  3  25     condB
211    2  4  25     condB
281    3  5  25     condB
351    3  6  25     condB

Our purpose is then to merge these baseline scores with the ESM dataset. The following R code makes it possible. We propose 2 methods. For both, we have to specify key variables (i.e., variables that are common between the two datasets and allow to idenfity participants) that will be used while merging in the ‘by’ argument.

‘merge()’, a build-in R function
‘left_join()’ from the dplyr package: this function takes the first specified dataframe as a reference and joins every associated column (following the key variables) of the second dataframe to it. Other type of joins are possible (e.g., inner join, right join, full join), but, in our case, the left join is the appropriate solution. Nonetheless, it may not be the case in every situation in which other join functions can be more adapted.

data = merge(data, data_baseline, by= c("dyad","id"))

It results:

head(data)

  dyad id role obsno  scheduled       sent      start        end contact PA1
1    1  1    1     1 1539763208 1539763211         NA         NA      NA  NA
2    1  1    1     2 1539766801 1539766822         NA         NA      NA  NA
3    1  1    1     3 1539770396 1539770408         NA         NA      NA  NA
4    1  1    1     4 1539773988 1539773992 1539774012 1539774181       0   1
5    1  1    1     5 1539777612 1539777615         NA         NA      NA  NA
6    1  1    1     6 1539849587 1539849608         NA         NA      NA  NA
  PA2 PA3 NA1 NA2 NA3 location age cond_dyad
1  NA  NA  NA  NA  NA     <NA>  40     condB
2  NA  NA  NA  NA  NA     <NA>  40     condB
3  NA  NA  NA  NA  NA     <NA>  40     condB
4  11  25  10  16  28        A  40     condB
5  NA  NA  NA  NA  NA     <NA>  40     condB
6  NA  NA  NA  NA  NA     <NA>  40     condB

Loop to merge datasets

The data of each participant can be stored in independent files. Hence, you have to merge them efficiently. Here, all the datasets are csv ones and stored in the ‘data/participants/’ folder (importantly, no other type of file is in this folder). In addition, they all have the same variable structure.

data_files = list.files("data/participants")  # Gather csv file names
data = data.frame()  # Create main dataset
for (i in data_files){
    data_temp = read.csv2(paste0("data/participants/", i)) # Open the file i
    data = rbind(data, data_temp)  # Merge data
}

If the variables are inconsistent between the datasets (e.g., not the same variables between children and parents), you can include conditions and rename the data in the loop.

A simpler solution is to use the function ‘rbindlist()’ from the data.table package. This function allows binding rows even if the columns are not fully consistent. Instead of returning an error message, it creates new columns. Hence, you should be careful to check the consistency of the data after merging. To use it, you have to first import the dataframe in a list as follows:

library(data.table)

data_files = list.files("data/participants")  # Gather csv file names
data_list = list()  # Create list of datasets
for (i in data_files){
    data_list[[i]] = read.csv2(paste0("data/participants/", i)) # Open the file i
}

data = rbindlist(data_list)  # Merge data