Dataframe format
Packages: dplyr, tidyr
To explore or analyze the data it is sometimes convenient to change its format, meaning how the information is structured in the dataframe. The usual formats are “long” and “wide” formats. But other formats can be useful. Sometimes, you’ll need to aggregate the data such at the day level or the participant-level. In the context of dyads, there are also specific formats that can be useful, in particular one in which partners’ variables are displayed side by side.
Wide format
In a wide format, each participant has a unique row of values. Each time point and variable combination gives a column (e.g., PA1_time1, PA1_time2, PA2_time1, PA2_time2).
The original dataset has a long format, as follows:
dyad role obsno id age cond_dyad scheduled
1 1 1 1 1 40 condB 2018-10-17 08:00:08
2 1 1 2 1 40 condB 2018-10-17 09:00:01
3 1 1 3 1 40 condB 2018-10-17 09:59:56
4 1 1 4 1 40 condB 2018-10-17 10:59:48
5 1 1 5 1 40 condB 2018-10-17 12:00:12
6 1 1 6 1 40 condB 2018-10-18 07:59:47
sent start
1 2018-10-17 08:00:11 <NA>
2 2018-10-17 09:00:22 <NA>
3 2018-10-17 10:00:08 <NA>
4 2018-10-17 10:59:52 2018-10-17 11:00:12
5 2018-10-17 12:00:15 <NA>
6 2018-10-18 08:00:08 <NA>
end contact PA1 PA2 PA3 NA1 NA2 NA3
1 <NA> NA NA NA NA NA NA NA
2 <NA> NA NA NA NA NA NA NA
3 <NA> NA NA NA NA NA NA NA
4 2018-10-17 11:03:01 0 1 11 25 10 16 28
5 <NA> NA NA NA NA NA NA NA
6 <NA> NA NA NA NA NA NA NA
location daycum valid_var valid daycum_dyad beepno
1 <NA> 1 days 1 0 1 days 1
2 <NA> 1 days 1 0 1 days 2
3 <NA> 1 days 1 0 1 days 3
4 A 1 days 7 1 1 days 4
5 <NA> 1 days 1 0 1 days 5
6 <NA> 2 days 1 0 2 days 1
To reformat you data from a long format, you first have to select a subset including the nesting variable (e.g., ‘id’), the ‘obsno’ variable (indicate the order of the beeps), and the variables of interest (e.g., ‘PA1’, ‘PA2’). Then, use the built-in function ‘reshape()’ as follows:
# Select
= data[,c("id", "obsno", "PA1", "PA2")]
df # Reshape
= reshape(as.data.frame(df), direction = "wide", idvar = "id", timevar = "obsno", sep="_") df_wide
The outcome dataframe looks like:
id PA1_1 PA2_1 PA1_2 PA2_2 PA1_3 PA2_3 PA1_4 PA2_4
1 1 NA NA NA NA NA NA 1 11
71 2 32 3 23 4 17 7 24 12
141 3 8 21 19 35 NA NA NA NA
211 4 NA NA 15 27 18 34 19 31
281 5 NA NA NA NA NA NA 11 59
351 6 NA NA 1 1 NA NA 1 1
PA1_5 PA2_5 PA1_6 PA2_6 PA1_7 PA2_7 PA1_8 PA2_8 PA1_9
1 NA NA NA NA 1 1 1 1 1
71 32 6 26 1 15 1 19 1 21
141 9 13 22 25 NA NA 24 27 12
211 18 31 NA NA 17 29 13 20 20
281 32 56 46 68 NA NA 36 68 NA
351 NA NA 11 1 NA NA 11 14 7
PA2_9 PA1_10 PA2_10 PA1_11 PA2_11 PA1_12 PA2_12 PA1_13
1 1 NA NA NA NA 1 1 1
71 2 11 2 11 1 13 1 13
141 21 NA NA 13 30 36 57 15
211 29 20 24 NA NA NA NA 33
281 NA 1 61 NA NA NA NA 1
351 25 NA NA 1 25 NA NA 1
PA2_13 PA1_14 PA2_14 PA1_15 PA2_15 PA1_16 PA2_16 PA1_17
1 1 1 1 1 1 NA NA NA
71 1 8 2 NA NA 14 4 NA
141 30 NA NA NA NA 38 48 NA
211 36 32 22 37 32 NA NA 32
281 38 NA NA NA NA NA NA 1
351 3 1 1 1 1 NA NA 1
PA2_17 PA1_18 PA2_18 PA1_19 PA2_19 PA1_20 PA2_20 PA1_21
1 NA NA NA NA NA NA NA NA
71 NA 9 4 13 10 NA NA NA
141 NA NA NA NA NA NA NA 27
211 36 27 30 31 33 NA NA 31
281 47 8 56 NA NA NA NA NA
351 1 2 1 NA NA 9 1 11
PA2_21 PA1_22 PA2_22 PA1_23 PA2_23 PA1_24 PA2_24 PA1_25
1 NA NA NA NA NA NA NA 23
71 NA 19 1 NA NA 25 16 32
141 39 24 36 NA NA 20 27 18
211 30 29 32 NA NA NA NA 21
281 NA NA NA 4 64 1 70 1
351 1 NA NA 2 20 1 22 1
PA2_25 PA1_26 PA2_26 PA1_27 PA2_27 PA1_28 PA2_28 PA1_29
1 22 NA NA 14 6 7 5 1
71 15 NA NA NA NA 29 5 31
141 28 NA NA NA NA 21 44 NA
211 30 21 31 22 37 NA NA 27
281 50 1 57 NA NA NA NA 1
351 14 NA NA NA NA 1 1 1
PA2_29 PA1_30 PA2_30 PA1_31 PA2_31 PA1_32 PA2_32 PA1_33
1 8 NA NA 1 1 1 1 1
71 6 31 7 NA NA NA NA 26
141 NA NA NA NA NA NA NA 14
211 33 25 26 23 33 NA NA 25
281 41 NA NA 1 50 NA NA 33
351 1 1 1 NA NA NA NA 7
PA2_33 PA1_34 PA2_34 PA1_35 PA2_35 PA1_36 PA2_36 PA1_37
1 1 1 1 1 1 NA NA 1
71 1 26 1 17 1 3 4 4
141 31 27 54 NA NA NA NA NA
211 33 NA NA 28 37 27 30 28
281 65 NA NA NA NA NA NA 3
351 1 9 1 10 4 8 17 NA
PA2_37 PA1_38 PA2_38 PA1_39 PA2_39 PA1_40 PA2_40 PA1_41
1 1 1 3 NA NA 1 1 NA
71 5 20 1 15 1 1 1 NA
141 NA 28 28 NA NA 28 33 9
211 32 26 22 24 28 18 22 19
281 60 NA NA NA NA 1 45 NA
351 NA NA NA NA NA 1 7 NA
PA2_41 PA1_42 PA2_42 PA1_43 PA2_43 PA1_44 PA2_44 PA1_45
1 NA 1 19 NA NA 1 1 1
71 NA 13 11 17 6 5 3 6
141 33 NA NA NA NA 25 43 13
211 33 19 28 24 38 26 29 30
281 NA 1 41 NA NA NA NA NA
351 NA 1 1 NA NA 1 1 1
PA2_45 PA1_46 PA2_46 PA1_47 PA2_47 PA1_48 PA2_48 PA1_49
1 9 6 29 NA NA NA NA 26
71 4 22 5 24 6 14 8 24
141 12 NA NA NA NA 28 43 NA
211 34 NA NA NA NA 21 24 26
281 NA NA NA 29 66 NA NA 36
351 1 NA NA NA NA NA NA NA
PA2_49 PA1_50 PA2_50 PA1_51 PA2_51 PA1_52 PA2_52 PA1_53
1 22 24 21 24 15 24 7 24
71 11 NA NA 33 1 24 5 23
141 NA 26 52 NA NA 27 34 NA
211 33 27 31 NA NA 29 30 31
281 70 20 62 NA NA 1 60 1
351 NA 6 12 1 21 NA NA 1
PA2_53 PA1_54 PA2_54 PA1_55 PA2_55 PA1_56 PA2_56 PA1_57
1 12 21 23 NA NA NA NA 1
71 12 30 14 38 11 NA NA 24
141 NA NA NA NA NA 23 38 14
211 29 25 26 NA NA NA NA 22
281 50 NA NA NA NA 1 41 1
351 15 1 5 1 1 NA NA 1
PA2_57 PA1_58 PA2_58 PA1_59 PA2_59 PA1_60 PA2_60 PA1_61
1 1 1 3 NA NA 1 1 1
71 3 NA NA NA NA 19 1 NA
141 40 18 37 17 32 NA NA NA
211 35 22 32 NA NA 25 29 24
281 42 1 44 NA NA 14 54 NA
351 1 NA NA 1 1 NA NA NA
PA2_61 PA1_62 PA2_62 PA1_63 PA2_63 PA1_64 PA2_64 PA1_65
1 1 1 1 NA NA 1 1 NA
71 NA 19 4 16 1 10 1 NA
141 NA 16 14 NA NA 13 25 26
211 26 21 29 20 22 21 31 20
281 NA 26 70 20 64 NA NA NA
351 NA 6 1 6 2 1 12 1
PA2_65 PA1_66 PA2_66 PA1_67 PA2_67 PA1_68 PA2_68 PA1_69
1 NA 1 1 1 1 1 1 1
71 NA 15 1 8 1 7 3 16
141 49 14 43 NA NA 17 35 25
211 25 NA NA 27 30 30 35 28
281 NA NA NA NA NA NA NA 1
351 15 NA NA NA NA NA NA NA
PA2_69 PA1_70 PA2_70
1 5 1 1
71 1 14 1
141 28 16 11
211 28 26 29
281 40 1 43
351 NA NA NA
Long format
Based on the previous wide format, we want to reformat this dataframe into a long format. We propose two solutions:
- ‘reshape()’ function, a base R function
- ‘pivot_longer()’ function from the tidyr package
# Recreate the list of time values and variable_time (names of the columns in the wide format)
= as.character(c(1:70)) # 70 beeps per participant
time = lapply(c("PA1","PA2"), function(x) paste0(x,"_", time)) # PA1_1, PA1_2, ...
list_var
# Use the reshape
= reshape(
df_long
df_wide,idvar = "id",
varying = list_var,
v.names = c("PA1", "PA2"),
timevar = "obsno",
times = time,
direction = "long"
)
The outcome dataframe looks like:
# A tibble: 6 × 4
id obsno PA1 PA2
<dbl> <chr> <int> <int>
1 1 1 NA NA
2 1 2 NA NA
3 1 3 NA NA
4 1 4 1 11
5 1 5 NA NA
6 1 6 NA NA
Agregate formats
For preprocessing checks or later data analysis, we may need to aggregate the data at the day level or the participant level, from an original dataset that is at the beep level. To do so, we mainly use the ‘group_by()’ and the ‘summarise()’ functions from the dplyr packages. The general idea is to:
- To change the dataframe level with the ‘group_by()’ function: identification and time variable can be used to handle the level of the dataframe. In addition, we can add participant-level or day-level invariant variables to keep them in the dataframe outcomes.
- To compute aggregate scores from the time-varying variables with the ‘summarise()’ function: mean, max, etc. of the variables of interest to aggretate them at the chosen level.
Day level
To create a dataset at the day level, we need the daycum variable. Then, ‘group_by()’ the minimal set of following variables (e.g., id and daycum). Finally, aggregate the variables of interest at the day level.
= data %>%
df_day group_by(id, role, age, cond_dyad, daycum) %>% # Day-level through daycum variable
summarise(PA1_mean = mean(PA1, na.rm=TRUE))
The outcome dataframe looks like:
# A tibble: 6 × 6
# Groups: id, role, age, cond_dyad [1]
id role age cond_dyad daycum PA1_mean
<dbl> <int> <int> <chr> <drtn> <dbl>
1 1 1 40 condB 1 days 1
2 1 1 40 condB 2 days 1
3 1 1 40 condB 3 days 1
4 1 1 40 condB 4 days NaN
5 1 1 40 condB 5 days 23
6 1 1 40 condB 6 days 7.33
Participant level
To create a dataframe at the participant level (only one row per participant), we only need to ‘group_by()’ the id variable. Time-varying variables can be aggregated using the ‘summarise()’ function.
= data %>%
df_participant group_by(id, role, age, cond_dyad) %>%
summarise(PA1_mean = mean(PA1, na.rm=TRUE))
The outcome dataframe looks like:
# A tibble: 6 × 5
# Groups: id, role, age [6]
id role age cond_dyad PA1_mean
<dbl> <int> <int> <chr> <dbl>
1 1 1 40 condB 5.46
2 2 2 42 condB 18.5
3 3 1 25 condB 20.3
4 4 2 25 condB 24.4
5 5 1 25 condB 10.8
6 6 2 25 condB 3.39
Formats for dyadic datasets
When manipulating dyadic data, it is often necessary to reformat the dataset from an individual to a dyad format. A dyad format displays partners’ observations side by side, meaning that the first partner’s data will have its variables on the left part of the dataframe and the second partner on the right (with two options).
A A A A A A A
A A A A A A A
A A A A A A A
A A A A A A A
B B B B B B B
B B B B B B B
B B B B B B B
B B B B B B B
A A A A A A B B B B B B
A A A A A A B B B B B B
A A A A A A B B B B B B
A A A A A A B B B B B B
A A A A A A B B B B B B
A A A A A A B B B B B B
A A A A A A B B B B B B
A A A A A A B B B B B B
B B B B B B A A A A A A
B B B B B B A A A A A A
B B B B B B A A A A A A
B B B B B B A A A A A A
For the variables that have similar names among the two dyad members (e.g., PA1), we will add a suffix. For instance, partner 1 will have the suffix “_p1” (e.g., ‘PA1_1’), and partner 2 will have “_p2” (e.g., ‘PA1_2’). In both cases, we propose three methods in function of if:
- dyads’ members are distinguishable (e.g., teacher and student) and a variable identifies a member from the other. In our example, ‘role’ makes dyads distinguishable.
- dyads are indistinguishable (e.g., friends) and the dyads’ members are laid in the dataframe in a consistent way. The first partner will be on the left and the second on the right of the dataframe.
- dyads are indistinguishable (e.g., friends) and the dyads’ members are randomly laid in the dataframe, meaning that positions (i.e., right or left) are randomly defined.
When merging datasets, it is crucial to ensure that the key variables specified in the ‘by’ argument accurately establish correspondence between partner observations within dyads. We can use the ‘dyad’ and ‘obsno’ variables to do so. However, regarding the second variable, if partners within a dyad did not start at the same beep, it may lead to a divergence in partner observations. Here, instead of the ‘obsno’ variable, we prefer using the ‘daycum_dyad’ and ‘beepno’ variables (see Create time variables section). Note that the later is created based on the former one.
Dyad format
The dyad format is a format where partners’ observations are presented only once and displayed side by side.
# Split partners
= data %>% filter(role == 1)
df1 = data %>% filter(role == 2)
df2
# Join side by side
= df1 %>%
data_dyad left_join(df2, by=c("dyad", "daycum_dyad", "beepno"), suffix = c("_p1", "_p2"))
The outcome dataframe looks like:
dyad role_p1 obsno_p1 id_p1 age_p1 cond_dyad_p1
1 1 1 1 1 40 condB
2 1 1 2 1 40 condB
3 1 1 3 1 40 condB
4 1 1 4 1 40 condB
5 1 1 5 1 40 condB
6 1 1 6 1 40 condB
scheduled_p1 sent_p1
1 2018-10-17 08:00:08 2018-10-17 08:00:11
2 2018-10-17 09:00:01 2018-10-17 09:00:22
3 2018-10-17 09:59:56 2018-10-17 10:00:08
4 2018-10-17 10:59:48 2018-10-17 10:59:52
5 2018-10-17 12:00:12 2018-10-17 12:00:15
6 2018-10-18 07:59:47 2018-10-18 08:00:08
start_p1 end_p1 contact_p1 PA1_p1
1 <NA> <NA> NA NA
2 <NA> <NA> NA NA
3 <NA> <NA> NA NA
4 2018-10-17 11:00:12 2018-10-17 11:03:01 0 1
5 <NA> <NA> NA NA
6 <NA> <NA> NA NA
PA2_p1 PA3_p1 NA1_p1 NA2_p1 NA3_p1 location_p1 daycum_p1
1 NA NA NA NA NA <NA> 1 days
2 NA NA NA NA NA <NA> 1 days
3 NA NA NA NA NA <NA> 1 days
4 11 25 10 16 28 A 1 days
5 NA NA NA NA NA <NA> 1 days
6 NA NA NA NA NA <NA> 2 days
valid_var_p1 valid_p1 daycum_dyad beepno role_p2 obsno_p2
1 1 0 1 days 1 2 1
2 1 0 1 days 2 2 2
3 1 0 1 days 3 2 3
4 7 1 1 days 4 2 4
5 1 0 1 days 5 2 5
6 1 0 2 days 1 2 6
id_p2 age_p2 cond_dyad_p2 scheduled_p2
1 2 42 condB 2018-10-17 07:59:58
2 2 42 condB 2018-10-17 08:59:50
3 2 42 condB 2018-10-17 09:59:55
4 2 42 condB 2018-10-17 11:00:01
5 2 42 condB 2018-10-17 11:59:58
6 2 42 condB 2018-10-18 08:00:07
sent_p2 start_p2
1 2018-10-17 08:00:15 2018-10-17 08:00:21
2 2018-10-17 08:59:53 2018-10-17 09:00:22
3 2018-10-17 10:00:16 2018-10-17 10:00:37
4 2018-10-17 11:00:05 2018-10-17 11:00:15
5 2018-10-17 12:00:01 2018-10-17 12:00:23
6 2018-10-18 08:00:09 2018-10-18 08:00:44
end_p2 contact_p2 PA1_p2 PA2_p2 PA3_p2
1 2018-10-17 08:04:32 0 32 3 10
2 2018-10-17 09:03:08 0 23 4 14
3 2018-10-17 10:02:42 0 17 7 19
4 2018-10-17 11:03:07 0 24 12 16
5 2018-10-17 12:03:12 0 32 6 14
6 2018-10-18 08:03:49 0 26 1 18
NA1_p2 NA2_p2 NA3_p2 location_p2 daycum_p2 valid_var_p2
1 5 20 1 C 1 days 7
2 4 4 11 C 1 days 7
3 14 34 20 B 1 days 7
4 24 3 28 A 1 days 7
5 20 1 35 E 1 days 7
6 17 1 42 D 2 days 7
valid_p2
1 1
2 1
3 1
4 1
5 1
6 1
Dyad pairwise format
The dyad pairwise format is a format where partners’ observations are presented side by side, but each partner’s observations are displayed twice (once as the first partner and once as the second partner).
# Split partners
= data %>% filter(role == 1)
df1 = data %>% filter(role == 2)
df2
# Join side by side
= df1 %>%
df1_merged left_join(df2, by=c("dyad", "daycum_dyad", "beepno"), suffix = c("_p1", "_p2"))
= df2 %>%
df2_merged left_join(df1, by=c("dyad", "daycum_dyad", "beepno"), suffix = c("_p1", "_p2"))
# Merge
= rbind(df1_merged, df2_merged) data_dyad
The outcome dataframe looks like:
dyad role_p1 obsno_p1 id_p1 age_p1 cond_dyad_p1
1 1 1 1 1 40 condB
2 1 1 2 1 40 condB
3 1 1 3 1 40 condB
4 1 1 4 1 40 condB
5 1 1 5 1 40 condB
6 1 1 6 1 40 condB
scheduled_p1 sent_p1
1 2018-10-17 08:00:08 2018-10-17 08:00:11
2 2018-10-17 09:00:01 2018-10-17 09:00:22
3 2018-10-17 09:59:56 2018-10-17 10:00:08
4 2018-10-17 10:59:48 2018-10-17 10:59:52
5 2018-10-17 12:00:12 2018-10-17 12:00:15
6 2018-10-18 07:59:47 2018-10-18 08:00:08
start_p1 end_p1 contact_p1 PA1_p1
1 <NA> <NA> NA NA
2 <NA> <NA> NA NA
3 <NA> <NA> NA NA
4 2018-10-17 11:00:12 2018-10-17 11:03:01 0 1
5 <NA> <NA> NA NA
6 <NA> <NA> NA NA
PA2_p1 PA3_p1 NA1_p1 NA2_p1 NA3_p1 location_p1 daycum_p1
1 NA NA NA NA NA <NA> 1 days
2 NA NA NA NA NA <NA> 1 days
3 NA NA NA NA NA <NA> 1 days
4 11 25 10 16 28 A 1 days
5 NA NA NA NA NA <NA> 1 days
6 NA NA NA NA NA <NA> 2 days
valid_var_p1 valid_p1 daycum_dyad beepno role_p2 obsno_p2
1 1 0 1 days 1 2 1
2 1 0 1 days 2 2 2
3 1 0 1 days 3 2 3
4 7 1 1 days 4 2 4
5 1 0 1 days 5 2 5
6 1 0 2 days 1 2 6
id_p2 age_p2 cond_dyad_p2 scheduled_p2
1 2 42 condB 2018-10-17 07:59:58
2 2 42 condB 2018-10-17 08:59:50
3 2 42 condB 2018-10-17 09:59:55
4 2 42 condB 2018-10-17 11:00:01
5 2 42 condB 2018-10-17 11:59:58
6 2 42 condB 2018-10-18 08:00:07
sent_p2 start_p2
1 2018-10-17 08:00:15 2018-10-17 08:00:21
2 2018-10-17 08:59:53 2018-10-17 09:00:22
3 2018-10-17 10:00:16 2018-10-17 10:00:37
4 2018-10-17 11:00:05 2018-10-17 11:00:15
5 2018-10-17 12:00:01 2018-10-17 12:00:23
6 2018-10-18 08:00:09 2018-10-18 08:00:44
end_p2 contact_p2 PA1_p2 PA2_p2 PA3_p2
1 2018-10-17 08:04:32 0 32 3 10
2 2018-10-17 09:03:08 0 23 4 14
3 2018-10-17 10:02:42 0 17 7 19
4 2018-10-17 11:03:07 0 24 12 16
5 2018-10-17 12:03:12 0 32 6 14
6 2018-10-18 08:03:49 0 26 1 18
NA1_p2 NA2_p2 NA3_p2 location_p2 daycum_p2 valid_var_p2
1 5 20 1 C 1 days 7
2 4 4 11 C 1 days 7
3 14 34 20 B 1 days 7
4 24 3 28 A 1 days 7
5 20 1 35 E 1 days 7
6 17 1 42 D 2 days 7
valid_p2
1 1
2 1
3 1
4 1
5 1
6 1
The variable ‘obsno’ is not always the most suitable choice for merging dyads’ observations. There are instances where the combination of ‘daycum_dyad’ and ‘beepno’ is more appropriate.
Furthermore, it is crucial to verify that the time interval between partners’ responses adheres to the expected distribution, as described in the dyad time interval section. Specifically, it is important to identify and examine any outliers. Outliers can be when there is a time gap of more than 30 minutes between partner observations that are merged together (unexpected if partners are supposed to answer within an interval of less than 30 minutes). Such outliers may indicate issues with data collection or errors in data manipulation.
# # Gather partners' id number and dyad number
# nested_list = lapply(split(data$id, data$dyad), unique)
# # Split partners
# id_list1 = sapply(nested_list, function(x) x[1])
# id_list2 = sapply(nested_list, function(x) x[2])
# data1 = data[data$id %in% id_list1,]
# data2 = data[data$id %in% id_list2,]
# # Join side by side partners' data
# data1 = data1 %>%
# left_join(data2, by=c("dyad", "daycum_dyad", "beepno"), suffix = c("_1", "_2"))
# data2 = data2 %>%
# left_join(data1, by=c("dyad", "daycum_dyad", "beepno"), suffix = c("_1", "_2"))
# data_pairdyad = rbind(data1, data2)