Common scores
Packages: dplyr
Descriptive scores are a fundamental part of analyzing data from experience sampling studies (ESM). They allow us to summarize the distribution of responses and identify patterns or trends in the data.
‘summarize()’ and ‘mutate()’ functions
The ‘summarize()’ and ‘mutate()’ functions from the dplyr package provide a convenient way to compute descriptive statistics. With these functions, it is possible to compute various summary statistics, such as means, standard deviations, and counts, for different variables and grouping variables in a data frame. In particular:
- summarize(): is used to calculate summary statistics for each group in the data frame. In the outputs, there is one row per group. Below, we group by ‘id’ and compute the mean of the ‘PA1’ variable for each participant. The output is composed of one row per participant.
%>%
data group_by(id) %>%
summarize(mean_PA1 = mean(PA1,na.rm=TRUE)) %>% head()
# A tibble: 6 × 2
id mean_PA1
<dbl> <dbl>
1 1 5.46
2 2 18.5
3 3 20.3
4 4 24.4
5 5 10.8
6 6 3.39
- mutate(): is used to create new variables based on existing ones. It does not change the shape of the output dataframe, it only adds a new column. Below, we group by ‘id’ and compute the mean of the ‘PA1’ variable for each participant. Unlike the ‘summarise’ function, which reduces data to summary values, here all rows and additional variables are retained.
%>%
data group_by(id) %>%
mutate(mean_PA1 = mean(PA1,na.rm=TRUE)) %>% head()
# A tibble: 6 × 10
# Groups: id [1]
id daycum obsno PA1 PA2 PA3 NA1 NA2 NA3 mean_PA1
<dbl> <drtn> <int> <int> <int> <int> <int> <int> <int> <dbl>
1 1 1 days 1 NA NA NA NA NA NA 5.46
2 1 1 days 2 NA NA NA NA NA NA 5.46
3 1 1 days 3 NA NA NA NA NA NA 5.46
4 1 1 days 4 1 11 25 10 16 28 5.46
5 1 1 days 5 NA NA NA NA NA NA 5.46
6 1 2 days 6 NA NA NA NA NA NA 5.46
Descriptive statistics
In this section, we will go over some commonly used descriptive measures in ESM research and show how to compute them using R.
In particular, an important decision is about how to handle the missing values. If na.rm=TRUE then it removes the missing values in the computation (see check computation). If na.rm=FALSE (often default argument) then it keeps the missing values and returns NA values whenever there is at least one NA in the computation.
Here are examples of descriptive scores that are often computed. We propose 2 methods for each: one with R base functions and one with dplyr functions.
- Compute mean
$PA = apply(data[,c("PA1","PA2","PA3")], 1, mean) data
- Compute weight mean
= c(1, .3, .7)
weight $PA = apply(data[,c("PA1","PA2","PA3")], 1, function(x) mean(x * weight)) data
- Compute sum
$PA = apply(data[,c("PA1","PA2","PA3")], 1, sum) data
- Compute standard deviation
$PA = apply(data[,c("PA1","PA2","PA3")], 1, sd) data
- Compute max/min
$PA = apply(data[,c("PA1","PA2","PA3")], 1, max) data
- Compute cumulative score within a period (e.g., a day). You might need to replace missing values (NA) by 0 to do so.
$PA1_na = replace(data$PA1, is.na(data$PA1), 0)
data$PA_cumsum = ave(data$PA1_na, data$id, data$daycum, FUN=cumsum) data