First glimpse
Packages: dplyr, psych, skimr, Hmisc, esmtools
After importing your dataset or while preprocessing it, it is important to have a quick and efficient overview of your dataset. To this end, we demonstrate how to display meta-data (e.g., dimensions), to display a subset of rows, to compute common descriptive statistics (e.g., mean, standard deviation) and to compute occurences of values. Beyond providing a good understanding of your dataset, it often a reveals some issues in the dataset (e.g., wrong minimal or maximal values, loss of many rows after a data manipulation, an unexpected high number of occurrences for a value of a categorical variable).
Meta-data
Three important meta-data aspects to check are the number of rows and columns, the format of the columns and the number of observations per participant. It can be done using:
- ‘dim()’: returns the number of rows (first number) and the number of columns (second number). It helps to quickly see if those numbers are the expected ones (e.g., after data modification). You can also investigate independently the number of rows with ‘nrow()’ and the number of columns with ‘ncol()’.
dim(data)
[1] 4200 18
- ‘str()’: returns the columns’ formats and their first values. It is particularly useful to inspect if variables are in the correct formats (e.g., integer, character, POSIXct). Can also be done using ‘glimpse()’ from the dplyr package.
str(data)
'data.frame': 4200 obs. of 18 variables:
$ dyad : num 1 1 1 1 1 1 1 1 1 1 ...
$ role : int 1 1 1 1 1 1 1 1 1 1 ...
$ obsno : int 1 2 3 4 5 6 7 8 9 10 ...
$ id : num 1 1 1 1 1 1 1 1 1 1 ...
$ age : int 40 40 40 40 40 40 40 40 40 40 ...
$ cond_dyad: chr "condB" "condB" "condB" "condB" ...
$ scheduled: POSIXct, format: "2018-10-17 08:00:08" "2018-10-17 09:00:01" ...
$ sent : POSIXct, format: "2018-10-17 08:00:11" "2018-10-17 09:00:22" ...
$ start : POSIXct, format: NA NA ...
$ end : POSIXct, format: NA NA ...
$ contact : int NA NA NA 0 NA NA 0 0 0 NA ...
$ PA1 : int NA NA NA 1 NA NA 1 1 1 NA ...
$ PA2 : int NA NA NA 11 NA NA 1 1 1 NA ...
$ PA3 : int NA NA NA 25 NA NA 5 7 16 NA ...
$ NA1 : int NA NA NA 10 NA NA 30 30 43 NA ...
$ NA2 : int NA NA NA 16 NA NA 1 13 23 NA ...
$ NA3 : int NA NA NA 28 NA NA 35 41 46 NA ...
$ location : chr NA NA NA "A" ...
- Number of rows per participant: with R base functions or dplyr function. Be aware that for the R base version, the output displays the id number above the number of rows for the participant. In the dplyr version, ‘n()’ is meant to compute the number of rows for each group, so here for each participant.
sapply(split(data$id, data$id), length)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52
70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70
53 54 55 56 57 58 59 60
70 70 70 70 70 70 70 70
Display rows
A direct inspection of the observations/rows can also be useful to perform a quick investigation in many situations, such as after importing data or creating new variables. Most common practices are to use ‘head()’ and ‘tail()’ to display respectively the first rows or last rows of the dataset. However, only displaying the first or last subset of rows of a dataset may not provide a good representation of it, and, worse, it may blur problematic patterns, outliers, or other data quality issues. Hence, displaying a set of random rows may be a good alternative.
- ‘head()’ and ‘tail()’: returns, respectively, the first and last n rows (by default n=5). We can change n as follows: ‘head(data, n=10)’ or ‘tail(data, n=10)’.
head(data, n=5)
dyad role obsno id age cond_dyad scheduled sent
1 1 1 1 1 40 condB 2018-10-17 08:00:08 2018-10-17 08:00:11
2 1 1 2 1 40 condB 2018-10-17 09:00:01 2018-10-17 09:00:22
3 1 1 3 1 40 condB 2018-10-17 09:59:56 2018-10-17 10:00:08
4 1 1 4 1 40 condB 2018-10-17 10:59:48 2018-10-17 10:59:52
5 1 1 5 1 40 condB 2018-10-17 12:00:12 2018-10-17 12:00:15
start end contact PA1 PA2 PA3 NA1 NA2 NA3
1 <NA> <NA> NA NA NA NA NA NA NA
2 <NA> <NA> NA NA NA NA NA NA NA
3 <NA> <NA> NA NA NA NA NA NA NA
4 2018-10-17 11:00:12 2018-10-17 11:03:01 0 1 11 25 10 16 28
5 <NA> <NA> NA NA NA NA NA NA NA
location
1 <NA>
2 <NA>
3 <NA>
4 A
5 <NA>
- Random rows: in the esmtools package, you can find three functions that displays:
- (1) random rows: with the function ‘randrows’, you can display n randomly selected rows from the dataset.
- (2) on random set of following rows: with the function ‘folrows’, you can displays one randomly selected sets of n following rows.
- (3) multiple random sets of following rows: with the function ‘folrows’, you can displays nb_sample randomly selected sets of n following rows.
library(esmtools)
randrows(data, n=5)
dyad role obsno id age cond_dyad scheduled sent
3711 27 2 1 54 36 condB 2018-09-18 08:00:04 2018-09-18 08:00:12
273 2 2 63 4 25 condB 2018-08-21 10:00:08 2018-08-21 10:00:19
3863 28 2 13 56 48 condA 2018-10-29 11:00:01 2018-10-29 11:00:23
321 3 1 41 5 25 condB 2018-03-12 09:00:17 2018-03-12 09:00:28
607 5 1 47 9 48 condA 2018-02-11 10:00:04 2018-02-11 10:00:21
start end contact PA1 PA2 PA3 NA1 NA2 NA3
3711 <NA> <NA> NA NA NA NA NA NA NA
273 2018-08-21 10:00:30 2018-08-21 10:01:19 0 20 22 1 1 1 93
3863 2018-10-29 11:00:30 2018-10-29 11:03:20 0 27 18 1 1 10 79
321 <NA> <NA> NA NA NA NA NA NA NA
607 <NA> <NA> NA NA NA NA NA NA NA
location
3711 <NA>
273 D
3863 D
321 <NA>
607 <NA>
Common descriptive functions
There are multiple functions from many packages that can be used to compute descriptive statistics (e.g., mean, median, quantiles, proportion of missing values) over all variables of a dataframe. Computing descriptive statistics is an useful way to check some assumptions about the data (e.g., range of values), to detect potential issues (e.g., important number of missing values), or to have a quick overview of the data. Here, we show four functions coming from different packages:
summary(data)
dyad role obsno id age
Min. : 1.0 Min. :1.0 Min. : 1.0 Min. : 1.00 Min. :25.00
1st Qu.: 8.0 1st Qu.:1.0 1st Qu.:18.0 1st Qu.:15.75 1st Qu.:25.00
Median :15.5 Median :1.5 Median :35.5 Median :30.50 Median :35.50
Mean :15.5 Mean :1.5 Mean :35.5 Mean :30.50 Mean :35.10
3rd Qu.:23.0 3rd Qu.:2.0 3rd Qu.:53.0 3rd Qu.:45.25 3rd Qu.:42.25
Max. :30.0 Max. :2.0 Max. :70.0 Max. :60.00 Max. :65.00
cond_dyad scheduled
Length:4200 Min. :2018-02-02 08:59:47.00
Class :character 1st Qu.:2018-07-20 09:00:01.00
Mode :character Median :2018-09-13 11:00:13.50
Mean :2018-09-08 11:33:24.83
3rd Qu.:2018-10-24 02:59:52.50
Max. :2019-06-10 11:59:46.00
NA's :2
sent start
Min. :2018-02-02 08:59:51.00 Min. :2018-02-02 09:00:31.00
1st Qu.:2018-07-20 09:00:18.75 1st Qu.:2018-07-22 12:00:37.25
Median :2018-09-13 11:30:18.00 Median :2018-09-14 22:00:28.00
Mean :2018-09-08 11:54:09.94 Mean :2018-09-14 18:49:31.13
3rd Qu.:2018-10-23 17:00:04.00 3rd Qu.:2018-10-31 13:00:30.50
Max. :2019-06-10 11:59:54.00 Max. :2019-06-10 12:00:15.00
NA's :1254
end contact PA1
Min. :2018-02-02 09:03:07.00 Min. :0.0000 Min. : 1.00
1st Qu.:2018-07-22 12:01:49.25 1st Qu.:0.0000 1st Qu.: 4.00
Median :2018-09-14 22:02:02.00 Median :0.0000 Median : 18.00
Mean :2018-09-14 18:51:14.74 Mean :0.1229 Mean : 23.09
3rd Qu.:2018-10-31 13:02:31.00 3rd Qu.:0.0000 3rd Qu.: 32.00
Max. :2019-06-10 12:02:30.00 Max. :1.0000 Max. :100.00
NA's :1254 NA's :1254 NA's :1254
PA2 PA3 NA1 NA2
Min. : 1.00 Min. : 1.00 Min. : 1.00 Min. : 1.0
1st Qu.: 3.00 1st Qu.: 3.00 1st Qu.: 1.00 1st Qu.: 1.0
Median : 19.00 Median : 16.00 Median : 11.00 Median : 7.0
Mean : 21.77 Mean : 23.32 Mean : 21.36 Mean :10.5
3rd Qu.: 33.00 3rd Qu.: 31.00 3rd Qu.: 31.00 3rd Qu.:15.0
Max. :100.00 Max. :100.00 Max. :100.00 Max. :83.0
NA's :1254 NA's :1254 NA's :1254 NA's :1254
NA3 location
Min. : 1.00 Length:4200
1st Qu.: 40.00 Class :character
Median : 72.00 Mode :character
Mean : 63.72
3rd Qu.: 89.00
Max. :100.00
NA's :1254
Occurences of values
Finally, we can also be interested in the occurrences of values for the specific items, particularly with Lickert scales or multiple choice questions. In such a case, it is useful to investigate the number of unique values and inspect if some values have unexpectedly high or low numbers of occurrences. We can either display:
- the overall occurrences: numbers of occurences for each value over the whole data set.
- the within-participant occurrences: numbers of occurences for each value and for each participant in the data set. In the output of the function, the rows are the participants and the columns are the different values of the item.
table(data$location, useNA="ifany")
A B C D E <NA>
621 543 535 608 639 1254