First glimpse

Packages: dplyr, psych, skimr, Hmisc, esmtools


After importing your dataset or while preprocessing it, it is important to have a quick and efficient overview of your dataset. To this end, we demonstrate how to display meta-data (e.g., dimensions), to display a subset of rows, to compute common descriptive statistics (e.g., mean, standard deviation) and to compute occurences of values. Beyond providing a good understanding of your dataset, it often a reveals some issues in the dataset (e.g., wrong minimal or maximal values, loss of many rows after a data manipulation, an unexpected high number of occurrences for a value of a categorical variable).

Meta-data

Three important meta-data aspects to check are the number of rows and columns, the format of the columns and the number of observations per participant. It can be done using:

  • ‘dim()’: returns the number of rows (first number) and the number of columns (second number). It helps to quickly see if those numbers are the expected ones (e.g., after data modification). You can also investigate independently the number of rows with ‘nrow()’ and the number of columns with ‘ncol()’.

dim(data)
[1] 4200   18
  • ‘str()’: returns the columns’ formats and their first values. It is particularly useful to inspect if variables are in the correct formats (e.g., integer, character, POSIXct). Can also be done using ‘glimpse()’ from the dplyr package.

str(data)
'data.frame':   4200 obs. of  18 variables:
 $ dyad     : num  1 1 1 1 1 1 1 1 1 1 ...
 $ role     : int  1 1 1 1 1 1 1 1 1 1 ...
 $ obsno    : int  1 2 3 4 5 6 7 8 9 10 ...
 $ id       : num  1 1 1 1 1 1 1 1 1 1 ...
 $ age      : int  40 40 40 40 40 40 40 40 40 40 ...
 $ cond_dyad: chr  "condB" "condB" "condB" "condB" ...
 $ scheduled: POSIXct, format: "2018-10-17 08:00:08" "2018-10-17 09:00:01" ...
 $ sent     : POSIXct, format: "2018-10-17 08:00:11" "2018-10-17 09:00:22" ...
 $ start    : POSIXct, format: NA NA ...
 $ end      : POSIXct, format: NA NA ...
 $ contact  : int  NA NA NA 0 NA NA 0 0 0 NA ...
 $ PA1      : int  NA NA NA 1 NA NA 1 1 1 NA ...
 $ PA2      : int  NA NA NA 11 NA NA 1 1 1 NA ...
 $ PA3      : int  NA NA NA 25 NA NA 5 7 16 NA ...
 $ NA1      : int  NA NA NA 10 NA NA 30 30 43 NA ...
 $ NA2      : int  NA NA NA 16 NA NA 1 13 23 NA ...
 $ NA3      : int  NA NA NA 28 NA NA 35 41 46 NA ...
 $ location : chr  NA NA NA "A" ...
  • Number of rows per participant: with R base functions or dplyr function. Be aware that for the R base version, the output displays the id number above the number of rows for the participant. In the dplyr version, ‘n()’ is meant to compute the number of rows for each group, so here for each participant.

sapply(split(data$id, data$id), length)
 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 
70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 
70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 
53 54 55 56 57 58 59 60 
70 70 70 70 70 70 70 70 

Display rows

A direct inspection of the observations/rows can also be useful to perform a quick investigation in many situations, such as after importing data or creating new variables. Most common practices are to use ‘head()’ and ‘tail()’ to display respectively the first rows or last rows of the dataset. However, only displaying the first or last subset of rows of a dataset may not provide a good representation of it, and, worse, it may blur problematic patterns, outliers, or other data quality issues. Hence, displaying a set of random rows may be a good alternative.

  • ‘head()’ and ‘tail()’: returns, respectively, the first and last n rows (by default n=5). We can change n as follows: ‘head(data, n=10)’ or ‘tail(data, n=10)’.

head(data, n=5) 
  dyad role obsno id age cond_dyad           scheduled                sent
1    1    1     1  1  40     condB 2018-10-17 08:00:08 2018-10-17 08:00:11
2    1    1     2  1  40     condB 2018-10-17 09:00:01 2018-10-17 09:00:22
3    1    1     3  1  40     condB 2018-10-17 09:59:56 2018-10-17 10:00:08
4    1    1     4  1  40     condB 2018-10-17 10:59:48 2018-10-17 10:59:52
5    1    1     5  1  40     condB 2018-10-17 12:00:12 2018-10-17 12:00:15
                start                 end contact PA1 PA2 PA3 NA1 NA2 NA3
1                <NA>                <NA>      NA  NA  NA  NA  NA  NA  NA
2                <NA>                <NA>      NA  NA  NA  NA  NA  NA  NA
3                <NA>                <NA>      NA  NA  NA  NA  NA  NA  NA
4 2018-10-17 11:00:12 2018-10-17 11:03:01       0   1  11  25  10  16  28
5                <NA>                <NA>      NA  NA  NA  NA  NA  NA  NA
  location
1     <NA>
2     <NA>
3     <NA>
4        A
5     <NA>
  • Random rows: in the esmtools package, you can find three functions that displays:
    • (1) random rows: with the function ‘randrows’, you can display n randomly selected rows from the dataset.
    • (2) on random set of following rows: with the function ‘folrows’, you can displays one randomly selected sets of n following rows.
    • (3) multiple random sets of following rows: with the function ‘folrows’, you can displays nb_sample randomly selected sets of n following rows.

library(esmtools)
randrows(data, n=5)
     dyad role obsno id age cond_dyad           scheduled                sent
3711   27    2     1 54  36     condB 2018-09-18 08:00:04 2018-09-18 08:00:12
273     2    2    63  4  25     condB 2018-08-21 10:00:08 2018-08-21 10:00:19
3863   28    2    13 56  48     condA 2018-10-29 11:00:01 2018-10-29 11:00:23
321     3    1    41  5  25     condB 2018-03-12 09:00:17 2018-03-12 09:00:28
607     5    1    47  9  48     condA 2018-02-11 10:00:04 2018-02-11 10:00:21
                   start                 end contact PA1 PA2 PA3 NA1 NA2 NA3
3711                <NA>                <NA>      NA  NA  NA  NA  NA  NA  NA
273  2018-08-21 10:00:30 2018-08-21 10:01:19       0  20  22   1   1   1  93
3863 2018-10-29 11:00:30 2018-10-29 11:03:20       0  27  18   1   1  10  79
321                 <NA>                <NA>      NA  NA  NA  NA  NA  NA  NA
607                 <NA>                <NA>      NA  NA  NA  NA  NA  NA  NA
     location
3711     <NA>
273         D
3863        D
321      <NA>
607      <NA>

Common descriptive functions

There are multiple functions from many packages that can be used to compute descriptive statistics (e.g., mean, median, quantiles, proportion of missing values) over all variables of a dataframe. Computing descriptive statistics is an useful way to check some assumptions about the data (e.g., range of values), to detect potential issues (e.g., important number of missing values), or to have a quick overview of the data. Here, we show four functions coming from different packages:

summary(data)
      dyad           role         obsno            id             age       
 Min.   : 1.0   Min.   :1.0   Min.   : 1.0   Min.   : 1.00   Min.   :25.00  
 1st Qu.: 8.0   1st Qu.:1.0   1st Qu.:18.0   1st Qu.:15.75   1st Qu.:25.00  
 Median :15.5   Median :1.5   Median :35.5   Median :30.50   Median :35.50  
 Mean   :15.5   Mean   :1.5   Mean   :35.5   Mean   :30.50   Mean   :35.10  
 3rd Qu.:23.0   3rd Qu.:2.0   3rd Qu.:53.0   3rd Qu.:45.25   3rd Qu.:42.25  
 Max.   :30.0   Max.   :2.0   Max.   :70.0   Max.   :60.00   Max.   :65.00  
                                                                            
  cond_dyad           scheduled                     
 Length:4200        Min.   :2018-02-02 08:59:47.00  
 Class :character   1st Qu.:2018-07-20 09:00:01.00  
 Mode  :character   Median :2018-09-13 11:00:13.50  
                    Mean   :2018-09-08 11:33:24.83  
                    3rd Qu.:2018-10-24 02:59:52.50  
                    Max.   :2019-06-10 11:59:46.00  
                    NA's   :2                       
      sent                            start                       
 Min.   :2018-02-02 08:59:51.00   Min.   :2018-02-02 09:00:31.00  
 1st Qu.:2018-07-20 09:00:18.75   1st Qu.:2018-07-22 12:00:37.25  
 Median :2018-09-13 11:30:18.00   Median :2018-09-14 22:00:28.00  
 Mean   :2018-09-08 11:54:09.94   Mean   :2018-09-14 18:49:31.13  
 3rd Qu.:2018-10-23 17:00:04.00   3rd Qu.:2018-10-31 13:00:30.50  
 Max.   :2019-06-10 11:59:54.00   Max.   :2019-06-10 12:00:15.00  
                                  NA's   :1254                    
      end                            contact            PA1        
 Min.   :2018-02-02 09:03:07.00   Min.   :0.0000   Min.   :  1.00  
 1st Qu.:2018-07-22 12:01:49.25   1st Qu.:0.0000   1st Qu.:  4.00  
 Median :2018-09-14 22:02:02.00   Median :0.0000   Median : 18.00  
 Mean   :2018-09-14 18:51:14.74   Mean   :0.1229   Mean   : 23.09  
 3rd Qu.:2018-10-31 13:02:31.00   3rd Qu.:0.0000   3rd Qu.: 32.00  
 Max.   :2019-06-10 12:02:30.00   Max.   :1.0000   Max.   :100.00  
 NA's   :1254                     NA's   :1254     NA's   :1254    
      PA2              PA3              NA1              NA2      
 Min.   :  1.00   Min.   :  1.00   Min.   :  1.00   Min.   : 1.0  
 1st Qu.:  3.00   1st Qu.:  3.00   1st Qu.:  1.00   1st Qu.: 1.0  
 Median : 19.00   Median : 16.00   Median : 11.00   Median : 7.0  
 Mean   : 21.77   Mean   : 23.32   Mean   : 21.36   Mean   :10.5  
 3rd Qu.: 33.00   3rd Qu.: 31.00   3rd Qu.: 31.00   3rd Qu.:15.0  
 Max.   :100.00   Max.   :100.00   Max.   :100.00   Max.   :83.0  
 NA's   :1254     NA's   :1254     NA's   :1254     NA's   :1254  
      NA3           location        
 Min.   :  1.00   Length:4200       
 1st Qu.: 40.00   Class :character  
 Median : 72.00   Mode  :character  
 Mean   : 63.72                     
 3rd Qu.: 89.00                     
 Max.   :100.00                     
 NA's   :1254                       

Occurences of values

Finally, we can also be interested in the occurrences of values for the specific items, particularly with Lickert scales or multiple choice questions. In such a case, it is useful to investigate the number of unique values and inspect if some values have unexpectedly high or low numbers of occurrences. We can either display:

  • the overall occurrences: numbers of occurences for each value over the whole data set.
  • the within-participant occurrences: numbers of occurences for each value and for each participant in the data set. In the output of the function, the rows are the participants and the columns are the different values of the item.

table(data$location, useNA="ifany")

   A    B    C    D    E <NA> 
 621  543  535  608  639 1254