ESM Preprocessing Gallery – session_and_dataset

Session and Dataset info

Packages: esmtools

Reporting information about the R session as well as the dataframe help to track any modification and to reproduce the analysis. It is good practice to use the two next functions at the end of your rmarkdown documents that you used to preprocess or analyze the data.

R session information

When sharing R code, it is important to report the session information because it provides crucial information about the software versions and package versions used in the analysis. This information allows others to reproduce the analysis and troubleshoot some issues. Additionally, reporting session information demonstrates transparency and helps to build trust in the research.

The most efficient and easy function for this purpose is the ‘sessionInfo()’ one. It is a base R function that displays information about the current R session, including the version of R, the version of installed packages, the operating system, and some hardware specifications.

sessionInfo()

R version 4.3.0 (2023-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.4 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Etc/UTC
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_1.1.2      kableExtra_1.3.4

loaded via a namespace (and not attached):
 [1] vctrs_0.6.2       httr_1.4.6        svglite_2.1.1     cli_3.6.1        
 [5] knitr_1.43        rlang_1.1.1       xfun_0.39         stringi_1.7.12   
 [9] generics_0.1.3    jsonlite_1.8.5    glue_1.6.2        colorspace_2.1-0 
[13] htmltools_0.5.5   fansi_1.0.4       scales_1.2.1      rmarkdown_2.22   
[17] tibble_3.2.1      evaluate_0.21     munsell_0.5.0     fastmap_1.1.1    
[21] lifecycle_1.0.3   stringr_1.5.0     compiler_4.3.0    rvest_1.0.3      
[25] pkgconfig_2.0.3   htmlwidgets_1.6.2 rstudioapi_0.14   systemfonts_1.0.4
[29] digest_0.6.31     viridisLite_0.4.2 R6_2.5.1          tidyselect_1.2.0 
[33] utf8_1.2.3        pillar_1.9.0      magrittr_2.0.3    webshot_0.5.4    
[37] tools_4.3.0       xml2_1.3.4

Dataset information

Sharing information about the dataset is crucial for improving transparency, reproducibility, and traceability in research. In particular, traceability is important whenever sharing the dataset or checking if any modifications have been done on it.

The ‘dataInfo()’ function from the esmtools package provides a comprehensive summary of the dataset, including:

the path of the file.
the size in octets.
the creation and update times of the data file.
the number of columns and rows.
the number of participants.
the average number of observations per participant.
the compliance mean.
the data collection period.
the variable names.
and any associated URL, DOI, or citation links for the dataset.

For more information see the function documentation.

Note

In the ‘read’ function, it is crucial to ensure that the dataset is well imported and can be properly processed by the function. In the example below, by reformatting the timestamp variables, you guarantee that they will be accurately interpreted as dates and times within the ‘read’ function.

# To install the package:
# remotes::install_gitlab("ppw-okpiv/researchers/u0148925/esmtools", host="https://gitlab.kuleuven.be", force=TRUE)

# Import the package
library(esmtools)

# Provide the specific function and argument to read the dataset
read = function(x) read.csv(x, sep=";", dec=".") %>%
    # Reformate the timestamps variables
    mutate(across(c(scheduled,sent,start,end), ~as.POSIXct(as.numeric(.), origin="1970-01-01")))

# Display the information on the dataset
dataInfo(file_path="data/data_sim.csv", read_fun=read,
         idvar="id", timevar="scheduled")

Path : data/data_sim.csv 
Extension : csv 
Size : 349.68 Kb 
Creation time : 2024-11-13 14:27:59.694621 
Update time : 2024-11-13 14:27:59.694621 
ncol : 18 
nrow : 4200 
Number participants : 60 
Average number obs : 70 
Period : from 2018-02-02 08:59:47 to 2019-06-10 11:59:46 
Variables : dyad, role, obsno, id, age, cond_dyad, scheduled, sent, start, end, contact, PA1, PA2, PA3, NA1, NA2, NA3, location