ESM Preprocessing Gallery

Import data

Packages: dplyr, readr

Probably one of the first task you will do is to import your dataset in your R session. Many file formats and many R functions exist. We briefly introduce them here.

Import a csv file

A common and practical file format to store ESM data is the csv (comma-separated values) file. To import a csv file, we can use the base R functions ‘read.csv()’ or ‘read.csv2()’. Those functions are built on the ‘read.table()’ function and have default argument values: ‘sep’ (i.e., character that separates the values) and ‘dec’ (i.e., character that indicates a decimal):

‘read.csv()’ has as default aguments sep=“,” and dec=“.”. It is equivalent to ‘read.table(file, sep=“,”, dec=“.”)’.
‘read.csv2()’ has as default arguments sep=“;” and dec=“,”. It is equivalent to ‘read.table(file, sep=“;”, dec=“,”)’.

As an illustration, we want to import the file ‘data_sim.csv’ (can be downloaded above). This file contains values separated by “;” and decimals indicated by “,”. The function that fits these settings is the ‘read.csv2()’.

data = read.csv2(file="data/data_sim.csv")

Warning

Note that directly after importing data, it is recommended to inspect it to see if it has been imported correctly (see first glimpse). Also, you should check the warning messages that might appear during the import process.

Additional arguments

Some function from other packages have additional arguments (e.g., code of missing values, specifying column type) when importing csv files. A popular one is the readr package that contains the ‘read_delim()’ function, and two of its extensions: ‘read_csv()’ and ‘read_csv2()’. They hold the same significance as ‘read.csv’ and ‘read.csv2’, differing in the aspect that they incorporate many useful arguments, such as:

Missing value code using the ‘na’ argument. We can specify what is the code of the missing values in the file (e.g., ‘-999’ or ‘na’).
Column type with the ‘col_types’ argument and the ‘cols’ function. We can specify ‘n’ for numerical, ‘i’ for integer, ‘c’ character, ‘f’ for factor, ‘d’ for Date, ‘T’ for datetime, etc. For instance, if the variable PA1 is expected to be numeric: ‘col_types = cols(PA1=“n”)’. In the cols function the .default argument can be used to specify the default column type.

library(readr)
df = read_csv2(file="data/data_sim.csv",
    na=c("-999", "__na__"),              # Specify missing value format
    col_types = cols(.default = "i",     # Specify default column types, here integer
                     cond_dyad = "c", location="c"))

This function has further arguments that can be found in the function’s documentation: https://readr.tidyverse.org/reference/read_delim.html.

Import other types of files

In case we have to import other types of files, here are some useful functions:

Function	Type of file	Package	Description
read.delim	.txt ; .csv	.	If separator character that is different from a tab, a comma or a semicolon
read.table	.txt ; .csv ; .dat	.	Allows to specify more arguments
read_sas	.sas	haven	SAS data
read_stata	.dta	haven	Stata data
read_spss	.sav	haven	SPSS file
read_excel	.xls ; .xlsx	readxl	Excel files