Codebook packages

Packages: dplyr, dataMaid, codebook, codebookr


A codebook is a document that describes the variables in a dataset and allows a quick investigation of the content of the dataset. It assures good comprehension when looking at the data. Hence it gives researcher background information to conduct analysis and it ensures reusability of the data. It typically includes information such as the variable name, label, data type, and possible values. But it is important to adapt it in function of the dataset and study particularities.

In particular, if the data comes from SPSS (format .dat), it can already contains information on the variable. In other words, the variable would have attributes that can be extracted in a codebook using R (see: https://www.alexcernat.com/easy-way-to-make-a-codebook-in-r/)

For the demonstration, we only took a subset of our dataset. With real data, you should report as much as variables as possible, in regards to ethical concerns.

Dataframe

The easiest way to create a codebook is to store information and descriptive statistics in a dataframe and to export it. We can star building the codebook by running a descriptive statistics function on the data. The results is convert to a dataframe (if needed) and stored.

# Create codebook based on descriptive function
codebook = as.data.frame(psych::describe(data))

# Extract the row.names (if needed)
codebook$variable = row.names(codebook)
row.names(codebook) = c(1:nrow(codebook))

To see further descriptive statistics function, see: the first glimpse section.

Building on it, we can add further information to the output such as a descriptive or a label column. To do so, we only need to select a variable and the new column (e.g., label, description, issues, etc.).

codebook[codebook$variable=="id", "Description"] = "Number pre-defined before the study"
codebook[codebook$variable=="obsno", "Description"] = "Computed based on the sent variables"
codebook[codebook$variable=="sent", "Description"] = ""
codebook[codebook$variable=="start", "Description"] = "Record the exact time the participant open the application to start answering."
codebook[codebook$variable=="PA1", "Description"] = ""

Going further, you can merge other function output in the codebook dataframe. For instance, the percentage of missing values in each variable (which is missing from the describe() function from the stats R package).

# Compute the percentage of missing values per variables
library(naniar)
perc_miss = apply(data, 2, pct_miss)

# Integrate the results in the codebook dataframe
codebook[, "perc_miss"] = perc_miss

Then, we can reorder the dataframe columns (to follow an intuitive order) and, finally, we can export the dataframe in a csv, PDF, word or HTML version file.

# Reorder the columns
codebook = codebook %>% dplyr::select(variable,n:se,Description)

# Export 
write.csv(codebook, "results/codebook.csv")

Here is a look to the created codebook:

dataMaid package

With the function makeCodebook() from the dataMaid package, you can generate an automatic codebook. It will summarise each variable in the dataframe including: variable name, label, data type, and number of missing values. In addition, you can choose the output format. The PDF and word file will contains a unique page, in contrast the HMTL output split the . The report will contains 4 parts:

  • Data report overview: number of variables / observations
  • Codebook summary table: label, variable name, class, number of unique values, missingness and other description
  • Variable list: for each variable, displayes a table with some descriptive statistics along with a distribution plot
  • Report generation info: user, platform and session info of the person that created the codebook.
library(dataMaid)
makeCodebook(data)

You can add further information in the codebook by providing attributes to the variables. To add attributes to a variables, we use the attr() function and specify the targeted variable and the attribute name to change (e.g., “labels”, “shortDescription”).

attr(data$id, "shortDescription") = "Participant number defined using ..."

You have to rerun the makeCodebook() function and the codebook created will integrate this new element in the summary of this variable.

Here is a look to the created codebook:

Codebookr package

The codebookr package is an R package that creates a codebook from a data frame. It is more flexible as it provides several options for customizing the codebook, such as selecting which variables to include, grouping variables, adding annotations and descriptions, and choosing the output format (e.g., HTML, PDF, or Word).

library(codebookr)

To add information to the default codebook generated, you need to add attributes to the variables. To do so, pass the variable in the cb_add_col_attributes() and add one or more of the following attributes:

  • description
  • source
  • col_type
  • value_labels
  • skip_pattern
data = data %>%
  cb_add_col_attributes(
    id,
    description = "...",
    source = "...",
    other_attribute = "..."
  ) %>%
  cb_add_col_attributes(
    sent,
    description = "...",
    source = "...",
    other_attribute = "..."
  ) 

The default codebook include:

  • a metadata table about the dataset: such as its name, size, number of columns, and number of rows.
  • The column attributes tables: each variable will have a section that displayes the the name, the data type, the number of unique non-missing values, and the total number of missing values of the variable.

If you have added attributes with the “cb_add_col_attributes” or the “attr” function, then it will be displayed in the respective function.

To furher specify the codebookr function, you can use:

  • title: title of the document
  • subtitle: optional text description of the dataset that will appear on the first page of the Word codebook document.
  • keep_blank_attributes: if TRUE, add Column description, Source information, Column type, value labels, and skip pattern rows to the columns attribute tables
codebook = codebookr::codebook(study, title="my title", subtitle="my_subtitle", description="my_description")

To create a word document, use print.

print(codebook, "codebook.docx")

Here is a look to the created codebook:

For further information on this package see: https://brad-cannell.github.io/codebookr/

Codebook package

To add variable label:

attributes(data$id)$label = "..."

Or use labels values of the variables :

library(labelled)
val_labels(data$id) <- c("Very Inaccurate" = 1, "Very Accurate" = 6)

# OR

var_label(data) <- list(
        id = "Waste my time.", 
        sent = "Am exacting in my work."
)

Value label:

val_labels(data$id) <- c("in high school" = 1,
   "finished high school" = 2,
              "some college" = 3, 
               "college graduate" = 4, 
              "graduate degree" = 5)

If same for multiple varaibles:

add_likert_labels <- function(x) {
  val_labels(x) <- c("Very Inaccurate" = 1, 
                  "Moderately Inaccurate" = 2, 
                  "Slightly Inaccurate" = 3,
                  "Slightly Accurate" = 4,
                  "Moderately Accurate" = 5,
                  "Very Accurate" = 6)
  x
}
likert_items = c("PA1", "PA2")
data = data %>% mutate_at(likert_items,  add_likert_labels)

Adding metadata to the dataframe:

metadata(data)$name <- "25 Personality items representing 5 factors"
metadata(data)$description <- "25 personality self report items taken from the International Personality Item Pool (ipip.ori.org)[...]"

Other idetifier:

metadata(data)$identifier <- "https://dx.doi.org/10.17605/OSF.IO/K39BG"
metadata(data)$creator <- "William Revelle"
metadata(data)$citation <- "Revelle, W., Wilt, J., & Rosenthal, A. (2010). Individual differences in cognition: New methods for examining the personality-cognition link. In A. Gruszka, G. Matthews, & B. Szymura (Eds.), Handbook of individual differences in cognition: Attention, memory, and executive control (pp. 27-49). New York, NY: Springer."
metadata(data)$url <- "https://CRAN.R-project.org/package=psych"
metadata(data)$datePublished <- "2010-01-01"
metadata(data)$temporalCoverage <- "Spring 2010" 
metadata(data)$spatialCoverage <- "Online" 

For further information on this package see: https://rubenarslan.github.io/codebook/articles/codebook_tutorial.html#loading-data-1