Renaming and relabeling
Packages: dplyr, tidyselect, forcats
Renaming variables is often done to make the variable names more meaningful. Similarly, changing labels of categorical variables is done to make the levels more understandable to the data users, but to ensure the consistency of variable names across different datasets, and to comply with specific formatting requirements for a particular analysis or publication.
Renaming variables
It is often recommended to convert variable names to lower case. It makes the data manipulation easier as we don’t end up with names mixing upper and lower cases. To do so, we can simply use the combination of ‘tolower()’ and ‘names()’ functions, as follows:
names(data) = tolower(names(data))
It is often necessary to rename variables for efficiency and clarity (e.g., irrelevant or unnecessarily long names), especially after having imported the data. There are several methods to rename variables in R, including:
- ‘names()’ function: with this function, you can assign new names to the columns of a dataframe. However, this method has a drawback. If the order of the columns changes in the input data, then you will need to update your code.
- ‘rename()’ function: this method offers the advantage of directly linking variable names and its modified names.
- ‘select()’ function: you can also change variable names while selecting/filtering them.
# Rename specific variable name
names(data)[2] = "role"
# Rename multiple variables using a vector
names(data) = c("dyad", "role", "obsno", "id", "age", "cond_dyad",
"scheduled", "sent", "start", "end", "contact",
"PA1", "PA2", "PA3", "NA1", "NA2", "NA3",
"location", "nationality")
Relabeling
Misspelled or inconsistent labels can cause issues in data analysis, especially when performing ‘group_by()’ operations, merging dataframes or statistical analysis. For instance, ‘Gender’ and ‘gender’ may refer to the same label, but it won’t be considered as such by R. Note that the following methods can also be used to dichotomize or create categorical variables.
Subsequently, we introduce distinct relabeling techniques tailored to various purposes and scenarios:
- Relabel values in a dichotomic variable based on a simple boolean test
- Relabel values based on boolean tests using ‘ifelse()’
- Relabel values based on boolean tests using ‘case_when()’
- Relabel and collaspe values based on boolean tests using ‘fct_collapse()’
- Relabel all values across selected variables with ‘replace()’ and ‘across()’
- Reverse items
A simple boolean test to relabel
A simple logical test can be used to create a dichotomic variable based on a cut-off value. For instance, we dichotomize the ‘age’ variable using a cut-off value of 45. If the corresponding age value is greater than or equal to 45, the code returns 1, and 0 otherwise.
$age_cat = as.numeric(data$age >= 45) data
‘ifelse()’
The ‘ifelse()’ function can be used to create new values to a categorical variable based on a certain condition as follows: ifelse(condition, if_condition_TRUE, if_condition_FALSE).
We present three ways to use it:
- The basic usage.
- In ‘mutate()’ function (dplyr package).
- Multiple ‘ifelse()’ functions. False conditions return another ‘ifelse()’ function. Helps to overcome the simple dichotomy TRUE/FALSE of this function.
$age_cat = ifelse(data$age >= 50, ">=50", "<50") data
‘case_when()’
The ‘case_when()’ function is used to recode values in a categorical variable based on multiple following conditions. It allows more flexibility in recoding compared to the ‘ifelse()’ function since it can handle multiple conditions and different types of conditions.
For each condition (e.g., data$age < 35), a corresponding value is returned if the condition evaluates to TRUE (e.g., ~ “<35”). If the condition is not met, then the next condition is tested in the same way.
$age_cat = case_when(data$age < 35 ~ "<35",
data$age <= 50 ~ "[35;50]",
data$age <= 75 ~ "[51;75]") data
‘fct_collapse()’
The function ‘fct_collapse()’ from the forcats package can be useful when relabeling categorical variables as it allows collapsing levels into new categories. This can be particularly useful for aggregating spelling errors or alternative forms of the same category into a single label, thus reducing the number of levels and making the data easier to work with.
For instance, we have reported different spelling variations of identical nationalities, with or without spelling errors (e.g., ‘BE’, ‘belgique’, ‘belgue’, ‘be’, ‘belge’ for Belgium). Hence, we want to associate a unique label (e.g., ‘be’) to this category of labels using ‘fct_collapse()’.
= c("BE", "belgique", "belgue", "be", "belge")
be_list = c("fr", "france")
fr_list = c("swiss", "switzerland")
swiss_list
library(forcats)
= data %>%
data mutate(nationality_new = fct_collapse(nationality, be = be_list, fr = fr_list, swiss = swiss_list))
Relabel all values with ‘replace()’ and ‘across()’
‘replace()’ function within ‘across()’ function allows to select a set of variables and replace all values that match a pattern (e.g., ‘Tout a fait 7’) by a specific value (e.g., ‘7’).
library(tidyselect)
= data %>%
data mutate(across(PA1:NA3, ~replace(., . == "Tout a fait 7" , "7")),
across(PA1:NA3, ~replace(., . == "Pas du tout 1" , "1")))
The oucomes are in a character format. You may need to reformat it, such as in our example where those variables are expected to be numerical ones.
Reverse items
Similar to relabeling, reversing means inverting the values of a variable (e.g., 100 -> 1 for a scale going from 1 to 100). An easy way to reverse items is to use subtractions. For instance, if an item has a 1-7 Likert scale, you can reverse its score by subtracting each response from 8, as: 8 - item_value.
$PA1_inv = 101 - data$PA1 # reverse slider 1-100
data$NA3_inv = 8 - data$NA3 # reverse Lickert 1-7 data