In today’s lab, we are going to focus mostly on how we recode and change our variables.
Variable recoding is a critical part of “wrangling” our data - or in other words, getting our data into a format that is usable and informative. We want the variables in our dataset to accurately reflect the concepts that we are investigating and we want the variables to be in the proper format so that we can summarize and visualize our data in ways that are meaningful.
Variable recoding involves tasks such as:
We will start by importing a dataset using the import()
function from the rio package, like we did in the last lab. And then we
will move on to discuss:
In this lesson, I will introduce a number of functions. In order to help you understand how each function works, I will first review “How it works” by showing a code block with an explanation of the arguments of the function using intuitive naming. The code block under the “How it works” section for each function will NOT run because it doesn’t draw on any real datasets or variables. It is simply for illustrative purposes. After the “How it works” section, I will show an example of how to use the function (that WILL run) in the section titled “Example using data”.
Before we get going, remember the “best practices” for starting a new R Script/R Markdown document:
Let’s start by importing the data that we will use for this week’s lab.
You should navigate to this website to download the dataset: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/KCIK9D The data will come in a “zipped” format. You simply need to click the folder to “unzip it” and then you’ll have a folder that includes the dataset and the codebook. Remember, you need to move it to the folder that you’re working in for today’s lab (your working directory).
library(rio)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
dc_df <- import("DC 2021 v1.dta")
# I use dc_df to indicate "democracy check dataframe" but you call
# the object where you save the data whatever you want.
class(dc_df)
## [1] "data.frame"
head(dc_df, 5)
# I didn't execute code this because it prints a LOT of info on the screen, but feel
# free to try it yourself.
There are a LOT of variables (302)!
Let’s take a few that we might need (after scrolling the codebook)
using the select()
function we learned last class.
dc_df2 <- dc_df %>%
select(ResponseId, age_in_years, dc21_province, dc21_education, dc21_democratic_sat, dc21_prov_gov_satis,
dc21_feedback_screen)
What do these variables mean? The codebook tells us… open up the codebook and have a look for yourself. Here are some examples: - dc21_income_category: What was your total household income before taxes in 2020? - no income (1) - $1 - $30,000 (2) - $30,001 to $60,000 (3) - $60,001 to $90,000 (4) - $90,001 to $110,000 (5) - $110,001 to $150,000 (6) - $150,001 to $200,000 (7) - More than $200,000 (8)
Unique to the tidyverse approach to writing code is
the pipe operator, %>%
. In order to make code more
readable, Hadley Wickham and friends created the pipe to help arrange
code vertically and avoid having to create multiple dataframe objects
(e.g. df1, df2, df3, … df78). The way to read the pipe operator is
“then”. For example, “Select variables Gender, X, THEN filter to include
only Female”. Let’s give it a try. We’re going to take our original
dataframe, “dc_df2”, THEN reduce it to seven variables (like we did
above), THEN filter age_in_years to include only individuals whose age
is equal to or greater than 30, THEN arrange the rows in our dataset by
province, THEN rename the dc21_feedback_screen variable.
dc_df2 <- dc_df %>% # Start with original dataframe, dc_df, THEN (%>%)
select(ResponseId, age_in_years, dc21_province, dc21_education, dc21_democratic_sat, dc21_prov_gov_satis,
dc21_feedback_screen) %>% # select these variables, THEN (%>%)
filter(age_in_years >= 30) %>% # Keep only individuals who are 30 years or older
arrange(dc21_province) %>% # Arrange rows by province, THEN (%>%)
rename(interact_with_govt = dc21_feedback_screen) # Rename
head(dc_df2, 5)
## ResponseId age_in_years dc21_province dc21_education
## 1 R_02gwXGzYjbfGX6h 79 1 7
## 2 R_07j3ZCJh5sZJkD7 73 1 7
## 3 R_08KdUpB7yVM2DFn 33 1 9
## 4 R_08wMA25qY06uQ0N 39 1 9
## 5 R_0IJ83oYEbMT4PeN 51 1 6
## dc21_democratic_sat dc21_prov_gov_satis interact_with_govt
## 1 3 2 1
## 2 2 2 1
## 3 NA 2 2
## 4 NA 4 NA
## 5 NA 4 2
By using pipes in your code, not only does it make things easier to read and keeps your Global Environment cleaner, it also slightly changes how each function works. More specifically, it eliminates the need for the data argument, since it uses the output from the previous step. Notice how for each function in the above code, we don’t specify the name of the dataframe as the first argument in each function. That’s because it starts with “df”, then uses the output in each subsequent step.
First thing to note about variable recoding (the process by which we make changes to the variables in our dataset) is that there are LOTS of ways to do the same things. Below, I provide the basic code you will need to recode variables in a way that aligns with the goals in this course. Most of these recoding methods rely on tidyverse, particularly the dplyr package in tidyverse. There are ways to recode variables that relies on other packages or Base R. Remember, google is your friend - you can turn to google when you have questions about what functions in tidyverse can help you recode a variable, what the arguments are for the function, etc.
mutate()
We use mutate()
to create new variables (in tidyverse
language, of course there are lots of other ways to create new
variables). Below, I comment out the code to show that this is just an
example of how it can be used (to take an old variable and multiply each
value by 2 to create a new variable). I am not executing this code.
df %>%
mutate(newvariablename = oldvariablename * 2) # multiply each value of the old variable by 2
df %>%
mutate(newvariablename = c(1:200)) # here we specify the exact values our new variable will take (1,2,3,4 etc... 200)
dc_df2 <- dc_df2 %>%
mutate(birth_year = 2021 - age_in_years)
Above, we’ve created a new variable called “birth_year” which records the birth year for each individual in our dataset by subtracting their reported age from the year the survey was completed.
To quickly change the variable class (type), we can make use of a set of functions that all use the same form:
as.character()
as.factor()
/ factor()
as.numeric()
As an example, there are the interact_with_government variable is currently of the class numeric, but is actually a categorical variable. It only takes on two possible values: 1 if individual interacted with ta government office in the past year, 0 if they didn’t. Let’s change it to be a factor variable instead.
class(dc_df2$ResponseId) # reminder: how we can check a variable's class/type
## [1] "character"
Here we will change the variable type of two of our variables in our dataframe.
dc_df2 <- dc_df2 %>%
mutate(interact_with_govt_fct = factor(interact_with_govt))
Best Practice/Pro Tip! Whenever we change the original data,
we should ALWAYS make new dataframes or new variables (depending on what
we’re doing). Avoid (where possible) changing the original data! That
way it’s a lot easier to fix mistakes, since we don’t need to rerun ALL
of our code to get back to where we were. We can make new variables
using the mutate()
function. THIS IS SUPER IMPORTANT!
We can use case_match() to recode the values of our variable and case_when() in more complicated recoding. We will rely mostly on case_match() but it is useful to know when case_when() might come in handy.
case_match()
case_when()
case_match()
dataframe %>%
mutate(newvariablename = case_match(oldvariablename,
oldvalue ~ newvalue,
oldvalue2 ~ newvalue2))
Pro tip!: You can type ??case_match() and run that in the console to pull up the help file (in the plots window) which will provide information about the arguments that make up the function.
The ~ operator in this case is like the <- (assignment operator) turned the opposite way (->). Below we are saying assign values 1,2,3 and 4 of the education variable to a category called <HS meaning less than high school education.
dc_df2 <- dc_df2 %>%
mutate(education = case_match(dc21_education,
c(1,2,3,4) ~ '<HS', # notice how we can use case_match() with different types of data (e.g. numeric, character strings)
5 ~ 'HS',
c(6,8) ~ 'somePS',
7 ~ 'techDiploma',
9 ~ 'BA',
10:11 ~ 'MA+'
))
Let’s check our work:
Best Practice/Pro Tip! - Whenever we recode
variables, we should ALWAYS compare the new variable to the old variable
to see if we did it properly! When dealing with categorical variables,
we can easily do this with table()
by making a crosstab. In
this application, we want to look at the old values of the education
variable (which appear as the rows in the table below) and make sure
that we properly assigned them to our new categories (which appear as
the columns). This is SUPER important!
table(dc_df2$dc21_education, dc_df2$education)
##
## <HS BA HS MA+ somePS techDiploma
## 1 4 0 0 0 0 0
## 2 8 0 0 0 0 0
## 3 27 0 0 0 0 0
## 4 174 0 0 0 0 0
## 5 0 0 906 0 0 0
## 6 0 0 0 0 673 0
## 7 0 0 0 0 0 1526
## 8 0 0 0 0 593 0
## 9 0 1971 0 0 0 0
## 10 0 0 0 692 0 0
## 11 0 0 0 257 0 0
class(dc_df2$education)
## [1] "character"
Right now, the education variable that we made using the dc_df2$dc21_education variable is of the type “character”. Let’s say that we wanted to make it into a ordinal variable (in R, a factor with levels).
Let’s do that and specify the exact order of the levels that we would like:
dc_df2 <- dc_df2 %>%
mutate(educ_fct = factor(education, levels=c("<HS", "HS", "somePS", "techDiploma", "BA", "MA+")))
levels(dc_df2$educ_fct) # check the order of the levels
## [1] "<HS" "HS" "somePS" "techDiploma" "BA"
## [6] "MA+"
Pro Tip: If we didn’t specify the order of the levels, then the order of the levels defaults to alphabetical order.
Pro Tip: If you’re trying to create a new variable
based on meeting multiple conditions across one or more old variables,
case_when()
can be a useful function. Try executing
??case_when() in the console to learn more.
We can relevel and recode factors with the following functions:
fct_relevel()
fct_recode()
fct_recode()
df %>%
mutate(newvariablename =
fct_recode(oldfactorvariable,
newvalue = oldvalue,
newvalue2 = oldvalue2
))
Remember earlier we created the interact_with_govt_fct variable where we took the original variable and specified that we wanted it to be a factor (not an integer).
Here let’s take that factor variable and replace the 1s and 2s with “yes” and “no”, which is what they represent.
dc_df2 <- dc_df2 %>%
mutate(interact_with_govt_fct = as.factor(interact_with_govt)) %>% # we already did this earlier, but showing again
mutate(interact_with_govt_fct2 = fct_recode(interact_with_govt_fct,
"No" = "2",
"Yes" = "1"
))
class(dc_df2$interact_with_govt_fct2)
## [1] "factor"
We can check the order of the levels or values of our factor
(ordinal) variable using levels()
levels(dc_df2$interact_with_govt_fct2)
## [1] "Yes" "No"
levels(dc_df2$interact_with_govt_fct)
## [1] "1" "2"
fct_relevel()
Let’s say we wanted to reverse the levels so that “No” was first. We
could do this by using fct_relevel()
df %>%
mutate(newvariable =
fct_relevel(oldvariable, "value_newlevel1", "value_newlevel2"))
dc_df2 <- dc_df2 %>%
mutate(interact_with_govt_fct2 = fct_relevel(interact_with_govt_fct2,
"No", "Yes"))
levels(dc_df2$interact_with_govt_fct2)
## [1] "No" "Yes"
When we’re working in R, we will often have to smoosh two dataframes together so that we have more complete information. Maybe two separate research projects collected information about provincial party leaders over time, but only one of the projects collected information about how much money each leader raised when they campaigned during the leadership race. We might want to smoosh (think add, join) the dataframes created by each project together. The two dataframes likely have a lot of the same observations or cases (party leaders), but they have different variables or information about these party leaders.
First, think about the way that you want to join the dataframes. Typically, we are interested in joining dataframes because one has variables that the other does not. Do they have the same observations? Which column or columns from the dataframes are we joining on?
Left Join: Keeps all rows from the first dataframe (that we list when we call the function), adding matching data from the right dataframe.
Anti-Join: Keep observations from the left dataframe that do not exist in the right dataframe. (Returns all rows from dataframe1 that do not have a match in dataframe2).
Full Join: Keeps all rows from both dataframes, filling with NA where there is no match.
To complete a join, we rely on “keys” (or in the case of a well-organized dataset, the unique ID attached to each observation in the dataset).
Left join, anti-join, and full join operate pretty similarly (almost identical arguments in each function), so I am just going to demonstrate a left join. The key to deciding which one to use is to clarify why you’re joining the dataframes. This will tell you WHICH join to use (left, anti or full).
left_join()
left_join()
joins two dataframes together given a common
variable or set of variables. When writing the function, it joins the
right dataframe to the left dataframe. The output will always have the
same rows as the rows that appear in the first dataset we list in the
function or list at the top of our pipe:
# first, note that these two different lines of code do the SAME THING
# the difference is whether or not they are organized using a pipe
left_join(dataset1, dataset2)
dataset1 %>%
left_join(dataset2)
By default, joins use the common variables that appear in both datasets in order to execute the join. This is important because it means that the variables we are “joining on” should (1) have the same variable name and (2) mean the same thing.
You can specify the variables to join by with the “by” argument.
left_join(dataset1, dataset2, by="id")
left_join(dataset1, dataset2, by=c("year", "country"))
Let’s make some fake data for our example. Imagine we have a (fake) dataset of party leadership candidates.
leader_cand_fake1 <- data.frame(
id = c(1,2,3,4,5),
first_name = c("Jenna", "Alex", "Katharine", "James", "Jacob"),
sex = c("female", "male", "female", "male", "male"),
province = c("Alberta", "Ontario", "Ontario", "British Columbia", "New Brunswick")
)
We don’t know their ages, however, and let’s say that this was of interest for our research question.
Now let’s imagine that we found another dataset that included a variable for candidate ages.
leader_cand_fake2 <- data.frame(
id = c(1,2,3,4,5,6),
first_name = c("Jenna", "Alex", "Katharine", "James", "Jacob", "John"),
age = c(55, 63, 49, 27, 20, 22))
left_join(leader_cand_fake1, leader_cand_fake2)
## Joining with `by = join_by(id, first_name)`
## id first_name sex province age
## 1 1 Jenna female Alberta 55
## 2 2 Alex male Ontario 63
## 3 3 Katharine female Ontario 49
## 4 4 James male British Columbia 27
## 5 5 Jacob male New Brunswick 20
Above, we used a left_join() to smoosh together the two fake datasets. R tells us that it is joining by the variables ‘id’ and ‘first_name’ which are two variables that our dataframes have in common. We can specify which variables to join by (and we often should).
You’ll see that it returns a new dataframe that includes a sex column (plus all of the other variables from the first dataframe), but we lose information about the candidate named “John” after joining the dataframes. Remember, a left_join() keeps all of the observations (rows) from the first dataframe only. (We could use a full_join() to keep “John” in the new dataframe, but the variables sex and province would have ‘NA’ values for John showing that we do not have information about John’s province or sex).
pivot_longer()
Typically, we use pivot_longer()
to tidy a dataset where
the column headers are values as opposed to variable names.
df %>%
pivot_longer(
cols = c(column1, column2), # the columns we want to combine into one variable
names_to = "newcolumnname", # the name we give to our new variable
values_to = "frequency" # the values that appeared in the old columns will appear in a new column called "frequency" (can give it whatever name you'd like)
)
In the example below, I create a “fake” dataset to show you how this might be done.
fake_dat <- data.frame(
id = c(1,2,3,4,5),
age = c(18, 39, 20, 65, 42),
male = c(1, 1, 1, 0, 0),
female = c(0,0,0,1,1),
province = c("Alberta", "British Columbia", "Ontario", "Alberta", "Ontario")
)
fake_dat_1 <- fake_dat %>%
pivot_longer(
cols = c(male, female),
names_to = "gender", #the name of our new variable
values_to = "frequency" #the values from the old "male" and "female" columns will go to a variable called "frequency"
)
fake_dat_2 <- fake_dat_1 %>%
mutate(sex = case_when(gender == "male" & frequency == "1" ~ "male", # use mutate to make a new variable
# use case_when() to say when gender is male and value is 1, assign "male"
gender == "female" & frequency == "1" ~ "female")) %>%
# when gender is female and value is 1, assign "female"
drop_na("sex") %>%
select(-c(gender)) # here we use -c() to say select all variables except gender
# we're getting RID of the gender column because we no longer need it.
(These are optional!! You do not NEED to use these methods, but they might be useful if you’re getting stuck with functions discussed above)
The recode()
function from the car
package is a helpful way to deal with categorical variables that have
lots of categories and/or need to be assigned an order (i.e., to be
ordinal). You will need to install the car package and load it. Go back
to the top of your script/markdown document and do this.
Now, since there is also a function called recode()
in
tidyverse, we need to specify that we want to use the
recode()
function from the car package by putting car::
ahead of the function. This just tells R, look inside the car package to
find the function.
Let’s recode the provinces variable. Right now, it is numeric and numbers are used to identify the unique provinces (look at the dataset codebook to verify WHICH numbers identify which provinces).
dc_df2 <- dc_df2 %>%
mutate(province = car::recode(dc21_province,
"1 = 'AB'; 2 = 'BC'; 3 = 'MB'; 4 = 'NB'; 5 = 'NL'; 6 = 'NWT';
7 = 'NS'; 8 ='NU'; 9 = 'ON'; 10= 'PEI'; 11= 'QE'; 12 = 'SK'; 13 = 'YK';
else=NA", as.factor=FALSE))
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `province = car::recode(...)`.
## Caused by warning in `car::recode()`:
## ! NAs introduced by coercion
# in the code above we first specified our dataset and a pipe to say "take our dataset, then..."
# use mutate to create a new variable
# we want the new variable to be called "province"
# province will be created using recode() from the car package
# the first argument in the recode function is the variable we want to recode (dc21_province)
# the second argument in the recode function specifies how we want to recode the variable (eg. old value = new value, old value = new value, all others assign NA)
# the third argument specifies if we want the variable to be a FACTOR or not
# usually, we specify factor if we want to tell R that there is a specific order to the values of the variable. here we specify that we don't want it to be a factor variable.
class(dc_df2$province)
## [1] "character"
# the variable is NOT a factor, but instead, a character variable
table(dc_df2$dc21_province, dc_df2$province)
##
## AB BC MB NB NL NS ON PEI QE SK
## 1 755 0 0 0 0 0 0 0 0 0
## 2 0 743 0 0 0 0 0 0 0 0
## 3 0 0 260 0 0 0 0 0 0 0
## 4 0 0 0 120 0 0 0 0 0 0
## 5 0 0 0 0 104 0 0 0 0 0
## 7 0 0 0 0 0 168 0 0 0 0
## 9 0 0 0 0 0 0 2767 0 0 0
## 10 0 0 0 0 0 0 0 30 0 0
## 11 0 0 0 0 0 0 0 0 1682 0
## 12 0 0 0 0 0 0 0 0 0 202
We also received a message saying that NAs were introduced. Let’s check that out by taking our data and filtering the rows to keep only those with NA values for the province variable.
dc_df2 %>%
filter(is.na(province))
## [1] ResponseId age_in_years dc21_province
## [4] dc21_education dc21_democratic_sat dc21_prov_gov_satis
## [7] interact_with_govt birth_year interact_with_govt_fct
## [10] education educ_fct interact_with_govt_fct2
## [13] province
## <0 rows> (or 0-length row.names)
Looks like we should be all good. What do we think happened? Well it is likely the case that there weren’t any observations in our dataset assigned “13” on the original province variable given that it does not appear in the data table we made above.
Let’s turn our education variable into an ordinal variable. Why? Right the values of the education variable are stand in labels for different categories of educational completion (e.g., 1 means no schooling). We know that there is an order or ranking to these categories, even if we can’t specify the exact difference between the categories. For these reasons, we will recode education as a factor.
We will use the recode()
function from the
car package again, in combination with the
mutate()
function which we use when we want to create a new
variable in our dataset.
Now, let’s say that we want to put our education variable into larger “buckets”. Rather than have 11 different education levels, we could re-group the values of the variable so that there are … categories.
dc_df2 <- dc_df2 %>%
mutate(education = car::recode(dc21_education,
"1:4= '<HS'; 5 = 'HS'; c(6,8) = 'somePS';
7 = 'techDiploma'; 9 = 'BA'; 10:11 = 'MA+'",
as.factor = TRUE, # tell R to make the variable into an ordered factor
levels=c("<HS", "HS", "somePS", "techDiploma", "BA", "MA+")))
# levels specifies the ORDER of the categories from lowest to highest
NOTE: in the above code, we use : to specify we want to capture the values on either side of : and all of thoses that line in between (e.g., 1:4 means capture values 1 and 4 and all values in between, i.e. 2 and 3)
Let’s check how our new variable compares to our old variable in order to verify that the recoding did what we wanted it to do.
table(dc_df2$dc21_education, dc_df2$education)
##
## <HS HS somePS techDiploma BA MA+
## 1 4 0 0 0 0 0
## 2 8 0 0 0 0 0
## 3 27 0 0 0 0 0
## 4 174 0 0 0 0 0
## 5 0 906 0 0 0 0
## 6 0 0 673 0 0 0
## 7 0 0 0 1526 0 0
## 8 0 0 593 0 0 0
## 9 0 0 0 0 1971 0
## 10 0 0 0 0 0 692
## 11 0 0 0 0 0 257
df %>%
mutate(
newvariable = ifelse(oldvariable == 1, 2, NA)
)
# we can read this as, if the old variable equals 1, assign it 2, otherwise assign NA (meaning for all other values than 1 taken on by the old variable, assign it NA)
# We can also have nested ifelse statements, for e.g.:
df %>%
mutate(
newvariable = ifelse(oldvariable == "cat", "animal", # if the old variable takes the value "cat", assign it the value "animal" in our new variable
ifelse(oldvariable == "pencil", "inanimate", # if the old variable takes the value "pencil", assign it the value "inanimate" in our new variable
ifelse(oldvariable == "squirrel", "animal",
ifelse(oldvariable == "desk", "inanimate", NA) # if the old variable takes the value "desk", assign it the value "inanimate" in our new variable, FOR ALL OTHER VALUES OF THE OLD VARIABLE assign them as NAs in the new variable
)
Import the federal candidates dataset that we used for lesson 2 (available on OWL). (Remember, you should have a copy of the data in your working directory in order to be able to import it into your current R session using import in the way we’ve discussed)
Create a new variable called “region”. The “region” variable should include the categories: West (all Western Canada provinces), Central (Ontario and Quebec), East (Newfoundland, New Brunswick, Nova Scotia, PEI). The territories can be coded as NA. The variable should be a factor and the order of the categories should be West, Central, East. What functions did you use? How did you check your work? (HINT: table())
Create a variable called “sex” with the categories “male” and “female” that is based on the existing gender variable in the dataset. You may need to think creatively to try to figure out what the 0s and 1s in the gender column mean (which is female? which is male?).
mutate()
as.factor()
as.character()
as.numeric()
case_match()
case_when()
fct_recode
car::recode()
ifelse()
table()
pivot_longer()
pivot_wider()
left_join()
full_join()
anti_join()
case_match()
and case_when()
)