Introduction

In today’s lab, we are going to focus mostly on how we recode and change our variables.

Variable recoding is a critical part of “wrangling” our data - or in other words, getting our data into a format that is usable and informative. We want the variables in our dataset to accurately reflect the concepts that we are investigating and we want the variables to be in the proper format so that we can summarize and visualize our data in ways that are meaningful.

Variable recoding involves tasks such as:

reducing the number of categories of a categorical variable or grouping values of a continuous variable into a categorical variable
specify an order to a categorical variable (so that it is a factor)
change the labels of a categorical variable (e.g. variable for sex where the label 1 = man, change this to the actual label “man”, etc.)

Key Sections

We will start by importing a dataset using the import() function from the rio package, like we did in the last lab. And then we will move on to discuss:

The pipe
Changing Variables
Variable Recoding
Dataset wrangling continued (pivoting a dataframe, joining dataframes)

How To Read This Lesson

In this lesson, I will introduce a number of functions. In order to help you understand how each function works, I will first review “How it works” by showing a code block with an explanation of the arguments of the function using intuitive naming. The code block under the “How it works” section for each function will NOT run because it doesn’t draw on any real datasets or variables. It is simply for illustrative purposes. After the “How it works” section, I will show an example of how to use the function (that WILL run) in the section titled “Example using data”.

Getting Started

Before we get going, remember the “best practices” for starting a new R Script/R Markdown document:

Create a new file folder for this lesson.
Open a new R script (or R Markdown).
Set your working directory so R is linked to that new file folder. You can do this with Session -> Set Working Directory -> Choose Directory. Then copy and paste the code from your Console (bottom left panel) into your R Script.
Save the R script (or R Markdown).
Load any packages that you think we’ll need for today’s lab (e.g., rio and tidyverse)

Import data

Let’s start by importing the data that we will use for this week’s lab.

You should navigate to this website to download the dataset: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/KCIK9D The data will come in a “zipped” format. You simply need to click the folder to “unzip it” and then you’ll have a folder that includes the dataset and the codebook. Remember, you need to move it to the folder that you’re working in for today’s lab (your working directory).

library(rio)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

dc_df <- import("DC 2021 v1.dta")
# I use dc_df to indicate "democracy check dataframe" but you call
# the object where you save the data whatever you want.

class(dc_df)

## [1] "data.frame"

head(dc_df, 5)
# I didn't execute code this because it prints a LOT of info on the screen, but feel
# free to try it yourself.

There are a LOT of variables (302)!

Let’s take a few that we might need (after scrolling the codebook) using the select() function we learned last class.

dc_df2 <- dc_df %>% 
  select(ResponseId, age_in_years, dc21_province, dc21_education, dc21_democratic_sat, dc21_prov_gov_satis, 
         dc21_feedback_screen)

What do these variables mean? The codebook tells us… open up the codebook and have a look for yourself. Here are some examples: - dc21_income_category: What was your total household income before taxes in 2020? - no income (1) - $1 - $30,000 (2) - $30,001 to $60,000 (3) - $60,001 to $90,000 (4) - $90,001 to $110,000 (5) - $110,001 to $150,000 (6) - $150,001 to $200,000 (7) - More than $200,000 (8)

dc21_feedback_screen: Have you contacted a government office in the past 12 months? (1 yes/ 2 no)

The Pipe

Unique to the tidyverse approach to writing code is the pipe operator, %>%. In order to make code more readable, Hadley Wickham and friends created the pipe to help arrange code vertically and avoid having to create multiple dataframe objects (e.g. df1, df2, df3, … df78). The way to read the pipe operator is “then”. For example, “Select variables Gender, X, THEN filter to include only Female”. Let’s give it a try. We’re going to take our original dataframe, “dc_df2”, THEN reduce it to seven variables (like we did above), THEN filter age_in_years to include only individuals whose age is equal to or greater than 30, THEN arrange the rows in our dataset by province, THEN rename the dc21_feedback_screen variable.

dc_df2 <- dc_df %>% # Start with original dataframe, dc_df, THEN (%>%)
 select(ResponseId, age_in_years, dc21_province, dc21_education, dc21_democratic_sat, dc21_prov_gov_satis, 
         dc21_feedback_screen) %>% # select these variables, THEN (%>%)
  filter(age_in_years >= 30) %>% # Keep only individuals who are 30 years or older
  arrange(dc21_province) %>% # Arrange rows by province, THEN (%>%)
  rename(interact_with_govt = dc21_feedback_screen) # Rename 

head(dc_df2, 5)

##          ResponseId age_in_years dc21_province dc21_education
## 1 R_02gwXGzYjbfGX6h           79             1              7
## 2 R_07j3ZCJh5sZJkD7           73             1              7
## 3 R_08KdUpB7yVM2DFn           33             1              9
## 4 R_08wMA25qY06uQ0N           39             1              9
## 5 R_0IJ83oYEbMT4PeN           51             1              6
##   dc21_democratic_sat dc21_prov_gov_satis interact_with_govt
## 1                   3                   2                  1
## 2                   2                   2                  1
## 3                  NA                   2                  2
## 4                  NA                   4                 NA
## 5                  NA                   4                  2

By using pipes in your code, not only does it make things easier to read and keeps your Global Environment cleaner, it also slightly changes how each function works. More specifically, it eliminates the need for the data argument, since it uses the output from the previous step. Notice how for each function in the above code, we don’t specify the name of the dataframe as the first argument in each function. That’s because it starts with “df”, then uses the output in each subsequent step.

Changing Variables

First thing to note about variable recoding (the process by which we make changes to the variables in our dataset) is that there are LOTS of ways to do the same things. Below, I provide the basic code you will need to recode variables in a way that aligns with the goals in this course. Most of these recoding methods rely on tidyverse, particularly the dplyr package in tidyverse. There are ways to recode variables that relies on other packages or Base R. Remember, google is your friend - you can turn to google when you have questions about what functions in tidyverse can help you recode a variable, what the arguments are for the function, etc.

Create new variable

`mutate()`

We use mutate() to create new variables (in tidyverse language, of course there are lots of other ways to create new variables). Below, I comment out the code to show that this is just an example of how it can be used (to take an old variable and multiply each value by 2 to create a new variable). I am not executing this code.

How it works:

df %>% 
  mutate(newvariablename = oldvariablename * 2) # multiply each value of the old variable by 2

df %>%
  mutate(newvariablename = c(1:200)) # here we specify the exact values our new variable will take (1,2,3,4 etc... 200)

Example using data:

dc_df2 <- dc_df2 %>%
  mutate(birth_year = 2021 - age_in_years)

Above, we’ve created a new variable called “birth_year” which records the birth year for each individual in our dataset by subtracting their reported age from the year the survey was completed.

Change variable class (type)

To quickly change the variable class (type), we can make use of a set of functions that all use the same form:

as.character()
as.factor() / factor()
as.numeric()

As an example, there are the interact_with_government variable is currently of the class numeric, but is actually a categorical variable. It only takes on two possible values: 1 if individual interacted with ta government office in the past year, 0 if they didn’t. Let’s change it to be a factor variable instead.

class(dc_df2$ResponseId) # reminder: how we can check a variable's class/type

## [1] "character"

Example using data:

Here we will change the variable type of two of our variables in our dataframe.

dc_df2 <- dc_df2 %>%
  mutate(interact_with_govt_fct = factor(interact_with_govt))

Best Practice/Pro Tip! Whenever we change the original data, we should ALWAYS make new dataframes or new variables (depending on what we’re doing). Avoid (where possible) changing the original data! That way it’s a lot easier to fix mistakes, since we don’t need to rerun ALL of our code to get back to where we were. We can make new variables using the mutate() function. THIS IS SUPER IMPORTANT!

Variable Recoding

We can use case_match() to recode the values of our variable and case_when() in more complicated recoding. We will rely mostly on case_match() but it is useful to know when case_when() might come in handy.

case_match()
case_when()

`case_match()`

How it works:

dataframe %>% 
  mutate(newvariablename = case_match(oldvariablename, 
  oldvalue ~ newvalue, 
  oldvalue2 ~ newvalue2))

Pro tip!: You can type ??case_match() and run that in the console to pull up the help file (in the plots window) which will provide information about the arguments that make up the function.

Example using data:

The ~ operator in this case is like the <- (assignment operator) turned the opposite way (->). Below we are saying assign values 1,2,3 and 4 of the education variable to a category called <HS meaning less than high school education.

dc_df2 <- dc_df2 %>% 
  mutate(education = case_match(dc21_education,
   c(1,2,3,4) ~ '<HS', # notice how we can use case_match() with different types of data (e.g. numeric, character strings)
   5 ~ 'HS',
   c(6,8) ~ 'somePS', 
   7 ~ 'techDiploma', 
   9 ~ 'BA', 
   10:11 ~ 'MA+'
  ))

Let’s check our work:

Best Practice/Pro Tip! - Whenever we recode variables, we should ALWAYS compare the new variable to the old variable to see if we did it properly! When dealing with categorical variables, we can easily do this with table() by making a crosstab. In this application, we want to look at the old values of the education variable (which appear as the rows in the table below) and make sure that we properly assigned them to our new categories (which appear as the columns). This is SUPER important!

table(dc_df2$dc21_education, dc_df2$education)

##     
##       <HS   BA   HS  MA+ somePS techDiploma
##   1     4    0    0    0      0           0
##   2     8    0    0    0      0           0
##   3    27    0    0    0      0           0
##   4   174    0    0    0      0           0
##   5     0    0  906    0      0           0
##   6     0    0    0    0    673           0
##   7     0    0    0    0      0        1526
##   8     0    0    0    0    593           0
##   9     0 1971    0    0      0           0
##   10    0    0    0  692      0           0
##   11    0    0    0  257      0           0

class(dc_df2$education)

## [1] "character"

Right now, the education variable that we made using the dc_df2$dc21_education variable is of the type “character”. Let’s say that we wanted to make it into a ordinal variable (in R, a factor with levels).

Let’s do that and specify the exact order of the levels that we would like:

dc_df2 <- dc_df2 %>% 
  mutate(educ_fct = factor(education, levels=c("<HS", "HS", "somePS", "techDiploma", "BA", "MA+")))

levels(dc_df2$educ_fct) # check the order of the levels

## [1] "<HS"         "HS"          "somePS"      "techDiploma" "BA"         
## [6] "MA+"

Pro Tip: If we didn’t specify the order of the levels, then the order of the levels defaults to alphabetical order.

Pro Tip: If you’re trying to create a new variable based on meeting multiple conditions across one or more old variables, case_when() can be a useful function. Try executing ??case_when() in the console to learn more.

Recode factor variables

We can relevel and recode factors with the following functions:

fct_relevel()
fct_recode()

`fct_recode()`

How it works:

df %>% 
  mutate(newvariablename = 
           fct_recode(oldfactorvariable,
                      newvalue = oldvalue,
                      newvalue2 = oldvalue2
  ))

Remember earlier we created the interact_with_govt_fct variable where we took the original variable and specified that we wanted it to be a factor (not an integer).

Here let’s take that factor variable and replace the 1s and 2s with “yes” and “no”, which is what they represent.

Example using data:

dc_df2 <- dc_df2 %>% 
  mutate(interact_with_govt_fct = as.factor(interact_with_govt)) %>% # we already did this earlier, but showing again
  mutate(interact_with_govt_fct2 = fct_recode(interact_with_govt_fct,
    "No" = "2",
    "Yes" = "1"
  ))

class(dc_df2$interact_with_govt_fct2)

## [1] "factor"

We can check the order of the levels or values of our factor (ordinal) variable using levels()

levels(dc_df2$interact_with_govt_fct2)

## [1] "Yes" "No"

levels(dc_df2$interact_with_govt_fct)

## [1] "1" "2"

`fct_relevel()`

Let’s say we wanted to reverse the levels so that “No” was first. We could do this by using fct_relevel()

How it works:

df %>% 
  mutate(newvariable = 
           fct_relevel(oldvariable, "value_newlevel1", "value_newlevel2"))

Example using data:

dc_df2 <- dc_df2 %>% 
  mutate(interact_with_govt_fct2 = fct_relevel(interact_with_govt_fct2, 
                                               "No", "Yes"))

levels(dc_df2$interact_with_govt_fct2)

## [1] "No"  "Yes"

More advanced topics

Joining dataframes

When we’re working in R, we will often have to smoosh two dataframes together so that we have more complete information. Maybe two separate research projects collected information about provincial party leaders over time, but only one of the projects collected information about how much money each leader raised when they campaigned during the leadership race. We might want to smoosh (think add, join) the dataframes created by each project together. The two dataframes likely have a lot of the same observations or cases (party leaders), but they have different variables or information about these party leaders.

First, think about the way that you want to join the dataframes. Typically, we are interested in joining dataframes because one has variables that the other does not. Do they have the same observations? Which column or columns from the dataframes are we joining on?

Left Join: Keeps all rows from the first dataframe (that we list when we call the function), adding matching data from the right dataframe.

Anti-Join: Keep observations from the left dataframe that do not exist in the right dataframe. (Returns all rows from dataframe1 that do not have a match in dataframe2).

Full Join: Keeps all rows from both dataframes, filling with NA where there is no match.

To complete a join, we rely on “keys” (or in the case of a well-organized dataset, the unique ID attached to each observation in the dataset).

Left join, anti-join, and full join operate pretty similarly (almost identical arguments in each function), so I am just going to demonstrate a left join. The key to deciding which one to use is to clarify why you’re joining the dataframes. This will tell you WHICH join to use (left, anti or full).

`left_join()`

left_join() joins two dataframes together given a common variable or set of variables. When writing the function, it joins the right dataframe to the left dataframe. The output will always have the same rows as the rows that appear in the first dataset we list in the function or list at the top of our pipe:

How it works:

# first, note that these two different lines of code do the SAME THING
# the difference is whether or not they are organized using a pipe
left_join(dataset1, dataset2)

dataset1 %>% 
  left_join(dataset2)

By default, joins use the common variables that appear in both datasets in order to execute the join. This is important because it means that the variables we are “joining on” should (1) have the same variable name and (2) mean the same thing.

You can specify the variables to join by with the “by” argument.

left_join(dataset1, dataset2, by="id")

left_join(dataset1, dataset2, by=c("year", "country"))

Example using data:

Let’s make some fake data for our example. Imagine we have a (fake) dataset of party leadership candidates.

leader_cand_fake1 <- data.frame(
  id = c(1,2,3,4,5),
  first_name = c("Jenna", "Alex", "Katharine", "James", "Jacob"),
  sex = c("female", "male", "female", "male", "male"),
  province = c("Alberta", "Ontario", "Ontario", "British Columbia", "New Brunswick")
)

We don’t know their ages, however, and let’s say that this was of interest for our research question.

Now let’s imagine that we found another dataset that included a variable for candidate ages.

leader_cand_fake2 <- data.frame(
  id = c(1,2,3,4,5,6),
  first_name = c("Jenna", "Alex", "Katharine", "James", "Jacob", "John"),
  age = c(55, 63, 49, 27, 20, 22))

left_join(leader_cand_fake1, leader_cand_fake2)

## Joining with `by = join_by(id, first_name)`

##   id first_name    sex         province age
## 1  1      Jenna female          Alberta  55
## 2  2       Alex   male          Ontario  63
## 3  3  Katharine female          Ontario  49
## 4  4      James   male British Columbia  27
## 5  5      Jacob   male    New Brunswick  20

Above, we used a left_join() to smoosh together the two fake datasets. R tells us that it is joining by the variables ‘id’ and ‘first_name’ which are two variables that our dataframes have in common. We can specify which variables to join by (and we often should).

You’ll see that it returns a new dataframe that includes a sex column (plus all of the other variables from the first dataframe), but we lose information about the candidate named “John” after joining the dataframes. Remember, a left_join() keeps all of the observations (rows) from the first dataframe only. (We could use a full_join() to keep “John” in the new dataframe, but the variables sex and province would have ‘NA’ values for John showing that we do not have information about John’s province or sex).

Pivoting a dataframe

`pivot_longer()`

Typically, we use pivot_longer() to tidy a dataset where the column headers are values as opposed to variable names.

How it works:

df %>% 
  pivot_longer(
    cols = c(column1, column2), # the columns we want to combine into one variable 
    names_to = "newcolumnname", # the name we give to our new variable 
    values_to = "frequency" # the values that appeared in the old columns will appear in a new column called "frequency" (can give it whatever name you'd like)
  )

Example using data:

In the example below, I create a “fake” dataset to show you how this might be done.

fake_dat <- data.frame(
  id = c(1,2,3,4,5),
  age = c(18, 39, 20, 65, 42),
  male = c(1, 1, 1, 0, 0),
  female = c(0,0,0,1,1), 
  province = c("Alberta", "British Columbia", "Ontario", "Alberta", "Ontario")
)

fake_dat_1 <- fake_dat %>%
  pivot_longer(
    cols = c(male, female),
    names_to = "gender", #the name of our new variable
    values_to = "frequency" #the values from the old "male" and "female" columns will go to a variable called "frequency"
  )

fake_dat_2 <- fake_dat_1 %>% 
  mutate(sex = case_when(gender == "male" & frequency == "1" ~ "male", # use mutate to make a new variable
                         # use case_when() to say when gender is male and value is 1, assign "male" 
                         gender == "female" & frequency == "1" ~ "female")) %>%
                          # when gender is female and value is 1, assign "female" 
  drop_na("sex") %>%
  select(-c(gender)) # here we use -c() to say select all variables except gender
# we're getting RID of the gender column because we no longer need it.

Optional: Other methods for recoding variables

(These are optional!! You do not NEED to use these methods, but they might be useful if you’re getting stuck with functions discussed above)

Change variable from continuous (interval/ratio) -> categorical (nominal or ordinal)

The recode() function from the car package is a helpful way to deal with categorical variables that have lots of categories and/or need to be assigned an order (i.e., to be ordinal). You will need to install the car package and load it. Go back to the top of your script/markdown document and do this.

Now, since there is also a function called recode() in tidyverse, we need to specify that we want to use the recode() function from the car package by putting car:: ahead of the function. This just tells R, look inside the car package to find the function.

Continous –> nominal example

Let’s recode the provinces variable. Right now, it is numeric and numbers are used to identify the unique provinces (look at the dataset codebook to verify WHICH numbers identify which provinces).

dc_df2 <- dc_df2 %>%
  mutate(province = car::recode(dc21_province, 
  "1 = 'AB'; 2 = 'BC'; 3 = 'MB'; 4 = 'NB'; 5 = 'NL'; 6 = 'NWT'; 
  7 = 'NS'; 8 ='NU'; 9 = 'ON'; 10= 'PEI'; 11= 'QE'; 12 = 'SK'; 13 = 'YK';
  else=NA", as.factor=FALSE))

## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `province = car::recode(...)`.
## Caused by warning in `car::recode()`:
## ! NAs introduced by coercion

# in the code above we first specified our dataset and a pipe to say "take our dataset, then..."
# use mutate to create a new variable
# we want the new variable to be called "province" 
# province will be created using recode() from the car package 
# the first argument in the recode function is the variable we want to recode (dc21_province)
# the second argument in the recode function specifies how we want to recode the variable (eg. old value = new value, old value = new value, all others assign NA)
# the third argument specifies if we want the variable to be a FACTOR or not 
# usually, we specify factor if we want to tell R that there is a specific order to the values of the variable. here we specify that we don't want it to be a factor variable. 

class(dc_df2$province)

## [1] "character"

# the variable is NOT a factor, but instead, a character variable

table(dc_df2$dc21_province, dc_df2$province)

##     
##        AB   BC   MB   NB   NL   NS   ON  PEI   QE   SK
##   1   755    0    0    0    0    0    0    0    0    0
##   2     0  743    0    0    0    0    0    0    0    0
##   3     0    0  260    0    0    0    0    0    0    0
##   4     0    0    0  120    0    0    0    0    0    0
##   5     0    0    0    0  104    0    0    0    0    0
##   7     0    0    0    0    0  168    0    0    0    0
##   9     0    0    0    0    0    0 2767    0    0    0
##   10    0    0    0    0    0    0    0   30    0    0
##   11    0    0    0    0    0    0    0    0 1682    0
##   12    0    0    0    0    0    0    0    0    0  202

We also received a message saying that NAs were introduced. Let’s check that out by taking our data and filtering the rows to keep only those with NA values for the province variable.

dc_df2 %>% 
  filter(is.na(province))

##  [1] ResponseId              age_in_years            dc21_province          
##  [4] dc21_education          dc21_democratic_sat     dc21_prov_gov_satis    
##  [7] interact_with_govt      birth_year              interact_with_govt_fct 
## [10] education               educ_fct                interact_with_govt_fct2
## [13] province               
## <0 rows> (or 0-length row.names)

Looks like we should be all good. What do we think happened? Well it is likely the case that there weren’t any observations in our dataset assigned “13” on the original province variable given that it does not appear in the data table we made above.

Continuous –> Ordinal example

Let’s turn our education variable into an ordinal variable. Why? Right the values of the education variable are stand in labels for different categories of educational completion (e.g., 1 means no schooling). We know that there is an order or ranking to these categories, even if we can’t specify the exact difference between the categories. For these reasons, we will recode education as a factor.

We will use the recode() function from the car package again, in combination with the mutate() function which we use when we want to create a new variable in our dataset.

Now, let’s say that we want to put our education variable into larger “buckets”. Rather than have 11 different education levels, we could re-group the values of the variable so that there are … categories.

dc_df2 <- dc_df2 %>%
  mutate(education = car::recode(dc21_education, 
                                 "1:4= '<HS'; 5 = 'HS'; c(6,8) = 'somePS'; 
                                 7 = 'techDiploma'; 9 = 'BA'; 10:11 = 'MA+'",
                                 as.factor = TRUE, # tell R to make the variable into an ordered factor
                                 levels=c("<HS", "HS", "somePS", "techDiploma", "BA", "MA+")))
# levels specifies the ORDER of the categories from lowest to highest

NOTE: in the above code, we use : to specify we want to capture the values on either side of : and all of thoses that line in between (e.g., 1:4 means capture values 1 and 4 and all values in between, i.e. 2 and 3)

Let’s check how our new variable compares to our old variable in order to verify that the recoding did what we wanted it to do.

table(dc_df2$dc21_education, dc_df2$education)

##     
##       <HS   HS somePS techDiploma   BA  MA+
##   1     4    0      0           0    0    0
##   2     8    0      0           0    0    0
##   3    27    0      0           0    0    0
##   4   174    0      0           0    0    0
##   5     0  906      0           0    0    0
##   6     0    0    673           0    0    0
##   7     0    0      0        1526    0    0
##   8     0    0    593           0    0    0
##   9     0    0      0           0 1971    0
##   10    0    0      0           0    0  692
##   11    0    0      0           0    0  257

ifelse statements

How it works:

df %>% 
  mutate(
    newvariable = ifelse(oldvariable == 1, 2, NA)
  )
# we can read this as, if the old variable equals 1, assign it 2, otherwise assign NA (meaning for all other values than 1 taken on by the old variable, assign it NA)

# We can also have nested ifelse statements, for e.g.: 
df %>% 
  mutate(
    newvariable = ifelse(oldvariable == "cat", "animal", # if the old variable takes the value "cat", assign it the value "animal" in our new variable 
                         ifelse(oldvariable == "pencil", "inanimate",  # if the old variable takes the value "pencil", assign it the value "inanimate" in our new variable 
                                ifelse(oldvariable == "squirrel", "animal", 
                                       ifelse(oldvariable == "desk", "inanimate", NA) # if the old variable takes the value "desk", assign it the value "inanimate" in our new variable, FOR ALL OTHER VALUES OF THE OLD VARIABLE assign them as NAs in the new variable 
  )

Exercises:

Import the federal candidates dataset that we used for lesson 2 (available on OWL). (Remember, you should have a copy of the data in your working directory in order to be able to import it into your current R session using import in the way we’ve discussed)
Create a new variable called “region”. The “region” variable should include the categories: West (all Western Canada provinces), Central (Ontario and Quebec), East (Newfoundland, New Brunswick, Nova Scotia, PEI). The territories can be coded as NA. The variable should be a factor and the order of the categories should be West, Central, East. What functions did you use? How did you check your work? (HINT: table())
Create a variable called “sex” with the categories “male” and “female” that is based on the existing gender variable in the dataset. You may need to think creatively to try to figure out what the 0s and 1s in the gender column mean (which is female? which is male?).

Wrap up

Important functions discussed

Variable recoding:
- mutate()
- as.factor()
- as.character()
- as.numeric()
- case_match()
- case_when()
- fct_recode
Optional additional functions for variable recoding:
- car::recode()
- ifelse()
For checking variable recoding:
- table()
Different types of pivots:
- pivot_longer()
- pivot_wider()
Different types of dataframe joins:
- left_join()
- full_join()
- anti_join()

Important operators discussed

%>% or |> (the pipe)
~ (like the assignment operator turned the opposite way around; used in case_match() and case_when())

Class 4: Data Wrangling Part II

POL3325G Data Science for Politics (January 28, 2025)

Shanaya Vanhooren

Introduction

Key Sections

How To Read This Lesson

Getting Started

Import data

The Pipe

Changing Variables

Create new variable

mutate()

How it works:

Example using data:

Change variable class (type)

Example using data:

Variable Recoding

case_match()

How it works:

Example using data:

Recode factor variables

fct_recode()

How it works:

Example using data:

fct_relevel()

How it works:

Example using data:

More advanced topics

Joining dataframes

left_join()

How it works:

Example using data:

Pivoting a dataframe

pivot_longer()

How it works:

Example using data:

Optional: Other methods for recoding variables

Change variable from continuous (interval/ratio) -> categorical (nominal or ordinal)

Continous –> nominal example

Continuous –> Ordinal example

ifelse statements

How it works:

Exercises:

Wrap up

Important functions discussed

Important operators discussed

`mutate()`

`case_match()`

`fct_recode()`

`fct_relevel()`

`left_join()`

`pivot_longer()`