Helpful tip: Using the # at the start of a line (not within a code block) in an R Markdown document will create a header. Click the “Outline” tab in the top right corner of the source pane to toggle between your headers.
Remember, the path specified below will look different in your script since you’re not working on my computer!
setwd("~/Users/shanaya/Documents/POL3325G Data Science Winter 2025/Lectures/Lecture 3")
library(rio)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
You can call your object whatever you’d like! Here, I save my data as an object called “dat”. Just save it as something so that it appears in the global environment.
dat <- import("federal-candidates-2023-subset.dta")
Using the federal candidates dataset that we have already imported into R during this lesson, I want you to subset the dataframe to include only the following variables: ID variable, election date, candidates names, and occupation.
dat2 <- dat %>%
select(id, edate, candidate_name, occupation)
Above, I filter the dataset to keep only the specified columns (id,
edate, candidate_name, occupation). You may have had to look at the
names of the variables using the names()
function to figure
out the exact name of the ID and election data variables.
Subset your dataframe to keep only those candidates that participated in the 2011 election. (HINT: edate == 2011-05-02)
dat3 <- dat2 %>% filter(edate == "2011-05-02")
Above, I filter the data to keep only the candidates for the 2011 federal election by specifying the date of the election. Notice how I had to put the date in quotation marks. This was because the edate variable is of the class “character”.
Sort your dataframe by province.
dat3 %>% arrange(province) # this won't work! See explanation below
Uh oh! We can’t sort by province because we did not keep the province
variable when we first subsetted the columns of our dataset using
select()
. If we wanted to sort by province, we could add
the province variable above and then sort the data.
Rename the occupation variable to ‘job’.
dat3 <- rename(dat3, job = occupation)
You can check this worked by running:
head(dat3)
## id edate candidate_name job
## 1 31383 2011-05-02 WRZESNEWSKYJ, Borys parliamentarian
## 2 32629 2011-05-02 LAFORESTERIE, Francis sales director
## 3 3988 2011-05-02 CLEARY, Ryan journalist/writer
## 4 32346 2011-05-02 GRANT, Lisa household manager
## 5 6074 2011-05-02 DUNCAN, John parliamentarian
## 6 9490 2011-05-02 HOBACK, Randy parliamentarian
You could also rename the variable by using the rename()
function within a pipe:
dat3 <- dat3 %>%
rename(job = occupation)