In this class, we will start looking at datasets in R. We’ll focus on:
Before we get started, remember the steps for opening a new script or markdown document to take notes for this class:
In this class, we’re only going to need two packages: rio and tidyverse.
The rio package is used to import data into R. It
only has one function that we’ll use: import()
.
Review point: The golden rule of packages is “Install once. Load every time.” In the same way that you only install the Instagram app once on your phone, you only need to install a package once on your computer. BUT every time you open R, you’ll need to “load” any packages you want to use in that session.
The function to install a package is install.packages()
.
Within the parantheses, you’ll need to provide the specific name of the
package. Let’s do this now.
install.packages(rio)
Error! What went wrong?? You can see that the error message says: “object ‘rio’ not found”. Last class, we talked about how R is an “object oriented” system, meaning that we save the output of our code as “objects”, which we can then see in the Global Environment (the top right panel in R Studio). The problem is that the function thinks rio is an object we’ve saved, not the name of a package. So we need to remember to include quotation marks around the name!
install.packages("rio")
You’ll see a lot of messages when installing the package, which are completely normal. Now, let’s load tidyverse and rio. You should have installed tidyverse last week, but if you didn’t, you can un-comment the line of code below to install it before loading the package.
Note: you may have received a message that says “the
following rio suggested packages are not installed…”. If you wish, you
can use install_formats()
to install those. For today, we
don’t need to worry about this. If you want to learn more about why you
received this message, you can read about it here: https://cran.r-project.org/web/packages/rio/vignettes/rio.html#Import,_Export,_and_Convert_Data_Files
#install.packages("tidyverse")
# if you attended class last week,
# you should have already installed tidyverse.
# so there's no need run the
# install.packages("tidyverse") code above.
Review point: Remember that we use # to “comment our code”. Whenever we put a # infront of a line of code, it tells R NOT to try to execute whatever comes after the # because these are our “notes to self”.
Great! We’re ready to go. Remember, you only ever need to install a package once on your computer! That being said, packages get updated (just like R), so you may want to reinstall a package in the future to have the most up-to-date version.
Now that these packages are installed on our computer, we still need
to load the packages in order to use them during this session. We do
that using the library()
function.
library(rio)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Great! We’re ready to import real data and start doing some cool things!
Most of the time, the data we’re interested in using/exploring will
need to be imported into R. Fortunately, R can flexibly handle data that
exists in different formats - Excel (.xlsx, .csv), SPSS (.sav), Stata
(.dta), text (.txt), etc. This can be achieved using functions from a
number of different packages, but we’ll use the import()
function from the rio package. Whenever we import data,
we want to save it as an object. Like we’ve discussed before, there are
few rules about what you name an object, other than not using spaces and
keeping things short. A common practice is to use “df”
(i.e. “dataframe”) for dataframe objects.
To import the data into R, we must have a copy of the file in the same file folder as our working directory. Remember, setting the working directory gives R a map to a specific file folder. As long as the data file is there (and we’ve spelled everything correctly!), R will find the file and bring it into the R environment.
The data we’ll be using in this class is a dataset on candidate characteristics in Canadian federal elections (1867-2017). The data is freely available online at: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/ABFNSQ
FOR TODAY’S CLASS I have created a smaller version of this data that includes only about 10,000 observations (it excludes many of the earlier elections). It is available on OWL under Content -> Data and the data file is called “federal-candidates-2023-short.dta”
import()
function from the
rio package to import the data into RStudio.import("federal-candidates-2023-subset.dta")
# you need to ensure...
# 1) this data exists in your working directory (the folder we're using for today's lesson)
# 2) the file name is in quotation marks
# 3) the file name you specify above matches the file name as it exists on your computer
You’ll notice that the dataset contains an “id” variable. This is an identification variable that provides a unique ID number to each observation in our dataset.
Look at your global environment. Do you think the data saved into the global environment? Why not?
We need to assign our dataset to an object so that it will “live” in the global environment in our RStudio session.
df <- import("federal-candidates-2023-subset.dta")
# df is commonly used to mean "dataframe"
If our dataset isn’t too large, we can click on the name of the data in our global environment to open it in a viewer window in the source window.
Below are some other ways to peek at our data (especially if it is a LARGE file):
head(df) # look at the first 6 rows
head(df, 10) # look at the first ten rows
#hint: change 10 to any (reasonable) number you'd like
tail(df) # look at the bottom 6 rows in our data
tail(df, 14) # look at the bottom 14 rows in our data
#hint: change 14 to any (reasonable) number you'd like
In the code below, I use base R to look at specific rows/columns, and to look at the values of a specific row(s) and/or column(s). Remember, datasets are organized as rows by columns.
# let's look at the value of the first row, second column
df[1, 2] # first number specifies the row(s), second number specifies the column(s)
## [1] 38
Above, when we specify row 1 and column 2 of the df object, we find that the first observation in our dataset belongs to the 38th parliament.
# let's look at the first ten rows in our dataset
df[1:10, ] # we include nothing after the comma which communicates to R that we want it to return ALL rows
## id parliament year candidate_name edate birth_year country_birth
## 1 4443 38 2004 CÔTÉ, Jean-Guy 2004-06-28 NA
## 2 24524 38 2004 NELSON, Erin 2004-06-28 NA
## 3 20062 38 2004 KUNZ, Revel 2004-06-28 NA
## 4 20861 38 2004 LE BEL, Benjamin 2004-06-28 NA
## 5 19061 38 2004 KOSSICK, Don 2004-06-28 NA
## 6 2813 38 2004 BUORS, Chris 2004-06-28 NA
## 7 4318 38 2004 CORBIERE, Mark 2004-06-28 NA
## 8 6320 38 2004 ELGERSMA, Steven 2004-06-28 NA
## 9 3258 38 2004 CARIGNAN, Jean Guy 2004-06-28 NA
## 10 9532 38 2004 HOFFMAN, Rachel 2004-06-28 NA
## occupation riding_id riding
## 1 composer NA MATAPÉDIA--MATANE
## 2 stationary engineer NA NORTH OKANAGAN--SHUSWAP
## 3 b & b owner NA BURNABY--NEW WESTMINSTER
## 4 community development officer NA ALFRED-PELLAN
## 5 community developer NA BLACKSTRAP
## 6 locomotive engineer NA SAINT BONIFACE
## 7 youth organizer NA KITCHENER CENTRE
## 8 farmer/horticulturalist (retired) NA HALDIMAND--NORFOLK
## 9 parliamentarian NA LOUIS-SAINT-LAURENT
## 10 graphic designer NA NOTRE-DAME-DE-GRÂCE--LACHINE
## province votes percent_votes party_raw
## 1 Quebec 1581 4.9922638 New Democratic Party
## 2 British Columbia 2333 4.5069060 Green Party of Canada
## 3 British Columbia 1606 3.8515036 Green Party of Canada
## 4 Quebec 1849 3.4669616 New Democratic Party
## 5 Saskatchewan 8862 23.5503578 New Democratic Party
## 6 Manitoba 317 0.8213286 Marijuana Party
## 7 Ontario 277 0.6139184 Independent
## 8 Ontario 617 1.2394536 Christian Heritage Party of Canada
## 9 Quebec 563 1.2548478 Independent
## 10 Quebec 88 0.1987667 Marxist-Leninist Party
## party_minor_group party_major_group gov_party_raw gov_minor_group
## 1 NDP CCF_NDP Liberal Party of Canada Liberal
## 2 Green Third_Party Liberal Party of Canada Liberal
## 3 Green Third_Party Liberal Party of Canada Liberal
## 4 NDP CCF_NDP Liberal Party of Canada Liberal
## 5 NDP CCF_NDP Liberal Party of Canada Liberal
## 6 Marijuana Third_Party Liberal Party of Canada Liberal
## 7 Independent Independent Liberal Party of Canada Liberal
## 8 Christian_Heritage Third_Party Liberal Party of Canada Liberal
## 9 Independent Independent Liberal Party of Canada Liberal
## 10 Marxist_Lennist Third_Party Liberal Party of Canada Liberal
## gov_major_group num_candidates type_elxn elected incumbent gender lgbtq2_out
## 1 Liberal 5 1 0 0 0 NA
## 2 Liberal 8 1 0 0 0 NA
## 3 Liberal 6 1 0 0 1 NA
## 4 Liberal 7 1 0 0 0 NA
## 5 Liberal 5 1 0 0 0 NA
## 6 Liberal 7 1 0 0 0 NA
## 7 Liberal 5 1 0 0 0 NA
## 8 Liberal 5 1 0 0 0 NA
## 9 Liberal 8 1 0 1 0 NA
## 10 Liberal 8 1 0 0 1 NA
## indigenousorigins lawyer censuscategory acclaimed switcher
## 1 0 0 5 0 0
## 2 0 0 2 0 0
## 3 0 0 NA 0 0
## 4 0 0 4 0 0
## 5 0 0 4 0 0
## 6 0 0 2 0 0
## 7 0 0 4 0 0
## 8 0 0 NA 0 0
## 9 0 0 10 0 0
## 10 0 0 5 0 0
## multiple_candidacy
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
## 7 0
## 8 0
## 9 0
## 10 0
df[ , 1:4] # here we only want to look at the first four columns
# I did not print the results below (because its a LOT, but you can try this on your laptop)
Great work so far.
We can see all the variable names with the names()
function:
names(df)
## [1] "id" "parliament" "year"
## [4] "candidate_name" "edate" "birth_year"
## [7] "country_birth" "occupation" "riding_id"
## [10] "riding" "province" "votes"
## [13] "percent_votes" "party_raw" "party_minor_group"
## [16] "party_major_group" "gov_party_raw" "gov_minor_group"
## [19] "gov_major_group" "num_candidates" "type_elxn"
## [22] "elected" "incumbent" "gender"
## [25] "lgbtq2_out" "indigenousorigins" "lawyer"
## [28] "censuscategory" "acclaimed" "switcher"
## [31] "multiple_candidacy"
We can see the structure of the dataframe with the str()
function:
str(df)
## 'data.frame': 10001 obs. of 31 variables:
## $ id : num 4443 24524 20062 20861 19061 ...
## ..- attr(*, "format.stata")= chr "%10.0g"
## $ parliament : num 38 38 38 38 38 38 38 38 38 38 ...
## ..- attr(*, "format.stata")= chr "%10.0g"
## $ year : num 2004 2004 2004 2004 2004 ...
## ..- attr(*, "format.stata")= chr "%10.0g"
## $ candidate_name : chr "CÔTÉ, Jean-Guy" "NELSON, Erin" "KUNZ, Revel" "LE BEL, Benjamin" ...
## ..- attr(*, "format.stata")= chr "%-9s"
## $ edate : chr "2004-06-28" "2004-06-28" "2004-06-28" "2004-06-28" ...
## ..- attr(*, "format.stata")= chr "%-9s"
## $ birth_year : num NA NA NA NA NA NA NA NA NA NA ...
## ..- attr(*, "format.stata")= chr "%10.0g"
## $ country_birth : chr "" "" "" "" ...
## ..- attr(*, "format.stata")= chr "%-9s"
## $ occupation : chr "composer" "stationary engineer" "b & b owner" "community development officer" ...
## ..- attr(*, "format.stata")= chr "%-9s"
## $ riding_id : num NA NA NA NA NA NA NA NA NA NA ...
## ..- attr(*, "format.stata")= chr "%10.0g"
## $ riding : chr "MATAPÉDIA--MATANE" "NORTH OKANAGAN--SHUSWAP" "BURNABY--NEW WESTMINSTER" "ALFRED-PELLAN" ...
## ..- attr(*, "format.stata")= chr "%-9s"
## $ province : chr "Quebec" "British Columbia" "British Columbia" "Quebec" ...
## ..- attr(*, "format.stata")= chr "%-9s"
## $ votes : num 1581 2333 1606 1849 8862 ...
## ..- attr(*, "format.stata")= chr "%10.0g"
## $ percent_votes : num 4.99 4.51 3.85 3.47 23.55 ...
## ..- attr(*, "format.stata")= chr "%10.0g"
## $ party_raw : chr "New Democratic Party" "Green Party of Canada" "Green Party of Canada" "New Democratic Party" ...
## ..- attr(*, "format.stata")= chr "%-9s"
## $ party_minor_group : chr "NDP" "Green" "Green" "NDP" ...
## ..- attr(*, "format.stata")= chr "%-9s"
## $ party_major_group : chr "CCF_NDP" "Third_Party" "Third_Party" "CCF_NDP" ...
## ..- attr(*, "format.stata")= chr "%-9s"
## $ gov_party_raw : chr "Liberal Party of Canada" "Liberal Party of Canada" "Liberal Party of Canada" "Liberal Party of Canada" ...
## ..- attr(*, "format.stata")= chr "%-9s"
## $ gov_minor_group : chr "Liberal" "Liberal" "Liberal" "Liberal" ...
## ..- attr(*, "format.stata")= chr "%-9s"
## $ gov_major_group : chr "Liberal" "Liberal" "Liberal" "Liberal" ...
## ..- attr(*, "format.stata")= chr "%-9s"
## $ num_candidates : num 5 8 6 7 5 7 5 5 8 8 ...
## ..- attr(*, "format.stata")= chr "%10.0g"
## $ type_elxn : num 1 1 1 1 1 1 1 1 1 1 ...
## ..- attr(*, "format.stata")= chr "%10.0g"
## $ elected : num 0 0 0 0 0 0 0 0 0 0 ...
## ..- attr(*, "format.stata")= chr "%10.0g"
## $ incumbent : num 0 0 0 0 0 0 0 0 1 0 ...
## ..- attr(*, "format.stata")= chr "%10.0g"
## $ gender : num 0 0 1 0 0 0 0 0 0 1 ...
## ..- attr(*, "format.stata")= chr "%10.0g"
## $ lgbtq2_out : num NA NA NA NA NA NA NA NA NA NA ...
## ..- attr(*, "format.stata")= chr "%10.0g"
## $ indigenousorigins : num 0 0 0 0 0 0 0 0 0 0 ...
## ..- attr(*, "format.stata")= chr "%10.0g"
## $ lawyer : num 0 0 0 0 0 0 0 0 0 0 ...
## ..- attr(*, "format.stata")= chr "%10.0g"
## $ censuscategory : num 5 2 NA 4 4 2 4 NA 10 5 ...
## ..- attr(*, "format.stata")= chr "%10.0g"
## $ acclaimed : num 0 0 0 0 0 0 0 0 0 0 ...
## ..- attr(*, "format.stata")= chr "%10.0g"
## $ switcher : num 0 0 0 0 0 0 0 0 0 0 ...
## ..- attr(*, "format.stata")= chr "%10.0g"
## $ multiple_candidacy: num 0 0 0 0 0 0 0 0 0 0 ...
## ..- attr(*, "format.stata")= chr "%10.0g"
In order for R to perform the appropriate commands on variables, it needs to understand how variables are measured. The Haan & Godley (2017) reading for today discussed different levels of measurement. The level of measurement can differ across variables. In R language, the level of measurement is the variable class. There are four types you’ll commonly see: “character”, “factor”, “numeric”, “integer”.
Figure 1 shows the relationship between these ideas. Broadly, we can think about two different levels of measurement of our variables: categorical and continuous. Categorical variables are nominal or ordinal. In political science, we’re interested in all sorts of categorical variables - political parties, type of conflict, type of institution or regime, etc. An example of an ordinal variable would a government’s degree of respect for human rights (e.g. “none”, “some”, “full”).
Continuous variables are formally considered to be interval or ratio. For the purposes of this course, we’re not going to belabour the difference. The key idea is that continuous variable are numbers with a meaningful/exact space between the levels. For instance, if someone asked you your age, the difference between 20 and 21 years and 45 and 46 years is the same. The distance or space between the levels in our ordinal example (respect for human rights) is not necessarily the same. Conceptually, going from “none” to “some” could be a lot different than going from “some” to “full”.
In R, the translation of the level of measurement occurs using the “class” (or type) of variable. Categorical variables can be coded as character or factor variables. Continuous variables can be coded as numeric or integer variables.
There are two reasons why this matters. First, the way variables are coded when we first acquire the data can be changed. Maybe we want to transform an income variable, measured as continuous, into a categorical variable (low, middle, high income), because that’s more appropriate for our research question. Or maybe we want to change a categorical variable from 4 categories into 2.
Second, when visualizing the data, there are times when we want to change the class of a variable strictly for visualization purposes. As we’ll see in future classes, the way R produces colours for continuous variables may not be what we want. This is especially true for things like “time in years”. Even though the “years” variable is technically/conceptually continuous, for plotting purposes, we may need to convert it to be a categorical variable.
The character and factor classes are used for categorical variables. For example, sex, country, political party, profession, university major, etc. Sometimes, whether a categorical variable is of class character or factor will make no difference for our purposes. Sometimes it will. This will typically depend on the task we are trying to accomplish and the way a function has been designed. We’ll talk about this more next class.
For our purposes, we can treat the numeric and integer classes as the same - these are variables that are measured as continuous (ratio or interval level). Examples include age, GDP, vote percentage, etc. Technically, integers are meant for whole numbers (no decimals), and numeric are meant for numbers that include decimals. Don’t worry about this distinction too much.
Figure 1: Level of Measurement and Variable Class
When we use the str()
function, it showed us the class
of every variable in our dataframe. Another way to determine the class
of a single variable using the class()
function. Let’s
check the class of a few of the variables in the df object.
# Check class of province variable
class(df$province)
# Check class of gender variable
class(df$gender)
# Check class of votes variable.
class(df$votes)
As you can see, there are 31 different variables in the dataset. We
may not actually care about all of them, so rather than work with the
full dataset, let’s make a new dataframe object with only the variables
we care about. We can do this using the select()
function.
The basic idea is that we’ll tell the function which variables to either
include or to exclude. Let’s make a new dataframe,
called “df1”, with the following variables: Province, Gender, Party,
Elected, edate.
select()
# Option 1: Include specific variables
df1 <- select(df, c(id, province, party_raw, gender, incumbent, elected, edate))
# Option 2: Exclude specific variables
# I'm making a new dataframe object called df2 and excluding the "occupation" variable only
# this is much more efficient than typing out the names of the other 30 variables I want to KEEP
df2 <- select(df, -c(occupation))
So we’ve created a new dataframe (df1) with only 5 variables. Let’s
narrow things down even more. I only want to see data for the 2015
election (“2015-10-29”). We’ll make another new dataframe object called
trudeau and use the filter()
function to include only the
observations (rows) which have that value for the edate variable.
filter()
# Option 1: Include specific variable values
trudeau_2015 <- filter(df1, edate == "2015-10-19")
# Option 2: Exclude specific variable values
# I'm making a new dataframe object called no_2015 and excluding the "2015-10-19" variable value
no_2015 <- filter(df1, edate != "2015-10-19")
Try opening these two new datasets by clicking on them in the Global Environment pane.
Maybe we want to see the last 2 major elections (“2011-05-02” and “2015-10-29”). We can use the & (AND) and | (OR) operators to combine multiple conditions, even conditions across multiple variables (e.g. Male AND Liberal; Quebec OR Elected).
# Election 1 OR Election 2 (satisfy either condition)
trudeau3 <- filter(df1, edate == "2015-10-19" | edate == "2019-10-21")
# British Columbia AND Female (must satisfy both conditions)
BC_women <- filter(df1, province == "British Columbia" & gender == "F") #oops this didn't work!
# Why didn't this work?
BC_women <- filter(df1, province == "British Columbia" & gender == 1)
# a good codebook would tell us how gender/sex is coded
# the candidate names offers a clue
# female = 1
We could filter to keep observations from several provinces:
df_new <- filter(df1, province %in% c("British Columbia", "Alberta", "Manitoba"))
We use %in%
to specify values of a list (list the
provinces we would like to keep in c()
).
Focusing on our new “trudeau_2015” dataframe, we can look at the
dataframe by clicking on the object in our Global Environment. To make
it easier to view, perhaps we want to organize the rows by province. We
can do this using the arrange()
function.
arrange()
trudeau_2015 <- arrange(trudeau_2015, province)
head(trudeau_2015, 2) # look at the first 2 rows of our dataset
## id province party_raw gender incumbent elected edate
## 1 32601 Alberta Liberal 0 0 0 2015-10-19
## 2 33399 Alberta Independent 0 0 0 2015-10-19
You’ll see that the rows are now ordered by province (where province is in alphabetical order, beginning with Alberta).
# Change to descending order
trudeau_2015 <- arrange(trudeau_2015, desc(province))
head(trudeau_2015, 2) # look at first two rows of our dataset
## id province party_raw gender incumbent elected edate
## 1 32059 Yukon Green Party of Canada 0 0 0 2015-10-19
## 2 696 Yukon Liberal 0 0 1 2015-10-19
Now the rows are ordered by province but beginning at the bottom of the alphabetical order with Yukon.
We could instead arrange by Party.
trudeau_2015 <- arrange(trudeau_2015, party_raw)
# Change to descending order
trudeau_2015 <- arrange(trudeau_2015, desc(party_raw))
head(trudeau_2015, 2)
## id province party_raw gender incumbent elected edate
## 1 32571 Ontario United Party of Canada 0 0 0 2015-10-19
## 2 1434 Ontario The Bridge Party of Canada 0 0 0 2015-10-19
What if we don’t like the names of our variables? Are we stuck with
them forever? Of course not! We can always change the names of our
variables using the rename()
function. Let’s change the
name of the “edate” variable to “Election_Date”.
Review point: don’t forget that object names and variable names can’t have spaces or dashes! Instead, use . or _ to make things easier to read.
rename()
trudeau_2015 <- rename(trudeau_2015, party = party_raw)
head(trudeau_2015, 2)
## id province party gender incumbent elected edate
## 1 32571 Ontario United Party of Canada 0 0 0 2015-10-19
## 2 1434 Ontario The Bridge Party of Canada 0 0 0 2015-10-19
Using the federal candidates dataset that we have already imported into R during this lesson, I want you to subset the dataframe to include only the following variables: ID variable, election date, candidates names, and occupation.
Subset your dataframe to keep only those candidates that participated in the 2011 election. (HINT: edate == 2011-05-02)
Sort your dataframe by province.
Rename the occupation variable to ‘job’.
import()
head()
tail()
names()
str()
class()
select()
filter()
arrange()
rename()
$
(used to select variables from our datasets)|
(used to mean “OR”)&
(used to mean “AND”)