Class 3: Data Importing and Wrangling Part I

Introduction

In this class, we will start looking at datasets in R. We’ll focus on:

How to import data
Variable classes (character, factor, integer, numeric) and how these relate to levels of measurement
How to start “wrangling” data (e.g., include/exclude variables from dataset, filtering dataset to focus on specific observations, arrange rows in your dataset)

Getting Started

Before we get started, remember the steps for opening a new script or markdown document to take notes for this class:

Open RStudio.
Open a new R script or R markdown (it might be more helpful to create a markdown document so that you can easily differentiate notes from code). (e.g. File -> New File -> R Markdown)
Create a new folder where you’ll store today’s script and any associated files.
Return to RStudio and set your working directory to the folder you just created. (e.g. Session -> Set Working Directory -> Choose Directory…)
Save your script or markdown document

Packages For Today’s Class

In this class, we’re only going to need two packages: rio and tidyverse.

The rio package is used to import data into R. It only has one function that we’ll use: import().

Review point: The golden rule of packages is “Install once. Load every time.” In the same way that you only install the Instagram app once on your phone, you only need to install a package once on your computer. BUT every time you open R, you’ll need to “load” any packages you want to use in that session.

The function to install a package is install.packages(). Within the parantheses, you’ll need to provide the specific name of the package. Let’s do this now.

install.packages(rio)

Error! What went wrong?? You can see that the error message says: “object ‘rio’ not found”. Last class, we talked about how R is an “object oriented” system, meaning that we save the output of our code as “objects”, which we can then see in the Global Environment (the top right panel in R Studio). The problem is that the function thinks rio is an object we’ve saved, not the name of a package. So we need to remember to include quotation marks around the name!

install.packages("rio")

You’ll see a lot of messages when installing the package, which are completely normal. Now, let’s load tidyverse and rio. You should have installed tidyverse last week, but if you didn’t, you can un-comment the line of code below to install it before loading the package.

Note: you may have received a message that says “the following rio suggested packages are not installed…”. If you wish, you can use install_formats() to install those. For today, we don’t need to worry about this. If you want to learn more about why you received this message, you can read about it here: https://cran.r-project.org/web/packages/rio/vignettes/rio.html#Import,_Export,_and_Convert_Data_Files

#install.packages("tidyverse")

# if you attended class last week, 
# you should have already installed tidyverse. 
# so there's no need run the 
# install.packages("tidyverse") code above.

Review point: Remember that we use # to “comment our code”. Whenever we put a # infront of a line of code, it tells R NOT to try to execute whatever comes after the # because these are our “notes to self”.

Great! We’re ready to go. Remember, you only ever need to install a package once on your computer! That being said, packages get updated (just like R), so you may want to reinstall a package in the future to have the most up-to-date version.

Now that these packages are installed on our computer, we still need to load the packages in order to use them during this session. We do that using the library() function.

library(rio)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Great! We’re ready to import real data and start doing some cool things!

Importing Data

Most of the time, the data we’re interested in using/exploring will need to be imported into R. Fortunately, R can flexibly handle data that exists in different formats - Excel (.xlsx, .csv), SPSS (.sav), Stata (.dta), text (.txt), etc. This can be achieved using functions from a number of different packages, but we’ll use the import() function from the rio package. Whenever we import data, we want to save it as an object. Like we’ve discussed before, there are few rules about what you name an object, other than not using spaces and keeping things short. A common practice is to use “df” (i.e. “dataframe”) for dataframe objects.

To import the data into R, we must have a copy of the file in the same file folder as our working directory. Remember, setting the working directory gives R a map to a specific file folder. As long as the data file is there (and we’ve spelled everything correctly!), R will find the file and bring it into the R environment.

The data we’ll be using in this class is a dataset on candidate characteristics in Canadian federal elections (1867-2017). The data is freely available online at: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/ABFNSQ

First, download the data from OWL and save it to your working directory for today’s class.

FOR TODAY’S CLASS I have created a smaller version of this data that includes only about 10,000 observations (it excludes many of the earlier elections). It is available on OWL under Content -> Data and the data file is called “federal-candidates-2023-short.dta”

Second, use the `import()` function from the rio package to import the data into RStudio.

import("federal-candidates-2023-subset.dta") 
# you need to ensure... 
# 1) this data exists in your working directory (the folder we're using for today's lesson)
# 2) the file name is in quotation marks 
# 3) the file name you specify above matches the file name as it exists on your computer

You’ll notice that the dataset contains an “id” variable. This is an identification variable that provides a unique ID number to each observation in our dataset.

Look at your global environment. Do you think the data saved into the global environment? Why not?

Save Data Into Global Environment

We need to assign our dataset to an object so that it will “live” in the global environment in our RStudio session.

df <- import("federal-candidates-2023-subset.dta") 
# df is commonly used to mean "dataframe"

Looking at a Dataframe

If our dataset isn’t too large, we can click on the name of the data in our global environment to open it in a viewer window in the source window.

Below are some other ways to peek at our data (especially if it is a LARGE file):

head(df) # look at the first 6 rows

head(df, 10) # look at the first ten rows 
#hint: change 10 to any (reasonable) number you'd like

tail(df) # look at the bottom 6 rows in our data

tail(df, 14) # look at the bottom 14 rows in our data
#hint: change 14 to any (reasonable) number you'd like

In the code below, I use base R to look at specific rows/columns, and to look at the values of a specific row(s) and/or column(s). Remember, datasets are organized as rows by columns.

# let's look at the value of the first row, second column
df[1, 2] # first number specifies the row(s), second number specifies the column(s)

## [1] 38

Above, when we specify row 1 and column 2 of the df object, we find that the first observation in our dataset belongs to the 38th parliament.

# let's look at the first ten rows in our dataset
df[1:10, ] # we include nothing after the comma which communicates to R that we want it to return ALL rows

##       id parliament year     candidate_name      edate birth_year country_birth
## 1   4443         38 2004   C√îT√â, Jean-Guy 2004-06-28         NA              
## 2  24524         38 2004       NELSON, Erin 2004-06-28         NA              
## 3  20062         38 2004        KUNZ, Revel 2004-06-28         NA              
## 4  20861         38 2004   LE BEL, Benjamin 2004-06-28         NA              
## 5  19061         38 2004       KOSSICK, Don 2004-06-28         NA              
## 6   2813         38 2004       BUORS, Chris 2004-06-28         NA              
## 7   4318         38 2004     CORBIERE, Mark 2004-06-28         NA              
## 8   6320         38 2004   ELGERSMA, Steven 2004-06-28         NA              
## 9   3258         38 2004 CARIGNAN, Jean Guy 2004-06-28         NA              
## 10  9532         38 2004    HOFFMAN, Rachel 2004-06-28         NA              
##                           occupation riding_id                        riding
## 1                           composer        NA            MATAP√âDIA--MATANE
## 2                stationary engineer        NA       NORTH OKANAGAN--SHUSWAP
## 3                        b & b owner        NA      BURNABY--NEW WESTMINSTER
## 4      community development officer        NA                 ALFRED-PELLAN
## 5                community developer        NA                    BLACKSTRAP
## 6                locomotive engineer        NA                SAINT BONIFACE
## 7                    youth organizer        NA              KITCHENER CENTRE
## 8  farmer/horticulturalist (retired)        NA            HALDIMAND--NORFOLK
## 9                    parliamentarian        NA           LOUIS-SAINT-LAURENT
## 10                  graphic designer        NA NOTRE-DAME-DE-GR√ÇCE--LACHINE
##            province votes percent_votes                          party_raw
## 1            Quebec  1581     4.9922638               New Democratic Party
## 2  British Columbia  2333     4.5069060              Green Party of Canada
## 3  British Columbia  1606     3.8515036              Green Party of Canada
## 4            Quebec  1849     3.4669616               New Democratic Party
## 5      Saskatchewan  8862    23.5503578               New Democratic Party
## 6          Manitoba   317     0.8213286                    Marijuana Party
## 7           Ontario   277     0.6139184                        Independent
## 8           Ontario   617     1.2394536 Christian Heritage Party of Canada
## 9            Quebec   563     1.2548478                        Independent
## 10           Quebec    88     0.1987667             Marxist-Leninist Party
##     party_minor_group party_major_group           gov_party_raw gov_minor_group
## 1                 NDP           CCF_NDP Liberal Party of Canada         Liberal
## 2               Green       Third_Party Liberal Party of Canada         Liberal
## 3               Green       Third_Party Liberal Party of Canada         Liberal
## 4                 NDP           CCF_NDP Liberal Party of Canada         Liberal
## 5                 NDP           CCF_NDP Liberal Party of Canada         Liberal
## 6           Marijuana       Third_Party Liberal Party of Canada         Liberal
## 7         Independent       Independent Liberal Party of Canada         Liberal
## 8  Christian_Heritage       Third_Party Liberal Party of Canada         Liberal
## 9         Independent       Independent Liberal Party of Canada         Liberal
## 10    Marxist_Lennist       Third_Party Liberal Party of Canada         Liberal
##    gov_major_group num_candidates type_elxn elected incumbent gender lgbtq2_out
## 1          Liberal              5         1       0         0      0         NA
## 2          Liberal              8         1       0         0      0         NA
## 3          Liberal              6         1       0         0      1         NA
## 4          Liberal              7         1       0         0      0         NA
## 5          Liberal              5         1       0         0      0         NA
## 6          Liberal              7         1       0         0      0         NA
## 7          Liberal              5         1       0         0      0         NA
## 8          Liberal              5         1       0         0      0         NA
## 9          Liberal              8         1       0         1      0         NA
## 10         Liberal              8         1       0         0      1         NA
##    indigenousorigins lawyer censuscategory acclaimed switcher
## 1                  0      0              5         0        0
## 2                  0      0              2         0        0
## 3                  0      0             NA         0        0
## 4                  0      0              4         0        0
## 5                  0      0              4         0        0
## 6                  0      0              2         0        0
## 7                  0      0              4         0        0
## 8                  0      0             NA         0        0
## 9                  0      0             10         0        0
## 10                 0      0              5         0        0
##    multiple_candidacy
## 1                   0
## 2                   0
## 3                   0
## 4                   0
## 5                   0
## 6                   0
## 7                   0
## 8                   0
## 9                   0
## 10                  0

df[ , 1:4] # here we only want to look at the first four columns 
# I did not print the results below (because its a LOT, but you can try this on your laptop)

Great work so far.

We can see all the variable names with the names() function:

names(df)

##  [1] "id"                 "parliament"         "year"              
##  [4] "candidate_name"     "edate"              "birth_year"        
##  [7] "country_birth"      "occupation"         "riding_id"         
## [10] "riding"             "province"           "votes"             
## [13] "percent_votes"      "party_raw"          "party_minor_group" 
## [16] "party_major_group"  "gov_party_raw"      "gov_minor_group"   
## [19] "gov_major_group"    "num_candidates"     "type_elxn"         
## [22] "elected"            "incumbent"          "gender"            
## [25] "lgbtq2_out"         "indigenousorigins"  "lawyer"            
## [28] "censuscategory"     "acclaimed"          "switcher"          
## [31] "multiple_candidacy"

We can see the structure of the dataframe with the str() function:

str(df)

## 'data.frame':    10001 obs. of  31 variables:
##  $ id                : num  4443 24524 20062 20861 19061 ...
##   ..- attr(*, "format.stata")= chr "%10.0g"
##  $ parliament        : num  38 38 38 38 38 38 38 38 38 38 ...
##   ..- attr(*, "format.stata")= chr "%10.0g"
##  $ year              : num  2004 2004 2004 2004 2004 ...
##   ..- attr(*, "format.stata")= chr "%10.0g"
##  $ candidate_name    : chr  "C√îT√â, Jean-Guy" "NELSON, Erin" "KUNZ, Revel" "LE BEL, Benjamin" ...
##   ..- attr(*, "format.stata")= chr "%-9s"
##  $ edate             : chr  "2004-06-28" "2004-06-28" "2004-06-28" "2004-06-28" ...
##   ..- attr(*, "format.stata")= chr "%-9s"
##  $ birth_year        : num  NA NA NA NA NA NA NA NA NA NA ...
##   ..- attr(*, "format.stata")= chr "%10.0g"
##  $ country_birth     : chr  "" "" "" "" ...
##   ..- attr(*, "format.stata")= chr "%-9s"
##  $ occupation        : chr  "composer" "stationary engineer" "b & b owner" "community development officer" ...
##   ..- attr(*, "format.stata")= chr "%-9s"
##  $ riding_id         : num  NA NA NA NA NA NA NA NA NA NA ...
##   ..- attr(*, "format.stata")= chr "%10.0g"
##  $ riding            : chr  "MATAP√âDIA--MATANE" "NORTH OKANAGAN--SHUSWAP" "BURNABY--NEW WESTMINSTER" "ALFRED-PELLAN" ...
##   ..- attr(*, "format.stata")= chr "%-9s"
##  $ province          : chr  "Quebec" "British Columbia" "British Columbia" "Quebec" ...
##   ..- attr(*, "format.stata")= chr "%-9s"
##  $ votes             : num  1581 2333 1606 1849 8862 ...
##   ..- attr(*, "format.stata")= chr "%10.0g"
##  $ percent_votes     : num  4.99 4.51 3.85 3.47 23.55 ...
##   ..- attr(*, "format.stata")= chr "%10.0g"
##  $ party_raw         : chr  "New Democratic Party" "Green Party of Canada" "Green Party of Canada" "New Democratic Party" ...
##   ..- attr(*, "format.stata")= chr "%-9s"
##  $ party_minor_group : chr  "NDP" "Green" "Green" "NDP" ...
##   ..- attr(*, "format.stata")= chr "%-9s"
##  $ party_major_group : chr  "CCF_NDP" "Third_Party" "Third_Party" "CCF_NDP" ...
##   ..- attr(*, "format.stata")= chr "%-9s"
##  $ gov_party_raw     : chr  "Liberal Party of Canada" "Liberal Party of Canada" "Liberal Party of Canada" "Liberal Party of Canada" ...
##   ..- attr(*, "format.stata")= chr "%-9s"
##  $ gov_minor_group   : chr  "Liberal" "Liberal" "Liberal" "Liberal" ...
##   ..- attr(*, "format.stata")= chr "%-9s"
##  $ gov_major_group   : chr  "Liberal" "Liberal" "Liberal" "Liberal" ...
##   ..- attr(*, "format.stata")= chr "%-9s"
##  $ num_candidates    : num  5 8 6 7 5 7 5 5 8 8 ...
##   ..- attr(*, "format.stata")= chr "%10.0g"
##  $ type_elxn         : num  1 1 1 1 1 1 1 1 1 1 ...
##   ..- attr(*, "format.stata")= chr "%10.0g"
##  $ elected           : num  0 0 0 0 0 0 0 0 0 0 ...
##   ..- attr(*, "format.stata")= chr "%10.0g"
##  $ incumbent         : num  0 0 0 0 0 0 0 0 1 0 ...
##   ..- attr(*, "format.stata")= chr "%10.0g"
##  $ gender            : num  0 0 1 0 0 0 0 0 0 1 ...
##   ..- attr(*, "format.stata")= chr "%10.0g"
##  $ lgbtq2_out        : num  NA NA NA NA NA NA NA NA NA NA ...
##   ..- attr(*, "format.stata")= chr "%10.0g"
##  $ indigenousorigins : num  0 0 0 0 0 0 0 0 0 0 ...
##   ..- attr(*, "format.stata")= chr "%10.0g"
##  $ lawyer            : num  0 0 0 0 0 0 0 0 0 0 ...
##   ..- attr(*, "format.stata")= chr "%10.0g"
##  $ censuscategory    : num  5 2 NA 4 4 2 4 NA 10 5 ...
##   ..- attr(*, "format.stata")= chr "%10.0g"
##  $ acclaimed         : num  0 0 0 0 0 0 0 0 0 0 ...
##   ..- attr(*, "format.stata")= chr "%10.0g"
##  $ switcher          : num  0 0 0 0 0 0 0 0 0 0 ...
##   ..- attr(*, "format.stata")= chr "%10.0g"
##  $ multiple_candidacy: num  0 0 0 0 0 0 0 0 0 0 ...
##   ..- attr(*, "format.stata")= chr "%10.0g"

Variable Classes

In order for R to perform the appropriate commands on variables, it needs to understand how variables are measured. The Haan & Godley (2017) reading for today discussed different levels of measurement. The level of measurement can differ across variables. In R language, the level of measurement is the variable class. There are four types you’ll commonly see: “character”, “factor”, “numeric”, “integer”.

Figure 1 shows the relationship between these ideas. Broadly, we can think about two different levels of measurement of our variables: categorical and continuous. Categorical variables are nominal or ordinal. In political science, we’re interested in all sorts of categorical variables - political parties, type of conflict, type of institution or regime, etc. An example of an ordinal variable would a government’s degree of respect for human rights (e.g. “none”, “some”, “full”).

Continuous variables are formally considered to be interval or ratio. For the purposes of this course, we’re not going to belabour the difference. The key idea is that continuous variable are numbers with a meaningful/exact space between the levels. For instance, if someone asked you your age, the difference between 20 and 21 years and 45 and 46 years is the same. The distance or space between the levels in our ordinal example (respect for human rights) is not necessarily the same. Conceptually, going from “none” to “some” could be a lot different than going from “some” to “full”.

In R, the translation of the level of measurement occurs using the “class” (or type) of variable. Categorical variables can be coded as character or factor variables. Continuous variables can be coded as numeric or integer variables.

There are two reasons why this matters. First, the way variables are coded when we first acquire the data can be changed. Maybe we want to transform an income variable, measured as continuous, into a categorical variable (low, middle, high income), because that’s more appropriate for our research question. Or maybe we want to change a categorical variable from 4 categories into 2.

Second, when visualizing the data, there are times when we want to change the class of a variable strictly for visualization purposes. As we’ll see in future classes, the way R produces colours for continuous variables may not be what we want. This is especially true for things like “time in years”. Even though the “years” variable is technically/conceptually continuous, for plotting purposes, we may need to convert it to be a categorical variable.

The character and factor classes are used for categorical variables. For example, sex, country, political party, profession, university major, etc. Sometimes, whether a categorical variable is of class character or factor will make no difference for our purposes. Sometimes it will. This will typically depend on the task we are trying to accomplish and the way a function has been designed. We’ll talk about this more next class.

For our purposes, we can treat the numeric and integer classes as the same - these are variables that are measured as continuous (ratio or interval level). Examples include age, GDP, vote percentage, etc. Technically, integers are meant for whole numbers (no decimals), and numeric are meant for numbers that include decimals. Don’t worry about this distinction too much.

Figure 1: Level of Measurement and Variable Class

When we use the str() function, it showed us the class of every variable in our dataframe. Another way to determine the class of a single variable using the class() function. Let’s check the class of a few of the variables in the df object.

# Check class of province  variable
class(df$province)

# Check class of gender variable
class(df$gender)

# Check class of votes variable.
class(df$votes)

Include/Exclude Variables

As you can see, there are 31 different variables in the dataset. We may not actually care about all of them, so rather than work with the full dataset, let’s make a new dataframe object with only the variables we care about. We can do this using the select() function. The basic idea is that we’ll tell the function which variables to either include or to exclude. Let’s make a new dataframe, called “df1”, with the following variables: Province, Gender, Party, Elected, edate.

`select()`

# Option 1: Include specific variables
df1 <- select(df, c(id, province, party_raw, gender, incumbent, elected, edate))


# Option 2: Exclude specific variables
# I'm making a new dataframe object called df2 and excluding the "occupation" variable only 
# this is much more efficient than typing out the names of the other 30 variables I want to KEEP
df2 <- select(df, -c(occupation))

Include/Exclude Rows Based on Values of Variable(s)

So we’ve created a new dataframe (df1) with only 5 variables. Let’s narrow things down even more. I only want to see data for the 2015 election (“2015-10-29”). We’ll make another new dataframe object called trudeau and use the filter() function to include only the observations (rows) which have that value for the edate variable.

`filter()`

# Option 1: Include specific variable values
trudeau_2015 <- filter(df1, edate == "2015-10-19")

# Option 2: Exclude specific variable values
# I'm making a new dataframe object called no_2015 and excluding the "2015-10-19" variable value
no_2015 <- filter(df1, edate != "2015-10-19")

Try opening these two new datasets by clicking on them in the Global Environment pane.

Maybe we want to see the last 2 major elections (“2011-05-02” and “2015-10-29”). We can use the & (AND) and | (OR) operators to combine multiple conditions, even conditions across multiple variables (e.g. Male AND Liberal; Quebec OR Elected).

# Election 1 OR Election 2 (satisfy either condition)
trudeau3 <- filter(df1, edate == "2015-10-19" | edate == "2019-10-21")

# British Columbia AND Female (must satisfy both conditions) 
BC_women <- filter(df1, province == "British Columbia" & gender == "F") #oops this didn't work! 
# Why didn't this work? 


BC_women <- filter(df1, province == "British Columbia" & gender == 1)
# a good codebook would tell us how gender/sex is coded 
# the candidate names offers a clue 
# female = 1

We could filter to keep observations from several provinces:

df_new <- filter(df1, province %in% c("British Columbia", "Alberta", "Manitoba"))

We use %in% to specify values of a list (list the provinces we would like to keep in c()).

Sort the Dataframe Rows

Focusing on our new “trudeau_2015” dataframe, we can look at the dataframe by clicking on the object in our Global Environment. To make it easier to view, perhaps we want to organize the rows by province. We can do this using the arrange() function.

`arrange()`

trudeau_2015 <- arrange(trudeau_2015, province)
head(trudeau_2015, 2) # look at the first 2 rows of our dataset

##      id province   party_raw gender incumbent elected      edate
## 1 32601  Alberta     Liberal      0         0       0 2015-10-19
## 2 33399  Alberta Independent      0         0       0 2015-10-19

You’ll see that the rows are now ordered by province (where province is in alphabetical order, beginning with Alberta).

# Change to descending order
trudeau_2015 <- arrange(trudeau_2015, desc(province))
head(trudeau_2015, 2) # look at first two rows of our dataset

##      id province             party_raw gender incumbent elected      edate
## 1 32059    Yukon Green Party of Canada      0         0       0 2015-10-19
## 2   696    Yukon               Liberal      0         0       1 2015-10-19

Now the rows are ordered by province but beginning at the bottom of the alphabetical order with Yukon.

We could instead arrange by Party.

trudeau_2015 <- arrange(trudeau_2015, party_raw)

# Change to descending order
trudeau_2015 <- arrange(trudeau_2015, desc(party_raw))


head(trudeau_2015, 2)

##      id province                  party_raw gender incumbent elected      edate
## 1 32571  Ontario     United Party of Canada      0         0       0 2015-10-19
## 2  1434  Ontario The Bridge Party of Canada      0         0       0 2015-10-19

Rename Variables

What if we don’t like the names of our variables? Are we stuck with them forever? Of course not! We can always change the names of our variables using the rename() function. Let’s change the name of the “edate” variable to “Election_Date”.

Review point: don’t forget that object names and variable names can’t have spaces or dashes! Instead, use . or _ to make things easier to read.

`rename()`

trudeau_2015 <- rename(trudeau_2015, party = party_raw)

head(trudeau_2015, 2)

##      id province                      party gender incumbent elected      edate
## 1 32571  Ontario     United Party of Canada      0         0       0 2015-10-19
## 2  1434  Ontario The Bridge Party of Canada      0         0       0 2015-10-19

Exercise:

Using the federal candidates dataset that we have already imported into R during this lesson, I want you to subset the dataframe to include only the following variables: ID variable, election date, candidates names, and occupation.
Subset your dataframe to keep only those candidates that participated in the 2011 election. (HINT: edate == 2011-05-02)
Sort your dataframe by province.
Rename the occupation variable to ‘job’.

Wrap up

Important functions that we learned today:

import()
head()
tail()
names()
str()
class()
select()
filter()
arrange()
rename()

Important operators in R that we used today:

$ (used to select variables from our datasets)
| (used to mean “OR”)
& (used to mean “AND”)

Class 3: Data Importing and Wrangling Part I

POL3325G Data Science for Politics (January 21, 2025)

Shanaya Vanhooren

Introduction

Getting Started

Packages For Today’s Class

Importing Data

First, download the data from OWL and save it to your working directory for today’s class.

Second, use the `import()` function from the rio package to import the data into RStudio.

Save Data Into Global Environment

Looking at a Dataframe

Variable Classes

Include/Exclude Variables

`select()`

Include/Exclude Rows Based on Values of Variable(s)

`filter()`

Sort the Dataframe Rows

`arrange()`

Rename Variables

`rename()`

Exercise:

Wrap up

Important functions that we learned today:

Important operators in R that we used today:

Class 3: Data Importing and Wrangling Part I

POL3325G Data Science for Politics (January 21, 2025)

Shanaya Vanhooren

Introduction

Getting Started

Packages For Today’s Class

Importing Data

First, download the data from OWL and save it to your working directory for today’s class.

Second, use the import() function from the rio package to import the data into RStudio.

Save Data Into Global Environment

Looking at a Dataframe

Variable Classes

Include/Exclude Variables

select()

Include/Exclude Rows Based on Values of Variable(s)

filter()

Sort the Dataframe Rows

arrange()

Rename Variables

rename()

Exercise:

Wrap up

Important functions that we learned today:

Important operators in R that we used today:

Second, use the `import()` function from the rio package to import the data into RStudio.

`select()`

`filter()`

`arrange()`

`rename()`