Introduction to R and Rstudio

Yunkyu Sohn (Slides by Ethan Fosse)
Feburuary 15, 2017

Research Associate, Department of Politics

This work is licensed under a Creative Commons Attribution 3.0 Unported License.

  • Free, open-source statistical programming and data analysis workshops using R and RStudio
  • Open to everyone with a Princeton ID
  • No programming experience is necessary or expected
  • Attendees should bring a laptop computer to fully participate in the workshops
  • Workshop contents are motivated by research questions in social sciences

Spring 2017 Schedule

Date Topic
February 15 Introduction to R and RStudio
February 22 Data Wrangling in R
TBD Base R Graphics
TBD Data Visualization in R with ggplot2
TBD Programming Loops in R
TBD Probability and Simulations in R
TBD Monte Carlo Simulations in R
TBD Text Analysis in R
TBD Hypothesis Testing in R
TBD Regression Analysis in R
TBD Social Network Analysis in R

Connect with Us:

  • Visit our website
  • Join our mailing list

Our Website

Our Mailing List

Send an email to listserv@lists.princeton.edu with “Subscribe COMPASSWORKSHOPS” in the body and all other lines blank, *including the subject*.

People

  • Teaching Staff

    • Ethan Fosse (Research Associate, Department of Sociology)
    • Yunkyu Sohn (Research Associate, Department of Politics)
  • Faculty Sponsors

Workshop Preliminaries

  1. Workshop Requirements
  2. Installing R and RStudio
  3. Our Research Question
  4. Roadmap for the Workshop

1. Workshop Requirements

Before You Begin

  1. You have access to a laptop computer
  2. You have access to reliable Internet service
  3. You want to learn more about R and RStudio!
  4. You have downloaded the R Workshop files:
    • MyFirstScript.R
    • States.RData
    • StatesHealth.dta
    • MyFirstMarkdown.Rmd

Go to: https://github.com/compass-workshops/IntroRWorkshop/blob/master/Data.zip

2. Installing R and RStudio

R and RStudio

  • R is free, open-source, and very popular!
  • How are R and RStudio related?
    • R is the underlying programming language we use, while Rstudio provide a graphical front-end that has a number of useful features, including syntax highlighting and windowing
    • For this course we use RStudio with R, but there are other graphical front-ends that some people use with R

How to Install R and RStudio

Opening RStudio

  • When you first open RStudio, you will be greeted by three panels:
  1. The interactive R console (entire left)
  2. Environment/History (tabbed in upper right)
  3. Files/Plots/Packages/Help/Viewer (tabbed in lower right)
  • Once you open files, such as R scripts, an editor panel will also open in the top left

3. Our Research Question

The 2008 Presidential Election

Obama and McCain

The Electoral College Map

Obama and McCain Map

Examining the Election of 2008

  • Throughout this workshop we will consider the following question:
    • Why did some states go for Obama and others for McCain?
  • We'll explore this question using real data on 50 states
  • In the process of answering this question we'll learn how to use R and RStudio for data analysis!

4. Roadmap for the Workshop

What We'll Learn Today

  • Part 1: Exploring a Data Set
  • Part 2: Working with Variables
  • Part 3: Summarizing and Visualizing Data
  • Part 4: Finding and Analyzing Subgroups
  • Part 5: Importing, Saving, and Exporting Data
  • Addendum: Creating Dynamic Reports

Part 1: Exploring a Data Set

Mission #1: Swing States in 2008

Ohio Map

  • Did Obama or McCain win Ohio in 2008?

  • We'll use data to answer this question!

Loading Data into R

  • Let's load States.RData into our workspace
  • Using RStudio's user-friendly interface:

    1. File —> Open File
    2. Navigate to the location where you downloaded the files for this workshop (for example: C:/Folder/)
  • You can also try this R Code:

    load(C:/Folder/States.RData)
    

Functions in R

  • The command load() is a function
  • We will use functions a lot!
  • A function takes an input (raw meat), does something to it (grinds it up), and gives an output (ground meat)
  • You can often guess what a function does by its name

Meat Grinder

Data Sets and Variables in R

  • R-speak:
    • Data set is called a data frame
    • Text variables are called character vectors
    • Numerical variables are numeric vectors
    • Categorical variables are factors and their categories are called levels
  • These are all called objects and appear in the Environment tab
  • We can use the R function class() to see what we have!
  • Try this R Code:
class(States)

Viewing the Data as a Spreadsheet

  • Let's look at our data in more depth!
  • Let's use the function View()
  • Try this R Code:
View(States)
  • Ask yourself:
    • What are the observational units of the data set?
    • What variables look interesting or unusual to you?

Another Way to Look at a Data Set

  • We can also view a data set in the R Console directly
  • R functions: print(), head(), tail()

    • Try this R Code:
print(States)
head(States)
tail(States)
  • Ask yourself:
    • What is the problem with the print() function?
    • What is the difference between head() and tail()?

Examining the Structure of a Data Set

  • We can also examine the structure of a data set in a more compact way
  • Let's use the function str()
  • Try this R Code:
str(States)
  • Ask yourself:
    • How many observational units are there?
    • How many variables are there?
    • Which variables are categorical? Numerical?

Exploring the Dimensions of a Data Set

  • The dimensions of a data set are the number of rows and number of columns
  • R functions: dim(), nrow(), ncol()
  • Try this R Code:
dim(States)
nrow(States)
ncol(States)
  • Ask yourself:
    • What are these different functions doing?

Exploring Names in a Data Set

  • Besides the dimensions, it's often helpful to know the names of the observational units and variables
  • R functions: rownames() and colnames()
  • Try this R Code:
rownames(States)
colnames(States)
  • Ask yourself:
    • Can you find Ohio in the set of row names?

Challenge #1: Swing States in 2008

  • Did Obama or McCain win Ohio in 2008?
  • R Code Hint:
View(States)
  • Follow-up: Who won your home state?
    • Note: If you're not from the United States, then feel free to pick New Jersey!

Check-In #1: Exploring a Data Set

  • At this point you should have:
    • Loaded a data set into R
    • Viewed a data set as a spreadsheet
    • Explored the types of variables in a data set
    • Examined the dimensions of a data set

Part 2: Working with Variables

Mission #2: The Margin of Victory

  • The 1936 Election was a landslide (46 versus 2 states won):

  • Was 2008 a landslide? How many states did Obama win versus McCain?

Summarizing the Data Set

  • Summarizing a data set using summary()

  • Try this R Code:

summary(States)
  • Ask yourself:
    • How many states are in the Northeast (NE)?
    • What is the average household income across states?
    • What is a drawback of using this function on a data set?

Grabbing and Summarizing a Single Variable

  • We can grab a variable using the dollar sign ($) notation:
  • General format: Dataset$Variable
  • Try this R Code:
States$HouseholdIncome
States$Region
  • We can use the summary() function on single variables!
  • Try this R Code:
summary(States$HouseholdIncome)
summary(States$Region)

Using Functions on Single Variables

  • We can use other functions we learned before as well, including class() and str()
  • Try this R Code:
class(States$HouseholdIncome)
class(States$Region)

str(States$HouseholdIncome)
str(States$Region)
  • Ask yourself:
    • Are these variables categorical or numerical?
    • What are the number of observational units for each variable?

Copying a Data Set

  • We can use the assignment operator (<-) to copy a data set
  • General format: NewDataset <- Dataset
  • Try this R Code:
StatesCopy <- States
str(StatesCopy)
class(StatesCopy)
  • Ask yourself:
    • Is StatesCopy the same as States?

Creating a Standalone Variable

  • We can also use “<-” to create a standalone variable
  • General format: NewVariable <- Dataset$Variable
  • Try this R Code:
HouseholdIncome <- States$HouseholdIncome
str(HouseholdIncome)
class(HouseholdIncome)
  • Ask yourself:
    • Is HouseholdIncome the same as States$HouseholdIncome?

Cleaning Up the Workspace

  • Look at the Environment tab to see all the variables and data sets in the R workspace
  • We can also show them in the R console: ls()
  • How to remove a variable or data set: rm()
  • Try this R Code:
ls()
Population <- States$Population
summary(Population)
rm(Population)

Challenge #2: The Margin of Victory

  • Was 2008 a landslide? How many states did Obama win versus McCain?

  • R Code Hint:

Winner <- States$ObamaMcCain
summary(Winner)

Or you can just do:

summary(States$ObamaMcCain)
  • Follow-up: Can you delete the variable Winner?

Check-In #2: Working with Variables

  • At this point you should have:
    • Summarized a data set
    • Copied a data set
    • Created standalone variables
    • Summarized and deleted standalone variables

Part 3: Summarizing and Visualizing Data

Mission #3: Beyond Winning and Losing

  • An election is more than just winning or losing
  • Let's look at some other questions:
  1. Across all states, what is the average percentage who voted for Obama?
  2. How many states in the Northeast voted for Obama rather than McCain?
  • Answering these questions requires using more R functions
  • This can get laborious in the R console!

Solution: Using an R Script

  • In RStudio, go to File -> Open File and navigate to the location you downloaded MyFirstScript.R
  • Right now your R script is empty!
  • Type the following into the document:
nrow(States)
ncol(States)
  • Highlight the lines of R code and click on Run or press Ctrl + Enter
  • Cool, right?

Commenting Your R Script

  • When conducting data analysis, R scripts can be really, really long
  • So you and others can understand what you've written, add comments using the # symbol
  • Add comments to your R script as follows:
nrow(States) # number of rows
ncol(States) # number of columns

Saving Your R Script

  • In Rstudio you can save your R script by clicking the Save icon or by pressing Ctrl + S
  • Now you can open it up and run your code anytime!
  • In RStudio you can also create new R scripts by going to: File -> New File -> R Script

Good Style Conventions

  • Just like writing prose or poetry, writing R code has certain conventions
  • A few tips:
    • Always add a space after a comma
    • Add detailed comments whenever possible
    • Name objects (or variables) and files as clearly as possible
  • For more see Google's R style guide
  • Now let's analyze some data!

Measures of Central Tendency for a Numerical Variable

  • Measures of central tendency
    • Mean: mean()
    • Median (or 50th percentile): median()
  • Try this R Code:
mean(States$HouseholdIncome)
median(States$HouseholdIncome)
  • Ask yourself:
    • Why is the mean different from the median?

Measures of Spread for a Numerical Variable

  • Measures of spread (or variablity)
  • Range: range()
    • Standard deviation: sd()
    • Inter-quartile range: IQR()
  • Try this R Code:
range(States$HouseholdIncome)
sd(States$HouseholdIncome)
IQR(States$HouseholdIncome)

Visualizing a Single Numerical Variable

  • We can create a histogram with hist()
  • Optional inputs (or arguments) in an R function are separated by commas
  • We can control the number of bins with the option breaks
  • Try this R Code:
hist(States$HouseholdIncome)
hist(States$HouseholdIncome, breaks=3)
hist(States$HouseholdIncome, breaks=15)

Visualizing Two Numerical Variables

  • We can create a scatter plot with plot()
  • We have two inputs (or arguments), first for the x-variable and second for the y-variable
  • Try this R Code:
plot(States$College, States$ObamaVote)
  • Ask yourself:
    • What is the relationship between percent college graduates and voting for Obama across states?
    • What's the problem with this graph?

Visualizing Two Numerical Variables Redux

  • If we include the command text() immediately after we use plot() we can add labels to the scatter plot
  • The option labels is used to specify the variable with the labels
  • Try this R Code:
plot(States$College, States$ObamaVote)
text(States$College, States$ObamaVote, labels=States$State)
  • Ask yourself:
    • Which states are highly educated and voted for Obama?
    • Can you find your home state (or New Jersey)?

Table of Counts for a Categorical Variable

  • In R-speak, categorical variables are called factors and the different categories they have are levels
  • R function for examining levels: levels()
  • To create a table of counts: table()

  • Try this R Code:

levels(States$Region)
table(States$Region)
  • Ask yourself:
    • How many states are in the West (W)?

Table of Counts for Two Categorical Variables

  • To create a table of counts for two variables we still use table()
  • Now we just specify two variables as inputs
  • Try this R Code:
table(States$Region,  States$ObamaMcCain)
  • Ask yourself:
    • How many states are in the West (W) and voted for McCain?

Visualizing a Single Categorical Variable

  • We can also create a bar plot of counts with plot()
  • The height of the bars equals the number of observational units in each category (or level)

  • Try this R Code:

    plot(States$ObamaMcCain)
    

Challenge #3: Beyond Winning and Losing

  1. Across all states, what is the average percentage who voted for Obama?
  2. R Code Hint:
mean(States$ObamaVote)
  1. How many states in the Northeast voted for Obama rather than McCain?
  2. R Code Hint:
table(States$Region, States$ObamaMcCain)

Check-In #3: Summarizing and Visualizing Data

  • At this point you should have:
    • Summarized a variable based on its central tendency and spread
    • Created a histogram and scatter plot for numerical variables
    • Generated tables for categorical variables
    • Created a bar graph for a categorical variable

Part 4: Finding and Analyzing Subgroups

Mission #4: What About the South?

  • The U.S. South is a distinctive cultural and economic region:

  • What percentage voted for Obama in Mississippi compared to Massachusetts?

Subsetting Data

  • The rows and columns are indexed by position and by name
  • We can subset the data set using single bracket notation: [ , ]
  • General format:

    • Subsetting to a row: data[row, ]
    • Subsetting to a column: data[ , column]
    • Subsetting to a row and a column: data[row, column]
  • Ask yourself:

    • How many rows and columns are there in States.RData?

Subsetting a Row of a Data Set

  • General format for subsetting to a row: data[row, ]
  • Try this R Code:
View(States) # look for the 1st row
States[1, ]
States["Alabama", ]
  • Ask yourself:
    • What percent of people in Alabama voted for McCain?
    • What are other social and political characteristics of Alabama?

Subsetting a Column of a Data Set

  • General format for subsetting to a row: data[ , column]
  • Try this R Code:
View(States) # look for the 2nd column
States[ , 2]
States[ , "HouseholdIncome"]
  • Ask yourself:
    • Doesn't this look familiar?

Subsetting a Row and a Column a Data Set

  • General format for subsetting to a row: data[row, column]
  • Try this R Code:
View(States) # look for the 1st row and 2nd column
States[1, 2]
States["Alabama", "HouseholdIncome"]
  • Ask yourself:
    • What is this result? Why is it a single number?

Subsetting Multiple Rows and Columns

  • Often we want to subset to multiple rows and columns
  • To do this, we can use the c() function, which combines (or concatenates) a set of elements
  • Again, we can subset using numbers or names
  • Try this R Code:
States[c(1, 5), ]
States[c("Alabama", "California"), ]
States[c(1, 5), c(3, 9)]
States[c("Alabama", "California"), c("McCainVote","College")] 
  • Ask yourself:
    • What percentage voted for McCain in California versus Alabama?

Challenge #4: What About the South?

  • What percentage voted for Obama in Mississippi compared to Massachusetts?

  • R Code Hint:

States[c("Mississippi", "Massachusetts"), c("ObamaVote")] 

Check-In #4: Summarizing and Visualizing Data

  • At this point you should have:
    • Subsetted rows of a data set
    • Subsetted columns of a data set
    • Subsetted rows and columns of a data set
    • Combined elements with the concatenate function

5. Importing, Saving, and Exporting Data

Mission #5: Healthy States, Red States?

  • During his presidency, Obama signed the Affordable Care Act (Obamacare), expanding healthcare to more (but not all) citizens
  • Did “healthier” states vote for McCain in 2008?

Loading in New Data

  • Our current data set does not have any variables on health
  • But we do have StatesHealth.dta
    • This is a Stata data file (.dta extension)
  • To load this data into R we will need to use the foreign() R package

The Power of R and R Packages

  • Since R is free and open-source, lots of people are writing R packages
  • An R package is just a collection of R functions, data, and code in a well-defined format
  • R packages can do everything from text mining (tm package) to data visualzation (ggplot2 package) to data wrangling (dplyr package)
  • Over 9,000 R packages as of 2017 on CRAN (Comprehensive R Archive Network)

Browsing R Packages on CRAN

Installing Packages in RStudio

  • To install R packages in RStudio: GUI versus R Console
  • 1. Using the GUI: Go to the Packages tab and click Install
  • 2. Using the R Console: Type install.packages("package_name")
    • package_name is just the name of the R package in quotes
    • Example: install.packages("foreign")

Loading an R Package For Use

  • Once you've installed an R package, it's then bundled with R and RStudio
  • You now have access to all of the functions, data, code, and other files associated with the installed R package
  • However, to access these files you must load your R package
    • Try this R Code: library(foreign)
  • You only have to install an R package once, but you must load it every time you start R

Importing Data into R

  1. Check your working directory with getwd()
  2. Set your working directory with setwd()
  3. Check that the data set is ther with dir()
  4. Read in the Stata data set wtih read.dta()
  • Try this R Code:
getwd()
setwd("C:/Folder/") 
dir()
StatesHealth <- read.dta(StatesHealth.dta)
  • Note: "C:/Folder/" should be changed to the location of the Stata data set on your computer

Examining the Data Set

  • After importing a data set, it's a good idea to check it for possible errors
  • Try this R Code:
View(StatesHealth)
head(StatesHealth)
tail(StatesHealth)
  • Ask yourself:
    • How is this data set different from States.RData?

Saving the Data Set

  • To save the data set as an .RData file, we can use save()
  • This will save the data set in the working directory
  • We can verify that it's in the directory with dir()
  • Try this R Code:
save(StatesHealth, file="StatesHealth.RData")
dir()
  • You can also save all data, variables, and other objects in R with save.image()
    • Example: save.image("Everything.RData")

Exporting a Data Set

  • Saving a data set will give it an .RData extension
  • But we can the States data set as a Stata data set

  • Try this R Code:

write.dta(States, file="States.dta")
dir()
  • We can now open States.dta with Stata!

More on Importing Data into R

  • R can import (and export) many kinds of data:
  • Importing Excel Spreadsheets (.xlsx)
    • Package: xlsx
    • Function: read.xlsx()
  • Importing SPSS Files (.sav)
    • Package: foreign
    • Function: read.spss()
  • Importing SAS Files (.xpt)
    • Package: foreign
    • Function: read.xport()

Challenge #5: Healthy States, Red States?

  • Did “healthier” states vote for McCain or Obama in 2008?

  • R Code Hint:

plot(StatesHealth$Obese, StatesHealth$ObamaVote)
  • Bonus:
text(StatesHealth$Obese, StatesHealth$ObamaVote, 
  labels=StatesHealth$State)

Check-In #5: Importing, Saving, and Exporting Data

  • At this point you should have:
    • Imported a data set into R
    • Saved an R data set for later use
    • Exported a data set into other formats

Recap of the Workshop

  • At this point you should have:
    • Loaded and explored the structure of a data set
    • Worked with an R script and installed an R package
    • Summarized and visualized numerical and categorical variables
    • Found and analyzed subgroups in a data set
    • Saved, exported, and imported a data set

For More Information:

URL: https://compass-workshops.github.io/info/

Email List: Send an email to listserv@lists.princeton.edu with “Subscribe COMPASSWORKSHOPS” in the body and all other lines blank, including the subject

Feedback Survey:

Addendum: Creating Dynamic Reports

  • The R script is useful for replicating what you've you done in R
  • But often you will want to create dynamic reports that combine text, code, and output
  • In RStudio, go to File -> Open File and navigate to the location you downloaded MyFirstMarkdown.Rmd
  • Three main components of R Markdown files:
    • R Instructions
    • R Code Chunks
    • Markdown Text

R Instructions

The initial chunk of text contains instructions for R

  • Telling R the title, author, and date:
title: "My First Markdown"
author: "Ethan Fosse (COMPASS Workshops)"
date: "September 20, 2016"
  • Instructing R to produce html output (in other words, a web page):
output: html_document

R Code Chunks

  • The R Markdown file also has embedded R code chunks
  • This is the R code chunk in MyFirstMarkdown.Rmd:
# loading the R data set
# load("C:/Folder/States.RData") 

# examining the data set
head(States)
nrow(States)
ncol(States)

  • The chunk begins with 3 single back quotes followed an {r} and ends with 3 single back quotes

Markdown Text

  • The rest of the document consists of Markdown text
  • Markdown is a system for writing web pages by marking up the text much as you would in an email rather than writing html code
  • The marked-up text gets converted to html, replacing the marks with the proper html code
  • You make text bold using double asterisks, like this: **bold**, and you make things italics by using single asterisks, like this: *italics*.

Editing R Code Chunks

  • Highlight the R code within the R code chunk and click Run or press Ctrl + Enter
  • It works just like an R script!
  • Now add the following to the R code chunk:
mean(States$ObamaVote)
  • Re-run the R code chunk and see what happens!

Creating New R Code Chunks

  • At the end of the R Markdown file type 3 single backquotes followed by {r}
  • Type the following into the new R code chunk
ObamaMcCain <- States$ObamaMcCain
plot(ObamaMcCain)

Editing the Load Function

  • We also need to check that all of the R code chunks have the correct R code
  • In the R Markdown file go to the following line of code:
# load("C:/Folder/States.RData")
  • Remove the comment symbol # so that R will run this line of code
  • Change load("C:/Folder/States.RData") to reflect the correct location of States.RData on your computer
  • If you don't specify the correct location of the data set, then you will have an error when trying to create a report

Producing Reports by Knitting

  • To create a report from your R Markdown file, click on Knit Html
  • Click on the gear icon for viewing options:
    • View in Pane: Viewing the html file within RStudio
    • View in Window: Viewing the html file in a separate window
  • Make sure your load() specifies the appropriate location
  • Other output options are pdf and Word
  • To create a new report, go to File -> New File -> R Markdown

Check-In: Creating Dynamic Reports

  • At this point you should have:
    • Learned about the format of an R Markdown file
    • Created and edited code chunks in R Markdown
    • Knitted an R Markdown file