Introduction

Here you can find lab notes and resources for Econ 415. These will be updated after our in-class lab sessions. These notes are not a substitute for attending lab but serve as an additional resource.

Much of the lab content will be drawn from R for Data Science

Useful R Resources

Getting Started with R – A collection of resources for those getting started with R

TidyR Cheatsheet – A useful cheatsheet for data cleaning and tidy data using the tidyverse functions

ggplot2 Cheatsheet – A useful cheatsheet for various ggplot geoms

Tidy Tuesday Repo – A weekly data science project in R to test your tidyverse skills!

R Setup and Workflow Basics

Before R: File Management

Effective file folder management is crucial for maintaining an organized and efficient digital workspace. Setting up organized folders will make your life significantly easier in the future!

  • Use your computer’s file management tools or RStudio’s “Files” tab.
  • Create a main folder for your project or course.
  • Go to the desired folder and create a new folder, e.g., “Lab1”
  • Save all downloaded data files and R Scripts in this folder

A Look Around R Studio

In the bottom left of R Studio, you will see the console. The console executes code. You can type code and execute it using the console but the code is not saved when you close R Studio. It is recommended that you do not use the console in your regular workflow.

To save your work, you should code in an R Script. Open a script using the button that looks like a piece of paper with a green plus sign in the top left corner of R Studio.

R scripts will open here. You can code, comment, and run the code from your script. To run the code, either click the “Run” button or by pressing CMD+Enter (Mac) or Ctrl+Enter (Windows). R scripts will be saved to the folder you are currently working in.

In the top left corner, we have the workspace/environment panes.

The workspace/enviroment tab tells you what objects are stored in R (i.e. what is loaded or stored in memory). The History tab which shows previous commands you have run.

Last, on the bottom right, we have several tabs including:

  • Files - shows the files on your computer in the directory you are working in
  • Viewer - can vew data or R objects
  • Help - shows help documentations for R functions and datasets
  • Plots - can see current and previous plots generated in your R session, save, and export them to png/pdf formats.
  • Packages - list of R packages you have installed. You can also install packages directly from this tab.

Coding Basics

R uses object-oriented programming. If you have never used this type of programming before, it can be a bit confusing at first. Essentially, R uses functions, which we apply to objects. More on this shortly, but if you aren’t sure what a function does, or how it works, you can use ? before the function to get the documentation. Ex: ?mean will bring up the help page for the mean() function. Try typing ?mean in the console and looking at the help page.

Objects

An object is an assignment between a name and a value. You assign values to names using <- or =. The first assignment symbol consists of a < next to a dash - to make it look like an arrow.

x <- 5 #assign the value of 5 to a variable called x
# notice that this x is now in your global environment
x # print x
## [1] 5
y = 10
y
## [1] 10

You can combine objects together as well which lets us do some basic math operations.

# create a new object called z that is equal to x*y
z <- x * y
#print z
z
## [1] 50

If you do not create an object, R will not save it to the global environment. If an object is not in the global environment and you try to reference it later, R will not know what you are referring to.

Math Operations

a <- 2+3
a
## [1] 5
b<-4-5
b
## [1] -1
c<-4*2
c
## [1] 8
d<-6/3
d
## [1] 2
e<-7^2
e
## [1] 49

Vectors

You can create a vector (a list) of items in R.

# create a vector of 1 through 10
vector1 <- 1:10
vector1
##  [1]  1  2  3  4  5  6  7  8  9 10

If we want specific items, we use the c() function and separate the items with a comma.

vector2 <- c(1,3,5,7,9)
vector2
## [1] 1 3 5 7 9

Mathematical operations work on vectors too!

vector2^2
## [1]  1  9 25 49 81

Classes

Objects in R have different classes. Check the class of a few objects we have already created:

class(x)
## [1] "numeric"
class(vector1)
## [1] "integer"

There are other classes too!

# create a string
my_string <- "Econ is cool!"
class(my_string)
## [1] "character"
# logical class
class(2>3)
## [1] "logical"

What happens if we have a vector of characters and numbers?

char_vector <- c(1:5, "banana", "apple")
char_vector
## [1] "1"      "2"      "3"      "4"      "5"      "banana" "apple"
#cant use mathematical operations on characters
# why?? because the entire vector is a character class!
class(char_vector)
## [1] "character"

Functions

Functions are operations that can transform your created object in a ton of different ways. We have actually already used two functions, c() and class(). Here are a few other useful ones:

#print the first few objects in vector1
head(vector1)
## [1] 1 2 3 4 5 6
#print the first 2 objects in vector1
head(vector1, 2)
## [1] 1 2
#print the last few objects in vector1
tail(vector1)
## [1]  5  6  7  8  9 10
#print last two objects in vector1
tail(vector1, 2)
## [1]  9 10
#find the mean of vector1
mean(vector1)
## [1] 5.5
#median 
median(vector1)
## [1] 5.5
#standard deviation
sd(vector1)
## [1] 3.02765
#Summary() prints summary stats
summary(vector1)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    3.25    5.50    5.50    7.75   10.00
min(vector1)
## [1] 1
max(vector1)
## [1] 10

Code Style

Coding style is the punctuation and grammer of the coding world. Making sure that your code is formatting in a readable, standard format is helpful for yourself and others to understand the code. We will follow guidelines from the tidyverse style guide

Spaces: Put spaces on either side of mathematical operators (i.e. +, -, ==, <, …), and around the assignment operator (<-). The exception to this is the ^ symbol.

# Strive for
z <- (a + b)^2 / d

# Avoid
z<-( a + b ) ^ 2/d

Don’t put spaces inside or outside parentheses for regular function calls. Always put a space after a comma.

# Strive for
mean(x, na.rm = TRUE)
## [1] 5
# Avoid
mean (x ,na.rm=TRUE)
## [1] 5

Adding extra spaces is fine if it helps with alignment. For example:

example_data_frame <- 
  data.frame(
    variable1      = c(1:10),
    variable_name2 = c(2:11),
    var_name       = c(3:12)
  )

Naming Conventions: Object names must start with a letter and can only contain letters, numbers, _, and .. The names should be descriptive–snake case is the recommended naming convention (separating lowercase wrods with _).

really_long_variable_name <- 1

Commenting: You can comment your code with #. It is strongly recommended to leave comments in your code so that others, and future you, can keep track of your thought process.

# Good code is well-commented code!!

You can also create section comments that will be collapsable. This is incredibly helpful when you have a really long R script! Any comment line which includes at least four trailing dashes (-) will create a section.

# This is section 1 ----

# ---- This is section 2 ----

Packages

R is really useful because of its ability to use packages. Pacman is a package for “package management” - it helps us load multiple packages at onc.

# if you have not previously installed the package, include the line:
#install.packages("pacman")

# you only have to do this once. you can also install packages from the "Packages" side panel tab

We need to load the a package after installing it to use it by using library().

library(pacman)

Now we use the p_load function to load other packages we want to use. We will use the tidyverse() package throughout the course, so let’s load that one.

p_load(tidyverse)

Tidyverse Functions

Tidyverse is used for data wrangling. It allows you to manipulate data frames in a rather intuitive way. Tidyverse is a huge package so today we will be focusing on functions from the dplyr package (comes with tidyverse).

  • select(): subset columns
  • filter(): subset rows on conditions
  • arrange(): sort results
  • mutate(): create new columns by using information from other columns
  • group_by() and summarize(): create summary statisitcs on grouped data
  • count(): count discrete values

Let’s use the starwars dataset that is built into the tidyverse package.

names(starwars)
##  [1] "name"       "height"     "mass"       "hair_color" "skin_color"
##  [6] "eye_color"  "birth_year" "sex"        "gender"     "homeworld" 
## [11] "species"    "films"      "vehicles"   "starships"
head(starwars)
## # A tibble: 6 × 14
##   name      height  mass hair_color skin_color eye_color birth_year sex   gender
##   <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
## 1 Luke Sky…    172    77 blond      fair       blue            19   male  mascu…
## 2 C-3PO        167    75 <NA>       gold       yellow         112   none  mascu…
## 3 R2-D2         96    32 <NA>       white, bl… red             33   none  mascu…
## 4 Darth Va…    202   136 none       white      yellow          41.9 male  mascu…
## 5 Leia Org…    150    49 brown      light      brown           19   fema… femin…
## 6 Owen Lars    178   120 brown, gr… light      blue            52   male  mascu…
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

Select and Filter

Let’s select only the name, gender, and homeworld variables

select(starwars, c(name, gender, homeworld))
## # A tibble: 87 × 3
##    name               gender    homeworld
##    <chr>              <chr>     <chr>    
##  1 Luke Skywalker     masculine Tatooine 
##  2 C-3PO              masculine Tatooine 
##  3 R2-D2              masculine Naboo    
##  4 Darth Vader        masculine Tatooine 
##  5 Leia Organa        feminine  Alderaan 
##  6 Owen Lars          masculine Tatooine 
##  7 Beru Whitesun Lars feminine  Tatooine 
##  8 R5-D4              masculine Tatooine 
##  9 Biggs Darklighter  masculine Tatooine 
## 10 Obi-Wan Kenobi     masculine Stewjon  
## # ℹ 77 more rows

Notice that this didn’t save anything in our global environment! If you want to save this new dataframe, you have to give it a name!

To select all columns except a certain one, use a minus sign

select(starwars, c(-homeworld, -gender))
## # A tibble: 87 × 12
##    name    height  mass hair_color skin_color eye_color birth_year sex   species
##    <chr>    <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr>  
##  1 Luke S…    172    77 blond      fair       blue            19   male  Human  
##  2 C-3PO      167    75 <NA>       gold       yellow         112   none  Droid  
##  3 R2-D2       96    32 <NA>       white, bl… red             33   none  Droid  
##  4 Darth …    202   136 none       white      yellow          41.9 male  Human  
##  5 Leia O…    150    49 brown      light      brown           19   fema… Human  
##  6 Owen L…    178   120 brown, gr… light      blue            52   male  Human  
##  7 Beru W…    165    75 brown      light      blue            47   fema… Human  
##  8 R5-D4       97    32 <NA>       white, red red             NA   none  Droid  
##  9 Biggs …    183    84 black      light      brown           24   male  Human  
## 10 Obi-Wa…    182    77 auburn, w… fair       blue-gray       57   male  Human  
## # ℹ 77 more rows
## # ℹ 3 more variables: films <list>, vehicles <list>, starships <list>

Filter the data frame to include only droids

filter(starwars, species == "Droid")
## # A tibble: 6 × 14
##   name   height  mass hair_color skin_color  eye_color birth_year sex   gender  
##   <chr>   <int> <dbl> <chr>      <chr>       <chr>          <dbl> <chr> <chr>   
## 1 C-3PO     167    75 <NA>       gold        yellow           112 none  masculi…
## 2 R2-D2      96    32 <NA>       white, blue red               33 none  masculi…
## 3 R5-D4      97    32 <NA>       white, red  red               NA none  masculi…
## 4 IG-88     200   140 none       metal       red               15 none  masculi…
## 5 R4-P17     96    NA none       silver, red red, blue         NA none  feminine
## 6 BB8        NA    NA none       none        black             NA none  masculi…
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

Filter the data frame to include droids OR humans

filter(starwars, species == "Droid" | species == "Human")
## # A tibble: 41 × 14
##    name     height  mass hair_color skin_color eye_color birth_year sex   gender
##    <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
##  1 Luke Sk…    172    77 blond      fair       blue            19   male  mascu…
##  2 C-3PO       167    75 <NA>       gold       yellow         112   none  mascu…
##  3 R2-D2        96    32 <NA>       white, bl… red             33   none  mascu…
##  4 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
##  5 Leia Or…    150    49 brown      light      brown           19   fema… femin…
##  6 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
##  7 Beru Wh…    165    75 brown      light      blue            47   fema… femin…
##  8 R5-D4        97    32 <NA>       white, red red             NA   none  mascu…
##  9 Biggs D…    183    84 black      light      brown           24   male  mascu…
## 10 Obi-Wan…    182    77 auburn, w… fair       blue-gray       57   male  mascu…
## # ℹ 31 more rows
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

Filter the data frame to include characters taler than 100 cm and a mass over 100

filter(starwars, height > 100 & mass > 100)
## # A tibble: 10 × 14
##    name     height  mass hair_color skin_color eye_color birth_year sex   gender
##    <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
##  1 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
##  2 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
##  3 Chewbac…    228   112 brown      unknown    blue           200   male  mascu…
##  4 Jabba D…    175  1358 <NA>       green-tan… orange         600   herm… mascu…
##  5 Jek Ton…    180   110 brown      fair       blue            NA   <NA>  <NA>  
##  6 IG-88       200   140 none       metal      red             15   none  mascu…
##  7 Bossk       190   113 none       green      red             53   male  mascu…
##  8 Dexter …    198   102 none       brown      yellow          NA   male  mascu…
##  9 Grievous    216   159 none       brown, wh… green, y…       NA   male  mascu…
## 10 Tarfful     234   136 brown      brown      blue            NA   male  mascu…
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

Piping

What if we want to do those things all in one step??? The tidyverse allows us to chain functions together with %>%. This is called a pipe and the pipe connects the LHS to the RHS (like reading a book). Let’s make a new dataframe where we select the name, height, and mass. Filter out those who are shorter than 100 cm.

new_df <- starwars %>% select(name, height, mass) %>% filter(height >= 100)
new_df
## # A tibble: 74 × 3
##    name               height  mass
##    <chr>               <int> <dbl>
##  1 Luke Skywalker        172    77
##  2 C-3PO                 167    75
##  3 Darth Vader           202   136
##  4 Leia Organa           150    49
##  5 Owen Lars             178   120
##  6 Beru Whitesun Lars    165    75
##  7 Biggs Darklighter     183    84
##  8 Obi-Wan Kenobi        182    77
##  9 Anakin Skywalker      188    84
## 10 Wilhuff Tarkin        180    NA
## # ℹ 64 more rows

Self check: make a new data frame where you select all columns except gender and has characters that have green skin color

example_df <- starwars %>% select(-gender) %>% filter(skin_color == "green")
example_df
## # A tibble: 6 × 13
##   name   height  mass hair_color skin_color eye_color birth_year sex   homeworld
##   <chr>   <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr>    
## 1 Greedo    173    74 <NA>       green      black             44 male  Rodia    
## 2 Yoda       66    17 white      green      brown            896 male  <NA>     
## 3 Bossk     190   113 none       green      red               53 male  Trandosha
## 4 Rugor…    206    NA none       green      orange            NA male  Naboo    
## 5 Kit F…    196    87 none       green      black             NA male  Glee Ans…
## 6 Poggl…    183    80 none       green      yellow            NA male  Geonosis 
## # ℹ 4 more variables: species <chr>, films <list>, vehicles <list>,
## #   starships <list>

Arrange

Let’s arrange all of the characters by their height

starwars %>% arrange(height)
## # A tibble: 87 × 14
##    name     height  mass hair_color skin_color eye_color birth_year sex   gender
##    <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
##  1 Yoda         66    17 white      green      brown            896 male  mascu…
##  2 Ratts T…     79    15 none       grey, blue unknown           NA male  mascu…
##  3 Wicket …     88    20 brown      brown      brown              8 male  mascu…
##  4 Dud Bolt     94    45 none       blue, grey yellow            NA male  mascu…
##  5 R2-D2        96    32 <NA>       white, bl… red               33 none  mascu…
##  6 R4-P17       96    NA none       silver, r… red, blue         NA none  femin…
##  7 R5-D4        97    32 <NA>       white, red red               NA none  mascu…
##  8 Sebulba     112    40 none       grey, red  orange            NA male  mascu…
##  9 Gasgano     122    NA none       white, bl… black             NA male  mascu…
## 10 Watto       137    NA black      blue, grey yellow            NA male  mascu…
## # ℹ 77 more rows
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

Notice this does lowest to highest, we can do the other way too

starwars %>% arrange(desc(height))
## # A tibble: 87 × 14
##    name     height  mass hair_color skin_color eye_color birth_year sex   gender
##    <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
##  1 Yarael …    264    NA none       white      yellow          NA   male  mascu…
##  2 Tarfful     234   136 brown      brown      blue            NA   male  mascu…
##  3 Lama Su     229    88 none       grey       black           NA   male  mascu…
##  4 Chewbac…    228   112 brown      unknown    blue           200   male  mascu…
##  5 Roos Ta…    224    82 none       grey       orange          NA   male  mascu…
##  6 Grievous    216   159 none       brown, wh… green, y…       NA   male  mascu…
##  7 Taun We     213    NA none       grey       black           NA   fema… femin…
##  8 Rugor N…    206    NA none       green      orange          NA   male  mascu…
##  9 Tion Me…    206    80 none       grey       black           NA   male  mascu…
## 10 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
## # ℹ 77 more rows
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

Self check: Arrange the characters names in alphabetical order

starwars %>% arrange(name)
## # A tibble: 87 × 14
##    name     height  mass hair_color skin_color eye_color birth_year sex   gender
##    <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
##  1 Ackbar      180    83 none       brown mot… orange          41   male  mascu…
##  2 Adi Gal…    184    50 none       dark       blue            NA   fema… femin…
##  3 Anakin …    188    84 blond      fair       blue            41.9 male  mascu…
##  4 Arvel C…     NA    NA brown      fair       brown           NA   male  mascu…
##  5 Ayla Se…    178    55 none       blue       hazel           48   fema… femin…
##  6 BB8          NA    NA none       none       black           NA   none  mascu…
##  7 Bail Pr…    191    NA black      tan        brown           67   male  mascu…
##  8 Barriss…    166    50 black      yellow     blue            40   fema… femin…
##  9 Ben Qua…    163    65 none       grey, gre… orange          NA   male  mascu…
## 10 Beru Wh…    165    75 brown      light      blue            47   fema… femin…
## # ℹ 77 more rows
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

Mutate

We can create new variables with mutate(). Let’s create a new variable that measures height in inches instead of centimeters (2.54cm per inch)

starwars %>% mutate(height_inches = height/2.54)
## # A tibble: 87 × 15
##    name     height  mass hair_color skin_color eye_color birth_year sex   gender
##    <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
##  1 Luke Sk…    172    77 blond      fair       blue            19   male  mascu…
##  2 C-3PO       167    75 <NA>       gold       yellow         112   none  mascu…
##  3 R2-D2        96    32 <NA>       white, bl… red             33   none  mascu…
##  4 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
##  5 Leia Or…    150    49 brown      light      brown           19   fema… femin…
##  6 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
##  7 Beru Wh…    165    75 brown      light      blue            47   fema… femin…
##  8 R5-D4        97    32 <NA>       white, red red             NA   none  mascu…
##  9 Biggs D…    183    84 black      light      brown           24   male  mascu…
## 10 Obi-Wan…    182    77 auburn, w… fair       blue-gray       57   male  mascu…
## # ℹ 77 more rows
## # ℹ 6 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>, height_inches <dbl>

Notice that this new variable is not in the original data frame. Why? Because we didn’t assign it to our object.

starwars <- starwars %>% mutate(height_inches = height/2.54)

Self check: Create a new variable that is the sum of person’s mass and height

starwars %>% mutate(total = height + mass)
## # A tibble: 87 × 16
##    name     height  mass hair_color skin_color eye_color birth_year sex   gender
##    <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
##  1 Luke Sk…    172    77 blond      fair       blue            19   male  mascu…
##  2 C-3PO       167    75 <NA>       gold       yellow         112   none  mascu…
##  3 R2-D2        96    32 <NA>       white, bl… red             33   none  mascu…
##  4 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
##  5 Leia Or…    150    49 brown      light      brown           19   fema… femin…
##  6 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
##  7 Beru Wh…    165    75 brown      light      blue            47   fema… femin…
##  8 R5-D4        97    32 <NA>       white, red red             NA   none  mascu…
##  9 Biggs D…    183    84 black      light      brown           24   male  mascu…
## 10 Obi-Wan…    182    77 auburn, w… fair       blue-gray       57   male  mascu…
## # ℹ 77 more rows
## # ℹ 7 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>, height_inches <dbl>, total <dbl>

Group_by and Summarize

This will group data together and can make summary statistics. Let’s find the average height for each species.

starwars %>% group_by(species) %>% summarize(avg_height = mean(height))
## # A tibble: 38 × 2
##    species   avg_height
##    <chr>          <dbl>
##  1 Aleena           79 
##  2 Besalisk        198 
##  3 Cerean          198 
##  4 Chagrian        196 
##  5 Clawdite        168 
##  6 Droid            NA 
##  7 Dug             112 
##  8 Ewok             88 
##  9 Geonosian       183 
## 10 Gungan          209.
## # ℹ 28 more rows

Notice we have NA’s! We can get rid of those.

# Using na.omit()
starwars %>% na.omit() %>% group_by(species) %>% summarize(avg_height = mean(height))
## # A tibble: 11 × 2
##    species      avg_height
##    <chr>             <dbl>
##  1 Cerean             198 
##  2 Ewok                88 
##  3 Gungan             196 
##  4 Human              179.
##  5 Kel Dor            188 
##  6 Mirialan           168 
##  7 Mon Calamari       180 
##  8 Trandoshan         190 
##  9 Twi'lek            178 
## 10 Wookiee            228 
## 11 Zabrak             175
# Using na.rm = T inside the mean function
starwars %>%  group_by(species) %>% summarize(avg_height = mean(height, na.rm = T))
## # A tibble: 38 × 2
##    species   avg_height
##    <chr>          <dbl>
##  1 Aleena           79 
##  2 Besalisk        198 
##  3 Cerean          198 
##  4 Chagrian        196 
##  5 Clawdite        168 
##  6 Droid           131.
##  7 Dug             112 
##  8 Ewok             88 
##  9 Geonosian       183 
## 10 Gungan          209.
## # ℹ 28 more rows

Count

Count the number of each species.

starwars %>% count(species)
## # A tibble: 38 × 2
##    species       n
##    <chr>     <int>
##  1 Aleena        1
##  2 Besalisk      1
##  3 Cerean        1
##  4 Chagrian      1
##  5 Clawdite      1
##  6 Droid         6
##  7 Dug           1
##  8 Ewok          1
##  9 Geonosian     1
## 10 Gungan        3
## # ℹ 28 more rows

Importing Data

We need to get some data before we can start working with it! Download the “lab1.csv” file from Canvas. It should then be in your Downloads folder. Now, we need to make sure that we are working in the same place as our data is, so that R knows where to get the csv file from. Move the csv file to your current directory (the folder your are currently working out of.)

If you are unsure which directory you are in, the function getwd() will give you the file path of your current location.

getwd()
## [1] "/Users/jenniputz/Dropbox/Econ415/Winter 2025/R"

Now, if this is not the same location as your desired folder, you need to tell R to set a new working directory. You can use setwd() to tell R where to go.

For example, if I want to go to my Downloads folder on my Mac, I can write setwd("/Users/jenniputz/Downloads"). If you are on windows, your file path would start with a C:\ and use \.

Once you are in your correct directory, we can read in our data using read_csv(). Remember that everything in R is an object, so you must give your data frame a name.

cps <- read_csv("lab1.csv")
## Rows: 8891 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): educ
## dbl (4): employed, black, female, exper
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Non-Experimental Data

Today you will investigate racial disparities in the labor market using data from the Current Population Survey (CPS), a large survey administered by the US Census Bureau and the Bureau of Labor Statistics. The federal government uses the CPS to estimate the unemployment rate. Economists use the CPS to study a variety of topics in labor economics, including the effect of binding minimum wages, the gender pay gap, and returns to schooling. You will use a CPS sample of workers from Boston and Chicago to study employment patterns by race.

Import data

If you have not already loaded the tidyverse package, please do so.

The data file is lab1.csv and is available on Canvas. Import the data using read_csv if you have not already done so.

By looking at the dataset, you can see that most of the variables are binary: they take values of either 1 or 0.

head(cps)
## # A tibble: 6 × 5
##   employed black female educ              exper
##      <dbl> <dbl>  <dbl> <chr>             <dbl>
## 1        1     0      1 HS Graduate          20
## 2        0     0      0 HS Graduate          20
## 3        1     1      1 HS Dropout           12
## 4        1     1      0 HS Dropout           17
## 5        0     1      1 HS Graduate          21
## 6        1     0      1 College or Higher    13

For example, individuals with employed == 1 have a job while those with employed == 0 do not.

Employment Rates

What percentage of individuals in the sample are employed?

The mean of a binary variable gives the fraction of observations with values equal to 1.

mean(cps$employed)
## [1] NA

Something went wrong. If you use the summary function, you’ll see that there are missing values (NAs) of employed.

summary(cps$employed)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  1.0000  1.0000  0.7811  1.0000  1.0000      18

When there are missing values, some functions, like mean, will return a missing value as output. To circumvent this, you can specify na.rm = TRUE in the mean function:

mean(cps$employed, na.rm = TRUE)
## [1] 0.7811338

The employment rate is 78 percent.

What are the employment rates by race and gender?

cps %>% 
  group_by(black, female) %>% 
  summarize(employed = mean(employed, na.rm = TRUE))
## `summarise()` has grouped output by 'black'. You can override using the
## `.groups` argument.
## # A tibble: 4 × 3
## # Groups:   black [2]
##   black female employed
##   <dbl>  <dbl>    <dbl>
## 1     0      0    0.868
## 2     0      1    0.723
## 3     1      0    0.718
## 4     1      1    0.693

You can see that the employment rate is

  • 87 percent for white males (black == 0 and female == 0)
  • 72 percent for white females (black == 0 and female == 1)
  • 72 percent for black males (black == 1 and female == 0)
  • 69 percent for black females (black == 1 and female == 1).

Racial Disparities

What is the average difference in employment status between black individuals and white individuals?

To find the difference,

  1. Use the filter function to restrict the sample to one group (black or white)
  2. Use mean to calculate the group mean
  3. Repeat for the other group
  4. Take the difference in means.
black_emp_df <- cps %>% filter(black == 1)
black_emp <- mean(black_emp_df$employed, na.rm = T)
black_emp 
## [1] 0.7035928
white_emp_df <- cps %>% filter(black == 0)
white_emp <- mean(white_emp_df$employed, na.rm = T)
white_emp
## [1] 0.7948786
black_emp - white_emp
## [1] -0.09128578

The employment rate is 9 percentage points lower for black individuals than for white individuals.

Does this mean that there is racial disparity?

Not yet. We still don’t know if the difference is statistically significant. You can find out by conducting a \(t\)-test of the null hypothesis that the true difference-in-means is zero against the alternative hypothesis that the difference is nonzero.

To conduct the test, you need to calculate the \(t\)-statistic for the difference-in-means, which is given by

\[t = \dfrac{\overline{\text{Employed}}_\text{Black} - \overline{\text{Employed}}_\text{White}}{\sqrt{\frac{S^2_\text{Black}}{N_\text{Black}} + \frac{S^2_\text{White}}{N_\text{White}}}}\] We can conduct the \(t\)-test using the t.test() function. Here, you input the variable names for x and y, and tell R if you want a two-sided or one-sided test.

t.test(black_emp_df$employed, white_emp_df$employed, alternative = "two.sided", var.equal = FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  black_emp_df$employed and white_emp_df$employed
## t = -6.845, df = 1724.5, p-value = 1.059e-11
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.11744252 -0.06512905
## sample estimates:
## mean of x mean of y 
## 0.7035928 0.7948786

To conclude your test, compare your \(t\)-stat to 2 (an approximation for the critical value of \(t\) in 5 percent test–we will do more rigorous hypothesis testing later in the course). If \(|t| > 2\), then you can reject the null hypothesis. Your \(t\)-stat of -6.86 is certainly more extreme than 2, so you can reject the null hypothesis. This means that the difference in employment rates is statistically significant. There is a racial disparity in employment.

Does the disparity in employment rates provide causal evidence of racial discrimination in hiring? Does the comparison of employment rates by race hold all else constant? What else could explain the gap?

Graphing

Let’s use the penguins dataset that is in the palmerpenguins package. Also load the tidyverse package.

library(pacman)
p_load(tidyverse, palmerpenguins)

The dataset contains 344 observations of penguins at the Palmer Station in Antartica.

head(penguins)
## # A tibble: 6 × 8
##   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
## 1 Adelie  Torgersen           39.1          18.7               181        3750
## 2 Adelie  Torgersen           39.5          17.4               186        3800
## 3 Adelie  Torgersen           40.3          18                 195        3250
## 4 Adelie  Torgersen           NA            NA                  NA          NA
## 5 Adelie  Torgersen           36.7          19.3               193        3450
## 6 Adelie  Torgersen           39.3          20.6               190        3650
## # ℹ 2 more variables: sex <fct>, year <int>

Let’s use this dataset to create plots, following the code in “R for Data Science”.

Creating a ggplot

The basic setup of making a ggplot requires three things: the data, the aesthetic mapping, and a geom. The aesthetic mappings describe how variables in the data are mapped to visual properties (aesthetics) of geoms, like which variables are on the axes, the variable to color or fill by, etc. The geoms tell R how to draw the data like points, lines, columns, etc.

In general, we can make a ggplot by typing the following: \[\text{ggplot(data = <DATA>, mapping = aes(<MAPPING>)) + <geom_function>()}\]

The way ggplot works is by adding layers. We can add a new layer with the + sign. Let’s build a ggplot step by step. First, start with ggplot() and tell R what data we are using.

ggplot(data = penguins)

Why did this make a blank graph? Well, we haven’t given R the aesthetic mapping yet so it doesn’t know what to put on top of the base layer. Let’s add the x and y variables, flipper length and body mass.

ggplot(data = penguins, mapping = aes(x = flipper_length_mm, y = body_mass_g)) # note you can put the mapping here or in the geom

Now we have a graph with axes and gridlines but no information on the graph. To get data on the graph, we need to tell R how we want to draw the data with a geom. To make a scatterplot, we use geom_point().

ggplot(data = penguins, mapping = aes(x = flipper_length_mm, y = body_mass_g)) + geom_point()

Adding Aesthetics and Layers

Suppose we want to see how the relationship between flipper length and body mass differs by species of penguin. We can represent each species on our scatterplot with different colored points.

ggplot(data = penguins, 
       mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species)) +
  geom_point()

Let’s add another layer: a curve displaying the relationship using geom_smooth(). Using method = "lm" gives the line of best fit.

ggplot(data = penguins, 
       mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species)) +
  geom_point() +
  geom_smooth(method = "lm")

Note that this creates three separate lines, one for each species. When the aesthetic mapping is defined in the `ggplot function, the mappings are passed down to all of the subsequent layers. You can also define aesthetic mappings at the local level. For example, what if we wanted the points to be colored by species but only one line of best fit?

ggplot(data = penguins, 
       mapping = aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point(mapping = aes(color = species)) +
  geom_smooth(method = "lm")

We can also change the shapes of the points.

ggplot(data = penguins, 
       mapping = aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point(mapping = aes(color = species, shape = species)) +
  geom_smooth(method = "lm")

Now let’s make it pretty! We can add titles, axis labels, themes and more! You can also get fancy and change the color of your points and lines.

ggplot(data = penguins, 
       mapping = aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_point(mapping = aes(color = species, shape = species)) +
  geom_smooth(method = "lm") +
  labs(title = "Body Mass and Flipper Length of Penguins",
       x = "Flipper Length (mm)", 
       y = "Body Mass (g)") +
  theme_minimal()

Visualizing Distributions

For categorical variables, we can use a bar chart. The height of the bars is how many observations occurred for each x value.

ggplot(penguins, aes(x = species)) + geom_bar()

For numerical variables, we can create histograms and density plots. For histograms, changing the binwidth sets the intervals for the x variable.

ggplot(penguins, aes(x = body_mass_g)) + geom_histogram(binwidth = 20)

ggplot(penguins, aes(x = body_mass_g)) + geom_histogram(binwidth = 200)

ggplot(penguins, aes(x = body_mass_g)) + geom_histogram(binwidth = 2000)

Density plots are smooth versions of histograms and are useful for continuous data.

ggplot(penguins, aes(x = body_mass_g)) + geom_density()

Test Your Knowledge

Try to reproduce this graph:

Simple OLS Regression

library(pacman)
p_load(tidyverse, broom, stargazer, AER)
# Note: broom package contains the tidy() function

For this lab, we will use data from the AER package on California schools. To get this data and use it we can:

# Note: if you install and load the AER package, the dataset is included in the package. Then run the line below:
data("CASchools")


# look at a  snapshot
head(CASchools, 10)
##    district                          school      county grades students
## 1     75119              Sunol Glen Unified     Alameda  KK-08      195
## 2     61499            Manzanita Elementary       Butte  KK-08      240
## 3     61549     Thermalito Union Elementary       Butte  KK-08     1550
## 4     61457 Golden Feather Union Elementary       Butte  KK-08      243
## 5     61523        Palermo Union Elementary       Butte  KK-08     1335
## 6     62042         Burrel Union Elementary      Fresno  KK-08      137
## 7     68536           Holt Union Elementary San Joaquin  KK-08      195
## 8     63834             Vineland Elementary        Kern  KK-08      888
## 9     62331        Orange Center Elementary      Fresno  KK-08      379
## 10    67306     Del Paso Heights Elementary  Sacramento  KK-06     2247
##    teachers calworks    lunch computer expenditure    income   english  read
## 1     10.90   0.5102   2.0408       67    6384.911 22.690001  0.000000 691.6
## 2     11.15  15.4167  47.9167      101    5099.381  9.824000  4.583333 660.5
## 3     82.90  55.0323  76.3226      169    5501.955  8.978000 30.000002 636.3
## 4     14.00  36.4754  77.0492       85    7101.831  8.978000  0.000000 651.9
## 5     71.50  33.1086  78.4270      171    5235.988  9.080333 13.857677 641.8
## 6      6.40  12.3188  86.9565       25    5580.147 10.415000 12.408759 605.7
## 7     10.00  12.9032  94.6237       28    5253.331  6.577000 68.717949 604.5
## 8     42.50  18.8063 100.0000       66    4565.746  8.174000 46.959461 605.5
## 9     19.00  32.1900  93.1398       35    5355.548  7.385000 30.079157 608.9
## 10   108.00  78.9942  87.3164        0    5036.211 11.613333 40.275921 611.9
##     math
## 1  690.0
## 2  661.9
## 3  650.9
## 4  643.5
## 5  639.9
## 6  605.4
## 7  609.0
## 8  612.5
## 9  616.1
## 10 613.4
names(CASchools)
##  [1] "district"    "school"      "county"      "grades"      "students"   
##  [6] "teachers"    "calworks"    "lunch"       "computer"    "expenditure"
## [11] "income"      "english"     "read"        "math"

Running a Regression

To do a regression in R, we use lm(). The basic steup: name <- lm(y ~ x, data = name_of_df).

Let’s regress reading scores on student expenditure.

lm(read ~ expenditure, data = CASchools)
## 
## Call:
## lm(formula = read ~ expenditure, data = CASchools)
## 
## Coefficients:
## (Intercept)  expenditure  
##   6.182e+02    6.912e-03

The output from lm() gives us an intercept coefficient, \(\hat{\beta}_0\), and a slope coefficient, \(\hat{\beta}_1\).

Let’s run another regression. Regress math scores on student expenditure.

lm(math ~ expenditure, data = CASchools)
## 
## Call:
## lm(formula = math ~ expenditure, data = CASchools)
## 
## Coefficients:
## (Intercept)  expenditure  
##   6.290e+02    4.585e-03

On your own, try to code a regression for \(Math_i = \beta_0 + \beta_1 Income_i + u_i\).

lm(math ~ income,  data = CASchools)
## 
## Call:
## lm(formula = math ~ income, data = CASchools)
## 
## Coefficients:
## (Intercept)       income  
##     625.539        1.815

Making tables

Now that we know how to run a regression, let’s talk about how to look at the output. The output above wasn’t super informative…

Using summary()

The first option is to use the summary function in R. There are two ways to do this:

Save your regression as an object in R so that we can then use that object later.

reg1 <- lm(math ~ income, data = CASchools)
summary(reg1)
## 
## Call:
## lm(formula = math ~ income, data = CASchools)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -39.045  -8.997   0.308   8.416  34.246 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 625.53948    1.53627  407.18   <2e-16 ***
## income        1.81523    0.09073   20.01   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.42 on 418 degrees of freedom
## Multiple R-squared:  0.4892, Adjusted R-squared:  0.4879 
## F-statistic: 400.3 on 1 and 418 DF,  p-value: < 2.2e-16

We have the intercept coefficient, \(\hat{\beta}_0 = 625\), and the slope coefficient \(\hat{\beta}_1 = 1.8\). summary() also gives us other information that we didn’t have before like the standard error, the t-score, the p-value, and the \(R^2\). The stars on the p-value tell us if the coefficient is statistically significant (more on this after the midterm).

Using tidy()

Another way to make nice regression output is to use the tidy() function from the broom package. To use this, you must have loaded the broom package in p_load(). The process is similar:

# since we have already created reg1 as an object, we can just call it without having to redo the regression
tidy(reg1)
## # A tibble: 2 × 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   626.      1.54       407.  0       
## 2 income          1.82    0.0907      20.0 5.99e-63

tidy() puts the information from summary() into a much nicer looking table.

Stargazer

By far, the most powerful tool for making amazing tables in R is the stargazer package. I would encourage you to try this out if you are feeling up for it! There is a very helpful cheatsheet linked here.

First, let’s try to make a simple table with stargazer() using our reg1 object.

stargazer(reg1)
## 
## % Table created by stargazer v.5.2.3 by Marek Hlavac, Social Policy Institute. E-mail: marek.hlavac at gmail.com
## % Date and time: Tue, Apr 08, 2025 - 14:50:12
## \begin{table}[!htbp] \centering 
##   \caption{} 
##   \label{} 
## \begin{tabular}{@{\extracolsep{5pt}}lc} 
## \\[-1.8ex]\hline 
## \hline \\[-1.8ex] 
##  & \multicolumn{1}{c}{\textit{Dependent variable:}} \\ 
## \cline{2-2} 
## \\[-1.8ex] & math \\ 
## \hline \\[-1.8ex] 
##  income & 1.815$^{***}$ \\ 
##   & (0.091) \\ 
##   & \\ 
##  Constant & 625.539$^{***}$ \\ 
##   & (1.536) \\ 
##   & \\ 
## \hline \\[-1.8ex] 
## Observations & 420 \\ 
## R$^{2}$ & 0.489 \\ 
## Adjusted R$^{2}$ & 0.488 \\ 
## Residual Std. Error & 13.420 (df = 418) \\ 
## F Statistic & 400.257$^{***}$ (df = 1; 418) \\ 
## \hline 
## \hline \\[-1.8ex] 
## \textit{Note:}  & \multicolumn{1}{r}{$^{*}$p$<$0.1; $^{**}$p$<$0.05; $^{***}$p$<$0.01} \\ 
## \end{tabular} 
## \end{table}

Huh, weird output right? Stargazer is defaulting to TeX output. We can change the type of output we want.

1: As text (use this for your problem set)

stargazer(reg1, type = "text")
## 
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                                math            
## -----------------------------------------------
## income                       1.815***          
##                               (0.091)          
##                                                
## Constant                    625.539***         
##                               (1.536)          
##                                                
## -----------------------------------------------
## Observations                    420            
## R2                             0.489           
## Adjusted R2                    0.488           
## Residual Std. Error      13.420 (df = 418)     
## F Statistic          400.257*** (df = 1; 418)  
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

Stargazer can also do html output and MS word output (although word output is not recommended).

Stargazer gave us some stats that we don’t really care about… let’s get rid of them…

stargazer(reg1, keep.stat =c("rsq", "n"), type = "text")
## 
## ========================================
##                  Dependent variable:    
##              ---------------------------
##                         math            
## ----------------------------------------
## income                1.815***          
##                        (0.091)          
##                                         
## Constant             625.539***         
##                        (1.536)          
##                                         
## ----------------------------------------
## Observations             420            
## R2                      0.489           
## ========================================
## Note:        *p<0.1; **p<0.05; ***p<0.01

Looks great!

Unlike the tables we made using summary() and broom(), stargazer can combine multiple regressions into one table!

# do another regression and save it as an object
reg2 <- lm(read ~ income, data = CASchools)
stargazer(reg1, reg2, keep.stat =c("rsq", "n"), type = "text")
## 
## =========================================
##                  Dependent variable:     
##              ----------------------------
##                   math          read     
##                   (1)            (2)     
## -----------------------------------------
## income          1.815***      1.942***   
##                 (0.091)        (0.097)   
##                                          
## Constant       625.539***    625.228***  
##                 (1.536)        (1.651)   
##                                          
## -----------------------------------------
## Observations      420            420     
## R2               0.489          0.487    
## =========================================
## Note:         *p<0.1; **p<0.05; ***p<0.01

The variables in the regressions don’t even have to be the same!

# do another regression and save it as an object
reg3 <- lm(read ~ expenditure, data = CASchools)
stargazer(reg1, reg2, reg3, keep.stat =c("rsq", "n"), type = "text")
## 
## =============================================
##                    Dependent variable:       
##              --------------------------------
##                 math            read         
##                 (1)        (2)        (3)    
## ---------------------------------------------
## income        1.815***   1.942***            
##               (0.091)    (0.097)             
##                                              
## expenditure                         0.007*** 
##                                     (0.002)  
##                                              
## Constant     625.539*** 625.228*** 618.249***
##               (1.536)    (1.651)    (8.101)  
##                                              
## ---------------------------------------------
## Observations    420        420        420    
## R2             0.489      0.487      0.047   
## =============================================
## Note:             *p<0.1; **p<0.05; ***p<0.01

We can get really fancy with stargazer(). For more info, see the cheatsheet on Canvas.

Visualizing your regression

Last, we can use ggplot and fit a regression line through our data. Let’s start by making a scatterplot with math scores on the y axis and income on the x axis.

ggplot(data = CASchools, aes(x = income, y = math)) + geom_point()

To add a regression line, we use the stat_smooth() function:

# method = "lm" tells R that we are using OLS. se = FALSE removes the standard error bars.
ggplot(data = CASchools, aes(x = income, y = math)) + geom_point() + stat_smooth(method = "lm", se =FALSE)
## `geom_smooth()` using formula = 'y ~ x'

Multiple OLS Regression

Setup

library(pacman)
p_load(tidyverse, stargazer, broom)

wages <- read_csv("wages.csv")

Let’s look at the data frame:

head(wages,10)
## # A tibble: 10 × 34
##       id nearc2 nearc4  educ   age fatheduc motheduc weight momdad14 sinmom14
##    <dbl>  <dbl>  <dbl> <dbl> <dbl>    <dbl>    <dbl>  <dbl>    <dbl>    <dbl>
##  1     3      0      0    12    27        8        8 380166        1        0
##  2     4      0      0    12    34       14       12 367470        1        0
##  3     5      1      1    11    27       11       12 380166        1        0
##  4     6      1      1    12    34        8        7 367470        1        0
##  5     7      1      1    12    26        9       12 380166        1        0
##  6     8      1      1    18    33       14       14 367470        1        0
##  7     9      1      1    14    29       14       14 496635        1        0
##  8    10      1      1    12    28       12       12 367772        1        0
##  9    11      1      1    12    29       12       12 480445        1        0
## 10    12      1      1     9    28       11       12 380166        1        0
## # ℹ 24 more variables: step14 <dbl>, reg661 <dbl>, reg662 <dbl>, reg663 <dbl>,
## #   reg664 <dbl>, reg665 <dbl>, reg666 <dbl>, reg667 <dbl>, reg668 <dbl>,
## #   reg669 <dbl>, south66 <dbl>, black <dbl>, smsa <dbl>, south <dbl>,
## #   smsa66 <dbl>, wage <dbl>, enroll <dbl>, KWW <dbl>, IQ <dbl>, married <dbl>,
## #   libcrd14 <dbl>, exper <dbl>, lwage <dbl>, expersq <dbl>

Multiple Regression in R

We know how to do simple OLS–doing multiple OLS is just one easy step further! Let’s run a regression of log wages on education and experience:

reg1 <- lm(lwage ~ educ + exper, data = wages)
summary(reg1)
## 
## Call:
## lm(formula = lwage ~ educ + exper, data = wages)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.57440 -0.22238  0.02188  0.25802  1.17969 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.834500   0.092866   52.06   <2e-16 ***
## educ        0.080330   0.005259   15.27   <2e-16 ***
## exper       0.046395   0.003306   14.03   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3877 on 1601 degrees of freedom
## Multiple R-squared:  0.1458, Adjusted R-squared:  0.1447 
## F-statistic: 136.6 on 2 and 1601 DF,  p-value: < 2.2e-16

R gives us the t-stat and p-value so we can easily conduct hypothesis tests. Suppose we want to test \[H_0\text{: } \beta_1 = 0\] vs \[H_1\text{: } \beta_1 \neq 0\] at the 5% level. We can look at the p-value and see that \(p < \alpha\) so we can reject the null hypothesis at 5%.

Like last lab, we can use stargazer to summarize our regression results. This time, let’s keep the adjusted \(R^2\) in our table.

stargazer(reg1, keep.stat = c("rsq", "adj.rsq", "n"), type = "text")
## 
## ========================================
##                  Dependent variable:    
##              ---------------------------
##                         lwage           
## ----------------------------------------
## educ                  0.080***          
##                        (0.005)          
##                                         
## exper                 0.046***          
##                        (0.003)          
##                                         
## Constant              4.835***          
##                        (0.093)          
##                                         
## ----------------------------------------
## Observations            1,604           
## R2                      0.146           
## Adjusted R2             0.145           
## ========================================
## Note:        *p<0.1; **p<0.05; ***p<0.01

Confidence Intervals

How do we get confidence intervals in R? Let’s do a 95% confidence interval.

Option 1: Construct by hand. We have the standard error and we have \(\hat{\beta}_1\). We need to get the critical value of t.

# We can use the qt() function to get the critical value of t. First put in your significance level. 
# for a two-tail test, make sure to divide by 2
# then the degrees of freedom. we have 1604-3 degrees of freedom
qt(.05/2, 1601)
## [1] -1.961447

The critical value of t is 1.96. Now we can plug that in to our confidence interval formula: \[ .080330 - 1.96(.005259) \leq \beta_1 \leq .080330 + 1.96(.005259) \] Which simplifies to: \[ .0700 \leq \beta_1 \leq .0906\]

Option 2: Let R tell us. The tidy() function will give us the confidence interval if we tell it!

tidy(reg1, conf.int = T)
## # A tibble: 3 × 7
##   term        estimate std.error statistic  p.value conf.low conf.high
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
## 1 (Intercept)   4.83     0.0929       52.1 0          4.65      5.02  
## 2 educ          0.0803   0.00526      15.3 2.86e-49   0.0700    0.0906
## 3 exper         0.0464   0.00331      14.0 2.80e-42   0.0399    0.0529

The confidence interval matches our by-hand solution!

But this is for a 95%. How do we get tidy() to give us a different confidence level? What if we want a 90% confidence interval?

tidy(reg1, conf.int = T, conf.level = .9)
## # A tibble: 3 × 7
##   term        estimate std.error statistic  p.value conf.low conf.high
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
## 1 (Intercept)   4.83     0.0929       52.1 0          4.68      4.99  
## 2 educ          0.0803   0.00526      15.3 2.86e-49   0.0717    0.0890
## 3 exper         0.0464   0.00331      14.0 2.80e-42   0.0410    0.0518

Omitted Variable Bias

reg1 <- lm(lwage ~ educ, data = wages)
reg2 <- lm(lwage ~ educ + exper, data = wages)
stargazer(reg1, reg2, keep.stat = c("rsq", "adj.rsq", "n"), type = "text")
## 
## =========================================
##                  Dependent variable:     
##              ----------------------------
##                         lwage            
##                   (1)            (2)     
## -----------------------------------------
## educ            0.037***      0.080***   
##                 (0.005)        (0.005)   
##                                          
## exper                         0.046***   
##                                (0.003)   
##                                          
## Constant        5.816***      4.835***   
##                 (0.065)        (0.093)   
##                                          
## -----------------------------------------
## Observations     1,604          1,604    
## R2               0.041          0.146    
## Adjusted R2      0.040          0.145    
## =========================================
## Note:         *p<0.1; **p<0.05; ***p<0.01

What is the sign of the OVB? Use the formula: \[ Bias = \beta_2 \frac{cov(X_1, X_2)}{var(X_1)} \] \(\beta_2 = .046\), which is positive. We can get the covariance by using the covariance function.

cov(wages$educ, wages$exper)
## [1] -4.759561

There is a negative bias from omitting experience.

Let’s do another regression.

reg3 <- lm(lwage ~ educ + exper + IQ, data = wages) # positive bias from omitting IQ (imperfect proxy for intelligence)
stargazer(reg1, reg2, reg3, keep.stat = c("rsq", "adj.rsq", "n"), type = "text")
## 
## ==========================================
##                   Dependent variable:     
##              -----------------------------
##                          lwage            
##                 (1)       (2)       (3)   
## ------------------------------------------
## educ         0.037***  0.080***  0.069*** 
##               (0.005)   (0.005)   (0.006) 
##                                           
## exper                  0.046***  0.048*** 
##                         (0.003)   (0.003) 
##                                           
## IQ                               0.004*** 
##                                   (0.001) 
##                                           
## Constant     5.816***  4.835***  4.587*** 
##               (0.065)   (0.093)   (0.104) 
##                                           
## ------------------------------------------
## Observations   1,604     1,604     1,604  
## R2             0.041     0.146     0.159  
## Adjusted R2    0.040     0.145     0.158  
## ==========================================
## Note:          *p<0.1; **p<0.05; ***p<0.01

How are IQ and education correlated? Check the OVB formula again. Here, \(\beta_3 = .004\) which is positive. We can check the covariance to get the sign of the bias:

cov(wages$educ, wages$IQ)
## [1] 16.96962

IQ and education are positively correlated and there is a positive bias from omitting IQ.

We can keep adding variables to our regression:

reg4 <- lm(lwage ~ educ + exper + IQ + KWW, data = wages) 
reg5 <- lm(lwage ~ educ + exper + IQ + KWW + fatheduc, data = wages) 
reg6 <- lm(lwage ~ educ + exper + IQ + KWW + fatheduc + motheduc, data = wages) 
stargazer(reg1, reg2, reg3, reg4, reg5, reg6, keep.stat = c("rsq", "adj.rsq", "n"), type = "text")
## 
## ==================================================================
##                               Dependent variable:                 
##              -----------------------------------------------------
##                                      lwage                        
##                (1)      (2)      (3)      (4)      (5)      (6)   
## ------------------------------------------------------------------
## educ         0.037*** 0.080*** 0.069*** 0.058*** 0.058*** 0.057***
##              (0.005)  (0.005)  (0.006)  (0.006)  (0.006)  (0.006) 
##                                                                   
## exper                 0.046*** 0.048*** 0.041*** 0.042*** 0.042***
##                       (0.003)  (0.003)  (0.004)  (0.004)  (0.004) 
##                                                                   
## IQ                             0.004*** 0.003*** 0.003*** 0.003***
##                                (0.001)  (0.001)  (0.001)  (0.001) 
##                                                                   
## KWW                                     0.006*** 0.006*** 0.006***
##                                         (0.002)  (0.002)  (0.002) 
##                                                                   
## fatheduc                                          0.003   -0.0003 
##                                                  (0.003)  (0.004) 
##                                                                   
## motheduc                                                   0.008* 
##                                                           (0.004) 
##                                                                   
## Constant     5.816*** 4.835*** 4.587*** 4.672*** 4.664*** 4.634***
##              (0.065)  (0.093)  (0.104)  (0.106)  (0.106)  (0.108) 
##                                                                   
## ------------------------------------------------------------------
## Observations  1,604    1,604    1,604    1,604    1,604    1,604  
## R2            0.041    0.146    0.159    0.167    0.168    0.170  
## Adjusted R2   0.040    0.145    0.158    0.165    0.165    0.167  
## ==================================================================
## Note:                                  *p<0.1; **p<0.05; ***p<0.01

Let’s look a little closer at the models with father’s education included. The coefficient is not statistically significant. Should we still include this variable in our regression?

YES Remember, if we exclude variables that should be in the model and they are correlated with the regressors that we include, then we will have omitted variable bias – Even if they aren’t correlated with the included variables, they help us tighten up our standard errors – If theory or intuition says you should have a variable in the model, then it should be in there!

F-tests

Let’s suppose we want to test to see if the coefficients on father’s education and mother’s education are equal to zero at the 5% level. The null hypothesis is: \[H_0\text{: } \beta_5 = \beta_6 = 0\] The alternative hypothesis is: \[H_1\text{: either } \beta_5 \neq 0 \text{ or } \beta_6 \neq 0\]

To do the test, we need to identify our restricted model and our unrestricted model. reg4 is the restricted model and reg6 is the unrestricted.

Recall, the F-stat is equal to: \[F = \frac{(R^2_u - R^2_r)/q}{(1-R^2_u)/(n-k-1)}\]

The easy way and have R do it for us: Load the car package to use this function:

p_load(car)
linearHypothesis(reg6, c("fatheduc=0", "motheduc=0"))
## Linear hypothesis test
## 
## Hypothesis:
## fatheduc = 0
## motheduc = 0
## 
## Model 1: restricted model
## Model 2: lwage ~ educ + exper + IQ + KWW + fatheduc + motheduc
## 
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1   1599 234.57                           
## 2   1597 233.93  2   0.64137 2.1892 0.1123

Joining and Merging Datasets

Setup

library(pacman)
p_load(tidyverse, nycflights13)

Look at your data by viewing it or using the head() function. The dataframe is called flights. his data frame contains all 336,776 flights that departed from New York City in 2013. The data comes from the US Bureau of Transportation Statistics.

head(flights)
## # A tibble: 6 × 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2013     1     1      517            515         2      830            819
## 2  2013     1     1      533            529         4      850            830
## 3  2013     1     1      542            540         2      923            850
## 4  2013     1     1      544            545        -1     1004           1022
## 5  2013     1     1      554            600        -6      812            837
## 6  2013     1     1      554            558        -4      740            728
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

Review of Dplyr Functions

You have already learned how to use the tidyverse functions, but it may be useful to have some review. Recall, using functions from dplyr, we can manipulate our data frame. We can:

  • Pick observations by their values using filter().
  • Reorder the rows with arrange().
  • Pick variables by their names using select().
  • Create new variables with functions of existing variables with mutate().
  • Collapse many values down to a single summary using group_by() and summarise().

Recall, the steps to use these functions are the same:

  1. Start with the name of the data frame you want to save in your global environment.
  2. After the = or <-, put the name of the data frame you want to manipulate. In this case, flights.
  3. Then, we can use the %>% pipe operator to describe what we want to do with the data frame, using the variable names.
  4. The result is a new dataframe.

Today, we will review how to use filter, mutate, and group_by/summarize.

Filter

filter() allows you to subset observations based on their values. For example, we can select all flights in January with:

jan_flights <- flights %>% filter(month == 1)

jan_flights
## # A tibble: 27,004 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ℹ 26,994 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

Logical Operators

Boolean operators that work with dply: & is “and”, | is “or”, and ! is “not”.

The following code finds all flights that departed in November or December:

nov_dec <- flights %>% filter(month == 11 | month == 12)
nov_dec
## # A tibble: 55,403 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013    11     1        5           2359         6      352            345
##  2  2013    11     1       35           2250       105      123           2356
##  3  2013    11     1      455            500        -5      641            651
##  4  2013    11     1      539            545        -6      856            827
##  5  2013    11     1      542            545        -3      831            855
##  6  2013    11     1      549            600       -11      912            923
##  7  2013    11     1      550            600       -10      705            659
##  8  2013    11     1      554            600        -6      659            701
##  9  2013    11     1      554            600        -6      826            827
## 10  2013    11     1      554            600        -6      749            751
## # ℹ 55,393 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

You can also use the %in% operator to achieve the same thing:

nov_dec <- flights %>% filter(month %in% c(11,12))
nov_dec
## # A tibble: 55,403 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013    11     1        5           2359         6      352            345
##  2  2013    11     1       35           2250       105      123           2356
##  3  2013    11     1      455            500        -5      641            651
##  4  2013    11     1      539            545        -6      856            827
##  5  2013    11     1      542            545        -3      831            855
##  6  2013    11     1      549            600       -11      912            923
##  7  2013    11     1      550            600       -10      705            659
##  8  2013    11     1      554            600        -6      659            701
##  9  2013    11     1      554            600        -6      826            827
## 10  2013    11     1      554            600        -6      749            751
## # ℹ 55,393 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

Find flights that weren’t delayed (on arrival or departure) by more than two hours:

nondelays <- flights %>% filter(arr_delay <= 120 & dep_delay <= 120)
nondelays
## # A tibble: 316,050 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ℹ 316,040 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

Mutate

mutate() allows you to create new variables. Make sure to save the resulting data frame so that it is in your global environment.

Here, we create two variables, gain and speed.

new_df <- flights %>% mutate(
  gain = dep_delay - arr_delay,
  speed = distance / air_time * 60
)

new_df
## # A tibble: 336,776 × 21
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ℹ 336,766 more rows
## # ℹ 13 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>, gain <dbl>, speed <dbl>

Group_by and Summarize

summarize() collapses a data frame into a single row. It is most used in conjunction with group_by().

Imagine that we want to explore the relationship between the distance and average delay for each location. Using what you know about dplyr, you might write code like this:

delays <- flights %>% 
  group_by(dest) %>% 
  summarise(
    count = n(),
    dist = mean(distance, na.rm = TRUE),
    delay = mean(arr_delay, na.rm = TRUE)
  ) 

delays
## # A tibble: 105 × 4
##    dest  count  dist delay
##    <chr> <int> <dbl> <dbl>
##  1 ABQ     254 1826   4.38
##  2 ACK     265  199   4.85
##  3 ALB     439  143  14.4 
##  4 ANC       8 3370  -2.5 
##  5 ATL   17215  757. 11.3 
##  6 AUS    2439 1514.  6.02
##  7 AVL     275  584.  8.00
##  8 BDL     443  116   7.05
##  9 BGR     375  378   8.03
## 10 BHM     297  866. 16.9 
## # ℹ 95 more rows

Merging Datasets

Sometimes, the data we want to use is contained in two separate data frames. We want a way to merge those data frames together.

The different merges we can do: - inner_join(x, y) only includes observations that having matching x and y key values. Rows of x can be dropped/filtered. - left_join(x, y) includes all observations in x, regardless of whether they match or not. This is the most commonly used join because it ensures that you don’t lose observations from your primary table. - right_join(x, y) includes all observations in y. It’s equivalent to left_join(y, x), but the columns will be ordered differently. - full_join(x, y) includes all observations from x and y - merge(x, y) or inner_join(x, y) uses all variables with common names across the two tables

The nycflights13 package also contains a data frame called weather. Let’s look at it:

head(weather)
## # A tibble: 6 × 15
##   origin  year month   day  hour  temp  dewp humid wind_dir wind_speed wind_gust
##   <chr>  <int> <int> <int> <int> <dbl> <dbl> <dbl>    <dbl>      <dbl>     <dbl>
## 1 EWR     2013     1     1     1  39.0  26.1  59.4      270      10.4         NA
## 2 EWR     2013     1     1     2  39.0  27.0  61.6      250       8.06        NA
## 3 EWR     2013     1     1     3  39.0  28.0  64.4      240      11.5         NA
## 4 EWR     2013     1     1     4  39.9  28.0  62.2      250      12.7         NA
## 5 EWR     2013     1     1     5  39.0  28.0  64.4      260      12.7         NA
## 6 EWR     2013     1     1     6  37.9  28.0  67.2      240      11.5         NA
## # ℹ 4 more variables: precip <dbl>, pressure <dbl>, visib <dbl>,
## #   time_hour <dttm>

Notice that there are common variable names between the flights and weather data frame. We can join them together.

flights2 <- left_join(x = flights, y = weather) 
## Joining with `by = join_by(year, month, day, origin, hour, time_hour)`
flights2
## # A tibble: 336,776 × 28
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ℹ 336,766 more rows
## # ℹ 20 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>, temp <dbl>, dewp <dbl>,
## #   humid <dbl>, wind_dir <dbl>, wind_speed <dbl>, wind_gust <dbl>,
## #   precip <dbl>, pressure <dbl>, visib <dbl>

Notice the message Joining by: c(“year”, “month”, “day”, “hour”, “origin”), which indicates the variables used for joining. If you want to set specific variables to join with, you can use by = c() in the argument of the join, like so:

flights2 <- left_join(x = flights, y = weather, by = c("year", "month", "day", "hour", "origin")) 
flights2
## # A tibble: 336,776 × 29
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ℹ 336,766 more rows
## # ℹ 21 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour.x <dttm>, temp <dbl>, dewp <dbl>,
## #   humid <dbl>, wind_dir <dbl>, wind_speed <dbl>, wind_gust <dbl>,
## #   precip <dbl>, pressure <dbl>, visib <dbl>, time_hour.y <dttm>

Note that these two joins are equivalent.

Let’s try the other merges:

First, using right_join():

flights3 <- right_join(x = flights, y = weather)
## Joining with `by = join_by(year, month, day, origin, hour, time_hour)`
flights3
## # A tibble: 341,957 × 28
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ℹ 341,947 more rows
## # ℹ 20 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>, temp <dbl>, dewp <dbl>,
## #   humid <dbl>, wind_dir <dbl>, wind_speed <dbl>, wind_gust <dbl>,
## #   precip <dbl>, pressure <dbl>, visib <dbl>

Using inner_join() or merge():

flights4 <- inner_join(x = flights, y = weather)
## Joining with `by = join_by(year, month, day, origin, hour, time_hour)`
flights4
## # A tibble: 335,220 × 28
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ℹ 335,210 more rows
## # ℹ 20 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>, temp <dbl>, dewp <dbl>,
## #   humid <dbl>, wind_dir <dbl>, wind_speed <dbl>, wind_gust <dbl>,
## #   precip <dbl>, pressure <dbl>, visib <dbl>

Using full_join():

flights5 <- full_join(x = flights, y = weather)
## Joining with `by = join_by(year, month, day, origin, hour, time_hour)`
flights5
## # A tibble: 343,513 × 28
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ℹ 343,503 more rows
## # ℹ 20 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>, temp <dbl>, dewp <dbl>,
## #   humid <dbl>, wind_dir <dbl>, wind_speed <dbl>, wind_gust <dbl>,
## #   precip <dbl>, pressure <dbl>, visib <dbl>

Resources

See chapter 19 of R for Data Science 2e (linked on canvas) for a more in-depth tutorial on joining datasets.

Interaction terms, dummy variables, and non-linear regressions

#load the tidyverse package if it is not already loaded
library(pacman)
p_load(tidyverse, carData)

Our dataset today is Salaries, containing the salary for professors of different rank (assistant professor, associate professor and professor), from different disciplines (two levels: A and B) and binary gender data (male, female)- among other variables.

head(Salaries)
##        rank discipline yrs.since.phd yrs.service  sex salary
## 1      Prof          B            19          18 Male 139750
## 2      Prof          B            20          16 Male 173200
## 3  AsstProf          B             4           3 Male  79750
## 4      Prof          B            45          39 Male 115000
## 5      Prof          B            40          41 Male 141500
## 6 AssocProf          B             6           6 Male  97000

Categorical Variables

Notice that we have categorical variables in our dataset. If we want to use these in our regressions, we will need to create dummy variables. There are several ways to do this:

Option 1: Create a new variable using mutate

Salaries1 <- Salaries %>% mutate(sex = ifelse(sex == "Male", 0, 1))
head(Salaries1)
##        rank discipline yrs.since.phd yrs.service sex salary
## 1      Prof          B            19          18   0 139750
## 2      Prof          B            20          16   0 173200
## 3  AsstProf          B             4           3   0  79750
## 4      Prof          B            45          39   0 115000
## 5      Prof          B            40          41   0 141500
## 6 AssocProf          B             6           6   0  97000

Option 2: Use dummy_cols()

p_load(fastDummies) #requires this package to be loaded
Salaries2 <- Salaries %>% dummy_cols()
head(Salaries2)
##        rank discipline yrs.since.phd yrs.service  sex salary rank_AsstProf
## 1      Prof          B            19          18 Male 139750             0
## 2      Prof          B            20          16 Male 173200             0
## 3  AsstProf          B             4           3 Male  79750             1
## 4      Prof          B            45          39 Male 115000             0
## 5      Prof          B            40          41 Male 141500             0
## 6 AssocProf          B             6           6 Male  97000             0
##   rank_AssocProf rank_Prof discipline_A discipline_B sex_Female sex_Male
## 1              0         1            0            1          0        1
## 2              0         1            0            1          0        1
## 3              0         0            0            1          0        1
## 4              0         1            0            1          0        1
## 5              0         1            0            1          0        1
## 6              1         0            0            1          0        1

Option 3: Just declare the variable as a factor in your regression equation

reg1 <- lm(salary ~ as.factor(sex), data = Salaries)
summary(reg1)
## 
## Call:
## lm(formula = salary ~ as.factor(sex), data = Salaries)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -57290 -23502  -6828  19710 116455 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          101002       4809  21.001  < 2e-16 ***
## as.factor(sex)Male    14088       5065   2.782  0.00567 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30030 on 395 degrees of freedom
## Multiple R-squared:  0.01921,    Adjusted R-squared:  0.01673 
## F-statistic: 7.738 on 1 and 395 DF,  p-value: 0.005667
reg2 <- lm(salary ~ sex, data = Salaries1)
summary(reg2)
## 
## Call:
## lm(formula = salary ~ sex, data = Salaries1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -57290 -23502  -6828  19710 116455 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   115090       1587  72.503  < 2e-16 ***
## sex           -14088       5065  -2.782  0.00567 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30030 on 395 degrees of freedom
## Multiple R-squared:  0.01921,    Adjusted R-squared:  0.01673 
## F-statistic: 7.738 on 1 and 395 DF,  p-value: 0.005667
# What is the difference between these two regressions?

What if we have more than two categories? The procedure is the same.

reg3 <- lm(salary ~ as.factor(rank), data = Salaries)
summary(reg3)
## 
## Call:
## lm(formula = salary ~ as.factor(rank), data = Salaries)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -68972 -16376  -1580  11755 104773 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 80776       2887  27.976  < 2e-16 ***
## as.factor(rank)AssocProf    13100       4131   3.171  0.00164 ** 
## as.factor(rank)Prof         45996       3230  14.238  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23630 on 394 degrees of freedom
## Multiple R-squared:  0.3943, Adjusted R-squared:  0.3912 
## F-statistic: 128.2 on 2 and 394 DF,  p-value: < 2.2e-16

Adding Interaction Terms

Let’s add an interaction term of two categorical variables: sex and discipline.

int_reg1 <- lm(salary ~ as.factor(sex) + as.factor(discipline) + as.factor(sex):as.factor(discipline), data = Salaries)

summary(int_reg1)
## 
## Call:
## lm(formula = salary ~ as.factor(sex) + as.factor(discipline) + 
##     as.factor(sex):as.factor(discipline), data = Salaries)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -52900 -23149  -5419  20505 112785 
## 
## Coefficients:
##                                           Estimate Std. Error t value Pr(>|t|)
## (Intercept)                                  89065       6992  12.739  < 2e-16
## as.factor(sex)Male                           21635       7368   2.937  0.00351
## as.factor(discipline)B                       22170       9528   2.327  0.02048
## as.factor(sex)Male:as.factor(discipline)B   -14109      10034  -1.406  0.16049
##                                              
## (Intercept)                               ***
## as.factor(sex)Male                        ** 
## as.factor(discipline)B                    *  
## as.factor(sex)Male:as.factor(discipline)B    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 29660 on 393 degrees of freedom
## Multiple R-squared:  0.0482, Adjusted R-squared:  0.04094 
## F-statistic: 6.634 on 3 and 393 DF,  p-value: 0.0002217

Now, let’s use an interaction for a categorical variable and a continuous variable.

int_reg2 <- lm(salary ~ as.factor(sex) + yrs.since.phd + as.factor(sex):yrs.since.phd, data = Salaries)

summary(int_reg2)
## 
## Call:
## lm(formula = salary ~ as.factor(sex) + yrs.since.phd + as.factor(sex):yrs.since.phd, 
##     data = Salaries)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -83012 -19442  -2988  15059 102652 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       73840.8     8696.7   8.491 4.27e-16 ***
## as.factor(sex)Male                20209.6     9179.2   2.202 0.028269 *  
## yrs.since.phd                      1644.9      454.6   3.618 0.000335 ***
## as.factor(sex)Male:yrs.since.phd   -728.0      468.0  -1.555 0.120665    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 27420 on 393 degrees of freedom
## Multiple R-squared:  0.1867, Adjusted R-squared:  0.1805 
## F-statistic: 30.07 on 3 and 393 DF,  p-value: < 2.2e-16

Nonlinear Regression

Load the gapminder package to get the dataset:

p_load(gapminder)
head(gapminder)
## # A tibble: 6 × 6
##   country     continent  year lifeExp      pop gdpPercap
##   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Afghanistan Asia       1952    28.8  8425333      779.
## 2 Afghanistan Asia       1957    30.3  9240934      821.
## 3 Afghanistan Asia       1962    32.0 10267083      853.
## 4 Afghanistan Asia       1967    34.0 11537966      836.
## 5 Afghanistan Asia       1972    36.1 13079460      740.
## 6 Afghanistan Asia       1977    38.4 14880372      786.

Let’s make a quick ggplot and look at the shape of our data:

ggplot(data=gapminder, aes(x=gdpPercap, y=lifeExp)) + geom_point() + stat_smooth(method =  "lm", se = F)
## `geom_smooth()` using formula = 'y ~ x'

To transform our data, we can create a new variable for the log(gdpPercap) using mutate or we can take the log in our lm() function.

gapminder_log <- gapminder %>% mutate(log_gdp = log(gdpPercap)) 
head(gapminder_log)
## # A tibble: 6 × 7
##   country     continent  year lifeExp      pop gdpPercap log_gdp
##   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>   <dbl>
## 1 Afghanistan Asia       1952    28.8  8425333      779.    6.66
## 2 Afghanistan Asia       1957    30.3  9240934      821.    6.71
## 3 Afghanistan Asia       1962    32.0 10267083      853.    6.75
## 4 Afghanistan Asia       1967    34.0 11537966      836.    6.73
## 5 Afghanistan Asia       1972    36.1 13079460      740.    6.61
## 6 Afghanistan Asia       1977    38.4 14880372      786.    6.67
ggplot(data=gapminder_log, aes(x=log_gdp, y=lifeExp)) + geom_point() + stat_smooth(method =  "lm", se = F)
## `geom_smooth()` using formula = 'y ~ x'

reg_linear_log <- lm(lifeExp ~ log(gdpPercap), data = gapminder)
summary(reg_linear_log)
## 
## Call:
## lm(formula = lifeExp ~ log(gdpPercap), data = gapminder)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -32.778  -4.204   1.212   4.658  19.285 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -9.1009     1.2277  -7.413 1.93e-13 ***
## log(gdpPercap)   8.4051     0.1488  56.500  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.62 on 1702 degrees of freedom
## Multiple R-squared:  0.6522, Adjusted R-squared:  0.652 
## F-statistic:  3192 on 1 and 1702 DF,  p-value: < 2.2e-16

Quadratic Terms

To add a quadratic term, we need to use the I() function in our regression

quad_reg <- lm(lifeExp ~ gdpPercap + I(gdpPercap^2), data = gapminder)
summary(quad_reg)
## 
## Call:
## lm(formula = lifeExp ~ gdpPercap + I(gdpPercap^2), data = gapminder)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -28.0600  -6.4253   0.2611   7.0889  27.1752 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     5.052e+01  2.978e-01  169.65   <2e-16 ***
## gdpPercap       1.551e-03  3.737e-05   41.50   <2e-16 ***
## I(gdpPercap^2) -1.502e-08  5.794e-10  -25.92   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.885 on 1701 degrees of freedom
## Multiple R-squared:  0.5274, Adjusted R-squared:  0.5268 
## F-statistic: 949.1 on 2 and 1701 DF,  p-value: < 2.2e-16

Heteroskedasticity

Today we will be working with the CASchools dataset again, containing information on public schools in California. This dataset is part of the AER package.

library(pacman)
p_load(tidyverse, AER, lfe)
data("CASchools")

Looking for Heteroskedasticity Graphically

We can make a plot to examine the relationship beween our x variable and the residuals. To do this, we need to run a regression, cover the residuals, then make a scatterplot. The resid() function will get the residuals from the regression object.

reg <- lm(math ~ calworks, data = CASchools)
residuals <- resid(reg)

ggplot(aes(x = calworks, y = residuals), data = CASchools) + 
  geom_point(color = "purple") + theme_minimal() +
  geom_hline(yintercept = 0) 

Goldfeld-Quandt Test

We want to be sure that our variance isn’t changing with our predictors. One way is to compare different values of our predicted variable and see how our errors are changing. The Goldfeld-Quant test does exactly that. We pick a fraction of our sample and compare the first fraction of the sample to the second fraction. We then look at the ratio of Sum Squared Error and see how far it is from 1. If the ratio is 1, then across small and large values of x, the errors are roughly the same. If the ratio is different from 1, the errors are not the same across values of x.

If we wanted to do this by hand, we’d need to follow some steps. 6 steps in total. Note that Goldfeld-Quant only allows you to look at one varibale at a time so let’s look at calworks and let’s compare the first 1/4 and the last 1/4 of the sample.

  • Step 1: Order your observations by your variable of interest.
CASchools <- CASchools %>% arrange(calworks)
  • Step 2: Split the data into two groups. We have chosen 1/4 of our dataset so we need to know what 1/4 is. The nrow() function tells us the total number of observations in the dataframe.
n_GQ <- as.integer(nrow(CASchools)*1/4)
n_GQ
## [1] 105
  • Step 3: Run the regression you want to esimate on the last 1/4 and the first 1/4 of the data. Let’s do a regression with calworks, expenditure, and students.
# first 1/4
lm_g1 <- lm(math ~ calworks + expenditure + students, data = head(CASchools,  n_GQ))

# last 1/4
lm_g2 <- lm(math ~ calworks + expenditure + students, data = tail(CASchools,  n_GQ))
  • Step 4: Record the sum of squared errors so we can form the test statistic.
# get residuals
e_g1 <- resid(lm_g1)
e_g2 <- resid(lm_g2)

# square and sum
sse_g1 <- sum(e_g1^2)
sse_g1
## [1] 11944.58
sse_g2 <- sum(e_g2^2)
sse_g2
## [1] 12002.33
  • Step 5: Calculate the GQ test statisitic and calculate the p-value. The GQ test is an F test.
stat_GQ <- (sse_g2 / sse_g1)
stat_GQ
## [1] 1.004835
# n-k-1 degrees of freedom, pf() gives the p-value from the F distribution
pf(q = stat_GQ, df1 = n_GQ - 4, df2 = n_GQ - 4, lower.tail = F)
## [1] 0.4903561
  • Step 6: State the null hypothesis and draw conclusion. The null hypothesis of Goldfeld-Quant is H0: The variances of the residuals from regressions using the first 1/4 and the last 1/4 of the dataset ordered by per capita income are the same. The alternative hypothesis of Goldfeld-Quant is HA: The variances of the residuals from regressions using the first 1/4 and the last 1/4 of the dataset ordered by per capita income are different. More formally: \(H_0:\sigma_1^2 = \sigma_2^2\) and \(H_1:\sigma_1^2 \neq \sigma_2^2\).

Since \(p > 0.10\) we fail to reject the null hyothesis and conclude that we do not have evidence of heteroskedasticity.

White Test

  • Step 1: Regress y on the x variables and record the residuals.
lm_white <- lm(math ~ calworks + expenditure + students, data = CASchools)
resid_white <- resid(lm_white)
  • Step 2: Regress the squared residuals on all x variables, their squares, and first-degree interactions.
resid_reg <- lm(I(resid_white^2) ~ calworks + expenditure + students +
                  I(calworks^2) + I(expenditure^2) + I(students^2) +
                  calworks:expenditure + calworks:students + expenditure:students,  data = CASchools)
  • Step 3: Record the \(R^2\)
r2_White <- summary(resid_reg)$r.squared
r2_White
## [1] 0.06588887
  • Step 4: Compute the LM test statistic, \(LM = n*R^2_e\), and find the p-value. The LM test uses a chi-squared distribution.
LM_test <- nrow(CASchools)*r2_White
LM_test
## [1] 27.67333
pchisq(q = LM_test, df = 10, lower.tail = F)
## [1] 0.002035818
  • Step 5: State null and alternative hypotheses and conclude. The null hypothesis is: \(H_0:\beta_1=...=\beta_9=0\) and the alternative hypothesis is \(H_1\): at least one \(\beta_i \neq 0\). We reject the null at the 1% level and conclude that there is statistically significant evidence of heteroskedasticity.

Dealing with Heteroskedasticity

To get heteroskedasticity robust standard errors we use the felm function fromthe lfe package.

reg_standard <- lm(math ~ calworks + expenditure + students, data = CASchools)
summary(reg_standard)
## 
## Call:
## lm(formula = math ~ calworks + expenditure + students, data = CASchools)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.820  -9.205   0.025   9.962  41.352 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.369e+02  6.018e+00 105.835  < 2e-16 ***
## calworks    -1.028e+00  6.149e-02 -16.718  < 2e-16 ***
## expenditure  5.738e-03  1.114e-03   5.153 3.97e-07 ***
## students    -1.557e-04  1.807e-04  -0.862    0.389    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.32 on 416 degrees of freedom
## Multiple R-squared:  0.4215, Adjusted R-squared:  0.4174 
## F-statistic:   101 on 3 and 416 DF,  p-value: < 2.2e-16
reg_robust <- felm(math ~ calworks + expenditure + students, data = CASchools)
summary(reg_robust, robust = T)
## 
## Call:
##    felm(formula = math ~ calworks + expenditure + students, data = CASchools) 
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.820  -9.205   0.025   9.962  41.352 
## 
## Coefficients:
##               Estimate Robust s.e t value Pr(>|t|)    
## (Intercept)  6.369e+02  6.266e+00 101.638  < 2e-16 ***
## calworks    -1.028e+00  7.860e-02 -13.079  < 2e-16 ***
## expenditure  5.738e-03  1.219e-03   4.709  3.4e-06 ***
## students    -1.557e-04  1.656e-04  -0.940    0.348    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.32 on 416 degrees of freedom
## Multiple R-squared(full model): 0.4215   Adjusted R-squared: 0.4174 
## Multiple R-squared(proj model): 0.4215   Adjusted R-squared: 0.4174 
## F-statistic(full model, *iid*):  101 on 3 and 416 DF, p-value: < 2.2e-16 
## F-statistic(proj model): 62.81 on 3 and 416 DF, p-value: < 2.2e-16

Functions and Simulations in R

library(tidyverse)

Function Basics

Before we talk about functions in R, we should first make sure we understand what a function is in general. You all are probably used to thinking of something like:

     `f(x)= x^2`

But what really does this even mean? Why is this a useful tool? - It takes some input (a number), and gives us an output (in this case, the squared number, which is indeed also a number). We can represent the function \(f\) above as: Number -> ADifferentNumber where that arrow is something we call a mapping. A mapping links one object to another. This doesn’t have to be two numbers, it could be a word and a dataframe, or two words. For instance, I could come up with a function called color labeler and then pass it some object, which it will map to a color.

color_labeler(banana) = yellow

Where color_labeler is the name of the function, that maps fruits(inputs) to colors(outputs)

Knowing all of this, we need to go back to lecture 1 to remind ourselves about R-facts before we make a function.

  • Everything in R has a name, and everything is an object
  • Functions have inputs/arguments and outputs

With these in mind, let’s write a function that will take some x, and spit out x^2 + 10. That is, let’s write the code for f(x)=x^2+10.

Everything in R needs a name That includes our function.
We do this by starting with a name, setting it equal to a ‘function()’ function.

In general, this function looks like function([some_set_of_arguments]){your function code}. function() is a special operator that takes any arguments you want in the parentheses, and then lets you manipulate them in any way you see fit. Think of the parentheses here as the toys you’re giving your computer to play with inthe sandbox.

squarePlusten = function(x){
  #tell squarePlusten what to do. x is an input here, we can tell our function to transform our variable into
  #something else.
  x_squaredten = x^2+10
  #Now, in order to make use of this value, we need our function to spit something out. 
  #We do this with another special function, 'return()'. 
  return(x_squaredten)
}

notice however: x_squaredten isn’t equal to anything now that our function has been run. That’s because x_squaredten is only defined in the context of your written function.

x_squaredten

Now, we can use the function by calling the function name and giving it an input:

#check for 10. should be 110
squarePlusten(10)
## [1] 110

A couple things to note:

  • Brackets around functions lets R know where the function starts and ends.
  • Technically we do not need them for one line functions

  • Return: this tells the function what it should return for us. We don’t hold onto any temporary objects created while your function runs, though the function can modify objects outside of itself.

Remember, functions don’t just change numbers to numbers. A more accurate way to think of functions is as a map from one object to another. The starting point is the argument, and the end point is the output. The functions take an object as an input.

For example, we can write a function that takes a vector, square all the elements, and then return a plot. In this case, (lets call our function g) does the following:

Vector -> F -> Plot

We’re going to need a new tool for this…

Sapply()

So, what should our code do? - Take a vector, say c(1,2,3) - Square each element of the vector. We will also need a way to store these results.

Our first step is to figure out all of our moving parts. Here, we need some set of numbers. We can use vectors with the c() command!

Lets store them in a vector called x: x = c(1,4,9)

Now, I want a function to return a plot with all of the (x,x^2 = y) pairs affiliated. We can use a special function called sapply() which lets us perform a function a whole bunch of times.

sapply() is cool. This is how it works: sapply(some_vector,function) will take every element in the vector and apply your function to it. We can use this to run a function a whole ton of times all at once.

#build a vector, as we discussed above
x = c(1,4,9)
#we'll use the sapply, which takes three arguments, an input, an argument to build a new set of values.
plot_sq = function(x){
  #now we can use the sapply function as discussed above.
  y = sapply(x, function(x) x^2)
  #return the plot
  return(plot(x=x,y=y))
}
#check that it works
plot_sq(1:3)

We just built our own plotting function! Pretty exciting.

For Loops

Note that we could have also written the same thing with a for loop. What is a for loop? The definition is actually hinted at in the name: the logical flow here is ‘for something in a list, do an argument I provide.’ A better source:

From wikipedia: A for-loop has two parts: a header specifying the iteration, and a body which is executed once per iteration

Here is an example of how to write a basic for loop:

#we will build an empty vector x, which we will use a for loop to fill in
x=c()
#starting our loop, we're saying that i is an object in 1-10.
for (i in 1:10){
  #dividing 10 by the numbers 1 through 10.
  x[i]= 10/i
}
#let's look at what this makes
x
##  [1] 10.000000  5.000000  3.333333  2.500000  2.000000  1.666667  1.428571
##  [8]  1.250000  1.111111  1.000000

i is, for each loop, storing the value in your sequence (here, it’s a number in 1:10) then performing the operation you defined in the loop. Like a function, a for loop defines its “body” by setting the start point with a { and an end point with a }.

But, what is this doing? Let’s do this manually, but for i in 1:3

# start with i equals 1, but first we need an empty vector to fill in.
x_slow = c()
i = 1
x_slow[i] = 10/i
x_slow
## [1] 10

Now, we update i(i = i+1) and repeat the above

#look now at i = 2. 
i=i+1
x_slow[i] = 10/i
x_slow
## [1] 10  5

Again, we update i and repeat

#look now at i = 3
i=i+1
x_slow[i] = 10/i
x_slow #and so forth
## [1] 10.000000  5.000000  3.333333

Really though, we can loop over any object we want. Maybe we want to loop over some weird sequence, say c(2,300,-4,6) and the loop will perform some set of operations based on it. For instance,

z = 0
for (i in c(2,300,-4,6)) {
      print(paste0(i, ' plus ', z, ' equals'))
      z = z + i
      print(z)
      }
## [1] "2 plus 0 equals"
## [1] 2
## [1] "300 plus 2 equals"
## [1] 302
## [1] "-4 plus 302 equals"
## [1] 298
## [1] "6 plus 298 equals"
## [1] 304

Lets rewrite our function plot_sq with a for loop inside it

plot_sqf = function(x){
    #initialize y:
    y=c()
    #start our for loop. We're looping over every object in the vector x, so we want to iterate a number of times
    #equal to the length of x so we don't miss any.
    for (i in 1:length(x)) {
       #for each i in the vector i:length, update the ith value in y with:
       y[i] = x[i]^2
       }
      #return the plot
       return(plot(x=x,y=y))
}

#plot squared values between 1-10
plot_sqf(1:10)

Understanding the loop

Let’s look at this closer

for (i in 1:length(x)): i does what is called “indexing.” My goal for this function is to “loop” through all elements of x, and fill a vector called y with the squared elements of x. Hence the loop needs to terminate at the final element of x.

  • length(x) gives the # of rows x has.
  • y[i] = x[i]^2 remember how we index arrays in R: row by column. Thus for each i=1,2.., the iterator fills row i in y with the squared element of the ith row of x.

Let’s take a look at the plot and make sure it is the same as our last version with sapply()

(plot_sqf(1:3))
## NULL
plot_sq(1:3)

Let’s make the plot a bit nicer.

plot_sq2 = function(x){
 #can do this with a for loop as well. sapply applies the function x^2 to all elements of x
y = sapply(x, function(x) x^2)
#return the plot
return(ggplot()+geom_line(aes(x=x,y=y)))
 }
                                                         
#check that it works
plot_sq2(-20:20) 

Monte Carlo Simulations

First, we should learn about setting a seed. Setting a seed in R (e.g., set.seed(123)) is like picking a starting point for randomness, so you get the same random results every time you run the code. Without set.seed(), every time you run the code, you might get slightly different estimates! This is important for reproducability, debugging, and comparing results with others. Note that setting the seed only affects the first random numbers generated, so it is a good idea to do this multiple times.

set.seed(123)

Side Note: There are functions to pull random numbers from a probablity distribution: - rnorm(): Normal Distribution - runif(): Uniform Distribution - rbinom(): Binomial Distribution - rt(): T-Distribution - rexp(): Exponential Distribution - Etc.

Example: OLS Unbiasedness

Suppose we have the simple linear regression model: \[Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i\] where \(X_i\) is an independent variable drawn from a normal distribution and \(\varepsilon_i\) is an error term drawn from a normal distribution. We will estimate \(\beta_1\) using OLS across many simulated samples and examine its sampling distribution.

First, specify the simulation parameters.

n <- 100  # Sample size
beta_0 <- 2
beta_1 <- 3
sigma <- 2  # Standard deviation of the error term
num_simulations <- 1000  # Number of replications

For this simulation, we will write a function to create the variables needed for the model, run the regression, and store \(\widehat{\beta_1}\). Then, we will replicate it.

set.seed(123) 
ols_function <- function(beta_0, beta_1, epsilon, sigma){
  X <- rnorm(n, mean = 5, sd = 2) 
  epsilon <- rnorm(n, mean = 0, sd = sigma)  
  Y <- beta_0 + beta_1 * X + epsilon

  model <- lm(Y ~ X)  # Run OLS regression
  beta_1_estimates <- coef(model)[2]  # Store estimated beta_1
  return(beta_1_estimates)
}

ols_results <- replicate(num_simulations, ols_function(beta_0, beta_1, epsilon, sigma))

Next, let’s visualize our \(\widehat{\beta}_1\) values with a histogram and show that it is unbiased.

# Store as dataframe
estimates <- as.data.frame(ols_results)
ggplot(data = estimates, aes(x = ols_results)) + geom_histogram(fill = "darkgreen") + 
  theme_minimal() + geom_vline(xintercept = mean(estimates$ols_results)) +
  labs(title = "Sampling Distribution of Beta1",  x = "Estimated Beta1")

We can also see the bias mathematically:

bias <- mean(ols_results) - beta_1
bias
## [1] 0.003565251

The bias is very close to zero, indicating that OLS is unbiased.

Example: OLS Variance

Next, let’s simulate an OLS regression and examine the variance. We will take note of what happens to the variance of \(\widehat{\beta}_1\) when the sample size increases (\(n\) increases).

Begin by specifying the parameters for the simulation:

beta_0 <- 2
beta_1 <- 3
sigma <- 1
num_simulations <- 5000  # Number of replications
sample_sizes <- c(25, 50, 100, 250, 500)  # Different sample sizes

Next, create the function for the simulation. Let’s use a for loop inside the function.

set.seed(123) 
estimate_variance <- function(n, sigma, beta_0, beta_1) {
  beta_1_estimates <- numeric(num_simulations)

  for (i in 1:num_simulations) {
    X <- rnorm(n, mean = 5, sd = 2)
    epsilon <- rnorm(n, mean = 0, sd = sigma)
    Y <- beta_0 + beta_1 * X + epsilon
    
    model <- lm(Y ~ X)
    beta_1_estimates[i] <- coef(model)[2]
  }
  
  return(var(beta_1_estimates))
}

Next, let’s run this function using sapply() for different sample sizes, \(n\), and plot the results.

variance_results_n <- sapply(sample_sizes, function(n) estimate_variance(n, beta_0, beta_1, sigma))

n_df <-  as.data.frame(variance_results_n)
n_df <- n_df %>% mutate(sample_sizes = sample_sizes)
ggplot(data = n_df, aes(x = n_df$sample_sizes, y = n_df$variance_results_n)) + geom_point() + 
  geom_line() + theme_minimal() + labs(x = "Sample Size", y = "Estimated Variance of Beta1")

Example: Product Demand

This example applies Monte Carlo simulation to predict the uncertainty in product demand. A company sells a product and wants to estimate the potential demand for the next quarter. Based on historical data, demand follows a normal distribution with an average monthly demand of 500 units and a standard deviation of 100 units. The company wants to simulate 1,000 possible demand scenarios for the next three months and estimate: (1) the probability that total demand exceeds 1,800 units (indicating a potential stockout) and (2) the expected total demand over the quarter.

Begin by setting the parameters for the simulation:

avg_demand <- 500  # Average monthly demand
sd_demand <- 100   # Standard deviation of demand
months <- 3        # Forecasting period (3 months)
num_simulations <- 1000  # Number of Monte Carlo simulations

Then, simulate the total demand over the forecasting period.

total_demand <- replicate(num_simulations, sum(rnorm(months, mean = avg_demand, sd = sd_demand)))

Calculate the probability that total demand exceeds 1,800 units. If the probability of a stockout is high, the company may need to adjust inventory levels. Also, let’s calculate the expected amount of demand.

prob_stockout <- mean(total_demand > 1800)*100
prob_stockout
## [1] 5.1
exp_demand <- mean(total_demand)
exp_demand
## [1] 1504.62

Make a histogram of the total demand.

qtr_demand <- as.data.frame(total_demand)

ggplot(qtr_demand, aes(x = qtr_demand$total_demand)) + geom_histogram(fill = "lightblue") +
  geom_vline(xintercept = mean(total_demand)) + theme_minimal() +
  labs(title = "Simulated Demand for Next Quarter", x = "Total Demand")

Example: Investment Returns

Let’s use a Monte Carlo simulation to estimate the expected return on an investment, considering the uncertainty in the market. We’ll simulate the annual return of investing in a hypothetical stock portfolio. We’ll assume the annual return of the portfolio follows a normal distribution, reflecting the ups and downs of the market. The average annual return (mean) is set at 8% with a standard deviation of 10%.

There are a variety of ways to do Monte Carlo simulation in R, including several packages to assist you–here, we will build our own function.

  • Step 1: Specify parameters for the simulation
average_return <- 0.08  
std_deviation <- 0.10   
num_years <- 10         
num_simulations <- 1000 
  • Step 2: Write a function to simulate the returns
simulated_return <- function(average_return, std_deviation, num_years) {

  annual_returns <- rnorm(num_years, mean = average_return, sd = std_deviation)
  
  # Calculate the compound return over the investment period
  final_return <- prod(1 + annual_returns) - 1
  
  return(final_return)
}

What is happening in this function?

  • annual_returns() is a vector containing the returns for each year.

  • 1 + annual_returns adjusts these returns to reflect the total value at the end of each period relative to the initial value. For example, if you have a return of 0.05 (or 5%), adding 1 to it gives 1.05, which represents your total value at the end of the period as a multiple of your initial investment. If you invested $100, a return of 5% means you’d have $105 at the end of the period, or 1.05 times your initial investment.

  • prod(1 + annual_returns) calculates the product of all these adjusted returns. For example, if you had two consecutive years with returns of 5% and 10%, the calculation would be prod(c(1.05, 1.10)), which equals 1.155. This means that $1 invested at the start would grow to $1.155 after these two periods, assuming reinvestment of all earnings.

  • Subtracting 1 from this product gives you the total compounded return as a decimal. Continuing the example, 1.155 - 1 equals 0.155, or 15.5%. This represents the overall growth of the investment, minus the original investment itself, expressed as a percentage of the original investment.

  • Step 3: Perform the simulation

simulation_results <- replicate(num_simulations, simulated_return(average_return, std_deviation, num_years))
head(simulation_results)
## [1]  1.16462864  0.32302118  0.88157942  1.50322975  0.35425510 -0.05417243
simulation_results <- as.data.frame(simulation_results)
  • Step 4: Analyze the results
ggplot(data = simulation_results, aes(x = simulation_results)) + geom_histogram(fill = "darkorchid3") + 
  theme_minimal() +
  geom_vline(xintercept = mean(simulation_results$simulation_results)) + 
  labs(x = "Simulated Returns")

Intro to Machine Learning

Today, we will be using the glmnet package to perform Ridge, Lasso, and Elastic Net regressions using the Credit dataset from the ISLR package.

library(pacman)
p_load(ISLR, tidyverse, glmnet)

Before we begin, we need to clean our data a bit to get rid of the ID variable and make the categorical variables numeric.

credit_new <- Credit %>% 
  mutate(i_female = ifelse(Gender == "Female", 1, 0),
         i_student = ifelse(Student == "Yes", 1, 0),
         i_married = ifelse(Married == "Yes", 1, 0),
         i_asian = ifelse(Ethnicity == "Asian", 1, 0),
         i_africanamerican = ifelse(Ethnicity == "African-American", 1, 0)) %>%
  select(-ID, -Gender, -Student, -Married, -Ethnicity) %>% relocate(Balance)

Ridge Regression

Ridge regression requires the \(\lambda\) parameter. We can use cross-validation methods to find the optimal \(\lambda\) value. Luckily, glmnet has a function for this! First, we need to define a set of \(\lambda\)s–note that glmnet requires a decreasing sequence.

lambdas = 10^seq(from = 5,  to = -2, length = 100)

Then, we can use cross-validation:

ridge_cv <- cv.glmnet(
  x = credit_new %>% select(-Balance) %>% as.matrix(),
  y = credit_new$Balance,
  standardize = T,
  alpha = 0,
  lambda = lambdas,
  type.measure = "mse",
  nfolds = 5
)

#store the results in a data frame
ridge_cv_results <- data.frame(
  lambda = ridge_cv$lambda,
  rsme = sqrt(ridge_cv$cvm)
)

Let’s look at a plot of our \(\lambda\) values and the RSME.

# make a plot of the lambdas and RSME
ggplot(data = ridge_cv_results, aes(x = lambda, y = rsme)) + geom_line() + 
  scale_x_continuous(trans = "log10")

Now that we know the optimal \(\lambda\) value, we can use this to predict using ridge regression.

est_ridge <- glmnet(
  x = credit_new %>% select(-Balance) %>% as.matrix(),
  y = credit_new$Balance,
  standardize = T,
  alpha = 0,
  lambda = ridge_cv$lambda.min
)

coef(est_ridge)
## 12 x 1 sparse Matrix of class "dgCMatrix"
##                             s0
## (Intercept)       -477.4400874
## Income              -7.7159166
## Limit                0.1663979
## Rating               1.4821043
## Cards               16.0699923
## Age                 -0.6386871
## Education           -1.0305012
## i_female           -10.4128121
## i_student          423.0763384
## i_married           -8.7887526
## i_asian             10.4299792
## i_africanamerican    .

We can also predict on the entire dataset:

#going to use the head function here, so that it doesn't print  out 400 lines!
head(predict(est_ridge,
        type = "response",
        s = ridge_cv$lambda.min,
        newx = credit_new %>% select(-Balance) %>% as.matrix()))
##             s1
## [1,]  417.4289
## [2,]  920.7352
## [3,]  672.6235
## [4,]  978.1324
## [5,]  398.7734
## [6,] 1091.0151

Lasso Regression

The process for Lasso is very similar to the Ridge regression. First, we find the optimal \(\lambda\) using cross-validation. Note, here we set \(\alpha =1\) in the glmnet function

lasso_cv <- cv.glmnet(
  x = credit_new %>% select(-Balance) %>% as.matrix(),
  y = credit_new$Balance,
  standardize = T,
  alpha = 1,
  lambda = lambdas,
  type.measure = "mse",
  nfolds = 5
)

Estimate the Lasso regression:

est_lasso <- glmnet(
  x = credit_new %>% select(-Balance) %>% as.matrix(),
  y = credit_new$Balance,
  standardize = T,
  alpha = 1,
  lambda = lasso_cv$lambda.min
)

coef(est_lasso)
## 12 x 1 sparse Matrix of class "dgCMatrix"
##                             s0
## (Intercept)       -466.5233307
## Income              -7.7997283
## Limit                0.2053575
## Rating               0.9191353
## Cards               18.7251742
## Age                 -0.6168619
## Education           -1.1679625
## i_female           -10.4391064
## i_student          426.0119541
## i_married           -7.2412534
## i_asian              9.4476383
## i_africanamerican    .

Predict on the full dataset:

head(predict(est_lasso,
        type = "response",
        s = lasso_cv$lambda.min,
        newx = credit_new %>% select(-Balance) %>% as.matrix()))
##             s1
## [1,]  414.3537
## [2,]  920.9063
## [3,]  670.7234
## [4,]  969.6886
## [5,]  400.9545
## [6,] 1099.3173

Elastic Net

We should also specify a range of \(\alpha\)s to test with cross-validation to find the optimal one but that is for another day… let’s just use 0.5 for now.

net_cv <- cv.glmnet(
  x = credit_new %>% select(-Balance) %>% as.matrix(),
  y = credit_new$Balance,
  standardize = T,
  alpha = 0.5,
  lambda = lambdas,
  #new:
  type.measure = "mse",
  nfolds = 5
)

Estimate coefficients:

est_net <- glmnet(
  x = credit_new %>% select(-Balance) %>% as.matrix(),
  y = credit_new$Balance,
  standardize = T,
  alpha = 0.5,
  lambda = net_cv$lambda.min
)

coef(est_net)
## 12 x 1 sparse Matrix of class "dgCMatrix"
##                             s0
## (Intercept)       -474.2433322
## Income              -7.7224457
## Limit                0.1828503
## Rating               1.2374830
## Cards               16.9888629
## Age                 -0.6118414
## Education           -0.9687906
## i_female            -9.6012216
## i_student          423.2265486
## i_married           -7.3486486
## i_asian              9.0698307
## i_africanamerican    .

Predict:

head(predict(est_net,
        type = "response",
        s = net_cv$lambda.min,
        newx = credit_new %>% select(-Balance) %>% as.matrix()))
##             s1
## [1,]  415.4975
## [2,]  921.3394
## [3,]  670.7032
## [4,]  973.9869
## [5,]  400.9337
## [6,] 1093.2512

Decision Trees

Let’s build a decision tree using data on credit card defaults from the ISLR package. To build the tree, we will use the caret package and plot using the rpart.plot package.

library(pacman)
p_load(tidyverse, ISLR, caret, rpart.plot)

data("Default")

We grow the tree using the train function in caret:

default_tree <- train(
  default ~ .,
  data = Default,
  method = "rpart",
  trControl = trainControl("cv",number = 5),
  tuneGrid = data.frame(cp = seq(0, 0.2, by =  0.005))
)

Get the final model:

default_tree$finalModel
## n= 10000 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 10000 333 No (0.96670000 0.03330000)  
##    2) balance< 1800.002 9712 171 No (0.98239292 0.01760708) *
##    3) balance>=1800.002 288 126 Yes (0.43750000 0.56250000)  
##      6) balance< 1971.915 170  72 No (0.57647059 0.42352941)  
##       12) income< 27401.2 102  32 No (0.68627451 0.31372549) *
##       13) income>=27401.2 68  28 Yes (0.41176471 0.58823529) *
##      7) balance>=1971.915 118  28 Yes (0.23728814 0.76271186) *
rpart.plot(default_tree$finalModel)

Bagging and Random Forests

Let’s use the Boston Housing Data from the mlbench package. We have worked with this dataset before on a previous problem set. There are many packages you can use to conduct a random forest, including caret that you already have loaded, but randomForest is a very straightforward and easy to use one. The randomForestExplainer package will produce an html printout with a lot of information about your forest.

p_load(randomForest, randomForestExplainer, mlbench, tidyverse)

We also need to move our outcome variable, median value, to the first column in the data frame.

data("BostonHousing")
housing <- BostonHousing %>% relocate(medv)

Here is a list of the variable descriptions:

For Bagging, we can use the caret package and use varImp() to get the variable importance values. Note that varImp() always scales these values to numbers between 0 and 100.

bagged <- train(medv ~.,
                data = housing,
                method = "treebag",
                nbagg = 100,
                keepX = T,
                trControl = trainControl(
                  method = "cv",
                  number = 5
                ))

varImp(bagged)
## treebag variable importance
## 
##         Overall
## lstat   100.000
## rm       77.882
## nox      57.363
## crim     54.897
## dis      49.928
## ptratio  48.869
## tax      34.942
## indus    27.062
## age      24.279
## rad      11.748
## b         5.568
## zn        1.086
## chas1     0.000

For the random forest, let’s use the randomForest function. For random forests, the function varImpPlot() can be used to plot the variables’ importance and contributions to MSE.

forest <- randomForest(formula = medv ~ ., 
             data = housing,
             ntree = 100,
             mtry = 2,
             importance = T)

varImp(forest)
##           Overall
## crim     8.142262
## zn       3.354628
## indus    5.713033
## chas     2.100833
## nox      8.231205
## rm      14.941500
## age      4.942931
## dis      7.932614
## rad      4.121301
## tax      6.635994
## ptratio  6.754249
## b        4.808805
## lstat   10.610207
varImpPlot(forest)

You can ues the randomForestExplainer package to get an html output of your forest. Run the line of code below (without the comment):

#explain_forest(forest)