Here you can find lab notes and resources for Econ 415. These will be updated after our in-class lab sessions. These notes are not a substitute for attending lab but serve as an additional resource.
Much of the lab content will be drawn from R for Data Science
Getting Started with R – A collection of resources for those getting started with R
TidyR Cheatsheet – A useful cheatsheet for data cleaning and tidy data using the tidyverse functions
ggplot2 Cheatsheet – A useful cheatsheet for various ggplot geoms
Tidy Tuesday Repo – A weekly data science project in R to test your tidyverse skills!
Effective file folder management is crucial for maintaining an organized and efficient digital workspace. Setting up organized folders will make your life significantly easier in the future!
In the bottom left of R Studio, you will see the console. The console executes code. You can type code and execute it using the console but the code is not saved when you close R Studio. It is recommended that you do not use the console in your regular workflow.
To save your work, you should code in an R Script. Open a script using the button that looks like a piece of paper with a green plus sign in the top left corner of R Studio.
R scripts will open here. You can code, comment, and run the code from your script. To run the code, either click the “Run” button or by pressing CMD+Enter (Mac) or Ctrl+Enter (Windows). R scripts will be saved to the folder you are currently working in.
In the top left corner, we have the workspace/environment panes.
The workspace/enviroment tab tells you what objects are stored in R (i.e. what is loaded or stored in memory). The History tab which shows previous commands you have run.
Last, on the bottom right, we have several tabs including:
R uses object-oriented programming. If you have never used this type
of programming before, it can be a bit confusing at first. Essentially,
R uses functions, which we apply to objects. More on
this shortly, but if you aren’t sure what a function does, or how it
works, you can use ? before the function to get the documentation. Ex:
?mean will bring up the help page for the
mean() function. Try typing ?mean in the console and
looking at the help page.
An object is an assignment between a name and a value. You assign
values to names using <- or =. The first
assignment symbol consists of a < next to a dash
- to make it look like an arrow.
x <- 5 #assign the value of 5 to a variable called x
# notice that this x is now in your global environment
x # print x
## [1] 5
y = 10
y
## [1] 10
You can combine objects together as well which lets us do some basic math operations.
# create a new object called z that is equal to x*y
z <- x * y
#print z
z
## [1] 50
If you do not create an object, R will not save it to the global environment. If an object is not in the global environment and you try to reference it later, R will not know what you are referring to.
a <- 2+3
a
## [1] 5
b<-4-5
b
## [1] -1
c<-4*2
c
## [1] 8
d<-6/3
d
## [1] 2
e<-7^2
e
## [1] 49
You can create a vector (a list) of items in R.
# create a vector of 1 through 10
vector1 <- 1:10
vector1
## [1] 1 2 3 4 5 6 7 8 9 10
If we want specific items, we use the c() function and
separate the items with a comma.
vector2 <- c(1,3,5,7,9)
vector2
## [1] 1 3 5 7 9
Mathematical operations work on vectors too!
vector2^2
## [1] 1 9 25 49 81
Objects in R have different classes. Check the class of a few objects we have already created:
class(x)
## [1] "numeric"
class(vector1)
## [1] "integer"
There are other classes too!
# create a string
my_string <- "Econ is cool!"
class(my_string)
## [1] "character"
# logical class
class(2>3)
## [1] "logical"
What happens if we have a vector of characters and numbers?
char_vector <- c(1:5, "banana", "apple")
char_vector
## [1] "1" "2" "3" "4" "5" "banana" "apple"
#cant use mathematical operations on characters
# why?? because the entire vector is a character class!
class(char_vector)
## [1] "character"
Functions are operations that can transform your created
object in a ton of different ways. We have actually
already used two functions, c() and class().
Here are a few other useful ones:
#print the first few objects in vector1
head(vector1)
## [1] 1 2 3 4 5 6
#print the first 2 objects in vector1
head(vector1, 2)
## [1] 1 2
#print the last few objects in vector1
tail(vector1)
## [1] 5 6 7 8 9 10
#print last two objects in vector1
tail(vector1, 2)
## [1] 9 10
#find the mean of vector1
mean(vector1)
## [1] 5.5
#median
median(vector1)
## [1] 5.5
#standard deviation
sd(vector1)
## [1] 3.02765
#Summary() prints summary stats
summary(vector1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 3.25 5.50 5.50 7.75 10.00
min(vector1)
## [1] 1
max(vector1)
## [1] 10
Coding style is the punctuation and grammer of the coding world. Making sure that your code is formatting in a readable, standard format is helpful for yourself and others to understand the code. We will follow guidelines from the tidyverse style guide
Spaces: Put spaces on either side of mathematical operators (i.e. +, -, ==, <, …), and around the assignment operator (<-). The exception to this is the ^ symbol.
# Strive for
z <- (a + b)^2 / d
# Avoid
z<-( a + b ) ^ 2/d
Don’t put spaces inside or outside parentheses for regular function calls. Always put a space after a comma.
# Strive for
mean(x, na.rm = TRUE)
## [1] 5
# Avoid
mean (x ,na.rm=TRUE)
## [1] 5
Adding extra spaces is fine if it helps with alignment. For example:
example_data_frame <-
data.frame(
variable1 = c(1:10),
variable_name2 = c(2:11),
var_name = c(3:12)
)
Naming Conventions: Object names must start with a
letter and can only contain letters, numbers, _, and
.. The names should be descriptive–snake case is the
recommended naming convention (separating lowercase wrods with
_).
really_long_variable_name <- 1
Commenting: You can comment your code with
#. It is strongly recommended to leave comments in
your code so that others, and future you, can keep track of your thought
process.
# Good code is well-commented code!!
You can also create section comments that will be collapsable. This is incredibly helpful when you have a really long R script! Any comment line which includes at least four trailing dashes (-) will create a section.
# This is section 1 ----
# ---- This is section 2 ----
R is really useful because of its ability to use packages. Pacman is a package for “package management” - it helps us load multiple packages at onc.
# if you have not previously installed the package, include the line:
#install.packages("pacman")
# you only have to do this once. you can also install packages from the "Packages" side panel tab
We need to load the a package after installing it to use it by using
library().
library(pacman)
Now we use the p_load function to load other packages we want to use.
We will use the tidyverse() package throughout the course,
so let’s load that one.
p_load(tidyverse)
Tidyverse is used for data wrangling. It allows you to manipulate data frames in a rather intuitive way. Tidyverse is a huge package so today we will be focusing on functions from the dplyr package (comes with tidyverse).
Let’s use the starwars dataset that is built into the
tidyverse package.
names(starwars)
## [1] "name" "height" "mass" "hair_color" "skin_color"
## [6] "eye_color" "birth_year" "sex" "gender" "homeworld"
## [11] "species" "films" "vehicles" "starships"
head(starwars)
## # A tibble: 6 × 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke Sky… 172 77 blond fair blue 19 male mascu…
## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
## 3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
## 4 Darth Va… 202 136 none white yellow 41.9 male mascu…
## 5 Leia Org… 150 49 brown light brown 19 fema… femin…
## 6 Owen Lars 178 120 brown, gr… light blue 52 male mascu…
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
Let’s select only the name, gender, and homeworld variables
select(starwars, c(name, gender, homeworld))
## # A tibble: 87 × 3
## name gender homeworld
## <chr> <chr> <chr>
## 1 Luke Skywalker masculine Tatooine
## 2 C-3PO masculine Tatooine
## 3 R2-D2 masculine Naboo
## 4 Darth Vader masculine Tatooine
## 5 Leia Organa feminine Alderaan
## 6 Owen Lars masculine Tatooine
## 7 Beru Whitesun Lars feminine Tatooine
## 8 R5-D4 masculine Tatooine
## 9 Biggs Darklighter masculine Tatooine
## 10 Obi-Wan Kenobi masculine Stewjon
## # ℹ 77 more rows
Notice that this didn’t save anything in our global environment! If you want to save this new dataframe, you have to give it a name!
To select all columns except a certain one, use a minus sign
select(starwars, c(-homeworld, -gender))
## # A tibble: 87 × 12
## name height mass hair_color skin_color eye_color birth_year sex species
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke S… 172 77 blond fair blue 19 male Human
## 2 C-3PO 167 75 <NA> gold yellow 112 none Droid
## 3 R2-D2 96 32 <NA> white, bl… red 33 none Droid
## 4 Darth … 202 136 none white yellow 41.9 male Human
## 5 Leia O… 150 49 brown light brown 19 fema… Human
## 6 Owen L… 178 120 brown, gr… light blue 52 male Human
## 7 Beru W… 165 75 brown light blue 47 fema… Human
## 8 R5-D4 97 32 <NA> white, red red NA none Droid
## 9 Biggs … 183 84 black light brown 24 male Human
## 10 Obi-Wa… 182 77 auburn, w… fair blue-gray 57 male Human
## # ℹ 77 more rows
## # ℹ 3 more variables: films <list>, vehicles <list>, starships <list>
Filter the data frame to include only droids
filter(starwars, species == "Droid")
## # A tibble: 6 × 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 C-3PO 167 75 <NA> gold yellow 112 none masculi…
## 2 R2-D2 96 32 <NA> white, blue red 33 none masculi…
## 3 R5-D4 97 32 <NA> white, red red NA none masculi…
## 4 IG-88 200 140 none metal red 15 none masculi…
## 5 R4-P17 96 NA none silver, red red, blue NA none feminine
## 6 BB8 NA NA none none black NA none masculi…
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
Filter the data frame to include droids OR humans
filter(starwars, species == "Droid" | species == "Human")
## # A tibble: 41 × 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke Sk… 172 77 blond fair blue 19 male mascu…
## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
## 3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
## 4 Darth V… 202 136 none white yellow 41.9 male mascu…
## 5 Leia Or… 150 49 brown light brown 19 fema… femin…
## 6 Owen La… 178 120 brown, gr… light blue 52 male mascu…
## 7 Beru Wh… 165 75 brown light blue 47 fema… femin…
## 8 R5-D4 97 32 <NA> white, red red NA none mascu…
## 9 Biggs D… 183 84 black light brown 24 male mascu…
## 10 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu…
## # ℹ 31 more rows
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
Filter the data frame to include characters taler than 100 cm and a mass over 100
filter(starwars, height > 100 & mass > 100)
## # A tibble: 10 × 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Darth V… 202 136 none white yellow 41.9 male mascu…
## 2 Owen La… 178 120 brown, gr… light blue 52 male mascu…
## 3 Chewbac… 228 112 brown unknown blue 200 male mascu…
## 4 Jabba D… 175 1358 <NA> green-tan… orange 600 herm… mascu…
## 5 Jek Ton… 180 110 brown fair blue NA <NA> <NA>
## 6 IG-88 200 140 none metal red 15 none mascu…
## 7 Bossk 190 113 none green red 53 male mascu…
## 8 Dexter … 198 102 none brown yellow NA male mascu…
## 9 Grievous 216 159 none brown, wh… green, y… NA male mascu…
## 10 Tarfful 234 136 brown brown blue NA male mascu…
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
What if we want to do those things all in one step??? The tidyverse allows us to chain functions together with %>%. This is called a pipe and the pipe connects the LHS to the RHS (like reading a book). Let’s make a new dataframe where we select the name, height, and mass. Filter out those who are shorter than 100 cm.
new_df <- starwars %>% select(name, height, mass) %>% filter(height >= 100)
new_df
## # A tibble: 74 × 3
## name height mass
## <chr> <int> <dbl>
## 1 Luke Skywalker 172 77
## 2 C-3PO 167 75
## 3 Darth Vader 202 136
## 4 Leia Organa 150 49
## 5 Owen Lars 178 120
## 6 Beru Whitesun Lars 165 75
## 7 Biggs Darklighter 183 84
## 8 Obi-Wan Kenobi 182 77
## 9 Anakin Skywalker 188 84
## 10 Wilhuff Tarkin 180 NA
## # ℹ 64 more rows
Self check: make a new data frame where you select all columns except gender and has characters that have green skin color
example_df <- starwars %>% select(-gender) %>% filter(skin_color == "green")
example_df
## # A tibble: 6 × 13
## name height mass hair_color skin_color eye_color birth_year sex homeworld
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Greedo 173 74 <NA> green black 44 male Rodia
## 2 Yoda 66 17 white green brown 896 male <NA>
## 3 Bossk 190 113 none green red 53 male Trandosha
## 4 Rugor… 206 NA none green orange NA male Naboo
## 5 Kit F… 196 87 none green black NA male Glee Ans…
## 6 Poggl… 183 80 none green yellow NA male Geonosis
## # ℹ 4 more variables: species <chr>, films <list>, vehicles <list>,
## # starships <list>
Let’s arrange all of the characters by their height
starwars %>% arrange(height)
## # A tibble: 87 × 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Yoda 66 17 white green brown 896 male mascu…
## 2 Ratts T… 79 15 none grey, blue unknown NA male mascu…
## 3 Wicket … 88 20 brown brown brown 8 male mascu…
## 4 Dud Bolt 94 45 none blue, grey yellow NA male mascu…
## 5 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
## 6 R4-P17 96 NA none silver, r… red, blue NA none femin…
## 7 R5-D4 97 32 <NA> white, red red NA none mascu…
## 8 Sebulba 112 40 none grey, red orange NA male mascu…
## 9 Gasgano 122 NA none white, bl… black NA male mascu…
## 10 Watto 137 NA black blue, grey yellow NA male mascu…
## # ℹ 77 more rows
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
Notice this does lowest to highest, we can do the other way too
starwars %>% arrange(desc(height))
## # A tibble: 87 × 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Yarael … 264 NA none white yellow NA male mascu…
## 2 Tarfful 234 136 brown brown blue NA male mascu…
## 3 Lama Su 229 88 none grey black NA male mascu…
## 4 Chewbac… 228 112 brown unknown blue 200 male mascu…
## 5 Roos Ta… 224 82 none grey orange NA male mascu…
## 6 Grievous 216 159 none brown, wh… green, y… NA male mascu…
## 7 Taun We 213 NA none grey black NA fema… femin…
## 8 Rugor N… 206 NA none green orange NA male mascu…
## 9 Tion Me… 206 80 none grey black NA male mascu…
## 10 Darth V… 202 136 none white yellow 41.9 male mascu…
## # ℹ 77 more rows
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
Self check: Arrange the characters names in alphabetical order
starwars %>% arrange(name)
## # A tibble: 87 × 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Ackbar 180 83 none brown mot… orange 41 male mascu…
## 2 Adi Gal… 184 50 none dark blue NA fema… femin…
## 3 Anakin … 188 84 blond fair blue 41.9 male mascu…
## 4 Arvel C… NA NA brown fair brown NA male mascu…
## 5 Ayla Se… 178 55 none blue hazel 48 fema… femin…
## 6 BB8 NA NA none none black NA none mascu…
## 7 Bail Pr… 191 NA black tan brown 67 male mascu…
## 8 Barriss… 166 50 black yellow blue 40 fema… femin…
## 9 Ben Qua… 163 65 none grey, gre… orange NA male mascu…
## 10 Beru Wh… 165 75 brown light blue 47 fema… femin…
## # ℹ 77 more rows
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
We can create new variables with mutate(). Let’s create
a new variable that measures height in inches instead of centimeters
(2.54cm per inch)
starwars %>% mutate(height_inches = height/2.54)
## # A tibble: 87 × 15
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke Sk… 172 77 blond fair blue 19 male mascu…
## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
## 3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
## 4 Darth V… 202 136 none white yellow 41.9 male mascu…
## 5 Leia Or… 150 49 brown light brown 19 fema… femin…
## 6 Owen La… 178 120 brown, gr… light blue 52 male mascu…
## 7 Beru Wh… 165 75 brown light blue 47 fema… femin…
## 8 R5-D4 97 32 <NA> white, red red NA none mascu…
## 9 Biggs D… 183 84 black light brown 24 male mascu…
## 10 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu…
## # ℹ 77 more rows
## # ℹ 6 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>, height_inches <dbl>
Notice that this new variable is not in the original data frame. Why? Because we didn’t assign it to our object.
starwars <- starwars %>% mutate(height_inches = height/2.54)
Self check: Create a new variable that is the sum of person’s mass and height
starwars %>% mutate(total = height + mass)
## # A tibble: 87 × 16
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke Sk… 172 77 blond fair blue 19 male mascu…
## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
## 3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
## 4 Darth V… 202 136 none white yellow 41.9 male mascu…
## 5 Leia Or… 150 49 brown light brown 19 fema… femin…
## 6 Owen La… 178 120 brown, gr… light blue 52 male mascu…
## 7 Beru Wh… 165 75 brown light blue 47 fema… femin…
## 8 R5-D4 97 32 <NA> white, red red NA none mascu…
## 9 Biggs D… 183 84 black light brown 24 male mascu…
## 10 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu…
## # ℹ 77 more rows
## # ℹ 7 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>, height_inches <dbl>, total <dbl>
This will group data together and can make summary statistics. Let’s find the average height for each species.
starwars %>% group_by(species) %>% summarize(avg_height = mean(height))
## # A tibble: 38 × 2
## species avg_height
## <chr> <dbl>
## 1 Aleena 79
## 2 Besalisk 198
## 3 Cerean 198
## 4 Chagrian 196
## 5 Clawdite 168
## 6 Droid NA
## 7 Dug 112
## 8 Ewok 88
## 9 Geonosian 183
## 10 Gungan 209.
## # ℹ 28 more rows
Notice we have NA’s! We can get rid of those.
# Using na.omit()
starwars %>% na.omit() %>% group_by(species) %>% summarize(avg_height = mean(height))
## # A tibble: 11 × 2
## species avg_height
## <chr> <dbl>
## 1 Cerean 198
## 2 Ewok 88
## 3 Gungan 196
## 4 Human 179.
## 5 Kel Dor 188
## 6 Mirialan 168
## 7 Mon Calamari 180
## 8 Trandoshan 190
## 9 Twi'lek 178
## 10 Wookiee 228
## 11 Zabrak 175
# Using na.rm = T inside the mean function
starwars %>% group_by(species) %>% summarize(avg_height = mean(height, na.rm = T))
## # A tibble: 38 × 2
## species avg_height
## <chr> <dbl>
## 1 Aleena 79
## 2 Besalisk 198
## 3 Cerean 198
## 4 Chagrian 196
## 5 Clawdite 168
## 6 Droid 131.
## 7 Dug 112
## 8 Ewok 88
## 9 Geonosian 183
## 10 Gungan 209.
## # ℹ 28 more rows
Count the number of each species.
starwars %>% count(species)
## # A tibble: 38 × 2
## species n
## <chr> <int>
## 1 Aleena 1
## 2 Besalisk 1
## 3 Cerean 1
## 4 Chagrian 1
## 5 Clawdite 1
## 6 Droid 6
## 7 Dug 1
## 8 Ewok 1
## 9 Geonosian 1
## 10 Gungan 3
## # ℹ 28 more rows
Let’s use the penguins dataset that is in the
palmerpenguins package. Also load the
tidyverse package.
library(pacman)
p_load(tidyverse, palmerpenguins)
The dataset contains 344 observations of penguins at the Palmer Station in Antartica.
head(penguins)
## # A tibble: 6 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <int> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen NA NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193 3450
## 6 Adelie Torgersen 39.3 20.6 190 3650
## # ℹ 2 more variables: sex <fct>, year <int>
Let’s use this dataset to create plots, following the code in “R for Data Science”.
The basic setup of making a ggplot requires three things: the data, the aesthetic mapping, and a geom. The aesthetic mappings describe how variables in the data are mapped to visual properties (aesthetics) of geoms, like which variables are on the axes, the variable to color or fill by, etc. The geoms tell R how to draw the data like points, lines, columns, etc.
In general, we can make a ggplot by typing the following: \[\text{ggplot(data = <DATA>, mapping = aes(<MAPPING>)) + <geom_function>()}\]
The way ggplot works is by adding layers. We can add a new layer with the + sign. Let’s build a ggplot step by step. First, start with ggplot() and tell R what data we are using.
ggplot(data = penguins)
Why did this make a blank graph? Well, we haven’t given R the aesthetic mapping yet so it doesn’t know what to put on top of the base layer. Let’s add the x and y variables, flipper length and body mass.
ggplot(data = penguins, mapping = aes(x = flipper_length_mm, y = body_mass_g)) # note you can put the mapping here or in the geom
Now we have a graph with axes and gridlines but no information on the graph. To get data on the graph, we need to tell R how we want to draw the data with a geom. To make a scatterplot, we use geom_point().
ggplot(data = penguins, mapping = aes(x = flipper_length_mm, y = body_mass_g)) + geom_point()
Suppose we want to see how the relationship between flipper length and body mass differs by species of penguin. We can represent each species on our scatterplot with different colored points.
ggplot(data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species)) +
geom_point()
Let’s add another layer: a curve displaying the relationship using
geom_smooth(). Using method = "lm" gives the
line of best fit.
ggplot(data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species)) +
geom_point() +
geom_smooth(method = "lm")
Note that this creates three separate lines, one for each species.
When the aesthetic mapping is defined in the `ggplot
function, the mappings are passed down to all of the subsequent layers.
You can also define aesthetic mappings at the local level. For example,
what if we wanted the points to be colored by species but only one line
of best fit?
ggplot(data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)) +
geom_point(mapping = aes(color = species)) +
geom_smooth(method = "lm")
We can also change the shapes of the points.
ggplot(data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)) +
geom_point(mapping = aes(color = species, shape = species)) +
geom_smooth(method = "lm")
Now let’s make it pretty! We can add titles, axis labels, themes and more! You can also get fancy and change the color of your points and lines.
ggplot(data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)) +
geom_point(mapping = aes(color = species, shape = species)) +
geom_smooth(method = "lm") +
labs(title = "Body Mass and Flipper Length of Penguins",
x = "Flipper Length (mm)",
y = "Body Mass (g)") +
theme_minimal()
For categorical variables, we can use a bar chart. The height of the bars is how many observations occurred for each x value.
ggplot(penguins, aes(x = species)) + geom_bar()
For numerical variables, we can create histograms and density plots.
For histograms, changing the binwidth sets the intervals
for the x variable.
ggplot(penguins, aes(x = body_mass_g)) + geom_histogram(binwidth = 20)
ggplot(penguins, aes(x = body_mass_g)) + geom_histogram(binwidth = 200)
ggplot(penguins, aes(x = body_mass_g)) + geom_histogram(binwidth = 2000)
Density plots are smooth versions of histograms and are useful for continuous data.
ggplot(penguins, aes(x = body_mass_g)) + geom_density()
Try to reproduce this graph:
We need to get some data before we can start working with it! Download the “lab1.csv” file from Canvas. It should then be in your Downloads folder. Now, we need to make sure that we are working in the same place as our data is, so that R knows where to get the csv file from. Move the csv file to your current directory (the folder your are currently working out of.)
If you are unsure which directory you are in, the function
getwd() will give you the file path of your current
location.
getwd()
## [1] "/Users/jenniputz/Dropbox/Econ415/Winter 2025/R"
Now, if this is not the same location as your desired folder, you
need to tell R to set a new working directory. You can use
setwd() to tell R where to go.
For example, if I want to go to my Downloads folder on my Mac, I can
write setwd("/Users/jenniputz/Downloads"). If you are on
windows, your file path would start with a C:\ and use
\.
Once you are in your correct directory, we can read in our data using
read_csv(). Remember that everything in R is an object, so
you must give your data frame a name.
cps <- read_csv("lab1.csv")
## Rows: 8891 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): educ
## dbl (4): employed, black, female, exper
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Today you will investigate racial disparities in the labor market using data from the Current Population Survey (CPS), a large survey administered by the US Census Bureau and the Bureau of Labor Statistics. The federal government uses the CPS to estimate the unemployment rate. Economists use the CPS to study a variety of topics in labor economics, including the effect of binding minimum wages, the gender pay gap, and returns to schooling. You will use a CPS sample of workers from Boston and Chicago to study employment patterns by race.
If you have not already loaded the tidyverse package,
please do so.
The data file is lab1.csv and is available on Canvas.
Import the data using read_csv if you have not already done
so.
By looking at the dataset, you can see that most of the variables are binary: they take values of either 1 or 0.
head(cps)
## # A tibble: 6 × 5
## employed black female educ exper
## <dbl> <dbl> <dbl> <chr> <dbl>
## 1 1 0 1 HS Graduate 20
## 2 0 0 0 HS Graduate 20
## 3 1 1 1 HS Dropout 12
## 4 1 1 0 HS Dropout 17
## 5 0 1 1 HS Graduate 21
## 6 1 0 1 College or Higher 13
For example, individuals with employed == 1 have a job
while those with employed == 0 do not.
What percentage of individuals in the sample are employed?
The mean of a binary variable gives the fraction of observations with values equal to 1.
mean(cps$employed)
## [1] NA
Something went wrong. If you use the summary function,
you’ll see that there are missing values (NAs) of
employed.
summary(cps$employed)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0000 1.0000 1.0000 0.7811 1.0000 1.0000 18
When there are missing values, some functions, like
mean, will return a missing value as output. To circumvent
this, you can specify na.rm = TRUE in the mean
function:
mean(cps$employed, na.rm = TRUE)
## [1] 0.7811338
The employment rate is 78 percent.
What are the employment rates by race and gender?
cps %>%
group_by(black, female) %>%
summarize(employed = mean(employed, na.rm = TRUE))
## `summarise()` has grouped output by 'black'. You can override using the
## `.groups` argument.
## # A tibble: 4 × 3
## # Groups: black [2]
## black female employed
## <dbl> <dbl> <dbl>
## 1 0 0 0.868
## 2 0 1 0.723
## 3 1 0 0.718
## 4 1 1 0.693
You can see that the employment rate is
black == 0 and
female == 0)black == 0 and
female == 1)black == 1 and
female == 0)black == 1 and
female == 1).What is the average difference in employment status between black individuals and white individuals?
To find the difference,
filter function to restrict the sample to one
group (black or white)mean to calculate the group meanblack_emp_df <- cps %>% filter(black == 1)
black_emp <- mean(black_emp_df$employed, na.rm = T)
black_emp
## [1] 0.7035928
white_emp_df <- cps %>% filter(black == 0)
white_emp <- mean(white_emp_df$employed, na.rm = T)
white_emp
## [1] 0.7948786
black_emp - white_emp
## [1] -0.09128578
The employment rate is 9 percentage points lower for black individuals than for white individuals.
Does this mean that there is racial disparity?
Not yet. We still don’t know if the difference is statistically significant. You can find out by conducting a \(t\)-test of the null hypothesis that the true difference-in-means is zero against the alternative hypothesis that the difference is nonzero.
To conduct the test, you need to calculate the \(t\)-statistic for the difference-in-means, which is given by
\[t =
\dfrac{\overline{\text{Employed}}_\text{Black} -
\overline{\text{Employed}}_\text{White}}{\sqrt{\frac{S^2_\text{Black}}{N_\text{Black}}
+ \frac{S^2_\text{White}}{N_\text{White}}}}\] We can conduct the
\(t\)-test using the
t.test() function. Here, you input the variable names for
x and y, and tell R if you want a two-sided or
one-sided test.
t.test(black_emp_df$employed, white_emp_df$employed, alternative = "two.sided", var.equal = FALSE)
##
## Welch Two Sample t-test
##
## data: black_emp_df$employed and white_emp_df$employed
## t = -6.845, df = 1724.5, p-value = 1.059e-11
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.11744252 -0.06512905
## sample estimates:
## mean of x mean of y
## 0.7035928 0.7948786
To conclude your test, compare your \(t\)-stat to 2 (an approximation for the critical value of \(t\) in 5 percent test–we will do more rigorous hypothesis testing later in the course). If \(|t| > 2\), then you can reject the null hypothesis. Your \(t\)-stat of -6.86 is certainly more extreme than 2, so you can reject the null hypothesis. This means that the difference in employment rates is statistically significant. There is a racial disparity in employment.
Does the disparity in employment rates provide causal evidence of racial discrimination in hiring? Does the comparison of employment rates by race hold all else constant? What else could explain the gap?
library(pacman)
p_load(tidyverse, broom, stargazer, AER)
# Note: broom package contains the tidy() function
For this lab, we will use data from the AER package on
California schools. To get this data and use it we can:
# Note: if you install and load the AER package, the dataset is included in the package. Then run the line below:
data("CASchools")
# look at a snapshot
head(CASchools, 10)
## district school county grades students
## 1 75119 Sunol Glen Unified Alameda KK-08 195
## 2 61499 Manzanita Elementary Butte KK-08 240
## 3 61549 Thermalito Union Elementary Butte KK-08 1550
## 4 61457 Golden Feather Union Elementary Butte KK-08 243
## 5 61523 Palermo Union Elementary Butte KK-08 1335
## 6 62042 Burrel Union Elementary Fresno KK-08 137
## 7 68536 Holt Union Elementary San Joaquin KK-08 195
## 8 63834 Vineland Elementary Kern KK-08 888
## 9 62331 Orange Center Elementary Fresno KK-08 379
## 10 67306 Del Paso Heights Elementary Sacramento KK-06 2247
## teachers calworks lunch computer expenditure income english read
## 1 10.90 0.5102 2.0408 67 6384.911 22.690001 0.000000 691.6
## 2 11.15 15.4167 47.9167 101 5099.381 9.824000 4.583333 660.5
## 3 82.90 55.0323 76.3226 169 5501.955 8.978000 30.000002 636.3
## 4 14.00 36.4754 77.0492 85 7101.831 8.978000 0.000000 651.9
## 5 71.50 33.1086 78.4270 171 5235.988 9.080333 13.857677 641.8
## 6 6.40 12.3188 86.9565 25 5580.147 10.415000 12.408759 605.7
## 7 10.00 12.9032 94.6237 28 5253.331 6.577000 68.717949 604.5
## 8 42.50 18.8063 100.0000 66 4565.746 8.174000 46.959461 605.5
## 9 19.00 32.1900 93.1398 35 5355.548 7.385000 30.079157 608.9
## 10 108.00 78.9942 87.3164 0 5036.211 11.613333 40.275921 611.9
## math
## 1 690.0
## 2 661.9
## 3 650.9
## 4 643.5
## 5 639.9
## 6 605.4
## 7 609.0
## 8 612.5
## 9 616.1
## 10 613.4
names(CASchools)
## [1] "district" "school" "county" "grades" "students"
## [6] "teachers" "calworks" "lunch" "computer" "expenditure"
## [11] "income" "english" "read" "math"
To do a regression in R, we use lm(). The basic steup:
name <- lm(y ~ x, data = name_of_df).
Let’s regress reading scores on student expenditure.
lm(read ~ expenditure, data = CASchools)
##
## Call:
## lm(formula = read ~ expenditure, data = CASchools)
##
## Coefficients:
## (Intercept) expenditure
## 6.182e+02 6.912e-03
The output from lm() gives us an intercept coefficient,
\(\hat{\beta}_0\), and a slope
coefficient, \(\hat{\beta}_1\).
Let’s run another regression. Regress math scores on student expenditure.
lm(math ~ expenditure, data = CASchools)
##
## Call:
## lm(formula = math ~ expenditure, data = CASchools)
##
## Coefficients:
## (Intercept) expenditure
## 6.290e+02 4.585e-03
On your own, try to code a regression for \(Math_i = \beta_0 + \beta_1 Income_i + u_i\).
lm(math ~ income, data = CASchools)
##
## Call:
## lm(formula = math ~ income, data = CASchools)
##
## Coefficients:
## (Intercept) income
## 625.539 1.815
Now that we know how to run a regression, let’s talk about how to look at the output. The output above wasn’t super informative…
summary()The first option is to use the summary function in R. There are two ways to do this:
Save your regression as an object in R so that we can then use that object later.
reg1 <- lm(math ~ income, data = CASchools)
summary(reg1)
##
## Call:
## lm(formula = math ~ income, data = CASchools)
##
## Residuals:
## Min 1Q Median 3Q Max
## -39.045 -8.997 0.308 8.416 34.246
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 625.53948 1.53627 407.18 <2e-16 ***
## income 1.81523 0.09073 20.01 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.42 on 418 degrees of freedom
## Multiple R-squared: 0.4892, Adjusted R-squared: 0.4879
## F-statistic: 400.3 on 1 and 418 DF, p-value: < 2.2e-16
We have the intercept coefficient, \(\hat{\beta}_0 = 625\), and the slope
coefficient \(\hat{\beta}_1 = 1.8\).
summary() also gives us other information that we didn’t
have before like the standard error, the t-score, the p-value, and the
\(R^2\). The stars on the p-value tell
us if the coefficient is statistically significant (more on this after
the midterm).
tidy()Another way to make nice regression output is to use the
tidy() function from the broom package. To use
this, you must have loaded the broom package in
p_load(). The process is similar:
# since we have already created reg1 as an object, we can just call it without having to redo the regression
tidy(reg1)
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 626. 1.54 407. 0
## 2 income 1.82 0.0907 20.0 5.99e-63
tidy() puts the information from summary()
into a much nicer looking table.
By far, the most powerful tool for making amazing tables in R is the
stargazer package. I would encourage you to try this out if
you are feeling up for it! There is a very helpful cheatsheet linked
here.
First, let’s try to make a simple table with stargazer()
using our reg1 object.
stargazer(reg1)
##
## % Table created by stargazer v.5.2.3 by Marek Hlavac, Social Policy Institute. E-mail: marek.hlavac at gmail.com
## % Date and time: Tue, Mar 31, 2026 - 13:15:59
## \begin{table}[!htbp] \centering
## \caption{}
## \label{}
## \begin{tabular}{@{\extracolsep{5pt}}lc}
## \\[-1.8ex]\hline
## \hline \\[-1.8ex]
## & \multicolumn{1}{c}{\textit{Dependent variable:}} \\
## \cline{2-2}
## \\[-1.8ex] & math \\
## \hline \\[-1.8ex]
## income & 1.815$^{***}$ \\
## & (0.091) \\
## & \\
## Constant & 625.539$^{***}$ \\
## & (1.536) \\
## & \\
## \hline \\[-1.8ex]
## Observations & 420 \\
## R$^{2}$ & 0.489 \\
## Adjusted R$^{2}$ & 0.488 \\
## Residual Std. Error & 13.420 (df = 418) \\
## F Statistic & 400.257$^{***}$ (df = 1; 418) \\
## \hline
## \hline \\[-1.8ex]
## \textit{Note:} & \multicolumn{1}{r}{$^{*}$p$<$0.1; $^{**}$p$<$0.05; $^{***}$p$<$0.01} \\
## \end{tabular}
## \end{table}
Huh, weird output right? Stargazer is defaulting to TeX output. We can change the type of output we want.
1: As text (use this for your problem set)
stargazer(reg1, type = "text")
##
## ===============================================
## Dependent variable:
## ---------------------------
## math
## -----------------------------------------------
## income 1.815***
## (0.091)
##
## Constant 625.539***
## (1.536)
##
## -----------------------------------------------
## Observations 420
## R2 0.489
## Adjusted R2 0.488
## Residual Std. Error 13.420 (df = 418)
## F Statistic 400.257*** (df = 1; 418)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Stargazer can also do html output and MS word output (although word output is not recommended).
Stargazer gave us some stats that we don’t really care about… let’s get rid of them…
stargazer(reg1, keep.stat =c("rsq", "n"), type = "text")
##
## ========================================
## Dependent variable:
## ---------------------------
## math
## ----------------------------------------
## income 1.815***
## (0.091)
##
## Constant 625.539***
## (1.536)
##
## ----------------------------------------
## Observations 420
## R2 0.489
## ========================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Looks great!
Unlike the tables we made using summary() and
broom(), stargazer can combine multiple regressions into
one table!
# do another regression and save it as an object
reg2 <- lm(read ~ income, data = CASchools)
stargazer(reg1, reg2, keep.stat =c("rsq", "n"), type = "text")
##
## =========================================
## Dependent variable:
## ----------------------------
## math read
## (1) (2)
## -----------------------------------------
## income 1.815*** 1.942***
## (0.091) (0.097)
##
## Constant 625.539*** 625.228***
## (1.536) (1.651)
##
## -----------------------------------------
## Observations 420 420
## R2 0.489 0.487
## =========================================
## Note: *p<0.1; **p<0.05; ***p<0.01
The variables in the regressions don’t even have to be the same!
# do another regression and save it as an object
reg3 <- lm(read ~ expenditure, data = CASchools)
stargazer(reg1, reg2, reg3, keep.stat =c("rsq", "n"), type = "text")
##
## =============================================
## Dependent variable:
## --------------------------------
## math read
## (1) (2) (3)
## ---------------------------------------------
## income 1.815*** 1.942***
## (0.091) (0.097)
##
## expenditure 0.007***
## (0.002)
##
## Constant 625.539*** 625.228*** 618.249***
## (1.536) (1.651) (8.101)
##
## ---------------------------------------------
## Observations 420 420 420
## R2 0.489 0.487 0.047
## =============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
We can get really fancy with stargazer(). For more info,
see the cheatsheet on Canvas.
Last, we can use ggplot and fit a regression line through our data. Let’s start by making a scatterplot with math scores on the y axis and income on the x axis.
ggplot(data = CASchools, aes(x = income, y = math)) + geom_point()
To add a regression line, we use the stat_smooth()
function:
# method = "lm" tells R that we are using OLS. se = FALSE removes the standard error bars.
ggplot(data = CASchools, aes(x = income, y = math)) + geom_point() + stat_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula = 'y ~ x'
Instructions:
Use tidyverse functions to create a new variable for
the Student-Teacher Ratio (students/teachers). Call this variable
STR.
Estimate the regression models: \[ math_i = \beta_0 + \beta_1 STR_i + u_i\] and \[ read_i = \beta_0 + \beta_1 STR_i + u_i\]
Summarize your regression output from the above regressions in a table.
Please upload your .R file to today’s Canvas assignment.
library(pacman)
p_load(tidyverse, stargazer, broom)
wages <- read_csv("wages.csv")
Let’s look at the data frame:
head(wages,10)
## # A tibble: 10 × 34
## id nearc2 nearc4 educ age fatheduc motheduc weight momdad14 sinmom14
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 3 0 0 12 27 8 8 380166 1 0
## 2 4 0 0 12 34 14 12 367470 1 0
## 3 5 1 1 11 27 11 12 380166 1 0
## 4 6 1 1 12 34 8 7 367470 1 0
## 5 7 1 1 12 26 9 12 380166 1 0
## 6 8 1 1 18 33 14 14 367470 1 0
## 7 9 1 1 14 29 14 14 496635 1 0
## 8 10 1 1 12 28 12 12 367772 1 0
## 9 11 1 1 12 29 12 12 480445 1 0
## 10 12 1 1 9 28 11 12 380166 1 0
## # ℹ 24 more variables: step14 <dbl>, reg661 <dbl>, reg662 <dbl>, reg663 <dbl>,
## # reg664 <dbl>, reg665 <dbl>, reg666 <dbl>, reg667 <dbl>, reg668 <dbl>,
## # reg669 <dbl>, south66 <dbl>, black <dbl>, smsa <dbl>, south <dbl>,
## # smsa66 <dbl>, wage <dbl>, enroll <dbl>, KWW <dbl>, IQ <dbl>, married <dbl>,
## # libcrd14 <dbl>, exper <dbl>, lwage <dbl>, expersq <dbl>
We know how to do simple OLS–doing multiple OLS is just one easy step further! Let’s run a regression of log wages on education and experience:
reg <- lm(lwage ~ educ + exper, data = wages)
summary(reg)
##
## Call:
## lm(formula = lwage ~ educ + exper, data = wages)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.57440 -0.22238 0.02188 0.25802 1.17969
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.834500 0.092866 52.06 <2e-16 ***
## educ 0.080330 0.005259 15.27 <2e-16 ***
## exper 0.046395 0.003306 14.03 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3877 on 1601 degrees of freedom
## Multiple R-squared: 0.1458, Adjusted R-squared: 0.1447
## F-statistic: 136.6 on 2 and 1601 DF, p-value: < 2.2e-16
R gives us the t-stat and p-value so we can easily conduct hypothesis tests. Suppose we want to test \[H_0\text{: } \beta_1 = 0\] vs \[H_1\text{: } \beta_1 \neq 0\] at the 5% level. We can look at the p-value and see that \(p < \alpha\) so we can reject the null hypothesis at 5%.
Like last lab, we can use stargazer to summarize our regression results. This time, let’s keep the adjusted \(R^2\) in our table.
stargazer(reg, keep.stat = c("rsq", "adj.rsq", "n"), type = "text")
##
## ========================================
## Dependent variable:
## ---------------------------
## lwage
## ----------------------------------------
## educ 0.080***
## (0.005)
##
## exper 0.046***
## (0.003)
##
## Constant 4.835***
## (0.093)
##
## ----------------------------------------
## Observations 1,604
## R2 0.146
## Adjusted R2 0.145
## ========================================
## Note: *p<0.1; **p<0.05; ***p<0.01
How do we get confidence intervals in R? Let’s do a 95% confidence interval.
Option 1: Construct by hand. We have the standard error and we have \(\hat{\beta}_1\). We need to get the critical value of t.
# We can use the qt() function to get the critical value of t. First put in your significance level.
# for a two-tail test, make sure to divide by 2
# then the degrees of freedom. we have 1604-3 degrees of freedom
qt(.05/2, 1601)
## [1] -1.961447
The critical value of t is 1.96. Now we can plug that in to our confidence interval formula: \[ .080330 - 1.96(.005259) \leq \beta_1 \leq .080330 + 1.96(.005259) \] Which simplifies to: \[ .0700 \leq \beta_1 \leq .0906\]
Option 2: Let R tell us. The tidy() function will give
us the confidence interval if we tell it!
tidy(reg, conf.int = T)
## # A tibble: 3 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 4.83 0.0929 52.1 0 4.65 5.02
## 2 educ 0.0803 0.00526 15.3 2.86e-49 0.0700 0.0906
## 3 exper 0.0464 0.00331 14.0 2.80e-42 0.0399 0.0529
The confidence interval matches our by-hand solution!
But this is for a 95%. How do we get tidy() to give us a
different confidence level? What if we want a 90% confidence
interval?
tidy(reg, conf.int = T, conf.level = .9)
## # A tibble: 3 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 4.83 0.0929 52.1 0 4.68 4.99
## 2 educ 0.0803 0.00526 15.3 2.86e-49 0.0717 0.0890
## 3 exper 0.0464 0.00331 14.0 2.80e-42 0.0410 0.0518
reg1 <- lm(lwage ~ educ, data = wages)
reg2 <- lm(lwage ~ educ + exper, data = wages)
stargazer(reg1, reg2, keep.stat = c("rsq", "adj.rsq", "n"), type = "text")
##
## =========================================
## Dependent variable:
## ----------------------------
## lwage
## (1) (2)
## -----------------------------------------
## educ 0.037*** 0.080***
## (0.005) (0.005)
##
## exper 0.046***
## (0.003)
##
## Constant 5.816*** 4.835***
## (0.065) (0.093)
##
## -----------------------------------------
## Observations 1,604 1,604
## R2 0.041 0.146
## Adjusted R2 0.040 0.145
## =========================================
## Note: *p<0.1; **p<0.05; ***p<0.01
What is the sign of the OVB? Use the formula: \[ Bias = \beta_2 \frac{cov(X_1, X_2)}{var(X_1)} \] \(\beta_2 = .046\), which is positive. We can get the covariance by using the covariance function.
cov(wages$educ, wages$exper)
## [1] -4.759561
There is a negative bias from omitting experience.
Let’s do another regression.
reg3 <- lm(lwage ~ educ + exper + IQ, data = wages)
stargazer(reg1, reg2, reg3, keep.stat = c("rsq", "adj.rsq", "n"), type = "text")
##
## ==========================================
## Dependent variable:
## -----------------------------
## lwage
## (1) (2) (3)
## ------------------------------------------
## educ 0.037*** 0.080*** 0.069***
## (0.005) (0.005) (0.006)
##
## exper 0.046*** 0.048***
## (0.003) (0.003)
##
## IQ 0.004***
## (0.001)
##
## Constant 5.816*** 4.835*** 4.587***
## (0.065) (0.093) (0.104)
##
## ------------------------------------------
## Observations 1,604 1,604 1,604
## R2 0.041 0.146 0.159
## Adjusted R2 0.040 0.145 0.158
## ==========================================
## Note: *p<0.1; **p<0.05; ***p<0.01
How are IQ and education correlated? Check the OVB formula again. Here, \(\beta_3 = .004\) which is positive. We can check the covariance to get the sign of the bias:
cov(wages$educ, wages$IQ)
## [1] 16.96962
IQ and education are positively correlated and there is a positive bias from omitting IQ.
We can keep adding variables to our regression:
reg4 <- lm(lwage ~ educ + exper + IQ + KWW, data = wages)
reg5 <- lm(lwage ~ educ + exper + IQ + KWW + fatheduc, data = wages)
reg6 <- lm(lwage ~ educ + exper + IQ + KWW + fatheduc + motheduc, data = wages)
stargazer(reg1, reg2, reg3, reg4, reg5, reg6, keep.stat = c("rsq", "adj.rsq", "n"), type = "text")
##
## ==================================================================
## Dependent variable:
## -----------------------------------------------------
## lwage
## (1) (2) (3) (4) (5) (6)
## ------------------------------------------------------------------
## educ 0.037*** 0.080*** 0.069*** 0.058*** 0.058*** 0.057***
## (0.005) (0.005) (0.006) (0.006) (0.006) (0.006)
##
## exper 0.046*** 0.048*** 0.041*** 0.042*** 0.042***
## (0.003) (0.003) (0.004) (0.004) (0.004)
##
## IQ 0.004*** 0.003*** 0.003*** 0.003***
## (0.001) (0.001) (0.001) (0.001)
##
## KWW 0.006*** 0.006*** 0.006***
## (0.002) (0.002) (0.002)
##
## fatheduc 0.003 -0.0003
## (0.003) (0.004)
##
## motheduc 0.008*
## (0.004)
##
## Constant 5.816*** 4.835*** 4.587*** 4.672*** 4.664*** 4.634***
## (0.065) (0.093) (0.104) (0.106) (0.106) (0.108)
##
## ------------------------------------------------------------------
## Observations 1,604 1,604 1,604 1,604 1,604 1,604
## R2 0.041 0.146 0.159 0.167 0.168 0.170
## Adjusted R2 0.040 0.145 0.158 0.165 0.165 0.167
## ==================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Let’s suppose we want to test to see if the coefficients on father’s education and mother’s education are equal to zero at the 5% level. The null hypothesis is: \[H_0\text{: } \beta_5 = \beta_6 = 0\] The alternative hypothesis is: \[H_1\text{: either } \beta_5 \neq 0 \text{ or } \beta_6 \neq 0\]
To do the test, we need to identify our restricted model and
our unrestricted model. reg4 is the restricted
model and reg6 is the unrestricted.
Recall, the F-stat is equal to: \[F = \frac{(R^2_u - R^2_r)/q}{(1-R^2_u)/(n-k-1)}\]
The easy way and have R do it for us: Load the car
package to use this function:
p_load(car)
linearHypothesis(reg6, c("fatheduc=0", "motheduc=0"))
##
## Linear hypothesis test:
## fatheduc = 0
## motheduc = 0
##
## Model 1: restricted model
## Model 2: lwage ~ educ + exper + IQ + KWW + fatheduc + motheduc
##
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 1599 234.57
## 2 1597 233.93 2 0.64137 2.1892 0.1123
library(pacman)
p_load(tidyverse, nycflights13)
Look at your data by viewing it or using the head()
function. The dataframe is called flights. his data frame
contains all 336,776 flights that departed from New York City in 2013.
The data comes from the US Bureau of Transportation Statistics.
head(flights)
## # A tibble: 6 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
You have already learned how to use the tidyverse functions, but it may be useful to have some review. Recall, using functions from dplyr, we can manipulate our data frame. We can:
Today, we will review how to use filter, mutate, and group_by/summarize.
filter() allows you to subset observations based on
their values. For example, we can select all flights in January
with:
jan_flights <- flights %>% filter(month == 1)
jan_flights
## # A tibble: 27,004 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # ℹ 26,994 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
Boolean operators that work with dply: & is “and”, | is “or”, and ! is “not”.
The following code finds all flights that departed in November or December:
nov_dec <- flights %>% filter(month == 11 | month == 12)
nov_dec
## # A tibble: 55,403 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 11 1 5 2359 6 352 345
## 2 2013 11 1 35 2250 105 123 2356
## 3 2013 11 1 455 500 -5 641 651
## 4 2013 11 1 539 545 -6 856 827
## 5 2013 11 1 542 545 -3 831 855
## 6 2013 11 1 549 600 -11 912 923
## 7 2013 11 1 550 600 -10 705 659
## 8 2013 11 1 554 600 -6 659 701
## 9 2013 11 1 554 600 -6 826 827
## 10 2013 11 1 554 600 -6 749 751
## # ℹ 55,393 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
You can also use the %in% operator to achieve the same
thing:
nov_dec <- flights %>% filter(month %in% c(11,12))
nov_dec
## # A tibble: 55,403 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 11 1 5 2359 6 352 345
## 2 2013 11 1 35 2250 105 123 2356
## 3 2013 11 1 455 500 -5 641 651
## 4 2013 11 1 539 545 -6 856 827
## 5 2013 11 1 542 545 -3 831 855
## 6 2013 11 1 549 600 -11 912 923
## 7 2013 11 1 550 600 -10 705 659
## 8 2013 11 1 554 600 -6 659 701
## 9 2013 11 1 554 600 -6 826 827
## 10 2013 11 1 554 600 -6 749 751
## # ℹ 55,393 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
Find flights that weren’t delayed (on arrival or departure) by more than two hours:
nondelays <- flights %>% filter(arr_delay <= 120 & dep_delay <= 120)
nondelays
## # A tibble: 316,050 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # ℹ 316,040 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
mutate() allows you to create new variables. Make sure
to save the resulting data frame so that it is in your global
environment.
Here, we create two variables, gain and
speed.
new_df <- flights %>% mutate(
gain = dep_delay - arr_delay,
speed = distance / air_time * 60
)
new_df
## # A tibble: 336,776 × 21
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # ℹ 336,766 more rows
## # ℹ 13 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>, gain <dbl>, speed <dbl>
summarize() collapses a data frame into a single row. It
is most used in conjunction with group_by().
Imagine that we want to explore the relationship between the distance and average delay for each location. Using what you know about dplyr, you might write code like this:
delays <- flights %>%
group_by(dest) %>%
summarise(
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)
)
delays
## # A tibble: 105 × 4
## dest count dist delay
## <chr> <int> <dbl> <dbl>
## 1 ABQ 254 1826 4.38
## 2 ACK 265 199 4.85
## 3 ALB 439 143 14.4
## 4 ANC 8 3370 -2.5
## 5 ATL 17215 757. 11.3
## 6 AUS 2439 1514. 6.02
## 7 AVL 275 584. 8.00
## 8 BDL 443 116 7.05
## 9 BGR 375 378 8.03
## 10 BHM 297 866. 16.9
## # ℹ 95 more rows
Sometimes, the data we want to use is contained in two separate data frames. We want a way to merge those data frames together.
The different merges we can do:
inner_join(x, y) only includes observations that having
matching x and y key values. Rows of x can be dropped/filtered.left_join(x, y) includes all observations in x,
regardless of whether they match or not. This is the most commonly used
join because it ensures that you don’t lose observations from your
primary table.right_join(x, y) includes all observations in y. It’s
equivalent to left_join(y, x), but the columns will be ordered
differently.full_join(x, y) includes all observations from x and
ymerge(x, y) or inner_join(x, y) uses all
variables with common names across the two tablesThe nycflights13 package also contains a data frame
called weather. Let’s look at it:
head(weather)
## # A tibble: 6 × 15
## origin year month day hour temp dewp humid wind_dir wind_speed wind_gust
## <chr> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 EWR 2013 1 1 1 39.0 26.1 59.4 270 10.4 NA
## 2 EWR 2013 1 1 2 39.0 27.0 61.6 250 8.06 NA
## 3 EWR 2013 1 1 3 39.0 28.0 64.4 240 11.5 NA
## 4 EWR 2013 1 1 4 39.9 28.0 62.2 250 12.7 NA
## 5 EWR 2013 1 1 5 39.0 28.0 64.4 260 12.7 NA
## 6 EWR 2013 1 1 6 37.9 28.0 67.2 240 11.5 NA
## # ℹ 4 more variables: precip <dbl>, pressure <dbl>, visib <dbl>,
## # time_hour <dttm>
Notice that there are common variable names between the
flights and weather data frame. We can join
them together.
flights2 <- left_join(x = flights, y = weather)
## Joining with `by = join_by(year, month, day, origin, hour, time_hour)`
flights2
## # A tibble: 336,776 × 28
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # ℹ 336,766 more rows
## # ℹ 20 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>, temp <dbl>, dewp <dbl>,
## # humid <dbl>, wind_dir <dbl>, wind_speed <dbl>, wind_gust <dbl>,
## # precip <dbl>, pressure <dbl>, visib <dbl>
Notice the message Joining by: c(“year”, “month”, “day”, “hour”,
“origin”), which indicates the variables used for joining. If you want
to set specific variables to join with, you can use
by = c() in the argument of the join, like so:
flights2 <- left_join(x = flights, y = weather, by = c("year", "month", "day", "hour", "origin"))
flights2
## # A tibble: 336,776 × 29
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # ℹ 336,766 more rows
## # ℹ 21 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour.x <dttm>, temp <dbl>, dewp <dbl>,
## # humid <dbl>, wind_dir <dbl>, wind_speed <dbl>, wind_gust <dbl>,
## # precip <dbl>, pressure <dbl>, visib <dbl>, time_hour.y <dttm>
Note that these two joins are equivalent.
Let’s try the other merges:
First, using right_join():
flights3 <- right_join(x = flights, y = weather)
## Joining with `by = join_by(year, month, day, origin, hour, time_hour)`
flights3
## # A tibble: 341,957 × 28
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # ℹ 341,947 more rows
## # ℹ 20 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>, temp <dbl>, dewp <dbl>,
## # humid <dbl>, wind_dir <dbl>, wind_speed <dbl>, wind_gust <dbl>,
## # precip <dbl>, pressure <dbl>, visib <dbl>
Using inner_join() or merge():
flights4 <- inner_join(x = flights, y = weather)
## Joining with `by = join_by(year, month, day, origin, hour, time_hour)`
flights4
## # A tibble: 335,220 × 28
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # ℹ 335,210 more rows
## # ℹ 20 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>, temp <dbl>, dewp <dbl>,
## # humid <dbl>, wind_dir <dbl>, wind_speed <dbl>, wind_gust <dbl>,
## # precip <dbl>, pressure <dbl>, visib <dbl>
Using full_join():
flights5 <- full_join(x = flights, y = weather)
## Joining with `by = join_by(year, month, day, origin, hour, time_hour)`
flights5
## # A tibble: 343,513 × 28
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # ℹ 343,503 more rows
## # ℹ 20 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>, temp <dbl>, dewp <dbl>,
## # humid <dbl>, wind_dir <dbl>, wind_speed <dbl>, wind_gust <dbl>,
## # precip <dbl>, pressure <dbl>, visib <dbl>
See chapter 19 of R for Data Science 2e (linked on canvas) for a more in-depth tutorial on joining datasets.
#load the tidyverse package if it is not already loaded
library(pacman)
p_load(tidyverse, carData)
Our dataset today is Salaries, containing the salary for
professors of different rank (assistant professor, associate professor
and professor), from different disciplines (two levels: A and B) and
binary gender data (male, female)- among other variables.
head(Salaries)
## rank discipline yrs.since.phd yrs.service sex salary
## 1 Prof B 19 18 Male 139750
## 2 Prof B 20 16 Male 173200
## 3 AsstProf B 4 3 Male 79750
## 4 Prof B 45 39 Male 115000
## 5 Prof B 40 41 Male 141500
## 6 AssocProf B 6 6 Male 97000
Notice that we have categorical variables in our dataset. If we want to use these in our regressions, we will need to create dummy variables. There are several ways to do this:
Option 1: Create a new variable using mutate
Salaries1 <- Salaries %>% mutate(sex = ifelse(sex == "Male", 0, 1))
head(Salaries1)
## rank discipline yrs.since.phd yrs.service sex salary
## 1 Prof B 19 18 0 139750
## 2 Prof B 20 16 0 173200
## 3 AsstProf B 4 3 0 79750
## 4 Prof B 45 39 0 115000
## 5 Prof B 40 41 0 141500
## 6 AssocProf B 6 6 0 97000
Option 2: Use dummy_cols()
p_load(fastDummies) #requires this package to be loaded
##
## The downloaded binary packages are in
## /var/folders/30/xcfr30vj55q2bbs6dlm27fqw0000gn/T//RtmpBNbsrS/downloaded_packages
##
## fastDummies installed
Salaries2 <- Salaries %>% dummy_cols()
head(Salaries2)
## rank discipline yrs.since.phd yrs.service sex salary rank_AsstProf
## 1 Prof B 19 18 Male 139750 0
## 2 Prof B 20 16 Male 173200 0
## 3 AsstProf B 4 3 Male 79750 1
## 4 Prof B 45 39 Male 115000 0
## 5 Prof B 40 41 Male 141500 0
## 6 AssocProf B 6 6 Male 97000 0
## rank_AssocProf rank_Prof discipline_A discipline_B sex_Female sex_Male
## 1 0 1 0 1 0 1
## 2 0 1 0 1 0 1
## 3 0 0 0 1 0 1
## 4 0 1 0 1 0 1
## 5 0 1 0 1 0 1
## 6 1 0 0 1 0 1
Option 3: Just declare the variable as a factor in your regression equation
reg1 <- lm(salary ~ as.factor(sex), data = Salaries)
summary(reg1)
##
## Call:
## lm(formula = salary ~ as.factor(sex), data = Salaries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -57290 -23502 -6828 19710 116455
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 101002 4809 21.001 < 2e-16 ***
## as.factor(sex)Male 14088 5065 2.782 0.00567 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 30030 on 395 degrees of freedom
## Multiple R-squared: 0.01921, Adjusted R-squared: 0.01673
## F-statistic: 7.738 on 1 and 395 DF, p-value: 0.005667
reg2 <- lm(salary ~ sex, data = Salaries1)
summary(reg2)
##
## Call:
## lm(formula = salary ~ sex, data = Salaries1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -57290 -23502 -6828 19710 116455
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 115090 1587 72.503 < 2e-16 ***
## sex -14088 5065 -2.782 0.00567 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 30030 on 395 degrees of freedom
## Multiple R-squared: 0.01921, Adjusted R-squared: 0.01673
## F-statistic: 7.738 on 1 and 395 DF, p-value: 0.005667
# What is the difference between these two regressions?
What if we have more than two categories? The procedure is the same.
reg3 <- lm(salary ~ as.factor(rank), data = Salaries)
summary(reg3)
##
## Call:
## lm(formula = salary ~ as.factor(rank), data = Salaries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -68972 -16376 -1580 11755 104773
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 80776 2887 27.976 < 2e-16 ***
## as.factor(rank)AssocProf 13100 4131 3.171 0.00164 **
## as.factor(rank)Prof 45996 3230 14.238 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23630 on 394 degrees of freedom
## Multiple R-squared: 0.3943, Adjusted R-squared: 0.3912
## F-statistic: 128.2 on 2 and 394 DF, p-value: < 2.2e-16
Let’s add an interaction term of two categorical variables: sex and discipline.
int_reg1 <- lm(salary ~ as.factor(sex) + as.factor(discipline) + as.factor(sex):as.factor(discipline), data = Salaries)
summary(int_reg1)
##
## Call:
## lm(formula = salary ~ as.factor(sex) + as.factor(discipline) +
## as.factor(sex):as.factor(discipline), data = Salaries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -52900 -23149 -5419 20505 112785
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 89065 6992 12.739 < 2e-16
## as.factor(sex)Male 21635 7368 2.937 0.00351
## as.factor(discipline)B 22170 9528 2.327 0.02048
## as.factor(sex)Male:as.factor(discipline)B -14109 10034 -1.406 0.16049
##
## (Intercept) ***
## as.factor(sex)Male **
## as.factor(discipline)B *
## as.factor(sex)Male:as.factor(discipline)B
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 29660 on 393 degrees of freedom
## Multiple R-squared: 0.0482, Adjusted R-squared: 0.04094
## F-statistic: 6.634 on 3 and 393 DF, p-value: 0.0002217
Now, let’s use an interaction for a categorical variable and a continuous variable.
int_reg2 <- lm(salary ~ as.factor(sex) + yrs.since.phd + as.factor(sex):yrs.since.phd, data = Salaries)
summary(int_reg2)
##
## Call:
## lm(formula = salary ~ as.factor(sex) + yrs.since.phd + as.factor(sex):yrs.since.phd,
## data = Salaries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -83012 -19442 -2988 15059 102652
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 73840.8 8696.7 8.491 4.27e-16 ***
## as.factor(sex)Male 20209.6 9179.2 2.202 0.028269 *
## yrs.since.phd 1644.9 454.6 3.618 0.000335 ***
## as.factor(sex)Male:yrs.since.phd -728.0 468.0 -1.555 0.120665
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 27420 on 393 degrees of freedom
## Multiple R-squared: 0.1867, Adjusted R-squared: 0.1805
## F-statistic: 30.07 on 3 and 393 DF, p-value: < 2.2e-16
Load the gapminder package to get the dataset:
p_load(gapminder)
##
## The downloaded binary packages are in
## /var/folders/30/xcfr30vj55q2bbs6dlm27fqw0000gn/T//RtmpBNbsrS/downloaded_packages
##
## gapminder installed
head(gapminder)
## # A tibble: 6 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
Let’s make a quick ggplot and look at the shape of our data:
ggplot(data=gapminder, aes(x=gdpPercap, y=lifeExp)) + geom_point() + stat_smooth(method = "lm", se = F)
## `geom_smooth()` using formula = 'y ~ x'
To transform our data, we can create a new variable for the
log(gdpPercap) using mutate or we can take the log in our
lm() function.
gapminder_log <- gapminder %>% mutate(log_gdp = log(gdpPercap))
head(gapminder_log)
## # A tibble: 6 × 7
## country continent year lifeExp pop gdpPercap log_gdp
## <fct> <fct> <int> <dbl> <int> <dbl> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779. 6.66
## 2 Afghanistan Asia 1957 30.3 9240934 821. 6.71
## 3 Afghanistan Asia 1962 32.0 10267083 853. 6.75
## 4 Afghanistan Asia 1967 34.0 11537966 836. 6.73
## 5 Afghanistan Asia 1972 36.1 13079460 740. 6.61
## 6 Afghanistan Asia 1977 38.4 14880372 786. 6.67
ggplot(data=gapminder_log, aes(x=log_gdp, y=lifeExp)) + geom_point() + stat_smooth(method = "lm", se = F)
## `geom_smooth()` using formula = 'y ~ x'
reg_linear_log <- lm(lifeExp ~ log(gdpPercap), data = gapminder)
summary(reg_linear_log)
##
## Call:
## lm(formula = lifeExp ~ log(gdpPercap), data = gapminder)
##
## Residuals:
## Min 1Q Median 3Q Max
## -32.778 -4.204 1.212 4.658 19.285
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.1009 1.2277 -7.413 1.93e-13 ***
## log(gdpPercap) 8.4051 0.1488 56.500 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.62 on 1702 degrees of freedom
## Multiple R-squared: 0.6522, Adjusted R-squared: 0.652
## F-statistic: 3192 on 1 and 1702 DF, p-value: < 2.2e-16
To add a quadratic term, we need to use the I() function
in our regression
quad_reg <- lm(lifeExp ~ gdpPercap + I(gdpPercap^2), data = gapminder)
summary(quad_reg)
##
## Call:
## lm(formula = lifeExp ~ gdpPercap + I(gdpPercap^2), data = gapminder)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.0600 -6.4253 0.2611 7.0889 27.1752
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.052e+01 2.978e-01 169.65 <2e-16 ***
## gdpPercap 1.551e-03 3.737e-05 41.50 <2e-16 ***
## I(gdpPercap^2) -1.502e-08 5.794e-10 -25.92 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.885 on 1701 degrees of freedom
## Multiple R-squared: 0.5274, Adjusted R-squared: 0.5268
## F-statistic: 949.1 on 2 and 1701 DF, p-value: < 2.2e-16