Note: to complete this lab, you must submit your R-script (or notebook) with all of the code and answers to the questions.
In this lab we’ll be walking through…
How to load in a data set
How to look at that data set
How to run an ordinary least squares regression with two variables (bivariate)
How to look at the results of the regression
How to graph the results of the regression
This lab will make use of the tidyverse, vtable, and fixest packages.
So make sure those are installed first!
If they’re not installed, you can use
install.packages(c('tidyverse','vtable','fixest')).
Let’s start by loading up the packages we’ll need. Use the
library() function to load the tidyverse,
vtable, and fixest packages.
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.3
## Warning: package 'ggplot2' was built under R version 4.3.3
## Warning: package 'tibble' was built under R version 4.3.3
## Warning: package 'tidyr' was built under R version 4.3.3
## Warning: package 'readr' was built under R version 4.3.3
## Warning: package 'purrr' was built under R version 4.3.3
## Warning: package 'dplyr' was built under R version 4.3.3
## Warning: package 'stringr' was built under R version 4.3.3
## Warning: package 'forcats' was built under R version 4.3.3
## Warning: package 'lubridate' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(vtable)
## Warning: package 'vtable' was built under R version 4.3.3
## Loading required package: kableExtra
## Warning: package 'kableExtra' was built under R version 4.3.3
##
## Attaching package: 'kableExtra'
##
## The following object is masked from 'package:dplyr':
##
## group_rows
library(fixest)
## Warning: package 'fixest' was built under R version 4.3.3
Now we are going to load in some data. There are two main ways to load in data: from file or from a package.
Let’s work from a package first.
Many packages come with data pre-loaded. You can see what data sets
are there by typing data() and seeing what pops up.
For now, load up the pre-loaded data set mtcars using
the data() function.
data(mtcars)
When data comes from a package, it shows up as an object with the name you called it as.
Also, data you get this way often comes with a help file.
Use help(mtcars) to see descriptions of all the variables.
Let’s look at the data in other ways too.
Try clicking on the data object in the Environment window to look at
it like a spreadsheet. Note that this invokes View(mtcars),
which I’d like you to also try.
Then, send it to the vtable() function to get information on all the variables.
Question: What is the highest value that the cyl
variable takes? (Look at the “Values” column of the vtable. Or if you
prefer, do max(mtcars$cyl))
Answer: 8
The other way to get data into R is by loading it from a file.
The file could be on your computer or could be online.
Generally, this requires you take the file path or URL and make it into a string by putting quotes around it (’’ or ““).
We’re going to be getting our data file from https://vincentarelbundock.github.io/Rdatasets/csv/Ecdat/USFinanceIndustry.csv
Question: What does that URL look like as a string? Answer: “https://vincentarelbundock.github.io/Rdatasets/csv/Ecdat/USFinanceIndustry.csv”
Now put that string into the read_csv() function (from
the tidyverse).
read_csv("https://vincentarelbundock.github.io/Rdatasets/csv/Ecdat/USFinanceIndustry.csv")
## Rows: 84 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): rownames
## dbl (7): year, CorporateProfitsAdj, Domestic, Financial, Nonfinancial, restO...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 84 × 8
## rownames year CorporateProfitsAdj Domestic Financial Nonfinancial
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 <NA> 1929 10.7 10.4 1.6 8.9
## 2 <NA> 1930 7.4 7.2 0.7 6.6
## 3 <NA> 1931 2.8 2.8 0.5 2.3
## 4 <NA> 1932 -0.3 -0.2 0.6 -0.9
## 5 <NA> 1933 -0.3 -0.3 0.8 -1
## 6 <NA> 1934 2.4 2.3 0.5 1.8
## 7 <NA> 1935 3.9 3.7 0.5 3.2
## 8 <NA> 1936 6 5.9 0.9 5.1
## 9 <NA> 1937 6.9 6.6 0.8 5.8
## 10 <NA> 1938 4.8 4.5 0.9 3.6
## # ℹ 74 more rows
## # ℹ 2 more variables: restOfWorld <dbl>, FinanceProportion <dbl>
NOTE: read_csv() is different from, and usually better
than, the base R function read.csv. Use
read_csv, not read.csv!
This will read the data from the internet. However, we also need to store the data as an object in R.
So use <- or = to save the data as
df.
Hint: We know we can store objects using the format
a <- 2, which stores the number 2 in the
object a. Now we want to store the results of our
read_csv() function in the object df.
df <- read_csv("https://vincentarelbundock.github.io/Rdatasets/csv/Ecdat/USFinanceIndustry.csv")
## Rows: 84 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): rownames
## dbl (7): year, CorporateProfitsAdj, Domestic, Financial, Nonfinancial, restO...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Now let’s finally build a regression.
At its core, a regression in R requires two things: a formula and a data set.
We have the data set, so let’s make a formula.
A formula in R (when you have only two variables) is of the format
Y ~ X
where Y is the dependent variable and X is
the independent variable.
Question: Create a formula where mpg is
the dependent variable and hp is the independent variable.
Answer: mpg ~ hp
Now we can put that formula into the feols() function
from fixest library to run a regression.
Use the feols() function. Put the mpg ~ hp
formula in there first (without quotes). Then, specify
data = mtcars, since you’re using the mtcars
data set.
Hint: feols() takes two arguments here. The
first is mpg~hp. Then put a comma to separate the
arguments. Then data = mtcars.
feols(mpg~hp,data=mtcars)
## OLS estimation, Dep. Var.: mpg
## Observations: 32
## Standard-errors: IID
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.098861 1.633921 18.42125 < 2.2e-16 ***
## hp -0.068228 0.010119 -6.74239 1.7878e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 3.7403 Adj. R2: 0.589185
Question: In this OLS line, what is the slope on
hp? Round to the thousandths place (.001)
Answer: -0.068
In the mtcars data, mpg is the miles per
gallon of a particular car, and hp is the horsepower of
that car.
Question: How can we interpret the slope we found above? Answer: For every one additional increase in horsepower, the miles per gallon decreases by .068.
Question: How can we interpret the intercept? Answer: When the horsepower is 0, the miles per gallon is 30.099.
Most of the time, we want to see a fuller description of the regression, not just the coefficients.
For this, we will need to store our regression object so we can pass it to other functions.
Rerun the same regression as in the last step, but save it to the
object my_model.
Let’s look at a table summarizing the regression.
First, write the name of the model itself to summarize your
regression (just type my_model and run it).
my_model <- feols(mpg~hp,data=mtcars)
(Note: if you were using the base-R lm() (as in class)
instead of feols(), you’d need summary()
instead just just writing the model name)
Next, let’s get a nicer look using etable() from the
fixest package. Run the function with my_model
as the argument.
etable(my_model)
## my_model
## Dependent Var.: mpg
##
## Constant 30.10*** (1.634)
## hp -0.0682*** (0.0101)
## _______________ ___________________
## S.E. type IID
## Observations 32
## R2 0.60244
## Adj. R2 0.58919
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Now let’s get a nice graph of our regression results using
ggplot() from the tidyverse package.
In the following command, fill in the appropriate x and y-axis variables for your data and then run the whole thing (you’re replacing “xaxisvariable” and “yaxisvariable”):
ggplot(mtcars, aes(x = xaxisvariable, y = yaxisvariable)) + geom_point() + geom_smooth(method = 'lm')
ggplot(mtcars, aes(x = hp, y = mpg)) + geom_point() + geom_smooth(method = 'lm')
## `geom_smooth()` using formula = 'y ~ x'
Don’t think we forgot about that data we loaded in with
df earlier!
We are going to look at how corporate profits predict finanical-sector profits.
Use feols() to run a regression using the
df data with Financial as the dependent
variable and CorporateProfitsAdj as the independent
variable, and save it as reg2.
reg2 <- feols(Financial~CorporateProfitsAdj,data=df)
## NOTE: 1 observation removed because of NA values (LHS: 1, RHS: 1).
Just as one last thing, we’re going to do something strange.
Use the following code to add a new variable called
ProfitsHalf that divides CorporateProfitsAdj
by 2 and save it as the new data set df2:
df2 <- df %>% mutate(ProfitsHalf = CorporateProfitsAdj/2)
df2 <- df %>% mutate(ProfitsHalf = CorporateProfitsAdj/2)
Now regress Financial on ProfitsHalf with
feols() using the df2 data set and store the
result as reg3.
reg3 <- feols(Financial~ProfitsHalf,data=df2)
## NOTE: 1 observation removed because of NA values (LHS: 1, RHS: 1).
Put both reg2 and reg3 into
etable() so you can look at the results side-by-side
(separated by a comma).
etable(reg2, reg3)
## reg2 reg3
## Dependent Var.: Financial Financial
##
## Constant -7.543. (4.532) -7.543. (4.532)
## CorporateProfitsAdj 0.2388*** (0.0079)
## ProfitsHalf 0.4776*** (0.0158)
## ___________________ __________________ __________________
## S.E. type IID IID
## Observations 83 83
## R2 0.91808 0.91808
## Adj. R2 0.91707 0.91707
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The coefficient on ProfitsHalf in Model 2 (from
df2) is exactly twice as large as the coefficient on
CorporateProfitsAdj in Model 1 (from df).
Question: Why does this doubling make sense? Answer: The doubling makes sense because we divided the independent variable in half.