Note: to complete this lab, you must submit your R-script (or notebook) with all of the code and answers to the questions.
In this lab we’ll be walking through…
How to load in a data set
How to look at that data set
How to run an ordinary least squares regression with two variables (bivariate)
How to look at the results of the regression
How to graph the results of the regression
This lab will make use of the tidyverse, vtable, and fixest packages.
So make sure those are installed first!
If they’re not installed, you can use
install.packages(c('tidyverse','vtable','fixest')).
Let’s start by loading up the packages we’ll need. Use the
library() function to load the tidyverse,
vtable, and fixest packages.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(vtable)
## Loading required package: kableExtra
##
## Attaching package: 'kableExtra'
##
## The following object is masked from 'package:dplyr':
##
## group_rows
library(fixest)
Now we are going to load in some data. There are two main ways to load in data: from file or from a package.
Let’s work from a package first.
Many packages come with data pre-loaded. You can see what data sets
are there by typing data() and seeing what pops up.
For now, load up the pre-loaded data set mtcars using
the data() function.
data(mtcars)
When data comes from a package, it shows up as an object with the name you called it as.
Also, data you get this way often comes with a help file.
Use help(mtcars) to see descriptions of all the variables.
Let’s look at the data in other ways too.
Try clicking on the data object in the Environment window to look at
it like a spreadsheet. Note that this invokes View(mtcars),
which I’d like you to also try.
View(mtcars)
vtable(mtcars)
| Name | Class | Values |
|---|---|---|
| mpg | numeric | Num: 10.4 to 33.9 |
| cyl | numeric | Num: 4 to 8 |
| disp | numeric | Num: 71.1 to 472 |
| hp | numeric | Num: 52 to 335 |
| drat | numeric | Num: 2.76 to 4.93 |
| wt | numeric | Num: 1.513 to 5.424 |
| qsec | numeric | Num: 14.5 to 22.9 |
| vs | numeric | Num: 0 to 1 |
| am | numeric | Num: 0 to 1 |
| gear | numeric | Num: 3 to 5 |
| carb | numeric | Num: 1 to 8 |
Then, send it to the vtable() function to get information on all the variables.
Question: What is the highest value that the cyl
variable takes? (Look at the “Values” column of the vtable. Or if you
prefer, do max(mtcars$cyl))
Answer: 8
The other way to get data into R is by loading it from a file.
The file could be on your computer or could be online.
Generally, this requires you take the file path or URL and make it into a string by putting quotes around it (’’ or ““).
We’re going to be getting our data file from https://vincentarelbundock.github.io/Rdatasets/csv/Ecdat/USFinanceIndustry.csv
Question: What does that URL look like as a string? “https://vincentarelbundock.github.io/Rdatasets/csv/Ecdat/USFinanceIndustry.csv”
Now put that string into the read_csv() function (from
the tidyverse).
read_csv("https://vincentarelbundock.github.io/Rdatasets/csv/Ecdat/USFinanceIndustry.csv")
## Rows: 84 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): rownames
## dbl (7): year, CorporateProfitsAdj, Domestic, Financial, Nonfinancial, restO...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 84 × 8
## rownames year CorporateProfitsAdj Domestic Financial Nonfinancial
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 <NA> 1929 10.7 10.4 1.6 8.9
## 2 <NA> 1930 7.4 7.2 0.7 6.6
## 3 <NA> 1931 2.8 2.8 0.5 2.3
## 4 <NA> 1932 -0.3 -0.2 0.6 -0.9
## 5 <NA> 1933 -0.3 -0.3 0.8 -1
## 6 <NA> 1934 2.4 2.3 0.5 1.8
## 7 <NA> 1935 3.9 3.7 0.5 3.2
## 8 <NA> 1936 6 5.9 0.9 5.1
## 9 <NA> 1937 6.9 6.6 0.8 5.8
## 10 <NA> 1938 4.8 4.5 0.9 3.6
## # ℹ 74 more rows
## # ℹ 2 more variables: restOfWorld <dbl>, FinanceProportion <dbl>
NOTE: read_csv() is different from, and usually better
than, the base R function read.csv. Use
read_csv, not read.csv!
This will read the data from the internet. However, we also need to store the data as an object in R.
So use <- or = to save the data as
df.
Hint: We know we can store objects using the format
a <- 2, which stores the number 2 in the
object a. Now we want to store the results of our
read_csv() function in the object df.
df <- read_csv("https://vincentarelbundock.github.io/Rdatasets/csv/Ecdat/USFinanceIndustry.csv")
## Rows: 84 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): rownames
## dbl (7): year, CorporateProfitsAdj, Domestic, Financial, Nonfinancial, restO...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Now let’s finally build a regression.
At its core, a regression in R requires two things: a formula and a data set.
We have the data set, so let’s make a formula.
A formula in R (when you have only two variables) is of the format
Y ~ X
where Y is the dependent variable and X is
the independent variable.
Question: Create a formula where mpg is
the dependent variable and hp is the independent
variable.
mpg ~ hp
Now we can put that formula into the feols() function
from fixest library to run a regression.
Use the feols() function. Put the mpg ~ hp
formula in there first (without quotes). Then, specify
data = mtcars, since you’re using the mtcars
data set.
feols(mpg ~ hp, data = mtcars)
Hint: feols() takes two arguments here. The
first is mpg~hp. Then put a comma to separate the
arguments. Then data = mtcars.
Question: In this OLS line, what is the slope on
hp? Round to the thousandths place (.001)
-0.068
In the mtcars data, mpg is the miles per
gallon of a particular car, and hp is the horsepower of
that car.
Question: How can we interpret the slope we found above? Question: How can we interpret the intercept?
For every increase of 1 mpg, the horsepower decreases by 0.68. When horsepower is 0, the mpg is 30.09.
Most of the time, we want to see a fuller description of the regression, not just the coefficients.
For this, we will need to store our regression object so we can pass it to other functions.
Rerun the same regression as in the last step, but save it to the
object my_model.
Let’s look at a table summarizing the regression.
First, write the name of the model itself to summarize your
regression (just type my_model and run it).
my_model <- feols(mpg ~ hp, data = mtcars)
(Note: if you were using the base-R lm() (as in class)
instead of feols(), you’d need summary()
instead just just writing the model name)
Next, let’s get a nicer look using etable() from the
fixest package. Run the function with my_model
as the argument.
etable(my_model)
Now let’s get a nice graph of our regression results using
ggplot() from the tidyverse package.
In the following command, fill in the appropriate x and y-axis variables for your data and then run the whole thing (you’re replacing “xaxisvariable” and “yaxisvariable”):
ggplot(mtcars, aes(x = xaxisvariable, y = yaxisvariable)) + geom_point() + geom_smooth(method = 'lm')
ggplot(mtcars, aes(x = hp, y = mpg)) + geom_point() + geom_smooth(method = ‘lm’)
Don’t think we forgot about that data we loaded in with
df earlier!
We are going to look at how corporate profits predict finanical-sector profits.
Use feols() to run a regression using the
df data with Financial as the dependent
variable and CorporateProfitsAdj as the independent
variable, and save it as reg2.
reg2 <- feols(Financial ~ CorporateProfitsAdj, data = df)
Just as one last thing, we’re going to do something strange.
Use the following code to add a new variable called
ProfitsHalf that divides CorporateProfitsAdj
by 2 and save it as the new data set df2:
df2 <- df %>% mutate(ProfitsHalf = CorporateProfitsAdj/2)
df2 <- df %>% mutate(ProfitsHalf = CorporateProfitsAdj/2)
Now regress Financial on ProfitsHalf with
feols() using the df2 data set and store the
result as reg3.
reg3 <- feols(Financial ~ ProfitsHalf, data = df2)
Put both reg2 and reg3 into
etable() so you can look at the results side-by-side
(separated by a comma).
etable(reg2, reg3)
The coefficient on ProfitsHalf in Model 2 (from
df2) is exactly twice as large as the coefficient on
CorporateProfitsAdj in Model 1 (from df).
Question: Why does this doubling make sense? The doubling makes sense because we divided corportateprofitsadj by 2, making it twice as small as the profits from model 1.