Ordinary Least Squares

Ordinary Least Squares Part 1.

Note: to complete this lab, you must submit your R-script (or notebook) with all of the code and answers to the questions.

In this lab we’ll be walking through…

How to load in a data set
How to look at that data set
How to run an ordinary least squares regression with two variables (bivariate)
How to look at the results of the regression
How to graph the results of the regression

This lab will make use of the tidyverse, vtable, and fixest packages.

So make sure those are installed first!

If they’re not installed, you can use install.packages(c('tidyverse','vtable','fixest')).

Loading the packages

Let’s start by loading up the packages we’ll need. Use the library() function to load the tidyverse, vtable, and fixest packages.

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.3.3

## Warning: package 'ggplot2' was built under R version 4.3.3

## Warning: package 'tibble' was built under R version 4.3.3

## Warning: package 'tidyr' was built under R version 4.3.3

## Warning: package 'readr' was built under R version 4.3.3

## Warning: package 'purrr' was built under R version 4.3.3

## Warning: package 'dplyr' was built under R version 4.3.3

## Warning: package 'stringr' was built under R version 4.3.3

## Warning: package 'forcats' was built under R version 4.3.3

## Warning: package 'lubridate' was built under R version 4.3.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(vtable)

## Warning: package 'vtable' was built under R version 4.3.3

## Loading required package: kableExtra

## Warning: package 'kableExtra' was built under R version 4.3.3

## 
## Attaching package: 'kableExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rows

library(fixest)

## Warning: package 'fixest' was built under R version 4.3.3

Load the data

Now we are going to load in some data. There are two main ways to load in data: from file or from a package.

Let’s work from a package first.

Many packages come with data pre-loaded. You can see what data sets are there by typing data() and seeing what pops up.

For now, load up the pre-loaded data set mtcars using the data() function.

data(mtcars)

Help

When data comes from a package, it shows up as an object with the name you called it as.

Also, data you get this way often comes with a help file.

Use help(mtcars) to see descriptions of all the variables.

Different ways to view the data

Let’s look at the data in other ways too.

Try clicking on the data object in the Environment window to look at it like a spreadsheet. Note that this invokes View(mtcars), which I’d like you to also try.

Then, send it to the vtable() function to get information on all the variables.

Question: What is the highest value that the cyl variable takes? (Look at the “Values” column of the vtable. Or if you prefer, do max(mtcars$cyl))

Answer: 8

Loading data from a file

The other way to get data into R is by loading it from a file.

The file could be on your computer or could be online.

Generally, this requires you take the file path or URL and make it into a string by putting quotes around it (’’ or ““).

We’re going to be getting our data file from https://vincentarelbundock.github.io/Rdatasets/csv/Ecdat/USFinanceIndustry.csv

Question: What does that URL look like as a string? Answer: “https://vincentarelbundock.github.io/Rdatasets/csv/Ecdat/USFinanceIndustry.csv”

Now put that string into the read_csv() function (from the tidyverse).

read_csv("https://vincentarelbundock.github.io/Rdatasets/csv/Ecdat/USFinanceIndustry.csv")

## Rows: 84 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): rownames
## dbl (7): year, CorporateProfitsAdj, Domestic, Financial, Nonfinancial, restO...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

## # A tibble: 84 × 8
##    rownames  year CorporateProfitsAdj Domestic Financial Nonfinancial
##    <chr>    <dbl>               <dbl>    <dbl>     <dbl>        <dbl>
##  1 <NA>      1929                10.7     10.4       1.6          8.9
##  2 <NA>      1930                 7.4      7.2       0.7          6.6
##  3 <NA>      1931                 2.8      2.8       0.5          2.3
##  4 <NA>      1932                -0.3     -0.2       0.6         -0.9
##  5 <NA>      1933                -0.3     -0.3       0.8         -1  
##  6 <NA>      1934                 2.4      2.3       0.5          1.8
##  7 <NA>      1935                 3.9      3.7       0.5          3.2
##  8 <NA>      1936                 6        5.9       0.9          5.1
##  9 <NA>      1937                 6.9      6.6       0.8          5.8
## 10 <NA>      1938                 4.8      4.5       0.9          3.6
## # ℹ 74 more rows
## # ℹ 2 more variables: restOfWorld <dbl>, FinanceProportion <dbl>

NOTE: read_csv() is different from, and usually better than, the base R function read.csv. Use read_csv, not read.csv!

This will read the data from the internet. However, we also need to store the data as an object in R.

So use <- or = to save the data as df.

Hint: We know we can store objects using the format a <- 2, which stores the number 2 in the object a. Now we want to store the results of our read_csv() function in the object df.

df <- read_csv("https://vincentarelbundock.github.io/Rdatasets/csv/Ecdat/USFinanceIndustry.csv")

## Rows: 84 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): rownames
## dbl (7): year, CorporateProfitsAdj, Domestic, Financial, Nonfinancial, restO...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Regression

Now let’s finally build a regression.

At its core, a regression in R requires two things: a formula and a data set.

We have the data set, so let’s make a formula.

A formula in R (when you have only two variables) is of the format

Y ~ X

where Y is the dependent variable and X is the independent variable.

Question: Create a formula where mpg is the dependent variable and hp is the independent variable. Answer: mpg ~ hp

Now we can put that formula into the feols() function from fixest library to run a regression.

Use the feols() function. Put the mpg ~ hp formula in there first (without quotes). Then, specify data = mtcars, since you’re using the mtcars data set.

Hint: feols() takes two arguments here. The first is mpg~hp. Then put a comma to separate the arguments. Then data = mtcars.

feols(mpg~hp,data=mtcars)

## OLS estimation, Dep. Var.: mpg
## Observations: 32
## Standard-errors: IID 
##              Estimate Std. Error  t value   Pr(>|t|)    
## (Intercept) 30.098861   1.633921 18.42125  < 2.2e-16 ***
## hp          -0.068228   0.010119 -6.74239 1.7878e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 3.7403   Adj. R2: 0.589185

Question: In this OLS line, what is the slope on hp? Round to the thousandths place (.001) Answer: -0.068

In the mtcars data, mpg is the miles per gallon of a particular car, and hp is the horsepower of that car.

Question: How can we interpret the slope we found above? Answer: For every one additional increase in horsepower, the miles per gallon decreases by .068.

Question: How can we interpret the intercept? Answer: When the horsepower is 0, the miles per gallon is 30.099.

Regression output

Most of the time, we want to see a fuller description of the regression, not just the coefficients.

For this, we will need to store our regression object so we can pass it to other functions.

Rerun the same regression as in the last step, but save it to the object my_model.

Let’s look at a table summarizing the regression.

First, write the name of the model itself to summarize your regression (just type my_model and run it).

my_model <- feols(mpg~hp,data=mtcars)

(Note: if you were using the base-R lm() (as in class) instead of feols(), you’d need summary() instead just just writing the model name)

Next, let’s get a nicer look using etable() from the fixest package. Run the function with my_model as the argument.

etable(my_model)

##                            my_model
## Dependent Var.:                 mpg
##                                    
## Constant           30.10*** (1.634)
## hp              -0.0682*** (0.0101)
## _______________ ___________________
## S.E. type                       IID
## Observations                     32
## R2                          0.60244
## Adj. R2                     0.58919
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Visuals

Now let’s get a nice graph of our regression results using ggplot() from the tidyverse package.

In the following command, fill in the appropriate x and y-axis variables for your data and then run the whole thing (you’re replacing “xaxisvariable” and “yaxisvariable”):

ggplot(mtcars, aes(x = xaxisvariable, y = yaxisvariable)) + geom_point() + geom_smooth(method = 'lm')

ggplot(mtcars, aes(x = hp, y = mpg)) + geom_point() + geom_smooth(method = 'lm')

## `geom_smooth()` using formula = 'y ~ x'

More practice

Don’t think we forgot about that data we loaded in with df earlier!

We are going to look at how corporate profits predict finanical-sector profits.

Use feols() to run a regression using the df data with Financial as the dependent variable and CorporateProfitsAdj as the independent variable, and save it as reg2.

reg2 <- feols(Financial~CorporateProfitsAdj,data=df)

## NOTE: 1 observation removed because of NA values (LHS: 1, RHS: 1).

Just as one last thing, we’re going to do something strange.

Use the following code to add a new variable called ProfitsHalf that divides CorporateProfitsAdj by 2 and save it as the new data set df2:

df2 <- df %>% mutate(ProfitsHalf = CorporateProfitsAdj/2)

df2 <- df %>% mutate(ProfitsHalf = CorporateProfitsAdj/2)

Now regress Financial on ProfitsHalf with feols() using the df2 data set and store the result as reg3.

reg3 <- feols(Financial~ProfitsHalf,data=df2)

## NOTE: 1 observation removed because of NA values (LHS: 1, RHS: 1).

Put both reg2 and reg3 into etable() so you can look at the results side-by-side (separated by a comma).

etable(reg2, reg3)

##                                   reg2               reg3
## Dependent Var.:              Financial          Financial
##                                                          
## Constant               -7.543. (4.532)    -7.543. (4.532)
## CorporateProfitsAdj 0.2388*** (0.0079)                   
## ProfitsHalf                            0.4776*** (0.0158)
## ___________________ __________________ __________________
## S.E. type                          IID                IID
## Observations                        83                 83
## R2                             0.91808            0.91808
## Adj. R2                        0.91707            0.91707
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The coefficient on ProfitsHalf in Model 2 (from df2) is exactly twice as large as the coefficient on CorporateProfitsAdj in Model 1 (from df).

Question: Why does this doubling make sense? Answer: The doubling makes sense because we divided the independent variable in half.

Ordinary Least Squares

2023-09-21