library(pacman)
p_load(tidyverse, broom, stargazer, AER)
For this lab, we will use data from the AER
package on California schools. To get this data and use it we can:
data("CASchools")
# look at a snapshot
head(CASchools, 10)
## district school county grades students
## 1 75119 Sunol Glen Unified Alameda KK-08 195
## 2 61499 Manzanita Elementary Butte KK-08 240
## 3 61549 Thermalito Union Elementary Butte KK-08 1550
## 4 61457 Golden Feather Union Elementary Butte KK-08 243
## 5 61523 Palermo Union Elementary Butte KK-08 1335
## 6 62042 Burrel Union Elementary Fresno KK-08 137
## 7 68536 Holt Union Elementary San Joaquin KK-08 195
## 8 63834 Vineland Elementary Kern KK-08 888
## 9 62331 Orange Center Elementary Fresno KK-08 379
## 10 67306 Del Paso Heights Elementary Sacramento KK-06 2247
## teachers calworks lunch computer expenditure income english read
## 1 10.90 0.5102 2.0408 67 6384.911 22.690001 0.000000 691.6
## 2 11.15 15.4167 47.9167 101 5099.381 9.824000 4.583333 660.5
## 3 82.90 55.0323 76.3226 169 5501.955 8.978000 30.000002 636.3
## 4 14.00 36.4754 77.0492 85 7101.831 8.978000 0.000000 651.9
## 5 71.50 33.1086 78.4270 171 5235.988 9.080333 13.857677 641.8
## 6 6.40 12.3188 86.9565 25 5580.147 10.415000 12.408759 605.7
## 7 10.00 12.9032 94.6237 28 5253.331 6.577000 68.717949 604.5
## 8 42.50 18.8063 100.0000 66 4565.746 8.174000 46.959461 605.5
## 9 19.00 32.1900 93.1398 35 5355.548 7.385000 30.079157 608.9
## 10 108.00 78.9942 87.3164 0 5036.211 11.613333 40.275921 611.9
## math
## 1 690.0
## 2 661.9
## 3 650.9
## 4 643.5
## 5 639.9
## 6 605.4
## 7 609.0
## 8 612.5
## 9 616.1
## 10 613.4
names(CASchools)
## [1] "district" "school" "county" "grades" "students"
## [6] "teachers" "calworks" "lunch" "computer" "expenditure"
## [11] "income" "english" "read" "math"
To do a regression in R, we use lm()
. The basic steup: name <- lm(y ~ x, data = name_of_df).
Let’s regress reading scores on student expenditure.
lm(read ~ expenditure, data = CASchools)
##
## Call:
## lm(formula = read ~ expenditure, data = CASchools)
##
## Coefficients:
## (Intercept) expenditure
## 6.182e+02 6.912e-03
The output from lm()
gives us an intercept coefficient, \(\hat{\beta}_0\), and a slope coefficient, \(\hat{\beta}_1\).
Let’s run another regression. Regress math schores on student expenditure.
lm(read ~ expenditure, data = CASchools)
##
## Call:
## lm(formula = read ~ expenditure, data = CASchools)
##
## Coefficients:
## (Intercept) expenditure
## 6.182e+02 6.912e-03
On your own, try to code a regression for \(Math_i = \beta_0 + \beta_1 Income_i + u_i\).
lm(math ~ income, data = CASchools)
##
## Call:
## lm(formula = math ~ income, data = CASchools)
##
## Coefficients:
## (Intercept) income
## 625.539 1.815
Now that we know how to run a regression, let’s talk about how to look at the output. The output above wasn’t super informative…
summary()
The first option is to use the summary function in R. There are two ways to do this:
1: Nest the lm()
inside summary()
summary(lm(math ~ income, data = CASchools))
##
## Call:
## lm(formula = math ~ income, data = CASchools)
##
## Residuals:
## Min 1Q Median 3Q Max
## -39.045 -8.997 0.308 8.416 34.246
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 625.53948 1.53627 407.18 <2e-16 ***
## income 1.81523 0.09073 20.01 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.42 on 418 degrees of freedom
## Multiple R-squared: 0.4892, Adjusted R-squared: 0.4879
## F-statistic: 400.3 on 1 and 418 DF, p-value: < 2.2e-16
2: Save your regression as an object in R so that we can then use that object later! (This is preferred).
reg1 <- lm(math ~ income, data = CASchools)
summary(reg1)
##
## Call:
## lm(formula = math ~ income, data = CASchools)
##
## Residuals:
## Min 1Q Median 3Q Max
## -39.045 -8.997 0.308 8.416 34.246
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 625.53948 1.53627 407.18 <2e-16 ***
## income 1.81523 0.09073 20.01 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.42 on 418 degrees of freedom
## Multiple R-squared: 0.4892, Adjusted R-squared: 0.4879
## F-statistic: 400.3 on 1 and 418 DF, p-value: < 2.2e-16
Notice that both ways produce the same output. We have the intercept coefficient, \(\hat{\beta}_0 = 625\), and the slope coefficient \(\hat{\beta}_1 = 1.8\). summary()
also gives us other information that we didn’t have before like the standard error, the t-score, the p-value, and the \(R^2\). The stars on the p-value tell us if the coefficient is statistically significant (more on this after the midterm).
tidy()
Another way to make nice regression output is to use the tidy()
function from the broom
package. To use this, you must have loaded the broom
package in p_load()
. The process is similar:
# since we have already created reg1 as an object, we can just call it without having to redo the regression
tidy(reg1)
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 626. 1.54 407. 0
## 2 income 1.82 0.0907 20.0 5.99e-63
tidy()
puts the information from summary()
into a much nicer looking table.
By far, the most powerful tool for making amazing tables in R is the stargazer
package. I would encourage you to try this out if you are feeling up for it!
First, let’s try to make a simple table with stargazer()
using our reg1
object.
stargazer(reg1)
##
## % Table created by stargazer v.5.2.2 by Marek Hlavac, Harvard University. E-mail: hlavac at fas.harvard.edu
## % Date and time: Tue, Jun 29, 2021 - 11:06:48
## \begin{table}[!htbp] \centering
## \caption{}
## \label{}
## \begin{tabular}{@{\extracolsep{5pt}}lc}
## \\[-1.8ex]\hline
## \hline \\[-1.8ex]
## & \multicolumn{1}{c}{\textit{Dependent variable:}} \\
## \cline{2-2}
## \\[-1.8ex] & math \\
## \hline \\[-1.8ex]
## income & 1.815$^{***}$ \\
## & (0.091) \\
## & \\
## Constant & 625.539$^{***}$ \\
## & (1.536) \\
## & \\
## \hline \\[-1.8ex]
## Observations & 420 \\
## R$^{2}$ & 0.489 \\
## Adjusted R$^{2}$ & 0.488 \\
## Residual Std. Error & 13.420 (df = 418) \\
## F Statistic & 400.257$^{***}$ (df = 1; 418) \\
## \hline
## \hline \\[-1.8ex]
## \textit{Note:} & \multicolumn{1}{r}{$^{*}$p$<$0.1; $^{**}$p$<$0.05; $^{***}$p$<$0.01} \\
## \end{tabular}
## \end{table}
Huh, weird output right? Stargazer is defaulting to TeX output. We can change the type of output we want.
1: As text (use this for your problem set)
stargazer(reg1, type = "text")
##
## ===============================================
## Dependent variable:
## ---------------------------
## math
## -----------------------------------------------
## income 1.815***
## (0.091)
##
## Constant 625.539***
## (1.536)
##
## -----------------------------------------------
## Observations 420
## R2 0.489
## Adjusted R2 0.488
## Residual Std. Error 13.420 (df = 418)
## F Statistic 400.257*** (df = 1; 418)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Stargazer can also do html output and MS word output (although word output is not recommended).
Stargazer gave us some stats that we don’t really care about… let’s get rid of them…
stargazer(reg1, keep.stat =c("rsq", "n"), type = "text")
##
## ========================================
## Dependent variable:
## ---------------------------
## math
## ----------------------------------------
## income 1.815***
## (0.091)
##
## Constant 625.539***
## (1.536)
##
## ----------------------------------------
## Observations 420
## R2 0.489
## ========================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Looks great!
Unlike the tables we made using summary()
and broom()
, stargazer can combine multiple regressions into one table!
# do another regression and save it as an object
reg2 <- lm(read ~ income, data = CASchools)
stargazer(reg1, reg2, keep.stat =c("rsq", "n"), type = "text")
##
## =========================================
## Dependent variable:
## ----------------------------
## math read
## (1) (2)
## -----------------------------------------
## income 1.815*** 1.942***
## (0.091) (0.097)
##
## Constant 625.539*** 625.228***
## (1.536) (1.651)
##
## -----------------------------------------
## Observations 420 420
## R2 0.489 0.487
## =========================================
## Note: *p<0.1; **p<0.05; ***p<0.01
The variables in the regressions don’t even have to be the same!
# do another regression and save it as an object
reg3 <- lm(read ~ expenditure, data = CASchools)
stargazer(reg1, reg2, reg3, keep.stat =c("rsq", "n"), type = "text")
##
## =============================================
## Dependent variable:
## --------------------------------
## math read
## (1) (2) (3)
## ---------------------------------------------
## income 1.815*** 1.942***
## (0.091) (0.097)
##
## expenditure 0.007***
## (0.002)
##
## Constant 625.539*** 625.228*** 618.249***
## (1.536) (1.651) (8.101)
##
## ---------------------------------------------
## Observations 420 420 420
## R2 0.489 0.487 0.047
## =============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
We can get really fancy with stargazer()
. For more info, see the cheatsheet on Canvas.
Lastly, we can use ggplot and fit a regression line through our data. Let’s start by making a scatterplot with math scores on the y axis and income on the x axis.
ggplot(data = CASchools, aes(x = income, y = math)) + geom_point()
To add a regression line, we use the
stat_smooth()
function:
# method = "lm" tells R that we are using OLS. se = FALSE removes the standard error bars.
ggplot(data = CASchools, aes(x = income, y = math)) + geom_point() + stat_smooth(method ="lm", se =FALSE)
## `geom_smooth()` using formula 'y ~ x'