Guide to R for SCU Economics Students, v. 3.0
Multiple regression is regression with more than one regressor (X variable). In R, multiple regression is straightforward. In this tutorial you will run all the regressions for Table 7.1 of Stock and Watson (p. 238 of original 3rd edition) and put them into a single stargazer table, which will look a lot like Table 7.1. Running and presenting the regressions is the easy part: you will need to make sure you are able to interpret the regression coefficients and understand why the estimated coefficients might be biased.
Start RStudio and open the script t_multreg.R in your script editor. Make sure to change the setwd(…) command
, as usual.
This script loads the now-familiar CA school district data set, and runs several regressions with the district test score (testscr) as the dependent (Y) variable, and various regressors (X variables).
Scroll down to the regression commands. Note that the regressions (2)-(5) are multiple regressions, which have more than one regressor (X variable). In R we can add regressors simply by including them with a “+” sign, as in testscr ~ str + el_pct.
Examine the stargazer command, which creates the table. Note that we include the names of all the regressions (reg1, reg2, …), followed by the SE correction for all of the regressions, in the same order: se=list(cse(reg1), cse(reg2), …)
.
It is very important that you pay attention to the parentheses. The parentheses in every R command must balance, in the sense that for every open paren “(“ there is a later close paren “)”.
Highlight and run all the lines from the beginning of the script through the entire stargazer command. Look at the resulting table in your console.
stargazer(reg1, reg2, reg3, reg4, reg5,
se=list(cse(reg1),cse(reg2),cse(reg3),cse(reg4),cse(reg5)),
title="Regression Results", type="text",
df=FALSE, digits=3)
Regression Results
==========================================================================
Dependent variable:
------------------------------------------------------
testscr
(1) (2) (3) (4) (5)
--------------------------------------------------------------------------
str -2.280*** -1.101** -0.998*** -1.308*** -1.014***
(0.519) (0.433) (0.270) (0.339) (0.269)
el_pct -0.650*** -0.122*** -0.488*** -0.130***
(0.031) (0.033) (0.030) (0.036)
meal_pct -0.547*** -0.529***
(0.024) (0.038)
calw_pct -0.790*** -0.048
(0.068) (0.059)
Constant 698.933*** 686.032*** 700.150*** 697.999*** 700.392***
(10.364) (8.728) (5.568) (6.920) (5.537)
--------------------------------------------------------------------------
Observations 420 420 420 420 420
R2 0.051 0.426 0.775 0.629 0.775
Adjusted R2 0.049 0.424 0.773 0.626 0.773
Residual Std. Error 18.581 14.464 9.080 11.654 9.084
F Statistic 22.575*** 155.014*** 476.306*** 234.638*** 357.054***
==========================================================================
Note: *p<0.1; **p<0.05; ***p<0.01
Compare the results for each column with what is in Table 7.1 of your textbook (p. 238). The two tables should contain identical numbers, at least within rounding error.
Make sure you understand how to interpret each coefficient in the table (the above table only reproduces the first four regression results), and how to do a hypothesis test or confidence interval for each variable.
You can write the table to a text file if you want to save it, rather than cut and paste from the console. In the stargazer command, you would add another option, saving to a file named filename.txt: out=”filename.txt”
You can put column labels in stargazer, which helps the reader. The code is included in the next command. Be careful about commas and apostrophes. Make sure your labels are in the same order as the corresponding regressions! Ruy it and look at the table.
We can use a regression to predict the Y value for specific hypothetical or actual values of the regressors using the predict function. Let’s do this using the results of regression reg3 and predict the average test score for a district with str = 20, el_pct = 20, and meal_pct = 50.
To do this, we first create a new data frame with these values of the X variables:newdata = data.frame(str = 20, el_pct = 20, meal_pct = 50)
Next, we use the predict
command, telling R which regression coefficients to use (reg3) and which data frame (newdata): predict(reg3, newdata)
Run these commands and see what the result is.
The predict
command will also calculate the upper and lower limits of a 95% confidence interval around the prediction. There are two kinds of confidence intervals for predictions.
The first is obtained with the option interval="confidence"
in the predict
command. This gives us the 95% confidence interval for E(Y | X): that is, what is the expected test score over a large number of districts with this set of characteristics X? Because our coefficient estimates are uncertain, there is some uncertainty about this expected test score, and the CI gives us a plausible range of values.
The second kind of interval is often called a 95% prediction interval, and it is obtained with the option interval="prediction"
. The prediction interval can be interpreted as a range of values for predicting the test score in a single school district with these characteristics. Because of the residual variation around predicted values, the 95% prediction interval is always wider than the confidence interval of the prediction.
Run these two intervals and see what you get. Is the prediction interval in fact wider?
How important is the effect of a regressor (X) on Y? Note that this is not the same as asking whether the coefficient is statistically significant. We want to know how big the point estimate is, not just whether it is different from zero. We will discuss various ways of judging effect size in class, but one common approach is to “standardize” the regression coefficient. Confusingly, standardized coefficients are sometimes called “beta coefficients”—not to be confused with the coefficient itself, which we call beta.
The standardized slope coefficient tells us the expected marginal effect of a one-standard-deviation change in X on Y, measured in standard deviations of Y. We rescale the coefficient by measuring both variables in their sample standard deviations rather than their units. In more everyday langauge, this tells us how much a “typical” change in the X variable will move Y, other things equal. We allow the data to tell us what is a “typical” amount of variation.
The standardized coefficient is just one way of judging magnitude of effect, and is more appropriate for continuous variables than for binary variables.
There are packages that calculate standardized coefficients, but here we will just do it in the script by obtaining the coefficient estimate and multiplying it by sd(X)/sd(Y). Run these lines and see what happens:
coef(summary(reg5))["str","Estimate"]*sd(caschool$str)/sd(caschool$testscr)
coef(summary(reg5))["el_pct","Estimate"]*sd(caschool$el_pct)/sd(caschool$testscr)
Joint hypotheses are hypotheses about more than one “restriction” on the coefficients. For example, we might want to test the null hypothesis that in regression (5), the coefficients on meal_pct
and calw_pct
are both equal to zero. These variables are intended to control for low-income school districts; if both coefficients were zero, it would suggest that the relative poverty of a district does not have a significant impact on average test score, controlling for the other factors in the regression.
To implement a joint hypothesis test, we can use the function lht( )
which is part of the car package. To see how it works, run the next command:
lht(reg5, c("meal_pct = 0", "calw_pct = 0"), white.adjust = "hc1")
Examining this command,
reg5
tells lht which regression results to use.c("meal_pct = 0", "calw_pct = 0")
expresses the null hypothesis, with each restriction in quotation marks.white.adjust = “hc1”
assures that the heteroskedasticity-corrected standard errors are used for the test. Always include this.Here are the results:
Linear hypothesis test
Hypothesis:
meal_pct = 0
calw_pct = 0
Model 1: restricted model
Model 2: testscr ~ str + el_pct + meal_pct + calw_pct
Note: Coefficient covariance matrix supplied.
Res.Df Df F Pr(>F)
1 417
2 415 2 290.27 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Interpreting the results requires careful understanding of the nature of the hypothesis tests. For now, note that the null hypothesis is stated for us, and that the F-stat for the test is 280.66, with the p-value next to it (<2.2e-16). Df = 2 tells you the number of restrictions being tested.
Obviously with such a low p-value we can reject the null hypothesis. We can test a wide variety of hypotheses using lht. Run the remaining lht lines in the script and see if you can understand what each test is doing.
In the previous chapter we ran a regression of income (measured as per capita real GDP) on the percent of females in the country. The script contains the commands to estimate a multiple regression version of that equation, where we include population, latitude, infant mortality and fertility.
You have already run the commands to import the WDI dataset into R. The script includes the usual commands to clean up the dataset, renaming the variables, etc. Also in the Data section are some commands to rescale two of the variables. The third command transforms the variable latitude from a factor variable to a numeric variable.
wdim$GDPpcUSDreal = wdim$GDPpcUSDreal/1000
wdim$population = wdim$population/1000000
wdim$latitude=as.numeric(wdim$latitude)
In the Analysis section, run two regressions and display the results using stargazer.
regwdi1 = lm(femaleperc ~
GDPpcUSDreal+population+
latitude, data= wdim)
regwdi2 = lm(femaleperc ~ GDPpcUSDreal +
population+ infmort+ fertility+latitude,
data= wdim)
stargazer(regwdi1,regwdi2,
se=list(cse(regwdi1),cse(regwdi2)),
title="Regression of percent female
on various correlates", type="text",
df=FALSE, digits=3)
Take some time to examine the output of the regression. The variable measuring GDP per capita is roughly the same magnitude as before (an increase in GDP per capita of $1000 is associated with a decrease in the percentage female of .1). But now it is more statistically significant (there are two and three asterisks). The variable latitude is also statistically significant. Can you interpret the meaning?
Once again, we repeat the admonishment: R has no trouble calculating regression coefficients, but we need to use judgment and common sense to determine whether the coefficients are likely to be unbiased.
Key R commands (functions):
lm
: can be used for multiple regression toostargazer
: can combine results from several regressionspredict
: use regression results to predict Y for different datalht
: flexible function for linear hypothesis tests (requires package car)Key points/ concepts: