Guide to R for SCU Economics Students, v. 3.0
It is easy to run regressions in R. However, the standard R regression procedure has some limitations. Therefore, we use some packages that fix a couple of issues: (1) We want to correct the coefficient standard errors for heteroskedasticity. (2) We want our results presented in a table that makes it easy to compare alternative regression models.
Start RStudio and open the script t_regbasics.R in your script editor.
Once again, note that to run this script on your computer you will need to edit the setwd(...)
command for your own working directory (folder). That is, change the part in quotations to your working directory folder in the following line of code:
setwd("/Users/wsundstrom/Dropbox/econ_42")
Then load packages, run other settings, and import the data. Highlight and run all of the code through the Data section (to just before the Analysis section). Examine the table of descriptive statistics to remind yourself what is in the data set. Note: The functioncse( )
is used to correct the standard errors (see below).
Before we run the regression, let’s take another look at the scatter plot: Run the qplot
command. Note again that the relationship between str and testscr appears to be weakly negative.
Now highlight and run the following line:
reg1 = lm(testscr ~ str, data=caschool)
Note the following:
lm(...)
is the linear model (regression) function. The first variable is the Y variable, then any X variable(s) are included after the ~.
The X variable, here str, is the explanatory variable or the regressor.
We gave the regression output a name, reg1
. In your workspace (upper right) under Values, you should now see reg1.
Notice data=caschool
in the lm command. This allows us to just use the variable names in this function without including the data set name and $ sign.
To see the actual results of the regression in your console, highlight and run the summary(...)
command. In the future, we will not do this, but let’s take a look. Here is the relevant part of the regression output in your console:
Call:
lm(formula = testscr ~ str, data = caschool)
Residuals:
Min 1Q Median 3Q Max
-47.727 -14.251 0.483 12.822 48.540
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 698.9330 9.4675 73.825 < 2e-16 ***
str -2.2798 0.4798 -4.751 0.00000278 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 18.58 on 418 degrees of freedom
Multiple R-squared: 0.05124, Adjusted R-squared: 0.04897
F-statistic: 22.58 on 1 and 418 DF, p-value: 0.000002783
Consult your textbook for interpretation of most of the numbers presented in the output. For now, note that the implied regression equation is:
testscr^ = 698.9330 – 2.2798 str
Economists usually present regression results in a different format. The regression tables in your Stock and Watson textbook (e.g., Table 7.1) are good examples. We can use the stargazer command to make perfect regression tables. In addition, we want to use the corrected standard errors for the coefficients. Let’s make the table and see what we get.
Examine the next command, but don’t run it yet…
stargazer(reg1,
se=list(cse(reg1)),
title="CA school district regression", type="text",
df=FALSE, digits=3)
How do I know that these four lines are all part of the same command to R? A: Everything within the set of parentheses following the command stargazer
is part of the same command.
Within the parentheses, note the following:
reg1
is the name of the regression(s)se=...
is how we tell R to use the corrected standard errors. If you omitted this option, stargazer would just use the uncorrected SEs. See below for more on this.title=...
gives the table a title (put title in quotes)type="text"
means the table will be printed as text (always use this)df=FALSE
means the table will not be cluttered with degree of freedom numbersdigits=3
rounds the numbers in the table to three digits after the decimal (you can change this, of course)Now highlight and run the stargazer command (all four lines of it… make sure you include all parentheses!). Recall that if you just run part of a command, and it includes an open parenthesis, then R will be waiting for the close parenthesis, and will not give any output. See the earlier chapter for how to resolve this problem.
CA school district regression
===============================================
Dependent variable:
---------------------------
testscr
-----------------------------------------------
str -2.280***
(0.519)
Constant 698.933***
(10.364)
-----------------------------------------------
Observations 420
R2 0.051
Adjusted R2 0.049
Residual Std. Error 18.581
F Statistic 22.575***
===============================================
Note: *p<0.1; **p<0.05; ***p<0.01
Examine the table in the console in the console.
A note on the correction of standard errors for heteroskedasticity:
In many statistical packages (including R), the default regression function calculates coefficient standard errors that are only valid if the error term is homoskedastic—that is, the variance of the error (unexplained part) is the same for all observations. As you learn in chapter 5 of your textbook, there is no reason to think this would be true, and if it is not, the standard errors are incorrect, and so are the resulting hypothesis tests.
To fix this, I have added a correction that recalculates the standard errors: this is the function cse(…)
just before the data section. Stargazer uses this to correct the SEs. The correction seldom makes a large difference, but it is technically correct, and will make all your results completely comparable with the Stock and Watson textbook when we replicate some of their results. Hence: We will always use a correction of the standard errors.
The regression residuals are the difference between the actual value of the dependent variable (Y) and the predicted value from the regression, for each observation. We will discuss in class how the residuals can provide a useful diagnostic.
Highlight and run the following commands:
caschool$resid1 = resid(reg1)
qplot(str, resid1, data=caschool, main = "Residual against STR",
ylab = "Regression residual",xlab = "Student-teacher ratio")
Do you see any pattern in the plot?
Run the last lines of code and see how the second regression and third regression can be displayed as new columns in the stargazer table. Make sure you can write out the formula for the estimated regression equations for reg2 and reg3 and explain what they mean.
Let us run a regression using the World Bank’s Development Indicators database to see if there is a relationship between the income (measured as per capita real GDP) of a country and the percent of females in the country. We might imagine that poorer countries have more “missing women.”
Run the regression and display the results using stargazer. GDPpcUSDreal is the only regressor in this equation.
regwdi1 = lm(femaleperc ~ GDPpcUSDreal, data= wdim)
stargazer(regwdim1,
se=list(cse(reg1)),
title="Regression of GDP and percent female",
type="text",
df=FALSE, digits=5)
Take some time to examine the output of the regression. The variable measuring GDP per capita has an estimated coefficient of -.00006 and has one asterisk, indicating statistical significance at the ten percent level. To interpret the magnitude of the relationship, look at the underlying data.
Under Data in the Workspace frame (upper right), click on the data name, wdim. This should open the data set in what looks like a spreadsheet in the script editor frame. The GDP variable is in dollars, while the percent female variable is in percentages (and not in decimals). So the regression coefficient means that a one dollar increase in GDP per capita is associated with a decrease in the percent female of .00006. Alternatively, an increase in GDP of ten thousand dollars per capita reduces the percent female by more than half a percentage point. That seems pretty substantial. Of course, we need to reflect on whether this coefficient might be biased. (Obviously it is!)
An important observation to make here is that just because the slope coefficient here seems to be a “small” number does not necessarily mean the effect is small—in this case it means that the scale of X (per capita GDP) is large. We could make the regression table easier to interpret by rescaling the X variable to be in thousands of dollars. Without re-running the regression, can you tell what the slope coefficient would be in that case?
Key R commands (functions):
lm
: runs regressionscse
: function you load at beginning of script to generate corrected standard errors (requires package sandwich)stargazer
: creates nice tables of regression results and allows you to include corrected standard errors (requires package stargazer)resid
: obtains the regression residualsKey points/ concepts: