This tutorial follows from the summaries and bivariate statistics discussion. We will cover the basics of how to run ordinary least squares (OLS) regression in R. If you need a refresher on what linear regression looks like and how it works, see here: http://rpubs.com/rslbliss/icyf_ols_basics.
For this tutorial, we will be using the same apidata_2012_so.csv dataset. If you haven’t already done so, set your working directory and read in the data.
# Sets working directory
setwd("C:/Users/Richard/Desktop/R Introduction")
# Reads in data file
api <- read.csv("apidata_2012_so.csv")
To run a regression, you can construct a model using the lm() command, which stands for “linear model.” For example, I run the following model in the api data.
\[cst28\_engl = \beta_0 + \beta_1(cst28\_math) + \beta_2(pctfrpl) + \varepsilon\]
# Running linear model
lm(cst28_engl ~ pctell + pctmin + pctfrpl, data=api)
Call:
lm(formula = cst28_engl ~ pctell + pctmin + pctfrpl, data = api)
Coefficients:
(Intercept) pctell pctmin pctfrpl
-778.78 -55.93 98.70 -17.81
Notice that the setup is similar to the way we ran ANOVA before. Here, we have a clear dependent variable, cst28_engl, and clear independent variables, cst28_math and pctfrpl.
(I should have mentioned this several tutorials ago, but cst28_engl and cst28_math are the total numbers of students taking the California standardized assessment in grades 2 though 8 in English and math, respectively. The pctell variable is the percent of students classified as English language learners. The pctmin variable is the percent of students that are racial and ethnic minorities. The pctfrpl variable is the percent of students eligible for the free and reduced price lunch program. Now you know.)
To construct the command, we put the dependent variable on the left side of the ~ symbol, and then we put the independent variables on the right side, separated by + symbols. Finally, we specify the object with the data. This way, we don’t have to write api$ before every variable in the model. Convenient, right?
Like everything else in R, you can store regression results in an object and call them back by typing that object.
# Saving linear model in object
model <- lm(cst28_engl ~ pctell + pctmin + pctfrpl, data=api)
# Printing object
model
Call:
lm(formula = cst28_engl ~ pctell + pctmin + pctfrpl, data = api)
Coefficients:
(Intercept) pctell pctmin pctfrpl
-778.78 -55.93 98.70 -17.81
The model object we just created is the first complex object we’ve seen (except for the aov() command). In fact, many different things are stored in that model object from the linear model. To see the different things, use the attributes() command.
# Showing parts of object
attributes(model)
$names
[1] "coefficients" "residuals" "effects" "rank"
[5] "fitted.values" "assign" "qr" "df.residual"
[9] "na.action" "xlevels" "call" "terms"
[13] "model"
$class
[1] "lm"
The summary() function has a handy functionality when it comes to linear models, in that it prints a comprehensive report about the model. This is similar to what we did for the aov() command.
# Showing full summary of object
summary(model)
Call:
lm(formula = cst28_engl ~ pctell + pctmin + pctfrpl, data = api)
Residuals:
Min 1Q Median 3Q Max
-8367 -2623 -979 454 297882
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -778.78 943.38 -0.826 0.4093
pctell -55.93 30.93 -1.808 0.0709 .
pctmin 98.70 19.01 5.192 2.53e-07 ***
pctfrpl -17.81 16.27 -1.095 0.2739
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 10610 on 985 degrees of freedom
(27 observations deleted due to missingness)
Multiple R-squared: 0.03147, Adjusted R-squared: 0.02852
F-statistic: 10.67 on 3 and 985 DF, p-value: 6.624e-07
As a recap of OLS, let’s interpret a couple of the results.
cst28_engl doesn’t make sense. This possibly reflects the improbability of having a school with those types of demographics.pctell = -55.93: For a one percent increase in the percent of students classified as ELL, we expect to see a 55.93 count decrease in the number of students taking the test. This coefficient has a t-score of -1.80, which has a p-value of 0.07. It is statistically significant at the 10% level.pctmin = 98.70: pctfrpl = -17.81: cst28_engl is explained by the variables in our model.At the end, were we to write out an equation based on our model for the estimated cst28_engl values, it would look something like this.
\[\widehat{cst28\_engl} = -778.78 - 55.93(pctell) + 98.70(pctmin) - 17.81(pctfrpl)\]
Wait, but how do we include categorical variables on the right side? What if we wanted to also include a variable for district type? You have two options here.
First, you can just include it as you would include anything else. Look at the following regression and how it is run.
\[cst28_engl = \beta_0 + \beta_1(districttype) + \varepsilon\]
# Running model with categorical variable, and printing summary
summary(lm(cst28_engl ~ districttype, data=api))
Call:
lm(formula = cst28_engl ~ districttype, data = api)
Residuals:
Min 1Q Median 3Q Max
-5328 -1850 -1265 323 297823
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1627.8 466.3 3.491 0.000503 ***
districttypeHigh -1107.5 1288.0 -0.860 0.390067
districttypeUnified 3702.3 709.1 5.221 2.17e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 10600 on 986 degrees of freedom
(27 observations deleted due to missingness)
Multiple R-squared: 0.03139, Adjusted R-squared: 0.02942
F-statistic: 15.97 on 2 and 986 DF, p-value: 1.487e-07
(Notice, by the way, that this is the same model we ran for the ANOVA in the previous tutorial, and we got the same results.)
In the results, we see districttypeHigh and districttypeUnified. As you know, you enter categorical variables in regression by using a series of binary (0/1) indicators for each category and you leave one category out to serve as a reference group. So in total, it would look like this.
\[cst28_engl = \beta_0 + \beta_1(High) + \beta_2(Unified) + \varepsilon\]
We left the “Elementary” group out. When you enter a categorical variable in R, it will automatically include the categories separately and leave one out. You can make it so R leaves a different group out instead of the one in automatically picks, but that is a little complicated, so I suggest the other method.
The safer way to do this, albeit a little more tedious, is to create new indicator variables yourself using the ifelse() function. See below for an example of this.
# Create an indicator for Elementary schools
api$Elementary <- ifelse(api$districttype=="Elementary", 1, 0)
This command creates a new variable called Elementary. This Elementary variable is equal to 1 (the second argument) if the condition api$districttype=="Elementary" is true (the first argument). If it is false, this variable is equal to 0. You can check to make sure this is correct by using the table() command to create a crosstabulation.
# Check crosstabs of districttype versus Elementary
table(api$districttype, api$Elementary)
0 1
Elementary 0 538
High 78 0
Unified 400 0
Can you code the two lines to make indicator variables High and Unified?
Run:
Run:
Now, once you’ve done that, run the model as shown below.
# Running model with indicator variables, and printing summary
summary(lm(cst28_engl ~ High + Unified, data=api))
Call:
lm(formula = cst28_engl ~ High + Unified, data = api)
Residuals:
Min 1Q Median 3Q Max
-5328 -1850 -1265 323 297823
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1627.8 466.3 3.491 0.000503 ***
High -1107.5 1288.0 -0.860 0.390067
Unified 3702.3 709.1 5.221 2.17e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 10600 on 986 degrees of freedom
(27 observations deleted due to missingness)
Multiple R-squared: 0.03139, Adjusted R-squared: 0.02942
F-statistic: 15.97 on 2 and 986 DF, p-value: 1.487e-07
Notice that the results are the same as when we entered the categorical variable, but now we have more control over which group is left out.
Can you run the model, but with the “Unified” group left out instead of the “Elementary” group?
Run:
Before we get into more complex forms of regression analysis, let’s take a detour quickly to talk about external packages, which will be important to understand before we move on: http://rpubs.com/rslbliss/r_packages_ws