Linear Regression in R

This tutorial follows from the summaries and bivariate statistics discussion. We will cover the basics of how to run ordinary least squares (OLS) regression in R. If you need a refresher on what linear regression looks like and how it works, see here: http://rpubs.com/rslbliss/icyf_ols_basics.

For this tutorial, we will be using the same apidata_2012_so.csv dataset. If you haven’t already done so, set your working directory and read in the data.

Ordinary least squares regression

To run a regression, you can construct a model using the lm() command, which stands for “linear model.” For example, I run the following model in the api data.

\[cst28\_engl = \beta_0 + \beta_1(cst28\_math) + \beta_2(pctfrpl) + \varepsilon\]

# Running linear model
lm(cst28_engl ~ pctell + pctmin + pctfrpl, data=api)


Call:
lm(formula = cst28_engl ~ pctell + pctmin + pctfrpl, data = api)

Coefficients:
(Intercept)       pctell       pctmin      pctfrpl  
    -778.78       -55.93        98.70       -17.81

Notice that the setup is similar to the way we ran ANOVA before. Here, we have a clear dependent variable, cst28_engl, and clear independent variables, cst28_math and pctfrpl.

(I should have mentioned this several tutorials ago, but cst28_engl and cst28_math are the total numbers of students taking the California standardized assessment in grades 2 though 8 in English and math, respectively. The pctell variable is the percent of students classified as English language learners. The pctmin variable is the percent of students that are racial and ethnic minorities. The pctfrpl variable is the percent of students eligible for the free and reduced price lunch program. Now you know.)

To construct the command, we put the dependent variable on the left side of the ~ symbol, and then we put the independent variables on the right side, separated by + symbols. Finally, we specify the object with the data. This way, we don’t have to write api$ before every variable in the model. Convenient, right?

Like everything else in R, you can store regression results in an object and call them back by typing that object.

# Saving linear model in object
model <- lm(cst28_engl ~ pctell + pctmin + pctfrpl, data=api)
# Printing object
model


Call:
lm(formula = cst28_engl ~ pctell + pctmin + pctfrpl, data = api)

Coefficients:
(Intercept)       pctell       pctmin      pctfrpl  
    -778.78       -55.93        98.70       -17.81

The model object we just created is the first complex object we’ve seen (except for the aov() command). In fact, many different things are stored in that model object from the linear model. To see the different things, use the attributes() command.

# Showing parts of object
attributes(model)

$names
 [1] "coefficients"  "residuals"     "effects"       "rank"         
 [5] "fitted.values" "assign"        "qr"            "df.residual"  
 [9] "na.action"     "xlevels"       "call"          "terms"        
[13] "model"        

$class
[1] "lm"

The summary() function has a handy functionality when it comes to linear models, in that it prints a comprehensive report about the model. This is similar to what we did for the aov() command.

# Showing full summary of object
summary(model)


Call:
lm(formula = cst28_engl ~ pctell + pctmin + pctfrpl, data = api)

Residuals:
   Min     1Q Median     3Q    Max 
 -8367  -2623   -979    454 297882 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -778.78     943.38  -0.826   0.4093    
pctell        -55.93      30.93  -1.808   0.0709 .  
pctmin         98.70      19.01   5.192 2.53e-07 ***
pctfrpl       -17.81      16.27  -1.095   0.2739    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 10610 on 985 degrees of freedom
  (27 observations deleted due to missingness)
Multiple R-squared:  0.03147,   Adjusted R-squared:  0.02852 
F-statistic: 10.67 on 3 and 985 DF,  p-value: 6.624e-07

As a recap of OLS, let’s interpret a couple of the results.

(Intercept) = -778.78: This is the estimated value of the dependent variable when all independent variables are equal to 0. So a school with 0 ELL students, 0 minority students, and 0 free and reduced price lunch students is estimated to have -778.78 students taking the English test in grades 2 through 8. Doesn’t make sense, does it? This is a common occurrence, to have estimated values outside of the possible ranges of the outcome variable in real life. The linear model doesn’t know that a negative value for cst28_engl doesn’t make sense. This possibly reflects the improbability of having a school with those types of demographics.
pctell = -55.93: For a one percent increase in the percent of students classified as ELL, we expect to see a 55.93 count decrease in the number of students taking the test. This coefficient has a t-score of -1.80, which has a p-value of 0.07. It is statistically significant at the 10% level.
pctmin = 98.70:
Statistical significance?
pctfrpl = -17.81:
Statistical significance?
Multiple $R^2$ = 0.03: This is the normal $R^2$. This means that 3% of the variation in cst28_engl is explained by the variables in our model.
Adjusted $R^2$ = 0.03: This is the $R^2$, but penalized for the number of independent variables in our model.

At the end, were we to write out an equation based on our model for the estimated cst28_engl values, it would look something like this.

\[\widehat{cst28\_engl} = -778.78 - 55.93(pctell) + 98.70(pctmin) - 17.81(pctfrpl)\]

Wait, but how do we include categorical variables on the right side? What if we wanted to also include a variable for district type? You have two options here.

First, you can just include it as you would include anything else. Look at the following regression and how it is run.

\[cst28_engl = \beta_0 + \beta_1(districttype) + \varepsilon\]

# Running model with categorical variable, and printing summary
summary(lm(cst28_engl ~ districttype, data=api))


Call:
lm(formula = cst28_engl ~ districttype, data = api)

Residuals:
   Min     1Q Median     3Q    Max 
 -5328  -1850  -1265    323 297823 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)           1627.8      466.3   3.491 0.000503 ***
districttypeHigh     -1107.5     1288.0  -0.860 0.390067    
districttypeUnified   3702.3      709.1   5.221 2.17e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 10600 on 986 degrees of freedom
  (27 observations deleted due to missingness)
Multiple R-squared:  0.03139,   Adjusted R-squared:  0.02942 
F-statistic: 15.97 on 2 and 986 DF,  p-value: 1.487e-07

(Notice, by the way, that this is the same model we ran for the ANOVA in the previous tutorial, and we got the same results.)

In the results, we see districttypeHigh and districttypeUnified. As you know, you enter categorical variables in regression by using a series of binary (0/1) indicators for each category and you leave one category out to serve as a reference group. So in total, it would look like this.

\[cst28_engl = \beta_0 + \beta_1(High) + \beta_2(Unified) + \varepsilon\]

We left the “Elementary” group out. When you enter a categorical variable in R, it will automatically include the categories separately and leave one out. You can make it so R leaves a different group out instead of the one in automatically picks, but that is a little complicated, so I suggest the other method.

The safer way to do this, albeit a little more tedious, is to create new indicator variables yourself using the ifelse() function. See below for an example of this.

# Create an indicator for Elementary schools
api$Elementary <- ifelse(api$districttype=="Elementary", 1, 0)

This command creates a new variable called Elementary. This Elementary variable is equal to 1 (the second argument) if the condition api$districttype=="Elementary" is true (the first argument). If it is false, this variable is equal to 0. You can check to make sure this is correct by using the table() command to create a crosstabulation.

# Check crosstabs of districttype versus Elementary
table(api$districttype, api$Elementary)

            
               0   1
  Elementary   0 538
  High        78   0
  Unified    400   0

Can you code the two lines to make indicator variables High and Unified?

Run:

Now, once you’ve done that, run the model as shown below.

# Running model with indicator variables, and printing summary
summary(lm(cst28_engl ~ High + Unified, data=api))


Call:
lm(formula = cst28_engl ~ High + Unified, data = api)

Residuals:
   Min     1Q Median     3Q    Max 
 -5328  -1850  -1265    323 297823 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   1627.8      466.3   3.491 0.000503 ***
High         -1107.5     1288.0  -0.860 0.390067    
Unified       3702.3      709.1   5.221 2.17e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 10600 on 986 degrees of freedom
  (27 observations deleted due to missingness)
Multiple R-squared:  0.03139,   Adjusted R-squared:  0.02942 
F-statistic: 15.97 on 2 and 986 DF,  p-value: 1.487e-07

Notice that the results are the same as when we entered the categorical variable, but now we have more control over which group is left out.

Can you run the model, but with the “Unified” group left out instead of the “Elementary” group?

Run:

Linear Regression in R

Richard Blissett

2017-09-18

Ordinary least squares regression

External packages and functions