Linear Regression in R

Welcome AFIT Data Science learners! This is the first part of a two-part joint presentation on Linear Regression based on the Introduction to Statistical Learning in R (ISLR) course and book by Hastie and Tibshirani 1. Their course is offered for free on Stanford Lagunita Online edX and is an excellent resource. This is from Chapter 3: Linear Regression, Section 3.R Lab on Median House Values in Boston. The exact link is here but you may need to register at https://lagunita.stanford.edu/ to get access to the free edX course.

I’ve also leveraged an RPub by Ronnie Kolodziej2 who identified some useful code in the cars package to visualize influential observations. Dr. Boehmke also put together a superior lesson on Linear Regression which supplements the ISLR course3 and has about double the content as this pub.

This lesson requires MASS ISLR tidyverse and knitr if you want to run the included code.

# library(ISLR)    #Into to Stat Learning package
# library(MASS)     # contains boston dataset
# library(tidyverse) #piping
# library(knitr)    #kable 
# library(car)      #has good residuals and VIF functions
# library(plyr)
# library(stargazer) #displays linear regression results table very nicely

the Boston data set

To help illustrate the basic R ability to help conduct regression we will look at a popular data set that has been used in many regression lessons, the Boston house value set.

There are 14 variables (columns) and 506 rows in the Boston house values data frame in the MASS package. To look at those variables you may run names(Boston) however we are only going to look at three of those variables:

  • age proportion of owner-occupied units built prior to 1940.

  • lstat lower status of the population (percent).

  • medv median value of owner-occupied homes in $1000

Simple linear regression

Lets plot median house values medv, versus lstat, but first I am going to rename those columns to lowerStatus and medianValue so they are a little more clear standing on their own in future plots. You may want to do this on your own dataset so check out the code below.

#rename vars
Boston <- rename(Boston, c("lstat" = "lowerStatus", "medv" = "medianValue"))
# plot linear fit
plot(medianValue~lowerStatus, Boston)

fit1 <- lm(medianValue~lowerStatus, Boston)

This plot of lower status lowerStatus versus median value medianValue shows a negative correlation with medianValue. We also observe medianValue appear to top out at 50, which may indicate a limited ability to collect data.

fit1 <- lm(medianValue~lowerStatus, Boston)
summary(fit1)
## 
## Call:
## lm(formula = medianValue ~ lowerStatus, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.168  -3.990  -1.318   2.034  24.500 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 34.55384    0.56263   61.41   <2e-16 ***
## lowerStatus -0.95005    0.03873  -24.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.216 on 504 degrees of freedom
## Multiple R-squared:  0.5441, Adjusted R-squared:  0.5432 
## F-statistic: 601.6 on 1 and 504 DF,  p-value: < 2.2e-16

If we run a linear model using lm() with median value (y) and lower status (x) we see the equation R fits is \(y = -0.95x + 34.55\). Calling summary(fit1) shows: the standard error, the t-value, and the P-value (Pr(>|t|)). In the model result summary for fit1 we see the t-value is very large and the p-value is very low which indicates the lowerStatus variable is likely a significant variable. Furthermore we see the \(R^2\) value is 0.5441 which explains 54% of the variance in the data can be captured with this one variable.

This is simple linear regression because there is only one predictor. To visualize this line we call abline() function using fit1. Using names function on fit1 reveals there are 12 other statistical characteristics in a list about this certain fit. I will not be using these so for a more in depth treatment refer to some of the resources listed below.

plot(medianValue~lowerStatus, Boston)
abline(fit1,col = "red")

names(fit1)
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"

Also we should observe the linear trend goes below zero once lowerStatus is beyond 30%. A negative median house value does not make sense and is one indication that the linear fit is not adequate. But it is a first attempt and a very simple one. Let’s see what else R can help us figure out about this simple linear regression.

Coefficient Confidence Intervals & simple hypothesis test

We may be interested in reporting the 95% confidence intervals of the coefficients and intercept. Using confint() function reveals the lowerStatus coefficient is between -1.026 and -0.87. So if we had a null hypothesis that there is no relationship between median house values and lower status we could reject that null hypothesis using the 95% confidence interval on the coefficient, since zero is not in those bounds.

 confint(fit1, level = 0.95) #level = 0.95 is default if not called
##                 2.5 %     97.5 %
## (Intercept) 33.448457 35.6592247
## lowerStatus -1.026148 -0.8739505

Confidence Intervals and Prediction Intervals

The R predict() is helpful to figure out specific points of interest values according to the current linear fit. To predict the median value of homes in an area with a lowerStatus value of 5, 10 and 15 (which are all within the current limits and good for prediction), running the code below shows the 95% confidence interval for each of the points. I’ve used kable() to help display the results with only three decimal points, and rename the default column names to a more readable format.

 predict(fit1, data.frame(lowerStatus = c(5,10,15,30)), interval = "confidence")  %>% 
kable(digits = 3, col.names = c("Confidence Fit","Lower 95%", "Upper 95%"))
Confidence Fit Lower 95% Upper 95%
29.804 29.007 30.600
25.053 24.474 25.633
20.303 19.732 20.875
6.052 4.625 7.480

To get a Prediction Interval on the same points we change the interval = "confidence" code to interval = "prediction". One can see the prediction intervals are much wider than the confidence intervals, which is more conservative when predicting the next median value home not from the current data but from the same population (in 1978, same geographic area of sampling, etc). Obviously this linear fit is not adequate as it predicts negative values for lowerStatus over 30.

predict(fit1, data.frame(lowerStatus = c(5,10,15, 30)), interval = "prediction") %>%
kable(digits = 3, col.names = c("Prediction Fit","Lower 95%", "Upper 95%"))
Prediction Fit Lower 95% Upper 95%
29.804 17.566 42.042
25.053 12.828 37.279
20.303 8.078 32.528
6.052 -6.243 18.347

As an aside, if you are interested in adding these predictions to a data.frame you can use `modelr::add_predictions()’.

Graph Predictions

To show the intervals on our original data, use predict function on the values of interest, and plot on our linear model of medianValue ~ lowerStatus. The predicted values from our linear model are in green. The confidence intervals extend as high and as low as the blue circles. The prediction intervals extend as high and low as the red circles. This gives us a good idea how large each of those intervals are.

 x <-  c(5,10,15, 30)
new_x <-data.frame(lowerStatus = c(5,10,15, 30))  

CI <- predict(fit1,  new_x, interval = "confidence") 
PI <- predict(fit1, new_x, interval = "prediction") 


plot(medianValue~lowerStatus, Boston, pch = "+" , col = "gray")

points(x, CI[1:4], pch = 19, col = "black" )  
 # 29.8, 25.05, 20.30 and 6.05 are predicted values y_hat

# --The Confidence Interval Points 
points(x, CI[5:8],pch = 20, col = "blue") # lower bound 
points(x,CI[9:12],pch = 20, col = "blue") # upper bound

# --The Prediction Interval Points 
 points(x,PI[5:8],pch = 20, col = "red") #lower bound of PI
points(x,PI[9:12], pch = 20,col = "red") #Upper bound of PI
abline(fit1, pch = "--", col = "darkgray")

# another resource for better graphs:  https://rpubs.com/Bio-Geek/71339   

Examine Influential Observations & Residuals Using Functions from the Car package

An R-Pub by Ronnie Kolodziej, recommends using the methods in the car package to look at residuals and other model fitness measures:

Leverage statistics can be computed for any number of predictors using the hatvalues function. The function influenceIndexPlot from the car package creates four diagnostic plots including a plot of the hat-values.

# whats the max leverage point?
# which.max(hatvalues(fit1))  #reveals 375

plot(hatvalues(fit1))
#plotting that max influence point verifies it is at the very top of the plot.
points(375, hatvalues(fit1)[375], col = "red" , pch = 19)
  abline(h = 0.0039683 )

Using which.max(hatvalues(fit1)) shows us the most influential observation is observation number 375. According to the R-Pub by Ronnie Kolodziej:

“_The which.max() function identifies the index of the largest element of a vector. In this case, it tells us which observation has the largest leverage statistic. Recall that hat-values exceeding (p+1)/n= (1+1)/504 = 0.0039683 should be evaluated for high leverage."

If I add a line with that value to the hat-value chart it becomes clear that a lot of the observations are high leverage according to that definition.

If we map observation 375 against our original plot with linear fit fit1 we see that observation 375 is the area from the largest lowerStatus value. It makes sense that it would be a leverage point.

 plot(medianValue~lowerStatus, Boston, pch = "+" , col = "gray") 
points(medianValue~lowerStatus, Boston[375,], pch = 19 , col = "red") 
abline(fit1, pch = "--", col = "darkgray") 

par(cex.axis=1.5, cex.lab=1.1)
car::influenceIndexPlot(fit1, id.n = 5)

The car package influenceIndexplot function provides index plots of Cook’s distances, leverages, Studentized residuals, and outlier significance levels for a regression object. You may need to right-click and view in own window to see this image properly. This dataset has 506 observations being fitted on the x-axis. A detailed description of the Cook’s distance, the leverage plot, the studentized residuals, and outlier significance levels is outside the scope of this introduction to Linear Regression but this is to show you there is code to perform it quickly in the cars package and more details can be found in most of the references section.

outliers <- c(142,215,413,374,375)
 plot(medianValue~lowerStatus, Boston, pch = "+" , col = "gray") 
 points(medianValue~lowerStatus, Boston[outliers,], pch = 19 , col = "red") 
 abline(fit1, pch = "--", col = "darkgray") 

According to the Cook’s distances and the hatvalues outliers graphs, rows 142, 215, 374, 375, and 413 are the top 5 outliers or influential. Above they are in red, and you can see they take up a specific part of the graph.

outliers <- c(258, 263, 268,371 , 372)
 plot(medianValue~lowerStatus, Boston, pch = "+" , col = "gray") 
 points(medianValue~lowerStatus, Boston[outliers,], pch = 19 , col = "red") 
 abline(fit1, pch = "--", col = "darkgray") 

According to the bonferonni p-values and studentized residuals, rows 258, 263, 268,371 and 372 are the top five outliers/influential points. Above they are in red, and you can see they too take up a specific part of the graph.

Multiple linear regression

Now let’s move beyond simple linear regression with just one variable, and do multiple linear which is “just”multiple" variables. We literally add more variables by using the plus symbol, such as +age, to the lm function. In this case age is significant but the \(R^2\) and \(Adj R^2\) have not improved much from the simple linear model. The amount of variance explained is still roughly 55%.

fit2 <- lm(medianValue~lowerStatus+age,data=Boston)
summary(fit2)
## 
## Call:
## lm(formula = medianValue ~ lowerStatus + age, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.981  -3.978  -1.283   1.968  23.158 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 33.22276    0.73085  45.458  < 2e-16 ***
## lowerStatus -1.03207    0.04819 -21.416  < 2e-16 ***
## age          0.03454    0.01223   2.826  0.00491 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.173 on 503 degrees of freedom
## Multiple R-squared:  0.5513, Adjusted R-squared:  0.5495 
## F-statistic:   309 on 2 and 503 DF,  p-value: < 2.2e-16

So now lets see what the relation is between medianValue and all the variables in the Boston dataset at the same time. If we simply type medianValue ~ . R recognizes this as bringing everything else besides medianValue column in to the regression formula.

Now a lot of variables are significant, however age is no longer significant like it was in the previous fit. Age no longer being significant means basically there are some other predictors that are correlated with it. \(R^2\) is now 0.74. So do we stop and publish an equation with all variables except age? No, we need to look at some other measures of fit. This next part is Residuals.

fit3=lm(medianValue~.,Boston)
#summary(fit3)
#stargazer(fit3, type = "html")
Dependent variable:
medianValue
crim -0.108***
(0.033)
zn 0.046***
(0.014)
indus 0.021
(0.061)
chas 2.687***
(0.862)
nox -17.767***
(3.820)
rm 3.810***
(0.418)
age 0.001
(0.013)
dis -1.476***
(0.199)
rad 0.306***
(0.066)
tax -0.012***
(0.004)
ptratio -0.953***
(0.131)
black 0.009***
(0.003)
lowerStatus -0.525***
(0.051)
Constant 36.459***
(5.103)
Observations 506
R2 0.741
Adjusted R2 0.734
Residual Std. Error 4.745 (df = 492)
F Statistic 108.077*** (df = 13; 492)
Note: p<0.1; p<0.05; p<0.01

Residuals

If one calls plot(fit3) instead of getting one plot with your defined y and defined x’s, you get four plots of residuals. Residuals are how well your data fit to the linear model you gave it, and looking at the residuals against other measures helps bring out some patterns of non-linearity not captured with the current equation. An introduction of each plot is available here.

par(mfrow=c(2,2)) 
plot(fit3)

The first residual plot is Residuals versus Fitted Values. The curve in the residuals indicates non-linearity. That could be as simple as adding an interaction term, or it could mean something needs to be transformed with a log(x) or sqrt(x) or a polynomial needs to be added, etc.

The second residual plot is Normal Q-Q Plot of Standardized Residuals versus Theoretical Quantiles. Most of the residuals seem to fit this line well except the far extremes (labelled as numbers 369, 372, and 373).

The third residual plot is Scale Location Plot of Square Root of the Standardized Residuals versus Fitted Values. This indicates is the variance is changing with the mean (VIF). But this could be part of the non-linearity missed in the model.

The fourth residual plot is Residuals Versus Leverage, the Standardized Residuals versus the Leverage.

Once again this is a quick introduction and a proper explanation of these plots can be found in the resoures/references below.

Quickly Adjusting Our Linear Models

The update function allows one to quickly change the linear model, in this case we take out age and indus using: fit4=update(fit3,~.-age-indus) and everything left in the model seems significant. This is helpful when you are making a lot of iterations between investigating models.

fit4 <- update(fit3,~.-age-indus) #useful when you are changing one thing at a time... 

# or

fit4 <- lm(medianValue ~ .- age - indus, data = Boston)  #the typical way you do it each time you save a lm()

summary(fit4)
## 
## Call:
## lm(formula = medianValue ~ . - age - indus, data = Boston)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.5984  -2.7386  -0.5046   1.7273  26.2373 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  36.341145   5.067492   7.171 2.73e-12 ***
## crim         -0.108413   0.032779  -3.307 0.001010 ** 
## zn            0.045845   0.013523   3.390 0.000754 ***
## chas          2.718716   0.854240   3.183 0.001551 ** 
## nox         -17.376023   3.535243  -4.915 1.21e-06 ***
## rm            3.801579   0.406316   9.356  < 2e-16 ***
## dis          -1.492711   0.185731  -8.037 6.84e-15 ***
## rad           0.299608   0.063402   4.726 3.00e-06 ***
## tax          -0.011778   0.003372  -3.493 0.000521 ***
## ptratio      -0.946525   0.129066  -7.334 9.24e-13 ***
## black         0.009291   0.002674   3.475 0.000557 ***
## lowerStatus  -0.522553   0.047424 -11.019  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.736 on 494 degrees of freedom
## Multiple R-squared:  0.7406, Adjusted R-squared:  0.7348 
## F-statistic: 128.2 on 11 and 494 DF,  p-value: < 2.2e-16

Nonlinear terms, Interactions, Transformations

Interactions

The * means multiply and terms are kept inclusively.

fit5 <- lm(medianValue~lowerStatus*age,Boston)
summary(fit5)
## 
## Call:
## lm(formula = medianValue ~ lowerStatus * age, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.806  -4.045  -1.333   2.085  27.552 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     36.0885359  1.4698355  24.553  < 2e-16 ***
## lowerStatus     -1.3921168  0.1674555  -8.313 8.78e-16 ***
## age             -0.0007209  0.0198792  -0.036   0.9711    
## lowerStatus:age  0.0041560  0.0018518   2.244   0.0252 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.149 on 502 degrees of freedom
## Multiple R-squared:  0.5557, Adjusted R-squared:  0.5531 
## F-statistic: 209.3 on 3 and 502 DF,  p-value: < 2.2e-16

Quadratic Terms

To add a power term like a square you need to put an identity function in front of it to preserve the meaning of \(lowerStatus^2\) with I(lowerStatus^2). You can try taking out the I() to see what happens to become familiar with this error that doesn’t report back an error.

To plot this quadratic equation on the plot we can no longer use abline we must use points(lowerStatus,fitted(fit6), ... to map the x values as the current lowerStatus values and then plot the quadratic fit of the current linear model fit (fit6) as the y-axis using fitted(fit6). (This seems reverse from the usual y~x method)

fit6 <- lm(medianValue~lowerStatus +I(lowerStatus^2),Boston); summary(fit6) #execute two lines at a time
## 
## Call:
## lm(formula = medianValue ~ lowerStatus + I(lowerStatus^2), data = Boston)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.2834  -3.8313  -0.5295   2.3095  25.4148 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      42.862007   0.872084   49.15   <2e-16 ***
## lowerStatus      -2.332821   0.123803  -18.84   <2e-16 ***
## I(lowerStatus^2)  0.043547   0.003745   11.63   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.524 on 503 degrees of freedom
## Multiple R-squared:  0.6407, Adjusted R-squared:  0.6393 
## F-statistic: 448.5 on 2 and 503 DF,  p-value: < 2.2e-16
attach(Boston)
par(mfrow=c(1,2))
plot(medianValue~lowerStatus)
title(main = "A Polynomial 2 Fit")
points(lowerStatus,fitted(fit6),col="red",pch=20) #can't use abline anymore cuz its not a line


fit7 <- lm(medianValue~poly(lowerStatus,4))
plot(medianValue~lowerStatus)
title(main = "A Polynomial 4 Fit")
points(lowerStatus,fitted(fit7),col="blue",pch=20)

Another way to quickly define a polynomial fit and graph it is to use the code called above lm(medianValue~poly(lowerStatus,4)) which creates a quadratic term lowerstatus to the fourth power.

Transformations

Larose, Daniel T. and Chantal D (dad and daughter duo) Data Mining and PRedictive Analysis. Second Edition. Page 218. Check out “The Ladder of Re-expressions Heuristic” in this google book preview.

According to the ladder of re-expressions a heuristic we should transform the medianValue and lowerStatus with a square-root or a log function.

fit8 <- lm(log(medianValue)~log(lowerStatus))

plot(log(medianValue)~log(lowerStatus))
points(log(lowerStatus),fitted(fit8),col="blue",pch=20)
points(log(lowerStatus)[375], log(medianValue)[375], col = "red" , pch = 19)

par(mfrow=c(2,2))
plot(fit8) 

#Can anyone find a different transformation that the Q-Q plot matches better?

# plot(hatvalues(fit8))
# points(hatvalues(fit8)[375])
#   abline(h = 0.0039683 )

#stargazer(fit8, type = "html", title = "Log(MedianValue) vs Log(LowerStatus) Plot" )

stargazer(fit8) will use the stargazer package to create beautiful HTML output. Note, it is commented out because the code-results after running the snippet are not pretty, but when you copy it and place it in your RMarkdown text where it can be processesed as HTML it creates the following result:

Log(MedianValue) vs Log(LowerStatus) Plot
Dependent variable:
log(medianValue)
log(lowerStatus) -0.560***
(0.017)
Constant 4.362***
(0.042)
Observations 506
R2 0.677
Adjusted R2 0.677
Residual Std. Error 0.232 (df = 504)
F Statistic 1,057.644*** (df = 1; 504)
Note: p<0.1; p<0.05; p<0.01

Model Comparison: ANOVA

To see if two models are significantly different use an anova. Lets see if our first model with just the lowerStatus is as good as the model with all the variables except age and indus (model 3). You can also do this in Stargazer by just typing the models you would like to compare, for example `stargazer(fit1,fit3, type = “html”) will be a side by side comparison of the two fits.

anova(fit1, fit3)
## Analysis of Variance Table
## 
## Model 1: medianValue ~ lowerStatus
## Model 2: medianValue ~ crim + zn + indus + chas + nox + rm + age + dis + 
##     rad + tax + ptratio + black + lowerStatus
##   Res.Df   RSS Df Sum of Sq      F    Pr(>F)    
## 1    504 19472                                  
## 2    492 11079 12    8393.6 31.063 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#stargazer(fit1, fit3, type = "html" )  #see comment below

We get a significant difference when comparing the models in an Analysis of Variance Table, which means we reject the null hypothesis that the models were the same.

Dependent variable:
medianValue
(1) (2)
crim -0.108***
(0.033)
zn 0.046***
(0.014)
indus 0.021
(0.061)
chas 2.687***
(0.862)
nox -17.767***
(3.820)
rm 3.810***
(0.418)
age 0.001
(0.013)
dis -1.476***
(0.199)
rad 0.306***
(0.066)
tax -0.012***
(0.004)
ptratio -0.953***
(0.131)
black 0.009***
(0.003)
lowerStatus -0.950*** -0.525***
(0.039) (0.051)
Constant 34.554*** 36.459***
(0.563) (5.103)
Observations 506 506
R2 0.544 0.741
Adjusted R2 0.543 0.734
Residual Std. Error 6.216 (df = 504) 4.745 (df = 492)
F Statistic 601.618*** (df = 1; 504) 108.077*** (df = 13; 492)
Note: p<0.1; p<0.05; p<0.01

Summary / Parting Shots / Transition to Part Two

This lesson was designed to be 15 minute or less quick skim of important R-packages and functions to get your interest peaked with this very useful predictive tool. It’s very important that you walk away from this lesson with a lot more resources to peruse since a full lesson on the topic of Regression isn’t possible in 15 minutes. Jeff Leek in his Coursera class on “Regression in the Real World” states one should pay attention to “cofounders, complicated interactions, skewness, outliers, non-linear patterns, variance change, units/scales issues, overloading/overfitting, and the differences between correlation and causation” his dropbox for these specific slides is here, but it may not work, you may need to go through the Coursera link.

There are a wealth of great explanations of the math behind Linear Regression as well as other R-tutorials on Linear Regression in the references and resources section below.

In part two we’ll look at different ways of narrowing down this list of variables to a ‘best subset’ which looks at different measures like Adjusted R^2 to help define “best” from the hitters dataset and how to split your dataset into a training and testing set to help validate a model.

Press on to Part 2: http://rpubs.com/JakeJohnson541/298655 or test your knowledge in the ‘Post Lesson’ exercises below.

Post-Lesson Exercises

These are exercises with answers taken from Brian Caffo’s4 John Hopkins class on Regression. The Johns Hopkin course on Regression as of 8/11/2017 seems to be a paid-class on Coursera although the content appears to be on github, but a very short class is offered on Udemy for free that seems to cover this material too.5

  1. Fit a linear regression model to the father.son dataset with the father as the predictor and the son as the outcome. Plot the son’s height (horizontal axis) versus the residuals (vertical axis). Watch a video solution.

  2. Refer to question 1. Directly estimate the residual variance and compare this estimate to the output of lm. Watch a video solution.

  3. Refer to question 1. Give the R squared for this model. Watch a video solution.

  4. Load the mtcars dataset. Fit a linear regression with miles per gallon as the outcome and horsepower as the predictor. Plot horsepower versus the residuals. Watch a video solution.

  5. Refer to question 4. Directly estimate the residual variance and compare this estimate to the output of lm. Watch a video solution.

  6. Refer to question 4. Give the R squared for this model. Watch a video solution.

Resources / References

Awesome Books

Some are free and online, hover/click the links. The Daniel T. and Chantal D Larose book, (a dad and daughter duo!), Data Mining and Predictive Analysis, Second Edition, is a great and very well organized book. On page 81 they introduce the “Ladder of Re-expressions Heuristic” which I’ve provided a link to in Google Book Preview in the book’s link below.

Awesome Online Courses:

by Charles Redmond, free, a very simple approach from learning R to plotting lines in bite-size video tutorials.

Johns Hopkins 7-course Data Science Track on Coursera, some content offered for free, but very comprehensive, it’s like another degree
Stanford Lagunita OpenX Course Introduction to Statistical Learning in R (ISLR) – they vaguely remind me of the Click and Clack Brothers on NPR, they throw in a little humor every once in awhile which can be refreshing! free book, course, videos, r-scripts best place to start.
Udemy course on R, ggplot, and Simple Linear Regression

More Awesome Resources

• Swirl6: a self-paced free lesson on R that runs on your R-console! More information can be found https://github.com/swirldev/swirl_courses#swirl-courses try typing the following to get started in R:

install.packages("swirl")
library(swirl)
install_course("Regression Models")  
swirl()

R Package References

  • Boston 7

  • Stargazer 8

  • ISLR 9

  • MASS 10

  • Swirl11

  • UsingR 12 for father.son

  • knitr 13

  • tidyverse 14

  • and of course, Base R15

Extra

Extra

I’m not going to cover this in class, but if you have a qualitative predictor you should examine the following code the ISLR authors use on Carseats, especially the helpful contrasts() function.

Qualitative predictors

#fix(Carseats) #unhide if you want to quickly fix data set with Excel like function
names(Carseats)
##  [1] "Sales"       "CompPrice"   "Income"      "Advertising" "Population" 
##  [6] "Price"       "ShelveLoc"   "Age"         "Education"   "Urban"      
## [11] "US"
summary(Carseats)
##      Sales          CompPrice       Income        Advertising    
##  Min.   : 0.000   Min.   : 77   Min.   : 21.00   Min.   : 0.000  
##  1st Qu.: 5.390   1st Qu.:115   1st Qu.: 42.75   1st Qu.: 0.000  
##  Median : 7.490   Median :125   Median : 69.00   Median : 5.000  
##  Mean   : 7.496   Mean   :125   Mean   : 68.66   Mean   : 6.635  
##  3rd Qu.: 9.320   3rd Qu.:135   3rd Qu.: 91.00   3rd Qu.:12.000  
##  Max.   :16.270   Max.   :175   Max.   :120.00   Max.   :29.000  
##    Population        Price        ShelveLoc        Age       
##  Min.   : 10.0   Min.   : 24.0   Bad   : 96   Min.   :25.00  
##  1st Qu.:139.0   1st Qu.:100.0   Good  : 85   1st Qu.:39.75  
##  Median :272.0   Median :117.0   Medium:219   Median :54.50  
##  Mean   :264.8   Mean   :115.8                Mean   :53.32  
##  3rd Qu.:398.5   3rd Qu.:131.0                3rd Qu.:66.00  
##  Max.   :509.0   Max.   :191.0                Max.   :80.00  
##    Education    Urban       US     
##  Min.   :10.0   No :118   No :142  
##  1st Qu.:12.0   Yes:282   Yes:258  
##  Median :14.0                      
##  Mean   :13.9                      
##  3rd Qu.:16.0                      
##  Max.   :18.0
fit1 <- lm(Sales~.+Income:Advertising+Age:Price,Carseats)
summary(fit1)
## 
## Call:
## lm(formula = Sales ~ . + Income:Advertising + Age:Price, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9208 -0.7503  0.0177  0.6754  3.3413 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         6.5755654  1.0087470   6.519 2.22e-10 ***
## CompPrice           0.0929371  0.0041183  22.567  < 2e-16 ***
## Income              0.0108940  0.0026044   4.183 3.57e-05 ***
## Advertising         0.0702462  0.0226091   3.107 0.002030 ** 
## Population          0.0001592  0.0003679   0.433 0.665330    
## Price              -0.1008064  0.0074399 -13.549  < 2e-16 ***
## ShelveLocGood       4.8486762  0.1528378  31.724  < 2e-16 ***
## ShelveLocMedium     1.9532620  0.1257682  15.531  < 2e-16 ***
## Age                -0.0579466  0.0159506  -3.633 0.000318 ***
## Education          -0.0208525  0.0196131  -1.063 0.288361    
## UrbanYes            0.1401597  0.1124019   1.247 0.213171    
## USYes              -0.1575571  0.1489234  -1.058 0.290729    
## Income:Advertising  0.0007510  0.0002784   2.698 0.007290 ** 
## Price:Age           0.0001068  0.0001333   0.801 0.423812    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.011 on 386 degrees of freedom
## Multiple R-squared:  0.8761, Adjusted R-squared:  0.8719 
## F-statistic:   210 on 13 and 386 DF,  p-value: < 2.2e-16
contrasts(Carseats$ShelveLoc)
##        Good Medium
## Bad       0      0
## Good      1      0
## Medium    0      1

Finally the authors show us how to write functions to help plot faster. First they start with a simple little function that calculates the fit, makes the plot, and adds a red line in regplot. Then they show a super powerful addition to the function. Can you spot the difference??

Writing R functions to automate lots of regression plots

regplot <- function(x,y){
fit <- lm(y~x)
plot(x,y)
abline(fit,col="red")
}

attach(Carseats)
regplot(Price,Sales)

#Now the change:

regplot <- function(x,y,...){
fit <- lm(y~x)
plot(x,y,...)
abline(fit,col="red")
}
regplot(Price,Sales,xlab="Price",ylab="Sales",col="blue",pch=20)

… Did you see that?

Adding the “dot dot dot” to your new function call and adding the “dot dot dot”" to the plot call inside the function just allowed you to supercharge this simple plot function with the power to pass any other arguments to the plot function!

Finally to see what other symbols you can use to plot besides filled circles here is some helpful code

plot(1:20,1:20,pch=1:20,cex=2)


  1. ISLR Chap 3 Linear Regression [Stanford edX] (p 59 – 82), by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani. http://www.StatLearning.com Chap 3 Regression Slides: https://lagunita.stanford.edu/c4x/HumanitiesSciences/StatLearning/asset/linear_regression-handout.pdf ISLR Package Doc: https://cran.r-project.org/web/packages/ISLR/ISLR.pdf Potential Answers to ISLR Chap 3 Book Exercises (not verified): https://github.com/yahwes/ISLR/blob/master/ch03soln.Rmd

  2. RPubs Lab 2: Simple Linear Regression using MASS Median Value house data (assumed author Ronnie Kolodziej), https://rpubs.com/ronniekolodziej/156135

  3. Dr. Bradley Boehmke’s Predictive Analytics: https://afit-r.github.io/predictive use link to Linear Regression and Modeling Selection

  4. John Hopkins 7 Part Specialty Data Science Track, [Offered on Coursera.com at https://www.coursera.org/specializations/jhu-data-science ], Dr. Brian Caffo, one of the three instructors, Regression tracks based on his book with tons of R-code here: https://leanpub.com/regmods/read#leanpub-auto-fitted-values-residuals-and-residual-variation

  5. R, ggplot, and Simple Linear Regression [Udemy.com] by Charles Redmond, https://www.udemy.com/machlearn1/learn/v4/t/lecture/2897930?start=0 Uses the Parent Child Height Galton data Set, and I used his Post-Lesson Exercises here

  6. Sean Kross, Nick Carchedi, Bill Bauer and Gina Grdina (2017). swirl: Learn R, in R. R package version 2.4.3. https://CRAN.R-project.org/package=swirl

  7. Harrison, D. and Rubinfeld, D.L. (1978) Hedonic prices and the demand for clean air. J. Environ. Economics and Management 5, 81-102. Which is here: http://www.colorado.edu/ibs/crs/workshops/R_1-11-2012/root/Harrison_1978.pdf

  8. Hlavac, Marek (2015). stargazer: Well-Formatted Regression and Summary Statistics Tables. R package version 5.2. http://CRAN.R-project.org/package=stargazer

  9. Gareth James, Daniela Witten, Trevor Hastie and Rob Tibshirani (2013). ISLR: Data for An Introduction to Statistical Learning with Applications in R. R package version 1.0. https://CRAN.R-project.org/package=ISLR

  10. Venables, W. N. & Ripley, B. D. (2002) Modern Applied Statistics with S. Fourth Edition. Springer, New York. ISBN 0-387-95457-0 [MASS]

  11. Harrison, D. and Rubinfeld, D.L. (1978) Hedonic prices and the demand for clean air. J. Environ. Economics and Management 5, 81–102. and Belsley D.A., Kuh, E. and Welsch, R.E. (1980) Regression Diagnostics. Identifying Influential Data and Sources of Collinearity. New York: Wiley.

  12. Sean Kross, Nick Carchedi, Bill Bauer and Gina Grdina (2017). swirl: Learn R, in R. R package version 2.4.3. https://CRAN.R-project.org/package=swirl

  13. John Verzani (2015). UsingR: Data Sets, Etc. for the Text “Using R for Introductory Statistics”, Second Edition. R package version 2.0-5. https://CRAN.R-project.org/package=UsingR

  14. Hadley Wickham (2017). tidyverse: Easily Install and Load ‘Tidyverse’ Packages. R package version 1.1.1. https://CRAN.R-project.org/package=tidyverse

  15. R Core Team (2017). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.