Welcome AFIT Data Science learners! This is the first part of a two-part joint presentation on Linear Regression based on the Introduction to Statistical Learning in R (ISLR) course and book by Hastie and Tibshirani 1. Their course is offered for free on Stanford Lagunita Online edX and is an excellent resource. This is from Chapter 3: Linear Regression, Section 3.R Lab on Median House Values in Boston. The exact link is here but you may need to register at https://lagunita.stanford.edu/ to get access to the free edX course.
I’ve also leveraged an RPub by Ronnie Kolodziej2 who identified some useful code in the cars
package to visualize influential observations. Dr. Boehmke also put together a superior lesson on Linear Regression which supplements the ISLR course3 and has about double the content as this pub.
This lesson requires MASS
ISLR
tidyverse
and knitr
if you want to run the included code.
# library(ISLR) #Into to Stat Learning package
# library(MASS) # contains boston dataset
# library(tidyverse) #piping
# library(knitr) #kable
# library(car) #has good residuals and VIF functions
# library(plyr)
# library(stargazer) #displays linear regression results table very nicely
Boston
data setTo help illustrate the basic R ability to help conduct regression we will look at a popular data set that has been used in many regression lessons, the Boston
house value set.
There are 14 variables (columns) and 506 rows in the Boston
house values data frame in the MASS
package. To look at those variables you may run names(Boston)
however we are only going to look at three of those variables:
age
proportion of owner-occupied units built prior to 1940.
lstat
lower status of the population (percent).
medv
median value of owner-occupied homes in $1000
Lets plot median house values medv
, versus lstat
, but first I am going to rename
those columns to lowerStatus
and medianValue
so they are a little more clear standing on their own in future plots. You may want to do this on your own dataset so check out the code below.
#rename vars
Boston <- rename(Boston, c("lstat" = "lowerStatus", "medv" = "medianValue"))
# plot linear fit
plot(medianValue~lowerStatus, Boston)
fit1 <- lm(medianValue~lowerStatus, Boston)
This plot of lower status lowerStatus
versus median value medianValue
shows a negative correlation with medianValue. We also observe medianValue
appear to top out at 50, which may indicate a limited ability to collect data.
fit1 <- lm(medianValue~lowerStatus, Boston)
summary(fit1)
##
## Call:
## lm(formula = medianValue ~ lowerStatus, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.168 -3.990 -1.318 2.034 24.500
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.55384 0.56263 61.41 <2e-16 ***
## lowerStatus -0.95005 0.03873 -24.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.216 on 504 degrees of freedom
## Multiple R-squared: 0.5441, Adjusted R-squared: 0.5432
## F-statistic: 601.6 on 1 and 504 DF, p-value: < 2.2e-16
If we run a linear model using lm()
with median value (y) and lower status (x) we see the equation R fits is \(y = -0.95x + 34.55\). Calling summary(fit1)
shows: the standard error, the t-value, and the P-value (Pr(>|t|)). In the model result summary for fit1
we see the t-value is very large and the p-value is very low which indicates the lowerStatus
variable is likely a significant variable. Furthermore we see the \(R^2\) value is 0.5441 which explains 54% of the variance in the data can be captured with this one variable.
This is simple linear regression because there is only one predictor. To visualize this line we call abline()
function using fit1
. Using names
function on fit1
reveals there are 12 other statistical characteristics in a list about this certain fit. I will not be using these so for a more in depth treatment refer to some of the resources listed below.
plot(medianValue~lowerStatus, Boston)
abline(fit1,col = "red")
names(fit1)
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
Also we should observe the linear trend goes below zero once lowerStatus
is beyond 30%. A negative median house value does not make sense and is one indication that the linear fit is not adequate. But it is a first attempt and a very simple one. Let’s see what else R can help us figure out about this simple linear regression.
We may be interested in reporting the 95% confidence intervals of the coefficients and intercept. Using confint()
function reveals the lowerStatus coefficient is between -1.026 and -0.87. So if we had a null hypothesis that there is no relationship between median house values and lower status we could reject that null hypothesis using the 95% confidence interval on the coefficient, since zero is not in those bounds.
confint(fit1, level = 0.95) #level = 0.95 is default if not called
## 2.5 % 97.5 %
## (Intercept) 33.448457 35.6592247
## lowerStatus -1.026148 -0.8739505
The R predict()
is helpful to figure out specific points of interest values according to the current linear fit. To predict the median value of homes in an area with a lowerStatus
value of 5
, 10
and 15
(which are all within the current limits and good for prediction), running the code below shows the 95% confidence interval for each of the points. I’ve used kable()
to help display the results with only three decimal points, and rename the default column names to a more readable format.
predict(fit1, data.frame(lowerStatus = c(5,10,15,30)), interval = "confidence") %>%
kable(digits = 3, col.names = c("Confidence Fit","Lower 95%", "Upper 95%"))
Confidence Fit | Lower 95% | Upper 95% |
---|---|---|
29.804 | 29.007 | 30.600 |
25.053 | 24.474 | 25.633 |
20.303 | 19.732 | 20.875 |
6.052 | 4.625 | 7.480 |
To get a Prediction Interval on the same points we change the interval = "confidence"
code to interval = "prediction"
. One can see the prediction intervals are much wider than the confidence intervals, which is more conservative when predicting the next median value home not from the current data but from the same population (in 1978, same geographic area of sampling, etc). Obviously this linear fit is not adequate as it predicts negative values for lowerStatus
over 30.
predict(fit1, data.frame(lowerStatus = c(5,10,15, 30)), interval = "prediction") %>%
kable(digits = 3, col.names = c("Prediction Fit","Lower 95%", "Upper 95%"))
Prediction Fit | Lower 95% | Upper 95% |
---|---|---|
29.804 | 17.566 | 42.042 |
25.053 | 12.828 | 37.279 |
20.303 | 8.078 | 32.528 |
6.052 | -6.243 | 18.347 |
As an aside, if you are interested in adding these predictions to a data.frame you can use `modelr::add_predictions()’.
To show the intervals on our original data, use predict
function on the values of interest, and plot on our linear model of medianValue ~ lowerStatus.
The predicted values from our linear model are in green. The confidence intervals extend as high and as low as the blue circles. The prediction intervals extend as high and low as the red circles. This gives us a good idea how large each of those intervals are.
x <- c(5,10,15, 30)
new_x <-data.frame(lowerStatus = c(5,10,15, 30))
CI <- predict(fit1, new_x, interval = "confidence")
PI <- predict(fit1, new_x, interval = "prediction")
plot(medianValue~lowerStatus, Boston, pch = "+" , col = "gray")
points(x, CI[1:4], pch = 19, col = "black" )
# 29.8, 25.05, 20.30 and 6.05 are predicted values y_hat
# --The Confidence Interval Points
points(x, CI[5:8],pch = 20, col = "blue") # lower bound
points(x,CI[9:12],pch = 20, col = "blue") # upper bound
# --The Prediction Interval Points
points(x,PI[5:8],pch = 20, col = "red") #lower bound of PI
points(x,PI[9:12], pch = 20,col = "red") #Upper bound of PI
abline(fit1, pch = "--", col = "darkgray")
# another resource for better graphs: https://rpubs.com/Bio-Geek/71339
Car
packageAn R-Pub by Ronnie Kolodziej, recommends using the methods in the car
package to look at residuals and other model fitness measures:
“Leverage statistics can be computed for any number of predictors using thehatvalues
function. The functioninfluenceIndexPlot
from thecar
package creates four diagnostic plots including a plot of the hat-values.”
# whats the max leverage point?
# which.max(hatvalues(fit1)) #reveals 375
plot(hatvalues(fit1))
#plotting that max influence point verifies it is at the very top of the plot.
points(375, hatvalues(fit1)[375], col = "red" , pch = 19)
abline(h = 0.0039683 )
Using which.max(hatvalues(fit1))
shows us the most influential observation is observation number 375. According to the R-Pub by Ronnie Kolodziej:
“_The which.max() function identifies the index of the largest element of a vector. In this case, it tells us which observation has the largest leverage statistic. Recall that hat-values exceeding (p+1)/n= (1+1)/504 = 0.0039683 should be evaluated for high leverage."
If I add a line with that value to the hat-value chart it becomes clear that a lot of the observations are high leverage according to that definition.
If we map observation 375 against our original plot with linear fit fit1
we see that observation 375 is the area from the largest lowerStatus
value. It makes sense that it would be a leverage point.
plot(medianValue~lowerStatus, Boston, pch = "+" , col = "gray")
points(medianValue~lowerStatus, Boston[375,], pch = 19 , col = "red")
abline(fit1, pch = "--", col = "darkgray")
par(cex.axis=1.5, cex.lab=1.1)
car::influenceIndexPlot(fit1, id.n = 5)
The car
package influenceIndexplot
function provides index plots of Cook’s distances, leverages, Studentized residuals, and outlier significance levels for a regression object. You may need to right-click and view in own window to see this image properly. This dataset has 506 observations being fitted on the x-axis. A detailed description of the Cook’s distance, the leverage plot, the studentized residuals, and outlier significance levels is outside the scope of this introduction to Linear Regression but this is to show you there is code to perform it quickly in the cars
package and more details can be found in most of the references section.
outliers <- c(142,215,413,374,375)
plot(medianValue~lowerStatus, Boston, pch = "+" , col = "gray")
points(medianValue~lowerStatus, Boston[outliers,], pch = 19 , col = "red")
abline(fit1, pch = "--", col = "darkgray")
According to the Cook’s distances and the hatvalues outliers graphs, rows 142, 215, 374, 375, and 413 are the top 5 outliers or influential. Above they are in red, and you can see they take up a specific part of the graph.
outliers <- c(258, 263, 268,371 , 372)
plot(medianValue~lowerStatus, Boston, pch = "+" , col = "gray")
points(medianValue~lowerStatus, Boston[outliers,], pch = 19 , col = "red")
abline(fit1, pch = "--", col = "darkgray")
According to the bonferonni p-values and studentized residuals, rows 258, 263, 268,371 and 372 are the top five outliers/influential points. Above they are in red, and you can see they too take up a specific part of the graph.
Now let’s move beyond simple linear regression with just one variable, and do multiple linear which is “just”multiple" variables. We literally add more variables by using the plus symbol, such as +age
, to the lm
function. In this case age
is significant but the \(R^2\) and \(Adj R^2\) have not improved much from the simple linear model. The amount of variance explained is still roughly 55%.
fit2 <- lm(medianValue~lowerStatus+age,data=Boston)
summary(fit2)
##
## Call:
## lm(formula = medianValue ~ lowerStatus + age, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.981 -3.978 -1.283 1.968 23.158
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.22276 0.73085 45.458 < 2e-16 ***
## lowerStatus -1.03207 0.04819 -21.416 < 2e-16 ***
## age 0.03454 0.01223 2.826 0.00491 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.173 on 503 degrees of freedom
## Multiple R-squared: 0.5513, Adjusted R-squared: 0.5495
## F-statistic: 309 on 2 and 503 DF, p-value: < 2.2e-16
So now lets see what the relation is between medianValue
and all the variables in the Boston
dataset at the same time. If we simply type medianValue ~ .
R recognizes this as bringing everything else besides medianValue
column in to the regression formula.
Now a lot of variables are significant, however age
is no longer significant like it was in the previous fit. Age
no longer being significant means basically there are some other predictors that are correlated with it. \(R^2\) is now 0.74. So do we stop and publish an equation with all variables except age? No, we need to look at some other measures of fit. This next part is Residuals.
fit3=lm(medianValue~.,Boston)
#summary(fit3)
#stargazer(fit3, type = "html")
Dependent variable: | |
medianValue | |
crim | -0.108*** |
(0.033) | |
zn | 0.046*** |
(0.014) | |
indus | 0.021 |
(0.061) | |
chas | 2.687*** |
(0.862) | |
nox | -17.767*** |
(3.820) | |
rm | 3.810*** |
(0.418) | |
age | 0.001 |
(0.013) | |
dis | -1.476*** |
(0.199) | |
rad | 0.306*** |
(0.066) | |
tax | -0.012*** |
(0.004) | |
ptratio | -0.953*** |
(0.131) | |
black | 0.009*** |
(0.003) | |
lowerStatus | -0.525*** |
(0.051) | |
Constant | 36.459*** |
(5.103) | |
Observations | 506 |
R2 | 0.741 |
Adjusted R2 | 0.734 |
Residual Std. Error | 4.745 (df = 492) |
F Statistic | 108.077*** (df = 13; 492) |
Note: | p<0.1; p<0.05; p<0.01 |
If one calls plot(fit3)
instead of getting one plot with your defined y and defined x’s, you get four plots of residuals. Residuals are how well your data fit to the linear model you gave it, and looking at the residuals against other measures helps bring out some patterns of non-linearity not captured with the current equation. An introduction of each plot is available here.
par(mfrow=c(2,2))
plot(fit3)
The first residual plot is Residuals versus Fitted Values. The curve in the residuals indicates non-linearity. That could be as simple as adding an interaction term, or it could mean something needs to be transformed with a log(x) or sqrt(x) or a polynomial needs to be added, etc.
The second residual plot is Normal Q-Q Plot of Standardized Residuals versus Theoretical Quantiles. Most of the residuals seem to fit this line well except the far extremes (labelled as numbers 369, 372, and 373).
The third residual plot is Scale Location Plot of Square Root of the Standardized Residuals versus Fitted Values. This indicates is the variance is changing with the mean (VIF). But this could be part of the non-linearity missed in the model.
The fourth residual plot is Residuals Versus Leverage, the Standardized Residuals versus the Leverage.
Once again this is a quick introduction and a proper explanation of these plots can be found in the resoures/references below.
The update
function allows one to quickly change the linear model, in this case we take out age
and indus
using: fit4=update(fit3,~.-age-indus)
and everything left in the model seems significant. This is helpful when you are making a lot of iterations between investigating models.
fit4 <- update(fit3,~.-age-indus) #useful when you are changing one thing at a time...
# or
fit4 <- lm(medianValue ~ .- age - indus, data = Boston) #the typical way you do it each time you save a lm()
summary(fit4)
##
## Call:
## lm(formula = medianValue ~ . - age - indus, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.5984 -2.7386 -0.5046 1.7273 26.2373
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.341145 5.067492 7.171 2.73e-12 ***
## crim -0.108413 0.032779 -3.307 0.001010 **
## zn 0.045845 0.013523 3.390 0.000754 ***
## chas 2.718716 0.854240 3.183 0.001551 **
## nox -17.376023 3.535243 -4.915 1.21e-06 ***
## rm 3.801579 0.406316 9.356 < 2e-16 ***
## dis -1.492711 0.185731 -8.037 6.84e-15 ***
## rad 0.299608 0.063402 4.726 3.00e-06 ***
## tax -0.011778 0.003372 -3.493 0.000521 ***
## ptratio -0.946525 0.129066 -7.334 9.24e-13 ***
## black 0.009291 0.002674 3.475 0.000557 ***
## lowerStatus -0.522553 0.047424 -11.019 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.736 on 494 degrees of freedom
## Multiple R-squared: 0.7406, Adjusted R-squared: 0.7348
## F-statistic: 128.2 on 11 and 494 DF, p-value: < 2.2e-16
The *
means multiply and terms are kept inclusively.
fit5 <- lm(medianValue~lowerStatus*age,Boston)
summary(fit5)
##
## Call:
## lm(formula = medianValue ~ lowerStatus * age, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.806 -4.045 -1.333 2.085 27.552
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.0885359 1.4698355 24.553 < 2e-16 ***
## lowerStatus -1.3921168 0.1674555 -8.313 8.78e-16 ***
## age -0.0007209 0.0198792 -0.036 0.9711
## lowerStatus:age 0.0041560 0.0018518 2.244 0.0252 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.149 on 502 degrees of freedom
## Multiple R-squared: 0.5557, Adjusted R-squared: 0.5531
## F-statistic: 209.3 on 3 and 502 DF, p-value: < 2.2e-16
To add a power term like a square you need to put an identity function in front of it to preserve the meaning of \(lowerStatus^2\) with I(lowerStatus^2)
. You can try taking out the I() to see what happens to become familiar with this error that doesn’t report back an error.
To plot this quadratic equation on the plot we can no longer use abline
we must use points(lowerStatus,fitted(fit6), ...
to map the x values as the current lowerStatus
values and then plot the quadratic fit of the current linear model fit (fit6
) as the y-axis using fitted(fit6)
. (This seems reverse from the usual y~x method)
fit6 <- lm(medianValue~lowerStatus +I(lowerStatus^2),Boston); summary(fit6) #execute two lines at a time
##
## Call:
## lm(formula = medianValue ~ lowerStatus + I(lowerStatus^2), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.2834 -3.8313 -0.5295 2.3095 25.4148
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.862007 0.872084 49.15 <2e-16 ***
## lowerStatus -2.332821 0.123803 -18.84 <2e-16 ***
## I(lowerStatus^2) 0.043547 0.003745 11.63 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.524 on 503 degrees of freedom
## Multiple R-squared: 0.6407, Adjusted R-squared: 0.6393
## F-statistic: 448.5 on 2 and 503 DF, p-value: < 2.2e-16
attach(Boston)
par(mfrow=c(1,2))
plot(medianValue~lowerStatus)
title(main = "A Polynomial 2 Fit")
points(lowerStatus,fitted(fit6),col="red",pch=20) #can't use abline anymore cuz its not a line
fit7 <- lm(medianValue~poly(lowerStatus,4))
plot(medianValue~lowerStatus)
title(main = "A Polynomial 4 Fit")
points(lowerStatus,fitted(fit7),col="blue",pch=20)
Another way to quickly define a polynomial fit and graph it is to use the code called above lm(medianValue~poly(lowerStatus,4))
which creates a quadratic term lowerstatus
to the fourth power.
Larose, Daniel T. and Chantal D (dad and daughter duo) Data Mining and PRedictive Analysis. Second Edition. Page 218. Check out “The Ladder of Re-expressions Heuristic” in this google book preview.
According to the ladder of re-expressions a heuristic we should transform the medianValue
and lowerStatus
with a square-root or a log function.
fit8 <- lm(log(medianValue)~log(lowerStatus))
plot(log(medianValue)~log(lowerStatus))
points(log(lowerStatus),fitted(fit8),col="blue",pch=20)
points(log(lowerStatus)[375], log(medianValue)[375], col = "red" , pch = 19)
par(mfrow=c(2,2))
plot(fit8)
#Can anyone find a different transformation that the Q-Q plot matches better?
# plot(hatvalues(fit8))
# points(hatvalues(fit8)[375])
# abline(h = 0.0039683 )
#stargazer(fit8, type = "html", title = "Log(MedianValue) vs Log(LowerStatus) Plot" )
stargazer(fit8)
will use the stargazer
package to create beautiful HTML output. Note, it is commented out because the code-results after running the snippet are not pretty, but when you copy it and place it in your RMarkdown text where it can be processesed as HTML it creates the following result:
Dependent variable: | |
log(medianValue) | |
log(lowerStatus) | -0.560*** |
(0.017) | |
Constant | 4.362*** |
(0.042) | |
Observations | 506 |
R2 | 0.677 |
Adjusted R2 | 0.677 |
Residual Std. Error | 0.232 (df = 504) |
F Statistic | 1,057.644*** (df = 1; 504) |
Note: | p<0.1; p<0.05; p<0.01 |
To see if two models are significantly different use an anova. Lets see if our first model with just the lowerStatus
is as good as the model with all the variables except age
and indus
(model 3). You can also do this in Stargazer by just typing the models you would like to compare, for example `stargazer(fit1,fit3, type = “html”) will be a side by side comparison of the two fits.
anova(fit1, fit3)
## Analysis of Variance Table
##
## Model 1: medianValue ~ lowerStatus
## Model 2: medianValue ~ crim + zn + indus + chas + nox + rm + age + dis +
## rad + tax + ptratio + black + lowerStatus
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 504 19472
## 2 492 11079 12 8393.6 31.063 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#stargazer(fit1, fit3, type = "html" ) #see comment below
We get a significant difference when comparing the models in an Analysis of Variance Table, which means we reject the null hypothesis that the models were the same.
Dependent variable: | ||
medianValue | ||
(1) | (2) | |
crim | -0.108*** | |
(0.033) | ||
zn | 0.046*** | |
(0.014) | ||
indus | 0.021 | |
(0.061) | ||
chas | 2.687*** | |
(0.862) | ||
nox | -17.767*** | |
(3.820) | ||
rm | 3.810*** | |
(0.418) | ||
age | 0.001 | |
(0.013) | ||
dis | -1.476*** | |
(0.199) | ||
rad | 0.306*** | |
(0.066) | ||
tax | -0.012*** | |
(0.004) | ||
ptratio | -0.953*** | |
(0.131) | ||
black | 0.009*** | |
(0.003) | ||
lowerStatus | -0.950*** | -0.525*** |
(0.039) | (0.051) | |
Constant | 34.554*** | 36.459*** |
(0.563) | (5.103) | |
Observations | 506 | 506 |
R2 | 0.544 | 0.741 |
Adjusted R2 | 0.543 | 0.734 |
Residual Std. Error | 6.216 (df = 504) | 4.745 (df = 492) |
F Statistic | 601.618*** (df = 1; 504) | 108.077*** (df = 13; 492) |
Note: | p<0.1; p<0.05; p<0.01 |
This lesson was designed to be 15 minute or less quick skim of important R-packages and functions to get your interest peaked with this very useful predictive tool. It’s very important that you walk away from this lesson with a lot more resources to peruse since a full lesson on the topic of Regression isn’t possible in 15 minutes. Jeff Leek in his Coursera class on “Regression in the Real World” states one should pay attention to “cofounders, complicated interactions, skewness, outliers, non-linear patterns, variance change, units/scales issues, overloading/overfitting, and the differences between correlation and causation” his dropbox for these specific slides is here, but it may not work, you may need to go through the Coursera link.
There are a wealth of great explanations of the math behind Linear Regression as well as other R-tutorials on Linear Regression in the references and resources section below.
In part two we’ll look at different ways of narrowing down this list of variables to a ‘best subset’ which looks at different measures like Adjusted R^2 to help define “best” from the hitters
dataset and how to split your dataset into a training and testing set to help validate a model.
Press on to Part 2: http://rpubs.com/JakeJohnson541/298655 or test your knowledge in the ‘Post Lesson’ exercises below.
These are exercises with answers taken from Brian Caffo’s4 John Hopkins class on Regression. The Johns Hopkin course on Regression as of 8/11/2017 seems to be a paid-class on Coursera although the content appears to be on github, but a very short class is offered on Udemy for free that seems to cover this material too.5
Fit a linear regression model to the father.son
dataset with the father as the predictor and the son as the outcome. Plot the son’s height (horizontal axis) versus the residuals (vertical axis). Watch a video solution.
Refer to question 1. Directly estimate the residual variance and compare this estimate to the output of lm
. Watch a video solution.
Refer to question 1. Give the R squared for this model. Watch a video solution.
Load the mtcars
dataset. Fit a linear regression with miles per gallon as the outcome and horsepower as the predictor. Plot horsepower versus the residuals. Watch a video solution.
Refer to question 4. Directly estimate the residual variance and compare this estimate to the output of lm
. Watch a video solution.
Refer to question 4. Give the R squared for this model. Watch a video solution.
Some are free and online, hover/click the links. The Daniel T. and Chantal D Larose book, (a dad and daughter duo!), Data Mining and Predictive Analysis, Second Edition, is a great and very well organized book. On page 81 they introduce the “Ladder of Re-expressions Heuristic” which I’ve provided a link to in Google Book Preview in the book’s link below.
|
|
|
|
|
|
|
|
Johns Hopkins 7-course Data Science Track on Coursera, some content offered for free, but very comprehensive, it’s like another degree |
|
Stanford Lagunita OpenX Course Introduction to Statistical Learning in R (ISLR) – they vaguely remind me of the Click and Clack Brothers on NPR, they throw in a little humor every once in awhile which can be refreshing! free book, course, videos, r-scripts best place to start. |
|
Udemy course on R, ggplot, and Simple Linear Regression |
• Swirl6: a self-paced free lesson on R that runs on your R-console! More information can be found https://github.com/swirldev/swirl_courses#swirl-courses try typing the following to get started in R:
install.packages("swirl")
library(swirl)
install_course("Regression Models")
swirl()
I’m not going to cover this in class, but if you have a qualitative predictor you should examine the following code the ISLR authors use on Carseats, especially the helpful contrasts()
function.
#fix(Carseats) #unhide if you want to quickly fix data set with Excel like function
names(Carseats)
## [1] "Sales" "CompPrice" "Income" "Advertising" "Population"
## [6] "Price" "ShelveLoc" "Age" "Education" "Urban"
## [11] "US"
summary(Carseats)
## Sales CompPrice Income Advertising
## Min. : 0.000 Min. : 77 Min. : 21.00 Min. : 0.000
## 1st Qu.: 5.390 1st Qu.:115 1st Qu.: 42.75 1st Qu.: 0.000
## Median : 7.490 Median :125 Median : 69.00 Median : 5.000
## Mean : 7.496 Mean :125 Mean : 68.66 Mean : 6.635
## 3rd Qu.: 9.320 3rd Qu.:135 3rd Qu.: 91.00 3rd Qu.:12.000
## Max. :16.270 Max. :175 Max. :120.00 Max. :29.000
## Population Price ShelveLoc Age
## Min. : 10.0 Min. : 24.0 Bad : 96 Min. :25.00
## 1st Qu.:139.0 1st Qu.:100.0 Good : 85 1st Qu.:39.75
## Median :272.0 Median :117.0 Medium:219 Median :54.50
## Mean :264.8 Mean :115.8 Mean :53.32
## 3rd Qu.:398.5 3rd Qu.:131.0 3rd Qu.:66.00
## Max. :509.0 Max. :191.0 Max. :80.00
## Education Urban US
## Min. :10.0 No :118 No :142
## 1st Qu.:12.0 Yes:282 Yes:258
## Median :14.0
## Mean :13.9
## 3rd Qu.:16.0
## Max. :18.0
fit1 <- lm(Sales~.+Income:Advertising+Age:Price,Carseats)
summary(fit1)
##
## Call:
## lm(formula = Sales ~ . + Income:Advertising + Age:Price, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.9208 -0.7503 0.0177 0.6754 3.3413
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.5755654 1.0087470 6.519 2.22e-10 ***
## CompPrice 0.0929371 0.0041183 22.567 < 2e-16 ***
## Income 0.0108940 0.0026044 4.183 3.57e-05 ***
## Advertising 0.0702462 0.0226091 3.107 0.002030 **
## Population 0.0001592 0.0003679 0.433 0.665330
## Price -0.1008064 0.0074399 -13.549 < 2e-16 ***
## ShelveLocGood 4.8486762 0.1528378 31.724 < 2e-16 ***
## ShelveLocMedium 1.9532620 0.1257682 15.531 < 2e-16 ***
## Age -0.0579466 0.0159506 -3.633 0.000318 ***
## Education -0.0208525 0.0196131 -1.063 0.288361
## UrbanYes 0.1401597 0.1124019 1.247 0.213171
## USYes -0.1575571 0.1489234 -1.058 0.290729
## Income:Advertising 0.0007510 0.0002784 2.698 0.007290 **
## Price:Age 0.0001068 0.0001333 0.801 0.423812
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.011 on 386 degrees of freedom
## Multiple R-squared: 0.8761, Adjusted R-squared: 0.8719
## F-statistic: 210 on 13 and 386 DF, p-value: < 2.2e-16
contrasts(Carseats$ShelveLoc)
## Good Medium
## Bad 0 0
## Good 1 0
## Medium 0 1
Finally the authors show us how to write functions to help plot faster. First they start with a simple little function that calculates the fit, makes the plot, and adds a red line in regplot
. Then they show a super powerful addition to the function. Can you spot the difference??
regplot <- function(x,y){
fit <- lm(y~x)
plot(x,y)
abline(fit,col="red")
}
attach(Carseats)
regplot(Price,Sales)
#Now the change:
regplot <- function(x,y,...){
fit <- lm(y~x)
plot(x,y,...)
abline(fit,col="red")
}
regplot(Price,Sales,xlab="Price",ylab="Sales",col="blue",pch=20)
Adding the “dot dot dot” to your new function call and adding the “dot dot dot”" to the plot call inside the function just allowed you to supercharge this simple plot function with the power to pass any other arguments to the plot function!
Finally to see what other symbols you can use to plot besides filled circles here is some helpful code
plot(1:20,1:20,pch=1:20,cex=2)
ISLR Chap 3 Linear Regression [Stanford edX] (p 59 – 82), by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani. http://www.StatLearning.com Chap 3 Regression Slides: https://lagunita.stanford.edu/c4x/HumanitiesSciences/StatLearning/asset/linear_regression-handout.pdf ISLR
Package Doc: https://cran.r-project.org/web/packages/ISLR/ISLR.pdf Potential Answers to ISLR Chap 3 Book Exercises (not verified): https://github.com/yahwes/ISLR/blob/master/ch03soln.Rmd↩
RPubs Lab 2: Simple Linear Regression using MASS Median Value house data (assumed author Ronnie Kolodziej), https://rpubs.com/ronniekolodziej/156135↩
Dr. Bradley Boehmke’s Predictive Analytics: https://afit-r.github.io/predictive use link to Linear Regression and Modeling Selection↩
John Hopkins 7 Part Specialty Data Science Track, [Offered on Coursera.com at https://www.coursera.org/specializations/jhu-data-science ], Dr. Brian Caffo, one of the three instructors, Regression tracks based on his book with tons of R-code here: https://leanpub.com/regmods/read#leanpub-auto-fitted-values-residuals-and-residual-variation↩
R, ggplot, and Simple Linear Regression [Udemy.com] by Charles Redmond, https://www.udemy.com/machlearn1/learn/v4/t/lecture/2897930?start=0 Uses the Parent Child Height Galton data Set, and I used his Post-Lesson Exercises here↩
Sean Kross, Nick Carchedi, Bill Bauer and Gina Grdina (2017). swirl
: Learn R, in R. R package version 2.4.3. https://CRAN.R-project.org/package=swirl↩
Harrison, D. and Rubinfeld, D.L. (1978) Hedonic prices and the demand for clean air. J. Environ. Economics and Management 5, 81-102. Which is here: http://www.colorado.edu/ibs/crs/workshops/R_1-11-2012/root/Harrison_1978.pdf↩
Hlavac, Marek (2015). stargazer: Well-Formatted Regression and Summary Statistics Tables. R package version 5.2. http://CRAN.R-project.org/package=stargazer↩
Gareth James, Daniela Witten, Trevor Hastie and Rob Tibshirani (2013). ISLR: Data for An Introduction to Statistical Learning with Applications in R. R package version 1.0. https://CRAN.R-project.org/package=ISLR↩
Venables, W. N. & Ripley, B. D. (2002) Modern Applied Statistics with S. Fourth Edition. Springer, New York. ISBN 0-387-95457-0 [MASS]↩
Harrison, D. and Rubinfeld, D.L. (1978) Hedonic prices and the demand for clean air. J. Environ. Economics and Management 5, 81–102. and Belsley D.A., Kuh, E. and Welsch, R.E. (1980) Regression Diagnostics. Identifying Influential Data and Sources of Collinearity. New York: Wiley.↩
Sean Kross, Nick Carchedi, Bill Bauer and Gina Grdina (2017). swirl
: Learn R, in R. R package version 2.4.3. https://CRAN.R-project.org/package=swirl↩
John Verzani (2015). UsingR: Data Sets, Etc. for the Text “Using R for Introductory Statistics”, Second Edition. R package version 2.0-5. https://CRAN.R-project.org/package=UsingR↩
Hadley Wickham (2017). tidyverse: Easily Install and Load ‘Tidyverse’ Packages. R package version 1.1.1. https://CRAN.R-project.org/package=tidyverse↩
R Core Team (2017). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.↩