After today you will know how to assess the adequacy of a linear regression model. We will look at three examples.
Problem Set #4 is posted.
RStudio/R for presentations. Lecture14P.Rpres
Although the regression model is the best fitting line through the data. You still need to check if the model is adequate.
You are making 4 assumptions when fitting a linear regression model to your data. The assumptions concern the behavior of the residuals—or equivalently the distribution of Y (response variable) conditional on X (explanatory variable).
To the extent these assumptions are valid, the model is said to be adequate for representing your data. A model can be significant, but not adequate.
The first three assumptions are best examined using graphs.
Note: You don't prove an assumption. You check for evidence to reject it.
Consider the cars data frame.
head(cars)
## speed dist
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
## 6 9 10
The data give the speed of cars (mph) and the distances (ft) taken to stop. The data were recorded in the 1920s.
cor(cars$speed, cars$dist)
## [1] 0.8069
require(ggplot2)
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.0.2
ggplot(cars, aes(x = speed, y = dist)) + geom_point() + geom_smooth(method = lm,
se = FALSE) + xlab("Speed (mph)") + ylab("Break Distance (ft)")
The scatter plot shows that there appears to be a linear relationship between the breaking distance and the forward speed of the car for this sample of data.
The relationship is positive indicating that as speed increases so does the average breaking distance. The relationship also seems to be quite strong (r = .81).
model = lm(dist ~ speed, data = cars)
summary(model)
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.07 -9.53 -2.27 9.21 43.20
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.579 6.758 -2.60 0.012 *
## speed 3.932 0.416 9.46 1.5e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.4 on 48 degrees of freedom
## Multiple R-squared: 0.651, Adjusted R-squared: 0.644
## F-statistic: 89.6 on 1 and 48 DF, p-value: 1.49e-12
The model indicates a statistically significant relationship (p value is less than 0.001) between breaking distance and speed. In fact, 65% of the variation in average breaking distance is associated with differences in speed.
The statistically significant effect suggests it is unlikely in the population of cars at this time that there is no relationship between speed and breaking distance.
The model looks pretty good, so you write up your results and you are done. Hold on. Let's take a closer look.
There tends to be more values of breaking distance below the line over the range of speeds between 10 & 20 mph. Also, the spread of the residuals appears to get larger as speed increases. There is more variation in the response for larger values of speed.
A way to see this is to cut the explanatory variable into non-overlapping groups and display the set of corresponding response values as a box plot.
ggplot(cars, aes(x = cut(speed, breaks = 5), y = dist)) + geom_boxplot() + xlab("Speed (mph)") +
ylab("Break Distance (ft)")
With the breaks argument set to a value of 5 the cut() function divides the range of speed into equal intervals.
cut(cars$speed, breaks = 5)
## [1] (3.98,8.19] (3.98,8.19] (3.98,8.19] (3.98,8.19] (3.98,8.19]
## [6] (8.19,12.4] (8.19,12.4] (8.19,12.4] (8.19,12.4] (8.19,12.4]
## [11] (8.19,12.4] (8.19,12.4] (8.19,12.4] (8.19,12.4] (8.19,12.4]
## [16] (12.4,16.6] (12.4,16.6] (12.4,16.6] (12.4,16.6] (12.4,16.6]
## [21] (12.4,16.6] (12.4,16.6] (12.4,16.6] (12.4,16.6] (12.4,16.6]
## [26] (12.4,16.6] (12.4,16.6] (12.4,16.6] (16.6,20.8] (16.6,20.8]
## [31] (16.6,20.8] (16.6,20.8] (16.6,20.8] (16.6,20.8] (16.6,20.8]
## [36] (16.6,20.8] (16.6,20.8] (16.6,20.8] (16.6,20.8] (16.6,20.8]
## [41] (16.6,20.8] (16.6,20.8] (16.6,20.8] (20.8,25] (20.8,25]
## [46] (20.8,25] (20.8,25] (20.8,25] (20.8,25] (20.8,25]
## Levels: (3.98,8.19] (8.19,12.4] (12.4,16.6] (16.6,20.8] (20.8,25]
Here you see evidence against the assumption of linearity.
Further you see that the boxes sizes are increasing. That is the IQR range of breaking distance is larger for faster moving cars. This is evidence to suspect the assumption of constant variance.
What about the assumption of normality. Although the normality assumption about the residuals is that the conditional distribution of the residuals at each x_i is adequately described by a normal distribution, with sample data there usually are not enough observations at each value of X to check this assumption.
Instead, in practice, it is common to examine the joint distribution of residuals.
The residuals are obtained by using the residuals() extractor function.
res = residuals(model)
res
## 1 2 3 4 5 6 7 8
## 3.8495 11.8495 -5.9478 12.0522 2.1198 -7.8126 -3.7450 4.2550
## 9 10 11 12 13 14 15 16
## 12.2550 -8.6774 2.3226 -15.6098 -9.6098 -5.6098 -1.6098 -7.5422
## 17 18 19 20 21 22 23 24
## 0.4578 0.4578 12.4578 -11.4746 -1.4746 22.5254 42.5254 -21.4070
## 25 26 27 28 29 30 31 32
## -15.4070 12.5930 -13.3394 -5.3394 -17.2719 -9.2719 0.7281 -11.2043
## 33 34 35 36 37 38 39 40
## 2.7957 22.7957 30.7957 -21.1367 -11.1367 10.8633 -29.0691 -13.0691
## 41 42 43 44 45 46 47 48
## -9.0691 -5.0691 2.9309 -2.9339 -18.8663 -6.7987 15.2013 16.2013
## 49 50
## 43.2013 4.2689
There are several extractor functions that output information from the model as vectors.
The fortify() function from the ggplot2 package makes a data frame from model object outputs.
model.df = fortify(model)
head(model.df)
## dist speed .hat .sigma .cooksd .fitted .resid .stdresid
## 1 2 4 0.11486 15.53 0.0045923 -1.849 3.849 0.2660
## 2 10 4 0.11486 15.43 0.0435140 -1.849 11.849 0.8189
## 3 4 7 0.07150 15.52 0.0062024 9.948 -5.948 -0.4013
## 4 22 7 0.07150 15.43 0.0254673 9.948 12.052 0.8133
## 5 16 8 0.05997 15.54 0.0006447 13.880 2.120 0.1422
## 6 10 9 0.04990 15.50 0.0071320 17.813 -7.813 -0.5212
A look at the residuals shows that some are positive and some are negative. A histogram of the model's residuals is obtained by typing
ggplot(model.df, aes(.resid)) + geom_histogram(binwidth = 10)
You see that the right tail in the histogram is longer than the left tail. With a normal distribution the tails are symmetric. Thus the assumption of normality is under question.
Since departures from normality can occur simply because of sampling variation, the question arises as to whether that apparent skewness in the residuals is significant.
An approach to visualizing the expected variation from the reference distribution is to superimpose a confidence band (envelope) on the density plot.
The sm.density() function from the sm package is a quick way to plot the confidence envelope. The argument model = “Normal” is included.
require(sm)
## Loading required package: sm
## Warning: package 'sm' was built under R version 3.0.2
## Package `sm', version 2.2-5: type help(sm) for summary information
sm.density(res, model = "Normal")
The black curve representing the data is shifted left relative to a normal distribution and goes outside the blue band in the right tail indicating that the residuals may not be adequately described by a normal distribution.
In summary, your linear regression model is not adequate. The assumptions of linearity, equal variance, and normally distributed residuals are not reasonable for these data.
What to do? The relationship appears to be non-linear. You can use the square root of the breaking distance as the response variable.
ggplot(cars, aes(x = speed, y = sqrt(dist))) + geom_point() + geom_smooth(method = lm,
se = FALSE) + xlab("Speed (mph)") + ylab(expression(sqrt("Break Distance (ft)")))
Let's look at another example.
The file Inc.txt contains average income vs percentage of college graduates by state. Fit a linear regression model to these data and check the model assumption. Does the linear model appear to be adequate?
Inc = read.table("http://myweb.fsu.edu/jelsner/data/Inc.txt", header = TRUE)
head(Inc)
## State College Income
## 1 AL 20.4 20487
## 2 AK 28.1 26171
## 3 AZ 24.6 21947
## 4 AR 18.4 19479
## 5 CA 27.5 26808
## 6 CO 34.6 27573
Start with a plot of the data.
ggplot(Inc, aes(x = College, y = Income)) + geom_point() + geom_smooth(method = lm) +
xlab("Percentage of College Graduates") + ylab("Average Income ($)")
Regress avg income on percent of college graduates. Save and interpret the model.
model = lm(Income ~ College, data = Inc)
summary(model)
##
## Call:
## lm(formula = Income ~ College, data = Inc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4136 -1345 75 1276 5121
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10460 1615 6.48 4.3e-08 ***
## College 545 63 8.65 2.0e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2080 on 49 degrees of freedom
## Multiple R-squared: 0.604, Adjusted R-squared: 0.596
## F-statistic: 74.8 on 1 and 49 DF, p-value: 1.98e-11
The model is statistically significant with percent college graduates explaining 60% of the variation in average income by state.
For every 1% pt increase in graduates avg income increases by $545.
Is the model adequate?
Check linearity and equal spread.
ggplot(Inc, aes(x = cut(College, 4), y = Income)) + geom_boxplot() + xlab("Percentage of College Graduates") +
ylab("Average Income ($)")
It looks pretty favorable for these two assumptions.
Next look at the distribution of the residuals. Are they bell-shaped indicative of a normal distribution?
res = residuals(model)
sm.density(res, model = "Normal")
The black curve falls within the blue envelope of a normal distribution, so we have no reason to suspect the assumption of normally distributed residuals.
Another example.
Here you use the MplsZagat.csv data set to examine “good” and “poor” restaurant values in Minneapolis, MN.
You will examine two questions:
Articulating these questions in terms of explanatory and response variables you have:
From Wikipedia: Zagat ratings (after Tim and Nina Zagat) are on a 30-point scale. The scale is made from ratings for food, decor, service, and cost.
Since cost is part of what you are trying to explain, there is a bit of circularity in this analysis that we will ignore for now.
Before you examine these questions, you should look at the response variable Cost.
zagat = read.csv("http://www.tc.umn.edu/~zief0002/Data/MplsZagat.csv", header = TRUE)
head(zagat)
## Name Cost Food Decor Service Rating
## 1 Cosi 12 17 12 13 14.00
## 2 Chipotle 10 19 12 15 15.33
## 3 Hard Rock Cafe 25 12 21 14 15.67
## 4 Pei Wei Asian Diner 15 18 14 16 16.00
## 5 Buca di Beppo 24 15 17 17 16.33
## 6 California Pizza Kitchen 20 19 15 16 16.67
mean(zagat$Cost)
## [1] 37
sd(zagat$Cost)
## [1] 14.24
Based on the 65 restauarant sample the average cost of a meal is $37.
There is a large variation in the cost of dinner as indicated by the standard deviation of $14.24. Thus, going to a Zagat-rated restaurant can cost very little or it can be very expensive.
What is the relationship between a restaurantâs cost and its Zagat rating? This is answered by the correlation coefficient and a scatter plot.
cor(zagat$Cost, zagat$Rating)
## [1] 0.8677
ggplot(data = zagat, aes(x = Rating, y = Cost)) + geom_point() + geom_smooth(method = "lm",
se = FALSE) + xlab("Zagat Rating") + ylab("Cost per dinner ($)")
The plot shows that there is a linear relationship between the average cost of dinner and the overall Zagat rating for restaurants in this sample.
This relationship is positive indicating that resataurants that are more expensive tend to be the same restaurants which have high Zagat ratings.
The relationship also seems to be quite strong (r = 0.87), with few, if any, of the restaurants in the sample deviating from this linear pattern.
A linear model for the relationship is obtained by typing
model = lm(Cost ~ Rating, data = zagat)
summary(model)
##
## Call:
## lm(formula = Cost ~ Rating, data = zagat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.180 -3.792 0.089 3.326 21.971
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -54.946 6.695 -8.21 1.5e-11 ***
## Rating 4.322 0.312 13.86 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.13 on 63 degrees of freedom
## Multiple R-squared: 0.753, Adjusted R-squared: 0.749
## F-statistic: 192 on 1 and 63 DF, p-value: <2e-16
Results of the regression analysis suggest that there is also a statistically signiï¬cant relationship (p < 0.001) in the population.
In fact, 75.3% of the variation in average cost is associated with differences in Zagat ratings.
The statistically signiï¬cant effect suggests that it is unlikely in the population of Minneapolis restaurants that there is no relationship between price and rating.
Not only is the effect significant, but it is also large. In fact, each one point increase in Zagat Rating is associated with an increase of $4.32 in the average cost of a meal.
From the small standard error for the slope of $r I(round(summary(model)$coefficients[2,2],3)) you can deduce that the effect is quite precise. This is leads to a narrow conï¬dence interval for the slope estimate.
confint(model)
## 2.5 % 97.5 %
## (Intercept) -68.324 -41.567
## Rating 3.699 4.946
What about model adequacy: model assumptions are evaluated from evidence in the sample. The assumptions are linearity, equal variance, and residuals that can be described by a normal distribution.
Linearity and constant variance assumptions are checked using.
ggplot(zagat, aes(x = cut(Rating, breaks = quantile(Rating)), y = Cost)) + geom_boxplot() +
xlab("Zagat Rating") + ylab("Cost per dinner ($)")
With the breaks argument set to a single value the cut() function divides the range of Rating into equal intervals. Usually each bin will have a different number of points.
Instead here you can specify the breaks using the quantile() function. By default the breaks are made at the quartiles so you end up with 4 intervals with each interval containing the same number of points.
Another plot is the local regression smoother.
scatter.smooth(zagat$Cost ~ zagat$Rating)
abline(model, col = "red")
No evidence to reject the assumptions of linearity and constant variance.
Normality assumption.
res = residuals(model)
sm.density(res, xlab = "Model Residuals", model = "Normal")
Although the residuals deviate from a perfectly normal distribution, the deviation is not more than you would expect from sampling variation alone.
Another diagnostic is the quantile-quantile plot. A quantile-quantile (or Q-Q plot) plot is a graph of the quantiles of one distribution against the quantiles of another distribution. If the distributions have similar shapes, the points fall roughly along the straight line.
To check the normality assumption of a regression model you want to compare the quantiles of the residuals against the quantiles of a normal distribution. You do that with the qqnorm() function.
qqnorm(res)
qqline(res)
Departures from normality are seen as a systematic departure of the points from a straight line on the Q-Q plot.
Interpreting Q-Q plots is more a visceral than an intellectual exercise. The uninitiated are often mystified by the process. Experience is the key here.
Description: intepretation
Some evidence to suspect the assumption of normality.
Given the quality of food, service, and decor (controlling for Zagat rating) where do you get the best/worst price?
The residuals are the error in the model, what is left over once price of dinner is put in the model—or in statistical language, “after controlling for Zagat rating.” Thus, examining the model residuals will help us answer our second question.
The above assumptions concern the conditional distribution of the residuals. The residuals for each of the observations are automatically computed when you ï¬t a model in R.
In the code below, the fortify() function from the ggplot2 package is called on the model object to fortify a data frame with elements computed from the ï¬tted model.
model.df = fortify(model)
head(model.df)
## Cost Rating .hat .sigma .cooksd .fitted .resid .stdresid
## 1 12 14.00 0.11647 7.139 0.0606258 5.568 6.4317 0.9590
## 2 10 15.33 0.08280 7.190 0.0017140 11.332 -1.3315 -0.1949
## 3 25 15.67 0.07545 7.008 0.1296197 12.772 12.2277 1.7824
## 4 15 16.00 0.06851 7.191 0.0004802 14.213 0.7869 0.1143
## 5 24 16.33 0.06201 7.108 0.0482201 15.654 8.3461 1.2078
## 6 20 16.67 0.05593 7.182 0.0052023 17.095 2.9053 0.4191
The variables used in the model are in the first two columns with information about the regression specific to each of the cases (observations) is given in the next six columns.
The first three of those columns contain information that allows you to assess how influential that observation is to the model. Loosely, if that particularly observation was removed and the regression model refit without it, how much difference would it make? If it makes a lot of difference then the observation is influential.
ggplot(model.df, aes(x = Rating, y = .resid)) + geom_point() + geom_hline(yintercept = 0) +
xlab("Zagat Rating") + ylab("Model Residuals")
ggplot(model.df, aes(x = Rating, y = .stdresid)) + geom_point() + geom_hline(yintercept = 0) +
geom_hline(yintercept = -2, lty = "dotted") + geom_hline(yintercept = 2,
lty = "dotted") + xlab("Zagat Rating") + ylab("Studentized Residuals")
Looking at the residual plot, you can identify several observations that have large positive residuals (over-priced restaurants) and others that have large negative residuals (under-priced restaurants) after you “control for” Zagat rating.
You can look in the data to ï¬nd out which restaurants these are, but using indexing you can find these restaurants on the plot more easily.
You first add the restaurant names from the zagat data frame to the out data frame.
model.df$Name = zagat$Name
head(model.df)
## Cost Rating .hat .sigma .cooksd .fitted .resid .stdresid
## 1 12 14.00 0.11647 7.139 0.0606258 5.568 6.4317 0.9590
## 2 10 15.33 0.08280 7.190 0.0017140 11.332 -1.3315 -0.1949
## 3 25 15.67 0.07545 7.008 0.1296197 12.772 12.2277 1.7824
## 4 15 16.00 0.06851 7.191 0.0004802 14.213 0.7869 0.1143
## 5 24 16.33 0.06201 7.108 0.0482201 15.654 8.3461 1.2078
## 6 20 16.67 0.05593 7.182 0.0052023 17.095 2.9053 0.4191
## Name
## 1 Cosi
## 2 Chipotle
## 3 Hard Rock Cafe
## 4 Pei Wei Asian Diner
## 5 Buca di Beppo
## 6 California Pizza Kitchen
You can then use the geom_text() function to add text labels to those restaurants having low and high studentized residuals.
This is done using a logical expression to choose rows. The xlim() and ylim() functions change the limits of the plot so that the labels ï¬t.
ggplot(model.df[abs(model.df$.stdresid) >= 2, ], aes(x = Rating, y = .stdresid,
label = Name)) + geom_point(col = "red") + geom_hline(yintercept = 0) +
geom_hline(yintercept = -2, lty = "dotted") + geom_hline(yintercept = 2,
lty = "dotted") + geom_text(angle = 45) + xlab("Zagat Rating") + ylab("Studentized Residuals") +
xlim(15, 30) + ylim(-3.5, 4)
Regression models make a series of assumptions.
Regression modeling is statistical control.
Outliers can distort regression results or can be interesting on their own.