QG Lecture 14

After today you will know how to assess the adequacy of a linear regression model. We will look at three examples.

Problem Set #4 is posted.

RStudio/R for presentations. Lecture14P.Rpres

Model Adequacy

Although the regression model is the best fitting line through the data. You still need to check if the model is adequate.

You are making 4 assumptions when fitting a linear regression model to your data. The assumptions concern the behavior of the residuals—or equivalently the distribution of Y (response variable) conditional on X (explanatory variable).

Linearity: Average values of sub-populations fall on a straight-line function of X.
Constant variance: Sub-population standard deviations are all equal.
Normality: Values from each sub-population can be described by a normal distribution.
Independence: Each observation is independent of the other observations.

To the extent these assumptions are valid, the model is said to be adequate for representing your data. A model can be significant, but not adequate.

The first three assumptions are best examined using graphs.

Note: You don't prove an assumption. You check for evidence to reject it.

Consider the cars data frame.

head(cars)

##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16
## 6     9   10

The data give the speed of cars (mph) and the distances (ft) taken to stop. The data were recorded in the 1920s.

cor(cars$speed, cars$dist)

## [1] 0.8069

require(ggplot2)

## Loading required package: ggplot2

## Warning: package 'ggplot2' was built under R version 3.0.2

ggplot(cars, aes(x = speed, y = dist)) + geom_point() + geom_smooth(method = lm, 
    se = FALSE) + xlab("Speed (mph)") + ylab("Break Distance (ft)")

plot of chunk plotCars

The scatter plot shows that there appears to be a linear relationship between the breaking distance and the forward speed of the car for this sample of data.

The relationship is positive indicating that as speed increases so does the average breaking distance. The relationship also seems to be quite strong (r = .81).

model = lm(dist ~ speed, data = cars)
summary(model)

## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -29.07  -9.53  -2.27   9.21  43.20 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.579      6.758   -2.60    0.012 *  
## speed          3.932      0.416    9.46  1.5e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.4 on 48 degrees of freedom
## Multiple R-squared:  0.651,  Adjusted R-squared:  0.644 
## F-statistic: 89.6 on 1 and 48 DF,  p-value: 1.49e-12

The model indicates a statistically significant relationship (p value is less than 0.001) between breaking distance and speed. In fact, 65% of the variation in average breaking distance is associated with differences in speed.

The statistically significant effect suggests it is unlikely in the population of cars at this time that there is no relationship between speed and breaking distance.

The model looks pretty good, so you write up your results and you are done. Hold on. Let's take a closer look.

There tends to be more values of breaking distance below the line over the range of speeds between 10 & 20 mph. Also, the spread of the residuals appears to get larger as speed increases. There is more variation in the response for larger values of speed.

A way to see this is to cut the explanatory variable into non-overlapping groups and display the set of corresponding response values as a box plot.

ggplot(cars, aes(x = cut(speed, breaks = 5), y = dist)) + geom_boxplot() + xlab("Speed (mph)") + 
    ylab("Break Distance (ft)")

plot of chunk boxplotsCutbyX

With the breaks argument set to a value of 5 the cut() function divides the range of speed into equal intervals.

cut(cars$speed, breaks = 5)

##  [1] (3.98,8.19] (3.98,8.19] (3.98,8.19] (3.98,8.19] (3.98,8.19]
##  [6] (8.19,12.4] (8.19,12.4] (8.19,12.4] (8.19,12.4] (8.19,12.4]
## [11] (8.19,12.4] (8.19,12.4] (8.19,12.4] (8.19,12.4] (8.19,12.4]
## [16] (12.4,16.6] (12.4,16.6] (12.4,16.6] (12.4,16.6] (12.4,16.6]
## [21] (12.4,16.6] (12.4,16.6] (12.4,16.6] (12.4,16.6] (12.4,16.6]
## [26] (12.4,16.6] (12.4,16.6] (12.4,16.6] (16.6,20.8] (16.6,20.8]
## [31] (16.6,20.8] (16.6,20.8] (16.6,20.8] (16.6,20.8] (16.6,20.8]
## [36] (16.6,20.8] (16.6,20.8] (16.6,20.8] (16.6,20.8] (16.6,20.8]
## [41] (16.6,20.8] (16.6,20.8] (16.6,20.8] (20.8,25]   (20.8,25]  
## [46] (20.8,25]   (20.8,25]   (20.8,25]   (20.8,25]   (20.8,25]  
## Levels: (3.98,8.19] (8.19,12.4] (12.4,16.6] (16.6,20.8] (20.8,25]

Here you see evidence against the assumption of linearity.

Further you see that the boxes sizes are increasing. That is the IQR range of breaking distance is larger for faster moving cars. This is evidence to suspect the assumption of constant variance.

What about the assumption of normality. Although the normality assumption about the residuals is that the conditional distribution of the residuals at each x_i is adequately described by a normal distribution, with sample data there usually are not enough observations at each value of X to check this assumption.

Instead, in practice, it is common to examine the joint distribution of residuals.

The residuals are obtained by using the residuals() extractor function.

res = residuals(model)
res

##        1        2        3        4        5        6        7        8 
##   3.8495  11.8495  -5.9478  12.0522   2.1198  -7.8126  -3.7450   4.2550 
##        9       10       11       12       13       14       15       16 
##  12.2550  -8.6774   2.3226 -15.6098  -9.6098  -5.6098  -1.6098  -7.5422 
##       17       18       19       20       21       22       23       24 
##   0.4578   0.4578  12.4578 -11.4746  -1.4746  22.5254  42.5254 -21.4070 
##       25       26       27       28       29       30       31       32 
## -15.4070  12.5930 -13.3394  -5.3394 -17.2719  -9.2719   0.7281 -11.2043 
##       33       34       35       36       37       38       39       40 
##   2.7957  22.7957  30.7957 -21.1367 -11.1367  10.8633 -29.0691 -13.0691 
##       41       42       43       44       45       46       47       48 
##  -9.0691  -5.0691   2.9309  -2.9339 -18.8663  -6.7987  15.2013  16.2013 
##       49       50 
##  43.2013   4.2689

There are several extractor functions that output information from the model as vectors.

The fortify() function from the ggplot2 package makes a data frame from model object outputs.

model.df = fortify(model)
head(model.df)

##   dist speed    .hat .sigma   .cooksd .fitted .resid .stdresid
## 1    2     4 0.11486  15.53 0.0045923  -1.849  3.849    0.2660
## 2   10     4 0.11486  15.43 0.0435140  -1.849 11.849    0.8189
## 3    4     7 0.07150  15.52 0.0062024   9.948 -5.948   -0.4013
## 4   22     7 0.07150  15.43 0.0254673   9.948 12.052    0.8133
## 5   16     8 0.05997  15.54 0.0006447  13.880  2.120    0.1422
## 6   10     9 0.04990  15.50 0.0071320  17.813 -7.813   -0.5212

A look at the residuals shows that some are positive and some are negative. A histogram of the model's residuals is obtained by typing

ggplot(model.df, aes(.resid)) + geom_histogram(binwidth = 10)

plot of chunk histResid

You see that the right tail in the histogram is longer than the left tail. With a normal distribution the tails are symmetric. Thus the assumption of normality is under question.

Since departures from normality can occur simply because of sampling variation, the question arises as to whether that apparent skewness in the residuals is significant.

An approach to visualizing the expected variation from the reference distribution is to superimpose a confidence band (envelope) on the density plot.

The sm.density() function from the sm package is a quick way to plot the confidence envelope. The argument model = “Normal” is included.

require(sm)

## Loading required package: sm

## Warning: package 'sm' was built under R version 3.0.2

## Package `sm', version 2.2-5: type help(sm) for summary information

sm.density(res, model = "Normal")

plot of chunk densityEnvelope

The black curve representing the data is shifted left relative to a normal distribution and goes outside the blue band in the right tail indicating that the residuals may not be adequately described by a normal distribution.

In summary, your linear regression model is not adequate. The assumptions of linearity, equal variance, and normally distributed residuals are not reasonable for these data.

What to do? The relationship appears to be non-linear. You can use the square root of the breaking distance as the response variable.

ggplot(cars, aes(x = speed, y = sqrt(dist))) + geom_point() + geom_smooth(method = lm, 
    se = FALSE) + xlab("Speed (mph)") + ylab(expression(sqrt("Break Distance (ft)")))

plot of chunk squareRootResponse

Let's look at another example.

Income vs percent college graduates

The file Inc.txt contains average income vs percentage of college graduates by state. Fit a linear regression model to these data and check the model assumption. Does the linear model appear to be adequate?

Inc = read.table("http://myweb.fsu.edu/jelsner/data/Inc.txt", header = TRUE)
head(Inc)

##   State College Income
## 1    AL    20.4  20487
## 2    AK    28.1  26171
## 3    AZ    24.6  21947
## 4    AR    18.4  19479
## 5    CA    27.5  26808
## 6    CO    34.6  27573

Start with a plot of the data.

ggplot(Inc, aes(x = College, y = Income)) + geom_point() + geom_smooth(method = lm) + 
    xlab("Percentage of College Graduates") + ylab("Average Income ($)")

plot of chunk incomePlot

Regress avg income on percent of college graduates. Save and interpret the model.

model = lm(Income ~ College, data = Inc)
summary(model)

## 
## Call:
## lm(formula = Income ~ College, data = Inc)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -4136  -1345     75   1276   5121 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    10460       1615    6.48  4.3e-08 ***
## College          545         63    8.65  2.0e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2080 on 49 degrees of freedom
## Multiple R-squared:  0.604,  Adjusted R-squared:  0.596 
## F-statistic: 74.8 on 1 and 49 DF,  p-value: 1.98e-11

The model is statistically significant with percent college graduates explaining 60% of the variation in average income by state.

For every 1% pt increase in graduates avg income increases by $545.

Is the model adequate?

Check linearity and equal spread.

ggplot(Inc, aes(x = cut(College, 4), y = Income)) + geom_boxplot() + xlab("Percentage of College Graduates") + 
    ylab("Average Income ($)")

plot of chunk boxplotsCutbyX2

It looks pretty favorable for these two assumptions.

Next look at the distribution of the residuals. Are they bell-shaped indicative of a normal distribution?

res = residuals(model)
sm.density(res, model = "Normal")

plot of chunk modelResiduals

The black curve falls within the blue envelope of a normal distribution, so we have no reason to suspect the assumption of normally distributed residuals.

Another example.

Cost vs quality

Here you use the MplsZagat.csv data set to examine “good” and “poor” restaurant values in Minneapolis, MN.

You will examine two questions:

Do you get what you pay for?
Which restaurants are good values?

Articulating these questions in terms of explanatory and response variables you have:

What is the relationship between a restaurantâs Zagat rating and its cost?
Given what you are paying (controlling for price) where do you get the best food?

From Wikipedia: Zagat ratings (after Tim and Nina Zagat) are on a 30-point scale. The scale is made from ratings for food, decor, service, and cost.

Since cost is part of what you are trying to explain, there is a bit of circularity in this analysis that we will ignore for now.

Before you examine these questions, you should look at the response variable Cost.

zagat = read.csv("http://www.tc.umn.edu/~zief0002/Data/MplsZagat.csv", header = TRUE)
head(zagat)

##                       Name Cost Food Decor Service Rating
## 1                     Cosi   12   17    12      13  14.00
## 2                 Chipotle   10   19    12      15  15.33
## 3           Hard Rock Cafe   25   12    21      14  15.67
## 4      Pei Wei Asian Diner   15   18    14      16  16.00
## 5            Buca di Beppo   24   15    17      17  16.33
## 6 California Pizza Kitchen   20   19    15      16  16.67

mean(zagat$Cost)

## [1] 37

sd(zagat$Cost)

## [1] 14.24

Based on the 65 restauarant sample the average cost of a meal is $37.

There is a large variation in the cost of dinner as indicated by the standard deviation of $14.24. Thus, going to a Zagat-rated restaurant can cost very little or it can be very expensive.

What is the relationship between a restaurantâs cost and its Zagat rating? This is answered by the correlation coefficient and a scatter plot.

cor(zagat$Cost, zagat$Rating)

## [1] 0.8677

ggplot(data = zagat, aes(x = Rating, y = Cost)) + geom_point() + geom_smooth(method = "lm", 
    se = FALSE) + xlab("Zagat Rating") + ylab("Cost per dinner ($)")

plot of chunk relationship

The plot shows that there is a linear relationship between the average cost of dinner and the overall Zagat rating for restaurants in this sample.

This relationship is positive indicating that resataurants that are more expensive tend to be the same restaurants which have high Zagat ratings.

The relationship also seems to be quite strong (r = 0.87), with few, if any, of the restaurants in the sample deviating from this linear pattern.

A linear model for the relationship is obtained by typing

model = lm(Cost ~ Rating, data = zagat)
summary(model)

## 
## Call:
## lm(formula = Cost ~ Rating, data = zagat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.180  -3.792   0.089   3.326  21.971 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -54.946      6.695   -8.21  1.5e-11 ***
## Rating         4.322      0.312   13.86  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.13 on 63 degrees of freedom
## Multiple R-squared:  0.753,  Adjusted R-squared:  0.749 
## F-statistic:  192 on 1 and 63 DF,  p-value: <2e-16

Results of the regression analysis suggest that there is also a statistically signiï¬cant relationship (p < 0.001) in the population.

In fact, 75.3% of the variation in average cost is associated with differences in Zagat ratings.

The statistically signiï¬cant effect suggests that it is unlikely in the population of Minneapolis restaurants that there is no relationship between price and rating.

Not only is the effect significant, but it is also large. In fact, each one point increase in Zagat Rating is associated with an increase of $4.32 in the average cost of a meal.

From the small standard error for the slope of $r I(round(summary(model)$coefficients[2,2],3)) you can deduce that the effect is quite precise. This is leads to a narrow conï¬dence interval for the slope estimate.

confint(model)

##               2.5 %  97.5 %
## (Intercept) -68.324 -41.567
## Rating        3.699   4.946

What about model adequacy: model assumptions are evaluated from evidence in the sample. The assumptions are linearity, equal variance, and residuals that can be described by a normal distribution.

Linearity and constant variance assumptions are checked using.

ggplot(zagat, aes(x = cut(Rating, breaks = quantile(Rating)), y = Cost)) + geom_boxplot() + 
    xlab("Zagat Rating") + ylab("Cost per dinner ($)")

plot of chunk boxplotsCutbyX3

With the breaks argument set to a single value the cut() function divides the range of Rating into equal intervals. Usually each bin will have a different number of points.

Instead here you can specify the breaks using the quantile() function. By default the breaks are made at the quartiles so you end up with 4 intervals with each interval containing the same number of points.

Another plot is the local regression smoother.

scatter.smooth(zagat$Cost ~ zagat$Rating)
abline(model, col = "red")

plot of chunk scatterSmooth

No evidence to reject the assumptions of linearity and constant variance.

Normality assumption.

res = residuals(model)
sm.density(res, xlab = "Model Residuals", model = "Normal")

plot of chunk normalityAssumption

Although the residuals deviate from a perfectly normal distribution, the deviation is not more than you would expect from sampling variation alone.

Another diagnostic is the quantile-quantile plot. A quantile-quantile (or Q-Q plot) plot is a graph of the quantiles of one distribution against the quantiles of another distribution. If the distributions have similar shapes, the points fall roughly along the straight line.

To check the normality assumption of a regression model you want to compare the quantiles of the residuals against the quantiles of a normal distribution. You do that with the qqnorm() function.

qqnorm(res)
qqline(res)

plot of chunk qqnormFunction

Departures from normality are seen as a systematic departure of the points from a straight line on the Q-Q plot.

Interpreting Q-Q plots is more a visceral than an intellectual exercise. The uninitiated are often mystified by the process. Experience is the key here.

Description: intepretation

All but a few points fall on a line: outliers in the data
Left end of pattern is below the line; right end of pattern is above the line: long tails at both ends of the distribution
Left end of pattern is above the line; right end of pattern is below the line: short tails at both ends of the data distribution
Curved pattern with slope increasing from left to right: data is skewed to the right
Curved pattern with slope decreasing from left to right: data is skewed to the left
Staircase pattern (plateaus and gaps): data have been rounded or are discrete

Some evidence to suspect the assumption of normality.

Best/worst meal deals

Given the quality of food, service, and decor (controlling for Zagat rating) where do you get the best/worst price?

The residuals are the error in the model, what is left over once price of dinner is put in the model—or in statistical language, “after controlling for Zagat rating.” Thus, examining the model residuals will help us answer our second question.

The above assumptions concern the conditional distribution of the residuals. The residuals for each of the observations are automatically computed when you ï¬t a model in R.

In the code below, the fortify() function from the ggplot2 package is called on the model object to fortify a data frame with elements computed from the ï¬tted model.

model.df = fortify(model)
head(model.df)

##   Cost Rating    .hat .sigma   .cooksd .fitted  .resid .stdresid
## 1   12  14.00 0.11647  7.139 0.0606258   5.568  6.4317    0.9590
## 2   10  15.33 0.08280  7.190 0.0017140  11.332 -1.3315   -0.1949
## 3   25  15.67 0.07545  7.008 0.1296197  12.772 12.2277    1.7824
## 4   15  16.00 0.06851  7.191 0.0004802  14.213  0.7869    0.1143
## 5   24  16.33 0.06201  7.108 0.0482201  15.654  8.3461    1.2078
## 6   20  16.67 0.05593  7.182 0.0052023  17.095  2.9053    0.4191

The variables used in the model are in the first two columns with information about the regression specific to each of the cases (observations) is given in the next six columns.

The first three of those columns contain information that allows you to assess how influential that observation is to the model. Loosely, if that particularly observation was removed and the regression model refit without it, how much difference would it make? If it makes a lot of difference then the observation is influential.

ggplot(model.df, aes(x = Rating, y = .resid)) + geom_point() + geom_hline(yintercept = 0) + 
    xlab("Zagat Rating") + ylab("Model Residuals")

plot of chunk residalsVsExplanatory

ggplot(model.df, aes(x = Rating, y = .stdresid)) + geom_point() + geom_hline(yintercept = 0) + 
    geom_hline(yintercept = -2, lty = "dotted") + geom_hline(yintercept = 2, 
    lty = "dotted") + xlab("Zagat Rating") + ylab("Studentized Residuals")

plot of chunk residalsVsExplanatory2

Looking at the residual plot, you can identify several observations that have large positive residuals (over-priced restaurants) and others that have large negative residuals (under-priced restaurants) after you “control for” Zagat rating.

You can look in the data to ï¬nd out which restaurants these are, but using indexing you can find these restaurants on the plot more easily.

You first add the restaurant names from the zagat data frame to the out data frame.

model.df$Name = zagat$Name
head(model.df)

##   Cost Rating    .hat .sigma   .cooksd .fitted  .resid .stdresid
## 1   12  14.00 0.11647  7.139 0.0606258   5.568  6.4317    0.9590
## 2   10  15.33 0.08280  7.190 0.0017140  11.332 -1.3315   -0.1949
## 3   25  15.67 0.07545  7.008 0.1296197  12.772 12.2277    1.7824
## 4   15  16.00 0.06851  7.191 0.0004802  14.213  0.7869    0.1143
## 5   24  16.33 0.06201  7.108 0.0482201  15.654  8.3461    1.2078
## 6   20  16.67 0.05593  7.182 0.0052023  17.095  2.9053    0.4191
##                       Name
## 1                     Cosi
## 2                 Chipotle
## 3           Hard Rock Cafe
## 4      Pei Wei Asian Diner
## 5            Buca di Beppo
## 6 California Pizza Kitchen

You can then use the geom_text() function to add text labels to those restaurants having low and high studentized residuals.

This is done using a logical expression to choose rows. The xlim() and ylim() functions change the limits of the plot so that the labels ï¬t.

ggplot(model.df[abs(model.df$.stdresid) >= 2, ], aes(x = Rating, y = .stdresid, 
    label = Name)) + geom_point(col = "red") + geom_hline(yintercept = 0) + 
    geom_hline(yintercept = -2, lty = "dotted") + geom_hline(yintercept = 2, 
    lty = "dotted") + geom_text(angle = 45) + xlab("Zagat Rating") + ylab("Studentized Residuals") + 
    xlim(15, 30) + ylim(-3.5, 4)

plot of chunk unnamed-chunk-1

Summary

Regression models make a series of assumptions.

Before accepting a model you need to examine those assumptions to make sure they are tenable.
The model fit may be excellent, but you can't be sure your conclusions are correct unless you evaluate tenablity of the assumptions.
The key to evaluating the regression assumptions are the model residuals.

Regression modeling is statistical control.

You often want to do more than just summarize the relationship between variables.
Regression provides a strategy to control for effects of an explanatory variable to see what is left over.
These left-overs or residuals are interpreted as “controlled observations” (e.g. the observed number of Buchanan votes controlling for party registration).

Outliers can distort regression results or can be interesting on their own.

Always inspect scatter plots and plots of residuals to determine whether there are any unusual values that might unduly influence the regression line.
If you find outliers, re-fit the regression model sans those observations and compare results.
Regardless of how you decide to handle the presence of outliers, always let your readers know about their existence and what you did about them.