KITADA

Lesson #22

Assessing Conditions in a Multiple Linear Regression Analysis

Motivation:

In Lesson 21, we learned how to do a multiple linear regression analysis. Any conclusions from that analysis to a population of interest (all countries) will only be valid if certain conditions exist. In this lesson, we will discuss what those conditions are and how to assess them.

What you need to know from this lesson:

After completing this lesson, you should be able to

explain why it is not a good idea to have a multiple linear regression model that contains two (or more) highly correlated explanatory variables
- graphically and numerically assess whether two explanatory variables are highly correlated with each other
- make a decision about what to do with highly correlated explanatory variables
assess whether outliers exist and decide what to do if outliers do exist
explain what the “linear” part in Multiple Linear Regression implies
- assess whether the relationship between each explanatory variable and the response variable is linear
graphically assess the constant variation and normality - answer questions about a multiple linear regression analysis performed on transformed data

To accomplish the above “What You Need to Know”, do the following:

1. Attend lecture and answer the questions on the following pages of this lesson.
2. Read Section 10.2 in the text
3. Do the Lesson 22 questions at the end of the lesson notes

The Lesson

The California Precipitation Example

The following are data collected at 30 weather stations in California. Can average annual precipitation (in inches) be predicted from altitude (in feet above sea level), latitude (in degrees north of equator), and/or distance from the coast (in miles) at all locations in the state of California?

CALIFORNIA

##             City Precipitation Altitude Latitude Distance
## 1   Eureka               39.57       43     40.8        1
## 2  Red Bluff             23.27      341     40.2       97
## 3   Thermal               3.00     -120     33.6       70
## 4  Fort Bragg            37.48       74     39.4        1
## 5  Soda Springs          49.26     6752     39.3      150
## 6  San Francisco         21.82       52     37.8        5
## 7   Sacramento           18.07       25     38.5       80
## 8  San Jose              14.17       95     37.4       28
## 9  Giant Forest          42.63     6360     36.6      145
## 10  Salinas              13.85       74     36.7       12
## 11  Fresno                9.44      331     36.7      114
## 12 Pt Piedras            19.33       57     35.7        1
## 13 Pasa Robles           15.67      740     35.7       31
## 14  Bakersfield           6.00      489     35.4       75
## 15  Bishop                5.73     4108     37.3      198
## 16  Mineral              47.82     4850     40.4      142
## 17 Santa Barbara         17.95      120     34.4        1
## 18  Susanville           18.20     4152     40.3      198
## 19 Tule Lake             10.03     4036     41.9      140
## 20  Needles               4.63      913     34.8      192
## 21  Burbank              14.74      699     34.2       47
## 22 Los Angeles           15.02      312     34.1       16
## 23 Long Beach            12.36       50     33.8       12
## 24 Los Banos              8.26      125     37.8       74
## 25  Blythe                4.05      268     33.6      155
## 26 San Diego              9.94       19     32.7        5
## 27  Daggett               4.25     2105     34.1       85
## 28 Death Valley           1.66     -178     36.5      194
## 29 Crescent City         74.87       35     41.7        1
## 30  Colusa               15.95       60     39.2       91

Step 1: Is the sample representative of the population?

If these locations weren’t randomly selected, do you feel they are still representative of all locations around the state of California in terms of location and precipitation?

No, if the locations aren't randomly selected we aren't as comfortable that the sample is representative of all lovations around the state. It is possible to systematically select a sample (ie not randomly selected) such that the locations are representative of the diverse regions of California. Similarly, having a random sample of locations does not guarantee that the sample is representative.

Step 2: Check for high correlations between explanatory variables:

*Starting point: check *

1) scatter plots between each pair of explanatory variables,
2) table of correlation coefficients

Scatter plot matrix:

### PAIR PLOTS ###
## NOTE: [,-1] is used to remove the first column of names
pairs(CALIFORNIA[,-1])

plot of chunk unnamed-chunk-3

Table of correlation coefficients:

### PAIRWISE CORRELATIONS ###
cor(CALIFORNIA[,-1])

##               Precipitation  Altitude  Latitude   Distance
## Precipitation     1.0000000 0.3317042 0.6052561 -0.2027259
## Altitude          0.3317042 1.0000000 0.3245005  0.5972514
## Latitude          0.6052561 0.3245005 1.0000000  0.1611690
## Distance         -0.2027259 0.5972514 0.1611690  1.0000000

There is no strong correlation between any pair of explanatory variables.

Other notes:

1.The above only checks correlations between pairs of explanatory variables. To check to see if an explanatory variable is highly correlated with all other explanatory variables, run a multiple linear regression analysis with that explanatory variable as the “response” variable and all other explanatory variables as the “explanatory” variables. If R2 is high, it says that all of the other explanatory variables are explaining a high amount of the variation in the “response” variable (i.e. the explanatory variable in question), which implies that the explanatory variable in question is highly correlated with the other explanatory variables.

Run a multiple regression with each explanatory variable serving as the “response” variable to see if each is highly correlated with the others. Here are the R2 values for each explanatory variable versus the other explanatory variables:
- Altitude versus others: R² = 41.0%
- Distance from coast versus others: R² = 35.8%
- Latitude versus others: R² = 10.7%
What is considered a “high” R2 value is subjective but we might get concerned that an explanatory variable is highly correlated with the other explanatory variables if the R2 from its regression on the other explanatory variables is above 75% or so.

2.If one explanatory variable is highly correlated with all other explanatory variables, it should be removed from the analysis.

3.If several explanatory variables seem highly correlated with each other, it is up to you to decide which one should be removed.

Step 3: Check for outliers that could be influencing the results:

Residual Plot versus each explanatory variable
Residual Plot versus Predicted Values
Scatterplot matrix

### RESIDUAL PLOTS ###
fullmod<-with(CALIFORNIA, lm(Precipitation~ Altitude+Latitude+Distance))
altmod<-with(CALIFORNIA, lm(Precipitation~ Altitude))
latmod<-with(CALIFORNIA, lm(Precipitation~ Latitude))
distmod<-with(CALIFORNIA, lm(Precipitation~ Distance))

par(mfrow=c(2,2))
plot(fitted(fullmod), resid(fullmod),
     xlab="Fitted Values",
     ylab="Residuals")
abline(h=0, lwd=2, lty=2, col="blue")
with(CALIFORNIA, plot(Latitude, resid(latmod),
                      ylab="Residuals"))
abline(h=0, lwd=2, lty=2, col="blue")
with(CALIFORNIA, plot(Distance, resid(distmod),
                      ylab="Residuals"))
abline(h=0, lwd=2, lty=2, col="blue")
with(CALIFORNIA, plot(Altitude, resid(altmod),
                      ylab="Residuals"))
abline(h=0, lwd=2, lty=2, col="blue")

plot of chunk unnamed-chunk-5

There appears to be a couple of unusual points: Crescent City with nearly 75 inches of rain a year and Tule Lake, which only gets 10 inches of rain a year despite its altitude (see residual plots versus each explanatory variable). These two outliers can also be seen in the residual plot versus predicted values. Crescent City is on the coast in northern California. Tule Lake lies in northeast California near the Oregon border. An argument could be made that both Crescent City and Tule Lake belong to a different population than the rest of the cities in this sample in the sense that their weather is more like Oregon and Washington than the rest of California. Perhaps the weather patterns of these two cities follow the weather pattern of the Pacific Northwest more so than the rest of the state of California and would better fit a Pacific Northwest model than a California model. Although it’s arguable, I will eliminate them from the analysis because of this. I will take them out now and assess the rest of the conditions without Crescent City and Tule Lake. (Note: by removing these cities, we’re redefining the population as locations in California below a certain latitude, such as below 41 degrees north of the equator).

Step 4: Check the conditions that must be met for conclusions from a multiple linear regression analysis to be valid to the population of interest:

1) linearity condition:

The response variable must be linearly related to EACH explanatory variable

Assess with:

Scatterplot of response variable versus each explanatory variable
Residual Plot: residuals versus predicted values

### PRECIPITATION PAIRWISE ###
par(mfrow=c(2,2))
with(CALIFORNIA, plot(Latitude, Precipitation,
                      main="Scatterplot: PPT vs LAT",
                      xlab="Latitude",
                      ylab="Precipitation"))
abline(coefficients(latmod), lwd=2, lty=2, col="red")

with(CALIFORNIA, plot(Distance, Precipitation,
                      main="Scatterplot: PPT vs DIST",
                      xlab="Distance",
                      ylab="Precipitation"))
abline(coefficients(distmod), lwd=2, lty=2, col="red")

with(CALIFORNIA, plot(Altitude, Precipitation,
                      main="Scatterplot: PPT vs ALT",
                      xlab="Altitude",
                      ylab="Precipitation"))
abline(coefficients(altmod), lwd=2, lty=2, col="red")

plot(fitted(fullmod), resid(fullmod),
     xlab="Fitted Values",
     ylab="Residuals")
abline(h=0, lwd=2, lty=2, col="blue")

plot of chunk unnamed-chunk-6

Do any of the scatterplots indicate a non-linear relationship exists between precipitation and one or more of the explanatory variables? If so, which explanatory variable(s) is(are) non-linearly related to precipitation?

There might be some curvature between distance vs precipitation.

Comments:

In a multiple linear regression model, the explanatory variables “interact” with each other. Therefore, it’s possible that the scatterplots of each explanatory variable versus the response variable appear linear, but when all are included in the model together, the residual plot versus the predicted values could show curvature. When this happens, try a transformation of the response variable.
If a scatterplot of the response variable versus one of the explanatory variables showed curvature but the scatterplots of the response variable versus the other explanatory variables appeared to be linear, transforming the explanatory variable that was not linearly related to the response variable may help.
If all scatterplots showed curvature, try transforming the response variable first.
In the scatterplot of precipitation versus altitude, most cities have low altitudes, so there is a “clump” of data between 0 and 500 feet above sea level. A transformation might remove some of the clumping. Another reason to consider a transformation is the ratio of the largest to smallest observation is greater than 20. If it is decided to do a log transformation of altitude, we have to remember that we cannot take a log of a negative number! (Note that Death Valley and Thermal have elevations below sea level!) Therefore, to make all numbers positive, a certain amount has to be added to each city’s altitude, such as 200 feet. I attempted a log transformation on altitude to see if it helped remove the clumping so we could better determine the nature of the relationship between precipitation and altitude. It helped a little but (jumping ahead a little), the other conditions weren’t met as well as when altitude wasn’t transformed. Therefore, I decided not to transform altitude.
In the scatterplot of precipitation versus distance from coast, there seems to be a negative trend between precipitation and distance from the coast. There are a few unusual locations about 150 miles inland that make the scatter plot appear non-linear. (These are most likely locations in the mountains.) A transformation of distance from the coast did not help improve any of the conditions (again, jumping ahead) and, therefore, was not done.

2) Constant Variation condition: The variation of the response variable is similar for all values of each explanatory variable. If this is true, the residuals will have similar spread for all values of each explanatory variable.

Assess with:

Scatterplot of response variable versus each explanatory variable
- the spread in the values of the response variable should be similar for all values of each explanatory variable
Residual Plots: residuals versus each explanatory variable
- the spread of the residuals should be similar for all values of each explanatory variable
Residual Plot: residuals versus predicted values
- the spread of the residuals should be similar for all predicated values

### REMOVE TULE LAKE AND CRESCENT CITY ###
library(dplyr)

## Warning: package 'dplyr' was built under R version 3.2.5

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

no_out<-CALIFORNIA[-c(19,29),]
no_out

##             City Precipitation Altitude Latitude Distance
## 1   Eureka               39.57       43     40.8        1
## 2  Red Bluff             23.27      341     40.2       97
## 3   Thermal               3.00     -120     33.6       70
## 4  Fort Bragg            37.48       74     39.4        1
## 5  Soda Springs          49.26     6752     39.3      150
## 6  San Francisco         21.82       52     37.8        5
## 7   Sacramento           18.07       25     38.5       80
## 8  San Jose              14.17       95     37.4       28
## 9  Giant Forest          42.63     6360     36.6      145
## 10  Salinas              13.85       74     36.7       12
## 11  Fresno                9.44      331     36.7      114
## 12 Pt Piedras            19.33       57     35.7        1
## 13 Pasa Robles           15.67      740     35.7       31
## 14  Bakersfield           6.00      489     35.4       75
## 15  Bishop                5.73     4108     37.3      198
## 16  Mineral              47.82     4850     40.4      142
## 17 Santa Barbara         17.95      120     34.4        1
## 18  Susanville           18.20     4152     40.3      198
## 20  Needles               4.63      913     34.8      192
## 21  Burbank              14.74      699     34.2       47
## 22 Los Angeles           15.02      312     34.1       16
## 23 Long Beach            12.36       50     33.8       12
## 24 Los Banos              8.26      125     37.8       74
## 25  Blythe                4.05      268     33.6      155
## 26 San Diego              9.94       19     32.7        5
## 27  Daggett               4.25     2105     34.1       85
## 28 Death Valley           1.66     -178     36.5      194
## 30  Colusa               15.95       60     39.2       91

fullmod_no<-with(no_out, 
                 lm(Precipitation~ Altitude+Latitude+Distance))
altmod_no<-with(no_out, 
                lm(Precipitation~ Altitude))
latmod_no<-with(no_out, 
                lm(Precipitation~ Latitude))
distmod_no<-with(no_out, 
                 lm(Precipitation~ Distance))


### PRECIPITATION PAIRWISE (NO OUTLIER) ###
pairs(no_out[,-1])

plot of chunk unnamed-chunk-7

#par(mfrow=c(2,2))
#with(no_out, plot(Latitude, Precipitation,
#                      main="Scatterplot: PPT vs LAT",
#                      xlab="Latitude",
#                      ylab="Precipitation"))
#abline(coefficients(latmod_no), lwd=2, lty=2, col="red")

#with(no_out, plot(Distance, Precipitation,
#                      main="Scatterplot: PPT vs DIST",
#                      xlab="Distance",
#                      ylab="Precipitation"))
#abline(coefficients(distmod_no), lwd=2, lty=2, col="red")

#with(no_out, plot(Altitude, Precipitation,
#                      main="Scatterplot: PPT vs ALT",
#                      xlab="Altitude",
#                      ylab="Precipitation"))
#abline(coefficients(altmod_no), lwd=2, lty=2, col="red")

#plot(fitted(fullmod_no), resid(fullmod_no),
#     main="Residual Plot",
#     xlab="Fitted Values",
#     ylab="Residuals")
#abline(h=0, lwd=2, lty=2, col="blue")

### RESIDUAL PLOTS ###
par(mfrow=c(2,2))
plot(fitted(fullmod_no), resid(fullmod_no),
     xlab="Fitted Values",
     ylab="Residuals (FULL- NO OUTLIER)")
abline(h=0, lwd=2, lty=2, col="blue")
with(no_out, plot(Latitude, resid(fullmod_no),
                  ylab="Residuals (FULL- NO OUTLIER)"))
abline(h=0, lwd=2, lty=2, col="blue")
with(no_out, plot(Distance, resid(fullmod_no),
                  ylab="Residuals (FULL- NO OUTLIER)"))
abline(h=0, lwd=2, lty=2, col="blue")
with(no_out, plot(Altitude, resid(fullmod_no),
                  ylab="Residuals (FULL- NO OUTLIER)"))
abline(h=0, lwd=2, lty=2, col="blue")

plot of chunk unnamed-chunk-7

Do you feel that the constant variation condition is not met? If so, what should be done?

It appears that there might be some non-constant variation issues. We might want to consider transformations.

3) Normality condition:

The residuals must be normally distributed.

Assess with:

normal probability plot of the residuals

### QQ PLOT ###
par(mfrow=c(1,1))
qqnorm(resid(fullmod_no))
qqline(resid(fullmod_no))

plot of chunk unnamed-chunk-8

Do you feel that the residuals are normally distributed? If not, what should be done?

Some deviations from normality in the tails.

Summary of step 4: Even though the normality condition seems met, a natural log transformation of precipitation will be performed because the residual plot versus predicted values indicates some curvature and non-constant variation of the residuals.

After performing a transformation on precipitation, we must repeat Step 4:

1) Linearity condition:

Scatterplots and the residual plot (residuals versus predicted values) are given below:

### LN TRANSFORM PRECIP ###
no_out_ln<-no_out%>%
  mutate(LN_PRECIP = log(Precipitation))

fullmod_ln<-with(no_out_ln, 
                 lm(LN_PRECIP~ Altitude+Latitude+Distance))
altmod_ln<-with(no_out_ln, 
                lm(LN_PRECIP~ Altitude))
latmod_ln<-with(no_out_ln, 
                lm(LN_PRECIP~ Latitude))
distmod_ln<-with(no_out_ln, 
                 lm(LN_PRECIP~ Distance))


### PRECIPITATION PAIRWISE (NO OUTLIER) ###
par(mfrow=c(2,2))
with(no_out_ln, plot(Latitude, LN_PRECIP,
                  main="Scatterplot: LN(PPT) vs LAT",
                  xlab="Latitude",
                  ylab="LN(Precipitation)"))
abline(coefficients(latmod_ln), lwd=2, lty=2, col="red")

with(no_out_ln, plot(Distance, LN_PRECIP,
                  main="Scatterplot: LN(PPT) vs DIST",
                  xlab="Distance",
                  ylab="LN(Precipitation)"))
abline(coefficients(distmod_ln), lwd=2, lty=2, col="red")

with(no_out_ln, plot(Altitude, LN_PRECIP,
                  main="Scatterplot: LN(PPT) vs ALT",
                  xlab="Altitude",
                  ylab="LN(Precipitation)"))
abline(coefficients(altmod_ln), lwd=2, lty=2, col="red")

plot(fitted(fullmod_ln), resid(fullmod_ln),
     main="Residual Plot",
     xlab="Fitted Values",
     ylab="Residuals")
abline(h=0, lwd=2, lty=2, col="blue")

plot of chunk unnamed-chunk-9

Is there any curvature in the residual plot of the residuals versus the predicted values? Therefore, does the linearity condition seem better met? If not, which variable(s) is(are) not linearly related to ln(precipitation)?

It looks like there is some curvature between the ln(precipitation) and altitude.

Comment:

There is still “clumping” in the scatterplot of ln(precipitation) versus altitude. A log transformation of altitude may help this. More on this later.

2 and 3) constant variation and normality conditions:

Plots are given on the next page

### TRANSFORMED MOD ###
par(mfrow=c(1,2))
plot(fitted(fullmod_ln), resid(fullmod_ln),
     main="Residual Plot",
     xlab="Fitted Values",
     ylab="Residuals")
abline(h=0, lwd=2, lty=2, col="blue")

qqnorm(resid(fullmod_ln))
qqline(resid(fullmod_ln))

plot of chunk unnamed-chunk-10

Is the constant variation condition better met after transforming precipitation?

It appears that as the fitted values increase the magnitude of the residuals increase.

The normality condition was met on the original scale. Is it still met after transforming precipitation?

There are some deviations in the upper tail but the rest if very close to the theoretic line of normality.

Comment:

There could be an argument that the normality condition is not as well met after transforming precipitation, but the gain in improvement in the residual plot offsets the loss in the normal probability plot.

If the linearity, constant variation, and/or normality conditions were not satisfied after performing a transformation on precipitation, what should be done? Do we have to do that in this example?

We could try to transform the explantory variables.

Should “altitude” be transformed?

In the scatterplot of precipitation versus altitude, there was a lot of “clumping” at lower altitudes. This makes sense because many of the locations in the sample were at lower altitudes with only a half dozen or so at higher altitudes. In addition, there was a lot of variation in precipitation for these locations which made it hard to assess both the linearity condition and constant variation condition. An option would be to do a log transformation on altitude to perhaps remove the clumping. But, remember that some altitudes were below sea level and logarithms can only be taken on positive values. So, we need to add the same amount to all altitudes to make them all positive. I added 200 feet to all altitudes and took the natural log of altitude to illustrate how this works. The scatter plot of ln(precipitation) versus ln(altitude + 200 feet) is shown to the right. Note that it did remove some of the clumping. However, I decided NOT to transform altitude because the constant variation was violated more so than without transforming altitude (graph not shown).

### LN TRANSFORM ALTITUDE ###
no_out_ln<-no_out%>%
  mutate(LN_PRECIP = log(Precipitation),
         LN_ALT200 = log(Altitude+200))

with(no_out_ln,plot(LN_ALT200, LN_PRECIP))

plot of chunk unnamed-chunk-11

Steps 5 through 7:

The following output is from the data with 28 observations after performing natural log transformation of precipitation:

summary(fullmod_ln)

## 
## Call:
## lm(formula = LN_PRECIP ~ Altitude + Latitude + Distance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.82582 -0.26341  0.07587  0.27828  0.55447 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4.002e+00  1.156e+00  -3.462  0.00202 ** 
## Altitude     2.883e-04  4.528e-05   6.366 1.39e-06 ***
## Latitude     1.915e-01  3.181e-02   6.019 3.25e-06 ***
## Distance    -1.014e-02  1.289e-03  -7.868 4.23e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3791 on 24 degrees of freedom
## Multiple R-squared:  0.829,  Adjusted R-squared:  0.8076 
## F-statistic: 38.77 on 3 and 24 DF,  p-value: 2.317e-09

anova(fullmod_ln)

## Analysis of Variance Table
## 
## Response: LN_PRECIP
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## Altitude   1 2.9817  2.9817  20.745 0.0001289 ***
## Latitude   1 4.8402  4.8402  33.676 5.540e-06 ***
## Distance   1 8.8971  8.8971  61.902 4.233e-08 ***
## Residuals 24 3.4495  0.1437                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1