KITADA
Lesson #22
Motivation:
In Lesson 21, we learned how to do a multiple linear regression analysis. Any conclusions from that analysis to a population of interest (all countries) will only be valid if certain conditions exist. In this lesson, we will discuss what those conditions are and how to assess them.
What you need to know from this lesson:
After completing this lesson, you should be able to
To accomplish the above “What You Need to Know”, do the following:
The Lesson
The California Precipitation Example
The following are data collected at 30 weather stations in California. Can average annual precipitation (in inches) be predicted from altitude (in feet above sea level), latitude (in degrees north of equator), and/or distance from the coast (in miles) at all locations in the state of California?
CALIFORNIA
## City Precipitation Altitude Latitude Distance
## 1 Eureka 39.57 43 40.8 1
## 2 Red Bluff 23.27 341 40.2 97
## 3 Thermal 3.00 -120 33.6 70
## 4 Fort Bragg 37.48 74 39.4 1
## 5 Soda Springs 49.26 6752 39.3 150
## 6 San Francisco 21.82 52 37.8 5
## 7 Sacramento 18.07 25 38.5 80
## 8 San Jose 14.17 95 37.4 28
## 9 Giant Forest 42.63 6360 36.6 145
## 10 Salinas 13.85 74 36.7 12
## 11 Fresno 9.44 331 36.7 114
## 12 Pt Piedras 19.33 57 35.7 1
## 13 Pasa Robles 15.67 740 35.7 31
## 14 Bakersfield 6.00 489 35.4 75
## 15 Bishop 5.73 4108 37.3 198
## 16 Mineral 47.82 4850 40.4 142
## 17 Santa Barbara 17.95 120 34.4 1
## 18 Susanville 18.20 4152 40.3 198
## 19 Tule Lake 10.03 4036 41.9 140
## 20 Needles 4.63 913 34.8 192
## 21 Burbank 14.74 699 34.2 47
## 22 Los Angeles 15.02 312 34.1 16
## 23 Long Beach 12.36 50 33.8 12
## 24 Los Banos 8.26 125 37.8 74
## 25 Blythe 4.05 268 33.6 155
## 26 San Diego 9.94 19 32.7 5
## 27 Daggett 4.25 2105 34.1 85
## 28 Death Valley 1.66 -178 36.5 194
## 29 Crescent City 74.87 35 41.7 1
## 30 Colusa 15.95 60 39.2 91
Step 1: Is the sample representative of the population?
If these locations weren’t randomly selected, do you feel they are still representative of all locations around the state of California in terms of location and precipitation?
No, if the locations aren't randomly selected we aren't as comfortable that the sample is representative of all lovations around the state. It is possible to systematically select a sample (ie not randomly selected) such that the locations are representative of the diverse regions of California. Similarly, having a random sample of locations does not guarantee that the sample is representative.
Step 2: Check for high correlations between explanatory variables:
*Starting point: check *
Scatter plot matrix:
### PAIR PLOTS ###
## NOTE: [,-1] is used to remove the first column of names
pairs(CALIFORNIA[,-1])
Table of correlation coefficients:
### PAIRWISE CORRELATIONS ###
cor(CALIFORNIA[,-1])
## Precipitation Altitude Latitude Distance
## Precipitation 1.0000000 0.3317042 0.6052561 -0.2027259
## Altitude 0.3317042 1.0000000 0.3245005 0.5972514
## Latitude 0.6052561 0.3245005 1.0000000 0.1611690
## Distance -0.2027259 0.5972514 0.1611690 1.0000000
There is no strong correlation between any pair of explanatory variables.
Other notes:
1.The above only checks correlations between pairs of explanatory variables. To check to see if an explanatory variable is highly correlated with all other explanatory variables, run a multiple linear regression analysis with that explanatory variable as the “response” variable and all other explanatory variables as the “explanatory” variables. If R2 is high, it says that all of the other explanatory variables are explaining a high amount of the variation in the “response” variable (i.e. the explanatory variable in question), which implies that the explanatory variable in question is highly correlated with the other explanatory variables.
Run a multiple regression with each explanatory variable serving as the “response” variable to see if each is highly correlated with the others. Here are the R2 values for each explanatory variable versus the other explanatory variables:
What is considered a “high” R2 value is subjective but we might get concerned that an explanatory variable is highly correlated with the other explanatory variables if the R2 from its regression on the other explanatory variables is above 75% or so.
2.If one explanatory variable is highly correlated with all other explanatory variables, it should be removed from the analysis.
3.If several explanatory variables seem highly correlated with each other, it is up to you to decide which one should be removed.
Step 3: Check for outliers that could be influencing the results:
### RESIDUAL PLOTS ###
fullmod<-with(CALIFORNIA, lm(Precipitation~ Altitude+Latitude+Distance))
altmod<-with(CALIFORNIA, lm(Precipitation~ Altitude))
latmod<-with(CALIFORNIA, lm(Precipitation~ Latitude))
distmod<-with(CALIFORNIA, lm(Precipitation~ Distance))
par(mfrow=c(2,2))
plot(fitted(fullmod), resid(fullmod),
xlab="Fitted Values",
ylab="Residuals")
abline(h=0, lwd=2, lty=2, col="blue")
with(CALIFORNIA, plot(Latitude, resid(latmod),
ylab="Residuals"))
abline(h=0, lwd=2, lty=2, col="blue")
with(CALIFORNIA, plot(Distance, resid(distmod),
ylab="Residuals"))
abline(h=0, lwd=2, lty=2, col="blue")
with(CALIFORNIA, plot(Altitude, resid(altmod),
ylab="Residuals"))
abline(h=0, lwd=2, lty=2, col="blue")
There appears to be a couple of unusual points: Crescent City with nearly 75 inches of rain a year and Tule Lake, which only gets 10 inches of rain a year despite its altitude (see residual plots versus each explanatory variable). These two outliers can also be seen in the residual plot versus predicted values. Crescent City is on the coast in northern California. Tule Lake lies in northeast California near the Oregon border. An argument could be made that both Crescent City and Tule Lake belong to a different population than the rest of the cities in this sample in the sense that their weather is more like Oregon and Washington than the rest of California. Perhaps the weather patterns of these two cities follow the weather pattern of the Pacific Northwest more so than the rest of the state of California and would better fit a Pacific Northwest model than a California model. Although it’s arguable, I will eliminate them from the analysis because of this. I will take them out now and assess the rest of the conditions without Crescent City and Tule Lake. (Note: by removing these cities, we’re redefining the population as locations in California below a certain latitude, such as below 41 degrees north of the equator).
Step 4: Check the conditions that must be met for conclusions from a multiple linear regression analysis to be valid to the population of interest:
The response variable must be linearly related to EACH explanatory variable
Assess with:
### PRECIPITATION PAIRWISE ###
par(mfrow=c(2,2))
with(CALIFORNIA, plot(Latitude, Precipitation,
main="Scatterplot: PPT vs LAT",
xlab="Latitude",
ylab="Precipitation"))
abline(coefficients(latmod), lwd=2, lty=2, col="red")
with(CALIFORNIA, plot(Distance, Precipitation,
main="Scatterplot: PPT vs DIST",
xlab="Distance",
ylab="Precipitation"))
abline(coefficients(distmod), lwd=2, lty=2, col="red")
with(CALIFORNIA, plot(Altitude, Precipitation,
main="Scatterplot: PPT vs ALT",
xlab="Altitude",
ylab="Precipitation"))
abline(coefficients(altmod), lwd=2, lty=2, col="red")
plot(fitted(fullmod), resid(fullmod),
xlab="Fitted Values",
ylab="Residuals")
abline(h=0, lwd=2, lty=2, col="blue")
Do any of the scatterplots indicate a non-linear relationship exists between precipitation and one or more of the explanatory variables? If so, which explanatory variable(s) is(are) non-linearly related to precipitation?
There might be some curvature between distance vs precipitation.
Comments:
In a multiple linear regression model, the explanatory variables “interact” with each other. Therefore, it’s possible that the scatterplots of each explanatory variable versus the response variable appear linear, but when all are included in the model together, the residual plot versus the predicted values could show curvature. When this happens, try a transformation of the response variable.
If a scatterplot of the response variable versus one of the explanatory variables showed curvature but the scatterplots of the response variable versus the other explanatory variables appeared to be linear, transforming the explanatory variable that was not linearly related to the response variable may help.
If all scatterplots showed curvature, try transforming the response variable first.
In the scatterplot of precipitation versus altitude, most cities have low altitudes, so there is a “clump” of data between 0 and 500 feet above sea level. A transformation might remove some of the clumping. Another reason to consider a transformation is the ratio of the largest to smallest observation is greater than 20. If it is decided to do a log transformation of altitude, we have to remember that we cannot take a log of a negative number! (Note that Death Valley and Thermal have elevations below sea level!) Therefore, to make all numbers positive, a certain amount has to be added to each city’s altitude, such as 200 feet. I attempted a log transformation on altitude to see if it helped remove the clumping so we could better determine the nature of the relationship between precipitation and altitude. It helped a little but (jumping ahead a little), the other conditions weren’t met as well as when altitude wasn’t transformed. Therefore, I decided not to transform altitude.
In the scatterplot of precipitation versus distance from coast, there seems to be a negative trend between precipitation and distance from the coast. There are a few unusual locations about 150 miles inland that make the scatter plot appear non-linear. (These are most likely locations in the mountains.) A transformation of distance from the coast did not help improve any of the conditions (again, jumping ahead) and, therefore, was not done.
2) Constant Variation condition: The variation of the response variable is similar for all values of each explanatory variable. If this is true, the residuals will have similar spread for all values of each explanatory variable.
Assess with:
### REMOVE TULE LAKE AND CRESCENT CITY ###
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.2.5
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
no_out<-CALIFORNIA[-c(19,29),]
no_out
## City Precipitation Altitude Latitude Distance
## 1 Eureka 39.57 43 40.8 1
## 2 Red Bluff 23.27 341 40.2 97
## 3 Thermal 3.00 -120 33.6 70
## 4 Fort Bragg 37.48 74 39.4 1
## 5 Soda Springs 49.26 6752 39.3 150
## 6 San Francisco 21.82 52 37.8 5
## 7 Sacramento 18.07 25 38.5 80
## 8 San Jose 14.17 95 37.4 28
## 9 Giant Forest 42.63 6360 36.6 145
## 10 Salinas 13.85 74 36.7 12
## 11 Fresno 9.44 331 36.7 114
## 12 Pt Piedras 19.33 57 35.7 1
## 13 Pasa Robles 15.67 740 35.7 31
## 14 Bakersfield 6.00 489 35.4 75
## 15 Bishop 5.73 4108 37.3 198
## 16 Mineral 47.82 4850 40.4 142
## 17 Santa Barbara 17.95 120 34.4 1
## 18 Susanville 18.20 4152 40.3 198
## 20 Needles 4.63 913 34.8 192
## 21 Burbank 14.74 699 34.2 47
## 22 Los Angeles 15.02 312 34.1 16
## 23 Long Beach 12.36 50 33.8 12
## 24 Los Banos 8.26 125 37.8 74
## 25 Blythe 4.05 268 33.6 155
## 26 San Diego 9.94 19 32.7 5
## 27 Daggett 4.25 2105 34.1 85
## 28 Death Valley 1.66 -178 36.5 194
## 30 Colusa 15.95 60 39.2 91
fullmod_no<-with(no_out,
lm(Precipitation~ Altitude+Latitude+Distance))
altmod_no<-with(no_out,
lm(Precipitation~ Altitude))
latmod_no<-with(no_out,
lm(Precipitation~ Latitude))
distmod_no<-with(no_out,
lm(Precipitation~ Distance))
### PRECIPITATION PAIRWISE (NO OUTLIER) ###
pairs(no_out[,-1])
#par(mfrow=c(2,2))
#with(no_out, plot(Latitude, Precipitation,
# main="Scatterplot: PPT vs LAT",
# xlab="Latitude",
# ylab="Precipitation"))
#abline(coefficients(latmod_no), lwd=2, lty=2, col="red")
#with(no_out, plot(Distance, Precipitation,
# main="Scatterplot: PPT vs DIST",
# xlab="Distance",
# ylab="Precipitation"))
#abline(coefficients(distmod_no), lwd=2, lty=2, col="red")
#with(no_out, plot(Altitude, Precipitation,
# main="Scatterplot: PPT vs ALT",
# xlab="Altitude",
# ylab="Precipitation"))
#abline(coefficients(altmod_no), lwd=2, lty=2, col="red")
#plot(fitted(fullmod_no), resid(fullmod_no),
# main="Residual Plot",
# xlab="Fitted Values",
# ylab="Residuals")
#abline(h=0, lwd=2, lty=2, col="blue")
### RESIDUAL PLOTS ###
par(mfrow=c(2,2))
plot(fitted(fullmod_no), resid(fullmod_no),
xlab="Fitted Values",
ylab="Residuals (FULL- NO OUTLIER)")
abline(h=0, lwd=2, lty=2, col="blue")
with(no_out, plot(Latitude, resid(fullmod_no),
ylab="Residuals (FULL- NO OUTLIER)"))
abline(h=0, lwd=2, lty=2, col="blue")
with(no_out, plot(Distance, resid(fullmod_no),
ylab="Residuals (FULL- NO OUTLIER)"))
abline(h=0, lwd=2, lty=2, col="blue")
with(no_out, plot(Altitude, resid(fullmod_no),
ylab="Residuals (FULL- NO OUTLIER)"))
abline(h=0, lwd=2, lty=2, col="blue")
Do you feel that the constant variation condition is not met? If so, what should be done?
It appears that there might be some non-constant variation issues. We might want to consider transformations.
3) Normality condition:
The residuals must be normally distributed.
Assess with:
### QQ PLOT ###
par(mfrow=c(1,1))
qqnorm(resid(fullmod_no))
qqline(resid(fullmod_no))
Do you feel that the residuals are normally distributed? If not, what should be done?
Some deviations from normality in the tails.
Summary of step 4: Even though the normality condition seems met, a natural log transformation of precipitation will be performed because the residual plot versus predicted values indicates some curvature and non-constant variation of the residuals.
After performing a transformation on precipitation, we must repeat Step 4:
1) Linearity condition:
Scatterplots and the residual plot (residuals versus predicted values) are given below:
### LN TRANSFORM PRECIP ###
no_out_ln<-no_out%>%
mutate(LN_PRECIP = log(Precipitation))
fullmod_ln<-with(no_out_ln,
lm(LN_PRECIP~ Altitude+Latitude+Distance))
altmod_ln<-with(no_out_ln,
lm(LN_PRECIP~ Altitude))
latmod_ln<-with(no_out_ln,
lm(LN_PRECIP~ Latitude))
distmod_ln<-with(no_out_ln,
lm(LN_PRECIP~ Distance))
### PRECIPITATION PAIRWISE (NO OUTLIER) ###
par(mfrow=c(2,2))
with(no_out_ln, plot(Latitude, LN_PRECIP,
main="Scatterplot: LN(PPT) vs LAT",
xlab="Latitude",
ylab="LN(Precipitation)"))
abline(coefficients(latmod_ln), lwd=2, lty=2, col="red")
with(no_out_ln, plot(Distance, LN_PRECIP,
main="Scatterplot: LN(PPT) vs DIST",
xlab="Distance",
ylab="LN(Precipitation)"))
abline(coefficients(distmod_ln), lwd=2, lty=2, col="red")
with(no_out_ln, plot(Altitude, LN_PRECIP,
main="Scatterplot: LN(PPT) vs ALT",
xlab="Altitude",
ylab="LN(Precipitation)"))
abline(coefficients(altmod_ln), lwd=2, lty=2, col="red")
plot(fitted(fullmod_ln), resid(fullmod_ln),
main="Residual Plot",
xlab="Fitted Values",
ylab="Residuals")
abline(h=0, lwd=2, lty=2, col="blue")
Is there any curvature in the residual plot of the residuals versus the predicted values? Therefore, does the linearity condition seem better met? If not, which variable(s) is(are) not linearly related to ln(precipitation)?
It looks like there is some curvature between the ln(precipitation) and altitude.
Comment:
There is still “clumping” in the scatterplot of ln(precipitation) versus altitude. A log transformation of altitude may help this. More on this later.
2 and 3) constant variation and normality conditions:
Plots are given on the next page
### TRANSFORMED MOD ###
par(mfrow=c(1,2))
plot(fitted(fullmod_ln), resid(fullmod_ln),
main="Residual Plot",
xlab="Fitted Values",
ylab="Residuals")
abline(h=0, lwd=2, lty=2, col="blue")
qqnorm(resid(fullmod_ln))
qqline(resid(fullmod_ln))
Is the constant variation condition better met after transforming precipitation?
It appears that as the fitted values increase the magnitude of the residuals increase.
The normality condition was met on the original scale. Is it still met after transforming precipitation?
There are some deviations in the upper tail but the rest if very close to the theoretic line of normality.
Comment:
There could be an argument that the normality condition is not as well met after transforming precipitation, but the gain in improvement in the residual plot offsets the loss in the normal probability plot.
If the linearity, constant variation, and/or normality conditions were not satisfied after performing a transformation on precipitation, what should be done? Do we have to do that in this example?
We could try to transform the explantory variables.
Should “altitude” be transformed?
In the scatterplot of precipitation versus altitude, there was a lot of “clumping” at lower altitudes. This makes sense because many of the locations in the sample were at lower altitudes with only a half dozen or so at higher altitudes. In addition, there was a lot of variation in precipitation for these locations which made it hard to assess both the linearity condition and constant variation condition. An option would be to do a log transformation on altitude to perhaps remove the clumping. But, remember that some altitudes were below sea level and logarithms can only be taken on positive values. So, we need to add the same amount to all altitudes to make them all positive. I added 200 feet to all altitudes and took the natural log of altitude to illustrate how this works. The scatter plot of ln(precipitation) versus ln(altitude + 200 feet) is shown to the right. Note that it did remove some of the clumping. However, I decided NOT to transform altitude because the constant variation was violated more so than without transforming altitude (graph not shown).
### LN TRANSFORM ALTITUDE ###
no_out_ln<-no_out%>%
mutate(LN_PRECIP = log(Precipitation),
LN_ALT200 = log(Altitude+200))
with(no_out_ln,plot(LN_ALT200, LN_PRECIP))
Steps 5 through 7:
The following output is from the data with 28 observations after performing natural log transformation of precipitation:
summary(fullmod_ln)
##
## Call:
## lm(formula = LN_PRECIP ~ Altitude + Latitude + Distance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.82582 -0.26341 0.07587 0.27828 0.55447
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.002e+00 1.156e+00 -3.462 0.00202 **
## Altitude 2.883e-04 4.528e-05 6.366 1.39e-06 ***
## Latitude 1.915e-01 3.181e-02 6.019 3.25e-06 ***
## Distance -1.014e-02 1.289e-03 -7.868 4.23e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3791 on 24 degrees of freedom
## Multiple R-squared: 0.829, Adjusted R-squared: 0.8076
## F-statistic: 38.77 on 3 and 24 DF, p-value: 2.317e-09
anova(fullmod_ln)
## Analysis of Variance Table
##
## Response: LN_PRECIP
## Df Sum Sq Mean Sq F value Pr(>F)
## Altitude 1 2.9817 2.9817 20.745 0.0001289 ***
## Latitude 1 4.8402 4.8402 33.676 5.540e-06 ***
## Distance 1 8.8971 8.8971 61.902 4.233e-08 ***
## Residuals 24 3.4495 0.1437
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1