Exam 2 next week Thursday (November 7th). I'll be out of town so you can take the exam anywhere you like.
The exam will be available starting at 12:30p and you are required to turn it in by 1:45pm. Please work independently.
Today: AIC, likelihood of your data, and choosing among models. Practice problems.
Review from last week. Linear regression models.
Davies and Goldsmith (1972) investigated the relationship between abrasion loss (abrasion) of samples of rubber (grams per hour) and hardness (higher values indicate harder rubber) and tensile strength (kg/cm2 ). Abrasion loss is the response variable.
The data are in AbrasionLoss.txt. Input the data using
AL = read.table("http://myweb.fsu.edu/jelsner/data/AbrasionLoss.txt", header = TRUE)
str(AL)
## 'data.frame': 30 obs. of 3 variables:
## $ abrasion: int 372 206 175 154 136 112 55 45 221 166 ...
## $ hardness: int 45 55 61 66 71 71 81 86 53 60 ...
## $ strength: int 162 233 232 231 231 237 224 219 203 189 ...
a. Create a scatter plot matrix of the three variables. Based on the correlation between abrasion loss and tensile strength, do you think tensile strength would be helpful in explaining abrasion loss?
require(GGally)
## Loading required package: GGally
## Warning: package 'GGally' was built under R version 3.0.2
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.0.2
## Loading required package: reshape
## Warning: package 'reshape' was built under R version 3.0.2
## Loading required package: plyr
## Warning: package 'plyr' was built under R version 3.0.2
##
## Attaching package: 'reshape'
##
## The following object is masked from 'package:plyr':
##
## rename, round_any
## Warning: replacing previous import 'rename' when loading 'reshape'
## Warning: replacing previous import 'round_any' when loading 'reshape'
ggpairs(AL)
b. Regress abrasion loss on hardness and strength. What is the adjusted R squared value? Is tensile strength an important explanatory variable after accounting for hardness?
model = lm(abrasion ~ hardness + strength, data = AL)
summary(model)
##
## Call:
## lm(formula = abrasion ~ hardness + strength, data = AL)
##
## Residuals:
## Min 1Q Median 3Q Max
## -79.38 -14.61 3.82 19.75 65.98
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 885.161 61.752 14.33 3.8e-14 ***
## hardness -6.571 0.583 -11.27 1.0e-11 ***
## strength -1.374 0.194 -7.07 1.3e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 36.5 on 27 degrees of freedom
## Multiple R-squared: 0.84, Adjusted R-squared: 0.828
## F-statistic: 71 on 2 and 27 DF, p-value: 1.77e-11
c. On average how much additional abrasion is lost for every 1 kg/cm2 increase in tensile strength?
coef(model)[3]
## strength
## -1.374
1.37 g/hr
d. Check the correlations between the explanatory variables. Could collinearity be a problem for interpreting the model?
cor(AL)
## abrasion hardness strength
## abrasion 1.0000 -0.7377 -0.2984
## hardness -0.7377 1.0000 -0.2992
## strength -0.2984 -0.2992 1.0000
No.
e. Find the 95% prediction interval for the abrasion corresponding to a new rubber sample having a hardness of 60 units and a tensile strength of 200 kg/cm2.
predict(model, data.frame(hardness = 60, strength = 200), interval = "prediction")
## fit lwr upr
## 1 216 138.9 293.2
Best estimate for the mean abrasion loss given a hardness of 60 and strength of 200 kg per square cm is 216 g/hr with a 95% prediction interval between 139 and 293 g/hr.
Typically you begin with a model that has many explanatory variables. The first step is to see if you can remove some of the variables.
From last week you know that by dropping a variable the R squared value will decrease. The variance of the response explained by the remaining explanatory variables will be lower.
But the cost of keeping the variable in the model is an increase in model bias. The model is biased toward the particular set of data. By dropping a variable you add one degree of freedom thus making the model less biased.
This trade-off between variance and bias is solved by considering the adjusted R squared value. You favor a model that has the largest adjusted R squared. If the adjusted R squared gets smaller after you remove a particular variable, then you keep it.
Another useful statistic in this regard is called AIC (Akaike Information Criterion). AIC is more general and can be used with models that do not use least-squares criteria for determining the model parameters.
AIC is grounded in information theory. Values of AIC provide a means for model selection in a relative sense. If all candidate models fit poorly the AIC is not much help.
AIC = 2 * k - 2 * log(L), where k is the number of model parameters (1 + number of explanatory variables) and L is the highest value from the likelihood function. The likelihood function (likelihood) is the probability of your data given the model.
AIC rewards goodness of fit, but includes a penalty that is an increasing function of the number of estimated parameters.
Given a set of candidate models for your data, you should choose the one with the lowest AIC.
Let's return to our model of petrol consumption at the state level.
PC = read.table("http://myweb.fsu.edu/jelsner/data/PetrolConsumption.txt", header = TRUE)
str(PC)
## 'data.frame': 48 obs. of 5 variables:
## $ Petrol.Tax : num 9 9 9 7.5 8 10 8 8 8 7 ...
## $ Avg.Inc : int 3571 4092 3865 4870 4399 5342 5319 5126 4447 4512 ...
## $ Pavement : int 1976 1250 1586 2351 431 1333 11868 2138 8577 8507 ...
## $ Prop.DL : num 0.525 0.572 0.58 0.529 0.544 0.571 0.451 0.553 0.529 0.552 ...
## $ Petrol.Consumption: int 541 524 561 414 410 457 344 467 464 498 ...
First create a full model. A full model includes all the explanatory variables you might think are important.
modelF = lm(Petrol.Consumption ~ Prop.DL + Pavement + Avg.Inc + Petrol.Tax,
data = PC)
Then use the drop1() function. The drop1() function takes a regression model object and returns an analysis of variance table showing what happens when each variable is removed from the model.
drop1(modelF)
## Single term deletions
##
## Model:
## Petrol.Consumption ~ Prop.DL + Pavement + Avg.Inc + Petrol.Tax
## Df Sum of Sq RSS AIC
## <none> 189050 407
## Prop.DL 1 212355 401405 442
## Pavement 1 2252 191302 406
## Avg.Inc 1 65729 254779 420
## Petrol.Tax 1 31632 220682 413
The first row of the table shows the sum of squared residuals (here labeled RSS—residual sum of squares) for the full model. The value is 189050 (in the
sum(resid(modelF)^2)
## [1] 189050
deviance(modelF)
## [1] 189050
The first row also gives the AIC value for the full model. Here 407.37.
The model deviance has units of the response variable. The AIC has unitless.
The next row tells you what happens to the RSS if the explanatory variable Prop.DL is removed from the full model. If you drop the Prop.DL variable from the full model, the RSS increases to 401405 and you gain 1 degree of freedom (less bias).
Said another way, the value of 212355 in the Sum of Sq column indicates how much the variable Prop.DL is worth to the model. By removing it the RSS increases from 189050 to 401405 a difference of 212355.
Apparently that is too much of an increase in RSS for the gain of only 1 dof as the AIC increases from 407.37 to 441.51.
The AIC for the full model is 407. In comparison the AIC is 442 for a model that is otherwise the same but missing Prop.DL.
The next row gives the same information about the Pavement variable. That is, what happens to the RSS from the full model when the variable Pavement is removed.
Here the RSS increases by 2252 to 191302 (the residuals on average get a bit larger), but this increase is worth it as it only costs one degree of freedom (Df). This is indicated by a reduction in the AIC to so 405.94.
The AIC column keeps score. The AIC for the model having all the explanatory variables is 407.37. A model without Prop.DL has an AIC of 441.51, which is LARGER than the AIC for the full model so we keep Prop.DL in the model.
On the other hand, a model without Pavement has an AIC of 405.94, which is SMALLER than the AIC for the full model so we remove Pavement from the model.
So you can simply compare the AIC values for each variable against the AIC value for the full model. If the AIC value for a variable is less than the AIC for the full model then the trade-off is in favor of removing it.
Repeat the procedure after removing Pavement.
modelR = lm(Petrol.Consumption ~ Prop.DL + Avg.Inc + Petrol.Tax, data = PC)
drop1(modelR)
## Single term deletions
##
## Model:
## Petrol.Consumption ~ Prop.DL + Avg.Inc + Petrol.Tax
## Df Sum of Sq RSS AIC
## <none> 191302 406
## Prop.DL 1 243586 434889 443
## Avg.Inc 1 69532 260834 419
## Petrol.Tax 1 33742 225044 412
Now all the AIC value exceed the value given in the row labeled
Stepwise regression is used to automate this procedure of selecting a final model. It is method for finding the best model from a set of candidate models when the models are nested. It is useful for sifting through a large number of explanatory variables.
It is a procedure not a model.
Stepwise regression can be done by backward deletion of variables or by forward selection of the variables. Backward deletion amounts to automating the drop1() function and forward selection amounts to automating the add1() function.
Both methods use the AIC as a criterion for choosing or deleting a variable. To see how this works, return to your full model explaining gasoline consumption.
step(modelF)
## Start: AIC=407.4
## Petrol.Consumption ~ Prop.DL + Pavement + Avg.Inc + Petrol.Tax
##
## Df Sum of Sq RSS AIC
## - Pavement 1 2252 191302 406
## <none> 189050 407
## - Petrol.Tax 1 31632 220682 413
## - Avg.Inc 1 65729 254779 420
## - Prop.DL 1 212355 401405 442
##
## Step: AIC=405.9
## Petrol.Consumption ~ Prop.DL + Avg.Inc + Petrol.Tax
##
## Df Sum of Sq RSS AIC
## <none> 191302 406
## - Petrol.Tax 1 33742 225044 412
## - Avg.Inc 1 69532 260834 419
## - Prop.DL 1 243586 434889 443
##
## Call:
## lm(formula = Petrol.Consumption ~ Prop.DL + Avg.Inc + Petrol.Tax,
## data = PC)
##
## Coefficients:
## (Intercept) Prop.DL Avg.Inc Petrol.Tax
## 307.328 1374.768 -0.068 -29.484
As another example, let's consider the stackloss data frame.
You are interested in regressing stack loss onto air flow, water temperature, and acid concentration.
modelF = lm(stack.loss ~ ., data = stackloss)
The period after the tilde says to include all the remaining columns in the data frame as explanatory variables.
step(modelF)
## Start: AIC=52.98
## stack.loss ~ Air.Flow + Water.Temp + Acid.Conc.
##
## Df Sum of Sq RSS AIC
## - Acid.Conc. 1 10 189 52.1
## <none> 179 53.0
## - Water.Temp 1 130 309 62.5
## - Air.Flow 1 296 475 71.5
##
## Step: AIC=52.12
## stack.loss ~ Air.Flow + Water.Temp
##
## Df Sum of Sq RSS AIC
## <none> 189 52.1
## - Water.Temp 1 130 319 61.1
## - Air.Flow 1 294 483 69.9
##
## Call:
## lm(formula = stack.loss ~ Air.Flow + Water.Temp, data = stackloss)
##
## Coefficients:
## (Intercept) Air.Flow Water.Temp
## -50.359 0.671 1.295
Backward deletion of variables is the default in the step() function. That is a successive application of the drop1() function. In the case of having a large number of explanatory variables it's a good idea to also try forward selection to see if the results are the same.
Consider the pollute.txt data file. Read the file from the web and look at the first six lines.
pollute = read.table("http://myweb.fsu.edu/jelsner/data/pollute.txt", header = TRUE)
head(pollute)
## Pollution Temp Industry Population Wind Rain Wet.days
## 1 24 61.5 368 497 9.1 48.34 115
## 2 30 55.6 291 593 8.3 43.11 123
## 3 56 55.9 775 622 9.5 35.89 105
## 4 28 51.0 137 176 8.7 15.17 89
## 5 14 68.4 136 529 8.8 54.47 116
## 6 46 47.6 44 116 8.8 33.36 135
Create a regression model with Pollution as the response variable and Temp, Industry, Population, Wind, Rain and Wet.days as explanatory variables. Which variable is the most significant in explaining the variation in pollution? Which variable is the least significant? Is there a problem with collinearity? Which variables are involved? Remove one and create a new regression model. Use a stepwise regression to find the final model. Which variables are not included in the final model?
In summary, stepwise regression is a procedure for determining which set of explanatory variables should be included in a multiple regression model.
Checks of model fit and adequacy need to be performed once a set of variables is selected.
Download the ESRI shapefiles from my website.
require(maptools)
## Warning: package 'maptools' was built under R version 3.0.2
## Warning: package 'sp' was built under R version 3.0.2
tmp = download.file("http://myweb.fsu.edu/jelsner/data/south.zip", "south.zip",
mode = "wb")
unzip("south.zip")
homicides = readShapeSpatial("south")
slotNames(homicides)
## [1] "data" "polygons" "plotOrder" "bbox" "proj4string"
str(homicides, max.level = 2)
## Formal class 'SpatialPolygonsDataFrame' [package "sp"] with 5 slots
## ..@ data :'data.frame': 1412 obs. of 69 variables:
## .. ..- attr(*, "data_types")= chr [1:69] "C" "C" "C" "C" ...
## ..@ polygons :List of 1412
## .. .. [list output truncated]
## ..@ plotOrder : int [1:1412] 1247 1172 1091 1088 1253 1368 1286 1201 1089 292 ...
## ..@ bbox : num [1:2, 1:2] -106.6 25 -75 40.6
## .. ..- attr(*, "dimnames")=List of 2
## ..@ proj4string:Formal class 'CRS' [package "sp"] with 1 slots
head(homicides@data)
## NAME STATE_NAME STATE_FIPS CNTY_FIPS FIPS STFIPS COFIPS FIPSNO
## 0 Hancock West Virginia 54 029 54029 54 29 54029
## 1 Brooke West Virginia 54 009 54009 54 9 54009
## 2 Ohio West Virginia 54 069 54069 54 69 54069
## 3 Marshall West Virginia 54 051 54051 54 51 54051
## 4 New Castle Delaware 10 003 10003 10 3 10003
## 5 Washington Maryland 24 043 24043 24 43 24043
## SOUTH HR60 HR70 HR80 HR90 HC60 HC70 HC80 HC90 PO60
## 0 1 1.6829 4.193 6.598 0.9461 0.6667 1.667 2.667 0.3333 39615
## 1 1 4.6072 3.285 3.214 1.2349 1.3333 1.000 1.000 0.3333 28940
## 2 1 0.9741 5.254 7.602 2.6210 0.6667 3.333 4.667 1.3333 68437
## 3 1 0.8762 4.433 4.006 4.4616 0.3333 1.667 1.667 1.6667 38041
## 4 1 4.2284 6.738 7.117 6.7127 13.0000 26.000 28.333 29.6667 307446
## 5 1 1.8271 2.568 2.063 1.6475 1.6667 2.667 2.333 2.0000 91219
## PO70 PO80 PO90 RD60 RD70 RD80 RD90 PS60 PS70
## 0 39749 40418 35233 -1.3947 -1.3074 -1.1593 -0.39903 1.2187 1.1368
## 1 30443 31117 26992 -1.2071 -0.9980 -1.2495 -0.66360 0.9556 0.9218
## 2 63439 61389 50871 -0.4634 -0.3106 -0.1694 0.13631 1.5572 1.3991
## 3 37598 41608 37356 -0.9012 -0.8014 -1.0752 -0.05353 0.7396 0.6670
## 4 385856 398115 441946 -0.9854 -0.8245 -0.3015 -0.88762 2.2293 2.2551
## 5 103829 113086 121393 -0.6971 -0.7060 -0.6852 -0.73195 1.2741 1.2763
## PS80 PS90 UE60 UE70 UE80 UE90 DV60 DV70 DV80 DV90 MA60 MA70
## 0 1.0386 0.8965 3.1 2.7 7.076 6.858 2.278 2.559 5.062 7.264 28.9 30.0
## 1 0.8275 0.6878 4.8 3.4 6.035 6.966 1.504 2.279 4.383 6.015 28.5 30.0
## 2 1.2679 1.0747 7.3 3.7 5.888 6.932 1.994 2.589 5.101 7.511 34.1 33.5
## 3 0.6361 0.5182 11.0 4.8 11.643 8.773 3.480 3.012 5.371 7.222 31.2 30.9
## 4 2.1905 2.1436 4.7 3.9 5.951 3.820 1.752 2.466 5.185 7.006 29.5 27.0
## 5 1.2306 1.2153 8.3 4.3 6.205 4.088 2.021 2.489 4.682 7.354 30.5 29.5
## MA80 MA90 POL60 POL70 POL80 POL90 DNL60 DNL70 DNL80 DNL90 MFIL59 MFIL69
## 0 31.4 37.7 10.59 10.59 10.61 10.47 6.168 6.171 6.171 6.051 8.841 9.247
## 1 30.9 37.3 10.27 10.32 10.35 10.20 5.796 5.846 5.846 5.716 8.697 9.137
## 2 33.0 37.7 11.13 11.06 11.02 10.84 6.470 6.394 6.364 6.172 8.599 9.079
## 3 30.9 36.6 10.55 10.53 10.64 10.53 4.829 4.818 4.915 4.801 8.548 9.047
## 4 29.7 32.5 12.64 12.86 12.89 13.00 6.552 6.781 6.914 6.944 8.828 9.304
## 5 31.6 34.4 11.42 11.55 11.64 11.71 5.292 5.421 5.516 5.579 8.546 9.080
## MFIL79 MFIL89 FP59 FP69 FP79 FP89 BLK60 BLK70 BLK80 BLK90
## 0 10.073 10.33 9.6 5.9 6.533 10.173 3.8395 3.2554 2.5607 2.5573
## 1 10.007 10.35 13.5 8.7 6.141 8.973 1.4167 0.8245 0.7970 0.7484
## 2 9.862 10.31 20.6 10.3 8.916 12.277 3.0524 3.1432 3.4632 3.3103
## 3 9.912 10.20 19.8 10.5 7.769 12.699 0.9463 0.6011 0.5936 0.5461
## 4 10.030 10.72 11.3 6.6 8.032 5.015 11.7221 12.6651 15.1002 16.4803
## 5 9.870 10.45 21.3 9.4 8.025 7.151 2.8338 3.2332 4.2021 5.9682
## GI59 GI69 GI79 GI89 FH60 FH70 FH80 FH90
## 0 0.2236 0.2954 0.3323 0.3639 9.981 7.8 9.786 12.60
## 1 0.2204 0.3185 0.3142 0.3506 10.929 8.0 10.215 11.24
## 2 0.2724 0.3585 0.3770 0.3905 15.622 12.9 14.717 17.57
## 3 0.2276 0.3196 0.3210 0.3773 11.963 8.8 8.803 13.56
## 4 0.2561 0.3297 0.3658 0.3327 12.036 10.7 15.169 16.38
## 5 0.2573 0.3348 0.3439 0.3443 12.273 9.3 10.502 13.09
Since this is a spatial polygons data frame you can use the plot() method to create a map of the county borders.
plot(homicides)
You can generate a default choropleth map by using the spplot() method.
spplot(homicides, "RD90")
There is no projection information. This can be added using the proj4string argument and the CRS function within the readShapeSpatial() function. Here using the Clarke 1866 ellipsoid.
homicides = readShapeSpatial("south", proj4string = CRS("+proj=longlat +ellps=clrk66"))
plot(homicides)
(1) Consider the airquality data frame containing daily air quality measurements in New York.
a. Create a scatter plot matrix of the four variables (solar radiation, wind speed, temperature, and ozone). Which variable appears to have the strongest relationship to ozone?
b. Regress ozone on air temperature, wind speed, and solar radiation. What is the R squared value for this model? Is the model statistically significant?
c. Can the model be simplied by removing a variable? If so, which one?
d. Check the model assumptions of linearity, constant variance, and normality and comment on your findings.
e. Is collinearity a problem?
(2) The file GeorgiaEduc.dbf is an ESRI data base file containing the percentage of Georgia county residents having a bachelorâs degree along with other countywide information as explanatory variables.
Download the data file from Blackboard and import it to R using the read.dbf() function from the maptools package.
require(maptools)
setwd("~/Desktop")
GE = read.dbf("GeorgiaEduc.dbf")
names(GE)
Let the response variable be PctBach (percentage of population with bachelor degrees) and the explanatory variables be TotPop90, PctRural, PctEld, PctFB (first born), PctPov, PctBlack. Peform a stepwise regression to determine the final model.