QG Lecture 16

Exam 2 next week Thursday (November 7th). I'll be out of town so you can take the exam anywhere you like.

The exam will be available starting at 12:30p and you are required to turn it in by 1:45pm. Please work independently.

Today: AIC, likelihood of your data, and choosing among models. Practice problems.

Review from last week. Linear regression models.

Davies and Goldsmith (1972) investigated the relationship between abrasion loss (abrasion) of samples of rubber (grams per hour) and hardness (higher values indicate harder rubber) and tensile strength (kg/cm2 ). Abrasion loss is the response variable.

The data are in AbrasionLoss.txt. Input the data using

AL = read.table("http://myweb.fsu.edu/jelsner/data/AbrasionLoss.txt", header = TRUE)
str(AL)
## 'data.frame':    30 obs. of  3 variables:
##  $ abrasion: int  372 206 175 154 136 112 55 45 221 166 ...
##  $ hardness: int  45 55 61 66 71 71 81 86 53 60 ...
##  $ strength: int  162 233 232 231 231 237 224 219 203 189 ...

a. Create a scatter plot matrix of the three variables. Based on the correlation between abrasion loss and tensile strength, do you think tensile strength would be helpful in explaining abrasion loss?

require(GGally)
## Loading required package: GGally
## Warning: package 'GGally' was built under R version 3.0.2
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.0.2
## Loading required package: reshape
## Warning: package 'reshape' was built under R version 3.0.2
## Loading required package: plyr
## Warning: package 'plyr' was built under R version 3.0.2
## 
## Attaching package: 'reshape'
## 
## The following object is masked from 'package:plyr':
## 
##     rename, round_any
## Warning: replacing previous import 'rename' when loading 'reshape'
## Warning: replacing previous import 'round_any' when loading 'reshape'
ggpairs(AL)

plot of chunk scatterplotMatrix

b. Regress abrasion loss on hardness and strength. What is the adjusted R squared value? Is tensile strength an important explanatory variable after accounting for hardness?

model = lm(abrasion ~ hardness + strength, data = AL)
summary(model)
## 
## Call:
## lm(formula = abrasion ~ hardness + strength, data = AL)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -79.38 -14.61   3.82  19.75  65.98 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  885.161     61.752   14.33  3.8e-14 ***
## hardness      -6.571      0.583  -11.27  1.0e-11 ***
## strength      -1.374      0.194   -7.07  1.3e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 36.5 on 27 degrees of freedom
## Multiple R-squared:  0.84,   Adjusted R-squared:  0.828 
## F-statistic:   71 on 2 and 27 DF,  p-value: 1.77e-11

c. On average how much additional abrasion is lost for every 1 kg/cm2 increase in tensile strength?

coef(model)[3]
## strength 
##   -1.374

1.37 g/hr

d. Check the correlations between the explanatory variables. Could collinearity be a problem for interpreting the model?

cor(AL)
##          abrasion hardness strength
## abrasion   1.0000  -0.7377  -0.2984
## hardness  -0.7377   1.0000  -0.2992
## strength  -0.2984  -0.2992   1.0000

No.

e. Find the 95% prediction interval for the abrasion corresponding to a new rubber sample having a hardness of 60 units and a tensile strength of 200 kg/cm2.

predict(model, data.frame(hardness = 60, strength = 200), interval = "prediction")
##   fit   lwr   upr
## 1 216 138.9 293.2

Best estimate for the mean abrasion loss given a hardness of 60 and strength of 200 kg per square cm is 216 g/hr with a 95% prediction interval between 139 and 293 g/hr.

Dropping variables from a model

Typically you begin with a model that has many explanatory variables. The first step is to see if you can remove some of the variables.

From last week you know that by dropping a variable the R squared value will decrease. The variance of the response explained by the remaining explanatory variables will be lower.

But the cost of keeping the variable in the model is an increase in model bias. The model is biased toward the particular set of data. By dropping a variable you add one degree of freedom thus making the model less biased.

This trade-off between variance and bias is solved by considering the adjusted R squared value. You favor a model that has the largest adjusted R squared. If the adjusted R squared gets smaller after you remove a particular variable, then you keep it.

Another useful statistic in this regard is called AIC (Akaike Information Criterion). AIC is more general and can be used with models that do not use least-squares criteria for determining the model parameters.

AIC is grounded in information theory. Values of AIC provide a means for model selection in a relative sense. If all candidate models fit poorly the AIC is not much help.

AIC = 2 * k - 2 * log(L), where k is the number of model parameters (1 + number of explanatory variables) and L is the highest value from the likelihood function. The likelihood function (likelihood) is the probability of your data given the model.

AIC rewards goodness of fit, but includes a penalty that is an increasing function of the number of estimated parameters.

Given a set of candidate models for your data, you should choose the one with the lowest AIC.

Let's return to our model of petrol consumption at the state level.

PC = read.table("http://myweb.fsu.edu/jelsner/data/PetrolConsumption.txt", header = TRUE)
str(PC)
## 'data.frame':    48 obs. of  5 variables:
##  $ Petrol.Tax        : num  9 9 9 7.5 8 10 8 8 8 7 ...
##  $ Avg.Inc           : int  3571 4092 3865 4870 4399 5342 5319 5126 4447 4512 ...
##  $ Pavement          : int  1976 1250 1586 2351 431 1333 11868 2138 8577 8507 ...
##  $ Prop.DL           : num  0.525 0.572 0.58 0.529 0.544 0.571 0.451 0.553 0.529 0.552 ...
##  $ Petrol.Consumption: int  541 524 561 414 410 457 344 467 464 498 ...

First create a full model. A full model includes all the explanatory variables you might think are important.

modelF = lm(Petrol.Consumption ~ Prop.DL + Pavement + Avg.Inc + Petrol.Tax, 
    data = PC)

Then use the drop1() function. The drop1() function takes a regression model object and returns an analysis of variance table showing what happens when each variable is removed from the model.

drop1(modelF)
## Single term deletions
## 
## Model:
## Petrol.Consumption ~ Prop.DL + Pavement + Avg.Inc + Petrol.Tax
##            Df Sum of Sq    RSS AIC
## <none>                  189050 407
## Prop.DL     1    212355 401405 442
## Pavement    1      2252 191302 406
## Avg.Inc     1     65729 254779 420
## Petrol.Tax  1     31632 220682 413

The first row of the table shows the sum of squared residuals (here labeled RSS—residual sum of squares) for the full model. The value is 189050 (in the row; no variable is dropped). This value is the deviance of the full model.

sum(resid(modelF)^2)
## [1] 189050
deviance(modelF)
## [1] 189050

The first row also gives the AIC value for the full model. Here 407.37.

The model deviance has units of the response variable. The AIC has unitless.

The next row tells you what happens to the RSS if the explanatory variable Prop.DL is removed from the full model. If you drop the Prop.DL variable from the full model, the RSS increases to 401405 and you gain 1 degree of freedom (less bias).

Said another way, the value of 212355 in the Sum of Sq column indicates how much the variable Prop.DL is worth to the model. By removing it the RSS increases from 189050 to 401405 a difference of 212355.

Apparently that is too much of an increase in RSS for the gain of only 1 dof as the AIC increases from 407.37 to 441.51.

The AIC for the full model is 407. In comparison the AIC is 442 for a model that is otherwise the same but missing Prop.DL.

The next row gives the same information about the Pavement variable. That is, what happens to the RSS from the full model when the variable Pavement is removed.

Here the RSS increases by 2252 to 191302 (the residuals on average get a bit larger), but this increase is worth it as it only costs one degree of freedom (Df). This is indicated by a reduction in the AIC to so 405.94.

The AIC column keeps score. The AIC for the model having all the explanatory variables is 407.37. A model without Prop.DL has an AIC of 441.51, which is LARGER than the AIC for the full model so we keep Prop.DL in the model.

On the other hand, a model without Pavement has an AIC of 405.94, which is SMALLER than the AIC for the full model so we remove Pavement from the model.

So you can simply compare the AIC values for each variable against the AIC value for the full model. If the AIC value for a variable is less than the AIC for the full model then the trade-off is in favor of removing it.

Repeat the procedure after removing Pavement.

modelR = lm(Petrol.Consumption ~ Prop.DL + Avg.Inc + Petrol.Tax, data = PC)
drop1(modelR)
## Single term deletions
## 
## Model:
## Petrol.Consumption ~ Prop.DL + Avg.Inc + Petrol.Tax
##            Df Sum of Sq    RSS AIC
## <none>                  191302 406
## Prop.DL     1    243586 434889 443
## Avg.Inc     1     69532 260834 419
## Petrol.Tax  1     33742 225044 412

Now all the AIC value exceed the value given in the row labeled , so you conclude there is no reason to make remove any of the remaining variables. This is your final model.

Stepwise Regression

Stepwise regression is used to automate this procedure of selecting a final model. It is method for finding the best model from a set of candidate models when the models are nested. It is useful for sifting through a large number of explanatory variables.

It is a procedure not a model.

Stepwise regression can be done by backward deletion of variables or by forward selection of the variables. Backward deletion amounts to automating the drop1() function and forward selection amounts to automating the add1() function.

Both methods use the AIC as a criterion for choosing or deleting a variable. To see how this works, return to your full model explaining gasoline consumption.

step(modelF)
## Start:  AIC=407.4
## Petrol.Consumption ~ Prop.DL + Pavement + Avg.Inc + Petrol.Tax
## 
##              Df Sum of Sq    RSS AIC
## - Pavement    1      2252 191302 406
## <none>                    189050 407
## - Petrol.Tax  1     31632 220682 413
## - Avg.Inc     1     65729 254779 420
## - Prop.DL     1    212355 401405 442
## 
## Step:  AIC=405.9
## Petrol.Consumption ~ Prop.DL + Avg.Inc + Petrol.Tax
## 
##              Df Sum of Sq    RSS AIC
## <none>                    191302 406
## - Petrol.Tax  1     33742 225044 412
## - Avg.Inc     1     69532 260834 419
## - Prop.DL     1    243586 434889 443
## 
## Call:
## lm(formula = Petrol.Consumption ~ Prop.DL + Avg.Inc + Petrol.Tax, 
##     data = PC)
## 
## Coefficients:
## (Intercept)      Prop.DL      Avg.Inc   Petrol.Tax  
##     307.328     1374.768       -0.068      -29.484

As another example, let's consider the stackloss data frame.

You are interested in regressing stack loss onto air flow, water temperature, and acid concentration.

modelF = lm(stack.loss ~ ., data = stackloss)

The period after the tilde says to include all the remaining columns in the data frame as explanatory variables.

step(modelF)
## Start:  AIC=52.98
## stack.loss ~ Air.Flow + Water.Temp + Acid.Conc.
## 
##              Df Sum of Sq RSS  AIC
## - Acid.Conc.  1        10 189 52.1
## <none>                    179 53.0
## - Water.Temp  1       130 309 62.5
## - Air.Flow    1       296 475 71.5
## 
## Step:  AIC=52.12
## stack.loss ~ Air.Flow + Water.Temp
## 
##              Df Sum of Sq RSS  AIC
## <none>                    189 52.1
## - Water.Temp  1       130 319 61.1
## - Air.Flow    1       294 483 69.9
## 
## Call:
## lm(formula = stack.loss ~ Air.Flow + Water.Temp, data = stackloss)
## 
## Coefficients:
## (Intercept)     Air.Flow   Water.Temp  
##     -50.359        0.671        1.295

Backward deletion of variables is the default in the step() function. That is a successive application of the drop1() function. In the case of having a large number of explanatory variables it's a good idea to also try forward selection to see if the results are the same.

Another example

Consider the pollute.txt data file. Read the file from the web and look at the first six lines.

pollute = read.table("http://myweb.fsu.edu/jelsner/data/pollute.txt", header = TRUE)
head(pollute)
##   Pollution Temp Industry Population Wind  Rain Wet.days
## 1        24 61.5      368        497  9.1 48.34      115
## 2        30 55.6      291        593  8.3 43.11      123
## 3        56 55.9      775        622  9.5 35.89      105
## 4        28 51.0      137        176  8.7 15.17       89
## 5        14 68.4      136        529  8.8 54.47      116
## 6        46 47.6       44        116  8.8 33.36      135

Create a regression model with Pollution as the response variable and Temp, Industry, Population, Wind, Rain and Wet.days as explanatory variables. Which variable is the most significant in explaining the variation in pollution? Which variable is the least significant? Is there a problem with collinearity? Which variables are involved? Remove one and create a new regression model. Use a stepwise regression to find the final model. Which variables are not included in the final model?

In summary, stepwise regression is a procedure for determining which set of explanatory variables should be included in a multiple regression model.

Checks of model fit and adequacy need to be performed once a set of variables is selected.

Getting ESRI shapefiles into R

Download the ESRI shapefiles from my website.

require(maptools)
## Warning: package 'maptools' was built under R version 3.0.2
## Warning: package 'sp' was built under R version 3.0.2
tmp = download.file("http://myweb.fsu.edu/jelsner/data/south.zip", "south.zip", 
    mode = "wb")
unzip("south.zip")
homicides = readShapeSpatial("south")
slotNames(homicides)
## [1] "data"        "polygons"    "plotOrder"   "bbox"        "proj4string"
str(homicides, max.level = 2)
## Formal class 'SpatialPolygonsDataFrame' [package "sp"] with 5 slots
##   ..@ data       :'data.frame':  1412 obs. of  69 variables:
##   .. ..- attr(*, "data_types")= chr [1:69] "C" "C" "C" "C" ...
##   ..@ polygons   :List of 1412
##   .. .. [list output truncated]
##   ..@ plotOrder  : int [1:1412] 1247 1172 1091 1088 1253 1368 1286 1201 1089 292 ...
##   ..@ bbox       : num [1:2, 1:2] -106.6 25 -75 40.6
##   .. ..- attr(*, "dimnames")=List of 2
##   ..@ proj4string:Formal class 'CRS' [package "sp"] with 1 slots
head(homicides@data)
##         NAME    STATE_NAME STATE_FIPS CNTY_FIPS  FIPS STFIPS COFIPS FIPSNO
## 0    Hancock West Virginia         54       029 54029     54     29  54029
## 1     Brooke West Virginia         54       009 54009     54      9  54009
## 2       Ohio West Virginia         54       069 54069     54     69  54069
## 3   Marshall West Virginia         54       051 54051     54     51  54051
## 4 New Castle      Delaware         10       003 10003     10      3  10003
## 5 Washington      Maryland         24       043 24043     24     43  24043
##   SOUTH   HR60  HR70  HR80   HR90    HC60   HC70   HC80    HC90   PO60
## 0     1 1.6829 4.193 6.598 0.9461  0.6667  1.667  2.667  0.3333  39615
## 1     1 4.6072 3.285 3.214 1.2349  1.3333  1.000  1.000  0.3333  28940
## 2     1 0.9741 5.254 7.602 2.6210  0.6667  3.333  4.667  1.3333  68437
## 3     1 0.8762 4.433 4.006 4.4616  0.3333  1.667  1.667  1.6667  38041
## 4     1 4.2284 6.738 7.117 6.7127 13.0000 26.000 28.333 29.6667 307446
## 5     1 1.8271 2.568 2.063 1.6475  1.6667  2.667  2.333  2.0000  91219
##     PO70   PO80   PO90    RD60    RD70    RD80     RD90   PS60   PS70
## 0  39749  40418  35233 -1.3947 -1.3074 -1.1593 -0.39903 1.2187 1.1368
## 1  30443  31117  26992 -1.2071 -0.9980 -1.2495 -0.66360 0.9556 0.9218
## 2  63439  61389  50871 -0.4634 -0.3106 -0.1694  0.13631 1.5572 1.3991
## 3  37598  41608  37356 -0.9012 -0.8014 -1.0752 -0.05353 0.7396 0.6670
## 4 385856 398115 441946 -0.9854 -0.8245 -0.3015 -0.88762 2.2293 2.2551
## 5 103829 113086 121393 -0.6971 -0.7060 -0.6852 -0.73195 1.2741 1.2763
##     PS80   PS90 UE60 UE70   UE80  UE90  DV60  DV70  DV80  DV90 MA60 MA70
## 0 1.0386 0.8965  3.1  2.7  7.076 6.858 2.278 2.559 5.062 7.264 28.9 30.0
## 1 0.8275 0.6878  4.8  3.4  6.035 6.966 1.504 2.279 4.383 6.015 28.5 30.0
## 2 1.2679 1.0747  7.3  3.7  5.888 6.932 1.994 2.589 5.101 7.511 34.1 33.5
## 3 0.6361 0.5182 11.0  4.8 11.643 8.773 3.480 3.012 5.371 7.222 31.2 30.9
## 4 2.1905 2.1436  4.7  3.9  5.951 3.820 1.752 2.466 5.185 7.006 29.5 27.0
## 5 1.2306 1.2153  8.3  4.3  6.205 4.088 2.021 2.489 4.682 7.354 30.5 29.5
##   MA80 MA90 POL60 POL70 POL80 POL90 DNL60 DNL70 DNL80 DNL90 MFIL59 MFIL69
## 0 31.4 37.7 10.59 10.59 10.61 10.47 6.168 6.171 6.171 6.051  8.841  9.247
## 1 30.9 37.3 10.27 10.32 10.35 10.20 5.796 5.846 5.846 5.716  8.697  9.137
## 2 33.0 37.7 11.13 11.06 11.02 10.84 6.470 6.394 6.364 6.172  8.599  9.079
## 3 30.9 36.6 10.55 10.53 10.64 10.53 4.829 4.818 4.915 4.801  8.548  9.047
## 4 29.7 32.5 12.64 12.86 12.89 13.00 6.552 6.781 6.914 6.944  8.828  9.304
## 5 31.6 34.4 11.42 11.55 11.64 11.71 5.292 5.421 5.516 5.579  8.546  9.080
##   MFIL79 MFIL89 FP59 FP69  FP79   FP89   BLK60   BLK70   BLK80   BLK90
## 0 10.073  10.33  9.6  5.9 6.533 10.173  3.8395  3.2554  2.5607  2.5573
## 1 10.007  10.35 13.5  8.7 6.141  8.973  1.4167  0.8245  0.7970  0.7484
## 2  9.862  10.31 20.6 10.3 8.916 12.277  3.0524  3.1432  3.4632  3.3103
## 3  9.912  10.20 19.8 10.5 7.769 12.699  0.9463  0.6011  0.5936  0.5461
## 4 10.030  10.72 11.3  6.6 8.032  5.015 11.7221 12.6651 15.1002 16.4803
## 5  9.870  10.45 21.3  9.4 8.025  7.151  2.8338  3.2332  4.2021  5.9682
##     GI59   GI69   GI79   GI89   FH60 FH70   FH80  FH90
## 0 0.2236 0.2954 0.3323 0.3639  9.981  7.8  9.786 12.60
## 1 0.2204 0.3185 0.3142 0.3506 10.929  8.0 10.215 11.24
## 2 0.2724 0.3585 0.3770 0.3905 15.622 12.9 14.717 17.57
## 3 0.2276 0.3196 0.3210 0.3773 11.963  8.8  8.803 13.56
## 4 0.2561 0.3297 0.3658 0.3327 12.036 10.7 15.169 16.38
## 5 0.2573 0.3348 0.3439 0.3443 12.273  9.3 10.502 13.09

Since this is a spatial polygons data frame you can use the plot() method to create a map of the county borders.

plot(homicides)

plot of chunk mapCountyBorders

You can generate a default choropleth map by using the spplot() method.

spplot(homicides, "RD90")

plot of chunk choroplethResourceDeprevation1990

There is no projection information. This can be added using the proj4string argument and the CRS function within the readShapeSpatial() function. Here using the Clarke 1866 ellipsoid.

homicides = readShapeSpatial("south", proj4string = CRS("+proj=longlat +ellps=clrk66"))
plot(homicides)

Practice

(1) Consider the airquality data frame containing daily air quality measurements in New York.

a. Create a scatter plot matrix of the four variables (solar radiation, wind speed, temperature, and ozone). Which variable appears to have the strongest relationship to ozone?

b. Regress ozone on air temperature, wind speed, and solar radiation. What is the R squared value for this model? Is the model statistically significant?

c. Can the model be simplied by removing a variable? If so, which one?

d. Check the model assumptions of linearity, constant variance, and normality and comment on your findings.

e. Is collinearity a problem?

(2) The file GeorgiaEduc.dbf is an ESRI data base file containing the percentage of Georgia county residents having a bachelor’s degree along with other countywide information as explanatory variables.

Download the data file from Blackboard and import it to R using the read.dbf() function from the maptools package.

require(maptools)
setwd("~/Desktop")
GE = read.dbf("GeorgiaEduc.dbf")
names(GE)

Let the response variable be PctBach (percentage of population with bachelor degrees) and the explanatory variables be TotPop90, PctRural, PctEld, PctFB (first born), PctPov, PctBlack. Peform a stepwise regression to determine the final model.