Lecture: Multiple Regression

Key Concepts covered in the lecture include:
1. Diagnostic Plots
2. Dummy Variables
3. Cofounding variables
4. Interaction Terms
5. Using ANOVA to compare a full and reduced model
6. Multicolinearity (Variance Inflation Factor)
7. Model Selection
8. Akaike Information Criterion (AIC)
9. Working with Heteroskedasticity


The in-class exercise is about using creating a multi-variate regression model.

Part 1: Use a model selection machine to find the best possible model to predict the 2004 Election Results

library(sp)
library(maptools)  #for reading shp files
## Loading required package: foreign
## Loading required package: grid
## Loading required package: lattice
## Checking rgeos availability: TRUE
library(MASS)  #to use AIC
library(leaps)  #to use regsubsets; install all required libraries

# Read the shapefile
shp <- readShapePoly("E:/Quant/inClassExercises/InClassExerciseData/2004_Election_Counties.shp")
names(shp)  #There are a lot of attributes.
##  [1] "NAME"       "STATE_NAME" "STATE_FIPS" "CNTY_FIPS"  "FIPS"      
##  [6] "AREA"       "FIPS_num"   "Bush"       "Kerry"      "County_F"  
## [11] "Nader"      "Total"      "Bush_pct"   "Kerry_pct"  "Nader_pct" 
## [16] "MDratio"    "hosp"       "pcthisp"    "pcturban"   "urbrural"  
## [21] "pctfemhh"   "pcincome"   "pctpoor"    "pctlt9ed"   "pcthsed"   
## [26] "pctcoled"   "unemploy"   "pctwhtcl"   "homevalu"   "rent"      
## [31] "popdens"    "crowded"    "ginirev"    "SmokecurM"  "SmokevrM"  
## [36] "SmokecurF"  "SmokevrF"   "Obese"      "Noins"      "XYLENES__M"
## [41] "TOLUENE"    "TETRACHLOR" "STYRENE"    "NICKEL_COM" "METHYLENE_"
## [46] "MERCURY_CO" "LEAD_COMPO" "BENZENE__I" "ARSENIC_CO" "POP2000"   
## [51] "POP00SQMIL" "MALE2000"   "FEMALE2000" "MAL2FEM"    "UNDER18"   
## [56] "AIAN"       "ASIA"       "BLACK"      "NHPI"       "WHITE"     
## [61] "AIAN_MORE"  "ASIA_MORE"  "BLK_MORE"   "NHPI_MORE"  "WHT_MORE"  
## [66] "HISP_LAT"   "CH19902000" "MEDAGE2000" "PEROVER65"

# Find the best fitting models that include many of the attributes
d <- regsubsets(Bush_pct ~ ., nbest = 1, nvmax = 20, data = shp[, c(13, 16:38)])
d.fit <- (summary(d))  #This creates a varaible that stores the results of each best-fitting model

# Plot the Adjusted R2 and BIC to find the 'best' best-fitting model.
# Where the curve levels off, we have reached an ideal model.
plot(1:20, y = d.fit$adjr2, type = "o", xlab = "Number of Parameters", ylab = "Adjusted R^2")

plot of chunk unnamed-chunk-1

plot(1:20, y = d.fit$bic, type = "o", xlab = "Number of Parameters", ylab = "BIC")  #BIC is very similar to the AIC.  lower values = better fitting models compared to other like-models.

plot of chunk unnamed-chunk-1


# Because of the plots, we can identify that the 10-variable appears to be
# the best.  But what are the variables? We can plot the regsubsets object
# another way that shows the variables.  The plot can be somewhat
# difficult to read though.
plot(d)  #shows the variables included in all of the calculated models

plot of chunk unnamed-chunk-1


# Because the previous plot is graphically interesting, but difficult to
# interpret, we can dig directly into the summary information to quickly
# find the variables included in the models.
d.fit$outmat[10, ]  #This shows all of the 20 variables used with an * below each variable included in the model specified.  Here, the 10-variable model.
##   MDratio      hosp   pcthisp  pcturban  urbrural  pctfemhh  pcincome 
##       " "       "*"       "*"       " "       " "       "*"       " " 
##   pctpoor  pctlt9ed   pcthsed  pctcoled  unemploy  pctwhtcl  homevalu 
##       "*"       " "       " "       "*"       "*"       " "       "*" 
##      rent   popdens   crowded   ginirev SmokecurM  SmokevrM SmokecurF 
##       " "       " "       "*"       "*"       " "       " "       " " 
##  SmokevrF     Obese 
##       "*"       " "

Part 2: Examine the “best” model.

# Create the best 10-variable model...
BFM <- lm(Bush_pct ~ hosp + pcthisp + pctfemhh + pctpoor + pctcoled + unemploy + 
    homevalu + crowded + ginirev + SmokevrF, data = shp)


# Do some diagnostics on the model.... Compute the Variance Inflcation
# Factor
library(car)
## Loading required package: nnet
vif(BFM)  #This examines multi-colinearity within the dataset.  Values over 10 indicate a red flag that multicolinearity are affecting the model results
##     hosp  pcthisp pctfemhh  pctpoor pctcoled unemploy homevalu  crowded 
##    1.145    1.797    2.191    4.357    2.192    1.733    2.537    3.024 
##  ginirev SmokevrF 
##    3.180    1.739
mean(vif(BFM))  #Average VIF values greater than 1 indicate some degree of multicolinearity.
## [1] 2.39

# Create 5 diagnostic plots... (I'm not sure the importance of some of
# these, what they might say about the model, or how to go about creating
# them.) Plot residuals agains the model variables

# Plot Residuals against fitted values
plot(predict(BFM), BFM$residuals)  # I don't know how to interpret this.

plot of chunk unnamed-chunk-2


# Plot residuals against omitted variables

# Check the residuals for normality
qqnorm(BFM$residuals)  #qq plot
qqline(BFM$residuals)

plot of chunk unnamed-chunk-2


shapiro.test(BFM$residuals)  #Shaprio-Wilks Test for Normality
## 
##  Shapiro-Wilk normality test
## 
## data:  BFM$residuals 
## W = 0.9883, p-value = 2.371e-15
# Based on the Shapiro Test, we can reject the null hypothesis that the
# residuals are normally distributed.  This creates a problem because it
# violates one of our assumptions of linear regression...