Key Concepts covered in the lecture include:
1. Diagnostic Plots
2. Dummy Variables
3. Cofounding variables
4. Interaction Terms
5. Using ANOVA to compare a full and reduced model
6. Multicolinearity (Variance Inflation Factor)
7. Model Selection
8. Akaike Information Criterion (AIC)
9. Working with Heteroskedasticity
The in-class exercise is about using creating a multi-variate regression model.
Part 1: Use a model selection machine to find the best possible model to predict the 2004 Election Results
library(sp)
library(maptools) #for reading shp files
## Loading required package: foreign
## Loading required package: grid
## Loading required package: lattice
## Checking rgeos availability: TRUE
library(MASS) #to use AIC
library(leaps) #to use regsubsets; install all required libraries
# Read the shapefile
shp <- readShapePoly("E:/Quant/inClassExercises/InClassExerciseData/2004_Election_Counties.shp")
names(shp) #There are a lot of attributes.
## [1] "NAME" "STATE_NAME" "STATE_FIPS" "CNTY_FIPS" "FIPS"
## [6] "AREA" "FIPS_num" "Bush" "Kerry" "County_F"
## [11] "Nader" "Total" "Bush_pct" "Kerry_pct" "Nader_pct"
## [16] "MDratio" "hosp" "pcthisp" "pcturban" "urbrural"
## [21] "pctfemhh" "pcincome" "pctpoor" "pctlt9ed" "pcthsed"
## [26] "pctcoled" "unemploy" "pctwhtcl" "homevalu" "rent"
## [31] "popdens" "crowded" "ginirev" "SmokecurM" "SmokevrM"
## [36] "SmokecurF" "SmokevrF" "Obese" "Noins" "XYLENES__M"
## [41] "TOLUENE" "TETRACHLOR" "STYRENE" "NICKEL_COM" "METHYLENE_"
## [46] "MERCURY_CO" "LEAD_COMPO" "BENZENE__I" "ARSENIC_CO" "POP2000"
## [51] "POP00SQMIL" "MALE2000" "FEMALE2000" "MAL2FEM" "UNDER18"
## [56] "AIAN" "ASIA" "BLACK" "NHPI" "WHITE"
## [61] "AIAN_MORE" "ASIA_MORE" "BLK_MORE" "NHPI_MORE" "WHT_MORE"
## [66] "HISP_LAT" "CH19902000" "MEDAGE2000" "PEROVER65"
# Find the best fitting models that include many of the attributes
d <- regsubsets(Bush_pct ~ ., nbest = 1, nvmax = 20, data = shp[, c(13, 16:38)])
d.fit <- (summary(d)) #This creates a varaible that stores the results of each best-fitting model
# Plot the Adjusted R2 and BIC to find the 'best' best-fitting model.
# Where the curve levels off, we have reached an ideal model.
plot(1:20, y = d.fit$adjr2, type = "o", xlab = "Number of Parameters", ylab = "Adjusted R^2")
plot(1:20, y = d.fit$bic, type = "o", xlab = "Number of Parameters", ylab = "BIC") #BIC is very similar to the AIC. lower values = better fitting models compared to other like-models.
# Because of the plots, we can identify that the 10-variable appears to be
# the best. But what are the variables? We can plot the regsubsets object
# another way that shows the variables. The plot can be somewhat
# difficult to read though.
plot(d) #shows the variables included in all of the calculated models
# Because the previous plot is graphically interesting, but difficult to
# interpret, we can dig directly into the summary information to quickly
# find the variables included in the models.
d.fit$outmat[10, ] #This shows all of the 20 variables used with an * below each variable included in the model specified. Here, the 10-variable model.
## MDratio hosp pcthisp pcturban urbrural pctfemhh pcincome
## " " "*" "*" " " " " "*" " "
## pctpoor pctlt9ed pcthsed pctcoled unemploy pctwhtcl homevalu
## "*" " " " " "*" "*" " " "*"
## rent popdens crowded ginirev SmokecurM SmokevrM SmokecurF
## " " " " "*" "*" " " " " " "
## SmokevrF Obese
## "*" " "
Part 2: Examine the “best” model.
# Create the best 10-variable model...
BFM <- lm(Bush_pct ~ hosp + pcthisp + pctfemhh + pctpoor + pctcoled + unemploy +
homevalu + crowded + ginirev + SmokevrF, data = shp)
# Do some diagnostics on the model.... Compute the Variance Inflcation
# Factor
library(car)
## Loading required package: nnet
vif(BFM) #This examines multi-colinearity within the dataset. Values over 10 indicate a red flag that multicolinearity are affecting the model results
## hosp pcthisp pctfemhh pctpoor pctcoled unemploy homevalu crowded
## 1.145 1.797 2.191 4.357 2.192 1.733 2.537 3.024
## ginirev SmokevrF
## 3.180 1.739
mean(vif(BFM)) #Average VIF values greater than 1 indicate some degree of multicolinearity.
## [1] 2.39
# Create 5 diagnostic plots... (I'm not sure the importance of some of
# these, what they might say about the model, or how to go about creating
# them.) Plot residuals agains the model variables
# Plot Residuals against fitted values
plot(predict(BFM), BFM$residuals) # I don't know how to interpret this.
# Plot residuals against omitted variables
# Check the residuals for normality
qqnorm(BFM$residuals) #qq plot
qqline(BFM$residuals)
shapiro.test(BFM$residuals) #Shaprio-Wilks Test for Normality
##
## Shapiro-Wilk normality test
##
## data: BFM$residuals
## W = 0.9883, p-value = 2.371e-15
# Based on the Shapiro Test, we can reject the null hypothesis that the
# residuals are normally distributed. This creates a problem because it
# violates one of our assumptions of linear regression...