Question 1
Lets try to create a regression model that can predict college education based on a number of factors about a neighborhood in the Midwest
Create Dummy Variables
Generate Model
## Reordering variables and trying again:
## Subset selection object
## Call: regsubsets.formula(percollege ~ ., data = midwest2, method = "exhaustive",
## nvmax = NULL, nbest = 1, really.big = T)
## 20 Variables (and intercept)
## Forced in Forced out
## state.IL FALSE FALSE
## state.IN FALSE FALSE
## state.MI FALSE FALSE
## state.OH FALSE FALSE
## area FALSE FALSE
## percwhite FALSE FALSE
## percblack FALSE FALSE
## percamerindan FALSE FALSE
## percasian FALSE FALSE
## percother FALSE FALSE
## perchsd FALSE FALSE
## percprof FALSE FALSE
## percpovertyknown FALSE FALSE
## percbelowpoverty FALSE FALSE
## percchildbelowpovert FALSE FALSE
## percadultpoverty FALSE FALSE
## percelderlypoverty FALSE FALSE
## inmetro.0 FALSE FALSE
## state.WI FALSE FALSE
## inmetro.1 FALSE FALSE
## 1 subsets of each size up to 18
## Selection Algorithm: exhaustive
## state.IL state.IN state.MI state.OH state.WI area percwhite
## 1 ( 1 ) "*" " " " " " " " " " " " "
## 2 ( 1 ) "*" "*" " " " " " " " " " "
## 3 ( 1 ) "*" "*" "*" " " " " " " " "
## 4 ( 1 ) "*" "*" "*" "*" " " " " " "
## 5 ( 1 ) "*" "*" "*" "*" " " "*" " "
## 6 ( 1 ) "*" "*" "*" "*" " " "*" "*"
## 7 ( 1 ) "*" "*" "*" "*" " " "*" "*"
## 8 ( 1 ) "*" "*" "*" "*" " " "*" "*"
## 9 ( 1 ) "*" "*" "*" "*" " " "*" "*"
## 10 ( 1 ) "*" "*" "*" "*" " " "*" "*"
## 11 ( 1 ) "*" "*" "*" "*" " " "*" "*"
## 12 ( 1 ) "*" "*" "*" "*" " " "*" "*"
## 13 ( 1 ) "*" "*" "*" "*" " " "*" "*"
## 14 ( 1 ) "*" "*" "*" "*" " " "*" "*"
## 15 ( 1 ) "*" "*" "*" "*" " " "*" "*"
## 16 ( 1 ) "*" "*" "*" "*" " " "*" "*"
## 17 ( 1 ) "*" "*" "*" "*" " " "*" "*"
## 18 ( 1 ) "*" "*" "*" "*" " " "*" "*"
## percblack percamerindan percasian percother perchsd percprof
## 1 ( 1 ) " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " " " "
## 4 ( 1 ) " " " " " " " " " " " "
## 5 ( 1 ) " " " " " " " " " " " "
## 6 ( 1 ) " " " " " " " " " " " "
## 7 ( 1 ) "*" " " " " " " " " " "
## 8 ( 1 ) "*" "*" " " " " " " " "
## 9 ( 1 ) "*" "*" "*" " " " " " "
## 10 ( 1 ) "*" "*" "*" "*" " " " "
## 11 ( 1 ) "*" "*" "*" "*" "*" " "
## 12 ( 1 ) "*" "*" "*" "*" "*" "*"
## 13 ( 1 ) "*" "*" "*" "*" "*" "*"
## 14 ( 1 ) "*" "*" "*" "*" "*" "*"
## 15 ( 1 ) "*" "*" "*" "*" "*" "*"
## 16 ( 1 ) "*" "*" "*" "*" "*" "*"
## 17 ( 1 ) "*" "*" "*" "*" "*" "*"
## 18 ( 1 ) "*" "*" "*" "*" "*" "*"
## percpovertyknown percbelowpoverty percchildbelowpovert
## 1 ( 1 ) " " " " " "
## 2 ( 1 ) " " " " " "
## 3 ( 1 ) " " " " " "
## 4 ( 1 ) " " " " " "
## 5 ( 1 ) " " " " " "
## 6 ( 1 ) " " " " " "
## 7 ( 1 ) " " " " " "
## 8 ( 1 ) " " " " " "
## 9 ( 1 ) " " " " " "
## 10 ( 1 ) " " " " " "
## 11 ( 1 ) " " " " " "
## 12 ( 1 ) " " " " " "
## 13 ( 1 ) "*" " " " "
## 14 ( 1 ) "*" "*" " "
## 15 ( 1 ) "*" "*" "*"
## 16 ( 1 ) "*" "*" "*"
## 17 ( 1 ) "*" "*" "*"
## 18 ( 1 ) "*" "*" "*"
## percadultpoverty percelderlypoverty inmetro.0 inmetro.1
## 1 ( 1 ) " " " " " " " "
## 2 ( 1 ) " " " " " " " "
## 3 ( 1 ) " " " " " " " "
## 4 ( 1 ) " " " " " " " "
## 5 ( 1 ) " " " " " " " "
## 6 ( 1 ) " " " " " " " "
## 7 ( 1 ) " " " " " " " "
## 8 ( 1 ) " " " " " " " "
## 9 ( 1 ) " " " " " " " "
## 10 ( 1 ) " " " " " " " "
## 11 ( 1 ) " " " " " " " "
## 12 ( 1 ) " " " " " " " "
## 13 ( 1 ) " " " " " " " "
## 14 ( 1 ) " " " " " " " "
## 15 ( 1 ) " " " " " " " "
## 16 ( 1 ) "*" " " " " " "
## 17 ( 1 ) "*" "*" " " " "
## 18 ( 1 ) "*" "*" "*" " "
##
## Call:
## lm(formula = percollege ~ . - state.WI - inmetro.0 - inmetro.1,
## data = midwest2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.5848 -0.9852 -0.0471 0.8787 6.1165
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -24.90566 11.20392 -2.223 0.026751 *
## state.IL -1.33901 0.27098 -4.941 1.12e-06 ***
## state.IN -5.46053 0.31923 -17.105 < 2e-16 ***
## state.MI -1.75952 0.27151 -6.481 2.56e-10 ***
## state.OH -2.83819 0.28494 -9.961 < 2e-16 ***
## area -12.02948 6.29205 -1.912 0.056575 .
## percwhite 0.06079 0.10428 0.583 0.560248
## percblack 0.11849 0.11029 1.074 0.283290
## percamerindan 0.12937 0.10679 1.211 0.226389
## percasian 0.26477 0.26645 0.994 0.320946
## percother NA NA NA NA
## perchsd 0.24683 0.02729 9.044 < 2e-16 ***
## percprof 2.11693 0.07401 28.603 < 2e-16 ***
## percpovertyknown 0.12903 0.03617 3.567 0.000402 ***
## percbelowpoverty -1.17554 0.34538 -3.404 0.000729 ***
## percchildbelowpovert 0.34523 0.08983 3.843 0.000140 ***
## percadultpoverty 0.46888 0.20294 2.311 0.021344 *
## percelderlypoverty 0.28925 0.08149 3.550 0.000429 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.576 on 420 degrees of freedom
## Multiple R-squared: 0.939, Adjusted R-squared: 0.9367
## F-statistic: 404 on 16 and 420 DF, p-value: < 2.2e-16
Evaluation
The diagnostics indicate that 3 variables (inmetro.0, inmetro.1, and state.WI) do not add more value to the model than they detract. As such, those three variables were left out of the model and at a ~.94 R^2, the model does a pretty good job at providing insight into whether a person was college educated or not. Analysis of the residuals does give the impression that there is significant clustering going on with the appearance of 3 distinct clusters.