Use various methods (forward, backward, seqrep) in regsubsets to identify risk factors associated with giving birth to an underweight baby (weighing less than 2500 grams). Compare the results for the different methods. Use “lowbwt.xls” file under assignment 5 section on the blackboard.
str(lowbwt)
## Classes 'tbl_df', 'tbl' and 'data.frame': 189 obs. of 10 variables:
## $ LOW : int 0 0 0 0 0 0 0 0 0 0 ...
## $ AGE : int 19 33 20 21 18 21 22 17 29 26 ...
## $ LWT : int 182 155 105 108 107 124 118 103 123 113 ...
## $ RACE : int 2 3 1 1 1 3 1 3 1 1 ...
## $ SMOKE: int 0 0 1 1 1 0 0 0 1 1 ...
## $ PTL : int 0 0 0 0 0 0 0 0 0 0 ...
## $ HT : int 0 0 0 0 0 0 0 0 0 0 ...
## $ UI : int 1 0 0 1 1 0 0 0 0 0 ...
## $ FTV : int 0 3 1 2 0 0 1 1 1 0 ...
## $ BWT : int 2523 2551 2557 2594 2600 2622 2637 2637 2663 2665 ...
## - attr(*, "spec")=List of 2
## ..$ cols :List of 11
## .. ..$ ID : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ LOW : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ AGE : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ LWT : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ RACE : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ SMOKE: list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ PTL : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ HT : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ UI : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ FTV : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ BWT : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## ..$ default: list()
## .. ..- attr(*, "class")= chr "collector_guess" "collector"
## ..- attr(*, "class")= chr "col_spec"
#initial desc stats
stat.desc(lowbwt)
## LOW AGE LWT RACE
## nbr.val 189.00000000 189.0000000 1.890000e+02 189.00000000
## nbr.null 130.00000000 0.0000000 0.000000e+00 0.00000000
## nbr.na 0.00000000 0.0000000 0.000000e+00 0.00000000
## min 0.00000000 14.0000000 8.000000e+01 1.00000000
## max 1.00000000 45.0000000 2.500000e+02 3.00000000
## range 1.00000000 31.0000000 1.700000e+02 2.00000000
## sum 59.00000000 4392.0000000 2.453500e+04 349.00000000
## median 0.00000000 23.0000000 1.210000e+02 1.00000000
## mean 0.31216931 23.2380952 1.298148e+02 1.84656085
## SE.mean 0.03379535 0.3854221 2.224323e+00 0.06679957
## CI.mean.0.95 0.06666683 0.7603078 4.387838e+00 0.13177302
## var 0.21586176 28.0759878 9.350985e+02 0.84335247
## std.dev 0.46460925 5.2986779 3.057938e+01 0.91834224
## coef.var 1.48832456 0.2280169 2.355616e-01 0.49732574
## SMOKE PTL HT UI
## nbr.val 189.00000000 189.00000000 189.00000000 189.00000000
## nbr.null 115.00000000 159.00000000 177.00000000 161.00000000
## nbr.na 0.00000000 0.00000000 0.00000000 0.00000000
## min 0.00000000 0.00000000 0.00000000 0.00000000
## max 1.00000000 3.00000000 1.00000000 1.00000000
## range 1.00000000 3.00000000 1.00000000 1.00000000
## sum 74.00000000 37.00000000 12.00000000 28.00000000
## median 0.00000000 0.00000000 0.00000000 0.00000000
## mean 0.39153439 0.19576720 0.06349206 0.14814815
## SE.mean 0.03559787 0.03588534 0.01778429 0.02590903
## CI.mean.0.95 0.07022260 0.07078968 0.03508241 0.05110979
## var 0.23950242 0.24338624 0.05977710 0.12687155
## std.dev 0.48938984 0.49334191 0.24449356 0.35619033
## coef.var 1.24992812 2.52004383 3.85077362 2.40428474
## FTV BWT
## nbr.val 189.00000000 1.890000e+02
## nbr.null 100.00000000 0.000000e+00
## nbr.na 0.00000000 0.000000e+00
## min 0.00000000 7.090000e+02
## max 6.00000000 4.990000e+03
## range 6.00000000 4.281000e+03
## sum 150.00000000 5.565400e+05
## median 0.00000000 2.977000e+03
## mean 0.79365079 2.944656e+03
## SE.mean 0.07705173 5.302858e+01
## CI.mean.0.95 0.15199707 1.046075e+02
## var 1.12208713 5.314737e+05
## std.dev 1.05928614 7.290224e+02
## coef.var 1.33470054 2.475747e-01
#full model of low birth weights less than 2500
x1 <- glm(BWT~.,data = subset(lowbwt,BWT<2500))
summary(x1)
##
## Call:
## glm(formula = BWT ~ ., data = subset(lowbwt, BWT < 2500))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -927.90 -210.25 21.96 223.82 677.84
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3146.5070 369.7457 8.510 2.73e-11 ***
## LOW NA NA NA NA
## AGE -39.0230 10.1215 -3.855 0.000331 ***
## LWT -0.5016 1.8177 -0.276 0.783716
## RACE -31.1115 60.7274 -0.512 0.610686
## SMOKE 95.7624 108.0855 0.886 0.379866
## PTL -6.7080 83.9221 -0.080 0.936611
## HT -199.7388 145.3218 -1.374 0.175429
## UI -352.4896 105.3134 -3.347 0.001557 **
## FTV 9.2827 43.8505 0.212 0.833209
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 109333.3)
##
## Null deviance: 8861515 on 58 degrees of freedom
## Residual deviance: 5466666 on 50 degrees of freedom
## AIC: 862.2
##
## Number of Fisher Scoring iterations: 2
#This is essentially best fit
reg.best <- regsubsets(BWT~., data = subset(lowbwt,BWT<2500), nvmax = 8)
## Reordering variables and trying again:
#The plot will show the Adjust R^2
plot(reg.best, scale = "adjr2", main = "Adjusted R^2")
summary(reg.best)
## Subset selection object
## Call: regsubsets.formula(BWT ~ ., data = subset(lowbwt, BWT < 2500),
## nvmax = 8)
## 9 Variables (and intercept)
## Forced in Forced out
## AGE FALSE FALSE
## LWT FALSE FALSE
## RACE FALSE FALSE
## SMOKE FALSE FALSE
## PTL FALSE FALSE
## HT FALSE FALSE
## UI FALSE FALSE
## FTV FALSE FALSE
## LOW FALSE FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
## LOW AGE LWT RACE SMOKE PTL HT UI FTV
## 1 ( 1 ) " " "*" " " " " " " " " " " " " " "
## 2 ( 1 ) " " "*" " " " " " " " " " " "*" " "
## 3 ( 1 ) " " "*" " " " " " " " " "*" "*" " "
## 4 ( 1 ) " " "*" " " " " "*" " " "*" "*" " "
## 5 ( 1 ) " " "*" " " "*" "*" " " "*" "*" " "
## 6 ( 1 ) " " "*" "*" "*" "*" " " "*" "*" " "
## 7 ( 1 ) " " "*" "*" "*" "*" " " "*" "*" "*"
## 8 ( 1 ) " " "*" "*" "*" "*" "*" "*" "*" "*"
The best fit model shows that LWT, PTL, UI, and FTV are included, since we take the best Adjusted R^2; 0.33.
#Backward Selection to see if our results are any different
reg.back <- regsubsets(BWT~., data = subset(lowbwt,BWT<2500), method = "backward", nvmax = 8)
## Reordering variables and trying again:
#The plot will show the Adjust R^2
plot(reg.back, scale = "adjr2", main = "Adjusted R^2")
summary(reg.back)
## Subset selection object
## Call: regsubsets.formula(BWT ~ ., data = subset(lowbwt, BWT < 2500),
## method = "backward", nvmax = 8)
## 9 Variables (and intercept)
## Forced in Forced out
## AGE FALSE FALSE
## LWT FALSE FALSE
## RACE FALSE FALSE
## SMOKE FALSE FALSE
## PTL FALSE FALSE
## HT FALSE FALSE
## UI FALSE FALSE
## FTV FALSE FALSE
## LOW FALSE FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: backward
## LOW AGE LWT RACE SMOKE PTL HT UI FTV
## 1 ( 1 ) " " "*" " " " " " " " " " " " " " "
## 2 ( 1 ) " " "*" " " " " " " " " " " "*" " "
## 3 ( 1 ) " " "*" " " " " " " " " "*" "*" " "
## 4 ( 1 ) " " "*" " " " " "*" " " "*" "*" " "
## 5 ( 1 ) " " "*" " " "*" "*" " " "*" "*" " "
## 6 ( 1 ) " " "*" "*" "*" "*" " " "*" "*" " "
## 7 ( 1 ) " " "*" "*" "*" "*" " " "*" "*" "*"
## 8 ( 1 ) " " "*" "*" "*" "*" "*" "*" "*" "*"
#Forward selection
reg.forward <- regsubsets(BWT~., data = subset(lowbwt,BWT<2500), nvmax = 6, nbest = 1, method = "forward")
## Reordering variables and trying again:
summary(reg.forward)
## Subset selection object
## Call: regsubsets.formula(BWT ~ ., data = subset(lowbwt, BWT < 2500),
## nvmax = 6, nbest = 1, method = "forward")
## 9 Variables (and intercept)
## Forced in Forced out
## AGE FALSE FALSE
## LWT FALSE FALSE
## RACE FALSE FALSE
## SMOKE FALSE FALSE
## PTL FALSE FALSE
## HT FALSE FALSE
## UI FALSE FALSE
## FTV FALSE FALSE
## LOW FALSE FALSE
## 1 subsets of each size up to 7
## Selection Algorithm: forward
## LOW AGE LWT RACE SMOKE PTL HT UI FTV
## 1 ( 1 ) " " "*" " " " " " " " " " " " " " "
## 2 ( 1 ) " " "*" " " " " " " " " " " "*" " "
## 3 ( 1 ) " " "*" " " " " " " " " "*" "*" " "
## 4 ( 1 ) " " "*" " " " " "*" " " "*" "*" " "
## 5 ( 1 ) " " "*" " " "*" "*" " " "*" "*" " "
## 6 ( 1 ) " " "*" "*" "*" "*" " " "*" "*" " "
## 7 ( 1 ) " " "*" "*" "*" "*" " " "*" "*" "*"
#Trying with even less variables though
plot(reg.forward, scale = "adjr2", main = "Adjusted R^2")
reg.forward
## Subset selection object
## Call: regsubsets.formula(BWT ~ ., data = subset(lowbwt, BWT < 2500),
## nvmax = 6, nbest = 1, method = "forward")
## 9 Variables (and intercept)
## Forced in Forced out
## AGE FALSE FALSE
## LWT FALSE FALSE
## RACE FALSE FALSE
## SMOKE FALSE FALSE
## PTL FALSE FALSE
## HT FALSE FALSE
## UI FALSE FALSE
## FTV FALSE FALSE
## LOW FALSE FALSE
## 1 subsets of each size up to 7
## Selection Algorithm: forward
After using backward and forward step-wise, LWT, PTL, UI, and FTV are still listed as the best Coefficients to use.
#Lastly we can do the hybrid approach
reg.seqrep <- regsubsets(BWT~., data = subset(lowbwt,BWT<2500),nvmax = 6, nbest = 1, method = "seqrep")
## Reordering variables and trying again:
plot(reg.seqrep, scale = "adjr2", main = "Adjusted R^2")
#Here we want the hieghest Adjusted R^2 which looks to be .83, then looking across the .83 row wherever we see a black box is a recommended variable to use.
#Looks like pretty much the same variables, we can also do Cp
plot(reg.seqrep, scale = "Cp", main = "Cp")
#Here it looks like we are again selecting pretty much the same variables, by using the lowest Cp except "PTL" is now excluded
summary(reg.seqrep)
## Subset selection object
## Call: regsubsets.formula(BWT ~ ., data = subset(lowbwt, BWT < 2500),
## nvmax = 6, nbest = 1, method = "seqrep")
## 9 Variables (and intercept)
## Forced in Forced out
## AGE FALSE FALSE
## LWT FALSE FALSE
## RACE FALSE FALSE
## SMOKE FALSE FALSE
## PTL FALSE FALSE
## HT FALSE FALSE
## UI FALSE FALSE
## FTV FALSE FALSE
## LOW FALSE FALSE
## 1 subsets of each size up to 7
## Selection Algorithm: 'sequential replacement'
## LOW AGE LWT RACE SMOKE PTL HT UI FTV
## 1 ( 1 ) " " "*" " " " " " " " " " " " " " "
## 2 ( 1 ) " " "*" " " " " " " " " " " "*" " "
## 3 ( 1 ) " " "*" " " " " " " " " "*" "*" " "
## 4 ( 1 ) " " "*" " " " " "*" " " "*" "*" " "
## 5 ( 1 ) " " "*" " " "*" "*" " " "*" "*" " "
## 6 ( 1 ) " " "*" "*" "*" "*" "*" "*" " " " "
## 7 ( 1 ) " " "*" "*" "*" "*" " " "*" "*" "*"
Based off of our Adjusted R^2 plots while looking at Forward and Backward Step-Wise, the variables best to use that impact low birth weights are: LWT, PTL, UI, and FTV. However, after doing the hybyrid approach with Cp, it was recommended to take “PTL” out.
Define test sets and training sets using “lowbwt.xls” to identify risk factors associated with giving birth to an underweight baby (weighing less than 2500 grams). Do PCR on the training set and then use cross-validation. Go through ALL the relevant steps : model selection, validation, interpretation, etc.
Use “lowbwt.xls” file under assignment 5 section on the blackboard.
set.seed (1000)
# We want to scale our data and "CV" stands for cross validation, which is defaulted to RMSE
pcr_model <- pcr(LOW~., data = lowbwt, scale = TRUE, validation = "CV")
summary(pcr_model)
## Data: X dimension: 189 9
## Y dimension: 189 1
## Fit method: svdpc
## Number of components considered: 9
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 0.4658 0.4007 0.3971 0.3879 0.3645 0.3546 0.3473
## adjCV 0.4658 0.4001 0.3970 0.3863 0.3640 0.3549 0.3462
## 7 comps 8 comps 9 comps
## CV 0.3354 0.3323 0.2981
## adjCV 0.3339 0.3317 0.2972
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
## X 19.99 36.09 49.62 61.97 71.73 80.62 88.52
## LOW 26.33 27.82 36.81 41.81 47.47 50.39 52.63
## 8 comps 9 comps
## X 95.04 100.00
## LOW 53.79 63.35
# Check and see where are mean square error prediction is the lowest
validationplot(pcr_model ,val.type="MSEP")
validationplot(pcr_model, val.type = "R2")
Ideally, we would like to see a low cross validation error with a lower number of components than the number of variables in the dataset. The mean square error prediction is the lowest when there are 8 or more components included.
The R^2 is also the highest when there are 8 or more components included, but can only explain about 90% of the variability in the data. Which is surprising to me, because I would have hoped that having less than the full set of coefficients would make a nice prediction model. Knowing this we can create a prediction model and work from there. Since R^2 is the highest when there are 8 or more components, starting with 9 might be the best bet.
# Now let's train and test our model
train <- lowbwt[1:126, ]
y_test <- lowbwt[127:189, 1]
test <- lowbwt[127:189,-1]
pcr_pred <- predict(pcr_model, test, ncomp = 9)
df_predict <- data.frame(pcr_pred,y_test)
summary(df_predict)
## LOW.9.comps LOW
## Min. :-0.8156 Min. :0.0000
## 1st Qu.: 0.5905 1st Qu.:1.0000
## Median : 0.6760 Median :1.0000
## Mean : 0.6702 Mean :0.9365
## 3rd Qu.: 0.8270 3rd Qu.:1.0000
## Max. : 1.3639 Max. :1.0000
mse <- mean((df_predict$LOW.9.comps-df_predict$LOW)^2)
rmse <- sqrt(mse)
rmse
## [1] 0.3310249
predplot(pcr_model)
By looking at the PCR prediction numbers and using them to get the RMSE of the test data the result is 0.33 which was similar to our earlier models. I think it’s safe to say that using less than 8 wouldn’t be very accurate, so based off of PCR, a good model would require 9 components.