DATS 6101: Assignment 5

Problem A - Subsetting

Use various methods (forward, backward, seqrep) in regsubsets to identify risk factors associated with giving birth to an underweight baby (weighing less than 2500 grams). Compare the results for the different methods. Use “lowbwt.xls” file under assignment 5 section on the blackboard.

str(lowbwt)

## Classes 'tbl_df', 'tbl' and 'data.frame':    189 obs. of  10 variables:
##  $ LOW  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ AGE  : int  19 33 20 21 18 21 22 17 29 26 ...
##  $ LWT  : int  182 155 105 108 107 124 118 103 123 113 ...
##  $ RACE : int  2 3 1 1 1 3 1 3 1 1 ...
##  $ SMOKE: int  0 0 1 1 1 0 0 0 1 1 ...
##  $ PTL  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ HT   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ UI   : int  1 0 0 1 1 0 0 0 0 0 ...
##  $ FTV  : int  0 3 1 2 0 0 1 1 1 0 ...
##  $ BWT  : int  2523 2551 2557 2594 2600 2622 2637 2637 2663 2665 ...
##  - attr(*, "spec")=List of 2
##   ..$ cols   :List of 11
##   .. ..$ ID   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ LOW  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ AGE  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ LWT  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ RACE : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ SMOKE: list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ PTL  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ HT   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ UI   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ FTV  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ BWT  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   ..$ default: list()
##   .. ..- attr(*, "class")= chr  "collector_guess" "collector"
##   ..- attr(*, "class")= chr "col_spec"

#initial desc stats
stat.desc(lowbwt)

##                       LOW          AGE          LWT         RACE
## nbr.val      189.00000000  189.0000000 1.890000e+02 189.00000000
## nbr.null     130.00000000    0.0000000 0.000000e+00   0.00000000
## nbr.na         0.00000000    0.0000000 0.000000e+00   0.00000000
## min            0.00000000   14.0000000 8.000000e+01   1.00000000
## max            1.00000000   45.0000000 2.500000e+02   3.00000000
## range          1.00000000   31.0000000 1.700000e+02   2.00000000
## sum           59.00000000 4392.0000000 2.453500e+04 349.00000000
## median         0.00000000   23.0000000 1.210000e+02   1.00000000
## mean           0.31216931   23.2380952 1.298148e+02   1.84656085
## SE.mean        0.03379535    0.3854221 2.224323e+00   0.06679957
## CI.mean.0.95   0.06666683    0.7603078 4.387838e+00   0.13177302
## var            0.21586176   28.0759878 9.350985e+02   0.84335247
## std.dev        0.46460925    5.2986779 3.057938e+01   0.91834224
## coef.var       1.48832456    0.2280169 2.355616e-01   0.49732574
##                     SMOKE          PTL           HT           UI
## nbr.val      189.00000000 189.00000000 189.00000000 189.00000000
## nbr.null     115.00000000 159.00000000 177.00000000 161.00000000
## nbr.na         0.00000000   0.00000000   0.00000000   0.00000000
## min            0.00000000   0.00000000   0.00000000   0.00000000
## max            1.00000000   3.00000000   1.00000000   1.00000000
## range          1.00000000   3.00000000   1.00000000   1.00000000
## sum           74.00000000  37.00000000  12.00000000  28.00000000
## median         0.00000000   0.00000000   0.00000000   0.00000000
## mean           0.39153439   0.19576720   0.06349206   0.14814815
## SE.mean        0.03559787   0.03588534   0.01778429   0.02590903
## CI.mean.0.95   0.07022260   0.07078968   0.03508241   0.05110979
## var            0.23950242   0.24338624   0.05977710   0.12687155
## std.dev        0.48938984   0.49334191   0.24449356   0.35619033
## coef.var       1.24992812   2.52004383   3.85077362   2.40428474
##                       FTV          BWT
## nbr.val      189.00000000 1.890000e+02
## nbr.null     100.00000000 0.000000e+00
## nbr.na         0.00000000 0.000000e+00
## min            0.00000000 7.090000e+02
## max            6.00000000 4.990000e+03
## range          6.00000000 4.281000e+03
## sum          150.00000000 5.565400e+05
## median         0.00000000 2.977000e+03
## mean           0.79365079 2.944656e+03
## SE.mean        0.07705173 5.302858e+01
## CI.mean.0.95   0.15199707 1.046075e+02
## var            1.12208713 5.314737e+05
## std.dev        1.05928614 7.290224e+02
## coef.var       1.33470054 2.475747e-01

#full model of low birth weights less than 2500
x1 <- glm(BWT~.,data = subset(lowbwt,BWT<2500))
summary(x1)

## 
## Call:
## glm(formula = BWT ~ ., data = subset(lowbwt, BWT < 2500))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -927.90  -210.25    21.96   223.82   677.84  
## 
## Coefficients: (1 not defined because of singularities)
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3146.5070   369.7457   8.510 2.73e-11 ***
## LOW                NA         NA      NA       NA    
## AGE          -39.0230    10.1215  -3.855 0.000331 ***
## LWT           -0.5016     1.8177  -0.276 0.783716    
## RACE         -31.1115    60.7274  -0.512 0.610686    
## SMOKE         95.7624   108.0855   0.886 0.379866    
## PTL           -6.7080    83.9221  -0.080 0.936611    
## HT          -199.7388   145.3218  -1.374 0.175429    
## UI          -352.4896   105.3134  -3.347 0.001557 ** 
## FTV            9.2827    43.8505   0.212 0.833209    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 109333.3)
## 
##     Null deviance: 8861515  on 58  degrees of freedom
## Residual deviance: 5466666  on 50  degrees of freedom
## AIC: 862.2
## 
## Number of Fisher Scoring iterations: 2

#This is essentially best fit 
reg.best <- regsubsets(BWT~., data = subset(lowbwt,BWT<2500), nvmax = 8)

## Reordering variables and trying again:

#The plot will show the Adjust R^2
plot(reg.best, scale = "adjr2", main = "Adjusted R^2")

summary(reg.best)

## Subset selection object
## Call: regsubsets.formula(BWT ~ ., data = subset(lowbwt, BWT < 2500), 
##     nvmax = 8)
## 9 Variables  (and intercept)
##       Forced in Forced out
## AGE       FALSE      FALSE
## LWT       FALSE      FALSE
## RACE      FALSE      FALSE
## SMOKE     FALSE      FALSE
## PTL       FALSE      FALSE
## HT        FALSE      FALSE
## UI        FALSE      FALSE
## FTV       FALSE      FALSE
## LOW       FALSE      FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
##          LOW AGE LWT RACE SMOKE PTL HT  UI  FTV
## 1  ( 1 ) " " "*" " " " "  " "   " " " " " " " "
## 2  ( 1 ) " " "*" " " " "  " "   " " " " "*" " "
## 3  ( 1 ) " " "*" " " " "  " "   " " "*" "*" " "
## 4  ( 1 ) " " "*" " " " "  "*"   " " "*" "*" " "
## 5  ( 1 ) " " "*" " " "*"  "*"   " " "*" "*" " "
## 6  ( 1 ) " " "*" "*" "*"  "*"   " " "*" "*" " "
## 7  ( 1 ) " " "*" "*" "*"  "*"   " " "*" "*" "*"
## 8  ( 1 ) " " "*" "*" "*"  "*"   "*" "*" "*" "*"

The best fit model shows that LWT, PTL, UI, and FTV are included, since we take the best Adjusted R^2; 0.33.

#Backward Selection to see if our results are any different
reg.back <- regsubsets(BWT~., data = subset(lowbwt,BWT<2500), method = "backward", nvmax = 8)

## Reordering variables and trying again:

#The plot will show the Adjust R^2
plot(reg.back, scale = "adjr2", main = "Adjusted R^2")

summary(reg.back)

## Subset selection object
## Call: regsubsets.formula(BWT ~ ., data = subset(lowbwt, BWT < 2500), 
##     method = "backward", nvmax = 8)
## 9 Variables  (and intercept)
##       Forced in Forced out
## AGE       FALSE      FALSE
## LWT       FALSE      FALSE
## RACE      FALSE      FALSE
## SMOKE     FALSE      FALSE
## PTL       FALSE      FALSE
## HT        FALSE      FALSE
## UI        FALSE      FALSE
## FTV       FALSE      FALSE
## LOW       FALSE      FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: backward
##          LOW AGE LWT RACE SMOKE PTL HT  UI  FTV
## 1  ( 1 ) " " "*" " " " "  " "   " " " " " " " "
## 2  ( 1 ) " " "*" " " " "  " "   " " " " "*" " "
## 3  ( 1 ) " " "*" " " " "  " "   " " "*" "*" " "
## 4  ( 1 ) " " "*" " " " "  "*"   " " "*" "*" " "
## 5  ( 1 ) " " "*" " " "*"  "*"   " " "*" "*" " "
## 6  ( 1 ) " " "*" "*" "*"  "*"   " " "*" "*" " "
## 7  ( 1 ) " " "*" "*" "*"  "*"   " " "*" "*" "*"
## 8  ( 1 ) " " "*" "*" "*"  "*"   "*" "*" "*" "*"

#Forward selection
reg.forward <- regsubsets(BWT~., data = subset(lowbwt,BWT<2500), nvmax = 6, nbest = 1, method = "forward")

## Reordering variables and trying again:

summary(reg.forward)

## Subset selection object
## Call: regsubsets.formula(BWT ~ ., data = subset(lowbwt, BWT < 2500), 
##     nvmax = 6, nbest = 1, method = "forward")
## 9 Variables  (and intercept)
##       Forced in Forced out
## AGE       FALSE      FALSE
## LWT       FALSE      FALSE
## RACE      FALSE      FALSE
## SMOKE     FALSE      FALSE
## PTL       FALSE      FALSE
## HT        FALSE      FALSE
## UI        FALSE      FALSE
## FTV       FALSE      FALSE
## LOW       FALSE      FALSE
## 1 subsets of each size up to 7
## Selection Algorithm: forward
##          LOW AGE LWT RACE SMOKE PTL HT  UI  FTV
## 1  ( 1 ) " " "*" " " " "  " "   " " " " " " " "
## 2  ( 1 ) " " "*" " " " "  " "   " " " " "*" " "
## 3  ( 1 ) " " "*" " " " "  " "   " " "*" "*" " "
## 4  ( 1 ) " " "*" " " " "  "*"   " " "*" "*" " "
## 5  ( 1 ) " " "*" " " "*"  "*"   " " "*" "*" " "
## 6  ( 1 ) " " "*" "*" "*"  "*"   " " "*" "*" " "
## 7  ( 1 ) " " "*" "*" "*"  "*"   " " "*" "*" "*"

#Trying with even less variables though
plot(reg.forward, scale = "adjr2", main = "Adjusted R^2")

reg.forward

## Subset selection object
## Call: regsubsets.formula(BWT ~ ., data = subset(lowbwt, BWT < 2500), 
##     nvmax = 6, nbest = 1, method = "forward")
## 9 Variables  (and intercept)
##       Forced in Forced out
## AGE       FALSE      FALSE
## LWT       FALSE      FALSE
## RACE      FALSE      FALSE
## SMOKE     FALSE      FALSE
## PTL       FALSE      FALSE
## HT        FALSE      FALSE
## UI        FALSE      FALSE
## FTV       FALSE      FALSE
## LOW       FALSE      FALSE
## 1 subsets of each size up to 7
## Selection Algorithm: forward

After using backward and forward step-wise, LWT, PTL, UI, and FTV are still listed as the best Coefficients to use.

#Lastly we can do the hybrid approach
reg.seqrep <- regsubsets(BWT~., data = subset(lowbwt,BWT<2500),nvmax = 6, nbest = 1, method = "seqrep")

## Reordering variables and trying again:

plot(reg.seqrep, scale = "adjr2", main = "Adjusted R^2")

#Here we want the hieghest Adjusted R^2 which looks to be .83, then looking across the .83 row wherever we see a black box is a recommended variable to use. 
#Looks like pretty much the same variables, we can also do Cp
plot(reg.seqrep, scale = "Cp", main = "Cp")

#Here it looks like we are again selecting pretty much the same variables, by using the lowest Cp except "PTL" is now excluded 
summary(reg.seqrep)

## Subset selection object
## Call: regsubsets.formula(BWT ~ ., data = subset(lowbwt, BWT < 2500), 
##     nvmax = 6, nbest = 1, method = "seqrep")
## 9 Variables  (and intercept)
##       Forced in Forced out
## AGE       FALSE      FALSE
## LWT       FALSE      FALSE
## RACE      FALSE      FALSE
## SMOKE     FALSE      FALSE
## PTL       FALSE      FALSE
## HT        FALSE      FALSE
## UI        FALSE      FALSE
## FTV       FALSE      FALSE
## LOW       FALSE      FALSE
## 1 subsets of each size up to 7
## Selection Algorithm: 'sequential replacement'
##          LOW AGE LWT RACE SMOKE PTL HT  UI  FTV
## 1  ( 1 ) " " "*" " " " "  " "   " " " " " " " "
## 2  ( 1 ) " " "*" " " " "  " "   " " " " "*" " "
## 3  ( 1 ) " " "*" " " " "  " "   " " "*" "*" " "
## 4  ( 1 ) " " "*" " " " "  "*"   " " "*" "*" " "
## 5  ( 1 ) " " "*" " " "*"  "*"   " " "*" "*" " "
## 6  ( 1 ) " " "*" "*" "*"  "*"   "*" "*" " " " "
## 7  ( 1 ) " " "*" "*" "*"  "*"   " " "*" "*" "*"

Based off of our Adjusted R^2 plots while looking at Forward and Backward Step-Wise, the variables best to use that impact low birth weights are: LWT, PTL, UI, and FTV. However, after doing the hybyrid approach with Cp, it was recommended to take “PTL” out.

Problem B - PCR

Define test sets and training sets using “lowbwt.xls” to identify risk factors associated with giving birth to an underweight baby (weighing less than 2500 grams). Do PCR on the training set and then use cross-validation. Go through ALL the relevant steps : model selection, validation, interpretation, etc.

Use “lowbwt.xls” file under assignment 5 section on the blackboard.

set.seed (1000)
# We want to scale our data and "CV" stands for cross validation, which is defaulted to RMSE
pcr_model <- pcr(LOW~., data = lowbwt, scale = TRUE, validation = "CV")
summary(pcr_model)

## Data:    X dimension: 189 9 
##  Y dimension: 189 1
## Fit method: svdpc
## Number of components considered: 9
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV          0.4658   0.4007   0.3971   0.3879   0.3645   0.3546   0.3473
## adjCV       0.4658   0.4001   0.3970   0.3863   0.3640   0.3549   0.3462
##        7 comps  8 comps  9 comps
## CV      0.3354   0.3323   0.2981
## adjCV   0.3339   0.3317   0.2972
## 
## TRAINING: % variance explained
##      1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X      19.99    36.09    49.62    61.97    71.73    80.62    88.52
## LOW    26.33    27.82    36.81    41.81    47.47    50.39    52.63
##      8 comps  9 comps
## X      95.04   100.00
## LOW    53.79    63.35

# Check and see where are mean square error prediction is the lowest
validationplot(pcr_model ,val.type="MSEP")

validationplot(pcr_model, val.type = "R2")

Ideally, we would like to see a low cross validation error with a lower number of components than the number of variables in the dataset. The mean square error prediction is the lowest when there are 8 or more components included.

The R^2 is also the highest when there are 8 or more components included, but can only explain about 90% of the variability in the data. Which is surprising to me, because I would have hoped that having less than the full set of coefficients would make a nice prediction model. Knowing this we can create a prediction model and work from there. Since R^2 is the highest when there are 8 or more components, starting with 9 might be the best bet.

# Now let's train and test our model
train <- lowbwt[1:126, ]
y_test <- lowbwt[127:189, 1]
test <- lowbwt[127:189,-1]

pcr_pred <- predict(pcr_model, test, ncomp = 9)
df_predict <- data.frame(pcr_pred,y_test)
summary(df_predict)

##   LOW.9.comps           LOW        
##  Min.   :-0.8156   Min.   :0.0000  
##  1st Qu.: 0.5905   1st Qu.:1.0000  
##  Median : 0.6760   Median :1.0000  
##  Mean   : 0.6702   Mean   :0.9365  
##  3rd Qu.: 0.8270   3rd Qu.:1.0000  
##  Max.   : 1.3639   Max.   :1.0000

mse <- mean((df_predict$LOW.9.comps-df_predict$LOW)^2)
rmse <- sqrt(mse)
rmse

## [1] 0.3310249

predplot(pcr_model)

By looking at the PCR prediction numbers and using them to get the RMSE of the test data the result is 0.33 which was similar to our earlier models. I think it’s safe to say that using less than 8 wouldn’t be very accurate, so based off of PCR, a good model would require 9 components.

DATS 6101: Assignment 5

Practicing Sub-setting Data and Principle Component Regression

Matt Wilchek

April 24, 2017

Problem A - Subsetting

Problem B - PCR