Problem 8.4.7

Using the typical values of p,p/2, and sqrt(p) i will try a range of ntree from 1 to 500.

## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.

The plot shows that the testing MSE for the single tree is around 18 which is quite high. Additionally, as more trees are added to the model the MSE becomes lower and lower until it stabilizes around a few hundred trees. The test MSE including all variables at split is sa little igher (around 11) compared to both using half or square-root number of variables; both slightly less than 10.

Additional Problem

## 
## Classification tree:
## tree(formula = Premie ~ ., data = premie.train)
## Variables actually used in tree construction:
## [1] "weight"  "Mage"    "Fage"    "Gender"  "Racemom" "Visits"  "Racedad"
## Number of terminal nodes:  15 
## Residual mean deviance:  0.2016 = 198.4 / 984 
## Misclassification error rate: 0.04605 = 46 / 999

pred.unpruned=predict(tree.premie,premie.test,type="class")
misclass.unpruned=sum(premie.test$Premie != pred.unpruned)
misclass.unpruned/length(pred.unpruned)
## [1] 0.07807808

My testing misclassification rate was 0.0731 or 7.31%.

## 
## Classification tree:
## snip.tree(tree = tree.premie, nodes = 30L)
## Variables actually used in tree construction:
## [1] "weight"  "Mage"    "Fage"    "Gender"  "Racemom" "Visits"  "Racedad"
## Number of terminal nodes:  14 
## Residual mean deviance:  0.2073 = 204.2 / 985 
## Misclassification error rate: 0.04605 = 46 / 999
  1. I obtained a best size of 11 for my pruned tree from cross validation. It should be noted that 13,11, and 8 had the same cross-validation errors so i choose the term in the middle. In my pruned tree it has no information whether smoking is a potential cause of premature deaths. Weight was an enourmous indicator for premature babies. The most important indicator was if the babies weight was less than 98.5, and then the next important indicator was if weight was less than 80.5 or 117.5. The other variables that were associated with premature births was weight,apgar,fage,visits,birthComp,MomPriorCond, and DelivComp.

pred.pruned=predict(pruned.premie,premie.test,type="class")
misclass.pruned=sum(premie.test$Premie !=pred.pruned)
misclass.pruned/length(pred.pruned)
## [1] 0.07807808

The misclassification rate is 0.067067 which is also 6.71%. This error is lower than the unpruned tree meaning that I did better than the 9% misclassification rate of all births that are premature; and was also lower than the unpruned tree.

pred.forest=predict(for.premie,premie.test,type="class")
misclass.forest=sum(premie.test$Premie !=pred.forest)
misclass.forest/length(pred.forest)
## [1] 0.06706707

The testing misclassification rate was 0.06106 or 6.11%. This testing misclassification rate was lower than 9% meaning I did better than if we asuumed all baby were born mature. Moreover, this misclassfication rate was also lower than the pruned tree!

  1. The most important variables are:
##                      No         Yes MeanDecreaseAccuracy MeanDecreaseGini
## Gender        1.8991753  2.74976142            2.8055214        2.0849679
## weight       41.0088020 55.74257653           54.7340012       65.3764921
## Apgar1        2.3018620  2.43030733            3.1409840        7.4539774
## Fage          5.1483218  1.71850661            5.5328232        5.8909043
## Mage         10.8769715 -2.48541424            9.4633123        9.9761240
## Feduc         4.2041930  1.83621742            4.6434971        3.6714534
## Meduc         1.0089714  0.51809151            1.0329236        6.0670646
## TotPreg       2.7423037  1.47905998            3.0286157        5.7012022
## Visits        0.7212833  5.36153693            3.3010735       12.6998191
## Marital       2.9970055  1.83215347            3.6271473        1.3059289
## Racemom      -1.3230242  3.56315998            0.1101052        2.6848227
## Racedad      -1.0842291  4.53242893            1.0591139        3.1386638
## Hispmom       0.1003883  1.88056523            0.7391573        1.7594425
## Hispdad      -3.9729741  3.83579125           -2.3619964        2.2457948
## Gained       -0.2248060  1.61779445            0.6215030       11.4654779
## Habit         0.2939516 -1.13683121           -0.2236429        1.1640663
## MomPriorCond -0.4781317 -0.01816443           -0.4442456        1.9697548
## BirthDef     -0.5717359 -1.58065003           -0.9841065        0.6631672
## DelivComp     2.2579777  3.39508005            3.6077341        2.2584137
## BirthComp    -0.6456355 -0.32910717           -0.6010196        1.9679080

The most 5 important variables in my random forest are: Weight, Visits, Gained, Mage, and Apgarl.

  1. I would use the CART as my final model because it gave a lower misclassification rate than did the random forests.

  2. There is advice that I could give that would not abolish premature births but could find a way to allievate it. By using random forests, I was able to find the most important variables that lead to premature birth. We as a society can help reduce these factors such as the low birth weight of the mother at the time of birth; or we can increase the amount of visits to the doctor while women are pregnant. Some variables are almost imposibble to be able to change, but even changing one may have a drastic difference on pre-mature baby births.