Homework 5 - STAT 601

##Kevin Kuipers (Completed by myself) ##September 25, 2018

##Problem 1. The BostonHousing dataset reported by Harrison and Rubinfeld (1978) is available as data.frame package mlbench (Leisch and Dimitriadou, 2009). The goal here is to predict the median value of owner-occupied homes (medv variable, in 1000s USD) based on other predictors in the dataset. Use this dataset to do the following

##Problem 1 part a)

a.) Construct a regression tree using rpart(). The following need to be included in your discussion. 
How many nodes did your tree have? 
Did you prune the tree? 
Did it decrease the number of nodes? 
What is the prediction error (calculate MSE)?  
Provide a plot of the predicted vs. observed values. Plot the final tree.

##Problem 1 Part a) Constructing and ploting a decision tree

I will fit a regressiong tree using rpart(). with doing a minsplit of 10. After the fit I will plot the model to see the results of the trees and variables used. The seed I will be using for repoducibility is 929270. First tree plot will be using plot(as.party()) command.

##Plot Tree Diagram

##GGplot Tree Diagram

##CP Plot & Table

##           CP nsplit rel error    xerror       xstd
## 1 0.45274420      0 1.0000000 1.0040525 0.08306682
## 2 0.17117244      1 0.5472558 0.6520042 0.06011629
## 3 0.07165784      2 0.3760834 0.4140955 0.04701425
## 4 0.05900152      3 0.3044255 0.3640133 0.04377652
## 5 0.03375589      4 0.2454240 0.2975932 0.03581334
## 6 0.02661300      5 0.2116681 0.2599863 0.02885904
## 7 0.02357238      6 0.1850551 0.2193263 0.02641688
## 8 0.01085935      7 0.1614827 0.2104044 0.02608657
## 9 0.01000000      8 0.1506234 0.2090758 0.02835479

##GGPlot of CP plot

In order to create a ggplot of the CP plot I will create a dataframe from among the cp table of the model

It appears that nsplit = 8 has the lowest xerror value = 0.2553457. Therefore, the current model does not need any pruning which is currently set at 9 nodes.

##Pruning with the lowest xerror value

For the sake of trial with pruning I will take the node number with the lowest xerror value and cp value and run it through fitting the model with these numbers in mind. Just to see if the decision tree model looks better.

##           CP nsplit rel error    xerror       xstd
## 1 0.45274420      0 1.0000000 1.0040525 0.08306682
## 2 0.17117244      1 0.5472558 0.6520042 0.06011629
## 3 0.07165784      2 0.3760834 0.4140955 0.04701425
## 4 0.05900152      3 0.3044255 0.3640133 0.04377652
## 5 0.03375589      4 0.2454240 0.2975932 0.03581334
## 6 0.02661300      5 0.2116681 0.2599863 0.02885904
## 7 0.02357238      6 0.1850551 0.2193263 0.02641688
## 8 0.01085935      7 0.1614827 0.2104044 0.02608657
## 9 0.01000000      8 0.1506234 0.2090758 0.02835479

##GGplot of Pruned Model

The plot and the table both reveal the same results before the pruning. Therefore, as you can see no pruning was required from the original model. The 9 node model contained the lowest xerror value.

##Problem 1 part a) Calculating the MSE

First I have to use the predict function comparing the model to the BostonHousing data set and then using it to compute the MSE

## The MSE is:  12.71556

##Plotting the Observed vs Predicted

The first plot will be using the plot() command looking at the observed median income vs the predicted median income for owner home occupied homes. The second plot will be using ggplot() command. In order to do this for ggplot I will create a data frame containing the predicted value and the observed value.

##Problem 1 part a Final)

The final model contain 9 nodes with a xerror value of 0.2553457 and a MSE of 12.71556. There was no need of pruning and even when going through the exercise of pruning with lowest xerror value the models were the same. Displayed above are the two plots of observed vs predicted to look at variability between the out comes. For the most part the variability between predicted and observes follows similar patterns as Median income increases.

##Problem 1 part b)

b) Perform bagging with 50 trees. Report the prediction error (MSE). 
Provide the predicted vs observed plot.

Calculating the MSE - Bagging 50 trees

I will perform bagging with 50 trees and output the prediction error (MSE)

## The MSE for bagging with 50 trees is : 16.24467

##Plotting Observed vs Predicted - Bagging 50 trees

First I will plot using the base-r plot(). Then I will plot using gpplot(). In order to do this for ggplot I will create a data frame containing the predicted value and the observed value.

##Problem 1 part b final) It appears that 50 trees made the variability worse for predicting median income. When looking at the first model’s observed vs predicted plots and comparing them to the bagging with 50 trees it has gotten worse–more variability between observed vs predicted. Therefore, I would go with the first model.

##Problem 1 part c)

c) Use randomForest() function in R to perform bagging. 
Report the prediction error (MSE). 
Was it the same as (b)? 
If they are different what do you think caused it?  
Provide a plot of the predicted vs. observed values.

##Random Forest mtry=13 MSE

In order to compute the MSE I will create a data frame containing the Median income from the BostonHousing data set and another column for the predicted value from the random forest model with using all 13 variables found in the original data set.

## The MSE for RandomForest Model:  10.53586

It appears the MSE is lower using the random forest command with 50 trees

##Random Forest mtry=13 - Observed vs Predicted Plot

First I will plot in base R and then ggplot using the previously made data frame of observed vs predicted for the random forest model.

The observed vs predicted plot using the random forest model looks alot better in terms of variability. In general, the points cluster closer together and closer to the regression line. Part of the reason random forest may perform better is because it generates a random sample of the data and uses it to develop trees. In constrast, bagging utilizes all the variables to build its trees. During this process overfitting can occur or even variables that are not beneficial in being predictors can occur in the model. Therefore, bagging can select variables that are not good predictors for the outcome and thus increase the error rates.

##Probelm 1 part d)

d) Use randomForest() function in R to perform random forest. 
Report the prediction error (MSE).  
Provide a plot of the predicted vs. observed values.

I will use randomForest() command to perform a random forest model without any criteria in the model. I will also perform all the same tests above when looking at the randomForest model with mtry=13. This time no mtry criteria will specified for the model.

##Random Forest MSE

In order to compute the MSE I will create a data frame containing the Median income from the BostonHousing data set and the predicted value.

## The MSE for RandomForest Model:  9.751704

It appears the MSE is lower using the random forest command with 50 trees

##Random Forest - Observed vs Predicted Plot

First I will plot in base R and then ggplot using the previously made data frame of observed vs predicted for the random forest model.

##Problem 1 d Final)

It appears the MSE is lower for the second random forest model than the first model. The first random forest model used all 13 variables. The second model did not specify the number of variables. Therefore, it continues to seem that that some variables are not “valulable” for predicting the desired outcome. This is why variable selection is important for developing a good model.

e) Provide a table containing each method and associated MSE. 
Which method is more accurate?

MSE Table

I will collect all the MSE values and put them into a data frame.

## [[1]]
## [[1]]$rpart_Trees
## [1] 12.71556
## 
## [[1]]$bagging_50
## [1] 16.24467
## 
## [[1]]$Random_Forest_mtry13
## [1] 10.53586
## 
## [[1]]$Random_Forest
## [1] 9.751704

##Problem 1 Final) It appears using the default parameters for randomForest() command tends to produce the lowest MSE. This seems to suggest that it is not limited to using all the variables within the data set and thus randomly chooses which variables are the best predictors. Whereas using mtry=13 starts of selecting all 13 variables and thus can use variables that are not good predictors for the model.

Problem 2.

Consider the glacoma data (data = “GlaucomaM”, package = “TH.data”).

a) Build a logistic regression model. 
Note that most of the predictor variables are highly correlated. 
Hence, a logistic regression model using the whole set of 
variables will not work here as it is sensitive to correlation.
    begin(verbatim)
    glac_glm <- glm(Class ~., data = GlaucomaM, family = "binomial")
    #warning messages  -- variable selection needed 
    end(verbatim)

   The solution is to select variables that seem to be 
    important for predicting the response and using 
    those in the modeling process using GLM. 
    One way to do this is by looking at the relationship between  cor(mtcars[,unlist(lapply(mtcars, is.numeric))])
    the response variable and predictor variables using graphical 
    or numerical summaries - this tends to be a tedious process. 
    Secondly, we can use a formal variable selection approach. 
    The step() function will do this in R. Using the step function, 
    choose any direction for variable selection and fit logistic regression model. 
    Discuss the model and error rate.
    
    begin(verbatim)
    #use of step() function in R
    ?step
    glm.step <- step(glac_glm)
    end(verbatim)
   
    
   Do not print out the summaries of every single model built using variable selection. 
    That will end up being dozens of pages long and not worth reading through. 
    Your discussion needs to include the direction you chose. 
    You may only report on the final model, the summary of that model, 
    and the error rate associated with that model.

##Building a Model using Step() command

First I will convert the class variable to a binary variable. 1 will represent normal and 0 otherwise. A Matrix of the will also be developed with the correlation of all the variables. Any with an absolute correlation value of 0.85 will be removed. After this is done then the step command will be used to find the formula for the model.

## class ~ phct + phcs + phci + vbrn + vars + mdn + tms

##Problem 2 B)

b) Build a logistic regression model with K-fold cross 
validation (k = 10). Report the error rate.

##Building the Logistic Regression model 10-fold Cross Validation

## The Error rate for 10-fold Cross Validation is:  0.2

##Problem 2 C)

c) Find a function (package in R) that can conduct the 
"adaboost" ensemble modeling. Use it to predict glaucoma 
and report error rate. Be sure to mention the package you used

## The Adaboost Error Rate : 0

d) Report the error rates based on single tree, bagging and random forest. 
(A table would be great for this).

##rpart() tree plot

##CP Table of rpart

##            CP nsplit rel error    xerror        xstd
## 1  0.41853844      0 1.0000000 1.0048912 0.002282293
## 2  0.06657497      1 0.5814616 0.6912667 0.076975865
## 3  0.06359004      2 0.5148866 0.7746646 0.088471881
## 4  0.05919825      3 0.4512966 0.7747453 0.090838552
## 5  0.05075188      4 0.3920983 0.8104926 0.095279750
## 6  0.03664023      5 0.3413464 0.7652280 0.095701335
## 7  0.03616490      6 0.3047062 0.8389154 0.102248263
## 8  0.02527222      8 0.2323764 0.8248150 0.101581422
## 9  0.01417234      9 0.2071042 0.7895940 0.100030241
## 10 0.01398347     10 0.1929318 0.8112680 0.101384347
## 11 0.01000000     11 0.1789484 0.8162996 0.101814647

## The number of splits with the lowest xerror is :  1

##CP Plot of rpart

##Pruning tree with - Only 1 nodes

Displaying the pruned model containing the loest xerror value

##           CP nsplit rel error    xerror        xstd
## 1 0.41853844      0 1.0000000 1.0048912 0.002282293
## 2 0.06657497      1 0.5814616 0.6912667 0.076975865

## The Pruned Model MSE is:  0.1453654

##Bagging with 50 trees - 10 Fold Cross Validation

## The error rate for bagging with 50 trees using 10-fold Cross Validation is:  0.1127816

##Random Forest Model - 10 Fold Cross Validation

## The Random Forest error rate with 10 fold cross validation is : 0.09166707

##10 Fold Cross Validation on Tree Model

## The rpart error rate with 10 fold cross validation is : 0.3333333

##Final Table of MSEs per Model

Putting all the error rates from the previous models and testings into a dataframe for outputting the model with lowest error rate

## [[1]]
## [[1]]$Random.Forest.Error.Rate
## [1] 0.09166707
## 
## [[1]]$Bagging.50.Error.Rate
## [1] 0.1127816
## 
## [[1]]$Adaboost.Error.Rate
## [1] 0
## 
## [[1]]$Cross.Validation.K10.Error.Rate
## [1] 0.2
## 
## [[1]]$Rpart.Error.Rate
## [1] 0.3333333

e) Write a conclusion comparing the above results 
(use a table to report models and corresponding error rates). 
Which one is the best model?

Model Selection

It appears that adaboost function has the lowest error rate which as at 0. This means that adaboost method predicted the class variable correctly at a 100%. The second best model is the Random Forest Model. It has an error rate at roughly 8.5%.

##Problem F

f) From the above analysis, which variables seem to be important in predicting Glaucoma?

Printing the summary of the glm.step model will reveal the most significant variables.

## 
## Call:
## glm(formula = class ~ phct + phcs + phci + vbrn + vars + mdn + 
##     tms, family = "binomial", data = GlaucomaM.dat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2697  -0.4091  -0.0204   0.3916   2.1019  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.9877     0.9346  -2.127  0.03344 *  
## phct         -5.6843     3.0371  -1.872  0.06126 .  
## phcs          9.7276     4.5485   2.139  0.03246 *  
## phci        -17.7577     3.4751  -5.110 3.22e-07 ***
## vbrn          7.0661     3.4795   2.031  0.04227 *  
## vars         66.4108    13.0101   5.105 3.32e-07 ***
## mdn          -4.5538     1.6347  -2.786  0.00534 ** 
## tms          -4.3313     2.4746  -1.750  0.08007 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 271.71  on 195  degrees of freedom
## Residual deviance: 120.04  on 188  degrees of freedom
## AIC: 136.04
## 
## Number of Fisher Scoring iterations: 7

##Final Conclusion

It appears that all the variables are significant however, phci & vars seem to be the most significant in predicting glaucoma. Whereas phct and tms are slightly significant at 0.1 level for predicting glaucoma.