knitr::opts_chunk$set(echo = TRUE)
library(rpart); library(rpart.plot); library(randomForest); library(tidyverse); library(caret); library(MASS) 
## Warning: package 'rpart.plot' was built under R version 3.4.4
## Warning: package 'randomForest' was built under R version 3.4.4
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## -- Attaching packages ------------------------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 2.2.1     v purrr   0.2.4
## v tibble  1.4.1     v dplyr   0.7.4
## v tidyr   0.7.2     v stringr 1.2.0
## v readr   1.1.1     v forcats 0.2.0
## -- Conflicts ---------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::combine()  masks randomForest::combine()
## x dplyr::filter()   masks stats::filter()
## x dplyr::lag()      masks stats::lag()
## x ggplot2::margin() masks randomForest::margin()
## Warning: package 'caret' was built under R version 3.4.4
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
abalone_clean <- read_csv("C:/LocalFiles/Documents/Freshman TSU/STAT-220/HW 9/abalone_clean.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
##   X1 = col_integer(),
##   sex = col_character(),
##   length = col_double(),
##   diameter = col_double(),
##   height = col_double(),
##   whole.weight = col_double(),
##   shucked.weight = col_double(),
##   viscera.weight = col_double(),
##   shell.weight = col_double(),
##   rings = col_integer()
## )
abalone_tree <- abalone_clean[c(2:10)] %>%
  mutate(sex=as.factor(sex))
set.seed(410)           #this allows everyone to get the same sample
training <-  sample_frac(abalone_tree, .7)
testing<- setdiff(abalone_tree, training)

1) Run the code above to import the data set and get it ready.

b) Be sure to run the bottom 3 lines simultaneously (so that we all get the same subsets)

2) Make CART trees to predict sex (including Indeterminate) on the two sets.

a) Use rpart (with na.action=na.rpart) to make a tree to predict sex on the training set

i) Use the rpart.plot package to make a pretty tree (use extra=104)

ii) How does it look different than what was on your last homework?

abalone.rparttrain <- rpart(sex ~ ., data=training, na.action = na.rpart)
rpart.plot(abalone.rparttrain, extra=104)

#The last tree didn't have indeterminte on it.  It also used different variables in the prediction: height, shucked.weight, and diameter.

b) Now, make a tree on the testing set (normally, you wouldn’t do this).

i) Use the rpart.plot package to make another pretty tree (use extra=104)

ii) How are the two trees (or 3, counting HW#7) different?

abalone.rparttest <- rpart(sex ~ ., data=testing, na.action = na.rpart)
rpart.plot(abalone.rparttest, extra=104)

#This tree has more ending predictions than the other one.  Also, it uses whole.weight where the other one don't.  There are more branches on this one as well (more boxes).

c) Use predict to validate the model (from the training set) on the test set data.

i) Use Table and/or confusionMatrix.

ii) How did it do?

abalonepredict <-predict(abalone.rparttrain, newdata=testing, na.action=na.pass, type="class")
summary(abalonepredict)
##   F   I   M 
## 133 525 595
table(testing$sex, abalonepredict, useNA="always")
##       abalonepredict
##          F   I   M <NA>
##   F     68  84 242    0
##   I      2 333  80    0
##   M     63 108 273    0
##   <NA>   0   0   0    0
confusionMatrix(abalonepredict, testing$sex)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   F   I   M
##          F  68   2  63
##          I  84 333 108
##          M 242  80 273
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5379          
##                  95% CI : (0.5098, 0.5658)
##     No Information Rate : 0.3543          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.2994          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: F Class: I Class: M
## Sensitivity           0.17259   0.8024   0.6149
## Specificity           0.92433   0.7709   0.6020
## Pos Pred Value        0.51128   0.6343   0.4588
## Neg Pred Value        0.70893   0.8874   0.7401
## Prevalence            0.31445   0.3312   0.3543
## Detection Rate        0.05427   0.2658   0.2179
## Detection Prevalence  0.10615   0.4190   0.4749
## Balanced Accuracy     0.54846   0.7866   0.6084
#It only had a 53.79 percent accuracy in the confusion Matrix -- not super great.

3) Create a randomForest to make a model of Abalone sex (no, not like that)

a) Use the randomForest command (with na.action=na.rpart).

abalone.rf <- randomForest(sex ~ ., data=training, na.action=na.omit)
abalone.rf
## 
## Call:
##  randomForest(formula = sex ~ ., data = training, na.action = na.omit) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 44.53%
## Confusion matrix:
##     F   I   M class.error
## F 349 115 449   0.6177437
## I  76 717 134   0.2265372
## M 342 186 556   0.4870849
varImpPlot(abalone.rf, n.var=10)

b) Use predict to validate the training model on the test set

i) Use Table and/or confusionMatrix.

ii) How did it do? Better than CART?

abalone.rfP <- predict(abalone.rf, newdata=testing, na.action=na.pass, type="class")
summary(abalone.rfP)
##   F   I   M 
## 321 438 494
table(testing$sex, abalone.rfP, useNA="always")
##       abalone.rfP
##          F   I   M <NA>
##   F    154  48 192    0
##   I     30 313  72    0
##   M    137  77 230    0
##   <NA>   0   0   0    0
confusionMatrix(abalone.rfP, testing$sex)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   F   I   M
##          F 154  30 137
##          I  48 313  77
##          M 192  72 230
## 
## Overall Statistics
##                                          
##                Accuracy : 0.5563         
##                  95% CI : (0.5283, 0.584)
##     No Information Rate : 0.3543         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.3317         
##  Mcnemar's Test P-Value : 0.003643       
## 
## Statistics by Class:
## 
##                      Class: F Class: I Class: M
## Sensitivity            0.3909   0.7542   0.5180
## Specificity            0.8056   0.8508   0.6737
## Pos Pred Value         0.4798   0.7146   0.4656
## Neg Pred Value         0.7425   0.8748   0.7181
## Prevalence             0.3144   0.3312   0.3543
## Detection Rate         0.1229   0.2498   0.1836
## Detection Prevalence   0.2562   0.3496   0.3943
## Balanced Accuracy      0.5982   0.8025   0.5958
#It had 54.83 percent accuracy, so slightly better than the CART model at predicting abalone sex.

4) Turn this into a pretty .RMarkdown .pdf or .html file. Be sure to explain whether Random Forests or CART was better at predicting abalone sex.