The goal in this exercise is to predict the wines variable by “type” using classification trees. Use the boosting technique to choose the best predictors from the following: fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates#and alcohol.

Data Set Description

##   type fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1  red           7.4             0.70        0.00            1.9     0.076
## 2  red           7.8             0.88        0.00            2.6     0.098
## 3  red           7.8             0.76        0.04            2.3     0.092
## 4  red          11.2             0.28        0.56            1.9     0.075
## 5  red           7.4             0.70        0.00            1.9     0.076
## 6  red           7.4             0.66        0.00            1.8     0.075
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  11                   34  0.9978 3.51      0.56     9.4
## 2                  25                   67  0.9968 3.20      0.68     9.8
## 3                  15                   54  0.9970 3.26      0.65     9.8
## 4                  17                   60  0.9980 3.16      0.58     9.8
## 5                  11                   34  0.9978 3.51      0.56     9.4
## 6                  13                   40  0.9978 3.51      0.56     9.4
##   quality
## 1       5
## 2       5
## 3       5
## 4       6
## 5       5
## 6       5
## 'data.frame':    6497 obs. of  13 variables:
##  $ type                : chr  "red" "red" "red" "red" ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

There is a class imbalance between wine types of 25-75%

## 
##       red     white 
## 0.2461136 0.7538864

Boosted Classification trees

I will remove any “NA” values in the data set, and change the response variable to binary values “0” and “1” from “red”, and “white”

I will split the data set approximately 50%

I fit the boosted tree model here…

##                                       var     rel.inf
## total.sulfur.dioxide total.sulfur.dioxide 47.07421614
## chlorides                       chlorides 42.51286070
## volatile.acidity         volatile.acidity  4.54334964
## density                           density  2.50920638
## sulphates                       sulphates  1.51823044
## pH                                     pH  0.54490582
## residual.sugar             residual.sugar  0.41937043
## fixed.acidity               fixed.acidity  0.41682228
## free.sulfur.dioxide   free.sulfur.dioxide  0.19677026
## citric.acid                   citric.acid  0.11905191
## alcohol                           alcohol  0.09913596
## quality                           quality  0.04608004

The relative influence chart above shows the top predictors in the data set.

Next I will Compute prediction accuracy in the Test set

## [1] 1.024971e-20 5.737522e-14 2.998611e-14 2.465554e-09 1.422552e-20
## [6] 1.942724e-10

Round the probabilities

## [1] 0 0 0 0 0 0
## [1] 0.9933273

Perform the goodness of fit for the training set.

## [1] 0 0 0 0 0 0
## [1] 0.9996875

Here I will find the optimal shrinkage parameter.

## [1] 10
## [1] 0.9996967
## [1] 0.1

Random Forest Classification Trees

Fit the random forest model on the Wine data.

##                 Length Class  Mode     
## call               4   -none- call     
## type               1   -none- character
## predicted       3200   factor numeric  
## err.rate        1500   -none- numeric  
## confusion          6   -none- numeric  
## votes           6400   matrix numeric  
## oob.times       3200   -none- numeric  
## classes            2   -none- character
## importance        12   -none- numeric  
## importanceSD       0   -none- NULL     
## localImportance    0   -none- NULL     
## proximity          0   -none- NULL     
## ntree              1   -none- numeric  
## mtry               1   -none- numeric  
## forest            14   -none- list     
## y               3200   factor numeric  
## test               0   -none- NULL     
## inbag              0   -none- NULL     
## terms              3   terms  call
## List of 19
##  $ call           : language randomForest(formula = factor(type1) ~ ., data = wine_train, mtry = 4)
##  $ type           : chr "classification"
##  $ predicted      : Factor w/ 2 levels "0","1": 1 2 2 2 2 2 2 2 2 2 ...
##   ..- attr(*, "names")= chr [1:3200] "708" "4343" "4238" "4338" ...
##  $ err.rate       : num [1:500, 1:3] 0.0232 0.0306 0.0238 0.0207 0.0195 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : chr [1:3] "OOB" "0" "1"
##  $ confusion      : num [1:2, 1:3] 7.47e+02 5.00 1.10e+01 2.44e+03 1.45e-02 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:2] "0" "1"
##   .. ..$ : chr [1:3] "0" "1" "class.error"
##  $ votes          : matrix [1:3200, 1:2] 1 0.0167 0 0.157 0.1099 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:3200] "708" "4343" "4238" "4338" ...
##   .. ..$ : chr [1:2] "0" "1"
##  $ oob.times      : num [1:3200] 198 180 169 172 182 181 173 197 174 183 ...
##  $ classes        : chr [1:2] "0" "1"
##  $ importance     : num [1:12, 1] 49.8 138 18.4 30.6 328 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:12] "fixed.acidity" "volatile.acidity" "citric.acid" "residual.sugar" ...
##   .. ..$ : chr "MeanDecreaseGini"
##  $ importanceSD   : NULL
##  $ localImportance: NULL
##  $ proximity      : NULL
##  $ ntree          : num 500
##  $ mtry           : num 4
##  $ forest         :List of 14
##   ..$ ndbigtree : int [1:500] 75 107 97 79 73 87 113 79 79 79 ...
##   ..$ nodestatus: int [1:145, 1:500] 1 1 1 1 1 1 1 1 1 1 ...
##   ..$ bestvar   : int [1:145, 1:500] 5 7 7 9 7 7 5 10 5 5 ...
##   ..$ treemap   : int [1:145, 1:2, 1:500] 2 4 6 8 10 12 14 16 18 20 ...
##   ..$ nodepred  : int [1:145, 1:500] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ xbestsplit: num [1:145, 1:500] 0.0635 50.5 92.5 3.145 54.5 ...
##   ..$ pid       : num [1:2] 1 1
##   ..$ cutoff    : num [1:2] 0.5 0.5
##   ..$ ncat      : Named int [1:12] 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..- attr(*, "names")= chr [1:12] "fixed.acidity" "volatile.acidity" "citric.acid" "residual.sugar" ...
##   ..$ maxcat    : int 1
##   ..$ nrnodes   : int 145
##   ..$ ntree     : num 500
##   ..$ nclass    : int 2
##   ..$ xlevels   :List of 12
##   .. ..$ fixed.acidity       : num 0
##   .. ..$ volatile.acidity    : num 0
##   .. ..$ citric.acid         : num 0
##   .. ..$ residual.sugar      : num 0
##   .. ..$ chlorides           : num 0
##   .. ..$ free.sulfur.dioxide : num 0
##   .. ..$ total.sulfur.dioxide: num 0
##   .. ..$ density             : num 0
##   .. ..$ pH                  : num 0
##   .. ..$ sulphates           : num 0
##   .. ..$ alcohol             : num 0
##   .. ..$ quality             : num 0
##  $ y              : Factor w/ 2 levels "0","1": 1 2 2 2 2 2 2 2 2 2 ...
##   ..- attr(*, "names")= chr [1:3200] "708" "4343" "4238" "4338" ...
##  $ test           : NULL
##  $ inbag          : NULL
##  $ terms          :Classes 'terms', 'formula'  language factor(type1) ~ fixed.acidity + volatile.acidity + citric.acid + residual.sugar +      chlorides + free.sulfur.di| __truncated__ ...
##   .. ..- attr(*, "variables")= language list(factor(type1), fixed.acidity, volatile.acidity, citric.acid,      residual.sugar, chlorides, free.sulfur.dio| __truncated__ ...
##   .. ..- attr(*, "factors")= int [1:13, 1:12] 0 1 0 0 0 0 0 0 0 0 ...
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : chr [1:13] "factor(type1)" "fixed.acidity" "volatile.acidity" "citric.acid" ...
##   .. .. .. ..$ : chr [1:12] "fixed.acidity" "volatile.acidity" "citric.acid" "residual.sugar" ...
##   .. ..- attr(*, "term.labels")= chr [1:12] "fixed.acidity" "volatile.acidity" "citric.acid" "residual.sugar" ...
##   .. ..- attr(*, "order")= int [1:12] 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..- attr(*, "intercept")= num 0
##   .. ..- attr(*, "response")= int 1
##   .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. ..- attr(*, "predvars")= language list(factor(type1), fixed.acidity, volatile.acidity, citric.acid,      residual.sugar, chlorides, free.sulfur.dio| __truncated__ ...
##   .. ..- attr(*, "dataClasses")= Named chr [1:13] "factor" "numeric" "numeric" "numeric" ...
##   .. .. ..- attr(*, "names")= chr [1:13] "factor(type1)" "fixed.acidity" "volatile.acidity" "citric.acid" ...
##  - attr(*, "class")= chr [1:2] "randomForest.formula" "randomForest"

Compute prediction accuracy in the training set

##  708 4343 4238 4338 5236 3984 
##    0    1    1    1    1    1 
## Levels: 0 1
## [1] 0.9996875

Compute prediction accuracy in the test set

## 1 2 3 4 6 7 
## 0 0 0 0 0 0 
## Levels: 0 1
## [1] 0.9896876

Grow the classification tree with rpart() 10 fold cross validation.

The Random Forest trees