Wine Quality Data Set

The goal in this exercise is to predict the wines variable by “type” using classification trees. Use the boosting technique to choose the best predictors from the following: fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates#and alcohol.

Data Set Description

Type: wine type (red or white)
Fixed acidity *Volatile acidity
Citric acid
Residual sugar
Chlorides
Free sulfir dioxide
Total sulfur dioxide
Density
pH
Sulfates
Alcohol: content of alcohol (%)
Quality: wine quality (on a scale from 1 to 7)

##   type fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1  red           7.4             0.70        0.00            1.9     0.076
## 2  red           7.8             0.88        0.00            2.6     0.098
## 3  red           7.8             0.76        0.04            2.3     0.092
## 4  red          11.2             0.28        0.56            1.9     0.075
## 5  red           7.4             0.70        0.00            1.9     0.076
## 6  red           7.4             0.66        0.00            1.8     0.075
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  11                   34  0.9978 3.51      0.56     9.4
## 2                  25                   67  0.9968 3.20      0.68     9.8
## 3                  15                   54  0.9970 3.26      0.65     9.8
## 4                  17                   60  0.9980 3.16      0.58     9.8
## 5                  11                   34  0.9978 3.51      0.56     9.4
## 6                  13                   40  0.9978 3.51      0.56     9.4
##   quality
## 1       5
## 2       5
## 3       5
## 4       6
## 5       5
## 6       5

## 'data.frame':    6497 obs. of  13 variables:
##  $ type                : chr  "red" "red" "red" "red" ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

There is a class imbalance between wine types of 25-75%

## 
##       red     white 
## 0.2461136 0.7538864

Boosted Classification trees

I will remove any “NA” values in the data set, and change the response variable to binary values “0” and “1” from “red”, and “white”

I will split the data set approximately 50%

I fit the boosted tree model here…

##                                       var     rel.inf
## total.sulfur.dioxide total.sulfur.dioxide 47.07421614
## chlorides                       chlorides 42.51286070
## volatile.acidity         volatile.acidity  4.54334964
## density                           density  2.50920638
## sulphates                       sulphates  1.51823044
## pH                                     pH  0.54490582
## residual.sugar             residual.sugar  0.41937043
## fixed.acidity               fixed.acidity  0.41682228
## free.sulfur.dioxide   free.sulfur.dioxide  0.19677026
## citric.acid                   citric.acid  0.11905191
## alcohol                           alcohol  0.09913596
## quality                           quality  0.04608004

The relative influence chart above shows the top predictors in the data set.

Next I will Compute prediction accuracy in the Test set

## [1] 1.024971e-20 5.737522e-14 2.998611e-14 2.465554e-09 1.422552e-20
## [6] 1.942724e-10

Round the probabilities

## [1] 0 0 0 0 0 0

## [1] 0.9933273

Perform the goodness of fit for the training set.

## [1] 0 0 0 0 0 0

## [1] 0.9996875

Here I will find the optimal shrinkage parameter.

## [1] 10

## [1] 0.9996967

## [1] 0.1

Random Forest Classification Trees

Fit the random forest model on the Wine data.

##                 Length Class  Mode     
## call               4   -none- call     
## type               1   -none- character
## predicted       3200   factor numeric  
## err.rate        1500   -none- numeric  
## confusion          6   -none- numeric  
## votes           6400   matrix numeric  
## oob.times       3200   -none- numeric  
## classes            2   -none- character
## importance        12   -none- numeric  
## importanceSD       0   -none- NULL     
## localImportance    0   -none- NULL     
## proximity          0   -none- NULL     
## ntree              1   -none- numeric  
## mtry               1   -none- numeric  
## forest            14   -none- list     
## y               3200   factor numeric  
## test               0   -none- NULL     
## inbag              0   -none- NULL     
## terms              3   terms  call

## List of 19
##  $ call           : language randomForest(formula = factor(type1) ~ ., data = wine_train, mtry = 4)
##  $ type           : chr "classification"
##  $ predicted      : Factor w/ 2 levels "0","1": 1 2 2 2 2 2 2 2 2 2 ...
##   ..- attr(*, "names")= chr [1:3200] "708" "4343" "4238" "4338" ...
##  $ err.rate       : num [1:500, 1:3] 0.0232 0.0306 0.0238 0.0207 0.0195 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : chr [1:3] "OOB" "0" "1"
##  $ confusion      : num [1:2, 1:3] 7.47e+02 5.00 1.10e+01 2.44e+03 1.45e-02 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:2] "0" "1"
##   .. ..$ : chr [1:3] "0" "1" "class.error"
##  $ votes          : matrix [1:3200, 1:2] 1 0.0167 0 0.157 0.1099 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:3200] "708" "4343" "4238" "4338" ...
##   .. ..$ : chr [1:2] "0" "1"
##  $ oob.times      : num [1:3200] 198 180 169 172 182 181 173 197 174 183 ...
##  $ classes        : chr [1:2] "0" "1"
##  $ importance     : num [1:12, 1] 49.8 138 18.4 30.6 328 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:12] "fixed.acidity" "volatile.acidity" "citric.acid" "residual.sugar" ...
##   .. ..$ : chr "MeanDecreaseGini"
##  $ importanceSD   : NULL
##  $ localImportance: NULL
##  $ proximity      : NULL
##  $ ntree          : num 500
##  $ mtry           : num 4
##  $ forest         :List of 14
##   ..$ ndbigtree : int [1:500] 75 107 97 79 73 87 113 79 79 79 ...
##   ..$ nodestatus: int [1:145, 1:500] 1 1 1 1 1 1 1 1 1 1 ...
##   ..$ bestvar   : int [1:145, 1:500] 5 7 7 9 7 7 5 10 5 5 ...
##   ..$ treemap   : int [1:145, 1:2, 1:500] 2 4 6 8 10 12 14 16 18 20 ...
##   ..$ nodepred  : int [1:145, 1:500] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ xbestsplit: num [1:145, 1:500] 0.0635 50.5 92.5 3.145 54.5 ...
##   ..$ pid       : num [1:2] 1 1
##   ..$ cutoff    : num [1:2] 0.5 0.5
##   ..$ ncat      : Named int [1:12] 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..- attr(*, "names")= chr [1:12] "fixed.acidity" "volatile.acidity" "citric.acid" "residual.sugar" ...
##   ..$ maxcat    : int 1
##   ..$ nrnodes   : int 145
##   ..$ ntree     : num 500
##   ..$ nclass    : int 2
##   ..$ xlevels   :List of 12
##   .. ..$ fixed.acidity       : num 0
##   .. ..$ volatile.acidity    : num 0
##   .. ..$ citric.acid         : num 0
##   .. ..$ residual.sugar      : num 0
##   .. ..$ chlorides           : num 0
##   .. ..$ free.sulfur.dioxide : num 0
##   .. ..$ total.sulfur.dioxide: num 0
##   .. ..$ density             : num 0
##   .. ..$ pH                  : num 0
##   .. ..$ sulphates           : num 0
##   .. ..$ alcohol             : num 0
##   .. ..$ quality             : num 0
##  $ y              : Factor w/ 2 levels "0","1": 1 2 2 2 2 2 2 2 2 2 ...
##   ..- attr(*, "names")= chr [1:3200] "708" "4343" "4238" "4338" ...
##  $ test           : NULL
##  $ inbag          : NULL
##  $ terms          :Classes 'terms', 'formula'  language factor(type1) ~ fixed.acidity + volatile.acidity + citric.acid + residual.sugar +      chlorides + free.sulfur.di| __truncated__ ...
##   .. ..- attr(*, "variables")= language list(factor(type1), fixed.acidity, volatile.acidity, citric.acid,      residual.sugar, chlorides, free.sulfur.dio| __truncated__ ...
##   .. ..- attr(*, "factors")= int [1:13, 1:12] 0 1 0 0 0 0 0 0 0 0 ...
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : chr [1:13] "factor(type1)" "fixed.acidity" "volatile.acidity" "citric.acid" ...
##   .. .. .. ..$ : chr [1:12] "fixed.acidity" "volatile.acidity" "citric.acid" "residual.sugar" ...
##   .. ..- attr(*, "term.labels")= chr [1:12] "fixed.acidity" "volatile.acidity" "citric.acid" "residual.sugar" ...
##   .. ..- attr(*, "order")= int [1:12] 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..- attr(*, "intercept")= num 0
##   .. ..- attr(*, "response")= int 1
##   .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. ..- attr(*, "predvars")= language list(factor(type1), fixed.acidity, volatile.acidity, citric.acid,      residual.sugar, chlorides, free.sulfur.dio| __truncated__ ...
##   .. ..- attr(*, "dataClasses")= Named chr [1:13] "factor" "numeric" "numeric" "numeric" ...
##   .. .. ..- attr(*, "names")= chr [1:13] "factor(type1)" "fixed.acidity" "volatile.acidity" "citric.acid" ...
##  - attr(*, "class")= chr [1:2] "randomForest.formula" "randomForest"

Compute prediction accuracy in the training set

##  708 4343 4238 4338 5236 3984 
##    0    1    1    1    1    1 
## Levels: 0 1

## [1] 0.9996875

Compute prediction accuracy in the test set

## 1 2 3 4 6 7 
## 0 0 0 0 0 0 
## Levels: 0 1

## [1] 0.9896876

Grow the classification tree with rpart() 10 fold cross validation.

The Random Forest trees

Print the complexity paremeter table

## 
## Classification tree:
## rpart(formula = type1 ~ ., data = wine_train, method = "class")
## 
## Variables actually used in tree construction:
## [1] chlorides            total.sulfur.dioxide volatile.acidity    
## 
## Root node error: 758/3200 = 0.23688
## 
## n= 3200 
## 
##         CP nsplit rel error   xerror     xstd
## 1 0.711082      0  1.000000 1.000000 0.031730
## 2 0.067942      1  0.288918 0.350923 0.020603
## 3 0.060686      3  0.153034 0.189974 0.015471
## 4 0.018470      4  0.092348 0.108179 0.011792
## 5 0.014512      5  0.073879 0.093668 0.010992
## 6 0.010000      6  0.059367 0.088391 0.010685

Wine Quality Data Set

Kenneth B. Hunt, MBA

June 18, 2018

Data Set Description

There is a class imbalance between wine types of 25-75%

Boosted Classification trees

I will remove any “NA” values in the data set, and change the response variable to binary values “0” and “1” from “red”, and “white”

I will split the data set approximately 50%

I fit the boosted tree model here…

The relative influence chart above shows the top predictors in the data set.

Next I will Compute prediction accuracy in the Test set

Round the probabilities

Perform the goodness of fit for the training set.

Here I will find the optimal shrinkage parameter.

Random Forest Classification Trees

Fit the random forest model on the Wine data.

Compute prediction accuracy in the training set

Compute prediction accuracy in the test set

Grow the classification tree with rpart() 10 fold cross validation.

The Random Forest trees

Print the complexity paremeter table