Ozer - Week 4 - Assignment

Objective is to predict wine quality ranking from the its chemical properties. This provide guidance to vineyards regarding quality of wine and price expected without heavy reliance on the tasters.

### clean the environment, and load all needed libraries (specify why you need them)
rm(list=ls())

library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

library(rpart) # recursive partitioning of trees
library(rpart.plot)
library(plotly) #cool graphing

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

library(RColorBrewer)
library(DataExplorer) #Better ead
library(knitr)
library(kableExtra) #nice table views
library(rattle)  # to use fancyRpotPlot

## Rattle: A free graphical interface for data science with R.
## Version 5.2.0 Copyright (c) 2006-2018 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.

library(randomForest) # fit random forests

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:rattle':
## 
##     importance

## The following object is masked from 'package:ggplot2':
## 
##     margin

library(gmodels) #crosstable

1) Train, predict, and evaluate the wine quality using decision trees. What is the accuracy rate of the tree model? Plot and interpret the trees.

wine_red <- read.csv("/Users/jay/Desktop/CODE/MSDS/MSDS680_ML/Week4/data/winequality-red.csv",header=TRUE, sep = ";")
names(wine_red) # list variable names

##  [1] "fixed.acidity"        "volatile.acidity"     "citric.acid"         
##  [4] "residual.sugar"       "chlorides"            "free.sulfur.dioxide" 
##  [7] "total.sulfur.dioxide" "density"              "pH"                  
## [10] "sulphates"            "alcohol"              "quality"

str(wine_red) # Summary of the data set

## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

#table preview using kable for better form factor.
#kable(wine_red[1:7, 1:12]) %>%
  #kable_styling(bootstrap_options = c("striped", full_width = F))

#high level stats on the dataset using DataExplorer
#introduce(wine_red)
#plot_intro(wine_red)

## View missing value distribution for wine_red data
#plot_missing(wine_red)

#Looks straight forward, no missing values, no discrete columns - all continous. looks good at a high level. Now on to features relevance / correlation with each other

#Correlation between features see if any predictors are highly correlated with one another. If so I will take out these predictors in order to avoid multicollinearity.
plot_correlation(wine_red)

#Density and fixed.acidity are moderately correlated with each other
#total.sulfur.dioxide and free.sulfur.dioxide moderately correlated. 
#next is quality and alcohol level. Quality is our classifier. Won't remove the alcohol predictor.

#removing two predictors - density and total.sulfur.dioxide and creating a new wine_red data set
wine_red <- subset(wine_red, select = -c(7,8))
#plot_correlation(wine_red)
str(wine_red) #quality is now 10th variable not 12th.

## 'data.frame':    1599 obs. of  10 variables:
##  $ fixed.acidity      : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity   : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid        : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar     : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides          : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide: num  11 25 15 17 11 13 15 15 9 17 ...
##  $ pH                 : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates          : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol            : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality            : int  5 5 5 6 5 5 5 7 7 5 ...

#Note after running the models: Removing highly correlated variables allowed the model to predict "cooking wine" better. Without removing density and total sulfur model 3 class model was not classifying any "cooking wine".

## View distribution of all continuous variables
plot_histogram(wine_red)

plot_density(wine_red)

#Some things to note: quality is discrete. Alcohol level, residual sugar, sulphates, free.sulfur.dioxide graphs skew right. This means their mean is larger than the median. Also quality is not evenly distributed. Let's dig a bit deeper, to the class label - quality.

#distinct values in the class label
sort(unique(wine_red$quality))

## [1] 3 4 5 6 7 8

#Check the proportions of the quality
round(prop.table(table(wine_red$quality)), 2)

## 
##    3    4    5    6    7    8 
## 0.01 0.03 0.43 0.40 0.12 0.01

#Approx 83% of the wine in the data set either have quality rating of 5 or 6 so in other words average! Out of 6 ratings only 2 are dominant and both represent medium quality within that scale.

#here is the histogram to check the overall distribution in frequencies of type (using Plot_ly). 
plotly_plot<-plot_ly(data = wine_red, x =~quality, type = "histogram", color = ~quality>9, colors = "Set1",  hovertext="Wine Quality Histogram") #hovertext option
plotly_plot

#Notes: Initially I was geting the error "In RColorBrewer: minimal value for n is 3, returning requested palette with 3 different levels". This was because ColorBrewer's minimum number of data classes was three. First I tried to surpress it with running suppressWarnings(plotly_plot), however that would not work. Instead, adding a larger then symbol to colors option solved the issue.

#Additional EAD on the data and some thoughts:
hist(wine_red$alcohol, col="#EE3B3B", main="Histogram of Alcohol Percent in Wine", xlab="Alcohol Percent", ylab="Number of samples", las=1) #Alcohol level varies. My guess is that around 8/9% it is going to be cooking wine. Good quality wine usually is around 13-14%

hist(wine_red$residual.sugar, col="#BCEE6B", main="Histogram of residual sugar", xlab="Density", ylab="Number of samples", las=1) # There are very few overly sweet wines.

hist(wine_red$chlorides, col="#CDB79E", main="Histogram of Chlorides in Wine", xlab="Chlorides", ylab="Number of samples", las=1) # Chloride amount usually depends on the quality of the water and sulphates. There is a moderate correlation between sulphates and chlorides.

hist(wine_red$pH, col="#458B74", main="Wine pH Histogram", xlab="Quality", ylab="Number of samples") #normally distributed. this is the chemistry behind making wine along with storage conditions, a normal distribution makes sense to me.

#Boxplot to see quality vs alcohol using plot_ly
plot_ly(data = wine_red, x = ~quality, y = ~alcohol, type = "box")

# Alcohol content increase with quality with a few outliers in medium quality wine.

Three Categories

#Three categories:
#In order to have 3 distinct categories, I convert 6 point quality to a 3 point system. Lowest quality wine, I renamed as the cooking_wine (represents 3,4), medium quality I renamed as table_wine (5,6) and finally the highest quality as spectacular_wine (7,8) - new set is called wine_red_3 so that I can redo the run with only 2 categories of quality (below).
wine_red_3 <- wine_red
wine_red_3$quality <- ifelse(wine_red_3$quality == 3, "cooking_wine", ifelse(wine_red_3$quality == 4, "cooking_wine", ifelse(wine_red_3$quality == 5, "table_wine", ifelse(wine_red_3$quality == 6, "table_wine", ifelse(wine_red_3$quality == 7, "spectacular_wine",  "spectacular_wine")) ))) #using ifelse to replace

#confirm
unique(wine_red_3$quality) #check distinct values

## [1] "table_wine"       "spectacular_wine" "cooking_wine"

#head(wine_red_3) 
table(wine_red_3$quality) #count each category

## 
##     cooking_wine spectacular_wine       table_wine 
##               63              217             1319

class(wine_red_3$quality) # returns character

## [1] "character"

# Now, convert categorical to factor which is necessary for the model
wine_red_3$quality <- factor(wine_red_3$quality)
levels(wine_red_3$quality) # check the levels

## [1] "cooking_wine"     "spectacular_wine" "table_wine"

class(wine_red_3$quality) # class is now factor

## [1] "factor"

Two Categories

# Two categories!!! new data set is now called wine_red_2
#As a long time wine drinker I dont want to spend any time tasting anything less then 6. So this time, I will convert the quality variable into 2 classes. Since I only want to drink only the best i will use "High" (>=6) and "Low"(<6) as my quality classification.
#Doing as.factor during the classification.
wine_red_2 <- wine_red
wine_red_2$quality <- as.factor(with(wine_red_2, ifelse(quality >=6, "High", "Low"))) 
#str(wine_red_2)
#head(wine_red_2)
unique(wine_red_2$quality)  # check distinct

## [1] Low  High
## Levels: High Low

table(wine_red_2$quality)  # check distribution.

## 
## High  Low 
##  855  744

class(wine_red_2$quality)

## [1] "factor"

#My reason behind choosing 6 as the dividing point for the categorization is to have a more balanced score distribution as opposed to the wine_red_3 data set where quality distribution is more diverse and in my opinion simulates real life more accurately. My goal is to see the effects of a more balanced score distribution vs logical score distribution that simulates real life data.

Create the traning and test data sets for the models

set.seed(123)  # for reproducability
# Determine the number of rows for training
nrow(wine_red_3) * 0.70 #Apply the nrow() function to determine how many observations are in the loans dataset, and the number needed for a 70% sample.

## [1] 1119.3

nrow(wine_red_2) * 0.70

## [1] 1119.3

# Create a random sample of row IDs. Use the sample() function to create an integer vector of row IDs for the 70% sample. The first argument of sample() should be the number of rows in the data set, and the second is the number of rows you need in your training set.
sample_rows3 <- sample(nrow(wine_red_3), nrow(wine_red_3) * 0.70)
sample_rows2 <- sample(nrow(wine_red_2), nrow(wine_red_2) * 0.70)

# Create the training datasets. Subset the data using the row IDs to create the training dataset. Save this as wine_red_train
wine_red_3_train <- wine_red_3[sample_rows3, ]
wine_red_2_train <- wine_red_2[sample_rows2, ]

# Create the test datasets. #Subset again, but this time select all the rows that are not in sample_rows. Save this as wine_red_test
wine_red_3_test <- wine_red_3[-sample_rows3, ]
wine_red_2_test <- wine_red_2[-sample_rows2, ]

# check the proportion of class variable -class is the default variable
round(prop.table(table(wine_red_3_train$quality)), 3)

## 
##     cooking_wine spectacular_wine       table_wine 
##            0.041            0.148            0.811

round(prop.table(table(wine_red_3_test$quality)), 3)

## 
##     cooking_wine spectacular_wine       table_wine 
##            0.035            0.106            0.858

round(prop.table(table(wine_red_2_train$quality)), 3)

## 
## High  Low 
## 0.53 0.47

round(prop.table(table(wine_red_2_test$quality)), 3)

## 
##  High   Low 
## 0.546 0.454

##proportions between training and test data sets do not differ greatly, so no need to resample, or attempt a more sophisticated sampling approach. Although the wine dataset is sorted randomly as it can be seen with head() function, I still opted for the random sampling instead of splitting according just using the id.

Model training

#3 quality classes: 
# Grow a tree using all of the available data - Building a tree
m3.rpart <- rpart(quality~ ., data = wine_red_3_train, method = "class") # use (.) to select all predictors and method = class for classification
m3.rpart # to display basic information

## n= 1119 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 1119 212 table_wine (0.04110813 0.14834674 0.81054513)  
##    2) alcohol>=10.75 410 145 table_wine (0.03658537 0.31707317 0.64634146)  
##      4) sulphates>=0.685 185  87 table_wine (0.00000000 0.47027027 0.52972973)  
##        8) alcohol>=11.65 80  28 spectacular_wine (0.00000000 0.65000000 0.35000000)  
##         16) free.sulfur.dioxide< 18.5 56  14 spectacular_wine (0.00000000 0.75000000 0.25000000)  
##           32) fixed.acidity< 9.65 43   6 spectacular_wine (0.00000000 0.86046512 0.13953488) *
##           33) fixed.acidity>=9.65 13   5 table_wine (0.00000000 0.38461538 0.61538462) *
##         17) free.sulfur.dioxide>=18.5 24  10 table_wine (0.00000000 0.41666667 0.58333333)  
##           34) residual.sugar>=2.35 8   2 spectacular_wine (0.00000000 0.75000000 0.25000000) *
##           35) residual.sugar< 2.35 16   4 table_wine (0.00000000 0.25000000 0.75000000) *
##        9) alcohol< 11.65 105  35 table_wine (0.00000000 0.33333333 0.66666667)  
##         18) volatile.acidity< 0.4 54  26 table_wine (0.00000000 0.48148148 0.51851852)  
##           36) pH< 3.265 22   7 spectacular_wine (0.00000000 0.68181818 0.31818182) *
##           37) pH>=3.265 32  11 table_wine (0.00000000 0.34375000 0.65625000)  
##             74) pH>=3.395 8   2 spectacular_wine (0.00000000 0.75000000 0.25000000) *
##             75) pH< 3.395 24   5 table_wine (0.00000000 0.20833333 0.79166667) *
##         19) volatile.acidity>=0.4 51   9 table_wine (0.00000000 0.17647059 0.82352941) *
##      5) sulphates< 0.685 225  58 table_wine (0.06666667 0.19111111 0.74222222)  
##       10) free.sulfur.dioxide< 11.5 119  44 table_wine (0.09243697 0.27731092 0.63025210)  
##         20) sulphates>=0.545 89  40 table_wine (0.11235955 0.33707865 0.55056180)  
##           40) alcohol>=11.95 26  11 spectacular_wine (0.03846154 0.57692308 0.38461538)  
##             80) fixed.acidity< 9 15   3 spectacular_wine (0.06666667 0.80000000 0.13333333) *
##             81) fixed.acidity>=9 11   3 table_wine (0.00000000 0.27272727 0.72727273) *
##           41) alcohol< 11.95 63  24 table_wine (0.14285714 0.23809524 0.61904762) *
##         21) sulphates< 0.545 30   4 table_wine (0.03333333 0.10000000 0.86666667) *
##       11) free.sulfur.dioxide>=11.5 106  14 table_wine (0.03773585 0.09433962 0.86792453) *
##    3) alcohol< 10.75 709  67 table_wine (0.04372355 0.05077574 0.90550071) *

#All 1119 examples (training set) begins at the root node of which 329 of it have alcohol level >= 10.95. Since alcohol was first used in the tree, it is the most important predictor of wine quality. Nodes with an (*) are terminal/leaf nodes. There are 14 terminal nodes.

#Detailed summary of the tree's fit, including mean squared error for each one of the nodes and overall measure of feature importance.
summary(m3.rpart)

## Call:
## rpart(formula = quality ~ ., data = wine_red_3_train, method = "class")
##   n= 1119 
## 
##           CP nsplit rel error    xerror       xstd
## 1 0.03773585      0 1.0000000 1.0000000 0.06183305
## 2 0.01886792      3 0.8867925 0.9811321 0.06138185
## 3 0.01415094      8 0.7924528 1.0000000 0.06183305
## 4 0.01179245      9 0.7783019 1.0283019 0.06249420
## 5 0.01000000     13 0.7311321 1.0235849 0.06238529
## 
## Variable importance
##             alcohol           sulphates       fixed.acidity 
##                  28                  12                  12 
##           chlorides                  pH         citric.acid 
##                   9                   9                   8 
## free.sulfur.dioxide    volatile.acidity      residual.sugar 
##                   8                   7                   6 
## 
## Node number 1: 1119 observations,    complexity param=0.03773585
##   predicted class=table_wine        expected loss=0.1894549  P(node) =1
##     class counts:    46   166   907
##    probabilities: 0.041 0.148 0.811 
##   left son=2 (410 obs) right son=3 (709 obs)
##   Primary splits:
##       alcohol          < 10.75  to the right, improve=35.88262, (0 missing)
##       sulphates        < 0.685  to the right, improve=19.24283, (0 missing)
##       volatile.acidity < 0.335  to the left,  improve=18.03043, (0 missing)
##       chlorides        < 0.0675 to the left,  improve=16.76491, (0 missing)
##       citric.acid      < 0.315  to the right, improve=13.12394, (0 missing)
##   Surrogate splits:
##       chlorides        < 0.0685 to the left,  agree=0.700, adj=0.180, (0 split)
##       fixed.acidity    < 6.35   to the left,  agree=0.665, adj=0.085, (0 split)
##       volatile.acidity < 0.375  to the left,  agree=0.662, adj=0.078, (0 split)
##       pH               < 3.535  to the right, agree=0.650, adj=0.044, (0 split)
##       citric.acid      < 0.625  to the right, agree=0.640, adj=0.017, (0 split)
## 
## Node number 2: 410 observations,    complexity param=0.03773585
##   predicted class=table_wine        expected loss=0.3536585  P(node) =0.3663986
##     class counts:    15   130   265
##    probabilities: 0.037 0.317 0.646 
##   left son=4 (185 obs) right son=5 (225 obs)
##   Primary splits:
##       sulphates           < 0.685  to the right, improve=12.947140, (0 missing)
##       volatile.acidity    < 0.335  to the left,  improve=12.550450, (0 missing)
##       alcohol             < 11.55  to the right, improve= 8.438234, (0 missing)
##       citric.acid         < 0.295  to the right, improve= 7.599476, (0 missing)
##       free.sulfur.dioxide < 13.5   to the left,  improve= 4.976991, (0 missing)
##   Surrogate splits:
##       citric.acid         < 0.305  to the right, agree=0.666, adj=0.259, (0 split)
##       fixed.acidity       < 8.15   to the right, agree=0.654, adj=0.232, (0 split)
##       pH                  < 3.345  to the left,  agree=0.615, adj=0.146, (0 split)
##       volatile.acidity    < 0.345  to the left,  agree=0.612, adj=0.141, (0 split)
##       free.sulfur.dioxide < 15.5   to the right, agree=0.607, adj=0.130, (0 split)
## 
## Node number 3: 709 observations
##   predicted class=table_wine        expected loss=0.09449929  P(node) =0.6336014
##     class counts:    31    36   642
##    probabilities: 0.044 0.051 0.906 
## 
## Node number 4: 185 observations,    complexity param=0.03773585
##   predicted class=table_wine        expected loss=0.4702703  P(node) =0.1653262
##     class counts:     0    87    98
##    probabilities: 0.000 0.470 0.530 
##   left son=8 (80 obs) right son=9 (105 obs)
##   Primary splits:
##       alcohol             < 11.65  to the right, improve=9.106306, (0 missing)
##       chlorides           < 0.0785 to the left,  improve=6.198017, (0 missing)
##       volatile.acidity    < 0.335  to the left,  improve=5.270561, (0 missing)
##       free.sulfur.dioxide < 18.5   to the left,  improve=5.046786, (0 missing)
##       fixed.acidity       < 5.75   to the left,  improve=2.831280, (0 missing)
##   Surrogate splits:
##       chlorides           < 0.0545 to the left,  agree=0.649, adj=0.188, (0 split)
##       fixed.acidity       < 5.75   to the left,  agree=0.627, adj=0.138, (0 split)
##       citric.acid         < 0.095  to the left,  agree=0.622, adj=0.125, (0 split)
##       pH                  < 3.49   to the right, agree=0.611, adj=0.100, (0 split)
##       free.sulfur.dioxide < 13.5   to the left,  agree=0.600, adj=0.075, (0 split)
## 
## Node number 5: 225 observations,    complexity param=0.01179245
##   predicted class=table_wine        expected loss=0.2577778  P(node) =0.2010724
##     class counts:    15    43   167
##    probabilities: 0.067 0.191 0.742 
##   left son=10 (119 obs) right son=11 (106 obs)
##   Primary splits:
##       free.sulfur.dioxide < 11.5   to the left,  improve=5.211482, (0 missing)
##       volatile.acidity    < 0.355  to the left,  improve=4.652731, (0 missing)
##       residual.sugar      < 3.75   to the right, improve=4.350865, (0 missing)
##       citric.acid         < 0.265  to the left,  improve=3.323959, (0 missing)
##       pH                  < 3.285  to the right, improve=2.804444, (0 missing)
##   Surrogate splits:
##       fixed.acidity    < 6.75   to the right, agree=0.644, adj=0.245, (0 split)
##       pH               < 3.375  to the left,  agree=0.640, adj=0.236, (0 split)
##       citric.acid      < 0.295  to the right, agree=0.636, adj=0.226, (0 split)
##       volatile.acidity < 0.595  to the left,  agree=0.582, adj=0.113, (0 split)
##       residual.sugar   < 2.525  to the right, agree=0.573, adj=0.094, (0 split)
## 
## Node number 8: 80 observations,    complexity param=0.01886792
##   predicted class=spectacular_wine  expected loss=0.35  P(node) =0.0714924
##     class counts:     0    52    28
##    probabilities: 0.000 0.650 0.350 
##   left son=16 (56 obs) right son=17 (24 obs)
##   Primary splits:
##       free.sulfur.dioxide < 18.5   to the left,  improve=3.7333330, (0 missing)
##       chlorides           < 0.082  to the left,  improve=2.4504200, (0 missing)
##       alcohol             < 12.45  to the right, improve=2.1356540, (0 missing)
##       fixed.acidity       < 10.75  to the left,  improve=2.0360080, (0 missing)
##       citric.acid         < 0.435  to the left,  improve=0.8787018, (0 missing)
##   Surrogate splits:
##       residual.sugar < 6.475  to the left,  agree=0.738, adj=0.125, (0 split)
##       pH             < 3.59   to the left,  agree=0.738, adj=0.125, (0 split)
##       alcohol        < 13.8   to the left,  agree=0.738, adj=0.125, (0 split)
##       citric.acid    < 0.72   to the left,  agree=0.725, adj=0.083, (0 split)
## 
## Node number 9: 105 observations,    complexity param=0.01886792
##   predicted class=table_wine        expected loss=0.3333333  P(node) =0.09383378
##     class counts:     0    35    70
##    probabilities: 0.000 0.333 0.667 
##   left son=18 (54 obs) right son=19 (51 obs)
##   Primary splits:
##       volatile.acidity    < 0.4    to the left,  improve=4.880174, (0 missing)
##       chlorides           < 0.0745 to the left,  improve=3.089998, (0 missing)
##       free.sulfur.dioxide < 25.5   to the left,  improve=1.660784, (0 missing)
##       sulphates           < 0.885  to the right, improve=1.571930, (0 missing)
##       citric.acid         < 0.295  to the right, improve=1.496995, (0 missing)
##   Surrogate splits:
##       citric.acid    < 0.315  to the right, agree=0.724, adj=0.431, (0 split)
##       sulphates      < 0.765  to the right, agree=0.714, adj=0.412, (0 split)
##       chlorides      < 0.0755 to the left,  agree=0.657, adj=0.294, (0 split)
##       residual.sugar < 2.9    to the left,  agree=0.610, adj=0.196, (0 split)
##       fixed.acidity  < 7.1    to the right, agree=0.600, adj=0.176, (0 split)
## 
## Node number 10: 119 observations,    complexity param=0.01179245
##   predicted class=table_wine        expected loss=0.3697479  P(node) =0.106345
##     class counts:    11    33    75
##    probabilities: 0.092 0.277 0.630 
##   left son=20 (89 obs) right son=21 (30 obs)
##   Primary splits:
##       sulphates        < 0.545  to the right, improve=3.643175, (0 missing)
##       residual.sugar   < 3.525  to the right, improve=3.002174, (0 missing)
##       volatile.acidity < 0.665  to the right, improve=3.001409, (0 missing)
##       pH               < 3.43   to the right, improve=2.504788, (0 missing)
##       citric.acid      < 0.28   to the left,  improve=2.497845, (0 missing)
##   Surrogate splits:
##       volatile.acidity < 0.935  to the left,  agree=0.773, adj=0.100, (0 split)
##       chlorides        < 0.1225 to the left,  agree=0.773, adj=0.100, (0 split)
##       residual.sugar   < 1.675  to the right, agree=0.756, adj=0.033, (0 split)
## 
## Node number 11: 106 observations
##   predicted class=table_wine        expected loss=0.1320755  P(node) =0.09472744
##     class counts:     4    10    92
##    probabilities: 0.038 0.094 0.868 
## 
## Node number 16: 56 observations,    complexity param=0.01415094
##   predicted class=spectacular_wine  expected loss=0.25  P(node) =0.05004468
##     class counts:     0    42    14
##    probabilities: 0.000 0.750 0.250 
##   left son=32 (43 obs) right son=33 (13 obs)
##   Primary splits:
##       fixed.acidity       < 9.65   to the left,  improve=4.520572, (0 missing)
##       chlorides           < 0.087  to the left,  improve=2.389899, (0 missing)
##       citric.acid         < 0.435  to the left,  improve=2.333333, (0 missing)
##       free.sulfur.dioxide < 10.5   to the right, improve=1.834225, (0 missing)
##       residual.sugar      < 2.55   to the left,  improve=1.682788, (0 missing)
##   Surrogate splits:
##       citric.acid    < 0.57   to the left,  agree=0.893, adj=0.538, (0 split)
##       residual.sugar < 2.875  to the left,  agree=0.893, adj=0.538, (0 split)
##       chlorides      < 0.083  to the left,  agree=0.875, adj=0.462, (0 split)
##       pH             < 3.125  to the right, agree=0.821, adj=0.231, (0 split)
##       sulphates      < 0.925  to the left,  agree=0.786, adj=0.077, (0 split)
## 
## Node number 17: 24 observations,    complexity param=0.01886792
##   predicted class=table_wine        expected loss=0.4166667  P(node) =0.02144772
##     class counts:     0    10    14
##    probabilities: 0.000 0.417 0.583 
##   left son=34 (8 obs) right son=35 (16 obs)
##   Primary splits:
##       residual.sugar      < 2.35   to the right, improve=2.666667, (0 missing)
##       volatile.acidity    < 0.385  to the right, improve=2.240093, (0 missing)
##       pH                  < 3.305  to the left,  improve=1.800000, (0 missing)
##       alcohol             < 11.95  to the right, improve=1.481793, (0 missing)
##       free.sulfur.dioxide < 27.5   to the right, improve=1.333333, (0 missing)
##   Surrogate splits:
##       fixed.acidity < 8.55   to the right, agree=0.833, adj=0.500, (0 split)
##       pH            < 3.2    to the left,  agree=0.833, adj=0.500, (0 split)
##       citric.acid   < 0.395  to the right, agree=0.792, adj=0.375, (0 split)
##       chlorides     < 0.073  to the right, agree=0.750, adj=0.250, (0 split)
## 
## Node number 18: 54 observations,    complexity param=0.01886792
##   predicted class=table_wine        expected loss=0.4814815  P(node) =0.04825737
##     class counts:     0    26    28
##    probabilities: 0.000 0.481 0.519 
##   left son=36 (22 obs) right son=37 (32 obs)
##   Primary splits:
##       pH                  < 3.265  to the left,  improve=2.980008, (0 missing)
##       chlorides           < 0.0745 to the left,  improve=2.514155, (0 missing)
##       sulphates           < 0.735  to the right, improve=2.480933, (0 missing)
##       alcohol             < 11.35  to the left,  improve=2.386876, (0 missing)
##       free.sulfur.dioxide < 28     to the left,  improve=1.944781, (0 missing)
##   Surrogate splits:
##       free.sulfur.dioxide < 15.5   to the left,  agree=0.815, adj=0.545, (0 split)
##       fixed.acidity       < 9.7    to the right, agree=0.778, adj=0.455, (0 split)
##       citric.acid         < 0.525  to the right, agree=0.685, adj=0.227, (0 split)
##       chlorides           < 0.0635 to the left,  agree=0.667, adj=0.182, (0 split)
##       sulphates           < 0.88   to the right, agree=0.648, adj=0.136, (0 split)
## 
## Node number 19: 51 observations
##   predicted class=table_wine        expected loss=0.1764706  P(node) =0.04557641
##     class counts:     0     9    42
##    probabilities: 0.000 0.176 0.824 
## 
## Node number 20: 89 observations,    complexity param=0.01179245
##   predicted class=table_wine        expected loss=0.4494382  P(node) =0.0795353
##     class counts:    10    30    49
##    probabilities: 0.112 0.337 0.551 
##   left son=40 (26 obs) right son=41 (63 obs)
##   Primary splits:
##       alcohol          < 11.95  to the right, improve=3.324978, (0 missing)
##       chlorides        < 0.1085 to the left,  improve=3.286517, (0 missing)
##       fixed.acidity    < 7.35   to the left,  improve=3.183172, (0 missing)
##       volatile.acidity < 0.3125 to the left,  improve=3.038265, (0 missing)
##       citric.acid      < 0.54   to the left,  improve=2.779310, (0 missing)
##   Surrogate splits:
##       residual.sugar < 2.85   to the right, agree=0.764, adj=0.192, (0 split)
##       fixed.acidity  < 5.85   to the left,  agree=0.742, adj=0.115, (0 split)
##       chlorides      < 0.121  to the right, agree=0.742, adj=0.115, (0 split)
##       pH             < 3.665  to the right, agree=0.719, adj=0.038, (0 split)
## 
## Node number 21: 30 observations
##   predicted class=table_wine        expected loss=0.1333333  P(node) =0.02680965
##     class counts:     1     3    26
##    probabilities: 0.033 0.100 0.867 
## 
## Node number 32: 43 observations
##   predicted class=spectacular_wine  expected loss=0.1395349  P(node) =0.03842717
##     class counts:     0    37     6
##    probabilities: 0.000 0.860 0.140 
## 
## Node number 33: 13 observations
##   predicted class=table_wine        expected loss=0.3846154  P(node) =0.01161752
##     class counts:     0     5     8
##    probabilities: 0.000 0.385 0.615 
## 
## Node number 34: 8 observations
##   predicted class=spectacular_wine  expected loss=0.25  P(node) =0.00714924
##     class counts:     0     6     2
##    probabilities: 0.000 0.750 0.250 
## 
## Node number 35: 16 observations
##   predicted class=table_wine        expected loss=0.25  P(node) =0.01429848
##     class counts:     0     4    12
##    probabilities: 0.000 0.250 0.750 
## 
## Node number 36: 22 observations
##   predicted class=spectacular_wine  expected loss=0.3181818  P(node) =0.01966041
##     class counts:     0    15     7
##    probabilities: 0.000 0.682 0.318 
## 
## Node number 37: 32 observations,    complexity param=0.01886792
##   predicted class=table_wine        expected loss=0.34375  P(node) =0.02859696
##     class counts:     0    11    21
##    probabilities: 0.000 0.344 0.656 
##   left son=74 (8 obs) right son=75 (24 obs)
##   Primary splits:
##       pH             < 3.395  to the right, improve=3.520833, (0 missing)
##       sulphates      < 0.775  to the right, improve=2.959239, (0 missing)
##       residual.sugar < 1.85   to the right, improve=2.117500, (0 missing)
##       chlorides      < 0.075  to the left,  improve=2.008929, (0 missing)
##       fixed.acidity  < 7.85   to the left,  improve=1.660172, (0 missing)
##   Surrogate splits:
##       free.sulfur.dioxide < 40     to the right, agree=0.812, adj=0.250, (0 split)
##       residual.sugar      < 1.55   to the left,  agree=0.781, adj=0.125, (0 split)
## 
## Node number 40: 26 observations,    complexity param=0.01179245
##   predicted class=spectacular_wine  expected loss=0.4230769  P(node) =0.02323503
##     class counts:     1    15    10
##    probabilities: 0.038 0.577 0.385 
##   left son=80 (15 obs) right son=81 (11 obs)
##   Primary splits:
##       fixed.acidity    < 9      to the left,  improve=4.031235, (0 missing)
##       residual.sugar   < 2.45   to the left,  improve=2.186538, (0 missing)
##       volatile.acidity < 0.375  to the left,  improve=1.922145, (0 missing)
##       alcohol          < 12.45  to the left,  improve=1.922145, (0 missing)
##       sulphates        < 0.65   to the left,  improve=1.867553, (0 missing)
##   Surrogate splits:
##       residual.sugar   < 2.7    to the left,  agree=0.846, adj=0.636, (0 split)
##       citric.acid      < 0.425  to the left,  agree=0.769, adj=0.455, (0 split)
##       sulphates        < 0.635  to the left,  agree=0.769, adj=0.455, (0 split)
##       volatile.acidity < 0.475  to the right, agree=0.731, adj=0.364, (0 split)
##       chlorides        < 0.1075 to the left,  agree=0.731, adj=0.364, (0 split)
## 
## Node number 41: 63 observations
##   predicted class=table_wine        expected loss=0.3809524  P(node) =0.05630027
##     class counts:     9    15    39
##    probabilities: 0.143 0.238 0.619 
## 
## Node number 74: 8 observations
##   predicted class=spectacular_wine  expected loss=0.25  P(node) =0.00714924
##     class counts:     0     6     2
##    probabilities: 0.000 0.750 0.250 
## 
## Node number 75: 24 observations
##   predicted class=table_wine        expected loss=0.2083333  P(node) =0.02144772
##     class counts:     0     5    19
##    probabilities: 0.000 0.208 0.792 
## 
## Node number 80: 15 observations
##   predicted class=spectacular_wine  expected loss=0.2  P(node) =0.01340483
##     class counts:     1    12     2
##    probabilities: 0.067 0.800 0.133 
## 
## Node number 81: 11 observations
##   predicted class=table_wine        expected loss=0.2727273  P(node) =0.009830206
##     class counts:     0     3     8
##    probabilities: 0.000 0.273 0.727

#variable importance is first alcohol and then citric.acid followed by sulphates.

#2 quality classes: 
m2.rpart <- rpart(quality~ ., data = wine_red_2_train, method = "class")
m2.rpart #basic information

## n= 1119 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##   1) root 1119 526 High (0.52993744 0.47006256)  
##     2) alcohol>=10.15 556 148 High (0.73381295 0.26618705)  
##       4) alcohol>=11.45 186  18 High (0.90322581 0.09677419) *
##       5) alcohol< 11.45 370 130 High (0.64864865 0.35135135)  
##        10) sulphates>=0.575 292  79 High (0.72945205 0.27054795)  
##          20) residual.sugar< 4.1 268  63 High (0.76492537 0.23507463) *
##          21) residual.sugar>=4.1 24   8 Low (0.33333333 0.66666667) *
##        11) sulphates< 0.575 78  27 Low (0.34615385 0.65384615) *
##     3) alcohol< 10.15 563 185 Low (0.32859680 0.67140320)  
##       6) sulphates>=0.575 308 140 Low (0.45454545 0.54545455)  
##        12) fixed.acidity>=10.75 30   2 High (0.93333333 0.06666667) *
##        13) fixed.acidity< 10.75 278 112 Low (0.40287770 0.59712230)  
##          26) volatile.acidity< 0.6525 220 102 Low (0.46363636 0.53636364)  
##            52) free.sulfur.dioxide< 26.5 186  91 High (0.51075269 0.48924731)  
##             104) alcohol>=9.85 31   5 High (0.83870968 0.16129032) *
##             105) alcohol< 9.85 155  69 Low (0.44516129 0.55483871) *
##            53) free.sulfur.dioxide>=26.5 34   7 Low (0.20588235 0.79411765) *
##          27) volatile.acidity>=0.6525 58  10 Low (0.17241379 0.82758621) *
##       7) sulphates< 0.575 255  45 Low (0.17647059 0.82352941) *

#Same as before. alcohol is the biggest predictor followed by sulphates but with a higher member count.

#Detailed summary of the tree's fit, including mean squared error for each one of th enodes and overall measure of feature importance.
summary(m2.rpart)

## Call:
## rpart(formula = quality ~ ., data = wine_red_2_train, method = "class")
##   n= 1119 
## 
##           CP nsplit rel error    xerror       xstd
## 1 0.36692015      0 1.0000000 1.0000000 0.03174091
## 2 0.02471483      1 0.6330798 0.6539924 0.02934467
## 3 0.02281369      3 0.5836502 0.6292776 0.02902527
## 4 0.01520913      5 0.5380228 0.6178707 0.02887029
## 5 0.01330798      6 0.5228137 0.5893536 0.02846127
## 6 0.01000000      9 0.4828897 0.5779468 0.02828882
## 
## Variable importance
##             alcohol           sulphates    volatile.acidity 
##                  37                  20                  11 
##           chlorides         citric.acid       fixed.acidity 
##                   8                   7                   6 
##                  pH      residual.sugar free.sulfur.dioxide 
##                   5                   3                   2 
## 
## Node number 1: 1119 observations,    complexity param=0.3669202
##   predicted class=High  expected loss=0.4700626  P(node) =1
##     class counts:   593   526
##    probabilities: 0.530 0.470 
##   left son=2 (556 obs) right son=3 (563 obs)
##   Primary splits:
##       alcohol          < 10.15  to the right, improve=91.86638, (0 missing)
##       sulphates        < 0.645  to the right, improve=66.89763, (0 missing)
##       volatile.acidity < 0.5875 to the left,  improve=54.79112, (0 missing)
##       citric.acid      < 0.295  to the right, improve=23.48732, (0 missing)
##       chlorides        < 0.0695 to the left,  improve=19.34351, (0 missing)
##   Surrogate splits:
##       chlorides        < 0.0725 to the left,  agree=0.631, adj=0.257, (0 split)
##       sulphates        < 0.625  to the right, agree=0.631, adj=0.257, (0 split)
##       volatile.acidity < 0.4725 to the left,  agree=0.618, adj=0.232, (0 split)
##       citric.acid      < 0.315  to the right, agree=0.605, adj=0.205, (0 split)
##       pH               < 3.315  to the right, agree=0.576, adj=0.147, (0 split)
## 
## Node number 2: 556 observations,    complexity param=0.02281369
##   predicted class=High  expected loss=0.2661871  P(node) =0.4968722
##     class counts:   408   148
##    probabilities: 0.734 0.266 
##   left son=4 (186 obs) right son=5 (370 obs)
##   Primary splits:
##       alcohol          < 11.45  to the right, improve=16.043860, (0 missing)
##       sulphates        < 0.645  to the right, improve=15.998100, (0 missing)
##       volatile.acidity < 0.6225 to the left,  improve=14.185950, (0 missing)
##       pH               < 3.485  to the left,  improve= 8.016034, (0 missing)
##       citric.acid      < 0.275  to the right, improve= 7.036869, (0 missing)
##   Surrogate splits:
##       chlorides        < 0.0545 to the left,  agree=0.707, adj=0.124, (0 split)
##       fixed.acidity    < 5.85   to the left,  agree=0.698, adj=0.097, (0 split)
##       pH               < 3.605  to the right, agree=0.687, adj=0.065, (0 split)
##       volatile.acidity < 0.19   to the left,  agree=0.678, adj=0.038, (0 split)
##       citric.acid      < 0.635  to the right, agree=0.678, adj=0.038, (0 split)
## 
## Node number 3: 563 observations,    complexity param=0.02471483
##   predicted class=Low   expected loss=0.3285968  P(node) =0.5031278
##     class counts:   185   378
##    probabilities: 0.329 0.671 
##   left son=6 (308 obs) right son=7 (255 obs)
##   Primary splits:
##       sulphates           < 0.575  to the right, improve=21.574260, (0 missing)
##       volatile.acidity    < 0.3175 to the left,  improve=17.073440, (0 missing)
##       fixed.acidity       < 10.75  to the right, improve=13.532280, (0 missing)
##       free.sulfur.dioxide < 23.5   to the left,  improve= 7.127815, (0 missing)
##       alcohol             < 9.85   to the right, improve= 6.362737, (0 missing)
##   Surrogate splits:
##       volatile.acidity    < 0.5675 to the left,  agree=0.616, adj=0.153, (0 split)
##       citric.acid         < 0.225  to the right, agree=0.606, adj=0.129, (0 split)
##       chlorides           < 0.0895 to the right, agree=0.570, adj=0.051, (0 split)
##       free.sulfur.dioxide < 27.5   to the left,  agree=0.563, adj=0.035, (0 split)
##       fixed.acidity       < 6.05   to the right, agree=0.560, adj=0.027, (0 split)
## 
## Node number 4: 186 observations
##   predicted class=High  expected loss=0.09677419  P(node) =0.1662198
##     class counts:   168    18
##    probabilities: 0.903 0.097 
## 
## Node number 5: 370 observations,    complexity param=0.02281369
##   predicted class=High  expected loss=0.3513514  P(node) =0.3306524
##     class counts:   240   130
##    probabilities: 0.649 0.351 
##   left son=10 (292 obs) right son=11 (78 obs)
##   Primary splits:
##       sulphates        < 0.575  to the right, improve=18.087530, (0 missing)
##       volatile.acidity < 0.605  to the left,  improve=10.380030, (0 missing)
##       pH               < 3.445  to the left,  improve= 8.225002, (0 missing)
##       residual.sugar   < 4.1    to the left,  improve= 5.886307, (0 missing)
##       chlorides        < 0.0945 to the left,  improve= 5.702395, (0 missing)
##   Surrogate splits:
##       volatile.acidity < 0.8325 to the left,  agree=0.805, adj=0.077, (0 split)
##       residual.sugar   < 11.2   to the left,  agree=0.795, adj=0.026, (0 split)
## 
## Node number 6: 308 observations,    complexity param=0.02471483
##   predicted class=Low   expected loss=0.4545455  P(node) =0.2752458
##     class counts:   140   168
##    probabilities: 0.455 0.545 
##   left son=12 (30 obs) right son=13 (278 obs)
##   Primary splits:
##       fixed.acidity       < 10.75  to the right, improve=15.238540, (0 missing)
##       volatile.acidity    < 0.3175 to the left,  improve=12.467590, (0 missing)
##       free.sulfur.dioxide < 23.5   to the left,  improve= 7.481423, (0 missing)
##       alcohol             < 9.85   to the right, improve= 6.573584, (0 missing)
##       chlorides           < 0.0975 to the left,  improve= 5.852814, (0 missing)
##   Surrogate splits:
##       volatile.acidity < 0.215  to the left,  agree=0.909, adj=0.067, (0 split)
##       pH               < 2.905  to the left,  agree=0.906, adj=0.033, (0 split)
## 
## Node number 7: 255 observations
##   predicted class=Low   expected loss=0.1764706  P(node) =0.227882
##     class counts:    45   210
##    probabilities: 0.176 0.824 
## 
## Node number 10: 292 observations,    complexity param=0.01520913
##   predicted class=High  expected loss=0.2705479  P(node) =0.2609473
##     class counts:   213    79
##    probabilities: 0.729 0.271 
##   left son=20 (268 obs) right son=21 (24 obs)
##   Primary splits:
##       residual.sugar < 4.1    to the left,  improve=8.206161, (0 missing)
##       sulphates      < 0.745  to the right, improve=6.355323, (0 missing)
##       chlorides      < 0.0945 to the left,  improve=5.570401, (0 missing)
##       alcohol        < 10.55  to the right, improve=5.338315, (0 missing)
##       pH             < 3.485  to the left,  improve=4.599881, (0 missing)
##   Surrogate splits:
##       fixed.acidity < 14.75  to the left,  agree=0.925, adj=0.083, (0 split)
##       pH            < 2.93   to the right, agree=0.921, adj=0.042, (0 split)
## 
## Node number 11: 78 observations
##   predicted class=Low   expected loss=0.3461538  P(node) =0.06970509
##     class counts:    27    51
##    probabilities: 0.346 0.654 
## 
## Node number 12: 30 observations
##   predicted class=High  expected loss=0.06666667  P(node) =0.02680965
##     class counts:    28     2
##    probabilities: 0.933 0.067 
## 
## Node number 13: 278 observations,    complexity param=0.01330798
##   predicted class=Low   expected loss=0.4028777  P(node) =0.2484361
##     class counts:   112   166
##    probabilities: 0.403 0.597 
##   left son=26 (220 obs) right son=27 (58 obs)
##   Primary splits:
##       volatile.acidity    < 0.6525 to the left,  improve=7.785490, (0 missing)
##       free.sulfur.dioxide < 26.5   to the left,  improve=4.852455, (0 missing)
##       alcohol             < 9.85   to the right, improve=3.875205, (0 missing)
##       pH                  < 3.205  to the right, improve=3.437188, (0 missing)
##       chlorides           < 0.359  to the left,  improve=3.239162, (0 missing)
##   Surrogate splits:
##       citric.acid    < 0.005  to the right, agree=0.795, adj=0.017, (0 split)
##       residual.sugar < 1.35   to the right, agree=0.795, adj=0.017, (0 split)
## 
## Node number 20: 268 observations
##   predicted class=High  expected loss=0.2350746  P(node) =0.2394996
##     class counts:   205    63
##    probabilities: 0.765 0.235 
## 
## Node number 21: 24 observations
##   predicted class=Low   expected loss=0.3333333  P(node) =0.02144772
##     class counts:     8    16
##    probabilities: 0.333 0.667 
## 
## Node number 26: 220 observations,    complexity param=0.01330798
##   predicted class=Low   expected loss=0.4636364  P(node) =0.1966041
##     class counts:   102   118
##    probabilities: 0.464 0.536 
##   left son=52 (186 obs) right son=53 (34 obs)
##   Primary splits:
##       free.sulfur.dioxide < 26.5   to the left,  improve=5.343546, (0 missing)
##       volatile.acidity    < 0.315  to the left,  improve=5.229564, (0 missing)
##       alcohol             < 9.85   to the right, improve=4.846442, (0 missing)
##       citric.acid         < 0.005  to the left,  improve=4.595215, (0 missing)
##       fixed.acidity       < 8.75   to the left,  improve=3.363290, (0 missing)
##   Surrogate splits:
##       residual.sugar < 12.4   to the left,  agree=0.855, adj=0.059, (0 split)
##       alcohol        < 8.9    to the right, agree=0.855, adj=0.059, (0 split)
## 
## Node number 27: 58 observations
##   predicted class=Low   expected loss=0.1724138  P(node) =0.05183199
##     class counts:    10    48
##    probabilities: 0.172 0.828 
## 
## Node number 52: 186 observations,    complexity param=0.01330798
##   predicted class=High  expected loss=0.4892473  P(node) =0.1662198
##     class counts:    95    91
##    probabilities: 0.511 0.489 
##   left son=104 (31 obs) right son=105 (155 obs)
##   Primary splits:
##       alcohol          < 9.85   to the right, improve=8.002151, (0 missing)
##       volatile.acidity < 0.375  to the left,  improve=4.504159, (0 missing)
##       citric.acid      < 0.005  to the left,  improve=3.710236, (0 missing)
##       sulphates        < 0.665  to the right, improve=3.477504, (0 missing)
##       chlorides        < 0.363  to the left,  improve=3.021019, (0 missing)
##   Surrogate splits:
##       pH               < 2.97   to the left,  agree=0.844, adj=0.065, (0 split)
##       sulphates        < 1.755  to the right, agree=0.844, adj=0.065, (0 split)
##       volatile.acidity < 0.255  to the left,  agree=0.839, adj=0.032, (0 split)
##       residual.sugar   < 4.9    to the right, agree=0.839, adj=0.032, (0 split)
## 
## Node number 53: 34 observations
##   predicted class=Low   expected loss=0.2058824  P(node) =0.03038427
##     class counts:     7    27
##    probabilities: 0.206 0.794 
## 
## Node number 104: 31 observations
##   predicted class=High  expected loss=0.1612903  P(node) =0.02770331
##     class counts:    26     5
##    probabilities: 0.839 0.161 
## 
## Node number 105: 155 observations
##   predicted class=Low   expected loss=0.4451613  P(node) =0.1385165
##     class counts:    69    86
##    probabilities: 0.445 0.555

#variable importance again is alcohol but this time followed by sulphates and then volatile.acidity.

Visualizing decision trees

# Here are three different graphical representations of the model. 
# Plot the wine_red_3_model with default settings
rpart.plot(m3.rpart, digits = 3)

#plot the wine_red_3_model with fancyRpartPlot that comes with rattle package
fancyRpartPlot(m3.rpart)

# Plot the m3.rpart and m2.rpart with customized settings
rpart.plot(m3.rpart, digits = 1, type = 2, box.palette = list("Greens", "Oranges", "Blues"), fallen.leaves = TRUE, extra = 101) # 3 colors in palette - three types of wine quality

rpart.plot(m2.rpart, digits = 1, type = 2, box.palette = list("Reds", "Blues"), fallen.leaves = TRUE, extra = 101)  # 2 colors in palette - two types of wine quality

#I prefer using the customized settings. In my opinion especially the type = 2 option provides much better readability. in 3 class visualization cooking_wine is represented with Green color.

Model Evaluation

# Make predictions on the test dataset
p3.rpart <- predict(m3.rpart, wine_red_3_test[,-10], type = "class") #removing the class column [,-10]. Because I removed density and total sulfur it is [10]. 
p2.rpart <- predict(m2.rpart, wine_red_2_test[,-10], type = "class")

# Examine the confusion matrix
table(p3.rpart, wine_red_3_test$quality)

##                   
## p3.rpart           cooking_wine spectacular_wine table_wine
##   cooking_wine                0                0          0
##   spectacular_wine            1               22         11
##   table_wine                 16               29        401

table(p2.rpart, wine_red_2_test$quality)

##         
## p2.rpart High Low
##     High  182  48
##     Low    80 170

# 3 class results: Correctly predicted is 23+378 = 401. Incorrect is 79. better then 2 class. Distribution of the sample set seem to have direct effect on the prediction results.
# 2 class results: false negative:82, false positive: 62. Correctly predicted is 175+161 = 336. Prediction accuracy is pretty low with many false negatives.

Accuracy of decision trees by mean, confusionMatrix and Cross tabulation

# Compute the accuracy on the test dataset using mean
mean(p3.rpart == wine_red_3_test$quality)

## [1] 0.88125

mean(p2.rpart == wine_red_2_test$quality)

## [1] 0.7333333

#Accuracy is 83% and 70% for 3 value class and 2 value class, respectively. These are not terrible but ideally the goal should be to reach 95% accuracy.

# Here is more information using ConfusionMatrix
confusionMatrix(wine_red_3_test$quality,p3.rpart)

## Confusion Matrix and Statistics
## 
##                   Reference
## Prediction         cooking_wine spectacular_wine table_wine
##   cooking_wine                0                1         16
##   spectacular_wine            0               22         29
##   table_wine                  0               11        401
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8812          
##                  95% CI : (0.8489, 0.9088)
##     No Information Rate : 0.9292          
##     P-Value [Acc > NIR] : 0.9999          
##                                           
##                   Kappa : 0.3908          
##                                           
##  Mcnemar's Test P-Value : 1.471e-05       
## 
## Statistics by Class:
## 
##                      Class: cooking_wine Class: spectacular_wine
## Sensitivity                           NA                 0.64706
## Specificity                      0.96458                 0.93498
## Pos Pred Value                        NA                 0.43137
## Neg Pred Value                        NA                 0.97203
## Prevalence                       0.00000                 0.07083
## Detection Rate                   0.00000                 0.04583
## Detection Prevalence             0.03542                 0.10625
## Balanced Accuracy                     NA                 0.79102
##                      Class: table_wine
## Sensitivity                     0.8991
## Specificity                     0.6765
## Pos Pred Value                  0.9733
## Neg Pred Value                  0.3382
## Prevalence                      0.9292
## Detection Rate                  0.8354
## Detection Prevalence            0.8583
## Balanced Accuracy               0.7878

confusionMatrix(wine_red_2_test$quality,p2.rpart)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction High Low
##       High  182  80
##       Low    48 170
##                                           
##                Accuracy : 0.7333          
##                  95% CI : (0.6914, 0.7724)
##     No Information Rate : 0.5208          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4687          
##                                           
##  Mcnemar's Test P-Value : 0.006143        
##                                           
##             Sensitivity : 0.7913          
##             Specificity : 0.6800          
##          Pos Pred Value : 0.6947          
##          Neg Pred Value : 0.7798          
##              Prevalence : 0.4792          
##          Detection Rate : 0.3792          
##    Detection Prevalence : 0.5458          
##       Balanced Accuracy : 0.7357          
##                                           
##        'Positive' Class : High            
##

# Cross tabulation of predicted versus actual classes for the most structured
#library(gmodels)
CrossTable(wine_red_3_test$quality, p3.rpart,
           prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE, #prop.c & prop.r= FALSE removes column & row percentages
           dnn = c('actual quality', 'predicted quality'))

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  480 
## 
##  
##                  | predicted quality 
##   actual quality | spectacular_wine |       table_wine |        Row Total | 
## -----------------|------------------|------------------|------------------|
##     cooking_wine |                1 |               16 |               17 | 
##                  |            0.002 |            0.033 |                  | 
## -----------------|------------------|------------------|------------------|
## spectacular_wine |               22 |               29 |               51 | 
##                  |            0.046 |            0.060 |                  | 
## -----------------|------------------|------------------|------------------|
##       table_wine |               11 |              401 |              412 | 
##                  |            0.023 |            0.835 |                  | 
## -----------------|------------------|------------------|------------------|
##     Column Total |               34 |              446 |              480 | 
## -----------------|------------------|------------------|------------------|
## 
##

CrossTable(wine_red_2_test$quality, p2.rpart,
           prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE, 
           dnn = c('actual quality', 'predicted quality'))

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  480 
## 
##  
##                | predicted quality 
## actual quality |      High |       Low | Row Total | 
## ---------------|-----------|-----------|-----------|
##           High |       182 |        80 |       262 | 
##                |     0.379 |     0.167 |           | 
## ---------------|-----------|-----------|-----------|
##            Low |        48 |       170 |       218 | 
##                |     0.100 |     0.354 |           | 
## ---------------|-----------|-----------|-----------|
##   Column Total |       230 |       250 |       480 | 
## ---------------|-----------|-----------|-----------|
## 
##

#Summary: 
# 3 class: 401 out of 480 are correctly predicted. Accuracy is higher then the 2 class.
# 2 class (yes/No): 144 out of 480 records are incorrectly predicted. %70 percent accuracy. Also, false negatives are pretty high at 62. Next is to prune the tree pre and post and try to improve the prediction

2) Prune the trees. What is the accuracy rate of the pruned trees? Plot pruned trees and compare with the plot from 1)

#change wine_red_3_model to wine_red_3_rp and this to wine_red_3_cp
#pruning a classification tree to avoid over-fitting and producing a more robust classification model.

#First
printcp(m3.rpart) #- minimum xerror

## 
## Classification tree:
## rpart(formula = quality ~ ., data = wine_red_3_train, method = "class")
## 
## Variables actually used in tree construction:
## [1] alcohol             fixed.acidity       free.sulfur.dioxide
## [4] pH                  residual.sugar      sulphates          
## [7] volatile.acidity   
## 
## Root node error: 212/1119 = 0.18945
## 
## n= 1119 
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.037736      0   1.00000 1.00000 0.061833
## 2 0.018868      3   0.88679 0.98113 0.061382
## 3 0.014151      8   0.79245 1.00000 0.061833
## 4 0.011792      9   0.77830 1.02830 0.062494
## 5 0.010000     13   0.73113 1.02358 0.062385

min(m3.rpart$cptable[,"xerror"]) #find minimum, cross validation error of the classification tree

## [1] 0.9811321

printcp(m2.rpart)

## 
## Classification tree:
## rpart(formula = quality ~ ., data = wine_red_2_train, method = "class")
## 
## Variables actually used in tree construction:
## [1] alcohol             fixed.acidity       free.sulfur.dioxide
## [4] residual.sugar      sulphates           volatile.acidity   
## 
## Root node error: 526/1119 = 0.47006
## 
## n= 1119 
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.366920      0   1.00000 1.00000 0.031741
## 2 0.024715      1   0.63308 0.65399 0.029345
## 3 0.022814      3   0.58365 0.62928 0.029025
## 4 0.015209      5   0.53802 0.61787 0.028870
## 5 0.013308      6   0.52281 0.58935 0.028461
## 6 0.010000      9   0.48289 0.57795 0.028289

min(m2.rpart$cptable[,"xerror"])

## [1] 0.5779468

opt <- which.min(m3.rpart$cptable[,"xerror"]) #locate the record with the minimum cross validation errors.
cp3 <- m3.rpart$cptable[opt,"CP"]  # get the cost complexity parameter of the record with the minimum cross validation errors.
cp3  #nsplit - 3

## [1] 0.01886792

opt <- which.min(m2.rpart$cptable[,"xerror"]) 
cp2 <- m2.rpart$cptable[opt,"CP"]  
cp2  #nsplit - 6

## [1] 0.01

Pruning decision trees

#3 class:
# Opting to use post-pruning with rpart - in this method the model is grown and then pruned so that the complexity is not be lost by pruning too early. The branches that has minor impact on tree's overall accuracy are removed after the fact. These are the branches that does not substantially improve the classifications accuracy.
m3pru.rpart <- prune(m3.rpart, cp3) #cp is prune functions complexity parameter, to create a simpler pruned tree. This value was calculated above
m3pru.rpart # only 4 terminal nodes - pruned!

## n= 1119 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 1119 212 table_wine (0.04110813 0.14834674 0.81054513)  
##   2) alcohol>=10.75 410 145 table_wine (0.03658537 0.31707317 0.64634146)  
##     4) sulphates>=0.685 185  87 table_wine (0.00000000 0.47027027 0.52972973)  
##       8) alcohol>=11.65 80  28 spectacular_wine (0.00000000 0.65000000 0.35000000) *
##       9) alcohol< 11.65 105  35 table_wine (0.00000000 0.33333333 0.66666667) *
##     5) sulphates< 0.685 225  58 table_wine (0.06666667 0.19111111 0.74222222) *
##   3) alcohol< 10.75 709  67 table_wine (0.04372355 0.05077574 0.90550071) *

#2 class:
m2pru.rpart <- prune(m2.rpart, cp2) 
m2pru.rpart # 7 terminal nodes, also pruned. Similar output to non pruned.

## n= 1119 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##   1) root 1119 526 High (0.52993744 0.47006256)  
##     2) alcohol>=10.15 556 148 High (0.73381295 0.26618705)  
##       4) alcohol>=11.45 186  18 High (0.90322581 0.09677419) *
##       5) alcohol< 11.45 370 130 High (0.64864865 0.35135135)  
##        10) sulphates>=0.575 292  79 High (0.72945205 0.27054795)  
##          20) residual.sugar< 4.1 268  63 High (0.76492537 0.23507463) *
##          21) residual.sugar>=4.1 24   8 Low (0.33333333 0.66666667) *
##        11) sulphates< 0.575 78  27 Low (0.34615385 0.65384615) *
##     3) alcohol< 10.15 563 185 Low (0.32859680 0.67140320)  
##       6) sulphates>=0.575 308 140 Low (0.45454545 0.54545455)  
##        12) fixed.acidity>=10.75 30   2 High (0.93333333 0.06666667) *
##        13) fixed.acidity< 10.75 278 112 Low (0.40287770 0.59712230)  
##          26) volatile.acidity< 0.6525 220 102 Low (0.46363636 0.53636364)  
##            52) free.sulfur.dioxide< 26.5 186  91 High (0.51075269 0.48924731)  
##             104) alcohol>=9.85 31   5 High (0.83870968 0.16129032) *
##             105) alcohol< 9.85 155  69 Low (0.44516129 0.55483871) *
##            53) free.sulfur.dioxide>=26.5 34   7 Low (0.20588235 0.79411765) *
##          27) volatile.acidity>=0.6525 58  10 Low (0.17241379 0.82758621) *
##       7) sulphates< 0.575 255  45 Low (0.17647059 0.82352941) *

# Plot the m3.rpart and m2.rpart (pruned models) with customized settings (preferred visualization method)
rpart.plot(m3pru.rpart, digits = 1, type = 2, box.palette = list("Greens", "Oranges", "Blues"), fallen.leaves = TRUE, extra = 101) #cooking_wine is not used!

rpart.plot(m2pru.rpart, digits = 1, type = 2, box.palette = list("Greens", "Oranges"), fallen.leaves = TRUE, extra = 101)

# Make predictions on the test dataset
p3pru.rpart <- predict(m3pru.rpart, wine_red_3_test[,-10], type = "class") #removing the class column [,-10] 
p2pru.rpart <- predict(m2pru.rpart, wine_red_2_test[,-10], type = "class")

# Examine the confusion matrix
table(p3pru.rpart, wine_red_3_test$quality)

##                   
## p3pru.rpart        cooking_wine spectacular_wine table_wine
##   cooking_wine                0                0          0
##   spectacular_wine            0               20          7
##   table_wine                 17               31        405

table(p2pru.rpart, wine_red_2_test$quality)

##            
## p2pru.rpart High Low
##        High  182  48
##        Low    80 170

#Summary: 
# 3 class: 405 out of 480 are correctly predicted. This is a better outcome before pruning (it was 401)
# 2 class (high/low): 144 out of 480 records are incorrectly predicted. this is worse then before (144). Also, false negatives significantly increased to 83.

Accuracy by cor(), mean, confusionMatrix and Cross tabulation

# Compute the accuracy on the test dataset using mean
mean(p3pru.rpart == wine_red_3_test$quality)

## [1] 0.8854167

mean(p2pru.rpart == wine_red_2_test$quality)

## [1] 0.7333333

# Here is more information using ConfusionMatrix
confusionMatrix(wine_red_3_test$quality,p3pru.rpart)

## Confusion Matrix and Statistics
## 
##                   Reference
## Prediction         cooking_wine spectacular_wine table_wine
##   cooking_wine                0                0         17
##   spectacular_wine            0               20         31
##   table_wine                  0                7        405
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8854          
##                  95% CI : (0.8535, 0.9125)
##     No Information Rate : 0.9438          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3772          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: cooking_wine Class: spectacular_wine
## Sensitivity                           NA                 0.74074
## Specificity                      0.96458                 0.93157
## Pos Pred Value                        NA                 0.39216
## Neg Pred Value                        NA                 0.98368
## Prevalence                       0.00000                 0.05625
## Detection Rate                   0.00000                 0.04167
## Detection Prevalence             0.03542                 0.10625
## Balanced Accuracy                     NA                 0.83615
##                      Class: table_wine
## Sensitivity                     0.8940
## Specificity                     0.7407
## Pos Pred Value                  0.9830
## Neg Pred Value                  0.2941
## Prevalence                      0.9437
## Detection Rate                  0.8438
## Detection Prevalence            0.8583
## Balanced Accuracy               0.8174

confusionMatrix(wine_red_2_test$quality,p2pru.rpart)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction High Low
##       High  182  80
##       Low    48 170
##                                           
##                Accuracy : 0.7333          
##                  95% CI : (0.6914, 0.7724)
##     No Information Rate : 0.5208          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4687          
##                                           
##  Mcnemar's Test P-Value : 0.006143        
##                                           
##             Sensitivity : 0.7913          
##             Specificity : 0.6800          
##          Pos Pred Value : 0.6947          
##          Neg Pred Value : 0.7798          
##              Prevalence : 0.4792          
##          Detection Rate : 0.3792          
##    Detection Prevalence : 0.5458          
##       Balanced Accuracy : 0.7357          
##                                           
##        'Positive' Class : High            
##

#Pruned results analysis!!!
#Pruning should reduce the size of the decision tree which reduces training accuracy therefore improve the accuracy on test (unseen) data. So, pruning should improve testing accuracy. However for overall accuracy in my runs it does not. This makes me think I need more training data. Also perhaps the sample distribution of my training data has an effect on this. Because although 3 class incorrect predicted total count improved with pruning, the 2 class did not. In order to prove sample distribution has an effect on pruning, I changed the sample distribution of 2 class by changing the limit of high/low from 6 to 7. 
#At >=6 : High (855)  Low (744) 
#At >=7 : High (217)  Low (1382)
#This adjustment did make a positive effect for pruning. I was able to get a better accuracy by pruning when I resized my sample distribution. I think this proves that it is important to do eda before setting the parameters and understand the data. The >=7 sample sizing is more logical when data is analyzed. This analysis was a goal when I was categorizing the scores at the begining of the analysis.

3) Train, predict, and evaluate the performance using random forests.

# Random Forest prediction of wine_red data
#Build model for 3 class
fit3 <- randomForest(quality ~., data=wine_red_3_train)
print(fit3) # view results

## 
## Call:
##  randomForest(formula = quality ~ ., data = wine_red_3_train) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 14.75%
## Confusion matrix:
##                  cooking_wine spectacular_wine table_wine class.error
## cooking_wine                1                1         44  0.97826087
## spectacular_wine            0               80         86  0.51807229
## table_wine                  1               33        873  0.03748622

importance(fit3) # importance of each predictor

##                     MeanDecreaseGini
## fixed.acidity               32.48145
## volatile.acidity            46.50456
## citric.acid                 34.32098
## residual.sugar              32.86344
## chlorides                   39.77369
## free.sulfur.dioxide         31.79099
## pH                          30.75127
## sulphates                   45.23182
## alcohol                     62.79199

varImpPlot(fit3)

varImp(fit3)

#number of trees build is 500 and the results are immediately looking better. Alcohol is the biggest predictor but this time followed by volatile acidity and sulphates.

#predict decision tree
fit3_predict <- predict(fit3,wine_red_3_test[,-10],type="class")
table(fit3_predict, wine_red_3_test$quality)

##                   
## fit3_predict       cooking_wine spectacular_wine table_wine
##   cooking_wine                0                0          2
##   spectacular_wine            1               24          6
##   table_wine                 16               27        404

#Accuracy:
(1 + 39 + 390) / nrow(wine_red_3_test)

## [1] 0.8958333

#The accuracy is improved quite a bit with the ensemble method for 3 class.

#Build model for 2 class
fit2 <- randomForest(quality ~., data=wine_red_2_train)
print(fit2) # view results

## 
## Call:
##  randomForest(formula = quality ~ ., data = wine_red_2_train) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 17.87%
## Confusion matrix:
##      High Low class.error
## High  493 100   0.1686341
## Low   100 426   0.1901141

importance(fit2) # importance of each predictor

##                     MeanDecreaseGini
## fixed.acidity               43.17415
## volatile.acidity            80.29368
## citric.acid                 44.79185
## residual.sugar              40.77023
## chlorides                   52.24612
## free.sulfur.dioxide         43.90392
## pH                          45.88537
## sulphates                   89.58212
## alcohol                    116.25546

varImpPlot(fit2)

varImp(fit2)

#Alcohol is the biggest predictor followed by sulphates.

#predict decision tree
fit2_predict <- predict(fit2,wine_red_2_test[,-10],type="class")
table(fit2_predict, wine_red_2_test$quality)

##             
## fit2_predict High Low
##         High  206  55
##         Low    56 163

#Accuracy:
(193 + 174) / nrow(wine_red_2_test)

## [1] 0.7645833

#Accuracy in the random forest is much better then the decision tree or pruned. Improved about 6% with much lower fasle negative count.

4) Compare all performances (trees, pruned trees, and randomForest) and discuss

#Results comparison: - Using dnn option to name the tables.
#3 classes
table(wine_red_3_test$quality, p3.rpart, dnn = "decisiontree") #decisiontree

##                   NA
## decisiontree       cooking_wine spectacular_wine table_wine
##   cooking_wine                0                1         16
##   spectacular_wine            0               22         29
##   table_wine                  0               11        401

mean(p3.rpart == wine_red_3_test$quality) #Accuracy

## [1] 0.88125

table(wine_red_3_test$quality, p3pru.rpart, dnn = "pruned") #pruned

##                   NA
## pruned             cooking_wine spectacular_wine table_wine
##   cooking_wine                0                0         17
##   spectacular_wine            0               20         31
##   table_wine                  0                7        405

mean(p3pru.rpart == wine_red_3_test$quality) #Accuracy

## [1] 0.8854167

table(wine_red_3_test$quality, fit3_predict, dnn = "randomforest" ) #randomforest

##                   NA
## randomforest       cooking_wine spectacular_wine table_wine
##   cooking_wine                0                1         16
##   spectacular_wine            0               24         27
##   table_wine                  2                6        404

mean(fit3_predict == wine_red_3_test$quality) #Accuracy

## [1] 0.8916667

#Analysis: 
# results for the 3 class sample distribution improve both with pruning as well as randomforest. As expected randomforest (ensemble) produces the best accuracy. 


#2 classes
table(wine_red_2_test$quality, p2.rpart,  dnn = "decisiontree") #decisiontree

##             NA
## decisiontree High Low
##         High  182  80
##         Low    48 170

mean(p2.rpart == wine_red_2_test$quality) #Accuracy

## [1] 0.7333333

table(wine_red_2_test$quality, p2pru.rpart, dnn = "pruned") #pruned

##       NA
## pruned High Low
##   High  182  80
##   Low    48 170

mean(p2pru.rpart == wine_red_2_test$quality) #Accuracy

## [1] 0.7333333

table(wine_red_2_test$quality, fit2_predict, dnn = "randomforest") #randomforest

##             NA
## randomforest High Low
##         High  206  56
##         Low    55 163

mean(fit2_predict == wine_red_2_test$quality) #Accuracy

## [1] 0.76875

#Analysis:
# Accuracy stays the same with pruning and is the highest when using randomforest method. With pruning one previous false negative is now correctly predicted.  The overall (all models) low accuracy might be because of the distribution of the sample set. I set the distribution of the sample data set specifically in order to see a more evenly divided set would provide a  better prediction accuracy. It does not. I also did the same exercise for 2 class by changing my high/low quality limit form 6 to 7 which resulted in the highest accuracy for all models. I have also done an iteration without removing density and total sulphur features. The accuracy results were slightly lower which is another lesson learned.