Red vs White

Analysis: For this project I analyzed wine data from UC Irvine’s Machine Learning Repository. The datasets were already broken up by red, and white. I thought it would be an interesting prediction problem to recombine them, and see if I can distinguish which wines were which color.

The wine data proved to be a good dataset as introduction to machine learning. Much like the Iris dataset one can apply numerous techniques to come up with a pretty good prediction of color. However, there is overlap of characteristics between the two colors that make room for fine-tuning models more inviting. In this example I apply a recursive partitioning decision tree.

Now we read in the two data sources, add the Color variable to each set and merge them.

First we use tableplot function in tabplot to see the distribution of variables across our sort column, in this case it’s our classification column, Color

As you can see there is a clear difference in distributions across all the variables except for alcohol and quality

Library	Description
ScatterPlot3d	Create 3d Scatterplots
GGPlot2	Grammar of Graphics Plotting library
SQLDf	Query dataframes like SQL tables
dplyr	the data wrangling package
rpart	Recursively Partioned Trees
RSQLite	Connect to SQLite database
Rattle	Beautiful Rpart Tree Plot
RColorBrewer	Beaitufl Rpart Tree Plot
Caret	Classification and Regression package for CreateDatapartition and ConfusionMatrix functions
tabplot	Package for visualizing many columns at once

For plotting the wine data with appropriate colors we generate a palette with the grape colors.

Green Grapes -> White Wine Purple Grapes -> Red Wine

colors <- c("Green", "Purple")
colors <- colors[as.numeric(wineDF$Color)]

3d Plot the data to see if we can distinguish Red vs White by sight

scatterplot3d(x = wineDF$residual.sugar, y = wineDF$sulphates, z = wineDF$total.sulfur.dioxide, color = colors, pch = 16)

scatterplot3d(x = wineDF$sulphates, y = wineDF$residual.sugar, z = wineDF$total.sulfur.dioxide, color = colors, pch = 16)

scatterplot3d(x = wineDF$density, y = wineDF$pH, z = wineDF$alcohol, color = colors, pch = 16)

ggplot(aes(x=density,y=pH,color=Color),data=wineDF)+geom_jitter() + ggtitle("Density vs Acidity")+xlab("Density") + ylab("pH") + scale_color_manual(values = c("Purple", "Green"))

ggplot(aes(x=residual.sugar,y=total.sulfur.dioxide ,color=Color),data=wineDF)+geom_jitter() + ggtitle("Sugar vs Sulfur")+xlab("Residual Sugar") + ylab("Sulfur Dioxide") + scale_color_manual(values = c("Purple", "Green"))

ggplot(aes(x=residual.sugar,y=sulphates ,color=Color),data=wineDF)+geom_jitter() + ggtitle("Sugar vs Sulphates")+xlab("Residual Sugar") + ylab("Sulphate") + scale_color_manual(values = c("Purple", "Green"))

It looks like we should easily be able to come up with a model to distingush wine color. Let’s pass all of our variables as possible features and run a recursively partitioned decision tree. We will also split the data into testing and training data using createDataPartition function from Caret. And by passing the variable of interest to the y argument, the random samppling occurs within each class (Red/White) and should preserve the overall class distribution.

wineformula <- formula(Color ~ fixed.acidity + volatile.acidity + citric.acid +
                         residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol + quality)



set.seed(1001)

rprimindex <- createDataPartition(y=wineDF$Color, times = 1, p =.5, list = F)
wineDFtrain <-wineDF[rprimindex, ] ## Create Test Data
wineDFtest  <-wineDF[-rprimindex, ]

winemodel <- rpart(wineformula, data = wineDFtrain)

Our decision tree model identifies our key variables for prediciton.

The first split is on chloride, then on sulfur.dioxide, then density, and fixed.acidity.

Our model performs quite well predicting over 97% of wines.

winePredtest <-as.data.frame(predict(winemodel,wineDFtest))






ColorRecode<- sqldf(  " select case when White > .5 then 'White'
  when Red > .5 then 'Red'
  end as Color from winePredtest")

## Loading required package: tcltk

winePredtest$Color <- as.factor(ColorRecode$Color)




confusionMatrix(winePredtest$Color, wineDFtest$Color)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  Red White
##      Red    754    28
##      White   45  2421
##                                           
##                Accuracy : 0.9775          
##                  95% CI : (0.9718, 0.9823)
##     No Information Rate : 0.754           
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.939           
##  Mcnemar's Test P-Value : 0.06112         
##                                           
##             Sensitivity : 0.9437          
##             Specificity : 0.9886          
##          Pos Pred Value : 0.9642          
##          Neg Pred Value : 0.9818          
##              Prevalence : 0.2460          
##          Detection Rate : 0.2321          
##    Detection Prevalence : 0.2408          
##       Balanced Accuracy : 0.9661          
##                                           
##        'Positive' Class : Red             
##

fancyRpartPlot(winemodel)

Red vs White

Salil Gupta

April 14, 2016

For plotting the wine data with appropriate colors we generate a palette with the grape colors.