Analysis: For this project I analyzed wine data from UC Irvine’s Machine Learning Repository. The datasets were already broken up by red, and white. I thought it would be an interesting prediction problem to recombine them, and see if I can distinguish which wines were which color.

The wine data proved to be a good dataset as introduction to machine learning. Much like the Iris dataset one can apply numerous techniques to come up with a pretty good prediction of color. However, there is overlap of characteristics between the two colors that make room for fine-tuning models more inviting. In this example I apply a recursive partitioning decision tree.

Library Description
ScatterPlot3d Create 3d Scatterplots
GGPlot2 Grammar of Graphics Plotting library
SQLDf Query dataframes like SQL tables
dplyr the data wrangling package
rpart Recursively Partioned Trees
RSQLite Connect to SQLite database
Rattle Beautiful Rpart Tree Plot
RColorBrewer Beaitufl Rpart Tree Plot
Caret Classification and Regression package for CreateDatapartition and ConfusionMatrix functions
tabplot Package for visualizing many columns at once

Now we read in the two data sources, add the Color variable to each set and merge them.

redURL <- "http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
whiteURL <- "http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv"


redDF <- read.csv(redURL, sep = ";")


whiteDF <- read.csv(whiteURL, sep = ";")

whiteDF$Color <- "White"
redDF$Color <- "Red"

wineDF <- rbind(whiteDF, redDF)
wineDF$Color <- as.factor(wineDF$Color)

First we use tableplot function in tabplot to see the distribution of variables across our sort column, in this case it’s our classification column, Color

tableplot(wineDF, sortCol = "Color")

As you can see there is a clear difference in distributions across all the variables except for alcohol and quality

For plotting the wine data with appropriate colors we generate a palette with the grape colors.

Green Grapes -> White Wine Purple Grapes -> Red Wine

colors <- c("Green", "Purple")
colors <- colors[as.numeric(wineDF$Color)]

3d Plot the data to see if we can distinguish Red vs White by sight

scatterplot3d(x = wineDF$residual.sugar, y = wineDF$sulphates, z = wineDF$total.sulfur.dioxide, color = colors, pch = 16)

scatterplot3d(x = wineDF$sulphates, y = wineDF$residual.sugar, z = wineDF$total.sulfur.dioxide, color = colors, pch = 16)

scatterplot3d(x = wineDF$density, y = wineDF$pH, z = wineDF$alcohol, color = colors, pch = 16)

ggplot(aes(x=density,y=pH,color=Color),data=wineDF)+geom_jitter() + ggtitle("Density vs Acidity")+xlab("Density") + ylab("pH") + scale_color_manual(values = c("Purple", "Green"))

ggplot(aes(x=residual.sugar,y=total.sulfur.dioxide ,color=Color),data=wineDF)+geom_jitter() + ggtitle("Sugar vs Sulfur")+xlab("Residual Sugar") + ylab("Sulfur Dioxide") + scale_color_manual(values = c("Purple", "Green"))

ggplot(aes(x=residual.sugar,y=sulphates ,color=Color),data=wineDF)+geom_jitter() + ggtitle("Sugar vs Sulphates")+xlab("Residual Sugar") + ylab("Sulphate") + scale_color_manual(values = c("Purple", "Green"))

It looks like we should easily be able to come up with a model to distingush wine color. Let’s pass all of our variables as possible features and run a recursively partitioned decision tree. We will also split the data into testing and training data using createDataPartition function from Caret. And by passing the variable of interest to the y argument, the random samppling occurs within each class (Red/White) and should preserve the overall class distribution.

wineformula <- formula(Color ~ fixed.acidity + volatile.acidity + citric.acid +
                         residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol + quality)



set.seed(1001)

rprimindex <- createDataPartition(y=wineDF$Color, times = 1, p =.5, list = F)
wineDFtrain <-wineDF[rprimindex, ] ## Create Test Data
wineDFtest  <-wineDF[-rprimindex, ]

winemodel <- rpart(wineformula, data = wineDFtrain)

Our decision tree model identifies our key variables for prediciton.

The first split is on chloride, then on sulfur.dioxide, then density, and fixed.acidity.

Our model performs quite well predicting over 97% of wines.

winePredtest <-as.data.frame(predict(winemodel,wineDFtest))






ColorRecode<- sqldf(  " select case when White > .5 then 'White'
  when Red > .5 then 'Red'
  end as Color from winePredtest")
## Loading required package: tcltk
winePredtest$Color <- as.factor(ColorRecode$Color)




confusionMatrix(winePredtest$Color, wineDFtest$Color)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  Red White
##      Red    754    28
##      White   45  2421
##                                           
##                Accuracy : 0.9775          
##                  95% CI : (0.9718, 0.9823)
##     No Information Rate : 0.754           
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.939           
##  Mcnemar's Test P-Value : 0.06112         
##                                           
##             Sensitivity : 0.9437          
##             Specificity : 0.9886          
##          Pos Pred Value : 0.9642          
##          Neg Pred Value : 0.9818          
##              Prevalence : 0.2460          
##          Detection Rate : 0.2321          
##    Detection Prevalence : 0.2408          
##       Balanced Accuracy : 0.9661          
##                                           
##        'Positive' Class : Red             
## 
fancyRpartPlot(winemodel)