Analysis: For this project I analyzed wine data from UC Irvine’s Machine Learning Repository. The datasets were already broken up by red, and white. I thought it would be an interesting prediction problem to recombine them, and see if I can distinguish which wines were which color.
The wine data proved to be a good dataset as introduction to machine learning. Much like the Iris dataset one can apply numerous techniques to come up with a pretty good prediction of color. However, there is overlap of characteristics between the two colors that make room for fine-tuning models more inviting. In this example I apply a recursive partitioning decision tree.
| Library | Description |
|---|---|
| ScatterPlot3d | Create 3d Scatterplots |
| GGPlot2 | Grammar of Graphics Plotting library |
| SQLDf | Query dataframes like SQL tables |
| dplyr | the data wrangling package |
| rpart | Recursively Partioned Trees |
| RSQLite | Connect to SQLite database |
| Rattle | Beautiful Rpart Tree Plot |
| RColorBrewer | Beaitufl Rpart Tree Plot |
| Caret | Classification and Regression package for CreateDatapartition and ConfusionMatrix functions |
| tabplot | Package for visualizing many columns at once |
Now we read in the two data sources, add the Color variable to each set and merge them.
redURL <- "http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
whiteURL <- "http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv"
redDF <- read.csv(redURL, sep = ";")
whiteDF <- read.csv(whiteURL, sep = ";")
whiteDF$Color <- "White"
redDF$Color <- "Red"
wineDF <- rbind(whiteDF, redDF)
wineDF$Color <- as.factor(wineDF$Color)
First we use tableplot function in tabplot to see the distribution of variables across our sort column, in this case it’s our classification column, Color
tableplot(wineDF, sortCol = "Color")
As you can see there is a clear difference in distributions across all the variables except for alcohol and quality
Green Grapes -> White Wine Purple Grapes -> Red Wine
colors <- c("Green", "Purple")
colors <- colors[as.numeric(wineDF$Color)]
3d Plot the data to see if we can distinguish Red vs White by sight
scatterplot3d(x = wineDF$residual.sugar, y = wineDF$sulphates, z = wineDF$total.sulfur.dioxide, color = colors, pch = 16)
scatterplot3d(x = wineDF$sulphates, y = wineDF$residual.sugar, z = wineDF$total.sulfur.dioxide, color = colors, pch = 16)
scatterplot3d(x = wineDF$density, y = wineDF$pH, z = wineDF$alcohol, color = colors, pch = 16)
ggplot(aes(x=density,y=pH,color=Color),data=wineDF)+geom_jitter() + ggtitle("Density vs Acidity")+xlab("Density") + ylab("pH") + scale_color_manual(values = c("Purple", "Green"))
ggplot(aes(x=residual.sugar,y=total.sulfur.dioxide ,color=Color),data=wineDF)+geom_jitter() + ggtitle("Sugar vs Sulfur")+xlab("Residual Sugar") + ylab("Sulfur Dioxide") + scale_color_manual(values = c("Purple", "Green"))
ggplot(aes(x=residual.sugar,y=sulphates ,color=Color),data=wineDF)+geom_jitter() + ggtitle("Sugar vs Sulphates")+xlab("Residual Sugar") + ylab("Sulphate") + scale_color_manual(values = c("Purple", "Green"))
It looks like we should easily be able to come up with a model to distingush wine color. Let’s pass all of our variables as possible features and run a recursively partitioned decision tree. We will also split the data into testing and training data using createDataPartition function from Caret. And by passing the variable of interest to the y argument, the random samppling occurs within each class (Red/White) and should preserve the overall class distribution.
wineformula <- formula(Color ~ fixed.acidity + volatile.acidity + citric.acid +
residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol + quality)
set.seed(1001)
rprimindex <- createDataPartition(y=wineDF$Color, times = 1, p =.5, list = F)
wineDFtrain <-wineDF[rprimindex, ] ## Create Test Data
wineDFtest <-wineDF[-rprimindex, ]
winemodel <- rpart(wineformula, data = wineDFtrain)
Our decision tree model identifies our key variables for prediciton.
The first split is on chloride, then on sulfur.dioxide, then density, and fixed.acidity.
Our model performs quite well predicting over 97% of wines.
winePredtest <-as.data.frame(predict(winemodel,wineDFtest))
ColorRecode<- sqldf( " select case when White > .5 then 'White'
when Red > .5 then 'Red'
end as Color from winePredtest")
## Loading required package: tcltk
winePredtest$Color <- as.factor(ColorRecode$Color)
confusionMatrix(winePredtest$Color, wineDFtest$Color)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Red White
## Red 754 28
## White 45 2421
##
## Accuracy : 0.9775
## 95% CI : (0.9718, 0.9823)
## No Information Rate : 0.754
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.939
## Mcnemar's Test P-Value : 0.06112
##
## Sensitivity : 0.9437
## Specificity : 0.9886
## Pos Pred Value : 0.9642
## Neg Pred Value : 0.9818
## Prevalence : 0.2460
## Detection Rate : 0.2321
## Detection Prevalence : 0.2408
## Balanced Accuracy : 0.9661
##
## 'Positive' Class : Red
##
fancyRpartPlot(winemodel)