Wine Quality

As a Amatuer Wine connoisseur, I found the wine quality dataset to be the most intriquing, so I decided to pulled from it

library(RCurl)
## Loading required package: bitops
redwineURL <-getURL("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv")

redwine <-read.csv(text = redwineURL, sep =";")

whitewineURL <-getURL("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv")

whitewine <-read.csv(text = whitewineURL, sep =";")

summary(redwine)
##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000
summary(whitewine)
##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700   1st Qu.: 1.700  
##  Median : 6.800   Median :0.2600   Median :0.3200   Median : 5.200  
##  Mean   : 6.855   Mean   :0.2782   Mean   :0.3342   Mean   : 6.391  
##  3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900   3rd Qu.: 9.900  
##  Max.   :14.200   Max.   :1.1000   Max.   :1.6600   Max.   :65.800  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.00900   Min.   :  2.00      Min.   :  9.0       
##  1st Qu.:0.03600   1st Qu.: 23.00      1st Qu.:108.0       
##  Median :0.04300   Median : 34.00      Median :134.0       
##  Mean   :0.04577   Mean   : 35.31      Mean   :138.4       
##  3rd Qu.:0.05000   3rd Qu.: 46.00      3rd Qu.:167.0       
##  Max.   :0.34600   Max.   :289.00      Max.   :440.0       
##     density             pH          sulphates         alcohol     
##  Min.   :0.9871   Min.   :2.720   Min.   :0.2200   Min.   : 8.00  
##  1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100   1st Qu.: 9.50  
##  Median :0.9937   Median :3.180   Median :0.4700   Median :10.40  
##  Mean   :0.9940   Mean   :3.188   Mean   :0.4898   Mean   :10.51  
##  3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500   3rd Qu.:11.40  
##  Max.   :1.0390   Max.   :3.820   Max.   :1.0800   Max.   :14.20  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.878  
##  3rd Qu.:6.000  
##  Max.   :9.000

For this dataset, I thought it would be a good idea to find the best of the best. For my purpose here, I was extremely interested in the “rating”. I did a bit of research and found that most sites use a similar criteria to this one (From Wine Spectactor).

Rating Descriptor Description
95-100 Classic a great wine
90-94 Outstanding superior character and style
80-89 Good/Very Good wine with special qualities
70-79 Average drinkable with minor flaws
60-69 Below average drinkable but not recommended
50-59 Poor undrinkable, not recommended

Furthermore, review of the website Wine Folly has found that ratings tend to follow a normal distribution. So, to check to see if this data follows this trend, I plotted a histogram of the data.

hist(redwine$quality, xlab = "Quality", main = "Red Wine Quality")

hist(whitewine$quality, xlab = "Quality", main = "White Wine Quality")

As the data appears to mimic the norm, I felt that, selecting all the great wines would be anything above a 7 on this scale, so this became the first criteria I wanted to use for my subset.

Furthermore, I wanted to focus most on the areas that I believe most influence our precieved taste of a wine. When we think of wine tasting, we most often think of acidity, sweetness, and alcohol content. Along with the rating, this made 4 columns that I wished to focus on.

BestReds <- subset(redwine, quality >= 7, select = c(residual.sugar,pH,alcohol,quality))
summary(BestReds)
##  residual.sugar        pH           alcohol         quality     
##  Min.   :1.200   Min.   :2.880   Min.   : 9.20   Min.   :7.000  
##  1st Qu.:2.000   1st Qu.:3.200   1st Qu.:10.80   1st Qu.:7.000  
##  Median :2.300   Median :3.270   Median :11.60   Median :7.000  
##  Mean   :2.709   Mean   :3.289   Mean   :11.52   Mean   :7.083  
##  3rd Qu.:2.700   3rd Qu.:3.380   3rd Qu.:12.20   3rd Qu.:7.000  
##  Max.   :8.900   Max.   :3.780   Max.   :14.00   Max.   :8.000
BestWhites <- subset(whitewine, quality >= 7, select = c(residual.sugar,pH,alcohol,quality))
summary(BestWhites)
##  residual.sugar         pH           alcohol         quality     
##  Min.   : 0.800   Min.   :2.840   Min.   : 8.50   Min.   :7.000  
##  1st Qu.: 1.800   1st Qu.:3.100   1st Qu.:10.70   1st Qu.:7.000  
##  Median : 3.875   Median :3.200   Median :11.50   Median :7.000  
##  Mean   : 5.262   Mean   :3.215   Mean   :11.42   Mean   :7.175  
##  3rd Qu.: 7.400   3rd Qu.:3.320   3rd Qu.:12.40   3rd Qu.:7.000  
##  Max.   :19.250   Max.   :3.820   Max.   :14.20   Max.   :9.000

So this is one way to sparse out the data that we have on hand. I felt that this was a good way to provide a summary of the

I decided to take the subset one step further. In white wines, there is a number classification for the “sweetness” of the wine. It ranges from 00 to 10. The following shows how much sugar is consider “sweet or dry”.

|Sugar(g/l)*|Characteristic| |—–|————–| | > 1 | Bone Dry| | 1-5 | Dry| |5 - 25| Off-Dry| |25 - 45| Medium Dry| |45 - 65| Sweet| | < 65 | Very Sweet |

*Please note there is some variation in what is “Sweet” when it comes to sugarl levels. Some compare pH and sugar levels, which I may do at a later date, but as of now I just wanted to show some uses of subsets. I picked the most common sugar levels I could find.

We see that none of the medium dry or sweet wines made it in the top white wines, so I decided to go back to the original data set, as a way to just visual some of the data using a box plot. I understand their are simplier ways to concatenate data like this (ie making a new column in the data frame and then creating a box plot), but I decided to go this route as proof of concept, and to highlight column name changes as this was necessary to produce a readable box plot.

Bonedry <- subset(whitewine, residual.sugar >= 0 & residual.sugar < 1, select= residual.sugar)
Dry <- subset(whitewine, residual.sugar >= 1  & residual.sugar < 5,  select= residual.sugar)
Off_Dry <- subset(whitewine, residual.sugar >= 5  & residual.sugar < 25,  select= residual.sugar)
Medium_Dry <- subset(whitewine, residual.sugar >= 25  & residual.sugar < 45,  select= residual.sugar)
Sweet <-subset(whitewine, residual.sugar >= 45  & residual.sugar < 65,  select= residual.sugar)
Very_Sweet <- subset(whitewine, residual.sugar >= 65,  select= residual.sugar)

names(Bonedry)[names(Bonedry) == "residual.sugar"] <- "Bonedry"
names(Dry)[names(Dry) == "residual.sugar"] <- "Dry"
names(Off_Dry)[names(Off_Dry) == "residual.sugar"] <- "Off_Dry"
names(Medium_Dry)[names(Medium_Dry) == "residual.sugar"] <- "Medium"
names(Sweet)[names(Sweet) == "residual.sugar"] <- "Sweet"
names(Very_Sweet)[names(Very_Sweet) == "residual.sugar"] <- "Very_Sweet"



boxplot(c(Bonedry,Dry,Off_Dry,Medium_Dry,Sweet,Very_Sweet), main = "Sweetness Distribution")

Citations:

https://vinewineandwander.wordpress.com/2012/03/10/0-10-decoded/

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.