As a Amatuer Wine connoisseur, I found the wine quality dataset to be the most intriquing, so I decided to pulled from it
library(RCurl)
## Loading required package: bitops
redwineURL <-getURL("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv")
redwine <-read.csv(text = redwineURL, sep =";")
whitewineURL <-getURL("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv")
whitewine <-read.csv(text = whitewineURL, sep =";")
summary(redwine)
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median :0.07900 Median :14.00 Median : 38.00
## Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.636
## 3rd Qu.:6.000
## Max. :8.000
summary(whitewine)
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700 1st Qu.: 1.700
## Median : 6.800 Median :0.2600 Median :0.3200 Median : 5.200
## Mean : 6.855 Mean :0.2782 Mean :0.3342 Mean : 6.391
## 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900 3rd Qu.: 9.900
## Max. :14.200 Max. :1.1000 Max. :1.6600 Max. :65.800
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.00900 Min. : 2.00 Min. : 9.0
## 1st Qu.:0.03600 1st Qu.: 23.00 1st Qu.:108.0
## Median :0.04300 Median : 34.00 Median :134.0
## Mean :0.04577 Mean : 35.31 Mean :138.4
## 3rd Qu.:0.05000 3rd Qu.: 46.00 3rd Qu.:167.0
## Max. :0.34600 Max. :289.00 Max. :440.0
## density pH sulphates alcohol
## Min. :0.9871 Min. :2.720 Min. :0.2200 Min. : 8.00
## 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100 1st Qu.: 9.50
## Median :0.9937 Median :3.180 Median :0.4700 Median :10.40
## Mean :0.9940 Mean :3.188 Mean :0.4898 Mean :10.51
## 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500 3rd Qu.:11.40
## Max. :1.0390 Max. :3.820 Max. :1.0800 Max. :14.20
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.878
## 3rd Qu.:6.000
## Max. :9.000
For this dataset, I thought it would be a good idea to find the best of the best. For my purpose here, I was extremely interested in the “rating”. I did a bit of research and found that most sites use a similar criteria to this one (From Wine Spectactor).
| Rating | Descriptor | Description |
|---|---|---|
| 95-100 | Classic | a great wine |
| 90-94 | Outstanding | superior character and style |
| 80-89 | Good/Very Good | wine with special qualities |
| 70-79 | Average | drinkable with minor flaws |
| 60-69 | Below average | drinkable but not recommended |
| 50-59 | Poor | undrinkable, not recommended |
Furthermore, review of the website Wine Folly has found that ratings tend to follow a normal distribution. So, to check to see if this data follows this trend, I plotted a histogram of the data.
hist(redwine$quality, xlab = "Quality", main = "Red Wine Quality")
hist(whitewine$quality, xlab = "Quality", main = "White Wine Quality")
As the data appears to mimic the norm, I felt that, selecting all the great wines would be anything above a 7 on this scale, so this became the first criteria I wanted to use for my subset.
Furthermore, I wanted to focus most on the areas that I believe most influence our precieved taste of a wine. When we think of wine tasting, we most often think of acidity, sweetness, and alcohol content. Along with the rating, this made 4 columns that I wished to focus on.
BestReds <- subset(redwine, quality >= 7, select = c(residual.sugar,pH,alcohol,quality))
summary(BestReds)
## residual.sugar pH alcohol quality
## Min. :1.200 Min. :2.880 Min. : 9.20 Min. :7.000
## 1st Qu.:2.000 1st Qu.:3.200 1st Qu.:10.80 1st Qu.:7.000
## Median :2.300 Median :3.270 Median :11.60 Median :7.000
## Mean :2.709 Mean :3.289 Mean :11.52 Mean :7.083
## 3rd Qu.:2.700 3rd Qu.:3.380 3rd Qu.:12.20 3rd Qu.:7.000
## Max. :8.900 Max. :3.780 Max. :14.00 Max. :8.000
BestWhites <- subset(whitewine, quality >= 7, select = c(residual.sugar,pH,alcohol,quality))
summary(BestWhites)
## residual.sugar pH alcohol quality
## Min. : 0.800 Min. :2.840 Min. : 8.50 Min. :7.000
## 1st Qu.: 1.800 1st Qu.:3.100 1st Qu.:10.70 1st Qu.:7.000
## Median : 3.875 Median :3.200 Median :11.50 Median :7.000
## Mean : 5.262 Mean :3.215 Mean :11.42 Mean :7.175
## 3rd Qu.: 7.400 3rd Qu.:3.320 3rd Qu.:12.40 3rd Qu.:7.000
## Max. :19.250 Max. :3.820 Max. :14.20 Max. :9.000
So this is one way to sparse out the data that we have on hand. I felt that this was a good way to provide a summary of the
I decided to take the subset one step further. In white wines, there is a number classification for the “sweetness” of the wine. It ranges from 00 to 10. The following shows how much sugar is consider “sweet or dry”.
|Sugar(g/l)*|Characteristic| |—–|————–| | > 1 | Bone Dry| | 1-5 | Dry| |5 - 25| Off-Dry| |25 - 45| Medium Dry| |45 - 65| Sweet| | < 65 | Very Sweet |
*Please note there is some variation in what is “Sweet” when it comes to sugarl levels. Some compare pH and sugar levels, which I may do at a later date, but as of now I just wanted to show some uses of subsets. I picked the most common sugar levels I could find.
We see that none of the medium dry or sweet wines made it in the top white wines, so I decided to go back to the original data set, as a way to just visual some of the data using a box plot. I understand their are simplier ways to concatenate data like this (ie making a new column in the data frame and then creating a box plot), but I decided to go this route as proof of concept, and to highlight column name changes as this was necessary to produce a readable box plot.
Bonedry <- subset(whitewine, residual.sugar >= 0 & residual.sugar < 1, select= residual.sugar)
Dry <- subset(whitewine, residual.sugar >= 1 & residual.sugar < 5, select= residual.sugar)
Off_Dry <- subset(whitewine, residual.sugar >= 5 & residual.sugar < 25, select= residual.sugar)
Medium_Dry <- subset(whitewine, residual.sugar >= 25 & residual.sugar < 45, select= residual.sugar)
Sweet <-subset(whitewine, residual.sugar >= 45 & residual.sugar < 65, select= residual.sugar)
Very_Sweet <- subset(whitewine, residual.sugar >= 65, select= residual.sugar)
names(Bonedry)[names(Bonedry) == "residual.sugar"] <- "Bonedry"
names(Dry)[names(Dry) == "residual.sugar"] <- "Dry"
names(Off_Dry)[names(Off_Dry) == "residual.sugar"] <- "Off_Dry"
names(Medium_Dry)[names(Medium_Dry) == "residual.sugar"] <- "Medium"
names(Sweet)[names(Sweet) == "residual.sugar"] <- "Sweet"
names(Very_Sweet)[names(Very_Sweet) == "residual.sugar"] <- "Very_Sweet"
boxplot(c(Bonedry,Dry,Off_Dry,Medium_Dry,Sweet,Very_Sweet), main = "Sweetness Distribution")
Citations:
https://vinewineandwander.wordpress.com/2012/03/10/0-10-decoded/
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.