This dataset is public available for research. The details are described in [Cortez et al., 2009].
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016 [Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf [bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib
In the above reference, two datasets were created, using red and white wine samples. The inputs include objective tests (e.g. PH values) and the output is based on sensory data (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent).
The two datasets are related to red and white variants of the Portuguese “Vinho Verde” wine. For more details, consult: http://www.vinhoverde.pt/en/ or the reference [Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).
red wine - 1599; white wine - 4898.
11 + output attribute
For more information, read [Cortez et al., 2009].
Input variables (based on physicochemical tests):
1 - fixed acidity (tartaric acid - g / dm^3)
2 - volatile acidity (acetic acid - g / dm^3)
3 - citric acid (g / dm^3)
4 - residual sugar (g / dm^3)
5 - chlorides (sodium chloride - g / dm^3
6 - free sulfur dioxide (mg / dm^3)
7 - total sulfur dioxide (mg / dm^3)
8 - density (g / cm^3)
9 - pH
10 - sulphates (potassium sulphate - g / dm3)
11 - alcohol (% by volume)
Output variable (based on sensory data):
12 - quality (score between 0 and 10)
1 - fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)
2 - volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
3 - citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines
4 - residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet
5 - chlorides: the amount of salt in the wine
6 - free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
7 - total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
8 - density: the density of water is close to that of water depending on the percent alcohol and sugar content
9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
10 - sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant
11 - alcohol: the percent alcohol content of the wine
Output variable (based on sensory data): 12 - quality (score between 0 and 10)
# set up working directory
setwd("/home/daria/Courses/R/Udacity/EDA_Course_Materials/FinalProject")
# load all packages used in this exploratory analysis
library(ggplot2)
library(GGally)
red <- read.csv('wineQualityReds.csv')
white <- read.csv('wineQualityWhites.csv')
# add categorical varialbles to both sets
red['color'] <- 'red'
white['color'] <- 'white'
# merge red wine and white wine datasets
data <- rbind(red, white)
head(data)
## X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1 7.4 0.70 0.00 1.9 0.076
## 2 2 7.8 0.88 0.00 2.6 0.098
## 3 3 7.8 0.76 0.04 2.3 0.092
## 4 4 11.2 0.28 0.56 1.9 0.075
## 5 5 7.4 0.70 0.00 1.9 0.076
## 6 6 7.4 0.66 0.00 1.8 0.075
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 11 34 0.9978 3.51 0.56 9.4
## 2 25 67 0.9968 3.20 0.68 9.8
## 3 15 54 0.9970 3.26 0.65 9.8
## 4 17 60 0.9980 3.16 0.58 9.8
## 5 11 34 0.9978 3.51 0.56 9.4
## 6 13 40 0.9978 3.51 0.56 9.4
## quality color
## 1 5 red
## 2 5 red
## 3 5 red
## 4 6 red
## 5 5 red
## 6 5 red
tail(data)
## X fixed.acidity volatile.acidity citric.acid residual.sugar
## 6492 4893 6.5 0.23 0.38 1.3
## 6493 4894 6.2 0.21 0.29 1.6
## 6494 4895 6.6 0.32 0.36 8.0
## 6495 4896 6.5 0.24 0.19 1.2
## 6496 4897 5.5 0.29 0.30 1.1
## 6497 4898 6.0 0.21 0.38 0.8
## chlorides free.sulfur.dioxide total.sulfur.dioxide density pH
## 6492 0.032 29 112 0.99298 3.29
## 6493 0.039 24 92 0.99114 3.27
## 6494 0.047 57 168 0.99490 3.15
## 6495 0.041 30 111 0.99254 2.99
## 6496 0.022 20 110 0.98869 3.34
## 6497 0.020 22 98 0.98941 3.26
## sulphates alcohol quality color
## 6492 0.54 9.7 5 white
## 6493 0.50 11.2 6 white
## 6494 0.46 9.6 5 white
## 6495 0.46 9.4 6 white
## 6496 0.38 12.8 7 white
## 6497 0.32 11.8 6 white
dim(data)
## [1] 6497 14
names(data)
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality" "color"
summary(data)
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.: 813 1st Qu.: 6.400 1st Qu.:0.2300 1st Qu.:0.2500
## Median :1650 Median : 7.000 Median :0.2900 Median :0.3100
## Mean :2044 Mean : 7.215 Mean :0.3397 Mean :0.3186
## 3rd Qu.:3274 3rd Qu.: 7.700 3rd Qu.:0.4000 3rd Qu.:0.3900
## Max. :4898 Max. :15.900 Max. :1.5800 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 1.00
## 1st Qu.: 1.800 1st Qu.:0.03800 1st Qu.: 17.00
## Median : 3.000 Median :0.04700 Median : 29.00
## Mean : 5.443 Mean :0.05603 Mean : 30.53
## 3rd Qu.: 8.100 3rd Qu.:0.06500 3rd Qu.: 41.00
## Max. :65.800 Max. :0.61100 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.: 77.0 1st Qu.:0.9923 1st Qu.:3.110 1st Qu.:0.4300
## Median :118.0 Median :0.9949 Median :3.210 Median :0.5100
## Mean :115.7 Mean :0.9947 Mean :3.219 Mean :0.5313
## 3rd Qu.:156.0 3rd Qu.:0.9970 3rd Qu.:3.320 3rd Qu.:0.6000
## Max. :440.0 Max. :1.0390 Max. :4.010 Max. :2.0000
## alcohol quality color
## Min. : 8.00 Min. :3.000 Length:6497
## 1st Qu.: 9.50 1st Qu.:5.000 Class :character
## Median :10.30 Median :6.000 Mode :character
## Mean :10.49 Mean :5.818
## 3rd Qu.:11.30 3rd Qu.:6.000
## Max. :14.90 Max. :9.000
str(data)
## 'data.frame': 6497 obs. of 14 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## $ color : chr "red" "red" "red" "red" ...
Mean residual sugar level is 5.4 g/l, but there is a sample of very sweet wine with 65.8 g/l (an outlier). Mean free sulfur dioxide is 30.5 ppm. Max value is 289 which is quite high as 75% is 41 ppm. PH of wine is within range from 2.7 till 4, mean 3.2. There is no basic wines in this dataset (no high pH levels). Alcohol: lightest wine is 8%, strongest is 14.9. Minimum quality mark is 3, mean 5.8, highest is 9.
First I’d like to have a look at a single variable to get an idea about its distribution and decide for the things I will have a look deeper later.
summary(data$quality)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.818 6.000 9.000
table(data$quality)
##
## 3 4 5 6 7 8 9
## 30 216 2138 2836 1079 193 5
qplot(quality, data = data, fill = color, binwidth = 1) +
scale_x_continuous(breaks = seq(3,10,1), lim = c(3,10)) +
scale_y_sqrt()
We know that number of observations for red and white are different in out dataset, but still we can see that for both colors it’s normal distribution with almost the same picks at 5 and 6 quality point.
summary(data$alcohol)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.30 10.49 11.30 14.90
qplot(alcohol, data = data, fill = color, binwidth = 0.5) +
scale_x_continuous(breaks = seq(8,15,0.5), lim = c(8,15))
Alcohol level distribution looks skewed. Again, red wine sample is smaller but it gives the same pattern of alcohol level distribution as while wines. Most frequently wines have 9.5%, mean is 10.49% of alcohol.
summary(data$density)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9923 0.9949 0.9947 0.9970 1.0390
qplot(density, data = data, fill = color, binwidth = 0.0002) +
scale_x_log10(lim = c(min(data$density), 1.00370),
breaks = seq(min(data$density), 1.00370, 0.002))
Looking at ‘table’ summary we see that there are two outliers: 1.0103 and 1.03898. To see the distribution of density clearer I used log10 and limited the data. Now we can see that density distribution of white wine is bimodal and of red wine is normal.
summary(data$volatile.acidity)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.2300 0.2900 0.3397 0.4000 1.5800
qplot(volatile.acidity, data = data, fill = color, binwidth = 0.001) +
scale_x_log10(breaks = seq(min(data$volatile.acidity),
max(data$volatile.acidity), 0.1))
## Warning: position_stack requires constant width: output may be incorrect
Volatile acidity has normal distribution. I also suppose that more acetic wines have worse marks because high acidity can lead to unpleasant taste.
summary(data$chlorides)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03800 0.04700 0.05603 0.06500 0.61100
qplot(chlorides, data = data, fill = color, binwidth = 0.01) +
scale_x_log10(breaks = seq(min(data$chlorides), max(data$chlorides), 0.1))
## Warning: position_stack requires constant width: output may be incorrect
Chlorides distribution initially is skewed so I used log10 to see the distribution clearer.
qplot(quality, data = data, binwidth = 1, color = color, geom = "density") +
scale_x_continuous(breaks = seq(3, 9, 1))
In out sample we have almost the same amount of red and white wines with quality ‘3’, ‘4’ and ‘9’, more red wines with quality ‘5’ and more white wines with quality “6”, “7” and “8”.
ggpairs(data)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
I checked correlation between all the variables in wine dataset.
We can see some correlation in pairs like:
# use function for plotting with ggplot for simplicity of code
f <- function(dataset, x, y, z, opts=NULL) {
ggplot(dataset, aes_string(x = x, y = y, color = z)) +
geom_point(alpha = 1/5, position = position_jitter(h = 0), size = 2) +
geom_smooth(method = 'lm')
}
# density vs. alcohol plot
p <- f(data, "density", "alcohol", "color")
p + coord_cartesian(xlim=c(min(data$density),1.005), ylim=c(8,15))
# density vs. fixed.acidity plot
p <- f(data, "density", "fixed.acidity", "color")
p + coord_cartesian(xlim=c(min(data$density),1.005))
# residual.sugar vs. total.sulfur.dioxide
p <- f(data, "residual.sugar", "total.sulfur.dioxide", "color")
p + scale_x_log10() +
coord_cartesian(xlim=c(min(data$residual.sugar),30),
ylim=c(min(data$total.sulfur.dioxide), 350))
# residual.sugar vs. density
p <- f(data, "residual.sugar", "density", "color")
p + coord_cartesian(xlim=c(min(data$residual.sugar),25),
ylim=c(min(data$density), 1.005))
# residual.sugar vs. alcohol
p <- f(data, "residual.sugar", "alcohol", "color")
p + coord_cartesian(xlim=c(min(data$residual.sugar),25),
ylim=c(min(data$alcohol), 15))
# chlorides vs. density
p <- f(data, "chlorides", "density", "color")
p + scale_x_log10() +
coord_cartesian(ylim=c(min(data$density), 1.005))
# chlorides vs. sulphates
p <- f(data, "chlorides", "sulphates", "color")
p + scale_x_log10() +
coord_cartesian(ylim=c(min(data$sulphates), 1))
After checking correlated pairs I noticed that red and white wine behave different in some graphs. Means for red and white wine correlation can be significantly different.
RED | WHITE
* alcohol vs. density : strong c. : strong c.
* fixed.acidity vs. density : strong c. : no c.
* residual.sugar vs total.sulfur.dioxide : weak : weak c.
* residual.sugar vs. density : strong c. : strong c.
* residual.sugar vs. alcohol : no c. : strong c.
* chlorides vs. density : strong c. : strong c.
* chlorides vs. sulphates : strong c. : no c.
ggplot(data = data, aes(y = alcohol, x = quality)) +
geom_point(alpha = 1/4, position = position_jitter(h = 0), size = 4) +
geom_smooth(method = 'lm') +
facet_wrap(~ color)
My idea that volatile acidity affects the quality of wine is not correct. The only objective wine parameter which has correlation with quality is alcohol.
qplot(x = color, y = fixed.acidity, data = data, geom = "boxplot")
qplot(x = color, y = volatile.acidity, data = data, geom = "boxplot")
qplot(x = color, y = residual.sugar, data = data, geom = "boxplot")
qplot(x = color, y = total.sulfur.dioxide, data = data, geom = "boxplot")
These are parameters which depend a lot on the color of wine.
qplot(x = color, y = quality, data = data, geom = "boxplot")
Mean and 25% & 75% quintiles are similar for red and white wines.
# converting 'quality' vector into factor variable
data$quality <- as.factor(data$quality)
ggplot(aes(x = chlorides, y = sulphates, color = quality), data = data) +
facet_wrap(~color) +
geom_point(size = 3, alpha = 1/4) +
scale_color_identity(guide = 'legend') +
ylim(min(data$sulphates), quantile(data$sulphates, 0.95)) +
xlim(min(data$chlorides), quantile(data$chlorides, 0.95))
## Warning: Removed 391 rows containing missing values (geom_point).
## Warning: Removed 182 rows containing missing values (geom_point).
Sulfates and chlorides of white wine are spread out more than red wine. Most frequent quality levels for both colors are 4, 5, 6 and 7.
ggplot(aes(x = fixed.acidity,
y = volatile.acidity,
color = quality),
data = data) +
facet_wrap(~color) +
geom_point(size = 3, alpha = 1/4) +
scale_color_identity(guide = 'legend') +
ylim(min(data$volatile.acidity),
quantile(data$volatile.acidity, 0.99)) +
xlim(min(data$fixed.acidity),
quantile(data$fixed.acidity, 0.99))
## Warning: Removed 115 rows containing missing values (geom_point).
## Warning: Removed 7 rows containing missing values (geom_point).
Red wine of quality 5 has fixed acidity between 6 - 12, white wine = 5 - 10. White wine samples of quality ‘6’ are highly concentrated around volatile acidity of 0.2 and fixed acidity of 6-7.
ggplot(aes(x = free.sulfur.dioxide,
y = total.sulfur.dioxide,
color = quality),
data = data) +
facet_wrap(~color) +
geom_point(size = 3, alpha = 1/4) +
scale_color_identity(guide = 'legend') +
ylim(min(data$total.sulfur.dioxide),
quantile(data$total.sulfur.dioxide, 0.95)) +
xlim(min(data$free.sulfur.dioxide),
quantile(data$free.sulfur.dioxide, 0.95))
## Warning: Removed 6 rows containing missing values (geom_point).
## Warning: Removed 523 rows containing missing values (geom_point).
We can see clear positive correlation between total sulfur dioxide and free sulfur dioxide for red and white wine. Total sulfur dioxide of white wine has higher values mostly because of wine with quality ‘5’.
ggplot(aes(x = pH, y = alcohol, color = quality), data = data) +
facet_wrap(~color) +
geom_point(size = 3, alpha = 1/4) +
scale_color_identity(guide = 'legend') +
ylim(min(data$alcohol), quantile(data$alcohol, 0.95)) +
xlim(min(data$pH), quantile(data$pH, 0.95))
## Warning: Removed 202 rows containing missing values (geom_point).
## Warning: Removed 372 rows containing missing values (geom_point).
Levels of alcohol for both wine is quite the same, but minimum level of pH for white wine is 2.9, for red wine is 3.1. Only red wine with quality ‘5’ has pH level around 3.
ggplot(aes(x = citric.acid, y = alcohol, color = quality),
data = data) +
facet_wrap(~color) +
geom_point(size = 3, alpha = 1/4) +
scale_color_identity(guide = 'legend') +
ylim(min(data$alcohol), quantile(data$alcohol, 0.95)) +
xlim(min(data$citric.acid), quantile(data$citric.acid, 0.95))
## Warning: Removed 161 rows containing missing values (geom_point).
## Warning: Removed 430 rows containing missing values (geom_point).
In these plots we can notice that most of red wine is clustered around where citric acid is 0 - 0.2, for white wine this is true for citric acid level 0.2 - 0.4.
summary(data$quality)
## 3 4 5 6 7 8 9
## 30 216 2138 2836 1079 193 5
table(data$quality)
##
## 3 4 5 6 7 8 9
## 30 216 2138 2836 1079 193 5
qplot(as.numeric(as.character(quality)),
data = data,
fill = color,
binwidth = 1,
origin = - 0.5,
main = "Quality of Red and White Wine") +
scale_x_continuous(breaks = seq(2,10,1), lim = c(2,10)) +
scale_y_sqrt(breaks = seq(0,5600,500)) +
xlab('Quality') +
ylab('Quantity')
Let’s again look at distribution of wine by color. We may say that most frequent quality levels are 5 and 6 for both wine color.
ggplot(data = data,
aes(x = density, y = alcohol, color = color)) +
geom_point(alpha = 1/6, position = position_jitter(h = 0), size = 3) +
geom_smooth(method = 'lm') +
coord_cartesian(xlim=c(min(data$density),1.005), ylim=c(8,15)) +
xlab('Density') +
ylab('Alcohol') +
ggtitle('Density vs. Alcohol correlation by Color')
Density and Alcohol show the strongest correlation among all wine parameters and it equals to -0.687 for both red and white wine. Red wine in average are stronger than white wine. In this dataset wine with less alcohol percentage are mostly white and red wine mostly has more alcohol percentage.
ggplot(data = data,
aes(x = density, y = alcohol, color = factor(quality))) +
geom_point(alpha = 1/2, position = position_jitter(h = 0), size = 2) +
coord_cartesian(xlim=c(min(data$density),1.005), ylim=c(8,15)) +
scale_color_brewer(type='qual') +
xlab('Density') +
ylab('Alcohol') +
ggtitle('Density vs. Alcohol correlation by Quality')
ggplot(data = data,
aes(x = density, y = alcohol) )+
facet_wrap( ~ quality) +
geom_boxplot() +
xlab('Density') +
ylab('Alcohol') +
ggtitle('Density vs. Alcohol correlation by Quality')
Wine with high alcohol percentage has quality level 7, wine with less alcohol percentage is quality level 5. Wine with quality levels 6 and 8 have various combinations of alcohol and density.
ggplot(data = data, aes(y = alcohol, x = quality)) +
geom_boxplot() +
geom_smooth(method = 'lm') +
facet_wrap(~ color) +
xlab('Quality') +
ylab('Alcohol') +
ggtitle('How Alcohol Level Affects Wine Quality')
## geom_smooth: Only one unique x value each group.Maybe you want aes(group = 1)?
## geom_smooth: Only one unique x value each group.Maybe you want aes(group = 1)?
Alcohol level and Quality have correlation value of 0.4. This is the strongest correlation we have found between an objective wine parameter and wine quality. 0.4 is not a high correlation level so we cannot use alcohol as a parameter for quality prediction.
Analyzing the data we can come up the following conclusion:
According to my investigation I may conclude that experts’ decisions on wine quality levels are based on their personal testes or could depend on other variables like year of production, grape types, wine brand etc. as only one variable (alcohol level) has correlation with quality of wine.
For future exploration of this data I would pick one category of wine (for example, quality level 3-4, 5-7, 8-9) to look at the patterns which can appear in each of these three buckets. I also would normalize data because we have more white wine than red wine.