Wine_research

Author
Affiliation

Moisieiev Vasyl

Kyiv School of Economics

Intoduction and data

Sources:

Created by: Paulo Cortez (Univ. Minho), Antonio Cerdeira, Fernando Almeida, Telmo Matos and Jose Reis (CVRVV) @ 2009

Past Usage:

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

In the above reference, two datasets were created, using red and white wine samples. The inputs include objective tests (e.g. PH values) and the output is based on sensory data (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent). Several data mining methods were applied to model these datasets under a regression approach. The support vector machine model achieved the best results. Several metrics were computed: MAD, confusion matrix for a fixed error tolerance (T), etc. Also, we plot the relative importances of the input variables (as measured by a sensitivity analysis procedure).

For more information, read [Cortez et al., 2009].

Attribute information:

Input variables (based on physicochemical tests):

  • fixed acidity - most acids involved with wine or fixed or nonvolatile (do not evaporate readily)
  • volatile acidity - the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
  • citric acid - found in small quantities, citric acid can add ‘freshness’ and flavor to wines
  • residual sugar - the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet
  • chlorides - the amount of salt in the wine
  • free sulfur dioxide - the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
  • total sulfur dioxide - amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
  • density - the density of water is close to that of water depending on the percent alcohol and sugar content
  • pH - describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
  • sulphates - a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant
  • alcohol - the percent alcohol content of the wine
  • type - the type of wine (red or white)

Output variable (based on sensory data):

  • quality (score between 0 and 10)

Research question

Does the alcohol content of wine influence its quality, and do the chemical properties like acidity, sugar, and sulfur dioxide contribute to the quality of red and white wines differently?

This research can provide insights into how certain chemical properties of wine affect its perceived quality, which can help wineries improve their production processes and marketing strategies. Additionally, understanding if the chemical composition of red and white wines has different impacts on quality can guide consumers in choosing wines based on their preferences and scientific backing.

Import and cleaning of the data

library(corrplot)
library(tidyverse)
library(ggplot2)
library(plotly)

Import all data and combine them into one dataframe.

red_wine_df <- read.csv("winequality-red.csv", sep = ";")

white_wine_df <- read.csv("winequality-white.csv", sep = ";")

red_wine_df <- red_wine_df %>% mutate(type = "red")
white_wine_df <- white_wine_df %>% mutate(type = "white")
red_wine_df
white_wine_df 
wine_df <- rbind(red_wine_df, white_wine_df)
wine_df 

Then we need to check for missing values, duplicates, and the structure of the data.

wine_df %>% 
  group_by(type) %>% 
  summarise(count = n())
colSums(is.na(wine_df))
       fixed.acidity     volatile.acidity          citric.acid 
                   0                    0                    0 
      residual.sugar            chlorides  free.sulfur.dioxide 
                   0                    0                    0 
total.sulfur.dioxide              density                   pH 
                   0                    0                    0 
           sulphates              alcohol              quality 
                   0                    0                    0 
                type 
                   0 
anyDuplicated(wine_df)
[1] 5
wine_df <- unique(wine_df)
anyDuplicated(wine_df)
[1] 0
glimpse(wine_df)
Rows: 5,320
Columns: 13
$ fixed.acidity        <dbl> 7.4, 7.8, 7.8, 11.2, 7.4, 7.9, 7.3, 7.8, 7.5, 6.7…
$ volatile.acidity     <dbl> 0.700, 0.880, 0.760, 0.280, 0.660, 0.600, 0.650, …
$ citric.acid          <dbl> 0.00, 0.00, 0.04, 0.56, 0.00, 0.06, 0.00, 0.02, 0…
$ residual.sugar       <dbl> 1.9, 2.6, 2.3, 1.9, 1.8, 1.6, 1.2, 2.0, 6.1, 1.8,…
$ chlorides            <dbl> 0.076, 0.098, 0.092, 0.075, 0.075, 0.069, 0.065, …
$ free.sulfur.dioxide  <dbl> 11, 25, 15, 17, 13, 15, 15, 9, 17, 15, 16, 9, 52,…
$ total.sulfur.dioxide <dbl> 34, 67, 54, 60, 40, 59, 21, 18, 102, 65, 59, 29, …
$ density              <dbl> 0.9978, 0.9968, 0.9970, 0.9980, 0.9978, 0.9964, 0…
$ pH                   <dbl> 3.51, 3.20, 3.26, 3.16, 3.51, 3.30, 3.39, 3.36, 3…
$ sulphates            <dbl> 0.56, 0.68, 0.65, 0.58, 0.56, 0.46, 0.47, 0.57, 0…
$ alcohol              <dbl> 9.4, 9.8, 9.8, 9.8, 9.4, 9.4, 10.0, 9.5, 10.5, 9.…
$ quality              <int> 5, 5, 5, 6, 5, 5, 7, 7, 5, 5, 5, 5, 5, 5, 7, 5, 4…
$ type                 <chr> "red", "red", "red", "red", "red", "red", "red", …

Hypotesis

1 Higher alcohol content in wine leads to a better quality rating.

ggplot(wine_df, aes(x = factor(quality), y = alcohol, fill = type)) +
  geom_boxplot() +
  labs(title = "Alcohol Content by Wine Quality and Type", 
       x = "Wine Quality", 
       y = "Alcohol Content") +
  theme_minimal()

here we can see that level of alcohol influence on quality of wine. The higher level of alcohol the higher quality of wine.

2 Wines with balanced levels of acidity, sugar, and sulfur dioxide tend to have higher quality ratings.

p <- ggplot(wine_df, aes(x = fixed.acidity, fill = factor(quality))) +
  geom_density(alpha = 0.4) +
  labs(title = "Density of Fixed Acidity by Wine Quality", 
       x = "Fixed Acidity", 
       y = "Density") +
  facet_wrap(~type) +
  theme_minimal()
ggplotly(p)

Also level of acidity for red wine needs to be around 6.5 to have the best quality for red and white wine it needs to be around 7.5

p <- ggplot(wine_df, aes(x = total.sulfur.dioxide, fill = factor(quality))) +
  geom_density(alpha = 0.4) +
  labs(title = "Density of sulfure dioxide by Wine Quality", 
       x = "Fixed Acidity", 
       y = "Density") +
  facet_wrap(~type) +
  theme_minimal()
ggplotly(p)

And level of sulfur dioxide for red wine needs to be around 50 to have the best quality for red and white wine it needs to be around 100

ggplot(wine_df, aes(x = factor(quality), y = residual.sugar, fill = type)) +
  geom_boxplot() +
  labs(title = "Residual Sugar by Wine Quality", 
       x = "Wine Quality", 
       y = "Residual Sugar") +
  theme_minimal()

Level of sugar needs to be around 2 for red wine and 6 for white wine to have the best quality.

3 The relationship between alcohol content, acidity, and sulfur dioxide on wine quality differs between red and white wines.

cor_data_red <- subset(wine_df[, c("alcohol", "quality", "fixed.acidity", "volatile.acidity", "citric.acid", "residual.sugar")], wine_df$type == "red")

cor_data_white <- subset(wine_df[, c("alcohol", "quality", "fixed.acidity", "volatile.acidity", "citric.acid", "residual.sugar")], wine_df$type == "white")

cor_matrix_red <- cor(cor_data_red, use = "complete.obs")
cor_matrix_white <- cor(cor_data_white, use = "complete.obs")


corrplot(cor_matrix_red, method = "circle", title = "Red Wine Correlation")

corrplot(cor_matrix_white, method = "circle", title = "White Wine Correlation")

As we can see from the correlation matrix, the relationship between alcohol content, acidity, and sulfur dioxide on wine quality differs between red and white wines. For red wine, alcohol content has a stronger positive correlation with quality, while for white wine, acidity and sulfur dioxide have a stronger positive correlation with quality.

Conclusion

The analysis of the wine dataset revealed several insights into the factors that influence wine quality. White wine needs to be more sweet and have more sulfur dioxide to have the best quality. Red wine needs to have more alcohol and acidity to have the best quality. The relationship between alcohol content, acidity, and sulfur dioxide on wine quality differs between red and white wines. These findings can help wineries and consumers make informed decisions about wine production and selection based on scientific evidence. Further research could explore additional factors that influence wine quality and how they interact with the chemical properties analyzed in this study.