Data 606 Data Project

Abstract

This project investigates how different chemical properties affect the quality of the Portuguese “Vinho Verde” red wine. We use a dataset of wine samples to examine the relationship between properties like fixed acidity, volatile acidity, and alcohol content, and the quality ratings given by wine experts. Our goal is to see if these properties can predict wine quality. We use basic statistics, correlation analysis, and multiple linear regression to explore these relationships. The results show significant links between some chemical properties and wine quality, offering useful insights for wine producers to improve their products.

Part 1 - Introduction

The quality of wine is a complex attribute influenced by various chemical properties. Understanding these influences can help wine producers enhance their products and meet consumer expectations. This project aims to investigate how different chemical properties affect the quality of Portuguese “Vinho Verde” red wine.

The research question guiding this study is: “How do various chemical properties of wine influence its quality, and can we predict wine quality based on these properties?” This question is addressed using a dataset of wine samples, which includes measurements of properties such as fixed acidity, volatile acidity, and alcohol content, along with quality ratings provided by wine experts.

By analyzing the relationships between these chemical properties and the quality ratings, this study seeks to identify significant predictors of wine quality. The findings will provide valuable insights for wine producers, enabling them to make data-driven decisions to improve the quality of their wines. The use of descriptive statistics, correlation analysis, and multiple linear regression ensures a comprehensive examination of the data, allowing for robust conclusions to be drawn.

Part 2 - Data

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

# load data
wines <- data.frame(read_csv(file = "https://raw.githubusercontent.com/Yedzinovich/FALL2024TIDYVERSE/refs/heads/main/WineQT.csv"))

## Rows: 1143 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (13): fixed acidity, volatile acidity, citric acid, residual sugar, chlo...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# display column names
colnames(wines)

##  [1] "fixed.acidity"        "volatile.acidity"     "citric.acid"         
##  [4] "residual.sugar"       "chlorides"            "free.sulfur.dioxide" 
##  [7] "total.sulfur.dioxide" "density"              "pH"                  
## [10] "sulphates"            "alcohol"              "quality"             
## [13] "Id"

# remove unnecessary columns (id column)
wines <- wines[, -13]

# rename columns to be more readable 
wines <- wines %>% rename(
  fixed_acidity = fixed.acidity,
  volatile_acidity = volatile.acidity,
  citric_acid = citric.acid,
  residual_sugar = residual.sugar,
  chlorides = chlorides,
  free_sulfur_dioxide = free.sulfur.dioxide,
  total_sulfur_dioxide = total.sulfur.dioxide,
  density = density,
  ph = pH,
  sulphates = sulphates,
  alcohol = alcohol,
  quality = quality
)

head(wines)

##   fixed_acidity volatile_acidity citric_acid residual_sugar chlorides
## 1           7.4             0.70        0.00            1.9     0.076
## 2           7.8             0.88        0.00            2.6     0.098
## 3           7.8             0.76        0.04            2.3     0.092
## 4          11.2             0.28        0.56            1.9     0.075
## 5           7.4             0.70        0.00            1.9     0.076
## 6           7.4             0.66        0.00            1.8     0.075
##   free_sulfur_dioxide total_sulfur_dioxide density   ph sulphates alcohol
## 1                  11                   34  0.9978 3.51      0.56     9.4
## 2                  25                   67  0.9968 3.20      0.68     9.8
## 3                  15                   54  0.9970 3.26      0.65     9.8
## 4                  17                   60  0.9980 3.16      0.58     9.8
## 5                  11                   34  0.9978 3.51      0.56     9.4
## 6                  13                   40  0.9978 3.51      0.56     9.4
##   quality
## 1       5
## 2       5
## 3       5
## 4       6
## 5       5
## 6       5

Part 3 - Exploratory data analysis

Let’s perform an exploratory data analysis on the wine dataset.

Section 1: Descriptive Statistics and Distribution Analysis

1 - Summary statistics: calculate summary statistics (mean, median, standard deviation etc) for each variable to understand their central tendency and dispersion.

2 - Histograms: create histograms for each variable to visualize their distributions. This helps in identifying the shape of the data, presence of outliers, and skewness.

3 - Box plots: create box plots for each variable to visualize their spread and identify potential outliers.

Note: we exclude the quality variable in this step because we are focusing on visualizing the distributions of the chemical properties of the wine. The quality variable is the dependent variable (the outcome we are trying to predict), and it doesn’t need to be included in the histograms/plots of the independent variables.

summary(wines)

##  fixed_acidity    volatile_acidity  citric_acid     residual_sugar  
##  Min.   : 4.600   Min.   :0.1200   Min.   :0.0000   Min.   : 0.900  
##  1st Qu.: 7.100   1st Qu.:0.3925   1st Qu.:0.0900   1st Qu.: 1.900  
##  Median : 7.900   Median :0.5200   Median :0.2500   Median : 2.200  
##  Mean   : 8.311   Mean   :0.5313   Mean   :0.2684   Mean   : 2.532  
##  3rd Qu.: 9.100   3rd Qu.:0.6400   3rd Qu.:0.4200   3rd Qu.: 2.600  
##  Max.   :15.900   Max.   :1.5800   Max.   :1.0000   Max.   :15.500  
##    chlorides       free_sulfur_dioxide total_sulfur_dioxide    density      
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00       Min.   :0.9901  
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 21.00       1st Qu.:0.9956  
##  Median :0.07900   Median :13.00       Median : 37.00       Median :0.9967  
##  Mean   :0.08693   Mean   :15.62       Mean   : 45.91       Mean   :0.9967  
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 61.00       3rd Qu.:0.9978  
##  Max.   :0.61100   Max.   :68.00       Max.   :289.00       Max.   :1.0037  
##        ph          sulphates         alcohol         quality     
##  Min.   :2.740   Min.   :0.3300   Min.   : 8.40   Min.   :3.000  
##  1st Qu.:3.205   1st Qu.:0.5500   1st Qu.: 9.50   1st Qu.:5.000  
##  Median :3.310   Median :0.6200   Median :10.20   Median :6.000  
##  Mean   :3.311   Mean   :0.6577   Mean   :10.44   Mean   :5.657  
##  3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :4.010   Max.   :2.0000   Max.   :14.90   Max.   :8.000

# Histograms for each variable of of wine chemical properties
wines %>% gather(key = "variable", value = "value", -quality) %>% # gather function is used to reshape the data from wide format to long format.
  ggplot(aes(x = value)) +
  geom_histogram(bins = 30, fill = "#800020", color = "black") +
  facet_wrap(~variable, scales = "free_x") +
  theme_minimal() +
  labs(title = "Histograms of Wine Chemical Properties", x = "Value", y = "Frequency")

# Box plots for each variable of wine chemical properties
wines %>% gather(key = "variable", value = "value", -quality) %>%
  ggplot(aes(x = variable, y = value)) +
  geom_boxplot(fill = "#dbf47c", color = "black") +
  theme_minimal() +
  labs(title = "Box Plots of Wine Chemical Properties", x = "Variable", y = "Value")

This initial exploratory data analysis provides a comprehensive overview of the dataset, helping me understand the basic characteristics and distributions of the chemical properties of the wine.

Let’s have a look at the histogram. The histograms display the frequency distributions of various wine chemical properties. Let’s have a look at each property one by one.

Alcohol: Positively skewed, with most wines having an alcohol content in the range of 8-12%.
Chlorides: Strongly skewed to the right, suggesting most wines have very low chloride levels.
Citric Acid: Many wines have low citric acid levels, with a tail extending towards higher values.
Density: The distribution is tight, peaking around ~0.995-1.000.
Fixed Acidity: Peaks in the range of 6-10 g/L, with a slight right skew.
Free Sulfur Dioxide: Right-skewed.
pH: Nearly normal distribution centered around ~3.3.
Residual Sugar: Strongly skewed, as most wines have residual sugar below 5 g/L, with a few sweet wines extending to higher levels.
Sulfates: Concentrated around ~0.5-1.0 g/L, with a right skew.
Total Sulfur Dioxide: Skewed to the right, with most wines having levels below 100 ppm.
Volatile Acidity: Peaks around ~0.3-0.5 g/L, suggesting lower acidity levels.

Now, let’s have a look at the box plots of wine chemicals. The box plots highlight the spread and presence of outliers in the same variables.

Alcohol: A small range without noticeable outliers
Chlorides: Some outliers
Citric Acid: Some outliers
Density: Very few outliers.
Fixed Acidity: Moderate spread, with some high outliers.
Free Sulfur Dioxide: Moderate spread, with some high outliers.
pH: Tight distribution with not that many outliers.
Residual Sugar: Some high outliers
Sulfates: A narrow spread with high outliers.
Total Sulfur Dioxide: A wide range with high outliers.
Volatile Acidity: A narrow range with higher outliers.

Some observations that can be made based on plots above: - Skewed Distributions: Many chemical properties are positively skewed (such as chlorides, residual sugar, total sulfur dioxide), indicating that typical wines cluster at low levels of these properties, with a few exceptions. - Outliers: The box plots reveal significant outliers in some variables, particularly for chlorides, residual sugar, and total sulfur dioxide, which might correspond to specialty wines or experimental samples. - Tight Ranges: Variables like density and pH show tight clustering, which is consistent with standard winemaking practices. - Right-Tailed Distributions: These reflect a few instances of atypical wines with unusual chemical properties.

Section 2: Correlation Analysis

In this section, we are going to compute correlation coefficients between quality and other variables to identify any significant relationships.

correlations <- cor(wines)
correlations["quality", ]

##        fixed_acidity     volatile_acidity          citric_acid 
##           0.12197010          -0.40739351           0.24082084 
##       residual_sugar            chlorides  free_sulfur_dioxide 
##           0.02200193          -0.12408453          -0.06325964 
## total_sulfur_dioxide              density                   ph 
##          -0.18333915          -0.17520792          -0.05245303 
##            sulphates              alcohol              quality 
##           0.25771026           0.48486621           1.00000000

Based on the correlation values between wine quality and various chemical properties, the following can be observed:

Alcohol: This has the highest positive correlation with wine quality ((0.485)). This suggests that wines with higher alcohol content tend to have better quality.
Sulphates: There is a positive correlation ((0.258)) as well, indicating that higher sulphate levels are somewhat associated with better wine quality.
Citric Acid: There is a positive correlation ((0.241)), suggesting a slight tendency for wines with more citric acid to be of higher quality.
Fixed Acidity: There is a positive correlation ((0.122)), indicating some positive relationship with wine quality.
Residual Sugar: There is a very small positive correlation ((0.022)), indicating that sugar content has little to no impact on wine quality.
Volatile Acidity: There is a negative correlation ((-0.407)), suggesting that higher volatile acidity is associated with lower wine quality.
Chlorides: There is a negative correlation ((-0.124)), suggesting that higher chloride levels might slightly reduce wine quality.
Total Sulfur Dioxide: There is a negative correlation ((-0.183)), indicating that higher levels might slightly lower wine quality.
Density: There is negative correlation ((-0.175)), suggesting that denser wines might be of slightly lower quality.
pH: There is a small negative correlation ((-0.052)), indicating that pH level has little to no impact on wine quality.
Free Sulfur Dioxide: There is a small negative correlation ((-0.063)), indicating minimal impact on wine quality.

As we can see, alcohol content and sulphates are the most positively correlated with wine quality, while volatile acidity has the most significant negative correlation. Other chemical properties show weaker relationships with wine quality.

Part 4 - Inference

From the above correlations, I can see that alcohol content is one of the most positively correlated with wine quality. Let’s test the hypothesis that there is a significant positive correlation between alcohol content and quality of wine.

Section 1: Hypotheses:

Null Hypothesis: There is no significant correlation between alcohol content and quality of wine.
Alternative Hypothesis: There is a significant positive correlation between alcohol content and quality of wine.

correlation <- cor(wines$alcohol, wines$quality)
cor_test <- cor.test(wines$alcohol, wines$quality)
cor_test

## 
##  Pearson's product-moment correlation
## 
## data:  wines$alcohol and wines$quality
## t = 18.727, df = 1141, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4392310 0.5280056
## sample estimates:
##       cor 
## 0.4848662

Explanation of Results - Pearson Correlation Coefficient: 0.4848662 - P-value: 1.924653e-68 Based on these results, we can determine whether to reject the null hypothesis. The p-value is significantly less than 0.05 (0.05 > 1.924653e-68), indicating a significant positive correlation between alcohol content and wine quality.

Section 2: Hypothesis Testing

Let’s conduct t-tests to compare the means of chemical property (alcohol) between high-quality and low-quality wines.

# Define high-quality and low-quality wines
high_quality <- subset(wines, quality >= 7) # why 7? highest quality ranking is 8, to let's take 7 & 8 at the highest
low_quality <- subset(wines, quality < 7)

head(high_quality)

##     fixed_acidity volatile_acidity citric_acid residual_sugar chlorides
## 8             7.3             0.65        0.00            1.2     0.065
## 9             7.8             0.58        0.02            2.0     0.073
## 13            8.5             0.28        0.56            1.8     0.092
## 28            8.1             0.38        0.28            2.1     0.066
## 90            8.0             0.59        0.16            1.8     0.065
## 144           9.6             0.32        0.47            1.4     0.056
##     free_sulfur_dioxide total_sulfur_dioxide density   ph sulphates alcohol
## 8                    15                   21 0.99460 3.39      0.47    10.0
## 9                     9                   18 0.99680 3.36      0.57     9.5
## 13                   35                  103 0.99690 3.30      0.75    10.5
## 28                   13                   30 0.99680 3.23      0.73     9.7
## 90                    3                   16 0.99620 3.42      0.92    10.5
## 144                   9                   24 0.99695 3.22      0.82    10.3
##     quality
## 8         7
## 9         7
## 13        7
## 28        7
## 90        7
## 144       7

head(low_quality)

##   fixed_acidity volatile_acidity citric_acid residual_sugar chlorides
## 1           7.4             0.70        0.00            1.9     0.076
## 2           7.8             0.88        0.00            2.6     0.098
## 3           7.8             0.76        0.04            2.3     0.092
## 4          11.2             0.28        0.56            1.9     0.075
## 5           7.4             0.70        0.00            1.9     0.076
## 6           7.4             0.66        0.00            1.8     0.075
##   free_sulfur_dioxide total_sulfur_dioxide density   ph sulphates alcohol
## 1                  11                   34  0.9978 3.51      0.56     9.4
## 2                  25                   67  0.9968 3.20      0.68     9.8
## 3                  15                   54  0.9970 3.26      0.65     9.8
## 4                  17                   60  0.9980 3.16      0.58     9.8
## 5                  11                   34  0.9978 3.51      0.56     9.4
## 6                  13                   40  0.9978 3.51      0.56     9.4
##   quality
## 1       5
## 2       5
## 3       5
## 4       6
## 5       5
## 6       5

# T-tests for alcohol
t_test_alcohol <- t.test(high_quality$alcohol, low_quality$alcohol)
t_test_alcohol

## 
##  Welch Two Sample t-test
## 
## data:  high_quality$alcohol and low_quality$alcohol
## t = 14.687, df = 210.02, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1.09246 1.43119
## sample estimates:
## mean of x mean of y 
##  11.52841  10.26658

Explanation of Sample t-test for alcohol:
t-value: 14.687
Degrees of Freedom (df): 210.02
p-value: < 2.2e-16 = 95% Confidence Interval: [1.09246, 1.43119]

Sample Estimates: - Mean of high-quality wines: 11.52841 - Mean of low-quality wines: 10.26658

t-value: The t-value of 14.687 indicates the difference in means between the two groups is 14.687 times the standard error of the difference. A t-value of 14.687 suggests a very strong difference between the groups, far beyond what we would expect by random chance, leading us to conclude that the difference in alcohol content is statistically significant.
p-value: The p-value is less than 2.2e-16, which is extremely small and much less than 0.05. This indicates very strong evidence against the null hypothesis. We reject the null hypothesis and conclude that there is a significant difference in alcohol content between high-quality and low-quality wines.
Confidence Interval: The 95% confidence interval [1.09246, 1.43119] does not include 0, confirming that the difference in means is significant.
Means: High-quality wines have a higher mean alcohol content (11.52841) compared to low-quality wines (10.26658). Summary

Both tests show significant differences in fixed acidity and alcohol content between high-quality and low-quality wines. The low p-values and confidence intervals that do not include 0 provide strong evidence that these differences are statistically significant.

Section 3: Linear Regression

Simple Linear Regression The linear regression model is used to predict wine quality based on alcohol content.

# Simple linear regression for alcohol
model_simple <- lm(quality ~ alcohol, data = wines)
summary(model_simple)

## 
## Call:
## lm(formula = quality ~ alcohol, data = wines)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8224 -0.4000 -0.1725  0.5152  2.5748 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.88701    0.20240   9.323   <2e-16 ***
## alcohol      0.36104    0.01928  18.727   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7051 on 1141 degrees of freedom
## Multiple R-squared:  0.2351, Adjusted R-squared:  0.2344 
## F-statistic: 350.7 on 1 and 1141 DF,  p-value: < 2.2e-16

A simple linear regression model is fitted to predict wine quality based on alcohol content. The regression results show that for each unit increase in alcohol content, the wine quality is expected to increase by 0.36104 units. The coefficient for alcohol is highly significant, with a t-value of 18.727 and a p-value of less than 2e-16. This indicates that the effect of alcohol content on wine quality is statistically significant. The model’s R-squared value is 0.2351, meaning that approximately 23.51% of the variability in wine quality can be explained by alcohol content alone.

While the analysis shows a significant positive correlation between alcohol content and wine quality, it doesn’t mean that higher alcohol content always results in better wine. The correlation coefficient of 0.4848662 indicates a moderate relationship, suggesting that alcohol content is one of several factors influencing wine quality.

Wine quality is determined by a complex interplay of various chemical properties, including acidity, sugar content, tannins, and more. Higher alcohol content can enhance certain flavors and contribute to the overall balance of the wine, but it can also overpower other characteristics if not well-balanced.

Multiple Linear Regression Build a multiple linear regression model to predict quality using several predictors (not only alcohol).

# Multiple linear regression
model_multiple <- lm(quality ~ fixed_acidity + volatile_acidity + residual_sugar + chlorides + free_sulfur_dioxide + total_sulfur_dioxide + density + ph + sulphates + alcohol, data = wines)
summary(model_multiple)

## 
## Call:
## lm(formula = quality ~ fixed_acidity + volatile_acidity + residual_sugar + 
##     chlorides + free_sulfur_dioxide + total_sulfur_dioxide + 
##     density + ph + sulphates + alcohol, data = wines)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.45636 -0.36756 -0.04732  0.44086  2.00042 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           2.212e+01  2.476e+01   0.894 0.371688    
## fixed_acidity         1.566e-02  2.869e-02   0.546 0.585157    
## volatile_acidity     -1.072e+00  1.193e-01  -8.986  < 2e-16 ***
## residual_sugar        1.316e-02  1.845e-02   0.713 0.475870    
## chlorides            -1.824e+00  4.734e-01  -3.854 0.000123 ***
## free_sulfur_dioxide   2.632e-03  2.529e-03   1.041 0.298244    
## total_sulfur_dioxide -2.946e-03  8.114e-04  -3.631 0.000295 ***
## density              -1.800e+01  2.527e+01  -0.712 0.476320    
## ph                   -3.979e-01  2.224e-01  -1.789 0.073910 .  
## sulphates             8.720e-01  1.334e-01   6.536 9.56e-11 ***
## alcohol               2.759e-01  3.075e-02   8.972  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6404 on 1132 degrees of freedom
## Multiple R-squared:  0.3739, Adjusted R-squared:  0.3684 
## F-statistic: 67.61 on 10 and 1132 DF,  p-value: < 2.2e-16

The multiple regression analysis reveals several key factors that significantly contribute to wine quality. Among the chemical properties analyzed, volatile acidity, chlorides, total sulfur dioxide, sulphates, and alcohol content emerged as significant predictors. Specifically, volatile acidity has a negative impact on wine quality, with an estimate of -1.072 and a highly significant p-value of less than 2e-16. Similarly, chlorides negatively affect wine quality, with an estimate of -1.824 and a p-value of 0.000123. Total sulfur dioxide also shows a negative impact, with an estimate of -0.002946 and a p-value of 0.000295.

On the positive side, sulphates and alcohol content significantly enhance wine quality. Sulphates have an estimate of 0.8720 and a p-value of 9.56e-11, indicating a strong positive effect. Alcohol content, with an estimate of 0.2759 and a p-value of less than 2e-16, also significantly improves wine quality. These results demonstrate that higher levels of sulphates and alcohol are associated with better quality wines.

Conversely, some factors did not show a significant impact on wine quality in this model. Fixed acidity (estimate: 0.01566, p-value: 0.585157), residual sugar (estimate: 0.01316, p-value: 0.475870), free sulfur dioxide (estimate: 0.002632, p-value: 0.298244), density (estimate: -18.00, p-value: 0.476320), and pH (estimate: -0.3979, p-value: 0.073910) were not statistically significant predictors. This suggests that variations in these properties do not substantially influence the overall quality of wine.

The model explains approximately 37.39% of the variability in wine quality (Multiple R-squared: 0.3739), highlighting that while these chemical properties are important, other factors not included in the model also play a crucial role in determining wine quality. This analysis underscores the complexity of wine quality assessment, where multiple interacting factors contribute to the final evaluation.

Section 4: ANOVA:

Let’s use one-way ANOVA in our analysis to validate and complement the findings from the multiple regression model. It provides a clear understanding of how each chemical property individually affects wine quality, which is essential for accurate prediction and quality assessment.

The ANOVA test we performed is a one-way ANOVA. This is because we are examining the effect of multiple independent variables (fixed_acidity, volatile_acidity, residual_sugar, chlorides, free_sulfur_dioxide, total_sulfur_dioxide, density, pH, sulphates, and alcohol) on a single dependent variable (wine quality) without considering interactions between the independent variables.

# Perform ANOVA
anova_model <- aov(quality ~ fixed_acidity + volatile_acidity + residual_sugar + 
                   chlorides + free_sulfur_dioxide + total_sulfur_dioxide + 
                   density + ph + sulphates + alcohol, data = wines)

summary(anova_model)

##                        Df Sum Sq Mean Sq F value   Pr(>F)    
## fixed_acidity           1   11.0   11.03  26.898 2.54e-07 ***
## volatile_acidity        1  112.4  112.36 273.946  < 2e-16 ***
## residual_sugar          1    0.2    0.20   0.481  0.48814    
## chlorides               1    8.3    8.28  20.190 7.73e-06 ***
## free_sulfur_dioxide     1    3.0    2.96   7.208  0.00736 ** 
## total_sulfur_dioxide    1   16.0   16.04  39.097 5.71e-10 ***
## density                 1   49.0   49.02 119.521  < 2e-16 ***
## ph                      1    6.4    6.44  15.705 7.87e-05 ***
## sulphates               1   38.0   37.96  92.543  < 2e-16 ***
## alcohol                 1   33.0   33.01  80.490  < 2e-16 ***
## Residuals            1132  464.3    0.41                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The ANOVA test results indicate that several chemical properties significantly influence wine quality. The F-values and corresponding p-values show the strength and significance of each factor’s effect. Volatile acidity has the highest F-value (273.946) and a p-value of less than 2e-16, indicating a very strong negative impact on wine quality. Similarly, alcohol content (F-value: 80.490, p-value: < 2e-16) and sulphates (F-value: 92.543, p-value: < 2e-16) have strong positive effects on wine quality. Other significant factors include chlorides (F-value: 20.190, p-value: 7.73e-06), total sulfur dioxide (F-value: 39.097, p-value: 5.71e-10), density (F-value: 119.521, p-value: < 2e-16), and pH (F-value: 15.705, p-value: 7.87e-05). Fixed acidity (F-value: 26.898, p-value: 2.54e-07) and free sulfur dioxide (F-value: 7.208, p-value: 0.00736) also show significant effects, though to a lesser extent. Residual sugar, however, does not significantly affect wine quality (F-value: 0.481, p-value: 0.48814).

These results highlight the complex interplay of various chemical properties in determining wine quality, with volatile acidity, alcohol, and sulphates being particularly influential.

Observations: Which One to Trust? Correlation Analysis? Multiple Regression? ANOVA?

Correlation Analysis: Measures the linear relationship between two variables. Correlation only captures the strength and direction of a linear relationship between two variables, without considering the influence of other variables.

Multiple Regression Evaluates the relationship between one dependent variable (wine quality) and multiple independent variables (sulphates, alcohol, acidity etc.).It provides coefficients for each independent variable, indicating their individual impact on the dependent variable while controlling for other variables. It is ideal for predicting the dependent variable and understanding the relative importance of each predictor.

ANOVA (Analysis of Variance): ANOVA in Regression: In the context of multiple linear regression, ANOVA helps determine the significance of each predictor variable (sulphates, alcohol, acidity etc) in explaining the variability in the dependent variable (wine quality).(Compares the means of different groups to determine if there are statistically significant differences between them.)

For a quick overview of the relationship between two variables: Trust correlation analysis For prediction and detailed insights: Trust multiple regression. For comparing group means: Trust ANOVA.

In our case, where the goal is to understand and predict wine quality based on various chemical properties, multiple regression is the more appropriate and reliable method. It allows us to see the individual contributions of each variable and make informed predictions about wine quality.

Part 5 - Conclusion

The analysis of the wine dataset reveals that various chemical properties significantly influence wine quality. Through correlation analysis, multiple regression, and ANOVA, we identified key factors that contribute to wine quality.

Key Factors:

Sulphates: Sulphates positively influence wine quality. The multiple regression analysis indicated a significant positive effect (Estimate: 0.8720, p-value: 9.56e-11), and the ANOVA results supported this with an F-value of 92.543 and a p-value of < 2e-16.

Alcohol Content: There is a significant positive correlation between alcohol content and wine quality. The Pearson correlation coefficient is 0.4848662, and the p-value is 1.924653e-68, indicating that higher alcohol content is associated with better wine quality. This was further supported by the multiple regression analysis, where alcohol content had a significant positive effect (Estimate: 0.2759, p-value: < 2e-16).

Volatile Acidity: This factor has a strong negative impact on wine quality. The multiple regression analysis showed a significant negative effect (Estimate: -1.072, p-value: < 2e-16), and the ANOVA results confirmed its importance with an F-value of 273.946 and a p-value of < 2e-16.

Chlorides and Total Sulfur Dioxide: Both have significant negative effects on wine quality. Chlorides (Estimate: -1.824, p-value: 0.000123) and total sulfur dioxide (Estimate: -0.002946, p-value: 0.000295) were significant in the multiple regression analysis, and the ANOVA results confirmed their impact with F-values of 20.190 and 39.097, respectively.

Other Factors: Fixed acidity, free sulfur dioxide, density, and pH also showed significant effects in the ANOVA results, though their impact was less pronounced compared to the factors mentioned above.

Predictive Power: The multiple regression model explained approximately 37.39% of the variability in wine quality (R-squared: 0.3739), indicating that while these chemical properties are important predictors, other factors not included in the model also play a crucial role. The ANOVA results further validated the significance of these properties.

Positive Influences on Wine Quality: - Higher alcohol content - Higher sulphates

Negative Influences on Wine Quality: - Higher volatile acidity - Higher chlorides - Higher total sulfur dioxide

Conclusion: These key chemical properties significantly influence wine quality. Higher alcohol content and sulphates are linked to better quality, while higher volatile acidity, chlorides, and total sulfur dioxide have a negative impact. This analysis suggests that wine quality can be predicted based on these chemical properties, though other factors also play a role. These insights are valuable for winemakers aiming to optimize wine quality through careful management of its chemical composition.

References

The data for this study was not self-collected. It is sourced from Kaggle, specifically from the dataset titled “Wine Quality Dataset” provided by Yasser H. You can access the dataset using the following link: https://www.kaggle.com/datasets/yasserh/wine-quality-dataset/data