Predicting the Quality of a Wine based on its Alcohol Percentage

2024-11-01

Wine Testing Setup

Two different judges were asked to rate a variety of different wines.

The wines were unlabeled to remove any bias that the judges may have towards a particular wine.

A higher score represents a better wine, scoring is from 0-100.

Collected Data

   wine judge.A judge.B alc.per
1     A      15      21       5
2     B      76      83      13
3     C      77      92      15
4     D      79      81      21
5     E      80      84      20
6     F      82      72      19
7     G      85      73      26
8     H      86      99      27
9     I      93      94      25
10    J      99      91      25
11    K      96      89      24
12    L      98      95      28

The ratings by Judge A and Judge B of the different wines were compared.

Wine A was disliked the most between the judges.

Applying Linear Regression

We model Judge A and Judge B’s rating as a function of alcohol content (%).

The linear regression equation used is as follows:

\[ \hat{y} = \beta_0 + \beta_1 \times \text{a%} \]

where:

\(\hat{y}\) is the predicted score for Judge A
\(\beta_0\) is the intercept
\(\beta_1\) is the coefficient for alcohol percentage
a% is the alcohol percentage of the wine

Finding Goodness of Fit

The goodness of fit for the model is represented by the \(R^2\) value.

The \(R^2\) equation used is as follows:

\[ R^2 = 1 - \frac{\sum (y_i - \hat{y_i})^2}{\sum (y_i - \bar{y})^2} \]

where: - \(y_i\) are the observed values

\(\hat{y_i}\) are the predicted values
\(\bar{y}\) is the mean of observed values

Code was used to quickly and easily create a summary of the results for the wine scoring.

modelA <- lm(judge.A ~ alc.per, data = wine)
r2_A <- summary(modelA)$r.squared

modelB <- lm(judge.B ~ alc.per, data = wine)
r2_B <- summary(modelB)$r.squared

Summary of Comparing Score and Alcohol Percent by Judge A

Call:
lm(formula = judge.A ~ alc.per, data = wine)

Residuals:
     Min       1Q   Median       3Q      Max 
-20.9850  -5.1661   0.7908   6.1994  17.2839 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  21.7781    10.9351   1.992 0.074425 .  
alc.per       2.8414     0.5046   5.631 0.000218 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11.4 on 10 degrees of freedom
Multiple R-squared:  0.7603,    Adjusted R-squared:  0.7363 
F-statistic: 31.71 on 1 and 10 DF,  p-value: 0.000218

Summary of Comparing Score and Alcohol Percent by Judge B

Call:
lm(formula = judge.B ~ alc.per, data = wine)

Residuals:
     Min       1Q   Median       3Q      Max 
-24.0986  -3.6196   0.0082   3.5315  23.8792 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  33.5875    13.6471   2.461  0.03361 * 
alc.per       2.3022     0.6297   3.656  0.00442 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 14.23 on 10 degrees of freedom
Multiple R-squared:  0.572, Adjusted R-squared:  0.5292 
F-statistic: 13.37 on 1 and 10 DF,  p-value: 0.004419

Predicting the score from Judge A based on the alcohol percentage of the wine.

## `geom_smooth()` using formula = 'y ~ x'

Predicting the score from Judge B based on the alcohol percentage of the wine.

## `geom_smooth()` using formula = 'y ~ x'

Results

For Judge A, \(R^2\)=0.76 when comparing alcohol percentage and rating.

For Judge B, \(R^2\)=0.57 when comparing alcohol percentage and rating.

Based on the summaries of the models comparing each of the judge’s rating and the alcohol percentage:

For Judge A, p=0.000218

For Judge B, p=0.004419

Discussion

The p-values (p=0.000218 and p=0.004419 for Judge A and B, respectively) for both models are statistically significant (both p<0.05), confirming that alcohol percentage affects both judges’ scores.

However, the higher \(R^2\) (\(R^2\)=0.76 and \(R^2\)=0.57 for Judge A and B, respectively) for Judge A implies that alcohol content is a more influential factor in their ratings compared to Judge B’s.

Additionally, the variance around the linear model, described by the \(R^2\) values, indicate that there may be other factors that may affect the quality of a wine and therefore result in a higher or lower rating.