Assignment 4 - The Study of Wine Quality

Question 1
Question 2
Question 3

Question 1

Produce summary statistics of “residual.sugar” and use its median to divide the data into two groups A and B. We want to test if the “distribution” in Group A and Group B has the same population mean

State the null hypothesis
- The null hypothesis states: the distribution in Group A and Group B has the same population mean for residual sugar

medianresidualsugar <- median(`residual sugar`)
summary(wine$`residual sugar`) #Produce summary statistics of “residual.sugar”

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

wine$group <- if_else(`residual sugar` <= medianresidualsugar, 'A', 'B')
# Frequency table
table(wine$group)

## 
##   A   B 
## 883 716

tapply(wine$`residual sugar`, wine$group, summary)

## $A
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.800   1.900   1.894   2.100   2.200 
## 
## $B
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.250   2.400   2.600   3.334   3.400  15.500

Use visualization tools to inspect the hypothesis

wine %>% ggplot(mapping = aes(x = `residual sugar`, fill = group)) + 
  geom_density(alpha =0.7) +
  labs(title = "Density Distribution of Residual Sugar by Group") +
  theme_grey() +  # Default
  theme(legend.position = "right")

wine %>% ggplot(mapping = aes(x = group, y = `residual sugar`, fill = group)) +
  geom_boxplot() +
  labs(title = "Boxplot of Residual Sugar by Group",
       x = "Group",
       y = "Residual Sugar")

The mean residual sugar for the group seems quite different. Is it true in the population?
What test are you going to use?
- The box plot and density visual both indicates some differences in mean. We will use t test with this hypothesis.
- Based on the table and visual we can see that the the group variances are quite different. Group B has a very long right tail. So we will use a Welch Two-Sample t-test which uses an approximation (Satterthwaite approximation) for the degrees of freedom.

q1ttest <- t.test(density ~ group, data = wine, alternative = "two.sided", var.equal=FALSE)
q1ttest

## 
##  Welch Two Sample t-test
## 
## data:  density by group
## t = -14.697, df = 1365.2, p-value < 0.00000000000000022
## alternative hypothesis: true difference in means between group A and group B is not equal to 0
## 95 percent confidence interval:
##  -0.001513022 -0.001156687
## sample estimates:
## mean in group A mean in group B 
##       0.9961490       0.9974838

What is the p-value?
- The p value is 0.00000000000000022
What is your conclusion using a type 1 error = 0.05?
- Since the p-value is 0.00000000000000022, which is much smaller than the type I error threshold of 0.05, we would reject the null hypothesis.
Does your conclusion imply that there is an association between “distribution” and “residual.sugar”?
- We reject the null hypothesis and that there is a statistical evidence supporting that group A are different from group B in terms of the residual sugar - indicating population means for residual sugar in the two groups are different.

Question 2

Produce summary statistics of “residual.sugar” and use its 1st, 2nd, and 3rd quantiles to divide the data into four groups A, B, C, and D. We want to test if the “distribution” in the four groups has the same population mean.

State the null hypothesis
- The null hypothesis states: the population means of residual sugar in the four groups (A, B, C, and D) are equal.

q1 <- quantile(wine$`residual sugar`, 0.25)
q2 <- quantile(wine$`residual sugar`, 0.50)
q3 <- quantile(wine$`residual sugar`, 0.75)
wine$group <- cut(wine$`residual sugar`, 
                breaks = c(-Inf, q1, q2, q3, Inf), 
                labels = c("A", "B", "C", "D"))

#Frequency Table 
table(wine$group)

## 
##   A   B   C   D 
## 464 419 361 355

tapply(wine$`residual sugar`, wine$group, summary)

## $A
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.600   1.800   1.714   1.900   1.900 
## 
## $B
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   2.000   2.100   2.094   2.200   2.200 
## 
## $C
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.250   2.300   2.400   2.437   2.500   2.600 
## 
## $D
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.650   2.825   3.400   4.246   4.675  15.500

Use visualization tools to inspect the hypothesis

wine %>% ggplot(mapping = aes(x = `residual sugar`, fill = group)) + 
  geom_density(alpha =0.7) +
  labs(title = "Density Distribution of Residual Sugar by Group") +
  theme_grey() +  # Default
  theme(legend.position = "right")

wine %>% ggplot(mapping = aes(x = group, y = `residual sugar`, fill = group)) +
  geom_boxplot() +
  labs(title = "Boxplot of Residual Sugar by Group",
       x = "Group",
       y = "Residual Sugar")

The mean of group D is quite different from the rest of the group A, B, C as in the box plot and summary table
What test are you going to use?
- To test the null hypothesis, we can use ANOVA (Analysis of Variance), as it is suitable for comparing means across multiple groups.

anova_result <- aov(`residual sugar` ~ group, data = wine)

summary(anova_result)

##               Df Sum Sq Mean Sq F value              Pr(>F)    
## group          3   1437   479.0   439.2 <0.0000000000000002 ***
## Residuals   1595   1740     1.1                                
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

What is the p-value?
- The p-value is in the Pr(>F) which is <0.0000000000000002 - very small meaning there seems to be a systematic difference between the four groups of residual sugar.
What is your conclusion using a type 1 error = 0.05?
- Given that the p-value is far less than the significance level a type 1 error of 0.05, we reject the null hypothesis. This means there is sufficient evidence to conclude that there is a systematic difference in the population means of residual sugar across the four groups (A, B, C, and D).

Question 3

Create a 2 by 4 contingency table using the categories A, B, C, D of “residual.sugar” and the binary variable “excellent” you created in Part B. Note that you have two factors: the categorical levels of “residual.sugar” (A, B, C and D) and an indicator of excellent wines (yes or no).

wine <- wine %>%
  mutate(excellent = ifelse(quality>=7,'yes','no'))

contingency_table <- table(wine$group, wine$excellent)
print(contingency_table)

##    
##      no yes
##   A 411  53
##   B 367  52
##   C 308  53
##   D 296  59

Use the Chi-square test to test if these two factors are correlated or not;
- Since the p-value 0.1386 is greater than the significance level of 0.05, we you fail to reject the null hypothesis. This suggests that there is no statistically significant association between the categories of residual.sugar (A, B, C, D) and the excellent wine status (yes/no).
  - From this sample, we can conclude that in the population from which this sample was drawn, there is no significant association between residual sugar levels and wine excellence.

chi_square_result <- chisq.test(contingency_table)
# summary(chi_square_result)
print(chi_square_result)

## 
##  Pearson's Chi-squared test
## 
## data:  contingency_table
## X-squared = 5.5, df = 3, p-value = 0.1386

Use the permutation test to do the same and compare the result to that in (a);
- The p-value from the permutation test - p = 0.1319 which is similar to that from the Pearson’s Chi-square test - p = 0.1386. Both p-values are greater than the significance level of 0.05.

permutation_result <- chisq.test(contingency_table, simulate.p.value = TRUE, B=2000)
# summary(permutation_result)
print(permutation_result)

## 
##  Pearson's Chi-squared test with simulated p-value (based on 2000
##  replicates)
## 
## data:  contingency_table
## X-squared = 5.5, df = NA, p-value = 0.1409

Can you conclude that “residual.sugar” is a significant factor contributing to the excellence of wine? Why?
- Based on the results of both the Chi-square test and the permutation test, we can conclude that:
  - Both test result lead to the same conclusion of no significant association between the categories of residual sugar levels (A, B, C, D) and the classification of wines as excellent (yes/no).
  - Therefore, residual sugar is not a significant factor contributing to the excellence of wine. The evidence from both statistical tests indicates that variations in residual sugar levels do not significantly impact the likelihood of wine being classified as excellent.

Assignment 4 - The Study of Wine Quality - Part C

Thuong Ho

2024-10-11

Question 1

Question 2

Question 3