Wine Project Part C

Question 1

1. Produce summary statistics of “residual.sugar” and use its median to divide the data into two groups A and B. We want to test if “density” in Group A and Group B has the same population mean. Please answer the following questions.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(dplyr)
wine <- read.csv("winequality-red.csv")
attach(wine)

summary(wine$residual.sugar)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

median_sugar <- median(wine$residual.sugar)
residual.sugar.group <- ifelse(wine$residual.sugar > median_sugar, 'B', 'A')

a. State the null hypothesis.

Ans. Null hypothesis- The population mean of ‘density’ in group A and B are the same.

df <- data.frame(residual.sugar, residual.sugar.group, density)
head(df)

##   residual.sugar residual.sugar.group density
## 1            1.9                    A  0.9978
## 2            2.6                    B  0.9968
## 3            2.3                    B  0.9970
## 4            1.9                    A  0.9980
## 5            1.9                    A  0.9978
## 6            1.8                    A  0.9978

b. Use visualization tools to inspect the hypothesis. Do you think the hypothesis is right or not?

Ans. We used boxplot to visualize to inspect the hypothesis. From the graph we can see that the density of both the groups A and B are different. Hence, we can say that our null hypothesis is incorrect.

boxplot(wine$density ~ residual.sugar.group, main="Density by Group (A and B)")

t_test <- t.test(wine$density ~ df$residual.sugar.group, data = wine, alternative = "two.sided")

t_test

## 
##  Welch Two Sample t-test
## 
## data:  wine$density by df$residual.sugar.group
## t = -14.697, df = 1365.2, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group A and group B is not equal to 0
## 95 percent confidence interval:
##  -0.001513022 -0.001156687
## sample estimates:
## mean in group A mean in group B 
##       0.9961490       0.9974838

c. What test are you going to use?

Ans. Here, we are going to use the t-test as we are testing for two groups.

d. What is the p-value?

Ans. The P-value is 2.2e-16.

e. What is your conclusion?

Ans. The p-value is less than 0.05, so we can reject the null hypothesis. There is a significant difference between the density of group A and B.

f. Does your conclusion imply that there is an association between “density” and “residual.sugar”?

Ans. Yes, with higher density, the mean of residual sugar is also high, suggesting that density changes with residual.sugar. So, we can say that there is an association between ‘density’ and ‘residual.sugar’.

density_A <- df %>%
  filter(residual.sugar.group == 'A')

density_B <- df %>%
  filter(residual.sugar.group == 'B')

summary(density_A)

##  residual.sugar  residual.sugar.group    density      
##  Min.   :0.900   Length:883           Min.   :0.9901  
##  1st Qu.:1.800   Class :character     1st Qu.:0.9952  
##  Median :1.900   Mode  :character     Median :0.9962  
##  Mean   :1.894                        Mean   :0.9961  
##  3rd Qu.:2.100                        3rd Qu.:0.9971  
##  Max.   :2.200                        Max.   :1.0008

summary(density_B)

##  residual.sugar   residual.sugar.group    density      
##  Min.   : 2.250   Length:716           Min.   :0.9902  
##  1st Qu.: 2.400   Class :character     1st Qu.:0.9963  
##  Median : 2.600   Mode  :character     Median :0.9975  
##  Mean   : 3.334                        Mean   :0.9975  
##  3rd Qu.: 3.400                        3rd Qu.:0.9987  
##  Max.   :15.500                        Max.   :1.0037

Question 2

2. Produce summary statistics of “residual.sugar” and use its 1st, 2nd, and 3rd quantiles to divide the data into four groups A, B, C, and D. We want to test if “density” in the four groups has the same population mean. Please answer the following questions.

residual.sugar.group_1 <- NULL

for(i in 1:length(residual.sugar)){
  if(residual.sugar[i]<=1.900) residual.sugar.group_1[i]<- "A"
  else if(residual.sugar[i]<=2.200) residual.sugar.group_1[i]<- "B"
  else if(residual.sugar[i]<=2.600) residual.sugar.group_1[i]<- "C"
  else residual.sugar.group_1[i]<- "D"
}
table(residual.sugar.group_1)

## residual.sugar.group_1
##   A   B   C   D 
## 464 419 361 355

a. State the null hypothesis

Ans. The population mean of ‘density’ in the groups A, B, C and D are the same.

b. Use visualization tools to inspect the hypothesis. Do you think the hypothesis is right or not?

Ans. We have used boxplot to visualize the hypothesis.

boxplot(density ~ residual.sugar.group_1, main="Density by Quartile Group")

c. What test are you going to use?

Ans. Here we are going to use Analysis of Variance (ANOVA) testing as we are going to test for multiple groups.

summary(aov(density ~ residual.sugar.group_1))

##                          Df   Sum Sq   Mean Sq F value Pr(>F)    
## residual.sugar.group_1    3 0.000996 0.0003321   112.8 <2e-16 ***
## Residuals              1595 0.004696 0.0000029                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

d. What is the p-value?

Ans. The p-value is 2e-16.

e. What is your conclusion?

Ans. From the ANOVA test and the graph, it is visible that the p-value is significantly less. Thus, we can reject the null hypothesis. We can say that the population mean of density in groups A, B, C and D are different.

f. Does your conclusion imply that there is an association between “density” and “residual.sugar”? Compare your result here with that in Question 1. Do you think increasing the number of groups help identify the association? Would you consider dividing the data into 10 groups so as to help the discovery of the association? Why?

Ans. Yes, the conclusion implies that there is an association between “density” and “residual.sugar”.

In Question 1, we observed a significant difference in the mean density between two groups, and here, with four groups showing varying means, it indicates a positive correlation between density and residual sugar.

Increasing the number of groups can aid in detecting non-linear trends and allow for more detailed analysis, but it may limit the statistical power when dealing with a smaller number of groups.

Dividing the data into 10 groups is an option, but since this would result in smaller group sizes, we need to ensure that there is sufficient data to extract meaningful statistics.

Question 3

Create a 2 by 4 contingency table using the categories A, B, C, D of “residual.sugar” and the binary variable “excellent” you created in Part B. Note that you have two factors: the categorical levels of “residual.sugar” (A, B, C and D) and an indicator of excellent wines (yes or no).

wine$excellent <- ifelse(wine$quality >= 7, 'Yes', 'No')

ctable <- table(wine$excellent, residual.sugar.group_1)

ctable

##      residual.sugar.group_1
##         A   B   C   D
##   No  411 367 308 296
##   Yes  53  52  53  59

a. Use the Chi-square test to test if these two factors are correlated or not

Ans. Null hypothesis - There is no correlation between residual sugar categories (A, B, C and D) and the wine excellence (‘Yes’ or ‘No’). The p-value is greater than the preset value of 0.05, we cannot reject the null hypothesis. Hence, based on Chi-square test we can say that there is no association between residual sugar and wine excellence.

Xsq <- chisq.test(ctable)
Xsq

## 
##  Pearson's Chi-squared test
## 
## data:  ctable
## X-squared = 5.5, df = 3, p-value = 0.1386

Xsq$observed

##      residual.sugar.group_1
##         A   B   C   D
##   No  411 367 308 296
##   Yes  53  52  53  59

Xsq$expected

##      residual.sugar.group_1
##               A         B         C         D
##   No  401.03064 362.13759 312.00876 306.82301
##   Yes  62.96936  56.86241  48.99124  48.17699

Xsq$residuals

##      residual.sugar.group_1
##                A          B          C          D
##   No   0.4978269  0.2555143 -0.2269479 -0.6178802
##   Yes -1.2563264 -0.6448212  0.5727305  1.5592955

b. Use the permutation test to do the same and compare the result to that in (a)

chisq.test(ctable, simulate.p.value = T)

## 
##  Pearson's Chi-squared test with simulated p-value (based on 2000
##  replicates)
## 
## data:  ctable
## X-squared = 5.5, df = NA, p-value = 0.1514

From the permutation test we get the p-value as 0.1394 and from Chi-square test we get the p-value as 0.1386, which is near about the former value. As the p-value is greater than 0.05, we cannot reject the null hypothesis. Also, the p-value from permutation test is slightly higher than the Chi-square test.

c. Can you conclude that “residual.sugar” is a significant factor contributing to the excellence of wine? Why?

We cannot conclude that the residual.sugar is a significant factor contributing to the excellence of wine. According to the Chi-square and permutation test, the null hypothesis is standing true.

Wine Project Part C

Moyuri Sarkar

2024-10-14

Question 1

Question 2

Question 3