1. Produce summary statistics of “residual.sugar” and use its median to divide the data into two groups A and B. We want to test if “density” in Group A and Group B has the same population mean. Please answer the following questions.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
wine <- read.csv("winequality-red.csv")
attach(wine)
summary(wine$residual.sugar)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
median_sugar <- median(wine$residual.sugar)
residual.sugar.group <- ifelse(wine$residual.sugar > median_sugar, 'B', 'A')
a. State the null hypothesis.
Ans. Null hypothesis- The population mean of ‘density’ in group A and B are the same.
df <- data.frame(residual.sugar, residual.sugar.group, density)
head(df)
## residual.sugar residual.sugar.group density
## 1 1.9 A 0.9978
## 2 2.6 B 0.9968
## 3 2.3 B 0.9970
## 4 1.9 A 0.9980
## 5 1.9 A 0.9978
## 6 1.8 A 0.9978
b. Use visualization tools to inspect the hypothesis. Do you think the hypothesis is right or not?
Ans. We used boxplot to visualize to inspect the hypothesis. From the graph we can see that the density of both the groups A and B are different. Hence, we can say that our null hypothesis is incorrect.
boxplot(wine$density ~ residual.sugar.group, main="Density by Group (A and B)")
t_test <- t.test(wine$density ~ df$residual.sugar.group, data = wine, alternative = "two.sided")
t_test
##
## Welch Two Sample t-test
##
## data: wine$density by df$residual.sugar.group
## t = -14.697, df = 1365.2, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group A and group B is not equal to 0
## 95 percent confidence interval:
## -0.001513022 -0.001156687
## sample estimates:
## mean in group A mean in group B
## 0.9961490 0.9974838
c. What test are you going to use?
Ans. Here, we are going to use the t-test as we are testing for two groups.
d. What is the p-value?
Ans. The P-value is 2.2e-16.
e. What is your conclusion?
Ans. The p-value is less than 0.05, so we can reject the null hypothesis. There is a significant difference between the density of group A and B.
f. Does your conclusion imply that there is an association between “density” and “residual.sugar”?
Ans. Yes, with higher density, the mean of residual sugar is also high, suggesting that density changes with residual.sugar. So, we can say that there is an association between ‘density’ and ‘residual.sugar’.
density_A <- df %>%
filter(residual.sugar.group == 'A')
density_B <- df %>%
filter(residual.sugar.group == 'B')
summary(density_A)
## residual.sugar residual.sugar.group density
## Min. :0.900 Length:883 Min. :0.9901
## 1st Qu.:1.800 Class :character 1st Qu.:0.9952
## Median :1.900 Mode :character Median :0.9962
## Mean :1.894 Mean :0.9961
## 3rd Qu.:2.100 3rd Qu.:0.9971
## Max. :2.200 Max. :1.0008
summary(density_B)
## residual.sugar residual.sugar.group density
## Min. : 2.250 Length:716 Min. :0.9902
## 1st Qu.: 2.400 Class :character 1st Qu.:0.9963
## Median : 2.600 Mode :character Median :0.9975
## Mean : 3.334 Mean :0.9975
## 3rd Qu.: 3.400 3rd Qu.:0.9987
## Max. :15.500 Max. :1.0037
2. Produce summary statistics of “residual.sugar” and use its 1st, 2nd, and 3rd quantiles to divide the data into four groups A, B, C, and D. We want to test if “density” in the four groups has the same population mean. Please answer the following questions.
residual.sugar.group_1 <- NULL
for(i in 1:length(residual.sugar)){
if(residual.sugar[i]<=1.900) residual.sugar.group_1[i]<- "A"
else if(residual.sugar[i]<=2.200) residual.sugar.group_1[i]<- "B"
else if(residual.sugar[i]<=2.600) residual.sugar.group_1[i]<- "C"
else residual.sugar.group_1[i]<- "D"
}
table(residual.sugar.group_1)
## residual.sugar.group_1
## A B C D
## 464 419 361 355
a. State the null hypothesis
Ans. The population mean of ‘density’ in the groups A, B, C and D are the same.
b. Use visualization tools to inspect the hypothesis. Do you think the hypothesis is right or not?
Ans. We have used boxplot to visualize the hypothesis.
boxplot(density ~ residual.sugar.group_1, main="Density by Quartile Group")
c. What test are you going to use?
Ans. Here we are going to use Analysis of Variance (ANOVA) testing as we are going to test for multiple groups.
summary(aov(density ~ residual.sugar.group_1))
## Df Sum Sq Mean Sq F value Pr(>F)
## residual.sugar.group_1 3 0.000996 0.0003321 112.8 <2e-16 ***
## Residuals 1595 0.004696 0.0000029
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
d. What is the p-value?
Ans. The p-value is 2e-16.
e. What is your conclusion?
Ans. From the ANOVA test and the graph, it is visible that the p-value is significantly less. Thus, we can reject the null hypothesis. We can say that the population mean of density in groups A, B, C and D are different.
f. Does your conclusion imply that there is an association between “density” and “residual.sugar”? Compare your result here with that in Question 1. Do you think increasing the number of groups help identify the association? Would you consider dividing the data into 10 groups so as to help the discovery of the association? Why?
Ans. Yes, the conclusion implies that there is an association between “density” and “residual.sugar”.
In Question 1, we observed a significant difference in the mean density between two groups, and here, with four groups showing varying means, it indicates a positive correlation between density and residual sugar.
Increasing the number of groups can aid in detecting non-linear trends and allow for more detailed analysis, but it may limit the statistical power when dealing with a smaller number of groups.
Dividing the data into 10 groups is an option, but since this would result in smaller group sizes, we need to ensure that there is sufficient data to extract meaningful statistics.
Create a 2 by 4 contingency table using the categories A, B, C, D of “residual.sugar” and the binary variable “excellent” you created in Part B. Note that you have two factors: the categorical levels of “residual.sugar” (A, B, C and D) and an indicator of excellent wines (yes or no).
wine$excellent <- ifelse(wine$quality >= 7, 'Yes', 'No')
ctable <- table(wine$excellent, residual.sugar.group_1)
ctable
## residual.sugar.group_1
## A B C D
## No 411 367 308 296
## Yes 53 52 53 59
a. Use the Chi-square test to test if these two factors are correlated or not
Ans. Null hypothesis - There is no correlation between residual sugar categories (A, B, C and D) and the wine excellence (‘Yes’ or ‘No’). The p-value is greater than the preset value of 0.05, we cannot reject the null hypothesis. Hence, based on Chi-square test we can say that there is no association between residual sugar and wine excellence.
Xsq <- chisq.test(ctable)
Xsq
##
## Pearson's Chi-squared test
##
## data: ctable
## X-squared = 5.5, df = 3, p-value = 0.1386
Xsq$observed
## residual.sugar.group_1
## A B C D
## No 411 367 308 296
## Yes 53 52 53 59
Xsq$expected
## residual.sugar.group_1
## A B C D
## No 401.03064 362.13759 312.00876 306.82301
## Yes 62.96936 56.86241 48.99124 48.17699
Xsq$residuals
## residual.sugar.group_1
## A B C D
## No 0.4978269 0.2555143 -0.2269479 -0.6178802
## Yes -1.2563264 -0.6448212 0.5727305 1.5592955
b. Use the permutation test to do the same and compare the result to that in (a)
chisq.test(ctable, simulate.p.value = T)
##
## Pearson's Chi-squared test with simulated p-value (based on 2000
## replicates)
##
## data: ctable
## X-squared = 5.5, df = NA, p-value = 0.1514
From the permutation test we get the p-value as 0.1394 and from Chi-square test we get the p-value as 0.1386, which is near about the former value. As the p-value is greater than 0.05, we cannot reject the null hypothesis. Also, the p-value from permutation test is slightly higher than the Chi-square test.
c. Can you conclude that “residual.sugar” is a significant factor contributing to the excellence of wine? Why?
We cannot conclude that the residual.sugar is a significant factor contributing to the excellence of wine. According to the Chi-square and permutation test, the null hypothesis is standing true.