We must view the Red Wine data set below before we answer questions
Summary Statistics for residual sugar
summary(winequality$residual.sugar)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
Now we have to divide the median 2.2 into the groups A and B
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
## winequality.residual.sugar residual.sugar.group winequality.density
## 1 1.90 A 0.9978
## 2 2.60 B 0.9968
## 3 2.30 B 0.9970
## 4 1.90 A 0.9980
## 5 1.90 A 0.9978
## 6 1.80 A 0.9978
## 7 1.60 A 0.9964
## 8 1.20 A 0.9946
## 9 2.00 A 0.9968
## 10 6.10 B 0.9978
## 11 1.80 A 0.9959
## 12 6.10 B 0.9978
## 13 1.60 A 0.9943
## 14 1.60 A 0.9974
## 15 3.80 B 0.9986
## 16 3.90 B 0.9986
## 17 1.80 A 0.9969
## 18 1.70 A 0.9968
## 19 4.40 B 0.9974
## 20 1.80 A 0.9969
## 21 1.80 A 0.9968
## 22 2.30 B 0.9982
## 23 1.60 A 0.9966
## 24 2.30 B 0.9968
## 25 2.40 B 0.9968
## 26 1.40 A 0.9955
## 27 1.80 A 0.9962
## 28 1.60 A 0.9966
## 29 1.90 A 0.9972
## 30 2.00 A 0.9964
## 31 2.40 B 0.9958
## 32 2.50 B 0.9966
## 33 2.30 B 0.9966
## 34 10.70 B 0.9993
## 35 1.80 A 0.9957
## 36 5.50 B 0.9986
## 37 2.40 B 0.9975
## 38 2.10 A 0.9968
## 39 1.50 A 0.9940
## 40 5.90 B 0.9978
## 41 5.90 B 0.9978
## 42 2.80 B 0.9976
## 43 2.60 B 0.9968
## 44 2.20 B 0.9968
## 45 1.80 A 0.9962
## 46 2.10 A 0.9934
## 47 2.20 B 0.9970
## 48 1.60 A 0.9969
## 49 1.60 A 0.9958
## 50 1.40 A 0.9954
## 51 1.70 A 0.9971
## 52 2.20 B 0.9956
## 53 2.10 A 0.9955
## 54 3.00 B 0.9970
## 55 2.80 B 0.9955
## 56 3.80 B 0.9978
## 57 3.40 B 0.9971
## 58 5.10 B 0.9983
## 59 2.30 B 0.9975
## 60 2.40 B 0.9962
## 61 2.20 B 0.9980
## 62 1.80 A 0.9968
## 63 1.90 A 0.9968
## 64 2.00 A 0.9966
## 65 4.65 B 0.9962
## 66 4.65 B 0.9962
## 67 1.50 A 0.9968
## 68 1.60 A 0.9962
## 69 2.00 A 0.9969
## 70 1.90 A 0.9962
## 71 1.90 A 0.9967
## 72 2.10 A 0.9962
## 73 1.90 A 0.9961
## 74 2.10 A 0.9976
## 75 2.50 B 0.9984
## 76 2.20 B 0.9986
## 77 2.20 B 0.9986
## 78 2.40 B 0.9966
## 79 2.00 A 0.9958
## 80 1.50 A 0.9972
## 81 1.60 A 0.9958
## 82 1.90 A 0.9974
## 83 2.00 A 0.9970
## 84 1.80 A 0.9969
## 85 1.80 A 0.9959
## 86 2.20 B 0.9961
## 87 1.90 A 0.9972
## 88 1.90 A 0.9966
## 89 2.10 A 0.9978
## 90 1.80 A 0.9978
## 91 1.90 A 0.9964
## 92 1.90 A 0.9972
## 93 2.00 A 0.9972
## 94 1.90 A 0.9966
## 95 1.40 A 0.9938
## 96 2.30 B 0.9932
## 97 3.00 B 0.9965
## 98 2.00 A 0.9963
## 99 2.50 B 0.9967
## 100 1.90 A 0.9972
A. State the Null Hypothesis –The density means between A and B show no signficiant difference.
boxplot(winequality$density ~ residual.sugar.group)
It is safe to say the null hypothesis is rejected
What test are you going to use?
–we are going to use the t.test so we can prove true difference in means is not equal to 0 on 95% percent confidence interval
What is the p-value
t.test(winequality$density ~ residual.sugar.group)
##
## Welch Two Sample t-test
##
## data: winequality$density by residual.sugar.group
## t = -14.955, df = 1571.9, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group A and group B is not equal to 0
## 95 percent confidence interval:
## -0.001479826 -0.001136653
## sample estimates:
## mean in group A mean in group B
## 0.9960537 0.9973619
The p-value is 2.2e-16
What is your conclusion? –Based on the evidence we have stated like the null hypothesis there is no difference in density mean between the groups A and B.
Does your conclusion imply that there is an association between “density” and “residual.sugar”?
–Yes we can conclude there is a relationship in both residual sugar and density
We will need to view the summary stats for residual sugar again
summary(winequality$residual.sugar)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
Now we are going to seperate the groups based on the quantiles for groups A, B, C, and D
residual.sugar.group2 <- NULL
for (i in 1:length(winequality$residual.sugar)){
if(winequality$residual.sugar[i] <= 1.9) residual.sugar.group2[i] <- "A"
else if(winequality$residual.sugar[i] <= 2.2) residual.sugar.group2[i] <- "B"
else if(winequality$residual.sugar[i] <= 2.6) residual.sugar.group2[i] <- "C"
else residual.sugar.group2[i] <- "D"
}
table(residual.sugar.group2)
## residual.sugar.group2
## A B C D
## 464 419 361 355
State the null hypothesis –No difference density between means groups A, B, C, D
boxplot(winequality$density ~ residual.sugar.group2)
It is safe to say the null hypothesis is rejected
What test are you going to use? -Going to use the anova test for this time because we want to see if there are significant difference between the means of many independ groups like A, B, C, and D unlike the t-test to see one two groups
What is the p-value?
summary(aov(winequality$density ~ residual.sugar.group2))
## Df Sum Sq Mean Sq F value Pr(>F)
## residual.sugar.group2 3 0.000996 0.0003321 112.8 <2e-16 ***
## Residuals 1595 0.004696 0.0000029
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-value 2e-16
What is your conclusion?
We actually can conclude that the p-value decrease across mulitple groups does show a difference of the density mean groups
Does your conclusion imply that there is an association between “density” and “residual.sugar”? Compare your result here with that in Question 1. Do you think increasing the number of groups help identify the association? Would you consider dividing the data into 10 groups so as to help the discovery of the association? Why?
First we have indicated there is an association between density and residual sugar. Second The p-value is less than t-test after we did aov test. Showing there is mean difference occuring across the multiple groups 2.The problem with dividing more into 10 groups will have the p-value go extremenly small where we will not see any signficiant difference across all the groups when dividing more and more. The best thing we can do is gradually see the groups divide more in a steady pace. As of now, we have found a decrease in the p-value which shows a relationshiop already there for both density and residual sugar.
Create a 2 by 4 contingency table using the categories A, B, C, D of “residual.sugar” and the binary variable “excellent” you created in Part B. Note that you have two factors: the categorical levels of “residual.sugar” (A, B, C and D) and an indicator of excellent wines (yes or no).
here is building the contingency table and establishing the yes and no values
winequality$excellent <-winequality$excellent <- ifelse(winequality$quality >= 7, "Yes", "No")
contingency_table <- table(data.frame(winequality$excellent, residual.sugar.group2))
print(contingency_table)
## residual.sugar.group2
## winequality.excellent A B C D
## No 411 367 308 296
## Yes 53 52 53 59
–Use the Chi-square test to test if these two factors are correlated or not
chisq.test(contingency_table)
##
## Pearson's Chi-squared test
##
## data: contingency_table
## X-squared = 5.5, df = 3, p-value = 0.1386
Based on the chi test there is no correlation between wine excellence and residual sugar because p value is less than 0.05
P-value is 0.1386
Use the permutation test to do the same and compare the result to that in (a)
chisq.test(contingency_table, simulate.p.value = T)
##
## Pearson's Chi-squared test with simulated p-value (based on 2000
## replicates)
##
## data: contingency_table
## X-squared = 5.5, df = NA, p-value = 0.1359
Based on the permutation test, there is still no correlation less than 0.05 (p-value)
P-Value is 0.1439
Can you conclude that “residual.sugar” is a significant factor contributing to the excellence of wine? Why?
-We can’t say after these new discoveries between excellence and wine there is a correlation because the p-values are less than 0.05. So therefore we have to conclude this based in the principal of the null hypothesis.