library(stats4)
library(dplyr)
library(ggplot2)
red <- read.csv("C:/Users/Anil Palazzo/Desktop/school stuff/Masters/BANA7051-Stat Methods/Data/wine+quality/winequality-red.csv",
header = TRUE, sep = ";", na.strings = "NA")
# Summary Statistics
summary(red$residual.sugar)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
ab_group <- ifelse(red$residual.sugar < 2.2, "A", "B")
group_df <- data.frame(red$residual.sugar, ab_group, red$density)
print(head(group_df, 10))
## red.residual.sugar ab_group red.density
## 1 1.9 A 0.9978
## 2 2.6 B 0.9968
## 3 2.3 B 0.9970
## 4 1.9 A 0.9980
## 5 1.9 A 0.9978
## 6 1.8 A 0.9978
## 7 1.6 A 0.9964
## 8 1.2 A 0.9946
## 9 2.0 A 0.9968
## 10 6.1 B 0.9978
There is no difference in the density means of the two populations. Of the groups A and B.
boxplot(red$density ~ ab_group)
We can see from the above box plots, that there is a difference between
population means of density between the two different groups.
I think a t-test would best suit this the best.
t.test(red$density ~ ab_group)
##
## Welch Two Sample t-test
##
## data: red$density by ab_group
## t = -14.955, df = 1571.9, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group A and group B is not equal to 0
## 95 percent confidence interval:
## -0.001479826 -0.001136653
## sample estimates:
## mean in group A mean in group B
## 0.9960537 0.9973619
The p-value is less than 2.2e-16
My conclusion is based on the p-value which is 2.2e-16, which is less than .05 therefore, I reject the null hypothesis that states that there is no difference between population mean of the density of the groups A and B.
The t-test results only imply that there is a statistical difference in the density means of the two groups, however does not imply any association. More regression analysis needs to be conducted to determine the relationship between the two variables.
summary(red$residual.sugar)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
quantiles <- quantile(red$residual.sugar, probs = c(0, 0.25, 0.5, 0.75, 1))
quantiles
## 0% 25% 50% 75% 100%
## 0.9 1.9 2.2 2.6 15.5
red <- red %>%
mutate(group_1 = cut(red$residual.sugar, breaks = quantiles, labels = c("1", "2", "3", "4"), include.lowest = TRUE))
There is no difference between the population mean of “density” for group A and the groups B,C and D.
ggplot(red, aes(x = group_1, y = density)) +
geom_boxplot()
Given the boxplots above. We can see that there is a difference in means
between the different groups.
I am going to use an ANOVA test.
summary(aov(red$density ~ red$group_1))
## Df Sum Sq Mean Sq F value Pr(>F)
## red$group_1 3 0.000996 0.0003321 112.8 <2e-16 ***
## Residuals 1595 0.004696 0.0000029
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p value is 2e-16
Since the p-value is less than .05 we can reject the null hypothesis. Therefore that means that there is a difference between population mean of the density of all the groups.
red$excellent <- ifelse(red$quality >= 7, "Yes", "No")
c_table <- table(data.frame(red$excellent, red$group_1))
chisq.test(c_table, simulate.p.value = T)
##
## Pearson's Chi-squared test with simulated p-value (based on 2000
## replicates)
##
## data: c_table
## X-squared = 5.5, df = NA, p-value = 0.1489
The p value of 0.1484 is higher than that of the chi-square test (0.1386). Based on the results of the permutation test, there is no correlation between wine excellence and residual sugar.
WE cannot conclude that residual sugar is a significant factor contributing to the excellence of wine, because the p value of both tests, chi squared and permutation test are greater than .05. Which means that there is no correlation between residual sugar and excellency of wines.