library(stats4)
library(dplyr)
library(ggplot2)

red <- read.csv("C:/Users/Anil Palazzo/Desktop/school stuff/Masters/BANA7051-Stat Methods/Data/wine+quality/winequality-red.csv",
                header = TRUE, sep = ";", na.strings = "NA")

1. Produce summary statistics of “residual.sugar” and use its median to divide the data into two groups A and B. We want to test if “density” in Group A and Group B has the same population mean. Please answer the following questions.

# Summary Statistics
summary(red$residual.sugar)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

ab_group <- ifelse(red$residual.sugar < 2.2, "A", "B")
group_df <- data.frame(red$residual.sugar, ab_group, red$density)

print(head(group_df, 10))

##    red.residual.sugar ab_group red.density
## 1                 1.9        A      0.9978
## 2                 2.6        B      0.9968
## 3                 2.3        B      0.9970
## 4                 1.9        A      0.9980
## 5                 1.9        A      0.9978
## 6                 1.8        A      0.9978
## 7                 1.6        A      0.9964
## 8                 1.2        A      0.9946
## 9                 2.0        A      0.9968
## 10                6.1        B      0.9978

a. State the null Hypothesis

There is no difference in the density means of the two populations. Of the groups A and B.

b. Use visualization tools to inspect the hypothesis.

boxplot(red$density ~ ab_group)

We can see from the above box plots, that there is a difference between population means of density between the two different groups.

c. What test are you going to use?

I think a t-test would best suit this the best.

t.test(red$density ~ ab_group)

## 
##  Welch Two Sample t-test
## 
## data:  red$density by ab_group
## t = -14.955, df = 1571.9, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group A and group B is not equal to 0
## 95 percent confidence interval:
##  -0.001479826 -0.001136653
## sample estimates:
## mean in group A mean in group B 
##       0.9960537       0.9973619

d. What is the p-value?

The p-value is less than 2.2e-16

e. What is your conclusion using a type 1 error = 0.05?

My conclusion is based on the p-value which is 2.2e-16, which is less than .05 therefore, I reject the null hypothesis that states that there is no difference between population mean of the density of the groups A and B.

f. Does your conclusion imply that there is an association between “density” and “residual.sugar”?

The t-test results only imply that there is a statistical difference in the density means of the two groups, however does not imply any association. More regression analysis needs to be conducted to determine the relationship between the two variables.

2. Produce summary statistics of “residual.sugar” and use its 1st, 2nd, and 3rd quantiles to divide the data into four groups A, B, C, and D. We want to test if “density” in the four groups has the same population mean. Please answer the following questions.

summary(red$residual.sugar)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

quantiles <- quantile(red$residual.sugar, probs = c(0, 0.25, 0.5, 0.75, 1))
quantiles

##   0%  25%  50%  75% 100% 
##  0.9  1.9  2.2  2.6 15.5

red <- red %>%
  mutate(group_1 = cut(red$residual.sugar, breaks = quantiles, labels = c("1", "2", "3", "4"), include.lowest = TRUE))

a. State the null Hypothesis

There is no difference between the population mean of “density” for group A and the groups B,C and D.

b. Use visualization tools to inspect the hypothesis. Do you think the hypothesis is right or not?

ggplot(red, aes(x = group_1, y = density)) +
  geom_boxplot()

Given the boxplots above. We can see that there is a difference in means between the different groups.

c. What test are you going to use?

I am going to use an ANOVA test.

summary(aov(red$density ~ red$group_1))

##               Df   Sum Sq   Mean Sq F value Pr(>F)    
## red$group_1    3 0.000996 0.0003321   112.8 <2e-16 ***
## Residuals   1595 0.004696 0.0000029                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

d. What is the p-value?

The p value is 2e-16

e. What is your conclusion using a type 1 error = 0.05?

Since the p-value is less than .05 we can reject the null hypothesis. Therefore that means that there is a difference between population mean of the density of all the groups.

3. Create a 2 by 4 contingency table using the categories A, B, C, D of “residual.sugar” and the binary variable “excellent” you created in Part B. Note that you have two factors: the categorical levels of “residual.sugar” (A, B, C and D) and an indicator of excellent wines (yes or no).

red$excellent <- ifelse(red$quality >= 7, "Yes", "No")

c_table <- table(data.frame(red$excellent, red$group_1))

a. Use the Chi-square test to test if these two factors are correlated or not

chisq.test(c_table)

## 
##  Pearson's Chi-squared test
## 
## data:  c_table
## X-squared = 5.5, df = 3, p-value = 0.1386

According to the chi-squared test results, the p-value is 0.1386, therefore the null-hypothesis hold true. There is no correlation between residual sugar and excellency of wines

b. Use the permutation test to do the same and compare the result to that in (a)

chisq.test(c_table, simulate.p.value = T)

## 
##  Pearson's Chi-squared test with simulated p-value (based on 2000
##  replicates)
## 
## data:  c_table
## X-squared = 5.5, df = NA, p-value = 0.1489

The p value of 0.1484 is higher than that of the chi-square test (0.1386). Based on the results of the permutation test, there is no correlation between wine excellence and residual sugar.

c. Can you conclude that “residual.sugar” is a significant factor contributing to the excellence of wine? Why?

WE cannot conclude that residual sugar is a significant factor contributing to the excellence of wine, because the p value of both tests, chi squared and permutation test are greater than .05. Which means that there is no correlation between residual sugar and excellency of wines.

Assignment 4

Anil Palazzo

2024-10-08

1. Produce summary statistics of “residual.sugar” and use its median to divide the data into two groups A and B. We want to test if “density” in Group A and Group B has the same population mean. Please answer the following questions.

a. State the null Hypothesis

b. Use visualization tools to inspect the hypothesis.

c. What test are you going to use?

d. What is the p-value?

e. What is your conclusion using a type 1 error = 0.05?

f. Does your conclusion imply that there is an association between “density” and “residual.sugar”?

2. Produce summary statistics of “residual.sugar” and use its 1st, 2nd, and 3rd quantiles to divide the data into four groups A, B, C, and D. We want to test if “density” in the four groups has the same population mean. Please answer the following questions.

a. State the null Hypothesis

b. Use visualization tools to inspect the hypothesis. Do you think the hypothesis is right or not?

c. What test are you going to use?

d. What is the p-value?

e. What is your conclusion using a type 1 error = 0.05?

3. Create a 2 by 4 contingency table using the categories A, B, C, D of “residual.sugar” and the binary variable “excellent” you created in Part B. Note that you have two factors: the categorical levels of “residual.sugar” (A, B, C and D) and an indicator of excellent wines (yes or no).

a. Use the Chi-square test to test if these two factors are correlated or not

b. Use the permutation test to do the same and compare the result to that in (a)

c. Can you conclude that “residual.sugar” is a significant factor contributing to the excellence of wine? Why?