Introduction

Over the years, we have developed a taste for wines and this has lead to invention of tastes of Red and White Wines.

This variation in the taste of Wine is done by adding or removing different ingredients when making the wine, changing the order of adding different ingredients and also by manipulating the fermentation process by adding some chemicals.

We will be observing the variance of pH value of in different Red and WHite Wine samples of the Portuguese “Vinho Verde” wine.

Problem Statement

The purpose is to determine if there is a statistically significant mean difference between the pH values of the Red and White Wines.

To determine this, we will be performing several statistical tasks and visualisation of the data. We will be performing a Two Sample t-Test on our data set to conclude our investigation.

Data

Our data set is an open data set available at UCI Machine Learning repository present under the link (http://archive.ics.uci.edu/ml/datasets/Wine+Quality).

This Wine Quality Data Set contains two data sets of the Portuguese “Vinho Verde” wine. One data set is of Red Wine samples and another one is of the White Wine samples.

This data set is of Classification or Regression type.

There are total 12 attributes available in the data set like fixed acidity, citric acid, total sulfur dioxide, density, etc. However, we will be picking only the pH value of both the Red and White wines.

The data is available since 2009 and there are no missing values.

Data Cont.

Since we will be performing our test only on the pH value observation, we will first load both the data set CSVs of Whiet and Red Wines and then extract the observations of pH values and then start perming our investigation. The data set which will be further investigated will have only two columns as follong: - 1. pH: Contains the pH value observation of different Wine samples. 2. Type: Contains the type of Wine the pH observation belongs to.

# Set working directory
setwd("C:\\Users\\abhis\\OneDrive\\Desktop\\Master of Data Science\\Sem 1\\Introduction to Statistics MATH1324\\Assignments\\Assignment 3")

# Load the data set
adult <- read_csv("adult.csv", col_names = c('Age', 'Work.Class', 'Final.Weight', 'Education', 'Education.Number', 'Marital.Status', 'Occupation', 'Relationship', 'Race', 'Sex', 'Capital.Gain', 'Capital.Loss', 'Hours.Per.Week', 'Native.Country', 'Income'))

# View loaded data set
View(adult)

Descriptive Statistics and Visualisation

Following is the summary of statistics data of our chosen data set categorised by the type of Wine, i.e. Red and White.

p1 <- ggplot(data = adult, aes(x = Sex, y = Capital.Gain))
p1 + geom_dotplot(binaxis = "y", stackdir = "center", dotsize = 1/2, alpha = .25) + 
    stat_summary(fun.y = "mean", geom = "point", colour = "red") +
    stat_summary(fun.data = "mean_cl_normal", colour = "red", 
                 geom = "errorbar", width = .2)

Decsriptive Statistics Cont.

adult %>% group_by(Sex) %>% summarise(Min = min(Capital.Gain,na.rm = TRUE),
                                         Q1 = quantile(Capital.Gain,probs = .25,na.rm = TRUE),
                                         Median = median(Capital.Gain, na.rm = TRUE),
                                         Q3 = quantile(Capital.Gain,probs = .75,na.rm = TRUE),
                                         Max = max(Capital.Gain,na.rm = TRUE),
                                         Mean = mean(Capital.Gain, na.rm = TRUE),
                                         SD = sd(Capital.Gain, na.rm = TRUE),
                                         n = n(),
                                         Missing = sum(is.na(Capital.Gain))) -> capital_gain_summary_table
knitr::kable(capital_gain_summary_table)

Sex	Min	Q1	Median	Q3	Max	Mean	SD	n	Missing
Female	0	0	0	0	99999	568.4105	4924.263	10771	0
Male	0	0	0	0	99999	1329.3701	8326.312	21790	0

Hypothesis Testing

We will perform the Two Samples t-test on our data set to check for a statistically significant mean difference in the pH Values of White Wine and Red Wine by performing the following steps: 1. We will first check for the normality of data by visualising a QQPlot. 2. Then we perform the test of Homogeneity of Variance. 3. Further we will perform the Two Sample t-test on our data set. After performing the above steps, we will observe the results and look for the appropriate conclusion.

adult$Capital.Gain[adult$Sex == 'Male'] %>% qqPlot(dist="norm",ylab="Unemployment rate", main = "QQPlot - Australian Unemployment rate (Female)")

## [1] 840 923

Hypothesis Testing Cont.

We visualised a QQ Plot to compare our observations in the data set with what the normal observations should be. We check if our values falls ourside the dashed line or not to further go for the test of Homogeneity of Variance

\[H_0: \mu_1 = \mu_2 \]

\[H_A: \mu_1 \ne \mu_2\]

\[S = \sum^n_{i = 1}d^2_i\]

Discussion

We performed the Two Samples t-test on our data set to check for a statistically significant mean difference in the pH Values of White Wine and Red Wine. The strength of our investigation is that our data set consisted of a large number of observations of both Red and White Wine samples. The weakness of our investigation is that our data set contained a lot of outliers in observations of both Red and White Wine samples. Our results could be used in the future by a new liqour company to determine the pH value of the Red Wines and White Wines. Also, to determine the difference in the pH values of both Red and White Wines should be kept to get the best selling product. We plot a QQ Plot for both Red and White wine samples. When we check the plots, there were a set of values falling inside as well as outside of the dashed lines. Since there were some values which fall outside the dashed lines, so we performed the test of Homogeneity of Variance, we use Levene’s Test, which showed the p-value of 0.284, is greater than 0.05, so we perform the Two Sample t-Test assuming unequal variance. The result of the performed Two Sample t-test are as following: - 1. The p-value was found to be 0.0003, which is less than 0.05. 2. The 95 percent confidence interval is [4.541, 15.309] 3. The t-value is 3.615, which does not fall within the 95% CI. To conclude, the results of our investigation suggest that pH value of Red Wine is significantly higher than the pH value of White Wine.

MATH1324 Assignment 3

pH value of Red and White Wines