The purpose of this project is to conduct an analysis using the tools and knowledge acquired during this semester long course. I will be using a data set containing 1599 different red wines with 12 associated variables for each observation. I will be using the statistical software R to conduct my analysis and I will be applying my understanding of empirical CDF, bootstrapping, MLE, hypothesis testing, and Bayesian analysis.
library(dplyr)
library(stats4)
library(DT)
library(tinytex)
library(bootstrap)
library(ggplot2)
red_wine <- read.csv("C:/Users/Chris Steege/Documents/Classes/Fall 2020/Statistical Methods/winequality-red.csv", header=FALSE, sep=';', skip=1)
colnames(red_wine) <- c("fixed acidity","volatile acidity","citric acid","residual sugar","chlorides","free sulfur dioxide","total sulfur dioxide","density","pH","sulphates","alcohol","quality")
In this dataset, the rows represent a unique wine, and the columns give us information about the wines.
library(DT)
datatable(head(red_wine,25))
Here, I will provide a table with the variable names, types, and descriptions.
Var_type <- lapply(red_wine, class)
var_description <- c("Fixed Acidity Content", "Volatile Acidity Content", "Citric Acid Content", "Residual Sugar Content", "Chloride Content", "Free SUlfur DIoxide Content", "Total Sulfur Dioxide Content", "Wine Density", "Wine pH Level", "Sulphates Content", "Alcohol Content (ABV)", "Rating Given by Customers")
Variable_names <- colnames(red_wine)
data_description <- as_data_frame(cbind(Variable_names,Var_type, var_description))
colnames(data_description) <- c("Variable Name", "Data Type", "Variable Description")
library(knitr)
kable(data_description)
| Variable Name | Data Type | Variable Description |
|---|---|---|
| fixed acidity | numeric | Fixed Acidity Content |
| volatile acidity | numeric | Volatile Acidity Content |
| citric acid | numeric | Citric Acid Content |
| residual sugar | numeric | Residual Sugar Content |
| chlorides | numeric | Chloride Content |
| free sulfur dioxide | numeric | Free SUlfur DIoxide Content |
| total sulfur dioxide | numeric | Total Sulfur Dioxide Content |
| density | numeric | Wine Density |
| pH | numeric | Wine pH Level |
| sulphates | numeric | Sulphates Content |
| alcohol | numeric | Alcohol Content (ABV) |
| quality | integer | Rating Given by Customers |
To help understand the data and have a reference, I’ll create a table of summary statistics for each varaible.
tmp <- do.call(data.frame,
list(mean = apply(red_wine, 2, mean),
sd = apply(red_wine, 2, sd),
median = apply(red_wine, 2, median),
min = apply(red_wine, 2, min),
max = apply(red_wine, 2, max),
n = apply(red_wine, 2, length)))
tmp
## mean sd median min max n
## fixed acidity 8.31963727 1.741096318 7.90000 4.60000 15.90000 1599
## volatile acidity 0.52782051 0.179059704 0.52000 0.12000 1.58000 1599
## citric acid 0.27097561 0.194801137 0.26000 0.00000 1.00000 1599
## residual sugar 2.53880550 1.409928060 2.20000 0.90000 15.50000 1599
## chlorides 0.08746654 0.047065302 0.07900 0.01200 0.61100 1599
## free sulfur dioxide 15.87492183 10.460156970 14.00000 1.00000 72.00000 1599
## total sulfur dioxide 46.46779237 32.895324478 38.00000 6.00000 289.00000 1599
## density 0.99674668 0.001887334 0.99675 0.99007 1.00369 1599
## pH 3.31111320 0.154386465 3.31000 2.74000 4.01000 1599
## sulphates 0.65814884 0.169506980 0.62000 0.33000 2.00000 1599
## alcohol 10.42298311 1.065667582 10.20000 8.40000 14.90000 1599
## quality 5.63602251 0.807569440 6.00000 3.00000 8.00000 1599
By plotting the Density of the wine on a histogram, you can see that density nearly follows a normal distribution.
hist(red_wine[,8], main="Histogram for density")
You can plot the Empiricle CDF by sorting the data and plotting according to how many points are less than each value. Additionaly, I have found the limiting distribution by finding the acceptable error based on the confidence we choose and adding or subtracting from the empiracle CDF. In this case, I want a 95% confidence band, so I chose alpha = .05.
Thie limiting distribution tells us that we are 95% confident that every point in the true distribution lies within this band.
library(ACSWR)
density.ecdf <- ecdf(red_wine[,8])
plot(density.ecdf)
plot.ecdf(red_wine[,8])
density.ecdf(1)-density.ecdf(0.5)
## [1] 0.9555972
#Estimate P(0.5<X<1)
Alpha=.05
n=length(red_wine[,8])
Eps=sqrt(log(2/Alpha)/(2*n))
grid<-seq(0,1.5, length.out = 1000)
lines(grid, pmin(density.ecdf(grid)+Eps,1))
lines(grid, pmax(density.ecdf(grid)-Eps,0))
We can perform the same procedure for Sulfur which follows a more abnormal (Due to it having a floor value) and skewed distribution.
hist(red_wine[,7], main="Histogram for total sulfur dioxide")
Again, the limiting distribution can be formed by find the error which tells us that we have a confidence of 95% that the true distribution lies between the bands. The bands hug tightly to the empiricle cdf, because we have many data points that increase our confidence.
Sulfur.ecdf <- ecdf(red_wine[,7])
plot(Sulfur.ecdf)
plot.ecdf(red_wine[,7])
Sulfur.ecdf(1)-Sulfur.ecdf(0.5)
## [1] 0
#Estimate P(0.5<X<1)
Alpha=.05
n=length(red_wine[,7])
Eps=sqrt(log(2/Alpha)/(2*n))
grid<-seq(0,320, length.out = 1000)
lines(grid, pmin(Sulfur.ecdf(grid)+Eps,1))
lines(grid, pmax(Sulfur.ecdf(grid)-Eps,0))
Here we are going to compare mean estimation methods using CLT, MLE, and Bootstrap
Using the central limit theorem, we can quantify the variability of our mean estimation.
Mean = NULL
CLT_se = NULL
for(i in 1:12){
Mean[i] = mean(red_wine[,i])
}
for(i in 1:12){
CLT_se[i] <-sd(red_wine[,i])/sqrt(length(red_wine[,i]))
}
Our confidence intervals are derived from the standard error.
CLT_LB = NULL
CLT_UB = NULL
for(i in 1:12){
CLT_LB[i] <- Mean[i] - 2*CLT_se[i]
CLT_UB[i] <- Mean[i] + 2*CLT_se[i]
}
Using the bootstrap method, we can also derive a mean estimate and the variability of our estimate with a confidence interval.
Bootstrap_mean = NULL
Bootstrap_se = NULL
Boostrap_LB = NULL
Boostrap_UB = NULL
for(i in 1:12){
mu.hat.set <- NULL
for (k in 1:2000) {
Bootstrap <- sample(red_wine[,i], size=1599, replace = T)
mu.hat <- mean(Bootstrap)
mu.hat.set[k] <- mu.hat
}
Bootstrap_mean[i] <- mean(mu.hat.set)
Bootstrap_se[i] <- sd(mu.hat.set)
Boostrap_LB[i] = Bootstrap_mean[i] - 2*Bootstrap_se[i]
Boostrap_UB[i] = Bootstrap_mean[i] + 2*Bootstrap_se[i]
}
Finally, MLE can also be used to derive the mean and standard error.
Scale = c(10,10,10,10,100,10,10,10,10,10,10,10)
MLE_mean = NULL
MLE_se = NULL
MLE_LB = NULL
MLE_UB = NULL
for(j in 1:12){
minuslog.lik <- function(mu, sigma){
log.lik <- 0
for(i in 1:1599) {
log.lik <- log.lik+log(dnorm(Scale[j]*red_wine[i,j], mean = mu, sd=sigma))
}
return (-log.lik)
}
density.est <- mle(minuslog=minuslog.lik, start=list(mu=mean(Scale[j]*red_wine[,j]), sigma = sd(Scale[j]*red_wine[,j])))
MLE_mean[j] =((summary(density.est)@coef/Scale[j])[1,1])
MLE_se[j] =((summary(density.est)@coef/Scale[j])[1,2])
MLE_LB = MLE_mean[j] - 2*MLE_se[j]
MLE_UB = MLE_mean[j] + 2*MLE_se[j]
}
MLE_mean
## [1] 8.31963727 0.52782044 0.27097561 2.53880550 0.08746654 15.87492183
## [7] 46.46779237 0.99674668 3.31111320 0.65814883 10.42298312 5.63602251
Here, I want to compare the mean and standard errors of our mean estimates across CLT, Bootstrapping, and MLE.
THe mean estimates for our using the the regular mean procedure and MLE were the same. The bootstrapping means for each varaible were close to the others but are always slightly higher or lower. The standard errors of these estimates were all different for each variable, but they all have very similar values.
Means <- data.frame(Mean, Bootstrap_mean, MLE_mean)
Standard_errors <- data.frame(CLT_se, Bootstrap_se, MLE_se)
Means
## Mean Bootstrap_mean MLE_mean
## 1 8.31963727 8.3194250 8.31963727
## 2 0.52782051 0.5278322 0.52782044
## 3 0.27097561 0.2708043 0.27097561
## 4 2.53880550 2.5381304 2.53880550
## 5 0.08746654 0.0874046 0.08746654
## 6 15.87492183 15.8724886 15.87492183
## 7 46.46779237 46.4573097 46.46779237
## 8 0.99674668 0.9967456 0.99674668
## 9 3.31111320 3.3111262 3.31111320
## 10 0.65814884 0.6581992 0.65814883
## 11 10.42298311 10.4222258 10.42298312
## 12 5.63602251 5.6361979 5.63602251
Standard_errors
## CLT_se Bootstrap_se MLE_se
## 1 0.0435410166 4.440835e-02 0.0435273403
## 2 0.0044778922 4.421293e-03 0.0044764919
## 3 0.0048715510 4.955018e-03 0.0048700290
## 4 0.0352592217 3.589516e-02 0.0352481981
## 5 0.0011770004 1.181359e-03 0.0011766323
## 6 0.2615856825 2.581509e-01 0.2615325653
## 7 0.8226402272 8.336501e-01 0.8231282669
## 8 0.0000471981 4.657363e-05 0.0000471981
## 9 0.0038608683 3.930420e-03 0.0038596617
## 10 0.0042389994 4.272490e-03 0.0042376735
## 11 0.0266500190 2.671026e-02 0.0266416885
## 12 0.0201955481 2.014512e-02 0.0201831763
Using bootstrapping, We can estimate the correlation coefficient between the predictor variables and the quality of the wine. In this case, I have used the citric acid content as an example and found that the bootstrap correlation between it and wine quality was .2263 with a standard error of .0392. This gives us a correlation 95% CI interval from .148 to .305.
#x: index/row number
#xdata: the data frame
theta <- function(x,xdata){ cor(xdata[x,1],xdata[x,2]) }
# sample from the population
n <- 500
results <- bootstrap(1:n,3200,theta, red_wine[,c(3,12)])
par(mfrow=c(2,1))
hist(results$thetastar)
var(results$thetastar)
## [1] 0.002242354
##Bootstrap confidence intervals
results <- bootstrap(1:n,3200,theta, red_wine[,c(3,12)])
se.boot=sqrt(var(results$thetastar))
cor.hat<-cor(red_wine[,3], red_wine[,12])
normal.ci<-c(cor.hat-2*se.boot, cor.hat+2*se.boot)
pivatol.ci<-c(2*cor.hat-quantile(results$thetastar,0.975), 2*cor.hat-quantile(results$thetastar,0.025))
quantile.ci<-quantile(results$thetastar, c(0.025, 0.975))
print(quantile.ci)
## 2.5% 97.5%
## 0.1417577 0.3343433
Here I am going to convert wine quality into a binary variable where if the quality rating is greater than 5, it is considered a quality wine. I will then perform a t-test between the citric acid content and quality to see what kind of role citric acid plays in determining if a wine is quality. Our null hypothesis is that the true difference in means is equal to 0.
We computed the means between the two groups and found that not quality wines had a mean of about .2378 while quality wines had a mean of about .2999. The difference between the means of this group was considered to be statistically significant, because we have a very low p-value. Therefore we will reject the null hypothesis.
red_wine$Quality_bin <- ifelse(red_wine$quality>5,1,0)
red_wine %>%
group_by(Quality_bin) %>%
summarize(Cit_mean = mean(`citric acid`))
## # A tibble: 2 x 2
## Quality_bin Cit_mean
## <dbl> <dbl>
## 1 0 0.238
## 2 1 0.300
qplot(x = Quality_bin, y=red_wine$`citric acid`, geom = 'boxplot', data = red_wine)
red_wine.t.test <- t.test(red_wine$`citric acid` ~ red_wine$Quality_bin)
red_wine.t.test
##
## Welch Two Sample t-test
##
## data: red_wine$`citric acid` by red_wine$Quality_bin
## t = -6.4799, df = 1592.5, p-value = 1.22e-10
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.08093360 -0.04332173
## sample estimates:
## mean in group 0 mean in group 1
## 0.2377554 0.2998830