Chris Steege
In this section of the project, I will be, again, working with the red wine data set which contains 1599 wines with 11 qualities and an associated quality rating. I will be performing 3 different statistical procedures on the Wine Density, Residual Sugar, and Wine Quality columns of data set. These statistical procedures involve using the central limit theorem, bootstrapping, and maximum likelihood estimate to estimate the population mean, standard deviation, and proportion (Wine Quality only) of these variables from the sample, standard error, confidence intervals.
library(dplyr)
library(stats4)
library(DT)
red_wine <- read.csv("C:/Users/Chris Steege/Documents/Classes/Fall 2020/Statistical Methods/winequality-red.csv", header=FALSE, sep=';', skip=1)
colnames(red_wine) <- c("fixed acidity","volatile acidity","citric acid","residual sugar","chlorides","free sulfur dioxide","total sulfur dioxide","density","pH","sulphates","alcohol","quality")
datatable(head(red_wine))
Here is an estimation of the population mean of wine density from our sample.
Density <- red_wine[,8]
mean(Density)
## [1] 0.9967467
Using the central limit theorem, we can quantify the variability of our mean estimation.
Standard_error <- sd(Density)/sqrt(length(Density))
Standard_error
## [1] 4.71981e-05
Our confidence intervals are derived from the standard error.
CLT_CI_Density <- c(mean(Density) - 2*Standard_error, mean(Density) + 2*Standard_error)
CLT_CI_Density
## [1] 0.9966523 0.9968411
Using the bootstrap method, we can also derive the variability of our estimate and a confidence interval.
mu.hat.set <- NULL
for (k in 1:2000) {
Bootstrap <- sample(Density, size=1599, replace = T)
mu.hat <- mean(Bootstrap)
mu.hat.set[k] <- mu.hat
}
Bootstrap_mean <- mean(mu.hat.set)
Bootstrap_variability <- sd(mu.hat.set)
print("Variability")
## [1] "Variability"
Bootstrap_CI_Density <-quantile(mu.hat.set, probs=c(0.025,0.975))
print("Confidence Interval")
## [1] "Confidence Interval"
Bootstrap_CI_Density
## 2.5% 97.5%
## 0.9966554 0.9968333
The difference between bootstrap variability and the standard error from the central limit theorem is very small.
Bootstrap_variability - Standard_error
## [1] -6.346767e-07
Our confidence intervals appear to be very close as well.
print("Confidence Intervals")
## [1] "Confidence Intervals"
Bootstrap_CI_Density
## 2.5% 97.5%
## 0.9966554 0.9968333
CLT_CI_Density
## [1] 0.9966523 0.9968411
We can use a normal distribution to model the density as shown from our histogram.
hist(Density, main="Histogram for density")
We can find the MLE of the mean and standard deviation using the normal distribution to obtain probabilities for optimization. The standard error is also printed beside each.
minuslog.lik <- function(mu, sigma){
log.lik <- 0
for(i in 1:1599) {
log.lik <- log.lik+log(dnorm(10*Density[i], mean = mu, sd=sigma))
}
return (-log.lik)
}
density.est <- mle(minuslog=minuslog.lik, start=list(mu=mean(10*Density), sigma = sd(10*Density)))
summary(density.est)@coef/10
## Estimate Std. Error
## mu 0.996746679 4.719810e-05
## sigma 0.001887334 3.296881e-05
Here is an estimation of the Residual Sugar mean based on the sample.
Residual.sugar <- red_wine[,4]
ResidualSugar_mean <- mean(Residual.sugar)
mean(Residual.sugar)
## [1] 2.538806
Even though the residual sugar is highly skewed, we can still use the CLT theorem to quantity the variability of our estimate because the variability is relative to a normal distribution by the central limit theorem. Our 95% confidence interval is given below.
Standard_error_residual <- sd(red_wine[,4])/sqrt(length(Residual.sugar))
CLT_CI_Redidual <- c(mean(Residual.sugar) - 2*Standard_error_residual, mean(Residual.sugar) + 2*Standard_error_residual)
CLT_CI_Redidual
## [1] 2.468287 2.609324
A quality of the bootstrap confidence interval is that it is symmetric. We can use this method as well to quantify the variability of our estimate and obtain a confidence interval.
mu.hat.set_res <- NULL
for (k in 1:2000) {
Bootstrap <- sample(Residual.sugar, size=1599, replace = T)
mu.hat <- mean(Bootstrap)
mu.hat.set_res[k] <- mu.hat
}
Bootstrap_mean_residual <- mean(mu.hat.set_res)
Bootstap_variability_residual <- sd(mu.hat.set_res)
Bootstrap_mean_residual
## [1] 2.539158
Bootstrap_CI_Residual <- c(mean(Residual.sugar) - 2*Standard_error_residual, mean(Residual.sugar) + 2*Standard_error_residual)
Bootstrap_CI_Residual
## [1] 2.468287 2.609324
As mentioned earlier, our bootstrap distribution is normal so the confidence interval is symmetric.
hist(mu.hat.set_res,freq = FALSE)
lines(density(mu.hat.set_res), lwd=5, col='blue')
Modeling the distribution using the normal for Residual sugar would be a poor idea because it follows more closely a log normal distribution.
hist(Residual.sugar, main="Histogram for Residual Sugar")
Here, we will find the MLE of the mean and standard deviation using the log normal distribution to obtain probabilities for optimization.
minuslog.lik<-function(mu, sigma){
log.lik<-0
for(i in 1:1599){
log.lik= log.lik+ log(dlnorm(red_wine[i,4], meanlog= mu, sdlog= sigma))
}
return(-log.lik)
}
est.lognorm<-mle(minuslog=minuslog.lik, start=list(mu=log(mean(red_wine[,4])), sigma=log(sd(red_wine[,4]))))
summary(est.lognorm)
## Maximum likelihood estimation
##
## Call:
## mle(minuslogl = minuslog.lik, start = list(mu = log(mean(red_wine[,
## 4])), sigma = log(sd(red_wine[, 4]))))
##
## Coefficients:
## Estimate Std. Error
## mu 0.8502341 0.008935942
## sigma 0.3573260 0.006318293
##
## -2 log L: 3965.773
Wines will be classified as excellent if they have a rating of at least 7. I am going to derive a confidence interval for our excellent/not excellent proportion using the central limit theorem.
quality <- red_wine %>% filter(quality >= 7) %>% select(quality)
proportion <- nrow(subset(red_wine, quality >=7))/nrow(red_wine)
Standard_Error_Quality <- proportion/sqrt(nrow(red_wine))
CLT_CI_PropQuality <- c(proportion - 2*Standard_Error_Quality, proportion + 2*Standard_Error_Quality)
Standard_Error_Quality
## [1] 0.003393806
CLT_CI_PropQuality
## [1] 0.1289222 0.1424974
When comparing the confidence intervals for our proportion of excellent wines for CLT and bootstrapping, it is notable that our confidence interval is much wider for the bootstrapping method. We have a much more significant estimated variability for the mean.
proportion.hat.set_quality <- NULL
for (k in 1:2000) {
bootstrap <- sample(red_wine[,12], size=1599, replace = T)
proportion.hat <- length(subset(bootstrap, bootstrap>=7))/1599
proportion.hat.set_quality[k] <- proportion.hat
}
Bootstrap_mean_quality <- mean(proportion.hat.set_quality)
Bootstrap_variability_quality <- sd(proportion.hat.set_quality)
Bootstrap_variability_quality
## [1] 0.008570678
Bootstrap_CI_PropQuality <- c(proportion - 2*Bootstrap_variability_quality, proportion + 2*Bootstrap_variability_quality)
Bootstrap_CI_PropQuality
## [1] 0.1185685 0.1528512
Finally, we can find the maximum likelihood estimate of proportion and its standard error using the binomial distribution.
prop <- as.integer(red_wine[,12] >= 7)
Quality_minuslog.lik <- function(mu){
log.lik <- 0
for(i in 1:1599) {
log.lik <- log.lik+log(dbinom(prop[i], 1, prob=mu))
}
return (-log.lik)
}
Quality_density.est <- stats4::mle(minuslog=Quality_minuslog.lik, start=list(mu=mean(prop)))
summary(Quality_density.est)
## Maximum likelihood estimation
##
## Call:
## stats4::mle(minuslogl = Quality_minuslog.lik, start = list(mu = mean(prop)))
##
## Coefficients:
## Estimate Std. Error
## mu 0.1357098 0.008564278
##
## -2 log L: 1269.921