Statistical Analysis of Wine Data Set

In this section of the project, I will be, again, working with the red wine data set which contains 1599 wines with 11 qualities and an associated quality rating. I will be performing 3 different statistical procedures on the Wine Density, Residual Sugar, and Wine Quality columns of data set. These statistical procedures involve using the central limit theorem, bootstrapping, and maximum likelihood estimate to estimate the population mean, standard deviation, and proportion (Wine Quality only) of these variables from the sample, standard error, confidence intervals.

CLT, Bootstrapping, and MLE

Wine Density

Here is an estimation of the population mean of wine density from our sample.

Density <- red_wine[,8]
mean(Density)

## [1] 0.9967467

Using the central limit theorem, we can quantify the variability of our mean estimation.

Standard_error <- sd(Density)/sqrt(length(Density))
Standard_error

## [1] 4.71981e-05

Our confidence intervals are derived from the standard error.

CLT_CI_Density <- c(mean(Density) - 2*Standard_error, mean(Density) + 2*Standard_error)

CLT_CI_Density

## [1] 0.9966523 0.9968411

Using the bootstrap method, we can also derive the variability of our estimate and a confidence interval.

mu.hat.set <- NULL
for (k in 1:2000) {
  Bootstrap <- sample(Density, size=1599, replace = T)
  mu.hat <- mean(Bootstrap)
  mu.hat.set[k] <- mu.hat
}

Bootstrap_mean <- mean(mu.hat.set)
Bootstrap_variability <- sd(mu.hat.set)
print("Variability")

## [1] "Variability"

Bootstrap_CI_Density <-quantile(mu.hat.set, probs=c(0.025,0.975))

print("Confidence Interval")

## [1] "Confidence Interval"

Bootstrap_CI_Density

##      2.5%     97.5% 
## 0.9966595 0.9968418

The difference between bootstrap variability and the standard error from the central limit theorem is very small.

Bootstrap_variability - Standard_error

## [1] -9.612504e-07

Our confidence intervals appear to be very close as well.

print("Confidence Intervals")

## [1] "Confidence Intervals"

Bootstrap_CI_Density

##      2.5%     97.5% 
## 0.9966595 0.9968418

CLT_CI_Density

## [1] 0.9966523 0.9968411

We can use a normal distribution to model the density as shown from our histogram.

hist(Density, main="Histogram for density")

We can find the MLE of the mean and standard deviation using the normal distribution to obtain probabilities for optimization. The standard error is also printed beside each.

minuslog.lik <- function(mu, sigma){
  log.lik <- 0
  for(i in 1:1599) {
    log.lik <- log.lik+log(dnorm(10*Density[i], mean = mu, sd=sigma))
  }
  return (-log.lik)
}

density.est <- mle(minuslog=minuslog.lik, start=list(mu=mean(10*Density), sigma = sd(10*Density)))

summary(density.est)@coef/10

##          Estimate   Std. Error
## mu    0.996746679 4.719810e-05
## sigma 0.001887334 3.296881e-05

Wine Residual Sugar

Here is an estimation of the Residual Sugar mean based on the sample.

Residual.sugar <- red_wine[,4]
ResidualSugar_mean <- mean(Residual.sugar)
mean(Residual.sugar)

## [1] 2.538806

Even though the residual sugar is highly skewed, we can still use the CLT theorem to quantity the variability of our estimate because the variability is relative to a normal distribution by the central limit theorem. Our 95% confidence interval is given below.

Standard_error_residual <- sd(red_wine[,4])/sqrt(length(Residual.sugar))

CLT_CI_Redidual <- c(mean(Residual.sugar) - 2*Standard_error_residual, mean(Residual.sugar) + 2*Standard_error_residual)

CLT_CI_Redidual

## [1] 2.468287 2.609324

A quality of the bootstrap confidence interval is that it is symmetric. We can use this method as well to quantify the variability of our estimate and obtain a confidence interval.

mu.hat.set_res <- NULL
for (k in 1:2000) {
  Bootstrap <- sample(Residual.sugar, size=1599, replace = T)
  mu.hat <- mean(Bootstrap)
  mu.hat.set_res[k] <- mu.hat
}

Bootstrap_mean_residual <- mean(mu.hat.set_res)
Bootstap_variability_residual <- sd(mu.hat.set_res)

Bootstrap_mean_residual

## [1] 2.539584

Bootstrap_CI_Residual <- c(mean(Residual.sugar) - 2*Standard_error_residual, mean(Residual.sugar) + 2*Standard_error_residual)

Bootstrap_CI_Residual

## [1] 2.468287 2.609324

As mentioned earlier, our bootstrap distribution is normal so the confidence interval is symmetric.

hist(mu.hat.set_res,freq = FALSE)

lines(density(mu.hat.set_res), lwd=5, col='blue')

Modeling the distribution using the normal for Residual sugar would be a poor idea because it follows more closely a log normal distribution.

hist(Residual.sugar, main="Histogram for Residual Sugar")

Here, we will find the MLE of the mean and standard deviation using the log normal distribution to obtain probabilities for optimization.

minuslog.lik<-function(mu, sigma){
  log.lik<-0
  for(i in 1:1599){
    log.lik= log.lik+ log(dlnorm(red_wine[i,4], meanlog= mu, sdlog= sigma))
  }
  return(-log.lik)
}
est.lognorm<-mle(minuslog=minuslog.lik, start=list(mu=log(mean(red_wine[,4])), sigma=log(sd(red_wine[,4]))))

summary(est.lognorm)

## Maximum likelihood estimation
## 
## Call:
## mle(minuslogl = minuslog.lik, start = list(mu = log(mean(red_wine[, 
##     4])), sigma = log(sd(red_wine[, 4]))))
## 
## Coefficients:
##        Estimate  Std. Error
## mu    0.8502341 0.008935942
## sigma 0.3573260 0.006318293
## 
## -2 log L: 3965.773

Wine Quality

Wines will be classified as excellent if they have a rating of at least 7. I am going to derive a confidence interval for our excellent/not excellent proportion using the central limit theorem.

quality <- red_wine %>% filter(quality >= 7) %>% select(quality)

proportion <- nrow(subset(red_wine, quality >=7))/nrow(red_wine)
Standard_Error_Quality <- proportion/sqrt(nrow(red_wine))

CLT_CI_PropQuality <- c(proportion - 2*Standard_Error_Quality, proportion + 2*Standard_Error_Quality)

Standard_Error_Quality

## [1] 0.003393806

CLT_CI_PropQuality

## [1] 0.1289222 0.1424974

When comparing the confidence intervals for our proportion of excellent wines for CLT and bootstrapping, it is notable that our confidence interval is much wider for the bootstrapping method. We have a much more significant estimated variability for the mean.

proportion.hat.set_quality <- NULL
for (k in 1:2000) {
  bootstrap <- sample(red_wine[,12], size=1599, replace = T)
  proportion.hat <- length(subset(bootstrap, bootstrap>=7))/1599
  proportion.hat.set_quality[k] <- proportion.hat
}

Bootstrap_mean_quality <- mean(proportion.hat.set_quality)

Bootstrap_variability_quality <- sd(proportion.hat.set_quality)

Bootstrap_variability_quality

## [1] 0.008663896

Bootstrap_CI_PropQuality <- c(proportion - 2*Bootstrap_variability_quality, proportion + 2*Bootstrap_variability_quality)

Bootstrap_CI_PropQuality

## [1] 0.1183820 0.1530376

Finally, we can find the maximum likelihood estimate of proportion and its standard error using the binomial distribution.

prop <- as.integer(red_wine[,12] >= 7)

Quality_minuslog.lik <- function(mu){
  log.lik <- 0
  for(i in 1:1599) {
    log.lik <- log.lik+log(dbinom(prop[i], 1, prob=mu))
  }
  return (-log.lik)
}

Quality_density.est <- stats4::mle(minuslog=Quality_minuslog.lik, start=list(mu=mean(prop)))

summary(Quality_density.est)

## Maximum likelihood estimation
## 
## Call:
## stats4::mle(minuslogl = Quality_minuslog.lik, start = list(mu = mean(prop)))
## 
## Coefficients:
##     Estimate  Std. Error
## mu 0.1357098 0.008564278
## 
## -2 log L: 1269.921

Statistical Analysis of Wine Data Set

Data Preparation

Packages Required

Data Import

Data Preview

CLT, Bootstrapping, and MLE

Wine Density

Wine Residual Sugar

Wine Quality