Introduction

This report analyzes wine quality data from 1599 red wine samples produced in northern Portugal. The goal is to make inferences about key population parameters using the sample data.

The analysis is structured in three parts:

  1. Inference for Density – Using both the Central Limit Theorem (CLT) and bootstrap methods to estimate the population mean, quantify its variability, construct confidence intervals, and assess whether a normal distribution is appropriate for modeling this variable.

  2. Inference for Residual Sugar – Addressing the challenges posed by skewness in the distribution, and applying both CLT (when appropriate) and bootstrap methods to evaluate the mean and its uncertainty, while also considering alternative distributional models.

  3. Inference for Excellent Wines – Treating wine quality classification as a binary outcome, we estimate the population proportion of excellent wines, construct confidence intervals using both CLT and bootstrap methods, and compare the two approaches.

Dataset and Packages

I will load in my data set first from the downloaded wine+quality zip file, and import any packages for the report.

# Load the dataset and packages
library(ggplot2)
library(dplyr)
library(tidyr)
library(patchwork)
library(stats4)

red_df <- read.csv("wine_quality_data/winequality-red.csv", sep = ';')
white_df <- read.csv("wine_quality_data/winequality-white.csv", sep = ';')

1. Inference for Density

This section examines the variable ‘density’ from 1599 red wine samples.

Instructions:

a. Provide an estimate of μ based on the sample;

# Estimating μ based on red_df sample
mu.hat.density <- mean(red_df$density)
mu.hat.density
## [1] 0.9967467


b. Use the Central Limit Theorem (CLT) to quantify the variability of your estimate;

# Creating variable 'n' to store count of sample data
n <- nrow(red_df)

# Creating variable 'sd.density' to store the standard deviation of the 
# variable density
sd.density <- sd(red_df$density)

# Creating variable 'sd.mu.hat.density' to store the standard deviation of the 
# sample means 
sd.mu.hat.density <- sd.density/sqrt(n)
sd.mu.hat.density
## [1] 4.71981e-05

The CLT tells us that if we repeatedly sampled 1599 bottles of red wine and computed the mean density each time, those means would vary around the true population mean μ with a standard deviation of about 0.0000472.

This means our estimate of μ (0.9967) is very precise, with only tiny sampling variability.


c. Use the CLT to give a 95% confidence interval for μ.

# Upper and lower bounds of the 95% interval
mu.hat.density.CI <- c(mu.hat.density - 2 * sd.mu.hat.density, mu.hat.density + 2 * sd.mu.hat.density)
mu.hat.density.CI
## [1] 0.9966523 0.9968411

The 95% confidence interval for μ using the CLT is (0.9966523, 0.9968411).


d. Use the bootstrap method to do parts b and c, and compare the results with those obtained from the CLT. State your findings.

  • Using the Bootstrap method to quantify the variability of my estimate;
mu.hat.set <- NULL ### Used to store mu.hat’s

for(k in 1:2000){ ### Repeat sampling 2000 times
  sample.bootstrap <- sample(red_df$density, size = 1599, replace = T)
  mu.hat <- mean(sample.bootstrap)
  mu.hat.set[k] <- mu.hat
}

sd(mu.hat.set)
## [1] 4.726196e-05

The Bootstrap method also yields a standard deviation of about 0.0000473.


  • Using the Bootstrap Method to give a 95% confidence interval for μ
# Upper and lower bounds of the 95% interval
quantile(mu.hat.set, probs = c(0.025,0.975))
##      2.5%     97.5% 
## 0.9966554 0.9968402

The 95% confidence interval for μ using the Bootstrap Method is (0.9966, 0.9969).

After using the CLT and Bootstrap Method of finding the variability and 95% confidence interval, I’ve found that there is very little difference in the output. The differences are negligible, showing up only in the fourth decimal place and having no practical impact on the results.


e. Can we use a normal distribution to model “density”?. What are the maximum likelihood estimates of the mean and standard deviation?

# Determining the distribution to use 
hist(red_df$density)

The distribution of the density variable is normally shaped, therefore, the MLE will use the normal distribution.


minuslog.lik <- function(mu, sigma){
  log.lik <- 0
  for(i in 1:1599){
    log.lik <- log.lik + log(dnorm(red_df$density[i], mean = mu, sd = sigma))
    }
  return(-log.lik)
  }

est <- mle(minuslog = minuslog.lik, start = list(mu = 1, sigma = 1))
summary(est)
## Maximum likelihood estimation
## 
## Call:
## mle(minuslogl = minuslog.lik, start = list(mu = 1, sigma = 1))
## 
## Coefficients:
##          Estimate   Std. Error
## mu    0.996745476 5.605292e-05
## sigma 0.002241416 9.395284e-06
## 
## -2 log L: -15438.11

The Maximum Likelihood Estimations of the mean and standard deviation are shown from the output above, along with the standard errors.

2. Inference for Residual Sugar

This section examines the variable residual sugar from 1599 red wine samples.

Instructions:

a. Provide an estimate of μ based on the sample;

# Estimating μ based on red_df sample
mu.hat.res <- mean(red_df$residual.sugar)
mu.hat.res
## [1] 2.538806


b. Using the CLT to quantify the variability and generate a 95% confidence interval

Although the data for ‘residual sugar’ is highly skewed as seen in the histogram below, we are still able to quantify the variability using the CLT due to the quantity of data in this column. There are 1599 observations, and using the CLT, we can get a general understanding of what the variability is like.

# Histogram of the 'residual sugar' column
hist(red_df$residual.sugar)

# Creating variable 'sd.res' to store the standard deviation of the 
# variable residual sugar
sd.res <- sd(red_df$residual.sugar)

# Creating variable 'sd.mu.hat.res' to store the standard deviation of the 
# sample means 
sd.mu.hat.res <- sd.res/sqrt(n)
sd.mu.hat.res
## [1] 0.03525922

The CLT tells us that if we repeatedly sampled 1599 bottles of red wine and computed the mean residual sugar each time, those means would vary around the true population mean μ with a standard deviation of about 0.0353.

Even with the highly skewed data, this output tells us the μ(2.5388) is very precise


# Upper and lower bounds of the 95% interval
mu.hat.res.CI <- c(mu.hat.res - 2 * sd.mu.hat.res, mu.hat.res + 2 * sd.mu.hat.res)
mu.hat.res.CI
## [1] 2.468287 2.609324

The 95% confidence interval for μ using the CLT is (0.9966523, 0.9968411).


c. Using the Bootstrap Method to quantify the variability and generate a 95% confidence interval

mu.hat.res.set <- NULL ### Used to store mu.hat’s

for(k in 1:2000){ ### Repeat sampling 2000 times
  sample.bootstrap <- sample(red_df$residual.sugar, size = 1599, replace = T)
  mu.hat.res <- mean(sample.bootstrap)
  mu.hat.res.set[k] <- mu.hat.res
}

sd(mu.hat.res.set)
## [1] 0.03468666
# Upper and lower bounds of the 95% interval
mu.hat.res.CI2 <- quantile(mu.hat.res.set, probs = c(0.025,0.975))
mu.hat.res.CI2
##     2.5%    97.5% 
## 2.470662 2.607850

The 95% confidence interval for μ using the Bootstrap Method is about (2.4706621, 2.6078502).


Is the bootstrap confidence interval symmetric?

hist(mu.hat.res.set,freq = FALSE)
lines(density(mu.hat.res.set), lwd = 5, col = 'blue')

As we can see from the graph above, the confidence interval calculated from the Bootstrap Method is symmetrically-shape , although not perfectly symmetrical.


d. Can we use a normal distribution to model “residual sugar”? If no, what distribution do you think can approximate its empirical distribution?

  • Because of the variable’s skewness, a normal distribution to model will not be suitable. A lognormal distribution would better fit the shape of the data.


What are their maximum likelihood estimates?

minuslog.lik<-function(mu, sigma){
  log.lik <- 0
  for(i in 1:1599){
    log.lik = log.lik + log(dlnorm(x = red_df$residual.sugar[i], meanlog = mu, sdlog = sigma))
    }
  return(-log.lik)
}
est.lognorm <- mle(minuslog = minuslog.lik, start = list(mu = log(mean(red_df$residual.sugar)),
sigma = log(sd(red_df$residual.sugar))))
summary(est.lognorm)
## Maximum likelihood estimation
## 
## Call:
## mle(minuslogl = minuslog.lik, start = list(mu = log(mean(red_df$residual.sugar)), 
##     sigma = log(sd(red_df$residual.sugar))))
## 
## Coefficients:
##        Estimate  Std. Error
## mu    0.8502341 0.008935942
## sigma 0.3573260 0.006318293
## 
## -2 log L: 3965.773
res.simulate <- rlnorm(1599, meanlog = 0.8502341, sdlog = 0.3573260)
par(mfrow = c(1,2))
hist(res.simulate)
hist(red_df$residual.sugar)

After comparing the maximum likelihood estimate (MLE) distribution to the actual data, the simulated graph closely mirrors the overall shape of the real distribution, though some differences remain. The real data exhibits much stronger skewness, whereas the simulation distributes values more evenly and can’t capture extreme outliers, such as the observation at 15.

3. Inference for Excellent Wines

This section examines the variable quality from 1599 red wine samples. We classify those wines as “excellent” if their rating is at least 7. Suppose the population proportion of excellent wines is p.

Instructions:

a. Use the CLT to derive a 95% confidence interval for p;

# Creating a new column, Excellent_Quality, in the red_df dataset
red_df <- red_df %>%
  mutate(Excellent_Quality = ifelse(quality >= 7, 1,0))

# Checking if the Excellent_Quality column is properly implemented into the data
head(red_df[c('quality', 'Excellent_Quality')], 10)
##    quality Excellent_Quality
## 1        5                 0
## 2        5                 0
## 3        5                 0
## 4        6                 0
## 5        5                 0
## 6        5                 0
## 7        5                 0
## 8        7                 1
## 9        7                 1
## 10       5                 0
# Estimating μ based on red_df sample
mu.hat.qual <- mean(red_df$Excellent_Quality)
mu.hat.qual
## [1] 0.1357098
# Creating variable 'sd.qual' to store the standard deviation of the 
# variable residual sugar
sd.qual <- sd(red_df$Excellent_Quality)

# Creating variable 'sd.mu.hat.qual' to store the standard deviation of the 
# sample means 
sd.mu.hat.qual <- sd.qual/sqrt(n)
sd.mu.hat.qual
## [1] 0.00856736

The CLT tells us that if we repeatedly sampled 1599 bottles of red wine and computed the mean Excellent_Quality each time, those means would vary around the true population mean μ with a standard deviation of about 0.0086.


# Upper and lower bounds of the 95% interval
mu.hat.qual.CI <- c(mu.hat.qual - 2 * sd.mu.hat.qual, mu.hat.qual + 2 * sd.mu.hat.qual)
mu.hat.qual.CI
## [1] 0.1185751 0.1528445

The 95% confidence interval for μ using the CLT is about (0.1185751, 0.1528445).


b. Use the bootstrap method to derive a 95% confidence interval for p;

mu.hat.qual.set <- NULL ### Used to store mu.hat’s

for(k in 1:2000){ ### Repeat sampling 2000 times
  sample.bootstrap <- sample(red_df$Excellent_Quality, size = 1599, replace = T)
  mu.hat.qual <- mean(sample.bootstrap)
  mu.hat.qual.set[k] <- mu.hat.qual
}

sd(mu.hat.qual.set)
## [1] 0.008540804
# Upper and lower bounds of the 95% interval
mu.hat.qual.CI2 <- quantile(mu.hat.qual.set, probs = c(0.025,0.975))
mu.hat.qual.CI2
##      2.5%     97.5% 
## 0.1188243 0.1525954

The 95% confidence interval for μ using the CLT is about (0.1188243, 0.1525954).


c. Compare the two intervals. Is there any difference worth our attention?

  • The differences are negligible, showing up only in the third decimal place and having no practical impact on the results.

d. What is the maximum likelihood estimate of p and its standard error?

Excellent_Sum <- sum(red_df$Excellent_Quality)

minuslog.lik <- function(p){
  -log(dbinom(x = Excellent_Sum, size = 1599, prob = p))
}
est <- mle(minuslog=minuslog.lik, start = list(p = 0.5))
summary(est)
## Maximum likelihood estimation
## 
## Call:
## mle(minuslogl = minuslog.lik, start = list(p = 0.5))
## 
## Coefficients:
##    Estimate  Std. Error
## p 0.1357098 0.008564279
## 
## -2 log L: 7.072712

The Maximum Likelihood Estimations of the mean and standard deviation are shown from the output above, along with the standard errors.