This report analyzes wine quality data from 1599 red wine samples produced in northern Portugal. The goal is to make inferences about key population parameters using the sample data.
The analysis is structured in three parts:
Inference for Density – Using both the Central Limit Theorem (CLT) and bootstrap methods to estimate the population mean, quantify its variability, construct confidence intervals, and assess whether a normal distribution is appropriate for modeling this variable.
Inference for Residual Sugar – Addressing the challenges posed by skewness in the distribution, and applying both CLT (when appropriate) and bootstrap methods to evaluate the mean and its uncertainty, while also considering alternative distributional models.
Inference for Excellent Wines – Treating wine quality classification as a binary outcome, we estimate the population proportion of excellent wines, construct confidence intervals using both CLT and bootstrap methods, and compare the two approaches.
I will load in my data set first from the downloaded wine+quality zip file, and import any packages for the report.
# Load the dataset and packages
library(ggplot2)
library(dplyr)
library(tidyr)
library(patchwork)
library(stats4)
red_df <- read.csv("wine_quality_data/winequality-red.csv", sep = ';')
white_df <- read.csv("wine_quality_data/winequality-white.csv", sep = ';')
This section examines the variable ‘density’ from 1599 red wine samples.
Instructions:
a. Provide an estimate of μ based on the sample;
# Estimating μ based on red_df sample
mu.hat.density <- mean(red_df$density)
mu.hat.density
## [1] 0.9967467
b. Use the Central Limit Theorem (CLT) to quantify the variability of your estimate;
# Creating variable 'n' to store count of sample data
n <- nrow(red_df)
# Creating variable 'sd.density' to store the standard deviation of the
# variable density
sd.density <- sd(red_df$density)
# Creating variable 'sd.mu.hat.density' to store the standard deviation of the
# sample means
sd.mu.hat.density <- sd.density/sqrt(n)
sd.mu.hat.density
## [1] 4.71981e-05
The CLT tells us that if we repeatedly sampled 1599 bottles of red wine and computed the mean density each time, those means would vary around the true population mean μ with a standard deviation of about 0.0000472.
This means our estimate of μ (0.9967) is very precise, with only tiny sampling variability.
c. Use the CLT to give a 95% confidence interval for μ.
# Upper and lower bounds of the 95% interval
mu.hat.density.CI <- c(mu.hat.density - 2 * sd.mu.hat.density, mu.hat.density + 2 * sd.mu.hat.density)
mu.hat.density.CI
## [1] 0.9966523 0.9968411
The 95% confidence interval for μ using the CLT is (0.9966523, 0.9968411).
d. Use the bootstrap method to do parts b and c, and compare the results with those obtained from the CLT. State your findings.
mu.hat.set <- NULL ### Used to store mu.hat’s
for(k in 1:2000){ ### Repeat sampling 2000 times
sample.bootstrap <- sample(red_df$density, size = 1599, replace = T)
mu.hat <- mean(sample.bootstrap)
mu.hat.set[k] <- mu.hat
}
sd(mu.hat.set)
## [1] 4.726196e-05
The Bootstrap method also yields a standard deviation of about 0.0000473.
# Upper and lower bounds of the 95% interval
quantile(mu.hat.set, probs = c(0.025,0.975))
## 2.5% 97.5%
## 0.9966554 0.9968402
The 95% confidence interval for μ using the Bootstrap Method is (0.9966, 0.9969).
After using the CLT and Bootstrap Method of finding the variability and 95% confidence interval, I’ve found that there is very little difference in the output. The differences are negligible, showing up only in the fourth decimal place and having no practical impact on the results.
e. Can we use a normal distribution to model “density”?. What are the maximum likelihood estimates of the mean and standard deviation?
# Determining the distribution to use
hist(red_df$density)
The distribution of the density variable is normally shaped, therefore, the MLE will use the normal distribution.
minuslog.lik <- function(mu, sigma){
log.lik <- 0
for(i in 1:1599){
log.lik <- log.lik + log(dnorm(red_df$density[i], mean = mu, sd = sigma))
}
return(-log.lik)
}
est <- mle(minuslog = minuslog.lik, start = list(mu = 1, sigma = 1))
summary(est)
## Maximum likelihood estimation
##
## Call:
## mle(minuslogl = minuslog.lik, start = list(mu = 1, sigma = 1))
##
## Coefficients:
## Estimate Std. Error
## mu 0.996745476 5.605292e-05
## sigma 0.002241416 9.395284e-06
##
## -2 log L: -15438.11
The Maximum Likelihood Estimations of the mean and standard deviation are shown from the output above, along with the standard errors.
This section examines the variable residual sugar from 1599 red wine samples.
Instructions:
a. Provide an estimate of μ based on the sample;
# Estimating μ based on red_df sample
mu.hat.res <- mean(red_df$residual.sugar)
mu.hat.res
## [1] 2.538806
b. Using the CLT to quantify the variability and generate a 95% confidence interval
Although the data for ‘residual sugar’ is highly skewed as seen in the histogram below, we are still able to quantify the variability using the CLT due to the quantity of data in this column. There are 1599 observations, and using the CLT, we can get a general understanding of what the variability is like.
# Histogram of the 'residual sugar' column
hist(red_df$residual.sugar)
# Creating variable 'sd.res' to store the standard deviation of the
# variable residual sugar
sd.res <- sd(red_df$residual.sugar)
# Creating variable 'sd.mu.hat.res' to store the standard deviation of the
# sample means
sd.mu.hat.res <- sd.res/sqrt(n)
sd.mu.hat.res
## [1] 0.03525922
The CLT tells us that if we repeatedly sampled 1599 bottles of red wine and computed the mean residual sugar each time, those means would vary around the true population mean μ with a standard deviation of about 0.0353.
Even with the highly skewed data, this output tells us the μ(2.5388) is very precise
# Upper and lower bounds of the 95% interval
mu.hat.res.CI <- c(mu.hat.res - 2 * sd.mu.hat.res, mu.hat.res + 2 * sd.mu.hat.res)
mu.hat.res.CI
## [1] 2.468287 2.609324
The 95% confidence interval for μ using the CLT is (0.9966523, 0.9968411).
c. Using the Bootstrap Method to quantify the variability and generate a 95% confidence interval
mu.hat.res.set <- NULL ### Used to store mu.hat’s
for(k in 1:2000){ ### Repeat sampling 2000 times
sample.bootstrap <- sample(red_df$residual.sugar, size = 1599, replace = T)
mu.hat.res <- mean(sample.bootstrap)
mu.hat.res.set[k] <- mu.hat.res
}
sd(mu.hat.res.set)
## [1] 0.03468666
# Upper and lower bounds of the 95% interval
mu.hat.res.CI2 <- quantile(mu.hat.res.set, probs = c(0.025,0.975))
mu.hat.res.CI2
## 2.5% 97.5%
## 2.470662 2.607850
The 95% confidence interval for μ using the Bootstrap Method is about (2.4706621, 2.6078502).
Is the bootstrap confidence interval symmetric?
hist(mu.hat.res.set,freq = FALSE)
lines(density(mu.hat.res.set), lwd = 5, col = 'blue')
As we can see from the graph above, the confidence interval calculated from the Bootstrap Method is symmetrically-shape , although not perfectly symmetrical.
d. Can we use a normal distribution to model “residual sugar”? If no, what distribution do you think can approximate its empirical distribution?
What are their maximum likelihood estimates?
minuslog.lik<-function(mu, sigma){
log.lik <- 0
for(i in 1:1599){
log.lik = log.lik + log(dlnorm(x = red_df$residual.sugar[i], meanlog = mu, sdlog = sigma))
}
return(-log.lik)
}
est.lognorm <- mle(minuslog = minuslog.lik, start = list(mu = log(mean(red_df$residual.sugar)),
sigma = log(sd(red_df$residual.sugar))))
summary(est.lognorm)
## Maximum likelihood estimation
##
## Call:
## mle(minuslogl = minuslog.lik, start = list(mu = log(mean(red_df$residual.sugar)),
## sigma = log(sd(red_df$residual.sugar))))
##
## Coefficients:
## Estimate Std. Error
## mu 0.8502341 0.008935942
## sigma 0.3573260 0.006318293
##
## -2 log L: 3965.773
res.simulate <- rlnorm(1599, meanlog = 0.8502341, sdlog = 0.3573260)
par(mfrow = c(1,2))
hist(res.simulate)
hist(red_df$residual.sugar)
After comparing the maximum likelihood estimate (MLE) distribution to
the actual data, the simulated graph closely mirrors the overall shape
of the real distribution, though some differences remain. The real data
exhibits much stronger skewness, whereas the simulation distributes
values more evenly and can’t capture extreme outliers, such as the
observation at 15.
This section examines the variable quality from 1599 red wine samples. We classify those wines as “excellent” if their rating is at least 7. Suppose the population proportion of excellent wines is p.
Instructions:
a. Use the CLT to derive a 95% confidence interval for p;
# Creating a new column, Excellent_Quality, in the red_df dataset
red_df <- red_df %>%
mutate(Excellent_Quality = ifelse(quality >= 7, 1,0))
# Checking if the Excellent_Quality column is properly implemented into the data
head(red_df[c('quality', 'Excellent_Quality')], 10)
## quality Excellent_Quality
## 1 5 0
## 2 5 0
## 3 5 0
## 4 6 0
## 5 5 0
## 6 5 0
## 7 5 0
## 8 7 1
## 9 7 1
## 10 5 0
# Estimating μ based on red_df sample
mu.hat.qual <- mean(red_df$Excellent_Quality)
mu.hat.qual
## [1] 0.1357098
# Creating variable 'sd.qual' to store the standard deviation of the
# variable residual sugar
sd.qual <- sd(red_df$Excellent_Quality)
# Creating variable 'sd.mu.hat.qual' to store the standard deviation of the
# sample means
sd.mu.hat.qual <- sd.qual/sqrt(n)
sd.mu.hat.qual
## [1] 0.00856736
The CLT tells us that if we repeatedly sampled 1599 bottles of red wine and computed the mean Excellent_Quality each time, those means would vary around the true population mean μ with a standard deviation of about 0.0086.
# Upper and lower bounds of the 95% interval
mu.hat.qual.CI <- c(mu.hat.qual - 2 * sd.mu.hat.qual, mu.hat.qual + 2 * sd.mu.hat.qual)
mu.hat.qual.CI
## [1] 0.1185751 0.1528445
The 95% confidence interval for μ using the CLT is about (0.1185751, 0.1528445).
b. Use the bootstrap method to derive a 95% confidence interval for p;
mu.hat.qual.set <- NULL ### Used to store mu.hat’s
for(k in 1:2000){ ### Repeat sampling 2000 times
sample.bootstrap <- sample(red_df$Excellent_Quality, size = 1599, replace = T)
mu.hat.qual <- mean(sample.bootstrap)
mu.hat.qual.set[k] <- mu.hat.qual
}
sd(mu.hat.qual.set)
## [1] 0.008540804
# Upper and lower bounds of the 95% interval
mu.hat.qual.CI2 <- quantile(mu.hat.qual.set, probs = c(0.025,0.975))
mu.hat.qual.CI2
## 2.5% 97.5%
## 0.1188243 0.1525954
The 95% confidence interval for μ using the CLT is about (0.1188243, 0.1525954).
c. Compare the two intervals. Is there any difference worth our attention?
d. What is the maximum likelihood estimate of p and its standard error?
Excellent_Sum <- sum(red_df$Excellent_Quality)
minuslog.lik <- function(p){
-log(dbinom(x = Excellent_Sum, size = 1599, prob = p))
}
est <- mle(minuslog=minuslog.lik, start = list(p = 0.5))
summary(est)
## Maximum likelihood estimation
##
## Call:
## mle(minuslogl = minuslog.lik, start = list(p = 0.5))
##
## Coefficients:
## Estimate Std. Error
## p 0.1357098 0.008564279
##
## -2 log L: 7.072712
The Maximum Likelihood Estimations of the mean and standard deviation are shown from the output above, along with the standard errors.