1. Suppose the population mean of the variable “density” is μ , do the following inferences:


a. Provide an estimate of μ based on the sample;

mu.density <- mean(red_wine$density)
mu.density
## [1] 0.9967467
The estimate of μ based on the sample mean of density is 0.997 i.e μ is very close to 1


b. Use the Central Limit Theorem (CLT) to quantify the variability of your estimate;

sd.density <- sd(red_wine$density)
sd.density
## [1] 0.001887334
var.density <- sd(red_wine$density)/sqrt(length(red_wine$density))
var.density
## [1] 4.71981e-05
The value of variability is 4.71981e-05 (close to zero)


c. Use the CLT to give a 95% confidence interval for μ.

hist(red_wine$density)

lc.density <- mu.density - 2*(var.density)
lc.density
## [1] 0.9966523
uc.density <- mu.density + 2*(var.density)
uc.density
## [1] 0.9968411
The 95% confidence interval for μ is (0.9966, 0.9968)


d. Use the bootstrap method to do parts b and c, and compare the results

mu.density.set <- NULL
for (k in 1:2000) {
  density.bootstrap <- sample(red_wine$density, size = 1599, replace = T)
  mu.density <- mean(density.bootstrap)
  mu.density.set[k] <- mu.density
}

sd(mu.density.set)
## [1] 4.827167e-05
var.density.set <- var(mu.density.set)
var.density.set
## [1] 2.330155e-09
conf.q.density <- quantile(mu.density.set, probs = c(0.025, 0.975))
conf.q.density
##      2.5%     97.5% 
## 0.9966526 0.9968431
The variability from the bootstrap method is almost equal to the variability calculated using the Central Limit Theorem. And same is the case with confidence interval.


2. Suppose the population mean of the variable “residual sugar” is μ , a


a. Provide an estimate of μ based on the sample;

mu.sugar <- mean(red_wine$`residual sugar`)
mu.sugar
## [1] 2.538806
The estimate of μ based on the sample mean of residual sugar is 2.5388


b. Noting that the sample distribution of “residual sugar” is highly skewed, can we use the CLT to quantify the variability of your estimate? Can we use the CLT to give a 95% confidence interval for μ? If yes, please give your solution. If no, explain why.

hist(red_wine$`residual sugar`)

mu.sugar <- mean(red_wine$`residual sugar`)
mu.sugar
## [1] 2.538806
sd.sugar <- sd(red_wine$`residual sugar`)
sd.sugar
## [1] 1.409928
var.sugar <- sd(red_wine$`residual sugar`)/ sqrt(length(red_wine$`residual sugar`))
var.sugar
## [1] 0.03525922
As long as the sample size is large, the distribution of the sample means will follow an approximate Normal distribution and hence CLT applies. And a left or right skew in the distribution does not impact the application of CLT
#2b CI

lc.sugar <- mu.sugar - 2*(var.sugar)
lc.sugar
## [1] 2.468287
uc.sugar <- mu.sugar + 2*(var.sugar)
uc.sugar
## [1] 2.609324
ci.sugar.clt <- c(lc.sugar, uc.sugar)
ci.sugar.clt
## [1] 2.468287 2.609324
The upper and lower levels of 95% confidence interval are (2.4683, 2.6093)


c. Use the bootstrap method to do part b. Is the bootstrap confidence interval symmetric? (hint: check the bootstrap distribution; see p. 25-26 in Lecture 4).

mu.sugar.set <- NULL
for (k in 1:2000) {
  sugar.bootstrap <- sample(red_wine$`residual sugar`, size=1599, replace=T)
mu.sugar <- mean(sugar.bootstrap)
mu.sugar.set[k] <- mu.sugar
}

sd(mu.sugar.set)
## [1] 0.03464122
#CI using Bootstrap

conf.q.sugar <- quantile(mu.sugar.set, probs=c(0.025, 0.975))
conf.q.sugar
##     2.5%    97.5% 
## 2.472620 2.609636
The variability using Bootstrap method is almost equal to the variability calculated using CLT


3. We classify those wines as “excellent” if their rating is at least 7. Suppose the population proportion of excellent wines is p. Do the following:

a. Use the CLT to derive a 95% confidence interval for p;

red_wine$excellent <- as.numeric(red_wine$quality > 6)

p <- mean(red_wine$excellent)
p
## [1] 0.1357098
var.excellent <- sqrt(p*(1 - p) / length(red_wine$excellent))
var.excellent
## [1] 0.008564681
lc.p <- p - 2*(var.excellent)
lc.p
## [1] 0.1185805
uc.p <- p + 2*(var.excellent)
uc.p
## [1] 0.1528392
The lower and upper limits of the 95% confidence interval for p, using CLT, are (0.1185805, 0.1528392)


b. Use the bootstrap method to derive a 95% confidence interval for p;

bootstrap_p.set <- NULL
for (k in 1:2500) {
  p.bootstrap <- sample(red_wine$excellent, size = 1599, replace = T)
bootstrap_p <- mean(p.bootstrap)
bootstrap_p.set[k] <- bootstrap_p
}

sd(bootstrap_p.set)
## [1] 0.008628812
conf.q.p <- quantile(bootstrap_p.set, probs = c(0.025, 0.975))
conf.q.p
##      2.5%     97.5% 
## 0.1194497 0.1532208
hist(bootstrap_p.set, freq = FALSE)
lines(density(bootstrap_p.set), lwd = 5, col = 'blue')

##### The lower and upper limits of the 95% confidence interval for p, using Bootstrap are (0.1200750, 0.1532208)

c. Compare the two intervals. Is there any difference worth our attention?

There is no significant difference between the intervals calculated