library(stats4)
RedWine <- read.csv(file = "C:/Users/Aditi Anand/Documents/UC Fall Sem/Applied Statistical Methods/Red Wine/winequality-red.csv", sep=";", header = T)
attach(RedWine)

Question 1 - Density

a). Provide an estimate of μ based on the sample;

mean(density)
## [1] 0.9967467

b). Use the Central Limit Theorem (CLT) to quantify the variability of your estimate;

sd_density <- sd(density)/sqrt(length(density))
cat('Variability of Density : ', sd_density)
## Variability of Density :  4.71981e-05

c). Use the CLT to give a 95% confidence interval for μ.

lower_bound <- mean(density) - 2*sd_density
upper_bound <- mean(density) + 2*sd_density
cat('95% confidence interval of density : ',lower_bound, ' - ',upper_bound)
## 95% confidence interval of density :  0.9966523  -  0.9968411

d). Use the bootstrap method to do parts b and c, and compare the results with those obtained from the CLT. State your findings.

mu.hat.set<-NULL    
for(k in 1:2000){   ### Repeat sampling 2000 times
  sample.bootstrap<-sample(density, size=1599, replace=T)
  mu.hat<-mean(sample.bootstrap)
  mu.hat.set[k]<-mu.hat
}
sd(mu.hat.set)
## [1] 4.70227e-05
quantile(mu.hat.set, probs=c(0.025,0.975))
##      2.5%     97.5% 
## 0.9966560 0.9968378

The variability and confidence interval are similar using CLT and Bootstrap.

e). Can we use a normal distribution to model “density”? If yes, what are the maximum likelihood estimates of the mean and standard deviation? Please provide their standard errors as well.

hist(density)

Yes, Density follows the normal distribution.

density_10 <- density * 10
minuslog.lik<-function(mu, sigma){
  log.lik<-0
  for(i in 1:1599){
    log.lik <- log.lik + log(dnorm(density_10[i], mean = mu, sd = sigma))
  }
  return(-log.lik)
}

est <- mle(minuslog = minuslog.lik, start=list(mu = mean(density_10), sigma = sd(density_10)))
summary(est)@coef/10
##          Estimate   Std. Error
## mu    0.996746679 4.719810e-05
## sigma 0.001887334 3.296881e-05

Question 2 - Residual Sugar

a). Provide an estimate of μ based on the sample;

mean(residual.sugar)
## [1] 2.538806

b).Noting that the sample distribution of “residual sugar” is highly skewed, can we use the CLT to quantify the variability of your estimate? Can we use the CLT to give a 95% confidence interval for μ? If yes, please give your solution. If no, explain why.

hist(residual.sugar)

Regardless of the type or skewness of the data, Centarl Limit Theorem can be applied to quantify the variability of the estimate and get 95% confidence interval for the population mean.

sd_rs <- sd(residual.sugar)/sqrt(length(residual.sugar))
lower_bound_rs <- mean(residual.sugar) - 2*sd_rs
upper_bound_rs <- mean(residual.sugar) + 2*sd_rs
cat('Variability of Residual Sugar : ', sd_rs)
## Variability of Residual Sugar :  0.03525922
cat('\n95% confidence interval of Residual Sugar : ',lower_bound_rs, ' - ',upper_bound_rs)
## 
## 95% confidence interval of Residual Sugar :  2.468287  -  2.609324

c). Use the bootstrap method to do part b. Is the bootstrap confidence interval symmetric?

mu.hat.set<-NULL    
for(k in 1:2000){   ### Repeat sampling 2000 times
  sample.bootstrap<-sample(residual.sugar, size=1599, replace=T)
  mu.hat<-mean(sample.bootstrap)
  mu.hat.set[k]<-mu.hat
}
sd(mu.hat.set)
## [1] 0.03501374
quantile(mu.hat.set, probs=c(0.025,0.975))
##     2.5%    97.5% 
## 2.473387 2.607320
hist(mu.hat.set)

The bootstrap confidence interval is symmetric.

d). Can we use a normal distribution to model “residual sugar”? If no, what distribution do you think can approximate its empirical distribution? What parameters are needed to characterize such a distribution? what are their maximum likelihood estimates? Please provide their standard errors as well.

The histogram of residual sugar is right skewed. Therefore, normal distribution cannot be used.

hist(log(residual.sugar))

Log Normal distribution can be used to model the residual sugar data. The parameters are meanlog and sdlog.

minuslog.lik<-function(mu, sigma){
  log.lik<-0
  for(i in 1:1599){
    log.lik <- log.lik + log(dlnorm(residual.sugar[i], mean = mu, sd = sigma))
  }
  return(-log.lik)
}

est.lognorm <- mle(minuslog = minuslog.lik, start=list(mu = log(mean(residual.sugar)), sigma = log(sd(residual.sugar))))
summary(est.lognorm)@coef
##        Estimate  Std. Error
## mu    0.8502341 0.008935942
## sigma 0.3573260 0.006318293

Question 3 - Quality

We classify those wines as “excellent” if their rating is at least 7. Suppose the population proportion of excellent wines is p.

q_excellent <- ifelse(quality > 6, 1, 0)

a). Use the CLT to derive a 95% confidence interval for p;

sd_q <- sd(q_excellent)/sqrt(length(q_excellent))
lower_bound_rs <- mean(q_excellent) - 2*sd_q
upper_bound_rs <- mean(q_excellent) + 2*sd_q
cat('95% confidence interval of p : ',lower_bound_rs, ' - ',upper_bound_rs)
## 95% confidence interval of p :  0.1185751  -  0.1528445

b). Use the bootstrap method to derive a 95% confidence interval for p;

mu.hat.set<-NULL    
for(k in 1:2000){   ### Repeat sampling 2000 times
  sample.bootstrap<-sample(q_excellent, size=1599, replace=T)
  mu.hat<-mean(sample.bootstrap)
  mu.hat.set[k]<-mu.hat
}
quantile(mu.hat.set, probs=c(0.025,0.975))
##      2.5%     97.5% 
## 0.1188243 0.1525954

c). Compare the two intervals. Is there any difference worth our attention? Both the intervals are very close to each other.

d). What is the maximum likelihood estimate of p and its standard error?

minuslog.lik<-function(p){
  log.lik <- 0
  for (i in 1:1599) {
    log.lik <- log.lik + log(dbinom(q_excellent[i], size = 1, prob = p))
  }
  return (-log.lik)
}
est <- mle(minuslog=minuslog.lik, start=list(p=mean(q_excellent)))
summary(est)@coef
##    Estimate  Std. Error
## p 0.1357098 0.008564278

The maximum likelihood estimate of p is 0.1357098 and its standard error is 0.008564278.