Wine Project BANA 7051

Part A

Q1. What is the sample size?

glimpse(data_rw)

## Rows: 1,599
## Columns: 12
## $ fixed.acidity        <dbl> 7.4, 7.8, 7.8, 11.2, 7.4, 7.4, 7.9, 7.3, 7.8, 7.5…
## $ volatile.acidity     <dbl> 0.700, 0.880, 0.760, 0.280, 0.700, 0.660, 0.600, …
## $ citric.acid          <dbl> 0.00, 0.00, 0.04, 0.56, 0.00, 0.00, 0.06, 0.00, 0…
## $ residual.sugar       <dbl> 1.9, 2.6, 2.3, 1.9, 1.9, 1.8, 1.6, 1.2, 2.0, 6.1,…
## $ chlorides            <dbl> 0.076, 0.098, 0.092, 0.075, 0.076, 0.075, 0.069, …
## $ free.sulfur.dioxide  <dbl> 11, 25, 15, 17, 11, 13, 15, 15, 9, 17, 15, 17, 16…
## $ total.sulfur.dioxide <dbl> 34, 67, 54, 60, 34, 40, 59, 21, 18, 102, 65, 102,…
## $ density              <dbl> 0.9978, 0.9968, 0.9970, 0.9980, 0.9978, 0.9978, 0…
## $ pH                   <dbl> 3.51, 3.20, 3.26, 3.16, 3.51, 3.51, 3.30, 3.39, 3…
## $ sulphates            <dbl> 0.56, 0.68, 0.65, 0.58, 0.56, 0.56, 0.46, 0.47, 0…
## $ alcohol              <dbl> 9.4, 9.8, 9.8, 9.8, 9.4, 9.4, 9.4, 10.0, 9.5, 10.…
## $ quality              <int> 5, 5, 5, 6, 5, 5, 5, 7, 7, 5, 5, 5, 5, 5, 5, 5, 7…

# any missing data?
sum(is.na(data_rw))

## [1] 0

There is no missing data so the sample size is 1599.

Q2. Any outliers? Do you have any concerns about the data quality?

In order to identify the outliers, boxplots are used.

par(mfrow=c(3,4))
for (i in 1:(length(data_rw))){
  boxplot(x = data_rw[i], 
          horizontal = TRUE, 
          main = sprintf('Boxplot of %s', 
                         colnames(data_rw[i])),
          xlab = colnames(data_rw[i]))
}

From the boxplots above we see that outliers exist for almost all variables, just to different extent. To determine whether the extent of existing outliers would cause any concerns, we use the outlier function.

is_outlier <- function(x) {
  return(x < quantile(x, 0.25) - 1.5 * IQR(x) | 
           x > quantile(x, 0.75) + 1.5 * IQR(x))
}
outlier <- data.frame(variable = character(), 
                      sum_outliers = integer(),
                      stringsAsFactors=FALSE)

for (j in 1:(length(data_rw)-1)){
  variable <- colnames(data_rw[j])
  for (i in data_rw[j]){
    sum.outliers <- sum(is_outlier(i))
  }
  row <- data.frame(variable, sum.outliers)
  outlier <- rbind(outlier, row)
}
top_n(outlier,12)

## Selecting by sum.outliers

##                variable sum.outliers
## 1         fixed.acidity           49
## 2      volatile.acidity           19
## 3           citric.acid            1
## 4        residual.sugar          155
## 5             chlorides          112
## 6   free.sulfur.dioxide           30
## 7  total.sulfur.dioxide           55
## 8               density           45
## 9                    pH           35
## 10            sulphates           59
## 11              alcohol           13

#kable(outlier) this function could not be knitted for some reason.

Since the number of outliers are generally less than 10% of the total sample size (less than 159), I wouldn’t be too concerned about them compromising data quality.

Q3. How can you summarize the data of each variable in a concise way? What statistics are you going to present?

For all variables, mean, sd, min, max, and median were computed.

Nn=dim(data_rw)[1]
sum.tab= cbind(apply(cbind(data_rw), 2, mean,na.rm =T),
              apply(cbind(data_rw), 2, sd, na.rm =T),
              apply(cbind(data_rw), 2, min,na.rm =T),  
              apply(cbind(data_rw), 2, max,na.rm =T),
              apply(cbind(data_rw), 2, median, na.rm =T))
colnames(sum.tab)= c("Mean", "Std.","Min", "Max",  "Median")
#kable(sum.tab) this function cannot be knitted

Q4. How do you visualize the distribution of each variable?

Histograms were used to demonstrate data distribution for each variable.

hist.data.frame(data_rw)

### Q5. Do you see any skewed distribution?

The histograms above suggest that most variables are slightly skewed to the right expect for density and pH which seem to be perfectly normal distributed. Among the skewed variables, citric.acid seems to be very interesting as most of the cases were actually 0, and the other values were observed fairly evenly across the cases. Caution will be necessary when computing or interpreting this variable.

Part B

Q1. Suppose the population mean of the variable “density” is μ, do the following inferences:

a. Provide an estimate of μ based on the sample

attach(data_rw)
mu.hat.density<-round(mean(density),3)
mu.hat.density

## [1] 0.997

the estimate mu hat is unbiased as the variable density is a random variable that seems to be normally distributed. Therefore, mu hat, which is the mean of the sample could be used to inference the real mu which is the mean of the population.In other words, mu = 0.997

b. Use the Central Limit Theorem (CLT) to quantify the variability of your estimate

sd.mu.hat.density<-sd(density)/sqrt(length(density))
sd.mu.hat.density

## [1] 4.71981e-05

The variability of my estimate is quantified through standard deviation which is 4.71981e-05.

c. Use the CLT to give a 95% confidence interval for μ.

c(mu.hat.density-2*sd.mu.hat.density,  mu.hat.density+2* sd.mu.hat.density)

## [1] 0.9969056 0.9970944

The 95% CI is estimated to be 0.9969056 to 0.9970944 according to CLT.

d.Use the bootstrap method to do parts b and c, and compare the results with those obtained from the CLT. State your findings.

mu.hat.set <- NULL
for (k in 1:2000){
  data_rw_bs <-sample(density,size=800, replace=T)
  mu.hat.2<-mean(data_rw_bs)
  mu.hat.set[k] <-mu.hat.2
}

mean(mu.hat.set)

## [1] 0.9967449

sd(mu.hat.set)

## [1] 6.850856e-05

quantile(mean(mu.hat.set), probs=c(0.025, 0.975))

##      2.5%     97.5% 
## 0.9967449 0.9967449

Using the Bootstrap technique, the estimated μ is 0.997 with a standard deviation of 6.757452e-05. The 95% confidence interval is estimated to be 0.9967453 to 0.9967453. Comparing to the results obtained based on CLT, results based on Bootstrap are similar.

e. Can we use a normal distribution to model “density”? If yes, what are the maximum likelihood estimates of the mean and standard deviation? Please provide their standard errors as well.

#multiply density data points by 10 and create a new variable `density_new`
rw_new<-data_rw %>%
  mutate(density_new=density*10)

minuslog.lik<-function(mu, sigma){ log.lik<-0
for(i in 1:800){
     log.lik<-log.lik+log(dnorm(density_new[i], mean=mu,
 sd=sigma))
}
   return(-log.lik)
 }
attach(rw_new)
est <- mle(minuslog=minuslog.lik,
start=list(mu=mean(density_new),
sigma=sd(density_new)))
summary(est)

## Maximum likelihood estimation
## 
## Call:
## mle(minuslogl = minuslog.lik, start = list(mu = mean(density_new), 
##     sigma = sd(density_new)))
## 
## Coefficients:
##         Estimate   Std. Error
## mu    9.97646408 0.0005953600
## sigma 0.01683932 0.0004150282
## 
## -2 log L: -4265.989

The maximum likelihood of mean and standard deviation are 0.9976 and 0.00168; the standard errors for them are 0.000059 and 0.000042 respectively.

Q2. Suppose the population mean of the variable “residual sugar” is μ , answer the following questions.

a. Provide an estimate of μ based on the sample

mu.hat.rs<-round(mean(residual.sugar),3)
mu.hat.rs

## [1] 2.539

According to CLT, population mean \(\mu\) is estimated to equal to mu hat, which is 2.539.

b. Noting that the sample distribution of “residual sugar” is highly skewed, can we use the CLT to quantify the variability of your estimate? Can we use the CLT to give a 95% confidence interval for μ? If yes, please give your solution. If no, explain why.

hist(residual.sugar)

For highly skewed distribution like residual.sugar, CLT can still be used to calculate the 95% confidence interval, we will just identify the value of the 2.5% and 97.5%, the values in between will be the 95% CI.The output below shows us that the 95% CI for this variable is 2.539 to 2.539.

quantile(mu.hat.rs, probs=c(0.025, 0.975))

##  2.5% 97.5% 
## 2.539 2.539

c. Use the bootstrap method to do part b. Is the bootstrap confidence interval symmetric?

mu.hat.set.rs <- NULL
for (k in 1:2000){
  data_rw_bs_rs <-sample(residual.sugar,size=800, replace=T)
  mu.hat.3<-mean(data_rw_bs_rs)
  mu.hat.set.rs[k] <-mu.hat.3
}

mean(mu.hat.set.rs)

## [1] 2.540807

sd(mu.hat.set.rs)

## [1] 0.05050381

#95% Confidence interval
quantile(mu.hat.set.rs, probs=c(0.025, 0.975))

##     2.5%    97.5% 
## 2.445494 2.643694

#distribution of the set of means from bootstrap
hist(mu.hat.set.rs)

Although the distribution for residual.sugar is highly skewed, when use the bootstarp technique to generate set of means, the distribution of the drawn sample means are representing a normal distribution. In other words, the 95% CI is considered symmetric.

d. Can we use a normal distribution to model “residual sugar”? If no, what distribution do you think can approximate its empirical distribution? What parameters are needed to characterize such a distribution? what are their maximum likelihood estimates? Please provide their standard errors as well.

minuslog.lik<-function(mu, sigma){ log.lik<-0
for(i in 1:800){
log.lik = log.lik + log(dlnorm(x=residual.sugar[i], meanlog = mu, sdlog = sigma)) }
   return(-log.lik)
 }
est.lognorm<-mle(minuslog=minuslog.lik, start=list(mu=log(mean(residual.sugar)), sigma=log(sd(residual.sugar))))
 summary(est.lognorm)

## Maximum likelihood estimation
## 
## Call:
## mle(minuslogl = minuslog.lik, start = list(mu = log(mean(residual.sugar)), 
##     sigma = log(sd(residual.sugar))))
## 
## Coefficients:
##        Estimate  Std. Error
## mu    0.8919587 0.012070414
## sigma 0.3414029 0.008534798
## 
## -2 log L: 1977.921

Due to the skewness of the distribution for residual.sugar, it would be inappropriate to model this variable with a normal distribution. A log-normal distribution, however, could be used to model this variable. To characterize such a distribution, μ and σ are the two parameters needed. the R code above calculates the maximum likelihood estimate for these parameters.

3. We classify those wines as “excellent” if their rating is at least 7. Suppose the population proportion of excellent wines is p. Do the following:

a. Use the CLT to derive a 95% confidence interval for p

data_rw_rating <- data_rw%>%
  mutate(rating = case_when(quality >=7 ~'Excellent',
                            TRUE ~ 'average'))
attach(data_rw_rating)

#getting the proportion of 'Excellent'
table(rating)

## rating
##   average Excellent 
##      1382       217

p.ex=217/1599 #point estimate of the population proportion
q=1-p.ex
CI.l=p.ex-1.96*sqrt((p.ex*q)/1599)
CI.u=p.ex+1.96*sqrt((p.ex*q)/1599)
CI.ex=c(CI.l, CI.u)
CI.ex

## [1] 0.1189230 0.1524966

Theoretically, the 95% confidence interval for p is 12% to 15%

b. Use the bootstrap method to derive a 95% confidence interval for p

bs.ex <-do(1000)*rflip(1000, prob=p.ex)
se.ex <-sd(~prop, data=bs.ex) #standard error

#95% Confidence interval
CI.ex.bs <-c(0.134-2*se.ex, 0.134+2*se.ex)
CI.ex.bs

## [1] 0.1126986 0.1553014

Using the bootstrap method, the 95% confidence interval for p is 11% to 16%

c. Compare the two intervals. Is there any difference worth our attention?

The 95% confidence interval derived from the bootstrap method is slightly wider than the one calculated based on CLT. However, the results are very close thus should be reliable.

d. What is the maximum likelihood estimate of p and its standard error?

minuslog.lik<-function(p){
  -log(dbinom(x=217, size=1599, prob=p))
}
est <- mle(minuslog=minuslog.lik, start=list(p=0.5))
summary(est)

## Maximum likelihood estimation
## 
## Call:
## mle(minuslogl = minuslog.lik, start = list(p = 0.5))
## 
## Coefficients:
##    Estimate  Std. Error
## p 0.1357098 0.008564279
## 
## -2 log L: 7.072712

The calculation above suggests that the maximum likelihood estimate of p is 0.136, and the standard error is 0.0086.

4. Is there anything confusing or hard to understand about this course?

(Please take a moment and think through your comments to be as specific as possible so that I can determine what steps to take to make my teaching more effective for you)

This course is a very good refresher for me in terms of the fundamentals about the CLT and R. The videos and weekly meetings are all very helpful for me to understand the concepts and material. As someone from a non-business and non-statistical background, some of the equations and mathematics concepts (e.g., log, probability) post some challenges for me. If there are dedicated sessions to explain some of the basics, that would be great. Another thing is that it would be great if we could have a discussion about how the concepts and skills that we have learned could be applied to real business problems. For example, how could the things that we did be helpful in business world? In what situations are bootstrap or MLE be useful? Just some my thoughts.

Wine Project BANA 7051

Nan Yang

9/8/2022

Part A

Q1. What is the sample size?

Q2. Any outliers? Do you have any concerns about the data quality?

Q3. How can you summarize the data of each variable in a concise way? What statistics are you going to present?

Q4. How do you visualize the distribution of each variable?

Part B

Q1. Suppose the population mean of the variable “density” is μ, do the following inferences:

a. Provide an estimate of μ based on the sample

b. Use the Central Limit Theorem (CLT) to quantify the variability of your estimate

c. Use the CLT to give a 95% confidence interval for μ.

d.Use the bootstrap method to do parts b and c, and compare the results with those obtained from the CLT. State your findings.

e. Can we use a normal distribution to model “density”? If yes, what are the maximum likelihood estimates of the mean and standard deviation? Please provide their standard errors as well.

Q2. Suppose the population mean of the variable “residual sugar” is μ , answer the following questions.

a. Provide an estimate of μ based on the sample

b. Noting that the sample distribution of “residual sugar” is highly skewed, can we use the CLT to quantify the variability of your estimate? Can we use the CLT to give a 95% confidence interval for μ? If yes, please give your solution. If no, explain why.

c. Use the bootstrap method to do part b. Is the bootstrap confidence interval symmetric?

3. We classify those wines as “excellent” if their rating is at least 7. Suppose the population proportion of excellent wines is p. Do the following:

a. Use the CLT to derive a 95% confidence interval for p

b. Use the bootstrap method to derive a 95% confidence interval for p

c. Compare the two intervals. Is there any difference worth our attention?

d. What is the maximum likelihood estimate of p and its standard error?

4. Is there anything confusing or hard to understand about this course?