Final Project - Application of Analytical Tools

Introduction

The purpose of this project is to conduct an analysis using the tools and knowledge acquired during this semester long course. I will be using a data set containing 1599 different red wines with 12 associated variables for each observation. I will be using the statistical software R to conduct my analysis and I will be applying my understanding of empirical CDF, bootstrapping, MLE, hypothesis testing, and Bayesian analysis.

library(dplyr)
library(stats4)
library(DT)
library(tinytex)
library(bootstrap)
library(ggplot2)

red_wine <- read.csv("C:/Users/Chris Steege/Documents/Classes/Fall 2020/Statistical Methods/winequality-red.csv", header=FALSE, sep=';', skip=1)
colnames(red_wine) <- c("fixed acidity","volatile acidity","citric acid","residual sugar","chlorides","free sulfur dioxide","total sulfur dioxide","density","pH","sulphates","alcohol","quality")

Data Preview

In this dataset, the rows represent a unique wine, and the columns give us information about the wines.

library(DT)
datatable(head(red_wine,25))

Data Description

Here, I will provide a table with the variable names, types, and descriptions.

Var_type <- lapply(red_wine, class)
var_description <- c("Fixed Acidity Content", "Volatile Acidity Content", "Citric Acid Content", "Residual Sugar Content", "Chloride Content", "Free SUlfur DIoxide Content", "Total Sulfur Dioxide Content", "Wine Density", "Wine pH Level", "Sulphates Content", "Alcohol Content (ABV)", "Rating Given by Customers")
Variable_names <- colnames(red_wine)
data_description <- as_data_frame(cbind(Variable_names,Var_type, var_description))
colnames(data_description) <- c("Variable Name", "Data Type", "Variable Description")
library(knitr)
kable(data_description)

Variable Name	Data Type	Variable Description
fixed acidity	numeric	Fixed Acidity Content
volatile acidity	numeric	Volatile Acidity Content
citric acid	numeric	Citric Acid Content
residual sugar	numeric	Residual Sugar Content
chlorides	numeric	Chloride Content
free sulfur dioxide	numeric	Free SUlfur DIoxide Content
total sulfur dioxide	numeric	Total Sulfur Dioxide Content
density	numeric	Wine Density
pH	numeric	Wine pH Level
sulphates	numeric	Sulphates Content
alcohol	numeric	Alcohol Content (ABV)
quality	integer	Rating Given by Customers

To help understand the data and have a reference, I’ll create a table of summary statistics for each varaible.

tmp <- do.call(data.frame, 
               list(mean = apply(red_wine, 2, mean), 
                    sd = apply(red_wine, 2, sd),
                    median = apply(red_wine, 2, median),
                    min = apply(red_wine, 2, min),
                    max = apply(red_wine, 2, max),
                    n = apply(red_wine, 2, length)))
tmp

##                             mean           sd   median     min       max    n
## fixed acidity         8.31963727  1.741096318  7.90000 4.60000  15.90000 1599
## volatile acidity      0.52782051  0.179059704  0.52000 0.12000   1.58000 1599
## citric acid           0.27097561  0.194801137  0.26000 0.00000   1.00000 1599
## residual sugar        2.53880550  1.409928060  2.20000 0.90000  15.50000 1599
## chlorides             0.08746654  0.047065302  0.07900 0.01200   0.61100 1599
## free sulfur dioxide  15.87492183 10.460156970 14.00000 1.00000  72.00000 1599
## total sulfur dioxide 46.46779237 32.895324478 38.00000 6.00000 289.00000 1599
## density               0.99674668  0.001887334  0.99675 0.99007   1.00369 1599
## pH                    3.31111320  0.154386465  3.31000 2.74000   4.01000 1599
## sulphates             0.65814884  0.169506980  0.62000 0.33000   2.00000 1599
## alcohol              10.42298311  1.065667582 10.20000 8.40000  14.90000 1599
## quality               5.63602251  0.807569440  6.00000 3.00000   8.00000 1599

Empiricle CDF

By plotting the Density of the wine on a histogram, you can see that density nearly follows a normal distribution.

hist(red_wine[,8], main="Histogram for density")

You can plot the Empiricle CDF by sorting the data and plotting according to how many points are less than each value. Additionaly, I have found the limiting distribution by finding the acceptable error based on the confidence we choose and adding or subtracting from the empiracle CDF. In this case, I want a 95% confidence band, so I chose alpha = .05.

Thie limiting distribution tells us that we are 95% confident that every point in the true distribution lies within this band.

library(ACSWR)
density.ecdf <- ecdf(red_wine[,8])
plot(density.ecdf)

plot.ecdf(red_wine[,8])

density.ecdf(1)-density.ecdf(0.5)

## [1] 0.9555972

#Estimate P(0.5<X<1)
Alpha=.05
n=length(red_wine[,8])
Eps=sqrt(log(2/Alpha)/(2*n))
grid<-seq(0,1.5, length.out = 1000)
lines(grid, pmin(density.ecdf(grid)+Eps,1))

lines(grid, pmax(density.ecdf(grid)-Eps,0))

We can perform the same procedure for Sulfur which follows a more abnormal (Due to it having a floor value) and skewed distribution.

hist(red_wine[,7], main="Histogram for total sulfur dioxide")

Again, the limiting distribution can be formed by find the error which tells us that we have a confidence of 95% that the true distribution lies between the bands. The bands hug tightly to the empiricle cdf, because we have many data points that increase our confidence.

Sulfur.ecdf <- ecdf(red_wine[,7])
plot(Sulfur.ecdf)

plot.ecdf(red_wine[,7])

Sulfur.ecdf(1)-Sulfur.ecdf(0.5)

## [1] 0

#Estimate P(0.5<X<1)
Alpha=.05
n=length(red_wine[,7])
Eps=sqrt(log(2/Alpha)/(2*n))
grid<-seq(0,320, length.out = 1000)
lines(grid, pmin(Sulfur.ecdf(grid)+Eps,1))

lines(grid, pmax(Sulfur.ecdf(grid)-Eps,0))

Mean Estimation

Here we are going to compare mean estimation methods using CLT, MLE, and Bootstrap

Using the central limit theorem, we can quantify the variability of our mean estimation.

Mean = NULL
CLT_se = NULL
for(i in 1:12){
  Mean[i] = mean(red_wine[,i])
}
for(i in 1:12){
  CLT_se[i] <-sd(red_wine[,i])/sqrt(length(red_wine[,i]))
}

Our confidence intervals are derived from the standard error.

CLT_LB = NULL
CLT_UB = NULL
for(i in 1:12){
  CLT_LB[i] <- Mean[i] - 2*CLT_se[i]
  CLT_UB[i] <- Mean[i] + 2*CLT_se[i]
}

Using the bootstrap method, we can also derive a mean estimate and the variability of our estimate with a confidence interval.

Bootstrap_mean = NULL
Bootstrap_se = NULL
Boostrap_LB = NULL
Boostrap_UB = NULL

for(i in 1:12){
  mu.hat.set <- NULL
  for (k in 1:2000) {
    Bootstrap <- sample(red_wine[,i], size=1599, replace = T)
    mu.hat <- mean(Bootstrap)
    mu.hat.set[k] <- mu.hat
  }

  Bootstrap_mean[i] <- mean(mu.hat.set)
  Bootstrap_se[i] <- sd(mu.hat.set)


  Boostrap_LB[i] = Bootstrap_mean[i] - 2*Bootstrap_se[i]
  Boostrap_UB[i] = Bootstrap_mean[i] + 2*Bootstrap_se[i]

}

Finally, MLE can also be used to derive the mean and standard error.

Scale = c(10,10,10,10,100,10,10,10,10,10,10,10)
MLE_mean = NULL
MLE_se = NULL
MLE_LB = NULL
MLE_UB = NULL
for(j in 1:12){
  minuslog.lik <- function(mu, sigma){
    log.lik <- 0
    for(i in 1:1599) {
      log.lik <- log.lik+log(dnorm(Scale[j]*red_wine[i,j], mean = mu, sd=sigma))
    }
    return (-log.lik)
  }

  density.est <- mle(minuslog=minuslog.lik, start=list(mu=mean(Scale[j]*red_wine[,j]), sigma = sd(Scale[j]*red_wine[,j])))

  MLE_mean[j] =((summary(density.est)@coef/Scale[j])[1,1])
  MLE_se[j] =((summary(density.est)@coef/Scale[j])[1,2])
  MLE_LB = MLE_mean[j] - 2*MLE_se[j]
  MLE_UB = MLE_mean[j] + 2*MLE_se[j]

}

MLE_mean

##  [1]  8.31963727  0.52782044  0.27097561  2.53880550  0.08746654 15.87492183
##  [7] 46.46779237  0.99674668  3.31111320  0.65814883 10.42298312  5.63602251

Here, I want to compare the mean and standard errors of our mean estimates across CLT, Bootstrapping, and MLE.

THe mean estimates for our using the the regular mean procedure and MLE were the same. The bootstrapping means for each varaible were close to the others but are always slightly higher or lower. The standard errors of these estimates were all different for each variable, but they all have very similar values.

Means <- data.frame(Mean, Bootstrap_mean, MLE_mean)
Standard_errors <- data.frame(CLT_se, Bootstrap_se, MLE_se)
Means

##           Mean Bootstrap_mean    MLE_mean
## 1   8.31963727      8.3194250  8.31963727
## 2   0.52782051      0.5278322  0.52782044
## 3   0.27097561      0.2708043  0.27097561
## 4   2.53880550      2.5381304  2.53880550
## 5   0.08746654      0.0874046  0.08746654
## 6  15.87492183     15.8724886 15.87492183
## 7  46.46779237     46.4573097 46.46779237
## 8   0.99674668      0.9967456  0.99674668
## 9   3.31111320      3.3111262  3.31111320
## 10  0.65814884      0.6581992  0.65814883
## 11 10.42298311     10.4222258 10.42298312
## 12  5.63602251      5.6361979  5.63602251

Standard_errors

##          CLT_se Bootstrap_se       MLE_se
## 1  0.0435410166 4.440835e-02 0.0435273403
## 2  0.0044778922 4.421293e-03 0.0044764919
## 3  0.0048715510 4.955018e-03 0.0048700290
## 4  0.0352592217 3.589516e-02 0.0352481981
## 5  0.0011770004 1.181359e-03 0.0011766323
## 6  0.2615856825 2.581509e-01 0.2615325653
## 7  0.8226402272 8.336501e-01 0.8231282669
## 8  0.0000471981 4.657363e-05 0.0000471981
## 9  0.0038608683 3.930420e-03 0.0038596617
## 10 0.0042389994 4.272490e-03 0.0042376735
## 11 0.0266500190 2.671026e-02 0.0266416885
## 12 0.0201955481 2.014512e-02 0.0201831763

Bootstrap Correlation

Using bootstrapping, We can estimate the correlation coefficient between the predictor variables and the quality of the wine. In this case, I have used the citric acid content as an example and found that the bootstrap correlation between it and wine quality was .2263 with a standard error of .0392. This gives us a correlation 95% CI interval from .148 to .305.

#x: index/row number
#xdata: the data frame
theta <- function(x,xdata){ cor(xdata[x,1],xdata[x,2]) }



# sample from the population
n <- 500
results <- bootstrap(1:n,3200,theta, red_wine[,c(3,12)])
par(mfrow=c(2,1))
hist(results$thetastar)
var(results$thetastar)

## [1] 0.002242354

##Bootstrap confidence intervals
results <- bootstrap(1:n,3200,theta, red_wine[,c(3,12)])
se.boot=sqrt(var(results$thetastar))
cor.hat<-cor(red_wine[,3], red_wine[,12])
normal.ci<-c(cor.hat-2*se.boot, cor.hat+2*se.boot)
pivatol.ci<-c(2*cor.hat-quantile(results$thetastar,0.975), 2*cor.hat-quantile(results$thetastar,0.025))
quantile.ci<-quantile(results$thetastar, c(0.025, 0.975))
print(quantile.ci)

##      2.5%     97.5% 
## 0.1417577 0.3343433

Hypothesis Testing

Here I am going to convert wine quality into a binary variable where if the quality rating is greater than 5, it is considered a quality wine. I will then perform a t-test between the citric acid content and quality to see what kind of role citric acid plays in determining if a wine is quality. Our null hypothesis is that the true difference in means is equal to 0.

We computed the means between the two groups and found that not quality wines had a mean of about .2378 while quality wines had a mean of about .2999. The difference between the means of this group was considered to be statistically significant, because we have a very low p-value. Therefore we will reject the null hypothesis.

red_wine$Quality_bin <- ifelse(red_wine$quality>5,1,0)

red_wine %>%
  group_by(Quality_bin) %>%
  summarize(Cit_mean = mean(`citric acid`))

## # A tibble: 2 x 2
##   Quality_bin Cit_mean
##         <dbl>    <dbl>
## 1           0    0.238
## 2           1    0.300

qplot(x = Quality_bin, y=red_wine$`citric acid`, geom = 'boxplot', data = red_wine)

red_wine.t.test <- t.test(red_wine$`citric acid` ~ red_wine$Quality_bin)

red_wine.t.test

## 
##  Welch Two Sample t-test
## 
## data:  red_wine$`citric acid` by red_wine$Quality_bin
## t = -6.4799, df = 1592.5, p-value = 1.22e-10
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.08093360 -0.04332173
## sample estimates:
## mean in group 0 mean in group 1 
##       0.2377554       0.2998830

Probability Models