glimpse(data_rw)
## Rows: 1,599
## Columns: 12
## $ fixed.acidity <dbl> 7.4, 7.8, 7.8, 11.2, 7.4, 7.4, 7.9, 7.3, 7.8, 7.5…
## $ volatile.acidity <dbl> 0.700, 0.880, 0.760, 0.280, 0.700, 0.660, 0.600, …
## $ citric.acid <dbl> 0.00, 0.00, 0.04, 0.56, 0.00, 0.00, 0.06, 0.00, 0…
## $ residual.sugar <dbl> 1.9, 2.6, 2.3, 1.9, 1.9, 1.8, 1.6, 1.2, 2.0, 6.1,…
## $ chlorides <dbl> 0.076, 0.098, 0.092, 0.075, 0.076, 0.075, 0.069, …
## $ free.sulfur.dioxide <dbl> 11, 25, 15, 17, 11, 13, 15, 15, 9, 17, 15, 17, 16…
## $ total.sulfur.dioxide <dbl> 34, 67, 54, 60, 34, 40, 59, 21, 18, 102, 65, 102,…
## $ density <dbl> 0.9978, 0.9968, 0.9970, 0.9980, 0.9978, 0.9978, 0…
## $ pH <dbl> 3.51, 3.20, 3.26, 3.16, 3.51, 3.51, 3.30, 3.39, 3…
## $ sulphates <dbl> 0.56, 0.68, 0.65, 0.58, 0.56, 0.56, 0.46, 0.47, 0…
## $ alcohol <dbl> 9.4, 9.8, 9.8, 9.8, 9.4, 9.4, 9.4, 10.0, 9.5, 10.…
## $ quality <int> 5, 5, 5, 6, 5, 5, 5, 7, 7, 5, 5, 5, 5, 5, 5, 5, 7…
# any missing data?
sum(is.na(data_rw))
## [1] 0
There is no missing data so the sample size is 1599.
In order to identify the outliers, boxplots are used.
par(mfrow=c(3,4))
for (i in 1:(length(data_rw))){
boxplot(x = data_rw[i],
horizontal = TRUE,
main = sprintf('Boxplot of %s',
colnames(data_rw[i])),
xlab = colnames(data_rw[i]))
}
From the boxplots above we see that outliers exist for almost all variables, just to different extent. To determine whether the extent of existing outliers would cause any concerns, we use the outlier function.
is_outlier <- function(x) {
return(x < quantile(x, 0.25) - 1.5 * IQR(x) |
x > quantile(x, 0.75) + 1.5 * IQR(x))
}
outlier <- data.frame(variable = character(),
sum_outliers = integer(),
stringsAsFactors=FALSE)
for (j in 1:(length(data_rw)-1)){
variable <- colnames(data_rw[j])
for (i in data_rw[j]){
sum.outliers <- sum(is_outlier(i))
}
row <- data.frame(variable, sum.outliers)
outlier <- rbind(outlier, row)
}
top_n(outlier,12)
## Selecting by sum.outliers
## variable sum.outliers
## 1 fixed.acidity 49
## 2 volatile.acidity 19
## 3 citric.acid 1
## 4 residual.sugar 155
## 5 chlorides 112
## 6 free.sulfur.dioxide 30
## 7 total.sulfur.dioxide 55
## 8 density 45
## 9 pH 35
## 10 sulphates 59
## 11 alcohol 13
#kable(outlier) this function could not be knitted for some reason.
Since the number of outliers are generally less than 10% of the total sample size (less than 159), I wouldn’t be too concerned about them compromising data quality.
For all variables, mean, sd, min, max, and median were computed.
Nn=dim(data_rw)[1]
sum.tab= cbind(apply(cbind(data_rw), 2, mean,na.rm =T),
apply(cbind(data_rw), 2, sd, na.rm =T),
apply(cbind(data_rw), 2, min,na.rm =T),
apply(cbind(data_rw), 2, max,na.rm =T),
apply(cbind(data_rw), 2, median, na.rm =T))
colnames(sum.tab)= c("Mean", "Std.","Min", "Max", "Median")
#kable(sum.tab) this function cannot be knitted
Histograms were used to demonstrate data distribution for each variable.
hist.data.frame(data_rw)
### Q5. Do you see any skewed distribution?
The histograms above suggest that most variables are slightly skewed to the right expect for density and pH which seem to be perfectly normal distributed. Among the skewed variables, citric.acid seems to be very interesting as most of the cases were actually 0, and the other values were observed fairly evenly across the cases. Caution will be necessary when computing or interpreting this variable.
attach(data_rw)
mu.hat.density<-round(mean(density),3)
mu.hat.density
## [1] 0.997
the estimate mu hat is unbiased as the variable density is a random variable that seems to be normally distributed. Therefore, mu hat, which is the mean of the sample could be used to inference the real mu which is the mean of the population.In other words, mu = 0.997
sd.mu.hat.density<-sd(density)/sqrt(length(density))
sd.mu.hat.density
## [1] 4.71981e-05
The variability of my estimate is quantified through standard deviation which is 4.71981e-05.
c(mu.hat.density-2*sd.mu.hat.density, mu.hat.density+2* sd.mu.hat.density)
## [1] 0.9969056 0.9970944
The 95% CI is estimated to be 0.9969056 to 0.9970944 according to CLT.
mu.hat.set <- NULL
for (k in 1:2000){
data_rw_bs <-sample(density,size=800, replace=T)
mu.hat.2<-mean(data_rw_bs)
mu.hat.set[k] <-mu.hat.2
}
mean(mu.hat.set)
## [1] 0.9967449
sd(mu.hat.set)
## [1] 6.850856e-05
quantile(mean(mu.hat.set), probs=c(0.025, 0.975))
## 2.5% 97.5%
## 0.9967449 0.9967449
Using the Bootstrap technique, the estimated μ is 0.997 with a standard deviation of 6.757452e-05. The 95% confidence interval is estimated to be 0.9967453 to 0.9967453. Comparing to the results obtained based on CLT, results based on Bootstrap are similar.
#multiply density data points by 10 and create a new variable `density_new`
rw_new<-data_rw %>%
mutate(density_new=density*10)
minuslog.lik<-function(mu, sigma){ log.lik<-0
for(i in 1:800){
log.lik<-log.lik+log(dnorm(density_new[i], mean=mu,
sd=sigma))
}
return(-log.lik)
}
attach(rw_new)
est <- mle(minuslog=minuslog.lik,
start=list(mu=mean(density_new),
sigma=sd(density_new)))
summary(est)
## Maximum likelihood estimation
##
## Call:
## mle(minuslogl = minuslog.lik, start = list(mu = mean(density_new),
## sigma = sd(density_new)))
##
## Coefficients:
## Estimate Std. Error
## mu 9.97646408 0.0005953600
## sigma 0.01683932 0.0004150282
##
## -2 log L: -4265.989
The maximum likelihood of mean and standard deviation are 0.9976 and 0.00168; the standard errors for them are 0.000059 and 0.000042 respectively.
mu.hat.rs<-round(mean(residual.sugar),3)
mu.hat.rs
## [1] 2.539
According to CLT, population mean \(\mu\) is estimated to equal to mu hat, which is 2.539.
hist(residual.sugar)
For highly skewed distribution like residual.sugar, CLT can still be used to calculate the 95% confidence interval, we will just identify the value of the 2.5% and 97.5%, the values in between will be the 95% CI.The output below shows us that the 95% CI for this variable is 2.539 to 2.539.
quantile(mu.hat.rs, probs=c(0.025, 0.975))
## 2.5% 97.5%
## 2.539 2.539
mu.hat.set.rs <- NULL
for (k in 1:2000){
data_rw_bs_rs <-sample(residual.sugar,size=800, replace=T)
mu.hat.3<-mean(data_rw_bs_rs)
mu.hat.set.rs[k] <-mu.hat.3
}
mean(mu.hat.set.rs)
## [1] 2.540807
sd(mu.hat.set.rs)
## [1] 0.05050381
#95% Confidence interval
quantile(mu.hat.set.rs, probs=c(0.025, 0.975))
## 2.5% 97.5%
## 2.445494 2.643694
#distribution of the set of means from bootstrap
hist(mu.hat.set.rs)
Although the distribution for residual.sugar is highly skewed, when use the bootstarp technique to generate set of means, the distribution of the drawn sample means are representing a normal distribution. In other words, the 95% CI is considered symmetric.
minuslog.lik<-function(mu, sigma){ log.lik<-0
for(i in 1:800){
log.lik = log.lik + log(dlnorm(x=residual.sugar[i], meanlog = mu, sdlog = sigma)) }
return(-log.lik)
}
est.lognorm<-mle(minuslog=minuslog.lik, start=list(mu=log(mean(residual.sugar)), sigma=log(sd(residual.sugar))))
summary(est.lognorm)
## Maximum likelihood estimation
##
## Call:
## mle(minuslogl = minuslog.lik, start = list(mu = log(mean(residual.sugar)),
## sigma = log(sd(residual.sugar))))
##
## Coefficients:
## Estimate Std. Error
## mu 0.8919587 0.012070414
## sigma 0.3414029 0.008534798
##
## -2 log L: 1977.921
Due to the skewness of the distribution for residual.sugar, it would be inappropriate to model this variable with a normal distribution. A log-normal distribution, however, could be used to model this variable. To characterize such a distribution, μ and σ are the two parameters needed. the R code above calculates the maximum likelihood estimate for these parameters.
data_rw_rating <- data_rw%>%
mutate(rating = case_when(quality >=7 ~'Excellent',
TRUE ~ 'average'))
attach(data_rw_rating)
#getting the proportion of 'Excellent'
table(rating)
## rating
## average Excellent
## 1382 217
p.ex=217/1599 #point estimate of the population proportion
q=1-p.ex
CI.l=p.ex-1.96*sqrt((p.ex*q)/1599)
CI.u=p.ex+1.96*sqrt((p.ex*q)/1599)
CI.ex=c(CI.l, CI.u)
CI.ex
## [1] 0.1189230 0.1524966
Theoretically, the 95% confidence interval for p is 12% to 15%
bs.ex <-do(1000)*rflip(1000, prob=p.ex)
se.ex <-sd(~prop, data=bs.ex) #standard error
#95% Confidence interval
CI.ex.bs <-c(0.134-2*se.ex, 0.134+2*se.ex)
CI.ex.bs
## [1] 0.1126986 0.1553014
Using the bootstrap method, the 95% confidence interval for p is 11% to 16%
The 95% confidence interval derived from the bootstrap method is slightly wider than the one calculated based on CLT. However, the results are very close thus should be reliable.
minuslog.lik<-function(p){
-log(dbinom(x=217, size=1599, prob=p))
}
est <- mle(minuslog=minuslog.lik, start=list(p=0.5))
summary(est)
## Maximum likelihood estimation
##
## Call:
## mle(minuslogl = minuslog.lik, start = list(p = 0.5))
##
## Coefficients:
## Estimate Std. Error
## p 0.1357098 0.008564279
##
## -2 log L: 7.072712
The calculation above suggests that the maximum likelihood estimate of p is 0.136, and the standard error is 0.0086.
(Please take a moment and think through your comments to be as specific as possible so that I can determine what steps to take to make my teaching more effective for you)
This course is a very good refresher for me in terms of the fundamentals about the CLT and R. The videos and weekly meetings are all very helpful for me to understand the concepts and material. As someone from a non-business and non-statistical background, some of the equations and mathematics concepts (e.g., log, probability) post some challenges for me. If there are dedicated sessions to explain some of the basics, that would be great. Another thing is that it would be great if we could have a discussion about how the concepts and skills that we have learned could be applied to real business problems. For example, how could the things that we did be helpful in business world? In what situations are bootstrap or MLE be useful? Just some my thoughts.