(4) In northern climates, roads must be cleared of snow quickly following a storm. One measure of storm severity is \(X_{1}\) = its duration in hours, while the effectiveness of snow removal can be quantified by \(X_{2}\) = the number of hours crews, men and machine spend to clear the snow. The provided dataset details 25 incidents in Wisconsin.
a. Do the variables seem to be normally distributed? Investigate.
# Histograms
ggplot(data = df, aes(x = storm)) +
geom_histogram(bins = 10) +
labs(x = "Number of Storm Hours",
y = "Frequency",
title = "Distribution of Storm Hours")ggplot(data = df, aes(x = work)) +
geom_histogram(bins = 10) +
labs(x = "Number of Work Hours",
y = "Frequency",
title = "Distribution of Work Hours")# Q-Q plots
ggplot(df) +
geom_qq(aes(sample = storm)) +
geom_qq_line(aes(sample = storm)) +
labs(x = "Theoretical Quantiles",
y = "Sample Quantiles",
title = "Normal QQ Plot for Storm Hours")ggplot(df) +
geom_qq(aes(sample = work)) +
geom_qq_line(aes(sample = work)) +
labs(x = "Theoretical Quantiles",
y = "Sample Quantiles",
title = "Normal QQ Plot for Work Hours")# Contour Plot
ggplot(df, aes(x = storm, y = work)) +
geom_density2d() +
labs(x = "Storm Hours",
y = "Work Hours",
title = "Contour Plot of Storm vs Work Hours")# Royston's Test & QQ Plot
mvn(df,
mvnTest = "royston",
multivariatePlot = "qq",
multivariateOutlierMethod = "adj",
showOutliers = TRUE) ## $multivariateNormality
## Test H p value MVN
## 1 Royston 11.84998 0.00273702 NO
##
## $univariateNormality
## Test Variable Statistic p value Normality
## 1 Shapiro-Wilk storm 0.9170 0.0438 NO
## 2 Shapiro-Wilk work 0.8825 0.0078 NO
##
## $Descriptives
## n Mean Std.Dev Median Min Max 25th 75th Skew Kurtosis
## storm 25 9.420 3.760208 8.5 3.5 19.5 7.0 10.5 0.9829208 0.5554744
## work 25 19.272 7.889155 17.4 10.4 42.3 13.2 23.6 1.1529966 0.7998056
##
## $multivariateOutliers
## Observation Mahalanobis Distance Outlier
## 1 1 20.275 TRUE
## 2 2 18.267 TRUE
## 3 3 13.373 TRUE
## 4 4 9.790 TRUE
## 5 5 8.868 TRUE
## 6 6 8.279 TRUE
The histograms & QQ-plots for both storm and work hours show deviations from normality which may cause concern. The results of the Shapiro-Wilk tests (at \(\alpha = 0.05\)) for both storm & work hours lead to the conclusion that neither variable follows a univariate normal distribution. Also, the Royston test indicates the data do not come from a bivariate normal distribution. Additional plots & metrics are provided to further the case for non-normality; two observations were identified as outliers based on an adjusted, squared Mahalanobis distances as well.
# Bivariate Percentage
par(mfrow = c(1,1))
plot(storm, work, pch = 20)
quan_chisq <- ellipse(cor(df[,c(1,2)]),
scale = c(sd(storm), sd(work)),
center = c(0,0),
level = 0.5,
which = c(1, 2),
npoints = 100)
lines(quan_chisq[,1] + mean(storm),
quan_chisq[,2] + mean(work),
col = "blue",
lwd = 2)# Mahalanobis distance between storm & work hours
m_dist <- mahalanobis(x = df[,c(1,2)],
center = colMeans(df[,c(1,2)]),
cov = cov(df[,c(1,2)]))
# 63.3% of Mahalanobis distance <= 75th quantile of chi-square
sum(m_dist <= qchisq(.75,2))/30 ## [1] 0.6333333
b. If the data is not bivariate normal distributed, determine the power transformation for approximate bivariate normality. Denoted the bivariate mean by \(\mu = (\mu_{1}, \mu_{2})'\).
## bcPower Transformations to Multinormality
## Est Power Rounded Pwr Wald Lwr Bnd Wald Upr Bnd
## storm 0.2396 1 -0.5558 1.0350
## work -0.6417 0 -1.7565 0.4732
##
## Likelihood ratio test that transformation parameters are equal to 0
## (all log transformations)
## LRT df pval
## LR test, lambda = (0 0) 1.711958 2 0.42487
##
## Likelihood ratio test that no transformations are needed
## LRT df pval
## LR test, lambda = (1 1) 12.45482 2 0.0019746
## The following objects are masked from df (pos = 3):
##
## storm, work
# Histograms of transformed data
ggplot(data = df, aes(x = storm)) +
geom_histogram(bins = 10) +
labs(x = "Number of Storm Hours",
y = "Frequency",
title = "Distribution of Storm Hours")ggplot(data = df, aes(x = work)) +
geom_histogram(bins = 10) +
labs(x = "Number of Work Hours",
y = "Frequency",
title = "Distribution of Work Hours")# Scatterplot with Ellipse
plot(storm, work, pch = 20)
quan_chisq1 <- ellipse(cor(df[,c(1,2)]),
scale = c(sd(storm),
sd(work)),
center=c(0,0),
t = sqrt(qchisq(.5, 2)),
which = c(1, 2),
npoints = 100)
lines(quan_chisq1[,1] + mean(storm),
quan_chisq1[,2] + mean(work),
col = "blue",
lwd = 2)# Contour Plot of Transformed Values
ggplot(df, aes(x = storm, y = work)) +
geom_density2d() +
labs(x = "Storm Hours",
y = "Work Hours",
title = "Contour Plot of Storm vs Work Hours")## $multivariateNormality
## Test H p value MVN
## 1 Royston 0.8760029 0.647228 YES
##
## $univariateNormality
## Test Variable Statistic p value Normality
## 1 Shapiro-Wilk storm 0.9779 0.8416 YES
## 2 Shapiro-Wilk work 0.9590 0.3945 YES
##
## $Descriptives
## n Mean Std.Dev Median Min Max 25th 75th
## storm 25 2.170685 0.3889775 2.140066 1.252763 2.970414 1.945910 2.351375
## work 25 2.888439 0.3726098 2.856470 2.341806 3.744787 2.580217 3.161247
## Skew Kurtosis
## storm -0.06361083 -0.0189353
## work 0.47449566 -0.7519325
After comparing a few transformations, I settled on a log transformation for this dataset. With the transformation, storm & work hours appear to follow univariate normal distributions (based on the updated histograms, qq-plots, & Shapiro-Wilk test results) as well as some bivariate normal distribution (according to the results of the Royston test.
c. Construct a 95% confidence ellipse for the population means \(\mu_{1} & \mu_{2}\).
s <- cov(df)
n <- nrow(df)
p <- 2
alpha <- 0.05
xbar <- as.matrix(colMeans(df), ncol = 1)
Tsq <- n*t(xbar)%*%solve(s)%*%(xbar)
cri <- (((n-1)*p)/(n-p)) * qf((1-alpha), p, n - p)
plot(storm, work, pch = 20)
CR <- ellipse(x = cor(df),
scale = sqrt(diag(s)/n),
centre = as.vector(xbar),
t = sqrt(cri))
lines(CR,
type = "l",
xlab = expression({delta}[1]),
ylab = expression({delta}[2]),
main = expression("95% confidence ellipse"))
points(0,0,pch=20)d. Construct 95% simultaneous T2 intervals for \(\mu_{1} & \mu_{2}\) and the 95% Bonferroni intervals for \(\mu_{1} & \mu_{2}\). Compare the two sets of intervals.
## One at a time intervals of for delta1 and delta2
One <- matrix(NA,2,2)
for (i in 1:2){
One[i,1] <- xbar[i] - qt(.975, n - 1)*sqrt(s[i,i]/n)
One[i,2] <- xbar[i] + qt(.975, n - 1)*sqrt(s[i,i]/n)
}
## T^2 interval for delta1 and delta2
Tsq <- matrix(NA,2,2)
c <- sqrt(((p*(n - 1))/(n - p))*qf(.95, p, n - p))
for (i in 1:2){
Tsq[i,1] <- xbar[i] - c*sqrt(s[i,i]/n)
Tsq[i,2] <- xbar[i] + c*sqrt(s[i,i]/n)
}
## Bonferroni Interval for delta1 and delta2 when m=2 (m is the number of CIs).
m <- 2
Bonf <- matrix(NA,2,2)
for (i in 1:m){
Bonf[i,1] <- xbar[i] - qt((1 - alpha/(2*m)), n - 1)*sqrt(s[i,i]/n)
Bonf[i,2] <- xbar[i] + qt((1 - alpha/(2*m)), n - 1)*sqrt(s[i,i]/n)
}
cols <- c("lower","upper"); rows <- c("Storm","Work")
colnames(One) = colnames(Tsq) = colnames(Bonf) = cols
row.names(One) = row.names(Tsq) = row.names(Bonf) = rows
out <- list(One, Tsq, Bonf)
names(out) <- c("One-at-a-time Intervals",
"T^2 Intervals",
"Bonferroni Simultaneous Intervals")
# Confidence Intervals for Transformed Values
out## $`One-at-a-time Intervals`
## lower upper
## Storm 2.010123 2.331247
## Work 2.734633 3.042244
##
## $`T^2 Intervals`
## lower upper
## Storm 1.962783 2.378587
## Work 2.689284 3.087593
##
## $`Bonferroni Simultaneous Intervals`
## lower upper
## Storm 1.98468 2.356690
## Work 2.71026 3.066617
## lower upper
## Storm 7.464235 10.29077
## Work 15.404086 20.95221
## lower upper
## Storm 7.119109 10.78965
## Work 14.721137 21.92424
## lower upper
## Storm 7.276718 10.55595
## Work 15.033188 21.46914
One-at-a-time, simultaneous T2, & Bonferroni simultaneous intervals are provided for both log-transformed and back-transformed means for storm & work hours. It makes intuitive sense that the one-at-a-time CI’s are the most narrow of the three because we can be more certain about a range of mean responses for a single variable if we ignore the presence of another possibly influential variable. That is also the reason that we are interested in simultaneous CI’s; simply analyzing regular 95% CI’s for one variable at a time may lead to naive misinterpretations of the data.
The simultaneous T2 and Bonferroni CI’s provide better insight into the mean responses of these variables in tandem because of their more nuanced interpretations. Although the Bonferroni intervals are slightly more narrow than the simultaneous T2 intervals, they are relatively similar when comparing to regular CI’s. The Bonferroni intervals indicate that we can be 95% confident that the true mean values of storm & work hours are contained in the intervals.
(5) In the first phase of a study of the cost of transporting milk from farms to dairy plants, a survey was taken of firms engaged in milk transportation. Cost data on \(X_{1}\) = fuel, \(X_{2}\) = repair, and \(X_{3}\) = capital, all measured on a per-mile basis, are provided for \(n_{1} = 36\) gasoline trucks and \(n_{2} = 23\) diesel trucks.
X <- read.csv("T6-10.DAT", header = FALSE, sep = "")
names(X) <- c("fuel", "repair", "capital", "truck")a. Test for differences in mean cost vectors at \(\alpha = 0.01\).
alpha <- .01
X1 <- X[X$truck == "gasoline",1:3]; X2 <- X[X$truck == "diesel",1:3]
n1 <- dim(X1)[1]; n2 <- dim(X2)[1]
p <- dim(X1)[2]
# Testing H0: mu1 - mu2 = delta0
delta0 <- rep(0,p)
xbar1 <- colMeans(X1); xbar2 <- colMeans(X2)
S1 <- var(X1); S2 <- var(X2)
Sp <- ((n1-1)*S1+(n2-1)*S2)/(n1+n2-2)
T2 <- (n1*n2/(n1+n2))*t(xbar1-xbar2-delta0)%*%solve(Sp)%*%(xbar1-xbar2-delta0)
crit <- (n1 + n2 - 2)*p/(n1 + n2 - p - 1)*qf(1-alpha, p, n1 + n2 - p - 1)
A <- diag(p)
T2 > crit## [,1]
## [1,] TRUE
## fuel repair capital
## 12.218611 8.112500 9.590278
## fuel repair capital
## 10.10565 10.76217 18.16783
## fuel repair capital
## fuel 23.013361 12.366395 2.906609
## repair 12.366395 17.544111 4.773082
## capital 2.906609 4.773082 13.963334
## fuel repair capital
## fuel 4.3623166 0.7598872 2.362099
## repair 0.7598872 25.8512360 7.685732
## capital 2.3620992 7.6857322 46.654400
## fuel repair capital
## fuel 15.814712 7.886690 2.696447
## repair 7.886690 20.750370 5.897263
## capital 2.696447 5.897263 26.580938
## [,1]
## [1,] 50.91279
\(H_{0}: \mu_{1} - \mu_{2} = \delta_{0}\)
\(H_{0}: \mu_{1} - \mu_{2} \neq \delta_{0}\)
Test Statistic: \(T^{2} = 50.9127868\)
Critical Value: 12.93096
Conclusion: Since T2 > the critical value (50.9127868 > 12.93096 at \(\alpha = 0.01\)), we reject the null hypothesis with statistically significant evidence in the direction of the alternate. That is, the data indicates that the mean difference in cost vectors between the groups is not equal to \(\delta_{0}\).
b. If the hypothesis of equal cost vetors is rejected in part a, find the mean components most responsible for the rejection.
## [,1]
## fuel 0.2547452
## repair -0.1339036
## capital -0.3188296
The linear combination of the mean components most responsible for rejection of \(H_{0}\) in part a is provided above. In terms of magnitude, capital seems to be most responsible but further analysis is necessary to confirm.
c. Construct 99% simultaneous T2 confidence intervals for the mean differences. Which cost, if any, appear to be quite different?
# T^2-intervals
lower <- t(A)%*%(xbar1 - xbar2) - sqrt(crit) * sqrt(diag((t(A)%*%Sp%*%A)*(1/n1 + 1/n2)))
upper <- t(A)%*%(xbar1 - xbar2) + sqrt(crit) * sqrt(diag((t(A)%*%Sp%*%A)*(1/n1 + 1/n2)))
out <- as.data.frame(cbind(lower, upper),
row.names = c("Fuel", "Repair", "Capital"))
names(out) <- c("Lower", "Upper")
kable(out)| Lower | Upper | |
|---|---|---|
| Fuel | -1.704346 | 5.930264 |
| Repair | -7.022268 | 1.722920 |
| Capital | -13.526479 | -3.628618 |
The cost of capital seems to be most different according to the simultaneous confidence intervals as both of its limits are positive; this backs up the idea that capital is more.
d. Note that the observations 9 and 21 for gasoline trucks have been identified as multivariate outliers. Repeat part a with these observations deleted. Comment on the results.
alpha <- .01
df <- X[-c(9,21),]
X1 <- df[df$truck=="gasoline",1:3]
X2 <- df[df$truck=="diesel",1:3]
n1 <- dim(X1)[1]
n2 <- dim(X2)[1]
p <- dim(X1)[2]
# Testing H0: mu1 - mu2 = delta0
delta0 <- rep(0,p)
xbar1 <- colMeans(X1); xbar2 <- colMeans(X2)
S1 <- var(X1); S2 <- var(X2)
Sp <- ((n1-1)*S1+(n2-1)*S2)/(n1+n2-2)
T2 <- (n1*n2/(n1+n2))*t(xbar1-xbar2-delta0)%*%solve(Sp)%*%(xbar1-xbar2-delta0)
crit <- (n1 + n2 - 2)*p/(n1 + n2 - p - 1)*qf(.95, p, n1 + n2 - p - 1)
A <- diag(p)\(H_{0}: \mu_{1} - \mu_{2} = \delta_{0}\)
\(H_{0}: \mu_{1} - \mu_{2} \neq \delta_{0}\)
Test Statistic: \(T^{2} = 52.3835991\)
Critical Value: 8.6519598
Conclusion: The conclusion after removing the two outliers is in the same direction, but with slightly stronger evidence in the form of a higher test statistic & a lower critical value. Since T2 > the critical value (52.3835991 > 8.6519598 at \(\alpha = 0.01\)), we reject the null hypothesis with statistically significant evidence in the direction of the alternate. That is, the data indicates that the mean difference in cost vectors between the groups is not equal to \(\delta_{0}\).