MVAR Midterm

(4) In northern climates, roads must be cleared of snow quickly following a storm. One measure of storm severity is \(X_{1}\) = its duration in hours, while the effectiveness of snow removal can be quantified by \(X_{2}\) = the number of hours crews, men and machine spend to clear the snow. The provided dataset details 25 incidents in Wisconsin.

df <- read.csv("T3-2.DAT", header = FALSE, sep = "")
names(df) <- c("storm", "work"); attach(df)

a. Do the variables seem to be normally distributed? Investigate.

# Histograms
ggplot(data = df, aes(x = storm)) + 
  geom_histogram(bins = 10) + 
  labs(x = "Number of Storm Hours", 
       y = "Frequency", 
       title = "Distribution of Storm Hours")

ggplot(data = df, aes(x = work)) + 
  geom_histogram(bins = 10) + 
  labs(x = "Number of Work Hours", 
       y = "Frequency", 
       title = "Distribution of Work Hours")

# Q-Q plots
ggplot(df) + 
  geom_qq(aes(sample = storm)) + 
  geom_qq_line(aes(sample = storm)) + 
  labs(x = "Theoretical Quantiles", 
       y = "Sample Quantiles", 
       title = "Normal QQ Plot for Storm Hours")

ggplot(df) + 
  geom_qq(aes(sample = work)) + 
  geom_qq_line(aes(sample = work)) + 
  labs(x = "Theoretical Quantiles", 
       y = "Sample Quantiles", 
       title = "Normal QQ Plot for Work Hours")

# Scatterplot Matrix
ggpairs(df)

# Contour Plot
ggplot(df, aes(x = storm, y = work)) + 
  geom_density2d() + 
  labs(x = "Storm Hours", 
       y = "Work Hours", 
       title = "Contour Plot of Storm vs Work Hours")

# Royston's Test & QQ Plot
mvn(df, 
    mvnTest = "royston", 
    multivariatePlot = "qq",
    multivariateOutlierMethod = "adj",
    showOutliers = TRUE)

## $multivariateNormality
##      Test        H    p value MVN
## 1 Royston 11.84998 0.00273702  NO
## 
## $univariateNormality
##           Test  Variable Statistic   p value Normality
## 1 Shapiro-Wilk   storm      0.9170    0.0438    NO    
## 2 Shapiro-Wilk   work       0.8825    0.0078    NO    
## 
## $Descriptives
##        n   Mean  Std.Dev Median  Min  Max 25th 75th      Skew  Kurtosis
## storm 25  9.420 3.760208    8.5  3.5 19.5  7.0 10.5 0.9829208 0.5554744
## work  25 19.272 7.889155   17.4 10.4 42.3 13.2 23.6 1.1529966 0.7998056
## 
## $multivariateOutliers
##   Observation Mahalanobis Distance Outlier
## 1           1               20.275    TRUE
## 2           2               18.267    TRUE
## 3           3               13.373    TRUE
## 4           4                9.790    TRUE
## 5           5                8.868    TRUE
## 6           6                8.279    TRUE

out <- mvn(df, 
           mvnTest = "royston", 
           multivariatePlot = "contour")

The histograms & QQ-plots for both storm and work hours show deviations from normality which may cause concern. The results of the Shapiro-Wilk tests (at \(\alpha = 0.05\)) for both storm & work hours lead to the conclusion that neither variable follows a univariate normal distribution. Also, the Royston test indicates the data do not come from a bivariate normal distribution. Additional plots & metrics are provided to further the case for non-normality; two observations were identified as outliers based on an adjusted, squared Mahalanobis distances as well.

# Bivariate Percentage
par(mfrow = c(1,1))  
plot(storm, work, pch = 20)
quan_chisq <- ellipse(cor(df[,c(1,2)]),
                      scale = c(sd(storm), sd(work)),
                      center = c(0,0),
                      level = 0.5,
                      which = c(1, 2),
                      npoints = 100)
lines(quan_chisq[,1] + mean(storm),
      quan_chisq[,2] + mean(work),
      col = "blue",
      lwd = 2)

# Mahalanobis distance between storm & work hours
m_dist <- mahalanobis(x = df[,c(1,2)],
                      center = colMeans(df[,c(1,2)]),
                      cov = cov(df[,c(1,2)]))

# 63.3% of Mahalanobis distance <= 75th quantile of chi-square
sum(m_dist <= qchisq(.75,2))/30

## [1] 0.6333333

b. If the data is not bivariate normal distributed, determine the power transformation for approximate bivariate normality. Denoted the bivariate mean by \(\mu = (\mu_{1}, \mu_{2})'\).

trans <- powerTransform(df, 
                        family = "bcPower")  
summary(trans)

## bcPower Transformations to Multinormality 
##       Est Power Rounded Pwr Wald Lwr Bnd Wald Upr Bnd
## storm    0.2396           1      -0.5558       1.0350
## work    -0.6417           0      -1.7565       0.4732
## 
## Likelihood ratio test that transformation parameters are equal to 0
##  (all log transformations)
##                              LRT df    pval
## LR test, lambda = (0 0) 1.711958  2 0.42487
## 
## Likelihood ratio test that no transformations are needed
##                              LRT df      pval
## LR test, lambda = (1 1) 12.45482  2 0.0019746

df <- log(df)
attach(df)

## The following objects are masked from df (pos = 3):
## 
##     storm, work

# Histograms of transformed data
ggplot(data = df, aes(x = storm)) + 
  geom_histogram(bins = 10) + 
  labs(x = "Number of Storm Hours", 
       y = "Frequency", 
       title = "Distribution of Storm Hours")

ggplot(data = df, aes(x = work)) + 
  geom_histogram(bins = 10) + 
  labs(x = "Number of Work Hours", 
       y = "Frequency", 
       title = "Distribution of Work Hours")

# Scatterplot with Ellipse 
plot(storm, work, pch = 20)
quan_chisq1 <- ellipse(cor(df[,c(1,2)]), 
                       scale = c(sd(storm),
                                 sd(work)), 
                       center=c(0,0), 
                       t = sqrt(qchisq(.5, 2)), 
                       which = c(1, 2), 
                       npoints = 100)
lines(quan_chisq1[,1] + mean(storm), 
      quan_chisq1[,2] + mean(work), 
      col = "blue", 
      lwd = 2)

# Contour Plot of Transformed Values
ggplot(df, aes(x = storm, y = work)) + 
  geom_density2d() + 
  labs(x = "Storm Hours", 
       y = "Work Hours", 
       title = "Contour Plot of Storm vs Work Hours")

# Compare to test on untransformed stiff
mvn(df, 
    mvnTest = "royston", 
    multivariatePlot = "qq")

## $multivariateNormality
##      Test         H  p value MVN
## 1 Royston 0.8760029 0.647228 YES
## 
## $univariateNormality
##           Test  Variable Statistic   p value Normality
## 1 Shapiro-Wilk   storm      0.9779    0.8416    YES   
## 2 Shapiro-Wilk   work       0.9590    0.3945    YES   
## 
## $Descriptives
##        n     Mean   Std.Dev   Median      Min      Max     25th     75th
## storm 25 2.170685 0.3889775 2.140066 1.252763 2.970414 1.945910 2.351375
## work  25 2.888439 0.3726098 2.856470 2.341806 3.744787 2.580217 3.161247
##              Skew   Kurtosis
## storm -0.06361083 -0.0189353
## work   0.47449566 -0.7519325

out <- mvn(df, 
           mvnTest = "royston", 
           multivariatePlot = "contour")

After comparing a few transformations, I settled on a log transformation for this dataset. With the transformation, storm & work hours appear to follow univariate normal distributions (based on the updated histograms, qq-plots, & Shapiro-Wilk test results) as well as some bivariate normal distribution (according to the results of the Royston test.

c. Construct a 95% confidence ellipse for the population means \(\mu_{1} & \mu_{2}\).

s <- cov(df)
n <- nrow(df)
p <- 2
alpha <- 0.05
xbar <- as.matrix(colMeans(df), ncol = 1)
Tsq <- n*t(xbar)%*%solve(s)%*%(xbar)
cri <- (((n-1)*p)/(n-p)) * qf((1-alpha), p, n - p)
plot(storm, work, pch = 20)
CR <- ellipse(x = cor(df), 
              scale = sqrt(diag(s)/n), 
              centre = as.vector(xbar),
              t = sqrt(cri))
lines(CR, 
      type = "l",
      xlab = expression({delta}[1]),
      ylab = expression({delta}[2]), 
      main = expression("95% confidence ellipse"))
points(0,0,pch=20)

d. Construct 95% simultaneous T² intervals for \(\mu_{1} & \mu_{2}\) and the 95% Bonferroni intervals for \(\mu_{1} & \mu_{2}\). Compare the two sets of intervals.

## One at a time intervals of for delta1 and delta2
One <- matrix(NA,2,2)
for (i in 1:2){
  One[i,1] <- xbar[i] - qt(.975, n - 1)*sqrt(s[i,i]/n) 
  One[i,2] <- xbar[i] + qt(.975, n - 1)*sqrt(s[i,i]/n) 
}


## T^2 interval for delta1 and delta2
Tsq <- matrix(NA,2,2)
c <- sqrt(((p*(n - 1))/(n - p))*qf(.95, p, n - p))
for (i in 1:2){
  Tsq[i,1] <- xbar[i] - c*sqrt(s[i,i]/n) 
  Tsq[i,2] <- xbar[i] + c*sqrt(s[i,i]/n) 
}

## Bonferroni Interval for delta1 and delta2 when m=2 (m is the number of CIs).
m <- 2
Bonf <- matrix(NA,2,2)
for (i in 1:m){
  Bonf[i,1] <- xbar[i] - qt((1 - alpha/(2*m)), n - 1)*sqrt(s[i,i]/n)
  Bonf[i,2] <- xbar[i] + qt((1 - alpha/(2*m)), n - 1)*sqrt(s[i,i]/n)
}

cols <- c("lower","upper"); rows <- c("Storm","Work")

colnames(One) = colnames(Tsq) = colnames(Bonf) = cols
row.names(One) = row.names(Tsq) = row.names(Bonf) = rows
out <- list(One, Tsq, Bonf)
names(out) <- c("One-at-a-time Intervals", 
                "T^2 Intervals", 
                "Bonferroni Simultaneous Intervals")
# Confidence Intervals for Transformed Values
out

## $`One-at-a-time Intervals`
##          lower    upper
## Storm 2.010123 2.331247
## Work  2.734633 3.042244
## 
## $`T^2 Intervals`
##          lower    upper
## Storm 1.962783 2.378587
## Work  2.689284 3.087593
## 
## $`Bonferroni Simultaneous Intervals`
##         lower    upper
## Storm 1.98468 2.356690
## Work  2.71026 3.066617

# Confidence Intervals for Back-Transformed Values
for(i in 1:length(out)){
  print(exp(out[[i]]))
}

##           lower    upper
## Storm  7.464235 10.29077
## Work  15.404086 20.95221
##           lower    upper
## Storm  7.119109 10.78965
## Work  14.721137 21.92424
##           lower    upper
## Storm  7.276718 10.55595
## Work  15.033188 21.46914

One-at-a-time, simultaneous T², & Bonferroni simultaneous intervals are provided for both log-transformed and back-transformed means for storm & work hours. It makes intuitive sense that the one-at-a-time CI’s are the most narrow of the three because we can be more certain about a range of mean responses for a single variable if we ignore the presence of another possibly influential variable. That is also the reason that we are interested in simultaneous CI’s; simply analyzing regular 95% CI’s for one variable at a time may lead to naive misinterpretations of the data.
The simultaneous T² and Bonferroni CI’s provide better insight into the mean responses of these variables in tandem because of their more nuanced interpretations. Although the Bonferroni intervals are slightly more narrow than the simultaneous T² intervals, they are relatively similar when comparing to regular CI’s. The Bonferroni intervals indicate that we can be 95% confident that the true mean values of storm & work hours are contained in the intervals.

(5) In the first phase of a study of the cost of transporting milk from farms to dairy plants, a survey was taken of firms engaged in milk transportation. Cost data on \(X_{1}\) = fuel, \(X_{2}\) = repair, and \(X_{3}\) = capital, all measured on a per-mile basis, are provided for \(n_{1} = 36\) gasoline trucks and \(n_{2} = 23\) diesel trucks.

X <- read.csv("T6-10.DAT", header = FALSE, sep = "")
names(X) <- c("fuel", "repair", "capital", "truck")

a. Test for differences in mean cost vectors at \(\alpha = 0.01\).

alpha <- .01
X1 <- X[X$truck == "gasoline",1:3]; X2 <- X[X$truck == "diesel",1:3]
n1 <- dim(X1)[1]; n2 <- dim(X2)[1]
p <- dim(X1)[2]

# Testing H0: mu1 - mu2 = delta0
delta0 <- rep(0,p)
xbar1 <- colMeans(X1); xbar2 <- colMeans(X2)
S1 <- var(X1); S2 <- var(X2)
Sp <- ((n1-1)*S1+(n2-1)*S2)/(n1+n2-2)
T2 <- (n1*n2/(n1+n2))*t(xbar1-xbar2-delta0)%*%solve(Sp)%*%(xbar1-xbar2-delta0)
crit <- (n1 + n2 - 2)*p/(n1 + n2 - p - 1)*qf(1-alpha, p, n1 + n2 - p - 1)
A <- diag(p)
T2 > crit

##      [,1]
## [1,] TRUE

xbar1; xbar2

##      fuel    repair   capital 
## 12.218611  8.112500  9.590278

##     fuel   repair  capital 
## 10.10565 10.76217 18.16783

S1;S2

##              fuel    repair   capital
## fuel    23.013361 12.366395  2.906609
## repair  12.366395 17.544111  4.773082
## capital  2.906609  4.773082 13.963334

##              fuel     repair   capital
## fuel    4.3623166  0.7598872  2.362099
## repair  0.7598872 25.8512360  7.685732
## capital 2.3620992  7.6857322 46.654400

Sp

##              fuel    repair   capital
## fuel    15.814712  7.886690  2.696447
## repair   7.886690 20.750370  5.897263
## capital  2.696447  5.897263 26.580938

T2

##          [,1]
## [1,] 50.91279

\(H_{0}: \mu_{1} - \mu_{2} = \delta_{0}\)
\(H_{0}: \mu_{1} - \mu_{2} \neq \delta_{0}\)
Test Statistic: \(T^{2} = 50.9127868\)
Critical Value: 12.93096
Conclusion: Since T² > the critical value (50.9127868 > 12.93096 at \(\alpha = 0.01\)), we reject the null hypothesis with statistically significant evidence in the direction of the alternate. That is, the data indicates that the mean difference in cost vectors between the groups is not equal to \(\delta_{0}\).

b. If the hypothesis of equal cost vetors is rejected in part a, find the mean components most responsible for the rejection.

a <- solve(Sp)%*%(xbar1 - xbar2); a

##               [,1]
## fuel     0.2547452
## repair  -0.1339036
## capital -0.3188296

The linear combination of the mean components most responsible for rejection of \(H_{0}\) in part a is provided above. In terms of magnitude, capital seems to be most responsible but further analysis is necessary to confirm.

c. Construct 99% simultaneous T² confidence intervals for the mean differences. Which cost, if any, appear to be quite different?

# T^2-intervals
lower <- t(A)%*%(xbar1 - xbar2) - sqrt(crit) * sqrt(diag((t(A)%*%Sp%*%A)*(1/n1 + 1/n2)))
upper <- t(A)%*%(xbar1 - xbar2) + sqrt(crit) * sqrt(diag((t(A)%*%Sp%*%A)*(1/n1 + 1/n2)))
out <- as.data.frame(cbind(lower, upper), 
                     row.names = c("Fuel", "Repair", "Capital"))
names(out) <- c("Lower", "Upper")
kable(out)

	Lower	Upper
Fuel	-1.704346	5.930264
Repair	-7.022268	1.722920
Capital	-13.526479	-3.628618

The cost of capital seems to be most different according to the simultaneous confidence intervals as both of its limits are positive; this backs up the idea that capital is more.

d. Note that the observations 9 and 21 for gasoline trucks have been identified as multivariate outliers. Repeat part a with these observations deleted. Comment on the results.

alpha <- .01
df <- X[-c(9,21),]

X1 <- df[df$truck=="gasoline",1:3]
X2 <- df[df$truck=="diesel",1:3]

n1 <- dim(X1)[1]
n2 <- dim(X2)[1]
p <- dim(X1)[2]

# Testing H0: mu1 - mu2 = delta0
delta0 <- rep(0,p)
xbar1 <- colMeans(X1); xbar2 <- colMeans(X2)
S1 <- var(X1); S2 <- var(X2)
Sp <- ((n1-1)*S1+(n2-1)*S2)/(n1+n2-2)
T2 <- (n1*n2/(n1+n2))*t(xbar1-xbar2-delta0)%*%solve(Sp)%*%(xbar1-xbar2-delta0)
crit <- (n1 + n2 - 2)*p/(n1 + n2 - p - 1)*qf(.95, p, n1 + n2 - p - 1)
A <- diag(p)

\(H_{0}: \mu_{1} - \mu_{2} = \delta_{0}\)
\(H_{0}: \mu_{1} - \mu_{2} \neq \delta_{0}\)
Test Statistic: \(T^{2} = 52.3835991\)
Critical Value: 8.6519598
Conclusion: The conclusion after removing the two outliers is in the same direction, but with slightly stronger evidence in the form of a higher test statistic & a lower critical value. Since T² > the critical value (52.3835991 > 8.6519598 at \(\alpha = 0.01\)), we reject the null hypothesis with statistically significant evidence in the direction of the alternate. That is, the data indicates that the mean difference in cost vectors between the groups is not equal to \(\delta_{0}\).

MVAR Midterm

Patrick Oster

Thursday February 28th, 2019

a. Do the variables seem to be normally distributed? Investigate.

b. If the data is not bivariate normal distributed, determine the power transformation for approximate bivariate normality. Denoted the bivariate mean by \(\mu = (\mu_{1}, \mu_{2})'\).

c. Construct a 95% confidence ellipse for the population means \(\mu_{1} & \mu_{2}\).

d. Construct 95% simultaneous T² intervals for \(\mu_{1} & \mu_{2}\) and the 95% Bonferroni intervals for \(\mu_{1} & \mu_{2}\). Compare the two sets of intervals.

a. Test for differences in mean cost vectors at \(\alpha = 0.01\).

b. If the hypothesis of equal cost vetors is rejected in part a, find the mean components most responsible for the rejection.

c. Construct 99% simultaneous T² confidence intervals for the mean differences. Which cost, if any, appear to be quite different?

d. Note that the observations 9 and 21 for gasoline trucks have been identified as multivariate outliers. Repeat part a with these observations deleted. Comment on the results.

a. Do the variables seem to be normally distributed? Investigate.

b. If the data is not bivariate normal distributed, determine the power transformation for approximate bivariate normality. Denoted the bivariate mean by \(\mu = (\mu_{1}, \mu_{2})'\).

c. Construct a 95% confidence ellipse for the population means \(\mu_{1} & \mu_{2}\).

d. Construct 95% simultaneous T2 intervals for \(\mu_{1} & \mu_{2}\) and the 95% Bonferroni intervals for \(\mu_{1} & \mu_{2}\). Compare the two sets of intervals.

a. Test for differences in mean cost vectors at \(\alpha = 0.01\).

b. If the hypothesis of equal cost vetors is rejected in part a, find the mean components most responsible for the rejection.

c. Construct 99% simultaneous T2 confidence intervals for the mean differences. Which cost, if any, appear to be quite different?

d. Note that the observations 9 and 21 for gasoline trucks have been identified as multivariate outliers. Repeat part a with these observations deleted. Comment on the results.

d. Construct 95% simultaneous T² intervals for \(\mu_{1} & \mu_{2}\) and the 95% Bonferroni intervals for \(\mu_{1} & \mu_{2}\). Compare the two sets of intervals.

c. Construct 99% simultaneous T² confidence intervals for the mean differences. Which cost, if any, appear to be quite different?