knitr::opts_chunk$set(echo = TRUE,
fig.align = "center",
message = F,
warning = F)
## Packages
pacman::p_load(tidyverse, Rfast, MVN, heplots,
skimr, GGally, car, rstatix)
################################################################################
theme_set(theme_bw())
theme_update(
axis.title = element_text(size = 12),
axis.text = element_text(size = 10),
plot.title = element_text(size = 16, hjust = 0.5),
plot.subtitle = element_text(size = 10, hjust = 0.5),
plot.caption = element_text(size = 8)
)
The blood.csv data set compares reagents from four different companies. Reagents are used to measure white blood count (WBC), red blood count (RBC), and hemoglobin count (HGC).
The first reagent is the one currently in use, and the other three are less expensive reagents that we wish to compare with the first. If a cheaper reagent is found to be not statistically different from reagent 1, it will be used instead of the more expensive version.
A random sample of 20 patients have a vial of blood drawn, which is then divided into 4 smaller samples. The four reagents are used on a different sample from each of the 20 vials.
# Reading the data in
blood <- read.csv("blood.csv")
# Need to make sure to change reagent and subjects into factors
blood <-
blood |>
mutate(reg = factor(reagent),
sub = factor(subject)) |>
select(-reagent, -subject) |>
filter(sub != "14")
# Sample size and number of levels of the factor
(N <- nrow(blood)); (k <- n_distinct(blood$reg))
## [1] 76
## [1] 4
# Number of response variables (p)
(p <- ncol(blood |> select(where(is.numeric))))
## [1] 3
# Calculating the number of blocks: b
(b <- n_distinct(blood$sub))
## [1] 19
Create a plot of using the discriminant or discriminants to compare the differences between RBC, WBC, and HGC across the 4 reagents. Does there appear to be a difference?
blood_lda <-
MASS::lda(
reg ~ RBC + WBC + HGC,
data = blood
)
# Getting the LD values
predict(blood_lda) |>
pluck("x") |>
data.frame() |>
# Adding the reagent column
mutate(reg = blood$reg) |>
# Creating the box plots
ggplot(
mapping = aes(x = LD1,
y = reg)
) +
geom_boxplot(fill = "darkred") +
labs(y = "Reagent")
The first discriminant shows that the appears to be a small difference between reagent 1 with the other 3.
Perform a hypothesis test comparing the reagents using a completely randomized design. Make sure to write the model for the data, state the hypotheses, Wilks’ Lambda, the approximate F statistic, and the p-value.
Model: \(\textbf{y}_{ij} = \pmb{\mu}_i + \pmb{\epsilon}_{ij}\) or \(\textbf{y}_{ij} = \pmb{\mu} + \pmb{\alpha}_i + \pmb{\epsilon}_{ij}\)
Each model is equivalent to the other
\[H_0: \pmb{\mu}_1 = \pmb{\mu}_2 = \pmb{\mu}_3 = \pmb{\mu}_4 \\ H_a: \textrm{At least one } \pmb{\mu}_i \textrm{ is different} \]
# Fitting the MANOVA model
blood_man <-
manova(cbind(WBC, RBC, HGC) ~ reg,
data = blood)
# Getting the summary results
blood_man_sum <-
summary(blood_man,
test = "Wilks")
# Displaying the results in a nice table
blood_man_sum |>
pluck("stats") |>
round(digits = 3) |>
data.frame() |>
rename(`p-value` = `Pr..F.`) |>
rownames_to_column(var = "source") |>
slice(1) |> # Just getting the row for reagent
gt::gt()
source | Df | Wilks | approx.F | num.Df | den.Df | p-value |
---|---|---|---|---|---|---|
reg | 3 | 0.845 | 1.353 | 9 | 170.512 | 0.213 |
The Wilks test statistic is \(\Lambda = 0.845\), the approximate F is \(F = 1.35\) with a p-value of 0.21.
With a large p-value, we fail to reject the null hypothesis. There is not strong evidence of a difference between the 4 reagents when measuring the 3 different blood measurements: red blood cell count, white blood cell count, and hemoglobin count.
Why is the F statistic approximate and not an exact F statistic?
It is not an exact F statistic because we can only make an exact F transformation using \(\Lambda\) when \(p \le 2\) or \(k \le 3\)
Since the blood question has 3 variables (RBC, WBC, HGC) and 4 groups (4 different reagents), we can’t find an exact transformation.
Of the 3 assumptions, which have been violated? Which of the violations has the largest impact on the validity of the MANOVA test in part 1a)? Make sure to justify your answer!
# Checking MVN:
mvn(
data = blood_man$residuals, # Checking MVN using the residuals
desc = F,
univariateTest = "SW",
mvnTest = "mardia",
multivariatePlot = "qq"
)
## $multivariateNormality
## Test Statistic p value Result
## 1 Mardia Skewness 20.1940199524142 0.0274700808980401 NO
## 2 Mardia Kurtosis -2.04706866210914 0.0406513433870539 NO
## 3 MVN <NA> <NA> NO
##
## $univariateNormality
## Test Variable Statistic p value Normality
## 1 Shapiro-Wilk WBC 0.9397 0.0013 NO
## 2 Shapiro-Wilk RBC 0.9145 1e-04 NO
## 3 Shapiro-Wilk HGC 0.9003 <0.001 NO
The data do not appear to be Multivariate normal. All 3 variables have strong evidence of non-normality and the multivariate QQ plot shows a deviation from what is expected once the distances are above 5.5.
box_m(
data = blood |> dplyr::select(RBC, WBC, HGC),
group = blood$reg
) |>
dplyr::select(statistic, parameter, p.value) |>
rename(`chisq df` = parameter) |>
gt::gt()
statistic | chisq df | p.value |
---|---|---|
0.5519948 | 18 | 1 |
No evidence that the covariance matrices for the 4 reagents are different!
The most important violation of our assumptions is the independence of cases assumption. For a CRD design, we need each vial of blood to be independent of each other vial of blood being tested. However, each subject’s blood is measured 4 times, once for each reagent. Which means that \(\mathbf{y}_{11}, \mathbf{y}_{21}, \mathbf{y}_{31}, \mathbf{y}_{41}\) are all associated since they all came from subject 1!
All of our random vectors, \(\mathbf{y}_{ij}\), (blood samples) that have the same \(j\) (subject) will be associated.
Using the same blood data, perform a hypothesis test using a randomized complete block design. State the hypotheses, Wilks’ Lambda, approximate F-stat, p-value, and the conclusion in context.
Model: \(\textbf{y}_{ij} = \pmb{\mu} + \pmb{\alpha}_i + \pmb{\beta}_j + \pmb{\epsilon}_{ij}\)
\[H_0: \pmb{\alpha}_1 = \pmb{\alpha}_2 = \pmb{\alpha}_3 = \pmb{\alpha}_4 = \pmb{0} \\ H_a: \textrm{At least one } \pmb{\alpha}_i \textrm{ is different} \]
### MANOVA test using manova()
blood_RCBD <-
manova(cbind(WBC, RBC, HGC) ~ reg + sub,
data = blood)
# Displaying the results
summary(blood_RCBD,
test="Wilks") |>
pluck("stats") |>
round(digits = 4) |>
data.frame() |>
rename(`p-value` = `Pr..F.`) |>
rownames_to_column(var = "source") |>
slice(1:2) |>
gt::gt()
source | Df | Wilks | approx.F | num.Df | den.Df | p-value |
---|---|---|---|---|---|---|
reg | 3 | 0.0617 | 30.1446 | 9 | 126.7049 | 0 |
sub | 18 | 0.0000 | 732.2267 | 54 | 155.7559 | 0 |
Wilks’ \(\Lambda = 0.099\), F-stat \(= 23.6\), p-value \(\approx 0\)
We reject our null hypothesis, we have very strong evidence that at least 1 of the reagents measures red, white, and/or hemoglobin counts differently than the others.
Check the conditions to the RCBD design for a MANOVA Test. Which ones appear to be violated?
blood_res_RCBD <-
data.frame(blood |> dplyr::select(reg, sub),
res = blood_RCBD$res)
blood_res_RCBD |>
dplyr::select(where(is.numeric)) |>
mvn(
mvnTest = "mardia",
multivariatePlot = "qq",
univariateTest = "SW",
desc = F
)
## $multivariateNormality
## Test Statistic p value Result
## 1 Mardia Skewness 4.65852775085122 0.91279137513761 YES
## 2 Mardia Kurtosis -0.194048234993727 0.846138113580113 YES
## 3 MVN <NA> <NA> YES
##
## $univariateNormality
## Test Variable Statistic p value Normality
## 1 Shapiro-Wilk res.WBC 0.9805 0.2930 YES
## 2 Shapiro-Wilk res.RBC 0.9767 0.1762 YES
## 3 Shapiro-Wilk res.HGC 0.9748 0.1340 YES
The result of the tests and the QQ plot indicate that there is no evidence that the residuals are not MVN
box_m(
data = blood_res_RCBD |> dplyr::select(where(is.numeric)),
group = blood_res_RCBD$reg
) |>
dplyr::select(statistic, parameter, p.value) |>
rename(`Chisq df` = parameter) |>
slice(1:2) |>
gt::gt()
statistic | Chisq df | p.value |
---|---|---|
13.15836 | 18 | 0.7820805 |
With a p-value of 0.78, we don’t have strong evidence that the covariance matrices for the 4 reagents are different:
\[\pmb{\Sigma}_1 = \pmb{\Sigma}_2 = \pmb{\Sigma}_3 = \pmb{\Sigma}_4 = \pmb{\Sigma}\]
Perform any necessary follow-up tests to determine how the reagents are different.
blood |>
pivot_longer(
cols = WBC:HGC,
names_to = "variable",
values_to = "value"
) |>
group_by(variable) |>
anova_test(value ~ reg + sub) |>
data.frame() |>
dplyr::filter(Effect == "reg") |>
arrange(p)
## variable Effect DFn DFd F p p..05 ges
## 1 HGC reg 3 54 109.936 5.57e-23 * 0.859
## 2 WBC reg 3 54 29.139 2.43e-11 * 0.618
## 3 RBC reg 3 54 24.373 4.17e-10 * 0.575
All 3 counts are statistically significant, with the strongest evidence of a difference for the hemoglobin counts.
blood_tukey <-
blood |>
pivot_longer(
cols = WBC:HGC,
names_to = "variable",
values_to = "value"
) |>
group_by(variable) |>
tukey_hsd(
formula = value ~ reg + sub,
which = "reg"
) |>
filter(p.adj < 0.05/3) |>
arrange(group1, variable, p.adj) |>
dplyr::select(-term) |>
mutate(
across(
.cols = where(is.numeric),
.fns = ~ round(., digits = 4)
)
)
blood_tukey
## # A tibble: 10 × 9
## variable group1 group2 null.value estimate conf.low conf.high p.adj
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 HGC 1 4 0 0.453 0.376 0.529 0
## 2 HGC 1 2 0 0.410 0.334 0.487 0
## 3 HGC 1 3 0 0.421 0.344 0.498 0
## 4 RBC 1 3 0 -0.0684 -0.0919 -0.0449 0
## 5 RBC 1 4 0 -0.0605 -0.084 -0.037 0
## 6 RBC 1 2 0 -0.0526 -0.0761 -0.0291 0
## 7 WBC 1 4 0 -0.258 -0.334 -0.182 0
## 8 WBC 1 3 0 -0.158 -0.234 -0.0818 0
## 9 WBC 2 4 0 -0.174 -0.250 -0.0976 0
## 10 WBC 3 4 0 -0.1 -0.176 -0.0239 0.0053
## # ℹ 1 more variable: p.adj.signif <chr>
Looks like all 3 cheaper reagents are different than the first reagent with only reagent 2 having no evidence of being different than reagent 1 regarding WBC only, but significantly different for RBC and HGC. So it does not appear that the company can use a cheaper alternative to reagent 1 :(
Fit a MANOVA model for just the subjects (no reagents), then save the residuals in a data frame.
Run discriminant analysis on the residuals based on the regeants, then create a plot using the discriminant(s). Does there appear to be a difference? How does the plot compare to the first graph created in 1A)? Briefly explain why there is a difference
## Fitting a manova model just using subjects and calculating the residuals
blood_sub_res <-
manova(cbind(WBC, RBC, HGC) ~ sub,
data = blood) |>
pluck("residuals") |>
data.frame() |>
mutate(reg = blood$reg)
# Finding the linear discriminants of the subject residuals:
#res_lda <-
MASS::lda(reg ~ .,
data = blood_sub_res) |>
predict() |>
pluck("x") |>
data.frame() |>
mutate(reg = blood$reg) ->
blood_sub_ld
# Box plots of LD1
ggplot(data = blood_sub_ld,
mapping = aes(x = LD1,
fill = reg,
y = reg)) +
geom_boxplot(show.legend = F) +
labs(y = "Reagent",
title = "Box plots of LD1")
# Scatterplot of LD1 & LD2
ggplot(data = blood_sub_ld,
mapping = aes(x = LD1,
y = LD2,
color = reg)) +
geom_point() +
labs(fill = "Reagent",
title = "Scatterplots of LD1")
Using LD1, there is a clear difference between reagent 1 vs reagents 2, 3, and 4. There doesn’t seem to be any difference between the reagents 2, 3, or 4 when measuring RBC, WBC, and HGC.
The graph from question 1A) is similar to the box plots created in this question, except the box plots are narrower and the reagent 1 box plot is further to the left than it was without accounting for the subjects.
The reason the boxes are smaller and the difference between R1 vs R2 - 4 is that we are removing the variability in blood counts explained by the individuals who the blood samples came from.
Once we remove the additional explained variability, it magnifies any true differences between the groups!