Welcome, you are here to learn how to perform a multivariate test for two mean vectors using the R package stests. So let’s get started…
In this case we have a random sample of size \(n_1\), \(\underline{x}_1, \underline{x}_2,...,\underline{x}_{n_1}\), from a multivariate normal population with mean vector \(\mu_1\) and covariance matrix \(\Sigma_1\) and a random sample of size \(n_2\), \(\underline{y}_1, \underline{y}_2,...,\underline{y}_{n_2}\), from a multivariate normal population with mean vector \(\mu_2\) and covariance matrix \(\Sigma_2\). We assume that this samples are independent.
Having this random samples, we want to know if the two populations behave similarly on average, that is, we want to test the equality of the mean vectors from the two multivariate populations:
\[ H_0: \underline{\mu}_1 - \underline{\mu}_2 = 0 \hspace{1cm} vs \hspace{1cm} H_1: \underline{\mu}_1 - \underline{\mu}_2 \neq 0 \]
To do this, we will use First order James’ test, proposed by James in 1954 without assuming equality of the variance-covariance matrices. The statistic for testing the hypotheses presented above is:
\[ T^2=(\underline{\overline{X}}_1-\underline{\overline{X}}_2)^{\top}\tilde{S}^{-1}(\underline{\overline{X}}_1-\underline{\overline{X}}_2), \]
where, \(\underline{\overline{X}}_1\) and \(\underline{\overline{X}}_2\) correspond to the mean vectors of the samples from the populations 1 and 2 and the variance-covariance matrix \(\tilde{S}\) is given by:
\[ \tilde{S}= \tilde{S}_1 + \tilde{S}_2 = \frac{S_1}{n_1} + \frac{S_2}{n_2}. \]
Finally, we will reject the null hypothesis \(H_0\) if \(T^2>cv\), where the critical value \(cv\) is given by:
\[ cv=\chi^2_{\alpha, p}(A + B\chi^2_{\alpha, p}),\\ \] with,
\[\begin{align*} A &= 1+\frac{1}{2p}\sum_{i=1}^2\frac{\left[tr\left(\tilde{S}^{-1}\tilde{S_i}\right)\right]^{2}}{n_i-1},\\ B &= \frac{1}{p(p+2)}\left\{\sum_{i=1}^2\frac{\left[tr\left(\tilde{S}^{-1}\tilde{S_i}\right)\right]^{2}}{n_i-1} + \frac{1}{2} \sum_{i=1}^2\frac{\left[tr\left(\tilde{S}^{-1}\tilde{S_i}\right)\right]^{2}}{n_i-1}\right\}, \end{align*}\]
where \(p\) corresponds to the number of variables and \(\alpha\) is the level of significance chosen for the test.
If you want to learn more about the First order James’ test, check this article: Tests of Linear Hypotheses in Univariate and Multivariate Analysis when the Ratios of the Population Variances are Unknown
stests is an R package with useful functions to perform several statistical tests, including tests for multivariate analysis. The current version of the package is hosted in github and any user could download it using the next code:
if (!require('devtools')) install.packages('devtools')
devtools::install_github('fhernanb/stests', force=TRUE)
If you want more information about the package visit the website.
Now we are going to perform a simulation study to test how well the package works while testing the hypothesis of equality of two mean vectors. First, we need to load the packages stests, ggplot2 and MASS and generate random samples from the two multivariate normal populations.
# Load the packages
library(stests)
library(MASS)
library(ggplot2)
# Function to generate data from multivariate normal populations
gen_data <- function(mu1, sigma1, n1, mu2, sigma2, n2) {
dt1 <- MASS::mvrnorm(n1, mu=mu1, Sigma=sigma1) # Population 1
dt1 <- cbind(dt1, 1)
dt2 <- MASS::mvrnorm(n2, mu=mu2, Sigma=sigma2) # Population 2
dt2 <- cbind(dt2, 2)
datos <- as.data.frame(rbind(dt1, dt2))
colnames(datos) <- c('x', 'y', 'pop')
datos$pop <- as.factor(datos$pop)
return(datos)
}
To show you what exactly this function does, we are going to generate and plot samples from two bivariate normal populations with set values for \(n_1\), \(\mu_1\), \(\Sigma_1\), \(n_2\), \(\mu_2\) and \(\Sigma_2\). The stars that you will observe in the graph correspond to \(\mu_1\) and \(\mu_2\).
# Sample size, mean vector and covariance matrix for sample 1
n1 <- 45
mu1 <- c(200, 500)
sigma1 <- matrix(c(13825.3, 23823.4,
23823.4, 73107.4), ncol=2)
# Sample size, mean vector and covariance matrix for sample 2
n2 <- 55
mu2 <- c(300, 700)
sigma2 <- matrix(c(8632.0, 19616.7,
19616.7, 55964.5), ncol=2)
# Generate the samples
datos <- gen_data(mu1, sigma1, n1, mu2, sigma2, n2)
# Plot the samples
ggplot(data=datos, aes(x=x, y=y, color=pop)) +
geom_point() +
annotate("point", x=mu1[1], y=mu1[2], shape=8, size=5) +
annotate("point", x=mu2[1], y=mu2[2], shape=8, size=5)
Now that we are clear about how the gen_data function works, it is time to generate our samples from two bivariate normal populations and to do this we are going to set values for \(n_1\), \(\underline{\mu}_1\), \(n_2\), \(\underline{\mu}_2\), \(\Sigma_1\) and \(\Sigma_2\) as follows:
\[\begin{align*} n_1 &= n_2=n, \\ \mu_1 &= \begin{pmatrix} 200\\ 500 \end{pmatrix},\\ \mu_2 &= \mu_1 + \delta, \\ \Sigma_1 &= \begin{pmatrix} 13825.3 & 23823.4\\ 23823.4 & 73107.4 \end{pmatrix},\\ \Sigma_2 &= \begin{pmatrix} 8632.0 & 19616.7\\ 19616.7 & 55964.5 \end{pmatrix},\\ \end{align*}\]
where \(\delta\) parameter corresponds to the distance between the centroids \(\mu_1\) and \(\mu_2\) of the two samples (remember the stars). With this parameter, we want to figure out how the distance or difference between the mean vectors affects the performance of the test. It is important to mention that the values selected for the mean vectors and the covariance matrices may vary depending on your needs.
# Function to generate samples from two bivariate normal populations
auxiliar <- function(delta, n) {
mu1 <- c(200, 500) # Vector mean population 1
# Covariance matrix popoulation 1
sigma1 <- matrix(c(13825.3, 23823.4,
23823.4, 73107.4), ncol=2)
mu2 <- mu1 + delta # Vector mean population 2
# Covariance matrix popoulation 2
sigma2 <- matrix(c(8632.0, 19616.7,
19616.7, 55964.5), ncol=2)
dat <- gen_data(mu1, sigma1, n1=n, mu2, sigma2, n2=n)
return(dat)
}
The next function help us to perform the multivariate test for the two mean vectors. In this case we will use the two_mean_vector_test() function from the package stests. For this function to work, we need to provide it the mean vectors of the samples \(\underline{\bar{X}}_1\) and \(\underline{\bar{X}}_2\), the covariance matrices of the samples \(S_1\) and \(S_2\) and the sample sizes \(n_1\) and \(n_2\). Also, we need to specify the method that we want to use to test the null hypothesis, which in this case will be james and the significance level \(\alpha\).
# Function to perform a multivariate test for two mean vectors
one_simul <- function(delta, n) {
# Generate the random samples from the two populations
datos <- auxiliar(delta, n)
# Calculate the mean vectors of each sample
medias <- lapply(split(x=datos[, 1:2], f=datos$pop, drop=TRUE), colMeans)
# Calculate the covariance matrices of each sample
varia <- lapply(split(x=datos[, 1:2], f=datos$pop, drop=TRUE), var)
# Determine the sample size of each sample
N <- lapply(split(x=datos[, 1:2], f=datos$pop, drop=TRUE), nrow)
# Use function two_mean_vector_test from stests package to perform the test
res <- two_mean_vector_test(xbar1=medias$`1`, s1=varia$`1`, n1=N$`1`,
xbar2=medias$`2`, s2=varia$`2`, n2=N$`2`,
method = 'james', alpha=0.05)
# Return the statistic and the critical value of the test
return(cbind(res$statistic[1], res$statistic[2]))
}
Finally, we use the next function to perform the multivariate test using different values for \(\delta\) and for \(n\). For each combination of these two parameters, we are going to replicate 5000 times the one_simul() function, that is, we are going to perform the test 5000 times and calculate the rejection rate that indicates the proportion of tests in which the null hypothesis is rejected, that is, in which \(T^2>cv\).
# Function to replicate the one_simul function
mult_simul <- function(x) {
delta <- x[1]
n <- x[2]
nrep <- x[3]
res <- replicate(n=nrep, expr=one_simul(delta=delta, n=n))
# Return the rejection rate
return(mean(res[1,1,] > res[1,2,]))
}
# Specify the parameters for the simulation
delta <- seq(from=0, to=150, by=10)
n <- seq(from=10, to=150, by=10)
nrep <- 5000
casos <- expand.grid(delta=delta, n=n, nrep=nrep)
# Calculate the rejection rate for each case
tasa <- apply(X=casos, MARGIN=1, FUN=mult_simul)
res <- cbind(casos, tasa)
The next plot shows us the rejection rate for each combination of parameters \(\delta\) and \(n\).
ggplot(data=res, aes(x=delta, y=tasa)) +
geom_point() + geom_line(aes(color=n)) +
labs(x=expression(delta), y="Rejection rate H0")
From the results obtained from the simulation study, we can conclude that regardless of the sample size \(n\), the rejection rate tends to get closer to 1 as the distance between the centroids \(\delta\) increases, which we would expect to happen with this value. On the other hand, as \(n\) increases, the rejection rate reaches the value of 1 faster, therefore we prefer to work with large sample sizes. Finally, when \(n=10\), the behavior of the rejection rate is different from the ones obtained with the other values of \(n\) and it does not reach 1 even when \(\delta\) is large, so we recommend to perform this hypothesis test with a sample size \(n\geq 10\).