Our data set includes 62 observations. Each observation represents the average amount that a company in the manufacturing sector pays their non-management female employees and male employees respectively.
We have the following tasks: -Develop a statistical hypothesis test using a mean vector to determine whether or not non-management females and males have the same average salary. -Develop a 95% confidence region for the true average salary vector for females and males.
-Develop marginal and simultaneous confidence intervals for the female and male data.
-Decide between two competing hypotheses for the parameters \(\mu\) and \(\Sigma\) of the underlying multivariate normal distribution.
Figure 1
Above in Figure 1 we see a scatter plot of our bivariate data and the 95% confidence region for where we believe the true average salary vector lies.
Let us add our hypothesis that female and male average salarys are equal to the plot via a green line with slope 1 that goes through points \((c,c)\) for \(12 < c < 39\)
Figure 2
As we can see above in Figure 2, our green null hypothesis set does not intersect the 95% confidence region. Thus, we can reject the null hypothesis and instead favor the alternative: non-management female and male average salaries are not equal.
From Figure 2 we can see a way to calculate a p-value. Our p-value would be the closest point from our green null hypothesis set to the center or our ellipse (i.e. the sample mean vector). While it is possible to calculate this theoretically using linear algebra, we will instead loop through feasible null hypothesis set values and calculate the p-value for each one. The minimum such distance point in our null hypothesis set will give us the p-value. We will include code below for reference.
equals.mesh= seq(18, 24, by=0.1) #anything less than 18 or greater than 24 is clearly not going to give us minimum distance <-> maximum pvalue
mu0.matrix <- cbind(equals.mesh, equals.mesh)
pvals.vec <- rep(NA, nrow(mu0.matrix))
for(i in 1:length(pvals.vec)) {
current.mu0 = mu0.matrix[i,]
T2 = t(x.bar.vec - current.mu0) %*% solve(S.x.bar) %*% (x.bar.vec - current.mu0)
pvals.vec[i] = 1 - pchisq(T2, 2)
}
max.pval = max(pvals.vec)
c.val = mu0.matrix[which.max(pvals.vec),1]
result.df = data.frame("Max P-Value" = max.pval, "c Null Hypothesis"=c.val)
result.df
## Max.P.Value c.Null.Hypothesis
## equals.mesh 7.993606e-15 20.5
Therefore, we can see that the Null Hyothesis \(H_0: (\mu_1, \mu_2)^T = (20.5, 20.5)^T\) gives us a p-value of 7.993606e-15. We conclude that is it very unlikely that non-management female and male average salaries are equal.
Figure 3
## Binary Gender Lower Upper
## 1 female 20.3349966069143 21.8691969414728
## 2 male 23.2212757157512 25.7090468648939
Above in Figure 3 is a visual representation of the marginal confidence interval for males (in red) and the marginal confidence interval for females (in green) that are numerically represented in the dataframe console output.
Figure 4
## Binary Gender Lower Upper
## 1 female 20.131158220816 22.0730353275711
## 2 male 22.8907430958576 26.0395794847875
Above in Figure 4 we can see the visual representation of the simultaneous confidence intervals. As discussed in lecture the width of these confidence intervals are wider than the marginal confidence intervals. For exact numerical answers see the R console output table above.
Two analysts Jacob and Sabine have two competing parameter estimates for the underlying multivariate normal distribution of our female and male average salary data.
Jacob believes the following:
\[ \mu = (22.5, 24.5)^T \\ \Sigma = \begin{bmatrix} 12 & 8 \\ 8 & 12\\ \end{bmatrix} \]
Sabine believes the following:
\[ \mu = (21.5, 26)^T \\ \Sigma = \begin{bmatrix} 9 & 8 \\ 8 & 16\\ \end{bmatrix} \]
In order to decide between our two competing hypotheses we set up a likelihood ratio test.
\[ \Lambda = \frac{L(\theta_J)}{L(\theta_S)} \]
Furthermore, for numerical stability we will take a log-likelihood ratio.
\[ \log(\Lambda) = \ell(\theta_J) - \ell(\theta_S) \]
So we will compute \(\Lambda\), then by a theorem proved in class, we have this relationship with the Hotelling \(T^2\) test statistic:
\[ T^2 = (n-1)(\frac{1}{\Lambda^{2/n} - 1}) \] Furthermore, this \(T^2\) test statistic asympototically approaches a \(\chi^2_p\) where in our case \(p=2\).
Thus, once we have our \(\Lambda\) we will calculate \(T^2\) which will then give us a p-value since our sample size \(n = 62\) is large enough for the \(T^2\) to approach the \(\chi^2_p\) distribution.
log.like <- function(mu, Sigma, n, p, data) {
val1 = -1*(n*p/2)*log(2*pi) - (n/2)*log(det(Sigma))
val2 = 0
for(i in 1:n) {
x.i = matrix(as.numeric(data[i,]), nrow=p, ncol=1)
val2 = val2 + (t(x.i - mu) %*% solve(Sigma) %*% (x.i-mu))
}
val2 = -0.5*val2
z = val1 + val2
return(z)
}
mu.J = matrix(c(22.5, 24.5), nrow=p, ncol=1)
Sigma.J = matrix(c(12, 8, 8, 12), nrow=2, byrow=TRUE)
jacob = log.like(mu.J, Sigma.J, n, p, data)
mu.S = matrix(c(21.5, 26), nrow=p, ncol=1)
Sigma.S = matrix(c(9, 8, 8, 16), nrow=2, byrow=TRUE)
sabine = log.like(mu.S, Sigma.S, n, p, data)
Lambda = exp(jacob - sabine)
T2 = (n-1)*(1/(Lambda^(2/n)) - 1)
pval.Jacob = 1 - pchisq(T2, p)
print(paste("Likelihood ratio in favor Jacob: ", Lambda))
## [1] "Likelihood ratio in favor Jacob: 4.55128446877171e-08"
print(paste("P-Value in favor of Jacoib", pval.Jacob))
## [1] "P-Value in favor of Jacoib 2.47902254280064e-10"
As we can see from the output of our code the Likelihood Ratio is deeply in favor of Sabine, and this is further confirmed by our p-value that shows under Jacob’s hypothesis the probability of seeing the sample data would only be 2.4e-10.