TODO: 11.1, 11.2, 11.3, 11.8, 11.15, 11.16, 11.19, 11.32, 11.34, 11.39
A computer was used to generate four random numbers from a normal distribution with a set mean and variance: 1.1650, .6268, .0751, .3516. Five more random normal numbers with the same variance but perhaps a different mean were then generated (the mean may or may not actually be different): .3035, 2.6961, 1.0591, 2.7971, 1.2641.
What do you think the means of the random normal number generators were? What do you think the difference of the means was?
What do you think the variance of the random number generator was?
What is the estimated standard error of your estimate of the difference of the means?
Form a 90% confidence interval for the difference of the means of the random number generators.
In this situation, is it more appropriate to use a one-sided test or a two-sided test of the equality of the means?
What is the p-value of a two-sided test of the null hypothesis of equal means?
Would the hypothesis that the means were the same versus a two-sided alter- native be rejected at the significance level \(\alpha = .1\) ?
Suppose you know that the variance of the normal distribution was \(\sigma^2 = 1\). How would your answers to the preceding questions change?
Solution I will be using R to solve most of this problem, here is the set up The set up:
library(ggplot2)
xs <- c(1.1650, .6268, .0751, .3516)
ys <- c(.3035, 2.6961, 1.0591, 2.7971, 1.2641)
n <- length(xs)
m <- length(ys)
a. The means of the random number genrators are intuitively given by the mle of the normal distribution.
x_bar <- mean(xs)
y_bar <- mean(ys)
These values are given by 0.554625 and 1.62398 respectively.
Now the difference of the means is given by \(d = \bar{X} - \bar{Y}\), in R
d <- x_bar - y_bar; d
## [1] -1.069355
b. To find the variance we will use pooled variance,
pooled_var <- function(x,y) {
numerator <- ((length(x) - 1) * var(x)) + ((length(y) - 1) * var(y))
denom <- length(x) + length(y) - 2
numerator/denom
}
sp <- pooled_var(xs, ys); sp
## [1] 0.7666932
c. To estimate the standard error we use
\[SE_{\bar{X} - \bar{Y}} = \sqrt{s_p} \sqrt{\frac{1}{4} + \frac{1}{5}}\]
d_err <- sqrt(sp) * (sqrt(1/5 + 1/4)); d_err
## [1] 0.5873772
d. To form a 90% CI we will need to calculate the following:
\[\bar{d} \pm t_{m+n -2}(\alpha/2)SE\]
We already have all these so we calculate whats left and find the interval,
t_val <- qt(p = .95, df = 7); t_val
## [1] 1.894579
ci <- c(d, d) + c(- t_val*d_err, t_val*d_err); ci
## [1] -2.18218723 0.04347723
e. In this case we are inclined to perform a two sided test since we are given almost no information on how the two generators relate to each other.
f. To find the p-value we first find the test-statisic, in this case given by,
\[t = \frac{\bar{X} - \bar{Y}}{\sqrt{S_p \left( \frac{1}{n} + \frac{1}{m} \right)}}\]
t_stat <- (d)/(d_err)
The p-value is the probability of observing something smaller than this,
lb <- pt(q = t_stat, df = 7); lb
## [1] 0.05573858
Since this is the two sided test, the p-value inclides the probability of a value larger than 1.8205593 and so we can multiply the above value by two get the final p-value of 0.1114772
g.
We note that the 90% CI includes the value of zero, so we fail to reject the null hypothesis.
h.
Knowing the variance will allow us to use a two sample \(z\) test.
The difference of the means of two normal distributions with equal variance is to be estimated by sampling an equal number of observations from each distribution. If it were possible, would it be better to halve the standard deviations of the populations or double the sample sizes?
Solution
We have that distributions have equal variance moreover the difference of means are estimated by taking equal number of samples of each.
Now to clarify, better in this case means a smaller margin of error, which is given by, \[S_p\sqrt{\frac{1}{n} + \frac{1}{m}}\] We know sample sizes are equal so, the above becomes, \[S_p\sqrt{\frac{2}{n}}\] We also know that variances are equal, so comparing the two \[(\frac{1}{2}S_p)\sqrt{\frac{2}{n}} \;\; vs\;\; S_p\sqrt{\frac{2}{2n}}\]
\[(\frac{1}{2}S_p)\frac{\sqrt{2}}{\sqrt{n}} \;\; vs\;\; S_p\frac{1}{\sqrt{n}}\]
In this last comparison its clear that halving the standard deviation would result in a smaller margin of error.
Under the assumption that the two population variances were equal, we estimated this quantity by \[s_p^2\left( \frac{1}{n} + \frac{1}{m} \right)\] and without this assumption by, \[\frac{s_X^2}{n} + \frac{s_Y^2}{m}\] Show that these two estimates are identical when \(m = n\).
Solution
First note that when \(m = n\) then
\[s_p^2 = \frac{(n-1)(s_X^2 + s_Y^2)}{2(n-1)} = \frac{s_X^2 + s_Y^2}{2}\] and so we have that \[\frac{s_X^2 + s_Y^2}{2}\left( \frac{1}{n} + \frac{1}{m} \right) = \frac{s_X^2 + s_Y^2}{2}\left( \frac{2}{n} \right)\] \[ = \frac{s_X^2}{n} + \frac{s_Y^2}{m}\] when \(m = n\)
An experiment to determine the efficacy of a drug for reducing high blood pressure is performed using four subjects in the following way: two of the subjects are chosen at random for the control group and two for the treatment group. During the course of treatment with the drug, the blood pressure of each of the subjects in the treatment group is measured for ten consecutive days as is the blood pressure of each of the subjects in the control group.
In order to test whether the treatment has an effect, do you think it is appropriate to use the two-sample t test with n = m = 20?
Do you think it is appropriate to use the Mann-Whitney test with n = m = 20?
Solution
a.
Although it might reasonable at first, it may also be the case that some of the assumptions that must be made in order to perform this test are not met, namely that of independence. Assuming that a patient from one day to the next is an independent observation is hightly suspicious.
b.
The experimental conditions for this test are not met.
Suppose that n measurements are to be taken under a treatment condition and another n measurements are to be taken independently under a control condi- tion. It is thought that the standard deviation of a single observation is about 10 under both conditions. How large should n be so that a 95% confidence inter- val for \(\mu_x - \mu_y\) has a width of 2? Use the normal distribution rather than the t distribution, since n will turn out to be rather large.
Solution
We know the confidence interval given that we know \(\sigma\) is:
\[(\bar{X} - \bar{Y}) \pm 1.96 \times 10 \times \sqrt{\frac{2}{n}}\]
Now we note that the only part of the above that affects the width is
\[1.96 \times 10 \times \sqrt{\frac{2}{n}}\]
Morever this is half of the width (since we add and subtract this term from the point estimate). So we just have to solve
\[1.96 \times 10 \times \sqrt{\frac{2}{n}} \leq 1\]
this calculation reduces down to
(19.6)**2 * 2
## [1] 768.32
and so we must have at least \(n =\) 768.32
Howlarge should n be so that the test of \(H_0: \mu_x = \mu_y\) against a one sided alternative \(H_a: \mu_x > \mu_y\) has power of .5 if \(\mu_x - \mu_y = 2\) and \(\alpha = .10\).
Solution
Using the method described in book we will use
\[\Phi\left( z(\alpha) - \frac{\Delta}{\sigma}\sqrt{\frac{n}{2}} \right) = .5\]
to find the value of \(n\).
Note that the standard normal is .5 exactly when \(\Phi(0)\) so we have
\[z(\alpha) - \frac{\Delta}{\sigma}\sqrt{\frac{n}{2}} = 0\]
Plugging in the known values solving for n,
(1.28*(5))**2 * 2
## [1] 81.92
An experiment is planned to compare the mean of a control group to the mean of an independent sample of a group given a treatment. Suppose that there are to be 25 samples in each group. Suppose that the observations are approximately normally distributed and that the standard deviation of a single measurement in either group is \(\sigma = 5\).
What will the standard error of \(\bar{Y} - \bar{X}\) be?
With a significane level of \(\alpha = .05\) what is the rejection region of the test of the null hypothesis \(H_0: \mu_y = \mu_x\) versus the alternative \(H_A: \mu_y > \mu_x ?\)
What is the power of the test if \(\mu_y = \mu_x + 1 ?\)
Suppose that the p-value of the test turns out to be 0.07. Would the test reject at significance level \(\alpha = .10 ?\)
Solution
We are given the following:
\[ n = m = 25 \;\; \sigma = 5\]
a.
To compute the standard error we do the following:
\[\sigma \left( \frac{1}{n} + \frac{1}{m} \right)\]
n <- 25; m <- 25
sigma <- 5
std_error <- sigma * sqrt(1/n + 1/m); std_error
## [1] 1.414214
b.
The rejectio region is formed by the value \(k\) such that
\[P_0(\bar{Y} - \bar{X} > k) = .10\]
We know that
\[\bar{Y} - \bar{X} \sim N(\mu_y - \mu_x, 2)\]
Since this probability takes place under null we assume that \(\mu_y = \mu_x \Rightarrow \mu_y - \mu_x = 0\) and so
\[\bar{Y} - \bar{X} \sim N(0, 2)\]
and so we can calculate this in R
reject_reg <- qnorm(p = .95, mean = 0, sd = sqrt(2)); reject_reg
## [1] 2.326174
c.
The power is defined as
\[P_A(\bar{Y} - \bar{X} > 2.3261743 )\]
Similarly like we did before
\[\bar{Y} - \bar{X} \sim N(\mu_y - \mu_x, 2)\]
But here we are given that \(\mu_y = \mu_x + 1 \Rightarrow \mu_y - \mu_x = 1\) and so
\[\bar{Y} - \bar{X} \sim N(1, 2)\]
using R
1 - pnorm(q = reject_reg, mean = 1, sd = sqrt(2))
## [1] 0.1741873
d.
Yes
If \(X\sim N(\mu_x, \sigma_x^2)\) and \(Y\) is independent \(N(\mu_y, \sigma_y^2)\) what is \(\pi = P(X < Y) ?\)
Solution
Note that finding \(P(X < Y)\) is equivalent to finding \(P(X - Y < 0)\) and so the problem reduces to finding the distribution of \(X - Y\). This is not complicated, since the difference of two normal r.v’s is itself a normal r.v with \(\mu_{X - Y} = \mu_x - \mu_y\) and \(\sigma_{X - Y}^2 = \sigma_X^2 + \sigma_Y^2\), and so we have that
\[X - Y \sim N(\mu_x - \mu_y, \sigma_X^2 + \sigma_Y^2)\]
We can standardize this with
\[\frac{X - Y - (\mu_x - \mu_y)}{\sqrt{\sigma_x^2 + \sigma_y^2}} \sim N(0,1)\]
and so we can rewrite \(P(X - Y < 0)\) as
\[P\left(\frac{X - Y - (\mu_x - \mu_y)}{\sqrt{\sigma_x^2 + \sigma_y^2}} < \frac{- (\mu_x - \mu_y)}{\sqrt{\sigma_x^2 + \sigma_y^2}}\right)\] \[ = P\left(Z < - \frac{(\mu_x - \mu_y)}{\sqrt{\sigma_x^2 + \sigma_y^2}}\right)\] \[ = 1 - \Phi\left(\frac{(\mu_x - \mu_y)}{\sqrt{\sigma_x^2 + \sigma_y^2}} \right)\]
This problem contrasts the power functions of paired and unpaired designs. Graph and compare the power curves for testing \(H_0: \mu_x = \mu_y\) for the following two designs.
Solution
a.
b.
The set up.
test <- c(676,206,230,256,280,433,337,466,497,512,794,428,452,512)
control <- c(88,570,605,617,653,2913,924,286,1098,982,2346,321,615,519)
difference <- control - test
df <- data.frame(test = test, control = control, difference = difference)
a.
Plot the difference vs the control
qplot(control, difference, data = df) + theme_bw()
There appears to be a strong positive correlation between the difference and the control.
b.
Calculate the mean difference, its standard deviation and a CI.
# the mean difference
mean_d <- mean(df$control) - mean(df$test); cat("The mean difference:", mean_d)
## The mean difference: 461.2857
sd_d <- sqrt(var(df$control)/14 + var(df$test)/14 - cov(df$control, df$test)*2/14); cat("The Std. Dev of the differences:", sd_d)
## The Std. Dev of the differences: 202.533