DATA 606 - Homework

Exercise 5.6

First check that the conditions for inference using the t-distribution are satisfied:

Observations are independent: this should be true, since it’s a simple random sample of 25 observations, which is assumed to be \(\ll\) the population size.
Observations come from a nearly normal distribution: true, per the question.

The 90% confidence interval is (65, 77).

Then the sample mean, margin of error, and the sample standard deviation are:

Sample mean: \(\bar{x} = 71\)

ci <- c(65, 77)
(m <- (ci[1] + ci[2])/2)

## [1] 71

Margin of error: \(ME_{\bar{x}} = 6\)
```
(me <- (ci[2] - ci[1])/2)
```
```
## [1] 6
```
Sample standard deviation: \(s = 17.5\)

This is calculated as \(s = SE_{\bar{x}} \cdot \sqrt{n}\), where the standard error is \(SE_{\bar{x}} = ME_{\bar{x}} / t^*\). The critical t-value of \(t^* = 1.71\) is based on the degrees of freedom \(df = 24\).
```
n <- 25
df <- n - 1
# critical t-value for 90% confidence interval
(t <- qt(0.95, df))
```
```
## [1] 1.710882
```
```
# standard error
(se <- me / t)
```
```
## [1] 3.506963
```
```
# sample std dev
(s <- se * sqrt(n))
```
```
## [1] 17.53481
```

Exercise 5.14

The population standard deviation is known in this case, \(\sigma = 250\).

The margin of error should be \(ME_{\bar{x}} \le 25\).

90% confidence interval

\(ME_{\bar{x}} = t^* SE_{\bar{x}} = t^* \sigma / \sqrt{n} \le 25\)
Let’s assume the sample size is large enough that the t-distribution approximates the normal distribution, so that we can use z-scores instead of t-scores. Then we use \(t_{0.90}^* \approx z^*_{0.90} = 1.645\) and solve for \(n\): \[n \ge \left(\frac{z^*_{0.90} \sigma}{25}\right)^2 = 270.6\]
So the sample size should be at least 271.
```
s <- 250
me <- 25
(z <- qnorm(0.95))
```
```
## [1] 1.644854
```
```
(n <- (z * s / me)^2)
```
```
## [1] 270.5543
```
Let’s check that when using the t-score with \(n=271\) that \(ME_{\bar{x}} \le 25\). Now the critical t-value is \(t^*_{0.90} = 1.651\) for \(df = n-1 = 270\):

\[ME_{\bar{x}} = t^*_{0.90} SE_{\bar{x}} = \frac{t^*_{0.90} \sigma}{\sqrt{n}} = 25.06\]
```
n <- 271
df <- n - 1
(t <- qt(0.95, df))
```
```
## [1] 1.650517
```
```
(se <- s / sqrt(n))
```
```
## [1] 15.18642
```
```
(me <- t * se)
```
```
## [1] 25.06544
```
Note that for a sample size of \(n=271\), the margin of error \(ME_{\bar{x}}\) is slightly greater than 25. If we need to be precisely below 25, then we should increase the sample size to \(n=273\), in which case the margin of error becomes \(ME_{\bar{x}} = 24.97\).
```
n <- 273
df <- n - 1
(t <- qt(0.95, df))
```
```
## [1] 1.650475
```
```
(se <- s / sqrt(n))
```
```
## [1] 15.13069
```
```
(me <- t * se)
```
```
## [1] 24.97282
```
99% percent confidence interval

Holding the margin of error constant at 25, the sample size for a 99% confidence interval should be larger than that for a 95% confidence interval. This is because the margin of error is equal to the critical t-value times the standard error:

\[ME_{\bar{x}} = t^* SE_{\bar{x}} = \frac{t^* \sigma}{\sqrt{n}} \lt 25\]

When we increase the critical t-value (corresponding to the confidence level increasing from 95% to 99%), the standard error must decrease in order to hold the margin of error constant. Since the population standard deviation is fixed, we must increase the sample size in order to decrease the standard error.
99% confidence interval

Following the same calculation as in part (a), where now \(z^*_{0.99} = 2.576\):

\[n \ge \left(\frac{z^*_{0.99} \sigma}{25}\right)^2 = 664.9\]

So now the sample size should be at least 665.
```
(z <- qnorm(0.995))
```
```
## [1] 2.575829
```
```
(n <- (z * s / me)^2)
```
```
## [1] 664.9346
```
As before, let’s confirm using the t-distribution that a sample size of 665 produces a margin of error \(ME_{\bar{x}} \le 25\).

\[ME_{\bar{x}} = t^*_{0.99} SE_{\bar{x}} = \frac{t^*_{0.99} \sigma}{\sqrt{n}} = 25.04\]
```
n <- 665
df <- n - 1
(t <- qt(0.995, df))
```
```
## [1] 2.583254
```
```
(se <- s / sqrt(n))
```
```
## [1] 9.694584
```
```
(me <- t * se)
```
```
## [1] 25.04357
```
Again, to be precise, we need to increase the sample size to \(n=668\), in order to bring the margin of error below 25; in this case the margin of error becomes \(ME_{\bar{x}} = 24.99\)
```
n <- 668
df <- n - 1
(t <- qt(0.995, df))
```
```
## [1] 2.58322
```
```
(se <- s / sqrt(n))
```
```
## [1] 9.67279
```
```
(me <- t * se)
```
```
## [1] 24.98695
```

Exercise 5.20

This is paired data set of reading and writing scores with sample size of \(n=200\). The observations are the differences of reading and writing scores (reading - writing) for each student.

There is not a clear difference between the average reading and writing scores. The median writing score appears to be higher than the median reading score, but there appears to be wider dispersion in the reading scores. For the histogram of differences in scores (reading - writing) for the paired data set, it appears that the distribution is roughly symmetric and centered around 0 (i.e., no difference in reading and writing scores for each student, on average).
The 200 students are randomly selected from the entire survey population, and the sample size of 200 presumably is \(\ll\) the population size. This implies that the observed differences of scores should be independent.
\(H_0: \mu_{read-write} = 0\)
\(H_A: \mu_{read-write} \neq 0\)
This is a two-tailed test.
The conditions for inference using the t-distribution are satisfied:
- Observations are independent: true from part (b) above
- Observations come from a nearly normal distribution: although the sample histogram doesn’t closely resemble a normal distribution, the sample size is large at 200, and the histogram doesn’t exhibit strong skew, so this condition should be fine.
Sample mean: \(\bar{x}_{read-write} = -0.545\)
Sample standard deviation: \(s = 8.887\)
Standard error: \(SE_{\bar{x}_{read-write}} = s / \sqrt{n} = 0.628\)
T-score: \(T = (\bar{x}_{read-write} - \mu_0) / SE_{\bar{x}_{read-write}} = -0.867\).

The p-value (two-tailed) based on the T-score above with \(df = 199\) is 39%, in which case we fail to reject the null hypothesis. In other words, the data do not provide convincing evidence, at a significance level of \(\alpha = 0.05\) (or even at \(\alpha = 0.20\)), that the mean difference of scores is different than 0.
```
m = -0.545
s = 8.887
n = 200
df = n - 1
(se = s / sqrt(n))
```
```
## [1] 0.6284058
```
```
(t = m / se)
```
```
## [1] -0.867274
```
```
# two-tailed p-value
(p = pt(t, df)) * 2
```
```
## [1] 0.3868365
```
We may have made a Type 2 error, i.e., we failed to reject the null hypothesis \(H_0\) when the alternative hypothesis \(H_A\) is in fact true. In this case, the mean difference in the reading and writing scores for the population is different than 0, but we fail to make that conclusion based on the hypothesis testing of the sample mean.
Yes; the sample estimate \(\bar{x}_{read-write}\) is too close to the null estimate of 0 to reject the null hypothesis, which suggests that the confidence interval around the sample estimate will include 0. We can confirm this:
Critical t-score: \(t^*_{0.95} = 1.972\)
95% confidence interval: \(\left(\bar{x}_{read-write} \pm t^*_{0.95} SE_{\bar{x}_{read-write}}\right) = (-1.784, 0.694)\)
The confidence interval, as expected, does include 0.
```
(t_crit <- qt(0.975, df))
```
```
## [1] 1.971957
```
```
m + c(-t_crit, t_crit) * se
```
```
## [1] -1.7841889  0.6941889
```

Exercise 5.32

We can assume that the conditions for inference are satisfied, per the question; so we will use inference with the t-distribution to test the difference of two population means.

Two-tailed hypothesis test with \(\alpha = 0.05\):
\(H_0: \mu_{man} - \mu_{auto} = 0\)
\(H_A: \mu_{man} - \mu_{auto} \neq 0\)
Compute the difference of the sample means, the standard error, the T-score and the p-value:
\(\bar{x}_1 - \bar{x}_2 = 3.73\)
\(SE_{\bar{x}_1 - \bar{x}_2} = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}} = 1.13\)
\(T = ((\bar{x}_1 - \bar{x}_2) - (\mu_{man,0} - \mu_{auto,0})) / SE_{\bar{x}_1 - \bar{x}_2} = 3.30\)
\(p = 0.003\)

The p-value of 0.3% is well under \(\alpha = 0.05\), so we can reject \(H_0\) in favor of the alternative hypothesis \(H_A\). We conclude that the mean fuel efficiencies of automatic and manual cars are different.

x1 <- 19.85
x2 <- 16.12
s1 <- 4.51
s2 <- 3.58
n1 <- 26
n2 <- 26
df <- min(n1, n2) - 1

(m <- x1 - x2)

## [1] 3.73

(se <- sqrt(s1^2 / n1 + s2^2 / n2))

## [1] 1.12927

(t <- m / se)

## [1] 3.30302

# two-tailed p-value
(p <- pt(t, df, lower.tail=FALSE) * 2)

## [1] 0.002883615

Exercise 5.48

ANOVA analysis of means across many groups.

Here there are \(k=5\) groups, with \(n=1172\) observations.

\(H_0: \mu_{<HS} = \mu_{HS} = \mu_{JC} = \mu_{B} = \mu_{G}\)
Population means are the same for all groups
\(H_A: \mu_i \neq \mu_j\) for at least one pair of groups \(i\) and \(j\)
Population means are NOT the same for all groups
Conditions for inference using ANOVA:
- Independent observations within and across groups: We have to assume that the survey respondents are independent; i.e., we assume the survey is sent to a representative random sample of the population, and that people who respond are independent. The sample size is much less than the population size, so this helps.
- Observations within each group are approximately normal: Judging from the box plots, it appears that the distributions are approximately symmetric for each group, although some groups like HS and Bachelor’s seem to exhibit more skew than the other groups. This is mitigated by the large group sizes (e.g., 546 for HS and 253 for Bachelor’s), so this condition should be fine.
- Variance across groups is approximately equal: This is broadly true, as the group standard deviations vary from 14 to 18, with the overall sample standard deviation equal to 15.

See table below.

	Df	Sum Sq	Mean Sq	F value	Pr(>F)
degree	4	2,006	501.54	2.19	0.0682
Residuals	1,167	267,382	229.12
——–	——–	——–	——–	——–	——–
Total	1,171	269,388

n <- 1172
k <- 5
MSG <- 501.54
SSE <- 267382
(df_G <- k - 1)

## [1] 4

(df_E <- n - k)

## [1] 1167

(df_T <- n - 1)

## [1] 1171

(SSG <- MSG * df_G)

## [1] 2006.16

(SST <- SSG + SSE)

## [1] 269388.2

(MSE <- SSE / df_E)

## [1] 229.1191

(F <- MSG / MSE)

## [1] 2.188992

The p-value of 0.068 is greater than the significance level of \(\alpha = 0.05\), so we fail to reject \(H_0\). The survey data is insufficient to demonstrate that there is a statistically significant difference of population means across the groups.

DATA 606 - Homework - Chapter 5

Kevin Benson

October 27, 2018

Exercise 5.6

Exercise 5.14

Exercise 5.20

Exercise 5.32

Exercise 5.48