DATA 606 - Homework

Exercise 6.6

False; we know from the point estimate that exactly 46% of Americans in the sample support the decision.
True; the 95% CI is \(0.46 \pm 0.03 =\) (43%, 49%). This assumes that the conditions for inference are met, including that this is a random sample.
False; 95% of the sample proportions will be within 2 standard deviations of the true population mean. Alternatively, 95% of the 95% confidence intervals constructed from the many random samples will include the true population mean.
False; decreasing the confidence level decreases the critical Z-value, which decreases the margin of error.

Exercise 6.12

48% is a sample statistic; it represents the proportion of the 1,259 respondents in the survey who have that view.
\(n = 1259\)
\(p = 0.48\)
\(SE = \sqrt{p (1-p) / n} = 0.0141\)
\(ME = 1.96 SE = 0.0276\) at 95% confidence level

So the 95% CI is 48% \(\pm\) 1.41% = (45.2%, 50.8%). The interpretation is that we are 95% confident that between 45% and 51% of the population of US residents think that the use of marijuana should be made legal.
```
n <- 1259
p <- 0.48
(SE <- sqrt(p * (1-p) / n))
```
```
## [1] 0.01408022
```
```
(ME <- 1.96 * SE)
```
```
## [1] 0.02759723
```
```
(p + c(-ME, ME))
```
```
## [1] 0.4524028 0.5075972
```
Yes this should be true, as the conditions for inference using the normal distribution should be met:
- Independence of observations: it is reasonable to assume that this is a random sample, and the sample size is clearly <10% of the population.
- Success-failure condition: there are >10 instances of successes and failures in the sample.
No it is not justified, since the 95% CI is (45.2%, 50.8%), which is mostly below 50%. The true population proportion may be >50%, but it may be anywhere in the interval at a 95% confidence level.

Exercise 6.20

At a 95% confidence level, the margin of error is:

\[ ME_{\hat{p}} = 1.96 SE_{\hat{p}} = 1.96 \sqrt{\frac{p (1-p)}{n}} \lt 0.02\] Since we don’t know the true population proproption \(p\), let’s assume the worse case of \(p=0.5\). Then substituting and solving for \(n\):

\[ n \gt \left(\frac{1.96}{0.02}\right)^2 p (1-p) = \left(\frac{1.96}{0.02}\right)^2 0.5 (1-0.5) = 2401\]

(1.96 * 0.5 / 0.02)^2

## [1] 2401

So we would need to survey at least 2,401 Americans in order to limit the margin of error to 2%, at a 95% confidence level.

Exercise 6.28

CA: \(n_C = 11545\)
\(\hat{p}_C = 0.08\)
OR: \(n_O = 4691\)
\(\hat{p}_O = 0.088\)
Point estimate: \(\hat{p}_C - \hat{p}_O = 0.08 - 0.088 = -0.008\)
Standard error:
\[SE_{\hat{p}_C - \hat{p}_O} = \sqrt{\frac{\hat{p}_C (1-\hat{p}_C)}{n_C} + \frac{\hat{p}_O (1-\hat{p}_O)}{n_O}} = 0.0048\]
Margin of error at 95% confidence level:
\[ME_{\hat{p}_C - \hat{p}_O} = 1.96 \cdot SE_{\hat{p}_C - \hat{p}_O} = 0.0095\]
95% confidence interval:
\[\hat{p}_C - \hat{p}_O \pm ME_{\hat{p}_C - \hat{p}_O} = (-0.0175, 0.0015)\]
So the 95% CI is (-1.8%, 0.2%).
Interpretation: The interpretation is that, at a 95% confidence level, the proportion of CA residents who report insufficient sleep was between -1.8% lower and 0.2% higher than the corresponding proportion of OR residents (inference from sample to population statistics).
```
nc <- 11545
no <- 4691
pc <- 0.08
po <- 0.088
(pc - po)
```
```
## [1] -0.008
```
```
(se <- sqrt( pc * (1-pc) / nc + po * (1-po) / no ))
```
```
## [1] 0.004845984
```
```
(me <- 1.96 * se)
```
```
## [1] 0.009498128
```
```
(pc - po + c(-me, me))
```
```
## [1] -0.017498128  0.001498128
```

Exercise 6.44

\(H_0\): Barking deer have no preference for foraging in certain habitats over others.
\(H_A\): Barking deer have a preference for foraging in certain habitats over others.
We can use the chi-squared goodness of fit test. We will test whether the observed foraging sites imply whether the deer have any preference for foraging habitats.
Conditions for chi-squared test
- Independent observations: we have to make an assumption that each observation is independent of the others. For instance, we don’t know if factors such as human development, agriculture / pesticide use, or wildfires might cause systemic bias in the observed foraging sites.
- Sample size: we need to check whether each category has at least 5 expected cases. If the null hypothesis is true, then the number of expected cases in each category will be the natural habitat frequency times the total number of observations. The smallest habitat frequency is 4.8% for the “Woods” category, which in a sample of 426 cases, would have 20 cases. So this condition is satisfied.
\(k = 4\), with the categories Woods, Cultivated grassplot, Deciduous forests, and Other
\(n = 426\) observations
Expected values if \(H_0\) is true:
\(E_w = 0.048 * 426 = 20.4\)
\(E_c = 0.147 * 426 = 62.6\)
\(E_d = 0.396 * 426 = 168.7\)
\(E_o = 0.409 * 426 = 174.2\)
\(df = k-1 = 3\)
\(Z_w = \frac{4 - 20.4}{\sqrt{20.4}} = -3.6\)
\(Z_c = \frac{16 - 62.6}{\sqrt{62.6}} = -5.9\)
\(Z_d = \frac{61 - 168.7}{\sqrt{168.7}} = -7.8\)
\(Z_o = \frac{345 - 174.2}{\sqrt{174.2}} = 12.9\)

(Note there’s a typo in the book - in the table, Deciduous forests should be 61, not 67.)

Now the chi-squared statistic is:
\[\chi^2 = Z_w^2 + Z_c^2 + Z_d^2 + Z_o^2 = 276.6\]

This is an exceptionally large \(\chi^2\) value, which is far in the tail of the \(H_0\) probability distribution, which is strong evidence favoring the alternative hypothesis \(H_A\) (the actual p-value is \(\ll\) 0.1%). We conclude that the deer exhibit a preference for foraging in certain habitats over others.
```
k <- 4
n <- 426
p_w <- 0.048
p_c <- 0.147
p_d <- 0.396
(p_o <- 1 - p_w - p_c - p_d)
```
```
## [1] 0.409
```
```
(p_w + p_c + p_d + p_o)
```
```
## [1] 1
```
```
(E_w <- p_w * n)
```
```
## [1] 20.448
```
```
(E_c <- p_c * n)
```
```
## [1] 62.622
```
```
(E_d <- p_d * n)
```
```
## [1] 168.696
```
```
(E_o <- p_o * n)
```
```
## [1] 174.234
```
```
(E_w + E_c + E_d + E_o)
```
```
## [1] 426
```
```
O_w <- 4
O_c <- 16
O_d <- 61
O_o <- 345
(O_w + O_c + O_d + O_o)
```
```
## [1] 426
```
```
(Z_w <- (O_w - E_w) / sqrt(E_w))
```
```
## [1] -3.637372
```
```
(Z_c <- (O_c - E_c) / sqrt(E_c))
```
```
## [1] -5.891521
```
```
(Z_d <- (O_d - E_d) / sqrt(E_d))
```
```
## [1] -8.291769
```
```
(Z_o <- (O_o - E_o) / sqrt(E_o))
```
```
## [1] 12.93704
```
```
(chisq <- Z_w^2 + Z_c^2 + Z_d^2 + Z_o^2)
```
```
## [1] 284.0609
```

Exercise 6.48

Chi-squared test for independence in a two-way table.
\(H_0\): There is no association between coffee intake and depression.
\(H_A\): There is an association between coffee intake and depression.
Overall proportion of women who do or do not suffer from depression:
Yes: \(2607 / 50739 = 0.0514\)
No: \(48132 / 50739 = 0.9486\)
```
(pd <- 2607 / 50739)
```
```
## [1] 0.05138059
```
```
(1-pd)
```
```
## [1] 0.9486194
```
Expected count for (Yes = clinical depression) and (2-6 cups/week = coffee consumption):
\(6617 \cdot 2607 / 50739 = 340\)

Contribution of this cell to the \(\chi^2\) test statistic:
\((Observed - Expected)^2 / Expected = (373 - 340)^2 / 340 = 3.21\)
```
(e <- pd * 6617)
```
```
## [1] 339.9854
```
```
o <- 373
(o - e)^2 / e
```
```
## [1] 3.205914
```
The degrees of freedome is \(df = (5-1) \cdot (2-1) = 4\). From the \(\chi^2\) distribution table in the back of the book, the p-value corresponding to \(\chi^2 = 20.93\) and \(df = 4\) is <0.1%.
Since the p-value of \(\lt 0.001\) is less than our significance level of \(\alpha = 0.05\), we reject \(H_0\) in favor of \(H_A\) and conclude that there is an association between coffee intake and depression.
Yes, it would be premature to recommend higher coffee consumption on the basis of this study alone. There are other factors to consider, for instance, side effects of higher caffeine intake, and other dimensions of psychological health that are not considered in this study. In addition, the study should be independently verified by other researchers to validate the results of this study.

DATA 606 - Homework - Chapter 6

Kevin Benson

November 4, 2018