FALSE. The sample’s level of support is known with certainty.
TRUE. The population’s level of support is unknown and requires estimation by 95 percent confidence interval as stated.
TRUE. This is the definition of a 95 percent confidence interval.
TRUE. This depends on the inverse of the square root of n.
TRUE. The 95% confidence interval is well above 50%. Its lower bound is 62.
The dataset for this problem is contained is
48% is a sample statistic because it is calculated from the 1259 resident sample.
We are 95% confident the true population proportion lies within the interval whose standard error is: \[SE_{p}=\sqrt{\frac{p(1-p)}{n}}=\sqrt{\frac{(.48)(1-.48)}{1259}}\]
( SE = sqrt( .48 * (1-.48) / 1259) )
## [1] 0.01408022
The 95% confidence interval is calculated using a normal distribution assumption to be (.452, .507)
(z = qnorm(.975))
## [1] 1.959964
proportion = 0.48
(CI_lower = proportion - z * SE)
## [1] 0.4524033
(CI_upper = proportion + z * SE)
## [1] 0.5075967
The critic’s argument is true, but this data set appears to satisfy those requirements. The observations are appear independent and the success-failure condition is met.
The news piece statement is not justified. Based on the confidence interval, there is a possibility that the true population proportion is less than a majority of .50.
To acheive a 2% margin of error, we require the following equality:
\(zSE=0.02\) which implies \[z\sqrt{\frac{p(1-p)}{n}} = 0.02\]
This requires \[n = \frac{p(1-p)}{ \left( \frac{0.02}{z} \right)^2} \approx 2398\]
( n = proportion * ( 1 - proportion ) /( 0.02/ z)^2 )
## [1] 2397.07
Both proportions appear to come from a normal distribution. The large samples of observations appear to be independently drawn. The success-failure condition is individually met by the sample in CA and OR. Thus, the difference in proportions is well approximated by a normal distribution with standard error where \(p_1\) is the California proportion and \(p_2\) is the Oregon proportion.
\[SE_{p_1-p_2} = \sqrt{\frac{p_1(1-p_1)}{n_1}+\frac{p_2(1-p_2)}{n_2}}\]
p1 = .08
n1 = 11545
p2 = 0.088
n2 = 4691
(SE = sqrt( p1*(1-p1)/n1 + p2*(1-p2)/n2 ) )
## [1] 0.004845984
(z = qnorm( 0.975)) # for a 95 percent two-sided confidence interval
## [1] 1.959964
The confidence interval is:
(CI_lower = (p2-p1) - z * SE)
## [1] -0.001497954
(CI_upper = (p2-p1) + z * SE)
## [1] 0.01749795
This means the 95% confidence interval of the difference in proportion is between -0.149% and 1.75%. This means we are 95% confident that the true population difference of Oregon minus California lies between -0.149% and 1.75%. If we do hypothesis testing that the two states have the same mean proportion, we cannot reject the null hypothesis at the 95% confidence level.
Null Hypothesis \(H_0\) the barking deer has no preference where it forages in its habitat. Alternate hypothesis \(H_A\) the barking deer prefers to forage in certain areas of its habitat.
We can use a \(\chi^2\) test for one-way table to test these hypotheses.
The assumptions are nearly satisfied. expected count is at least 5 except for Woods which has 4 observations. Observations are likely to be independent.
observations = c( 4, 16, 67, 345) # Woods, Cultivated grass, Deciduous forest, other
( expectedfreq = c( .048, .147, .396, 1 - .048 - .147 -.396) )
## [1] 0.048 0.147 0.396 0.409
( expected_counts = 426 * expectedfreq )
## [1] 20.448 62.622 168.696 174.234
( vec = ( observations - expected_counts ) ^2 / expected_counts )
## [1] 13.23047 34.71002 61.30600 167.36703
( chi_squared = sum( vec) )
## [1] 276.6135
(p_value = 1- pchisq( chi_squared, df = 3 ) )
## [1] 0
The p-value of the \(\chi^2\) test is nearly 0. The chance of the p-value being this small while the null hypothesis is true is so low, we strongly support the alternative hypothesis.
A \(\chi^2\) test with two way table is appropriate for this case. The trickiest issue to establish is the independence of how the bins are defined.
\(H_0\) is that clinical depression have no difference in proportion at different caffeine consumption levels. \(H_A\) is that clinical depression has some difference in proportion at different caffeine levels.
To calculate the proportions, we do this at each caffeine consumption level and in the aggregate.
(counts = data.frame( yes = c( 670, 373, 905, 564, 95) ,
no = c( 11545, 6244, 16329, 11726, 2288)
) )
## yes no
## 1 670 11545
## 2 373 6244
## 3 905 16329
## 4 564 11726
## 5 95 2288
(totals = counts$yes + counts$no ) # Total Woman in survey by Caffeine consumptions
## [1] 12215 6617 17234 12290 2383
(counts / totals) # fraction of women with depression grouped by caffeine consumptions
## yes no
## 1 0.05485059 0.9451494
## 2 0.05636996 0.9436300
## 3 0.05251248 0.9474875
## 4 0.04589097 0.9541090
## 5 0.03986572 0.9601343
(aggregate_level = sum(counts$yes) / sum(counts$no + counts$yes) ) # average level of women with depression as all caffeine levels
## [1] 0.05138059
We conclude the 5.13% of women suffer from depression based on this survey. 94.87% don’t suffer from depression.
(expected_count = ( 6617 * 2607 ) / 50739 )
## [1] 339.9854
The contribution to the test statistic is: \(\chi_1^2 = 3.206\)
(chi1sq = (373 - expected_count) ^2 / expected_count )
## [1] 3.205914
(p_value = 1 - pchisq( 20.93, df = 4))
## [1] 0.0003269507
Since the p-value is so small, we reject the null hypothesis and conclude there is no relationship between caffeine consumption and depression.
I agree with the author that the \(\chi^2\) test did not establish causality between depression and caffeine consumption. There could be a third factor underlying the observed variables. Also, we don’t know if the depression could cause the caffeine consumption. That is a possible interpretation of the same data.