DATA 606 - Assignment6

6.6 2010 Healthcare Law

FALSE. The sample’s level of support is known with certainty.
TRUE. The population’s level of support is unknown and requires estimation by 95 percent confidence interval as stated.
TRUE. This is the definition of a 95 percent confidence interval.
TRUE. This depends on the inverse of the square root of n.
TRUE. The 95% confidence interval is well above 50%. Its lower bound is 62.

6.12 Legalization of marijuana, Part I.

The dataset for this problem is contained is

48% is a sample statistic because it is calculated from the 1259 resident sample.
We are 95% confident the true population proportion lies within the interval whose standard error is: \[SE_{p}=\sqrt{\frac{p(1-p)}{n}}=\sqrt{\frac{(.48)(1-.48)}{1259}}\]

( SE = sqrt( .48 * (1-.48) / 1259) )

## [1] 0.01408022

The 95% confidence interval is calculated using a normal distribution assumption to be (.452, .507)

(z = qnorm(.975))

## [1] 1.959964

proportion = 0.48

(CI_lower = proportion - z * SE)

## [1] 0.4524033

(CI_upper = proportion + z * SE)

## [1] 0.5075967

The critic’s argument is true, but this data set appears to satisfy those requirements. The observations are appear independent and the success-failure condition is met.
The news piece statement is not justified. Based on the confidence interval, there is a possibility that the true population proportion is less than a majority of .50.

6.20 Legalize Marijuana, Part II

To acheive a 2% margin of error, we require the following equality:

\(zSE=0.02\) which implies \[z\sqrt{\frac{p(1-p)}{n}} = 0.02\]

This requires \[n = \frac{p(1-p)}{ \left( \frac{0.02}{z} \right)^2} \approx 2398\]

( n = proportion * ( 1 - proportion ) /( 0.02/ z)^2 )

## [1] 2397.07

6.28 Sleep deprivation, CA or OR, Part I

Both proportions appear to come from a normal distribution. The large samples of observations appear to be independently drawn. The success-failure condition is individually met by the sample in CA and OR. Thus, the difference in proportions is well approximated by a normal distribution with standard error where \(p_1\) is the California proportion and \(p_2\) is the Oregon proportion.

\[SE_{p_1-p_2} = \sqrt{\frac{p_1(1-p_1)}{n_1}+\frac{p_2(1-p_2)}{n_2}}\]

p1 = .08
n1 = 11545
p2 = 0.088
n2 = 4691

(SE = sqrt( p1*(1-p1)/n1 + p2*(1-p2)/n2 ) )

## [1] 0.004845984

(z = qnorm( 0.975))  # for a 95 percent two-sided confidence interval

## [1] 1.959964

The confidence interval is:

(CI_lower = (p2-p1) - z * SE)

## [1] -0.001497954

(CI_upper = (p2-p1) + z * SE)

## [1] 0.01749795

This means the 95% confidence interval of the difference in proportion is between -0.149% and 1.75%. This means we are 95% confident that the true population difference of Oregon minus California lies between -0.149% and 1.75%. If we do hypothesis testing that the two states have the same mean proportion, we cannot reject the null hypothesis at the 95% confidence level.

6.44 Barking Deer

Null Hypothesis \(H_0\) the barking deer has no preference where it forages in its habitat. Alternate hypothesis \(H_A\) the barking deer prefers to forage in certain areas of its habitat.
We can use a \(\chi^2\) test for one-way table to test these hypotheses.
The assumptions are nearly satisfied. expected count is at least 5 except for Woods which has 4 observations. Observations are likely to be independent.

observations = c( 4, 16, 67, 345) # Woods, Cultivated grass,  Deciduous forest, other

( expectedfreq = c( .048, .147,  .396,  1 - .048 - .147 -.396) )

## [1] 0.048 0.147 0.396 0.409

( expected_counts = 426 * expectedfreq )

## [1]  20.448  62.622 168.696 174.234

( vec =  ( observations - expected_counts ) ^2 / expected_counts )

## [1]  13.23047  34.71002  61.30600 167.36703

( chi_squared = sum( vec) )

## [1] 276.6135

(p_value = 1- pchisq( chi_squared, df = 3 ) )

## [1] 0

The p-value of the \(\chi^2\) test is nearly 0. The chance of the p-value being this small while the null hypothesis is true is so low, we strongly support the alternative hypothesis.

6.48 Coffee and Depression

A \(\chi^2\) test with two way table is appropriate for this case. The trickiest issue to establish is the independence of how the bins are defined.
\(H_0\) is that clinical depression have no difference in proportion at different caffeine consumption levels. \(H_A\) is that clinical depression has some difference in proportion at different caffeine levels.
To calculate the proportions, we do this at each caffeine consumption level and in the aggregate.

(counts = data.frame( yes = c( 670, 373, 905, 564, 95) ,
                     no = c( 11545, 6244, 16329, 11726, 2288)
                     ) )

##   yes    no
## 1 670 11545
## 2 373  6244
## 3 905 16329
## 4 564 11726
## 5  95  2288

(totals = counts$yes + counts$no ) # Total Woman in survey by Caffeine consumptions

## [1] 12215  6617 17234 12290  2383

(counts / totals)  # fraction of women with depression grouped by caffeine consumptions

##          yes        no
## 1 0.05485059 0.9451494
## 2 0.05636996 0.9436300
## 3 0.05251248 0.9474875
## 4 0.04589097 0.9541090
## 5 0.03986572 0.9601343

(aggregate_level = sum(counts$yes) / sum(counts$no + counts$yes) ) # average level of women with depression as all caffeine levels

## [1] 0.05138059

We conclude the 5.13% of women suffer from depression based on this survey. 94.87% don’t suffer from depression.

the expected count in the cell with value 373 is: 339.98

(expected_count = ( 6617 * 2607 ) / 50739 )

## [1] 339.9854

The contribution to the test statistic is: \(\chi_1^2 = 3.206\)

(chi1sq = (373 - expected_count) ^2 / expected_count )

## [1] 3.205914

If the test statistic is \(\chi^2 = 20.93\) then since \(R=2\) and \(C=5\), \((R-1)(C-1)=4\) is the degrees of freedom. Thus, the p-value is:

(p_value = 1 - pchisq( 20.93, df = 4))

## [1] 0.0003269507

Since the p-value is so small, we reject the null hypothesis and conclude there is no relationship between caffeine consumption and depression.
I agree with the author that the \(\chi^2\) test did not establish causality between depression and caffeine consumption. There could be a third factor underlying the observed variables. Also, we don’t know if the depression could cause the caffeine consumption. That is a possible interpretation of the same data.