Inference for Categorical Data

Statistics and Probability for Data Analytics

CUNY MSDS DATA 606

Rose Koh

2018/04/06

Links

Rpub Link

Assignments

Chapter 6 - Inference for Categorical Data

Practice: 6.5, 6.11, 6.27, 6.43, 6.47
Graded: 6.6, 6.12, 6.20, 6.28, 6.44, 6.48

a.F: This sample has 46% of approval rate. Thus with 95% CI applies to entire population, the US population approval rate is between 43% and 49%.
b.T: The sample is less than 10% of the population. This sample is independent and allows us to make an inference about the population.
c.F: A CI is about the population proportion, not about a sample statistic.
d.F: The margin of error at a 90% confidence level, since we are lowering our confidence.

Yes. It was calculated based on 1,259 sample of the US residents.

n <- 1259
p <- 0.48
z <- 1.96 # 95% CI, alpha of 0.05, on z table, z = 1.96

me <- z * sqrt(p*(1-p)/n)

ci.lower <- (p - me) * 100
ci.upper <- (p + me) * 100

The 95% confidence interval for the proportion of US residents who think marijuana should be made legal is from 45.240277 and 50.759723.

Yes, As long as
- observations are independent: < 10% of the population.
- success-failure condition: The sample size is sufficent.
No. The 95% confidence interval falls between 45.24% to 50.76%. It is likely to be < 50%.

# margin of error = 2%
# me = z * se
# 95% CI , z = 1.96
p <- 0.48
me <- 0.02
z <- 1.96
se <- me / z

# standard of error = sqrt(p * (1-p) / n)
n <- (p * (1 - p) / (se^2))
n

## [1] 2397.158

n.ca <- 11545
p.ca <- 0.08

n.or <- 4691
p.or <- 0.088

z <- 1.96 # 95% CI

se.ca <- sqrt((p.ca)*(1-p.ca)/n.ca)
me.ca <- z * se.ca # margin of error CA at 95% CI

se.or <- sqrt((p.or)*(1-p.or)/n.or)
me.or <- z * se.or # margin of error OR at 95% CI

ca.lower <- p.ca - me.ca
ca.upper <- p.ca + me.ca

or.lower <- p.or - me.or
or.upper <- p.or + me.or

H0: prop of CA residents with insufficient sleep == prop of OR residents with insufficient sleep
Ha: prop of CA residents with insufficient sleep != prop of OR residents with insufficient sleep
Inference conditions are met as:
The sample is less than 10% of the population – independence.
The np and n(1-p) are greater than 10 – success, failure.
CA
margin of error 0.0049488
CI 0.0750512 and 0.0849488
OR
margin of error 0.008107
CI 0.079893 and 0.096107
Since the CA and OR CI overlap, we can’t reject the H0.

se <- sqrt((p.ca)*(1-p.ca)/n.ca + (p.or)*(1-p.or)/n.or) # Calculating a new SE for the differences
me <- z * se

# on 95%
diff <- p.or - p.ca
diff.lower <- diff - me
diff.upper <- diff + me

The 95% CI for difference in proportion of CA and OR residents for sleep deprivation ranges between -0.0014981 and 0.0174981.
Since 95% includes 0, we can’t reject H0.

Hypothesis

H0: The barking deer have no preference in foraging location.
Ha: The barking deer have preferences in foraging location.

Chi-square test
Check conditions for inference

Independence: Not provided, but assuming it is independent.
Sample size and distribution: All habitats have at least 5 expected cases, we assume this is satisfactory.

chisq.test(c(4,16,67,345), p = c(.048,.147,.396,.409))

## 
##  Chi-squared test for given probabilities
## 
## data:  c(4, 16, 67, 345)
## X-squared = 272.69, df = 3, p-value < 2.2e-16

Yes. with p-value < 2.2e-16, data shows that barking deer does have preferences in foraging location.

Chi-squared test
Hypothesis

H0: The coffee consumption and depression is not related.
Ha: The coffee consumption and depression is related.

dep <- 2607
not.dep <- 48132
total <- dep + not.dep

suffer from depression: 0.0513806
do not suffer from depressoion: 0.9486194

to.six.cups <- 6617
Ek <- (dep * to.six.cups) / total
chipart <- ((373-Ek) ^2) / Ek

The expected count for the highlighted value is 339.9853958
A total of 3.2059144 is contributed to this test statistic.

found <- data.frame(Yes = c(670,373,905,564,95),
                    No =c(11545,6244,16329,11726,2288)
                    )
chisq.test(found)

## 
##  Pearson's Chi-squared test
## 
## data:  found
## X-squared = 20.932, df = 4, p-value = 0.0003267

p.value = 0.0003267

We reject the H0, that the coffee consumption is associated with depression.
Agree. Despite the significance shown in the data, it is not an experiment. Correlation does not necessarily mean there’s causation. It is too early to make this recommendation that coffee consumption leads to reducing depression.