Data 606 - Homework 6

Heather Geiger - April 8, 2018

Question 6.6

False. Exactly 46% of Americans in this sample support the decision.
True. We are 95% confident that between 43% and 49% of Americans (the population this sample comes from) support the decision, based on the definition of a confidence interval.
False. We cannot say for sure that 95% of sample percentages would be between 43% and 49%. Looking at our answer to question 2/b, the true population percentage could be anywhere from 43% to 49%. If the true population percentage were 43%, then we would expect 95% of samples to have a percentage between 40 and 46%. If the true population percentage were 49%, then we would expect 95% of samples to have a percentage between 46 and 52%.
False. The margin of error at a 90% confidence level would be lower, not higher, because a lower z-score would be used to multiply by the standard error of the mean.

Question 6.12

margin_of_error <- 1.96 * sqrt((.48*.52)/1259) * 100
round(c(48 - margin_of_error,48 + margin_of_error),digits=2)

## [1] 45.24 50.76

48% is a sample statistic.
We are 95% confident that between 45.24% and 50.76% of Americans support legalization. So while it is possible that a majority of Americans support legalization, the evidence tilts in favor of slightly less than a majority of Americans supporting legalization.
The normal model is a good approximation here. We can assume the observations are independent, as the sample is a small part of the total population. We find hundreds each of successes and failures, so we also meet the success-failure condition.
See 2/b. Based on the confidence interval, it is not justified based on the evidence to suggest that a majority of Americans support legalization.

Question 6.20

#1.96 * sqrt((.48*.52)/n) == .02
n = (.48*.52*1.96^2)/(.02^2)
1.96 * sqrt((.48*.52)/n) == .02

## [1] TRUE

round(n,digits=0) + 1

## [1] 2398

If we wanted to limit the margin of error of a 95% confidence interval to 2%, we would need to survey 2,398 Americans.

Question 6.28

pooled_proportion = function(prop1,n1,prop2,n2){
weight_prop1 <- n1/(n1 + n2)
weight_prop2 <- 1 - weight_prop1
return(weight_prop1 * prop1 + weight_prop2 * prop2)
}

SE_mean_using_pooled_proportion = function(prop1,n1,prop2,n2){
my_p <- pooled_proportion(prop1,n1,prop2,n2)
return(sqrt((my_p*(1 - my_p))/n1 + (my_p*(1 - my_p))/n2))
}

SE_mean_using_separated_proportions = function(prop1,n1,prop2,n2){
return(sqrt((prop1*(1 - prop1))/n1 + (prop2*(1 - prop2))/n2))
}

SE_mean_using_pooled_proportion(.08,11545,.088,4691)

## [1] 0.004758691

SE_mean_using_separated_proportions(.08,11545,.088,4691)

## [1] 0.004845984

#Whether we use the separated or pooled proportions to calculate SE of the mean, we get a value around .0048.

c(.008 - (1.96*.0048),.008 + (1.96*.0048))*100

## [1] -0.1408  1.7408

Within the 95% confidence interval, the percentage of sleep-deprived Californians as compared to Oregonians could be anywhere from the percentage of Californians who are sleep deprived being 0.14 points higher, to the percentage of Oregonians who are sleep deprived being 1.74 points higher. Thus while the data lean toward the idea of Oregonians being sleep deprived more frequently, the result is not statistically significant. The confidence interval includes the null hypothesis (that the percentages in the two populations are equal), so we must retain the null hypothesis that the sleep deprivation rates in the two states are equal.

Question 6.44

found <- c(4,16,67,345)
expected <- c(.048,.147,.396,.409)*426

expected

## [1]  20.448  62.622 168.696 174.234

differences_squared <- (found - expected)^2

X_squared <- sum(differences_squared/expected)

X_squared

## [1] 276.6135

1 - pchisq(X_squared,df=3)

## [1] 0

chisq.test(found, p = c(.048,.147,.396,.409))

## 
##  Chi-squared test for given probabilities
## 
## data:  found
## X-squared = 272.69, df = 3, p-value < 2.2e-16

Null hypothesis is that the deer have no preference, in which case we would expect the proportions of sites where the deer foraged per habitat type would be roughly equal to the proportions of these habitat types in the region. Alternative hypothesis is that the deer do have a preference, in which case we would expect the proportions to be different.
We can run a chi-squared test to answer this question.
The numbers of expected foraging sites per habitat type are all high enough (5 or more). As for independence, I actually don’t know that all cases are independent (maybe some of them are mother-child pairs, for example), but overall this seems like a reasonable assumption to make for the most part.
Yes, these data provide extremely convincing evidence that barking deer prefer to forage in certain habitats over others (X^2 = 272.69, df = 3, p-value < 2.2e-16).

Question 6.48

found <- data.frame(Yes = c(670,373,905,564,95),No = c(11545,6244,16329,11726,2288),row.names=c("Lt.1.a.wk","Btwn.2.and.6.a.wk","1.a.day","Btwn.2.and.3.a.day","4.or.more.a.day"))

found <- t(found)

expected <- data.frame(matrix(NA,nrow=nrow(found),ncol=ncol(found)))

for(i in 1:nrow(found))
{
for(j in 1:ncol(found))
{
expected[i,j] <- (rowSums(found)[i]*colSums(found)[j])/sum(colSums(found))
}
}

expected

##          X1        X2         X3         X4      X5
## 1   627.614  339.9854   885.4932   631.4675  122.44
## 2 11587.386 6277.0146 16348.5068 11658.5325 2260.56

X_squared <- (found - expected)^2/expected

X_squared

##          X1        X2        X3        X4        X5
## 1 2.8625493 3.2059144 0.4297225 7.2083913 6.1495551
## 2 0.1550458 0.1736437 0.0232753 0.3904321 0.3330817

X_squared <- sum(colSums(X_squared))

X_squared

## [1] 20.93161

1 - pchisq(X_squared,df=4)

## [1] 0.0003267104

chisq.test(found)

## 
##  Pearson's Chi-squared test
## 
## data:  found
## X-squared = 20.932, df = 4, p-value = 0.0003267

depression_rate <- sum(found["Yes",])/sum(colSums(found))

depression_rate

## [1] 0.05138059

A chi-squared test is the appropriate test here.
Null hypothesis is that the rates of depression are equal in each coffee consumption group. Alternative hypothesis is that at least one group has more or less depression than the others.
The overall rate of depression in this sample is around 5%.
The expected count for the highlighted cell is 339.985. The contribution to X^2 by this cell is 3.206.
The p-value is 0.0003267.
The conclusion of this hypothesis test is that we find statistically significant differences in the proportion of women with depression depending on their rate of coffee consumption. We reject the null hypothesis.
Yes, I would agree that we cannot conclude from this that we should recommend coffee to avoid depression. Although the data is statistically significant, it is based on an observational study rather than an experiment. Therefore we cannot make a causal link that higher coffee consumption leads to reduced depression.