On June 28, 2012 the U.S. Supreme Court upheld the much debated 2010 healthcare law, declaring it constitutional. A Gallup poll released the day after this decision indicates that 46% of 1,012 Americans agree with this decision. At a 95% confidence level, this sample has a 3% margin of error. Based on this information, determine if the following statements are true or false, and explain your reasoning.
FALSE we already know the feelings of those in the sample
TRUE this is the definition of a confidence interval, and seems to be accurate for our point estimate of 46%
FALSE 95% of the samples would contain the true population proportion
FALSE our margin of error would decrease because we are increasing the range of values we include in our confidence interval
The 2010 General Social Survey asked 1,259 US residents: “Do you think the use of marijuana should be made legal, or not?” 48% of the respondents said it should be made legal.
It is a sample statistic, because it expresses the feelings of a sample of U.S. residents – not all U.S. residents.
n <- 1259
p <- .48
z <- qnorm(.975)
SE <- sqrt((p*(1-p))/n)
upper <- p + z*SE
lower <- p - z*SE
upper
## [1] 0.5075967
lower
## [1] 0.4524033
legal <- p*n
notlegal <- (1-p)*n
legal
## [1] 604.32
notlegal
## [1] 654.68
Since both groups in the survey – demonstrating the number of people who believe marijuana should be legal or not legal – are above 10, we can be sure that we have enough people in the survey to ensure that the distribution is nearly normal. In addition, if respondents are found around America without any biases, we can assume the observations are also independent.
This news piece is not justified – since we cannot say either side has a majority in this issue. The confidence interval contains 50%, demonstrating that there is no majority.
As discussed in Exercise 6.12, the 2010 General Social Survey reported a sample where about 48% of US residents thought marijuana should be made legal. If we wanted to limit the margin of error of a 95% confidence interval to 2%, about how many Americans would we need to survey?
Margin_of_Error = z * sqrt((p*(1-p))/n)
N = (z2p(1-p))/(Margin_of_Error)2
p <- .48
z <- qnorm(.975)
Margin_of_Error <- .02
N <- ((z^2)*p*(1-p))/((Margin_of_Error)^2)
N
## [1] 2397.07
We would need 2397.07 americans for the survey
According to a report on sleep deprivation by the Centers for Disease Control and Prevention, the proportion of California residents who reported insufficient rest or sleep during each of the preceding 30 days is 8.0%, while this proportion is 8.8% for Oregon residents. These data are based on simple random samples of 11,545 California and 4,691 Oregon residents. Calculate a 95% confidence interval for the difference between the proportions of Californians and Oregonians who are sleep deprived and interpret it in context of the data.
cali_n <- 11545
ore_n <- 4691
cali_p <- .08
ore_p <- .088
z <- qnorm(.975)
cali_SE <- sqrt((ore_p*(1-ore_p))/ore_n)
ore_SE <- sqrt((cali_p*(1-cali_p))/cali_n)
SE_diff <- sqrt((cali_SE)^2 + (ore_SE)^2)
point_estimate <- (ore_p - cali_p)
upper <- point_estimate+(z*SE_diff)
lower <- point_estimate-(z*SE_diff)
upper
## [1] 0.01749795
lower
## [1] -0.001497954
We are 95% confident that Oregon residents report insufficient rest -.149% to 1.74% more than California residents.
Microhabitat factors associated with forage and bed sites of barking deer in Hainan Island, China were examined from 2001 to 2002. In this region woods make up 4.8% of the land, cultivated grass plot makes up 14.7%, and deciduous forests makes up 39.6%. Of the 426 sites where the deer forage, 4 were categorized as woods, 16 as cultivated grassplot, and 61 as deciduous forests.
Ho = Barking deer do not prefer foraging in any particular habitat. Ha = Barking deer prefer foraging in particular habitats.
Chi-Square Test
k <- 4
df <- k-1
# Percentages of land distribution
woods_p <- .048
cultivated_grass_p <- .147
deciduous_forests_p <- .396
other_p <- 1-(woods_p+cultivated_grass_p+deciduous_forests_p)
# Where deer forage
sites <- 426
woods_f <- 4
cultivated_grass_f <- 16
deciduous_forests_f <- 61
other_f <- sites-(woods_f+cultivated_grass_f+deciduous_forests_f)
# Expected values
woods_e <- sites * woods_p
cultivated_grass_e <- sites * cultivated_grass_p
deciduous_forests_e <- sites * deciduous_forests_p
other_e <- sites * other_p
All habitats have at least 5 expected cases, so condition is satified. We will assume independence since we are acquiring data from an entire region without overlap.
# Z Values
woods_z <- (woods_f - woods_e)/sqrt(woods_e)
cultivated_grass_z <- (cultivated_grass_f - cultivated_grass_e)/sqrt(cultivated_grass_e)
deciduous_forests_z <- (deciduous_forests_f - deciduous_forests_e)/sqrt(deciduous_forests_e)
other_z <- (other_f - other_e)/sqrt(other_e)
# Test statistic
chi_squared <- (woods_z^2)+(cultivated_grass_z^2)+(deciduous_forests_z^2)+(other_z^2)
chi_squared
## [1] 284.0609
# obtain p-value
p <- pchisq(chi_squared, df=df, lower.tail=FALSE)
p
## [1] 2.799724e-61
The p-value is very small, therefore there is evidence that barking deer favor certain habitats for foraging.
Researchers conducted a study investigating the relationship between caffeinated coffee consumption and risk of depression in women. They collected data on 50,739 women free of depression symptoms at the start of the study in the year 1996, and these women were followed through 2006. The researchers used questionnaires to collect data on caffeinated coffee consumption, asked each individual about physician-diagnosed depression, and also asked about the use of antidepressants. The table below shows the distribution of incidences of depression by amount of caffeinated coffee consumption.
library(knitr)
include_graphics('/Users/Michele/Desktop/648.png')
A chi-squared test for two way tables
Ho: Women demonstrate no difference in depression based on caffeinated coffee consumption Ha: Women demonstrate a significant difference in depression based on caffeinated coffee consumption
yes_total <- 2607
no_total <- 48132
total <- yes_total + no_total
suffer_depression <- yes_total/total
suffer_depression
## [1] 0.05138059
no_depression <- no_total/total
no_depression
## [1] 0.9486194
# Generate results database
yes <- c(670, 373, 905, 564, 95)
no <- c(11545, 6244, 16329, 11726, 2288)
total <- yes+no
col_headings <- c('<1 cup/week','2-6 cups/week', '1 cup/day','2-3 cups/day','>4 cups/day')
results <- t(data.frame(yes, no))
colnames(results) <- col_headings
results
## <1 cup/week 2-6 cups/week 1 cup/day 2-3 cups/day >4 cups/day
## yes 670 373 905 564 95
## no 11545 6244 16329 11726 2288
# Generate expected results dataframe
expected_yes <- total * suffer_depression
expected_no <- total * no_depression
expected_results <- t(data.frame(expected_yes, expected_no))
colnames(expected_results) <- col_headings
expected_results
## <1 cup/week 2-6 cups/week 1 cup/day 2-3 cups/day >4 cups/day
## expected_yes 627.614 339.9854 885.4932 631.4675 122.44
## expected_no 11587.386 6277.0146 16348.5068 11658.5325 2260.56
# Expected count for highlighted cell
results[1,2]
## [1] 373
# Contribution for highlighted cell
((results[1,2]-expected_results[1,2])^2)/expected_results[1,2]
## [1] 3.205914
k <- 5
df <- k - 1
p <- pchisq(20.93, df=df, lower.tail=FALSE)
p
## [1] 0.0003269507
Based on the p-value, we can reject the null hypothesis and conclude women demonstrate a significant difference in depression based on caffeinated coffee consumption.
It was too early to make this recommendation because the study only states that there is a statistical difference between the groups, it does not imply coffee causes or prevents depression.