Instructions

Due on Thursday, November 6, at 8:00 a.m.

Your Full Name: Maddie Ledet

Group Members (classmates with whom you worked on this problem set):

HKS Academic Code: I certify that I abide by the Harvard Kennedy School Academic code for all aspects of the course. In terms of problem sets, unless explicitly written otherwise, the norms are the following: You are free (and encouraged) to discuss problem sets with your classmates. However, you must hand in your own unique written work and code in all cases. Any copy/paste of another’s work is plagiarism. In other words, you may work with your classmate(s), sitting side-by-side (physically or remotely!) and going through the problem set question by question, but you must each type your own answers and your own code. For more details, please see syllabus.

LOAD PACKAGES

library(tidyverse)

PART I: E-CIGARETTES

E-cigarettes have been in the news lately. Notably, the U.S. Centers for Disease Control and Prevention (CDC) have been tracking an outbreak of lung injuries associated with the use of e-cigarettes. According to the CDC, “[a]s of October 15, 2019, 1,479 lung injury cases associated with the use of e-cigarette, or vaping, products have been reported to CDC from 49 states (all except Alaska), the District of Columbia, and 1 U.S. territory.”

A recent survey conducted by Politico and the Harvard T.H. Chan School of Public Health indicated that 47 percent of respondents believe that e-cigarettes are “very harmful” to people who use them. Results were based on a random sample of 1,009 adults interviewed between July 16 and July 21, 2019.

a. Define the following terms as they relate to the polling question described above: Population of interest, Sample, Estimate.

Population of interest: The population of interest are all adults living in the United States.
Sample: The sample population is the random sample of 1,009 adults interviewed between July 16 and July 21, 2019.
Estimate: The 47% of respondents who believe that e-cigarettes are very harmful.

b. Based on the fact that the survey found 47 percent of respondents think e-cigarettes are “very harmful”, an e-cigarette proponent claims that less than a majority of Americans believe e-cigarettes are very harmful (since 47 percent is less than 50 percent). Explain to a health policy official why that claim is not necessarily true. You can assume that the health policy official is motivated to understand but not well versed in statistics, and that the poll was well designed and conducted.

We cannot extrapolate from this survey that the majority of America believe that e-cigarettes are very harmful. With randomization, we cannot guarantee that a sample population represents a broader group. For instance, we could select a random group of 1,000 Americans, and their responses could differ from a different group of 1,000 Americans. Therefore, it would be remiss to draw this conclusion without checking whether our survey response is within the true population proportion.

c. Construct a 95% confidence interval for the percentage of people that believe that e-cigarettes are “very harmful.”

p = 0.47
n = 1009

CI_plus <- p + 2*(sqrt((p*(1-p))/n))
CI_minus <-p - 2*(sqrt((p*(1-p))/n))

CI_plus*100

## [1] 50.14247

CI_minus*100

## [1] 43.85753

d. Why does sample size matter? Explain intuitively why the size of the sample affects the width of the confidence interval, even if the results in a larger sample are exactly the same as the results in a smaller sample. Use language that a policymaker who is curious but not well trained in statistics can understand. [One short paragraph]

The size of the sample affects the width of the confidence interval because the size sits in the denominator. Therefore, if we have a greater number in the denominator, our outcome will be smaller because we are dividing by a large number. If we were to repeat this survey and create a confidence interval for each survey, 95% of those confidence intervals would contain the true population proportion.

PART 2: CONFIDENCE INTERVALS AND HYPOTHESIS TESTS

We are interested in comparing the share of students who come from single-parent households in the Cambridge and Brookline public schools. Suppose we take two random independent samples of 270 students who are enrolled in the Cambridge and Brookline public schools, respectively. We calculate that the share of students who come from single-parent households in the sample of students in Cambridge, \(\hat{p}_1\), is 40%, while the share of students who come from single-parent households in the sample of students in Brookline, \(\hat{p}_2\), is 33%. Note that our estimates differ by 7 percentage points.

a. Calculating a confidence interval. Construct a 95% confidence interval for the difference between the proportion of students who come from single-parent households in Cambridge and the proportion of students who come from single-parent households in Brookline.

p1 = 0.4
p2 = 0.33
n = 270 

p1_se = (p1*(1-p1))/n
p2_se = (p2*(1-p2))/n

CI_plus2 = (p1- p2) + 2*sqrt(p1_se + p2_se)
CI_minus2 = (p1- p2) - 2*sqrt(p1_se + p2_se)

CI_plus2*100

## [1] 15.26505

CI_minus2*100

## [1] -1.265054

b. Interpreting the confidence interval. Explain what the 95% confidence interval you calculated in part (a) means in words. Be specific to this problem (that is, do not give merely a general description of a confidence interval.) As part of your explanation, describe what it means that the 95% confidence interval includes both negative and positive numbers.

Our confidence interval represents a range in which the true population value can exist; In particular, when we repeat the survey multiple times, 95% of the confidence intervals we create will contain this true population value.The lower bound 1.265% means that Brookline public schools might have a higher proportion of single parent households than Cambridge by 1.265%. Contrarily, the upper bound, 15.365% means that Cambridge public schools has a higher proportion of single parent households than Brookline by 15.365%. The negative numbers represent when Brookline has the higher proportion than Cambridge.

c. Sample sizes and confidence intervals. Suppose that instead of taking two random independent samples of 270 students, we had instead taken two random samples 100 students. How would you expect the change in sample size to affect the size of the confidence interval computed in part a)? Explain your reasoning.

We would expect the confidence intervals to be wider. Since we are dividing our numerator by a smaller denominator, our quotient will be larger. Therefore, when we are adding the margin of error to p-hat, our answer will be wider because we are adding and subtracting a larger number from p-hat.

d. Conducting a hypothesis test. Using the survey results, test the hypothesis that the proportion of students who come from single-parent households in Cambridge and the proportion of students who come from single-parent households in Brookline are the same. Follow the steps:

State the null hypothesis (H0). The null hypothesis would be that Brookline and Cambridge have the same proportion of single family households in their schools. As a result, there would be no difference between p1 and p2. H0: P1 = P2

Our alternative hypothesis is that Brookline and Cambridge will have different proportions of single family households in there school. Therefore, there will be a difference between p1 and p2. H1: P1 does not equal P2 3. Calculate the estimate from the sample.

p1 = 0.4
p2 = 0.33
n = 270

p_avg_3 <- ((270*p1) + (270*p2))/(2*n)
p_avg_3

## [1] 0.365

p_denom <- p_avg_3*(1-p_avg_3)*((1/n)+(1/n))
se <- sqrt(p_denom)
se

## [1] 0.04143491

num <- (p1-p2) - 0
z_stat <- num/se
z_stat

## [1] 1.689397

Define the sampling distribution and use it to calculate the p-value.

p_value <- 2*(1-pnorm(abs(z_stat)))
p_value

## [1] 0.09114344

Reject or fail to reject H0 based on whether p-value < 0.05.

We fail to reject the null hypothesis because our p-value of 0.0911 is greater than 0.05

e. Interpreting the p-value. What is the interpretation of the p-value in this context? [A few sentences.]

The p-value is the probability that you would observe a difference of the proportion of single parent houses in Brookline and Cambridge greater than 7%, given that the null hypothesis is true, meaning that there is no difference between the two school systems. In this case, we got 9% which means that there is a 9% chance that we would observe a difference greater than 7% if the null hypothesis is true. Since our p-value is greater than 0.05, we fail to reject the null hypothesis. In other words, our evidence is not strong enough to confirm that Brookline and Cambridge have different proportions of single-parent households, and our observed differences could be from sampling variation.

SUBMITTING ASSIGNMENT

Click “Knit” >> “Knit to HTML” in the menu above. Next, click the “Open in Browser” button to open the html file in a web browser. Finally, from the browser, save the page as a pdf file, and submit that pdf file on Canvas.

API-201 PROBLEM SET #8

Instructions

LOAD PACKAGES

PART I: E-CIGARETTES

PART 2: CONFIDENCE INTERVALS AND HYPOTHESIS TESTS

SUBMITTING ASSIGNMENT