Week 6 - Difference in Proportions

Question 1: What is our parameter of interest? Answer: The parameter of interest is the difference in proportions of babies born with low birth weight by mothers who are non-smokers (p1) - proportion of of babies born with low birth weight by mothers who are smokers (p2).

What would be a point estimate of this parameter of interest? Answer: \(\hat{p_{1}}\) - \(\hat{p_{2}}\) = 0.105 - 0.142 = -0.037 Point Estimate is -0.037

p.hat = \(\hat{p_{1}}\)

load("C:/Users/Maisey/Downloads/ncbirths.rda")

summary(ncbirths$lowbirthweight)

##     low not low 
##     111     889

summary(ncbirths$habit)

## nonsmoker    smoker      NA's 
##       873       126         1

table(ncbirths$lowbirthweight, ncbirths$habit)

##          
##           nonsmoker smoker
##   low            92     18
##   not low       781    108

table(ncbirths$habit, ncbirths$lowbirthweight)

##            
##             low not low
##   nonsmoker  92     781
##   smoker     18     108

Question 2: Using the data, compute the following: A) The sample proportion of babies born with low birth weight among non-smoking women ((p_1 ) ̂)
Answer: 92/873 = 0.105

The sample proportion of babies born with low birth weight among smoking women ((p_2 ) ̂) Answer: 18/126 = 0.142
The point estimate for p_1-p_2, the difference in population proportions of babies born with low birth weight between smoking and non-smoking women. Answer: 0.105-0.142 = -0.037
The z* needed for a 90% confidence interval. Answer: (-0.037-0)\0.144 = -0.260

2*pnorm(-0.010, mean = 0, sd = 1)

## [1] 0.9920213

Question 3: Check the assumptions for the sampling distribution of (p_1 ) ̂-(p_2 ) ̂ to be normal. In other words, check the conditions necessary to construct a confidence interval for p_1-p_2. Recall, these conditions are (1) independence within groups, (2) independence between groups, and (3) success-failure condition in BOTH groups.

Answer: p.hat pooled = (92+18)/(873+126) = 0.110 P1 = Success: 873(92/873) = 92 > 10 [TRUE]; Failure:873(1-(92/873)) = 781 > 10 [TRUE] P2 = Success: 126(18/126) = 18 > 10 [TRUE]; Failure: 126(1-(18/126)) = 108 > 10 [TRUE]

Question 4: Calculate the standard error for the sampling distribution of (p_1 ) ̂-(p_2 ) ̂. Then, compute the 90% confidence interval for p_1-p_2.

Answer: P.hat1-p.hat2 +- Z * SE = (-0.037) +- (-0.260) * 0.144 = (-0.043, 0.032)

Question 5: Interpret the confidence interval you computed in Question 4 given the context of the data.

Answer: We are 90% confident that the difference in proportion of babies of low birth weight born to mothers who are non-smokers and mothers who are smokers is between -0.043 and 0.032.

Question 6: State the null and alternative hypotheses, if we are interested in comparing the proportion of babies born with low birth weight between non-smoking and smoking mothers.

State the hypotheses in words and with statistical notation. Answer: H-naught:p1-p2 = 0, H-A:p1-p2 ≠ 0; For the null hypothesis, all the babies that are born with low birth weight, there is no difference in proportions between non-smoking and smoking mothers. With the alternative hypothesis, babies born with low birth weight, there is a difference in proportion between non-smoking and smoking mothers.
Why is the null rather than the alternative hypothesis a statement of equality? Answer: The null is a statement about the population and is shown to be incorrect beyond a reasonable doubt. An alternative hypothesis is a claim about the population. Based on probability laws, we can only talk in terms of absolute certainties.

Question 7: Compute the pooled proportion of babies born with low birth weight between non-smoking and smoking mothers. Explain why we use a pooled proportion.

Answer: p.hat pooled = (92+18)/(873+126) = 0.110; We use a pooled proportion in the z-test for two proportions to construct an estimate for both population proportions. For a hypothesis test, we use this to estimate the standard error. Then, we could use the standard error to calculate the z-test statistic.

Question 8: Using the pooled proportion computed in Question 7, check the conditions necessary to use the normal distribution to perform a hypothesis test. Show all your work.

Answer: p.hat pooled = (92+18)/(873+126) = 0.110 P1 = Success: 0.110873 = 96 > 10 [TRUE]; Failure: (1-0.110)873 = 777 > 10 [TRUE] P2 = Success: 0.110126 = 14 > 10 [TRUE]; Failure: (1-0.110)126 = 112 > 10 [TRUE] Point Estimate = -0.037, SE = 0.144, Z-score = -0.26 P-Value = 0.795

Question 9: a. Compute the standard error using the pooled proportion computed in Question 7.

Answer: p.hat pooled = 0.110 SE = p.pooled(1-p.pooled)/873 + p.pooled(1-p.pooled)/126 = 0.144

Calculate your Z-statistic/test statistic.

Answer: Z-score = point estimate - null/SE = -0.037-0/0.144 = -0.26

Compute the associated p-value.

Answer: p-value = 0.795 #See code chunk below

Report your conclusion from the hypothesis test based on the given significance level above and include the confidence interval and p-value. State your conclusion in the context of the data.

Answer: P-Value is 0.795. There is a 80% chance of seeing the observed sample statistic or one more extreme if there truly was no difference between the two groups.

Define what the p-value means in context.

Answer: At significance level of 0.1, we fail to reject the null hypothesis and conclude there is no evidence in our data to suggest the proportion of babies with a low birth weight born by mothers that are non-smokers differs from those born by mothers that are smokers.

2*pnorm(-0.26, mean= 0, sd = 1)

## [1] 0.7948638

Question 10: Provide an appropriate visualization for your data. (Look at the Week 2 slides).

table(ncbirths$habit, ncbirths$lowbirthweight)

##            
##             low not low
##   nonsmoker  92     781
##   smoker     18     108

barplot(table(ncbirths$habit, ncbirths$lowbirthweight))

Question 11: Exercise 6.19 in the OpenIntro 4rth edition textbook (page 225).

6.19 Gender and color preference. A study asked 1,924 male and 3,666 female undergraduate college students their favorite color. A 95% confidence interval for the difference between the proportions of males and females whose favorite color is black (pmale 􀀀 pfemale) was calculated to be (0.02, 0.06). Based on this information, determine if the following statements are true or false, and explain your reasoning for each statement you identify as false.

We are 95% confident that the true proportion of males whose favorite color is black is 2% lower to 6% higher than the true proportion of females whose favorite color is black.
We are 95% confident that the true proportion of males whose favorite color is black is 2% to 6% higher than the true proportion of females whose favorite color is black.
95% of random samples will produce 95% confidence intervals that include the true difference between the population proportions of males and females whose favorite color is black.
We can conclude that there is a significant difference between the proportions of males and females whose favorite color is black and that the difference between the two sample proportions is too large to plausibly be due to chance.
The 95% confidence interval for (pfemale-pmale) cannot be calculated with only the information given in this exercise.

Answer: a) False - the confidence interval contains no negative values. b) True c) True d) True e) False - this statement did not change anything, it just re-ordered the values from the original values.