table(ncbirths$lowbirthweight,ncbirths$habit)
##          
##           nonsmoker smoker
##   low            92     18
##   not low       781    108

Difference in Proportions


Question 1:

Based on the above, what is our parameter of interest? What would be a point estimate of this parameter of interest?

Compute a 90% confidence interval for the difference in proportion of babies born with low birth weight between non-smoking mothers and smoking mothers ***

prop.test(table(ncbirths$habit, ncbirths$lowbirthweight),conf.level = .90, correct=FALSE)
## 
##  2-sample test for equality of proportions without continuity
##  correction
## 
## data:  table(ncbirths$habit, ncbirths$lowbirthweight)
## X-squared = 1.578, df = 1, p-value = 0.2091
## alternative hypothesis: two.sided
## 90 percent confidence interval:
##  -0.09152407  0.01657725
## sample estimates:
##    prop 1    prop 2 
## 0.1053837 0.1428571

Question 2:

Using the data, compute the following:

  1. The sample proportion of babies born with low birth weight among non-smoking women \(\hat{p_{1}}\)
p1 = 92/873
p1
## [1] 0.1053837
  1. The sample proportion of babies born with low birth weight among smoking women \(\hat{p_{2}}\)
p2 = 18/126
p2
## [1] 0.1428571
  1. The point estimate for \(p_{1}\) - \(p_{2}\) , the difference in population proportions of babies born with low birth weight between smoking and non-smoking women.
p_hat = p1 - p2
p_hat
P Hat = [1] -0.03747341
  1. The z* needed for a 90% confidence interval.
z = qnorm(.95, 0, 1)
z
Z Score = [1] 1.644854

Question 3:

Check the assumptions for the sampling distribution of \(\hat{p_{1}}\) - \(\hat{p_{2}}\) to be normal. In other words, check the conditions necessary to construct a confidence interval for \(p_{1}\) - \(p_{2}\). Recall, these conditions are (1) independence within groups, (2) independence between groups, and (3) success-failure condition in BOTH groups.

  1. Yes, Its a random sample

  2. Yes, Its a random sample

  3. Yes, \(n_{1}\)\(\hat{p_{1}}\)\(\ge\) 10 and \(n_{1}\)(1-\(\hat{p_{1}}\))\(\ge\) 10 AND \(n_{2}\)\(\hat{p_{2}}\)\(\ge\) 10 and \(n_{2}\)(1-\(\hat{p_{2}}\))\(\ge\) 10

Sample 1 92 \(\ge\) 10 and 781 \(\ge\) 10 Sample 2 18 \(\ge\) 10 and 108 \(\ge\) 10


Question 4:

Calculate the standard error for the sampling distribution of \(\hat{p_{1}}\) - \(\hat{p_{2}}\) . Then, compute the 90% confidence interval for \(p_{1}\) - \(p_{2}\).

se = sqrt((p1*(1-p1)/873) + (p2*(1-p2)/126))
se
Z Score = [1] 0.03286047
p_hat + z * se
Z Score = [1] 0.01657725
p_hat - z * se
Z Score = [1] -0.09152407

Question 5:

Interpret the confidence interval you computed in Question 5 given the context of the data.


Now suppose we’d like to formally test if there is a difference between the proportion of babies born with low birth weight to non-smoking and smoking mothers. We will conduct our hypothesis test using a significance level of \(\alpha\) = 0.1.


Question 6:

State the null and alternative hypotheses, if we are interested in comparing the proportion of babies born with low birth weight between non-smoking and smoking mothers.

  1. State the hypotheses in words and with statistical notation.
  1. Why is the null rather than the alternative hypothesis a statement of equality?

Question 7:

Compute the pooled proportion of babies born with low birth weight between non-smoking and smoking mothers. Explain why we use a pooled proportion.

pooled = ((18+92)/(126+873))
pooled
## [1] 0.1101101

Question 8:

Using the pooled proportion computed in Question 7, check the conditions necessary to use the normal distribution to perform a hypothesis test. Show all your work.

pooled * 781
## [1] 85.996
(1- pooled) * 781
## [1] 695.004
pooled * 108
## [1] 11.89189
(1- pooled) * 108
## [1] 96.10811

Question 9:

  1. Compute the standard error using the pooled proportion computed in Question 7.
se1 = sqrt((pooled*(1-pooled)/873) + (pooled*(1-pooled)/126))
se1
## [1] 0.02983129
  1. Calculate your Z-statistic/test statistic.
z1 = (p_hat - 0)/se1
z1
## [1] -1.256178
  1. Compute the associated p-value.
pvalue = pnorm(z1, mean = 0, sd = 1)
pvalue
## [1] 0.1045258
  1. Report your conclusion from the hypothesis test based on the given significance level above and include the confidence interval and p-value. State your conclusion in the context of the data.
  1. Define what the p-value means in context

Question 10:

Provide an appropriate visualization for your data. (Look at the Week 2 slides).


EXTRA CREDIT (2 points): Use the ggplot2() or plot_ly R packages to create visualizations. You will need to look up how to do this (you may refer to the R demo posted in the Week 3 module).

ANSWER

ggplot(data = ncbirths, aes(habit, lowbirthweight)) + 
    geom_col() + 
    facet_grid(~lowbirthweight)

Question 11:

Exercise 6.19 in the OpenIntro 4rth edition textbook (page 225).

  1. We are 95% confident that the true proportion of males whose favorite color is black is 2% lower to 6% higher than the true proportion of females whose favorite color is black.
  1. We are 95% confident that the true proportion of males whose favorite color is black is 2% to 6% higher than the true proportion of females whose favorite color is black.
  1. 95% of random samples will produce 95% confidence intervals that include the true difference between the population proportions of males and females whose favorite color is black
  1. We can conclude that there is a significant difference between the proportions
  1. The 95% confidence interval for (Pfemale - Pmale) cannot be calculated with only the information given in this exercise.