SEIS631: Foundations of Data Analysis R Coding assignment 6

Difference in Proportions

Question 1:

Based on the above, what is our parameter of interest? What would be a point estimate of this parameter of interest?

Parameter of Interest = Birth weight among smoking mothers vs non-smoking mothers.
Point Estimate = Low Birth Weight of non-smoking monthers/Total - ow Birth Weight of smoking mothers/ total

Compute a 90% confidence interval for the difference in proportion of babies born with low birth weight between non-smoking mothers and smoking mothers ***

prop.test(table(ncbirths$habit, ncbirths$lowbirthweight),conf.level = .90, correct=FALSE)

## 
##  2-sample test for equality of proportions without continuity
##  correction
## 
## data:  table(ncbirths$habit, ncbirths$lowbirthweight)
## X-squared = 1.578, df = 1, p-value = 0.2091
## alternative hypothesis: two.sided
## 90 percent confidence interval:
##  -0.09152407  0.01657725
## sample estimates:
##    prop 1    prop 2 
## 0.1053837 0.1428571

Question 2:

Using the data, compute the following:

The sample proportion of babies born with low birth weight among non-smoking women \(\hat{p_{1}}\)

0.1054

p1 = 92/873
p1

## [1] 0.1053837

The sample proportion of babies born with low birth weight among smoking women \(\hat{p_{2}}\)

0.1429

p2 = 18/126
p2

## [1] 0.1428571

The point estimate for \(p_{1}\) - \(p_{2}\) , the difference in population proportions of babies born with low birth weight between smoking and non-smoking women.

-0.037

p_hat = p1 - p2
p_hat

P Hat = [1] -0.03747341

The z* needed for a 90% confidence interval.

z = qnorm(.95, 0, 1)
z

Z Score = [1] 1.644854

Question 3:

Check the assumptions for the sampling distribution of \(\hat{p_{1}}\) - \(\hat{p_{2}}\) to be normal. In other words, check the conditions necessary to construct a confidence interval for \(p_{1}\) - \(p_{2}\). Recall, these conditions are (1) independence within groups, (2) independence between groups, and (3) success-failure condition in BOTH groups.

Yes, Its a random sample
Yes, Its a random sample
Yes, \(n_{1}\)\(\hat{p_{1}}\)\(\ge\) 10 and \(n_{1}\)(1-\(\hat{p_{1}}\))\(\ge\) 10 AND \(n_{2}\)\(\hat{p_{2}}\)\(\ge\) 10 and \(n_{2}\)(1-\(\hat{p_{2}}\))\(\ge\) 10

Sample 1 92 \(\ge\) 10 and 781 \(\ge\) 10 Sample 2 18 \(\ge\) 10 and 108 \(\ge\) 10

Question 4:

Calculate the standard error for the sampling distribution of \(\hat{p_{1}}\) - \(\hat{p_{2}}\) . Then, compute the 90% confidence interval for \(p_{1}\) - \(p_{2}\).

se = sqrt((p1*(1-p1)/873) + (p2*(1-p2)/126))
se

Z Score = [1] 0.03286047

p_hat + z * se

Z Score = [1] 0.01657725

p_hat - z * se

Z Score = [1] -0.09152407

(-.0915, 0.03286)

Question 5:

Interpret the confidence interval you computed in Question 5 given the context of the data.

We are 90% confident that smoking mothers has a differnce of -9.15% to +3% percentage point impact on low birth weight. Because 0% is contained tin the interval, we do not have enough information to say whether smoking mothers has and effect on low birth weight.

Now suppose we’d like to formally test if there is a difference between the proportion of babies born with low birth weight to non-smoking and smoking mothers. We will conduct our hypothesis test using a significance level of \(\alpha\) = 0.1.

Question 6:

State the null and alternative hypotheses, if we are interested in comparing the proportion of babies born with low birth weight between non-smoking and smoking mothers.

State the hypotheses in words and with statistical notation.

H\(_{o}\): \(p_{1}\) - \(p_{2}\) = 0
H\(_{A}\): \(p_{1}\) - \(p_{2}\) \(\ne\) 0

Why is the null rather than the alternative hypothesis a statement of equality?

Because it is easier to check if \(p_{1}\) - \(p_{2}\) does equal 0. If we tried to do it the opposite way it would take a very long time.

Question 7:

Compute the pooled proportion of babies born with low birth weight between non-smoking and smoking mothers. Explain why we use a pooled proportion.

0.1101

pooled = ((18+92)/(126+873))
pooled

## [1] 0.1101101

Question 8:

Using the pooled proportion computed in Question 7, check the conditions necessary to use the normal distribution to perform a hypothesis test. Show all your work.

Yes, we can assume observations within and between groups are independent
Yes, All greater than 10

pooled * 781

## [1] 85.996

(1- pooled) * 781

## [1] 695.004

pooled * 108

## [1] 11.89189

(1- pooled) * 108

## [1] 96.10811

Question 9:

Compute the standard error using the pooled proportion computed in Question 7.

0.0298

se1 = sqrt((pooled*(1-pooled)/873) + (pooled*(1-pooled)/126))
se1

## [1] 0.02983129

Calculate your Z-statistic/test statistic.

-1.2562

z1 = (p_hat - 0)/se1
z1

## [1] -1.256178

Compute the associated p-value.

0.1045258

pvalue = pnorm(z1, mean = 0, sd = 1)
pvalue

## [1] 0.1045258

Report your conclusion from the hypothesis test based on the given significance level above and include the confidence interval and p-value. State your conclusion in the context of the data.

Because the P value is larger that .01 we do not reject the null hypothesis. That is, the difference in birth weight to smoking and non smoking mothers is explained by chance.

Define what the p-value means in context

In context of the p-value .1045 means that we have a 10.45% chance of this happening.

Question 10:

Provide an appropriate visualization for your data. (Look at the Week 2 slides).

EXTRA CREDIT (2 points): Use the ggplot2() or plot_ly R packages to create visualizations. You will need to look up how to do this (you may refer to the R demo posted in the Week 3 module).

ANSWER

ggplot(data = ncbirths, aes(habit, lowbirthweight)) + 
    geom_col() + 
    facet_grid(~lowbirthweight)

Question 11:

Exercise 6.19 in the OpenIntro 4rth edition textbook (page 225).

We are 95% confident that the true proportion of males whose favorite color is black is 2% lower to 6% higher than the true proportion of females whose favorite color is black.

False, the proportion is 2% to 6% higher

We are 95% confident that the true proportion of males whose favorite color is black is 2% to 6% higher than the true proportion of females whose favorite color is black.

True

95% of random samples will produce 95% confidence intervals that include the true difference between the population proportions of males and females whose favorite color is black

True

We can conclude that there is a significant difference between the proportions

True

The 95% confidence interval for (Pfemale - Pmale) cannot be calculated with only the information given in this exercise.

False, because it would be -.06 to -.02

SEIS631: Foundations of Data Analysis R Coding assignment 6

Tully O’Leary

3/16/2021

Difference in Proportions