load("/Users/abdurhmanayde/Downloads/ncbirths.rda")

Question 1: Based on the above, what is our parameter of interest? What would be a point estimate of this parameter of interest?

$\hat{p_{1}}$

The parameter of interest is difference between proportion of all babies who born with low birth weight to the mothers who are non-smokers and smokers and to use the sample parameters (𝒑𝟏hat - 𝒑2_hat) as for point estimate for p1 - p2

Question 2: Using the data, compute the following:

  1. The sample proportion of babies born with low birth weight among non-smoking women ( 𝒑𝟏hat ) .

p1_hat = 92/873 = 0.105

  1. The sample proportion of babies born with low birth weight among smoking women ( 𝒑2_hat ).

    p2_hat = 18/126 = 0.143

  2. The point estimate for , the difference in population proportions of 𝒑𝟏―𝒑𝟐 babies born with low birth weight between smoking and non-smoking women.

    point estimate for p1 - p2 = 0.105 - 0.143 = -0.038

  3. The z* needed for a 90% confidence interval is 1.64

![](images/Screen%20Shot%202021-03-17%20at%209.31.30%20PM.png){width="184"}
## 
##     low not low 
##     111     889
## 
## nonsmoker    smoker 
##       873       126
##          
##           nonsmoker smoker
##   low            92     18
##   not low       781    108

Question 3: Check the assumptions for the sampling distribution of to be p1_hat―p2_hat normal. In other words, check the conditions necessary to construct a confidence interval for . Recall, these conditions are (1) independence within groups, (2) 𝒑𝟏―𝒑𝟐independence between groups, and (3) success-failure condition in BOTH groups.

p1_hat―p2_hat = 0.105 - 0.143 = -0.038

Our data is a random sample so independence is satisfied.

success- failure condition : np0 , n(1-p0)

nonsmokers women: 873 * 0.105 = 92 >= 10. and 873*(1-0.105) = 781> 10

smokers women: 126 * 0.143 =18 >= 10. and 126(1-0.143) = 108 > 10

Q4:Calculate the standard error for the sampling distribution of 𝒑𝟏―𝒑𝟐.Then, compute the 90% confidence interval for 𝒑𝟏―𝒑𝟐.

SE = 0.0329

CI = -0.038 ± 1.64 * 0.0329 = -0.091870 , 0.01587

We are 90% confident the proportions of babies born with low birth weight between smoking and non-smoking women is between -0.091870 and 0.0159

We are 90% confident that the true proportions of babies born with low birth weight between smoking and non-smoking women is between -0.092 and 0.0159

compare to the confidence interval constructed from the normal approximation:(-0.92,0.016)

-0.038 + 1.64 * sqrt(0.001079)
## [1] 0.01587094
-0.038 - 1.64 * sqrt(0.001079)
## [1] -0.09187094

Question 5: Interpret the confidence interval you computed in Question 4 given the context of the data.

z <- p_hat - p0/SE

-0.038 - 0.105/ 0.0329 = -4.346505

2*pnorm(-4.346505, lower.tail = F) =1.999986

our p_value was 1.999986 and our alpha was 0.05 and it is grater than alpha value.

so we have a strong evidence the true proportions of babies born with low birth weight between smoking and non-smoking women is located -0.92 and 0.0159

#z <- p_hat - p0/SE
2*pnorm(-4.346505, lower.tail = F)
## [1] 1.999986

Question 6: State the null and alternative hypotheses, if we are interested in comparing the proportion of babies born with low birth weight between non-smoking and smoking mothers. a. State the hypotheses in words and with statistical notation.b. Why is the null rather than the alternative hypothesis a statement of equality?

H0: p1 - p2 = 0 p1 = p2

H1: p1 - p2 != 0 p1 !=p2

which is the same as:

H0: p1 = p2

H1: p1 != p2

H0: 0.105 - 0.143 = 0 —> 0.105 = 0.143

H0: 0.105 - 0.143 != 0 —> 0.105 != 0.143

Then we check if the expected number of success and failure if the null hypothesis is true are at least 10.

success- failure condition : np0 , n(1-p0)

nonsmokers women: 873 * 0.105 = 92 >= 10. and 873*(1-0.105) = 781> 10

smokers women: 126 * 0.143 =18 >= 10. and 126(1-0.143) = 108 > 10

They are equal because we

Question 7: Compute the pooled proportion of babies born with low birth weight between non-smoking and smoking mothers. Explain why we use a pooled proportion.

P_hat pooled = # of successes in Group1 + # of successes in Group2 / n1 +n2

P_hat pooled = (92 + 18) / 873 + 126 = 110 / 999 = 0.1101

Question 8: Using the pooled proportion computed in Question 7, check the conditions necessary to use the normal distribution to perform a hypothesis test.Show all your work.

Independence: we took; therefore, samples are independent

P_hat pooled *n Non_smokers women = 0.11 * 873 = 96 >10

(1- P_hat pooled)*n Non_smokers women = (1- 0.11) *873 = 777 >10

P_hat pooled *n smokers women= 0.11 * 126 = 14>10

(1- P_hat pooled)*n smokers women= (1- 0.1101) *126 =112 >10

Question 9: a. Compute the standard error using the pooled proportion computed in Question 7. b. Calculate your Z-statistic/test statistic. c. Compute the associated p-value. d. Report your conclusion from the hypothesis test based on the given significance level above and include the confidence interval and p-value. State your conclusion in the context of the data.e. Define what the p-value means in context.

P_hat pooled = # of successes in Group1 + # of successes in Group2 / n1 +n2

P_hat pooled = (92 + 18) / 873 + 126 = 110 / 999 = 0.1101

significance level: alpha = 0.05

Then check independence: Random sample -> can assume observation within and between groups are independent:

P_hat pooled *n Non_smokers women = 0.11 * 873 = 96 >10

(1- P_hat pooled)*n Non_smokers women = (1- 0.11) *873 = 777 >10

P_hat pooled *n smokers women= 0.11 * 126 = 14>10

(1- P_hat pooled)*n smokers women= (1- 0.1101) *126 =112 >10

SE = sqrt ( 0.11( 1-0.11)/ 873 + 0.11(1-0.11)/126 = 0.029

Z = pint estimate - null value/ SE , p_value

Z = (-0.038 - 0) / 0.029 = -1.310

p_value = 0.190

our p_value was 0.190 and it is greater than alpha value so we fail to reject the null hypothesis. There is an 19% chance of seeing our observed sample statistic or if there truly was no difference between the two groups. At alpha = 0.05, we fail to reject the null hypothesis.

p_value <- 2*pnorm(-1.310, mean = 0 ,sd =1)
p_value
## [1] 0.1901958

Question 10: Provide an appropriate visualization for your data. (Look at the Week 2 slides). EXTRA CREDIT (2 points): Use the ggplot2() or plot_ly R packages to create visualizations. You will need to look up how to do this (you may refer to the R demo posted in the Week 3 module).

barplot(lowbirthweight1, legend.text = levels(ncbirths$lowbirthweight))

Question 11: Exercise 6.19 in the OpenIntro 4rth edition textbook (page 225).

A study asked 1,924 male and 3,666 female undergraduate college students their favorite color. A 95% confidence interval for the difference between the proportions of males and females whose favorite color is black (pmale − pfemale) was calculated to be (0.02, 0.06). Based on this information, determine if the following statements are true or false, and explain your reasoning for each statement you identify as false.23

  1. We are 95% confident that the true proportion of males whose favorite color is black is 2% lower to 6% higher than the true proportion of females whose favorite color is black.

    False, because We are 95% confident that the true proportion of males and females whose favorite color is black is between (0.02 and 0.06)

  2. We are 95% confident that the true proportion of males whose favorite color is black is 2% to 6% higher than the true proportion of females whose favorite color is black. True, because we have CI between 0.02 and 0.06 for both gender, and the 0.02 refer to males whose favorite black color.

  3. 95% of random samples will produce 95% confidence intervals that include the true difference between the population proportions of males and females whose favorite color is black. We don’t have random sample we gather data from entire population and we could have a point estimate True, we could have a point estimate because ~95% of the 95% confidence interval will capture the truth and it is random because it depends on the statistic(lines)

  4. We can conclude that there is a significant difference between the proportions of males and females whose favorite color is black and that the difference between the two sample proportions is too large to plausibly be due to chance. False, we do not have too large of chance.

  5. The 95% confidence interval for (pfemale −pmale) cannot be calculated with only the information given in this exercise. False, we can calculate it and we can use bootstrap and calculate p_hat value.