ACS <- read.csv("~/UST Data Science/Spring 2021/SEIS 631/5/ACS.csv")
summary(ACS)
## Sex Age Married Income
## Min. :0.000 Min. : 0.00 Min. :0.000 Min. : 0.0
## 1st Qu.:0.000 1st Qu.:20.00 1st Qu.:0.000 1st Qu.: 0.0
## Median :0.000 Median :41.00 Median :0.000 Median : 3.8
## Mean :0.463 Mean :40.07 Mean :0.437 Mean : 22.4
## 3rd Qu.:1.000 3rd Qu.:58.00 3rd Qu.:1.000 3rd Qu.: 31.2
## Max. :1.000 Max. :94.00 Max. :1.000 Max. :563.0
## NA's :175
## HoursWk Race USCitizen HealthInsurance
## Min. : 1.00 Length:1000 Min. :0.000 Min. :0.000
## 1st Qu.:30.00 Class :character 1st Qu.:1.000 1st Qu.:1.000
## Median :40.00 Mode :character Median :1.000 Median :1.000
## Mean :37.11 Mean :0.939 Mean :0.861
## 3rd Qu.:40.00 3rd Qu.:1.000 3rd Qu.:1.000
## Max. :99.00 Max. :1.000 Max. :1.000
## NA's :495
## Language
## Min. :0.000
## 1st Qu.:0.000
## Median :0.000
## Mean :0.194
## 3rd Qu.:0.000
## Max. :1.000
##
table(ACS$HealthInsurance)
##
## 0 1
## 139 861
Question 1:
a. What are the cases in the dataset? b. What is the sample proportion of US residents that have health insurance?
Answer: Cases are US Residents who participated in the study. The sample proportion of US residents that have health insurance is 86% of residents in the study.
h <- length(which(ACS$HealthInsurance==1))
h
## [1] 861
n <- length(na.omit(ACS$HealthInsurance))
n
## [1] 1000
p.hat <- h/n
p.hat
## [1] 0.861
Question 2: a.What type of estimate is the one you found in question 1: a point estimate or an interval estimate? b.Which do you think is a better estimate to report, a point estimate or an interval estimate? Explain your reasoning!
Answer: The type of estimate in question 1 is a point estimate. Based on what we are trying to find (sample proportion of those who have health insurance), I think, a point estimate is better to report because it provides an approximate value of the population parameter. Whereas, an interval estimate would provide a range. Although a range is nice to have, but an approximation would be better to have for this particular question.
Question 3: Suppose we want to construct a confidence interval. Are the conditions met to assume the sampling distribution of sample proportions is approximately normal (i.e., the CLT is valid)? Explain.
Answer: Yes, conditions are met to assume that the sampling distribution of the sample proportions is approximately normal because we were able to test a large enough number of samples (at least 30 for CLT). Another reason why we can assume that the CLT is valid is because we know the mean is equal to the population proportion.
Using the normal distribution: Question 4: What is the value of the estimated standard error? Use the formula from the Week 5 slides and estimate the standard error using the normal distribution.
Answer: SE = p.hat (1-p.hat) / 1000 SE = 0.861 (1-0.861) / 1000 = 0.0316227766 = 0.032. SE = 0.032
Question 5: a. Find a confidence interval for the true proportion of US residents who have health insurance based on a confidence level that you choose. b. Explain why you chose the confidence interval that you did. Use qnorm() to find the z needed.
c. Interpret this confidence interval.
Answer: I chose method 2 of finding the confidence interval because I wanted the to know the lower and upper bound range of where the confidence interval would be. We can say: we are 95% sure the mean population proportion of US Residents have health insurance is between 84% to 88%; Z is 1.084823 (see formula below).
boot.phats <- c()
for (i in 1:10000){
boot.samp <- sample(ACS$HealthInsurance, n, replace = TRUE)
boot.h <- length(which(boot.samp == 1))
boot.phat <- boot.h/n
boot.phats <- c(boot.phats, boot.phat)
}
CI.lb <- (sort(boot.phats) [250])
CI.ub <- (sort(boot.phats) [9750])
#to find Z
qnorm(0.861, mean=0, sd=1)
## [1] 1.084823
Using bootstrap simulations: Question 6: What is the value of the estimated standard error? Use bootstrap simulations like in HW 4 to find the standard error.
Answer: Value of SE is 1.0
#calculation of sample distribution (bootstrap)
boot.samp <- sample(ACS$HealthInsurance, size = n, replace = TRUE)
table(boot.samp)
## boot.samp
## 0 1
## 169 831
boot.phats <- c()
for (i in 1:10000){
boot.samp <- sample(ACS$HealthInsurance, n, replace = TRUE)
boot.h <- length(which(boot.samp == 1))
boot.phat <- boot.h/n
boot.phats <- c(boot.phats, boot.phat)
}
hist(boot.phats)
mean(boot.phats)
## [1] 0.8612062
SE <- sd(boot.phats)
Question 7: Find a confidence interval for the true proportion of US residents who have health insurance based on a confidence level that you choose and the standard error you calculated in question 6.
Answer: n=1000, p.hat = 0861, CLT = valid, SE = 0.861 (1-0.861) / 1000 = 0.0316227766. SE = 0.032, 0.861±√(1.085*0.032/1000), Final answer: (0.83, 0.90)
Question 8: Suppose we’d like to test if the true proportion of US residents who have health insurance is 80% vs. the true proportion of US residents who have health insurance is NOT 80%. What would be the hypotheses for this test? Please write your hypotheses in non-technical language AND using notation. Specify which hypothesis is which (null or alternative).
Answer: H-naught:P = 0.80 (null), H-A:P ≠ 0.80 (alternative)
Question 9: Conduct a hypothesis test for the hypotheses specified in Question 6 using the confidence interval calculated in Question 5. State your conclusions in layman’s terms and in the context of this question. Hint: look at the Week 5 part 2 slides.
Answer: Our data provides evidence that the true proportion of US residents who have health insurance is not equal to 0.80. Our data suggest between 83% and 90% of US Residents have health insurance. We reject the null hypothesis.