Homework5

ACS <- read.csv("~/Downloads/ACS.csv")
View(ACS)

Question 1:

a. What are the cases in the dataset?

The number of US residents that have health insurance in American Community.

b. What is the sample proportion of US residents that have health insurance?

0.861 of US residents have health insurance

table(ACS$HealthInsurance)

## 
##   0   1 
## 139 861

prop.table(table(ACS$HealthInsurance))

## 
##     0     1 
## 0.139 0.861

k<-length(which(ACS$HealthInsurance== 1))
k

## [1] 861

n<-length(na.omit(ACS$HealthInsurance))
n

## [1] 1000

p.hat<-k/n
p.hat

## [1] 0.861

Question 2: a. What type of estimate is the one you found in question 1: a point estimate or an interval estimate? The estimate in Q1 is Point estimate.

b.Which do you think is a better estimate to report, a point estimate or an interval estimate? Explain your reasoning! I think it is better to report to report an interval estimate, it gives an idea what is a range of values for the true population proportion, and the true population proportion unlikely the same as the point estimate in our sample. Point estimate, estimate the population parameter of interest and we use point estimates to construct confidence intervals for unknown parameters.

Question 3: Suppose we want to construct a confidence interval. Are the conditions met to assume the sampling distribution of sample proportions is approximately normal (i.e., the CLT is valid)? Explain.CLT(Central Limit Theorem)

Since we have random sample of all US residents is 1%, we assume our sample is an independent.

based on success - failure condition: success - failure kp = 1000 0.861 = 861 >= 10 , k(1-p.hat) = 139 >=10 , success - failure condition and independence are met, in this case we assume the sampling distribution of sample proportions is approximately normal.

k<-length(ACS$HealthInsurance)
k

## [1] 1000

k * p.hat

## [1] 861

k*(1- p.hat)

## [1] 139

Question 4: What is the value of the estimated standard error? Use the formula from the Week 5 slides and estimate the standard error using the normal distribution.

The value of of estimated standard error is 0.01093979 , SE = around 0.01094.

k<-length(ACS$HealthInsurance)
k

## [1] 1000

SE <- sqrt(p.hat * (1 - p.hat)/ k)
SE

## [1] 0.01093979

Question 5: a. Find a confidence interval for the true proportion of US residents who have health insurance based on a confidence level that you choose. b. Explain why you chose the confidence interval that you did. Use qnorm() to find the z needed. c. Interpret this confidence interval.

The 95% confidence interval for the true proportion of US residents who have health insurance is (0.8395584,0.8824416). We are 95% confident the proportion of US residents who have health isurance between 0.8395584 and 0.8824416

z <- qnorm(1 - 0.05 /2)
z

## [1] 1.959964

p.hat - z *SE

## [1] 0.8395584

p.hat + z *SE

## [1] 0.8824416

Question 6: What is the value of the estimated standard error? Use bootstrap simulations like in HW 4 to find the standard error.

The value of of estimated standard error is 0.01093979 , SE = around 0.01094.

boot.phats <-c()#Initializing the vector
for(i in 1:10000){#i is a sample and we are taking 10000 samples
  boot.samp <-sample(ACS$HealthInsurance, n, replace = TRUE)#Take a random sample
  #Now we need to calculate our bootstrap statistic
  #(this is analogous to the sample statistic we compute from a sample)
  boot.k <-length(which(boot.samp==1))#how many events or "successes" do we have in our sample
  boot.phat <- boot.k/n#a bootstrap statistic
  boot.phats <-c(boot.phats, boot.phat)#I am added the newly computed bootstrap statistic
  #to the vector of bootstrap statistics
}
hist(boot.phats)

mean(boot.phats)# this should be the same quantity as your point estimate ecall we use this quantity as an estimate of population proportion

## [1] 0.8607369

SE <-sd(boot.phats)
SE

## [1] 0.01111157

#We estimate the SE by computing the standard deviation of our bootstrap distribution.

Question 7: Find a confidence interval for the true proportion of US residents who have health insurance based on a confidence level that you choose and the standard error you calculated in question 6. Your confidence interval should be very similar to question 5.

The 95% confidence interval for the true proportion of US residents who have health insurance is (0.8376433 0.8815296). We are 95% confident the proportion of US residents who have health isurance between 0.8376433 and 0.8815296

prop.test(k, n, conf.level=0.95)

## 
##  1-sample proportions test with continuity correction
## 
## data:  k out of n, null probability 0.5
## X-squared = 998, df = 1, p-value < 2.2e-16
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
##  0.9952293 1.0000000
## sample estimates:
## p 
## 1

prop.test(861, 1000, conf.level=0.95)

## 
##  1-sample proportions test with continuity correction
## 
## data:  861 out of 1000, null probability 0.5
## X-squared = 519.84, df = 1, p-value < 2.2e-16
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
##  0.8376433 0.8815296
## sample estimates:
##     p 
## 0.861

CI <- p.hat+ c(-1,1)*2*SE
CI

## [1] 0.8387769 0.8832231

Question 8: Suppose we’d like to test if the true proportion of US residents who have health insurance is 80% vs. the true proportion of US residents who have health insurance is NOT 80%. What would be the hypotheses for this test? Please write your hypotheses in non-technical language AND using notation. Specify which hypothesis is which (null or alternative).

H0 : p = 0.80 HA: p != 0.80 p.hat = 0.861 , n = 1000 null hypotheses is true, that p = 0.80 we have random sample of 1% of all US residents and the independence satisfied.

In this case residents have health insurance is 0.80 since the sample proportion of US residents have health insurace is samll (1%) and the p value, of US residents have health insurace. so there is a very small chance to havea sample when 80% of US residents have health insurance, It unlikly that the true proportion of US residents who have health isurance is 0.80.

prop.test(861, 1000, conf.level=0.95)

## 
##  1-sample proportions test with continuity correction
## 
## data:  861 out of 1000, null probability 0.5
## X-squared = 519.84, df = 1, p-value < 2.2e-16
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
##  0.8376433 0.8815296
## sample estimates:
##     p 
## 0.861

p <- 0.80

n*p

## [1] 800

n*(1-p)  # success - failure condition

## [1] 200

SE <-  sqrt(p * (1 - p)/ n)   # standard error
SE

## [1] 0.01264911

z <-  (p.hat - p) /SE  # statistic test
z

## [1] 4.822473

2 * pnorm(z, lower.tail = FALSE)  #p- value

## [1] 1.417889e-06

prop.test(861, 1000, p = .80 )

## 
##  1-sample proportions test with continuity correction
## 
## data:  861 out of 1000, null probability 0.8
## X-squared = 22.877, df = 1, p-value = 1.727e-06
## alternative hypothesis: true p is not equal to 0.8
## 95 percent confidence interval:
##  0.8376433 0.8815296
## sample estimates:
##     p 
## 0.861

Question 9: Conduct a hypothesis test for the hypotheses specified in Question 8 using the confidence interval calculated in Question 5. State your conclusions in layman’s terms and in the context of this question. Hint: look at the Week 5 part 2 slides.

Our 95% confidence interval for p is (0.8395584,0.8824416).

0.80 not fall in (0.8395584,0.8824416), so our data provides evidence that the true proprtion of of US residents who have health insurance is not equal to 0.80. Our data suggest between 83% and 88% of US residents who have health insurance.