R Week 5

Question 1:
a. What are the cases in the dataset?

Households

b. What is the sample proportion of US residents that have health insurance?

0.861

##Calculate the point estimate

##The length() function counts the number of values in the object, and the which() function finds the

k<-length(which(ACS$HealthInsurance== 1))

##The na.omit() function removes the rows in the variable that are missing, so you would only be left with a
##variable with non-missing rows.
n<-length(na.omit(ACS$HealthInsurance))

##calculate the sample proportion
k

## [1] 861

## [1] 1000

p.hat<-k/n
p.hat

## [1] 0.861

Question 2:

a. What type of estimate is the one you found in question 1: a point estimate or an interval estimate?

A point estimate

b.Which do you think is a better estimate to report, a point estimate or an interval estimate? Explain your reasoning!

A point estimate is better to represent a simple information (in this case a proportion) from a single sample. A interval estimate is better to represent a range (confidence interval) from many sample statistic (sampling distribution)

Question 3:

Suppose we want to construct a confidence interval. Are the conditions met to assume the sampling distribution of sample proportions is approximately normal (i.e., the CLT - Central Limit Theorem is valid)? Explain. -

Independence It is a random sample and it is 1% of te US residents (population)
Yes. Sample size/success-failure condition We have more than 10 expected successes and 10 expected failures in the observed sample.

Using the normal distribution:

Question 4:

What is the value of the estimated standard error? Use the formula from the Week 5 slides and estimate the standard error using the normal distribution.

The SE is 0.01093979

## estimate the standard error.
SE=sqrt(p.hat*(1-p.hat)/n)

Question 5:

a. Find a confidence interval for the true proportion of US residents who have health insurance based on a confidence level that you choose.

I am using a 99% of confidence level

b. Explain why you chose the confidence interval that you did. Use qnorm() to find the z needed.

I choose 99% of confidence level because I want higher accuracy about US residents who have health insurance. Besides that, the different between 95% and 99% on the interval is less than 0.01, and 99% will increase the audience’s trust.

c. Interpret this confidence interval.

At 99% confidence level, between 83.95%% and 88.24% of the US residents have health insurance

##find SE - standart error
SE=sqrt(p.hat*(1-p.hat)/n)
##find z para 99%
Z<-qnorm(.995)
##find the confidence interval
CI1 <- p.hat-Z*SE
CI2 <- p.hat+Z*SE

Using bootstrap simulations:

Question 6:

What is the value of the estimated standard error? Use bootstrap simulations like in HW 4 to find the standard error.

10000 samples
SE=0.01085488

boot.phats <- c() #Initializing the vector
for(i in 1:10000){ #i is a sample and we are taking 10000 samples
  boot.samp <- sample(ACS$HealthInsurance, n, replace = TRUE) #Take a random sample
  #Now we need to calculate our bootstrap statistic
  ##(this is analogous to the sample statistic we compute from a sample)
  boot.k <- length(which(boot.samp == 1)) #how many events or "successes" do we have in our sample
  boot.phat <- boot.k/n #a bootstrap statistic
  boot.phats <- c(boot.phats, boot.phat) #I am added the newly computed bootstrap statistic to the vector of bootstrap statistics
}

##Recall we use this quantity as an estimate of population proportion
SEBoot <- sd(boot.phats) # estimate of the SE for the sampling distribution of the proportion.
##We estimate the SE by computing the standard deviation of our bootstrap distribution.
SEBoot

## [1] 0.01088075

Question 7:

Find a confidence interval for the true proportion of US residents who have health insurance based on a confidence level that you choose and the standard error you calculated in question 6.

The CI is between 0.8491204 and 0.8928796

CIboot1 <- boot.phat-2*SE
CIboot1

## [1] 0.8321204

CIboot2 <- boot.phat+2*SE
CIboot2

## [1] 0.8758796

Question 8:

Suppose we’d like to test if the true proportion of US residents who have health insurance is 80% vs. the true proportion of US residents who have health insurance is NOT 80%. What would be the hypotheses for this test? Please write your hypotheses in non-technical language AND using notation. Specify which hypothesis is which (null or alternative).

Null Hypothesis: p=0.8 - US residents who have health insurance is 80%
Alternative Hypotheses p!=0.8 - US residents who have health insurance is different 80%

Question 9:

Conduct a hypothesis test for the hypotheses specified in Question 8 using the confidence interval calculated in Question 5. State your conclusions in layman’s terms and in the context of this question. Hint: look at the Week 5 part 2 slides.

Null Hypothesis: p=0.8
Alternative Hypotheses p!=0.8
Confidence interval 99%
Our data provides evidence that the true proportion of US residents who have health insurance is not equal to 80%. Our data suggest between 77% and 83% of US residents who have health insurance.

## n=1000, phat=0.8

##Calculate SE
SE9=sqrt(0.8*(1-0.8)/n)
##z was calculated in question 5 for 99%
CI1Hy <- .8-Z*SE9
CI2Hy <- .8+Z*SE9
CI1Hy

## [1] 0.7674181

CI2Hy

## [1] 0.8325819

R Week 5

Andreya Kuerten

3/11/2021