Statistical inference for a proportion

  1. What are the cases in this data set? What is the variable type of the variable we use to answer this question? What is the meaning of the variable and what do the individual values mean (1, 2 etc.)? You will have to look at the data documentation for this.

The cases in this data set are the people who smoke a pipe or don’t in this Lung Health Study after 1 year. 1 means yes they smoke a pipe and 2 means no they don’t smoke a pipe.

table(lhs2$f060pipe)
## 
##    1    2 
##   59 5538
prop.table(table(lhs2$f060pipe))
## 
##          1          2 
## 0.01054136 0.98945864
  1. Does R classify the variable type correctly? If not, change the variable type. Look at past coding assignments to review how to do this if needed.

R does not classify the variable correctly.

str(lhs2$f060pipe)
##  int [1:5887] 2 2 2 NA 2 2 2 2 2 2 ...
#changing variable type
lhs2$f060pipe1<-factor(lhs2$f060pipe,
                      levels=c(1,2),
                      labels=c("Yes","No"))
#check if the factor works
table(lhs2$f060pipe1)                     
## 
##  Yes   No 
##   59 5538
str(lhs2$f060pipe1)
##  Factor w/ 2 levels "Yes","No": 2 2 2 NA 2 2 2 2 2 2 ...

Visualize and summarize the categorical variable of interest (f060pipe)

  1. Create a plot (or plots) to visualize the variable. You may choose the plot(s) you think is best to visualize the variable. Create a table of counts and a table of proportions to summarize the variable. Describe what you see in the plot(s) and tables.

From the table and summary count 59 people are using a pipe one year into the study and 5538 are not and there are 290 NA’s. I plotted the counts and proportions. From the counts that the yes responses are almost insignificant and the no response is most of the sample.

plot(lhs2$f060pipe1)

summary(lhs2$f060pipe1)
##  Yes   No NA's 
##   59 5538  290
table(lhs2$f060pipe1)
## 
##  Yes   No 
##   59 5538
prop.table(table(lhs2$f060pipe1))
## 
##        Yes         No 
## 0.01054136 0.98945864
counts<-table(lhs2$f060pipe1)
barplot(counts, main="Barplot of LHS Smoke a pipe at annual visit, year 1", 
        names.arg=c("Yes","no"), col = "blue")

prop <- prop.table(table(lhs2$f060pipe1))
barplot(prop, main="Proportion of LHS Smoke a pipe at annual visit, year 1", 
        names.arg=c("Yes","no"), col = "green")

##Calculate the point estimate

  1. What proportion of the LHS participants smoked a pipe at year 1 (after starting treatment)? In other words, what is your point estimate?

The proportion of the LHS participants that smokes a pipe at year 1 is 0.01054.

# Calculate the point estimate
# To count the number of participants who 
# had the event of interest
k <- length(which(lhs2$f060pipe == "1"))

#To count the total number of 
#non-missing values in the variable,
n<-length(na.omit(lhs2$f060pipe))

#To calculate the sample proportion
p.hat<-k/n
# Estimate the standard error using bootstrap simulations
boot.samp <- sample(lhs2$f060pipe, size = n , replace = TRUE)

# To calculate the standard error,then estimate it using 
# a bootstrap distribution
boot.samp <- sample(lhs2$f060pipe, size = n, replace = TRUE)

##Estimate the standard error using bootstrap simulations 5. Check the size of your bootstrap sample (it is contained in the object we labeled boot.samp). Does it have the same sample size as your original sample? Why do we obtain random samples that have the same sample size as our original sample? Skim Chapter 3 in Lock5 if you are unsure.

Yes the bootstrap sample has the same sample size as the original sample. We obtain random samples that have the same sample size as our original because we are trying to uncover the variability of the sample statistic and it directly relies on size of the sample.

# 5 Checking sample size 
summary(boot.samp)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1.00    2.00    2.00    1.99    2.00    2.00     287
table(boot.samp)
## boot.samp
##    1    2 
##   53 5257
##Generating our bootstrap distribution
boot.phats <- c() #Initializing the vector

for(i in 1:10000){ 
  boot.samp <- sample(lhs2$f060pipe, n, replace = TRUE) 
  boot.k <- length(which(boot.samp == 1)) 
  boot.phat <- boot.k/n
  boot.phats <- c(boot.phats, boot.phat)
}
 
#Check out boot.phats
summary(boot.phats)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.005717 0.009112 0.010005 0.010033 0.010899 0.015365
str(boot.phats)
##  num [1:10000] 0.01072 0.00965 0.00983 0.00965 0.00911 ...
#Visualize the bootstrap distribution
hist(boot.phats)

mean(boot.phats)
## [1] 0.01003338
SE <- sd(boot.phats)

#Calculate the 95% CI
#Method 1
CI <- p.hat + c(-1,1)*2*SE

Calculate the 95% confidence interval for the population proportion using bootstrap simulations

  1. Why do you think we used the sort() function? Note: sort(boot.phats) is a vector and we want the 250th element.

We used the sort function to sort the estimates and we wanted the estimate at exactly the (10,000 * 2.5%) place within the vector that was created from the for loop.

  1. Interpret the first 95% confidence interval we computed in the context of the research question.

In method one the confidence interval is (0.00784, 0.01324) and with the second method the confidence interval is (0.00750, 0.01268). From this we can conclude that we are 95% confident the proportion of the population that smoke a pipe at one year is between .007 and .013.

#Q 6 & 7
#Method 2
CI.lb <- (sort(boot.phats)[250]) 
CI.ub <- (sort(boot.phats)[9750]) 
  1. Compute a 99% confidence interval using percentiles from a bootstrap distribution (method 2). Skim section 3.4 in Lock5 or watch the mini-lecture video posted in Module 4 on 2/27/21 if you are unsure how to do this. Interpret this interval.

From method 2 the confidence interval is (0.006789, 0.01358). From this we can conclude that we are 99% confident the proportion of the population that smoke a pipe at one year is between 0.006789 and 0.01358.

#Calculate the 99% CI
#Method 2
CI.lb8 <- (sort(boot.phats)[(10000*.005)])
CI.ub8<- (sort(boot.phats)[(10000*.995)]) 

Another way to compute a 95% confidence interval: Built-in R function

  1. How do the 3 confidence intervals compare? Write out what they are. Which method do you personally like the best? Method 1 (0.00784, 0.01324) This method used the equation. Method 2 (0.00750, 0.01268) This method used the bootstrap method. Method 3 (0.00810, 0.01367) This method used the built in function from R.

I personally prefer the built in function with R because its easier. But I do like understanding there are a couple of ways to calculate confidence intervals.

prop.test(k, n, conf.level = 0.95)
## 
##  1-sample proportions test with continuity correction
## 
## data:  k out of n, null probability 0.5
## X-squared = 5361.5, df = 1, p-value < 2.2e-16
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
##  0.008103291 0.013673079
## sample estimates:
##          p 
## 0.01054136
prop.test(59, 5597, conf.level=0.95)
## 
##  1-sample proportions test with continuity correction
## 
## data:  59 out of 5597, null probability 0.5
## X-squared = 5361.5, df = 1, p-value < 2.2e-16
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
##  0.008103291 0.013673079
## sample estimates:
##          p 
## 0.01054136
  1. Compute a 95% confidence interval for the population proportion for those who the use of nicotine gum at first annual visit after starting treatment using your method of choice. See data documentation to identify which variable is needed.

CI = (0.00810, 0.01367)

summary(lhs2$AV1GUM)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.0000  0.0000  0.1679  0.0000  1.0000     299
table(lhs2$AV1GUM)
## 
##    0    1 
## 4650  938
prop.table(table(lhs2$AV1GUM))
## 
##         0         1 
## 0.8321403 0.1678597
prop.test(59, (5538+59), conf.level=0.95)
## 
##  1-sample proportions test with continuity correction
## 
## data:  59 out of (5538 + 59), null probability 0.5
## X-squared = 5361.5, df = 1, p-value < 2.2e-16
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
##  0.008103291 0.013673079
## sample estimates:
##          p 
## 0.01054136