lhs <- read.csv("~/UST Data Science/Spring 2021/SEIS 631/1/RStudio/lhs.csv")
n=length(lhs$f060pipe)
  1. What are the cases in this data set? What is the variable type of the variable we use to answer this question? What is the meaning of the variable and what do the individual values mean (1, 2 etc.)? You will have to look at the data documentation for this.
table(lhs$f060pipe)
## 
##    1    2 
##   59 5538

Including Plots

  1. Does R classify the variable type correctly? If not, change the variable type. Look at past coding assignments to review how to do this if needed.

written answer: R does not classifying the variable type correctly. Without the data dictionary, it would be difficult to distinguish what integer 1 and 2 means. A better way to label the results would be by character (1 = Yes, 2 = No)

lhs$f060pipe1 <- factor(lhs$f060pipe, levels=c(1,2), labels = c("Yes", "No"))
table(lhs$f060pipe1)
## 
##  Yes   No 
##   59 5538
  1. Create a plot (or plots) to visualize the variable. You may choose the plot(s) you think is best to visualize the variable. Create a table of counts and a table of proportions to summarize the variable. Describe what you see in the plot(s) and tables.
barplot(table(lhs$f060pipe1))

table(lhs$f060pipe1)
## 
##  Yes   No 
##   59 5538
prop.table(table(lhs$f060pipe1))
## 
##        Yes         No 
## 0.01054136 0.98945864
  1. What proportion of the LHS participants smoked a pipe at year 1 (after starting treatment)? In other words, what is your point estimate?

written answer: 1% of participants smoked a pipe at year 1

k <- length(which(lhs$f060pipe== 1))
k
## [1] 59
n <- length(na.omit(lhs$f060pipe))
n
## [1] 5597
p.hat <- k/n
p.hat
## [1] 0.01054136
boot.phats <- c()

for (i in 1:10000){
  boot.samp <- sample(lhs$AV1GUM, n, replace = TRUE)
  boot.k <- length(which(boot.samp == 1))
  boot.phat <- boot.k/n
  boot.phats <- c(boot.phats, boot.phat)
}
  1. Check the size of your bootstrap sample (it is contained in the object we labeled boot.samp). Does it have the same sample size as your original sample? Why do we obtain random samples that have the same sample size as our original sample? Skim Chapter 3 in Lock5 if you are unsure.

written answer: The boot.samp size is not the same as the original sample size. The reason is because we want to see the variability.

table(boot.samp)
## boot.samp
##    0    1 
## 4407  901
  1. Why do you think we used the sort() function? Note: sort(boot.phats) is a vector and we want the 250th element.

written answer: I believe we used the sort() function because we want to separate the lower and upper boot.phats data. The sort () function is used to list the data in ascending or descending order, so for us to calculate the data correctly (by the first half (250) or the second half (9750). Sorting from ascending to descending order would have helped with separating the data.

  1. Interpret the first 95% confidence interval we computed in the context of the research question.

Written answer: 0.007 – 0.012 = 95 % Based on the results, the 95% confidence interval for mean population proportion for those who smoke a pipe at one year after starting treatment goes from 0.7% to 1.2%. Which means we are 95% sure that the mean population proportion of those who smoke a pipe at the one year mark is between 0.7% and 1.2%.

  1. Compute a 99% confidence interval using percentiles from a bootstrap distribution (method 2). Skim section 3.4 in Lock5 or watch the mini-lecture video posted in Module 4 on 2/27/21 if you are unsure how to do this. Interpret this interval.

Written answer: 0.007-0.014 = 99%; I am 99% sure that the true proportion of all the participants who smoke a pipe at their first annual visit will be between 0.7 and 1.4%.

CI.lb <- (sort(boot.phats) [50])
CI.ub <- (sort(boot.phats) [9950])
  1. How do the 3 confidence intervals compare? Write out what they are. Which method do you personally like the best?

Written answer: The confidence intervals were all pretty close together, so I am not entirely sure if the calculations were done correctly. In my opinion, I prefer method 2 (based on percentiles) because it seems easier to read and understand.

prop.test(59, 5597, conf.level = 0.90)
## 
##  1-sample proportions test with continuity correction
## 
## data:  59 out of 5597, null probability 0.5
## X-squared = 5361.5, df = 1, p-value < 2.2e-16
## alternative hypothesis: true p is not equal to 0.5
## 90 percent confidence interval:
##  0.008440637 0.013133767
## sample estimates:
##          p 
## 0.01054136
CI.lb <- (sort(boot.phats) [500])
CI.ub <- (sort(boot.phats) [9500])

prop.test(59, 5597, conf.level = 0.95)
## 
##  1-sample proportions test with continuity correction
## 
## data:  59 out of 5597, null probability 0.5
## X-squared = 5361.5, df = 1, p-value < 2.2e-16
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
##  0.008103291 0.013673079
## sample estimates:
##          p 
## 0.01054136
CI.lb <- (sort(boot.phats) [250])
CI.ub <- (sort(boot.phats) [9750])

prop.test(59, 5597, conf.level = 0.99)
## 
##  1-sample proportions test with continuity correction
## 
## data:  59 out of 5597, null probability 0.5
## X-squared = 5361.5, df = 1, p-value < 2.2e-16
## alternative hypothesis: true p is not equal to 0.5
## 99 percent confidence interval:
##  0.00748453 0.01478630
## sample estimates:
##          p 
## 0.01054136
CI.lb <- (sort(boot.phats) [50])
CI.ub <- (sort(boot.phats) [9950])
  1. Compute a 95% confidence interval for the population proportion for those who the use of nicotine gum at first annual visit after starting treatment using your method of choice. See data documentation to identify which variable is needed.

written response: In this data set for AV1GUM, the distribution shows a symmetric, unimodal distribution with few outliers and peaks at roughly 0.16. The center of the distribution is easy to locate and both tails of the distribution are approximately the same length. The spread is squeezed and covering a narrower range (0.14 to 0.17). The population proportion is centered at approximately 0.16.

summary(lhs$Av1GUM)
## Length  Class   Mode 
##      0   NULL   NULL
table(lhs$AV1GUM)
## 
##    0    1 
## 4650  938
lhs$AV1GUM1 <- factor(lhs$AV1GUM, levels=c(0,1), labels = c("No", "Yes"))
table(lhs$AV1GUM1)
## 
##   No  Yes 
## 4650  938
plot(lhs$AV1GUM1)

m <- length(which(lhs$AV1GUM==1))
m
## [1] 938
b <- length(na.omit(lhs$AV1GUM))
b
## [1] 5588
p.hat <- m/b
p.hat
## [1] 0.1678597
boot.phats <- c()

for (i in 1:10000){
  boot.samp <- sample(lhs$AV1GUM, b, replace = TRUE)
  boot.m <- length(which(boot.samp == 1))
  boot.phat <- boot.m/b
  boot.phats <- c(boot.phats, boot.phat)
}

hist(boot.phats)

mean(boot.phats)
## [1] 0.1593269
SE <- sd(boot.phats)

CI.lb <- (sort(boot.phats) [250])
CI.ub <- (sort(boot.phats) [9750])