R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

  1. What are the cases in this data set? What is the variable type of the variable we use to answer this question? What is the meaning of the variable and what do the individual values mean (1, 2 etc.)? You will have to look at the data documentation for this.

Cases: People who smoke Variable Type: Integer Meaning of the Variable: f060pipeSmoke a pipe at annual visit, year 11 = Yes2 = No

lhs<-read.csv(file =file.choose(), header = TRUE)
  1. Does R classify the variable type correctly? If not, change the variable type. Look at past coding assignments to review how to do this if needed.
lhs$f060pipe<-factor(lhs$f060pipe,levels=c(1,2),labels=c("NO","Yes"))

3.Create a plot (or plots) to visualize the variable. You may choose the plot(s) you think is best to visualize the variable. Create a table of counts and a table of proportions to summarize the variable. Describe what you see in the plot(s) and tables.

table(lhs$f060pipe)
## 
##   NO  Yes 
##   59 5538

There are a lot more Yes’s than NO’s. The histogram displays more 2’s and due to the population size and the very low amount of No’s it is not visible. For the scatter plot you can see the 59 No’s in (1) and the rest in 2 as a thick black line.

QUESTION What proportion of the LHS participants smoked a pipe at year 1 (after starting treatment)? In other words, what is your point estimate?

k<-length(which(lhs$f060pipe== 1))
k
## [1] 0
n<-length(na.omit(lhs$f060pipe))
n
## [1] 5597
p.hat<-k/n

The Point estimate is:

p.hat
## [1] 0
  1. Check the size of your bootstrap sample (it is contained in the object we labeled boot.samp). Does it have the same sample size as your original sample? Why do we obtain random samples that have the same sample size as our original sample? Skim Chapter 3 in Lock5 if you are unsure.

Yes, it contains the same sample size. It is best to sample with replacement because it can be re-selected. Each sample can represent populations with the same characteristics. Bootstrapping will give a decent approximation of a sampling distribution.

  1. Why do you think we used the sort() function? Note: sort(boot.phats) is a vector and we want the 250th element.

Sort function sort vector or factor into ascending or descending order.

boot.phats <- c()
for(i in 1:10000){ boot.samp <- sample(lhs$f060pipe, n, replace = TRUE)
 boot.k <- length(which(boot.samp == 1))
 boot.phat <- boot.k/n 
 boot.phats <- c(boot.phats, boot.phat) }
hist(boot.phats)

mean(boot.phats)
## [1] 0
SE <- sd(boot.phats) 
CI <- p.hat + c(-1,1)*2*SE
CI.lb <- (sort(boot.phats)[50])
CI.ub <- (sort(boot.phats)[9950])
prop.test(k, n, conf.level=0.95)
## 
##  1-sample proportions test with continuity correction
## 
## data:  k out of n, null probability 0.5
## X-squared = 5595, df = 1, p-value < 2.2e-16
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
##  0.0000000000 0.0008550898
## sample estimates:
## p 
## 0
  1. Interpret the first 95% confidence interval we computed in the context of the research question.

Sample mean will lie between: Lower limit of 0.0079 and upper limit of 0.0132

  1. Compute a 99% confidence interval using percentiles from a bootstrap distribution (method 2).
CI.lb <- (sort(boot.phats)[50])     
CI.ub <- (sort(boot.phats)[9950])

Lower Limit: 0.0067 Upper Limit: 0.0135 Sample mean will lie between .00067 and 0.013

  1. How do the 3 confidence intervals compare? Write out what they are. Which method do you personally like the best?

The first Confidence is based on the depiction of the histogram model. If bell shaped we used the CI <- p.hat + c(-1,1)2SE

The second is percentile based: You calculate lower and upper limits once you have your confidence percentage or 95,99 or etc.

CI.lb <- (sort(boot.phats)[250]) CI.ub <- (sort(boot.phats)[9750])

The third is a built in R function: This function works by specifying the number of those with the event in the variable (k), the total number of non-missing and values in variable (n), and confidence level (95).

I prefer third method. It makes more sense to me when I assign the values of the data to a letter then use those letter for the CI function.
10. Compute a 95% confidence interval for the population proportion for those who the use of nicotine gum at first annual visit after starting treatment using your method of choice. See data documentation to identify which variable is needed.

G<-length(which(lhs$AV1GUM== 1))
G
## [1] 938
n<-length(na.omit(lhs$AV1GUM))
n
## [1] 5588
prop.test(G, n, conf.level=0.95)
## 
##  1-sample proportions test with continuity correction
## 
## data:  G out of n, null probability 0.5
## X-squared = 2464.5, df = 1, p-value < 2.2e-16
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
##  0.1582020 0.1779779
## sample estimates:
##         p 
## 0.1678597