This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
Cases: People who smoke Variable Type: Integer Meaning of the Variable: f060pipeSmoke a pipe at annual visit, year 11 = Yes2 = No
lhs<-read.csv(file =file.choose(), header = TRUE)
lhs$f060pipe<-factor(lhs$f060pipe,levels=c(1,2),labels=c("NO","Yes"))
3.Create a plot (or plots) to visualize the variable. You may choose the plot(s) you think is best to visualize the variable. Create a table of counts and a table of proportions to summarize the variable. Describe what you see in the plot(s) and tables.
table(lhs$f060pipe)
##
## NO Yes
## 59 5538
There are a lot more Yes’s than NO’s. The histogram displays more 2’s and due to the population size and the very low amount of No’s it is not visible. For the scatter plot you can see the 59 No’s in (1) and the rest in 2 as a thick black line.
QUESTION What proportion of the LHS participants smoked a pipe at year 1 (after starting treatment)? In other words, what is your point estimate?
k<-length(which(lhs$f060pipe== 1))
k
## [1] 0
n<-length(na.omit(lhs$f060pipe))
n
## [1] 5597
p.hat<-k/n
The Point estimate is:
p.hat
## [1] 0
Yes, it contains the same sample size. It is best to sample with replacement because it can be re-selected. Each sample can represent populations with the same characteristics. Bootstrapping will give a decent approximation of a sampling distribution.
Sort function sort vector or factor into ascending or descending order.
boot.phats <- c()
for(i in 1:10000){ boot.samp <- sample(lhs$f060pipe, n, replace = TRUE)
boot.k <- length(which(boot.samp == 1))
boot.phat <- boot.k/n
boot.phats <- c(boot.phats, boot.phat) }
hist(boot.phats)
mean(boot.phats)
## [1] 0
SE <- sd(boot.phats)
CI <- p.hat + c(-1,1)*2*SE
CI.lb <- (sort(boot.phats)[50])
CI.ub <- (sort(boot.phats)[9950])
prop.test(k, n, conf.level=0.95)
##
## 1-sample proportions test with continuity correction
##
## data: k out of n, null probability 0.5
## X-squared = 5595, df = 1, p-value < 2.2e-16
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
## 0.0000000000 0.0008550898
## sample estimates:
## p
## 0
Sample mean will lie between: Lower limit of 0.0079 and upper limit of 0.0132
CI.lb <- (sort(boot.phats)[50])
CI.ub <- (sort(boot.phats)[9950])
Lower Limit: 0.0067 Upper Limit: 0.0135 Sample mean will lie between .00067 and 0.013
The first Confidence is based on the depiction of the histogram model. If bell shaped we used the CI <- p.hat + c(-1,1)2SE
The second is percentile based: You calculate lower and upper limits once you have your confidence percentage or 95,99 or etc.
CI.lb <- (sort(boot.phats)[250]) CI.ub <- (sort(boot.phats)[9750])
The third is a built in R function: This function works by specifying the number of those with the event in the variable (k), the total number of non-missing and values in variable (n), and confidence level (95).
I prefer third method. It makes more sense to me when I assign the values of the data to a letter then use those letter for the CI function.
10. Compute a 95% confidence interval for the population proportion for those who the use of nicotine gum at first annual visit after starting treatment using your method of choice. See data documentation to identify which variable is needed.
G<-length(which(lhs$AV1GUM== 1))
G
## [1] 938
n<-length(na.omit(lhs$AV1GUM))
n
## [1] 5588
prop.test(G, n, conf.level=0.95)
##
## 1-sample proportions test with continuity correction
##
## data: G out of n, null probability 0.5
## X-squared = 2464.5, df = 1, p-value < 2.2e-16
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
## 0.1582020 0.1779779
## sample estimates:
## p
## 0.1678597