Homework4

1. What are the cases in this data set? What is the variable type of the variable we use to answer this question? What is the meaning of the variable and what do the individual values mean (1, 2 etc.)? You will have to look at the data documentation for this.

The cases in this data set, weather smoking intervention and use of an inhaled bronchodilator would decrease the rate of decline in lung function on random people tested in USA and Canada between 1986 to 1988. Type of variable: categorical and numerical.

individuals values 1,2 refer to Yes or No(Categorical variable)

f060pipe: Smoke a pipe at annual visit, year 1, 1= Yes,2 = No AV1GUM :Use of nicotine gum at annual visit, year 1 , 0 = No, 1 =Yes

lhs <- read.csv("~/Desktop/631/lhs.csv")
n = length((lhs$f060pipe))
str(lhs$f060pipe)

##  int [1:5887] 2 2 2 NA 2 2 2 2 2 2 ...

table(lhs$f060pipe)

## 
##    1    2 
##   59 5538

table(lhs$AV1GUM)

## 
##    0    1 
## 4650  938

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

summary(cars)

##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

2. Does R classify the variable type correctly? If not, change the variable type. Look at past coding assignments to review how to do this if needed.

R classify the variable type not correctly.

lhs$f060pipe2<-factor(lhs$f060pipe, levels=c(1,2),labels = c("Yes","No"))
table(lhs$f060pipe2)

## 
##  Yes   No 
##   59 5538

lhs$AV1GUM2<-factor(lhs$AV1GUM, levels=c(0,1),labels = c("No","Yes"))
table(lhs$AV1GUM2)

## 
##   No  Yes 
## 4650  938

Including Plots

You can also embed plots, for example:

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

Visualize and summarize the categorical variable of interest (f060pipe)

3. Create a plot (or plots) to visualize the variable. You may choose the plot(s) you think is best to visualize the variable. Create a table of counts and a table of proportions to summarize the variable. Describe what you see in the plot(s) and tables.

Since the f060pipe is numerical variable, we visualize it by historical graph left-skewed, It is clear from the graph the proportion of participant not using pipe during 1year higher than participants using pipe in both countries and there is no association in the graph.

(f060pipe) : Smoke a pipe at annual visit, year 1

The number of participants smoke pipe during 1 year study are 59 and the proportion is 0.0105.

The number of participants NOT smoke pipe during 1 year study are 5538 and the proportion is 0.989.

str(lhs$f060pipe)

##  int [1:5887] 2 2 2 NA 2 2 2 2 2 2 ...

table(lhs$f060pipe)

## 
##    1    2 
##   59 5538

prop.table(table(lhs$f060pipe))

## 
##          1          2 
## 0.01054136 0.98945864

barplot(table(lhs$f060pipe2))

lhs$f060pipe1<-factor(lhs$f060pipe, levels=c(1,2),labels = c("Yes","No"))

4.What proportion of the LHS participants smoked a pipe at year 1 (after starting treatment)? In other words,what is your point estimate?

The proportion of participants smoked a pipe at year 1 are: 1.05.% of participants smoked pipe in same period(at 1 year).

table(lhs$f060pipe)

## 
##    1    2 
##   59 5538

prop.table(table(lhs$f060pipe))

## 
##          1          2 
## 0.01054136 0.98945864

k<-length(which(lhs$f060pipe== 1))
k

## [1] 59

n<-length(na.omit(lhs$f060pipe))
n

## [1] 5597

p.hat<-k/n
p.hat

## [1] 0.01054136

5. Check the size of your bootstrap sample (it is contained in the object we labeled boot.samp). Does it have the same sample size as your original sample? Why do we obtain random samples that have the same sample size as our original sample? Skim Chapter 3 in Lock5 if you are unsure. Now let’s take many bootstrap samples and generate the bootstrap distribution (an approximation of the true sampling distribution). A rule of thumb is to generate at least 1000 samples.Let’s take 10000.

boot sample(boot.samp) has Yes variable and it is not same as original sample size,“Yes” = 59 in original sample and the boot.samp take bootstrap (65) from original sample. It does not have the same sample size and we use random samples to have different probability and proportion.

#table(boot.samp)

```{’’’{r, message = FALSE,warning=FALSE}

’’’



6\. Why do you think we used the sort() function? Note: sort(boot.phats) is a vector and we want the 250th
element.
We use sort function for ordering along more than one variable,increasing or decreasing values in 250th(lower bound) and upper bound in boot.phats


7\. Interpret the first 95% confidence interval we computed in the context of the research question.

Any given 95% confidence interval will contain the true parameter for 95% of all samples.

since only \~5 of 95% confidence intervals won't capture the population parameter, we say that we are 95% confident or 95%sure that the interval computed from our samples would capture the population parameter.


```r
k<-length(which(lhs$f060pipe== 1))
k

## [1] 59

n<-length(na.omit(lhs$f060pipe))
n

## [1] 5597

p.hat<-k/n
p.hat

## [1] 0.01054136

boot.samp <- sample(lhs$f060pipe, size = n, replace = TRUE)
boot.phats <- c() #Initializing the vector
for(i in 1:10000){ 
   boot.samp <- sample(lhs$f060pipe, n, replace = TRUE) 
   boot.k <- length(which(boot.samp == 1)) 
   boot.phat <- boot.k/n
   boot.phats <- c(boot.phats, boot.phat)}

hist(boot.phats)

mean(boot.phats)

## [1] 0.009985171

SE <- sd(boot.phats) 


CI <- p.hat + c(-1,1)*2*SE
#Method 2 based on percentiles
# 95% Confidence Interval
# 2.5% of 10000 is 250
# 97.5% of 10000 is 9750
CI.lb <- (sort(boot.phats)[250]) #lower bound CI.ub <- (sort(boot.phats)[9750]) #upper bound

prop.test(k, n, conf.level=0.95)

## 
##  1-sample proportions test with continuity correction
## 
## data:  k out of n, null probability 0.5
## X-squared = 5361.5, df = 1, p-value < 2.2e-16
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
##  0.008103291 0.013673079
## sample estimates:
##          p 
## 0.01054136

#prop.test(59, 5597, conf.level=0.95)

I am 95% confident that the true proportion of all f060pipe participants who smoking pipe at 1 year after starting treatments located between 0.0081 and 0.0136 in our study.

8. Compute a 99% confidence interval using percentiles from a bootstrap distribution (method 2). Skim section 3.4 in Lock5 or watch the mini-lecture video posted in Module 4 on 2/27/21 if you are unsure how to do this. Interpret this interval.

I am 99% confident that the true proportion from bootstrap distribution for f060pipe participants at one year after starting treatment between 0.0074 and 0.0147

table(boot.samp)

## boot.samp
##    1    2 
##   53 5257

prop.test(59, 5597, conf.level=0.99)

## 
##  1-sample proportions test with continuity correction
## 
## data:  59 out of 5597, null probability 0.5
## X-squared = 5361.5, df = 1, p-value < 2.2e-16
## alternative hypothesis: true p is not equal to 0.5
## 99 percent confidence interval:
##  0.00748453 0.01478630
## sample estimates:
##          p 
## 0.01054136

CI.lb <- (sort(boot.phats)[250])
CI.lb <- (sort(boot.phats)[9950])

How do the 3 confidence intervals compare? Write out what they are. Which method do you personally like the best?

There are three types:

-68% falls in 1 SD of the mean, the confidence interval will contain the true parameter for 68% of all samples. 68% = located between 0.0091 and 0.0120

-95% falls in 2 SD of the mean, the confidence interval will contain the true parameter for 95% of all samples. 95% = mark between 0.0081 and 0.0136

-99% falls in 3 SD of the mean, the confidence interval will contain the true parameter for 99% of all samples, and i prefer this one because the 1% of the confidence interval will capture the truth, and that would be in the tail of normal distribution 99% = mark between 0.0074 and 0.0147

In this case I use prop.test function, and all of these return simillar value, but not smae exact value.

prop.test(59, 5597, conf.level=0.68)

## 
##  1-sample proportions test with continuity correction
## 
## data:  59 out of 5597, null probability 0.5
## X-squared = 5361.5, df = 1, p-value < 2.2e-16
## alternative hypothesis: true p is not equal to 0.5
## 68 percent confidence interval:
##  0.009184021 0.012082995
## sample estimates:
##          p 
## 0.01054136

CI.lb <- (sort(boot.phats)[550])
CI.lb <- (sort(boot.phats)[9950])
prop.test(59, 5597, conf.level=0.95)

## 
##  1-sample proportions test with continuity correction
## 
## data:  59 out of 5597, null probability 0.5
## X-squared = 5361.5, df = 1, p-value < 2.2e-16
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
##  0.008103291 0.013673079
## sample estimates:
##          p 
## 0.01054136

CI.lb <- (sort(boot.phats)[225])
CI.lb <- (sort(boot.phats)[9950])
prop.test(59, 5597, conf.level=0.99)

## 
##  1-sample proportions test with continuity correction
## 
## data:  59 out of 5597, null probability 0.5
## X-squared = 5361.5, df = 1, p-value < 2.2e-16
## alternative hypothesis: true p is not equal to 0.5
## 99 percent confidence interval:
##  0.00748453 0.01478630
## sample estimates:
##          p 
## 0.01054136

CI.lb <- (sort(boot.phats)[125])
CI.lb <- (sort(boot.phats)[9950])

10.Compute a 95% confidence interval for the population proportion for those who the use of nicotine gum at first annual visit after starting treatment using your method of choice. See data documentation to identify which variable is needed. AV1GUM

table(lhs$AV1GUM)

## 
##    0    1 
## 4650  938

table(lhs$AV1GUM)

## 
##    0    1 
## 4650  938

prop.table(table(lhs$AV1GUM))

## 
##         0         1 
## 0.8321403 0.1678597

s<-length(which(lhs$AV1GUM== 1))
s

## [1] 938

r<-length(na.omit(lhs$AV1GUM))
r

## [1] 5588

s.hat<-s/r
s.hat

## [1] 0.1678597

boot.samp <- sample(lhs$AV1GUM, size = r, replace = TRUE)
boot.phats <- c() #Initializing the vector
for(i in 1:10000){ 
   boot.samp <- sample(lhs$AV1GUM, r, replace = TRUE) 
   boot.s <- length(which(boot.samp == 1)) 
   boot.phat <- boot.s/r
   boot.phats <- c(boot.phats, boot.phat)}

hist(boot.phats)

mean(boot.phats)

## [1] 0.1592778

SE <- sd(boot.phats) 


CI <- p.hat + c(-1,1)*2*SE
#Method 2 based on percentiles
# 95% Confidence Interval
# 2.5% of 10000 is 250
# 97.5% of 10000 is 9750
CI.lb <- (sort(boot.phats)[250]) #lower bound CI.ub <- (sort(boot.phats)[9750]) #upper bound

prop.test(s, r, conf.level=0.95)

## 
##  1-sample proportions test with continuity correction
## 
## data:  s out of r, null probability 0.5
## X-squared = 2464.5, df = 1, p-value < 2.2e-16
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
##  0.1582020 0.1779779
## sample estimates:
##         p 
## 0.1678597

#s<-length(which(lhs$AV1GUM== 1))

#r<-length(na.omit(lhs$AV1GUM))

#prop.test(s,r,conf.level = 0.95)

Who use the nicotine gum: 938

I am 95% confident that the true proportion of those who use (AV1GUM) nicotine gum at annual visit at year 1 between 0.1582 and 0.1779