Week 4 R Assignement

Question 1. What are the cases in this data set? What is the variable type of the variable we use to answer this question? What is the meaning of the variable and what do the individual values mean (1, 2 etc.)? You will have to look at the data documentation for this.

The cases are participants. R says the variable integer. The meaning of 1 = Yes and 2 = No. I looked at the data documentation.

##Importing the data base
lhs<-read.csv(file = file.choose(), header = TRUE)

##Checking the type of a variable
typeof(lhs$f060pipe)

## [1] "integer"

Question 2. Does R classify the variable type correctly? If not, change the variable type. Look at past coding assignments to review how to do this if needed.

R classified as an integer.

##Reclassifying as a character
class(lhs$f060pipe)="Character"

Question 3. Create a plot (or plots) to visualize the variable. You may choose the plot(s) you think is best to visualize the variable. Create a table of counts and a table of proportions to summarize the variable. Describe what you see in the plot(s) and tables.

More than 98% of the participants did not smoke a pipe at 1 year after start treatment.

#Save the count information from the table() function
counts<-table(lhs$f060pipe)
#Create a table of proportions using the counts object
prop.table(counts)

## 
##          1          2 
## 0.01054136 0.98945864

#Create a barplot using the counts object
barplot(counts, main="Barplot of smoke a pipe at 1 year after
starting treatment",
        names.arg=c("Yes","No"),col="blue")

Question 4. What proportion of the LHS participants smoked a pipe at year 1 (after starting treatment)? In other words, what is your point estimate?

0.01054136 – 1.05%

##Calculate the point estimate

##The length() function counts the number of values in the object, and the which() function finds the
##observations in the variable that match the criteria (e.g., which observations in the f060pipe variable have a
##                              value equal to 1, lhs$f060pipe==1).

k<-length(which(lhs$f060pipe== 1))
#viewing the object you just created
k

## [1] 59

##The na.omit() function removes the rows in the variable that are missing, so you would only be left with a
##variable with non-missing rows.

n<-length(na.omit(lhs$f060pipe))
n

## [1] 5597

##calculate the sample proportion
p.hat<-k/n
p.hat

## [1] 0.01054136

Question 5. Check the size of your bootstrap sample (it is contained in the object we labeled boot.samp). Does it have the same sample size as your original sample? Why do we obtain random samples that have the same sample size as our original sample? Skim Chapter 3 in Lock5 if you are unsure.

“boot.samp” has 5597 observations. The same size of “n”. The origiral variable “f060pipe” has 5887 observations, the difference is the number of omitted answers. We obtain random samples that have the same sample size as our original sample because we are trying to uncover the variability of the sample statistic.

##size of your bootstrap sample and original variable
boot.samp <- sample(lhs$f060pipe, size = n, replace = TRUE)
length (lhs$f060pipe)

## [1] 5887

length((boot.samp))

## [1] 5597

Question 6. Why do you think we used the sort() function? Note: sort(boot.phats) is a vector and we want the 250th element.

Because we want to identify the lowest and the highest 250 elements from boot.phats.

Question 7. Interpret the first 95% confidence interval we computed in the context of the research question.

We are 95% confident that a participant who smoked a pipe at the first year of treatment is between 0.007871986 0.013210737 0r 0.79% and 1.32%

Question 8. Compute a 99% confidence interval using percentiles from a bootstrap distribution (method 2). Skim section 3.4 in Lock5 or watch the mini-lecture video posted in Module 4 on 2/27/21 if you are unsure how to do this. Interpret this interval.

boot.phats <- c() #Initializing the vector
for(i in 1:10000){ #i is a sample and we are taking 10000 samples
  boot.samp <- sample(lhs$f060pipe, n, replace = TRUE) #Take a random sample
  #Now we need to calculate our bootstrap statistic
  ##(this is analogous to the sample statistic we compute from a sample)
  boot.k <- length(which(boot.samp == 1)) #how many events or "successes" do we have in our sample
  boot.phat <- boot.k/n #a bootstrap statistic
  boot.phats <- c(boot.phats, boot.phat) #I am added the newly computed bootstrap statistic
  #to the vector of bootstrap statistics
}

##visualize the bootstrap distribution and calculate the standard error.
hist(boot.phats,col="orange")

mean(boot.phats) # this should be the same quantity as your point estimate

## [1] 0.01000572

##Recall we use this quantity as an estimate of population proportion
SE <- sd(boot.phats) # estimate of the SE for the sampling distribution of the proportion.
##We estimate the SE by computing the standard deviation of our bootstrap distribution.
#Method 2 based on percentiles
# 99% Confidence Interval
# 0.5% of 10000 is 50
# 99.5% of 10000 is 9950
CI.lb99 <- (sort(boot.phats)[50]) #lower bound
CI.lb99

## [1] 0.006789351

CI.ub99 <- (sort(boot.phats)[9950]) #upper bound
CI.ub99

## [1] 0.0135787

Question 9. How do the 3 confidence intervals compare? Write out what they are. Which method do you personally like the best?

Method 1 – confidence interval based on a bootstrap standard error – is a good method when the histogram Is symmetric and bell-shaped. Method 2 – confidence interval based on a bootstrap percentiles – is a good method when the histogram is not so symmetric and/or we want mor or less than 95% of confidence interval. Method 3 – bult-in R Function. Is good if you did not have data and only had summary information for calculating the CI for a proportion. The method 3 looks easy but is not considering the bootstrap process. It looks like less trustable. For this reason, I prefer the method 2, since I do not need to concern about the shape of the histogram.

Question 10. Compute a 95% confidence interval for the population proportion for those who the use of nicotine gum at first annual visit after starting treatment using your method of choice. See data documentation to identify which variable is needed.

We are 95% confident that a participant who use nicotine gum at the first annual visit of treatment is between 0.1497231 and 0.1686618 0r 14.97% and 16.87%.

##The length() function counts the number of values in the object, and the which() function finds the
##observations in the variable that match the criteria
kG<-length(which(lhs$AV1GUM== 1))
kG

## [1] 938

##The na.omit() function removes the rows in the variable that are missing, so you would only be left with a
##variable with non-missing rows.
nG<-length(na.omit(lhs$AV1GUM))
nG

## [1] 5588

##calculate the sample proportion
p.hatG<-kG/nG
p.hatG

## [1] 0.1678597

##Sampling with replacement from our one sample
#We want to sample with replacement to obtain samples of the SAME size as our original sample (`n`)
boot.sampG <- sample(lhs$AV1GUM, size = nG, replace = TRUE)
##Generating our bootstrap distribution
boot.phatsG <- c() #Initializing the vector
for(i in 1:10000){ #i is a sample and we are taking 10000 samples
  boot.sampG <- sample(lhs$AV1GUM, nG, replace = TRUE) #Take a random sample
  #Now we need to calculate our bootstrap statistic
  ##(this is analogous to the sample statistic we compute from a sample)
  boot.kG <- length(which(boot.sampG == 1)) #how many events or "successes" do we have in our sample
  boot.phatG <- boot.kG/n #a bootstrap statistic
  boot.phatsG <- c(boot.phatsG, boot.phatG) #I am added the newly computed bootstrap statistic
  #to the vector of bootstrap statistics
}

##visualize the bootstrap distribution and calculate the standard error.
hist(boot.phatsG, col="yellow")

mean(boot.phatsG) # this should be the same quantity as your point estimate

## [1] 0.1591113

##Recall we use this quantity as an estimate of population proportion
SEG <- sd(boot.phatsG) # estimate of the SE for the sampling distribution of the proportion.
##We estimate the SE by computing the standard deviation of our bootstrap distribution.
# 95% Confidence Interval
# 2.5% of 10000 is 250
# 97.5% of 10000 is 9750
CI.lbG <- (sort(boot.phatsG)[250]) #lower bound
CI.lbG

## [1] 0.1495444

CI.ubG <- (sort(boot.phatsG)[9750]) #upper bound
CI.ubG

## [1] 0.1688405

Week 4 R Assignement

Andreya Kuerten

3/4/2021