The cases in this data set are the people who smoke a pipe or don’t in this Lung Health Study after 1 year. 1 means yes they smoke a pipe and 2 means no they don’t smoke a pipe.
table(lhs2$f060pipe)
##
## 1 2
## 59 5538
prop.table(table(lhs2$f060pipe))
##
## 1 2
## 0.01054136 0.98945864
R does not classify the variable correctly.
str(lhs2$f060pipe)
## int [1:5887] 2 2 2 NA 2 2 2 2 2 2 ...
#changing variable type
lhs2$f060pipe1<-factor(lhs2$f060pipe,
levels=c(1,2),
labels=c("Yes","No"))
#check if the factor works
table(lhs2$f060pipe1)
##
## Yes No
## 59 5538
str(lhs2$f060pipe1)
## Factor w/ 2 levels "Yes","No": 2 2 2 NA 2 2 2 2 2 2 ...
From the table and summary count 59 people are using a pipe one year into the study and 5538 are not and there are 290 NA’s. I plotted the counts and proportions. From the counts that the yes responses are almost insignificant and the no response is most of the sample.
plot(lhs2$f060pipe1)
summary(lhs2$f060pipe1)
## Yes No NA's
## 59 5538 290
table(lhs2$f060pipe1)
##
## Yes No
## 59 5538
prop.table(table(lhs2$f060pipe1))
##
## Yes No
## 0.01054136 0.98945864
counts<-table(lhs2$f060pipe1)
barplot(counts, main="Barplot of LHS Smoke a pipe at annual visit, year 1",
names.arg=c("Yes","no"), col = "blue")
prop <- prop.table(table(lhs2$f060pipe1))
barplot(prop, main="Proportion of LHS Smoke a pipe at annual visit, year 1",
names.arg=c("Yes","no"), col = "green")
##Calculate the point estimate
The proportion of the LHS participants that smokes a pipe at year 1 is 0.01054.
# Calculate the point estimate
# To count the number of participants who
# had the event of interest
k <- length(which(lhs2$f060pipe == "1"))
#To count the total number of
#non-missing values in the variable,
n<-length(na.omit(lhs2$f060pipe))
#To calculate the sample proportion
p.hat<-k/n
# Estimate the standard error using bootstrap simulations
boot.samp <- sample(lhs2$f060pipe, size = n , replace = TRUE)
# To calculate the standard error,then estimate it using
# a bootstrap distribution
boot.samp <- sample(lhs2$f060pipe, size = n, replace = TRUE)
##Estimate the standard error using bootstrap simulations 5. Check the size of your bootstrap sample (it is contained in the object we labeled boot.samp). Does it have the same sample size as your original sample? Why do we obtain random samples that have the same sample size as our original sample? Skim Chapter 3 in Lock5 if you are unsure.
Yes the bootstrap sample has the same sample size as the original sample. We obtain random samples that have the same sample size as our original because we are trying to uncover the variability of the sample statistic and it directly relies on size of the sample.
# 5 Checking sample size
summary(boot.samp)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.00 2.00 2.00 1.99 2.00 2.00 287
table(boot.samp)
## boot.samp
## 1 2
## 53 5257
##Generating our bootstrap distribution
boot.phats <- c() #Initializing the vector
for(i in 1:10000){
boot.samp <- sample(lhs2$f060pipe, n, replace = TRUE)
boot.k <- length(which(boot.samp == 1))
boot.phat <- boot.k/n
boot.phats <- c(boot.phats, boot.phat)
}
#Check out boot.phats
summary(boot.phats)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.005717 0.009112 0.010005 0.010033 0.010899 0.015365
str(boot.phats)
## num [1:10000] 0.01072 0.00965 0.00983 0.00965 0.00911 ...
#Visualize the bootstrap distribution
hist(boot.phats)
mean(boot.phats)
## [1] 0.01003338
SE <- sd(boot.phats)
#Calculate the 95% CI
#Method 1
CI <- p.hat + c(-1,1)*2*SE
We used the sort function to sort the estimates and we wanted the estimate at exactly the (10,000 * 2.5%) place within the vector that was created from the for loop.
In method one the confidence interval is (0.00784, 0.01324) and with the second method the confidence interval is (0.00750, 0.01268). From this we can conclude that we are 95% confident the proportion of the population that smoke a pipe at one year is between .007 and .013.
#Q 6 & 7
#Method 2
CI.lb <- (sort(boot.phats)[250])
CI.ub <- (sort(boot.phats)[9750])
From method 2 the confidence interval is (0.006789, 0.01358). From this we can conclude that we are 99% confident the proportion of the population that smoke a pipe at one year is between 0.006789 and 0.01358.
#Calculate the 99% CI
#Method 2
CI.lb8 <- (sort(boot.phats)[(10000*.005)])
CI.ub8<- (sort(boot.phats)[(10000*.995)])
I personally prefer the built in function with R because its easier. But I do like understanding there are a couple of ways to calculate confidence intervals.
prop.test(k, n, conf.level = 0.95)
##
## 1-sample proportions test with continuity correction
##
## data: k out of n, null probability 0.5
## X-squared = 5361.5, df = 1, p-value < 2.2e-16
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
## 0.008103291 0.013673079
## sample estimates:
## p
## 0.01054136
prop.test(59, 5597, conf.level=0.95)
##
## 1-sample proportions test with continuity correction
##
## data: 59 out of 5597, null probability 0.5
## X-squared = 5361.5, df = 1, p-value < 2.2e-16
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
## 0.008103291 0.013673079
## sample estimates:
## p
## 0.01054136
CI = (0.00810, 0.01367)
summary(lhs2$AV1GUM)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0000 0.0000 0.0000 0.1679 0.0000 1.0000 299
table(lhs2$AV1GUM)
##
## 0 1
## 4650 938
prop.table(table(lhs2$AV1GUM))
##
## 0 1
## 0.8321403 0.1678597
prop.test(59, (5538+59), conf.level=0.95)
##
## 1-sample proportions test with continuity correction
##
## data: 59 out of (5538 + 59), null probability 0.5
## X-squared = 5361.5, df = 1, p-value < 2.2e-16
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
## 0.008103291 0.013673079
## sample estimates:
## p
## 0.01054136