lhs <- read.csv("~/UST Data Science/Spring 2021/SEIS 631/1/RStudio/lhs.csv")
n=length(lhs$f060pipe)
table(lhs$f060pipe)
##
## 1 2
## 59 5538
written answer: R does not classifying the variable type correctly. Without the data dictionary, it would be difficult to distinguish what integer 1 and 2 means. A better way to label the results would be by character (1 = Yes, 2 = No)
lhs$f060pipe1 <- factor(lhs$f060pipe, levels=c(1,2), labels = c("Yes", "No"))
table(lhs$f060pipe1)
##
## Yes No
## 59 5538
barplot(table(lhs$f060pipe1))
table(lhs$f060pipe1)
##
## Yes No
## 59 5538
prop.table(table(lhs$f060pipe1))
##
## Yes No
## 0.01054136 0.98945864
written answer: 1% of participants smoked a pipe at year 1
k <- length(which(lhs$f060pipe== 1))
k
## [1] 59
n <- length(na.omit(lhs$f060pipe))
n
## [1] 5597
p.hat <- k/n
p.hat
## [1] 0.01054136
boot.phats <- c()
for (i in 1:10000){
boot.samp <- sample(lhs$AV1GUM, n, replace = TRUE)
boot.k <- length(which(boot.samp == 1))
boot.phat <- boot.k/n
boot.phats <- c(boot.phats, boot.phat)
}
written answer: The boot.samp size is not the same as the original sample size. The reason is because we want to see the variability.
table(boot.samp)
## boot.samp
## 0 1
## 4407 901
written answer: I believe we used the sort() function because we want to separate the lower and upper boot.phats data. The sort () function is used to list the data in ascending or descending order, so for us to calculate the data correctly (by the first half (250) or the second half (9750). Sorting from ascending to descending order would have helped with separating the data.
Written answer: 0.007 – 0.012 = 95 % Based on the results, the 95% confidence interval for mean population proportion for those who smoke a pipe at one year after starting treatment goes from 0.7% to 1.2%. Which means we are 95% sure that the mean population proportion of those who smoke a pipe at the one year mark is between 0.7% and 1.2%.
Written answer: 0.007-0.014 = 99%; I am 99% sure that the true proportion of all the participants who smoke a pipe at their first annual visit will be between 0.7 and 1.4%.
CI.lb <- (sort(boot.phats) [50])
CI.ub <- (sort(boot.phats) [9950])
Written answer: The confidence intervals were all pretty close together, so I am not entirely sure if the calculations were done correctly. In my opinion, I prefer method 2 (based on percentiles) because it seems easier to read and understand.
prop.test(59, 5597, conf.level = 0.90)
##
## 1-sample proportions test with continuity correction
##
## data: 59 out of 5597, null probability 0.5
## X-squared = 5361.5, df = 1, p-value < 2.2e-16
## alternative hypothesis: true p is not equal to 0.5
## 90 percent confidence interval:
## 0.008440637 0.013133767
## sample estimates:
## p
## 0.01054136
CI.lb <- (sort(boot.phats) [500])
CI.ub <- (sort(boot.phats) [9500])
prop.test(59, 5597, conf.level = 0.95)
##
## 1-sample proportions test with continuity correction
##
## data: 59 out of 5597, null probability 0.5
## X-squared = 5361.5, df = 1, p-value < 2.2e-16
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
## 0.008103291 0.013673079
## sample estimates:
## p
## 0.01054136
CI.lb <- (sort(boot.phats) [250])
CI.ub <- (sort(boot.phats) [9750])
prop.test(59, 5597, conf.level = 0.99)
##
## 1-sample proportions test with continuity correction
##
## data: 59 out of 5597, null probability 0.5
## X-squared = 5361.5, df = 1, p-value < 2.2e-16
## alternative hypothesis: true p is not equal to 0.5
## 99 percent confidence interval:
## 0.00748453 0.01478630
## sample estimates:
## p
## 0.01054136
CI.lb <- (sort(boot.phats) [50])
CI.ub <- (sort(boot.phats) [9950])
written response: In this data set for AV1GUM, the distribution shows a symmetric, unimodal distribution with few outliers and peaks at roughly 0.16. The center of the distribution is easy to locate and both tails of the distribution are approximately the same length. The spread is squeezed and covering a narrower range (0.14 to 0.17). The population proportion is centered at approximately 0.16.
summary(lhs$Av1GUM)
## Length Class Mode
## 0 NULL NULL
table(lhs$AV1GUM)
##
## 0 1
## 4650 938
lhs$AV1GUM1 <- factor(lhs$AV1GUM, levels=c(0,1), labels = c("No", "Yes"))
table(lhs$AV1GUM1)
##
## No Yes
## 4650 938
plot(lhs$AV1GUM1)
m <- length(which(lhs$AV1GUM==1))
m
## [1] 938
b <- length(na.omit(lhs$AV1GUM))
b
## [1] 5588
p.hat <- m/b
p.hat
## [1] 0.1678597
boot.phats <- c()
for (i in 1:10000){
boot.samp <- sample(lhs$AV1GUM, b, replace = TRUE)
boot.m <- length(which(boot.samp == 1))
boot.phat <- boot.m/b
boot.phats <- c(boot.phats, boot.phat)
}
hist(boot.phats)
mean(boot.phats)
## [1] 0.1593269
SE <- sd(boot.phats)
CI.lb <- (sort(boot.phats) [250])
CI.ub <- (sort(boot.phats) [9750])