library(tidyverse)## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.0 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(openintro)## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
library(infer)What are the counts within each category for the moment of days these students have texted while driving within the past 30 days?
data(yrbss)
dplyr::glimpse(yrbss)## Rows: 13,583
## Columns: 13
## $ age <int> 14, 14, 15, 15, 15, 15, 15, 14, 15, 15, 15, 1…
## $ gender <chr> "female", "female", "female", "female", "fema…
## $ grade <chr> "9", "9", "9", "9", "9", "9", "9", "9", "9", …
## $ hispanic <chr> "not", "not", "hispanic", "not", "not", "not"…
## $ race <chr> "Black or African American", "Black or Africa…
## $ height <dbl> NA, NA, 1.73, 1.60, 1.50, 1.57, 1.65, 1.88, 1…
## $ weight <dbl> NA, NA, 84.37, 55.79, 46.72, 67.13, 131.54, 7…
## $ helmet_12m <chr> "never", "never", "never", "never", "did not …
## $ text_while_driving_30d <chr> "0", NA, "30", "0", "did not drive", "did not…
## $ physically_active_7d <int> 4, 2, 7, 0, 2, 1, 4, 4, 5, 0, 0, 0, 4, 7, 7, …
## $ hours_tv_per_school_day <chr> "5+", "5+", "5+", "2", "3", "5+", "5+", "5+",…
## $ strength_training_7d <int> 0, 0, 0, 0, 1, 0, 2, 0, 3, 0, 3, 0, 0, 7, 7, …
## $ school_night_hours_sleep <chr> "8", "6", "<5", "6", "9", "8", "9", "6", "<5"…
table (yrbss$text_while_driving_30d)##
## 0 1-2 10-19 20-29 3-5
## 4792 925 373 298 493
## 30 6-9 did not drive
## 827 311 4646
0: 4792, 1-2:925, 10-19: 373, 20-29: 298. 3-5: 493, 30: 827, 6-9: 311, did not drive: 4646.
What is the proportion of people who have texted while during every day in the past 30 days and never wear helmets?
The proportion of people who have texted while during every day in the past 30 days and never wear helmets is 0.03408673.
yrbss %>%filter(text_while_driving_30d =="30" & helmet_12m =="never") %>%nrow()/nrow(yrbss)## [1] 0.03408673
What is the margin of error for the estimate of the proportion of non-helmet wearers that have texted while driving each day for the past 30 days based on this survey? The margin of error is 0.003051546.
p <-yrbss %>%filter(text_while_driving_30d =="30" & yrbss$helmet_12m =="never") %>%nrow()/nrow(yrbss)
ME<-1.96*sqrt(p*(1-p)/nrow(yrbss))
ME## [1] 0.003051546
Using the infer package, calculate confidence intervals for two other categorical variables (you’ll need to decide which level to call “success”, and report the associated margins of error. Interpret the interval in context of the data. It may be helpful to create new data sets for each of the two countries first, and then use these data sets to construct the confidence intervals).
table(yrbss$physically_active_7d)##
## 0 1 2 3 4 5 6 7
## 2172 962 1270 1451 1265 1728 840 3622
p <- yrbss %>%filter( physically_active_7d > 6) %>%nrow()/nrow(yrbss)
p## [1] 0.2666569
ME<-1.96*sqrt(p*(1-p)/nrow(yrbss))
ME## [1] 0.007436836
# calculate confidence interval for population proportion who exercises everyday
c(p - ME, p + ME) ## [1] 0.2592200 0.2740937
table(yrbss$strength_training_7d)##
## 0 1 2 3 4 5 6 7
## 3632 1012 1305 1468 1059 1333 513 2085
p <- yrbss %>%filter( strength_training_7d < 5) %>%nrow()/nrow(yrbss)
p## [1] 0.6240153
ME<-1.96*sqrt(p*(1-p)/nrow(yrbss))
ME## [1] 0.008145935
# calculate confidence interval for population proportion who exercises less than 5 days per week.
c(p - ME, p + ME) ## [1] 0.6158694 0.6321612
Describe the relationship between p and me. Include the margin of error vs. population proportion plot you constructed in your answer. For a given sample size, for which value of p is margin of error maximized?
The plot is symmetric and same shape. we can see that the margin of error has a maximum amount at 0.5 population proportion. The margin of error goes to zero when population proportion is at 0 and 1. As the sample size is going down, the margin of error increases exponentially.
n <- 1000
p <- seq(from = 0, to = 1, by = 0.01)
me <- 2 * sqrt(p * (1 - p)/n)
dd <- data.frame(p = p, me = me)
ggplot(data = dd, aes(x = p, y = me)) +
geom_line() +
labs(x = "Population Proportion", y = "Margin of Error")n <- 100
p <- seq(from = 0, to = 1, by = 0.01)
me <- 2 * sqrt(p * (1 - p)/n)
dd <- data.frame(p = p, me = me)
ggplot(data = dd, aes(x = p, y = me)) +
geom_line() +
labs(x = "Population Proportion", y = "Margin of Error")Describe the sampling distribution of sample proportions at n = 300 and p = 0.1. Be sure to note the center, spread and shape.
Answer: It shows normal distribution with reps = 5000 at the center at 0.1 or p value.
n <- 300
p <- 0.1
reps = 5000
sample.proportions <- replicate(reps, rbinom(1,n,p)) / n
hist(sample.proportions,
col = "pink",
freq = F,
breaks = 25)
curve(dnorm(x, mean=p, sd=sqrt(p*(1-p)/n)),
col = "blue",
lwd = "2",
add = T)Keep n constant and change p. How does the shape, center and spread of the sampling distribution vary as p changes. You might want to adjust min and max for the x-axis for a better view of the distribution.
Answer: The new histogram is more normally distributed with reps = 5000 at the center at 0.5 or new p value.
n <- 300
p <- 0.5
reps = 5000
sample.proportions <- replicate(reps, rbinom(1,n,p)) / n
hist(sample.proportions,
col = "pink",
freq = F,
breaks = 25)
curve(dnorm(x, mean=p, sd=sqrt(p*(1-p)/n)),
col = "blue",
lwd = "2",
add = T)Now also change n. How does n appear to affect the distribution of p_hat?
Answer: Increasing sample size n affects the distribution of p-hat smoother.
n <- 1000
p <- 0.5
reps = 5000
sample.proportions <- replicate(reps, rbinom(1,n,p)) / n
hist(sample.proportions,
col = "pink",
freq = F,
breaks = 25)
curve(dnorm(x, mean=p, sd=sqrt(p*(1-p)/n)),
col = "blue",
lwd = "2",
add = T)
### Exercise 9 Is there convincing evidence that those who sleep 10+
hours per day are more likely to strength train every day of the week?
As always, write out the hypotheses for any tests you conduct and
outline the status of the conditions for inference. If you find a
significant difference, also quantify this difference with a confidence
interval.
yrbss %>%filter(school_night_hours_sleep >= 10 & strength_training_7d > 6) %>%nrow()/nrow(yrbss)## [1] 0.1389973
Let’s say there has been no difference in likeliness to strength train every day of the week for those who sleep 10+ hours. What is the probability that you could detect a change (at a significance level of 0.05) simply by chance? Hint: Review the definition of the Type 1 error.
There would be a 5% chance of detecting a change. A type 1 error is a false positive that happens when a researcher rejects a null hypothesis that is actually true in the population.
Suppose you are hired by the local government to estimate the proportion of residents that attend a religious service on a weekly basis. According to the guidelines, the estimate must have a margin of error no greater than 1% with 95% confidence. You have no idea what to expect for p. How many people would you have to sample to ensure that you are within the guidelines? Hint: Refer to your plot of the relationship between p and margin of error. This question does not require using a dataset.
me <- 0.01
p <- 0.1
n <- 1.96 ^ 2 *p*(1-p)/ me ^ 2
n## [1] 3457.44
me <- 0.01
p <- 0.5
n <- 1.96 ^ 2 *p*(1-p)/ me ^ 2
n## [1] 9604