Lab 6 Data 606

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(openintro)

## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata

library(infer)

data('yrbss', package='openintro')
seed <- 1234

yrbss %>%
  count(text_while_driving_30d, sort=TRUE)

## # A tibble: 9 × 2
##   text_while_driving_30d     n
##   <chr>                  <int>
## 1 0                       4792
## 2 did not drive           4646
## 3 1-2                      925
## 4 <NA>                     918
## 5 30                       827
## 6 3-5                      493
## 7 10-19                    373
## 8 6-9                      311
## 9 20-29                    298

##1. What are the counts within each category for the amount of days these students have texted while driving within the past 30 days? Answer: 4,646 students did not drive and 4,792 didn’t text and drive.

#2. What is the proportion of people who have texted while driving every day in the past 30 days and never wear helmets? Answer : There is a total of 7.66% (464/6040) of students who texted everyday and never wear their helemts.

danger <- yrbss %>%
  filter(helmet_12m=="never") %>%
  filter(!is.na(text_while_driving_30d)) %>%
  mutate(text_ind_everyday = ifelse(text_while_driving_30d == "30", "yes", "no"))

danger %>%
  count(text_ind_everyday)

## # A tibble: 2 × 2
##   text_ind_everyday     n
##   <chr>             <int>
## 1 no                 6040
## 2 yes                 463

data('yrbss', package='openintro')
no_helmet <- yrbss %>%
  filter(helmet_12m == "never")

Inference on proportions:

#3. What is the margin of error for the estimate of the proportion of non-helmet wearers that have texted while driving each day for the past 30 days based on this survey?

Answer: Between 6.5% - 7.7%

danger %>%
 specify(response = text_ind_everyday, success = "yes") %>%
 generate(reps = 1000, type = "bootstrap") %>%
 calculate(stat = "prop") %>%
 get_ci(level = 0.95)

## # A tibble: 1 × 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1   0.0650   0.0770

#4. Using the infer package, calculate confidence intervals for two other categorical variables (you’ll need to decide which level to call “success”,and report the associated margins of error. Interpet the interval in context of the data. It may be helpful to create new data sets for each of the two countries first, and then use these data sets to construct the confidence intervals.

The two variables I picked were physically active and strenth training, after analyzing the variables for physical activity and strength training, the 95% confidence interval indicates that the true proportion of individuals engaging in both activities for more than 3 times a week falls between 31.3% and 32.9% within the population.

glimpse(yrbss)

## Rows: 13,583
## Columns: 13
## $ age                      <int> 14, 14, 15, 15, 15, 15, 15, 14, 15, 15, 15, 1…
## $ gender                   <chr> "female", "female", "female", "female", "fema…
## $ grade                    <chr> "9", "9", "9", "9", "9", "9", "9", "9", "9", …
## $ hispanic                 <chr> "not", "not", "hispanic", "not", "not", "not"…
## $ race                     <chr> "Black or African American", "Black or Africa…
## $ height                   <dbl> NA, NA, 1.73, 1.60, 1.50, 1.57, 1.65, 1.88, 1…
## $ weight                   <dbl> NA, NA, 84.37, 55.79, 46.72, 67.13, 131.54, 7…
## $ helmet_12m               <chr> "never", "never", "never", "never", "did not …
## $ text_while_driving_30d   <chr> "0", NA, "30", "0", "did not drive", "did not…
## $ physically_active_7d     <int> 4, 2, 7, 0, 2, 1, 4, 4, 5, 0, 0, 0, 4, 7, 7, …
## $ hours_tv_per_school_day  <chr> "5+", "5+", "5+", "2", "3", "5+", "5+", "5+",…
## $ strength_training_7d     <int> 0, 0, 0, 0, 1, 0, 2, 0, 3, 0, 3, 0, 0, 7, 7, …
## $ school_night_hours_sleep <chr> "8", "6", "<5", "6", "9", "8", "9", "6", "<5"…

n1 <-yrbss %>%filter(physically_active_7d > 3 & strength_training_7d >3) %>%nrow()
n = nrow(yrbss)
p <- n1/n
se <- sqrt(p * (1 - p) / nrow(yrbss))
z_score <- qnorm((1 + 0.95) / 2)
me <- z_score * se

lower <- p - me
upper <- p + me
# Print the confidence interval
cat("Confidence Interval: (", lower, ", ", upper, ")\n")

## Confidence Interval: ( 0.3129189 ,  0.3286183 )

#5. Describe the relationship between p and me. Include the margin of error vs. population proportion plot you constructed in your answer. For a given sample size, for which value of p is margin of error maximized?

The relationship between the margin of error (MoE) and the population proportion (p) is such that the MoE increases as the population proportion increases. In other words, when dealing with a larger population proportion, you tend to have a wider margin of error in your sample estimate. Interestingly, when considering a fixed sample size, the margin of error is optimized when the population proportion is around 50%. This is the point at which the margin of error is minimized, allowing for a more precise estimation of the population parameter.

n <- 1000
p <- seq(from = 0, to = 1, by = 0.01)
me <- 2 * sqrt(p * (1 - p)/n)
dd <- data.frame(p = p, me = me)
ggplot(data = dd, aes(x = p, y = me)) + 
  geom_line() +
  labs(x = "Population Proportion", y = "Margin of Error")

#6. Describe the sampling distribution of sample proportions at n=300 and p=0.1. Be sure to note the center, spread, and shape. Answer: At n=300 and p=0.1, the sampling distribution of sample proportions is centered around 0.1, has a spread of approximately 0.08 - 0.11, and is a bell-shaped, symmetric form.

p <- 0.1
n <- 300


(p*(1-p)/n)^.5

## [1] 0.01732051

.1-(p*(1-p)/n)^.5

## [1] 0.08267949

.1+(p*(1-p)/n)^.5

## [1] 0.1173205

#7. Keep n constant and change p. How does the shape, center, and spread of the sampling distribution vary as p changes. You might want to adjust min and max for the x-axis for a better view of the distribution. Answer: As you adjust p while maintaining a constant sample size, the shape of the distribution remains symmetrical and bell-shaped, the center aligns with the changing population proportion, and the variation in sample proportions fluctuates, mirroring the changes in the typical spread around the evolving population proportion.

#8. Now also change n. How does n appear to affect the distribution of p? Answer: Increasing the sample size (n) while keeping the population proportion (p) constant at 0.1 results in a narrower and more precise distribution of sample proportions (p̂). This indicates that larger sample sizes lead to more accurate estimates of the population proportion with reduced variability, bringing the sample estimates closer to the true population proportion..

p <- 0.1
n <- 400


(p*(1-p)/n)^.5

## [1] 0.015

.1-(p*(1-p)/n)^.5

## [1] 0.085

.1+(p*(1-p)/n)^.5

## [1] 0.115

#9. Is there convincing evidence that those who sleep 10+ hours per day are more likely to strength train every day of the week? As always, write out the hypotheses for any tests you conduct and outline the status of the conditions for inference. If you find a significant difference, also quantify this difference with a confidence interval.

Answer: In order to investigate whether individuals who sleep 10 or more hours per day are more inclined to engage in daily strength training, a hypothesis test was conducted.

The null hypothesis suggested that the proportion of individuals who both sleep 10+ hours and strength train every day of the week is equal to the proportion of those who do not sleep as long but also engage in daily strength training.

The alternative hypothesis, on the other hand, proposed that there is a difference in these proportions. To perform this analysis, the data was assumed to be derived from a random and independent sample, and the success-failure condition was met. After conducting the test, a 95% confidence interval was calculated for the proportion of those who sleep 10+ hours and strength train daily, yielding a range of [0.1616, 0.1740]. To determine if there is compelling evidence to suggest that longer sleep duration is associated with a higher likelihood of daily strength training, we can compare this interval to the proportion of those who do not sleep as long and strength train daily. If the interval does not encompass the proportion of the latter group, it may indicate a significant difference in the likelihood of strength training between these two groups.

library(tidyverse)
library(openintro)
library(infer)

sleep <- yrbss  %>%
  filter(school_night_hours_sleep == "10+")

strengthTraining <- yrbss %>%
  mutate(text_ind = ifelse(strength_training_7d == "7", "yes", "no"))
strengthTraining %>%
  filter(text_ind != "") %>%
  specify(response = text_ind, success = "yes") %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "prop") %>%
  get_ci(level = 0.95)

## # A tibble: 1 × 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1    0.162    0.175

#10. Let’s say there has been no difference in likeliness to strength train every day of the week for those who sleep 10+ hours. What is the probablity that you could detect a change (at a significance level of 0.05) simply by chance? Hint: Review the definition of the Type 1 error.

Answer: If there hasn’t been any real difference in the likelihood of people who sleep 10+ hours strength training every day of the week, the probability of detecting a change (at a significance level of 0.05) just by chance is approximately 0.05. This value represents the Type 1 error rate, which is the risk of mistakenly concluding that there is a significant difference when, in reality, there isn’t one. In simpler terms, it’s like the chance of making a false positive error in our statistical analysis.

#11. Suppose you’re hired by the local government to estimate the proportion of residents that attend a religious service on a weekly basis. According to the guidelines, the estimate must have a margin of error no greater than 1% with 95% confidence. You have no idea what to expect for p. How many people would you have to sample to ensure that you are within the guidelines? Hint: Refer to your plot of the relationship between p and margin of error. This question does not require using a dataset.

Answer: To estimate how many people to sample to ensure that you are within the guidelines, we can look at the plot comparing the relationship between p and margin of error. The plot provided helps us understand the sample size needed to ensure a 1% margin of error with 95% confidence when estimating the proportion of residents attending weekly religious services (\(p\)). What becomes evident is that, regardless of the actual \(p\) value, we would require a substantial sample size, likely in the thousands, to meet this stringent margin of error requirement. This underscores the importance of collecting a significant amount of data to make reliable \(p\) estimates within the specified guideline, especially when we lack prior knowledge about the true population proportion.

Lab 6 Data 606

2023-10-15