title: “Lab 6: Inference for Categorical Data”
author: “Evan McLaughlin”
date: “10.11.2020
knitr::opts_chunk$set(eval = TRUE, results = FALSE, fig.show = "show", message = FALSE)

library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.0.4
library(openintro)
library(ggplot2)
library(dplyr)
library(infer)
## Warning: package 'infer' was built under R version 4.0.3
library(trelliscopejs)
## Warning: package 'trelliscopejs' was built under R version 4.0.3
count_yrbss <- count(yrbss)
count_yrbss
yrbss[1:250,]

Exercise 1

What are the counts within each category for the amount of days these students have texted while driving within the past 30 days?

4145

Exercise 2

What is the proportion of people who have texted while driving every day in the past 30 days and never wear helmets?

proportion is 3.4%

text_drive <- yrbss %>%
filter(!text_while_driving_30d %in% c("did not drive","NA","0")) 

count(text_drive)

no_helmet <- yrbss %>%
  filter(helmet_12m == "never")

no_helmet <- no_helmet %>%
  mutate(text_ind = ifelse(text_while_driving_30d == "30", "yes", "no"))

bad_citizen <- no_helmet %>%
  filter(text_ind == "yes")

bc_prop <- count(bad_citizen) / count_yrbss

bc_prop

no_helmet %>%
  specify(response = text_ind, success = "yes") %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "prop") %>%
  get_ci(level = 0.95)
## Warning: Removed 474 rows containing missing values.

Exercise 3

What is the margin of error for the estimate of the proportion of non-helmet wearers that have texted while driving each day for the past 30 days based on this survey?

.0112

n = 1000
ME = 1.96 * sqrt(bc_prop * (1 - bc_prop)/n)

ME

Exercise 4

Using the infer package, calculate confidence intervals for two other categorical variables (you’ll need to decide which level to call “success”, and report the associated margins of error. Interpet the interval in context of the data. It may be helpful to create new data sets for each of the two countries first, and then use these data sets to construct the confidence intervals.

I examined the proportion of Asian Americans that are physically active more than three days per week, per guidelines from the USDA.

asian_count <- yrbss %>%
  filter(race == "Asian") 

asian_count <- asian_count %>%
  mutate(exercise_ind = ifelse(physically_active_7d > 3, "yes", "no"))

asian_count %>%
  specify(response = exercise_ind, success = "yes") %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "prop") %>%
  get_ci(level = 0.95)
## Warning: Removed 9 rows containing missing values.

Exercise 5

Describe the relationship between p and me. Include the margin of error vs. population proportion plot you constructed in your answer. For a given sample size, for which value of p is margin of error maximized?

the margin of error is impacted by the proportion, which is used in the calculation of the ME, which reaches its maximum when the proportion approaches .5.

dd <- data.frame(p = bc_prop, me = ME)

Exercise 6

Describe the sampling distribution of sample proportions at n=300 and p=0.1. Be sure to note the center, spread, and shape.

Both figures are well above 10, so this represents an appropriately normal distribution to use confidence intervals.

p <- 0.1
n <- 300

n*p

n*(1-p)

Exercise 7

Keep n constant and change p. How does the shape, center, and spread of the sampling distribution vary as p changes. You might want to adjust min and max for the x-axis for a better view of the distribution.

The figures are still above 10, but the higher proportion has brought down the number.

p <- 0.5
n <- 300

n*p

n*(1-p)

Exercise 8

Now also change n. How does n appear to affect the distribution of p^?

The figures are still above 10, but the smaller sample size also brings the number down much closer to 10, reducing the inferential value.

p <- 0.5
n <- 200

n*p

n*(1-p)