In this lab, we will explore and visualize the data using the tidyverse suite of packages, and perform statistical inference using infer. The data can be found in the companion package for OpenIntro resources, openintro.
Let’s load the packages.
library(tidyverse)
library(openintro)
library(infer)
rm(list = ls())
data("yrbss")You will be analyzing the same dataset as in the previous lab, where you delved into a sample from the Youth Risk Behavior Surveillance System (YRBSS) survey, which uses data from high schoolers to help discover health patterns. The dataset is called yrbss.
ANSWER See the below table for the counts. Please note the 918 NA’s. I will remove those rows for this analysis
yrbss %>%
group_by(text_while_driving_30d) %>%
summarise(count = n())## # A tibble: 9 x 2
## text_while_driving_30d count
## <chr> <int>
## 1 0 4792
## 2 1-2 925
## 3 10-19 373
## 4 20-29 298
## 5 3-5 493
## 6 30 827
## 7 6-9 311
## 8 did not drive 4646
## 9 <NA> 918
Remember that you can use filter to limit the dataset to just non-helmet wearers. Here, we will name the dataset no_helmet.
I will filter out the NA
#data('yrbss', package='openintro')
no_helmet <- yrbss %>%
filter(helmet_12m == "never" & !is.na(text_while_driving_30d))Also, it may be easier to calculate the proportion if you create a new variable that specifies whether the individual has texted every day while driving over the past 30 days or not. We will call this variable text_ind.
no_helmet <- no_helmet %>%
mutate(text_ind = ifelse(text_while_driving_30d == "30", "yes", "no"))no_helmet_tb <- no_helmet %>%
group_by(text_ind) %>%
summarise(count = n()) %>%
mutate(freq = count / sum(count))
no_helmet_tb## # A tibble: 2 x 3
## text_ind count freq
## <chr> <int> <dbl>
## 1 no 6040 0.929
## 2 yes 463 0.0712
ANSWER The proportion of people texted in the last 30 days and never wear a helmet is 7.12%
When summarizing the YRBSS, the Centers for Disease Control and Prevention seeks insight into the population parameters. To do this, you can answer the question, “What proportion of people in your sample reported that they have texted while driving each day for the past 30 days?” with a statistic; while the question “What proportion of people on earth have texted while driving each day for the past 30 days?” is answered with an estimate of the parameter.
The inferential tools for estimating population proportion are analogous to those used for means in the last chapter: the confidence interval and the hypothesis test.
no_helmet_ci <- no_helmet %>%
specify(response = text_ind, success = "yes") %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "prop") %>%
get_ci(level = 0.95)
no_helmet_ci## # A tibble: 1 x 2
## lower_ci upper_ci
## <dbl> <dbl>
## 1 0.0646 0.0773
Note that since the goal is to construct an interval estimate for a proportion, it’s necessary to both include the success argument within specify, which accounts for the proportion of non-helmet wearers than have consistently texted while driving the past 30 days, in this example, and that stat within calculate is here “prop”, signaling that you are trying to do some sort of inference on a proportion.
ANSWER Lets calculate using it the SE formala: SD / SQRT(n) from the proportion calculation we know that n = 6040 + 463 = 6,503. For Standard Error thus we will use the formula **z * SQRT(p*(1-p)/n)**
From above we also know that p = 0.0712 and 1-p = 0.929. Also for a 95% confidence leel, the proper z is 1.96, thus Margin of Error = 1.96 * SE
# Using the SE formula
1.96* sqrt(no_helmet_tb$freq[2]*no_helmet_tb$freq[1]/sum(no_helmet_tb$count))## [1] 0.006250207
Margin of error = 0.00625
If I used the one from the bootsrapping simulation we will get something very close to the above.
If the case below we got a margin of error of ~0.006 (see below)
(no_helmet_ci$upper_ci-no_helmet_ci$lower_ci)/2## [1] 0.00638167
infer package, calculate confidence intervals for two other categorical variables (you’ll need to decide which level to call “success”, and report the associated margins of error. Interpet the interval in context of the data. It may be helpful to create new data sets for each of the two countries first, and then use these data sets to construct the confidence intervals.ANSWER
We will do the following two proportions.
Proportion of MALES who are phisically active 4 or more days a week
Proportion of FEMALES who do strength training 0 or 1 days per week
For a) lets’ do some exploration first to see how data looks like
yrbss %>%
filter(gender == "male") %>%
group_by(physically_active_7d) %>%
summarise(counts = n())## # A tibble: 9 x 2
## physically_active_7d counts
## <int> <int>
## 1 0 788
## 2 1 338
## 3 2 486
## 4 3 661
## 5 4 646
## 6 5 895
## 7 6 475
## 8 7 2484
## 9 NA 177
We can see that things are pretty even, only when we get to 7 days we see a much higher frequency that what I thought. Also I see we have 177 NA’s we will need to take care in our analysis.
males_pa7d <- yrbss %>%
filter(gender =="male") %>%
mutate(active_4d = ifelse(physically_active_7d >= 4, "yes", "no"))Now let’s run bootstrapping to get our CI’s.
males_pa7d_ci <- males_pa7d %>%
specify(response = active_4d, success = "yes") %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "prop") %>%
get_ci(level = 0.95)
males_pa7d_ci## # A tibble: 1 x 2
## lower_ci upper_ci
## <dbl> <dbl>
## 1 0.653 0.676
We get a pretty tight CI that goes from ~0.65 to ~0.68
For b) lets’ do some exploration first to see how data looks like
yrbss %>%
filter(gender == "female") %>%
group_by(strength_training_7d) %>%
summarise(counts = n())## # A tibble: 9 x 2
## strength_training_7d counts
## <int> <int>
## 1 0 2284
## 2 1 583
## 3 2 741
## 4 3 691
## 5 4 450
## 6 5 546
## 7 6 196
## 8 7 587
## 9 NA 543
We can see that things are pretty even, except for the concentration at 0 days we also see a much higher frequency that what I thought. Also I see we have 543 NA’s we will need to take care in our analysis.
females_st7d <- yrbss %>%
filter(gender =="female") %>%
mutate(strength_1d = ifelse(strength_training_7d <= 1, "yes", "no"))Now let’s run bootstrapping to get our CI’s.
females_st7d_ci <- females_st7d %>%
specify(response = strength_1d, success = "yes") %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "prop") %>%
get_ci(level = 0.95)
females_st7d_ci## # A tibble: 1 x 2
## lower_ci upper_ci
## <dbl> <dbl>
## 1 0.459 0.484
We get a pretty tight CI that goes from ~0.46 to ~0.48
Imagine you’ve set out to survey 1000 people on two questions: are you at least 6-feet tall? and are you left-handed? Since both of these sample proportions were calculated from the same sample size, they should have the same margin of error, right? Wrong! While the margin of error does change with sample size, it is also affected by the proportion.
Think back to the formula for the standard error: \(SE = \sqrt{p(1-p)/n}\). This is then used in the formula for the margin of error for a 95% confidence interval:
\[ ME = 1.96\times SE = 1.96\times\sqrt{p(1-p)/n} \,. \] Since the population proportion \(p\) is in this \(ME\) formula, it should make sense that the margin of error is in some way dependent on the population proportion. We can visualize this relationship by creating a plot of \(ME\) vs. \(p\).
Since sample size is irrelevant to this discussion, let’s just set it to some value (\(n = 1000\)) and use this value in the following calculations:
n <- 1000The first step is to make a variable p that is a sequence from 0 to 1 with each number incremented by 0.01. You can then create a variable of the margin of error (me) associated with each of these values of p using the familiar approximate formula (\(ME = 2 \times SE\)).
p <- seq(from = 0, to = 1, by = 0.01)
me <- 2 * sqrt(p * (1 - p)/n)Lastly, you can plot the two variables against each other to reveal their relationship. To do so, we need to first put these variables in a data frame that you can call in the ggplot function.
dd <- data.frame(p = p, me = me)
ggplot(data = dd, aes(x = p, y = me)) +
geom_line() +
labs(x = "Population Proportion", y = "Margin of Error")p and me. Include the margin of error vs. population proportion plot you constructed in your answer. For a given sample size, for which value of p is margin of error maximized?ANSWER
There is a quadratic relationship between Margin of Error and proportion (P). This relatonship is maximized at P = 0.5 where we achieve the maximum margin of error.
We have emphasized that you must always check conditions before making inference. For inference on proportions, the sample proportion can be assumed to be nearly normal if it is based upon a random sample of independent observations and if both \(np \geq 10\) and \(n(1 - p) \geq 10\). This rule of thumb is easy enough to follow, but it makes you wonder: what’s so special about the number 10?
The short answer is: nothing. You could argue that you would be fine with 9 or that you really should be using 11. What is the “best” value for such a rule of thumb is, at least to some degree, arbitrary. However, when \(np\) and \(n(1-p)\) reaches 10 the sampling distribution is sufficiently normal to use confidence intervals and hypothesis tests that are based on that approximation.
You can investigate the interplay between \(n\) and \(p\) and the shape of the sampling distribution by using simulations. Play around with the following app to investigate how the shape, center, and spread of the distribution of \(\hat{p}\) changes as \(n\) and \(p\) changes.
ANSWER
The distribution is fairly normal with a smallish standard deviation which creates a thin adn tall distribution curve The center as expected is at 0.1
ANSWER The center of the distribution changes with P. If you reduce P way to the left as in P=0.01 you will see a distribution centered at the p, and with a smaller (thinner) standard error. As you increase P, the center moves, and the standard error increase, maxing at P=0.5 After that increasing P moves the center, but the standard error starts decreasing.
ANSWER low values of N create a “less normal” distribution also with a larger standard error. As we increase n, the distribution becomes more normal looking and also the standard becomes smaller and smaller, creating a thinner normal distribution.
For some of the exercises below, you will conduct inference comparing two proportions. In such cases, you have a response variable that is categorical, and an explanatory variable that is also categorical, and you are comparing the proportions of success of the response variable across the levels of the explanatory variable. This means that when using infer, you need to include both variables within specify.
ANSWER
We are going to use TWO Methods to solve this. First we will use CHI SQUARED TEST FOR INDEPENDENCE and then we will a Test of difference in Proportions.
Since the question is trying to asses whether sleeping 10+ has in impact in training 7 days of the week. I decided to test whether Sleeping 10+ is independent of Training 7d. If they are independent under 95% condidence level then we can not discard HO that they are independent otherwise we will discard the HO and take the Ha that they not independent.
Let’s create two features just stating whther the person trained 7d, also whether ther person slept 10+ hours. We will use simple “yes” “no” to each.
sleep10train7 <- yrbss %>%
filter(!is.na(school_night_hours_sleep) & !is.na(strength_training_7d)) %>%
mutate(sleep10 = ifelse(school_night_hours_sleep == "10+", "yes", "no")) %>%
mutate(train7 = ifelse(strength_training_7d == 7, "yes", "no"))Now we can generate a frenquency table (2x2) of our results.
freq_table <- table(sleep10train7$sleep10, sleep10train7$train7)
names(dimnames(freq_table)) <- c("Sleep 10+", "Train 7 Days")
freq_table## Train 7 Days
## Sleep 10+ no yes
## no 9949 1958
## yes 228 84
Finally we can perfom a CHI SQUARED TEST FOR INDEPENDENCE
chisq_test(sleep10train7, train7 ~ sleep10)## # A tibble: 1 x 3
## statistic chisq_df p_value
## <dbl> <int> <dbl>
## 1 23.2 1 0.00000143
#or
#chisq.test(freq_table)The p-value is 0.00000143, much smaller than the 0.05 confidence level we established. Therefore we can reject HO that they are independent. We can then there conclude that evidence does show that sleeping 10+ hours does influence training 7 days per week.
The trhought process is as follows: We will say that explanatory variable is sleeps 10+ hours and the response variable is trains 7 days of the week. Then we will calculate two proportions. a) Proportion of people who DID NOT SLEEP 10+ hours and still trained 7 days of the week. b) Proporion of people who SLEPT 10+ Hours and trained 7 days.
If there is not impact in sleeping on training, then the difference in proportion would be equal to zero (within our 95 confidence interval)
Let’s first do some basic feature engineering to see yes and no where appropriate
sleep10train7 <- yrbss %>%
filter(!is.na(school_night_hours_sleep) & !is.na(strength_training_7d)) %>%
mutate(sleep10 = ifelse(school_night_hours_sleep == "10+", "yes", "no")) %>%
mutate(train7 = ifelse(strength_training_7d == 7, "yes", "no"))Let’s calculate the two proportions we need to evaluate.
prop_s10_t7 <- freq_table[2,2] / (freq_table[2,1] + freq_table[2,2])
prop_n10_t7 <- freq_table[1,2] / (freq_table[1,1] + freq_table[1,2])
prop_s10_t7## [1] 0.2692308
prop_n10_t7## [1] 0.1644411
diff_props <- prop_n10_t7 - prop_s10_t7
diff_props## [1] -0.1047897
The difference in proportions in our sample is -0.104. Not zero but we need to establish a 95 Confidence interval to check whther 0 is include within the interval.
sleep10train7 %>%
specify(response = train7, explanatory = sleep10, success = "yes") %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "diff in props") %>%
get_ci(level = 0.95)## # A tibble: 1 x 2
## lower_ci upper_ci
## <dbl> <dbl>
## 1 -0.153 -0.0516
The Condifence interval is -0.155 to -0.0580. We see that ZERO is NOT PART of the confidence interval. We can REJECT HO that difference is ZERO, thus again finding that SLEEPING 10+ does influence TRAINING 7 Days
ANSWER
At an alpha of 0.05 the probability of a Type 1 error, that is rejecting the null that there is no impact on training by sleeping is equal to 5%
ANSWER
We know that MARGIN OF ERROR = z * SQRT((p*(1-p)/n)) our worst case sceario is p=0.5, Our margin of error is 0.01 So we need to solve for n in equation: 0.01 = 1.96 * SQRT(0.25/n)
Solving for n above, give us a sample size of 9,604