Inference for categorical data

Getting Started

Load packages

In this lab, we will explore and visualize the data using the tidyverse suite of packages, and perform statistical inference using infer. The data can be found in the companion package for OpenIntro resources, openintro.

Let’s load the packages.

library(tidyverse)
library(openintro)
library(infer)
rm(list = ls())
data("yrbss")

The data

You will be analyzing the same dataset as in the previous lab, where you delved into a sample from the Youth Risk Behavior Surveillance System (YRBSS) survey, which uses data from high schoolers to help discover health patterns. The dataset is called yrbss.

What are the counts within each category for the amount of days these students have texted while driving within the past 30 days?

ANSWER See the below table for the counts. Please note the 918 NA’s. I will remove those rows for this analysis

yrbss %>%
    group_by(text_while_driving_30d) %>%
    summarise(count = n())

## # A tibble: 9 x 2
##   text_while_driving_30d count
##   <chr>                  <int>
## 1 0                       4792
## 2 1-2                      925
## 3 10-19                    373
## 4 20-29                    298
## 5 3-5                      493
## 6 30                       827
## 7 6-9                      311
## 8 did not drive           4646
## 9 <NA>                     918

What is the proportion of people who have texted while driving every day in the past 30 days and never wear helmets?

Remember that you can use filter to limit the dataset to just non-helmet wearers. Here, we will name the dataset no_helmet.

I will filter out the NA

#data('yrbss', package='openintro')
no_helmet <- yrbss %>%
  filter(helmet_12m == "never" & !is.na(text_while_driving_30d))

Also, it may be easier to calculate the proportion if you create a new variable that specifies whether the individual has texted every day while driving over the past 30 days or not. We will call this variable text_ind.

no_helmet <- no_helmet %>%
  mutate(text_ind = ifelse(text_while_driving_30d == "30", "yes", "no"))

no_helmet_tb <- no_helmet %>%
    group_by(text_ind) %>%
    summarise(count = n()) %>%
    mutate(freq = count / sum(count))

no_helmet_tb

## # A tibble: 2 x 3
##   text_ind count   freq
##   <chr>    <int>  <dbl>
## 1 no        6040 0.929 
## 2 yes        463 0.0712

ANSWER The proportion of people texted in the last 30 days and never wear a helmet is 7.12%

Inference on proportions

When summarizing the YRBSS, the Centers for Disease Control and Prevention seeks insight into the population parameters. To do this, you can answer the question, “What proportion of people in your sample reported that they have texted while driving each day for the past 30 days?” with a statistic; while the question “What proportion of people on earth have texted while driving each day for the past 30 days?” is answered with an estimate of the parameter.

The inferential tools for estimating population proportion are analogous to those used for means in the last chapter: the confidence interval and the hypothesis test.

no_helmet_ci <- no_helmet %>%
  specify(response = text_ind, success = "yes") %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "prop") %>%
  get_ci(level = 0.95)

no_helmet_ci

## # A tibble: 1 x 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1   0.0646   0.0773

Note that since the goal is to construct an interval estimate for a proportion, it’s necessary to both include the success argument within specify, which accounts for the proportion of non-helmet wearers than have consistently texted while driving the past 30 days, in this example, and that stat within calculate is here “prop”, signaling that you are trying to do some sort of inference on a proportion.

What is the margin of error for the estimate of the proportion of non-helmet wearers that have texted while driving each day for the past 30 days based on this survey?

ANSWER Lets calculate using it the SE formala: SD / SQRT(n) from the proportion calculation we know that n = 6040 + 463 = 6,503. For Standard Error thus we will use the formula **z * SQRT(p*(1-p)/n)**

From above we also know that p = 0.0712 and 1-p = 0.929. Also for a 95% confidence leel, the proper z is 1.96, thus Margin of Error = 1.96 * SE

# Using the SE formula
1.96* sqrt(no_helmet_tb$freq[2]*no_helmet_tb$freq[1]/sum(no_helmet_tb$count))

## [1] 0.006250207

Margin of error = 0.00625

If I used the one from the bootsrapping simulation we will get something very close to the above.

If the case below we got a margin of error of ~0.006 (see below)

(no_helmet_ci$upper_ci-no_helmet_ci$lower_ci)/2

## [1] 0.00638167

Using the infer package, calculate confidence intervals for two other categorical variables (you’ll need to decide which level to call “success”, and report the associated margins of error. Interpet the interval in context of the data. It may be helpful to create new data sets for each of the two countries first, and then use these data sets to construct the confidence intervals.

ANSWER

We will do the following two proportions.

Proportion of MALES who are phisically active 4 or more days a week
Proportion of FEMALES who do strength training 0 or 1 days per week

For a) lets’ do some exploration first to see how data looks like

yrbss %>%
    filter(gender == "male") %>%
    group_by(physically_active_7d) %>%
    summarise(counts = n())

## # A tibble: 9 x 2
##   physically_active_7d counts
##                  <int>  <int>
## 1                    0    788
## 2                    1    338
## 3                    2    486
## 4                    3    661
## 5                    4    646
## 6                    5    895
## 7                    6    475
## 8                    7   2484
## 9                   NA    177

We can see that things are pretty even, only when we get to 7 days we see a much higher frequency that what I thought. Also I see we have 177 NA’s we will need to take care in our analysis.

males_pa7d <- yrbss %>%
  filter(gender =="male") %>%
  mutate(active_4d = ifelse(physically_active_7d >= 4, "yes", "no"))

Now let’s run bootstrapping to get our CI’s.

males_pa7d_ci <- males_pa7d %>%
  specify(response = active_4d, success = "yes") %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "prop") %>%
  get_ci(level = 0.95)

males_pa7d_ci

## # A tibble: 1 x 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1    0.653    0.676

We get a pretty tight CI that goes from ~0.65 to ~0.68

For b) lets’ do some exploration first to see how data looks like

yrbss %>%
    filter(gender == "female") %>%
    group_by(strength_training_7d) %>%
    summarise(counts = n())

## # A tibble: 9 x 2
##   strength_training_7d counts
##                  <int>  <int>
## 1                    0   2284
## 2                    1    583
## 3                    2    741
## 4                    3    691
## 5                    4    450
## 6                    5    546
## 7                    6    196
## 8                    7    587
## 9                   NA    543

We can see that things are pretty even, except for the concentration at 0 days we also see a much higher frequency that what I thought. Also I see we have 543 NA’s we will need to take care in our analysis.

females_st7d <- yrbss %>%
  filter(gender =="female") %>%
  mutate(strength_1d = ifelse(strength_training_7d <= 1, "yes", "no"))

Now let’s run bootstrapping to get our CI’s.

females_st7d_ci <- females_st7d %>%
  specify(response = strength_1d, success = "yes") %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "prop") %>%
  get_ci(level = 0.95)

females_st7d_ci

## # A tibble: 1 x 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1    0.459    0.484

We get a pretty tight CI that goes from ~0.46 to ~0.48

How does the proportion affect the margin of error?

Imagine you’ve set out to survey 1000 people on two questions: are you at least 6-feet tall? and are you left-handed? Since both of these sample proportions were calculated from the same sample size, they should have the same margin of error, right? Wrong! While the margin of error does change with sample size, it is also affected by the proportion.

Think back to the formula for the standard error: \(SE = \sqrt{p(1-p)/n}\). This is then used in the formula for the margin of error for a 95% confidence interval:

\[ ME = 1.96\times SE = 1.96\times\sqrt{p(1-p)/n} \,. \] Since the population proportion \(p\) is in this \(ME\) formula, it should make sense that the margin of error is in some way dependent on the population proportion. We can visualize this relationship by creating a plot of \(ME\) vs. \(p\).

Since sample size is irrelevant to this discussion, let’s just set it to some value (\(n = 1000\)) and use this value in the following calculations:

n <- 1000

The first step is to make a variable p that is a sequence from 0 to 1 with each number incremented by 0.01. You can then create a variable of the margin of error (me) associated with each of these values of p using the familiar approximate formula (\(ME = 2 \times SE\)).

p <- seq(from = 0, to = 1, by = 0.01)
me <- 2 * sqrt(p * (1 - p)/n)

Lastly, you can plot the two variables against each other to reveal their relationship. To do so, we need to first put these variables in a data frame that you can call in the ggplot function.

dd <- data.frame(p = p, me = me)
ggplot(data = dd, aes(x = p, y = me)) + 
  geom_line() +
  labs(x = "Population Proportion", y = "Margin of Error")

Describe the relationship between p and me. Include the margin of error vs. population proportion plot you constructed in your answer. For a given sample size, for which value of p is margin of error maximized?

ANSWER

There is a quadratic relationship between Margin of Error and proportion (P). This relatonship is maximized at P = 0.5 where we achieve the maximum margin of error.

Success-failure condition

We have emphasized that you must always check conditions before making inference. For inference on proportions, the sample proportion can be assumed to be nearly normal if it is based upon a random sample of independent observations and if both \(np \geq 10\) and \(n(1 - p) \geq 10\). This rule of thumb is easy enough to follow, but it makes you wonder: what’s so special about the number 10?

The short answer is: nothing. You could argue that you would be fine with 9 or that you really should be using 11. What is the “best” value for such a rule of thumb is, at least to some degree, arbitrary. However, when \(np\) and \(n(1-p)\) reaches 10 the sampling distribution is sufficiently normal to use confidence intervals and hypothesis tests that are based on that approximation.

You can investigate the interplay between \(n\) and \(p\) and the shape of the sampling distribution by using simulations. Play around with the following app to investigate how the shape, center, and spread of the distribution of \(\hat{p}\) changes as \(n\) and \(p\) changes.

Describe the sampling distribution of sample proportions at \(n = 300\) and \(p = 0.1\). Be sure to note the center, spread, and shape.

ANSWER

The distribution is fairly normal with a smallish standard deviation which creates a thin adn tall distribution curve The center as expected is at 0.1

Keep \(n\) constant and change \(p\). How does the shape, center, and spread of the sampling distribution vary as \(p\) changes. You might want to adjust min and max for the \(x\)-axis for a better view of the distribution.

ANSWER The center of the distribution changes with P. If you reduce P way to the left as in P=0.01 you will see a distribution centered at the p, and with a smaller (thinner) standard error. As you increase P, the center moves, and the standard error increase, maxing at P=0.5 After that increasing P moves the center, but the standard error starts decreasing.

Now also change \(n\). How does \(n\) appear to affect the distribution of \(\hat{p}\)?

ANSWER low values of N create a “less normal” distribution also with a larger standard error. As we increase n, the distribution becomes more normal looking and also the standard becomes smaller and smaller, creating a thinner normal distribution.

More Practice

For some of the exercises below, you will conduct inference comparing two proportions. In such cases, you have a response variable that is categorical, and an explanatory variable that is also categorical, and you are comparing the proportions of success of the response variable across the levels of the explanatory variable. This means that when using infer, you need to include both variables within specify.

Is there convincing evidence that those who sleep 10+ hours per day are more likely to strength train every day of the week? As always, write out the hypotheses for any tests you conduct and outline the status of the conditions for inference. If you find a significant difference, also quantify this difference with a confidence interval.

ANSWER

We are going to use TWO Methods to solve this. First we will use CHI SQUARED TEST FOR INDEPENDENCE and then we will a Test of difference in Proportions.

Method 1 - CHI SQUARED Test of Independence

Since the question is trying to asses whether sleeping 10+ has in impact in training 7 days of the week. I decided to test whether Sleeping 10+ is independent of Training 7d. If they are independent under 95% condidence level then we can not discard HO that they are independent otherwise we will discard the HO and take the Ha that they not independent.

Let’s create two features just stating whther the person trained 7d, also whether ther person slept 10+ hours. We will use simple “yes” “no” to each.

sleep10train7 <- yrbss %>%
  filter(!is.na(school_night_hours_sleep) & !is.na(strength_training_7d)) %>%
  mutate(sleep10 = ifelse(school_night_hours_sleep == "10+", "yes", "no")) %>%
  mutate(train7 = ifelse(strength_training_7d == 7, "yes", "no"))

Now we can generate a frenquency table (2x2) of our results.

freq_table <- table(sleep10train7$sleep10, sleep10train7$train7)
names(dimnames(freq_table)) <- c("Sleep 10+", "Train 7 Days")
freq_table

##          Train 7 Days
## Sleep 10+   no  yes
##       no  9949 1958
##       yes  228   84

Finally we can perfom a CHI SQUARED TEST FOR INDEPENDENCE

chisq_test(sleep10train7, train7 ~ sleep10)

## # A tibble: 1 x 3
##   statistic chisq_df    p_value
##       <dbl>    <int>      <dbl>
## 1      23.2        1 0.00000143

#or
#chisq.test(freq_table)

The p-value is 0.00000143, much smaller than the 0.05 confidence level we established. Therefore we can reject HO that they are independent. We can then there conclude that evidence does show that sleeping 10+ hours does influence training 7 days per week.

Method 2 Differencen in Proportions

The trhought process is as follows: We will say that explanatory variable is sleeps 10+ hours and the response variable is trains 7 days of the week. Then we will calculate two proportions. a) Proportion of people who DID NOT SLEEP 10+ hours and still trained 7 days of the week. b) Proporion of people who SLEPT 10+ Hours and trained 7 days.

If there is not impact in sleeping on training, then the difference in proportion would be equal to zero (within our 95 confidence interval)

Let’s first do some basic feature engineering to see yes and no where appropriate

sleep10train7 <- yrbss %>%
  filter(!is.na(school_night_hours_sleep) & !is.na(strength_training_7d)) %>%
  mutate(sleep10 = ifelse(school_night_hours_sleep == "10+", "yes", "no")) %>%
  mutate(train7 = ifelse(strength_training_7d == 7, "yes", "no"))

Let’s calculate the two proportions we need to evaluate.

prop_s10_t7 <- freq_table[2,2] / (freq_table[2,1] + freq_table[2,2])

prop_n10_t7 <- freq_table[1,2] / (freq_table[1,1] + freq_table[1,2])

prop_s10_t7

## [1] 0.2692308

prop_n10_t7

## [1] 0.1644411

diff_props <- prop_n10_t7 - prop_s10_t7
diff_props

## [1] -0.1047897

The difference in proportions in our sample is -0.104. Not zero but we need to establish a 95 Confidence interval to check whther 0 is include within the interval.

sleep10train7 %>%
  specify(response = train7, explanatory = sleep10, success = "yes") %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "diff in props") %>%
  get_ci(level = 0.95)

## # A tibble: 1 x 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1   -0.153  -0.0516

The Condifence interval is -0.155 to -0.0580. We see that ZERO is NOT PART of the confidence interval. We can REJECT HO that difference is ZERO, thus again finding that SLEEPING 10+ does influence TRAINING 7 Days

Let’s say there has been no difference in likeliness to strength train every day of the week for those who sleep 10+ hours. What is the probablity that you could detect a change (at a significance level of 0.05) simply by chance? Hint: Review the definition of the Type 1 error.

ANSWER

At an alpha of 0.05 the probability of a Type 1 error, that is rejecting the null that there is no impact on training by sleeping is equal to 5%

Suppose you’re hired by the local government to estimate the proportion of residents that attend a religious service on a weekly basis. According to the guidelines, the estimate must have a margin of error no greater than 1% with 95% confidence. You have no idea what to expect for \(p\). How many people would you have to sample to ensure that you are within the guidelines?
Hint: Refer to your plot of the relationship between \(p\) and margin of error. This question does not require using a dataset.

ANSWER

We know that MARGIN OF ERROR = z * SQRT((p*(1-p)/n)) our worst case sceario is p=0.5, Our margin of error is 0.01 So we need to solve for n in equation: 0.01 = 1.96 * SQRT(0.25/n)

Solving for n above, give us a sample size of 9,604

Inference for categorical data

Juan Falck

Getting Started

Load packages

The data

Inference on proportions

How does the proportion affect the margin of error?

Success-failure condition

More Practice

Method 1 - CHI SQUARED Test of Independence

Method 2 Differencen in Proportions

Thank you!