Inference of Population Proportions & Categorical Variables

Learning Objectives

State hypotheses for the X²-test of independence
State and check assumptions of the χ²-test of independence
Obtain and interpret results of the χ²-test of independence
Obtain and interpret confidence intervals for population proportions & differences in proportions

# Load necessary packages
library(tidyverse)
library(ggthemes)
library(flextable)
library(janitor)
library(broom)

# Set ggplot theme for visualizations
theme_set(ggthemes::theme_few())

# Set options for flextables
set_flextable_defaults(na_str = "NA")

# Load function for printing tables nicely
source("https://raw.githubusercontent.com/dilernia/STA323/main/Functions/make_flex.R")

Titanic data

Importing data from the website GitHub

# Importing data from course GitHub page
titanic <- read_csv("https://raw.githubusercontent.com/dilernia/STA323/main/Data/titanic.csv")

set.seed(1994)

# Printing sample of 7 rows from data set
titanic %>%  
  dplyr::sample_n(size = 7) %>% 
  make_flex(caption = "Randomly selected people who were aboard the Titanic.")

Table 1: Randomly selected people who were aboard the Titanic.
survived	sex	passenger_class	name	age	fare	n_siblings_spouses_aboard	n_parents_children_aboard
Yes	Female	Second	Miss. Amelia Brown	24.00	13.00	0.00	0.00
Yes	Female	Third	Miss. Anna Kristine Salkjelsvik	21.00	7.65	0.00	0.00
No	Male	Third	Mr. Nils Johan Goransson Olsson	28.00	7.85	0.00	0.00
Yes	Female	First	Mrs. William Thompson (Edith Junkins) Graham	58.00	153.46	0.00	1.00
No	Male	Third	Mr. Ole Martin Olsen	27.00	7.31	0.00	0.00
No	Male	Second	Mr. Thomas Charles Mudd	16.00	10.50	0.00	0.00
Yes	Female	Second	Miss. Alice Herman	24.00	65.00	1.00	2.00

What are the response and explanatory variables in this scenario?

Response variable: Survival status (Yes/No)

Explanatory variable: Passenger class (third, second, first)

Stating the hypothesis for χ²-test of independence

Formally state the hypotheses for our question of interest.

H₀: There is no relationship between passenger’s class and whether or not they survived the sinking of titanic.

H_a: There is a relationship between passenger’s class and whether or not they survived the sinking of titanic.

Next, we obtain a contingency table for our question of interest.

# Creating contingency table
contingencyTable <- titanic %>% 
  janitor::tabyl(var1 = passenger_class, var2 = survived)

# Printing table
contingencyTable %>% 
  make_flex(caption = "Observed counts for survival status by passenger class", ndigits = 0)

Table 2: Observed counts for survival status by passenger class
passenger_class	No	Yes
First	80	136
Second	97	87
Third	368	119

Visualizations can often be more effective when communicating our results

Clustered bar chart

# Creating a clustered bar chart
titanic %>% 
  dplyr::count(passenger_class, survived, .drop = FALSE) %>% 
    dplyr::filter(!is.na(passenger_class), !is.na(survived)) %>% 
  mutate(passenger_class = fct_reorder(passenger_class, n)) %>% 
  ggplot(aes(x = passenger_class, y = n,
             fill = survived)) + 
  geom_col(position="dodge", color = "black") +
  scale_fill_few() +
    scale_y_continuous(expand = expansion(mult = c(0, 0.10))) +
    labs(title = "Distribution of survival by passenger class",
       y = "Frequency",
       x = "Survived",
       caption = "Data source: Stanford University",
       fill = "Survived")

Dumbbell chart

# Creating dumbbell chart of survival status by passenger class
titanic %>% 
  dplyr::count(survived, passenger_class, .drop = FALSE) %>% 
  dplyr::filter(!is.na(survived), !is.na(passenger_class)) %>% 
  ggplot(aes(x = n, y = passenger_class,
             color = survived, fill = survived)) + 
  geom_line(aes(group = passenger_class), color = "black") +
    geom_point(pch = 21, color = "black", size = 5) +
  scale_fill_few() +
      labs(title = "Survival status by passenger class",
           x = "Frequency",
           y = "Passenger class",
           fill = "Survived",
           caption = "Data source: Stanford University") +
  theme(legend.position = "bottom",
        strip.background.y = element_rect(linetype = "solid", color = "black"))

Based on the clustered bar chart and dumbbell plot, does there appear to be a difference in the likelihood of survival by the class of the passengers?

Yes, there are clear differences on the likelihood of survival based on the class of the passenger. First class passengers had the highest chance of survival and the third class had the least chance of survival.

Next, we implement the X²-test of independence

# Implementing a chi-square test of independence
chi2Res <- chisq.test(contingencyTable, tabyl_results = TRUE)
# Printing table of expected counts
chi2Res$expected %>% 
  make_flex(caption = "Expected counts for survival status by passenger class", ndigits = 1)

Table 3: Expected counts for survival status by passenger class
passenger_class	No	Yes
First	132.7	83.3
Second	113.1	70.9
Third	299.2	187.8

Printing the results of the X²-test of independence

# Printing model output
chi2Res %>% 
  broom::tidy() %>% 
  make_flex(caption = "Results of the chi-square test of independence",
            ndigits = 2)

Table 4: Results of the chi-square test of independence
statistic	p.value	parameter	method
101.22	<2e-16	2	Pearson's Chi-squared test

If the null was true the t-statistic should be close to df (parameter).

State and check assumptions of the X²-test of independence

State each assumption for the X²-test (chi-square) of independence, and indicate whether or not each assumption is met, citing specific evidence from the output obtained.

Independent observations (commonly violated when observations consist of repeated measurements across time)

Since these were all separate passengers, we will consider this assumption met
Both variables being studied must be categorical

Both survival status and passenger class are categorical
All expected counts under the null hypothesis should be at least 5

All the expected counts are greater than 5. The smallest expected count is 70.9 which is larger than 5, so this assumption is met.

What are the degrees of freedom associated with the 2-test here?

Interpretting results of the X²-test of independence

State our test statistic, p-value, decision, and conclusion in the context of the problem testing at the 5% significance level, citing specific evidence from the obtained output

test statistic: 101.22

p-value: <2e-16

Decision: Reject the H₀ since the p-value is less than 0.05.

Interpretation of results: We have sufficient evidence that there is a relationship between passenger’s class and whether or not they survived the sinking of titanic (H_a) at 5% significance level.

Analogous to post-hoc in ANOVA for Chi-square test is inference of population proportions.

Confidence Intervals

What does 1 − \(\hat{p}\) represent in general?

1 - \(\hat{p}\) represents the proportion of failures.

Provide and interpret a 95% confidence interval for the survival of third-class passengers based on the Titanic data using the code below.

# Calculating confidence interval for p
third_class_counts <- contingencyTable %>% 
  dplyr::filter(passenger_class == "Third") %>% 
  dplyr::select(c("Yes", "No"))

third_surv_CI <- prop.test(x = third_class_counts$Yes,
                           n = sum(third_class_counts),
                           conf.level = 0.95,
                           correct = FALSE)

# Printing table of results
third_surv_CI %>% 
  broom::tidy() %>% 
  make_flex(caption = "Confidence interval for proportion of 3rd-class passengers who survived on the Titanic",
            ndigits = 3)

Table 5: Confidence interval for proportion of 3rd-class passengers who survived on the Titanic
estimate	statistic	p.value	parameter	conf.low	conf.high	method	alternative
0.244	127.312	<2e-16	1	0.208	0.284	1-sample proportions test without continuity correction	two.sided

The sample proportion, \(\hat{p}\) is 0.244, which is the proportion of third-class passengers who survived the sinking of the titanic.

Lower bound: 0.208

Upper bound: 0.284

Interpretation: We are 95% confident that the proportion of 3rd class passengers who survive luxury cruise ships in the Atlantic is between 0.208 and 0.284.

Is this a valid confidence interval? Provide specific evidence to support your conclusion.

Independent observations (commonly violated when observations consist of repeated measurements across time)

Since these were all separate passengers, we will consider this assumption met
Variable being studied must be categorical

Survival status (Yes/No) is a binary categorical variable, so this is met.
Number of successes should be at least 5 and number of failures should be at least 5

The number of failures and the number of successes are both greater than 5, so this assumption is met.

Provide and interpret a 95% confidence interval for the survival of first-class passengers based on the Titanic data.

# Calculating confidence interval for p
first_class_counts <- contingencyTable %>% 
  dplyr::filter(passenger_class == "First") %>% 
  dplyr::select(c("Yes", "No"))

first_surv_CI <- prop.test(x = first_class_counts$Yes,
                           n = sum(first_class_counts),
                           conf.level = 0.95,
                           correct = FALSE)

# Printing table of results
first_surv_CI %>% 
  broom::tidy() %>% 
  make_flex(caption = "Confidence interval for proportion of 1st-class passengers who survived on the Titanic",
            ndigits = 3)

Table 6: Confidence interval for proportion of 1st-class passengers who survived on the Titanic
estimate	statistic	p.value	parameter	conf.low	conf.high	method	alternative
0.630	14.519	0.000139	1	0.563	0.691	1-sample proportions test without continuity correction	two.sided

The sample proportion, \(\hat{p}\) is 0.630, which is the proportion of first-class passengers who survived the sinking of the titanic.

Lower bound: 0.563

Upper bound: 0.691

Interpretation: We are 95% confident that the proportion of 1st class passengers who survive luxury cruise ships in the Atlantic is between 0.563 and 0.691.

Is this a valid confidence interval? Provide specific evidence to support your conclusion.

Independent observations (commonly violated when observations consist of repeated measurements across time)

Since these were all separate passengers, we will consider this assumption met
Variable being studied must be categorical

Survival status (Yes/No) is a binary categorical variable, so this is met.
Number of successes should be at least 5 and number of failures should be at least 5

Since the number of failures 80, and number of successes 136, the number of failures and the number of successes are both greater than 5, so this assumption is met.

What is the margin of error for this interval?

The margin of error is given by (Upper-bound - Lower-bound) / 2 i.e (0.691- 0.563) / 2 = 0.064

Confidence Interval for p₁−p₂

Provide and interpret a 95% confidence interval for the difference in the survival rates of first-class and third-class passengers based on the Titanic data using the code below.

# Calculating confidence interval for p1 - p2
class_counts <- contingencyTable %>% 
  dplyr::filter(passenger_class %in% c("First", "Third")) %>% 
  dplyr::select(c("Yes", "No"))

diff_surv_CI <- prop.test(x = as.matrix(class_counts),
                           n = colSums(class_counts),
                           conf.level = 0.95,
                           correct = FALSE)

# Printing table of results
diff_surv_CI %>% 
  broom::tidy() %>% 
  make_flex(caption = "Confidence interval for difference in the survival rates of 1st-class and 3rd-class passengers",
            ndigits = 3)

Table 7: Confidence interval for difference in the survival rates of 1st-class and 3rd-class passengers
estimate1	estimate2	statistic	p.value	parameter	conf.low	conf.high	method	alternative
0.630	0.244	96.087	<2e-16	1.000	0.310	0.460	2-sample test for equality of proportions without continuity correction	two.sided

Confidence interval limits: (0.310, 0.460)

Interpretation: We are 95% confident that 1st-class passengers on sinking cruise ships on the Atlantic have a survival rate that is between 31% and 46% greater than that of 3rd class passengers.

Is this a valid confidence interval? Provide specific evidence to support your conclusion.

Number of successes should be at least 5 and number of failures should be at least 5

1st-class passengers: 80 died and 136 survived.

3rd-class passengers: 368 died and 119 survived.

Since all these four numbers are at least 5, we have sufficient number of successes and failures for the confidence interval to be valid.
Independent observations (commonly violated when observations consist of repeated measurements across time)

Since these were all separate passengers, we will consider this assumption met
Variable being studied must be categorical

Survival status (Yes/No) is a binary categorical variable, so this is met.

Do the results of the X²-test and 95% confidence interval of the difference align? Why or why not?

Yes, since the survival rates differed between first and third class passengers(0 is outside the interval of 0.310 and 0.460), this aligns with the statistically significant result of the chi-square test.

Inference of Population Proportions & Categorical Variables

Learning Objectives

Titanic data

Importing data from the website GitHub

What are the response and explanatory variables in this scenario?

Stating the hypothesis for χ2-test of independence

Formally state the hypotheses for our question of interest.

Next, we obtain a contingency table for our question of interest.

Visualizations can often be more effective when communicating our results

Clustered bar chart

Dumbbell chart

Based on the clustered bar chart and dumbbell plot, does there appear to be a difference in the likelihood of survival by the class of the passengers?

Next, we implement the X2-test of independence

Printing the results of the X2-test of independence

State and check assumptions of the X2-test of independence

State each assumption for the X2-test (chi-square) of independence, and indicate whether or not each assumption is met, citing specific evidence from the output obtained.

What are the degrees of freedom associated with the 2-test here?

Interpretting results of the X2-test of independence

State our test statistic, p-value, decision, and conclusion in the context of the problem testing at the 5% significance level, citing specific evidence from the obtained output

Confidence Intervals

What does 1 − \(\hat{p}\) represent in general?

Provide and interpret a 95% confidence interval for the survival of third-class passengers based on the Titanic data using the code below.

Is this a valid confidence interval? Provide specific evidence to support your conclusion.

Provide and interpret a 95% confidence interval for the survival of first-class passengers based on the Titanic data.

Is this a valid confidence interval? Provide specific evidence to support your conclusion.

What is the margin of error for this interval?

Confidence Interval for p1−p2

Provide and interpret a 95% confidence interval for the difference in the survival rates of first-class and third-class passengers based on the Titanic data using the code below.

Is this a valid confidence interval? Provide specific evidence to support your conclusion.

Do the results of the X2-test and 95% confidence interval of the difference align? Why or why not?

Stating the hypothesis for χ²-test of independence

Next, we implement the X²-test of independence

Printing the results of the X²-test of independence

State and check assumptions of the X²-test of independence

State each assumption for the X²-test (chi-square) of independence, and indicate whether or not each assumption is met, citing specific evidence from the output obtained.

Interpretting results of the X²-test of independence

Confidence Interval for p₁−p₂

Do the results of the X²-test and 95% confidence interval of the difference align? Why or why not?