Inference of Population Proportions & Categorical Variables

Learning Objectives

  • State hypotheses for the X2-test of independence

  • State and check assumptions of the χ2-test of independence

  • Obtain and interpret results of the χ2-test of independence

  • Obtain and interpret confidence intervals for population proportions & differences in proportions

# Load necessary packages
library(tidyverse)
library(ggthemes)
library(flextable)
library(janitor)
library(broom)
# Set ggplot theme for visualizations
theme_set(ggthemes::theme_few())

# Set options for flextables
set_flextable_defaults(na_str = "NA")

# Load function for printing tables nicely
source("https://raw.githubusercontent.com/dilernia/STA323/main/Functions/make_flex.R")

Titanic data

Importing data from the website GitHub

# Importing data from course GitHub page
titanic <- read_csv("https://raw.githubusercontent.com/dilernia/STA323/main/Data/titanic.csv")

set.seed(1994)

# Printing sample of 7 rows from data set
titanic %>%  
  dplyr::sample_n(size = 7) %>% 
  make_flex(caption = "Randomly selected people who were aboard the Titanic.")
Table 1: Randomly selected people who were aboard the Titanic.

survived

sex

passenger_class

name

age

fare

n_siblings_spouses_aboard

n_parents_children_aboard

Yes

Female

Second

Miss. Amelia Brown

24.00

13.00

0.00

0.00

Yes

Female

Third

Miss. Anna Kristine Salkjelsvik

21.00

7.65

0.00

0.00

No

Male

Third

Mr. Nils Johan Goransson Olsson

28.00

7.85

0.00

0.00

Yes

Female

First

Mrs. William Thompson (Edith Junkins) Graham

58.00

153.46

0.00

1.00

No

Male

Third

Mr. Ole Martin Olsen

27.00

7.31

0.00

0.00

No

Male

Second

Mr. Thomas Charles Mudd

16.00

10.50

0.00

0.00

Yes

Female

Second

Miss. Alice Herman

24.00

65.00

1.00

2.00

What are the response and explanatory variables in this scenario?

Response variable: Survival status (Yes/No)

Explanatory variable: Passenger class (third, second, first)

Stating the hypothesis for χ2-test of independence

Formally state the hypotheses for our question of interest.

H0: There is no relationship between passenger’s class and whether or not they survived the sinking of titanic.

Ha: There is a relationship between passenger’s class and whether or not they survived the sinking of titanic.

Next, we obtain a contingency table for our question of interest.

# Creating contingency table
contingencyTable <- titanic %>% 
  janitor::tabyl(var1 = passenger_class, var2 = survived)

# Printing table
contingencyTable %>% 
  make_flex(caption = "Observed counts for survival status by passenger class", ndigits = 0)
Table 2: Observed counts for survival status by passenger class

passenger_class

No

Yes

First

80

136

Second

97

87

Third

368

119

Visualizations can often be more effective when communicating our results

Clustered bar chart

# Creating a clustered bar chart
titanic %>% 
  dplyr::count(passenger_class, survived, .drop = FALSE) %>% 
    dplyr::filter(!is.na(passenger_class), !is.na(survived)) %>% 
  mutate(passenger_class = fct_reorder(passenger_class, n)) %>% 
  ggplot(aes(x = passenger_class, y = n,
             fill = survived)) + 
  geom_col(position="dodge", color = "black") +
  scale_fill_few() +
    scale_y_continuous(expand = expansion(mult = c(0, 0.10))) +
    labs(title = "Distribution of survival by passenger class",
       y = "Frequency",
       x = "Survived",
       caption = "Data source: Stanford University",
       fill = "Survived")

Dumbbell chart

# Creating dumbbell chart of survival status by passenger class
titanic %>% 
  dplyr::count(survived, passenger_class, .drop = FALSE) %>% 
  dplyr::filter(!is.na(survived), !is.na(passenger_class)) %>% 
  ggplot(aes(x = n, y = passenger_class,
             color = survived, fill = survived)) + 
  geom_line(aes(group = passenger_class), color = "black") +
    geom_point(pch = 21, color = "black", size = 5) +
  scale_fill_few() +
      labs(title = "Survival status by passenger class",
           x = "Frequency",
           y = "Passenger class",
           fill = "Survived",
           caption = "Data source: Stanford University") +
  theme(legend.position = "bottom",
        strip.background.y = element_rect(linetype = "solid", color = "black"))

Based on the clustered bar chart and dumbbell plot, does there appear to be a difference in the likelihood of survival by the class of the passengers?

Yes, there are clear differences on the likelihood of survival based on the class of the passenger. First class passengers had the highest chance of survival and the third class had the least chance of survival.

Next, we implement the X2-test of independence

# Implementing a chi-square test of independence
chi2Res <- chisq.test(contingencyTable, tabyl_results = TRUE)
# Printing table of expected counts
chi2Res$expected %>% 
  make_flex(caption = "Expected counts for survival status by passenger class", ndigits = 1)
Table 3: Expected counts for survival status by passenger class

passenger_class

No

Yes

First

132.7

83.3

Second

113.1

70.9

Third

299.2

187.8

Printing the results of the X2-test of independence

# Printing model output
chi2Res %>% 
  broom::tidy() %>% 
  make_flex(caption = "Results of the chi-square test of independence",
            ndigits = 2)
Table 4: Results of the chi-square test of independence

statistic

p.value

parameter

method

101.22

<2e-16

2

Pearson's Chi-squared test

If the null was true the t-statistic should be close to df (parameter).

State and check assumptions of the X2-test of independence

State each assumption for the X2-test (chi-square) of independence, and indicate whether or not each assumption is met, citing specific evidence from the output obtained.

  • Independent observations (commonly violated when observations consist of repeated measurements across time)

    Since these were all separate passengers, we will consider this assumption met

  • Both variables being studied must be categorical

    Both survival status and passenger class are categorical

  • All expected counts under the null hypothesis should be at least 5

    All the expected counts are greater than 5. The smallest expected count is 70.9 which is larger than 5, so this assumption is met.

What are the degrees of freedom associated with the 2-test here?

2

Interpretting results of the X2-test of independence

State our test statistic, p-value, decision, and conclusion in the context of the problem testing at the 5% significance level, citing specific evidence from the obtained output

test statistic: 101.22

p-value: <2e-16

Decision: Reject the H0 since the p-value is less than 0.05.

Interpretation of results: We have sufficient evidence that there is a relationship between passenger’s class and whether or not they survived the sinking of titanic (Ha) at 5% significance level.

Analogous to post-hoc in ANOVA for Chi-square test is inference of population proportions.

Confidence Intervals

What does 1 − \(\hat{p}\) represent in general?

1 - \(\hat{p}\) represents the proportion of failures.

Provide and interpret a 95% confidence interval for the survival of third-class passengers based on the Titanic data using the code below.

# Calculating confidence interval for p
third_class_counts <- contingencyTable %>% 
  dplyr::filter(passenger_class == "Third") %>% 
  dplyr::select(c("Yes", "No"))

third_surv_CI <- prop.test(x = third_class_counts$Yes,
                           n = sum(third_class_counts),
                           conf.level = 0.95,
                           correct = FALSE)

# Printing table of results
third_surv_CI %>% 
  broom::tidy() %>% 
  make_flex(caption = "Confidence interval for proportion of 3rd-class passengers who survived on the Titanic",
            ndigits = 3)
Table 5: Confidence interval for proportion of 3rd-class passengers who survived on the Titanic

estimate

statistic

p.value

parameter

conf.low

conf.high

method

alternative

0.244

127.312

<2e-16

1

0.208

0.284

1-sample proportions test without continuity correction

two.sided

The sample proportion, \(\hat{p}\) is 0.244, which is the proportion of third-class passengers who survived the sinking of the titanic.

Lower bound: 0.208

Upper bound: 0.284

Interpretation: We are 95% confident that the proportion of 3rd class passengers who survive luxury cruise ships in the Atlantic is between 0.208 and 0.284.

Is this a valid confidence interval? Provide specific evidence to support your conclusion.

  • Independent observations (commonly violated when observations consist of repeated measurements across time)

    Since these were all separate passengers, we will consider this assumption met

  • Variable being studied must be categorical

    Survival status (Yes/No) is a binary categorical variable, so this is met.

  • Number of successes should be at least 5 and number of failures should be at least 5

    The number of failures and the number of successes are both greater than 5, so this assumption is met.

Provide and interpret a 95% confidence interval for the survival of first-class passengers based on the Titanic data.

# Calculating confidence interval for p
first_class_counts <- contingencyTable %>% 
  dplyr::filter(passenger_class == "First") %>% 
  dplyr::select(c("Yes", "No"))

first_surv_CI <- prop.test(x = first_class_counts$Yes,
                           n = sum(first_class_counts),
                           conf.level = 0.95,
                           correct = FALSE)

# Printing table of results
first_surv_CI %>% 
  broom::tidy() %>% 
  make_flex(caption = "Confidence interval for proportion of 1st-class passengers who survived on the Titanic",
            ndigits = 3)
Table 6: Confidence interval for proportion of 1st-class passengers who survived on the Titanic

estimate

statistic

p.value

parameter

conf.low

conf.high

method

alternative

0.630

14.519

0.000139

1

0.563

0.691

1-sample proportions test without continuity correction

two.sided

The sample proportion, \(\hat{p}\) is 0.630, which is the proportion of first-class passengers who survived the sinking of the titanic.

Lower bound: 0.563

Upper bound: 0.691

Interpretation: We are 95% confident that the proportion of 1st class passengers who survive luxury cruise ships in the Atlantic is between 0.563 and 0.691.

Is this a valid confidence interval? Provide specific evidence to support your conclusion.

  • Independent observations (commonly violated when observations consist of repeated measurements across time)

    Since these were all separate passengers, we will consider this assumption met

  • Variable being studied must be categorical

    Survival status (Yes/No) is a binary categorical variable, so this is met.

  • Number of successes should be at least 5 and number of failures should be at least 5

    Since the number of failures 80, and number of successes 136, the number of failures and the number of successes are both greater than 5, so this assumption is met.

What is the margin of error for this interval?

The margin of error is given by (Upper-bound - Lower-bound) / 2 i.e (0.691- 0.563) / 2 = 0.064

Confidence Interval for p1−p2

Provide and interpret a 95% confidence interval for the difference in the survival rates of first-class and third-class passengers based on the Titanic data using the code below.

# Calculating confidence interval for p1 - p2
class_counts <- contingencyTable %>% 
  dplyr::filter(passenger_class %in% c("First", "Third")) %>% 
  dplyr::select(c("Yes", "No"))

diff_surv_CI <- prop.test(x = as.matrix(class_counts),
                           n = colSums(class_counts),
                           conf.level = 0.95,
                           correct = FALSE)

# Printing table of results
diff_surv_CI %>% 
  broom::tidy() %>% 
  make_flex(caption = "Confidence interval for difference in the survival rates of 1st-class and 3rd-class passengers",
            ndigits = 3)
Table 7: Confidence interval for difference in the survival rates of 1st-class and 3rd-class passengers

estimate1

estimate2

statistic

p.value

parameter

conf.low

conf.high

method

alternative

0.630

0.244

96.087

<2e-16

1.000

0.310

0.460

2-sample test for equality of proportions without continuity correction

two.sided

Confidence interval limits: (0.310, 0.460)

Interpretation: We are 95% confident that 1st-class passengers on sinking cruise ships on the Atlantic have a survival rate that is between 31% and 46% greater than that of 3rd class passengers.

Is this a valid confidence interval? Provide specific evidence to support your conclusion.

  • Number of successes should be at least 5 and number of failures should be at least 5

    1st-class passengers: 80 died and 136 survived.

    3rd-class passengers: 368 died and 119 survived.

    Since all these four numbers are at least 5, we have sufficient number of successes and failures for the confidence interval to be valid.

  • Independent observations (commonly violated when observations consist of repeated measurements across time)

    Since these were all separate passengers, we will consider this assumption met

  • Variable being studied must be categorical

    Survival status (Yes/No) is a binary categorical variable, so this is met.

Do the results of the X2-test and 95% confidence interval of the difference align? Why or why not?

Yes, since the survival rates differed between first and third class passengers(0 is outside the interval of 0.310 and 0.460), this aligns with the statistically significant result of the chi-square test.