Inference of Population Proportions & Categorical Variables
Learning Objectives
State hypotheses for the X2-test of independence
State and check assumptions of the χ2-test of independence
Obtain and interpret results of the χ2-test of independence
Obtain and interpret confidence intervals for population proportions & differences in proportions
# Load necessary packages
library(tidyverse)
library(ggthemes)
library(flextable)
library(janitor)
library(broom)
# Set ggplot theme for visualizations
theme_set(ggthemes::theme_few())
# Set options for flextables
set_flextable_defaults(na_str = "NA")
# Load function for printing tables nicely
source("https://raw.githubusercontent.com/dilernia/STA323/main/Functions/make_flex.R")
Titanic data
Importing data from the website GitHub
# Importing data from course GitHub page
titanic <- read_csv("https://raw.githubusercontent.com/dilernia/STA323/main/Data/titanic.csv")
set.seed(1994)
# Printing sample of 7 rows from data set
titanic %>%
dplyr::sample_n(size = 7) %>%
make_flex(caption = "Randomly selected people who were aboard the Titanic.")
survived | sex | passenger_class | name | age | fare | n_siblings_spouses_aboard | n_parents_children_aboard |
---|---|---|---|---|---|---|---|
Yes | Female | Second | Miss. Amelia Brown | 24.00 | 13.00 | 0.00 | 0.00 |
Yes | Female | Third | Miss. Anna Kristine Salkjelsvik | 21.00 | 7.65 | 0.00 | 0.00 |
No | Male | Third | Mr. Nils Johan Goransson Olsson | 28.00 | 7.85 | 0.00 | 0.00 |
Yes | Female | First | Mrs. William Thompson (Edith Junkins) Graham | 58.00 | 153.46 | 0.00 | 1.00 |
No | Male | Third | Mr. Ole Martin Olsen | 27.00 | 7.31 | 0.00 | 0.00 |
No | Male | Second | Mr. Thomas Charles Mudd | 16.00 | 10.50 | 0.00 | 0.00 |
Yes | Female | Second | Miss. Alice Herman | 24.00 | 65.00 | 1.00 | 2.00 |
What are the response and explanatory variables in this scenario?
Response variable: Survival status (Yes/No)
Explanatory variable: Passenger class (third, second, first)
Stating the hypothesis for χ2-test of independence
Formally state the hypotheses for our question of interest.
H0: There is no relationship between passenger’s class and whether or not they survived the sinking of titanic.
Ha: There is a relationship between passenger’s class and whether or not they survived the sinking of titanic.
Next, we obtain a contingency table for our question of interest.
# Creating contingency table
contingencyTable <- titanic %>%
janitor::tabyl(var1 = passenger_class, var2 = survived)
# Printing table
contingencyTable %>%
make_flex(caption = "Observed counts for survival status by passenger class", ndigits = 0)
passenger_class | No | Yes |
---|---|---|
First | 80 | 136 |
Second | 97 | 87 |
Third | 368 | 119 |
Visualizations can often be more effective when communicating our results
Clustered bar chart
# Creating a clustered bar chart
titanic %>%
dplyr::count(passenger_class, survived, .drop = FALSE) %>%
dplyr::filter(!is.na(passenger_class), !is.na(survived)) %>%
mutate(passenger_class = fct_reorder(passenger_class, n)) %>%
ggplot(aes(x = passenger_class, y = n,
fill = survived)) +
geom_col(position="dodge", color = "black") +
scale_fill_few() +
scale_y_continuous(expand = expansion(mult = c(0, 0.10))) +
labs(title = "Distribution of survival by passenger class",
y = "Frequency",
x = "Survived",
caption = "Data source: Stanford University",
fill = "Survived")
Dumbbell chart
# Creating dumbbell chart of survival status by passenger class
titanic %>%
dplyr::count(survived, passenger_class, .drop = FALSE) %>%
dplyr::filter(!is.na(survived), !is.na(passenger_class)) %>%
ggplot(aes(x = n, y = passenger_class,
color = survived, fill = survived)) +
geom_line(aes(group = passenger_class), color = "black") +
geom_point(pch = 21, color = "black", size = 5) +
scale_fill_few() +
labs(title = "Survival status by passenger class",
x = "Frequency",
y = "Passenger class",
fill = "Survived",
caption = "Data source: Stanford University") +
theme(legend.position = "bottom",
strip.background.y = element_rect(linetype = "solid", color = "black"))
Based on the clustered bar chart and dumbbell plot, does there appear to be a difference in the likelihood of survival by the class of the passengers?
Yes, there are clear differences on the likelihood of survival based on the class of the passenger. First class passengers had the highest chance of survival and the third class had the least chance of survival.
Next, we implement the X2-test of independence
# Implementing a chi-square test of independence
chi2Res <- chisq.test(contingencyTable, tabyl_results = TRUE)
# Printing table of expected counts
chi2Res$expected %>%
make_flex(caption = "Expected counts for survival status by passenger class", ndigits = 1)
passenger_class | No | Yes |
---|---|---|
First | 132.7 | 83.3 |
Second | 113.1 | 70.9 |
Third | 299.2 | 187.8 |
Printing the results of the X2-test of independence
# Printing model output
chi2Res %>%
broom::tidy() %>%
make_flex(caption = "Results of the chi-square test of independence",
ndigits = 2)
statistic | p.value | parameter | method |
---|---|---|---|
101.22 | <2e-16 | 2 | Pearson's Chi-squared test |
If the null was true the t-statistic should be close to df (parameter).
State and check assumptions of the X2-test of independence
State each assumption for the X2-test (chi-square) of independence, and indicate whether or not each assumption is met, citing specific evidence from the output obtained.
Independent observations (commonly violated when observations consist of repeated measurements across time)
Since these were all separate passengers, we will consider this assumption met
Both variables being studied must be categorical
Both survival status and passenger class are categorical
All expected counts under the null hypothesis should be at least 5
All the expected counts are greater than 5. The smallest expected count is 70.9 which is larger than 5, so this assumption is met.
What are the degrees of freedom associated with the 2-test here?
2
Interpretting results of the X2-test of independence
State our test statistic, p-value, decision, and conclusion in the context of the problem testing at the 5% significance level, citing specific evidence from the obtained output
test statistic: 101.22
p-value: <2e-16
Decision: Reject the H0 since the p-value is less than 0.05.
Interpretation of results: We have sufficient evidence that there is a relationship between passenger’s class and whether or not they survived the sinking of titanic (Ha) at 5% significance level.
Analogous to post-hoc in ANOVA for Chi-square test is inference of population proportions.
Confidence Intervals
What does 1 − \(\hat{p}\) represent in general?
1 - \(\hat{p}\) represents the proportion of failures.
Provide and interpret a 95% confidence interval for the survival of third-class passengers based on the Titanic data using the code below.
# Calculating confidence interval for p
third_class_counts <- contingencyTable %>%
dplyr::filter(passenger_class == "Third") %>%
dplyr::select(c("Yes", "No"))
third_surv_CI <- prop.test(x = third_class_counts$Yes,
n = sum(third_class_counts),
conf.level = 0.95,
correct = FALSE)
# Printing table of results
third_surv_CI %>%
broom::tidy() %>%
make_flex(caption = "Confidence interval for proportion of 3rd-class passengers who survived on the Titanic",
ndigits = 3)
estimate | statistic | p.value | parameter | conf.low | conf.high | method | alternative |
---|---|---|---|---|---|---|---|
0.244 | 127.312 | <2e-16 | 1 | 0.208 | 0.284 | 1-sample proportions test without continuity correction | two.sided |
The sample proportion, \(\hat{p}\) is 0.244, which is the proportion of third-class passengers who survived the sinking of the titanic.
Lower bound: 0.208
Upper bound: 0.284
Interpretation: We are 95% confident that the proportion of 3rd class passengers who survive luxury cruise ships in the Atlantic is between 0.208 and 0.284.
Is this a valid confidence interval? Provide specific evidence to support your conclusion.
Independent observations (commonly violated when observations consist of repeated measurements across time)
Since these were all separate passengers, we will consider this assumption met
Variable being studied must be categorical
Survival status (Yes/No) is a binary categorical variable, so this is met.
Number of successes should be at least 5 and number of failures should be at least 5
The number of failures and the number of successes are both greater than 5, so this assumption is met.
Provide and interpret a 95% confidence interval for the survival of first-class passengers based on the Titanic data.
# Calculating confidence interval for p
first_class_counts <- contingencyTable %>%
dplyr::filter(passenger_class == "First") %>%
dplyr::select(c("Yes", "No"))
first_surv_CI <- prop.test(x = first_class_counts$Yes,
n = sum(first_class_counts),
conf.level = 0.95,
correct = FALSE)
# Printing table of results
first_surv_CI %>%
broom::tidy() %>%
make_flex(caption = "Confidence interval for proportion of 1st-class passengers who survived on the Titanic",
ndigits = 3)
estimate | statistic | p.value | parameter | conf.low | conf.high | method | alternative |
---|---|---|---|---|---|---|---|
0.630 | 14.519 | 0.000139 | 1 | 0.563 | 0.691 | 1-sample proportions test without continuity correction | two.sided |
The sample proportion, \(\hat{p}\) is 0.630, which is the proportion of first-class passengers who survived the sinking of the titanic.
Lower bound: 0.563
Upper bound: 0.691
Interpretation: We are 95% confident that the proportion of 1st class passengers who survive luxury cruise ships in the Atlantic is between 0.563 and 0.691.
Is this a valid confidence interval? Provide specific evidence to support your conclusion.
Independent observations (commonly violated when observations consist of repeated measurements across time)
Since these were all separate passengers, we will consider this assumption met
Variable being studied must be categorical
Survival status (Yes/No) is a binary categorical variable, so this is met.
Number of successes should be at least 5 and number of failures should be at least 5
Since the number of failures 80, and number of successes 136, the number of failures and the number of successes are both greater than 5, so this assumption is met.
What is the margin of error for this interval?
The margin of error is given by (Upper-bound - Lower-bound) / 2 i.e (0.691- 0.563) / 2 = 0.064
Confidence Interval for p1−p2
Provide and interpret a 95% confidence interval for the difference in the survival rates of first-class and third-class passengers based on the Titanic data using the code below.
# Calculating confidence interval for p1 - p2
class_counts <- contingencyTable %>%
dplyr::filter(passenger_class %in% c("First", "Third")) %>%
dplyr::select(c("Yes", "No"))
diff_surv_CI <- prop.test(x = as.matrix(class_counts),
n = colSums(class_counts),
conf.level = 0.95,
correct = FALSE)
# Printing table of results
diff_surv_CI %>%
broom::tidy() %>%
make_flex(caption = "Confidence interval for difference in the survival rates of 1st-class and 3rd-class passengers",
ndigits = 3)
estimate1 | estimate2 | statistic | p.value | parameter | conf.low | conf.high | method | alternative |
---|---|---|---|---|---|---|---|---|
0.630 | 0.244 | 96.087 | <2e-16 | 1.000 | 0.310 | 0.460 | 2-sample test for equality of proportions without continuity correction | two.sided |
Confidence interval limits: (0.310, 0.460)
Interpretation: We are 95% confident that 1st-class passengers on sinking cruise ships on the Atlantic have a survival rate that is between 31% and 46% greater than that of 3rd class passengers.
Is this a valid confidence interval? Provide specific evidence to support your conclusion.
Number of successes should be at least 5 and number of failures should be at least 5
1st-class passengers: 80 died and 136 survived.
3rd-class passengers: 368 died and 119 survived.
Since all these four numbers are at least 5, we have sufficient number of successes and failures for the confidence interval to be valid.
Independent observations (commonly violated when observations consist of repeated measurements across time)
Since these were all separate passengers, we will consider this assumption met
Variable being studied must be categorical
Survival status (Yes/No) is a binary categorical variable, so this is met.
Do the results of the X2-test and 95% confidence interval of the difference align? Why or why not?
Yes, since the survival rates differed between first and third class passengers(0 is outside the interval of 0.310 and 0.460), this aligns with the statistically significant result of the chi-square test.