William Gosset (13 June 1876 – 16 October 1937) was an English statistician, chemist and brewer who served as Head Brewer of Guinness and Head Experimental Brewer of Guinness and was a pioneer of modern statistics.
You have probably heard about statisticians like Pearson, Gosset, Fisher, and Box. Please keep in mind that they were also the founders of the 21st Century Management. Some of the management terms (Six Sigma, Total Quality Management, normal distribution, etc.) we are learning today are defined or coined by these statisticians about 100 years ago. These concepts and their underlying techniques actually built the foundation for today’s Business Schools. In particular, Dr. Gosset also successfully managed the Guinness brand for his boss using the statistical techniques he created in the 20th century. See the following quotes and the article for more details.
“One of the greatest minds in 20th Century statistics was not a scholar. He brewed beer.”
“Guinness brewer William S. Gosset’s work is responsible for inspiring the concept of statistical significance, industrial quality control, efficient design of experiments and, not least of all, consistently great tasting beer.”
Reference - https://priceonomics.com/the-guinness-brewer-who-revolutionized-statistics/
Having an understanding of the chi-square test of independence
Implementing the chi-square test using R
Possiblly, using the chi-square test for feature selection (probably too ambitious)
The dataset used in this tutorial is available from the Stanford University Data Science Website (see the reference below). The dataset contains data regarding the passengers of the Titanic.
We hypothesized that there is a relationship between survival probability and sex, age group, passenger class, and other possible factors. Our uneducated guess is that passenger class would have the most signinficant effect.
Relabel the original variables to facilitate analysis and interpretation. This can be done in R as well. The new dataset is saved as a CSV file in the same folder I normally use for Rstudio (for statistical analysis) or Rmarkdown (for publication purposes).
Read the Titanic data set into R. Create a dataframe called “titanic”. We also use a few functions (view, str, head, tail, etc.) to understand how our dataset looks like.
Reference: https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/problem12.html
setwd("C:/Users/zxu3/Documents/R/chisquare")
#Please install the following package if the package "readr" is not installed.
#install.packages("readr")
library(readr)
## Warning: package 'readr' was built under R version 3.6.3
titanic <- read.csv("titanic.csv")
View(titanic)
str(titanic)
## 'data.frame': 887 obs. of 8 variables:
## $ Survived: int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : Factor w/ 887 levels "Capt. Edward Gifford Crosby",..: 602 823 172 814 733 464 700 33 842 839 ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ Age : num 22 38 26 35 35 27 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ ParCh : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
ls(titanic) # list the variables in the dataset
## [1] "Age" "Fare" "Name" "ParCh" "Pclass" "Sex" "SibSp"
## [8] "Survived"
head(titanic) #list the first 6 rows of the dataset
## Survived Pclass Name Sex Age
## 1 0 3 Mr. Owen Harris Braund male 22
## 2 1 1 Mrs. John Bradley (Florence Briggs Thayer) Cumings female 38
## 3 1 3 Miss. Laina Heikkinen female 26
## 4 1 1 Mrs. Jacques Heath (Lily May Peel) Futrelle female 35
## 5 0 3 Mr. William Henry Allen male 35
## 6 0 3 Mr. James Moran male 27
## SibSp ParCh Fare
## 1 1 0 7.2500
## 2 1 0 71.2833
## 3 0 0 7.9250
## 4 1 0 53.1000
## 5 0 0 8.0500
## 6 0 0 8.4583
This tells us that the dataset contains records for 889 passengers on the Titanic. We have 8 variables in the dataset (see the ls(titanic)function and results)
We use the table function to count the number of passengers who survived the sinking of the Titanic. Here the table function table(titanic$Survived) suggests that we will use a table to summarize the frequency of the variable “Survived” in the titanic dataset.
titanic.survived <- table(titanic$Survived)
titanic.survived
##
## 0 1
## 545 342
Using the function prop.table, we show the percentage of passengers who survived the sinking of the Titanic.
prop.table(titanic.survived) *100
##
## 0 1
## 61.44307 38.55693
We found that approximately 38.245% of the 889 listed passengers listed, survived the Titanic.
Using the function xtabs, we identify the number of first-class passengers who survived the sinking of the Titanic.
titanic.survived2 <- xtabs(~Survived+Pclass, data=titanic)
addmargins(titanic.survived2)
## Pclass
## Survived 1 2 3 Sum
## 0 80 97 368 545
## 1 136 87 119 342
## Sum 216 184 487 887
We found that 134 of the 214 First Class passengers survived the Titanic.
In this step, we try to measure the percentage of first-class passengers who survived the sinking of the Titanic.
prop.table(titanic.survived2, 2) *100
## Pclass
## Survived 1 2 3
## 0 37.03704 52.71739 75.56468
## 1 62.96296 47.28261 24.43532
We found that 62.61% of the 1st Class passengers survived the Titanic.
In this step, we try to find the percentage of females from First-Class who survived the sinking of the Titanic
titanic.survived4 <- xtabs(~Survived+Sex, data = titanic)
prop.table(titanic.survived4,1) *100
## Sex
## Survived female male
## 0 14.86239 85.13761
## 1 68.12865 31.87135
Our result suggests that about 67.94% of survivors were female.
In this step, we would like to find the percentage of females on board the Titanic who survived
prop.table(titanic.survived4,2) *100
## Sex
## Survived female male
## 0 25.79618 80.97731
## 1 74.20382 19.02269
Our result suggests that 74.03% of females on board the Titanic survived.
In this step, we run a Pearson’s Chi-squared test to test the following hypothesis:
Null Hypothesis- H0: Gender is not associated with survival. Or specifically, being a female does not increase the likelihood of survival.
Alternative Hypothesis - H1: The proportion of females onboard who survived the sinking of the Titanic was higher than the proportion of males onboard who survived the sinking of the Titanic.
chisq.test(titanic.survived4)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: titanic.survived4
## X-squared = 258.39, df = 1, p-value < 2.2e-16
Since p- value is less than .05, we reject the null hypothesis and accept the Alternate hypothesis and hence the proportion of females onboard who survived the sinking of the Titanic was higher than the proportion of males onboard who survived the sinking of the Titanic.