The statistical significance of beer and a list of statisticians behind the ‘6 sigma’ business management

William Gosset (13 June 1876 – 16 October 1937) was an English statistician, chemist and brewer who served as Head Brewer of Guinness and Head Experimental Brewer of Guinness and was a pioneer of modern statistics.

You have probably heard about statisticians like Pearson, Gosset, Fisher, and Box. Please keep in mind that they were also the founders of the 21st Century Management. Some of the management terms (Six Sigma, Total Quality Management, normal distribution, etc.) we are learning today are defined or coined by these statisticians about 100 years ago. These concepts and their underlying techniques actually built the foundation for today’s Business Schools. In particular, Dr. Gosset also successfully managed the Guinness brand for his boss using the statistical techniques he created in the 20th century. See the following quotes and the article for more details.

“One of the greatest minds in 20th Century statistics was not a scholar. He brewed beer.”

“Guinness brewer William S. Gosset’s work is responsible for inspiring the concept of statistical significance, industrial quality control, efficient design of experiments and, not least of all, consistently great tasting beer.”

Reference - https://priceonomics.com/the-guinness-brewer-who-revolutionized-statistics/

Objectives of this tutorial

Having an understanding of the chi-square test of independence

Implementing the chi-square test using R

Possiblly, using the chi-square test for feature selection (probably too ambitious)

Introduction

The dataset used in this tutorial is available from the Stanford University Data Science Website (see the reference below). The dataset contains data regarding the passengers of the Titanic.

We hypothesized that there is a relationship between survival probability and sex, age group, passenger class, and other possible factors. Our uneducated guess is that passenger class would have the most signinficant effect.

Step 0 - Data preparation (tidy the dataset)

Relabel the original variables to facilitate analysis and interpretation. This can be done in R as well. The new dataset is saved as a CSV file in the same folder I normally use for Rstudio (for statistical analysis) or Rmarkdown (for publication purposes).

Step 1 - reading the data

Read the Titanic data set into R. Create a dataframe called “titanic”. We also use a few functions (view, str, head, tail, etc.) to understand how our dataset looks like.

Reference: https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/problem12.html

setwd("C:/Users/zxu3/Documents/R/chisquare") 
#Please install the following package if the package "readr" is not installed.
#install.packages("readr") 
library(readr)
## Warning: package 'readr' was built under R version 3.6.3
titanic <- read.csv("titanic.csv")
View(titanic)
str(titanic)
## 'data.frame':    887 obs. of  8 variables:
##  $ Survived: int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass  : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name    : Factor w/ 887 levels "Capt. Edward Gifford Crosby",..: 602 823 172 814 733 464 700 33 842 839 ...
##  $ Sex     : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age     : num  22 38 26 35 35 27 54 2 27 14 ...
##  $ SibSp   : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ ParCh   : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Fare    : num  7.25 71.28 7.92 53.1 8.05 ...
ls(titanic) # list the variables in the dataset
## [1] "Age"      "Fare"     "Name"     "ParCh"    "Pclass"   "Sex"      "SibSp"   
## [8] "Survived"
head(titanic) #list the first 6 rows of the dataset
##   Survived Pclass                                               Name    Sex Age
## 1        0      3                             Mr. Owen Harris Braund   male  22
## 2        1      1 Mrs. John Bradley (Florence Briggs Thayer) Cumings female  38
## 3        1      3                              Miss. Laina Heikkinen female  26
## 4        1      1        Mrs. Jacques Heath (Lily May Peel) Futrelle female  35
## 5        0      3                            Mr. William Henry Allen   male  35
## 6        0      3                                    Mr. James Moran   male  27
##   SibSp ParCh    Fare
## 1     1     0  7.2500
## 2     1     0 71.2833
## 3     0     0  7.9250
## 4     1     0 53.1000
## 5     0     0  8.0500
## 6     0     0  8.4583

This tells us that the dataset contains records for 889 passengers on the Titanic. We have 8 variables in the dataset (see the ls(titanic)function and results)

Step 2.1 - Description analysis 1

We use the table function to count the number of passengers who survived the sinking of the Titanic. Here the table function table(titanic$Survived) suggests that we will use a table to summarize the frequency of the variable “Survived” in the titanic dataset.

titanic.survived <- table(titanic$Survived)
titanic.survived
## 
##   0   1 
## 545 342

Step 2.2 - Description analysis 2

Using the function prop.table, we show the percentage of passengers who survived the sinking of the Titanic.

prop.table(titanic.survived) *100
## 
##        0        1 
## 61.44307 38.55693

We found that approximately 38.245% of the 889 listed passengers listed, survived the Titanic.

Step 2.3 - Description analysis 3

Using the function xtabs, we identify the number of first-class passengers who survived the sinking of the Titanic.

titanic.survived2 <- xtabs(~Survived+Pclass, data=titanic)
addmargins(titanic.survived2)
##         Pclass
## Survived   1   2   3 Sum
##      0    80  97 368 545
##      1   136  87 119 342
##      Sum 216 184 487 887

We found that 134 of the 214 First Class passengers survived the Titanic.

Step 2.4 - Description analysis 4

In this step, we try to measure the percentage of first-class passengers who survived the sinking of the Titanic.

prop.table(titanic.survived2, 2) *100
##         Pclass
## Survived        1        2        3
##        0 37.03704 52.71739 75.56468
##        1 62.96296 47.28261 24.43532

We found that 62.61% of the 1st Class passengers survived the Titanic.

Step 2.5 - Description analysis 5

In this step, we try to find the percentage of females from First-Class who survived the sinking of the Titanic

titanic.survived4 <- xtabs(~Survived+Sex, data = titanic)
prop.table(titanic.survived4,1) *100
##         Sex
## Survived   female     male
##        0 14.86239 85.13761
##        1 68.12865 31.87135

Our result suggests that about 67.94% of survivors were female.

Step 2.6 - Description analysis 6

In this step, we would like to find the percentage of females on board the Titanic who survived

prop.table(titanic.survived4,2) *100
##         Sex
## Survived   female     male
##        0 25.79618 80.97731
##        1 74.20382 19.02269

Our result suggests that 74.03% of females on board the Titanic survived.

Step 3 - Chi Square analysis

In this step, we run a Pearson’s Chi-squared test to test the following hypothesis:

Null Hypothesis- H0: Gender is not associated with survival. Or specifically, being a female does not increase the likelihood of survival.

Alternative Hypothesis - H1: The proportion of females onboard who survived the sinking of the Titanic was higher than the proportion of males onboard who survived the sinking of the Titanic.

chisq.test(titanic.survived4)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  titanic.survived4
## X-squared = 258.39, df = 1, p-value < 2.2e-16

Conclusions

Since p- value is less than .05, we reject the null hypothesis and accept the Alternate hypothesis and hence the proportion of females onboard who survived the sinking of the Titanic was higher than the proportion of males onboard who survived the sinking of the Titanic.