Load the titanic_train.csv data set and see if you can create an ANOVA and Chi-Squared analysis from the data. Review slides 14 through 17 of Hypothesis testing for data science - regarding the R functions to use and assumptions.

##Use the categorical variables Pclass and Embarked in your analysis.

##Upload your zipped Rmd + HTML file or just your Rmd along with a link from RPubs.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
setwd("C:/Users/StarKid/Desktop/Data_Science/Data_101/week_5/IC10")
titanic <- read.csv("titanic_train.csv")

str(titanic)
## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr  "male" "female" "female" "female" ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : chr  "S" "C" "S" "S" ...

Convert Pclass and Embarked to a factors

titanic$Pclass <- factor(titanic$Pclass,
                  c("1", "2", "3"),
                  labels = c("1st", "2nd", "3rd"))

titanic$Embarked <- factor(titanic$Embarked,
                  c("C", "Q", "S"),
                  labels = c("Cherbourg", "Queenstown", "Southampton"))

Calculate the mean Fare for Pclass.

titanic %>% group_by(Pclass) %>% summarise(Avg_Price = mean(Fare, na.rm = T))
## # A tibble: 3 × 2
##   Pclass Avg_Price
##   <fct>      <dbl>
## 1 1st         84.2
## 2 2nd         20.7
## 3 3rd         13.7

Plot this with a boxplot.

boxplot(titanic$Fare ~ titanic$Pclass, outline = F)

ANOVA - Analysis of Variance

Similar to a two mean t-test but extended for additional categories.

Null Hypothesis: The means are equal. Alternative Hypothesis: The means are the not equal.

results <- aov(titanic$Fare ~ titanic$Pclass)
summary(results)
##                 Df  Sum Sq Mean Sq F value Pr(>F)    
## titanic$Pclass   2  776030  388015   242.3 <2e-16 ***
## Residuals      888 1421769    1601                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

To understand the relationship between each of the mean combinations, use Tukey Honest Significant Differences.

TukeyHSD(results)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = titanic$Fare ~ titanic$Pclass)
## 
## $`titanic$Pclass`
##               diff       lwr        upr     p adj
## 2nd-1st -63.492504 -72.91649 -54.068521 0.0000000
## 3rd-1st -70.479137 -78.14891 -62.809367 0.0000000
## 3rd-2nd  -6.986633 -15.10638   1.133114 0.1079834

Let’s look at how fares compare for the values of Embarked.

titanic %>% group_by(Embarked) %>% summarise(Avg_Price = mean(Fare, na.rm = T))
## # A tibble: 4 × 2
##   Embarked    Avg_Price
##   <fct>           <dbl>
## 1 Cherbourg        60.0
## 2 Queenstown       13.3
## 3 Southampton      27.1
## 4 <NA>             80

Visualize with a box plot including the means.

means <- tapply(titanic$Fare, titanic$Embarked, mean)
boxplot(titanic$Fare ~ titanic$Embarked, outline = F)
points(means, col = "black", pch = 19)

Chi-Squared Test for Independence

Testing whether two categorical variables are related.

Test takes a contingency table as its input. Each frequency count in the contingency table must be greater than or equal to 5 (n >= 5).

addmargins(table(titanic$Pclass, titanic$Embarked))
##      
##       Cherbourg Queenstown Southampton Sum
##   1st        85          2         127 214
##   2nd        17          3         164 184
##   3rd        66         72         353 491
##   Sum       168         77         644 889

With the frequencies in the 1st/Queenstown and 2nd/Queenstown cells less than 5, we cannot use Chi-Squared here.

Let’s try with Pclass and Sex

addmargins(table(titanic$Pclass, titanic$Sex))
##      
##       female male Sum
##   1st     94  122 216
##   2nd     76  108 184
##   3rd    144  347 491
##   Sum    314  577 891

This meets our requirement that the frequencies be greater than or equal to 5.

chisq.test(table(titanic$Pclass, titanic$Sex))
## 
##  Pearson's Chi-squared test
## 
## data:  table(titanic$Pclass, titanic$Sex)
## X-squared = 16.971, df = 2, p-value = 0.0002064

INDEPENDENT 2 SAMPLE TEST

#t.test(titanic$Pclass, titanic$Embarked, conf.level = 0.99, alternative="greater")

DEPENDENT PAIR 2 SAMPLE T-TEST

#t.test(titanic$Pclass, titanic$Embarked, paired = TRUE)
#qqnorm(titanic$Pclass - titanic$Embarked)