Load the titanic_train.csv data set and see if you can create an ANOVA and Chi-Squared analysis from the data. Review slides 14 through 17 of Hypothesis testing for data science - regarding the R functions to use and assumptions.

##Use the categorical variables Pclass and Embarked in your analysis.

##Upload your zipped Rmd + HTML file or just your Rmd along with a link from RPubs.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

setwd("C:/Users/StarKid/Desktop/Data_Science/Data_101/week_5/IC10")
titanic <- read.csv("titanic_train.csv")

str(titanic)

## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : chr  "male" "female" "female" "female" ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : chr  "S" "C" "S" "S" ...

Convert Pclass and Embarked to a factors

titanic$Pclass <- factor(titanic$Pclass,
                  c("1", "2", "3"),
                  labels = c("1st", "2nd", "3rd"))

titanic$Embarked <- factor(titanic$Embarked,
                  c("C", "Q", "S"),
                  labels = c("Cherbourg", "Queenstown", "Southampton"))

Calculate the mean Fare for Pclass.

titanic %>% group_by(Pclass) %>% summarise(Avg_Price = mean(Fare, na.rm = T))

## # A tibble: 3 × 2
##   Pclass Avg_Price
##   <fct>      <dbl>
## 1 1st         84.2
## 2 2nd         20.7
## 3 3rd         13.7

Plot this with a boxplot.

boxplot(titanic$Fare ~ titanic$Pclass, outline = F)

ANOVA - Analysis of Variance

Similar to a two mean t-test but extended for additional categories.

Null Hypothesis: The means are equal. Alternative Hypothesis: The means are the not equal.

results <- aov(titanic$Fare ~ titanic$Pclass)
summary(results)

##                 Df  Sum Sq Mean Sq F value Pr(>F)    
## titanic$Pclass   2  776030  388015   242.3 <2e-16 ***
## Residuals      888 1421769    1601                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

To understand the relationship between each of the mean combinations, use Tukey Honest Significant Differences.

TukeyHSD(results)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = titanic$Fare ~ titanic$Pclass)
## 
## $`titanic$Pclass`
##               diff       lwr        upr     p adj
## 2nd-1st -63.492504 -72.91649 -54.068521 0.0000000
## 3rd-1st -70.479137 -78.14891 -62.809367 0.0000000
## 3rd-2nd  -6.986633 -15.10638   1.133114 0.1079834

Let’s look at how fares compare for the values of Embarked.

titanic %>% group_by(Embarked) %>% summarise(Avg_Price = mean(Fare, na.rm = T))

## # A tibble: 4 × 2
##   Embarked    Avg_Price
##   <fct>           <dbl>
## 1 Cherbourg        60.0
## 2 Queenstown       13.3
## 3 Southampton      27.1
## 4 <NA>             80

Visualize with a box plot including the means.

means <- tapply(titanic$Fare, titanic$Embarked, mean)
boxplot(titanic$Fare ~ titanic$Embarked, outline = F)
points(means, col = "black", pch = 19)

Chi-Squared Test for Independence

Test takes a contingency table as its input. Each frequency count in the contingency table must be greater than or equal to 5 (n >= 5).

addmargins(table(titanic$Pclass, titanic$Embarked))

##      
##       Cherbourg Queenstown Southampton Sum
##   1st        85          2         127 214
##   2nd        17          3         164 184
##   3rd        66         72         353 491
##   Sum       168         77         644 889

With the frequencies in the 1st/Queenstown and 2nd/Queenstown cells less than 5, we cannot use Chi-Squared here.

Let’s try with Pclass and Sex

addmargins(table(titanic$Pclass, titanic$Sex))

##      
##       female male Sum
##   1st     94  122 216
##   2nd     76  108 184
##   3rd    144  347 491
##   Sum    314  577 891

This meets our requirement that the frequencies be greater than or equal to 5.

chisq.test(table(titanic$Pclass, titanic$Sex))

## 
##  Pearson's Chi-squared test
## 
## data:  table(titanic$Pclass, titanic$Sex)
## X-squared = 16.971, df = 2, p-value = 0.0002064

INDEPENDENT 2 SAMPLE TEST

#t.test(titanic$Pclass, titanic$Embarked, conf.level = 0.99, alternative="greater")

DEPENDENT PAIR 2 SAMPLE T-TEST

#t.test(titanic$Pclass, titanic$Embarked, paired = TRUE)

#qqnorm(titanic$Pclass - titanic$Embarked)

IC11

Amit Singh

2023-08-11

Load the titanic_train.csv data set and see if you can create an ANOVA and Chi-Squared analysis from the data. Review slides 14 through 17 of Hypothesis testing for data science - regarding the R functions to use and assumptions.

Convert Pclass and Embarked to a factors

Calculate the mean Fare for Pclass.

Plot this with a boxplot.

ANOVA - Analysis of Variance

Similar to a two mean t-test but extended for additional categories.

Null Hypothesis: The means are equal. Alternative Hypothesis: The means are the not equal.

To understand the relationship between each of the mean combinations, use Tukey Honest Significant Differences.

Let’s look at how fares compare for the values of Embarked.

Visualize with a box plot including the means.

Chi-Squared Test for Independence

Test takes a contingency table as its input. Each frequency count in the contingency table must be greater than or equal to 5 (n >= 5).

With the frequencies in the 1st/Queenstown and 2nd/Queenstown cells less than 5, we cannot use Chi-Squared here.

Let’s try with Pclass and Sex

This meets our requirement that the frequencies be greater than or equal to 5.

INDEPENDENT 2 SAMPLE TEST

DEPENDENT PAIR 2 SAMPLE T-TEST

IC11

Amit Singh

2023-08-11

Load the titanic_train.csv data set and see if you can create an ANOVA and Chi-Squared analysis from the data. Review slides 14 through 17 of Hypothesis testing for data science - regarding the R functions to use and assumptions.

Convert Pclass and Embarked to a factors

Calculate the mean Fare for Pclass.

Plot this with a boxplot.

ANOVA - Analysis of Variance

Similar to a two mean t-test but extended for additional categories.

Null Hypothesis: The means are equal. Alternative Hypothesis: The means are the not equal.

To understand the relationship between each of the mean combinations, use Tukey Honest Significant Differences.

Let’s look at how fares compare for the values of Embarked.

Visualize with a box plot including the means.

Chi-Squared Test for Independence

Testing whether two categorical variables are related.

Null Hypothesis: The variables are independent; not related.. Alternative Hypothesis: The variables are dependent; related in some way.

Test takes a contingency table as its input. Each frequency count in the contingency table must be greater than or equal to 5 (n >= 5).

With the frequencies in the 1st/Queenstown and 2nd/Queenstown cells less than 5, we cannot use Chi-Squared here.

Let’s try with Pclass and Sex

This meets our requirement that the frequencies be greater than or equal to 5.

INDEPENDENT 2 SAMPLE TEST

DEPENDENT PAIR 2 SAMPLE T-TEST