Titanic Midterm Project

3/11/2021

Purpose

The purpose of this project is to determine whether certain variables, such as class or gender, determine the chances of survival on the Titanic.

Background

We used the Titanic data set, which included passenger data that recorded:
- Name of Passenger
- Sex
- Age
- Survival
- Passenger Class
- Ticket Fare
- Number of siblings and/or spouses on board
- Number of parents and/or children on board
- Ticket Number
- Cabin Number
- Port of Embarkation

What is Chi-Squared Goodness of Fit Test?

Chi-Squared Goodness of Fit Test is a statistical hypothesis test used to determine whether a variable is likely to come from a specific distribution or not
These tests are often used to evaluate whether sample data is representative of population data
Chi-Squared Tests are good for testing categorical variables (in our case, Class and Survival)

How we measure Chi-Squared Tests

P-Value is the probability of obtaining results at least as extreme as the observed results of a statistical hypothesis test, assuming that the null hypothesis is correct
The null hypothesis means that there is no difference between two categorical variables
Alpha = .05
If the p-value was .05 and below, it meant the Chi-Squared Test showed that the probability that there was a relationship between two categorical variables was statistically significant
If the p-value was above .05, it meant the Chi-Squared Test showed that the probability that there was a relationship between two categorical variables was not statistically significant

Class vs Survival Chi-Squared Goodness of Fit Test

We wanted to see if there is a relationship between passenger class and chance of survival

Our null hypothesis is that there is no relationship between passenger class and survival
- Ho = there is no difference between chance of survival depending on passenger class
- Ha = there is a difference between chance of survival depending on passenger class

Chi-Squared Results

## 
##  Pearson's Chi-squared test
## 
## data:  table
## X-squared = 102.89, df = 2, p-value < 2.2e-16

Since p < alpha = .05, we reject the null hypothesis, meaning there is sufficient evidence showing that class does affect the chance of survival

What is Simple Linear Regression?

Simple Linear Regression models the relationship between two variables by fitting a linear equation into observed data
One variable is considered independent (in our case, Class)
One variable is considered dependent (in our case, Survival)

Class vs Survival Linear Regression Test

We want to find the relationship of passenger class to chance of survival
Passenger class is our independent variable and chance of survival is our dependent variable

Linear Regression Results

Linear Regression Results (cont.)

## 
## Call:
## lm(formula = titanic$Survived ~ titanic$Pclass)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.6416 -0.2476 -0.2476  0.3584  0.7524 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     0.83863    0.04510   18.60   <2e-16 ***
## titanic$Pclass -0.19700    0.01837  -10.72   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4581 on 889 degrees of freedom
##   (418 observations deleted due to missingness)
## Multiple R-squared:  0.1146, Adjusted R-squared:  0.1136 
## F-statistic:   115 on 1 and 889 DF,  p-value: < 2.2e-16

Linear Regression Results (cont.)

To prove that this model is accurate in predicting the survival probability by class we can run an F test on the regression at alpha = .05
- Ho = The regression model is not accurate
- Ha = The regression model is accurate
Since p is approx. 0 and p < alpha = .05, we reject the null hypothesis and there is sufficient evidence that the model predicted survival chance

Linear Regression Results (cont.)

We now want to view the the number of survivors and those who died by class in a stacked bar chart. We’re going to filter all the NA in the “Survived” column, and count the people by Class and Survival. Then we create a bar plot, with x as class, y as number of survivors and those who died, and fill as survival

byclass = titanic %>%
  filter(!is.na(Survived)) %>%
  select(Pclass, Survived) %>%
  count(Pclass, Survived) %>%
  mutate(Survived = ifelse(Survived == 0, "Died", "Survived"))

ggplot(byclass, aes(x = Pclass, y = n, fill = Survived, label = n)) +
  geom_bar(color = "black", stat = "identity") + scale_fill_hue(name = "") +
  geom_text(position = position_stack(vjust = .5)) +
  xlab("Class") + ylab("Number of Passengers")+
  ggtitle("Passengers by Class", )

Linear Regression Results (cont.)

Linear Regression analysis continued: The equation of our linear regression test:
\(expected percentage of survival = .83863 - .197 * (passenger class)\)

The expected percentage of survival of each of the classes:
- First Class: .64163
- Second Class: .44463
- Third Class: .24763

The actual percentage of survival of each of the classes:
- First Class: .62963
- Second Class: .47283
- Third Class: .24236

Because the expected and actual percentage of survivors are relatively similar we conclude that the model is accurate

Adding Gender into the mix

After much evidence showing that there is a relationship between passenger class and chance of survival, we decided to add the variable of gender to see if gender in passenger class also had a relationship with chance of survival

Adding Gender into the mix (cont.)

## `summarise()` has grouped output by 'Sex'. You can override using the `.groups` argument.
## `summarise()` has grouped output by 'Sex'. You can override using the `.groups` argument.

This graph shows that the proportion of survivors based of each class heavily favored the females.

Concluding Thoughts

In conclusion, there does seem to be a correlation between passenger class and chance of survival. First using the chi-squared test, we determined there seemed to be some sort of relationship between the two. Using the linear regression model, expected percentage of survival and actual percentage of survival, we determined that there was a correlation between passenger class and chance of survival. To end, we wanted to see if class vs survival had a more specific relationship, in this case gender. We determined that there was a visible relationship in the bar graph of the proportion of survivors by class and gender.