The purpose of this project is to determine whether certain variables, such as class or gender, determine the chances of survival on the Titanic.
3/11/2021
The purpose of this project is to determine whether certain variables, such as class or gender, determine the chances of survival on the Titanic.
## ## Pearson's Chi-squared test ## ## data: table ## X-squared = 102.89, df = 2, p-value < 2.2e-16
## ## Call: ## lm(formula = titanic$Survived ~ titanic$Pclass) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.6416 -0.2476 -0.2476 0.3584 0.7524 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.83863 0.04510 18.60 <2e-16 *** ## titanic$Pclass -0.19700 0.01837 -10.72 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.4581 on 889 degrees of freedom ## (418 observations deleted due to missingness) ## Multiple R-squared: 0.1146, Adjusted R-squared: 0.1136 ## F-statistic: 115 on 1 and 889 DF, p-value: < 2.2e-16
We now want to view the the number of survivors and those who died by class in a stacked bar chart. We’re going to filter all the NA in the “Survived” column, and count the people by Class and Survival. Then we create a bar plot, with x as class, y as number of survivors and those who died, and fill as survival
byclass = titanic %>%
filter(!is.na(Survived)) %>%
select(Pclass, Survived) %>%
count(Pclass, Survived) %>%
mutate(Survived = ifelse(Survived == 0, "Died", "Survived"))
ggplot(byclass, aes(x = Pclass, y = n, fill = Survived, label = n)) +
geom_bar(color = "black", stat = "identity") + scale_fill_hue(name = "") +
geom_text(position = position_stack(vjust = .5)) +
xlab("Class") + ylab("Number of Passengers")+
ggtitle("Passengers by Class", )
Linear Regression analysis continued: The equation of our linear regression test:
\(expected percentage of survival = .83863 - .197 * (passenger class)\)
The expected percentage of survival of each of the classes:
- First Class: .64163
- Second Class: .44463
- Third Class: .24763
The actual percentage of survival of each of the classes:
- First Class: .62963
- Second Class: .47283
- Third Class: .24236
Because the expected and actual percentage of survivors are relatively similar we conclude that the model is accurate
## `summarise()` has grouped output by 'Sex'. You can override using the `.groups` argument. ## `summarise()` has grouped output by 'Sex'. You can override using the `.groups` argument.
In conclusion, there does seem to be a correlation between passenger class and chance of survival. First using the chi-squared test, we determined there seemed to be some sort of relationship between the two. Using the linear regression model, expected percentage of survival and actual percentage of survival, we determined that there was a correlation between passenger class and chance of survival. To end, we wanted to see if class vs survival had a more specific relationship, in this case gender. We determined that there was a visible relationship in the bar graph of the proportion of survivors by class and gender.