For our first data set, we chose a data set involving passengers on Titanic’s maiden voyage. We downloaded this data set from Kaggle.com. Here is the Link: https://www.kaggle.com/azeembootwala/titanic?select=train_data.csv
For the data set, we first read it as a data frame.
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3 v purrr 0.3.4
## v tibble 3.0.5 v dplyr 1.0.3
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
titanic = read.csv("Titanic Train.csv", header = TRUE, stringsAsFactors = FALSE, na.strings = "")
titanic_df <- data.frame(titanic)
train <- titanic_df
After this initial setup, we then explored its dimensions and variables.
class(train)
## [1] "data.frame"
nrow(train)
## [1] 792
ncol(train)
## [1] 17
names(train)
## [1] "X" "PassengerId" "Survived" "Sex"
## [5] "Age" "Fare" "Pclass_1" "Pclass_2"
## [9] "Pclass_3" "Family_size" "Married.Man" "Married.Woman"
## [13] "Single.Man" "Single.Woman" "Chebourg" "Queenstown"
## [17] "Southampton"
head (train)
## X PassengerId Survived Sex Age Fare Pclass_1 Pclass_2 Pclass_3
## 1 0 1 0 1 0.2750 0.01415106 0 0 1
## 2 1 2 1 0 0.4750 0.13913574 1 0 0
## 3 2 3 1 0 0.3250 0.01546857 0 0 1
## 4 3 4 1 0 0.4375 0.10364430 1 0 0
## 5 4 5 0 1 0.4375 0.01571255 0 0 1
## 6 5 6 0 1 0.3500 0.01650950 0 0 1
## Family_size Married.Man Married.Woman Single.Man Single.Woman Chebourg
## 1 0.1 1 0 0 0 0
## 2 0.1 1 0 0 0 1
## 3 0.0 0 0 0 1 0
## 4 0.1 1 0 0 0 0
## 5 0.0 1 0 0 0 0
## 6 0.0 1 0 0 0 0
## Queenstown Southampton
## 1 0 1
## 2 0 0
## 3 0 1
## 4 0 1
## 5 0 1
## 6 1 0
tail(train)
## X PassengerId Survived Sex Age Fare Pclass_1 Pclass_2 Pclass_3
## 787 786 787 1 0 0.2250 0.01463083 0 0 1
## 788 787 788 0 1 0.1000 0.05684821 0 0 1
## 789 788 789 1 1 0.0125 0.04015973 0 0 1
## 790 789 790 0 1 0.5750 0.15458811 1 0 0
## 791 790 791 0 1 0.3500 0.01512699 0 0 1
## 792 791 792 0 1 0.2000 0.05074862 0 1 0
## Family_size Married.Man Married.Woman Single.Man Single.Woman Chebourg
## 787 0.0 0 0 0 1 0
## 788 0.5 0 0 1 0 0
## 789 0.3 0 0 1 0 0
## 790 0.0 1 0 0 0 1
## 791 0.0 1 0 0 0 0
## 792 0.0 1 0 0 0 0
## Queenstown Southampton
## 787 0 1
## 788 1 0
## 789 0 1
## 790 0 0
## 791 1 0
## 792 0 1
With this dataset, we wanted to determine the best variables to use to predict whether one would survive the voyage.
We first wanted to do some exploratory analysis on the data.
First, we wanted to find the average family size of the passengers. Here is the code and output.
hist(train$Family_size, col = "blue", breaks = 15)
With this Histogram, we can see that the normalized family size for the majority of passengers is less than .20.This means their is a great disparity in the amount of passengers that are traveling with no/few family members than there are with many.
Next, we wanted to check the fare prices that the passengers payed.
hist(train$Fare, col = "green", breaks = 15)
Again, the data shows that the majority normalized fare payed for the voyage is below .20. This means that the majority of passengers payed a small fare compared to the minority of passengers who payed a much larger fare. This makes sense as the price and passenger disparity between the first, second, and third class was very prevalent for Titanic’s maiden voyage.
With some of the Exploratory data finished, it time for us to ask some questions. For this data set, we want chose three:
1. Did Men survive more or less in proportion to women?
2. Is passenger class a good predictor for survival rate?
3. Is a larger family size better for survival?
For this first question, we feel a good starting point would be to map the survival rates by Sex, to see if there is any correlation.
p <- ggplot(train) + aes(Sex,Survived)
p + geom_jitter()
This chart compares survival rates (1 = survived, 0 = did not survive) with Males(1) and Females(0). This chart shows that, overall, men have a lower number of survivors compared to other men, and women have a higher survival rate compared to other women. This data helps us see that Women most likely had a higher survival rate than men. However, more analysis is needed on whether being single or not is important for survival.
For this next question, we want to know whether a passenger’s class would effect their survival rate. We believe a good start would to analyze the proportion of passengers in each class and their survival rate.
p <- ggplot(train) + aes(Pclass_1,Survived)
p + geom_jitter()
p2 <- ggplot(train) + aes(Pclass_2,Survived)
p2 + geom_jitter()
p3 <- ggplot(train) + aes(Pclass_3,Survived)
p3 + geom_jitter()
After plotting the three charts, we can see that there is some different in survival rate between the three classes. With further analysis, we may be able to raise our confidence in passenger class being an accurate predictor of survival.
Our final question tasks us with determining whether a larger family size can increase the survival rate for each passenger. The best way to start answering this question is to compare survival rate with family size.
p4 <- ggplot(train) + aes(Family_size,Survived)
p4 + geom_jitter()
This chart is very interesting. The chart shows that passenger’s with the largest amount of family members seemed to not survive the voyage. However, more analysis needs to be made to answer our question. Maybe comparing family size to passenger class could paint a better picture for out analysis.