The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.
One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.
Here we try to analyze which factors were more likely to contribute to the death of the passengers and classify who is more likely to survive depending on the features.
The purpose of this project was to gain introductory exposure to programmatic data analysis concepts, by analysing the factors that determined whether a passenger survived the Titanic disaster or did not.
training set (train.csv) test set (test.csv) The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each passenger. Your model will be based on “features” like passengers’ gender and class. You can also use feature engineering to create new features.
The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.
library(ggplot2)
titanic <- read.csv("train.csv",stringsAsFactors = FALSE)
str(titanic)
## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : chr "male" "female" "female" "female" ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : chr "S" "C" "S" "S" ...
head(titanic)
## PassengerId Survived Pclass
## 1 1 0 3
## 2 2 1 1
## 3 3 1 3
## 4 4 1 1
## 5 5 0 3
## 6 6 0 3
## Name Sex Age SibSp Parch
## 1 Braund, Mr. Owen Harris male 22 1 0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0
## 3 Heikkinen, Miss. Laina female 26 0 0
## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0
## 5 Allen, Mr. William Henry male 35 0 0
## 6 Moran, Mr. James male NA 0 0
## Ticket Fare Cabin Embarked
## 1 A/5 21171 7.2500 S
## 2 PC 17599 71.2833 C85 C
## 3 STON/O2. 3101282 7.9250 S
## 4 113803 53.1000 C123 S
## 5 373450 8.0500 S
## 6 330877 8.4583 Q
# convert data types
titanic$Pclass <- as.factor(titanic$Pclass)
titanic$Survived <- as.factor(titanic$Survived)
titanic$Name <- as.character(titanic$Name)
titanic$Sex <- as.factor(titanic$Sex)
# titanic$Ticket <- as.character(titanic$Ticket)
titanic$Cabin <- as.character(titanic$Cabin)
titanic$Embarked <- as.factor(titanic$Embarked)
str(titanic)
## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
## $ Pclass : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
colSums(is.na(titanic))
## PassengerId Survived Pclass Name Sex Age
## 0 0 0 0 0 177
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 0 0 0
colSums(titanic=="")
## PassengerId Survived Pclass Name Sex Age
## 0 0 0 0 0 NA
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 0 687 2
# library(dplyr)
# summarise(titanic, Average = mean(Age, na.rm = T))
# mean(titanic$Age, na.rm = TRUE)
summary(titanic$Age)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.42 20.12 28.00 29.70 38.00 80.00 177
# x <- na.omit(titanic)
titanicNew <- titanic[complete.cases(titanic), ]
dim(titanicNew)
## [1] 714 12
colSums(is.na(titanicNew))
## PassengerId Survived Pclass Name Sex Age
## 0 0 0 0 0 0
## SibSp Parch Ticket Fare Cabin Embarked
## 0 0 0 0 0 0
summary(titanicNew)
## PassengerId Survived Pclass Name Sex
## Min. : 1.0 0:424 1:186 Length:714 female:261
## 1st Qu.:222.2 1:290 2:173 Class :character male :453
## Median :445.0 3:355 Mode :character
## Mean :448.6
## 3rd Qu.:677.8
## Max. :891.0
## Age SibSp Parch Ticket
## Min. : 0.42 Min. :0.0000 Min. :0.0000 Length:714
## 1st Qu.:20.12 1st Qu.:0.0000 1st Qu.:0.0000 Class :character
## Median :28.00 Median :0.0000 Median :0.0000 Mode :character
## Mean :29.70 Mean :0.5126 Mean :0.4314
## 3rd Qu.:38.00 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :80.00 Max. :5.0000 Max. :6.0000
## Fare Cabin Embarked
## Min. : 0.00 Length:714 : 2
## 1st Qu.: 8.05 Class :character C:130
## Median : 15.74 Mode :character Q: 28
## Mean : 34.69 S:554
## 3rd Qu.: 33.38
## Max. :512.33
Summary : 1. There are 891 passengger in total 2. 177 Age values are missing 3. 714 Data remaining after null data cleansing 4. Southampton,Cherbourg and Queenstown is the most popular port of embarkment in in respective ways 5. There are 184 type of cabins 6. Maximum ticket fare is 512 and minimum of 0 (Free) 7. Passenger have Maximum of 6 siblings / spouses aboard 8. Passenger have Maximum of 5 parents / children aboard the Titanic 9. The oldest passenger is 80 years old and the youngest one is under 1 year old 10.Passenger dominated by men of 453 and women of 261 11.Pclass number 3 is most populated compared to others 12.290 passenger was survived and 424 was dead
aggregate(Fare~Pclass,titanicNew,mean)
## Pclass Fare
## 1 1 87.96158
## 2 2 21.47156
## 3 3 13.22944
aggregate(Fare~Pclass,titanicNew,var)
## Pclass Fare
## 1 1 6537.8850
## 2 2 173.9083
## 3 3 100.8650
aggregate(Fare~Pclass,titanicNew,sd)
## Pclass Fare
## 1 1 80.85719
## 2 2 13.18743
## 3 3 10.04316
boxplot(titanicNew$Fare)
##1. Which plcass more surviveable? ##2. Most Surviveable gender? ##3.What age is the most frequently survived? ##4. most survivor by port of embarkment?
LT=dim(titanicNew)[1]
ggplot(data=titanicNew[1:LT,],aes(x=Pclass,fill=Survived))+geom_bar()
ggplot(data=titanicNew[1:LT,],aes(x=Sex,fill=Survived))+geom_bar()
ggplot(data = titanicNew[!(is.na(titanicNew[1:LT,]$Age)),],aes(x=Age,fill=Survived))+geom_histogram(binwidth =3)