The raw data has been downloaded from Kaggle from their Titanic survival prediction competition.

Some important aspects of data analysis learnt:

Rows: 891 with Columns: 13

Note that this data set is only a subset of the actual list. Titanic actually sailed with 2,240 passengers and crew on board, more than 1,500 lost their lives in the disaster in 1912.

Passengers distribution

65% males and 35% females made up the data list.

How many of each survived or perished?

Maximum of the passengers did not survive and males were the most perished.


What was the distribution of the passenger ages?

There was a younger crowd on the ship along with infants to very elderly as well. 177 passengers did not have age specified due to data entry or some other issues. We can try to see if we can impute the age based on family or other traits.

Lets see how the survial rates for the Age variable.

The following groups are created:

Adults and young age groups had the lowest survival rate. Elderly folks may have just given up or may have not been fast & strong enough to make the move to the rescue boats.

How was the Titanic sailing in terms of cabins used?

Now lets overlay the survival rates to each cabin to gain insights on cabin with death rates

We see that Cabins-C & B did not fare well for the passengers. Reasons could be the cabins were far from the rescue boats or had stairs to reach the top or other obstacles.

Lets see how many travelled with parents i.e parents / children

Maximum travelled alone or with their immediate family only, not with parents

Lets see how many travelled with immediate families i.e siblings / spouses

Maximum travelled with no children or any immediate family.

Lets see the survival rate of the Pclass variable

Class=3 had the worst of it all with 76% perished !. Interesting to see Class=1 had a higher survival at 63% and this could be due to preferred treatment or other reasons.

Now lets analyze the various Fares present in the data

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    7.91   14.45   32.20   31.00  512.33

based off the summary data we can try to create buckets of <10, 10-20, 20-30, 30-40, >40

Less than $10 will be the lower class/cabins and having lower survival rates as compared to the >$30. Even the $10-30 bucket did not fare well.

Lets see which Fares were allocated to which Class and their survival rates

As noted above the $10 was often tagged to the Class=3 and had a very low survival rate. But why we are seeing $30-40 also in the Class=3 needs to be checked !

Lets see which Fares were allocated to which Cabins and their survival rates

The above has been filtered for records having Cabin information. There were 327 rows removed due to null Cabins.

Now lets compare the survival rates for Males/Females based on their Cabins

Males across all Classes had low survival rates and primarily in the Class=3. Females in the Class=1,2 fared very well while in Class=3 had a 50% chances. Males from Class=1 had the highest survival rates.

Lets see some summary on the Embarked variable

Most of the passengers from S=Southampton perished. The other ports were C=Cherbourg, Q=Queenstown

In the next part of the project lets check on the changes on survival rates by tweaking values of known variables… and lets learn the skill of predictions !