This is the data visualization project for the Titanic dataset. I am going to do some exploratory analysis to get a basic idea on the survival rate of the Titanic disater.

# First, downloaded the dataset from Kaggle and import it into R.
titanic = read.csv("train.csv")

Let’s take a look at the dataset and see if we need to tidy it up.

#Turns out some variables here are not the data type they are suppposed to be, so I need to change them into factor variable.
titanic$Survived = as.factor(titanic$Survived)
titanic$Pclass = as.factor(titanic$Pclass)
titanic$Sex = as.factor(titanic$Sex)
titanic$Embarked = as.factor(titanic$Embarked)

Now time to make some graphs.

install.packages("ggplot2")
## 
## The downloaded binary packages are in
##  /var/folders/12/rwgk87hx5zx84v6tt6jpj5l40000gn/T//RtmpPxTepz/downloaded_packages
library(ggplot2) 
install.packages("ggplots", repos = "http://cran.us.r-project.org")
## Warning: package 'ggplots' is not available (for R version 3.5.3)
# The very first basic one is the amoount of survival verse non-survival, use a barplot here is fine.
ggplot(titanic, aes(x= titanic$Survived)) + geom_bar()

prop.table(table(titanic$Survived)) #It's not perfect since we can't see the exact number, but we could use the prop.table to do the calculation.
## 
##         0         1 
## 0.6161616 0.3838384
#We could even add some customization to make it prettier.
ggplot(titanic, aes(x= Survived)) + theme_bw() + geom_bar(fill = "#FF6666") + labs(y = "Passenger Count" , title = "Titanic Surviral")

#Now we could see how many people survive in the Titanic disaster. Apparently, above 500 people survived and above 300 poeple died.

What if now we want to break down the survive and non-survival by sex? Yes let’s create another graph to see.

ggplot(titanic, aes(x = Sex, fill = Survived)) + theme_bw() + geom_bar() + labs(y = "Passenger Count", title = "Titannic Survival by Sex")

# From here we could see female overwelingly survive more than male. So maybe the saying "Women and children first" is true?

Now try to break down the survival into other catrgories.

ggplot(titanic, aes(x= Pclass, fill = Survived)) + theme_bw() + geom_bar() + labs(y = "Passenger Count" , title = "Titanic Surviral by Ticket Class")

ggplot(titanic, aes(x= SibSp, fill = Survived)) + theme_bw() + geom_bar() + labs(y = "Passenger Count" , title = "Titanic Surviral by Numbers of Siblings or Spouse")

ggplot(titanic, aes(x= Parch, fill = Survived)) + theme_bw() + geom_bar() + labs(y = "Passenger Count" , title = "Titanic Surviral by Numbers of Parents or Children")

#Something I found interesting is that, according to the graphs, passengers with less family members on board seem to have a higher survival rate? Maybe because they don't have to take care of others when getting on the safe boat? 

Actually we could also break down the survival into multiple categories, not just one. We just need to use facet_wrap to drill-down the data.

ggplot(titanic, aes(x= Sex, fill = Survived)) + theme_bw() + facet_wrap(~Pclass) + geom_bar() + labs(y = "Passenger Count" , title = "Titanic Surviral by Sex and class")

#In this graph,we devide the survival by sex and class. Clearly women in first and second class had a extremely high survival rate than other groups. This drill-down provide more insights into the survival in Titanic, it revealed that the survival rate might be determinded by not merely one factor, but multuple factors, such as sex and class here. 

#Now let's take age into consideration too.
ggplot(titanic, aes(x = Age, fill = Survived)) + geom_histogram() + theme_bw() + facet_wrap(Sex ~ Pclass)+ labs(x = "surval", y = "Age", title = "Survival by Age, Sex and Class")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 177 rows containing non-finite values (stat_bin).

#This graph covers all 4 dimension of the passengers, which could provide a even more accurate insight into the factors influcing survival in Titanic.