Exploratory data analysis for Titanic dataset: investigation whether you’d have a chance of surviving the disaster.

Load the cleaned data into dataframe titanic.

titanic <- read.csv("titanic_clean.csv", header = TRUE, sep = ",")

Check out the structure of titanic.

str(titanic)
## 'data.frame':    1310 obs. of  15 variables:
##  $ pclass          : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ survived        : int  1 1 0 0 0 1 1 0 1 0 ...
##  $ name            : Factor w/ 1307 levels "Abbing, Mr. Anthony",..: 22 24 25 26 27 31 46 47 51 55 ...
##  $ sex             : Factor w/ 2 levels "female","male": 1 2 1 2 1 2 1 2 1 2 ...
##  $ age             : num  29 0.917 2 30 25 ...
##  $ sibsp           : int  0 1 1 1 1 0 1 0 2 0 ...
##  $ parch           : int  0 2 2 2 2 0 0 0 0 0 ...
##  $ ticket          : Factor w/ 929 levels "110152","110413",..: 188 50 50 50 50 125 93 16 77 826 ...
##  $ fare            : num  211 152 152 152 152 ...
##  $ cabin           : Factor w/ 186 levels "A10","A11","A14",..: 44 80 80 80 80 150 146 16 62 NA ...
##  $ embarked        : Factor w/ 3 levels "C","Q","S": 3 3 3 3 3 3 3 3 3 1 ...
##  $ boat            : Factor w/ 27 levels "1","10","11",..: 12 3 NA NA NA 13 2 NA 27 NA ...
##  $ body            : int  NA NA NA 135 NA NA NA NA NA 22 ...
##  $ home.dest       : Factor w/ 369 levels "?Havana, Cuba",..: 309 231 231 231 231 237 162 24 22 229 ...
##  $ has_cabin_number: int  1 1 1 1 1 1 1 1 1 0 ...

The last passanger had missing information in all fields, except for the age variable, therefore was excluded from the dataset.

tail(titanic)
##      pclass survived                      name    sex      age sibsp parch
## 1305      3        0      Zabour, Miss. Hileni female 14.50000     1     0
## 1306      3        0     Zabour, Miss. Thamine female 29.88113     1     0
## 1307      3        0 Zakarian, Mr. Mapriededer   male 26.50000     0     0
## 1308      3        0       Zakarian, Mr. Ortin   male 27.00000     0     0
## 1309      3        0        Zimmerman, Mr. Leo   male 29.00000     0     0
## 1310     NA       NA                      <NA>   <NA> 29.88113    NA    NA
##      ticket    fare cabin embarked boat body home.dest has_cabin_number
## 1305   2665 14.4542  <NA>        C <NA>  328      <NA>                0
## 1306   2665 14.4542  <NA>        C <NA>   NA      <NA>                0
## 1307   2656  7.2250  <NA>        C <NA>  304      <NA>                0
## 1308   2670  7.2250  <NA>        C <NA>   NA      <NA>                0
## 1309 315082  7.8750  <NA>        S <NA>   NA      <NA>                0
## 1310   <NA>      NA  <NA>        S <NA>   NA      <NA>                0
titanic <- titanic[-1310,]

Use ggplot() to plot the distribution of sexes within the classes of the ship.

require(ggplot2)
## Loading required package: ggplot2
ggplot(titanic,aes(x=factor(pclass),fill=factor(sex)))+
  geom_bar(position="dodge")

Use ggplot() to estimate your chances of survival from the distribution of sexes within the classes of the ship.

ggplot(titanic,aes(x=factor(pclass),fill=factor(sex)))+
  geom_bar(position="dodge")+
  facet_grid(". ~ survived")

Use ggplot() to estimate your chances of survival based on your age from the distribution of sexes within the classes of the ship.

posn.j <- position_jitter(0.5, 0)
ggplot(titanic,aes(x=factor(pclass),y=age,col=factor(sex)))+
  geom_jitter(size=3,alpha=0.5,position=posn.j)+
  facet_grid(". ~ survived")

Import training and testing sets.

# Import the training set: train
train_url <- "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/train.csv"
train <- read.csv(train_url)
  
# Import the testing set: test
test_url <- "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/test.csv"
test <- read.csv(test_url)
  
# Print train and test to the console
#train
#test

# Your train and test set are still loaded
str(train)
## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
##  $ Embarked   : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
str(test)
## 'data.frame':    418 obs. of  11 variables:
##  $ PassengerId: int  892 893 894 895 896 897 898 899 900 901 ...
##  $ Pclass     : int  3 3 2 3 3 3 3 2 3 3 ...
##  $ Name       : Factor w/ 418 levels "Abbott, Master. Eugene Joseph",..: 210 409 273 414 182 370 85 58 5 104 ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 2 2 1 2 1 2 1 2 ...
##  $ Age        : num  34.5 47 62 27 22 14 30 26 18 21 ...
##  $ SibSp      : int  0 1 0 0 1 0 0 1 0 2 ...
##  $ Parch      : int  0 0 0 0 1 0 0 1 0 0 ...
##  $ Ticket     : Factor w/ 363 levels "110469","110489",..: 153 222 74 148 139 262 159 85 101 270 ...
##  $ Fare       : num  7.83 7 9.69 8.66 12.29 ...
##  $ Cabin      : Factor w/ 77 levels "","A11","A18",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Embarked   : Factor w/ 3 levels "C","Q","S": 2 3 2 3 3 3 2 3 1 3 ...
# Survival rates in absolute numbers
table(train$Survived)
## 
##   0   1 
## 549 342
# Survival rates in proportions
prop.table(table(train$Survived))
## 
##         0         1 
## 0.6161616 0.3838384
# Two-way comparison: Sex and Survived
table(train$Sex, train$Survived)
##         
##            0   1
##   female  81 233
##   male   468 109
# Two-way comparison: row-wise proportions
prop.table(table(train$Sex,train$Survived),1)
##         
##                  0         1
##   female 0.2579618 0.7420382
##   male   0.8110919 0.1889081
# Create the column child, and indicate whether child or no child
train$Child <- NA
train$Child[train$Age < 18] <- 1
train$Child[train$Age >= 18] <- 0

# Two-way comparison
prop.table(table(train$Child,train$Survived),1)
##    
##             0         1
##   0 0.6189684 0.3810316
##   1 0.4601770 0.5398230
# Copy of test
test_one <- test

# Initialize a Survived column to 0
test_one$Survived <- 0

# Set Survived to 1 if Sex equals "female"
test_one$Survived[test_one$Sex=="female"] <- 1

Explore Decision Trees.

The decision tree algorithm starts with all the data at the root node and scans all the variables for the best one to split on. Once a variable is chosen, you do the split and go down one level (or one node) and repeat. The final nodes at the bottom of the decision tree are known as terminal nodes, and the majority vote of the observations in that node determine how to predict for new observations that end up in that terminal node.

To create your first decision tree, you’ll make use of R’s rpart package. Instead of needing to writing an algo yourself you can use this package to build a decision tree.

# Load in the R package  
#install.packages('rpart')
require(rpart)
## Loading required package: rpart
## Warning: package 'rpart' was built under R version 3.2.5

Inside rpart, there is therpart() function to build your first decision tree. The function takes multiple arguments:

Your call could look like this: my_tree <- rpart(Survived ~ Sex + Age, data = train, method =“class”)

To visualize the resulting tree, you can use the plot(my_tree) and text(my_tree). The resutling graphs will not be that informative, but R has packages to make it all fancier: rattle, rpart.plot, and RColorBrewer.

# Build the decision tree
my_tree_two <- rpart(Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked, data = train, method = "class")

# Visualize the decision tree using plot() and text()
#plot(my_tree_two)
#text(my_tree_two)

# Load in the packages to build a fancy plot
#install.packages('rattle')
#install.packages('rpart.plot')
#install.packages('RColorBrewer')
library(rattle)
## Warning: package 'rattle' was built under R version 3.2.5
## Rattle: A free graphical interface for data mining with R.
## Version 4.1.0 Copyright (c) 2006-2015 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 3.2.5
library(RColorBrewer)
## Warning: package 'RColorBrewer' was built under R version 3.2.3
# Time to plot your fancy tree
fancyRpartPlot(my_tree_two)

Making Predictions.

Make predictions of the survival rates for the observations in the test set using a decision tree (predict() function): predict(my_tree_two, test, type = “class”). Here, my_tree_two is the tree model you’ve just built, test is the data set to build the preditions for, and type = “class” specifies that you want to classify observations.

# Make predictions on the test set
my_prediction <- predict(my_tree_two, test, type = "class")

# Finish the data.frame() call
my_solution <- data.frame(PassengerId = test$PassengerId, Survived = my_prediction)

# Use nrow() on my_solution
nrow(my_solution)
## [1] 418
ncol(my_solution)
## [1] 2
# Finish the write.csv() call
#write.csv(my_solution, file = "my_solution.csv", row.names = FALSE)

my_tree_three <- rpart(Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked, 
                     data = train, method = "class", control = rpart.control(minsplit = 50, cp = 0))
  
# Visualize my_tree_three
fancyRpartPlot(my_tree_three)

# Create train_two
train_two <- train
train_two$family_size <- train_two$SibSp + train_two$Parch + 1

# Finish the command
my_tree_four <- rpart(Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked + family_size, 
                      data = train_two, method = "class")

# Visualize your new decision tree
fancyRpartPlot(my_tree_four)

New train and test sets.

# You have access to a new train and test set named train_new and test_new. These data sets contain a new column with the name Title (referring to Miss, Mr, etc.). Title is another example of feature engineering: it's a new variable that possibly improves the model.
# Finish the command
my_tree_five <- rpart(Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked + Title, 
                      data = train_new, method = "class")

# Visualize my_tree_five
fancyRpartPlot(my_tree_five)

# Make prediction
my_prediction <- predict(my_tree_five, test_new, type = "class")

# Make results ready for submission
my_solution <- data.frame(PassengerId = test_new$PassengerId, Survived = my_prediction)
write.csv(my_solution, file = "my_solution.csv", row.names = FALSE)

Random Forest.

The Random Forest technique handles the overfitting problem you faced with decision trees. It grows multiple (very deep) classification trees using the training set. At the time of prediction, each tree is used to come up with a prediction and every outcome is counted as a vote. For example, if you have trained 3 trees with 2 saying a passenger in the test set will survive and 1 says he will not, the passenger will be classified as a survivor. This approach of overtraining trees, but having the majority’s vote count as the actual classification decision, avoids overfitting.

# All data, both training and test set
all_data

# Passenger on row 62 and 830 do not have a value for embarkment. 
# Since many passengers embarked at Southampton, we give them the value S.
all_data$Embarked[c(62, 830)] <- "S"

# Factorize embarkment codes.
all_data$Embarked <- factor(all_data$Embarked)

# Passenger on row 1044 has an NA Fare value. Let's replace it with the median fare value.
all_data$Fare[1044] <- median(all_data$Fare, na.rm = TRUE)

# How to fill in missing Age values?
# We make a prediction of a passengers Age using the other variables and a decision tree model. 
# This time you give method = "anova" since you are predicting a continuous variable.
library(rpart)
predicted_age <- rpart(Age ~ Pclass + Sex + SibSp + Parch + Fare + Embarked + Title + family_size,
                       data = all_data[!is.na(all_data$Age),], method = "anova")
all_data$Age[is.na(all_data$Age)] <- predict(predicted_age, all_data[is.na(all_data$Age),])

# Split the data back into a train set and a test set
train <- all_data[1:891,]
test <- all_data[892:1309,]
# Load in the package
library(randomForest)

# Train set and test set
str(train)
str(test)

# Set seed for reproducibility
set.seed(111)

# Apply the Random Forest Algorithm
#my_forest <- randomForest(as.factor(Survived) ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked + Title, data=train, importance=TRUE,ntree=1000)

my_forest <- randomForest(as.factor(Survived) ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked, data=train, importance=TRUE,ntree=1000)

# Make your prediction using the test set
my_prediction <- predict(my_forest, test)

# Create a data frame with two columns: PassengerId & Survived. Survived contains your predictions
my_solution <- data.frame(PassengerId = test$PassengerId,Survived=my_prediction)

# Write your solution away to a csv file with the name my_solution.csv
#write.csv(my_solution, file = "my_solution.csv", row.names = FALSE)
varImpPlot(my_forest)

The accuracy plot shows how much worse the model would perform without the included variables. So a high decrease (= high value x-axis) links to a high predictive variable. The second plot is the Gini coefficient. The higher the variable scores here, the more important it is for the model.