The shinking of the Titanic is one of the most infamous shipwrecks in history. While there was some element of luck involved in surviving related to the passenger class,gender,age etc.In this project I am going to build a predictive model with the passenger data(training data) from kaggle that answers the question:“what sorts of people were more likely to survive?”
The Data Science project cycle works as OSMEN framework where the first step comes as:
The first step of a Data Science project is Obtaining or Gathering the data.Now as the data is already available in the website I am going to set the directory to the folder where the data is downloaded.
setwd("C:/Users/user/Desktop/Folder/TITANIC")
Now it is time to load the dataset into the interface with the ‘read.csv()’ function where ‘header=T’ tells R that the first line contains name of the columns and ‘stringAsfactors= FALSE’ tells R to consider the strings as string variables.
titanic.train<- read.csv(file = "train.csv", stringsAsFactors = FALSE, header = TRUE)
titanic.test<- read.csv(file = "test.csv", stringsAsFactors = FALSE, header = TRUE)
Next I am going to have look how the dataset looks like with the ‘header()’ funcion as:
head(titanic.train)
head(titanic.test)
For each of the passenger Id,‘survival’ refres as 0=No,1=YES, ‘Pclass’ defines Ticket class, ‘Age’ in years as Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5,
‘sibsip’ defines no of siblings/spouses aboard the Titanic as sibsp: The dataset defines family relations in this way… Sibling = brother, sister, stepbrother, stepsister Spouse = husband, wife (mistresses and fiancés were ignored),
‘parch’ defines no of parents/children aboard the Titanic as parch: The dataset defines family relations in this way… Parent = mother, father Child = daughter, son, stepdaughter, stepson Some children travelled only with a nanny, therefore parch=0 for them,
‘embarked’ defines port of Embarkation as C=Cherbourg,Q= Queenstown, S= Southampton.
Now clearly we can see that the test datatset does not have the ‘Survived’ column, exactly what we have to predict.So the Submisson file will have the passengerId of test dataset and whether they survived or not.Let’s continue to the next step.
2)Scrubbing of the Dataset: Cleaning and filtering of the dataset.
Let’s combine the traing and testing dataset that will reduce the time as well as the task much more easier.As we know the test data set does not have the survived column so we have to include that column, so let’s fill the survived column with ‘NA’.Now to avoid the confusion in identifying the datasets while splitting let’s insert a new column in both the test and the traing dataset.let’s call the new column as ‘IsTrainset’ with ‘TRUE’ string in Training data and ‘FALSE’ string in the Test data.Now let’s combine them with ‘rbind’ function to start the scrubbing process.
titanic.test$Survived<- NA
titanic.train$IsTrainset<- TRUE
titanic.test$IsTrainset<- FALSE
titanic.combined<- rbind(titanic.train,titanic.test)
str(titanic.combined)
'data.frame': 1309 obs. of 13 variables:
$ PassengerId: chr "s" "2" "3" "4" ...
$ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
$ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
$ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
$ Sex : chr "male" "female" "female" "female" ...
$ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
$ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
$ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
$ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
$ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
$ Cabin : chr "" "C85" "" "C123" ...
$ Embarked : chr "S" "C" "S" "S" ...
$ IsTrainset : logi TRUE TRUE TRUE TRUE TRUE TRUE ...
a)The first step to data cleaning is removing unwanted observations from your dataset.This includes ‘duplicate or irrelevant observations’— as we can see from the ‘titanic.combined’ dataframe that each unique passengerId all the 13 variables are given. so no need of checking for duplicate data and irrelevant observations.
b)The next comes as ‘Fixing Structural errors’— as we can see all the 13 variables has unqiue justified name hence thereis zero chance of the typos and inconsistent capitalization and mislabeled classes.
c)the 3rd step is ‘Filter Unwanted Outliers’— Outliers can cause problems with certain types of models. For example, linear regression models are less robust to outliers than decision tree models.We can’t stress this enough: as we do not have a good reason for removing an outlier, such as suspicious measurements that are unlikely to be real data. so this step can also be skipped.
d)The most important step is ‘Handle missing data’— The 2 most commonly recommended ways of dealing with missing data is
1)Dropping observations that have missing values.
2)Imputing the missing values based on other observations.
now,The best way to handle missing data for categorical features is to simply label them as ’Missing’! and For missing numeric data, you should flag and fill the values.By using this technique of flagging and filling, you are essentially allowing the algorithm to estimate the optimal constant for missingness, instead of just filling it in with the mean. to find out the missing varibles we will use the function’is.na’ on different variables as we can see
table(is.na(titanic.combined))
FALSE TRUE
16335 682
now let’s look into each column as
table(is.na(titanic.combined$Age))
FALSE TRUE
1046 263
263 values out of 1309 were missing this whole time, that’s a whopping 20%! so let’s take a look at the combined dataframe’s age variable to see what we’re up against:
summary(titanic.combined$Age)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.17 21.00 28.00 29.88 39.00 80.00 263
So let’s grow a tree on the subset of the data with the age values available, and then replace those that are missing: rpart has a great advantage in that it can use surrogate variables when it encounters an NA value.
install.packages('rpart')
library(rpart)
age.fit<- rpart(Age~ Pclass + Sex + SibSp + Parch + Fare + Embarked, data=titanic.combined[!is.na(titanic.combined$Age),],method="anova")
titanic.combined$Age[is.na(titanic.combined$Age)] <- predict(age.fit, titanic.combined[is.na(titanic.combined$Age),])
table(is.na(titanic.combined$Age))
FALSE
1309
now we can see all the missing values of age column is replaced by the predicted values from the decision tree. There is some mising value that we are not able find out with the ‘is.na’, probably that beacuse of ’ ’.So let’s find out
table(titanic.combined$Embarked)
C Q S
2 270 123 914
now as we can see two observations do not have the port of embarktion specified that’s why they are blank. While a blank wouldn’t be a problem for our model like an NA would be, since we’re cleaning anyhow, let’s get rid of it. Because it’s so few observations and such a large majority boarded in Southampton, let’s just replace those two with “S”. First we need to find out who they are though! We can use which for this:
which(titanic.combined$Embarked== '')
[1] 62 830
This gives us the indexes of the blank fields. Then we simply replace those two, and encode it as a factor:
titanic.combined$Embarked[c(62,830)] = "S"
titanic.combined$Embarked <- factor(titanic.combined$Embarked)
The other variable with missing values are Fare so let’s take a look and find the observation and replace it with the mean of fare.
summary(titanic.combined$Fare)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.000 7.896 14.454 33.295 31.275 512.329 1
which(is.na(titanic.combined$Fare))
[1] 1044
titanic.combined$Fare[1044]<- median(titanic.combined$Fare,na.rm=TRUE)
Okay. Our dataframe is now cleared of all NAs.
3)Explore and vizualize the dataset to find patterns and Feature Engineering:
Feature engineering has been described as easily the most important factor in determining the success or failure of your predictive model.In the Titanic competition it could mean chopping, and combining different attributes that we were given.
the ticket number, cabin, and name were all unique to each passenger; perhaps parts of those text strings could be extracted to build a new predictive attribute. Let’s start with the name field.
titanic.combined$Name <- as.character(titanic.combined$Name)
titanic.combined$Name[1]
[1] "Braund, Mr. Owen Harris"
Nicely, we see that there is a comma right after the person’s last name, and a full stop after their title. We can easily use the function strsplit, which stands for string split, to break apart our original name over these two symbols. Let’s try it out on Mr. Braund:
strsplit(titanic.combined$Name[1], split = '[,.]')
[[1]]
[1] "Braund" " Mr" " Owen Harris"
Let’s go a level deeper into the indexing mess and extract the title.
strsplit(titanic.combined$Name[1], split = '[,.]')[[1]][2]
[1] " Mr"
We feed sapply our vector of names and our function that we just came up with. It runs through the rows of the vector of names, and sends each name to the function.The results of all these string splits are all combined up into a vector as output from the sapply function, which we then store to a new column in our original dataframe, called Title.
titanic.combined$Title <- sapply(titanic.combined$Name, FUN=function(x) {strsplit(x, split='[,.]')[[1]][2]})
titanic.combined$Title <- sub(' ', '', titanic.combined$Title)
table(titanic.combined$Title)
Capt Col Don Dona Dr Jonkheer
1 4 1 1 8 1
Lady Major Master Miss Mlle Mme
1 2 61 260 2 1
Mr Mrs Ms Rev Sir the Countess
757 197 2 8 1 1
There are a few very rare titles in here that won’t give our model much to work with, so let’s combine a few of the most unusual ones.
titanic.combined$Title[titanic.combined$Title %in% c('Mme', 'Mlle')] <- 'Mlle'
titanic.combined$Title[titanic.combined$Title %in% c('Capt', 'Don', 'Major', 'Sir')] <- 'Sir'
titanic.combined$Title[titanic.combined$Title %in% c('Dona', 'Lady', 'the Countess', 'Jonkheer')] <- 'Lady'
suppose when we combine two titles the ‘%in%’ function checks if any of the existing titles in the entire Title column match either of them and then the c() function stores to the operator. Our final step is to change the variable type back to a factor, as these are essentially categories that we have created:
titanic.combined$Title <- factor(titanic.combined$Title)
We’re done with the passenger’s title now. What else can we think up? Well, there’s those two variables SibSb and Parch that indicate the number of family members the passenger is travelling with. Seems reasonable to assume that a large family might have trouble tracking down little Johnny as they all scramble to get off the sinking ship, so let’s combine the two variables into a new one, FamilySize:
titanic.combined$FamilySize <- titanic.combined$SibSp + titanic.combined$Parch + 1
We just add the number of siblings, spouses, parents and children the passenger had with them, and plus one for their own existence of course, and have a new variable indicating the size of the family they travelled with.
we just thought about a large family having issues getting to lifeboats together, but maybe specific families had more trouble than others? now suppose there are three Johnsons in a family with size 3, and another three probably unrelated Johnsons all travelling solo.Combining the Surname with the family size though should remedy this concern.
titanic.combined$Surname <- sapply(titanic.combined$Name, FUN=function(x) {strsplit(x, split='[,.]')[[1]][1]})
titanic.combined$FamilyID <- paste(as.character(titanic.combined$FamilySize), titanic.combined$Surname, sep="")
But those three single Johnsons would all have the same Family ID. let’s knock out any family size of two or less and call it a “small” family. This would fix the Johnson problem too.
titanic.combined$FamilyID[titanic.combined$FamilySize <= 2] <- 'Small'
table(titanic.combined$FamilyID)
11Sage 3Abbott 3Appleton 3Beckwith
11 3 1 2
3Boulos 3Bourke 3Brown 3Caldwell
3 3 4 3
3Christy 3Collyer 3Compton 3Cornell
2 3 3 1
3Coutts 3Crosby 3Danbom 3Davies
3 3 3 5
3Dodge 3Douglas 3Drew 3Elias
3 1 3 3
3Frauenthal 3Frolicher 3Frolicher-Stehli 3Goldsmith
1 1 2 3
3Gustafsson 3Hamalainen 3Hansen 3Hart
2 2 1 3
3Hays 3Hickman 3Hiltunen 3Hirvonen
2 3 1 1
3Jefferys 3Johnson 3Kink 3Kink-Heilmann
2 3 2 2
3Klasen 3Lahtinen 3Mallet 3McCoy
3 2 3 3
3Minahan 3Moubarek 3Nakid 3Navratil
1 3 3 3
3Newell 3Newsom 3Nicholls 3Peacock
1 1 1 3
3Peter 3Quick 3Richards 3Rosblom
3 3 2 3
3Samaan 3Sandstrom 3Silven 3Spedden
3 3 1 3
3Strom 3Taussig 3Thayer 3Thomas
1 3 3 1
3Touma 3van Billiard 3Van Impe 3Vander Planke
3 3 3 2
3Wells 3Wick 3Widener 4Allison
3 3 3 4
4Backstrom 4Baclini 4Becker 4Carter
1 4 4 4
4Davidson 4Dean 4Herman 4Hocking
1 4 4 2
4Jacobsohn 4Johnston 4Laroche 4Renouf
1 4 4 1
4Vander Planke 4West 5Ford 5Hocking
1 4 5 1
5Kink-Heilmann 5Lefebre 5Palsson 5Ryerson
1 5 5 5
6Fortune 6Panula 6Rice 6Richards
6 6 6 1
6Skoog 7Andersson 7Asplund 8Goodwin
6 9 7 8
Small
1025
There’s plenty of FamilyIDs with only one or two members, even though we wanted only family sizes of 3 or more. Perhaps some families had different last names, but whatever the case,let’s store the values in a dataframe and subset this dataframe to show only those unexpectedly small FamilyID groups.
famIDs <- data.frame(table(titanic.combined$FamilyID))
famIDs <- famIDs[famIDs$Freq <= 2,]
We then need to overwrite any family IDs in our dataset for groups that were not correctly identified and finally convert it to a factor:
titanic.combined$FamilyID[titanic.combined$FamilyID %in% famIDs$Var1] <- 'Small'
titanic.combined$FamilyID <- factor(titanic.combined$FamilyID)
with this the feature engineering part is now completed.
Now we can approach the prediction modeling by applying simple decision trees but they are biased to favour factors with many levels.This way the decision node can chop and change the data into the best way possible combination for purity of the following nodes.The bias towards many-levelled factors won’t go away either, and the overfitting problem will rise.Random Forest algorithm has a few restrictions that we did not have with our decision trees that is Instead of looking at the entire pool of available variables, Random Forests take only a subset of them, typically the square root of the number available.This way, many of the trees won’t even have the gender variable available at the first split, and might not even see it until several nodes deep.
Random Forests in R can only digest factors with up to 32 levels. Our FamilyID variable had almost double that.we’ll copy the FamilyID column to a new variable, FamilyID2, and then convert it from a factor back into a character string with as.character(). We can then increase our cut-off to be a “Small” family from 2 to 3 people. Then we just convert it back to a factor and we’re done:
titanic.combined$FamilyID2 <- titanic.combined$FamilyID
titanic.combined$FamilyID2 <- as.character(titanic.combined$FamilyID2)
titanic.combined$FamilyID2[titanic.combined$FamilySize <= 3] <- 'Small'
titanic.combined$FamilyID2 <- factor(titanic.combined$FamilyID2)
table(titanic.combined$FamilyID2)
11Sage 4Allison 4Baclini 4Becker 4Carter 4Dean 4Herman
11 4 4 4 4 4 4
4Johnston 4Laroche 4West 5Ford 5Lefebre 5Palsson 5Ryerson
4 4 4 5 5 5 5
6Fortune 6Panula 6Rice 6Skoog 7Andersson 7Asplund 8Goodwin
6 6 6 6 9 7 8
Small
1194
Now we are at 22 levels so we’re good to split the test and train sets and delete the survived variable.
titanic.train<- titanic.combined[titanic.combined$IsTrainset== TRUE,]
titanic.test<-titanic.combined[titanic.combined$IsTrainset==FALSE,]
install.packages("dplyr")
library(dplyr)
titanic.test<- select(titanic.test,-Survived)
Now let’s grow a Random Forest. Install and load the package randomForest:
install.packages("randomForest")
library(randomForest)
the randomforest has randomness as discussed earlier so it’s a good practice to set a seed to get the reproducible results fo the next time we code up.he number inside isn’t important, you just need to ensure you use the same seed number each time so that the same random numbers are generated inside the Random Forest function.
set.seed(415)
Instead of specifying method=“class” as with rpart, we force the model to predict our classification by temporarily changing our target variable to a factor with only two levels using as.factor(). The importance=TRUE argument allows us to inspect variable importance as we’ll see, and the ntree argument specifies how many trees we want to grow.
fit<- randomForest(as.factor(Survived)~ Pclass+Sex+Age+SibSp+Parch+Fare+Embarked+Title+FamilySize+FamilyID2,data = titanic.train,importance=TRUE, ntree=2000)
The number of trees we want to grow depends on the size of the dataset,we can grow a large number of trees and not worry too much about their complexity, it will still run pretty fast.So let’s look at what variables were important:
varImpPlot(fit)
Random Forests doesn’t just waste those “out-of-bag” (OOB) observations, it uses them to see how well each tree performs on unseen data.
The prediction function works similarly to decision trees as all 2000 trees need to make their classifications and then discuss who’s right:
Prediction <- predict(fit, titanic.test)
submit <- data.frame(PassengerId = titanic.test$PassengerId, Survived = Prediction)
write.csv(submit, file = "First_randomForest.csv", row.names = FALSE)
5)Interpret the Results: this step is all about telling the story of how the model works and explaining each of the steps involved.
The Job was to predict if a passenger in the Titanic survived or not with 0,1 values for the variable. The score is the percentage of passengers you correctly predict.So after submitting this pediction file 'First_randomForest.csv' this model got the score of 0.78468.
In the leaderboard This model got the rank of 5122 which is in the top 28% among the leaderboard.