I am a new to data science and machine learning, and looking for a simple intro to the Kaggle prediction competitions. This is my first competition on Kaggle.
The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.
One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.
In this challenge, we are asked o complete the analysis of what sorts of people were likely to survive. In particular, we are asked to apply the tools of machine learning to predict which passengers survived the tragedy.
The data has been split into two groups:
training set (train.csv) test set (test.csv) The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each passenger. Your model will be based on “features” like passengers’ gender and class. You can also use feature engineering to create new features.
The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.
We also include gender_submission.csv, a set of predictions that assume all and only female passengers survive, as an example of what a submission file should look like.
Variable Definition Key survival Survival 0 = No, 1 = Yes pclass Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd sex Sex Age Age in years
sibsp # of siblings / spouses aboard the Titanic
parch # of parents / children aboard the Titanic
ticket Ticket number
fare Passenger fare
cabin Cabin number
embarked Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton Variable Notes pclass: A proxy for socio-economic status (SES) 1st = Upper 2nd = Middle 3rd = Lower
age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
sibsp: The dataset defines family relations in this way… Sibling = brother, sister, stepbrother, stepsister Spouse = husband, wife (mistresses and fiancés were ignored)
parch: The dataset defines family relations in this way… Parent = mother, father Child = daughter, son, stepdaughter, stepson Some children travelled only with a nanny, therefore parch=0 for them.
I have downloaded the data set from Kaggle and would like to examine it further
library(ggplot2)
library(dplyr)
#download.file("https://www.kaggle.com/c/titanic/download/test.csv")
#download.file("https://www.kaggle.com/c/titanic/download/train.csv")
train <- read.csv(file="./Data/train.csv", header=TRUE, sep=",",stringsAsFactors = FALSE)
test <- read.csv(file="./Data/test.csv", header=TRUE, sep=",",stringsAsFactors = FALSE)
str(train)
## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : chr "male" "female" "female" "female" ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : chr "S" "C" "S" "S" ...
str(test)
## 'data.frame': 418 obs. of 11 variables:
## $ PassengerId: int 892 893 894 895 896 897 898 899 900 901 ...
## $ Pclass : int 3 3 2 3 3 3 3 2 3 3 ...
## $ Name : chr "Kelly, Mr. James" "Wilkes, Mrs. James (Ellen Needs)" "Myles, Mr. Thomas Francis" "Wirz, Mr. Albert" ...
## $ Sex : chr "male" "female" "male" "male" ...
## $ Age : num 34.5 47 62 27 22 14 30 26 18 21 ...
## $ SibSp : int 0 1 0 0 1 0 0 1 0 2 ...
## $ Parch : int 0 0 0 0 1 0 0 1 0 0 ...
## $ Ticket : chr "330911" "363272" "240276" "315154" ...
## $ Fare : num 7.83 7 9.69 8.66 12.29 ...
## $ Cabin : chr "" "" "" "" ...
## $ Embarked : chr "Q" "S" "Q" "S" ...
summary(train)
## PassengerId Survived Pclass Name
## Min. : 1.0 Min. :0.0000 Min. :1.000 Length:891
## 1st Qu.:223.5 1st Qu.:0.0000 1st Qu.:2.000 Class :character
## Median :446.0 Median :0.0000 Median :3.000 Mode :character
## Mean :446.0 Mean :0.3838 Mean :2.309
## 3rd Qu.:668.5 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :891.0 Max. :1.0000 Max. :3.000
##
## Sex Age SibSp Parch
## Length:891 Min. : 0.42 Min. :0.000 Min. :0.0000
## Class :character 1st Qu.:20.12 1st Qu.:0.000 1st Qu.:0.0000
## Mode :character Median :28.00 Median :0.000 Median :0.0000
## Mean :29.70 Mean :0.523 Mean :0.3816
## 3rd Qu.:38.00 3rd Qu.:1.000 3rd Qu.:0.0000
## Max. :80.00 Max. :8.000 Max. :6.0000
## NA's :177
## Ticket Fare Cabin Embarked
## Length:891 Min. : 0.00 Length:891 Length:891
## Class :character 1st Qu.: 7.91 Class :character Class :character
## Mode :character Median : 14.45 Mode :character Mode :character
## Mean : 32.20
## 3rd Qu.: 31.00
## Max. :512.33
##
summary(test)
## PassengerId Pclass Name Sex
## Min. : 892.0 Min. :1.000 Length:418 Length:418
## 1st Qu.: 996.2 1st Qu.:1.000 Class :character Class :character
## Median :1100.5 Median :3.000 Mode :character Mode :character
## Mean :1100.5 Mean :2.266
## 3rd Qu.:1204.8 3rd Qu.:3.000
## Max. :1309.0 Max. :3.000
##
## Age SibSp Parch Ticket
## Min. : 0.17 Min. :0.0000 Min. :0.0000 Length:418
## 1st Qu.:21.00 1st Qu.:0.0000 1st Qu.:0.0000 Class :character
## Median :27.00 Median :0.0000 Median :0.0000 Mode :character
## Mean :30.27 Mean :0.4474 Mean :0.3923
## 3rd Qu.:39.00 3rd Qu.:1.0000 3rd Qu.:0.0000
## Max. :76.00 Max. :8.0000 Max. :9.0000
## NA's :86
## Fare Cabin Embarked
## Min. : 0.000 Length:418 Length:418
## 1st Qu.: 7.896 Class :character Class :character
## Median : 14.454 Mode :character Mode :character
## Mean : 35.627
## 3rd Qu.: 31.500
## Max. :512.329
## NA's :1
set.seed(123)
We will combine both test and train data for exploring and processing the datat
#Assume no one survived..
test$Survived <- 0
finalData <- rbind(train,test)
str(finalData)
## 'data.frame': 1309 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : num 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : chr "male" "female" "female" "female" ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : chr "S" "C" "S" "S" ...
summary(finalData)
## PassengerId Survived Pclass Name
## Min. : 1 Min. :0.0000 Min. :1.000 Length:1309
## 1st Qu.: 328 1st Qu.:0.0000 1st Qu.:2.000 Class :character
## Median : 655 Median :0.0000 Median :3.000 Mode :character
## Mean : 655 Mean :0.2613 Mean :2.295
## 3rd Qu.: 982 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :1309 Max. :1.0000 Max. :3.000
##
## Sex Age SibSp Parch
## Length:1309 Min. : 0.17 Min. :0.0000 Min. :0.000
## Class :character 1st Qu.:21.00 1st Qu.:0.0000 1st Qu.:0.000
## Mode :character Median :28.00 Median :0.0000 Median :0.000
## Mean :29.88 Mean :0.4989 Mean :0.385
## 3rd Qu.:39.00 3rd Qu.:1.0000 3rd Qu.:0.000
## Max. :80.00 Max. :8.0000 Max. :9.000
## NA's :263
## Ticket Fare Cabin
## Length:1309 Min. : 0.000 Length:1309
## Class :character 1st Qu.: 7.896 Class :character
## Mode :character Median : 14.454 Mode :character
## Mean : 33.295
## 3rd Qu.: 31.275
## Max. :512.329
## NA's :1
## Embarked
## Length:1309
## Class :character
## Mode :character
##
##
##
##
So Fare and Age has few missing data in numeric columns. We have to examine non numeric columns also.
library(Amelia) # a package to visualize missing data
missmap(finalData,main="Titanic Training Data - Missings Map", col=c("yellow", "black"), legend=FALSE)
#no. of survived and no. of expired people
as.data.frame(table(finalData$Survived))
## Var1 Freq
## 1 0 967
## 2 1 342
So 342 people survived and 967 Expired.
How many females and male survivors within a age category
survived <- finalData[finalData$Survived == 1, ]
ggplot(survived, aes(x = Age, fill = factor(Sex))) +
geom_histogram()
#survival rate for lone travellers
travelled_alone <- survived[survived$SibSp == 0 & survived$Parch == 0,]
nrow(travelled_alone) #no. of travelled alone and survivied
## [1] 163
as.data.frame(table(travelled_alone$Sex))
## Var1 Freq
## 1 female 99
## 2 male 64
tot_travelled_alone <- train[train$SibSp == 0 & train$Parch == 0,]
as.data.frame(table(tot_travelled_alone$Sex))
## Var1 Freq
## 1 female 126
## 2 male 411
Conclusion: Therefore most of female lone travellers survived than male lone travellers
#Give dummy variables for sex column
finalData$Sex <- ifelse(finalData$Sex == "male",0,1)
#Find how many travelled together, atleast the immediate family members count.
finalData$Familycount <- finalData$SibSp + finalData$Parch + 1
head(finalData)
## PassengerId Survived Pclass
## 1 1 0 3
## 2 2 1 1
## 3 3 1 3
## 4 4 1 1
## 5 5 0 3
## 6 6 0 3
## Name Sex Age SibSp Parch
## 1 Braund, Mr. Owen Harris 0 22 1 0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) 1 38 1 0
## 3 Heikkinen, Miss. Laina 1 26 0 0
## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) 1 35 1 0
## 5 Allen, Mr. William Henry 0 35 0 0
## 6 Moran, Mr. James 0 NA 0 0
## Ticket Fare Cabin Embarked Familycount
## 1 A/5 21171 7.2500 S 2
## 2 PC 17599 71.2833 C85 C 2
## 3 STON/O2. 3101282 7.9250 S 1
## 4 113803 53.1000 C123 S 2
## 5 373450 8.0500 S 1
## 6 330877 8.4583 Q 1
Survival probability in different classes
finalData %>% ggplot(aes(x=Pclass,fill=factor(Survived)))+geom_bar(stat="count",position="fill")
Survival probability in different sex
finalData %>% ggplot(aes(x=Sex,fill=factor(Survived)))+geom_bar(stat="count",position="fill")
Survival probability depending on Family Size
finalData$Familycount <- finalData$SibSp + finalData$Parch + 1
finalData%>%ggplot()+geom_boxplot(aes(x=Pclass,y=Familycount,fill=as.factor(Survived))) + theme_classic()
We will see the correlation between variables.
finalData$Familycount <- finalData$SibSp + finalData$Parch + 1
Unite Family members with their family group
#To do that split names into title, Surname, FirstName and Maiden name only for Mrs.
library(dplyr)
finalData <- finalData %>%
mutate(tempcol = strsplit(Name, "[,.]")) %>%
rowwise() %>%
mutate(Surname = unlist(tempcol)[1], Title = unlist(tempcol[2]), Firstname = unlist(tempcol[3]),
Maidenname = unlist(ifelse(Title == " Mrs",tail(strsplit(strsplit(tempcol[3],"[\\(\\)]")[[1]][2]," ")[[1]],1) ,"U"))) %>%
select(-c(Name,tempcol))
Get family ids
#second with surname and family count create familyid..no family for single travellers
finalData$Familyid <- ifelse( finalData$Familycount == 1,"single",paste(as.character(finalData$Familycount), finalData$Surname, sep=""))
unique(finalData$Title)
## [1] " Mr" " Mrs" " Miss" " Master"
## [5] " Don" " Rev" " Dr" " Mme"
## [9] " Ms" " Major" " Lady" " Sir"
## [13] " Mlle" " Col" " Capt" " the Countess"
## [17] " Jonkheer" " Dona"
finalData$Title[finalData$Title %in% c( ' Mlle', ' Ms')] <- ' Miss'
finalData$Title[finalData$Title %in% c(' Capt', ' Don', ' Major', ' Sir',' Col')] <- ' Sir'
finalData$Title[finalData$Title %in% c(' Dona', ' Lady', ' the Countess', ' Jonkheer', ' Mme')] <- ' Lady'
# Number of Rows with missing values
nrow(finalData[!complete.cases(finalData),])
## [1] 266
library(Hmisc)
#argImpute() automatically identifies the variable type and treats them accordingly.
finalData$Title <- as.factor(finalData$Title)
#finalData$Embarked <- as.factor(finalData$Embarked)
impute_arg <- aregImpute(~ Age + Sex + Pclass + SibSp + Parch + Fare + Title , data = finalData, n.impute = 10, nk = 0 )
## Iteration 1
Iteration 2
Iteration 3
Iteration 4
Iteration 5
Iteration 6
Iteration 7
Iteration 8
Iteration 9
Iteration 10
Iteration 11
Iteration 12
Iteration 13
# Get the imputed values
imputed <-impute.transcan(impute_arg, data=finalData, imputation = 5, list.out=TRUE, pr=FALSE, check=FALSE)
# convert the list to the database
imputed.data <- as.data.frame(do.call(cbind,imputed))
# arrange the columns accordingly
#as of now consider age alone...and fare can be imputed manually as it is missing only a value for 674 passenger_id
finalData <- cbind(finalData,imputed.data$Age)
finalData$Age <- NULL
names(finalData)[names(finalData) == 'imputed.data$Age'] <- 'Age'
finalData$Fare[1044] <- imputed.data$Fare[1044]
# two embarked value is empty..they both r ladies with same tickt id ..their pclass = 1 and age = 38 & 62 and not related..might be friends
# same cabin..and both survivied..cabin B cabin...with these find Embarked place..
finalData[finalData$Embarked == "", ]
## PassengerId Survived Pclass Sex SibSp Parch Ticket Fare Cabin Embarked
## 62 62 1 1 1 0 0 113572 80 B28
## 830 830 1 1 1 0 0 113572 80 B28
## Familycount Surname Title Firstname Maidenname
## 62 1 Icard Miss Amelie U
## 830 1 Stone Mrs George Nelson (Martha Evelyn) Evelyn
## Familyid Age
## 62 single 38
## 830 single 62
prop.table(table(finalData$Embarked,finalData$Survived),1)
##
## 0 1
## 0.0000000 1.0000000
## C 0.6555556 0.3444444
## Q 0.7560976 0.2439024
## S 0.7625821 0.2374179
#Wiith the above info check whole data base#
table(finalData$Embarked[finalData$Pclass == 1 & finalData$Sex == 1 & finalData$Age >= 38 & finalData$Age <= 62 & finalData$Survived == 1])
##
## C S
## 2 22 14
# mostly from Embarked = "C satisfies this ...therefore fill this missing wiht "C
finalData$Embarked[62] <- "C"
finalData$Embarked[830] <- "C"
#rows of title = Mrs and no Maiden name has to be filled. Not of much importance, so consider it later
head(finalData)
## PassengerId Survived Pclass Sex SibSp Parch Ticket Fare
## 1 1 0 3 0 1 0 A/5 21171 7.2500
## 2 2 1 1 1 1 0 PC 17599 71.2833
## 3 3 1 3 1 0 0 STON/O2. 3101282 7.9250
## 4 4 1 1 1 1 0 113803 53.1000
## 5 5 0 3 0 0 0 373450 8.0500
## 6 6 0 3 0 0 0 330877 8.4583
## Cabin Embarked Familycount Surname Title
## 1 S 2 Braund Mr
## 2 C85 C 2 Cumings Mrs
## 3 S 1 Heikkinen Miss
## 4 C123 S 2 Futrelle Mrs
## 5 S 1 Allen Mr
## 6 Q 1 Moran Mr
## Firstname Maidenname Familyid Age
## 1 Owen Harris U 2Braund 22
## 2 John Bradley (Florence Briggs Thayer) Thayer 2Cumings 38
## 3 Laina U single 26
## 4 Jacques Heath (Lily May Peel) Peel 2Futrelle 35
## 5 William Henry U single 35
## 6 James U single 63
#Extract cabin level info..if unknow mark 'U'
finalData$Cabin <- sub("^$", "U" , finalData$Cabin)
finalData <- finalData %>% rowwise() %>% mutate(CabinLevel = substring(Cabin,1,1))
#create ticket category
finalData <- finalData %>% rowwise() %>% mutate(Ticketcategory = strsplit(Ticket, " ")[[1]][1])
finalData$Ticketcategory <- gsub("^\\d*$","XX",finalData$Ticketcategory)
# create family category
# if family size = 2 & 3 then small family else big family
finalData <- finalData %>% rowwise() %>% mutate(Familycategory = ifelse(Familycount %in% c(2,3), "small" , ifelse(Familycount %in% c(1), "single", "big")))
#check people with similar familyid and family count matches and their age,sex, cabin,
#Groupby familyid can be used to fill in missing values, coz people in same family tend to be in same cabin..embark at same place..etc..
df <- finalData %>% group_by(Surname) %>% summarise(n())
#Categoriese age
finalData <- finalData %>% mutate(Agecategory = ifelse(Age < 18, "kid",ifelse(Age>60, "old", "adult")))
#SAVE FAMIY By finding family members latter
#There are 9 Andersson but only 7 family members..so two are other Andersson..change family id . Later we can finalDatane them with their family
#How come they also have come with 7 family members each...find it
finalData$Familyid[finalData$Surname == "Andersson" & finalData$Familycount == 7 & finalData$Ticket == "3101281"] <- "Andersson1"
finalData$Familyid[finalData$Surname == "Andersson" & finalData$Familycount == 7 & finalData$Ticket == "347091"] <- "Andersson2"
#Put all Richards together..they have diff familyid, same ticketid so same family..find other two siblings of Mrs. Richards ( 438 )
finalData$Familyid[finalData$Ticket == "29106" ] <- "Richards" # might be hocking is sibiling of Richards..lil bit confusing..leaving it for now
#Find two siblings of Mr. Kink-Heilmann and give them the same family id
finalData$Familyid[finalData$Ticket == "315153" ] <- "Kink-Heilmann"
#Hocking family all confusion..two children missing for Mrs. Hocking..n find her sibling also
finalData$Familyid[finalData$Ticket == "29105" ] <- "Hocking"
#Find sibling of MR Vander Planke , united Mr and MRs...find his sibling
finalData$Familyid[finalData$Ticket == "345763" ] <- "Vander Planke"
finalData$Familyid[finalData$Ticket == "345764" ] <- "Vander Planke" # Same surname and Fare
finalData$Familyid[finalData$Ticket == "31027" ] <- "Renouf"
#Unite Mr and Mrs
finalData$Familyid[finalData$Ticket == "243847" ] <- "Jacobsohn"
finalData$Familyid[finalData$Ticket =="F.C. 12750" ] <- "Davidson" # Find her 2 children ?? Mr. Davidson..no children..so correct the value
finalData$Familyid[finalData$Ticket =="3101278" ] <- "Backstrom" # Find 2 siblings of Mrs Bakstrom
finalData$Familyid[finalData$Ticket =="345763" ] <- "Vander Planke" # Some error with other Vander Planke..find out correct count of Vander Planke family members
finalData$Familyid[finalData$Ticket =="2625" ] <- "Thomas" # Find her sibling of Mrs. Thomas..she didnt come with spouse
finalData$Familyid[finalData$Ticket =="347054" ] <- "Strom" #Find sibling
finalData$Familyid[finalData$Ticket =="C.A. 33112" & finalData$Surname == "Davies" ] <- "Davies" # might be find parent of Mrs.. or correct Parch count
finalData$Familyid[finalData$Ticket =="A/4 48871" & finalData$Surname == "Davies" & finalData$Ticket =="A/4 48873" ] <- "Davies1"
finalData$Familyid[finalData$Surname =="Brown" & finalData$Ticket == "29750" ] <- "3Brown1" # Find sibling of 1248 another brown not this family
finalData$Familyid[finalData$Ticket =="11769"] <- "3Appleton" # find one sibling
#Wilkes sibling missing...893 id
# p id 19 Mrs. Vander Planke sibling missing check with ticket id 345763
# Mrs stephenson sibling missing ticket 36947 pid 592
# Uniting mother n child coz mothers surname and daughters maiden name same & same ticket id
finalData$Familyid[finalData$Ticket =="230433"] <- "Parrish"
# Might be father and daughter
finalData$Familyid[finalData$Ticket =="13236"] <- "Mock"
finalData$Familyid[finalData$Ticket == "31027"] <- "2Renouf"
#Sisters unite now
finalData$Familyid[finalData$Ticket == "13502" & finalData$PassengerId == "276"] <- "Andrews"
finalData$Familyid[finalData$Ticket == "13502" & finalData$PassengerId == "766"] <- "Andrews"
finalData$Familyid[finalData$Ticket == "3101298"] <- "Hirvonen"
finalData$Familyid[finalData$Ticket == "PC 17572" & finalData$Surname == "Harper"] <- "Harper1"
#Find sibling of Miss. Eustis pid 497
finalData$Familyid[finalData$PassengerId == "311"] <- "Potters"
finalData$Familyid[finalData$PassengerId == "1042"] <- "Potters" # mother n daughter
finalData$Familyid[finalData$Ticket == "113505"] <- "Bowermans"
#Fill in Maidenname
# if no maidenname for Mrs..fill with U
finalData$Maidenname[finalData$PassengerId %in% c(20,257,707,798,911,1148)] <- "U"
finalData <- finalData[finalData$CabinLevel != "T", ]
# 3. Splitting Data Manually as it was before
#problem dont use factor for all num and int values..coz it gives more than 53 factore..cannot be used in rfe feature selection
#finalData[] <- lapply(finalData, factor)
finalData$Pclass <- ifelse(finalData$Pclass == 1, "One", ifelse(finalData$Pclass == 2,"Two","Three"))
finalData$Survived <- ifelse(finalData$Survived == 1, "Y","N")
finalData$Survived <- as.factor(finalData$Survived)
finalData$Pclass <- as.factor(finalData$Pclass)
finalData$Sex <- as.factor(finalData$Sex)
finalData$Embarked <- as.factor(finalData$Embarked)
finalData$Embarked <- as.factor(finalData$Embarked)
finalData$Title <- as.factor(finalData$Title)
finalData$CabinLevel <- as.factor(finalData$CabinLevel)
finalData$Ticketcategory <- as.factor(finalData$Ticketcategory)
finalData$Familycategory <- as.factor(finalData$Familycategory)
finalData$Agecategory <- as.factor(finalData$Agecategory)
finalData$Familyid <- as.factor(finalData$Familyid)
finalData$Familyid_num <- as.numeric(finalData$Familyid)
control <- rfeControl(functions = rfFuncs,method = "repeatedcv",number = 10)
predictors <- c("Pclass","Sex","SibSp","Parch","Fare","Embarked","Familycount","Title","Age","CabinLevel","Ticketcategory","Familycategory","Agecategory")
Survival_Pred_Profile <- rfe(finalData[ ,predictors], finalData$Survived, sizes = c(1:13),
rfeControl = control,metric = "Accuracy")
print(Survival_Pred_Profile) # Top 5 : Title, CabinLevel, Sex, Fare, Pclass
##
## Recursive feature selection
##
## Outer resampling method: Cross-Validated (10 fold, repeated 1 times)
##
## Resampling performance over subset size:
##
## Variables Accuracy Kappa AccuracySD KappaSD Selected
## 1 0.7240 0.13103 0.030757 0.09805
## 2 0.7439 0.15768 0.030370 0.10985
## 3 0.7523 0.20380 0.030086 0.09924
## 4 0.7654 0.28003 0.044146 0.18248
## 5 0.7860 0.36214 0.033897 0.15734 *
## 6 0.7815 0.33242 0.041048 0.20652
## 7 0.7769 0.26684 0.035222 0.22929
## 8 0.7547 0.12868 0.027758 0.21292
## 9 0.7509 0.09319 0.028572 0.19672
## 10 0.7462 0.05937 0.024846 0.16952
## 11 0.7462 0.06132 0.022415 0.16140
## 12 0.7462 0.05868 0.022337 0.16182
## 13 0.7393 0.01259 0.009989 0.03392
##
## The top 5 variables (out of 5):
## Title, CabinLevel, Sex, Fare, Pclass
# list the chosen features
predictors(Survival_Pred_Profile)
## [1] "Title" "CabinLevel" "Sex" "Fare" "Pclass"
# plot the results
plot(Survival_Pred_Profile, type=c("g", "o"))
control <- trainControl(method="repeatedcv", number=10, repeats=3)
# train the model
model <- train(Survived ~ Pclass+Sex+SibSp+Parch+Fare+Embarked+Familycount+Title+Age+CabinLevel+Ticketcategory+Familycategory,
data = finalData, method="rf", preProcess="scale", trControl=control)
# estimate variable importance
importance <- varImp(model, scale=FALSE)
# summarize importance
print(importance)
# plot importance
plot(importance)
The four steps to improve model performance.
Improve Performance With Data, Improve Performance With Algorithms, Improve Performance With Algorithm Tuning, Improve Performance With Ensembles. We have done some basic data improvements. After trying many algorithems like random forest, logsitic regreession, ranger etc..I have finally built a model using Ensemble from caret package.
#Use the top 10 features in the predictions and splitting data to train and test set.
predictors <- c("Pclass","Sex","Parch","Embarked","Familycount","Title","Age","CabinLevel","Fare","Ticketcategory","Familyid_num")
selectedData <- finalData[,predictors]
selectedData <- cbind(selectedData,finalData$Survived)
names(selectedData)[12] <- "Survived"
names(finalData)[names(finalData) == 'finalData$Survived'] <- 'Survived'
#Splitting data
anyNA(selectedData)
## [1] FALSE
trainSet <- selectedData[1:891,]
testSet <- selectedData[892:1309,]
str(trainSet)
## 'data.frame': 891 obs. of 12 variables:
## $ Pclass : Factor w/ 3 levels "One","Three",..: 2 1 2 1 2 2 1 2 2 3 ...
## $ Sex : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Embarked : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...
## $ Familycount : num 2 2 1 2 1 1 1 5 3 2 ...
## $ Title : Factor w/ 8 levels " Dr"," Lady",..: 5 6 4 6 5 5 5 3 6 6 ...
## $ Age : num 22 38 26 35 35 63 54 2 27 14 ...
## $ CabinLevel : Factor w/ 8 levels "A","B","C","D",..: 8 3 8 3 8 8 5 8 8 8 ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Ticketcategory: Factor w/ 51 levels "A.","A./5.","A.5.",..: 6 23 44 51 51 51 51 51 51 51 ...
## $ Familyid_num : num 13 26 215 39 215 215 215 190 149 81 ...
## $ Survived : Factor w/ 2 levels "N","Y": 1 2 2 2 1 1 1 1 2 2 ...
str(testSet)
## 'data.frame': 418 obs. of 12 variables:
## $ Pclass : Factor w/ 3 levels "One","Three",..: 2 3 2 2 2 2 3 2 2 2 ...
## $ Sex : Factor w/ 2 levels "0","1": 2 1 1 2 1 2 1 2 1 1 ...
## $ Parch : int 0 0 0 1 0 0 1 0 0 0 ...
## $ Embarked : Factor w/ 3 levels "C","Q","S": 3 2 3 3 3 2 3 1 3 3 ...
## $ Familycount : num 2 1 1 3 1 1 3 1 3 1 ...
## $ Title : Factor w/ 8 levels " Dr"," Lady",..: 6 5 5 6 5 4 5 6 5 5 ...
## $ Age : num 47 62 27 22 14 30 26 18 21 19 ...
## $ CabinLevel : Factor w/ 8 levels "A","B","C","D",..: 8 8 8 8 8 8 8 8 8 8 ...
## $ Fare : num 7 9.69 8.66 12.29 9.22 ...
## $ Ticketcategory: Factor w/ 51 levels "A.","A./5.","A.5.",..: 51 51 51 51 51 51 51 51 4 51 ...
## $ Familyid_num : num 114 215 215 207 215 215 124 215 132 215 ...
## $ Survived : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
Stacking: In stacking multiple layers of machine learning models are placed one over another where each of the models passes their predictions to the model in the layer above it and the top layer model takes decisions based on the outputs of the models in layers below it. two important criteria that we previously discussed on individual model accuracy and inter-model prediction correlation which must be fulfilled. If the predictions are highly correlated, then using these three might not give better results than individual models.
STEPS for STACKING 1.Train the individual base layer models on training data. 2.Predict using each base layer model for training data and test data. 3.Now train the top layer model again on the predictions of the bottom layer models that has been made on the training data. 4.Finally, predict using the top layer model with the predictions of bottom layer models that has been made for testing data.
library(caret)
library(caretEnsemble)
library(rpart)
fitControl <- trainControl(method="repeatedcv", number=10, repeats=10, savePredictions=TRUE,classProbs=TRUE)
set.seed(1234)
algorithmList <- c( 'rpart','gbm','treebag')
models <- caretList(trainSet[,1:11],trainSet$Survived , trControl=fitControl, methodList=algorithmList)
results <- resamples(models)
summary(results)
dotplot(results)
# correlation between results
modelCor(results)
splom(results)
# stack using gbm
stackControl <- trainControl(method="repeatedcv", number=10, repeats=3, savePredictions=TRUE, classProbs=TRUE)
set.seed(1234)
stack.gbm <- caretStack(models, method="gbm", metric="Accuracy", trControl=stackControl)
print(stack.gbm)
stack1Pred <- predict(stack.gbm,testSet)
Pred <- ifelse(stack1Pred == "Y", 1,0)
out <- data.frame(PassengerId=test$PassengerId,Survived=Pred,row.names=NULL)
write.csv(x=out,file="submission_ensemble.csv",row.names=FALSE,quote=FALSE)
SCORE
Thank you for taking time and reading through this.