The sinking of the Titanic is one of the most infamous shipwrecks in history.
On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.
While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.
In this challenge, we ask you to build a predictive model that answers the question: “What sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).
Knowing from a training set of samples listing passengers who survived or did not survive the Titanic disaster, can our model determine based on a given test dataset not containing the survival information, if these passengers in the test dataset survived or not.
We may also want to develop some early understanding about the domain of our problem.
train <- read_csv("data_input/train.csv")
test <- read_csv("data_input/test.csv")
solution <- read_csv("data_input/solution.csv")
glimpse(train)#> Observations: 891
#> Variables: 12
#> $ PassengerId <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, ...
#> $ Survived <dbl> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0...
#> $ Pclass <dbl> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3...
#> $ Name <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley ...
#> $ Sex <chr> "male", "female", "female", "female", "male", "male", "...
#> $ Age <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 1...
#> $ SibSp <dbl> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1...
#> $ Parch <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0...
#> $ Ticket <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", ...
#> $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.86...
#> $ Cabin <chr> NA, "C85", NA, "C123", NA, NA, "E46", NA, NA, NA, "G6",...
#> $ Embarked <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", ...
#> PassengerId Survived Pclass Name
#> Min. : 1.0 Min. :0.0000 Min. :1.000 Length:891
#> 1st Qu.:223.5 1st Qu.:0.0000 1st Qu.:2.000 Class :character
#> Median :446.0 Median :0.0000 Median :3.000 Mode :character
#> Mean :446.0 Mean :0.3838 Mean :2.309
#> 3rd Qu.:668.5 3rd Qu.:1.0000 3rd Qu.:3.000
#> Max. :891.0 Max. :1.0000 Max. :3.000
#>
#> Sex Age SibSp Parch
#> Length:891 Min. : 0.42 Min. :0.000 Min. :0.0000
#> Class :character 1st Qu.:20.12 1st Qu.:0.000 1st Qu.:0.0000
#> Mode :character Median :28.00 Median :0.000 Median :0.0000
#> Mean :29.70 Mean :0.523 Mean :0.3816
#> 3rd Qu.:38.00 3rd Qu.:1.000 3rd Qu.:0.0000
#> Max. :80.00 Max. :8.000 Max. :6.0000
#> NA's :177
#> Ticket Fare Cabin Embarked
#> Length:891 Min. : 0.00 Length:891 Length:891
#> Class :character 1st Qu.: 7.91 Class :character Class :character
#> Mode :character Median : 14.45 Mode :character Mode :character
#> Mean : 32.20
#> 3rd Qu.: 31.00
#> Max. :512.33
#>
Data Explanation :
After loading the dataset, we can combine the train and test data by creating a new Survived column on the test data. From the kaggle competition dataset, it did not have the survival information. Now, we can filled it with NA values and store it under combine object.
#> [1] 1309 12
combine <- combine %>%
select(-c(1, 4, 9)) %>%
mutate(Survived = as.factor(Survived)) %>%
mutate(Pclass = as.factor(Pclass)) %>%
mutate_if(is.character, as.factor)
levels(combine$Pclass)#> [1] "1" "2" "3"
First of all, we can check the overall survival rate in training dataset.
#>
#> 0 1
#> 0.6161616 0.3838384
#> Survived Pclass Sex Age SibSp Parch Fare Cabin
#> 418 0 0 263 0 0 1 1014
#> Embarked
#> 2
These will require correcting.
Cabin > Age > Embarked features contain a number of null values in that order for the training dataset.Cabin > Age > Fareare incomplete in case of test dataset.Replace missing Fare & Embarked values with mean and mode
#> Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
#> 0.000 7.896 14.454 33.295 31.275 512.329 1
#> C Q S NA's
#> 270 123 914 2
#> Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
#> 0.00 7.75 8.05 13.30 15.25 69.55 1
combine <- combine %>%
mutate(Fare = replace_na(data = Fare, replace = 13.3),
Embarked = replace_na(data = Embarked, replace = "S"))NA Handling in Age column
There are numerous ways to deal with missing values. Imputing the mean of the feature can sometimes be an easy way to deal with it, but may lead to some bias in the model.
Another way to impute missing values could be using another type of model, such as linear regression, to predict what the missing Age values would be based on the other features in the dataset.
#> Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
#> 0.17 21.00 28.00 29.88 39.00 80.00 263
combine$MissingAge <- ifelse(is.na(combine$Age),
"Y", "N")
combine$MissingAge <- as.factor(combine$MissingAge)# Caret supports a number of mechanism for imputing (i.e., predicting) missing values.
# Leverage bagged decision trees to impute missing values for the Age feature.
# First, transform all feature to dummy variables.
dummy.vars <- dummyVars(~ ., data = combine[, -c(1,8)])
combine.dummy <- predict(dummy.vars, combine[, -c(1,8)])# Now, impute!
pre.process <- preProcess(combine.dummy, method = "bagImpute")
imputed.data <- predict(pre.process, combine.dummy)combine$Age <- imputed.data[, 6] #age
combine <- combine %>%
mutate(Age = round(Age))
summary(combine$Age)#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 0.0 22.0 28.0 29.7 37.0 80.0
NA Handling in Cabin column
#> Survived Pclass Sex Age SibSp Parch Fare
#> 418 0 0 0 0 0 0
#> Cabin Embarked MissingAge FamilySize
#> 1014 0 0 0
#> [1] 0.7710438
After some more mining, let make this even clearer, particularly for Cabin and Age. In the training data set, the overall survival rate is 38%,
I created a new variable called CabinLstoring the first letter of each cabin. Based on the Titanic deckplans, I saw that cabins were named according to floor of the ship. I thought this could be useful.
combine$CabinL <- NA
combine$CabinL <- factor(substr(combine$Cabin, start=1, stop=1))
## Check if it worked
table(combine$CabinL)#>
#> A B C D E F G T
#> 22 65 94 46 41 21 5 1
#>
#> FALSE TRUE
#> 295 1014
Using the Titanic deckplan, I assigned each cabin letter its relative floor: Cabins beginning with A were on the uppermost part of the ship, hence the value 1. G cabins were the lowest on the ship and received the value 7. For the single cabin letter == T, which I could not find in the diagram, I assigned it floor 0. The passenger was traveling in 1st class, and 1st class occupied the upper decks according to the diagram. These floor numbers were stored in a new variable called CabinFloor. As an integer, I figured I would have better success at estimating CabinFloor than I would using unordered factor levels (i.e. letters).
combine$CabinFloor <- NA
combine$CabinFloor <- ifelse(combine$CabinL=="A", 1,
ifelse(combine$CabinL=="B", 2,
ifelse(combine$CabinL=="C", 3,
ifelse(combine$CabinL=="D", 4,
ifelse(combine$CabinL=="E", 5,
ifelse(combine$CabinL=="F", 6,
ifelse(combine$CabinL=="G", 7,
ifelse(combine$CabinL=="T", 0, NA))))))))
combine$CabinFloor <- as.integer(combine$CabinFloor)We can use ggcorr to determine which numeric variable were most correlated with CabinFloor. But, before we find the corr between numeric variable, we can change the data type for a Survived and Pclass column.
combine <- combine %>%
mutate(Survived = as.numeric(Survived)) %>%
mutate(Pclass = as.numeric(Pclass))
glimpse (combine)#> Observations: 1,309
#> Variables: 13
#> $ Survived <dbl> 1, 2, 2, 2, 1, 1, 1, 1, 2, 2, 2, 2, 1, 1, 1, 2, 1, 2, 1,...
#> $ Pclass <dbl> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3,...
#> $ Sex <fct> male, female, female, female, male, male, male, male, fe...
#> $ Age <dbl> 22, 38, 26, 35, 35, 28, 54, 2, 27, 14, 4, 58, 20, 39, 14...
#> $ SibSp <dbl> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1,...
#> $ Parch <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0,...
#> $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.862...
#> $ Cabin <fct> NA, C85, NA, C123, NA, NA, E46, NA, NA, NA, G6, C103, NA...
#> $ Embarked <fct> S, C, S, S, S, Q, S, S, S, C, S, S, S, S, S, S, Q, S, S,...
#> $ MissingAge <fct> N, N, N, N, N, Y, N, N, N, N, N, N, N, N, N, N, N, Y, N,...
#> $ FamilySize <dbl> 2, 2, 1, 2, 1, 1, 1, 5, 3, 2, 3, 1, 1, 7, 1, 1, 6, 1, 2,...
#> $ CabinL <fct> NA, C, NA, C, NA, NA, E, NA, NA, NA, G, C, NA, NA, NA, N...
#> $ CabinFloor <int> NA, 3, NA, 3, NA, NA, 5, NA, NA, NA, 7, 3, NA, NA, NA, N...
As we can see in the diagram above, CabinFloor had a better correlation to Pclass. So we can imputed missing CabinFloor values using the mean Pclass and the ave() function.
combine$CabinFloor[is.na(combine$CabinFloor)] <- with(combine, ave(CabinFloor, Pclass,
FUN = function(x) mean(x, na.rm=TRUE)))[is.na(combine$CabinFloor)]
## Check if it worked
summary(combine$CabinFloor)#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 0.000 5.000 6.125 5.176 6.125 7.000
Next, we can discretizing CabinFLoor into uppper half and lower half values. We can create new column for this one or overwrite to CabinFloor column.
combine$CabinFloor <- ifelse(combine$CabinFloor < 4, "Upper Half", "Lower Half")
combine$CabinFloor <- as.factor(combine$CabinFloor)
table(combine$CabinFloor)#>
#> Lower Half Upper Half
#> 1060 249
Next, we can re-split our data into train and test dataset again. We can also remove the unused column to build our model.
combine <- combine %>%
mutate(Survived = as.factor(ifelse(Survived == 1, "0", "1"))) %>%
mutate(Pclass = as.factor(Pclass))
glimpse(combine)#> Observations: 1,309
#> Variables: 13
#> $ Survived <fct> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0,...
#> $ Pclass <fct> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3,...
#> $ Sex <fct> male, female, female, female, male, male, male, male, fe...
#> $ Age <dbl> 22, 38, 26, 35, 35, 28, 54, 2, 27, 14, 4, 58, 20, 39, 14...
#> $ SibSp <dbl> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1,...
#> $ Parch <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0,...
#> $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.862...
#> $ Cabin <fct> NA, C85, NA, C123, NA, NA, E46, NA, NA, NA, G6, C103, NA...
#> $ Embarked <fct> S, C, S, S, S, Q, S, S, S, C, S, S, S, S, S, S, Q, S, S,...
#> $ MissingAge <fct> N, N, N, N, N, Y, N, N, N, N, N, N, N, N, N, N, N, Y, N,...
#> $ FamilySize <dbl> 2, 2, 1, 2, 1, 1, 1, 5, 3, 2, 3, 1, 1, 7, 1, 1, 6, 1, 2,...
#> $ CabinL <fct> NA, C, NA, C, NA, NA, E, NA, NA, NA, G, C, NA, NA, NA, N...
#> $ CabinFloor <fct> Lower Half, Upper Half, Lower Half, Upper Half, Lower Ha...
#> Survived Pclass Sex Age SibSp Parch Fare
#> 418 0 0 0 0 0 0
#> Cabin Embarked MissingAge FamilySize CabinL CabinFloor
#> 1014 0 0 0 1014 0
train <- combine[1:891, -c(8, 10, 12)]
test <- combine[892:1309, -c(8, 10, 12)]
colSums(is.na(train))#> Survived Pclass Sex Age SibSp Parch Fare
#> 0 0 0 0 0 0 0
#> Embarked FamilySize CabinFloor
#> 0 0 0
We can explore the train data using ggplot2 packages for visualizations.
#> [1] "Survived" "Pclass" "Sex" "Age" "SibSp"
#> [6] "Parch" "Fare" "Embarked" "FamilySize" "CabinFloor"
Survival Rate by Pclass
ggplot(train, aes(x = Survived))+
geom_bar(aes(fill = Pclass), position ="dodge")+
labs(title = "Survival by Pclass",
y = "Frequency")+
theme_minimal()Passengers in 3rd class were much more likely to die, while passengers in 1st class were more likely to survive.
Survival Rate by Gender
ggplot(train, aes(x = Survived))+
geom_bar(aes(fill = Sex), position ="fill")+
labs(title = "Survival by Gender",
y = "Percentage")+
theme_minimal()Females were much more likely to survive than their male counterparts.
Survival Rate by Age
ggplot(train, aes(x = Age, y = as.numeric(Survived)))+
geom_point(alpha = 0.1)+
geom_smooth(color="red", fill="#69b3a2", se=TRUE)+
scale_y_continuous(breaks = seq(1, 2, 0.5), labels = c (0, 0.5, 1))+
labs(title = "Survival by Age",
y = "Survived")+
theme_minimal()Older passengers were less likely to survive.
Survival Rate by Fare
ggplot(train, aes(x = Fare, y = as.numeric(Survived)))+
geom_point(alpha = 0.1)+
geom_smooth(color="red", fill="#69b3a2", se=TRUE)+
scale_y_continuous(breaks = seq(1, 2.5, 0.5), labels = c (0, 0.5, 1, 1.5))+
labs(title = "Survival by Fare",
y = "Survived")+
theme_minimal()Passengers who paid more for their fare were more likely to survive.
Survival Rate by Embarked
ggplot(train, aes(x = Survived))+
geom_bar(aes(fill = Embarked), position ="stack")+
labs(title = "Survival by Embarked",
y = "Frequency")+
theme_minimal()Passengers who embarked at port “S” were less likely to survive, but then again, “S” was the mode for these data.
Survival Rate by Family Size
ggplot(train, aes(x = FamilySize, y = as.numeric(Survived)))+
geom_point(alpha = 0.1)+
geom_smooth(color="red", fill="#69b3a2", se=TRUE)+
scale_y_continuous(breaks = seq(1, 2, 0.5), labels = c (0, 0.5, 1))+
labs(title = "Survival by Family Size",
y = "Survived")+
theme_minimal()Passengers with more siblings, spouses, parents, and children on board were less likely to survive. However, for family sizes between 1 and approximately 4, there was an increase in passenger survival.
Survival Rate by CabinFloor
ggplot(train, aes(x = Survived))+
geom_bar(aes(fill = CabinFloor), position ="dodge")+
labs(title = "Survival by Cabin Floor",
y = "Frequency")+
theme_minimal()Passengers with a cabin in the lower half of the ship were less likely to survive than passengers in the upper half.
#> Observations: 891
#> Variables: 10
#> $ Survived <fct> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0,...
#> $ Pclass <fct> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3,...
#> $ Sex <fct> male, female, female, female, male, male, male, male, fe...
#> $ Age <dbl> 22, 38, 26, 35, 35, 28, 54, 2, 27, 14, 4, 58, 20, 39, 14...
#> $ SibSp <dbl> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1,...
#> $ Parch <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0,...
#> $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.862...
#> $ Embarked <fct> S, C, S, S, S, Q, S, S, S, C, S, S, S, S, S, S, Q, S, S,...
#> $ FamilySize <dbl> 2, 2, 1, 2, 1, 1, 1, 5, 3, 2, 3, 1, 1, 7, 1, 1, 6, 1, 2,...
#> $ CabinFloor <fct> Lower Half, Upper Half, Lower Half, Upper Half, Lower Ha...
#>
#> 0 1
#> 0.6161616 0.3838384
set.seed(100)
idx <- initial_split(data = train, prop = 0.8, strata = "Survived")
titanic_train <- training(idx)
titanic_test <- testing(idx)
#Split Label
train_label <- titanic_train$Survived
test_label <- titanic_test$Survived
# Splitted-data proportion checking
prop.table(table(titanic_train$Survived))#>
#> 0 1
#> 0.6162465 0.3837535
#>
#> 0 1
#> 0.6158192 0.3841808
model_logistic_base <- glm(formula = Survived~., data = titanic_train, family = "binomial")
summary(model_logistic_base)#>
#> Call:
#> glm(formula = Survived ~ ., family = "binomial", data = titanic_train)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.5957 -0.5999 -0.4046 0.5895 2.5191
#>
#> Coefficients: (1 not defined because of singularities)
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 5.318657 0.689120 7.718 1.18e-14 ***
#> Pclass2 -1.615219 0.478560 -3.375 0.000738 ***
#> Pclass3 -2.942136 0.484207 -6.076 1.23e-09 ***
#> Sexmale -2.789570 0.231299 -12.060 < 2e-16 ***
#> Age -0.044567 0.009206 -4.841 1.29e-06 ***
#> SibSp -0.336739 0.119984 -2.807 0.005008 **
#> Parch -0.135076 0.135273 -0.999 0.318016
#> Fare 0.001534 0.002541 0.604 0.546131
#> EmbarkedQ -0.362026 0.424061 -0.854 0.393264
#> EmbarkedS -0.723907 0.276873 -2.615 0.008934 **
#> FamilySize NA NA NA NA
#> CabinFloorUpper Half -0.797421 0.465473 -1.713 0.086687 .
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 950.86 on 713 degrees of freedom
#> Residual deviance: 619.06 on 703 degrees of freedom
#> AIC: 641.06
#>
#> Number of Fisher Scoring iterations: 5
#> Start: AIC=641.06
#> Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked +
#> FamilySize + CabinFloor
#>
#>
#> Step: AIC=641.06
#> Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked +
#> CabinFloor
#>
#> Df Deviance AIC
#> - Fare 1 619.44 639.44
#> - Parch 1 620.09 640.09
#> <none> 619.06 641.06
#> - CabinFloor 1 622.06 642.06
#> - Embarked 2 626.26 644.26
#> - SibSp 1 628.39 648.39
#> - Age 1 644.80 664.80
#> - Pclass 2 671.30 689.30
#> - Sex 1 805.45 825.45
#>
#> Step: AIC=639.44
#> Survived ~ Pclass + Sex + Age + SibSp + Parch + Embarked + CabinFloor
#>
#> Df Deviance AIC
#> - Parch 1 620.26 638.26
#> <none> 619.44 639.44
#> - CabinFloor 1 622.18 640.18
#> - Embarked 2 627.28 643.28
#> - SibSp 1 628.47 646.47
#> - Age 1 645.87 663.87
#> - Pclass 2 678.23 694.23
#> - Sex 1 807.94 825.94
#>
#> Step: AIC=638.26
#> Survived ~ Pclass + Sex + Age + SibSp + Embarked + CabinFloor
#>
#> Df Deviance AIC
#> <none> 620.26 638.26
#> - CabinFloor 1 623.02 639.02
#> - Embarked 2 628.30 642.30
#> - SibSp 1 632.27 648.27
#> - Age 1 646.24 662.24
#> - Pclass 2 679.18 693.18
#> - Sex 1 812.58 828.58
#>
#> Call: glm(formula = Survived ~ Pclass + Sex + Age + SibSp + Embarked +
#> CabinFloor, family = "binomial", data = titanic_train)
#>
#> Coefficients:
#> (Intercept) Pclass2 Pclass3
#> 5.3508 -1.6655 -3.0146
#> Sexmale Age SibSp
#> -2.7512 -0.0444 -0.3602
#> EmbarkedQ EmbarkedS CabinFloorUpper Half
#> -0.3459 -0.7493 -0.7538
#>
#> Degrees of Freedom: 713 Total (i.e. Null); 705 Residual
#> Null Deviance: 950.9
#> Residual Deviance: 620.3 AIC: 638.3
model_logistic <- glm(formula = Survived ~ Pclass + Sex + Age + SibSp + Embarked +
CabinFloor, family = "binomial", data = titanic_train)
summary(model_logistic)#>
#> Call:
#> glm(formula = Survived ~ Pclass + Sex + Age + SibSp + Embarked +
#> CabinFloor, family = "binomial", data = titanic_train)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.6198 -0.6154 -0.4017 0.6023 2.5313
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 5.35084 0.65580 8.159 3.37e-16 ***
#> Pclass2 -1.66551 0.46962 -3.547 0.00039 ***
#> Pclass3 -3.01458 0.46833 -6.437 1.22e-10 ***
#> Sexmale -2.75115 0.22445 -12.257 < 2e-16 ***
#> Age -0.04440 0.00913 -4.864 1.15e-06 ***
#> SibSp -0.36024 0.11536 -3.123 0.00179 **
#> EmbarkedQ -0.34588 0.42035 -0.823 0.41060
#> EmbarkedS -0.74926 0.27368 -2.738 0.00619 **
#> CabinFloorUpper Half -0.75375 0.45779 -1.647 0.09966 .
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 950.86 on 713 degrees of freedom
#> Residual deviance: 620.26 on 705 degrees of freedom
#> AIC: 638.26
#>
#> Number of Fisher Scoring iterations: 5
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 100 23
#> 1 9 45
#>
#> Accuracy : 0.8192
#> 95% CI : (0.7545, 0.8729)
#> No Information Rate : 0.6158
#> P-Value [Acc > NIR] : 3.784e-09
#>
#> Kappa : 0.6025
#>
#> Mcnemar's Test P-Value : 0.02156
#>
#> Sensitivity : 0.6618
#> Specificity : 0.9174
#> Pos Pred Value : 0.8333
#> Neg Pred Value : 0.8130
#> Prevalence : 0.3842
#> Detection Rate : 0.2542
#> Detection Prevalence : 0.3051
#> Balanced Accuracy : 0.7896
#>
#> 'Positive' Class : 1
#>
AUC (Area Under the ROC Curve) AUC - ROC curve is a performance measurement for classification problem at various thresholds settings
pred <- prediction(predictions = pred_logistic,labels = titanic_test$Survived)
perf <- performance(prediction.obj = pred,measure = "tpr",x.measure = "fpr")
plot(perf)#> [1] 0.8505127
#> [1] "dummyVars"
#> [1] "data.frame"
Scale data train
Scale data test
#> [1] 26.72078
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 103 26
#> 1 6 42
#>
#> Accuracy : 0.8192
#> 95% CI : (0.7545, 0.8729)
#> No Information Rate : 0.6158
#> P-Value [Acc > NIR] : 3.784e-09
#>
#> Kappa : 0.5955
#>
#> Mcnemar's Test P-Value : 0.0007829
#>
#> Sensitivity : 0.6176
#> Specificity : 0.9450
#> Pos Pred Value : 0.8750
#> Neg Pred Value : 0.7984
#> Prevalence : 0.3842
#> Detection Rate : 0.2373
#> Detection Prevalence : 0.2712
#> Balanced Accuracy : 0.7813
#>
#> 'Positive' Class : 1
#>
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 100 23
#> 1 9 45
#>
#> Accuracy : 0.8192
#> 95% CI : (0.7545, 0.8729)
#> No Information Rate : 0.6158
#> P-Value [Acc > NIR] : 3.784e-09
#>
#> Kappa : 0.6025
#>
#> Mcnemar's Test P-Value : 0.02156
#>
#> Sensitivity : 0.6618
#> Specificity : 0.9174
#> Pos Pred Value : 0.8333
#> Neg Pred Value : 0.8130
#> Prevalence : 0.3842
#> Detection Rate : 0.2542
#> Detection Prevalence : 0.3051
#> Balanced Accuracy : 0.7896
#>
#> 'Positive' Class : 1
#>
a <- as.data.frame(Log_eval$overall[[1]])
names(a) <- c("Value")
a$Metrics <- "Accuracy"
b <- as.data.frame(rbind(Log_eval$byClass[[1]],Log_eval$byClass[[2]],
Log_eval$byClass[[3]],Log_eval$byClass[[4]]))
names(b) <- c("Value")
b$Metrics <- c("Sensitivity", "Specificity", "Pos Pred Value", "Neg Pred Value")
c <- rbind(a,b)
c <- c[,c(2,1)]
Log.Regression <-c
Log.Regression$Model <- "Logistic Regression"#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 103 26
#> 1 6 42
#>
#> Accuracy : 0.8192
#> 95% CI : (0.7545, 0.8729)
#> No Information Rate : 0.6158
#> P-Value [Acc > NIR] : 3.784e-09
#>
#> Kappa : 0.5955
#>
#> Mcnemar's Test P-Value : 0.0007829
#>
#> Sensitivity : 0.6176
#> Specificity : 0.9450
#> Pos Pred Value : 0.8750
#> Neg Pred Value : 0.7984
#> Prevalence : 0.3842
#> Detection Rate : 0.2373
#> Detection Prevalence : 0.2712
#> Balanced Accuracy : 0.7813
#>
#> 'Positive' Class : 1
#>
x <- as.data.frame(KNN_eval$overall[[1]])
names(x) <- c("Value")
# x$Metrics <- "Accuracy"
y <- as.data.frame(rbind(KNN_eval$byClass[[1]],KNN_eval$byClass[[2]],
KNN_eval$byClass[[3]],KNN_eval$byClass[[4]]))
names(y) <- c("Value")
# y$Metrics <- c("Sensitivity", "Specificity", "Pos Pred Value", "Neg Pred Value")
#
z <- rbind(x,y)
# z<- z[,c(2,1)]
KNN <- z
KNN$Model <- "K-Nearest Neighbour"For the Logistic model, I use the backward stepwise method and in the k-NN model I use all variables that have been changed to dummy variables. Both models, logistical and KNN return the same accuracy value of 0.8192. This means out of 100 passengers, the model could answer 82 passengers’ survival status correctly.
The sensitivity and precision of logistic regression model is 0.6617 and 0.8333. Meanwhile, the sensitivity and precision of k-NN model is 0.6176 and 0.8750.
In this classification, it is important for a model to have high accuracy to correctly predict passengers’ survival status to avoid miss-classification. Model with high sensitivity is also preferable because it describes the ability of a model to correctly predict positive class from the total of real positive class.
From the above evaluation,
Logistic modelgave the same score in accuracy metrics and higher score in sensitivity, making it a more appropriate model to predict passengers’ survival status.
Although logistic was picked as the best model in our case, we only get a 0.8192 accuracy and 0.6617 sensitivity from that model. It is always possible to have higher accuracy by using other classification model and/or with the addition of other data wrangling process. It is even better to try and build models using different classification method and compare them to find the best method for our data.
Thank you for reading !!!