library(tictoc)
tic()
<- Sys.time() # Capture the current time start_time
The Challenge
The sinking of the Titanic is one of the most infamous shipwrecks in history.
On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.
While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.
In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).
Overview
The data has been split into two groups:
training set (train.csv) test set (test.csv)
The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each passenger. Your model will be based on “features” like passengers’ gender and class. You can also use feature engineering to create new features.
The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.
We also include gender_submission.csv, a set of predictions that assume all and only female passengers survive, as an example of what a submission file should look like.
Variable Notes
pclass: A proxy for socio-economic status (SES) 1st = Upper 2nd = Middle 3rd = Lower
age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
sibsp: The dataset defines family relations in this way…
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)
parch: The dataset defines family relations in this way…
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.
To download the data: https://www.kaggle.com/c/titanic/data
library(readr)
options(readr.num_columns = FALSE)
read_csv("C:/Users/Asus/Documents/R Pubs/Titanic/train.csv") -> traindf # Reading Training Data set
read_csv("C:/Users/Asus/Documents/R Pubs/Titanic/test.csv") -> testdf # Reading Testing Data set
read_csv("C:/Users/Asus/Documents/R Pubs/Titanic/gender_submission.csv") -> submission # Submission Template
Loading required Libraries
library(dplyr, warn.conflicts = F) # Data manipulation
library(Amelia) # Missing frequencies
library(mice)
library(caret)
library(ggplot2)
theme_set(theme_bw())
library(ggthemes)
library(corrplot)
library(ggcorrplot)
library(ggpubr)
%>% head() traindf
## # A tibble: 6 x 12
## PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin
## <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <chr>
## 1 1 0 3 Braund~ male 22 1 0 A/5 2~ 7.25 <NA>
## 2 2 1 1 Cuming~ fema~ 38 1 0 PC 17~ 71.3 C85
## 3 3 1 3 Heikki~ fema~ 26 0 0 STON/~ 7.92 <NA>
## 4 4 1 1 Futrel~ fema~ 35 1 0 113803 53.1 C123
## 5 5 0 3 Allen,~ male 35 0 0 373450 8.05 <NA>
## 6 6 0 3 Moran,~ male NA 0 0 330877 8.46 <NA>
## # ... with 1 more variable: Embarked <chr>
$Survived <- rep(NA, nrow(testdf))
testdfrbind(traindf, testdf) -> full_data
as_tibble(full_data) -> full_data
full_data
## # A tibble: 1,309 x 12
## PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin
## <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <chr>
## 1 1 0 3 Braun~ male 22 1 0 A/5 2~ 7.25 <NA>
## 2 2 1 1 Cumin~ fema~ 38 1 0 PC 17~ 71.3 C85
## 3 3 1 3 Heikk~ fema~ 26 0 0 STON/~ 7.92 <NA>
## 4 4 1 1 Futre~ fema~ 35 1 0 113803 53.1 C123
## 5 5 0 3 Allen~ male 35 0 0 373450 8.05 <NA>
## 6 6 0 3 Moran~ male NA 0 0 330877 8.46 <NA>
## 7 7 0 1 McCar~ male 54 0 0 17463 51.9 E46
## 8 8 0 3 Palss~ male 2 3 1 349909 21.1 <NA>
## 9 9 1 3 Johns~ fema~ 27 0 2 347742 11.1 <NA>
## 10 10 1 2 Nasse~ fema~ 14 1 0 237736 30.1 <NA>
## # ... with 1,299 more rows, and 1 more variable: Embarked <chr>
Missing Values
missmap(full_data,
col= c("yellow", "steelblue")) # form "Amelia" package
::md.pattern(full_data, rotate.names = T) mice
## PassengerId Pclass Name Sex SibSp Parch Ticket Fare Embarked Age Survived
## 183 1 1 1 1 1 1 1 1 1 1 1
## 529 1 1 1 1 1 1 1 1 1 1 1
## 87 1 1 1 1 1 1 1 1 1 1 0
## 244 1 1 1 1 1 1 1 1 1 1 0
## 19 1 1 1 1 1 1 1 1 1 0 1
## 158 1 1 1 1 1 1 1 1 1 0 1
## 4 1 1 1 1 1 1 1 1 1 0 0
## 82 1 1 1 1 1 1 1 1 1 0 0
## 2 1 1 1 1 1 1 1 1 0 1 1
## 1 1 1 1 1 1 1 1 0 1 1 0
## 0 0 0 0 0 0 0 1 2 263 418
## Cabin
## 183 1 0
## 529 0 1
## 87 1 1
## 244 0 2
## 19 1 1
## 158 0 2
## 4 1 2
## 82 0 3
## 2 1 1
## 1 0 3
## 1014 1698
%>% select(-c(PassengerId, Name, Ticket, Cabin)) -> full_data
full_data colSums(is.na(full_data))
## Survived Pclass Sex Age SibSp Parch Fare Embarked
## 418 0 0 263 0 0 1 2
par(lwd= 2)
hist(full_data$Fare,
col= "salmon2",
xlab= "Fare", breaks = 25,
main= "Histogram of Fare")
hist(full_data$Age,
lwd= 2,
border = "darkblue",
col= "skyblue", xlab= "Age",
main= "Histogram of Age")
Fare is negatively skewed, need transformation (log/ or factors)
Data Wrangling
Types of Variables
sapply(full_data, class) %>% data.frame() %>% `colnames<-`("Types")
## Types
## Survived numeric
## Pclass numeric
## Sex character
## Age numeric
## SibSp numeric
## Parch numeric
## Fare numeric
## Embarked character
Transforming Survived to Factor, Sex to Factor.
$Survived <- factor(full_data$Survived,
full_datalevels = c(0,1),
labels = c("No 0", "Yes 1"))
$Sex <- factor(full_data$Sex,
full_datalevels = c("female", "male"),
labels = c("female 0", "male 1"))
Creating a new variable FSize (Family Size), and removing SibSp and Parch.
$FSize <- full_data$SibSp + full_data$Parch + 1
full_data%>% select(-c(SibSp, Parch)) -> full_data full_data
Converting FSize to Factor.
table(full_data$FSize)
##
## 1 2 3 4 5 6 7 8 11
## 790 235 159 43 22 25 16 8 11
cut(full_data$FSize,
breaks = c(1,3,6,11),
labels = c("Low 1", "Medium 2", "Large 3"),
include.lowest = T) -> full_data$FSize
table(full_data$FSize)
##
## Low 1 Medium 2 Large 3
## 1184 90 35
colSums(is.na(full_data))
## Survived Pclass Sex Age Fare Embarked FSize
## 418 0 0 263 1 2 0
glimpse(full_data)
## Rows: 1,309
## Columns: 7
## $ Survived <fct> No 0, Yes 1, Yes 1, Yes 1, No 0, No 0, No 0, No 0, Yes 1, Yes~
## $ Pclass <dbl> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3, 2~
## $ Sex <fct> male 1, female 0, female 0, female 0, male 1, male 1, male 1,~
## $ Age <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, 55,~
## $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625, 21~
## $ Embarked <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S", "~
## $ FSize <fct> Low 1, Low 1, Low 1, Low 1, Low 1, Low 1, Low 1, Medium 2, Lo~
Pclass as Factor
$Pclass <- factor(full_data$Pclass,
full_datalevels= 1:3,
labels = c("High 1", "Med 2", "Low 3"))
table(full_data$Pclass)
##
## High 1 Med 2 Low 3
## 323 277 709
Analyzing Fare
par(mfrow= c(2,2))
boxplot(full_data$Fare,
main= "Boxplot Fare")
boxplot(full_data$Fare,
main= "Boxplot Fare Without Outliers",
outline = F, col= "steelblue")
hist(full_data$Fare,
main = "Histogram", col= "palegreen4")
hist(full_data$Fare,
xlim = c(0,100),
breaks = 150,
main= "Histogram Fare < 100", col= "yellow")
text(x= 0, y= 130,
labels= "xx\nxx", col= "red")
text(x= 60, y= 300,
labels= "Note:\nSome entry of\nFare values are ZEROs")
Investigating Fare
$Fare== 0,] full_data[full_data
## # A tibble: 18 x 7
## Survived Pclass Sex Age Fare Embarked FSize
## <fct> <fct> <fct> <dbl> <dbl> <chr> <fct>
## 1 No 0 Low 3 male 1 36 0 S Low 1
## 2 No 0 High 1 male 1 40 0 S Low 1
## 3 Yes 1 Low 3 male 1 25 0 S Low 1
## 4 No 0 Med 2 male 1 NA 0 S Low 1
## 5 No 0 Low 3 male 1 19 0 S Low 1
## 6 No 0 Med 2 male 1 NA 0 S Low 1
## 7 No 0 Med 2 male 1 NA 0 S Low 1
## 8 No 0 Med 2 male 1 NA 0 S Low 1
## 9 No 0 Low 3 male 1 49 0 S Low 1
## 10 No 0 High 1 male 1 NA 0 S Low 1
## 11 No 0 Med 2 male 1 NA 0 S Low 1
## 12 No 0 Med 2 male 1 NA 0 S Low 1
## 13 No 0 High 1 male 1 39 0 S Low 1
## 14 No 0 High 1 male 1 NA 0 S Low 1
## 15 No 0 High 1 male 1 38 0 S Low 1
## 16 <NA> <NA> <NA> NA NA <NA> <NA>
## 17 <NA> High 1 male 1 NA 0 S Low 1
## 18 <NA> High 1 male 1 49 0 S Low 1
All Zero fare entry are Embarked from “S”, Low family size, belongs form all categories of Pclass
Replacing Fare zeros with median value based on specific categories.
%>% na.omit()-> df
full_data tapply(df$Fare, df$Pclass, median)
## High 1 Med 2 Low 3
## 67.9500 15.0458 8.0500
tapply(df$Fare, df$Embarked, median)
## C Q S
## 36.2521 7.7500 13.0000
Deciding to fill the 0’s in Fare with Median Fare value of Pclass
%>%
full_data mutate(Fare=
ifelse(full_data$Fare==0 & full_data$Pclass== "High 1", 67.96,
ifelse(full_data$Fare==0 & full_data$Pclass== "Med 2", 15.0458,
ifelse(full_data$Fare==0 & full_data$Pclass== "Low 3", 8.05, full_data$Fare)))) -> full_data
is.na(full_data$Fare),] # Only one Fare is missing full_data[
## # A tibble: 1 x 7
## Survived Pclass Sex Age Fare Embarked FSize
## <fct> <fct> <fct> <dbl> <dbl> <chr> <fct>
## 1 <NA> Low 3 male 1 60.5 NA S Low 1
$Fare <- ifelse(is.na(full_data$Fare), 8.0500, full_data$Fare) # replacing with the median Fare value of "Low 3" Pclass full_data
Working with Embarked variable
colSums(is.na(full_data))
## Survived Pclass Sex Age Fare Embarked FSize
## 418 0 0 263 0 2 0
is.na(full_data$Embarked),] full_data[
## # A tibble: 2 x 7
## Survived Pclass Sex Age Fare Embarked FSize
## <fct> <fct> <fct> <dbl> <dbl> <chr> <fct>
## 1 Yes 1 High 1 female 0 38 80 <NA> Low 1
## 2 Yes 1 High 1 female 0 62 80 <NA> Low 1
%>% filter(Pclass== "High 1") %>%
full_data group_by(Embarked) %>% summarise(count= n())
## # A tibble: 4 x 2
## Embarked count
## <chr> <int>
## 1 C 141
## 2 Q 3
## 3 S 177
## 4 <NA> 2
%>%
full_data group_by(Embarked) %>%
summarise(count= n())
## # A tibble: 4 x 2
## Embarked count
## <chr> <int>
## 1 C 270
## 2 Q 123
## 3 S 914
## 4 <NA> 2
$Embarked <- coalesce(full_data$Embarked, "S")
full_data%>%
full_data group_by(Embarked) %>%
summarise(median(Fare))
## # A tibble: 3 x 2
## Embarked `median(Fare)`
## <chr> <dbl>
## 1 C 28.5
## 2 Q 7.75
## 3 S 13.9
$Embarked <- factor(full_data$Embarked,
full_datalevels = c("Q", "S", "C"),
labels = c("Q 1", "S 2", "C 3"))
$Age <- coalesce(full_data$Age,
full_datamedian(full_data$Age, na.rm = T))
colSums(is.na(full_data))
## Survived Pclass Sex Age Fare Embarked FSize
## 418 0 0 0 0 0 0
Correlation Plots
%>%
full_data na.omit() %>%
mutate_all(as.numeric) %>%
mutate(Male= as.numeric(Sex==2),
Female= as.numeric(Sex== 1)) %>%
select(-Sex) %>% cor() %>%
corrplot(method="pie") # library "corr plot"
%>%
full_data na.omit() %>%
mutate_all(as.numeric) %>%
mutate(Male= as.numeric(Sex==2),
Female= as.numeric(Sex== 1)) %>%
select(-Sex) %>%
cor() %>%
ggcorrplot(method =
"circle",
type = "upper",
ggtheme = theme_foundation(),
legend.title = "Correlaiton\nCoefficient",
outline.color = "black")
%>% head() full_data
## # A tibble: 6 x 7
## Survived Pclass Sex Age Fare Embarked FSize
## <fct> <fct> <fct> <dbl> <dbl> <fct> <fct>
## 1 No 0 Low 3 male 1 22 7.25 S 2 Low 1
## 2 Yes 1 High 1 female 0 38 71.3 C 3 Low 1
## 3 Yes 1 Low 3 female 0 26 7.92 S 2 Low 1
## 4 Yes 1 High 1 female 0 35 53.1 S 2 Low 1
## 5 No 0 Low 3 male 1 35 8.05 S 2 Low 1
## 6 No 0 Low 3 male 1 28 8.46 Q 1 Low 1
Converting Fare into Categories
quantile(full_data$Fare)
## 0% 25% 50% 75% 100%
## 3.1708 7.9250 14.5000 31.3875 512.3292
$Fare <- cut(full_data$Fare,
full_databreaks = quantile(full_data$Fare),
labels = c("Less 1", "Ecomomic 2", "High 3", "V-High 4"),
include.lowest = T)
%>% head() full_data
## # A tibble: 6 x 7
## Survived Pclass Sex Age Fare Embarked FSize
## <fct> <fct> <fct> <dbl> <fct> <fct> <fct>
## 1 No 0 Low 3 male 1 22 Less 1 S 2 Low 1
## 2 Yes 1 High 1 female 0 38 V-High 4 C 3 Low 1
## 3 Yes 1 Low 3 female 0 26 Less 1 S 2 Low 1
## 4 Yes 1 High 1 female 0 35 V-High 4 S 2 Low 1
## 5 No 0 Low 3 male 1 35 Ecomomic 2 S 2 Low 1
## 6 No 0 Low 3 male 1 28 Ecomomic 2 Q 1 Low 1
Fare Vs Survival
<- full_data %>% na.omit()
df mosaicplot(table(full_data$Fare, full_data$Survived),
col= c("red", "steelblue"),
main= "Fare Vs Survival",
border = F,
xlab= "Fare",
ylab= "Survived")
chesq.test
chisq.test(table(df$Fare, df$Survived))
##
## Pearson's Chi-squared test
##
## data: table(df$Fare, df$Survived)
## X-squared = 72.765, df = 3, p-value = 1.092e-15
Gender Vs Survival
mosaicplot(table(df$Sex, df$Survived),
col= c("pink", "steelblue"),
main= "Gender Vs Survival",
ylab= "Survied", xlab= "Gender",
border = F)
chisq.test
chisq.test(table(df$Sex, df$Survived))
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table(df$Sex, df$Survived)
## X-squared = 260.72, df = 1, p-value < 2.2e-16
$Age <- cut(df$Age,
dfbreaks = c(round(quantile(df$Age))),
include.lowest = T)
Age Vs Survival
mosaicplot(table(df$Age, df$Survived),
col= c("red", "yellow"),
main= "Age Vs Survival",
xlab= "Age",
ylab= "Survival")
chisq.test
chisq.test(table(df$Age, df$Survived))
##
## Pearson's Chi-squared test
##
## data: table(df$Age, df$Survived)
## X-squared = 6.8243, df = 3, p-value = 0.07772
Age and Survival have no significant relation
Final full_data
%>% head() full_data
## # A tibble: 6 x 7
## Survived Pclass Sex Age Fare Embarked FSize
## <fct> <fct> <fct> <dbl> <fct> <fct> <fct>
## 1 No 0 Low 3 male 1 22 Less 1 S 2 Low 1
## 2 Yes 1 High 1 female 0 38 V-High 4 C 3 Low 1
## 3 Yes 1 Low 3 female 0 26 Less 1 S 2 Low 1
## 4 Yes 1 High 1 female 0 35 V-High 4 S 2 Low 1
## 5 No 0 Low 3 male 1 35 Ecomomic 2 S 2 Low 1
## 6 No 0 Low 3 male 1 28 Ecomomic 2 Q 1 Low 1
Train and Test Data
<- full_data[1:nrow(traindf),]
train_df %>% na.omit() -> train_df
train_df <- full_data[(nrow(traindf)+1):nrow(full_data), 2:7] test_df
rm(full_data)
rm(df)
rm(traindf)
rm(testdf)
Splitting for Validation Set
set.seed(20)
::sample.split(train_df$Survived, .8) -> index
caTools==F,] -> valid_df
train_df[index==T,] -> train_df train_df[index
Time of Execution
toc()
## 7.95 sec elapsed
tic()
Training Model
Setting Parameters
= "Accuracy"
m= trainControl(method= "cv", number= 18) c
LDA
set.seed(20)
<- train(Survived~.,
fit_lda data= train_df,
method= "lda",
metric= m,
trControl= c)
KNN
set.seed(20)
<- train(Survived~.,
fit_knn data= train_df,
method= "lda",
metric= m,
trControl= c)
Supervised Vector Machine (Radial)
set.seed(20)
<- train(Survived~.,
fit_svmradial data= train_df,
method= "svmRadial",
metric= m, trControl= c)
Supervised Vector Machine (Linear)
set.seed(20)
<- train(Survived~.,
fit_svmlinear data= train_df,
method= "svmLinear",
metric= m, trControl= c)
CART
set.seed(20)
<- train(Survived~.,
fit_rpart data= train_df,
method= "rpart",
metric= m, trControl= c)
Random Forest Model
set.seed(20)
<- train(Survived~.,
fit_rf data= train_df,
method= "rf",
metric= m,
trControl= c)
<- resamples(list(LDA= fit_lda,
results KNN= fit_knn,
cart= fit_rpart,
svmLinear= fit_svmlinear,
svmRadial= fit_svmradial,
RF= fit_rf))
dotplot(results)
summary(results) %>% print(digits= 2)
##
## Call:
## summary.resamples(object = results)
##
## Models: LDA, KNN, cart, svmLinear, svmRadial, RF
## Number of resamples: 18
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## LDA 0.6666667 0.7515244 0.8000000 0.7884233 0.8461538 0.8750000 0
## KNN 0.6666667 0.7515244 0.8000000 0.7884233 0.8461538 0.8750000 0
## cart 0.6829268 0.7451923 0.8024390 0.7983253 0.8490385 0.9230769 0
## svmLinear 0.6666667 0.7548077 0.8024390 0.7939485 0.8397436 0.8750000 0
## svmRadial 0.7435897 0.7961538 0.8102564 0.8149295 0.8461538 0.8750000 0
## RF 0.7179487 0.8051282 0.8250000 0.8276058 0.8490385 0.9250000 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## LDA 0.2488889 0.4621944 0.5694398 0.5408761 0.6644737 0.7422680 0
## KNN 0.2488889 0.4621944 0.5694398 0.5408761 0.6644737 0.7422680 0
## cart 0.2586926 0.4277174 0.5465343 0.5448427 0.6609526 0.8354430 0
## svmLinear 0.2488889 0.4589286 0.5819480 0.5506147 0.6520556 0.7422680 0
## svmRadial 0.4144144 0.5291150 0.5952381 0.5856635 0.6578947 0.7368421 0
## RF 0.4115226 0.5851816 0.6199152 0.6243739 0.6822917 0.8378378 0
Evaluation of the Accuracy using Confusion matrix
confusionMatrix(predict(fit_lda, valid_df), valid_df$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No 0 Yes 1
## No 0 96 20
## Yes 1 14 48
##
## Accuracy : 0.809
## 95% CI : (0.7434, 0.8639)
## No Information Rate : 0.618
## P-Value [Acc > NIR] : 2.973e-08
##
## Kappa : 0.5885
##
## Mcnemar's Test P-Value : 0.3912
##
## Sensitivity : 0.8727
## Specificity : 0.7059
## Pos Pred Value : 0.8276
## Neg Pred Value : 0.7742
## Prevalence : 0.6180
## Detection Rate : 0.5393
## Detection Prevalence : 0.6517
## Balanced Accuracy : 0.7893
##
## 'Positive' Class : No 0
##
confusionMatrix(predict(fit_knn , valid_df), valid_df$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No 0 Yes 1
## No 0 96 20
## Yes 1 14 48
##
## Accuracy : 0.809
## 95% CI : (0.7434, 0.8639)
## No Information Rate : 0.618
## P-Value [Acc > NIR] : 2.973e-08
##
## Kappa : 0.5885
##
## Mcnemar's Test P-Value : 0.3912
##
## Sensitivity : 0.8727
## Specificity : 0.7059
## Pos Pred Value : 0.8276
## Neg Pred Value : 0.7742
## Prevalence : 0.6180
## Detection Rate : 0.5393
## Detection Prevalence : 0.6517
## Balanced Accuracy : 0.7893
##
## 'Positive' Class : No 0
##
confusionMatrix(predict(fit_rpart, valid_df), valid_df$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No 0 Yes 1
## No 0 104 27
## Yes 1 6 41
##
## Accuracy : 0.8146
## 95% CI : (0.7496, 0.8688)
## No Information Rate : 0.618
## P-Value [Acc > NIR] : 1.104e-08
##
## Kappa : 0.5828
##
## Mcnemar's Test P-Value : 0.0004985
##
## Sensitivity : 0.9455
## Specificity : 0.6029
## Pos Pred Value : 0.7939
## Neg Pred Value : 0.8723
## Prevalence : 0.6180
## Detection Rate : 0.5843
## Detection Prevalence : 0.7360
## Balanced Accuracy : 0.7742
##
## 'Positive' Class : No 0
##
confusionMatrix(predict(fit_svmlinear, valid_df), valid_df$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No 0 Yes 1
## No 0 95 21
## Yes 1 15 47
##
## Accuracy : 0.7978
## 95% CI : (0.7312, 0.8541)
## No Information Rate : 0.618
## P-Value [Acc > NIR] : 1.939e-07
##
## Kappa : 0.5643
##
## Mcnemar's Test P-Value : 0.4047
##
## Sensitivity : 0.8636
## Specificity : 0.6912
## Pos Pred Value : 0.8190
## Neg Pred Value : 0.7581
## Prevalence : 0.6180
## Detection Rate : 0.5337
## Detection Prevalence : 0.6517
## Balanced Accuracy : 0.7774
##
## 'Positive' Class : No 0
##
confusionMatrix(predict(fit_svmradial, valid_df), valid_df$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No 0 Yes 1
## No 0 103 25
## Yes 1 7 43
##
## Accuracy : 0.8202
## 95% CI : (0.7558, 0.8737)
## No Information Rate : 0.618
## P-Value [Acc > NIR] : 3.958e-09
##
## Kappa : 0.599
##
## Mcnemar's Test P-Value : 0.002654
##
## Sensitivity : 0.9364
## Specificity : 0.6324
## Pos Pred Value : 0.8047
## Neg Pred Value : 0.8600
## Prevalence : 0.6180
## Detection Rate : 0.5787
## Detection Prevalence : 0.7191
## Balanced Accuracy : 0.7844
##
## 'Positive' Class : No 0
##
confusionMatrix(predict(fit_rf, valid_df), valid_df$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No 0 Yes 1
## No 0 99 16
## Yes 1 11 52
##
## Accuracy : 0.8483
## 95% CI : (0.787, 0.8976)
## No Information Rate : 0.618
## P-Value [Acc > NIR] : 1.317e-11
##
## Kappa : 0.6742
##
## Mcnemar's Test P-Value : 0.4414
##
## Sensitivity : 0.9000
## Specificity : 0.7647
## Pos Pred Value : 0.8609
## Neg Pred Value : 0.8254
## Prevalence : 0.6180
## Detection Rate : 0.5562
## Detection Prevalence : 0.6461
## Balanced Accuracy : 0.8324
##
## 'Positive' Class : No 0
##
rf and svmradial have best results
Time of Execution
toc()
## 29.18 sec elapsed
tic()
Final training on whole data (train_df + valid_df) for more accuracy
<- rbind(train_df, valid_df) train_df
Training Model
Setting Parameters
= "Accuracy"
m= trainControl(method= "cv", number= 18) c
LDA
set.seed(20)
<- train(Survived~.,
fit_lda data= train_df,
method= "lda",
metric= m,
trControl= c)
KNN
set.seed(20)
<- train(Survived~.,
fit_knn data= train_df,
method= "lda",
metric= m,
trControl= c)
Supervised Vector Machine (Radial)
set.seed(20)
<- train(Survived~.,
fit_svmradial data= train_df,
method= "svmRadial",
metric= m, trControl= c)
Supervised Vector Machine (Linear)
set.seed(20)
<- train(Survived~.,
fit_svmlinear data= train_df,
method= "svmLinear",
metric= m, trControl= c)
CART
set.seed(20)
<- train(Survived~.,
fit_rpart data= train_df,
method= "rpart",
metric= m, trControl= c)
Random Forest Model
set.seed(20)
<- train(Survived~.,
fit_rf data= train_df,
method= "rf",
metric= m,
trControl= c)
<- resamples(list(LDA= fit_lda,
results KNN= fit_knn,
cart= fit_rpart,
svmLinear= fit_svmlinear,
svmRadial= fit_svmradial,
RF= fit_rf))
dotplot(results)
summary(results) %>% print(digits= 2)
##
## Call:
## summary.resamples(object = results)
##
## Models: LDA, KNN, cart, svmLinear, svmRadial, RF
## Number of resamples: 18
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## LDA 0.6938776 0.7437755 0.7777551 0.7931973 0.8400000 0.90 0
## KNN 0.6938776 0.7437755 0.7777551 0.7931973 0.8400000 0.90 0
## cart 0.7142857 0.7755102 0.7959184 0.8033333 0.8391837 0.90 0
## svmLinear 0.7200000 0.7551020 0.7879592 0.7943311 0.8325510 0.90 0
## svmRadial 0.7346939 0.7806122 0.8163265 0.8168027 0.8600000 0.90 0
## RF 0.7200000 0.8163265 0.8283673 0.8336508 0.8600000 0.94 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## LDA 0.3360434 0.4674658 0.5207002 0.5552678 0.6586760 0.7810858 0
## KNN 0.3360434 0.4674658 0.5207002 0.5552678 0.6586760 0.7810858 0
## cart 0.3740876 0.4877221 0.5340167 0.5572735 0.6450972 0.7763864 0
## svmLinear 0.3805310 0.4480546 0.5490691 0.5542787 0.6435939 0.7810858 0
## svmRadial 0.4186047 0.5183790 0.5935484 0.5961001 0.6869410 0.7810858 0
## RF 0.3670886 0.5898862 0.6221991 0.6364404 0.6998285 0.8739496 0
Time of Execution
toc()
## 33.17 sec elapsed
=
submission_rfdata.frame(
PassengerId= submission$PassengerId,
Survived= factor(predict(fit_rf, test_df),
levels = c("No 0", "Yes 1"),
labels= c(0,1)))
=
submission_svmradialdata.frame(
PassengerId= submission$PassengerId,
Survived= factor(predict(fit_svmradial, test_df),
levels = c("No 0", "Yes 1"),
labels= c(0,1)))
=
submission_lineardata.frame(
PassengerId= submission$PassengerId,
Survived= factor(predict(fit_svmlinear,
test_df), levels = c("No 0", "Yes 1"),
labels= c(0,1)))
=
submission_rpartdata.frame(
PassengerId= submission$PassengerId,
Survived= factor(predict(fit_rpart,
test_df),levels= c("No 0", "Yes 1"),
labels= c(0,1))
)
Printing top 20 results
print(submission_svmradial %>% head(20))
## PassengerId Survived
## 1 892 0
## 2 893 1
## 3 894 0
## 4 895 0
## 5 896 0
## 6 897 0
## 7 898 1
## 8 899 0
## 9 900 1
## 10 901 0
## 11 902 0
## 12 903 0
## 13 904 1
## 14 905 0
## 15 906 1
## 16 907 1
## 17 908 0
## 18 909 0
## 19 910 1
## 20 911 1
::write_csv(submission_svmradial, "subradial1.csv")
readr::write_csv(submission_rf, "subrf1.csv")
readr::write_csv(submission_linear, "sublinear.csv")
readr::write_csv(submission_rpart, "submrpart1.csv") readr
Kaggle Result
::kable(data.frame(method= c("rf", "svmradial"),
kableExtraaccuracy= c("76%", "78%")))
method | accuracy |
---|---|
rf | 76% |
svmradial | 78% |
<- Sys.time() end_time
Total execution Time
- start_time end_time
## Time difference of 1.180538 mins
END
Regards
Please visit my profile
Alok Pratap Singh (Research Scholar)
Linkedin (Open in New TAB)
Department of Psychology
University of Allahabad
Without data you’re just another person with an opinion
.