May 01, 2016
This project is a predictive data analysis practice based on an entry level Kaggle competition, which asks participants to predict the fate of the passengers based on their social class, name, gender, age, ticket fare and so on.
The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 (68 %) passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.
One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.
Variable Descriptions
| Variable | Desctiption |
|---|---|
| survival | Survival (0 = No; 1 = Yes) |
| pclass | Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd) |
| name | Name |
| sex | Sex |
| age | Age |
| sibsp | Number of Siblings/Spouses Aboard |
| parch | Number of Parents/Children Aboard |
| ticket | Ticket Number |
| fare | Passenger Fare |
| cabin | Cabin |
| embarked | Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton) |
Special Notes
Pclass is a proxy for socio-economic status (SES): 1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower
Age is in Years; Fractional if Age less than One (1). If the Age is Estimated, it is in the form of xx.5.
With respect to the family relation variables (i.e. sibsp and parch), some relations were ignored.
The following are the definitions used for sibsp and parch.
| Name | Definition |
|---|---|
| Sibling | Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic |
| Spouse | Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiances Ignored) |
| Parent | Mother or Father of Passenger Aboard Titanic |
| Child | Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic |
Other family relatives excluded from this study include cousins, nephews/nieces, aunts/uncles, and in-laws. Some children travelled only with a nanny, therefore parch=0 for them. As well, some travelled with very close friends or neighbors in a village, however, the definitions do not support such relations.
## Warning: package 'survival' was built under R version 3.2.5
After loading in the data, take a quick look at the structure of the data set.
## quick look of the data set
glimpse(data.comb)Observations: 1,309
Variables: 12
$ PassengerId (int) 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,...
$ Survived (int) 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0,...
$ Pclass (int) 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3,...
$ Name (fctr) Braund, Mr. Owen Harris, Cumings, Mrs. John Bradl...
$ Sex (fctr) male, female, female, female, male, male, male, m...
$ Age (dbl) 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, ...
$ SibSp (int) 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4,...
$ Parch (int) 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1,...
$ Ticket (fctr) A/5 21171, PC 17599, STON/O2. 3101282, 113803, 37...
$ Fare (dbl) 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, ...
$ Cabin (fctr) NA, C85, NA, C123, NA, NA, E46, NA, NA, NA, G6, C...
$ Embarked (fctr) S, C, S, S, S, Q, S, S, S, C, S, S, S, S, S, S, Q...
From the summary above, the first thing that caught my eye is there are NAs, AKA missing values. Before I can make any statistical model, I need to deal with them first.
The table below shows the number and percentage of NAs in each variable.| Number and Percentage of NAs in Each Variable | ||||||||||
| Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|
| No. of NAs | 0 | 0 | 0 | 263 | 0 | 0 | 0 | 1 | 1014 | 2 |
| % of NAs | 0 % | 0 % | 0 % | 20 % | 0 % | 0 % | 0 % | 0 % | 77 % | 0 % |
From the table, we can see that there are 263 NAs (roughly 20%) in Age variable, 1 in Fare, 1014 (roughly 77%) in Cabin, and 2 in Embarked.
In this case, we can definitely find out a way to fill in the missing values in Age since it is only 20%, but for Cabin, with 77% of missing values, it would be impractical to fill in all of them. So the Cabin variable is dropped for now.
Before dealing with the NAs, let’s look at the distribution of the variables first.
From the three plots above, we can conclude that:
What I can get from the plots above:
Age variable seems quite similar for different passenger fate. But we know that when the Titanic sank, “women and children first” rule was carried out. This is a very importtant information for the project, so I will keep Age variable in the model anyway.Fare suggests that survived people paid more than the deceased. However by looking at the histogram, we notice that people paid as high as $500 for the ticket. So I suspect this kind of ticket is group ticket. Family members or friends are likely to buy tickets like these.Based on the three points above, I’ll create two new features, Family Size and Group Ticket, in part 3 Feature Engineering.
Now it’s time to eliminate the missing values.
Age VariableAge variable, it looks normal. Actually by looking at the summary statistics, we know that the mean age is 29.88, and the median age is 28.
| Summary Statistics of ‘Age’ Variable | ||||||
| Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. | NA’s |
|---|---|---|---|---|---|---|
| 0.17 | 21 | 28 | 29.88 | 39 | 80 | 263 |
The mean and median are very close. So the easiest way would be using the median value (28) to replace all the NAs. But it might not be the best option here.
By exploring the relationship betweenAge and other variables, we may find some more hints. For example, let’s look at the relationship between age and socio-economic status as shown below.
Hover over the boxplot to observe the distribution of Age for each social class. It’s not hard to find that the median age is increasing along with the social class. So we can use this important information to fill in the NAs in Age variable. If a person’s age is missing, look at his social class first, then use the corresponding median age of that class to replace the NA. It would be much more accurate than just using the overall median.
But in this particular data set, there is actually a better way to deal with the NAs in Age. Let’s take a look at some of the passengers’ names first.
## study 'Name' variable
head(data_comb$Name)[1] "Braund, Mr. Owen Harris"
[2] "Cumings, Mrs. John Bradley (Florence Briggs Thayer)"
[3] "Heikkinen, Miss. Laina"
[4] "Futrelle, Mrs. Jacques Heath (Lily May Peel)"
[5] "Allen, Mr. William Henry"
[6] "Moran, Mr. James"
All the passengers’ names follow the same pattern: first name + title + last name. And a person’s title can be very useful. It would probably suggest the person’s gender, age, and even social class. So not only for the purpose of imputing NAs, but also as part of the feature engineering, I decided to extract the titles from passengers’ names, and make a new variable called Title to store them in the data frame.
## create 'Title' variable
data_comb$Title <- sapply(data_comb$Name, function(x) {temp = strsplit(x, split = "[,.]")
temp[[1]][2]})
data_comb$Title <- sub(" ", "", data_comb$Title)Age, and it also gives us the mean and median age for each title.
| N | Missing | Mean | Median | |
|---|---|---|---|---|
| Capt | 1 | 0 | 70 | 70 |
| Col | 4 | 0 | 54 | 54.5 |
| Don | 1 | 0 | 40 | 40 |
| Dona | 1 | 0 | 39 | 39 |
| Dr | 7 | 1 | 43.57 | 49 |
| Jonkheer | 1 | 0 | 38 | 38 |
| Lady | 1 | 0 | 48 | 48 |
| Major | 2 | 0 | 48.5 | 48.5 |
| Master | 53 | 8 | 5.48 | 4 |
| Miss | 210 | 50 | 21.77 | 22 |
| Mlle | 2 | 0 | 24 | 24 |
| Mme | 1 | 0 | 24 | 24 |
| Mr | 581 | 176 | 32.25 | 29 |
| Mrs | 170 | 27 | 36.99 | 35.5 |
| Ms | 1 | 1 | 28 | 28 |
| Rev | 8 | 0 | 41.25 | 41.5 |
| Sir | 1 | 0 | 49 | 49 |
| the Countess | 1 | 0 | 33 | 33 |
| ALL | 1046 | 263 | 29.88 | 28 |
Now let’s use the median age from each title to replace the missing values in that title gruop if there is at least one NA.
## replace missing age values with median age of the corresponding title
title.na <- c("Dr", "Master", "Miss", "Mr", "Mrs", "Ms")
medAge_byTitle <- data_comb %>%
group_by(Title) %>%
summarise(medAge = median(Age, na.rm = TRUE))
for (ttl in title.na) {
data_comb[data_comb$Title == ttl & is.na(data_comb$Age), "Age"] =
medAge_byTitle[medAge_byTitle$Title == ttl, "medAge"]
}Age variable again, we will find there is no NA left.
| Summary Statistics of ‘Age’ Variable w/o NAs | |||||
| Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. |
|---|---|---|---|---|---|
| 0.17 | 22 | 29 | 29.43 | 35.5 | 80 |
Embarked Variable| Summary Statistics of ‘Embarked’ Variable | |||
| C | Q | S | NA’s |
|---|---|---|---|
| 270 | 123 | 914 | 2 |
We can see that there are only 2 missing values in the Embarked variable. We can simply fill in ‘S’ for those two empty cells since most of the passengers are embarked from port S (Southampton). This is called mode in statistics.
## replace empty cells with mode which is "S"
data_comb$Embarked[is.na(data_comb$Embarked)] = "S"| Summary Statistics of ‘Embarked’ Variable w/o NAs | ||
| C | Q | S |
|---|---|---|
| 270 | 123 | 916 |
Fare Variable| Summary Statistics of ‘Fare’ Variable | ||||||
| Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. | NA’s |
|---|---|---|---|---|---|---|
| 0 | 7.896 | 14.45 | 33.3 | 31.28 | 512.3 | 1 |
The Fare variable is a little bit tricky. Although there is only 1 missing value, we can see from the summary statistics that there are some 0 fares in the data set. At first I thought these would be the ticket for baby passengers, they are just too young to be charged. But digging into the details, it turned out differently.
## check the zero values in 'fare'
head(filter(select(data_comb, -Survived, -Fate), Fare == 0 | is.na(Fare))) PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Embarked Title
1 180 Lower Leonard, Mr. Lionel male 36 0 0 LINE 0 S Mr
2 264 Upper Harrison, Mr. William male 40 0 0 112059 0 S Mr
3 272 Lower Tornquist, Mr. William Henry male 25 0 0 LINE 0 S Mr
4 278 Middle Parkes, Mr. Francis "Frank" male 29 0 0 239853 0 S Mr
5 303 Lower Johnson, Mr. William Cahoone Jr male 19 0 0 LINE 0 S Mr
6 414 Middle Cunningham, Mr. Alfred Fleming male 29 0 0 239853 0 S Mr
We can clearly see that these people with 0 fares are not babies. So I guess it is some kind of error. I will treat the zeroes the same as NAs.
Looking at the distribution of the Fare variable by passengers’ social class, we can see the median fare are $ 61.66, $ 15.14 and $ 8.13 for upper, middle and lower class respectively.
Now, let’s use these values to replace the missing values in Fare for each passenger class.
## replace the only missing value of 'Fare' and the 0 fare values
medFare_byPclass <- data_comb %>%
filter(Fare != 0) %>%
group_by(Pclass) %>%
summarise(medFare = median(Fare, na.rm = TRUE))
for (cls in medFare_byPclass$Pclass) {
data_comb[data_comb$Pclass == cls & (data_comb$Fare == 0 | is.na(data_comb$Fare)), "Fare"] =
medFare_byPclass[medFare_byPclass$Pclass == cls, "medFare"]
}Fare.
| Summary Statistics of ‘Fare’ Variable w/o NAs | |||||
| Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. |
|---|---|---|---|---|---|
| 3.171 | 7.925 | 14.5 | 33.7 | 31.39 | 512.3 |
Now, I will create some new features beased on the exploratory analysis.
Totally four new variables are created: Title, famSize, group, and wom_chd. I will explain each one of them step by step.
From part 2.3.1, we can clearly see that for some titles, there are only a few people on the ship. We need to merge the similar ones into groups.
Let’s see what the distribution of Age would look like across all these titles. It may give us some idea of how to create groups for titles.
From the plot above we can see that:
So I will put “Ms” in the same group with “Mrs”, and all the other title mentioned above will be in a new title group called “Noble” assuming that they all share similar social class and fate.
## group titles
data_comb$Title <- as.character(data_comb$Title)
data_comb[data_comb$Title == "Ms", "Title"] = "Mrs"
data_comb[data_comb$Title %in% c("Mlle", "Mme", "the Countess", "Dona", "Lady", "Jonkheer", "Don", "Rev", "Major", "Dr", "Sir", "Col", "Capt"), "Title"] = "Noble"| Number of Passengers with Each Title | ||||
| Master | Miss | Mr | Mrs | Noble |
|---|---|---|---|---|
| 61 | 260 | 757 | 199 | 32 |
This new feature is relatively easy to build. A passenger’s family size equals to the number of siblings/spouses plus number of parents/children plus one which himself.
## create 'famSize' variable
data_comb$famSize <- data_comb$SibSp + data_comb$Parch + 1From the distribution of passengers’ family size, we can observe that the median family size for those who survived is 2 which will be used in the next part of my feature engineering.
The Ticket variable is quite interesting. The total number of passengers is 1309, however there are only 929 unique tickets.
This indicates that a lot of people shared the same ticket. I will call it Group Ticket in this project, and in this part, I will biuld a new variable named group to represent these feature.
From last part of my feature engineering, I have already mentioned that the median family size for the survived is 2. So in this new variable, I will calculate the number of passengers with each ticket, and if that number is greater than 2, I will assign group a value “Yes”, and “No” vice versa.
## calculate the number of passengers with each ticket, and filter out those greater than 2
groupTKT <- data_comb %>%
group_by(Ticket) %>%
summarise(N = n()) %>%
filter(N > 2) %>%
arrange(desc(N))
## create 'group' variable
data_comb$group = "No"
data_comb$group[which(data_comb$Ticket %in% groupTKT$Ticket)] = "Yes"
data_comb$group <- as.factor(data_comb$group)group criteria.
| Group Ticket | ||
| No | Yes | |
|---|---|---|
| No. of Passengers | 977 | 332 |
| % of Passengers | 74.64 % | 25.36 % |
Since we have already known that women and children first rule was carried out in the incident, I will create a feature directly related to women and children. So if a passenger is either female or less than 18 years old, I will assign wom_chd a value “Yes”, and “No” vice versa.
## create 'wom_chd' variable
data_comb$wom_chd = "No"
data_comb$wom_chd[which(data_comb$Sex == "female" | data_comb$Age < 18)] = "Yes"
data_comb$wom_chd <- as.factor(data_comb$wom_chd)| Women and Children | ||
| No | Yes | |
|---|---|---|
| No. of Passengers | 753 | 556 |
| % of Passengers | 57.52 % | 42.48 % |
At last, let’s take a look at the relationship of passengers’ fate and the two new variables I just created.
We can see that passengers taking a group ticket, or being women or children, have a higher chance to survive.
After exploring the patterns and creating new features, now I will build statistical models to predict the fate of the passengers in the test data set.
Three machine learning methods are used in this project:
I will exaplain each one of them respectively. But before diving into the modeling process, I split the training data set into two parts: a new “training”" set and a new “testing” set.
## set seed and split the training set
set.seed(123)
trainIndex <- createDataPartition(data_train$Survived, p = 0.8, list = FALSE)
training <- data_train[trainIndex, ]
testing <- data_train[-trainIndex, ]So, with each method, the model is built on the data set named “traning”, and then be used to predict the fate of the passengers in the “testing” data set. Since we already know the fate in the “testing” data set, we can evalute the performance of the model, and improve the result before it is applied to the true test set and submission to Kaggle.
The first model I built is a simple Logistic Regression that uses the original variables only. I want to know how well it would be without my feature engineering.
## simple logistic regression
LR_model.train <- glm(Fate ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked,
data = training, family = binomial(link = "logit"))Pclass, Sex, Age and SibSp are all significant variables.
| Analysis of Deviance Table | |||||
| Df | Deviance | Resid. Df | Resid. Dev | Pr(>Chi) | |
|---|---|---|---|---|---|
| NULL | 713 | 950.8649 | |||
| Pclass | 2 | 95.7096 | 711 | 855.1554 | 0 |
| Sex | 1 | 197.4841 | 710 | 657.6713 | 0 |
| Age | 1 | 13.4235 | 709 | 644.2478 | 2e-04 |
| SibSp | 1 | 12.8114 | 708 | 631.4364 | 3e-04 |
| Parch | 1 | 0.4308 | 707 | 631.0056 | 0.5116 |
| Fare | 1 | 0.7293 | 706 | 630.2763 | 0.3931 |
| Embarked | 2 | 2.428 | 704 | 627.8483 | 0.297 |
Let’s use it to predict the “testing” data set.
## prediction
LR_pred.test <- predict(LR_model.train, testing, type = "response")
testing$Fate_LR <- ifelse(LR_pred.test <= 0.5, "Deceased", "Survived")
confusionMatrix(testing$Fate_LR, testing$Fate, positive = "Survived")Confusion Matrix and Statistics
Reference
Prediction Deceased Survived
Deceased 89 15
Survived 20 53
Accuracy : 0.8023
95% CI : (0.7359, 0.8582)
No Information Rate : 0.6158
P-Value [Acc > NIR] : 7.432e-08
Kappa : 0.5878
Mcnemar's Test P-Value : 0.499
Sensitivity : 0.7794
Specificity : 0.8165
Pos Pred Value : 0.7260
Neg Pred Value : 0.8558
Prevalence : 0.3842
Detection Rate : 0.2994
Detection Prevalence : 0.4124
Balanced Accuracy : 0.7980
'Positive' Class : Survived
From the Confusion Matrix, we can see the overall Accuracy is 80.23%, Sensitivity is 77.94% and Specificity is 81.65%. Specificity is lower, but as a first model it is not so bad actually.
Now I will add in all the new features I created in part 3 to enhance the prediction.
## add in new features to the model
LR_model.trainPlus <- glm(Fate ~ Pclass + Sex + Age + Fare + Embarked + ## original features
famSize + Title + group + wom_chd, ## new features
data = training, family = binomial(link = "logit"))wom_chd is insignificant. This is probably because most of the gender and age information can be captured by Sex, Age and Title, in which case, wom_chd could become secondary.
| Analysis of Deviance Table | |||||
| Df | Deviance | Resid. Df | Resid. Dev | Pr(>Chi) | |
|---|---|---|---|---|---|
| NULL | 713 | 950.8649 | |||
| Pclass | 2 | 95.7096 | 711 | 855.1554 | 0 |
| Sex | 1 | 197.4841 | 710 | 657.6713 | 0 |
| Age | 1 | 13.4235 | 709 | 644.2478 | 2e-04 |
| Fare | 1 | 0.0015 | 708 | 644.2463 | 0.9693 |
| Embarked | 2 | 4.7335 | 706 | 639.5129 | 0.0938 |
| famSize | 1 | 10.3762 | 705 | 629.1367 | 0.0013 |
| Title | 4 | 44.7831 | 701 | 584.3535 | 0 |
| group | 1 | 8.8245 | 700 | 575.529 | 0.003 |
| wom_chd | 1 | 0.015 | 699 | 575.514 | 0.9024 |
Let’s use this model to predict the “tesing” data set again.
## prediction with new features
LR_pred.testPlus <- predict(LR_model.trainPlus, testing, type = "response")
testing$Fate_LRPlus <- ifelse(LR_pred.testPlus <= 0.5, "Deceased", "Survived")
confusionMatrix(testing$Fate_LRPlus, testing$Fate, positive = "Survived")Confusion Matrix and Statistics
Reference
Prediction Deceased Survived
Deceased 89 13
Survived 20 55
Accuracy : 0.8136
95% CI : (0.7483, 0.8681)
No Information Rate : 0.6158
P-Value [Acc > NIR] : 1.058e-08
Kappa : 0.6135
Mcnemar's Test P-Value : 0.2963
Sensitivity : 0.8088
Specificity : 0.8165
Pos Pred Value : 0.7333
Neg Pred Value : 0.8725
Prevalence : 0.3842
Detection Rate : 0.3107
Detection Prevalence : 0.4237
Balanced Accuracy : 0.8127
'Positive' Class : Survived
We can see that the overall Accuracy is increased to 81.36%. Specificity is the same, but Sensitivity is increased to 80.88%. This means my feature engineering works well. With all the new features, I successfully predicted more True Positives (people that survived are correctly predicted as survived) than my last model.
From the model comparison below, it is obvious that the second model is better than the first one.| Logistic Regression Models Comparison | |||||
| Resid. Df | Resid. Dev | Df | Deviance | Pr(>Chi) | |
|---|---|---|---|---|---|
| 1 | 704 | 627.8483 | |||
| 2 | 699 | 575.514 | 5 | 52.3343 | 0 |
After applying the better model to the true test set and submitting the prediction results to Kaggle, it scored
0.79426/1.00000.
Cross Validation is a technique being used widely in data science. The purpose of doing so is to prevent overfitting and find the best possible model for the given data set.
In this project, I use caret package to perform cross validation for the Logistic Regression model.
## set up cross validation
cv.ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 3, classProbs = TRUE,
savePredictions = TRUE, summaryFunction = twoClassSummary)
## logistic regression with cross validation
LR_model.traincv <- train(Fate ~ Pclass + Sex + Age + Fare + Embarked +
famSize + Title + group + wom_chd,
data = training, method = "glm", family = "binomial",
metric = "ROC", trControl = cv.ctrl)
summary(LR_model.traincv)
Call:
NULL
Deviance Residuals:
Min 1Q Median 3Q Max
-2.4018 -0.5480 -0.3858 0.5277 2.4914
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.926e+01 5.025e+02 0.038 0.969426
PclassMiddle -1.262e+00 3.591e-01 -3.514 0.000441 ***
PclassLower -2.357e+00 3.460e-01 -6.812 9.60e-12 ***
Sexmale -1.464e+01 5.025e+02 -0.029 0.976752
Age -1.844e-02 1.105e-02 -1.669 0.095178 .
Fare -6.766e-04 2.753e-03 -0.246 0.805828
EmbarkedQ -4.807e-02 4.448e-01 -0.108 0.913938
EmbarkedS -3.543e-01 2.834e-01 -1.250 0.211237
famSize -5.942e-01 1.137e-01 -5.226 1.73e-07 ***
TitleMiss -1.483e+01 5.025e+02 -0.030 0.976454
TitleMr -3.309e+00 9.651e-01 -3.429 0.000607 ***
TitleMrs -1.414e+01 5.025e+02 -0.028 0.977559
TitleNoble -3.205e+00 1.146e+00 -2.796 0.005177 **
groupYes 1.098e+00 3.705e-01 2.964 0.003032 **
wom_chdYes -1.017e-01 8.371e-01 -0.121 0.903321
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 950.86 on 713 degrees of freedom
Residual deviance: 575.51 on 699 degrees of freedom
AIC: 605.51
Number of Fisher Scoring iterations: 13
From the model summary, we can see that the Residual Deviance is the same as my best model so far. The prediction results are the same as well.
## prediction with cross validation
LR_pred.testcv <- predict(LR_model.traincv, testing)
confusionMatrix(LR_pred.testcv, testing$Fate, positive = "Survived")Confusion Matrix and Statistics
Reference
Prediction Deceased Survived
Deceased 89 13
Survived 20 55
Accuracy : 0.8136
95% CI : (0.7483, 0.8681)
No Information Rate : 0.6158
P-Value [Acc > NIR] : 1.058e-08
Kappa : 0.6135
Mcnemar's Test P-Value : 0.2963
Sensitivity : 0.8088
Specificity : 0.8165
Pos Pred Value : 0.7333
Neg Pred Value : 0.8725
Prevalence : 0.3842
Detection Rate : 0.3107
Detection Prevalence : 0.4237
Balanced Accuracy : 0.8127
'Positive' Class : Survived
Since cross validation didn’t improve my original model, I decide to use a different kind of algorithm called Support Vector Machine, AKA SVM.
The svm function from e1071 package can build SVM models in R.
## svm
SVM_model.train <- svm(Fate ~ Pclass + Sex + Age + Fare + Embarked +
famSize + Title + group + wom_chd,
data = training, probability = TRUE)Use it to predict directly. We can see that the overall Accuracy is inreased to 83.62%, Sensitivity is the same as my Logistic Regression model, but Specificity is now increased to 85.32%.
## prediction with svm
SVM_pred.test <- predict(SVM_model.train, testing)
confusionMatrix(SVM_pred.test, testing$Fate, positive = "Survived")Confusion Matrix and Statistics
Reference
Prediction Deceased Survived
Deceased 93 13
Survived 16 55
Accuracy : 0.8362
95% CI : (0.7732, 0.8874)
No Information Rate : 0.6158
P-Value [Acc > NIR] : 1.38e-10
Kappa : 0.6566
Mcnemar's Test P-Value : 0.7103
Sensitivity : 0.8088
Specificity : 0.8532
Pos Pred Value : 0.7746
Neg Pred Value : 0.8774
Prevalence : 0.3842
Detection Rate : 0.3107
Detection Prevalence : 0.4011
Balanced Accuracy : 0.8310
'Positive' Class : Survived
Just by looking at the numbers, it is a huge improvement to my Logistic Regression models. Submitting to Kaggle, this model scored:
0.78947/1.00000
Unfortunately, the score didn’t increase. So next, I want to tune the model a little bit using the same Cross Validation technique, and see if I can improve the score.
A good thing about the e1071 package is that it also provides the tune function for tuning svm models.
## tuning
SVM_model.traincv <- tune(svm, Fate ~ Pclass + Sex + Age + Fare + Embarked +
famSize + Title + group + wom_chd,
data = training, probability = TRUE, kernel = "radial",
ranges = list(cost = c(0.1, 1, 10, 100, 1000),
gamma = c(0.5, 1, 2, 3, 4)))
## get the best model
SVM_model.traincv$best.model
Call:
best.tune(method = svm, train.x = Fate ~ Pclass + Sex + Age + Fare + Embarked + famSize + Title + group + wom_chd, data = training,
ranges = list(cost = c(0.1, 1, 10, 100, 1000), gamma = c(0.5, 1, 2, 3, 4)), probability = TRUE, kernel = "radial")
Parameters:
SVM-Type: C-classification
SVM-Kernel: radial
cost: 1
gamma: 0.5
Number of Support Vectors: 366
We can see that the best svm model is the one with cost 1 and gamma 0.5. Use the best model to predict again, the overall Accuracy is 83.05%, decreased, and Sensitivity is decreased to 76.47%, but Specificity is increased to 87.16%.
## prediction with best svm model
SVM_pred.testcv <- predict(SVM_model.traincv$best.model, testing)
confusionMatrix(SVM_pred.testcv, testing$Fate, positive = "Survived")Confusion Matrix and Statistics
Reference
Prediction Deceased Survived
Deceased 95 16
Survived 14 52
Accuracy : 0.8305
95% CI : (0.767, 0.8826)
No Information Rate : 0.6158
P-Value [Acc > NIR] : 4.325e-10
Kappa : 0.6398
Mcnemar's Test P-Value : 0.8551
Sensitivity : 0.7647
Specificity : 0.8716
Pos Pred Value : 0.7879
Neg Pred Value : 0.8559
Prevalence : 0.3842
Detection Rate : 0.2938
Detection Prevalence : 0.3729
Balanced Accuracy : 0.8181
'Positive' Class : Survived
Sensitivity is decreased a lot which is not a good sign. Submitting to Kaggle, this model only scored:
0.77990/1.00000
It becomes even lower… well, since cross validation didn’t improve my SVM model either, I decide to try something else.
Random Forest is my favorite machine learning algorithm so far. I like to use the randomForest function from the randomForest package.
## random forest
RF_model.train <- randomForest(Fate ~ Pclass + Sex + Age + Fare + Embarked +
famSize + Title + group + wom_chd,
data = training, importance = TRUE, ntree = 2000)The plot below shows the importance of the variables judged by the mean decrease accuracy. A variable is considered the most important if the accuracy of the model without it decreased the most compared to the full model.
Predict using this model, we get the same result as my best SVM model.
## prediction with random forest
RF_pred.test <- predict(RF_model.train, testing)
confusionMatrix(RF_pred.test, testing$Fate, positive = "Survived")Confusion Matrix and Statistics
Reference
Prediction Deceased Survived
Deceased 95 16
Survived 14 52
Accuracy : 0.8305
95% CI : (0.767, 0.8826)
No Information Rate : 0.6158
P-Value [Acc > NIR] : 4.325e-10
Kappa : 0.6398
Mcnemar's Test P-Value : 0.8551
Sensitivity : 0.7647
Specificity : 0.8716
Pos Pred Value : 0.7879
Neg Pred Value : 0.8559
Prevalence : 0.3842
Detection Rate : 0.2938
Detection Prevalence : 0.3729
Balanced Accuracy : 0.8181
'Positive' Class : Survived
However, after submitting to Kaggle, this model only scored:
0.77512/1.00000
which is very frustrating… but the bright side is that you cannot go any lower when you’ve hit the bottom already, right? ^_^
A good thing about Random Forest is that it is an ensemble algorithm based on Decision Tree, and cforest is a function from the party package which implements the random forest and bagging ensemble algorithms utilizing Conditional Inference Trees as base learners. They make their decisions in slightly different ways, using a statistical test rather than a purity measure, but the basic construction of each tree is fairly similar.
## conditional inference trees
CF_model.train <- cforest(Fate ~ Pclass + Sex + Age + Fare + Embarked +
famSize + Title + group + wom_chd,
data = training, controls = cforest_unbiased(ntree = 2000, mtry = 3))The prediction results show an overall Accuracy 83.62%, Sensitivity is 77.94% and Specificity is 87.16%.
## prediction with conditional inference trees
CF_pred.test <- predict(CF_model.train, testing, OOB = TRUE)
confusionMatrix(CF_pred.test, testing$Fate, positive = "Survived")Confusion Matrix and Statistics
Reference
Prediction Deceased Survived
Deceased 95 15
Survived 14 53
Accuracy : 0.8362
95% CI : (0.7732, 0.8874)
No Information Rate : 0.6158
P-Value [Acc > NIR] : 1.38e-10
Kappa : 0.6528
Mcnemar's Test P-Value : 1
Sensitivity : 0.7794
Specificity : 0.8716
Pos Pred Value : 0.7910
Neg Pred Value : 0.8636
Prevalence : 0.3842
Detection Rate : 0.2994
Detection Prevalence : 0.3785
Balanced Accuracy : 0.8255
'Positive' Class : Survived
Submiiting to Kaggle, I got my best score:
0.80383/1.00000
When the response variable is binary, Receiver Operating Characteristic, AKA ROC curve, and the Area Under the Curve (AUC) are commonly used to evaluate the model performance.
A ROC curve is composed of the ture positive rate (Sensitivity) vs. false positive rate (1 - Specificity) for each probablity threshold ranged from 0 to 1. The diagolnal line indicates the result of random guessing, for example, toss a coin. So with a binary response variable, you can never do worse than random guessing which has a AUC of 0.5.
The ROC curve and AUC of the models from part 4 are drawn below. From the comparison, we can clearly see that even though there are some differences among these algorithms, the AUC are actually very close. Random Forest has the highest AUC value, while SVM is almost the same.
From the comparison plot below, we can see that the algorithm with highest AUC is Random Forest, however its Kaggle score is the lowest; and with the highest Kaggle score, Conditional Inference Trees has the second smallest value of AUC.
This truely means something. Actually, if there is a golden rule for machine learning algorithms, it would be generalization.
In this project, I practiced:
dplyr, ggplot2, and plotly packagescaret, e1071, randomForest, and party packagesROCR package for ROC curve and AUCknitr and htmlTable packagesCitation:
Titanic: Getting Started With R by Trevor Stephens
Titanic Survival Prediction: One Approach to Deriving a Model by Curt Wehrley