Following a tutorial from statsguys' blog for the Titanic Kaggle Competition.
First, read in the data:
# setwd('Dropbox/Data Science/Kaggle/Titanic/')
trainData <- read.csv("~/Dropbox/Data Science/Kaggle/Titanic/train.csv", header = TRUE,
stringsAsFactors = FALSE)
testData <- read.csv("~/Dropbox/Data Science/Kaggle/Titanic/test.csv", header = TRUE,
stringsAsFactors = FALSE)
head(trainData)
## PassengerId Survived Pclass
## 1 1 0 3
## 2 2 1 1
## 3 3 1 3
## 4 4 1 1
## 5 5 0 3
## 6 6 0 3
## Name Sex Age SibSp
## 1 Braund, Mr. Owen Harris male 22 1
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1
## 3 Heikkinen, Miss. Laina female 26 0
## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1
## 5 Allen, Mr. William Henry male 35 0
## 6 Moran, Mr. James male NA 0
## Parch Ticket Fare Cabin Embarked
## 1 0 A/5 21171 7.250 S
## 2 0 PC 17599 71.283 C85 C
## 3 0 STON/O2. 3101282 7.925 S
## 4 0 113803 53.100 C123 S
## 5 0 373450 8.050 S
## 6 0 330877 8.458 Q
Below are the information on the variables from Kaggle:
VARIABLE DESCRIPTIONS:
SPECIAL NOTES:
Pclass is a proxy for socio-economic status (SES) 1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower
Age is in Years; Fractional if Age less than One (1) If the Age is Estimated, it is in the form xx.5
With respect to the family relation variables (i.e. sibsp and parch) some relations were ignored. The following are the definitions used for sibsp and parch.
Sibling: Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic
Spouse: Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiances Ignored)
Parent: Mother or Father of Passenger Aboard Titanic
Child: Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic
Other family relatives excluded from this study include cousins, nephews/nieces, aunts/uncles, and in-laws. Some children travelled only with a nanny, therefore parch=0 for them. As well, some travelled with very close friends or neighbors in a village, however, the definitions do not support such relations.
# par(mfrow = c(2, 3))
par(mar = rep(2, 4))
plot(density(trainData$Age, na.rm = TRUE))
plot(density(trainData$Fare, na.rm = TRUE))
trainData.new <- trainData[, -c(4, 9, 11, 12)]
trainData.new$Sex = ifelse(trainData.new$Sex == "male", 1, 0)
plot(trainData.new)
Survival rate by Gender:
counts <- table(trainData$Survived, trainData$Sex)
barplot(counts, xlab = "Gender", ylab = "Number of People", main = "survived and deceased between male and female")
counts[2]/(counts[1] + counts[2]) #Female survival rate
## [1] 0.742
counts[4]/(counts[3] + counts[4]) #Male survival rate
## [1] 0.1889
Survival rate by Class:
Pclass_survival <- table(trainData$Survived, trainData$Pclass)
barplot(Pclass_survival, xlab = "Cabin Class", ylab = "Number of People", main = "survived and deceased between male and female")
Pclass_survival[2]/(Pclass_survival[1] + Pclass_survival[2]) #1st Class Survival Rate
## [1] 0.6296
Pclass_survival[4]/(Pclass_survival[3] + Pclass_survival[4]) #2nd Class Survival Rate
## [1] 0.4728
Pclass_survival[6]/(Pclass_survival[5] + Pclass_survival[6]) #3rd Class Survival Rate
## [1] 0.2424
Remove unsed variables (PassengerID, Ticket number, Cabin (77% missing)). Decided to keep “Embarked” - this means will need to create 2 dummy variables later (because 3 places of embarkation) to account for it.
trainData = trainData[-c(1, 9, 11)]
names(trainData)
## [1] "Survived" "Pclass" "Name" "Sex" "Age" "SibSp"
## [7] "Parch" "Fare" "Embarked"
Create new dummy variables for “Embarked” and deleted “Embarked” variable afterwards.
# Embarkation is Southampton by default
trainData["Embarked_C"] = ifelse(trainData$Embarked == "C", 1, 0)
trainData["Embarked_Q"] = ifelse(trainData$Embarked == "Q", 1, 0)
# Remove 'Embarked' variable
trainData = trainData[-9]
Dummy variable for Gender:
trainData$Sex = ifelse(trainData$Sex == "female", 1, 0)
Making inference on missing Age value by titles (Mr,Mrs,Ms,etc):
# Divide data by titles:
master_vector = grep("Master.", trainData$Name, fixed = TRUE)
miss_vector = grep("Miss.", trainData$Name, fixed = TRUE)
mrs_vector = grep("Mrs.", trainData$Name, fixed = TRUE)
mr_vector = grep("Mr.", trainData$Name, fixed = TRUE)
dr_vector = grep("Dr.", trainData$Name, fixed = TRUE)
# Rename every person by his/her title only:
for (i in master_vector) {
trainData$Name[i] = "Master"
}
for (i in miss_vector) {
trainData$Name[i] = "Miss"
}
for (i in mrs_vector) {
trainData$Name[i] = "Mrs"
}
for (i in mr_vector) {
trainData$Name[i] = "Mr"
}
for (i in dr_vector) {
trainData$Name[i] = "Dr"
}
# Making Inference on Missing (NA) Age Values: Inputting Title-group
# averages:
master_age = round(mean(trainData$Age[trainData$Name == "Master"], na.rm = TRUE),
digits = 2)
miss_age = round(mean(trainData$Age[trainData$Name == "Miss"], na.rm = TRUE),
digits = 2)
mrs_age = round(mean(trainData$Age[trainData$Name == "Mrs"], na.rm = TRUE),
digits = 2)
mr_age = round(mean(trainData$Age[trainData$Name == "Mr"], na.rm = TRUE), digits = 2)
dr_age = round(mean(trainData$Age[trainData$Name == "Dr"], na.rm = TRUE), digits = 2)
for (i in 1:nrow(trainData)) {
if (is.na(trainData[i, 5])) {
if (trainData$Name[i] == "Master") {
trainData$Age[i] = master_age
} else if (trainData$Name[i] == "Miss") {
trainData$Age[i] = miss_age
} else if (trainData$Name[i] == "Mrs") {
trainData$Age[i] = mrs_age
} else if (trainData$Name[i] == "Mr") {
trainData$Age[i] = mr_age
} else if (trainData$Name[i] == "Dr") {
trainData$Age[i] = dr_age
} else {
print("Uncaught Title")
}
}
}
Child - hunch is that children had a higher chance of survival (1 if age<=12, 2 otherwise):
trainData["Child"] = NA
for (i in 1:nrow(trainData)) {
if (trainData$Age[i] <= 12) {
trainData$Child[i] = 1
} else {
trainData$Child[i] = 2
}
}
Family - adding SibSp
and Parch
variables - hunch is that passengers in larger families had a lower chance of survival:
trainData["Family"] = NA
for (i in 1:nrow(trainData)) {
x = trainData$SibSp[i]
y = trainData$Parch[i]
trainData$Family[i] = x + y + 1
}
Mother
trainData["Mother"] = NA
for (i in 1:nrow(trainData)) {
if (trainData$Name[i] == "Mrs" & trainData$Parch[i] > 0) {
trainData$Mother[i] = 1
} else {
trainData$Mother[i] = 2
}
}
Do the exact same things that were just done for the training set:
PassengerId = testData[1]
testData = testData[-c(1, 8, 10)]
# Embarkation is Southampton by default
testData["Embarked_C"] = ifelse(testData$Embarked == "C", 1, 0)
testData["Embarked_Q"] = ifelse(testData$Embarked == "Q", 1, 0)
# Remove 'Embarked' variable
testData = testData[-8]
# Gender dummy variable
testData$Sex = ifelse(testData$Sex == "female", 1, 0)
test_master_vector = grep("Master.", testData$Name)
test_miss_vector = grep("Miss.|Ms.", testData$Name)
test_mrs_vector = grep("Mrs.", testData$Name)
test_mr_vector = grep("Mr.", testData$Name)
test_dr_vector = grep("Dr.", testData$Name)
for (i in test_master_vector) {
testData[i, 2] = "Master"
}
for (i in test_miss_vector) {
testData[i, 2] = "Miss"
}
for (i in test_mrs_vector) {
testData[i, 2] = "Mrs"
}
for (i in test_mr_vector) {
testData[i, 2] = "Mr"
}
for (i in test_dr_vector) {
testData[i, 2] = "Dr"
}
test_master_age = round(mean(testData$Age[testData$Name == "Master"], na.rm = TRUE),
digits = 2)
test_miss_age = round(mean(testData$Age[testData$Name == "Miss"], na.rm = TRUE),
digits = 2)
test_mrs_age = round(mean(testData$Age[testData$Name == "Mrs"], na.rm = TRUE),
digits = 2)
test_mr_age = round(mean(testData$Age[testData$Name == "Mr"], na.rm = TRUE),
digits = 2)
test_dr_age = round(mean(testData$Age[testData$Name == "Dr"], na.rm = TRUE),
digits = 2)
for (i in 1:nrow(testData)) {
if (is.na(testData[i, 4])) {
if (testData[i, 2] == "Master") {
testData[i, 4] = test_master_age
} else if (testData[i, 2] == "Miss") {
testData[i, 4] = test_miss_age
} else if (testData[i, 2] == "Mrs") {
testData[i, 4] = test_mrs_age
} else if (testData[i, 2] == "Mr") {
testData[i, 4] = test_mr_age
} else if (testData[i, 2] == "Dr") {
testData[i, 4] = test_dr_age
} else {
print(paste("Uncaught title at: ", i, sep = ""))
print(paste("The title unrecognized was: ", testData[i, 2], sep = ""))
}
}
}
testData["Child"] = ifelse(testData$Age <= 12, 1, 2)
testData["Family"] = testData$SibSp + testData$Parch + 1
testData["Mother"] = ifelse(testData$Name == "Mrs" & testData$Parch > 0, 1,
2)
# Create the Model using Training Set
train.glm <- glm(Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked_C +
Embarked_Q + Child + Mother + Sex * Pclass, family = binomial, data = trainData)
# USE SPLINES FOR FARES!!! IT'S A PIECEWISE STEP FUNCTION!! BREAK IT DOWN
# INTO BASIS FUNCTIONS. FARE DEFINITELY CONTAINS INFO - DIFFERENT FROM
# CLASS...
summary(train.glm)
##
## Call:
## glm(formula = Survived ~ Pclass + Sex + Age + SibSp + Parch +
## Fare + Embarked_C + Embarked_Q + Child + Mother + Sex * Pclass,
## family = binomial, data = trainData)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.505 -0.602 -0.454 0.457 2.298
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 6.99835 1.67954 4.17 3.1e-05 ***
## Pclass -0.78393 0.15854 -4.94 7.6e-07 ***
## Sex 6.28668 0.92621 6.79 1.1e-11 ***
## Age -0.02652 0.00985 -2.69 0.00711 **
## SibSp -0.48472 0.13328 -3.64 0.00028 ***
## Parch -0.40966 0.18832 -2.18 0.02961 *
## Fare 0.00277 0.00263 1.05 0.29177
## Embarked_C 0.46544 0.24625 1.89 0.05874 .
## Embarked_Q 0.58173 0.31378 1.85 0.06375 .
## Child -2.01950 0.49349 -4.09 4.3e-05 ***
## Mother -0.98399 0.54993 -1.79 0.07357 .
## Pclass:Sex -1.43526 0.33961 -4.23 2.4e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1186.66 on 890 degrees of freedom
## Residual deviance: 734.99 on 879 degrees of freedom
## AIC: 759
##
## Number of Fisher Scoring iterations: 6
# Make Predictions using the Test Set
p.hats <- predict.glm(train.glm, newdata = testData, type = "response")
survival = ifelse(p.hats > 0.5, 1, 0)
# Creating CSV for Kaggle Submission
kaggle.sub <- cbind(PassengerId, survival)
colnames(kaggle.sub) <- c("PassengerId", "Survived")
write.csv(kaggle.sub, file = "~/Dropbox/Data Science/Kaggle/Titanic/titanic_lm.csv",
row.names = FALSE)
Result is quite bad! (bottom of the ranking) Let's think of ways to make it better.
After changing the predictors (taking out “Family”, leaving in “SibSp”, adding back “Fare”), results improved tremendously already! “Parch” (part of “Family”) was not statistically significant. I think that “Fare” contains very significant information but need to be “adjusted” in order to be useful.
Subset Selection
library(leaps)
regfit.full = regsubsets(Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare +
Embarked_C + Embarked_Q + Child + Mother + Sex * Pclass, data = trainData,
nvmax = 11) #Best Subset Selection for ALL variables
(reg.summary = summary(regfit.full))
## Subset selection object
## Call: regsubsets.formula(Survived ~ Pclass + Sex + Age + SibSp + Parch +
## Fare + Embarked_C + Embarked_Q + Child + Mother + Sex * Pclass,
## data = trainData, nvmax = 11)
## 11 Variables (and intercept)
## Forced in Forced out
## Pclass FALSE FALSE
## Sex FALSE FALSE
## Age FALSE FALSE
## SibSp FALSE FALSE
## Parch FALSE FALSE
## Fare FALSE FALSE
## Embarked_C FALSE FALSE
## Embarked_Q FALSE FALSE
## Child FALSE FALSE
## Mother FALSE FALSE
## Pclass:Sex FALSE FALSE
## 1 subsets of each size up to 11
## Selection Algorithm: exhaustive
## Pclass Sex Age SibSp Parch Fare Embarked_C Embarked_Q Child
## 1 ( 1 ) " " "*" " " " " " " " " " " " " " "
## 2 ( 1 ) "*" "*" " " " " " " " " " " " " " "
## 3 ( 1 ) "*" "*" "*" " " " " " " " " " " " "
## 4 ( 1 ) "*" "*" " " "*" " " " " " " " " "*"
## 5 ( 1 ) "*" "*" " " "*" " " " " " " " " "*"
## 6 ( 1 ) "*" "*" "*" "*" " " " " " " " " "*"
## 7 ( 1 ) "*" "*" "*" "*" " " " " "*" " " "*"
## 8 ( 1 ) "*" "*" "*" "*" " " " " "*" "*" "*"
## 9 ( 1 ) "*" "*" "*" "*" "*" " " "*" " " "*"
## 10 ( 1 ) "*" "*" "*" "*" "*" " " "*" "*" "*"
## 11 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*"
## Mother Pclass:Sex
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
## 3 ( 1 ) " " " "
## 4 ( 1 ) " " " "
## 5 ( 1 ) " " "*"
## 6 ( 1 ) " " "*"
## 7 ( 1 ) " " "*"
## 8 ( 1 ) " " "*"
## 9 ( 1 ) "*" "*"
## 10 ( 1 ) "*" "*"
## 11 ( 1 ) "*" "*"
names(reg.summary)
## [1] "which" "rsq" "rss" "adjr2" "cp" "bic" "outmat" "obj"
reg.summary$rsq #summary of the R-squares for all 11 variables
## [1] 0.2952 0.3677 0.3853 0.3998 0.4147 0.4206 0.4225 0.4248 0.4263 0.4284
## [11] 0.4286
# Plot RSS, adjusted r-square, Cp, BIC for all the models at once
par(mfrow = c(2, 2))
# RSS Plot
plot(reg.summary$rss, xlab = "Number of Variables", ylab = "RSS", type = "l")
# Adjusted RSq plot
plot(reg.summary$adjr2, xlab = "Number of Variables", ylab = "Adjusted RSq",
type = "l")
which.max(reg.summary$adjr2)
## [1] 10
points(10, reg.summary$adjr2[10], col = "red", cex = 2, pch = 20)
# Cp plot
plot(reg.summary$cp, xlab = "Number of Variables", ylab = "Cp", type = "l")
which.min(reg.summary$cp)
## [1] 10
points(10, reg.summary$cp[10], col = "red", cex = 2, pch = 20)
# BIC plot
plot(reg.summary$bic, xlab = "Number of Variables", ylab = "BIC", type = "l")
points(6, reg.summary$bic[6], col = "red", cex = 2, pch = 20)
# Plotting built into regsubsets()
plot(regfit.full, scale = "r2")
plot(regfit.full, scale = "adjr2")
plot(regfit.full, scale = "Cp")
plot(regfit.full, scale = "bic")
# Coefficients for these best subsets:
coef(regfit.full, 10)
## (Intercept) Pclass Sex Age SibSp Parch
## 1.427873 -0.130534 0.809913 -0.003859 -0.050960 -0.046746
## Embarked_C Embarked_Q Child Mother Pclass:Sex
## 0.063382 0.083698 -0.284098 -0.121595 -0.141278
coef(regfit.full, 6)
## (Intercept) Pclass Sex Age SibSp Child
## 1.100492 -0.127182 0.833691 -0.003664 -0.062855 -0.242295
## Pclass:Sex
## -0.144821
# So now do 2 fittings: One with 10 predictor and one with 6 predictors to
# see which gives better prediction accuracy.
# Create the Model using Training Set
train.glm10 <- glm(Survived ~ Pclass + Sex + Age + SibSp + Parch + Embarked_C +
Embarked_Q + Child + Mother + Sex * Pclass, family = binomial, data = trainData)
train.glm6 <- glm(Survived ~ Pclass + Sex + Age + SibSp + Child + Sex * Pclass,
family = binomial, data = trainData)
# Make Predictions using the Test Set
p.hats10 <- predict.glm(train.glm10, newdata = testData, type = "response")
survival10 = ifelse(p.hats10 > 0.5, 1, 0)
p.hats6 <- predict.glm(train.glm6, newdata = testData, type = "response")
survival6 = ifelse(p.hats6 > 0.5, 1, 0)
# Creating CSV for Kaggle Submission
kaggle.sub <- cbind(PassengerId, survival10)
colnames(kaggle.sub) <- c("PassengerId", "Survived")
write.csv(kaggle.sub, file = "~/Dropbox/Data Science/Kaggle/Titanic/titanic_glm_10.csv",
row.names = FALSE)
kaggle.sub <- cbind(PassengerId, survival6)
colnames(kaggle.sub) <- c("PassengerId", "Survived")
write.csv(kaggle.sub, file = "~/Dropbox/Data Science/Kaggle/Titanic/titanic_glm_6.csv",
row.names = FALSE)
Awesome! Best Subset Selection (using 10 variables) actually improved prediction substantially! Now it's the highest rank overall - beating even the previous Boosting method. Need to retest Boosting and other previously promising methods using the new predictors.
Best subset using 6 variables gave even BETTER results!!
6 Variables of interest are Pclass + Sex + Age + SibSp + Child + Sex*Pclass
It's interesting to note that Embarkation didn't provide much information.
Ok. Onto Regularization methods to see if they can improve results or just stick with 6 variables.