Logistic Regressions and Subset Selection for the Titanic Kaggle Competition

Following a tutorial from statsguys' blog for the Titanic Kaggle Competition.

First, read in the data:

# setwd('Dropbox/Data Science/Kaggle/Titanic/')

trainData <- read.csv("~/Dropbox/Data Science/Kaggle/Titanic/train.csv", header = TRUE, 
    stringsAsFactors = FALSE)
testData <- read.csv("~/Dropbox/Data Science/Kaggle/Titanic/test.csv", header = TRUE, 
    stringsAsFactors = FALSE)
head(trainData)

##   PassengerId Survived Pclass
## 1           1        0      3
## 2           2        1      1
## 3           3        1      3
## 4           4        1      1
## 5           5        0      3
## 6           6        0      3
##                                                  Name    Sex Age SibSp
## 1                             Braund, Mr. Owen Harris   male  22     1
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1
## 3                              Heikkinen, Miss. Laina female  26     0
## 4        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1
## 5                            Allen, Mr. William Henry   male  35     0
## 6                                    Moran, Mr. James   male  NA     0
##   Parch           Ticket   Fare Cabin Embarked
## 1     0        A/5 21171  7.250              S
## 2     0         PC 17599 71.283   C85        C
## 3     0 STON/O2. 3101282  7.925              S
## 4     0           113803 53.100  C123        S
## 5     0           373450  8.050              S
## 6     0           330877  8.458              Q

Below are the information on the variables from Kaggle:

VARIABLE DESCRIPTIONS:

survival Survival (0 = No; 1 = Yes)
pclass Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
name Name
sex Sex
age Age
sibsp Number of Siblings/Spouses Aboard
parch Number of Parents/Children Aboard
ticket Ticket Number
fare Passenger Fare
cabin Cabin -embarked Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)

SPECIAL NOTES:

Pclass is a proxy for socio-economic status (SES) 1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower
Age is in Years; Fractional if Age less than One (1) If the Age is Estimated, it is in the form xx.5
With respect to the family relation variables (i.e. sibsp and parch) some relations were ignored. The following are the definitions used for sibsp and parch.
Sibling: Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic
Spouse: Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiances Ignored)
Parent: Mother or Father of Passenger Aboard Titanic
Child: Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic
Other family relatives excluded from this study include cousins, nephews/nieces, aunts/uncles, and in-laws. Some children travelled only with a nanny, therefore parch=0 for them. As well, some travelled with very close friends or neighbors in a village, however, the definitions do not support such relations.

Data Exploration

# par(mfrow = c(2, 3))
par(mar = rep(2, 4))
plot(density(trainData$Age, na.rm = TRUE))

plot of chunk unnamed-chunk-2

plot(density(trainData$Fare, na.rm = TRUE))

plot of chunk unnamed-chunk-2

trainData.new <- trainData[, -c(4, 9, 11, 12)]
trainData.new$Sex = ifelse(trainData.new$Sex == "male", 1, 0)

plot(trainData.new)

plot of chunk unnamed-chunk-3

Survival rate by Gender:

counts <- table(trainData$Survived, trainData$Sex)
barplot(counts, xlab = "Gender", ylab = "Number of People", main = "survived and deceased between male and female")

plot of chunk unnamed-chunk-4

counts[2]/(counts[1] + counts[2])  #Female survival rate

## [1] 0.742

counts[4]/(counts[3] + counts[4])  #Male survival rate

## [1] 0.1889

Survival rate by Class:

Pclass_survival <- table(trainData$Survived, trainData$Pclass)
barplot(Pclass_survival, xlab = "Cabin Class", ylab = "Number of People", main = "survived and deceased between male and female")

plot of chunk unnamed-chunk-5

Pclass_survival[2]/(Pclass_survival[1] + Pclass_survival[2])  #1st Class Survival Rate

## [1] 0.6296

Pclass_survival[4]/(Pclass_survival[3] + Pclass_survival[4])  #2nd Class Survival Rate

## [1] 0.4728

Pclass_survival[6]/(Pclass_survival[5] + Pclass_survival[6])  #3rd Class Survival Rate

## [1] 0.2424

Cleaning the Training Data

Remove unsed variables (PassengerID, Ticket number, Cabin (77% missing)). Decided to keep “Embarked” - this means will need to create 2 dummy variables later (because 3 places of embarkation) to account for it.

trainData = trainData[-c(1, 9, 11)]
names(trainData)

## [1] "Survived" "Pclass"   "Name"     "Sex"      "Age"      "SibSp"   
## [7] "Parch"    "Fare"     "Embarked"

Create new dummy variables for “Embarked” and deleted “Embarked” variable afterwards.

I'm using Southampton as the “not coded” group because that is by far the biggest group in “Embarked” (72%)

# Embarkation is Southampton by default
trainData["Embarked_C"] = ifelse(trainData$Embarked == "C", 1, 0)
trainData["Embarked_Q"] = ifelse(trainData$Embarked == "Q", 1, 0)
# Remove 'Embarked' variable
trainData = trainData[-9]

Dummy variable for Gender:

trainData$Sex = ifelse(trainData$Sex == "female", 1, 0)

Making inference on missing Age value by titles (Mr,Mrs,Ms,etc):

# Divide data by titles:
master_vector = grep("Master.", trainData$Name, fixed = TRUE)
miss_vector = grep("Miss.", trainData$Name, fixed = TRUE)
mrs_vector = grep("Mrs.", trainData$Name, fixed = TRUE)
mr_vector = grep("Mr.", trainData$Name, fixed = TRUE)
dr_vector = grep("Dr.", trainData$Name, fixed = TRUE)

# Rename every person by his/her title only:
for (i in master_vector) {
    trainData$Name[i] = "Master"
}
for (i in miss_vector) {
    trainData$Name[i] = "Miss"
}
for (i in mrs_vector) {
    trainData$Name[i] = "Mrs"
}
for (i in mr_vector) {
    trainData$Name[i] = "Mr"
}
for (i in dr_vector) {
    trainData$Name[i] = "Dr"
}

# Making Inference on Missing (NA) Age Values: Inputting Title-group
# averages:
master_age = round(mean(trainData$Age[trainData$Name == "Master"], na.rm = TRUE), 
    digits = 2)
miss_age = round(mean(trainData$Age[trainData$Name == "Miss"], na.rm = TRUE), 
    digits = 2)
mrs_age = round(mean(trainData$Age[trainData$Name == "Mrs"], na.rm = TRUE), 
    digits = 2)
mr_age = round(mean(trainData$Age[trainData$Name == "Mr"], na.rm = TRUE), digits = 2)
dr_age = round(mean(trainData$Age[trainData$Name == "Dr"], na.rm = TRUE), digits = 2)

for (i in 1:nrow(trainData)) {
    if (is.na(trainData[i, 5])) {
        if (trainData$Name[i] == "Master") {
            trainData$Age[i] = master_age
        } else if (trainData$Name[i] == "Miss") {
            trainData$Age[i] = miss_age
        } else if (trainData$Name[i] == "Mrs") {
            trainData$Age[i] = mrs_age
        } else if (trainData$Name[i] == "Mr") {
            trainData$Age[i] = mr_age
        } else if (trainData$Name[i] == "Dr") {
            trainData$Age[i] = dr_age
        } else {
            print("Uncaught Title")
        }
    }
}

Creating Additional Predictors

Child - hunch is that children had a higher chance of survival (1 if age<=12, 2 otherwise):

trainData["Child"] = NA
for (i in 1:nrow(trainData)) {
    if (trainData$Age[i] <= 12) {
        trainData$Child[i] = 1
    } else {
        trainData$Child[i] = 2
    }
}

Family - adding SibSp and Parch variables - hunch is that passengers in larger families had a lower chance of survival:

trainData["Family"] = NA

for (i in 1:nrow(trainData)) {
    x = trainData$SibSp[i]
    y = trainData$Parch[i]
    trainData$Family[i] = x + y + 1
}

Mother

trainData["Mother"] = NA

for (i in 1:nrow(trainData)) {
    if (trainData$Name[i] == "Mrs" & trainData$Parch[i] > 0) {
        trainData$Mother[i] = 1
    } else {
        trainData$Mother[i] = 2
    }
}

Cleaning the Test Data

Do the exact same things that were just done for the training set:

PassengerId = testData[1]
testData = testData[-c(1, 8, 10)]

# Embarkation is Southampton by default
testData["Embarked_C"] = ifelse(testData$Embarked == "C", 1, 0)
testData["Embarked_Q"] = ifelse(testData$Embarked == "Q", 1, 0)
# Remove 'Embarked' variable
testData = testData[-8]

# Gender dummy variable
testData$Sex = ifelse(testData$Sex == "female", 1, 0)

test_master_vector = grep("Master.", testData$Name)
test_miss_vector = grep("Miss.|Ms.", testData$Name)
test_mrs_vector = grep("Mrs.", testData$Name)
test_mr_vector = grep("Mr.", testData$Name)
test_dr_vector = grep("Dr.", testData$Name)

for (i in test_master_vector) {
    testData[i, 2] = "Master"
}
for (i in test_miss_vector) {
    testData[i, 2] = "Miss"
}
for (i in test_mrs_vector) {
    testData[i, 2] = "Mrs"
}
for (i in test_mr_vector) {
    testData[i, 2] = "Mr"
}
for (i in test_dr_vector) {
    testData[i, 2] = "Dr"
}

test_master_age = round(mean(testData$Age[testData$Name == "Master"], na.rm = TRUE), 
    digits = 2)
test_miss_age = round(mean(testData$Age[testData$Name == "Miss"], na.rm = TRUE), 
    digits = 2)
test_mrs_age = round(mean(testData$Age[testData$Name == "Mrs"], na.rm = TRUE), 
    digits = 2)
test_mr_age = round(mean(testData$Age[testData$Name == "Mr"], na.rm = TRUE), 
    digits = 2)
test_dr_age = round(mean(testData$Age[testData$Name == "Dr"], na.rm = TRUE), 
    digits = 2)

for (i in 1:nrow(testData)) {
    if (is.na(testData[i, 4])) {
        if (testData[i, 2] == "Master") {
            testData[i, 4] = test_master_age
        } else if (testData[i, 2] == "Miss") {
            testData[i, 4] = test_miss_age
        } else if (testData[i, 2] == "Mrs") {
            testData[i, 4] = test_mrs_age
        } else if (testData[i, 2] == "Mr") {
            testData[i, 4] = test_mr_age
        } else if (testData[i, 2] == "Dr") {
            testData[i, 4] = test_dr_age
        } else {
            print(paste("Uncaught title at: ", i, sep = ""))
            print(paste("The title unrecognized was: ", testData[i, 2], sep = ""))
        }
    }
}

testData["Child"] = ifelse(testData$Age <= 12, 1, 2)

testData["Family"] = testData$SibSp + testData$Parch + 1

testData["Mother"] = ifelse(testData$Name == "Mrs" & testData$Parch > 0, 1, 
    2)

Fitting Logistic Regression Model as First Attempt for Submission

# Create the Model using Training Set
train.glm <- glm(Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked_C + 
    Embarked_Q + Child + Mother + Sex * Pclass, family = binomial, data = trainData)

# USE SPLINES FOR FARES!!! IT'S A PIECEWISE STEP FUNCTION!! BREAK IT DOWN
# INTO BASIS FUNCTIONS. FARE DEFINITELY CONTAINS INFO - DIFFERENT FROM
# CLASS...

summary(train.glm)

## 
## Call:
## glm(formula = Survived ~ Pclass + Sex + Age + SibSp + Parch + 
##     Fare + Embarked_C + Embarked_Q + Child + Mother + Sex * Pclass, 
##     family = binomial, data = trainData)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -3.505  -0.602  -0.454   0.457   2.298  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  6.99835    1.67954    4.17  3.1e-05 ***
## Pclass      -0.78393    0.15854   -4.94  7.6e-07 ***
## Sex          6.28668    0.92621    6.79  1.1e-11 ***
## Age         -0.02652    0.00985   -2.69  0.00711 ** 
## SibSp       -0.48472    0.13328   -3.64  0.00028 ***
## Parch       -0.40966    0.18832   -2.18  0.02961 *  
## Fare         0.00277    0.00263    1.05  0.29177    
## Embarked_C   0.46544    0.24625    1.89  0.05874 .  
## Embarked_Q   0.58173    0.31378    1.85  0.06375 .  
## Child       -2.01950    0.49349   -4.09  4.3e-05 ***
## Mother      -0.98399    0.54993   -1.79  0.07357 .  
## Pclass:Sex  -1.43526    0.33961   -4.23  2.4e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1186.66  on 890  degrees of freedom
## Residual deviance:  734.99  on 879  degrees of freedom
## AIC: 759
## 
## Number of Fisher Scoring iterations: 6


# Make Predictions using the Test Set
p.hats <- predict.glm(train.glm, newdata = testData, type = "response")
survival = ifelse(p.hats > 0.5, 1, 0)

# Creating CSV for Kaggle Submission
kaggle.sub <- cbind(PassengerId, survival)
colnames(kaggle.sub) <- c("PassengerId", "Survived")
write.csv(kaggle.sub, file = "~/Dropbox/Data Science/Kaggle/Titanic/titanic_lm.csv", 
    row.names = FALSE)

Result is quite bad! (bottom of the ranking) Let's think of ways to make it better.

After changing the predictors (taking out “Family”, leaving in “SibSp”, adding back “Fare”), results improved tremendously already! “Parch” (part of “Family”) was not statistically significant. I think that “Fare” contains very significant information but need to be “adjusted” in order to be useful.

Subset Selection

library(leaps)
regfit.full = regsubsets(Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + 
    Embarked_C + Embarked_Q + Child + Mother + Sex * Pclass, data = trainData, 
    nvmax = 11)  #Best Subset Selection for ALL variables
(reg.summary = summary(regfit.full))

## Subset selection object
## Call: regsubsets.formula(Survived ~ Pclass + Sex + Age + SibSp + Parch + 
##     Fare + Embarked_C + Embarked_Q + Child + Mother + Sex * Pclass, 
##     data = trainData, nvmax = 11)
## 11 Variables  (and intercept)
##            Forced in Forced out
## Pclass         FALSE      FALSE
## Sex            FALSE      FALSE
## Age            FALSE      FALSE
## SibSp          FALSE      FALSE
## Parch          FALSE      FALSE
## Fare           FALSE      FALSE
## Embarked_C     FALSE      FALSE
## Embarked_Q     FALSE      FALSE
## Child          FALSE      FALSE
## Mother         FALSE      FALSE
## Pclass:Sex     FALSE      FALSE
## 1 subsets of each size up to 11
## Selection Algorithm: exhaustive
##           Pclass Sex Age SibSp Parch Fare Embarked_C Embarked_Q Child
## 1  ( 1 )  " "    "*" " " " "   " "   " "  " "        " "        " "  
## 2  ( 1 )  "*"    "*" " " " "   " "   " "  " "        " "        " "  
## 3  ( 1 )  "*"    "*" "*" " "   " "   " "  " "        " "        " "  
## 4  ( 1 )  "*"    "*" " " "*"   " "   " "  " "        " "        "*"  
## 5  ( 1 )  "*"    "*" " " "*"   " "   " "  " "        " "        "*"  
## 6  ( 1 )  "*"    "*" "*" "*"   " "   " "  " "        " "        "*"  
## 7  ( 1 )  "*"    "*" "*" "*"   " "   " "  "*"        " "        "*"  
## 8  ( 1 )  "*"    "*" "*" "*"   " "   " "  "*"        "*"        "*"  
## 9  ( 1 )  "*"    "*" "*" "*"   "*"   " "  "*"        " "        "*"  
## 10  ( 1 ) "*"    "*" "*" "*"   "*"   " "  "*"        "*"        "*"  
## 11  ( 1 ) "*"    "*" "*" "*"   "*"   "*"  "*"        "*"        "*"  
##           Mother Pclass:Sex
## 1  ( 1 )  " "    " "       
## 2  ( 1 )  " "    " "       
## 3  ( 1 )  " "    " "       
## 4  ( 1 )  " "    " "       
## 5  ( 1 )  " "    "*"       
## 6  ( 1 )  " "    "*"       
## 7  ( 1 )  " "    "*"       
## 8  ( 1 )  " "    "*"       
## 9  ( 1 )  "*"    "*"       
## 10  ( 1 ) "*"    "*"       
## 11  ( 1 ) "*"    "*"

names(reg.summary)

## [1] "which"  "rsq"    "rss"    "adjr2"  "cp"     "bic"    "outmat" "obj"

reg.summary$rsq  #summary of the R-squares for all 11 variables

##  [1] 0.2952 0.3677 0.3853 0.3998 0.4147 0.4206 0.4225 0.4248 0.4263 0.4284
## [11] 0.4286


# Plot RSS, adjusted r-square, Cp, BIC for all the models at once
par(mfrow = c(2, 2))
# RSS Plot
plot(reg.summary$rss, xlab = "Number of Variables", ylab = "RSS", type = "l")
# Adjusted RSq plot
plot(reg.summary$adjr2, xlab = "Number of Variables", ylab = "Adjusted RSq", 
    type = "l")
which.max(reg.summary$adjr2)

## [1] 10

points(10, reg.summary$adjr2[10], col = "red", cex = 2, pch = 20)
# Cp plot
plot(reg.summary$cp, xlab = "Number of Variables", ylab = "Cp", type = "l")
which.min(reg.summary$cp)

## [1] 10

points(10, reg.summary$cp[10], col = "red", cex = 2, pch = 20)
# BIC plot
plot(reg.summary$bic, xlab = "Number of Variables", ylab = "BIC", type = "l")
points(6, reg.summary$bic[6], col = "red", cex = 2, pch = 20)

plot of chunk unnamed-chunk-15


# Plotting built into regsubsets()
plot(regfit.full, scale = "r2")
plot(regfit.full, scale = "adjr2")
plot(regfit.full, scale = "Cp")
plot(regfit.full, scale = "bic")

plot of chunk unnamed-chunk-15


# Coefficients for these best subsets:
coef(regfit.full, 10)

## (Intercept)      Pclass         Sex         Age       SibSp       Parch 
##    1.427873   -0.130534    0.809913   -0.003859   -0.050960   -0.046746 
##  Embarked_C  Embarked_Q       Child      Mother  Pclass:Sex 
##    0.063382    0.083698   -0.284098   -0.121595   -0.141278

coef(regfit.full, 6)

## (Intercept)      Pclass         Sex         Age       SibSp       Child 
##    1.100492   -0.127182    0.833691   -0.003664   -0.062855   -0.242295 
##  Pclass:Sex 
##   -0.144821


# So now do 2 fittings: One with 10 predictor and one with 6 predictors to
# see which gives better prediction accuracy.

# Create the Model using Training Set
train.glm10 <- glm(Survived ~ Pclass + Sex + Age + SibSp + Parch + Embarked_C + 
    Embarked_Q + Child + Mother + Sex * Pclass, family = binomial, data = trainData)

train.glm6 <- glm(Survived ~ Pclass + Sex + Age + SibSp + Child + Sex * Pclass, 
    family = binomial, data = trainData)

# Make Predictions using the Test Set
p.hats10 <- predict.glm(train.glm10, newdata = testData, type = "response")
survival10 = ifelse(p.hats10 > 0.5, 1, 0)
p.hats6 <- predict.glm(train.glm6, newdata = testData, type = "response")
survival6 = ifelse(p.hats6 > 0.5, 1, 0)

# Creating CSV for Kaggle Submission
kaggle.sub <- cbind(PassengerId, survival10)
colnames(kaggle.sub) <- c("PassengerId", "Survived")
write.csv(kaggle.sub, file = "~/Dropbox/Data Science/Kaggle/Titanic/titanic_glm_10.csv", 
    row.names = FALSE)

kaggle.sub <- cbind(PassengerId, survival6)
colnames(kaggle.sub) <- c("PassengerId", "Survived")
write.csv(kaggle.sub, file = "~/Dropbox/Data Science/Kaggle/Titanic/titanic_glm_6.csv", 
    row.names = FALSE)

Awesome! Best Subset Selection (using 10 variables) actually improved prediction substantially! Now it's the highest rank overall - beating even the previous Boosting method. Need to retest Boosting and other previously promising methods using the new predictors.

Best subset using 6 variables gave even BETTER results!!

6 Variables of interest are Pclass + Sex + Age + SibSp + Child + Sex*Pclass

It's interesting to note that Embarkation didn't provide much information.

Ok. Onto Regularization methods to see if they can improve results or just stick with 6 variables.