Exploring Non-linearity and Interaction Terms for Kaggle Titanic Competition

Assuming the 6 variables identified with the Best Subset Selection method are good. I want to test the following:

  1. Non-linearity in any of the 6 variables
  2. Test out all interaction terms to see which are relevant
  3. What if I added “Fare” variable back and did a “spline”/step-function for it. Would it make “Fare” relevant? Does “Fare” have some interaction with Class?

Non-linearity

# Create the Model using Training Set
train.glm <- glm(Survived ~ Pclass + Sex + Age + SibSp + Child + Sex * Pclass, 
    family = binomial, data = trainData)
summary(train.glm)
## 
## Call:
## glm(formula = Survived ~ Pclass + Sex + Age + SibSp + Child + 
##     Sex * Pclass, family = binomial, data = trainData)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -3.565  -0.591  -0.471   0.428   2.537  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  4.47701    0.84708    5.29  1.3e-07 ***
## Pclass      -0.84312    0.13939   -6.05  1.5e-09 ***
## Sex          6.28883    0.88987    7.07  1.6e-12 ***
## Age         -0.02807    0.00953   -2.94  0.00323 ** 
## SibSp       -0.54917    0.12818   -4.28  1.8e-05 ***
## Child       -1.55943    0.44559   -3.50  0.00047 ***
## Pclass:Sex  -1.40634    0.32661   -4.31  1.7e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1186.66  on 890  degrees of freedom
## Residual deviance:  747.85  on 884  degrees of freedom
## AIC: 761.8
## 
## Number of Fisher Scoring iterations: 6

# Test 'Pclass' first (only 2 at most - 'degree' must be less than number of
# unique points)
fit.1 = lm(Survived ~ Pclass + Sex + Age + SibSp + Child + Sex * Pclass, family = binomial, 
    data = trainData)
fit.2 = lm(Survived ~ poly(Pclass, 2) + Sex + Age + SibSp + Child + Sex * Pclass, 
    family = binomial, data = trainData)
anova(fit.1, fit.2)
## Analysis of Variance Table
## 
## Model 1: Survived ~ Pclass + Sex + Age + SibSp + Child + Sex * Pclass
## Model 2: Survived ~ poly(Pclass, 2) + Sex + Age + SibSp + Child + Sex * 
##     Pclass
##   Res.Df RSS Df Sum of Sq  F Pr(>F)
## 1    884 122                       
## 2    883 122  1  0.000213  0   0.97

# Test 'Age'
fit.1 = lm(Survived ~ Pclass + Sex + Age + SibSp + Child + Sex * Pclass, family = binomial, 
    data = trainData)
fit.2 = lm(Survived ~ Pclass + Sex + poly(Age, 2) + SibSp + Child + Sex * Pclass, 
    family = binomial, data = trainData)
fit.3 = lm(Survived ~ Pclass + Sex + poly(Age, 3) + SibSp + Child + Sex * Pclass, 
    family = binomial, data = trainData)
fit.4 = lm(Survived ~ Pclass + Sex + poly(Age, 4) + SibSp + Child + Sex * Pclass, 
    family = binomial, data = trainData)
fit.5 = lm(Survived ~ Pclass + Sex + poly(Age, 5) + SibSp + Child + Sex * Pclass, 
    family = binomial, data = trainData)
anova(fit.1, fit.2, fit.3, fit.4, fit.5)
## Analysis of Variance Table
## 
## Model 1: Survived ~ Pclass + Sex + Age + SibSp + Child + Sex * Pclass
## Model 2: Survived ~ Pclass + Sex + poly(Age, 2) + SibSp + Child + Sex * 
##     Pclass
## Model 3: Survived ~ Pclass + Sex + poly(Age, 3) + SibSp + Child + Sex * 
##     Pclass
## Model 4: Survived ~ Pclass + Sex + poly(Age, 4) + SibSp + Child + Sex * 
##     Pclass
## Model 5: Survived ~ Pclass + Sex + poly(Age, 5) + SibSp + Child + Sex * 
##     Pclass
##   Res.Df RSS Df Sum of Sq    F Pr(>F)   
## 1    884 122                            
## 2    883 122  1     0.023 0.17 0.6824   
## 3    882 122  1     0.127 0.93 0.3355   
## 4    881 121  1     1.340 9.78 0.0018 **
## 5    880 121  1     0.015 0.11 0.7414   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

plot(trainData$Age, trainData$Survived)

plot of chunk unnamed-chunk-2

par(mfrow = c(1, 2))
plot(density(trainData$Age[trainData$Survived == 1], na.rm = TRUE), main = "Distr. of Age v Survivors")
plot(density(trainData$Age[trainData$Survived == 0], na.rm = TRUE), main = "Distr. of Age v Non-Survivors")

plot of chunk unnamed-chunk-2

Children (survived), 10-20s (not survived), 20-30s (survived), 30-40s (not survived) make a big difference it seems…That's why it's non-linear… This suggests that it's not arbitrary. It's human decision and perhaps that's why using Decision Trees is a better idea (rather than tinkering with non-linearity in logistic regressions). We shall see.

# Test 'SibSp'
fit.1 = lm(Survived ~ Pclass + Sex + Age + SibSp + Child + Sex * Pclass, family = binomial, 
    data = trainData)
fit.2 = lm(Survived ~ Pclass + Sex + Age + poly(SibSp, 2) + Child + Sex * Pclass, 
    family = binomial, data = trainData)
fit.3 = lm(Survived ~ Pclass + Sex + Age + poly(SibSp, 3) + Child + Sex * Pclass, 
    family = binomial, data = trainData)
fit.4 = lm(Survived ~ Pclass + Sex + Age + poly(SibSp, 4) + Child + Sex * Pclass, 
    family = binomial, data = trainData)
fit.5 = lm(Survived ~ Pclass + Sex + Age + poly(SibSp, 5) + Child + Sex * Pclass, 
    family = binomial, data = trainData)
anova(fit.1, fit.2, fit.3, fit.4, fit.5)
## Analysis of Variance Table
## 
## Model 1: Survived ~ Pclass + Sex + Age + SibSp + Child + Sex * Pclass
## Model 2: Survived ~ Pclass + Sex + Age + poly(SibSp, 2) + Child + Sex * 
##     Pclass
## Model 3: Survived ~ Pclass + Sex + Age + poly(SibSp, 3) + Child + Sex * 
##     Pclass
## Model 4: Survived ~ Pclass + Sex + Age + poly(SibSp, 4) + Child + Sex * 
##     Pclass
## Model 5: Survived ~ Pclass + Sex + Age + poly(SibSp, 5) + Child + Sex * 
##     Pclass
##   Res.Df RSS Df Sum of Sq     F  Pr(>F)    
## 1    884 122                               
## 2    883 122  1     0.001  0.01 0.94129    
## 3    882 120  1     1.678 12.29 0.00048 ***
## 4    881 120  1     0.154  1.13 0.28798    
## 5    880 120  1     0.107  0.78 0.37689    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

plot(trainData$SibSp, trainData$Survived)

plot of chunk unnamed-chunk-3

par(mfrow = c(1, 2))
plot(density(trainData$SibSp[trainData$Survived == 1], na.rm = TRUE), main = "Distr. of Siblings/Spouse v Survivors")
plot(density(trainData$SibSp[trainData$Survived == 0], na.rm = TRUE), main = "Distr. of Siblings/Spouse v Non-Survivors")

plot of chunk unnamed-chunk-3

Even though this seems very signficant, there seems to be no obvious reason why there should be non-linearity for Sib/Spouse variable (looks like both survivor and non-survivor groups are distributed similarly for SibSp).

# Test 'Sex*Pclass'
fit.1 = lm(Survived ~ Pclass + Sex + Age + SibSp + Child + Sex * Pclass, family = binomial, 
    data = trainData)
fit.2 = lm(Survived ~ Pclass + Sex + Age + SibSp + Child + poly(Sex * Pclass, 
    2), family = binomial, data = trainData)
fit.3 = lm(Survived ~ Pclass + Sex + Age + SibSp + Child + poly(Sex * Pclass, 
    3), family = binomial, data = trainData)
fit.4 = lm(Survived ~ Pclass + Sex + Age + SibSp + Child + poly(Sex * Pclass, 
    4), family = binomial, data = trainData)
## Error: 'degree' must be less than number of unique points
anova(fit.1, fit.2, fit.3, fit.4)
## Analysis of Variance Table
## 
## Model 1: Survived ~ Pclass + Sex + Age + SibSp + Child + Sex * Pclass
## Model 2: Survived ~ Pclass + Sex + Age + SibSp + Child + poly(Sex * Pclass, 
##     2)
## Model 3: Survived ~ Pclass + Sex + Age + SibSp + Child + poly(Sex * Pclass, 
##     3)
## Model 4: Survived ~ Pclass + Sex + Age + poly(SibSp, 4) + Child + Sex * 
##     Pclass
##   Res.Df RSS Df Sum of Sq     F  Pr(>F)    
## 1    884 122                               
## 2    883 120  1     1.596 11.69 0.00066 ***
## 3    883 120  0     0.000                  
## 4    881 120  2     0.237  0.87 0.41937    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Again, VERY SIGNIFICANT for using non-linearity. But no obvious reason why Sex*Pclass should be non-linear…

So For Non-linearity, we'll try out: 4th degree for Age, 3rd degree for SibSp, and 2nd degree for Sex*Pclass

This could be a case of overfitting! But we'll check the results and see…

After using non-linearity for these variables, turns out that \( Age^{4} \) turned out to be statistically insignificant. And also ANOVA shows that using non-linear model didn't improve on linear model (with the same 6 variables). I'm going to give it a try anyway and see if it improves the prediction.

# Create the Model using Training Set
train.glm.poly <- glm(Survived ~ Pclass + Sex + Age + I(SibSp^3) + Child + Sex * 
    Pclass, family = binomial, data = trainData)
summary(train.glm.poly)
## 
## Call:
## glm(formula = Survived ~ Pclass + Sex + Age + I(SibSp^3) + Child + 
##     Sex * Pclass, family = binomial, data = trainData)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -3.637  -0.627  -0.469   0.447   2.513  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  4.45544    0.85592    5.21  1.9e-07 ***
## Pclass      -0.80608    0.13997   -5.76  8.5e-09 ***
## Sex          6.07230    0.88212    6.88  5.8e-12 ***
## Age         -0.02759    0.00955   -2.89  0.00385 ** 
## I(SibSp^3)  -0.05306    0.01333   -3.98  6.9e-05 ***
## Child       -1.64700    0.45879   -3.59  0.00033 ***
## Pclass:Sex  -1.35473    0.32556   -4.16  3.2e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1186.66  on 890  degrees of freedom
## Residual deviance:  741.45  on 884  degrees of freedom
## AIC: 755.5
## 
## Number of Fisher Scoring iterations: 7

# Make Predictions using the Test Set
p.hats.poly <- predict.glm(train.glm.poly, newdata = testData, type = "response")
survival = ifelse(p.hats.poly > 0.5, 1, 0)

# Creating CSV for Kaggle Submission
kaggle.sub <- cbind(PassengerId, survival)
colnames(kaggle.sub) <- c("PassengerId", "Survived")
write.csv(kaggle.sub, file = "~/Dropbox/Data Science/Kaggle/Titanic/titanic_glm_poly2.csv", 
    row.names = FALSE)

This is CRAZY!!! Improved results SIGNIFICANTLY!! WHAT?? Moved up 451 positions - now I'm in Top 250??

Took out non-linearity in Sex*Pclass interaction term and didn't hurt much at all. GOOD!

So non-linearity for the Sib/Spouse variable is HUGE!! WHY?

Anyway, this is good. Move onto finding more interaction terms.

Interaction Terms

# Testing Age + Sex (this seems like there could be some interaction)
fit.1 = lm(Survived ~ Pclass + Sex + Age + I(SibSp^3) + Child + Sex * Pclass, 
    family = binomial, data = trainData)
fit.2 = lm(Survived ~ Pclass + Sex + Age + Sex * Age + I(SibSp^3) + Child + 
    Sex * Pclass, family = binomial, data = trainData)
anova(fit.1, fit.2)
## Analysis of Variance Table
## 
## Model 1: Survived ~ Pclass + Sex + Age + I(SibSp^3) + Child + Sex * Pclass
## Model 2: Survived ~ Pclass + Sex + Age + Sex * Age + I(SibSp^3) + Child + 
##     Sex * Pclass
##   Res.Df RSS Df Sum of Sq    F Pr(>F)  
## 1    884 124                           
## 2    883 123  1      0.73 5.23  0.022 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Not much help. Next.

# Testing Age + Class
fit.1 = lm(Survived ~ Pclass + Sex + Age + I(SibSp^3) + Child + Sex * Pclass, 
    family = binomial, data = trainData)
fit.2 = lm(Survived ~ Pclass + Sex + Age + Age * Pclass + I(SibSp^3) + Child + 
    Sex * Pclass, family = binomial, data = trainData)
anova(fit.1, fit.2)
## Analysis of Variance Table
## 
## Model 1: Survived ~ Pclass + Sex + Age + I(SibSp^3) + Child + Sex * Pclass
## Model 2: Survived ~ Pclass + Sex + Age + Age * Pclass + I(SibSp^3) + Child + 
##     Sex * Pclass
##   Res.Df RSS Df Sum of Sq    F Pr(>F)
## 1    884 124                         
## 2    883 124  1   0.00925 0.07    0.8

No help at all. Next.

# Testing Child + Class
fit.1 = lm(Survived ~ Pclass + Sex + Age + I(SibSp^3) + Child + Sex * Pclass, 
    family = binomial, data = trainData)
fit.2 = lm(Survived ~ Pclass + Sex + Age + I(SibSp^3) + Child + Child * Class + 
    Sex * Pclass, family = binomial, data = trainData)
## Error: object 'Class' not found
anova(fit.1, fit.2)
## Analysis of Variance Table
## 
## Model 1: Survived ~ Pclass + Sex + Age + I(SibSp^3) + Child + Sex * Pclass
## Model 2: Survived ~ Pclass + Sex + Age + Age * Pclass + I(SibSp^3) + Child + 
##     Sex * Pclass
##   Res.Df RSS Df Sum of Sq    F Pr(>F)
## 1    884 124                         
## 2    883 124  1   0.00925 0.07    0.8

Nope!

# Testing Child + Sex
fit.1 = lm(Survived ~ Pclass + Sex + Age + I(SibSp^3) + Child + Sex * Pclass, 
    family = binomial, data = trainData)
fit.2 = lm(Survived ~ Pclass + Sex + Age + I(SibSp^3) + Child + Child * Sex + 
    Sex * Pclass, family = binomial, data = trainData)
anova(fit.1, fit.2)
## Analysis of Variance Table
## 
## Model 1: Survived ~ Pclass + Sex + Age + I(SibSp^3) + Child + Sex * Pclass
## Model 2: Survived ~ Pclass + Sex + Age + I(SibSp^3) + Child + Child * 
##     Sex + Sex * Pclass
##   Res.Df RSS Df Sum of Sq    F  Pr(>F)    
## 1    884 124                              
## 2    883 120  1      3.57 26.2 3.8e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(fit.1)
## 
## Call:
## lm(formula = Survived ~ Pclass + Sex + Age + I(SibSp^3) + Child + 
##     Sex * Pclass, data = trainData, family = binomial)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.306 -0.225 -0.101  0.223  0.943 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.936400   0.104938    8.92  < 2e-16 ***
## Pclass      -0.125706   0.020231   -6.21  8.0e-10 ***
## Sex          0.820545   0.075447   10.88  < 2e-16 ***
## Age         -0.003491   0.001231   -2.84  0.00466 ** 
## I(SibSp^3)  -0.000937   0.000272   -3.45  0.00059 ***
## Child       -0.172736   0.056017   -3.08  0.00211 ** 
## Pclass:Sex  -0.144256   0.031370   -4.60  4.9e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.375 on 884 degrees of freedom
## Multiple R-squared:  0.412,  Adjusted R-squared:  0.408 
## F-statistic:  103 on 6 and 884 DF,  p-value: <2e-16
summary(fit.2)
## 
## Call:
## lm(formula = Survived ~ Pclass + Sex + Age + I(SibSp^3) + Child + 
##     Child * Sex + Sex * Pclass, data = trainData, family = binomial)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0545 -0.2140 -0.0852  0.2037  0.9550 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.342219   0.130382   10.29  < 2e-16 ***
## Pclass      -0.128788   0.019958   -6.45  1.8e-10 ***
## Sex         -0.144317   0.202743   -0.71  0.47676    
## Age         -0.003180   0.001215   -2.62  0.00903 ** 
## I(SibSp^3)  -0.001065   0.000269   -3.96  8.2e-05 ***
## Child       -0.383873   0.068952   -5.57  3.4e-08 ***
## Sex:Child    0.476232   0.093089    5.12  3.8e-07 ***
## Pclass:Sex  -0.118807   0.031330   -3.79  0.00016 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.369 on 883 degrees of freedom
## Multiple R-squared:  0.428,  Adjusted R-squared:  0.424 
## F-statistic: 94.6 on 7 and 883 DF,  p-value: <2e-16

LOOKS like Child*Sex interaction term is very significant (t-value of 5+). But it negates entirely the effect of Sex which is even more statistically significant (t-value of 10+). So this is NOT a good tradeoff.

Tested SibSp^3 against other variables too and didn't seem to help.

Rerun Best Subset Selection

Before moving onto splines and GAM, perhaps it's worthwhile to rerun Best Subset selection given how significant an impact \( SipSp^{3} \) has on the whole model!

It's the same subset selection analysis as before but replacing linear SibSp with the polynomial SibSp.

library(leaps)
regfit.full = regsubsets(Survived ~ Pclass + Sex + Age + I(SibSp^3) + Parch + 
    Fare + Embarked_C + Embarked_Q + Child + Mother + Sex * Pclass, data = trainData, 
    nvmax = 11)  #Best Subset Selection for ALL variables
(reg.summary = summary(regfit.full))
## Subset selection object
## Call: regsubsets.formula(Survived ~ Pclass + Sex + Age + I(SibSp^3) + 
##     Parch + Fare + Embarked_C + Embarked_Q + Child + Mother + 
##     Sex * Pclass, data = trainData, nvmax = 11)
## 11 Variables  (and intercept)
##            Forced in Forced out
## Pclass         FALSE      FALSE
## Sex            FALSE      FALSE
## Age            FALSE      FALSE
## I(SibSp^3)     FALSE      FALSE
## Parch          FALSE      FALSE
## Fare           FALSE      FALSE
## Embarked_C     FALSE      FALSE
## Embarked_Q     FALSE      FALSE
## Child          FALSE      FALSE
## Mother         FALSE      FALSE
## Pclass:Sex     FALSE      FALSE
## 1 subsets of each size up to 11
## Selection Algorithm: exhaustive
##           Pclass Sex Age I(SibSp^3) Parch Fare Embarked_C Embarked_Q Child
## 1  ( 1 )  " "    "*" " " " "        " "   " "  " "        " "        " "  
## 2  ( 1 )  "*"    "*" " " " "        " "   " "  " "        " "        " "  
## 3  ( 1 )  "*"    "*" "*" " "        " "   " "  " "        " "        " "  
## 4  ( 1 )  "*"    "*" "*" " "        " "   " "  " "        " "        " "  
## 5  ( 1 )  "*"    "*" " " "*"        " "   " "  " "        " "        "*"  
## 6  ( 1 )  "*"    "*" "*" "*"        " "   " "  " "        " "        "*"  
## 7  ( 1 )  "*"    "*" "*" "*"        "*"   " "  " "        " "        "*"  
## 8  ( 1 )  "*"    "*" "*" "*"        "*"   " "  " "        " "        "*"  
## 9  ( 1 )  "*"    "*" "*" "*"        "*"   " "  "*"        " "        "*"  
## 10  ( 1 ) "*"    "*" "*" "*"        "*"   " "  "*"        "*"        "*"  
## 11  ( 1 ) "*"    "*" "*" "*"        "*"   "*"  "*"        "*"        "*"  
##           Mother Pclass:Sex
## 1  ( 1 )  " "    " "       
## 2  ( 1 )  " "    " "       
## 3  ( 1 )  " "    " "       
## 4  ( 1 )  " "    "*"       
## 5  ( 1 )  " "    "*"       
## 6  ( 1 )  " "    "*"       
## 7  ( 1 )  " "    "*"       
## 8  ( 1 )  "*"    "*"       
## 9  ( 1 )  "*"    "*"       
## 10  ( 1 ) "*"    "*"       
## 11  ( 1 ) "*"    "*"
names(reg.summary)
## [1] "which"  "rsq"    "rss"    "adjr2"  "cp"     "bic"    "outmat" "obj"
reg.summary$rsq  #summary of the R-squares for all 11 variables
##  [1] 0.2952 0.3677 0.3853 0.3985 0.4061 0.4115 0.4152 0.4178 0.4200 0.4221
## [11] 0.4221

# Plot RSS, adjusted r-square, Cp, BIC for all the models at once
par(mfrow = c(2, 2))
# RSS Plot
plot(reg.summary$rss, xlab = "Number of Variables", ylab = "RSS", type = "l")
# Adjusted RSq plot
plot(reg.summary$adjr2, xlab = "Number of Variables", ylab = "Adjusted RSq", 
    type = "l")
which.max(reg.summary$adjr2)
## [1] 10
points(10, reg.summary$adjr2[10], col = "red", cex = 2, pch = 20)
# Cp plot
plot(reg.summary$cp, xlab = "Number of Variables", ylab = "Cp", type = "l")
which.min(reg.summary$cp)
## [1] 10
points(10, reg.summary$cp[10], col = "red", cex = 2, pch = 20)
# BIC plot
plot(reg.summary$bic, xlab = "Number of Variables", ylab = "BIC", type = "l")
points(6, reg.summary$bic[6], col = "red", cex = 2, pch = 20)

plot of chunk unnamed-chunk-10


# Coefficients for these best subsets:
coef(regfit.full, 10)
## (Intercept)      Pclass         Sex         Age  I(SibSp^3)       Parch 
##   1.3577858  -0.1301911   0.7971262  -0.0037129  -0.0006543  -0.0613591 
##  Embarked_C  Embarked_Q       Child      Mother  Pclass:Sex 
##   0.0674378   0.0834749  -0.2421988  -0.1371012  -0.1388098
coef(regfit.full, 6)
## (Intercept)      Pclass         Sex         Age  I(SibSp^3)       Child 
##   0.9364000  -0.1257057   0.8205453  -0.0034915  -0.0009367  -0.1727359 
##  Pclass:Sex 
##  -0.1442555

Well, that's good. It's the same 6 variables. Now we can move onto Splines and GAMs.