Assuming the 6 variables identified with the Best Subset Selection method are good. I want to test the following:
- Non-linearity in any of the 6 variables
- Test out all interaction terms to see which are relevant
- What if I added “Fare” variable back and did a “spline”/step-function for it. Would it make “Fare” relevant? Does “Fare” have some interaction with Class?
# Create the Model using Training Set
train.glm <- glm(Survived ~ Pclass + Sex + Age + SibSp + Child + Sex * Pclass,
family = binomial, data = trainData)
summary(train.glm)
##
## Call:
## glm(formula = Survived ~ Pclass + Sex + Age + SibSp + Child +
## Sex * Pclass, family = binomial, data = trainData)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.565 -0.591 -0.471 0.428 2.537
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.47701 0.84708 5.29 1.3e-07 ***
## Pclass -0.84312 0.13939 -6.05 1.5e-09 ***
## Sex 6.28883 0.88987 7.07 1.6e-12 ***
## Age -0.02807 0.00953 -2.94 0.00323 **
## SibSp -0.54917 0.12818 -4.28 1.8e-05 ***
## Child -1.55943 0.44559 -3.50 0.00047 ***
## Pclass:Sex -1.40634 0.32661 -4.31 1.7e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1186.66 on 890 degrees of freedom
## Residual deviance: 747.85 on 884 degrees of freedom
## AIC: 761.8
##
## Number of Fisher Scoring iterations: 6
# Test 'Pclass' first (only 2 at most - 'degree' must be less than number of
# unique points)
fit.1 = lm(Survived ~ Pclass + Sex + Age + SibSp + Child + Sex * Pclass, family = binomial,
data = trainData)
fit.2 = lm(Survived ~ poly(Pclass, 2) + Sex + Age + SibSp + Child + Sex * Pclass,
family = binomial, data = trainData)
anova(fit.1, fit.2)
## Analysis of Variance Table
##
## Model 1: Survived ~ Pclass + Sex + Age + SibSp + Child + Sex * Pclass
## Model 2: Survived ~ poly(Pclass, 2) + Sex + Age + SibSp + Child + Sex *
## Pclass
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 884 122
## 2 883 122 1 0.000213 0 0.97
# Test 'Age'
fit.1 = lm(Survived ~ Pclass + Sex + Age + SibSp + Child + Sex * Pclass, family = binomial,
data = trainData)
fit.2 = lm(Survived ~ Pclass + Sex + poly(Age, 2) + SibSp + Child + Sex * Pclass,
family = binomial, data = trainData)
fit.3 = lm(Survived ~ Pclass + Sex + poly(Age, 3) + SibSp + Child + Sex * Pclass,
family = binomial, data = trainData)
fit.4 = lm(Survived ~ Pclass + Sex + poly(Age, 4) + SibSp + Child + Sex * Pclass,
family = binomial, data = trainData)
fit.5 = lm(Survived ~ Pclass + Sex + poly(Age, 5) + SibSp + Child + Sex * Pclass,
family = binomial, data = trainData)
anova(fit.1, fit.2, fit.3, fit.4, fit.5)
## Analysis of Variance Table
##
## Model 1: Survived ~ Pclass + Sex + Age + SibSp + Child + Sex * Pclass
## Model 2: Survived ~ Pclass + Sex + poly(Age, 2) + SibSp + Child + Sex *
## Pclass
## Model 3: Survived ~ Pclass + Sex + poly(Age, 3) + SibSp + Child + Sex *
## Pclass
## Model 4: Survived ~ Pclass + Sex + poly(Age, 4) + SibSp + Child + Sex *
## Pclass
## Model 5: Survived ~ Pclass + Sex + poly(Age, 5) + SibSp + Child + Sex *
## Pclass
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 884 122
## 2 883 122 1 0.023 0.17 0.6824
## 3 882 122 1 0.127 0.93 0.3355
## 4 881 121 1 1.340 9.78 0.0018 **
## 5 880 121 1 0.015 0.11 0.7414
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(trainData$Age, trainData$Survived)
par(mfrow = c(1, 2))
plot(density(trainData$Age[trainData$Survived == 1], na.rm = TRUE), main = "Distr. of Age v Survivors")
plot(density(trainData$Age[trainData$Survived == 0], na.rm = TRUE), main = "Distr. of Age v Non-Survivors")
Children (survived), 10-20s (not survived), 20-30s (survived), 30-40s (not survived) make a big difference it seems…That's why it's non-linear… This suggests that it's not arbitrary. It's human decision and perhaps that's why using Decision Trees is a better idea (rather than tinkering with non-linearity in logistic regressions). We shall see.
# Test 'SibSp'
fit.1 = lm(Survived ~ Pclass + Sex + Age + SibSp + Child + Sex * Pclass, family = binomial,
data = trainData)
fit.2 = lm(Survived ~ Pclass + Sex + Age + poly(SibSp, 2) + Child + Sex * Pclass,
family = binomial, data = trainData)
fit.3 = lm(Survived ~ Pclass + Sex + Age + poly(SibSp, 3) + Child + Sex * Pclass,
family = binomial, data = trainData)
fit.4 = lm(Survived ~ Pclass + Sex + Age + poly(SibSp, 4) + Child + Sex * Pclass,
family = binomial, data = trainData)
fit.5 = lm(Survived ~ Pclass + Sex + Age + poly(SibSp, 5) + Child + Sex * Pclass,
family = binomial, data = trainData)
anova(fit.1, fit.2, fit.3, fit.4, fit.5)
## Analysis of Variance Table
##
## Model 1: Survived ~ Pclass + Sex + Age + SibSp + Child + Sex * Pclass
## Model 2: Survived ~ Pclass + Sex + Age + poly(SibSp, 2) + Child + Sex *
## Pclass
## Model 3: Survived ~ Pclass + Sex + Age + poly(SibSp, 3) + Child + Sex *
## Pclass
## Model 4: Survived ~ Pclass + Sex + Age + poly(SibSp, 4) + Child + Sex *
## Pclass
## Model 5: Survived ~ Pclass + Sex + Age + poly(SibSp, 5) + Child + Sex *
## Pclass
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 884 122
## 2 883 122 1 0.001 0.01 0.94129
## 3 882 120 1 1.678 12.29 0.00048 ***
## 4 881 120 1 0.154 1.13 0.28798
## 5 880 120 1 0.107 0.78 0.37689
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(trainData$SibSp, trainData$Survived)
par(mfrow = c(1, 2))
plot(density(trainData$SibSp[trainData$Survived == 1], na.rm = TRUE), main = "Distr. of Siblings/Spouse v Survivors")
plot(density(trainData$SibSp[trainData$Survived == 0], na.rm = TRUE), main = "Distr. of Siblings/Spouse v Non-Survivors")
Even though this seems very signficant, there seems to be no obvious reason why there should be non-linearity for Sib/Spouse variable (looks like both survivor and non-survivor groups are distributed similarly for SibSp).
# Test 'Sex*Pclass'
fit.1 = lm(Survived ~ Pclass + Sex + Age + SibSp + Child + Sex * Pclass, family = binomial,
data = trainData)
fit.2 = lm(Survived ~ Pclass + Sex + Age + SibSp + Child + poly(Sex * Pclass,
2), family = binomial, data = trainData)
fit.3 = lm(Survived ~ Pclass + Sex + Age + SibSp + Child + poly(Sex * Pclass,
3), family = binomial, data = trainData)
fit.4 = lm(Survived ~ Pclass + Sex + Age + SibSp + Child + poly(Sex * Pclass,
4), family = binomial, data = trainData)
## Error: 'degree' must be less than number of unique points
anova(fit.1, fit.2, fit.3, fit.4)
## Analysis of Variance Table
##
## Model 1: Survived ~ Pclass + Sex + Age + SibSp + Child + Sex * Pclass
## Model 2: Survived ~ Pclass + Sex + Age + SibSp + Child + poly(Sex * Pclass,
## 2)
## Model 3: Survived ~ Pclass + Sex + Age + SibSp + Child + poly(Sex * Pclass,
## 3)
## Model 4: Survived ~ Pclass + Sex + Age + poly(SibSp, 4) + Child + Sex *
## Pclass
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 884 122
## 2 883 120 1 1.596 11.69 0.00066 ***
## 3 883 120 0 0.000
## 4 881 120 2 0.237 0.87 0.41937
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Again, VERY SIGNIFICANT for using non-linearity. But no obvious reason why Sex*Pclass should be non-linear…
So For Non-linearity, we'll try out: 4th degree for Age, 3rd degree for SibSp, and 2nd degree for Sex*Pclass
This could be a case of overfitting! But we'll check the results and see…
After using non-linearity for these variables, turns out that \( Age^{4} \) turned out to be statistically insignificant. And also ANOVA shows that using non-linear model didn't improve on linear model (with the same 6 variables). I'm going to give it a try anyway and see if it improves the prediction.
# Create the Model using Training Set
train.glm.poly <- glm(Survived ~ Pclass + Sex + Age + I(SibSp^3) + Child + Sex *
Pclass, family = binomial, data = trainData)
summary(train.glm.poly)
##
## Call:
## glm(formula = Survived ~ Pclass + Sex + Age + I(SibSp^3) + Child +
## Sex * Pclass, family = binomial, data = trainData)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.637 -0.627 -0.469 0.447 2.513
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.45544 0.85592 5.21 1.9e-07 ***
## Pclass -0.80608 0.13997 -5.76 8.5e-09 ***
## Sex 6.07230 0.88212 6.88 5.8e-12 ***
## Age -0.02759 0.00955 -2.89 0.00385 **
## I(SibSp^3) -0.05306 0.01333 -3.98 6.9e-05 ***
## Child -1.64700 0.45879 -3.59 0.00033 ***
## Pclass:Sex -1.35473 0.32556 -4.16 3.2e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1186.66 on 890 degrees of freedom
## Residual deviance: 741.45 on 884 degrees of freedom
## AIC: 755.5
##
## Number of Fisher Scoring iterations: 7
# Make Predictions using the Test Set
p.hats.poly <- predict.glm(train.glm.poly, newdata = testData, type = "response")
survival = ifelse(p.hats.poly > 0.5, 1, 0)
# Creating CSV for Kaggle Submission
kaggle.sub <- cbind(PassengerId, survival)
colnames(kaggle.sub) <- c("PassengerId", "Survived")
write.csv(kaggle.sub, file = "~/Dropbox/Data Science/Kaggle/Titanic/titanic_glm_poly2.csv",
row.names = FALSE)
This is CRAZY!!! Improved results SIGNIFICANTLY!! WHAT?? Moved up 451 positions - now I'm in Top 250??
Took out non-linearity in Sex*Pclass interaction term and didn't hurt much at all. GOOD!
Anyway, this is good. Move onto finding more interaction terms.
# Testing Age + Sex (this seems like there could be some interaction)
fit.1 = lm(Survived ~ Pclass + Sex + Age + I(SibSp^3) + Child + Sex * Pclass,
family = binomial, data = trainData)
fit.2 = lm(Survived ~ Pclass + Sex + Age + Sex * Age + I(SibSp^3) + Child +
Sex * Pclass, family = binomial, data = trainData)
anova(fit.1, fit.2)
## Analysis of Variance Table
##
## Model 1: Survived ~ Pclass + Sex + Age + I(SibSp^3) + Child + Sex * Pclass
## Model 2: Survived ~ Pclass + Sex + Age + Sex * Age + I(SibSp^3) + Child +
## Sex * Pclass
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 884 124
## 2 883 123 1 0.73 5.23 0.022 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Not much help. Next.
# Testing Age + Class
fit.1 = lm(Survived ~ Pclass + Sex + Age + I(SibSp^3) + Child + Sex * Pclass,
family = binomial, data = trainData)
fit.2 = lm(Survived ~ Pclass + Sex + Age + Age * Pclass + I(SibSp^3) + Child +
Sex * Pclass, family = binomial, data = trainData)
anova(fit.1, fit.2)
## Analysis of Variance Table
##
## Model 1: Survived ~ Pclass + Sex + Age + I(SibSp^3) + Child + Sex * Pclass
## Model 2: Survived ~ Pclass + Sex + Age + Age * Pclass + I(SibSp^3) + Child +
## Sex * Pclass
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 884 124
## 2 883 124 1 0.00925 0.07 0.8
No help at all. Next.
# Testing Child + Class
fit.1 = lm(Survived ~ Pclass + Sex + Age + I(SibSp^3) + Child + Sex * Pclass,
family = binomial, data = trainData)
fit.2 = lm(Survived ~ Pclass + Sex + Age + I(SibSp^3) + Child + Child * Class +
Sex * Pclass, family = binomial, data = trainData)
## Error: object 'Class' not found
anova(fit.1, fit.2)
## Analysis of Variance Table
##
## Model 1: Survived ~ Pclass + Sex + Age + I(SibSp^3) + Child + Sex * Pclass
## Model 2: Survived ~ Pclass + Sex + Age + Age * Pclass + I(SibSp^3) + Child +
## Sex * Pclass
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 884 124
## 2 883 124 1 0.00925 0.07 0.8
Nope!
# Testing Child + Sex
fit.1 = lm(Survived ~ Pclass + Sex + Age + I(SibSp^3) + Child + Sex * Pclass,
family = binomial, data = trainData)
fit.2 = lm(Survived ~ Pclass + Sex + Age + I(SibSp^3) + Child + Child * Sex +
Sex * Pclass, family = binomial, data = trainData)
anova(fit.1, fit.2)
## Analysis of Variance Table
##
## Model 1: Survived ~ Pclass + Sex + Age + I(SibSp^3) + Child + Sex * Pclass
## Model 2: Survived ~ Pclass + Sex + Age + I(SibSp^3) + Child + Child *
## Sex + Sex * Pclass
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 884 124
## 2 883 120 1 3.57 26.2 3.8e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(fit.1)
##
## Call:
## lm(formula = Survived ~ Pclass + Sex + Age + I(SibSp^3) + Child +
## Sex * Pclass, data = trainData, family = binomial)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.306 -0.225 -0.101 0.223 0.943
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.936400 0.104938 8.92 < 2e-16 ***
## Pclass -0.125706 0.020231 -6.21 8.0e-10 ***
## Sex 0.820545 0.075447 10.88 < 2e-16 ***
## Age -0.003491 0.001231 -2.84 0.00466 **
## I(SibSp^3) -0.000937 0.000272 -3.45 0.00059 ***
## Child -0.172736 0.056017 -3.08 0.00211 **
## Pclass:Sex -0.144256 0.031370 -4.60 4.9e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.375 on 884 degrees of freedom
## Multiple R-squared: 0.412, Adjusted R-squared: 0.408
## F-statistic: 103 on 6 and 884 DF, p-value: <2e-16
summary(fit.2)
##
## Call:
## lm(formula = Survived ~ Pclass + Sex + Age + I(SibSp^3) + Child +
## Child * Sex + Sex * Pclass, data = trainData, family = binomial)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.0545 -0.2140 -0.0852 0.2037 0.9550
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.342219 0.130382 10.29 < 2e-16 ***
## Pclass -0.128788 0.019958 -6.45 1.8e-10 ***
## Sex -0.144317 0.202743 -0.71 0.47676
## Age -0.003180 0.001215 -2.62 0.00903 **
## I(SibSp^3) -0.001065 0.000269 -3.96 8.2e-05 ***
## Child -0.383873 0.068952 -5.57 3.4e-08 ***
## Sex:Child 0.476232 0.093089 5.12 3.8e-07 ***
## Pclass:Sex -0.118807 0.031330 -3.79 0.00016 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.369 on 883 degrees of freedom
## Multiple R-squared: 0.428, Adjusted R-squared: 0.424
## F-statistic: 94.6 on 7 and 883 DF, p-value: <2e-16
LOOKS like Child*Sex interaction term is very significant (t-value of 5+). But it negates entirely the effect of Sex which is even more statistically significant (t-value of 10+). So this is NOT a good tradeoff.
Tested SibSp^3 against other variables too and didn't seem to help.
Before moving onto splines and GAM, perhaps it's worthwhile to rerun Best Subset selection given how significant an impact \( SipSp^{3} \) has on the whole model!
It's the same subset selection analysis as before but replacing linear SibSp with the polynomial SibSp.
library(leaps)
regfit.full = regsubsets(Survived ~ Pclass + Sex + Age + I(SibSp^3) + Parch +
Fare + Embarked_C + Embarked_Q + Child + Mother + Sex * Pclass, data = trainData,
nvmax = 11) #Best Subset Selection for ALL variables
(reg.summary = summary(regfit.full))
## Subset selection object
## Call: regsubsets.formula(Survived ~ Pclass + Sex + Age + I(SibSp^3) +
## Parch + Fare + Embarked_C + Embarked_Q + Child + Mother +
## Sex * Pclass, data = trainData, nvmax = 11)
## 11 Variables (and intercept)
## Forced in Forced out
## Pclass FALSE FALSE
## Sex FALSE FALSE
## Age FALSE FALSE
## I(SibSp^3) FALSE FALSE
## Parch FALSE FALSE
## Fare FALSE FALSE
## Embarked_C FALSE FALSE
## Embarked_Q FALSE FALSE
## Child FALSE FALSE
## Mother FALSE FALSE
## Pclass:Sex FALSE FALSE
## 1 subsets of each size up to 11
## Selection Algorithm: exhaustive
## Pclass Sex Age I(SibSp^3) Parch Fare Embarked_C Embarked_Q Child
## 1 ( 1 ) " " "*" " " " " " " " " " " " " " "
## 2 ( 1 ) "*" "*" " " " " " " " " " " " " " "
## 3 ( 1 ) "*" "*" "*" " " " " " " " " " " " "
## 4 ( 1 ) "*" "*" "*" " " " " " " " " " " " "
## 5 ( 1 ) "*" "*" " " "*" " " " " " " " " "*"
## 6 ( 1 ) "*" "*" "*" "*" " " " " " " " " "*"
## 7 ( 1 ) "*" "*" "*" "*" "*" " " " " " " "*"
## 8 ( 1 ) "*" "*" "*" "*" "*" " " " " " " "*"
## 9 ( 1 ) "*" "*" "*" "*" "*" " " "*" " " "*"
## 10 ( 1 ) "*" "*" "*" "*" "*" " " "*" "*" "*"
## 11 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*"
## Mother Pclass:Sex
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
## 3 ( 1 ) " " " "
## 4 ( 1 ) " " "*"
## 5 ( 1 ) " " "*"
## 6 ( 1 ) " " "*"
## 7 ( 1 ) " " "*"
## 8 ( 1 ) "*" "*"
## 9 ( 1 ) "*" "*"
## 10 ( 1 ) "*" "*"
## 11 ( 1 ) "*" "*"
names(reg.summary)
## [1] "which" "rsq" "rss" "adjr2" "cp" "bic" "outmat" "obj"
reg.summary$rsq #summary of the R-squares for all 11 variables
## [1] 0.2952 0.3677 0.3853 0.3985 0.4061 0.4115 0.4152 0.4178 0.4200 0.4221
## [11] 0.4221
# Plot RSS, adjusted r-square, Cp, BIC for all the models at once
par(mfrow = c(2, 2))
# RSS Plot
plot(reg.summary$rss, xlab = "Number of Variables", ylab = "RSS", type = "l")
# Adjusted RSq plot
plot(reg.summary$adjr2, xlab = "Number of Variables", ylab = "Adjusted RSq",
type = "l")
which.max(reg.summary$adjr2)
## [1] 10
points(10, reg.summary$adjr2[10], col = "red", cex = 2, pch = 20)
# Cp plot
plot(reg.summary$cp, xlab = "Number of Variables", ylab = "Cp", type = "l")
which.min(reg.summary$cp)
## [1] 10
points(10, reg.summary$cp[10], col = "red", cex = 2, pch = 20)
# BIC plot
plot(reg.summary$bic, xlab = "Number of Variables", ylab = "BIC", type = "l")
points(6, reg.summary$bic[6], col = "red", cex = 2, pch = 20)
# Coefficients for these best subsets:
coef(regfit.full, 10)
## (Intercept) Pclass Sex Age I(SibSp^3) Parch
## 1.3577858 -0.1301911 0.7971262 -0.0037129 -0.0006543 -0.0613591
## Embarked_C Embarked_Q Child Mother Pclass:Sex
## 0.0674378 0.0834749 -0.2421988 -0.1371012 -0.1388098
coef(regfit.full, 6)
## (Intercept) Pclass Sex Age I(SibSp^3) Child
## 0.9364000 -0.1257057 0.8205453 -0.0034915 -0.0009367 -0.1727359
## Pclass:Sex
## -0.1442555
Well, that's good. It's the same 6 variables. Now we can move onto Splines and GAMs.