First read in both the test and training data:
train <- read.csv('shared/titanic/train.csv')
test <- read.csv('shared/titanic/test.csv')
train$model <- rep(0,891)
nrow(train[train$model==
train$Survived,])/891
[1] 0.6162
If we predict that everyone dies, we're right 61.6 % of the time in the training set.
(gender_model <- lm(Survived~Sex, data=train))
Call:
lm(formula = Survived ~ Sex, data = train)
Coefficients:
(Intercept) Sexmale
0.742 -0.553
This model predicts that females have a 74.2% of surival and males have a 74.2 - 55.3 = 18.9% chance of survival.
train$gender_model_pred <- round(predict(gender_model, train))
We can make “in sample” predictions (meaning, on the training set) with this model and round them to make them 0/1 predictions rather than predicting probabilities.
nrow(train[train$gender_model_pred==
train$Survived,])/891
[1] 0.7868
In Sample, we're now right 78.7 % of the time but this isn't the true test. Models with additional variables will always do better in sample.
The true test is out of sample accuracy – predictions on the test set. And for this we'll need Kaggle.
test$Survived <- round(predict(gender_model, test))
submit <- data.frame(PassengerId = test$PassengerId, Survived = test$Survived)
write.csv(submit, file = "gender_model.csv", row.names = FALSE)
Now, you'll need to download the “gender_model.csv” file and sumbit it to Kaggle to see how well it performs.
Try plotting other variables against the residuals of your model
plot(jitter(train$Pclass), jitter(residuals(gender_model)), cex=0.5)
residual_model <- lm(residuals(gender_model)~train$Pclass)
Call:
lm(formula = residuals(gender_model) ~ train$Pclass)
Residuals:
Min 1Q Median 3Q Max
-0.9453 -0.2368 -0.0816 0.2100 0.9184
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.3585 0.0381 9.40 <2e-16 ***
train$Pclass -0.1553 0.0155 -9.99 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.388 on 889 degrees of freedom
Multiple R-squared: 0.101, Adjusted R-squared: 0.1
F-statistic: 99.9 on 1 and 889 DF, p-value: <2e-16
# separate effect
gender_Pclass_model1 <- lm(Survived~Sex+Pclass, data=train)
#interaction model (since Pclass might have a different effect for males and females)
gender_Pclass_model2 <- lm(Survived~Sex*Pclass, data=train)
coef(gender_Pclass_model1)
(Intercept) Sexmale Pclass
1.0833 -0.5167 -0.1580
coef(gender_Pclass_model2)[1:2]
(Intercept) Sexmale
1.2686 -0.8258
coef(gender_Pclass_model2)[3:4]
Pclass Sexmale:Pclass
-0.2439 0.1376
Try looking at summaries of the two models –> “summary(gender_Pclass_model2)”
How do you interpret the coefficients of the interaction model?
To find our which model performs better, try using them each to make predictions on the test set and then submitting the results from each model to Kaggle.