Titanic Linear Regression

Read in the Data

First read in both the test and training data:

train <- read.csv('shared/titanic/train.csv')
test <- read.csv('shared/titanic/test.csv')

The Simplest Possible Model

train$model <- rep(0,891)
nrow(train[train$model==
             train$Survived,])/891

[1] 0.6162

If we predict that everyone dies, we're right 61.6 % of the time in the training set.

A Gender Model

(gender_model <- lm(Survived~Sex, data=train))


Call:
lm(formula = Survived ~ Sex, data = train)

Coefficients:
(Intercept)      Sexmale  
      0.742       -0.553

This model predicts that females have a 74.2% of surival and males have a 74.2 - 55.3 = 18.9% chance of survival.

Predictions with the Gender Model

train$gender_model_pred <- round(predict(gender_model, train))

We can make “in sample” predictions (meaning, on the training set) with this model and round them to make them 0/1 predictions rather than predicting probabilities.

In Sample Accuracy

nrow(train[train$gender_model_pred==
             train$Survived,])/891

[1] 0.7868

In Sample, we're now right 78.7 % of the time but this isn't the true test. Models with additional variables will always do better in sample.

Out of Sample Accuracy

The true test is out of sample accuracy – predictions on the test set. And for this we'll need Kaggle.

test$Survived <- round(predict(gender_model, test))
submit <- data.frame(PassengerId = test$PassengerId, Survived = test$Survived)
write.csv(submit, file = "gender_model.csv", row.names = FALSE)

Now, you'll need to download the “gender_model.csv” file and sumbit it to Kaggle to see how well it performs.

Residual = Actual - Predicion

Try plotting other variables against the residuals of your model

plot(jitter(train$Pclass), jitter(residuals(gender_model)), cex=0.5)

plot of chunk unnamed-chunk-8

Were the residuals (errors) random (1/2)?

residual_model <- lm(residuals(gender_model)~train$Pclass)


Call:
lm(formula = residuals(gender_model) ~ train$Pclass)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.9453 -0.2368 -0.0816  0.2100  0.9184 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)    0.3585     0.0381    9.40   <2e-16 ***
train$Pclass  -0.1553     0.0155   -9.99   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.388 on 889 degrees of freedom
Multiple R-squared:  0.101, Adjusted R-squared:   0.1 
F-statistic: 99.9 on 1 and 889 DF,  p-value: <2e-16

Were the residuals (errors) random (2/2)?

plot of chunk unnamed-chunk-12

Try adding Pclass to the model, 1st separatedly and then with an interaction

# separate effect
gender_Pclass_model1 <- lm(Survived~Sex+Pclass, data=train)

#interaction model (since Pclass might have a different effect for males and females)
gender_Pclass_model2 <- lm(Survived~Sex*Pclass, data=train)

Coefficients, Model 1

coef(gender_Pclass_model1)

(Intercept)     Sexmale      Pclass 
     1.0833     -0.5167     -0.1580

Coefficients, Model 2

coef(gender_Pclass_model2)[1:2]

(Intercept)     Sexmale 
     1.2686     -0.8258

coef(gender_Pclass_model2)[3:4]

        Pclass Sexmale:Pclass 
       -0.2439         0.1376

Evaluation

Try looking at summaries of the two models –> “summary(gender_Pclass_model2)”

How do you interpret the coefficients of the interaction model?

To find our which model performs better, try using them each to make predictions on the test set and then submitting the results from each model to Kaggle.