Kaggle Learning - Titanic Dataset - Part 1

The aim of this document is to explore how to submit predictions for Kaggle competitions, and optimise those predictions. This first analysis takes a simple approach, forgoing exploratory analyses in favour of just throwing things into a quick analysis to see what results we get.

Random Values

Firstly, let’s see what kind of kaggle score I get when I submit completely random values. To do this, I use the rbinom function.

train=read.csv(file="train.csv")
test=read.csv(file="test.csv")
test$Survived=rbinom(418,1,0.5)

Chop the extra fields out so we just have the ID and the prediction.

myRand=test
myRand=myRand[,c(1,12)]

And finally, save it to a file.

write.csv(myRand,"random.csv", row.names=FALSE)

This submission received a Kaggle score of 0.46411

From this, we can see that using random values, our prediction was slightly worse than chance. Let’s have a look now at using binary logistic regression models.

Binary Logistic Regression

Models

The first model, m1, is the model contained just the main effects which were significant. The second model, m2, also contains interactions, including one which had p=0.06. The third model, m3, is identical to m2, but without the marginal interaction.

m1=glm(Survived~Pclass+Sex+Age,data=train, family="binomial")
m2a=glm(Survived~Pclass+Sex+Age+Pclass*Sex+ Pclass*SibSp+ Sex*Parch+ Age*Fare+Parch*Fare,data=train, family="binomial")
m2b=glm(Survived~Pclass+Sex+Age+Pclass*Sex+ Pclass*SibSp+ Sex*Parch+ Parch*Fare,data=train, family="binomial")

Missing Data

Our test set has NA values, and so trying to predict values from this dataset using our models above leads to NA predictions.

In order to account for this missing data, we replace the NA values with the averages, in this case median and mean, from the training set.

test[is.na(test$Fare),]$Fare=median(train$Fare)
test[is.na(test$Age),]$Age=mean(train$Age, na.rm=TRUE)

We now make copies of the test set to save our predictions based on the models.

myOne=test
my2a=test
my2b=test

For each model, we predict the odds of survival. As the response needs to be in a binary format, we predict that anyone with a survival chance over 0.5 survives and those with a survival chance below 0.5 does not. Results are then saved to a file so that they can be submitted on Kaggle.

myOne$SurvivedOdds<- predict(m1, myOne,type="response")
myOne[myOne$SurvivedOdds>=0.5,]$Survived=1
myOne[myOne$SurvivedOdds<0.5,]$Survived=0
myOne=myOne[,c(1,12)]
write.csv(myOne,"level1.csv", row.names=FALSE)

my2a$SurvivedOdds<- predict(m2a, my2a,type="response")
my2a[my2a$SurvivedOdds>=0.5,]$Survived=1
my2a[my2a$SurvivedOdds<0.5,]$Survived=0
my2a=my2a[,c(1,12)]

my2b$SurvivedOdds<- predict(m2b, my2b,type="response")
my2b[my2b$SurvivedOdds>=0.5,]$Survived=1
my2b[my2b$SurvivedOdds<0.5,]$Survived=0
my2b=my2b[,c(1,12)]
write.csv(my2b,"level2b.csv", row.names=FALSE)

The Kaggle scores for these models were as follows:

  • Kaggle score for the level 1 model was 0.74163
  • Kaggle score for level 2 model a was 0.76077
  • Kaggle score for level 2 model b was 0.77033 - so the simpler model is better!

The next job will be to go back and conduct exploratory analyses in order to better understand the variables and their relationships.