Titanic Survivors

The aim is to predict whether a given passenger would have survived the Titanic accident. The data about the Titanic passengers are available from https://www.kaggle.com.

For the analysis, the following libraries are loaded.

Getting the data and a first look at these

First, the data is download from kaggle.com and a local copy is saved.

sourceUrl <- "https://www.kaggle.com/c/titanic-gettingStarted/download/train.csv"
if(!file.exists("rawdata")){dir.create("rawdata")}
fileName <- "rawdata/train.csv"
if(!file.exists(fileName))   {  
    download.file(sourceUrl, destfile = fileName, method="curl")  
}

sourceUrl <- "https://www.kaggle.com/c/titanic-gettingStarted/download/test.csv"
if(!file.exists("rawdata")){dir.create("rawdata")}
fileName <- "rawdata/test.csv"
if(!file.exists(fileName))   {  
    download.file(sourceUrl, destfile = fileName, method="curl")  
}

The data is give in a the csv-file rawdata/test.csv, which need to be read in first.

train <- read.csv("rawdata/train.csv", header=TRUE)
test <- read.csv("rawdata/test.csv", header=TRUE)

It contains 891 rows and consists of the columns: PassengerId, Survived, Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, Embarked. Here, the variable Survived uses ‘0’ for ‘No’ and ‘1’ for ‘Yes’. To avoid confusion, the code will be replaced by factors.

train <- within(train, Survived <- factor(Survived, labels = c("No", "Yes")))
train <- within(train, Pclass <- factor(Pclass))
test <- within(test, Pclass <- factor(Pclass))

Next, a closer look on the variables.

The entry for the Cabin variable is mostly missing.

qplot(Sex, Pclass, colour = Survived, data=train, geom="jitter")
qplot(Survived, data=train, geom="bar", facets = . ~ Sex, fill=Pclass)

plot of chunk firstplotsplot of chunk firstplotsplot of chunk firstplotsplot of chunk firstplots From the above plots, one can see, that Sex and Pclass are meaningful variables. Also, children up to 10 years survived more often. However, the entry for the variable Age is missing 177 times. One should refrain from ignoring these missing values, because there will be a systematic bias that Age is known predominantly for survivors.

Fitting with only two variables

First, a model based on the randomForest is fitted using only variables Pclass and Sex.

fit <- randomForest(Survived ~Pclass + Sex, data=train)
prediction.1 <- predict(fit, test)

Alternatively, one can consider a model including all variables for which meaningful numeric or factor data is available except for Age (for the reason given above).

fit <- randomForest(Survived ~Pclass + Sex + SibSp + Parch, data=train)
prediction.2 <- predict(fit, test)

For the kaggle submission:

#first model
result <- data.frame(test$PassengerId, prediction.1)
names(result)<-c("PassengerId","Survived")
write.csv(result, "prediction1.csv",row.names=FALSE)
# second model
result <- data.frame(test$PassengerId, prediction.2)
names(result)<-c("PassengerId","Survived")
write.csv(result, "prediction2.csv",row.names=FALSE)