In this document, we predict the survivability of passengers aboard the Titanic based on their compartment class, gender age, family status, where they boarded and how much fare they paid for the trip.
I wonder how the algorithms will deal with ticket price which is not only dependent on compartment class but also on the point of boarding.
# download all data sets
# train - training data set
# test - test data set
# gendermodel - survival status of passenger of test dataset
train <- read.csv("C:\\Users\\Windows\\Dropbox\\AllStuff\\Titanic_Kaggle\\Data\\train.csv")
gendermodel <- read.csv("C:\\Users\\Windows\\Dropbox\\AllStuff\\Titanic_Kaggle\\Data\\gendermodel.csv")
test <- read.csv("C:\\Users\\Windows\\Dropbox\\AllStuff\\Titanic_Kaggle\\Data\\test.csv")
# what does the data sets look like
str(train)
## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
## $ Embarked : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
str(gendermodel)
## 'data.frame': 418 obs. of 2 variables:
## $ PassengerId: int 892 893 894 895 896 897 898 899 900 901 ...
## $ Survived : int 0 1 0 0 1 0 1 0 1 0 ...
str(test)
## 'data.frame': 418 obs. of 11 variables:
## $ PassengerId: int 892 893 894 895 896 897 898 899 900 901 ...
## $ Pclass : int 3 3 2 3 3 3 3 2 3 3 ...
## $ Name : Factor w/ 418 levels "Abbott, Master. Eugene Joseph",..: 210 409 273 414 182 370 85 58 5 104 ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 2 2 1 2 1 2 1 2 ...
## $ Age : num 34.5 47 62 27 22 14 30 26 18 21 ...
## $ SibSp : int 0 1 0 0 1 0 0 1 0 2 ...
## $ Parch : int 0 0 0 0 1 0 0 1 0 0 ...
## $ Ticket : Factor w/ 363 levels "110469","110489",..: 153 222 74 148 139 262 159 85 101 270 ...
## $ Fare : num 7.83 7 9.69 8.66 12.29 ...
## $ Cabin : Factor w/ 77 levels "","A11","A18",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Embarked : Factor w/ 3 levels "C","Q","S": 2 3 2 3 3 3 2 3 1 3 ...
# change column names to lower case
names(train) <- tolower(names(train))
names(gendermodel) <- tolower(names(gendermodel))
names(test) <- tolower(names(test))
# passenger names, ticket and cabin names can be removed since they cannot possibly contribute meaningfully to survival. Actually it might on an individual basis, but without any generalizable pattern.
train <- train[, !(names(train) %in% c("name", "ticket", "cabin"))]
test <- test[, !(names(test) %in% c("name", "ticket", "cabin"))]
head(train)
## passengerid survived pclass sex age sibsp parch fare embarked
## 1 1 0 3 male 22 1 0 7.2500 S
## 2 2 1 1 female 38 1 0 71.2833 C
## 3 3 1 3 female 26 0 0 7.9250 S
## 4 4 1 1 female 35 1 0 53.1000 S
## 5 5 0 3 male 35 0 0 8.0500 S
## 6 6 0 3 male NA 0 0 8.4583 Q
head(test)
## passengerid pclass sex age sibsp parch fare embarked
## 1 892 3 male 34.5 0 0 7.8292 Q
## 2 893 3 female 47.0 1 0 7.0000 S
## 3 894 2 male 62.0 0 0 9.6875 Q
## 4 895 3 male 27.0 0 0 8.6625 S
## 5 896 3 female 22.0 1 1 12.2875 S
## 6 897 3 male 14.0 0 0 9.2250 S
I am interested in a simple black and white idea of what the data says:
* survivability based on compartment class
* survivability based on gender
* survivability based on age
* survivability family status