Predicting which passengers survived the sinking of the Titanic using a provided training and test dataset.
This is a classification task, specifically, a binary classification task, which means that there are only two different states we are classifying.
Below are the descriptions contained in that data dictionary, with the response variable in bold:
PassengerID - A column added by Kaggle to identify each row and make submissions easier Survived - Whether the passenger survived or not and the value we are predicting (0=No, 1=Yes) Pclass - The class of the ticket the passenger purchased (1=1st, 2=2nd, 3=3rd) Sex - The passenger’s sex Age - The passenger’s age in years SibSp - The number of siblings or spouses the passenger had aboard the Titanic Parch - The number of parents or children the passenger had aboard the Titanic Ticket - The passenger’s ticket number Fare - The fare the passenger paid Cabin - The passenger’s cabin number Embarked - The port where the passenger embarked (C=Cherbourg, Q=Queenstown, S=Southampton)
## Lets first visualise the data
glimpse(train)
## Observations: 891
## Variables: 12
## $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
## $ Survived <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1…
## $ Pclass <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2…
## $ Name <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradl…
## $ Sex <chr> "male", "female", "female", "female", "male", "male"…
## $ Age <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39…
## $ SibSp <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0…
## $ Parch <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0…
## $ Ticket <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803…
## $ Fare <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51…
## $ Cabin <chr> NA, "C85", NA, "C123", NA, NA, "E46", NA, NA, NA, "G…
## $ Embarked <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S…
glimpse(test)
## Observations: 418
## Variables: 11
## $ PassengerId <int> 892, 893, 894, 895, 896, 897, 898, 899, 900, 901, 90…
## $ Pclass <int> 3, 3, 2, 3, 3, 3, 3, 2, 3, 3, 3, 1, 1, 2, 1, 2, 2, 3…
## $ Name <chr> "Kelly, Mr. James", "Wilkes, Mrs. James (Ellen Needs…
## $ Sex <chr> "male", "female", "male", "male", "female", "male", …
## $ Age <dbl> 34.5, 47.0, 62.0, 27.0, 22.0, 14.0, 30.0, 26.0, 18.0…
## $ SibSp <int> 0, 1, 0, 0, 1, 0, 0, 1, 0, 2, 0, 0, 1, 1, 1, 1, 0, 0…
## $ Parch <int> 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Ticket <chr> "330911", "363272", "240276", "315154", "3101298", "…
## $ Fare <dbl> 7.8292, 7.0000, 9.6875, 8.6625, 12.2875, 9.2250, 7.6…
## $ Cabin <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "B45…
## $ Embarked <chr> "Q", "S", "Q", "S", "S", "S", "Q", "S", "C", "S", "S…
The training-set has 891 examples and 11 features + the target variable (survived). 2 of the features are doubles, 5 are integers and 5 are character objects. The testing-set is the same minus the target variable.
Prior to modelling it is important to assess the features which may provide the most information. We assume that women and childern were favoured for spots on the life boats, that there was a large class disparity, that young may have been favoured over old. This suggests that Age, Sex, and PClass may be good predictors of survival.
Because the Survived column contains 0 if the passenger did not survive and 1 if they did, we can segment our data by sex and calculate the mean of this column.
Survived <- train %>%
filter(Survived == 1)
Died <- train %>%
filter(Survived == 0)
sum_train <- summary(train)
##Predictive utility of Age
Sur_Age_plot <- ggplot(train, aes(x = Age))+
geom_histogram()+
facet_wrap(~ Survived)+
labs(title = "Age vs survival")
## Predicitve utility of Sex
Sur_Sex_plot <- ggplot(train, aes(x = factor(Survived), fill = factor(Sex)))+
geom_bar(position = "stack")+
labs(title = "Sex vs survival")
##Predictive utility of Pclass
Sur_Pclass_plot <- ggplot(train, aes(x = factor(Survived), fill = factor(Pclass)))+
geom_bar(position = "stack")+
labs(title = "Pclass vs survival")
## Visualise the plots
print(sum_train)
## PassengerId Survived Pclass Name
## Min. : 1.0 Min. :0.0000 Min. :1.000 Length:891
## 1st Qu.:223.5 1st Qu.:0.0000 1st Qu.:2.000 Class :character
## Median :446.0 Median :0.0000 Median :3.000 Mode :character
## Mean :446.0 Mean :0.3838 Mean :2.309
## 3rd Qu.:668.5 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :891.0 Max. :1.0000 Max. :3.000
##
## Sex Age SibSp Parch
## Length:891 Min. : 0.42 Min. :0.000 Min. :0.0000
## Class :character 1st Qu.:20.12 1st Qu.:0.000 1st Qu.:0.0000
## Mode :character Median :28.00 Median :0.000 Median :0.0000
## Mean :29.70 Mean :0.523 Mean :0.3816
## 3rd Qu.:38.00 3rd Qu.:1.000 3rd Qu.:0.0000
## Max. :80.00 Max. :8.000 Max. :6.0000
## NA's :177
## Ticket Fare Cabin Embarked
## Length:891 Min. : 0.00 Length:891 Length:891
## Class :character 1st Qu.: 7.91 Class :character Class :character
## Mode :character Median : 14.45 Mode :character Mode :character
## Mean : 32.20
## 3rd Qu.: 31.00
## Max. :512.33
##
print(Sur_Age_plot)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 177 rows containing non-finite values (stat_bin).
print(Sur_Sex_plot)
print(Sur_Pclass_plot)
The tabluar summary indicates that approximatly 38% of people survived. Age contains 177 NAs. There is a large disparity in the fare paid, which is likely to be related to the Pclass. SibSP indicates that just under half of the passengers were traveling alone. We need to convert a lot of features into numeric so that the machine learning algorithms can process them. Furthermore, we can see that the features have widely different ranges, that we will need to convert into roughly the same scale. We can ignore Cabin, Name, Ticket from the dataset. It is difficult to make sense of both the SibSp and Parch together and perhaps these should be combined into a binary outcome, “alone” and “accompanied”
From Age vs Survival plot we can see that there appears to be a bimodal distribution in those that survived. The number of children aged < 10 who survived is greater than the number who died. It may be also may be useful to assume that these children were accompanied by their mothers. It is unclear if female children were favoured over male children. Since there seem to be certain ages, which have increased odds of survival and because I want every feature to be roughly on the same scale.
From Sex vs Survival plot we can see that a much larger proportion of women survived. It may be possible that the women who died were less likely to be mothers.
From Pclass vs Survival plot we can see that class 1 was more likely to survive than 2 and 2 more likely to survive than 3.
##Predictive utillity of Embarked and Fare
Sur_Embarked_Fare <- ggplot(train, aes(x = factor(Embarked), fill = factor(Survived)))+
geom_bar(position = "stack")+
#facet_wrap(~ Survived)+
labs(title = "Embarked vs Fare by survival")
##Predictive utility of Age
Sur_Age_Sex_plot <- ggplot(train, aes(x = Age, fill = Sex))+
geom_histogram(position = "stack")+
facet_wrap(~ Survived)+
labs(title = "Age vs Sex by survival")
## Predicitve utility of Sex
Sur_Fem_SibSP_plot <- ggplot(train, aes(x = factor(Survived), fill = factor(SibSp)))+
geom_bar(position = "stack")+
facet_wrap(~ Sex)+
labs(title = "Sex vs SibSp")
## Visualise the plots
print(Sur_Age_Sex_plot)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 177 rows containing non-finite values (stat_bin).
print(Sur_Fem_SibSP_plot)
print(Sur_Embarked_Fare)
From the Age vs Sex by survival it appears that there was no preferential treatment given to female children.
# Change Pclass to a factor
train$Pclass <- as.numeric(train$Pclass)
#Change Sex to numeric
train$Sex <- as.numeric(ifelse(train$Sex == "male", 1, 0))
#Replace NAs in the age variable with random digits from normal distribution
Age_replace <- rnorm(177, mean= mean(train$Age, na.rm = TRUE), sd = sd(train$Age, na.rm = TRUE))
train$Age[is.na(train$Age)] <- Age_replace
#Replace NAs in the age variable with C because it appears that there are more survivors from the same fare bracket in C than S.
Embarked_replace <- c("C", "C")
train$Embarked[is.na(train$Embarked)] <- Embarked_replace
train$Embarked <- as.numeric(as.factor(train$Embarked))
# Convert SibSp and Parch to Alone
train <- train %>%
mutate(Alone = ifelse(Parch != 0 | SibSp !=0, 0, 1))
Predictors <- c("Age", "Alone", "Pclass", "Sex", "Fare")
train[Predictors] <-scale(train[Predictors], center = TRUE, scale = TRUE)
glimpse(train)
## Observations: 891
## Variables: 13
## $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
## $ Survived <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1…
## $ Pclass <dbl> 0.8269128, -1.5652278, 0.8269128, -1.5652278, 0.8269…
## $ Name <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradl…
## $ Sex <dbl> 0.737281, -1.354813, -1.354813, -1.354813, 0.737281,…
## $ Age <dbl> -0.53091190, 0.55998645, -0.25818731, 0.35544301, 0.…
## $ SibSp <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0…
## $ Parch <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0…
## $ Ticket <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803…
## $ Fare <dbl> -0.50216314, 0.78640362, -0.48857985, 0.42049407, -0…
## $ Cabin <chr> NA, "C85", NA, "C123", NA, NA, "E46", NA, NA, NA, "G…
## $ Embarked <dbl> 3, 1, 3, 3, 3, 2, 3, 3, 3, 1, 3, 3, 3, 3, 3, 3, 2, 3…
## $ Alone <dbl> -1.2309535, -1.2309535, 0.8114666, -1.2309535, 0.811…
Kaggle_glm <- glm(Survived ~ Age+Sex+Pclass+Alone, family = "binomial", train)
summary(Kaggle_glm)
##
## Call:
## glm(formula = Survived ~ Age + Sex + Pclass + Alone, family = "binomial",
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6020 -0.6515 -0.4146 0.6285 2.6776
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.64491 0.08961 -7.197 6.16e-13 ***
## Age -0.43129 0.09798 -4.402 1.07e-05 ***
## Sex -1.25841 0.09281 -13.559 < 2e-16 ***
## Pclass -0.96836 0.10004 -9.680 < 2e-16 ***
## Alone 0.03418 0.09166 0.373 0.709
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1186.66 on 890 degrees of freedom
## Residual deviance: 806.57 on 886 degrees of freedom
## AIC: 816.57
##
## Number of Fisher Scoring iterations: 5
#Identify missing values and variables to change
summary(test)
## PassengerId Pclass Name Sex
## Min. : 892.0 Min. :1.000 Length:418 Length:418
## 1st Qu.: 996.2 1st Qu.:1.000 Class :character Class :character
## Median :1100.5 Median :3.000 Mode :character Mode :character
## Mean :1100.5 Mean :2.266
## 3rd Qu.:1204.8 3rd Qu.:3.000
## Max. :1309.0 Max. :3.000
##
## Age SibSp Parch Ticket
## Min. : 0.17 Min. :0.0000 Min. :0.0000 Length:418
## 1st Qu.:21.00 1st Qu.:0.0000 1st Qu.:0.0000 Class :character
## Median :27.00 Median :0.0000 Median :0.0000 Mode :character
## Mean :30.27 Mean :0.4474 Mean :0.3923
## 3rd Qu.:39.00 3rd Qu.:1.0000 3rd Qu.:0.0000
## Max. :76.00 Max. :8.0000 Max. :9.0000
## NA's :86
## Fare Cabin Embarked
## Min. : 0.000 Length:418 Length:418
## 1st Qu.: 7.896 Class :character Class :character
## Median : 14.454 Mode :character Mode :character
## Mean : 35.627
## 3rd Qu.: 31.500
## Max. :512.329
## NA's :1
# Change Pclass to a factor
test$Pclass <- as.numeric(test$Pclass)
#Change Sex to numeric
test$Sex <- as.numeric(ifelse(test$Sex == "male", 1, 0))
#Replace NAs in the age variable with random digits from normal distribution
Age_replace <- rnorm(86, mean= mean(test$Age, na.rm = TRUE), sd = sd(test$Age, na.rm = TRUE))
test$Age[is.na(test$Age)] <- Age_replace
#Convert embarked to factor
test$Embarked <- as.numeric(as.factor(test$Embarked))
# Convert SibSp and Parch to Alone
test <- test %>%
mutate(Alone = ifelse(Parch != 0 | SibSp !=0, 0, 1))
# Center and scale all of the variables
test[Predictors] <-scale(train[Predictors], center = TRUE, scale = TRUE)
## Warning in matrix(value, n, p): data length [4455] is not a sub-multiple or
## multiple of the number of rows [418]
glimpse(test)
## Observations: 418
## Variables: 12
## $ PassengerId <int> 892, 893, 894, 895, 896, 897, 898, 899, 900, 901, 90…
## $ Pclass <dbl> -0.59909305, -0.26141314, 0.15089957, -0.43105479, -…
## $ Name <chr> "Kelly, Mr. James", "Wilkes, Mrs. James (Ellen Needs…
## $ Sex <dbl> 0.8114666, -1.2309535, 0.8114666, -1.2309535, 0.8114…
## $ Age <dbl> -0.53091190, 0.55998645, -0.25818731, 0.35544301, 0.…
## $ SibSp <int> 0, 1, 0, 0, 1, 0, 0, 1, 0, 2, 0, 0, 1, 1, 1, 1, 0, 0…
## $ Parch <int> 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Ticket <chr> "330911", "363272", "240276", "315154", "3101298", "…
## $ Fare <dbl> -1.2309535, 0.8114666, -1.2309535, 0.8114666, 0.8114…
## $ Cabin <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "B45…
## $ Embarked <dbl> 2, 3, 2, 3, 3, 3, 2, 3, 1, 3, 3, 3, 3, 3, 3, 1, 2, 1…
## $ Alone <dbl> 0.01453728, -1.34908566, -0.33165146, -0.59909305, -…
eval_glm <- predict(Kaggle_glm, test)
summary(eval_glm)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -5.3564 -2.1775 -0.5646 -0.6237 0.2959 3.1727
test <- test %>%
mutate(Outcome = eval_glm)%>%
mutate(Survived = ifelse(eval_glm > 0, 1, 0))
glimpse(test)
## Observations: 418
## Variables: 14
## $ PassengerId <int> 892, 893, 894, 895, 896, 897, 898, 899, 900, 901, 90…
## $ Pclass <dbl> -0.59909305, -0.26141314, 0.15089957, -0.43105479, -…
## $ Name <chr> "Kelly, Mr. James", "Wilkes, Mrs. James (Ellen Needs…
## $ Sex <dbl> 0.8114666, -1.2309535, 0.8114666, -1.2309535, 0.8114…
## $ Age <dbl> -0.53091190, 0.55998645, -0.25818731, 0.35544301, 0.…
## $ SibSp <int> 0, 1, 0, 0, 1, 0, 0, 1, 0, 2, 0, 0, 1, 1, 1, 1, 0, 0…
## $ Parch <int> 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Ticket <chr> "330911", "363272", "240276", "315154", "3101298", "…
## $ Fare <dbl> -1.2309535, 0.8114666, -1.2309535, 0.8114666, 0.8114…
## $ Cabin <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "B45…
## $ Embarked <dbl> 2, 3, 2, 3, 3, 3, 2, 3, 1, 3, 3, 3, 3, 3, 3, 1, 2, 1…
## $ Alone <dbl> 0.01453728, -1.34908566, -0.33165146, -0.59909305, -…
## $ Outcome <dbl> -0.85646340, 0.86964219, -1.71217892, 1.14776886, -1…
## $ Survived <dbl> 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0…
x <- as.data.frame(cbind(as.integer(test$PassengerId), as.integer(test$Survived)))
colnames(x) <- c("PassengerId", "Survived")
write_csv(x, "/Users/Elliot/Documents/GitHub/Teo_Git/Kaggle/titanic_submission.csv", col_names = TRUE)