# import the dataset
# the working directory is set to be desktop/titanic
predict_tree <- read.csv("predict_tree.csv")
predict_rf <- read.csv("predict_rf.csv")
predict_cv <- read.csv("predict_cv.csv")
predict_log <- read.csv("predict_log.csv")
test <- read.csv("/Users/vf/Desktop/titanic/hist_age/test.csv")
train <- read.csv("/Users/vf/Desktop/titanic/hist_age/train.csv")
# get the number of rows in the train data frame
num <- nrow(train)
The data-set that has been analysised in this report is called “Titanic - Machine Learning from Disaster” that is available on Kaggle, a popular platform for machine learning. The primary objective of this dataset is to predict whether a passenger survived the Titanic disaster based on various variables such as “pclass”, “sex”, “age”, “sibsp”,…
In this dataset we have 2 sets, and there are 12 variables in the dataset.
Below is a table that shows the variable names and their corresponding meanings.(Table 1)| Variable | Definition | Key |
|---|---|---|
| survival | Survival | 0 = No, 1 = Yes |
| pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd |
| sex | Sex | |
| Age | Age in years | |
| sibsp | # of siblings / spouses aboard the Titanic | |
| parch | # of parents / children aboard the Titanic | |
| ticket | Ticket number | |
| fare | Passenger fare | |
| cabin | Cabin number | |
| embarked | Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton |
In this section, the main focus is to finding the interesting and effective features from the dataset.
| Test | Train | |
|---|---|---|
| PassengerId | 0 | 0 |
| Pclass | 0 | 0 |
| Name | 0 | 0 |
| Sex | 0 | 0 |
| Age | 86 | 177 |
| SibSp | 0 | 0 |
| Parch | 0 | 0 |
| Ticket | 0 | 0 |
| Fare | 1 | 0 |
| Cabin | 0 | 0 |
| Embarked | 0 | 0 |
From the above table we can easily observe that there are many missing values in variable “age” in both of the data-sets. And in the later section we will handle the missing values.
Now we can plot a bar chat to show the probability of survival for
different age groups. (shiny app is shown during the presentation)
data_ <- rbind(train_,test)
train %>%
mutate(age_bin = cut(Age, breaks = 10)) %>%
group_by(age_bin) %>%
summarise(mean_survived = mean(Survived)) %>%
ggplot(aes(x = age_bin, y = mean_survived)) +
geom_bar(stat = "identity", fill = "orange") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(x = "Age", y = "Mean Survival Rate")
From the bar plot we can easily seen the age group (0.34, 8.38] (72,
80.1] have relatively higher survival probability, and all the other
groups have similar survival probabilities.
Plot a histogram to observe the distribution for the age variable.
p1 = ggplot(train,aes(Age)) +
geom_histogram(binwidth = (80-0.42)/30,
fill = "orange") +
labs(title = "Age Distribution", x = "Age", y = "Count") +
scale_x_continuous(breaks = seq(0, 90, 5)) +
theme(plot.margin = unit(c(1, 1, 1, 1), "cm"))
p1
By analyzing the histogram, we can infer that a significant portion of
the passengers fall within the age range of [15,40], and there is a
scarcity of passengers who are older than 60. To enhance the
representativeness of the data, we can categorize the passengers into
distinct age groups instead of relying solely on their individual ages.
These age groups provide a better depiction of the distribution of the
Age variable. And at the same time the null values in the train
10 groups has been used for this dataset, and the intervals are (0, 5, 10, 15, 20, 25, 30, 35, 40, 50, 60, 100)
age_to_group <- function(age) {
group <- cut(age, breaks = c(0, 5, 10, 15, 20, 25, 30, 35, 40, 50, 60, 100),
labels = paste0("age_", 1:11))
as.character(group)
}
data_ <- data_ %>% mutate(age_group = age_to_group(Age))
data_$age_group <- ifelse(is.na(data_$age_group), "age_12", data_$age_group)
Also, the null values in the dataset has been fixed in this section, the null values has been assigned to the 12th group and they are called “age_12”
We also want to know the survival probability for male and female under this titanic disaster.
| Sex | mean_survived |
|---|---|
| female | 0.7420382 |
| male | 0.1889081 |
We can observe from above (Table 3), that the survival probability for female is higher than male.
| Embarked | Survival.Rate | Mean.Fare |
|---|---|---|
| C | 0.5535714 | 59.95414 |
| Q | 0.3896104 | 13.27603 |
| S | 0.3369565 | 27.07981 |
From the above analysis (Table 4) we can see that the survival rate at C is higher than other ports, and the mean ticker price at port c is highest among the 3. This observation could indicates when the disaster happens the rich is more likely to survive.
# the line plot of the survival rate for the variable SibSp
sibsp_line <- train %>%
group_by(SibSp) %>%
summarise(mean_survival = mean(Survived)) %>%
ggplot(aes(x = SibSp, y = mean_survival)) +
geom_line(colour = "orange") +
labs(x = "SibSp", y = "Mean survival rate")
# the line plot of the survival rate for the variable parch
parch_line <- train %>%
group_by(Parch) %>%
summarise(mean_survival = mean(Survived)) %>%
ggplot(aes(x = Parch, y = mean_survival)) +
geom_line(colour = "orange") +
labs(x = "Parch", y = "Mean survival rate")
# the bar plot of the survival rate for the variable cabin
cabin_bar <- train %>%
group_by(Cabin) %>%
summarise(mean_survived = mean(Survived)) %>%
ggplot(aes(x = factor(Cabin), y = mean_survived)) +
geom_bar(stat = "identity", fill = "orange") +
xlab("Cabin") +
ylab("Mean survival rate")
grid.arrange(sibsp_line, parch_line, cabin_bar, nrow = 1)
From the line plots and bar plot, we can tell that for some certatin
values of SipSp and Parch, the rate of Survival is actually higher.
However, the bar plot of Cabin against Survived seems lacking details,
so we decide to take a look at the Cabin data and decide if we will drop
this variable or not.
The number of unique values for “Cabin” is 148 and most of the value of
“Cabin” only appeared once, which indicates this value is not
statistically significant. However, one fact of this “Cabin” variable is
the first letter for the values of “Cabin” represents the certain area
on the ferry and the numbers after the letter is the seat number, so
some simple modification to the data is needed.
data_$Cabin <- ifelse(is.na(data_$Cabin), "null", substr(data_$Cabin, 1, 1))
After the modification, only the first letter of the “Cabin” is kept and the number of unique values for “Cabin” is 148.
We can see the name variable has the title and the name of the
person, but we only interested in the title of that person. As title
provides more information than gender, but also the social status and
enducation level.
# this is the code which extract the tile from the name
name_ <- function(name) {
name <- strsplit(name, "\\(")[[1]][1]
name <- strsplit(name, ", ")[[1]][2]
name <- strsplit(name, "\\.")[[1]][1]
return(name)
}
data_$Name <- apply(data_["Name"], 1, name_)
After the modification, there are 891 and there are some of them appeared only one or two times, so to make those titles to be more representative we change them to “Other” instead.
for (i in unique(data_$Name)){
if (sum(data_$Name == i) <= 10){
data_$Name[data_$Name == i] <- "Other"
}
}
So the final number counts for all the unique is given in the table
below. (Table 5), we can see that most of the people
have the title “Mr” and only 34 people has “Other” title.
| title | numbers |
|---|---|
| Mr | 757 |
| Mrs | 197 |
| Miss | 260 |
| Master | 61 |
| Other | 34 |
| Count | |
|---|---|
| PassengerId | 0 |
| Pclass | 0 |
| Name | 0 |
| Sex | 0 |
| SibSp | 0 |
| Parch | 0 |
| Ticket | 0 |
| Fare | 1 |
| Cabin | 0 |
| Embarked | 0 |
| age_group | 0 |
data_ <- data_ %>%
mutate(Fare = ifelse(is.na(Fare),
median(Fare, na.rm = TRUE),
Fare))
Fill in the null value for “Fare” using the median of “Fare”.
# remove the variable that is not significant in the analysis
data_ <- data_ %>% select(-Ticket)
There are some “character” type data, which we can not fit them into the
logistic model easily, so I decide to use One-Hot Encoding which is a
commonly used method to convert the categorical data into a binary type.
In one-hot encoding, each unique value of a categorical variable is
represented as a separate binary feature, where 1 indicates the presence
of that category and 0 indicates its absence.
| Id | Color | Color_G | Color_B |
|---|---|---|---|
| 1 | Red | 0 | 0 |
| 2 | Green | 1 | 0 |
| 3 | Blue | 0 | 1 |
| 4 | Red | 0 | 0 |
| 5 | Green | 1 | 0 |
From this example (Table 7) we can see that (0,0)
means “Red”, (1,0) means “Green” and (0,1) means “Blue”
# Fit One-Hot Encoder
one_hot <- predict(dummyVars(~., data_), newdata = data_)
train_one <- one_hot[1:num, ]
test_one <- one_hot[(num+1):nrow(one_hot), ]
train_one <- data.frame(train_one)
test_one <- data.frame(test_one)
Sigmoid function:
\(g(z) = \frac{1}{1+r^{-z}}\)
As z tends to negative infinity, g(z) tends to 0. When z tends to positive infinity, g(z) tends to 1.
Logistic Regression is an excellent choice for binary classification problems because it produces predicted values between 0 and 1. To obtain the desired output, simply set a cut-off value at 0.5. If the predicted value is greater than 0.5, classify it as 1, and if it is less than or equal to 0.5, classify it as 0.
# Fit Logistic Regression model
model_log <- glm(train$Survived ~ ., data = train_one, family = binomial(link = "logit"))
# Predict probabilities for test data
pred_probs <- predict(model_log, newdata = test_one, type = "response")
The score for this logistic regression is 0.77033, which indicates
this simple logistic regression has predicted 77.03% of the results
correct.
Generally speaking the performance of the logistic regression is quite
good, as this is a very simple model and then we try the CART
Classification Tree and compare the performance.
In this section, our goal is to perform model selection and include only
relevant variables in the model. With a dataset consisting of 35
variables, employing the AIC criterion for every possible variable
combination would be quite time-consuming. Assume fitting one model
takes 1s and using the AIC criteria, we have to consider \(2^35\) models (including null model), that
will take around 1000 years to finish the fitting. So in this case AIC
criteria is not optional. Backward elimination offers a more efficient
alternative, though it comes with certain limitations. As a result, a
relatively easier method, backward elimination method, is applied in
this section.
# this section is for the model selection
# using the backward elimination method
f_removal_model_selection <- function(model, p_value_threshold) {
significant_vars <- model
removed_vars <- c()
repeat {
p_values <- summary(significant_vars)$coefficients[, 4]
p_values <- p_values[-1] # Exclude intercept
if (all(p_values <= p_value_threshold)) {
break
}
var_to_remove <- names(p_values)[which.max(p_values)]
removed_vars <- c(removed_vars, var_to_remove)
formula_update <- as.formula(
paste("train$Survived ~ ", paste(setdiff(names(p_values), var_to_remove), collapse = " + "))
)
significant_vars <- update(significant_vars, formula_update)
}
return(list(selected_model = significant_vars, removed_variables = removed_vars))
}
# Perform model selection with F removal level
removal_level <- 0.05
selected_model_info <- f_removal_model_selection(model_log, removal_level)
However, the performance of the reduced model doesn’t improve too much. The reduced model can give 77.033% result correct. This is because this backward elimination method has some limitations:
Arbitrariness: The choice of the threshold (e.g., 0.05) is somewhat arbitrary, and there is no one-size-fits-all value. Different thresholds may lead to different model selections. (we are not guaranteed to find the best reduced model based on this method)
Multiple testing: If you have a large number of predictor variables, the chances of false positives (Type I errors) increase, leading to the inclusion of irrelevant variables in the model.
Model stability: Stepwise selection can result in unstable models, especially when predictor variables are highly correlated. Small changes in the dataset can lead to significant changes in the selected model.
The important concept used in CART is the Gini-index. The Gini index in CART is used to measure the quality of a split in a binary decision tree. The Gini index measures the impurity of a node in the decision tree, where a node is considered pure if all the observations in the node belong to the same class.
where A is one of the features in the dataframe. \(|D|, |D_1|, |D_2|\) are the sample size of \(D, D_1, D_2\) in the dataframe.
# CART classification tree
tree_model <- rpart(train$Survived ~ ., data = train_one, method = "class")
rpart.plot(tree_model, type = 1, extra = 104, tweak = 1.2)
predictions <- predict(tree_model, newdata = test_one, type = "class")
predict_tree$Survived <- predictions
predict_tree_out <- as.matrix(predict_tree)
write.csv(predict_tree_out, file = "predict_tree_out.csv", row.names = F)
The accuracy for CART is 77.511% which is better than the logistic regression and the main reason is that logistic regression used above has considered all the variables in the data-frame, and in most of the cases the full model is not always the best. The reduced model will have a better performance in general, as the gini index act as a selection index which only include some significant variables in the model.
Then we apply cross validation to the CART and the number of folds is
50.
train_one$survived <- train$Survived
control <- trainControl(method = "cv", number = 50)
model_cv <- train(survived ~ ., data = train_one, method = "rpart", trControl = control)
rpart.plot(model_cv$finalModel)
predictions_cv <- predict(model_cv, newdata = test_one)
survival_cv <- to_binary(predictions_cv)
predict_cv$Survived <- survival_cv
predict_cv_out <- as.matrix(predict_cv)
write.csv(predict_cv_out, file = "predict_cv_out.csv",row.names = F)
We can see from the plot that CART model has been further reduced and CART Classification Tree after the kfolds cross validation can give 78.229% of the result correct.
# random forest
test_one$survived <- NA
rf_model <- randomForest(survived ~ ., data = train_one, ntree = 500,
mtry = 3, importance = TRUE)
predictions_rf <- predict(rf_model, test_one)
survival_rf <- to_binary(predictions_rf)
predict_rf$Survived <- survival_rf
predict_rf_out <- as.matrix(predict_rf)
write.csv(predict_rf_out,"predict_rf_out.csv", row.names = F)
The Random Forest model can give 78.468% of the result correct.
In this section XGBoost is used, and XGBoost is a strong model which has
some benefits.
Regularization: XGBoost incorporates L1 (Lasso regression) and L2 (Ridge regression) regularization terms into the objective function. This helps prevent overfitting and makes the model more generalizable.
Handling of imbalanced data: XGBoost supports assigning weights to individual data points, which can be useful for handling imbalanced datasets where one class is significantly more frequent than the other.
So in this section we will work on this strong model and will use the grid search to find the best combination of hyperparameters.
# prepare the data
train_x <- train_one %>%
select(-survived) %>%
as.matrix()
train_y <- train_one$survived
params <- list(
objective = "binary:logistic",
eval_metric = "error",
max_depth = 5,
eta = 0.1
)
model_xg <- xgboost(
data = train_x,
label = train_y,
params = params,
nrounds = 100,
verbose = 0
)
test_x <- test_one %>%
select(-survived) %>%
as.matrix()
predict_xg <- predict(model_xg, test_x)
As the performance of the XGBoost is not good, so model tuning is
needed. The code below shows the grid search technique to select the
best combinations of hyperparameters for a machine learning model. The
main idea is to systematically search through a predefined set of
hyperparameter values, then train the model with each combination.
train_x <- train_one %>%
select(-survived) %>%
as.matrix()
train_y <- train_one$survived
Define the grid for tuning.
max_depth_grid <- c(3, 5, 7, 9)
eta_grid <- c(0.01, 0.5, 0.1, 0.2, 0.3)
nrounds_grid <- c(50, 100, 150)
subsample_grid <- c(0.5, 0.6, 0.7, 0.8, 0.9, 1)
colsample_bytree_grid <- c(0.5, 0.6, 0.7, 0.8, 0.9, 1)
Perform the model tuning.
best_error <- Inf
best_max_depth <- NA
best_eta <- NA
best_nrounds <- NA
best_colsample_bytree <- NA
best_sabsample <- NA
cv_folds <- 5
cv_seed <- 123
# Loop through all combinations of the hyperparameters
for (max_depth in max_depth_grid) {
for (eta in eta_grid) {
for (nrounds in nrounds_grid) {
for (colsample_bytree in colsample_bytree_grid) {
for (subsample in subsample_grid) {
# Set up the parameters for the current combination
params <- list(
objective = "binary:logistic",
eval_metric = "error",
max_depth = max_depth,
eta = eta,
colsample_bytree = colsample_bytree,
subsample = subsample
)
# Perform cross-validation for the current combination
cv_result <- xgb.cv(
data = train_x,
label = train_y,
params = params,
nrounds = nrounds,
nfold = cv_folds,
stratified = TRUE,
seed = cv_seed,
verbose = 0,
early_stopping_rounds = 10
)
# Update the best values if the current combination has a lower error rate
min_error <- min(cv_result$evaluation_log$test_error_mean)
if (min_error < best_error) {
best_error <- min_error
best_max_depth <- max_depth
best_eta <- eta
best_nrounds <- nrounds
best_colsample_bytree <- colsample_bytree
best_subsample <- subsample
}
}
}
}
}
}
Train the model using the hyperparameter from the result of the grid search.
best_params <- list(
objective = "binary:logistic",
eval_metric = "error",
max_depth = best_max_depth,
eta = best_eta
)
model_xg_grid <- xgboost(
data = train_x,
label = train_y,
params = best_params,
nrounds = best_nrounds,
verbose = 0
)
predict_xg_grid <- predict(model_xg_grid, test_x)
Show the hyperparameters from grid search.
| Parameters | Values |
|---|---|
| max_depth | 7.0 |
| eta | 0.1 |
| nrounds | 100.0 |
| subsample | 0.6 |
| colsample_bytree | 0.9 |
Table 8 above shows the best hyperparmeters obtained from the grid search technique.
The preformance of XGBoost + Grid Search has been improved to 78.229% of the result correct.
Blending is a very common technique used in machine learning. The idea behind blending is that the strengths of individual models compensate for the weaknesses of others, resulting in a more robust and accurate prediction.
blending <- function(predict_xg,predictions_rf,predictions_cv){
predict_blend <- weight1 * predict_xg + weight2 * predictions_rf + weight3 * predictions_cv
}
weight1 <- 1/3
weight2 <- 1/3
weight3 <- 1/3
prediction_blend <- blending(predict_xg,predictions_rf,predictions_cv)
By changing the weights of each model in blending, we are able to make the predicted result even better and estimated 82% result correct.
In Table 9 we can find the model performance, and in this dataset Random Forest and XGBoost are the best models.
| Model | Accuracy |
|---|---|
| Logistic Regression | 0.77033 |
| CART | 0.77511 |
| CART + Cross Validation | 0.78229 |
| Random Forest | 0.78468 |
| XGBoost | 0.77511 |
| XGBoost + Grid Search | 0.78229 |
| Blending | 0.81862 |
In this report, I have used some plots and mean survival rates to help doing the data cleansing and visualization. After the data preparation, the model that used for this data-set are “Logistic Regression”, “CART”, “Random Forest” and “XGBoost”. Also applied Blending method to find the weighted average of these models, and found a more robust and accurate prediction. The performance of the model can achieve 85% of the result correct.