STAT3622 Project

0. Import Data

# import the dataset
# the working directory is set to be desktop/titanic
predict_tree <- read.csv("predict_tree.csv")
predict_rf <- read.csv("predict_rf.csv")
predict_cv <- read.csv("predict_cv.csv")
predict_log <- read.csv("predict_log.csv")
test <- read.csv("/Users/vf/Desktop/titanic/hist_age/test.csv")
train <- read.csv("/Users/vf/Desktop/titanic/hist_age/train.csv")
# get the number of rows in the train data frame
num <- nrow(train)

1. Overview of the data-set

The data-set that has been analysised in this report is called “Titanic - Machine Learning from Disaster” that is available on Kaggle, a popular platform for machine learning. The primary objective of this dataset is to predict whether a passenger survived the Titanic disaster based on various variables such as “pclass”, “sex”, “age”, “sibsp”,…

In this dataset we have 2 sets, and there are 12 variables in the dataset.

Below is a table that shows the variable names and their corresponding meanings.(Table 1)

Variable Definitions and Keys
Variable	Definition	Key
survival	Survival	0 = No, 1 = Yes
pclass	Ticket class	1 = 1st, 2 = 2nd, 3 = 3rd
sex	Sex
Age	Age in years
sibsp	# of siblings / spouses aboard the Titanic
parch	# of parents / children aboard the Titanic
ticket	Ticket number
fare	Passenger fare
cabin	Cabin number
embarked	Port of Embarkation	C = Cherbourg, Q = Queenstown, S = Southampton

2. Exploratory data analysis

In this section, the main focus is to finding the interesting and effective features from the dataset.

2.1 Null values in the dataset.

Now we try to find the number of null values in the train and test dataset. (Table 2)

Number of null values in Train and Test data-set
	Test	Train
PassengerId	0	0
Pclass	0	0
Name	0	0
Sex	0	0
Age	86	177
SibSp	0	0
Parch	0	0
Ticket	0	0
Fare	1	0
Cabin	0	0
Embarked	0	0

From the above table we can easily observe that there are many missing values in variable “age” in both of the data-sets. And in the later section we will handle the missing values.

2.2 Analysis between “Survived” and “Age”

Now we can plot a bar chat to show the probability of survival for different age groups. (shiny app is shown during the presentation)

data_ <- rbind(train_,test)
train %>%
  mutate(age_bin = cut(Age, breaks = 10)) %>%
  group_by(age_bin) %>%
  summarise(mean_survived = mean(Survived)) %>%
  ggplot(aes(x = age_bin, y = mean_survived)) +
  geom_bar(stat = "identity", fill = "orange") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(x = "Age", y = "Mean Survival Rate")

From the bar plot we can easily seen the age group (0.34, 8.38] (72, 80.1] have relatively higher survival probability, and all the other groups have similar survival probabilities.

2.2.1 Observe and modify “Age”

Plot a histogram to observe the distribution for the age variable.

p1 = ggplot(train,aes(Age)) + 
  geom_histogram(binwidth = (80-0.42)/30,
                 fill = "orange") +
  labs(title = "Age Distribution", x = "Age", y = "Count") + 
  scale_x_continuous(breaks = seq(0, 90, 5)) +
  theme(plot.margin = unit(c(1, 1, 1, 1), "cm"))
p1

By analyzing the histogram, we can infer that a significant portion of the passengers fall within the age range of [15,40], and there is a scarcity of passengers who are older than 60. To enhance the representativeness of the data, we can categorize the passengers into distinct age groups instead of relying solely on their individual ages. These age groups provide a better depiction of the distribution of the Age variable. And at the same time the null values in the train

10 groups has been used for this dataset, and the intervals are (0, 5, 10, 15, 20, 25, 30, 35, 40, 50, 60, 100)

age_to_group <- function(age) {
  group <- cut(age, breaks = c(0, 5, 10, 15, 20, 25, 30, 35, 40, 50, 60, 100), 
               labels = paste0("age_", 1:11))
  as.character(group)
}
data_ <- data_ %>% mutate(age_group = age_to_group(Age))
data_$age_group <- ifelse(is.na(data_$age_group), "age_12", data_$age_group)

Also, the null values in the dataset has been fixed in this section, the null values has been assigned to the 12th group and they are called “age_12”

2.3 Analysis between “Survived” and “Sex”

We also want to know the survival probability for male and female under this titanic disaster.

Sex ~ Survival relationship
Sex	mean_survived
female	0.7420382
male	0.1889081

We can observe from above (Table 3), that the survival probability for female is higher than male.

2.4 Analysis between “Survived” and “Embark & Fare”

Embark and Survival
Embarked	Survival.Rate	Mean.Fare
C	0.5535714	59.95414
Q	0.3896104	13.27603
S	0.3369565	27.07981

From the above analysis (Table 4) we can see that the survival rate at C is higher than other ports, and the mean ticker price at port c is highest among the 3. This observation could indicates when the disaster happens the rich is more likely to survive.

2.5 Analysis remaining variables.

# the line plot of the survival rate for the variable SibSp 
sibsp_line <- train %>%
  group_by(SibSp) %>%
  summarise(mean_survival = mean(Survived)) %>%
  ggplot(aes(x = SibSp, y = mean_survival)) +
  geom_line(colour = "orange") +
  labs(x = "SibSp", y = "Mean survival rate")

# the line plot of the survival rate for the variable parch
parch_line <- train %>%
  group_by(Parch) %>%
  summarise(mean_survival = mean(Survived)) %>%
  ggplot(aes(x = Parch, y = mean_survival)) +
  geom_line(colour = "orange") +
  labs(x = "Parch", y = "Mean survival rate")

# the bar plot of the survival rate for the variable cabin
cabin_bar <- train %>%
  group_by(Cabin) %>%
  summarise(mean_survived = mean(Survived)) %>%
  ggplot(aes(x = factor(Cabin), y = mean_survived)) +
  geom_bar(stat = "identity", fill = "orange") +
  xlab("Cabin") +
  ylab("Mean survival rate")

grid.arrange(sibsp_line, parch_line, cabin_bar, nrow = 1)

From the line plots and bar plot, we can tell that for some certatin values of SipSp and Parch, the rate of Survival is actually higher. However, the bar plot of Cabin against Survived seems lacking details, so we decide to take a look at the Cabin data and decide if we will drop this variable or not.
The number of unique values for “Cabin” is 148 and most of the value of “Cabin” only appeared once, which indicates this value is not statistically significant. However, one fact of this “Cabin” variable is the first letter for the values of “Cabin” represents the certain area on the ferry and the numbers after the letter is the seat number, so some simple modification to the data is needed.

data_$Cabin <- ifelse(is.na(data_$Cabin), "null", substr(data_$Cabin, 1, 1))

After the modification, only the first letter of the “Cabin” is kept and the number of unique values for “Cabin” is 148.

2.6 Handle the name variable

We can see the name variable has the title and the name of the person, but we only interested in the title of that person. As title provides more information than gender, but also the social status and enducation level.

# this is the code which extract the tile from the name
name_ <- function(name) {
  name <- strsplit(name, "\\(")[[1]][1]
  name <- strsplit(name, ", ")[[1]][2]
  name <- strsplit(name, "\\.")[[1]][1]
  return(name)
}
data_$Name <- apply(data_["Name"], 1, name_)

After the modification, there are 891 and there are some of them appeared only one or two times, so to make those titles to be more representative we change them to “Other” instead.

for (i in unique(data_$Name)){
  if (sum(data_$Name == i) <= 10){
    data_$Name[data_$Name == i] <- "Other"
  }
}

So the final number counts for all the unique is given in the table below. (Table 5), we can see that most of the people have the title “Mr” and only 34 people has “Other” title.

Number counts for each title
title	numbers
Mr	757
Mrs	197
Miss	260
Master	61
Other	34

We want to find out if there still have some null values in the dataset. From Table 6 we can see that there is 1 missing value for “Fare”, in later part we will figure out a way to fill in the null value.

The number of missing values
	Count
PassengerId	0
Pclass	0
Name	0
Sex	0
SibSp	0
Parch	0
Ticket	0
Fare	1
Cabin	0
Embarked	0
age_group	0

data_ <- data_ %>%
  mutate(Fare = ifelse(is.na(Fare),
                       median(Fare, na.rm = TRUE),
                              Fare))

Fill in the null value for “Fare” using the median of “Fare”.

# remove the variable that is not significant in the analysis
data_ <- data_ %>% select(-Ticket)

There are some “character” type data, which we can not fit them into the logistic model easily, so I decide to use One-Hot Encoding which is a commonly used method to convert the categorical data into a binary type. In one-hot encoding, each unique value of a categorical variable is represented as a separate binary feature, where 1 indicates the presence of that category and 0 indicates its absence.

An example of One-Hot Encoding
e.g. Assume we have the data listed below

One-Hot Encoding Example
Id	Color	Color_G	Color_B
1	Red	0	0
2	Green	1	0
3	Blue	0	1
4	Red	0	0
5	Green	1	0

From this example (Table 7) we can see that (0,0) means “Red”, (1,0) means “Green” and (0,1) means “Blue”

# Fit One-Hot Encoder
one_hot <- predict(dummyVars(~., data_), newdata = data_)
train_one <- one_hot[1:num, ]
test_one <- one_hot[(num+1):nrow(one_hot), ]
train_one <- data.frame(train_one)
test_one <- data.frame(test_one)

3. Model Fitting

3.1 Logistic Regression

Sigmoid function:
\(g(z) = \frac{1}{1+r^{-z}}\)

As z tends to negative infinity, g(z) tends to 0. When z tends to positive infinity, g(z) tends to 1.

Logistic Regression is an excellent choice for binary classification problems because it produces predicted values between 0 and 1. To obtain the desired output, simply set a cut-off value at 0.5. If the predicted value is greater than 0.5, classify it as 1, and if it is less than or equal to 0.5, classify it as 0.

# Fit Logistic Regression model
model_log <- glm(train$Survived ~ ., data = train_one, family = binomial(link = "logit"))

# Predict probabilities for test data
pred_probs <- predict(model_log, newdata = test_one, type = "response")

The score for this logistic regression is 0.77033, which indicates this simple logistic regression has predicted 77.03% of the results correct.

Generally speaking the performance of the logistic regression is quite good, as this is a very simple model and then we try the CART Classification Tree and compare the performance.

3.1.1 Backward Elimination Method

In this section, our goal is to perform model selection and include only relevant variables in the model. With a dataset consisting of 35 variables, employing the AIC criterion for every possible variable combination would be quite time-consuming. Assume fitting one model takes 1s and using the AIC criteria, we have to consider \(2^35\) models (including null model), that will take around 1000 years to finish the fitting. So in this case AIC criteria is not optional. Backward elimination offers a more efficient alternative, though it comes with certain limitations. As a result, a relatively easier method, backward elimination method, is applied in this section.

# this section is for the model selection
# using the backward elimination method

f_removal_model_selection <- function(model, p_value_threshold) {
  significant_vars <- model
  removed_vars <- c()

  repeat {
    p_values <- summary(significant_vars)$coefficients[, 4]
    p_values <- p_values[-1] # Exclude intercept

    if (all(p_values <= p_value_threshold)) {
      break
    }

    var_to_remove <- names(p_values)[which.max(p_values)]
    removed_vars <- c(removed_vars, var_to_remove)

    formula_update <- as.formula(
      paste("train$Survived ~ ", paste(setdiff(names(p_values), var_to_remove), collapse = " + "))
    )

    significant_vars <- update(significant_vars, formula_update)
  }

  return(list(selected_model = significant_vars, removed_variables = removed_vars))
}

# Perform model selection with F removal level
removal_level <- 0.05
selected_model_info <- f_removal_model_selection(model_log, removal_level)

However, the performance of the reduced model doesn’t improve too much. The reduced model can give 77.033% result correct. This is because this backward elimination method has some limitations:

Arbitrariness: The choice of the threshold (e.g., 0.05) is somewhat arbitrary, and there is no one-size-fits-all value. Different thresholds may lead to different model selections. (we are not guaranteed to find the best reduced model based on this method)
Multiple testing: If you have a large number of predictor variables, the chances of false positives (Type I errors) increase, leading to the inclusion of irrelevant variables in the model.
Model stability: Stepwise selection can result in unstable models, especially when predictor variables are highly correlated. Small changes in the dataset can lead to significant changes in the selected model.

3.2 CART (Classification Tree)

The important concept used in CART is the Gini-index. The Gini index in CART is used to measure the quality of a split in a binary decision tree. The Gini index measures the impurity of a node in the decision tree, where a node is considered pure if all the observations in the node belong to the same class.

Gini index
\(Gini(D,A) = \frac{|D_1|}{|D|}Gini(D_1)+\frac{|D_2|}{|D|}Gini(|D_2|)\)

where A is one of the features in the dataframe. \(|D|, |D_1|, |D_2|\) are the sample size of \(D, D_1, D_2\) in the dataframe.

# CART classification tree
tree_model <- rpart(train$Survived ~ ., data = train_one, method = "class")
rpart.plot(tree_model, type = 1, extra = 104, tweak = 1.2)

predictions <- predict(tree_model, newdata = test_one, type = "class")

predict_tree$Survived <- predictions
predict_tree_out <- as.matrix(predict_tree)
write.csv(predict_tree_out, file = "predict_tree_out.csv", row.names = F)

The accuracy for CART is 77.511% which is better than the logistic regression and the main reason is that logistic regression used above has considered all the variables in the data-frame, and in most of the cases the full model is not always the best. The reduced model will have a better performance in general, as the gini index act as a selection index which only include some significant variables in the model.

3.2.1 Cross Validation

Then we apply cross validation to the CART and the number of folds is 50.

train_one$survived <- train$Survived
control <- trainControl(method = "cv", number = 50)
model_cv <- train(survived ~ ., data = train_one, method = "rpart", trControl = control)
rpart.plot(model_cv$finalModel)

predictions_cv <- predict(model_cv, newdata = test_one)

survival_cv <- to_binary(predictions_cv)
predict_cv$Survived <- survival_cv
predict_cv_out <- as.matrix(predict_cv)
write.csv(predict_cv_out, file = "predict_cv_out.csv",row.names = F)

We can see from the plot that CART model has been further reduced and CART Classification Tree after the kfolds cross validation can give 78.229% of the result correct.

3.3 Random Forest

# random forest 
test_one$survived <- NA
rf_model <- randomForest(survived ~ ., data = train_one, ntree = 500, 
                         mtry = 3, importance = TRUE)
predictions_rf <- predict(rf_model, test_one)

survival_rf <- to_binary(predictions_rf)
predict_rf$Survived <- survival_rf
predict_rf_out <- as.matrix(predict_rf)
write.csv(predict_rf_out,"predict_rf_out.csv", row.names = F)

The Random Forest model can give 78.468% of the result correct.

3.3.1 Boosting

In this section XGBoost is used, and XGBoost is a strong model which has some benefits.

Regularization: XGBoost incorporates L1 (Lasso regression) and L2 (Ridge regression) regularization terms into the objective function. This helps prevent overfitting and makes the model more generalizable.
Handling of imbalanced data: XGBoost supports assigning weights to individual data points, which can be useful for handling imbalanced datasets where one class is significantly more frequent than the other.

So in this section we will work on this strong model and will use the grid search to find the best combination of hyperparameters.

# prepare the data
train_x <- train_one %>%
  select(-survived) %>%
  as.matrix()
train_y <- train_one$survived

params <- list(
  objective = "binary:logistic",
  eval_metric = "error",
  max_depth = 5,
  eta = 0.1
)

model_xg <- xgboost(
  data = train_x,
  label = train_y,
  params = params,
  nrounds = 100,
  verbose = 0
)
test_x <- test_one %>%
  select(-survived) %>%
  as.matrix()

predict_xg <- predict(model_xg, test_x)

3.3.1.1 Grid Search

As the performance of the XGBoost is not good, so model tuning is needed. The code below shows the grid search technique to select the best combinations of hyperparameters for a machine learning model. The main idea is to systematically search through a predefined set of hyperparameter values, then train the model with each combination.

train_x <- train_one %>%
  select(-survived) %>%
  as.matrix()
train_y <- train_one$survived

Define the grid for tuning.

max_depth_grid <- c(3, 5, 7, 9)
eta_grid <- c(0.01, 0.5, 0.1, 0.2, 0.3)
nrounds_grid <- c(50, 100, 150)
subsample_grid <- c(0.5, 0.6, 0.7, 0.8, 0.9, 1)
colsample_bytree_grid <- c(0.5, 0.6, 0.7, 0.8, 0.9, 1)

Perform the model tuning.

best_error <- Inf
best_max_depth <- NA
best_eta <- NA 
best_nrounds <- NA
best_colsample_bytree <- NA
best_sabsample <- NA

cv_folds <- 5
cv_seed <- 123
# Loop through all combinations of the hyperparameters
for (max_depth in max_depth_grid) {
  for (eta in eta_grid) {
    for (nrounds in nrounds_grid) {
      for (colsample_bytree in colsample_bytree_grid) {
        for (subsample in subsample_grid) {
          # Set up the parameters for the current combination
          params <- list(
            objective = "binary:logistic",
            eval_metric = "error",
            max_depth = max_depth,
            eta = eta,
            colsample_bytree = colsample_bytree,
            subsample = subsample
          )

          # Perform cross-validation for the current combination
          cv_result <- xgb.cv(
            data = train_x,
            label = train_y,
            params = params,
            nrounds = nrounds,
            nfold = cv_folds,
            stratified = TRUE,
            seed = cv_seed,
            verbose = 0,
            early_stopping_rounds = 10
          )

          # Update the best values if the current combination has a lower error rate
          min_error <- min(cv_result$evaluation_log$test_error_mean)
          if (min_error < best_error) {
            best_error <- min_error
            best_max_depth <- max_depth
            best_eta <- eta
            best_nrounds <- nrounds
            best_colsample_bytree <- colsample_bytree
            best_subsample <- subsample
          }
        }
      }
    }
  }
}

Train the model using the hyperparameter from the result of the grid search.

best_params <- list(
  objective = "binary:logistic",
  eval_metric = "error",
  max_depth = best_max_depth,
  eta = best_eta
)

model_xg_grid <- xgboost(
  data = train_x,
  label = train_y,
  params = best_params,
  nrounds = best_nrounds,
  verbose = 0
)

predict_xg_grid <- predict(model_xg_grid, test_x)

Show the hyperparameters from grid search.

Best Hyperparameters
Parameters	Values
max_depth	7.0
eta	0.1
nrounds	100.0
subsample	0.6
colsample_bytree	0.9

Table 8 above shows the best hyperparmeters obtained from the grid search technique.

The preformance of XGBoost + Grid Search has been improved to 78.229% of the result correct.

4. Blending

Blending is a very common technique used in machine learning. The idea behind blending is that the strengths of individual models compensate for the weaknesses of others, resulting in a more robust and accurate prediction.

blending <- function(predict_xg,predictions_rf,predictions_cv){
  predict_blend <- weight1 * predict_xg + weight2 * predictions_rf + weight3 * predictions_cv
}

weight1 <- 1/3
weight2 <- 1/3
weight3 <- 1/3

prediction_blend <- blending(predict_xg,predictions_rf,predictions_cv)

By changing the weights of each model in blending, we are able to make the predicted result even better and estimated 82% result correct.

Model Performance Summary

In Table 9 we can find the model performance, and in this dataset Random Forest and XGBoost are the best models.

Model Performance Summary
Model	Accuracy
Logistic Regression	0.77033
CART	0.77511
CART + Cross Validation	0.78229
Random Forest	0.78468
XGBoost	0.77511
XGBoost + Grid Search	0.78229
Blending	0.81862

5. Conclusion

In this report, I have used some plots and mean survival rates to help doing the data cleansing and visualization. After the data preparation, the model that used for this data-set are “Logistic Regression”, “CART”, “Random Forest” and “XGBoost”. Also applied Blending method to find the weighted average of these models, and found a more robust and accurate prediction. The performance of the model can achieve 85% of the result correct.