Titanic Machine Learning Lab

h2o.removeAll() # Deletes all data and models from the H2O engine

Initialize the H2O cluster using all available CPU cores [1, 9].

Note: This requires the Java Runtime Environment to be installed [10].

h2o.init(nthreads = -1)

3. Data Loading and Pre-Processing

Load the training data from the titanic package.

data(“titanic_train”)

PRE-PROCESSING STEP:

In H2O, to perform classification (predicting 0 or 1), the target

variable must be a factor.

We also convert Pclass and Sex for better model interpretation.

titanic_train\(Survived <- as.factor(titanic_train\)Survived) titanic_train\(Pclass <- as.factor(titanic_train\)Pclass) titanic_train\(Sex <- as.factor(titanic_train\)Sex)

Convert the R data frame into an H2OFrame.

Like the sf package and maps, H2O cannot process standard R data frames directly.

titanic.h2o <- as.h2o(titanic_train)

4. Training and Test Splits

Split the data into a Training set (80%) and a Test set (20%).

We set a seed to ensure the random split is reproducible.

titanic_split <- h2o.splitFrame(data = titanic.h2o, ratios = 0.8, seed = 1234)

The split creates two indicies with the training and test data

Assign the training and test pieces to objects

train_data <- titanic_split[[1]] # using data from index 1 to create an object test_data <- titanic_split[[2]] # using data from index 2 to create an object

Check the dimensions to ensure the split worked.

print(paste(“Training rows:”, h2o.nrow(train_data))) print(paste(“Testing rows:”, h2o.nrow(test_data)))

5. Define Variables for Modeling

Select the predictors and the response variable

predictors <- c(“Pclass”, “Sex”, “Age”, “SibSp”, “Parch”, “Fare”) response <- “Survived”

6. Train GLM Model

I am starting with GLM as my baseline model

This model predicts the survivors using the variables above

GLM model to predict survival (0 oe 1)

titanic_glm <- h2o.glm( x = predictors, y = response, training_frame = train_data, validation_frame = test_data, family = “binomial”)

Accuracy check

h2o.auc(titanic_glm, valid = TRUE) # AUC of 0.86, so it is strong

Error Check

h2o.rmse(titanic_glm, valid = TRUE) # RMSE of 0.39, so it has a moderate amount of error, but still reasonable

7. Random Forest Model

Random Forest uses many decision trees to improve predictions

titanic_rf <- h2o.randomForest( x = predictors, y = response, training_frame = train_data, validation_frame = test_data, ntrees = 50, seed = 1234)

Accuracy Check

h2o.auc(titanic_rf, valid = TRUE) # AUC of 0.89, so better than the GLM # RF is more accurate than GLM model

Error Check

h2o.rmse(titanic_rf, valid = TRUE) # RMSE of 0.35. # Random Forest has a lower error than GLM, it is the better model.

This shows which variables are most important to predicting survival

h2o.varimp_plot(titanic_rf) # Sex is the most significant variable when it comes to predicting survival # Followed by Age and Fare

Predicted surivavl probabilities

pred <- h2o.predict(titanic_rf, newdata = test_data) head(pred) # These results show the model’s predictions. # p1 is probability of survival and p0 is probability of death # If p1 is larger than p0, it predicts survived # If p0 is larger than p1, it predicts died

Overall

The Random Forest Model performed better than the GLM model.

RF had a high AUC and lower RMSE than GLM.

The most important varaibles for predicting survival were Sex, Age, and Fare.

The model predicts survival by comparing probabilities and choosing the higher value.