Rows: 891 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): Name, Sex, Ticket, Cabin, Embarked
dbl (7): PassengerId, Survived, Pclass, Age, SibSp, Parch, Fare
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
titanic_test <-read_csv('test.csv')
Rows: 418 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): Name, Sex, Ticket, Cabin, Embarked
dbl (6): PassengerId, Pclass, Age, SibSp, Parch, Fare
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Machine Learning with tidymodels
Goal: Using various other traveler characteristics, we would like to predict if a traveler survived or not. That means that we would like to create/fit a model that can classify travelers into one of two groups. This is called classification algorithm. Outcome, survived or not, is a categorical variable with two levels. Thus, our model will be a binary classification model. Since we do know if travelers survived or not already, that means that we will be fitting a supervised algorithm where data ‘supervises’ the model by telling it what happened with a particular traveler.
Here are usual steps that we will be conducting from start to end for every machine learning project.
Loading necessary libraries and data. Then, Exploratory Data Analysis (EDA) to understand the data at hand.
Splitting data into training and testing using stratified sampling.
Then further resampling from the training data to choose a model or model hyperparameters via either cross validation, bootstrapping or just single validation set.
Declaring model specifications.
Declaring recipes for feature engineering using EDA results.
Fitting resamples with various hyperparameters and workflows.
Assess the results of these fitting to choose a final model.
Fit a final model with the chosen algorithm and hyperparameters to the whole training set and make predictions on the testing data to report generalization accuracy/error.
Lets start. First, we split the data using stratified sampling. Well, this step has been already done for us in the data, so we skip. But we create bootstrap (re)samples for model selection.
library(tidymodels)set.seed(2022)titanic_folds <-bootstraps(data = titanic_train, times =25)titanic_folds
# Bootstrap sampling
# A tibble: 25 × 2
splits id
<list> <chr>
1 <split [891/328]> Bootstrap01
2 <split [891/338]> Bootstrap02
3 <split [891/318]> Bootstrap03
4 <split [891/316]> Bootstrap04
5 <split [891/328]> Bootstrap05
6 <split [891/317]> Bootstrap06
7 <split [891/330]> Bootstrap07
8 <split [891/339]> Bootstrap08
9 <split [891/322]> Bootstrap09
10 <split [891/317]> Bootstrap10
# … with 15 more rows
# ℹ Use `print(n = ...)` to see more rows
We obtain a new object, a tibble, with list column that has the bootstrap samples and also out of bag (OOB) samples. When sample size is large, out of samples approximately consists of \(\frac{1}{e}\) of the sampled data.
exp(-1)
[1] 0.3678794
For the first bootstrap sample, there are 328 out of bag samples.
328/891
[1] 0.3681257
That is pretty close to the theoretical percentage. Since we have done EDA in another Quarto, we will use that analysis during feature engineering. Next step is to define model specifications. We will use three models: Logistic Regression, Random Forest and Support Vector Machine with radial kernel.
# logistic regressiontitanic_glm_spec <-logistic_reg() %>%# modelset_engine('glm') %>%# package to useset_mode('classification') # choose one of two: classification vs regressontitanic_rf_spec <-rand_forest(trees =1000) %>%# algorithm speicfic argument:1000 treesset_engine('ranger') %>%set_mode('classification')titanic_svm_spec <-svm_rbf() %>%# rbf - radial basedset_engine('kernlab') %>%set_mode('classification')
One can look into these objects to see model specifications.
titanic_svm_spec
Radial Basis Function Support Vector Machine Model Specification (classification)
Computational engine: kernlab
Now we declare recipe which helps us to do feature engineering for training data, and then will automatically apply the same steps to testing data.
# declare recipetitanic_recipe <-recipe(Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked, data = titanic_train) %>%# keep variables we wantstep_impute_median(Age,Fare) %>%# imputationstep_impute_mode(Embarked) %>%# imputationstep_mutate_at(Survived, Pclass, Sex, Embarked, fn = factor) %>%# make these factorsstep_mutate(Travelers = SibSp + Parch +1) %>%# new variablestep_rm(SibSp, Parch) %>%# remove variablesstep_dummy(all_nominal_predictors()) %>%# create indicator variablesstep_normalize(all_numeric_predictors()) # normalize numerical variables
We will define workflows that will combine model with the recipe, and fit it all together.
Recipe
Inputs:
role #variables
outcome 1
predictor 7
Training data contained 891 data points and 179 incomplete rows.
Operations:
Median imputation for Age, Fare [trained]
Mode imputation for Embarked [trained]
Variable mutation for Survived, Pclass, Sex, Embarked [trained]
Variable mutation for ~SibSp + Parch + 1 [trained]
Variables removed SibSp, Parch [trained]
Dummy variables from Pclass, Sex, Embarked [trained]
Centering and scaling for Age, Fare, Travelers, Pclass_X2, Pclass_X3, Sex... [trained]
final_fit %>%extract_fit_parsnip()
parsnip model object
Ranger result
Call:
ranger::ranger(x = maybe_data_frame(x), y = y, num.trees = ~1000, num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1), probability = TRUE)
Type: Probability estimation
Number of trees: 1000
Sample size: 891
Number of independent variables: 8
Mtry: 2
Target node size: 10
Variable importance mode: none
Splitrule: gini
OOB prediction error (Brier s.): 0.1273232
We will make predictions and more in upcoming Quartos.