In a nut shell, machine learning (ML) is nested within artificial intelligence (AI). With machine learning, algorithms are implemented to identify or “learn” patterns in data which allows for specific tasks to be carried out, such as predictions on new data. For people interested in delving deeper into the ins and outs of ML or AI in R. I highly suggest the book “Machine Learning with R, the tidyverse, and mlr” by Hefin I. Rhys.
Figure taken from Rhys (2020)
With traditional AI, there are human-specified rules. For example, if there’s a computer program that classifies cakes into 3 distinct groups (strawberry shortcake, chocolate, red velvet) based on color and size, that program already has a set of coded rules that only realizes 3 distinct answers. But machine learning “learns” these rules on available data and sometimes develops its own rules. In that case the ML algorithm may identify different types of cakes other than the previously mentioned three! So for traditional AI, all possibilities have to be coded into a program but in the case of ML, the possibilities are learned and not limited.
Figure taken from Rhys (2020)
Algorithms are the processes that machine learning utilizes to develop new rules based on given data. The set of rules that an algorithm learns upon its completion is called a ‘model’.
There are many different algorithms that can be classified in a variety of ways. For sake of time, I’ll go over three ways we classify ML algorithms:
Supervised: When we feed ground-truthed data to the ML algorithm. The algorithm uses the already identified data to be able to find differences or clusters among the data. After the algorithm has trained on the ground-truthed data, we can feed it new data that it uses to develop new predictions.
Unsupervised: In this case, no ground-truthed data is provided to the algorithm. Instead, the algorithm must learn its own set of rules to detect potential differences in the structure of the data. An example of this may be neural networks.
Semi-supervised: A mixture of both. Will not touch on this but I suggest reading Rhys (2020) if interested!
We use the terms classification or regression to delineate between the types of models we use in the context of our response variable. If we are running algorithms with a response that is continuous we refer to the model as a regression model. If the response is categorical, we refer to it as a classification model.
Throughout this class you’ve implemented regressions and classification algorithms. For example, for simple linear regressions, you’ve applied a regression algorithm that minimizes the sums of square errors (i.e. the residuals) to best fit your data. With logistic regression you used a classification algorithm to identify effects of different predictors on whether Titanic passengers survived or not.
So why didn’t we call the logistic regression you used for the Titanic dataset ‘machine learning’?
While the answer to that question depends a wide variety of contexts, one of the main reasons would be that the ultimate goal of those regressions run in the previous class were for inference rather than prediction. Let me explain by posing two different questions.
What influenced the odds of survival of the passengers on Titanic?
Given the Titanic data, could we predict which passengers survived or died on Titanic or a similar cruise line disaster?
Question 1 essentially asks, “given the data, what influenced the odds of survival?” while question 2 asks “given the data can we predict the odds of survival?”. While similar in some ways these two questions differ on the focus placed on prediction versus inference. Models for inference often have low predictability but have high interpretability. Models designed or implemented for prediction are somewhat opposite where accuracy and precision are prioritized over interpretability.
What does interpretability mean? Quick Answer: Can we determine or identify potential mechanistic or causal drivers that would yield what our model tells us?
For this vignette we will focus more on question 2 and use the same logistic regression that you implemented in class on the exact same Titanic data. The only difference is that we will engineer this logistic regression for predictive performance which involves some new concepts to learn and apply. We will also learn how to apply this framework with a package (more like a metapackage) called “tidymodels”. “tidymodels” was developed to streamline, simplify, and improve the machine learning workflow. While it is relatively new, its likely to be an industry standard in the R environment for data scientists. It also plays hand in hand with crucial data science packages you’ve been exposed to, namely those within the tidyverse.
Let’s load the packages we need for this vignette:
library(effects)
library(tidyverse)
library(tidymodels)
library(skimr)
library(vip)
Let’s load the Titanic data:
data("TitanicSurvival")
titanic<-TitanicSurvival %>%
as_tibble()
Many of you have probably seen this data already so we won’t do too much data inspection/visualization. But let’s check the structure of the dataset. One thing we need to make sure is for categorical variables (e.g. sex) to be coded as factors or else they won’t play nicely with the tidymodels package.
#Examine structure of the data
summary(titanic)
## survived sex age passengerClass
## no :809 female:466 Min. : 0.1667 1st:323
## yes:500 male :843 1st Qu.:21.0000 2nd:277
## Median :28.0000 3rd:709
## Mean :29.8811
## 3rd Qu.:39.0000
## Max. :80.0000
## NA's :263
class(titanic$sex)
## [1] "factor"
#quickly count the number of rows with NAs
skim(titanic)
| Name | titanic |
| Number of rows | 1309 |
| Number of columns | 4 |
| _______________________ | |
| Column type frequency: | |
| factor | 3 |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| survived | 0 | 1 | FALSE | 2 | no: 809, yes: 500 |
| sex | 0 | 1 | FALSE | 2 | mal: 843, fem: 466 |
| passengerClass | 0 | 1 | FALSE | 3 | 3rd: 709, 1st: 323, 2nd: 277 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| age | 263 | 0.8 | 29.88 | 14.41 | 0.17 | 21 | 28 | 39 | 80 | ▂▇▅▂▁ |
Here we see that 263 rows have at least one NA. That’s roughly 1/6 of the data. For the sake of time and the running of the models, let’s just remove these rows.
titanic %>%
rowwise() %>%
summarise(NA_count = sum(is.na(cur_data()))) %>%
arrange(desc(NA_count)) %>%
ungroup() %>%
summarise(NA_count= sum(NA_count))
## # A tibble: 1 × 1
## NA_count
## <int>
## 1 263
#Let's remove NAs, it will make life easier
titan_df<-titanic %>% na.omit
Let’s also see how many passengers are in the different categories. We want to make sure that there’s somewhat of a balanced distribution.
titan_df %>%
count(survived, sort = TRUE)
## # A tibble: 2 × 2
## survived n
## <fct> <int>
## 1 no 619
## 2 yes 427
Above we see that there is a relatively even split between survivors and non-survivors. Seeing that this is our planned response variable, it’s important to have a relatively balanced representation for both outcomes. In the case that this is not possible because of inherent attributes of the dataset, there are measures that can be taken (explained later).
titan_df %>%
count(sex, sort = TRUE)
## # A tibble: 2 × 2
## sex n
## <fct> <int>
## 1 male 658
## 2 female 388
Above we see more males than females.
titan_df %>%
count(passengerClass, sort = TRUE)
## # A tibble: 3 × 2
## passengerClass n
## <fct> <int>
## 1 3rd 501
## 2 1st 284
## 3 2nd 261
Above we see a more or less even split between 1st and 2nd class while 3rd class (steerage) make up half of the entire dataset.
Here we set the data up for our predictive modeling. To do this we need to split our data into a testing and training set. The testing set is used to train the model and the training set is used to validate the model’s capacity to predict. This has become a common practice because testing a ML model on the dataset used to train it often results in predictive capacities that are often too good to be true. Therefore, to test the model, testing it on data it has never seen before is the gold standard to check the model’s effectiveness.
Splitting the data is sometimes referred to as “setting a data budget”. The proportion of your data designated as testing and training can vary but in general it is good practice to use about 3/4 for training with 1/4 as testing. Below we use the tidymodels package to split the data randomly. Note the prop argument allows us to set the proportion of the split. Here we specify it to 0.75 (75% testing, 25% training). When we look at the titanic_split object we see three numbers. From left to right: 784: number of observations in training set, number of observations in testing set, and finally the total number of observations.
#set seed
set.seed(254)
#set up data split
titanic_split <- initial_split(titan_df, prop = 0.75)
titanic_split
## <Analysis/Assess/Total>
## <784/262/1046>
Next we assign these splits as training and testing using these two functions:
#assign testing and training
titanic_train <- training(titanic_split)
titanic_test <- testing(titanic_split)
Now our data is prepped but how do we actually train the logistic regression algorithm? You have all been accustomed to setting up statistical models with this format:
model <- lm(y ~ x, data = data)
With tidymodels (more specfically the parsnip package which is part of tidymodels), we set up models differently. We can divide the above line of code into separate parts:
With model specification we can set three things: the model’s engine, the model’s mode, and the model type.
The model engine is the computation engine used in the model. For linear models, the model engine would be “lm” but for generalized linear models, the engine would be “glm”.
The model’s mode specifies whether the model is a classification or regression model.
Finally, the model type specifies the structure of the model. For example, linear models are based on regression while other models may be based on decision trees (e.g. random forest models). Here we use logistic_reg().
Set up the model specification with this code:
glm_spec <- logistic_reg(mode = "classification",
engine = "glm")
glm_spec
## Logistic Regression Model Specification (classification)
##
## Computational engine: glm
But why do this? The short answer is that it standardizes how models are set up across different modelling frameworks. For example, Bayesian vs. Frequentist regressions often involve different syntax in setting up their models. Having a standardized model specification such as the above allow for universality across different packages and models. tidymodels can apply a variety of model types and so the standardized code helps with collaborations and reproducability.
Another thing that is involved when using tidymodels is the setting up of a workflow. A workflow combines the model specifications with other aspects such as the formula for the model (response ~ predictor). To add a workflow with the specified model from above, we use this code:
titanic_wf <- workflow() %>%
add_model(glm_spec) %>% #Add specified model
add_formula(survived ~ sex + age + passengerClass) #add model formula
titanic_wf
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Formula
## Model: logistic_reg()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## survived ~ sex + age + passengerClass
##
## ── Model ───────────────────────────────────────────────────────────────────────
## Logistic Regression Model Specification (classification)
##
## Computational engine: glm
Now that we have all the components for the model set up we can fit the model on to the training data:
glm_titanic <- titanic_wf %>%
fit(data = titanic_train)
Once the model is fit we can examine its coefficients:
tidy(glm_titanic)
## # A tibble: 5 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 3.68 0.388 9.47 2.73e-21
## 2 sexmale -2.47 0.194 -12.7 3.61e-37
## 3 age -0.0366 0.00752 -4.86 1.16e- 6
## 4 passengerClass2nd -1.35 0.262 -5.16 2.44e- 7
## 5 passengerClass3rd -2.49 0.266 -9.35 8.82e-21
If we want to view these coefficients in odds ratios then we can simply add the argument “exponentiate = TRUE”:
tidy(glm_titanic, exponentiate = TRUE)
## # A tibble: 5 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 39.6 0.388 9.47 2.73e-21
## 2 sexmale 0.0850 0.194 -12.7 3.61e-37
## 3 age 0.964 0.00752 -4.86 1.16e- 6
## 4 passengerClass2nd 0.259 0.262 -5.16 2.44e- 7
## 5 passengerClass3rd 0.0832 0.266 -9.35 8.82e-21
Basic summary from above seems like you are less likely to survive if you are male, old, and in 2nd or 3rd class relative to first.
We can use the model to predict from the testing set. These predictions can be generated as simple yes/no for survival like this:
#classes
pred_class <- predict(glm_titanic,
new_data = titanic_test,
type = "class")
head(pred_class)
## # A tibble: 6 × 1
## .pred_class
## <fct>
## 1 yes
## 2 yes
## 3 yes
## 4 yes
## 5 yes
## 6 no
Or they can be generated as probabilities for each observation from the testing set like this:
#probabilities
pred_prob <- predict(glm_titanic,
new_data = titanic_test,
type = "prob")
head(pred_prob)
## # A tibble: 6 × 2
## .pred_no .pred_yes
## <dbl> <dbl>
## 1 0.235 0.765
## 2 0.471 0.529
## 3 0.0593 0.941
## 4 0.0614 0.939
## 5 0.136 0.864
## 6 0.535 0.465
Let’s combine all these predictions with the true data from the testing set:
glm_results <- titanic_test %>%
select(survived) %>%
bind_cols(pred_class, pred_prob) %>%
rename(Original_survival = survived,
Predicted_survival = .pred_class,
Predicted_prob_no = .pred_no,
Predicted_prob_yes = .pred_yes)
head(glm_results)
## # A tibble: 6 × 4
## Original_survival Predicted_survival Predicted_prob_no Predicted_prob_yes
## <fct> <fct> <dbl> <dbl>
## 1 yes yes 0.235 0.765
## 2 no yes 0.471 0.529
## 3 no yes 0.0593 0.941
## 4 yes yes 0.0614 0.939
## 5 yes yes 0.136 0.864
## 6 yes no 0.535 0.465
Now let’s take a look at how many observations of the testing dataset were accurately predicted using a confusion matrix:
conf_mat(glm_results, truth = Original_survival,
estimate = Predicted_survival)
## Truth
## Prediction no yes
## no 127 33
## yes 27 75
Above we see that the testing data set had a total of 108 survivors and 154 non-survivors. Our model predicted 75 of the 108 survivors and 127 of the 154 non-survivors. We can also calculate an overall accuracy percentage of these predictions:
accuracy(glm_results, truth = Original_survival,
estimate = Predicted_survival)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.771
We see above that the model had an overall accuracy of 77%.
Another important metric for evaluating classification algorithms is the ROC-AUC (Area Under the ROC Curve) score. Without getting into too much detail. The score represents that ability of a multiclass model to discern between different classes. The ROC-AUC score is bounded between 0 and 1. Higher score is better.
roc_auc(glm_results,
truth = Original_survival,
Predicted_prob_yes,
event_level = "second")
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 roc_auc binary 0.827
Note: for the roc_auc() function, the reference level of the data should be whatever is considered “positive”. In the case of the titanic data, “yes” for surviving is “positive” but is the second level because “no” alphabetically comes first. Therefore, we use the “event_level =” argument to set the “positive” outcome as the second level using “second”.
We can also plot a visual of the ROC curve. Ideally a decent ROC curve looks like what you see here:
autoplot(roc_curve(glm_results, Original_survival, Predicted_prob_yes,
event_level = "second"))
In the above case the curve is away from the 1:1 line.
For more details: https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5
For those of us who want to visualize model coefficients:
tidy(glm_titanic) %>%
ggplot(.) +
geom_hline(yintercept = 0, lty = 3,
color = "grey", size = 1.5) +
geom_errorbar(aes(x = term, ymin = estimate-std.error,
ymax = estimate+std.error),
width = 0.3, size= 1.5) +
geom_point(aes(x = term, y = estimate,size = p.value), pch = 21,
color = "black", fill = "orange", alpha = 0.5) +
geom_point(aes(x = term, y = estimate), size = 4, pch = 21,
color = "black", fill = "lightblue") +
scale_size(range = c(7,12)) +
labs(title = "Model Coefficients",
x = "Model Terms", y = "Estimates",
size = "P-value")+
geom_text(x=2, y=2.7, label="Intercept\nRef. Level\nFemale, First Class", size = 5)+
theme_bw() +
coord_flip()
If we have a lot of data than we can further split the training data into its own training and testing sets using a variety of methods. This helps in developing better predictions and less biased models. Furthermore, if the model fit poorly we don’t want to waste time using it on the testing dataset.
To do this we need to further split the training dataset into it’s own training vs testing sets. There are various ways to do this that go beyond the scope of this introductory material. For now we will use the vfold_cv() which uses k-fold cross validation. This function randomly splits the data into a specified amount of sections. Within each section it is further split into its own testing and training data. In the code below we set “v = 10” to split the training data into ten sections. But we also want to make sure each section or fold has a relatively equal distribution of survivors relative to non-survivors so we use the strata argument and specify “strata = survived”. This implements a stratified sampling design.
Figure taken from Rhys (2020)
titanic_folds <- vfold_cv(data = titanic_train, v = 10, strata = survived)
We can examine one of the folds to see how many data points are being used for training and testing.
titanic_folds$splits[[1]]
## <Analysis/Assess/Total>
## <705/79/784>
Here we add a few extra functions so that we run the model on all the folds of data.
glm_titanic_fold <- titanic_wf %>%
fit(data = titanic_test) %>%
fit_resamples(
resamples = titanic_folds,
control = control_resamples(save_pred = TRUE)
)
We can calculate the average accuracy and ROC-AUC scores across the ten models.
collect_metrics(glm_titanic_fold)
## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.782 10 0.00928 Preprocessor1_Model1
## 2 roc_auc binary 0.843 10 0.0114 Preprocessor1_Model1
We can also plot the ROC curves for each model:
glm_titanic_fold %>%
collect_predictions() %>%
group_by(id) %>%
roc_curve(survived, .pred_no) %>%
ggplot(aes(1 - specificity, sensitivity, color = id)) +
geom_abline(lty = 3, color = "gray80", size = 1.5) +
geom_path(show.legend = FALSE, alpha = 0.6, size = 1.2) +
theme_bw() +
scale_color_viridis_d()
Machine learning with r the tidverse: https://www.manning.com/books/machine-learning-with-r-the-tidyverse-and-mlr
Julia Silge’s Blog: https://juliasilge.com/blog/