Lecture_6: Resampling

Machine Learning

This section introduces you to fundamental Machine Learning (ML) concepts and algorithms. Empowered by modern computing capabilities and statistical modelling, machine scale analysis can capture hidden values in big data that are otherwise limited to human scale thinking. In 1959, in a paper Some Studies in Machine Learning Using the Game of Checkers Arthur Samuel used the term machine learning in the game of checkers in the form we know it today: “to verify the fact that a computer can be programmed so that it will learn to play a better game of checkers than can be played by the person who wrote the program”. The paper conveys the process of passing our own methods of learning to a machine as a subfield of statistical computing. It is the application of statistical modelling to obtain a general understanding of the data, to make predictions and to improve performance based on data and empirical information.

ML can learn from the data that is processed and analysed and the more data it processes, the more it can learn. But, to create such intelligent machines it is useful to develop an understanding of classical statistics that is the LifeBlood of ML algorithms. Coated by the layers of k-nearest neighbours, linear regression, random forest, naïve bayes… machines can develop the cognitive abilities of today’s Artificial Intelligence (AI).

Translating the statistical methodologies that can run on machines through coding and programming is another vital part of ML. As such, R is a powerful statistical language for learning an important and desirable ML skill: code writing for statistical modelling. Using R greatly simplifies machine learning. Using R enables us to focus on developing an understanding of how each algorithm can solve a problem, for which we can simply use a written package to quickly generate prediction models on data with a few command lines.

Types of ML

We will touch upon a few of the most popular ML algorithms from a vast set of its algorithmic tools for understanding data. These tools are broadly classified as supervised or unsupervised. In general, supervised learning involves building a statistical model for estimating or predicting an output based on one or more inputs. There are many examples of their application in pharma and medicine, such as disease identification/diagnosis and personalised treatments, or pricing in business to name a few. With unsupervised learning, there are inputs but no supervising output; nevertheless, we can learn relationships and structure from such data. One of the most popular examples is a recommender system as a reflection of our consumer behaviour. Other examples can be found in biological genomic studies, market research, security, with the most used example being anomaly detection of malicious activity in an organization’s network etc.

Fuelled by powered computer technologies ML models have evolved considerably in the last decade, advancing into a third category known as reinforcement. Unlike supervised and unsupervised learning, reinforcement learning continuously improves its model advancing from the previous iterations. This differs from the other two categories of ML, which reach an indefinite stage after the model is formulated from the training and test data segments. Examples can be found in computer gaming, skill acquisition such as real time discussion etc.

Supervised Versus Unsupervised Learning

The majority of data analysis problems fall into one of the two ML categories: supervised or unsupervised. Sometimes we deal with a problem in which for each observation we have a predictor(s) \(x_i\), \(i = 1, 2, … n\) there is a response \(y_i\). In those cases, we seek to find a model that reflects the possible relationship between response and predictor(s). The aim is to be able to predict the response values for future observation accurately, with greater understanding of the relationship between the two. In statistical terms, we wish to conduct:

the prediction of \(y\)’s for future values of \(x\)’s and
the inference about the nature of the relationship between \(X\) and \(Y\)

The best known statistical modelling techniques that fall into this ML category that are commonly used are linear regression, logistic regression, Generalised Additive Models (GAM), boosting and Support Vector Machines (SVM).

To the problems where for each observation \(i = 1, 2, … n\) we have a set of measurements \(x_i\), but no associated responses \(y_i\) based on which we can “supervise” the analysis to understand the relationship between the variables or between the observations, we referred to as unsupervised learning. Commonly used methodology in this setting is clustering. The goal of cluster analysis is to group the observation in the data set into disjointed clusters of similar samples. It is commonly used in market segmentation where, for example, you need to develop marketing strategies for identified groups of customers clustered into them based on their behaviour.

Previously, we recognised that when dealing with the measured response variable in statistical modelling we tended to use regression. If, however we have attribute response variables such as defaults on a debt, gender, ethnicity, to name a few, we tend to use classification modelling. In multifactor data analysis, selecting the appropriate statistical modelling technique is not so straightforward as it is in bivariate statistical modelling situations, for which we fit a model based on the type of response variable. Here, we can use logistic regression for the models with binary response attribute variables as it estimates class probabilities, which on the other hand can be thought of as a regression method as well. Methods such as K-nearest neighbours and boosting can be used for both measured and attribute response variables. When dealing with multifactor statistical modelling, we have to think carefully not just about the type of response variable, but most of all about the type of problems we wish to address.

Assessing Model Accuracy

Selecting the most efficient approach can be one of the most challenging parts of performing statistical modelling in practice, as the choice of the model that has worked on one data set may not perform as well on another similar data set. Therefore, evaluating the performance of a chosen statistical method is seen as an important part of statistical modelling.

Through statistical modelling, we are trying to explain and dig deeper into understanding the nature of a process, i.e. phenomena of interest. We describe this phenomenon as an unknown function \(f\). In its related data we annotate as \(x\) the information that can explain the problem’s behaviour (this could be a single value or a vector or matrix of values or something more complex) and as \(y\) the results of its behaviour (this could also be a single value or something more complex). Mathematically we present this as

\[y_i = f(x_i)\]

However, there is one more element we need to add to this equation that would make it statistically correct

\[y_i = f(x_i) + \epsilon_i\]

We refer to this additional element epsilon as an “error” and it depicts the noise, the variation of the \(f\)’s behaviour. The standard approach is to adopt the following distribution structure: \(\epsilon \sim N(0,\sigma)\), meaning that the \(\epsilon\) value is considered to be normally distributed with a mean of \(0\) and some standard deviation \(\sigma\). This implies that negative and positive impacts from this noise are considered equally likely, and that small errors are much more likely than extreme ones.

To evaluate the performance of a statistical model on a given data set, we commonly monitor the discrepancies between the predicted response values for given observations obtained by the chosen statistical model (\(\hat{f}(x_i)\)) and the true response values for these observations (\(y_i\)). The most commonly used measure is the mean squared error (MSE):

\[MSE = \frac{1}{n} \sum^{n}_{i=1}(y_i - \hat{f}(x_i))^2\]

It is easy to note that the small \(MSE\) will correspond to the model for which the predicted responses (\(\hat{f}(x_i)\)) are very close to the true responses (\(y_i\)) and large if the true and predicted responses differ significantly.

In general, a machine learning model aims to make good predictions on new, previously unseen data. However, we don’t have the previously unseen data when experimenting with a new problem. One way of overcoming this obstacle is to split the given data into two subsets:

training data: a subset to train a model
test data: a subset to test the model

The ratio of of two split should be approximately 75/25. This means that training data should account for around 75 percent of the rows in the given data and the other 25 percent of rows is the test data. Good performance on the test data is a useful indicator of good performance on the new data in general, assuming that the test data is sufficiently large and that you don’t take advantage of using the same test data over and over again to make your model look better than it is. Often data is arranged sequentially and it is good practice to randomise the rows in the data set prior to the split, as it helps to avoid bias in the fitted model. The easiest way to split the data into a training and test set is to take a random sample.

💡 The MSE above is computed using the training data, used to fit the model, and as such it would be more correct to refer to it as the training MSE.

But we have already discussed that we are not really interested how well the model “works training” on the training data, ie. \(\hat{f}(x_i) \approx y_i\). We are more interested in the accuracy of the predictions \(\hat{f}(x_0)\) that are obtained when we apply the model to previously unseen test data \((x_0, y_0)\), ie. \(\hat{f}(x_0) \approx y_0\). In other words, we want to chose the model with the lowest \(test\) \(MSE\) and to do so we need a large enough number of observations in the test data to calculate the mean square prediction error for the test observations (\(x_0\), \(y_0\)), to which we can refer to as the \(test\) \(MSE\).

\[mean(y_0 - \hat{f}(x_0))^2\]

💡 Note that after randomising data we can begin to apply ML to the training data. The remaining test data is put aside and reserved for testing the accuracy of the model.

In statistics nothing is black and white. In other words, nothing is straightforward and there are many considerations one needs to take into account when applying statistical modelling. The same applies in this situation. We need to realise that when a given model yields a small training MSE but a large test MSE, we are said to be overfitting the data. The statistical model is too ‘preoccupied’ to find patterns in the training data and consequently is modelling the patterns that are caused by random effect, rather than by true features of the unknown function \(f\). When the model overfits the training data, the test MSE will be large because the modelled features that the model identifies in the training data just do not exist in the test data. Saying that, regardless of overfitting occurring or not, we expect the training MSE to be smaller than the test MSE, as most of the statistical models either directly or indirectly seek to minimize the training MSE. We need to be aware that the chosen model needs to be flexible and not rigid and glued to the training data.

Reducible and Irreducible Error: The Bias-Variance Trade-Off

MSE is simple to calculate and yet, despite its simplicity, it can provide us with a vital insight into modelling. It consists of two intrinsic components:

bias
variance

that can provide greater enlightenment about how the model works.

We have talked earlier about variance in general, but we have not explained the idea of bias in statistical term. Bias is the average difference between the estimator \(\hat{y}\) and true value \(y\). Mathematically we write bias as:

\[E[\hat{y} – y]\]

As it is not squared difference, it can be either positive or negative. Positive or negative bias implies that the model is over or under “predicting”, while the value of zero would indicate that the model is likely to predict too much as it is to predict too little. The latter implies that the model can be completely wrong in its prediction and still provide us with the bias of zero. This implies that bias on its own provides little information about how correct the model is in its prediction.

Remember that \(y = f(x) + \epsilon\) and therefore \(\hat{f}\) is not directly approximating \(f\). \(\hat{f}\) models \(y\) that includes the noise. It can be challenging and in some cases even impossible to meaningfully capture the behaviour of \(f\) itself when the noise term is very large. We have discussed earlier that we assess model accuracy using MSE which is calculated by

obtaining the error (i.e. discrepancy between \(\hat{f}(x_i)\) and \(y_i\))
squaring this value (making negative into the positive same, and greater error gets more severe penalty)
then averaging these results

The mean of the squared error is the same as the expectation\(^*\) of our squared error so we can go ahead and simplify this a slightly:

\[MSE=E[(y-\hat{f}(x))^2]\] Now, we can break this further and write it as \[MSE = E[(f(x)+ \epsilon - \hat{f}(x))^2]\] Knowing that computing the expectation of adding two random variables is the same as computing the expectation of each random variable and then adding them up

\[E[X+Y]=E[X] +E[Y]\] and recalling that \(\sigma^2\) represent the variance of \(\epsilon\), where the variance is calculated as \[E[X^2]-E[X]^2,\] and therefore \[Var(\epsilon) = \sigma^2 = E[\epsilon^2] - E[\epsilon]^2,\] with \(\epsilon \sim N(0,\sigma)\)

\[E[\epsilon]^2=\mu^2=0^2=0\] we get

\[E[\epsilon^2] = \sigma^2\] This helps us to rearranging MSE further and calculate it as

\[MSE=σ^2+E[−2f(x)\hat{f}(x)+f(x)^2+\hat{f}(x)^2],\]

where \(\sigma^2\) is the variance of the noise, ie. \(\epsilon\). This is a “eureka” moment: the variance of the noise in data is an irreducible part of the MSE. Regardless of how good the model is, it can never reduce the MSE to being less than the variance related to the noise, i.e. error. This error represents the lack of information in data used to adequately explain everything that can be known about the phenomena being modelled. We should not look at it as a nuisance, as it can often guide us to further explore the problem and look into other factors that might be related to it.

Knowing that \[Var(X) = E[X^2] - E[X]^2,\] we can apply further transformation and break MSE into \[MSE = \sigma^{2}+Var[f(x)-\hat{f}(x)]+E[f(x)-\hat{f}(x)]^2\] The term \(Var[f(x)-\hat{f}(x)]\) is the variance in the model predictions from the true output values and the last term \(E[f(x)-\hat{f}(x)]^2\) is just the bias squared. We mentioned earlier that that unlike variance, bias can be positive or negative, so we square this value in order to make sure it is always positive.

With this in mind, we realise that MSE consists of:

model variance
model bias and
irreducible error

\[\text{Mean Squared Error}=\text{Model Variance} + \text{Model Bias}^2 + \text{Irreducible Error}\]

We come to the conclusion that in order to minimize the expected test error, we need to select a statistical model that simultaneously achieves low variance and low bias.

💡 Note that in practice we will never know what the variance \(\sigma^2\) of the error \(\epsilon\) is, and therefore we will not be able to determine the variance and the bias of the model. However, since \(\sigma^2\) is constant, to improve the model we have to decrease either bias or variance.

Testing the model using the test data and observing its bias and variance can help us address some important issues, allowing us to reason with the model. If the model fails to find the \(f\) in data and is systematically over or under predicting, this will indicate underfitting and it will be reflected through high bias. However, high variance when working with test data indicates the issue of overfitting. What happens is that the model has learnt the training data really well and is too close to the data, so much so that it starts to mistake the \(f(x) + \epsilon\) for true \(f(x)\).

Simulation Study

To understand these concepts, let us run a small simulation study. We will:

simulate a function \(f\)
apply the error, i.e. noise sampled from a distribution with a known variance

To make it very simple and illustrative we will use a linear function \(f(x) = 3 + 2x\) to simulate response \(y\) with the error \(e\thicksim N(\mu =0, \sigma^2 =4)\), where \(x\) is going to be a sequence of numbers between \(0\) and \(10\) in steps of \(0.1\). We will examine the simulations for the models that over and under estimate the true \(f\), and since it is a linear function we will not have a problem identifying using simple linear regression modelling.

Let’s start with a simulation in which we will model the true function with \(\hat{f}_{1} = 4 + 2x\) and \(\hat{f}_{2} = 1 + 2x\).

Observing the graph, we notice that \(\hat{f}_1\) and \(\hat{f}_2\), depicted in blue and red lines respectively, follow the data nicely, but are also systematically over (in the case of \(\hat{f}_1\) and under (in the case of \(\hat{f}_2\)) estimating the values. In the simple model \(\hat{f}_3\), the line represents the value \(\bar{y}\), which cuts the data in half.

As we mentioned earlier, knowing the true function \(f\) and the distribution of \(\epsilon\) we can calculate: - the MSE using the simulated data and the estimated model, - the model’s bias and variance which will allow for the calculation of the “theoretical” MSE. This will allow for more detailed illustration about the information contained in the model’s bias and variance.

## ============================================== 
## =============== f_hat_1 ================ 
## MSE_data1 =  4.61 
## bias_1 =  1 
## variance_1 =  0 
## MSE_1 = 4 + bias_1^2 + variance_1 =  5 
## ============================================== 
## =============== f_hat_2 ================ 
## MSE_data2 =  7.64 
## bias_2 =  -2 
## variance_2 =  0 
## MSE_2 = 4 + bias_2^2 + variance_2 =  8 
## ============================================== 
## =============== f_hat_3 ================ 
## average y =  13 
## MSE_data3 =  36.7 
## bias_3 =  0 
## variance_3 =  33.84 
## MSE_3 = 4 + bias_3^2 + variance_3 =  37.84 
## ==============================================

\(\hat{f}_1\) has a positive bias because it is overestimating data points more often than it is underestimating, but as it does it so consistently in comparison to \(f\) that produces variance of zero. In contrast \(\hat{f}_2\) has a negative bias as it is underestimating simulated data, but nonetheless it also does it consistently, resulting in zero variance with \(f\).

Unlike in the previous two model estimates which follow the data points, \(\hat{f}_3\) predicts the mean value of data, resulting in no bias since it evenly underestimates and overestimates \(f(x)\). However, the variation in prediction between \(f\) and \(\hat{f}_3\) is obvious.

Given that the true function \(f\) is linear, by applying simple regression modelling, we should be able to estimate it easily in R using the \(lm()\) function.

Since the true function \(f\) is a linear model it is not surprising that \(\hat{f}_4\) can learn it, resulting in zero values of the model’s bias and variance.

## ============================================== 
## =============== f_hat_4 ================ 
## MSE_data4 =  3.62 
## bias_4 =  0 
## variance_4 =  0 
## MSE_4 = 4 + bias_4^2 + variance_4 =  4 
## ==============================================

We realise that the \(MSE\) is more than just a simple error measurement. It is a tool that informs and educates the modeller about the performance of the model being used in the analysis of a problem. It is packed with information that when unwrapped can provide a greater insight into not just the fitted model, but the nature of the problem and its data. It provides you with the desired agility for steering a boat in a sea of data.

\(^*\) The Expectation of a Random Variable is the sum of its values weighted by their probability. For example: What is the average toss of a fair six-sided die?

If the random variable is the top face of a tossed, fair six-sided die, then the probability of die landing on \(X\) is

\[f(x) = \frac{1}{6}\] for \(x = 1, 2,... 6\). Therefore, the average toss, i.e. the expected value of \(X\) is:

\[E(X) = 1(\frac{1}{6}) + 2(\frac{1}{6}) + 3(\frac{1}{6}) + 4(\frac{1}{6}) + 5(\frac{1}{6}) + 6(\frac{1}{6}) = 3.5\] Of course, we do not expect to get a fraction when tossing a die, i.e. we do not expect the toss to be 3.5, but rather an integer number between 1 to 6. So, what the expected value is really saying is what is the expected average of a large number of tosses will be. If we toss a fair, six-side die hundreds of times and calculate the average of the tosses we will not get the exact 3.5 figure, but we will expect it to be close to 3.5. This is a theoretical average, not the exact one that is realised.

YOUR TURN 👇

Identify a data set from your own work experience or from amongst many available freely on the Internet and describe a problem that can be addressed using this data. Write a brief report describing the identified data and a problem that can be addressed using this data. Discuss further in your report which of the categories of ML would be applicable for the analysis of this problem? Try to justify your thinking.
In the simulation study, notice the difference between \(MSE\) calculated from the simulated data and the one we obtain by plugging in the values of \(\sigma^2\) and estimator’s \(bias\) and \(variance\). Try to increase the sample size of the simulation and comment on the changes you may notice (hint: make a change for smaller steps in the \(seq()\) function). Can you draw any conclusions; what are they?

Resampling

We use resampling techniques for estimating model performance. We can fit a model on a subset of data, known as tanning data and using the remaining sample, better known as test data, we can assess the efficiency of the model. Often, this process is repeated multiple times and the results are aggregated and summarized. Hence, resampling method involves repeatedly drawing samples from a training data set and refitting a model to obtain addition information about that model.

Resampling methods involve fitting the same statistical method multiple times using different subsets of the training data that can be computationally taxing. Fortunately, computing power has greatly advanced in the last few decades allowing for resampling techniques to become an indispensable tool of a statistical modelling procedure.

In general, resampling techniques differ in the way in which the subsets of data are chosen. We will consider the most commonly used resampling methods: cross-validation and bootstrap.

k-Fold Cross-Validation

This approach involves randomly dividing the set of observations into k groups, known as folds of approximately equal size.

With cross-validation the test data is held out (approximately one fifth of data), and the remaining training data is randomly divided into \(k\) groups. Several different portions of this training data are used for validation, and the remaining part of the data is used for training as shown in the diagram below. Hence, a fold is treated as a validation set, and the method is fit on to the remaining \(k−1\) folds. The \(k\) resampled estimates of performance are summarized and used for testing and model building.

Since the non-holdout data is divided into five portions, we call this “5-fold cross-validation”. If there had been ten blocks, it would have been 10-fold cross-validation. The model that has been built using k-fold cross-validation is then tested on the originally held out test data subset. The \(k\) resampled estimates of model’s performance are summarized commonly using the mean and standard error to develop a better reasoning of its effectiveness in relation to its tuning parameters.

\[CV_{(k)} = \frac{1}{k}\sum^{k}_{i=1} MSE_{i}\]

A typical choice of \(k\) is \(5\) or \(10\), but there is no formal rule. One should keep in mind when making a choice that as \(k\) increases, the difference in size between the training set and the resampling subset gets smaller, causing the bias of the statistical model to become smaller.

Leave-One-Out Cross-Validation (LOOCV)

As in k-fold CV, the training set of data is split into two parts, with the difference that here they are not of comparable sizes as the validation set consists of a single observation and the remaining observations make up the training set.

It is a special case of k-fold CV in which \(k\) is equal to \(n\). Repeating the process \(n\) times produces \(n\) square errors and the estimate for the test \(MSE\) is the average of those \(n\) test error estimates:

\[CV_{(n)} = \frac{1}{n}\sum^{n}_{i=1} MSE_{i}\]

The main advantage of the LOOCV approach over a simple train and test validation is that it produces less bias. The model is fit on \(n-1\) training observations, almost as many as there are in the entire data set. Furthermore, since LOOCV is performed multiple times it yields consistent results with less randomness than in a simple train and test approach. However, when compared with k-fold CV, LOOCV results in a poorer estimate as it provides an approximately unbiased estimate for the test error that is highly variable.

One more down side to the generally high value of \(k\) is that the computation side of the procedure becomes more taxing especially if the model is rather a complex one. Come to think about it, the LOOCV requires as many model fits as data points and each model fit uses a subset that is ALMOST the same size as the training.

Stratified K-Folds cross-validator

Another variation of k-fold CV is stratified k-fold CV, which returns stratified folds. This could be particularly useful for data that exhibit a large imbalance in the distribution of the target classes. As stratification is the process of rearranging the data in such a manner that each fold is a good representative of the whole, the stratified k-fold CV facilitates the preservation of the relative class frequencies in each train and validation fold. The figure below illustrates the process in the case of a binary classification.

Choice of \(k\)

Using k-fold cross-validation increases validation sensitivity, allowing better reasoning with the model. One of the key questions is how to choose the number of folds, i.e. how big does \(k\) need to be? In general, the choice of the number of folds depends on the size of the dataset. For large datasets 5-fold cross-validation is considered to be quite accurate and when dealing with very sparse datasets, we may consider using leave-one-out in order to train on as many examples as possible.

We have seen that there is a bias-variance trade-off associated with the choice of \(k\) in k-fold cross-validation. Larger \(k\), for which training folds are closer to the total dataset, results in less bias towards overestimating the true expected error but higher variance and higher running time. We can summarise those findings as follows:

for a large number of folds
- positives:
  1. small bias of the true error rate estimator (as a result of a very accurate estimator)
- negatives:
  1. large variance of the true error rate estimator
  2. the computational time is large
for a small number of folds
- positives:
  1. small variance of the estimator
  2. the number of experiments and, therefore, computation time are reduced
- negatives:
  1. large bias of the estimator

In practice, typical choises for \(k\) in cross-validation are \(k=5\) or \(k=10\), as these values have been shown empirically to yield test error rate estimates that suffer neither from excessively high bias nor from very high variance.

Bootstrap

The bootstrap method was introduced by Efron in 1979. Since then it has evolved considerably. Due to its intuitive nature, easily grasped by practitioners and available strong computational power necessary for its application, today bootstrapping is regarded as the indispensable tool for data analysis. The method is named after Baron Münchhausen, a fictional character who in one of the stories saved his life by pulling himself out of the bottom of a deep lake by his own hair.

Bootstrapping is a computationally intensive, nonparametric technique that makes probability-based inference about a population characteristic, \(\Theta\), based on an estimator, \(\hat\Theta\), using a sample drawn from a population. The data is resampled with replacement many times in order to obtain an empirical estimate of the sampling distribution of the statistic of interest \(\Theta\). Thus, bootstrapping enables us to make inference without having to make distributional assumptions (see Efron, 1979). In machine learning, for estimation purposes the idea of bootstrapping datasets has been proposed as an alternative to CV.

A bootstrap sample is a random sample of the data taken with replacement (Efron and Tibshirani 1986). Consequently, since samples are drawn with replacement, each bootstrap sample is likely to contain duplicate values. Bootstrapping relies on analogy between the sample and the population from which the sample was drawn, by treating the sample as if it is a population. The two key features of bootstrapping a sample with replacement are:

a data point is randomly selected for the subset and returned to the original data set, so that it is still available for further selection
the bootstrap sample is the same size as the original data set from which it was constructed

Using uniform re-sampling with replacement, a \(B\) number of training sets are produced by bootstrap to produce a performance estimate of a chosen statistical method, ie. model. The model is trained and its performance is estimated on the out-of-sample observations, as depicted in the figure above. The original observations not selected in a particular bootstrap sample are usually referred to as the out-of-bag (OOB). Hence, for a given bootstrap iteration, a model is built on the selected sample and is used to predict the out-of-bag sample. On average around \(63.2\%\) of the original sample ends up in any particular bootstrap sample (Mendelson et al. 2016).

When allying the bootstrapping procedure for inferential purposes typical chose for \(B\) is in the range of a few hundreds to thousands. Efron and Tibshirani (p.52) indicate that \(B=50\) and even \(B=25\) is usually sufficient for bootstrap standard error estimates and point out that there are rare occasions for which more than \(B=200\) replications are needed for estimating a standard error. In the context of using bootstraping for validation purposes the size of \(B\) in the range of hundreds may be unacceptably high, and the validation process should be repeated for a specified number of folds \(k\), i.e. set \(B=k\). Hence, the bootstrap resampling with replacement procedure for ML from a data set of size \(n\) can be summarised as follows:

randomly select with replacement \(n\) examples and use this set for training and model building
the remaining examples that were not selected for training are used for testing
repeat this process for a specific number of folds \(k\)
the true error is estimated as the average error rate on test examples

As Efron in his paper on etimating the error rates of prediction rules points out, when performing statistical modelling one might want more than just an estimate of an error rate. Bootstrap methods are helpful in understanding variability all aspects of the prediction problem. In this paper he makes comprehensive comparisons between different resampling methods, drawing the conclusion that in general the bootstrap error rates tend to have less uncertainty than k-fold cross-validation. One should also be aware that for small sample sizes the bias is noticeable and decreases as the sample size becomes larger as shown in Young and Daniels’ paper Bootsrap Bias.

Summary

The easiest way to assess the performance of a statistical method is by evaluating the error rate on a set of data that is independent of the training data. If model selection and true error estimates are to be computed simultaneously, the data should be divided into three disjointed sets as explained by Replay in his book.

In general we can summarise the ML model building procedure algorithm as follows:

divide the available data into training + validation and test sets
select model and training parameters
1. train and build the model using the training set
2. evaluate the model using the validation set
repeat step 2 using different model structures and tuning parameters
select the best model and train it using data from the training + validation set
assess this final model using the test set

When using cv or bootstrap sub-steps i) and ii) are repeated within step 2) for each of the \(k\) folds.

YOUR TURN 👇

Provide a rigorous explanation on how standard deviation of the prediction can be estimated for a statistical model that does prediction for the response \(Y\) for a particular value of the predictor \(X\). Identify the key steps and present them in an algorithmic manner.

Practise by doing the following exercises:

Use multiple regression on the Air Quality Data data set.

Produce a correlogram which include all the variables in the data set.
Use the lm() function to perform multiple regression with Ozone as the response and all other variables as the predictors. Use the summary() function to print the results and comment on the output, in particular:
1. is there a relationship between the predictors and the response, i.e. is the model valid?
2. which predictors appear to have a statistically significant relationship to the response?
3. on the basis of your response to the previous question, fit a smaller model that only uses predictors for which there is evidence of association with the response variable.
Try out some of the variable selection methods explained in lectures/practicals. Present and discuss the results from the chosen approach(es).
Propose with justification a model (or a set of models) that seem to perform well on this data set. Make sure you provide justified explanation on the selected set of predictors in the proposed model.

Access the stroke data set from https://tanjakec.github.io/mydata/stroke.csv. Data dictionary of this data set is given in https://tanjakec.github.io/mydata/Stroke_DataDictionary.docx word document.

Split the data into a training and a test set, pre-process the data, and train a model a model of your choice from this section
Predict the response for the test set. What is the value of the performance metric?
Which predictors are most important in the model you have trained?
Train the model using a different method from the previous selection. Don’t forget to initialise the same random number seed as the one used for the previous model training.
Compare the accuracy between the trained models. Which one does it perform better?