1. The basic ideas of Machine Learning

Machine Learning is a field of Artificial Intelligence that deals with statistical algorithms to analyze data. In these notes we will consider some of the most important algorithms of Machine Learning. The notes are meant for a course on “Fundamentals of Data Mining and Big Data 1”. The part of Big Data is not dealt with here.

1.1. Inference and prediction

When we build models to study data there are two points of view we can take: how variables affect the results, and what we can say about unseen data. We call these two aspects inference and predictions, and there are different on the approach. For example, if we want to study the relation between unemployment and inflation, we can draw a model where the independent variable is unemployment and the dependent variable is inflation. If we are interested on inference, our study will try to see the influence of unemployment on inflation: What will be the change in inflation if unemployment increases by 1%? On the other hand, if we look for prediction, we will be interested on estimating the value of inflation for a value of unemployment: How much will be the inflation if unemployment is equal to 6%? Inference have its own issues, for example, if we have two (or more) predictors we will be interested on finding whether there is a relation between them. Doing a model with related predictors will give us the wrong influence of one of them on the dependent variable. On the other hand, such problem does not occur if our interest is in the prediction of the dependent variable.

1.2. Supervised vs unsupervised learning

There are two kinds of machine learning: supervised and unsupervised. On the first one, we have a “result” that we use to model, for example, if we have data on unemployment and inflation for several years, we will try to approximate the inflation, from unemployment, as much as possible. Or we want to predict whether an image is a dog or a cat from images labelled “dog” and “cat”, we know what we want. In the unsupervised study, data do not have “results”: we feed the algorithm with lots of images and let if decide what group of images are similar.

1.3. Bias-variance trade-off

One of the key points in modeling is the balance between bias and variance. The error of a supervised model is given by the sum of the variance of the model, the bias squared of it, and the variance of the random error (coming from issues like errors in measurement, for example). Thus, if the actual relation between the variable is \(Y = f(X) + \epsilon\), we can have a model of the form \(\widehat{Y} = \widehat{f}(X)\), and then we get, for a value \(x_0\), \[E(y_0 - \widehat{f}(x_0))^2 = Var(\widehat{f}(x_0)) + [Bias(\widehat{f}(x_0))]^2 + Var(\epsilon).\] The error \(Var(\epsilon)\) cannot be controlled, and it is the minimum error we can expect from our (any) model. The variance of \(\widehat{f}\) is given by the variation of the model if it is trained over different data sets. The bias is given by the choice of model, for example, if we take a linear function to model data that is not linear in nature. Rigid models, for example, a linear model, will have small variance but could have big bias. On the other hand, a flexible model will have higher variance (as the data changes, the model changes) but a small bias (as it can adjust better to the data).

1.4. Evaluation

In these notes we are more interested on prediction than on inference. Hence, we need to have a way to measure how big our prediction errors are over known data. There are many measures, here we list a few:

Mean Square Error (MSE) and Root Mean Square Error (RMSE).
Misclassification error.
Accuracy, precision, recall, F1.
ROC and Area under the curve (AUC).

2. Linear Model Selection and Regularization

The most basic statistical model is the linear model, where we have several predictors, say \(x_1, x_2, \ldots, x_n\) and a variable to predict, \(y\). The model will be something like \[ \widehat{y} = a_0 + a_1\,x_1 + a_2\,x_2 + \cdots + a_n\,x_n\] As mentioned earlier, in the case of inference one wants to have a good estimate of the coefficients \(a_j\), since a change of one unit in \(x_j\) will be reflected on a change of \(a_j\) units on \(y\), and tells us how important is the factor \(x_j\) in the explanation of \(y\). Here the relationship between the predictors is important, as it can give the wrong picture. For example, if \(x_2 = 2 x_1\), then, chaning \(x_1\) will change \(x_2\) automatically, and the influence of \(x_1\) on \(y\) is not just \(a_1\); it would be \(a_1 + 2 a_2\). Hence the problem of collinearity is an important issue. However, if we want to predict, we just plug in the values of \(x_1\) and \(x_2\) and hence get the estimation for \(y\) (we will need the values of the other predictors, of course, but right now we are talking about the relation between the first two predictors).

2.1. Review of the linear model

2.2. Subset selection

2.3. Ridge regression

2.4. Lasso

3. Missing Data

One of the big problems a data scientist finds is the famous, or infamous, NA: missing data. What can we do about it? There are many ways to handle this problem, known as data imputation techniques: replacing missing data with substituted values. In this lesson we are going to learn a few imputation techniques: deletion, last observation carried forward (LOCF), mean substitution and regression imputation. We will use the airquality database, available in the basic instalation of R.

3.1. Types of missing data

Missing information in a database is something quite normal in real-life applications; for example, not every one answers all questions in a questionnaire, even though they could be marked as mandatory to answer (if you are doing it online, you can just type a character, say “.” in a free-text type of question, for example).

Here are some types of missing data. Here are some cases where we might have missing data:

Missing completely at random: for example, if some records from a survey are lost because the answers sheets were wet and hence, not readable.
Missing at random: where the missingness is not random, but it can be fully accounted for by other variables with complete information. For example, males are less likely to answer questions about depression, but once we take into account the male character, the missing data is random. Thus, the missing is not random in the full database, but it is random within the factor that explains why data is missing.
Missing that depends on unobserved predictors: for example, people with strange side effects to a drug might decide to not answer a questionnaire, out of embarrassment. The data is missing, but in this case it depends on another factor, the side effects. In this case we are assuming that side effects is not part of the observed variables or predictors.
*Missing that depends on the missing value itself**: people with high earnings could be less likely to reply to a question about salaries. The missing data are only for people who earn a lot, not for every one.

3.2. Reasons for missing data

Some times missing data is unavoidable. For example, if we are doing a follow-up of a treatment for heart diseases, if a patient dies we cannot fill in the information.

Exercise: Think of at least three reasons for a database to have missing information in some observations.

3.3. Filling the missing data

When we have missing data in our database, the first thing we have to consider is whether we should “fill it” in with some values or not. An important factor to help us decide is the size of the missing data: it is not the same if we just fill in a few values, less than a handful, that if you have missing data in 25% of your observations, for example.

In this lesson we will handle missing data with the following techniques:

1. Remove observations with missing data

If the data is missing completely at random, and the proportion of missing data is small, we can consider deleting it from the data set.

2. LOCF, Last One Carried Forward

There are certain situations where missing data could be replace by existing information. For example, if we have records of temperatures in a city taking every five minutes, it is not a bad solution to fill a missing value by the temperature taken in the previous measure since, after all, temperatures do not change drastically in five minutes. Of course, there could be some extreme atmospherical phenomena that we will miss if they result in extreme changes of temperature, but that will be reflected in other values almost for sure.

3. Replace by the mean or the mode

The average or mean (for continuous variables), or the most common value (the mode, in the case of categorical variables), could be used to replace the missing values. This technique has the problem that makes the average (or the mode) more representative than it really is. Let us see how to do this with just a few commands, and with the imputeMean function from the mlr package. For that, we create some random data, remove a few points and replace them by the mean with both methods, and check the results are the same.

library(mlr)

## Loading required package: ParamHelpers

set.seed(1234)
x <- rnorm(100, mean = 0, sd = 3)
x[sample(1:100, size = 20, replace = FALSE)] <- NA
paste0("NA in x: ", sum(is.na(x)))

## [1] "NA in x: 20"

y <- x
y[is.na(x)] <- mean(x, na.rm = TRUE)
paste0("NA in y: ", sum(is.na(y)))

## [1] "NA in y: 0"

x_frame <- data.frame(x)
x_imputed <- impute(x_frame, cols = list(x = imputeMean()))
z <- unname(unlist(x_imputed[1]$data))
paste0("y == z: ", sum(y == z))

## [1] "y == z: 100"

The impute function takes as the first argument an R object, in this case a data frame, and then we list the columns we want to impute, with the imputation method, which does not need to be the same for all columns. The imputation can be done in many different ways: mean, median, mode, by a constant, using an uniform or a normal distribution, adding values above the maximum or below the minimum, etc.

Data Mining 1: Machine Learning

Pablo Arés Gastesi

Feb. 2024

1. The basic ideas of Machine Learning

1.1. Inference and prediction

1.2. Supervised vs unsupervised learning

1.3. Bias-variance trade-off

1.4. Evaluation

2. Linear Model Selection and Regularization

2.1. Review of the linear model

2.2. Subset selection

2.3. Ridge regression

2.4. Lasso

3. Missing Data

3.1. Types of missing data

3.2. Reasons for missing data

3.3. Filling the missing data

1. Remove observations with missing data

2. LOCF, Last One Carried Forward

3. Replace by the mean or the mode

4. Imputation by inear regression

5. Using the mlr package for imputation

4. Poisson regression

5. Logistic Regression

6. Cross Validation

7. Trees

8. KNN

9. K-Means

10. Clustering

11. Linear Discriminant Analysis

12. Principal Components Analysis