Machine Learning Cheat Sheet

Author

require(reticulate)

1 Statistics

1.1 Descriptive Statistics

Descriptive statistics is a branch of statistics that deals with the collection, presentation, and description of data. It is used to summarize and describe the characteristics of a dataset. In easy word, descriptive statistics is used to describe the data.

1.2 Inferential Statistics

Inferential statistics is a branch of statistics that deals with making inferences or predictions about a population based on a sample of data. It is used to test hypotheses, make predictions, and draw conclusions about a population based on a sample of data. In easy word, inferential statistics is used to make predictions about a population based on a sample of data.

2 Data Preprocessing

2.1 Splitting Predictors and Target

Split the dataset into two parts: predictors (independent variables) and target (dependent variable).

X = df[:, :-1].values # all rows, all columns except the last column
y = df[:, -1].values # all rows, only the last column
or
X = df.drop(columns=['target']) # drop the target column
y = df['target'] # only the target column

2.2 Handling Missing Data

Missing data is a common problem in real-world datasets. There are several ways to handle missing data:

2.2.1 Removing Rows

This method is simple and effective, but it can lead to a loss of valuable information. It is best used when the number of missing values is small (< 5% of the total data).

2.2.2 Imputation

Imputation is the process of replacing missing values with estimated values. This method is simple and effective, but it can lead to biased estimates.

Mean imputation for continuous data

Mean imputation is a simple imputation method that replaces missing values with the mean of the feature. This method is simple and effective, but it can lead to biased estimates.

from sklearn.impute import SimpleImputer # from library.module import class
imputer = SimpleImputer(missing_values=np.nan, strategy='mean') # object = class(parameters)
imputer.fit(X[:,1:3]) # object.method(parameters) [:,1:3] means all rows and columns 1 to before 3
X[:,1:3 ] = imputer.transform(X[:,1:3 ]) # object.method(parameters)

Median imputation for continuous data with outliers.

Mean imputation is best used for continuous data with no outliers.

imputer = SimpleImputer(missing_values=np.nan, strategy='median')
imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])

KNN imputation for both continuous and categorical data.

KNN imputation is a more advanced imputation method that uses the k-nearest neighbors algorithm to estimate missing values. This method is more accurate than mean or median imputation, but it can be computationally expensive.

from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=5)  # number of neighbors means the number of similar data points
X = imputer.fit_transform(X)

Mode imputation for categorical data.

Mode imputation is a simple imputation method that replaces missing values with the mode of the feature. This method is simple and effective, but it can lead to biased estimates.

imputer = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])

2.2.3 With Machine Learning Algorithms

Another way to handle missing data is to use machine learning algorithms that can handle missing values. Some algorithms, such as XGBoost and LightGBM, can handle missing values directly.

2.3 Encoding Categorical Data

Categorical data is data that represents categories. There are two main ways to encode categorical data:

2.3.1 Label Encoding

Label encoding is a method used to convert categorical data into numerical data. Each category is assigned a unique integer value. This method is simple and effective, but it can lead to biased estimates.

Label encoding is best used for ordinal data, where the categories have a natural order. e.g. Low, Medium, High to 1, 2, 3.

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)

2.3.2 One-Hot Encoding

One-hot encoding is a method used to convert categorical data into numerical data. Each category is represented as a binary vector, where each element corresponds to a category. This method is simple and effective, but it can lead to high-dimensional data.

One-hot encoding is best used for nominal data, where the categories do not have a natural order. e.g. Red, Green, Blue to Color_Red, Color_Green, Color_Blue.

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough') # 0 is the column index to be encoded
X = ct.fit_transform(X)

2.4 Train-Test Split

Split the dataset to 4: 1. X_train: Independent variable for training 2. y_train: Dependent variable for training 3. X_test: Independent variable for testing 4. y_test: Dependent variable for testing

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

2.5 Feature Scaling

Feature scaling is a method used to normalize the range of independent variables or features of data.

Feature scaling is done after splliting the dataset into training and test set. Rationale is that test set is supposed to be raw data in the future and prevent data leakage. We only scale x test, then use the scaling formula to x train.

2.5.1 Avoid Feature Scaling On

Decision Trees
Random Forest
XGBoost
LightGBM

2.5.2 When to Apply

Avoid on dummy variables from one-hot encoding
Avoid on categorical variables that are ordinal (e.g. 1, 2, 3)
Avoid on categorical variable that are nominal/binary (0 and 1)
Use when the dependent variable has super high values with respect to the other features

2.5.3 Min-Max Scaling / Normalization

Min-Max scaling is a simple scaling method that scales the data within a range of 0 to 1.
This is done by subtracting the minimum value of the feature and then dividing by the range of the feature. Range between [0,1]
Work best with normal distribution. \[ X' = \frac{X - Xmin}{Xmax - Xmin} \]

2.5.4 Standardization / Standard Scaling / Z-Score Normalization

This is done by subtracting the mean of the feature and then dividing by the standard deviation of the feature.
Result in [-3;+3], outside of these are outliers.
Generally work every time, in doubt, go with this.
Do not do this on the dummy variables
Expect input of 2D array \[ \begin{align*} \text{X'} &= \frac{X - \mu}{\sigma} \quad \text{or} \quad \text{X'} = \frac{X - \text{mean X}}{\text{Std Dev X}} \end{align*} \]

from sklearn.preprocessing import StandarScaler
sc = StandarScaler()
X_train[:, 3:] = sc.fit_transform(X_train[:, 3:]) # X_train will get fit and transform
X_test[:, 3:] = sc.transform(X_test[:, 3:]) #X_test will only get transform

3 Supervised Learning

Supervised Learning is the machine learning approach defined by its use of labeled datasets to train algorithms to classify data and predict outcomes.

The two most common method of supervised learning are regression and classification.

3.1 Regression / Prediction

Regression is related to continuous data (value functions). In Regression, the predicted output values are real numbers. It deals with problems such as predicting the price of a house or the trend in the stock price at a given time, etc.

3.1.1 Linear Regression Models

3.1.1.1 Simple Linear Regression

Linear Regression Used to predict linear correlation between two variables. Equation is: \[ \hat{y} = b{_0} + b{_1} \ast X{_1} \] where:

$\hat{y}$ is the predicted value
$b{_0}$ is the intercept
$b{_1}$ is the slope
$X{_1}$ is the independent variable

Determining the best fit line is done by minimizing the sum of the squared differences between the observed values and the predicted values. This is called the Ordinary Least Squares method. \[ \text{min} \sum_{i=1}^{n} (y_i - \hat{y_i})^2 \]

from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)

To show the coef and intercept

print(regressor.coef_) # slope
print(regressor.intercept_) # intercept

3.1.1.1.1 Assumptions of Linear Regression

Anscombe’s Quartet Anscombe Quartet is a set of four datasets that have nearly identical simple descriptive statistics, yet appear very different when graphed. This emphasizes the importance of graphing data before analyzing it and the effect of outliers on the regression line.

Linearity: The relationship between Y and each X.
Homoscedasticity: Equal variance. Avoid decreasing or increasing cone shape scatter.
Multivariate Normality: The residuals (error) should be normally distributed.
Independence: Independence of observations including “no autocorrelation” such as stock market where the price of today (Open) is dependent on yesterday (Close).
Lack of Multicollinearity: Independent variables / predictors should not be correlated with each other. The coefficient estimate would be unreliable.
Outlier Check: Outliers can have a large influence on the regression line. It is important to detect and remove outliers.

3.1.1.2 Multiple Linear Regression

3.1.1.2.1 Multiple Linear Regression Intuition

Used to predict linear correlation between multiple variables. Equation is: \[ \hat{y} = b{_0} + b{_1} \ast X{_1} + b{_2} \ast X{_2} + \ldots + b{_n} \ast X{_n} \] where:

$\hat{y}$ is the predicted value
$b{_0}$ is the intercept
$b{_1}$, $b{_2}$, $\ldots$, $b{_n}$ are the coefficients
$X{_1}$, $X{_2}$, $\ldots$, $X{_n}$ are the independent variables

3.1.1.2.2 Dummy Variable Trap

In manual calculation, do not convert all categorical data to dummy variables. Convert k-1 dummy variables only to avoid dummy variable trap. Meaning if there are 2 cities then only use 1 dummy variable; 3 cities, use 2 dummy variables.

3.1.1.2.3 Variables Selection

In building a model, we need to select the variables that have the most impact on the dependent variable, because:

Garbage In, Garbage Out (GIGO). Meaning if the data is bad, the model will be bad.
Need to explain the model to the stakeholders.

5 methods to select variables:

All-in Use when you have prior knowledge, you need to use all variables, or preparing for backward elimination
Backward Elimination (Reduce the number of variables)
1. Select a significance level to stay in the model (e.g. SL = 0.05)
2. Fit the full model with all possible predictors
3. Select the predictor with the highest P-value / insignificant. If P > SL, go to step 4, otherwise finish.
4. Remove the predictor
5. Fit the model without this variable
6. Repeat step 3-5 until all P < SL
Forward Selection (Add the number of variables)
1. Select a significance level to enter the model (e.g. SL = 0.05)
2. Fit all simple regression models y ~ Xn. Select the one with the lowest P-value.
3. Keep this variable and fit all possible models with one extra predictor added to the one(s) you already have.
4. Consider the predictor with the lowest P-value. If P < SL, go to step 3, otherwise finish.
Bidirectional Elimination (Combination of Backward and Forward)
1. Select a significance level to enter and to stay in the model (e.g. SLENTER = 0.05, SLSTAY = 0.05)
2. Perform the next step of Forward Selection (new variables must have P < SLENTER to enter)
3. Perform all steps of Backward Elimination (old variables must have P < SLSTAY to stay)
4. No new variables can enter and no old variables can exit
Score Comparison (All possible models)
1. Select a criterion of goodness of fit (e.g. AIC Akaike criterion)
2. Construct all possible regression models: 2^n - 1 total combinations
3. Select the one with the best criterion

Steps 2-3-4 usually called Stepwise Regression, but sometime only step 4 is called Stepwise Regression.

3.1.1.2.4 Multiple Linear Regression Python

from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)

But, in Python, the library will take care of dummy variable trap and variable selection automatically.

3.1.2 Non-Linear Regression Models

3.1.2.1 Polynomial Regression

3.1.2.1.1 Polynomial Regression Intuition

Used to predict non-linear correlation. Equation is: \[ \hat{y} = b{_0} + b{_1} \ast X{_1} + b{_2} \ast X{_1}^2 + \ldots + b{_n} \ast X{_1}^n \] where:

$\hat{y}$ is the predicted value
$b{_0}$ is the intercept
$b{_1}$, $b{_2}$, $\ldots$, $b{_n}$ are the coefficients
$X{_1}$ is the independent variable

3.1.2.1.2 Polynomial Regression Python

To create polynomial regression in Python, we need to combine class of PolynomialFeatures and LinearRegression.

We use polynomialfeatures to transform the original data into polynomial data. Then, we use linear regression to fit the polynomial data.

from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
poly_reg = PolynomialFeatures(degree=2) # Start with 2 then go up.
X_poly = poly_reg.fit_transform(X)
lin_reg_2 = LinearRegression() # Assuming lin_reg is already used for linear regression
lin_reg_2.fit(X_poly, y)
lin_reg_2.predict(poly_reg.fit_transform([[6.5]])) #6.5 is example of value to predict

3.1.2.2 Support Vector Regression (SVR)

3.1.2.2.1 Support Vector Regression Intuition

Used to predict non-linear correlation. Various type of SVR covers Linear, Polynomial, and RBF. In Linear Regression, we use OLS to determine the best fitting line and calculate the error. In Linear SVR, the line has a margin of error “$\varepsilon$-Insensitive Tube”.

The data inside the tube (yellow-area) is ignored and considered not an error. These values also called “vectors” in 2-dimensional plane of X and Y.
The data outside the tube will be considered as error and calculated as distance from the data point to the tube outer area. These values also called “support vectors” because they dictate how the tube is formed.

Denotion:

Value under the tube is $\xi{_i}^\ast$
Value Above the tube is $\xi{_i}$
The distance between the regression line and the tube is $\varepsilon$.

Formula to minimize error is: \[ \text{min} \frac{1}{2} \lVert w \rVert^2 + C \sum_{i=1}^{n} (\xi{_i} + \xi{_i}^\ast) \] where:

$w$ is the weight
$C$ is the penalty parameter
$\xi{_i}$ is the error above the tube
$\xi{_i}^\ast$ is the error below the tube

3.1.2.2.2 Support Vector Regression Python

SVR is an implicit equation/relationship, we must do feature scaling unlike explicit equation/relationship such as linear/multiple/polynomial regression.

Both X and Y must be in 2D array before scaling.

# In this case, asssume X is already in 2D array
y = y.reshape(len(y), 1) # This creates 2D array with rows of y and 1 column e.g. 10 rows 1 column

Separately scale X and Y. This to ensure the mean and SD of X and Y are not mixed up. If y is single column, then using SC separately might yield same result as using one SC.

from sklearn.preprocessing import StandarScaler
sc_X = StandarScaler()
sc_y = StandarScaler()
X = sc_X.fit_transform(X)
y = sc_y.fit_transform(y)

Do not have to split the dataset into training and test set because SVR will take care of it.

from sklearn.svm import SVR
regressor = SVR(kernel='rbf') # rbf is the most common kernel
regressor.fit(X, y)

To predict the value, we need to inverse the scaling

y_pred = sc_y.inverse_transform(regressor.predict(sc_X.transform([[6.5]])).reshape(1, -1))

Explanation
1. Transform the value to be predicted. Result in the scaled 6.5 value from dataset ranging 1-10.
sc_X.transform([[6.5]])
Output: array([[0.34815531]])

2. Predict the value. Result in the scaled value of the prediction result
regressor.predict(sc_X.transform([[6.5]]))
Output: array([-0.27861589])

3. Reshape the predicted value to 2D array
regressor.predict(sc_X.transform([[6.5]])).reshape(-1,1)
Output: array([[-0.27861589]])

4. Inverse the scaling to get the actual value
sc_y.inverse_transform(regressor.predict(sc_X.transform([[6.5]])).reshape(-1,1))
Output: array([[170370.0204065]])

3.1.2.3 Decision Tree Regression

3.1.2.3.1 Decision Tree Regression Intuition

Used to predict non-linear correlation. We recursively split the data into different regions until we are left with pure leaf node meaning the data in the leaf is homogenous/pure. Finding the best split with highest Variance Reduction (different than Variance) where High Variance Reduction / Low Variance = Low Impurity). Suitable for dataset with many features.

Finding the best split:

First we calculate the variance of the dependent variable in the parent node: \[ \text{Var} = \frac{1}{2} \ast \sum_{i=1}^{n} (y_i - \bar{y})^2 \] where:

$y_i$ is the dependent variable
$\bar{y}$ is the average of the dependent variable

Assuming we have two conditions (e.g. $x_0 \leq 1$, $x_1 \leq 2$) and calculate the variance as well:

Compare the variance reduction of each split to find the best split with the highest reduction: \[ \text{Var Reduction} = {Var(parent)} - \sum{w_i Var(child_i)} \] where:

$w_i$ is the weight of the child node
$Var(parent)$ is the variance of the parent node
$Var(child_i)$ is the variance of the child node

We can see that split $x_0 \leq 1$ has the highest variance reduction of 3929.31 compared to $x_1 \leq 2$ with 63.64 so we choose the first split of 3929.31.

Repeat the process until we have pure leaf node.
New data point will be predicted based on the average of the dependent variable in the leaf node.

Imagine a point in the graph $(16, -2)$, we will follow the decision tree to find the leaf node and predict the value using the mean of the dependent variable in the leaf node

3.1.2.3.2 Decision Tree Regression Python

No need to apply Feature Scaling on DTR.

from sklearn.tree import DecisionTreeRegressor
regressor = DecisionTreeRegressor(random_state=0)
regressor.fit(X, y)
regressor.predict([[6.5]]) #6.5 is example of value to predict

3.1.2.4 Random Forest Regression

3.1.2.4.1 Random Forest Regression Intuition

Used to predict non-linear correlation. Random Forest is an part of Ensemble Learning which combines multiple decision trees to create a more accurate and stable model. How it works:

Pick at random K data points from the training set. e.g total training data is 100, pick 25 data points.
Build the Decision Tree associated to these K data points
Choose the number of trees you want to build and repeat steps 1 and 2. Running multiple trees will give you multiple predictions.
For every new data point, run the prediction through all the produced trees and average the result from multiple trees. e.g. 500 trees, 500 predictions, average out the 500 predictions.

3.1.2.4.2 Random Forest Regression Python

No need to apply Feature Scaling on Random Forest.

from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators=10, random_state=0) # n_estimators is the number of trees
regressor.fit(X, y)
regressor.predict([[6.5]]) #6.5 is example of value to predict

3.1.2.5 Lasso Regression

3.1.2.6 Ridge Regression

3.1.2.7 Elastic Net Regression

3.1.3 Measuring Regression Performance

There are several ways to measure the performance of a regression model. Some of the most common metrics are:

3.1.3.1 R-Squared

3.1.3.1.1 R-Squared Intuition

R-squared is a statistical measure of how close the data are to the fitted regression line. It is also known as the Coefficient of Determination or Goodness of Fit. R-squared is always between 0 and 1:

0 indicates that the model explains none of the variability of the response data around its mean.
1 indicates that the model explains all the variability of the response data around its mean. \[ R^2 = 1 - \frac{SS_{res}/RSS} {SS_{tot}/TSS}\quad \text{where}\quad \frac{Usually Low}{Usually High} \] Where:
$SS_{res}$ is the Sum of Squares Residual / $RSS$ is the Residual Sum of Squares (same formula as $OLS$). Comparing our regression line to the actual data points. The lower the better. \[ SS_{res} = \sum_{i=1}^{n} (y_i - \hat{y_i})^2 \] Where:
- $y_i$ is the actual value
- $\hat{y_i}$ is the predicted value
$SS_{tot}$ or $TSS$ is the Total Sum of Squares. Comparing an average number to the actual data points. The higher the better. \[ SS_{tot} = \sum_{i=1}^{n} (y_i - \bar{y})^2 \] Where:
- $y_i$ is the actual value
- $\bar{y}$ is the average of the actual value

$SS_{tot}$ will be higher than $SS_{res}$ due to we only rudimentary put a mean line on the dataset.

Range of R-Squared:

1.0 = Perfect model (suspicious)
0.9 - 0.99 = Excellent model
< 0.7 = Not great
< 0.4 = Poor model
< 0 = Model is worse than the mean line

3.1.3.1.2 R-Squared Narrative

The $R^2$ value of our regression model is 0.85. This means that 85% of the variance in the dependent variable can be explained by the independent variables in the model. In other words, our model provides a good fit to the data, explaining a significant portion of the variability.

3.1.3.1.3 R-Squared Python

from sklearn.metrics import r2_score
r2_score(y_test, y_pred)

3.1.3.2 Adjusted R-Squared (The Most Important)

3.1.3.2.1 Adjusted R-Squared Intuition

Problem with R-Squared is R-Squared can be artificially inflated by adding more independent variables to the model:

Imagine we add another variable $\hat{y} = b{_0} + b{_1}x{_1} + b{_2}x{_2} \Leftarrow + b{_3}x{_3}$
Then $SS_{tot}$ does not change
but $SS_{res}$ will decrease or stay the same
and $R^2$ will increase.
The reason is that $SS_{res}$ or $OLS$ formula will find any line that minimizes the distance between the line and the data points (minimize error) and the way to achieve that is to make a good $b{_3}$ coefficient to increase $R^2$.
However, if $OLS$ cannot find an efficient $b{_3}$, then the formula will simply trick the $b{_3}$ to be 0 and the $R^2$ will be the same as before.
We could end up with models that have lots of variable and a high $R^2$ but are not good predictors.

Hence we use Adjusted R-Squared or Coefficient of Determination, a modified version of R-Squared that adjusts for the number of predictors in the model by penalizing the addition of unnecessary variables. \[ Adjusted \: R^2 = 1 - (1 - R^2) \ast \frac{(n - 1)}{n - k - 1} \] Where:

$n$ is the number of observations
$p/k$ is the number of predictors / independent variables

3.1.3.2.2 Adjusted R-Squared Narrative

The adjusted $Adjusted \: R^2$ value of our regression model is 0.82. This value accounts for the number of predictors and indicates that 82% of the variance in the dependent variable is explained by the model, considering the number of predictors. This adjustment suggests our model is well-fitted without overfitting by including too many predictors.

With $R^2$ of 85% and $Adjusted \: R^2$ of 82%, this suggests that while our model fits the data well, the predictive power is slightly lower when adjusted for the number of predictors, indicating that some predictors may not significantly improve the model

3.1.3.2.3 Adjusted R-Squared Python

def adj_r2_score(y_test, y_pred, X_test):
    n = len(y_test)
    k = X_test.shape[1]
    r2 = r2_score(y_test, y_pred)
    adj_r2 = 1 - (1 - r2) * (n - 1) / (n - k - 1)
    return adj_r2

3.1.3.3 Mean Absolute Error (MAE)

3.1.3.3.1 MAE Intuition

Measures the average magnitude of the errors in a set of predictions, without considering their direction. The lower the better. \[ MAE = \frac{1}{n} \sum_{i=1}^{n} \lvert y_i - \hat{y_i} \rvert \] Where:

$y_i$ is the actual value
$\hat{y_i}$ is the predicted value

3.1.3.3.2 MAE Narrative

If the MAE is 9, it means that, on average, the predictions are off by 9 units (e.g., $9,000 if you’re predicting house prices).

3.1.3.3.3 MAE Python

from sklearn.metrics import mean_absolute_error
mean_absolute_error(y_test, y_pred)

3.1.3.4 Mean Squared Error (MSE)

3.1.3.4.1 MSE Intuition

Measures the average of the squares of the errors. The lower the better. \[ MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y_i})^2 \] Where:

$y_i$ is the actual value
$\hat{y_i}$ is the predicted value

3.1.3.4.2 MSE Narrative

If the MSE is 85, it means that the average of the squared differences between the predicted and actual values is 85 (e.g., $85,000 squared if you’re predicting house prices). By squaring the errors, the MSE penalizes larger errors more than smaller errors.

3.1.3.4.3 MSE Python

from sklearn.metrics import mean_squared_error
mean_squared_error(y_test, y_pred)

3.1.3.5 Root Mean Squared Error (RMSE)

3.1.3.5.1 RMSE Intuition

Measures the square root of the average of the squares of the errors. The lower the better. RMSE is more sensitive to large errors than MAE. \[ RMSE = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y_i})^2} \] Where:

$y_i$ is the actual value
$\hat{y_i}$ is the predicted value

3.1.3.5.2 RMSE Narrative

If the RMSE is 9.22, it means that, on average, the predictions are off by about 9.22 units (e.g., $9,220 if you’re predicting house prices). Result in the same unit as MAE but penalize larger errors more than smaller errors. Better to use when outliers are undesirable.

3.1.3.5.3 RMSE Python

from sklearn.metrics import mean_squared_error
import numpy as np
np.sqrt(mean_squared_error(y_test, y_pred))

3.1.3.6 Mean Absolute Percentage Error (MAPE)

3.1.3.6.1 MAPE Intuition

Measures the average of the absolute percentage errors. The lower the better. Useful for understanding prediction error in percentage terms. \[ MAPE = \frac{100\%}{n} \sum_{i=1}^{n} \left\lvert \frac{y_i - \hat{y_i}}{y_i} \right\rvert \] Where:

$y_i$ is the actual value
$\hat{y_i}$ is the predicted value

3.1.3.6.2 MAPE Narrative

If the MAPE is 5.4%, it means that, on average, the predictions are off by 5.4% from the actual values.

3.1.3.6.3 MAPE Python

def mean_absolute_percentage_error(y_test, y_pred):
    return np.mean(np.abs((y_test - y_pred) / y_test)) * 100

3.2 Classification

A machine learning technique to identify the category of new observations. This could be things like trying to predict what objects are present in an image (a cat/ a dog) or whether it is going to rain today or not.

3.2.1 Types of Classification Decision Boundary

Linear Classification: The decision boundary is linear. e.g. Logistic Regression, SVM with Linear Kernel, Naive Bayes. Meaning the output is a straight line.
Non-Linear Classification: The decision boundary is non-linear. e.g. K-NN, SVM with Non-Linear Kernel, Decision Tree, Random Forest, and Neural Networks. Meaning the output is a curve or a more complex shape.

3.2.2 Types of Classification Output

Binary Classification: The output is a binary value (0 or 1, True or False, Yes or No). Output is both linear and non-linear decision boundaries.
Multiclass Classification: The output is a value from a set of more than two distinct possible values (e.g. Handwritten recognition, species classification). Output is both linear and non-linear decision boundaries.

3.2.3 Logistic Regression

3.2.3.1 Logistic Regression Intuition

Predict a categorical dependent variable from a number of independent variables Answer is binary (0 or 1, True or False, Yes or No). The output is a % probability that the given input point belongs to a certain class. The output is then converted to a binary value using a threshold value. e.g. decide by 50% threshold, if the probability is 0.7 then it is 1, if 0.3 then it is 0. The curve is called Logistic Regression Curve or Sigmoid Curve.

\[ ln \frac{p}{1 - p} = b_{0} + b_{1} \cdot X_{1} + b_{2} \cdot X_{2} + \ldots + b_{n} \cdot X_{n} \] Where:

$p$ is the probability of the dependent variable
$ln$ is the natural logarithm
$X_{1}, X_{2}, \ldots, X_{n}$ are the independent variables
$b_{0}, b_{1}, b_{2}, \ldots, b_{n}$ are the coefficients

Determining the best fit curve is using Maximum Likelihood Estimation (MLE)

Start by plotting our data points into the graph.
Draw a curve that represents the probability of the data points.
Calculate the probability of each data point based on the curve. e.g
Taking up the offer of 0.03, 0.54, 0.92, 0.95, 0.98
Refusing the offer of (1-0.01), (1-0.04), (1-0.10), (1-0.58), (1-0.96)
Multiply all the probabilities together to get the likelihood of the data points. e.g. 0.000019939
Draw multiple curves and calculate the likelihood of each curve. Select the curve with the highest likelihood.

3.2.3.2 Logistic Regression Python

Logistic Regression must be scaled because the algorithm is based on the logistic function meaning the output is a probability between 0 and 1. If the input is not scaled, the output will be skewed.

from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state=0)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test) # To get the prediction (0 / 1) or
y_pred = classifier.predict_proba(X_test) # To get the probability of yes (0.3 / 0.7)

3.2.4 K-Nearest Neighbors (K-NN)

3.2.4.1 K-NN Intuition

K-NN is a type of instance-based learning, or lazy learning, where the function is only approximated locally and all computation is deferred until function evaluation. The K-NN algorithm assumes that similar things exist in close proximity. In other words, similar things are near to each other.

Steps for K-NN:

Choose the number K of neighbors
Take the K nearest neighbors of the new data point, according to the Euclidean distance e.g. 5 neighbors. Calculate nearest neighbour by using Eucledian Distance: \[ d = \sqrt{(x_2 - x_1)^2 + (y_2 - y_1)^2} \] Where:
- $d$ is the distance
- $x_2$ and $y_2$ are the coordinates of the first point
- $x_1$ and $y_1$ are the coordinates of the second point
Among these K neighbors, count the number of data points in each category. e.g. in 5 neighbors, 3 neighbors are red and 2 neighbors are green.
Assign the new data point to the category where you counted the most neighbors e.g. red.

3.2.4.2 K-NN Python

K-NN must be scaled because the algorithm is based on the Euclidean distance. If the input is not scaled, the output will be skewed.

from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2) # minkowski & p=2 interprets as Euclidean distance
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)

3.2.5 Support Vector Machine (SVM)

3.2.5.1 SVM Intuition

Assumes the data are linearly separable and the algorithm will find the best hyperplane that separates the data points. SVM is a risky type of algorithm because it looks at a very extreme case close to the boundary and uses that that to construct its analysis.

The maximum margin is the distance between the hyperplane and the nearest data point from either set. The goal is to choose a hyperplane with the greatest possible margin between the hyperplane and any point within the training set, giving a greater chance of new data points being classified correctly.
Support vectors refers to the points that contribute to the algorithm.

3.2.5.2 SVM Python

SVM must be scaled because the algorithm is based on the Euclidean distance. If the input is not scaled, the output will be skewed.

from sklearn.svm import SVC
classifier = SVC(kernel='linear', random_state=0) # kernel can be linear, poly, rbf, sigmoid, precomputed
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)

or

from sklearn.svm import LinearSVC
classifier = LinearSVC(random_state=0)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)

3.2.6 Kernel SVM

3.2.6.1 Kernel SVM Intuition

Kernel SVM is used to classify non-linear data. This assumes the data are not linearly separable.

Example on transforming to higher dimension data:

Assume 1 dimension data is not linearly separable. We need to add an extra dimension to make it linearly separable.
Create function $f = x - 5$, then $f = (x- 5)^2$, and project the data into 2D.
Then we can draw a linear line to separate the data.

Separator for each dimension

1D: Point
2D: Line
3D: Plane
4D and above: Hyperplane

This process of mapping to a higher dimension is resource intensive. That is why we use kernel trick to perform similar operation resulting similar result without actually transforming the data. The most common kernel functions are:

Linear Kernel: Suitable for linear separable data. Formula: \[ K(\vec{x}, \vec{l}) = \vec{x} \cdot \vec{l} \] where:
- $\vec{x}$ and $\vec{l}$ are the data points
Polynomial Kernel: Suitable for non-linear separable data. Formula: \[ K(\vec{x}, \vec{l}) = (\vec{x} \cdot \vec{l} + c)^d \] where:
- $\vec{x}$ and $\vec{l}$ are the data points
- $c$ is a constant term
- $d$ is the degree of the polynomial.
Gaussian RBF Kernel: Suitable for non-linear separable data. The most common kernel. Formula: \[ K(\vec{x}, \vec{l}) = e^{-\frac{\|\vec{x} - \vec{l}\|^2}{2\sigma^2}} \] where:
- $K(\vec{x}, \vec{l})$ is the similarity (kernel) between the data points $\vec{x}$ and $\vec{l}$
- $\vec{x}$ is the data point
- $\vec{l}$ is the landmark
- $\sigma$ is the standard deviation
Sigmoid Kernel: Suitable for non-linear separable data. Formula: \[ K(\vec{x}, \vec{l}) = \tanh(\alpha \vec{x} \cdot \vec{l} + \beta) \] where:
- $\vec{x}$ and $\vec{l}$ are vectors
- $\alpha$ and $\beta$ are parameters of the sigmoid function
- $\tanh$ denotes the hyperbolic tangent function.

3.2.6.2 Gaussian RBF Kernel

The most common kernel function used in SVM. The Gaussian RBF kernel is a similarity function that measures the “distance” between a pair of examples, by computing the Euclidean distance between them after they have been mapped into an infinite-dimensional space

Visualization of the Gaussian RBF Kernel function of a specific landmark and specific sigma

RBF Kernel

Where $\vec{x}$ is the data point, $\vec{l}$ is the landmark, $\sigma$ is the standard deviation.

The center of the Gaussian RBF kernel is the landmark. The landmark is the point where the value of the kernel function is the highest in location 0,0. Imagine 2 conditions:
1. The red data point is far from the landmark (center):
  - Far from landmark = big $\vec{x}$ number
  - Big $\vec{x} -$ small $(0, 0) \;\: \vec{l}$ number = big number
  - Square it to result in an even bigger number
  - Divide by $\sigma$ assuming the result is still big number
  - Exponentiate the result to result in a small number $e^{-(largenumber)} \approx 0$.
  - A small value of K function means the data point is far from the landmark
2. The green data point is close from the landmark (center):
  - Close from landmark = small $\vec{x}$ number
  - Small $\vec{x} -$ small $(0, 0) \;\: \vec{l}$ number = small number
  - Square it to result in an even smaller number
  - Divide by $\sigma$ assuming the result is still small number
  - Exponentiate the result to result in a big number $e^{-(smallnumber)} \approx 1$.
  - A big number of K function means the data point is close from the landmark
Draw a circumference around the kernel function with a width of $\sigma$ and project this into the data points. The circumference is the decision boundary. The data points inside the circumference are classified as 1 and outside as 0.

Kernel Circumference

The bigger the $\sigma$, the wider the circumference and the smoother the decision boundary. The smaller the $\sigma$, the narrower the circumference and the more jagged the decision boundary.

3.2.6.3 Non-Linear SVR

The same concept as Kernel SVM but used for regression. The most common kernel is the Gaussian RBF kernel. The steps:

For example we have a dataset that is not linearly separable. We need to add an extra dimension to make it linearly separable.
The Gaussian RBF kernel is used to transform the data into a higher dimension. The kernel function is used to calculate the similarity between the data points with the landmark.
Fit a hyperplane to the data points. The hyperplane fits the region where the data points are located.
Draw a line where the hyperplane intersect with the gaussian RBF kernel.
Add the support vectors margin to the hyperplane. The support vectors are the data points that are closest to the hyperplane.

3.2.6.4 Kernel SVM Python

SVM must be scaled because the algorithm is based on the Euclidean distance. If the input is not scaled, the output will be skewed.

from sklearn.svm import SVC
classifier = SVC(kernel='rbf', random_state=0) # kernel can be linear, poly, rbf, sigmoid, precomputed
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)

3.2.7 Naive Bayes

3.2.7.1 Bayes Intuition

Naive Bayes is a classification technique based on Bayes’ Theorem with an assumption of independence among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. For example, a fruit may be considered an apple if it is red, round, and about 3 inches in diameter. Even if these features depend on each other or upon the existence of the other features, all of these properties independently contribute to the probability that this fruit is an apple and that is why it is known as ‘Naive’.

Bayes Theorem itself is a way of finding a probability when we know certain other probabilities. It is mathematically expressed as: \[ P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)} \] Where:

$P(A|B)$ is the probability of A given B
$P(B|A)$ is the probability of B given A
$P(A)$ is the probability of A
$P(B)$ is the probability of B

In this example, we have the production data of wrenches in 2 machines Bayes Theorem-1

Solving the problem: \[ P(Defect|Mach2) = \frac{P(Mach2|Defect) \cdot P(Defect)}{P(Mach2)} = \frac{0.5 \cdot 0.01}{0.4} = 0.0125 \]

0.0125 is the probability that a wrench is defective given that we come up to machine 2 and just randomly pick a wrench, or 125 in 10,000 wrenches.

Alternatively, we could simply calculate the following method Bayes Theorem-3

\[ P(Defect|Mach2]1) = \frac{P(Mach1|Defect) \cdot P(Defect)}{P(Mach1)} = \frac{0.5 \cdot 0.01}{0.6} = 0.00083 \]

3.2.7.2 Naive Bayes Intuition

For example, we have a dataset of whether a user walks/drive to the office with features of age and salary

This is the original dataset denoted by Red and Green. New data in Grey. Naive Bayes Intuition

The calculation steps are as follow:

To calculate the probably of the user walks to the office. We calculate from #1 to #4
To calculate the probably of the user drives to the office. We calculate from #1 to #4
Then we calculate the probability of the user walks/drives to the office then decide which class the user belongs to.

The solution steps are as follow:

Calculate the probability of the new data walks. \[ P(Walks) = \frac{P(X|Walks) \cdot P(Walks)}{P(X)} \]
1. ${P(Walks)}$ Prior Probability: Calculate the probability of the user walks to the office. 10 dots of walkers divided by total observations of 30.
2. ${P(X)}$ Marginal Likelihood: Draw a radius of our own judgement which indicates that the user tha falls under this radius is going to be deemed similar to the new data point. $P(X)$ is the probability of the new data point being similar to the existing data point. 4 in circle divided by total observations of 30.
3. $P(X|Walks)$ Likelihood: What is the likelihood somebody walks exhibit features X, or what is the probability that a randomly selected data point fro our data set will be similar to the data point that we are adding
4. $P(Walks|X)$ Posterior Probability: \[ P(Walks|X) = \frac{P(X|Walks) \cdot P(Walks)}{P(X)} = \frac{\frac{3}{10} \cdot \frac{10}{30}}{\frac{4}{30}} = 0.75 \]
Calculate the probability of the new data drives. \[ P(Drives) = \frac{P(X|Drives) \cdot P(Drives)}{P(X)} \]
1. ${P(Drives)}$ Prior Probability: Calculate the probability of the user drives to the office. 20 dots of drivers divided by total observations of 30.
2. ${P(X)}$ Marginal Likelihood: 4 in circle divided by total observations of 30.
3. $P(X|Drives)$ Likelihood: 1 in circle divided by total observations of 20.
4. $P(Drives|X)$ Posterior Probability: \[ P(Drives|X) = \frac{P(X|Drives) \cdot P(Drives)}{P(X)} = \frac{\frac{1}{20} \cdot \frac{20}{30}}{\frac{4}{30}} = 0.25 \]
Compare the probability of the user walks/drives to the office then decide which class the user belongs to. \[0.75 > 0.25\] \[P(Walks|X) > P(Drives|X)\] \[\text {The user walks to the office}\]

3.2.7.3 Naive Bayes Theorem

Why “Naive”?

It is called “Naive” because it assumes the variables are independent of each other which oftentimes not correct, hence the name of naive. Despite this, this algorithm is still widely used because it is simple and efficient.

For example it assumes independence of variables between age and salary. In reality, age and salary are dependent on each other. The older the person, the higher the salary.
$P(X)$

The denominator $P(X)$ is the same for all classes. It is the probability of the new data point being similar to the surrounding existing data point.

For example in the comparison formula, $P(X)$ is the same for both walks and drives. It is the probability of the new data point being similar to the surrounding existing data point. Hence we could remove this from the formula.
What happens when there are more than 2 features:
- The formula is the same, just add more features to the formula.
- The more features, the more accurate the prediction will be.
- The more features, the more computationally expensive the algorithm will be.

3.2.7.4 Naive Bayes Python

Naive Bayes does not need to be scaled because the algorithm is based on the probability. Scaling will result in neutral to mildly positive/negative effect. But if are using ensamble method, then scaling is allowed.

from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)

3.2.8 Decision Tree Classification

3.2.8.1 Decision Tree Classification Intuition

Decision Tree is a type of model that is used in classification and regression. It works for both categorical and continuous dependent variables. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.

The tree will keep on splitting data until it reaches a point where it cannot split anymore resulting in terminal tree. Decision Tree Classification

The steps to create a decision tree are as follow: 1. Start at the root node and split the data based on the feature that gives the most information gain. 2. Move to the next node and split the data again based on the feature that gives the most information gain. 3. Repeat the process until the data cannot be split anymore. 4. The result is a tree with decision nodes and leaf nodes. 5. The decision nodes are the features and the leaf nodes are the output.

3.2.8.2 Decision Tree Classification Python

Decision Tree does not need to be scaled because the algorithm is based on the information gain. Scaling will result in neutral effect. But if are using ensamble method, then scaling is allowed.

from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(criterion='entropy', random_state=0) # criterion can be gini or entropy
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)

3.2.9 Random Forest Classification

3.2.9.1 Random Forest Classification Intuition

Random Forest is an ensemble learning method that operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) of the individual trees. Random forests correct for decision trees’ habit of overfitting to their training set.

The steps to create a random forest are as follow: 1. Choose the k number of trees to build e.g. 10 trees. 2. Randomly select the data points e.g. 100 from 1000 data points. 3. Randomly select the features. e.g. 3 from 10 features. 4. Build the decision tree. 5. Repeat the #2 - #4 process until the number of trees is reached. 6. The result is a forest of decision trees.

3.2.9.2 Random Forest Classification Python

Random Forest does not need to be scaled because the algorithm is based on the information gain. Scaling will result in neutral effect. But if are using ensamble method, then scaling is allowed.

from sklearn.ensamble import randomforestclassifier
classifier = randomforestclassifier(n_estimators=100, criterion='entropy', random_state=0) # criterion can be gini or entropy
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)

3.2.10 Classification Model Pros & Cons

3.2.11 Measuring Classification Performance

3.2.11.1 Confusion Matrix

3.2.11.1.1 Confusion Matrix Intuition

A confusion matrix is a table that is often used to describe the performance of a classification model on a set of test data for which the true values are known. It allows the visualization of the performance of an algorithm. The confusion matrix is a 2x2 table that contains 4 outputs provided by the binary classifier. Each value in the confusion matrix has a specific name: \[ \begin{bmatrix} TP & FP \\ FN & TN \end{bmatrix} \] Confussion Matrix

True Positive (TP): we predict a label of 1 (positive), and the true label is 1.
False Positive (FP): we predict a label of 1 (positive), and the true label is 0.

Also known as a “Type 1 Error”. Effectively, we have a false alarm.

The example of Type 1 error is when we predict that a patient has a disease, but in reality, the patient does not have a disease.
False Negative (FN): we predict a label of 0 (negative), and the true label is 1.

Also known as a “Type 2 Error”. Effectively, we have a miss.

The example of Type 2 error is when we predict that a patient does not have a disease, but in reality, the patient has a disease.
True Negative (TN): we predict a label of 0 (negative), and the true label is 0.

3.2.11.1.2 Confussion Matrix Narrative

The confusion matrix is a table that is often used to describe the performance of a classification model on a set of test data for which the true values are known. It allows the visualization of the performance of an algorithm. The confusion matrix is a 2x2 table that contains 4 outputs provided by the binary classifier. Each value in the confusion matrix has a specific name:

3.2.11.1.3 Confusion Matrix Python

from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)

3.2.11.2 Accuracy

3.2.11.2.1 Accuracy Intuition

Accuracy is the most common metric for classification. It is the ratio of the number of correct predictions to the total number of predictions made. :The higher the accuracy, the better the model is. \[ Accuracy = \frac{TP + TN}{TP + TN + FP + FN} \]

3.2.11.2.2 Accuracy Narrative

If the accuracy is 0.9, it means that 90% of the predictions are correct.

3.2.11.2.3 Accuracy Python

from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)

3.2.11.3 Cumulative Accuracy Profile (CAP) Curve

3.2.11.3.1 CAP Curve Intuition

The CAP curve is a visual representation of the performance of a classification model. It shows the percentage of total observations (X-axis) and the percentage of correctly classified positive observations (Y-axis). The CAP curve is a way to assess the goodness of fit of a model. The higher the CAP curve, the better the model is. Analysing the CAP Curve:

Imagine we have have dataset of user contacted and user purchased, we believe the ratio to be 10% of contacted will purchase. $Cap Curve$

Blue = randomly contacted the user. Contact 20k will result in 2k purchase, 60k = 6k purchase, etc.

Red = put slight effort to contact user that we already know will purchase. Contact 20k will result in 4.5k purchase, 60k = 9.5k purchase, etc.
Now imagine we are doing superly targeted effort to contact user that we know will purchase and maximize that 10% ratio.

Grey = after knowing 10% will purchase, we immediately contact 10k of users and 1k will purchase. This will make the graph to maximize immediately.
Then we compare the 3 graphs. The higher the curve, the better the model is. To quantify the effectiveness of the model, we use the Area Under the Curve (AUC). The higher the AUC, the better the model is. To quantify the graph:
- Draw the line under the perfect model as Grey and name it $a_P$ (area under the perfect model)
- Draw the line under the model as Red and name it $a_R$ (area under the model’s CAP curve)
- Calculate with the formula: \[ AR = \frac{a_R}{a_P} \]
  - AR measures of the model’s effectiveness in comparison to a perfect model
  - AR = 1: The model is perfect, and the area under the model’s curve is equal to the area under the perfect model curve.
  - AR < 1: The model is less than perfect, with its area under the curve being less than that of the perfect model curve.
  - AR > 0: The model performs better than a random model (since the area under the random model curve is typically 0.5). However it is difficult to quantify the area under the graph visually, hence we use another approach.
Put a vertical line at the graph 50% horizontal point and see where it intersects in the good model. This will say how many of the result we will get if we contact 50% of the user. The higher the intersection, the better the model is.
- X < 60%: Rubbish
- 60% < X < 70%: Poor
- 70% < X < 80%: Good
- 80% < X < 90%: Very Good
- 90% < X < 100%: Too Good. Need to check for overfitting.

3.2.11.3.2 CAP Curve Narrative

The CAP curve effectively illustrates the performance differences between random selection, slight targeting, and optimal targeting. The steeper and higher the curve, the better the model’s performance. In practical terms, this means that a well-performing model can significantly improve targeting accuracy, leading to more efficient use of resources and better outcomes in marketing campaigns.

3.2.11.4 Precision

3.2.11.4.1 Precision Intuition

Precision is the ratio of correctly predicted positive observations to the total predicted positive observations. The higher the precision, the better the model is. \[ Precision = \frac{TP}{TP + FP(Type 1 Error)} \]

3.2.11.4.2 Precision Narrative

If the precision is 0.9, it means that 90% of the positive predictions are correct.

3.2.11.4.3 Precision Python

from sklearn.metrics import precision_score
precision_score(y_test, y_pred)

3.2.11.5 Recall

3.2.11.5.1 Recall Intuition

Recall is the ratio of correctly predicted positive observations to the all observations in actual class. The higher the recall, the better the model is. \[ Recall = \frac{TP}{TP + FN(Type 2 Error)} \]

3.2.11.5.2 Recall Narrative

If the recall is 0.9, it means that 90% of the actual positive observations are correct.

3.2.11.5.3 Recall Python

from sklearn.metrics import recall_score
recall_score(y_test, y_pred)

3.2.11.6 F1 Score

3.2.11.6.1 F1 Score Intuition

The F1 score is the weighted average of Precision and Recall. The higher the F1 score, the better the model is. \[ F1 Score = 2 * \frac{Precision * Recall}{Precision + Recall} \]

3.2.11.6.2 F1 Score Narrative

If the F1 score is 0.9, it means that the model is 90% accurate.

3.2.11.6.3 F1 Score Python

from sklearn.metrics import f1_score
f1_score(y_test, y_pred)

3.2.11.7 ROC Curve

The ROC curve is a graphical representation of the true positive rate against the false positive rate. It shows the tradeoff between sensitivity and specificity. The area under the ROC curve (AUC) is a measure of how well a parameter can distinguish between two diagnostic groups (diseased/normal). If the AUC is 0.5, then the parameter is no better than random. The higher the AUC, the better the parameter is at distinguishing between the two groups.

\[ \begin{align*} \text{TPR} &= \frac{TP}{TP + FN} \quad & \quad \text{FPR} &= \frac{FP}{FP + TN} \end{align*} \]

3.2.11.8 AUC

AUC stands for “Area under the ROC Curve.” That is, AUC measures the entire two-dimensional area underneath the entire ROC curve (think integral calculus) from (0,0) to (1,1). AUC provides an aggregate measure of performance across all possible classification thresholds. One way of interpreting AUC is as the probability that the model ranks a random positive example more highly than a random negative example. The higher the AUC, the better the model is. \[ AUC = \frac{1}{2} * (TPR1 + TPR2) * (FPR1 - FPR2) \]

3.2.11.9 Precision-Recall Curve

The precision-recall curve shows the tradeoff between precision and recall for different threshold. A high area under the curve represents both high recall and high precision, where high precision relates to a low false positive rate, and high recall relates to a low false negative rate. High scores for both show that the classifier is returning accurate results (high precision), as well as returning a majority of all positive results (high recall). \[ \begin{align*} \text{Precision} &= \frac{TP}{TP + FP} \quad & \quad \text{Recall} &= \frac{TP}{TP + FN} \end{align*} \] The precision-recall curve is useful when the target class is imbalanced.

4 Unsupervised Learning

Training without target label. We don’t have answer and ask the model to think itself. The only method for unsupervised is through clustering

4.1 Clustering

4.1.1 K-Means Clustering

4.1.1.1 K-Means clustering Intuition

Assuming we have the following clusters

Steps for K-Means:

Assume we want the data to be separated into two clusters We randomly put two centroids in anywhere in the data.
K-Means will assign each of the data point to the closest centroid and in this case we can draw a equidistant line and this will clearly separate the data points.
Calculate the central mass for each of the cluster but we do not include the centroid in this calculation. For each cluster, we take the average of all x and y coordinates and we will find a new location for the centroid to move. Then we repeat the process.

Draw another equidistant line, change the color of data points, calculate the center of gravity, and move the centroids.

Draw another equidistant line, change the color of data points, calculate the center of gravity, and move the centroids.

Draw another equidistant line, change the color of data points, calculate the center of gravity, and move the centroids.

Until we get into situation where doing the process again does not change anything.

4.1.1.2 Elbow Analysis

Elbow analysis is one of the approach to know how many cluster to decide in the beginning, or we could use our domain knowledge. \[ Within Cluster Sum of Squares (WCSS) = \sum_{P_i in Cluster 1} distance(Pi, C1)^2 + \sum_{P_i in Cluster _2} distance(P_i, C_2)^2 + ... \]

Calculate the distance between the point and centroid, square, and sums them up.

In 1 cluster, the WCSS will be high. In 2 clusters, the WCSS will be slightly small. In 3 clusters, the WCSS will be smaller and etc until the number of centroids = no of data points then WCSS is 0. Hence, the chart will look like this and we select by where the WCSS stops dropping rapidly. But it is more like a judgement call.

4.1.1.3 K-Means ++

We could encounter issue because of different centroid initial placement, resulting in random initialization trap

K-Means does not guarantee that there will not be an issue because the first one is still done by random, but at least it is using weighted random method then we use the placed centroids to find the correct placements.

4.1.1.4 K-Means Clustering Python

Start by doing finding the elbow

from sklearn.cluster import KMeans
wcss = []
for i in range(1, 10):
  kmeans = KMeans(n_clusters=i, init='k-means++', random_state=42)
  kmeans.fit(X)
  wcss.append(kmeans.inertia_)
plt.plot(x=range(1,10), y=wcss)
plt.title('The Elbow Method')
plt.xlabel('Number of Clusters')
plt.ylabel('WCSS')
plt.show()

Assuming the best elbow analysis is at 5.

After we find the elbow, then do the actual k-means

kmeans = KMeans(n_clusters=5, init='k-means++', random_state=42)
y_means = kmeans.fit_predict(X)

Then plot the result (assuming 5 clusters)

plt.scatter(X[y_kmeans == 0,0], X[y_kmeans==0,1], s=100, color="red", label="Cluster 1")
plt.scatter(X[y_kmeans == 1,0], X[y_kmeans==1,1], s=100, color="blue", label="Cluster 2")
plt.scatter(X[y_kmeans == 2,0], X[y_kmeans==2,1], s=100, color="green", label="Cluster 3")
plt.scatter(X[y_kmeans == 3,0], X[y_kmeans==3,1], s=100, color="cyan", label="Cluster 4")
plt.scatter(X[y_kmeans == 4,0], X[y_kmeans==4,1], s=100, color="magenta", label="Cluster 5")
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300)
plt.title('Clusters of Customers')
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score (1-100)')
plt.legend(['Tes', 'Tes1'])
plt.show()

4.1.2 Hierarchical Clustering

4.1.2.1 Hierarchical Clustering Intuition

There are two types of hierarchical clustering:

Agglomerative: Bottom up approach. Start with each data point as a single cluster and merge them together to form larger clusters.
Divisive: Top down approach. Start with all data points in a single cluster and divide them into smaller clusters.

To calculate distance:

Between points: Euclidean Distance
Between clusters: Closest Point, Furthest Point, Average Distance, Distance Between Distance
- Closest Point: the closest points between the two clusters
- Furthest Point: the furthest points between the two clusters
- Average Distance: average distance between all points in the two clusters (n*n)
- Distance Between Distance: make a centroid for each cluster and calculate the distance between the two centroids

Steps for HCA:

Assuming we have a 6 data points
Make each data point a single-point cluster. The number of clusters is equal to the number of data points.
Take the two closest clusters and merge them into a single cluster. The number of clusters is now 5.
Take another two closest clusters and merge them into a single cluster. The number of clusters is now 4.
Take another two closest clusters and merge them into a single cluster. The number of clusters is now 3.
Take another two closest clusters and merge them into a single cluster. The number of clusters is now 2.
Take another two closest clusters and merge them into a single cluster. The number of clusters is now 1.

And the data of these clusters are stored in a dendogram. The dendogram is a tree-like diagram that records the sequences of merges or splits. It is used to visualize the hierarchical clustering process. The y-axis represents the distance between clusters, and the x-axis represents the data points. :::

4.1.2.2 Dendogram Intuition

A dendogram is a tree-like diagram that records the sequences of merges or splits. It is used to visualize the hierarchical clustering process. The y-axis represents the distance between clusters, and the x-axis represents the data points.

Assume the left is data points, the right is empty dendogram. Find two closest data points and calculate euclidean distance and draw on the right.
Find two closest data points/cluster, group them, and draw on the right
Find two closest data points/cluster, group them, and draw on the right
Find two closest data points/cluster, group them, and draw on the right
Find two closest data points/cluster, group them, and draw on the right

The way we can use the dendogram is by applying a distance threshold or dissimilarity threshold along the x-axis. We can assume that we don’t want a dissimilarity above certain value e.g. above 1.7, above 0.9, etc.

However, the standard approach is to use the longest vertical line that does not cross any horizontal line.

4.1.2.3 Hierarchical Clustering Python

Start by drawing the dendogram and decide where to prone/cut the dendogram

import scipy.cluster.hierarchy as sch
dendogram = sch.dendrogram(sch.linkage(X, method='ward'))
plt.title("Dendogram")
plt.xlabel("Customers")
plt.ylabel("Euclidean Distance")
plt.show()

Train the model

from sklearn.cluster import AgglomerativeClustering
hc = AgglomerativeClustering(n_clusters=5, metric="euclidean", linkage="ward")
y_hc = hc.fit_predict(X)

Visualize the result

plt.scatter(X[y_hc == 0, 0], X[y_hc == 0, 1], s=100, c='red', label='Cluster 1')
plt.scatter(X[y_hc == 1, 0], X[y_hc == 1, 1], s=100, c='blue', label='Cluster 2')
plt.scatter(X[y_hc == 2, 0], X[y_hc == 2, 1], s=100, c='green', label='Cluster 3')
plt.scatter(X[y_hc == 3, 0], X[y_hc == 3, 1], s=100, c='orange', label='Cluster 4')
plt.scatter(X[y_hc == 4, 0], X[y_hc == 4, 1], s=100, c='magenta', label='Cluster 5')
plt.show()

4.1.3 Comparison of K-Means and Hierarchical Clustering

Clustering Model	Pros	Cons
K-Means	Simple to understand, easily adaptable, works well on small or large datasets, fast, efficient, and performant	Need to choose the number of clusters
HCA	The optimal number of clusters can be obtained by the model itself, practical visualization with the dendogram	Not approproate for large datasets

5 Association Rule Learning

Association rule learning is a rule-based machine learning method for discovering interesting relations between variables in large databases. It is used for market basket analysis, where we find sets of products that frequently co-occur in transactions. The goal is to identify strong rules discovered in databases using some measures of interestingness.

Association rule learning is a type of unsupervised learning, which means that it does not require labeled data. It is used to find patterns in data without any prior knowledge of the data.

People who liked movie 1 will likely to like movie 2. People who liked movie 2 will likely to like movie 4.

5.1 Apriori Algorithm

The Apriori algorithm is a classic algorithm used for mining frequent itemsets and relevant association rules. It is based on the principle that if an itemset is frequent, then all of its subsets must also be frequent. The algorithm uses a breadth-first search strategy to count candidate itemsets and prune the infrequent ones. Example is people who bought bread also bought butter.

Apriori algorithm has 3 main parts:

5.1.1 Apriori Support

Support is the proportion of transactions in the database that contain the itemset. It is used to measure the importance of an itemset. The higher the support, the more important the itemset is.

Example for movie recommendation (M) \[ \text{Support}(M) = \frac {\text{\# user watchlist containing } M} {\text{\# user watchlists}} \]

Example for market basket optimization (I) \[ \text{Support} (I) = \frac {\text{\# transactions containing } I} {\text{\# transactions}} \]

Assuming in this condition:

100 people
10 have watched the movie
$Support$ = 10/100 = 10%

5.1.2 Apriori Confidence

Confidence is the proportion of transactions that contain the itemset and also contain the consequent. It is used to measure the strength of the association rule. The higher the confidence, the stronger the association rule is.

Example for movie recommendation (M) \[ \text{Confidence}(M_1\rightarrow M_2) = \frac {\text{\# user watchlist containing } M_1 {\text{ and }} M_2} {\text{\# user watchlists containing } M_1} \]

Assuming in this condition:

$M_1$ is Interstellar and $M_2$ is Ex-Machina
40 people have watched Interstellar
Out of 40, 7 have watched Ex-Machina
$Confidence$ = 7/40 = 17.5%

5.1.3 Apriori Lift

Lift is the ratio of the observed support to that expected if $M_1$ and $M_2$ were independent. It is used to measure the strength of the association rule. The higher the lift, the stronger the association rule is.

Example for movie recommendation (M) \[ \text{Lift}(M_1\rightarrow M_2) = \frac {\text{Confidence}(M_1\rightarrow M_2)} {\text{Support}(M_2)} \]

Assuming in this condition: - $M_1$ is Interstellar and $M_2$ is Ex-Machina - Population is 100 - Originally 10 people in our population have watched Ex-Machine (indicated by red) - Originally 40 people in our populatio have watched Interstellar (indicated by green) - $Confidence$ is 7/40 = 17.5% - $Support$ is 10/100 = 10% - $Lift$ = 17.5% / 10% = 1.75

It is the difference / improvement the percentage of randomly selecting a group of people. The choices are:

Randomly recommeding Ex-Machina with 10% chance
First ask if they have watched Interstellah, then recommend Ex-Machina with 17.5% chance
The difference is 1.75 (17.5%/10%) which is the lift.

Steps for Apriori:

Since Apriori is an algorithm that might be exhaustive and big to calculate, we need to set a minimum support and confidence. For example, we set the minimum support to 10% and minimum confidence to 20%. This means that we only want to find itemsets that are present in at least 10% of the transactions and have a confidence of at least 20%.
Filter to include all the subsets in trasaction having higher support than minimum support
Filter again to include all the subsets having higher confidence than minimum confidence
Sort the rules by decreasing lift. The top 5 or top 10 are the one we want to implement in a business decision.

5.1.4 Apriori Algorithm Python

!pip install apyori -qq

transactions = []
# Convert transactions to list of lists
for i in range(0, len(dataset)):
    transactions.append([str(dataset.values[i, j]) for j in range(0, 20)])
    
from apyori import apriori
rules = apriori(transactions=transactions, min_support=0.003, min_confidence=0.2, min_lift=3, min_length=2, max_length=2)

results = list(rules)
def inspect(results):
    lhs = [tuple(result[2][0][0])[0] for result in results]
    rhs = [tuple(result[2][0][1])[0] for result in results]
    supports = [result[1] for result in results]
    confidence = [result[2][0][2] for result in results]
    lifts = [result[2][0][3] for result in results]
    return list(zip(lhs, rhs, supports, confidence, lifts))

resultsinDataFrame = pd.DataFrame(inspect(results), columns = ['Left Hand Side', 'Right Hand Side', 'Support', 'Confidence', 'Lift'])

5.2 Eclat Algorithm

Eclat algorithm is a depth-first search algorithm for mining frequent itemsets. It uses a vertical data format, where each item is associated with a list of transactions that contain it. The algorithm uses a depth-first search strategy to find frequent itemsets by intersecting the transaction lists of the items.

Eclat algorithm only has 1 main parts:

5.2.1 Eclat Support

Support is the proportion of transactions in the database that contain the itemset instead of single item. It is used to measure the importance of an itemset. The higher the support, the more important the itemset is.

Example for movie recommendation (M) \[ \text{Support}(M) = \frac {\text{\# user watchlists containing } M} {\text{\# user watchlists}} \]

Assuming in this condition:

$M$ is a set of movies
Narative: how often a set of movies (Interstellar and Ex-Machine) occur in all watchlists?

Steps for Eclat:

Eclat is similar to Apriori, but it uses a depth-first search strategy to find frequent itemsets. It uses a vertical data format, where each item is associated with a list of transactions that contain it. First, we set a minimum support.
Take all the subsets in transactions having higher support than minimum support
Sort these subsets by decreasing support

5.2.2 Eclat Algorithm Python

!pip install apyori -qq
transactions = []
# Convert transactions to list of lists
for i in range(0, len(dataset)):
    transactions.append([str(dataset.values[i, j]) for j in range(0, 20)])

from apyori import apriori
rules = apriori(transactions=transactions, min_support=0.003, min_confidence=0.2, min_lift=3, min_length=2, max_length=2)

results = list(rules)
def inspect(results):
    lhs         = [tuple(result[2][0][0])[0] for result in results]
    rhs         = [tuple(result[2][0][1])[0] for result in results]
    supports    = [result[1] for result in results]
    return list(zip(lhs, rhs, supports))
resultsinDataFrame = pd.DataFrame(inspect(results), columns = ['Product 1', 'Product 2', 'Support'])
resultsinDataFrame.nlargest(n = 10, columns = 'Support')

6 Reinforcement Learning

Reinforcement learning is a type of machine learning where an agent learns to make decisions by taking actions in an environment to maximize a reward. The agent interacts with the environment, receives feedback in the form of rewards or penalties, and learns from this feedback to improve its decision-making over time.

6.1 The Multi-Armed Bandit Problem

An example of problem is the multi-armed bandit problem. The name indicates multiple machine each with a lever (or arm) and each machine has a different probability of winning. The goal is how to play them to maximize the return. This problem can be solved with reinforcement learning.

The assumption is that each of these machines has a distribution out of which the machine picks a result. But the problem is, we do not know the distribution of each machine. The goal is to find the best machine to play.

The longer it take to play, the more we know about the distribution. But the problem is, we do not know how long it will take to play and losing money. The solution is to use exploration and exploitation. We need to explore the machines to find out which one is the best. But we also need to exploit the best machine to maximize the return.

Moreover, we also have the concept of regret. Regret (or opportunity cost) is the difference between the reward we could have obtained by playing the best machine (rightmost) and the reward we actually obtained by playing the machines. The goal is to minimize the regret. Now, if we don’t explore long enough as we are afraid of the regret, the suboptimal machine might appear as an optimal machine.

The ultimate goal is to explore and exploit the machines to minimize the regret. The problem is that we do not know how long it will take to explore and exploit the machines.

6.1.0.1 The Multi-Armed Bandit Example

Assuming we are running 5 ads simultaneously and we want to find out which ad is the best.

We could run A/B tests but A/B tests spent lots of money and time, and it is only exploration and it exploit both ads equally.

The solution is to use Upper Confidence Bound (UCB).

6.2 Upper Confidence Bound (UCB)

The example in advertising can be summarized as:

Number 4 reads: at each round $n$, ad $i$ gives reward $r_i(n)$ either {0, 1} where $r_i(n) = 1$ if the user clicked on the ad, $r_i(n) = 0$ if the user didn’t click on the ad. The goal is to maximize the total reward $r(n)$ over $n$ rounds.

The UCB algorithm works as follow:

UCB Formula: \[ \text{UCB}(a) = Q(a) + c \times \sqrt{\frac{\ln(t)}{N(a)}} \] Where:

$Q(a)$ = average reward of ad $a$ (or machine $a$)
$c$ = exploration parameter (controls how much we value exploration)
$t$ = total number of plays so far
$N(a)$ = number of times we’ve played machine a

UCB Steps:

Assume we have 5 machines with equal uncertainty. We will play each machine once to get the initial reward. The reward is either 0 or 1.
In the initial round, we play 5 times where each machine is played once. We calculate the average reward for each machine. (t=1-5)
Let’s say that the results are (1=win, 0=loss)
- A(1)
- B(0)
- C(1)
- D(0)
- E(0)
Calculating the UCB assuming $c=0.654$:
- A: $\text{UCB}(A) = 1 + c \times \sqrt{\frac{\ln(5)}{1}} = 1.83$
- B: $\text{UCB}(B) = 0 + c \times \sqrt{\frac{\ln(5)}{1}} = 0.83$
- C: $\text{UCB}(C) = 1 + c \times \sqrt{\frac{\ln(5)}{1}} = 1.83$
- D: $\text{UCB}(D) = 0 + c \times \sqrt{\frac{\ln(5)}{1}} = 0.83$
- E: $\text{UCB}(E) = 0 + c \times \sqrt{\frac{\ln(5)}{1}} = 0.83$
Based on the result, we can either choose A or C. Let’s say we pick A.
After playing A again (t=6). Suppose we got a loss (0), new estimate for all machines are:
- A: $\text{UCB}(A) = \frac{1+0}{2} + c \times \sqrt{\frac{\ln(6)}{2}} = 1.13$
- B: $\text{UCB}(B) = 0 + c \times \sqrt{\frac{\ln(6)}{1}} = 0.9$
- C: $\text{UCB}(C) = 1 + c \times \sqrt{\frac{\ln(6)}{1}} = 1.90$ (now highest)
- D: $\text{UCB}(B) = 0 + c \times \sqrt{\frac{\ln(6)}{1}} = 0.9$
- E: $\text{UCB}(B) = 0 + c \times \sqrt{\frac{\ln(6)}{1}} = 0.9$
Next, we play machine C and the process continues.
As $N(a)$ increases, the exploration term gets smaller. This means our $c$ confidence bound narrows as we gather more data about a specific machine.

6.3 Thompson Sampling

Thompson sampling is a Bayesian approach to the multi-armed bandit problem. It uses a probabilistic model to estimate the expected reward of each machine and selects the machine with the highest expected reward. The algorithm uses a prior distribution for each machine and updates the distribution based on the observed rewards.

The Thompson Sampling algorithm works as follow:

Thompson Sampling Formula: \[ \text{UCB}(a) = Q(a) + c \times \sqrt{\frac{\ln(t)}{N(a)}} \] Where:

$Q(a)$ = average reward of ad $a$ (or machine $a$)
$c$ = exploration parameter (controls how much we value exploration)
$t$ = total number of plays so far
$N(a)$ = number of times we’ve played machine a

UCB Steps:

Assume we have 5 machines with equal uncertainty. We will play each machine once to get the initial reward. The reward is either 0 or 1.
In the initial round, we play 5 times where each machine is played once. We calculate the average reward for each machine. (t=1-5)
Let’s say that the results are (1=win, 0=loss)
- A(1)

6.4 UCB vs Thompson Sampling

7 Dimensionality Reduction

7.1 Principal Component Analysis (PCA)

PCA is a technique used to reduce the dimensionality of a dataset while preserving as much variance as possible. It transforms the original features into a new set of features called principal components, which are linear combinations of the original features. The first principal component captures the maximum variance, the second principal component captures the second maximum variance, and so on. This is an unsupervised learning technique that is used to reduce the number of features in a dataset while retaining the most important information.

Instead of predicting between X and Y, PCA tries to find the relationship between X and Y values, although it is prone to outliers.

PCA can be used for:

7.1.1 PCA Python

from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)

7.2 Linear Discriminant Analysis (LDA)

LDA is a supervised learning technique used for dimensionality reduction. It is used to find a linear combination of features that best separates two or more classes. LDA maximizes the ratio of between-class variance to within-class variance, which helps to achieve better class separability. LDA is commonly used in classification problems where the goal is to find a lower-dimensional representation of the data that retains the discriminative information.

LDA will project to a lower dimension n-1 than the classes it has (if 2 classes then 1 dimension, if 3 classes then 2 dimensions).

7.2.1 LDA Python

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda = LinearDiscriminantAnalysis(n_components=2)
X_train = lda.fit_transform(X_train, y_train)
X_test = lda.transform(X_test)

7.3 Kernel PCA

Kernel PCA is an extension of PCA that uses kernel methods to perform PCA in a higher-dimensional space. It allows for non-linear dimensionality reduction by applying a kernel function to the data before performing PCA. This enables the algorithm to capture complex relationships between features that may not be linearly separable in the original feature space.

7.3.1 Kernel PCA Python

from sklearn.decomposition import KernelPCA
kpca = KernelPCA(n_components = 2, kernel = 'rbf')
X_train = kpca.fit_transform(X_train)
X_test = kpca.transform(X_test)

7.4 Comparison of PCA, Kernel PCA, and LDA

Feature	PCA	Kernel PCA	LDA
Type of Transformation	Linear transformations only	Nonlinear transformations	Linear transformations
Learning Approach	Unsupervised	Unsupervised	Supervised (requires class labels)
Optimization Goal	Maximize variance along linear directions	Maximize variance along potentially nonlinear directions	Maximize between-class separation, minimize within-class variance
Pattern Recognition	Can only find straight-line relationships	Can find curved relationships	Focused on class separation
Best Use Case	When data has linear structure and simplicity is preferred	When data has nonlinear structure (curves, clusters)	When separating classes with labeled data
Visual Example Performance	Would draw a straight line through curved data, missing the curve	Would follow the curve of the data	Would find the best line to separate different classes

Key Insight: PCA is about finding patterns in data, while LDA is about finding differences between groups.

7.4.1 PCA (Principal Component Analysis)

Example: Facial Recognition Systems

In facial recognition, each face image might have thousands of pixels (dimensions). PCA helps by:

Reducing dimensions: Compressing thousands of pixel values down to 50-100 “eigenfaces”
Filtering noise: Removing random variations in lighting, expression, etc.
Focusing on variance: Capturing the most distinctive facial features

For instance, the first few principal components might represent features like face width, nose length, or eye spacing—features with the highest variance across all faces.

Real application: Facebook’s DeepFace and many security systems use PCA as a preprocessing step before applying neural networks.

7.4.2 LDA (Linear Discriminant Analysis)

Example: Medical Diagnosis

Consider diagnosing three types of diabetes based on multiple blood markers:

Classes: Type 1 diabetes, Type 2 diabetes, Gestational diabetes
Variables: Glucose levels, insulin production, BMI, age, antibody tests, etc.

LDA would:

Find the specific combinations of these tests that best separate the three diabetes types
Create at most 2 new variables (since we have 3 classes) that maximize separation
Help doctors visualize how patients cluster into diagnostic groups

Real application: Medical diagnostic tools that help classify patients based on blood panels or genetic markers.

7.4.3 Kernel PCA

Example: Handwritten Digit Recognition When recognizing handwritten digits (like postal codes):

The challenge: Different people write the same digit in dramatically different ways
Linear problem: Standard PCA can’t capture the curvy, nonlinear variations in handwriting

Kernel PCA solves this by:

Implicitly mapping the data to a higher dimension where nonlinear patterns become linear
Capturing the “essence” of each digit’s shape despite widely varying writing styles
Finding features that would be impossible to detect with linear methods

Real application: Early versions of postal service mail sorting systems used Kernel PCA to help recognize handwritten zip codes before deep learning became dominant.

7.4.4 Choosing Between Methods

Use PCA when you want to compress data and don’t have class labels (like image compression)
Use LDA when you want to classify data into known groups (like medical diagnosis)
Use Kernel PCA when your data has obvious nonlinear patterns that linear methods miss (like complex image recognition)

8 Model Selection and Boosting

8.1 K-Fold Cross-Validation

8.1.1 K-Fold Cross-Validation Intuition

Normally, data is split into training and test sets. The model is trained on the training set and evaluated on the test set. However, this can lead to overfitting or underfitting (maybe we just got lucky). K-fold cross-validation is a technique used to evaluate the performance of a model by splitting the data into K subsets (or folds).

8.1.2 K-Fold Cross-Validation Steps

Split the training data into K subsets (or folds). For example, if K=10, the data is split into 10 subsets.
For each fold, use K-1 folds for training and the remaining fold for validation. For example, 9 folds for training, 1 fold for validation.
Repeat for all K folds. Each fold is used as a validation set once.
K- Folds Training Set Validation Set

1 2, 3, 4, 5, 6, 7, 8, 9 1

2 1, 3, 4, 5, 6, 7, 8, 9 2

… … …

K All but K K
Each of the K folds will result in a model and a performance metric (e.g. accuracy, precision, recall, F1 score, etc.). If the aggregate metrics are good, this means the modelling (hyperparameter) approach is valid.
Using the approved hyperparameter, train on the full training set then proceed to test on the test set at usual

K- Folds	Training Set	Validation Set
1	2, 3, 4, 5, 6, 7, 8, 9	1
2	1, 3, 4, 5, 6, 7, 8, 9	2
…	…	…
K	All but K	K

8.1.3 K-Fold Cross-Validation Python

from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)
accuracies.mean()
accuracies.std()

8.2 Bias-Variance Tradeoff

8.2.1 Bias-Variance Tradeoff Intuition

The bias-variance tradeoff is a fundamental concept in machine learning that describes the tradeoff between two sources of error in a model: bias and variance.

Bias is a systematic error that occurs in the machine learnign model itself due to incorrect assumptions in the ML process. Technically, we can define bias as the error between average model prediction and the ground truth.

High bias means the model is too simple and cannot capture the underlying patterns in the data, leading to underfitting.
Low bias means the model is more complex and can capture the underlying patterns in the data, leading to overfitting.

Variance is how much the model can adjust depending on the given dataset. It refers to the difference in fits between data sets such as in training set and test set.

High variance means the model is sensitive to small fluctuations in the training data, leading to overfitting.
Low variance means the model is more stable and less sensitive to changes in the training data, leading to underfitting.

Assuming we have weight and height data

We know that weight will increase as height increase, but at some point height stop and weight will increase (obese) Blue is training data, green is testing data.

Fitting a linear regression result in this model where it cannot capture the pattern perfectly.

Fitting a squiggly line result in this model where it can capture the pattern perfectly.

If comparing both, squiggly line is better than linear regression. But if we test on the testing data, the squiggly line will not perform well because it is overfitting the training data.

But in the testing data, the linear regression is better than the squiggly line after checking the sum of squares difference.

What we want is a model that has low bias to accurately model the true relationship and low variability by producing consistent predictions across different dataset. fdsfdsf

8.3 GridSearchCV

8.3.1 GridSearchCV Intuition

GridSearchCV is a technique used to find the best hyperparameters for a machine learning model. It performs an exhaustive search over a specified parameter grid and evaluates the model’s performance using cross-validation. The goal is to find the combination of hyperparameters that results in the best performance metric (e.g., accuracy, F1 score, etc.).

8.3.2 GridSearchCV Python

from sklearn.svm import SVC
classifier = SVC(kernel = 'rbf', random_state = 0)
classifier.fit(X_train, y_train)

from sklearn.model_selection import GridSearchCV
parameters = [{'C': [0.25, 0.5, 0.75, 1], 'kernel': ['linear']},
              {'C': [0.25, 0.5, 0.75, 1], 'kernel': ['rbf'], 'gamma': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]}]
grid_search = GridSearchCV(estimator = classifier,
                           param_grid = parameters,
                           scoring = 'accuracy',
                           cv = 10,
                           n_jobs = -1)
grid_search.fit(X_train, y_train)
best_accuracy = grid_search.best_score_
best_parameters = grid_search.best_params_

8.4 XGBoost

8.4.1 XGBoost Intuition

XGBoost (Extreme Gradient Boosting) is an optimized gradient boosting library designed to be highly efficient, flexible, and portable. It is widely used for supervised learning tasks, especially in Kaggle competitions and real-world applications. XGBoost is known for its speed and performance, making it a popular choice for machine learning practitioners.

8.4.2 XGBoost Python

from xgboost import XGBClassifier
classifier = XGBClassifier(learning_rate = 0.1,
                            n_estimators = 100,
                            max_depth = 5,
                            min_child_weight = 1,
                            gamma = 0,
                            subsample = 0.8,
                            colsample_bytree = 0.8,
                            objective = 'binary:logistic',
                            nthread = -1,
                            scale_pos_weight = 1,
                            seed = 27)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)

8.5 CatBoost

8.5.1 CatBoost Intuition

CatBoost is a gradient boosting library developed by Yandex. It is designed to handle categorical features automatically without the need for extensive preprocessing. CatBoost is known for its high performance, ease of use, and ability to work with large datasets. It is particularly effective for tasks involving categorical data, such as recommendation systems and natural language processing.

8.5.2 CatBoost Python

from catboost import CatBoostClassifier
classifier = CatBoostClassifier()
classifier.fit(X_train, y_train)