AI and Machine learning Concepts
I. Ridge Regression
OLS estimated parameters are BLUE (Best Linear Unbiased Estimators).
Best: Minimum Variance
Unbiased: E(Estimates) = Parameters
Possible to produce models with smaller MSEs by allowing the estimates to be biased.
\[MSE(\hat{\bf w})=\underset{\hat{\bf w} - \bar{\hat{\bf w}}}{\underbrace{Var(\hat{\bf w})}}+\underset{\hat{\bf w} - {\bf w}}{\underbrace{Bias(\hat{\bf w})}}\] - Ridge regression is a biased model that adds a penalty (i.e. \(L_2\) norm of \({\bf w}\) - the sum of square values) to the SSE.
\[\min_{{\bf w} \in {\cal R}^{p+1}} [\text{SSE}(\hat{\bf w})+\text{penalty}]=\min_{{\bf w} \in {\cal R}^{p+1}}\left[\sum_{i=1}^n(y_i-\hat{y})^2+\alpha\sum_{j=1}^pw_j^2\right]= \min_{{\bf w} \in {\cal R}^{p+1}} \left[\sum_{i=1}^n \left\|y_i-{\bf w}^T {\bf x}_i \right\|^2 + \alpha \left\| {\bf w} \right\|^2_2\right] \]
The ridge effect \(\alpha\) is a regularization hyper-parameter and its optimal value should be found.
Implication:
The little bias might help generalization (decrease the test error) even though training performance decreases.
Larger values of the parameter estimates are penalized more than smaller values.
Shrinks the estimates towards 0 as the \(\alpha\) penalty becomes large (called shrinkage methods).
Used when there is multicollinearity.
Note: IN ALL OF THE REGULARIZATION METHODS, THE PENALTIES DOES NOT INCLUDE THE BIAS TERM \(w_0\).
2. Lasso Regression
Lasso (Least Absolute Shrinkage and Selection Operator)
Lasso regression is another biased model that uses an \(L_1\) norm (i.e. the sum of absolute values):
\[\min_{{\bf w} \in {\cal R}^p}\ \{\text{SSE}(\hat{\bf w})+\text{penalty}\}=\min_{{\bf w} \in {\cal R}^p}\left[\sum_{i=1}^n(y_i-\hat{y})^2+\alpha\sum_{j=1}^p|w_j|\right]= \min_{{\bf w} \in {\cal R}^p} \sum_{i=1}^p \left\| {\bf w}^T {\bf x}_i - y_i \right\|^2 + \alpha \left\| {\bf w} \right\|_1 \]
Differently from \(L_2\) norm (i.e., ridge regression), which penalizes very large coefficients more, \(L_1\) penalizes equally.
While ridge regression shrinks the parameter estimates towards 0, lasso sets some of the estimates to zero.
Lasso not only performs regularization to improve the model but it also conducts a sort of feature selection.
III. Elastic Net Regression
- Elastic Net combines both \(L_1\) and \(L_2\) penalties together (using the effect of both ridge and lasso regressions):
\[\min_{{\bf w} \in {\cal R}^p}\ \{\text{SSE}(\hat{\bf w})+\text{penalty}\}=\min_{{\bf w} \in {\cal R}^p}\left[\sum_{i=1}^n(y_i-\hat{y})^2+\alpha_1\sum_{j=1}^p|w_j| + \alpha_2\sum_{j=1}^pw_j^2\right]= \min_{{\bf w} \in {\cal R}^p} \sum_{i=1}^p \left\| {\bf w}^T {\bf x}_i - y_i \right\|^2 + \alpha_1 \left\| {\bf w} \right\|_1 + \alpha_2 \left\| {\bf w} \right\|_2^2 \]
In
scikit-learn, it is parametrized differently:\[\min_{{\bf w} \in {\cal R}^p}\ \{\text{SSE}(\hat{\bf w})+\text{penalty}\}= \min_{{\bf w} \in {\cal R}^p} \sum_{i=1}^p \left\| {\bf w}^T {\bf x}_i - y_i \right\|^2 + \alpha\; \ell_1\; \left\| {\bf w} \right\|_1+\alpha \;(1-\ell_1)\; \left\| {\bf w} \right\|_2^2 \]
where \(\ell_1\) captures the relative amount of \(L_1\)-penalty.
\(\ell_1=1 \Rightarrow\) Lasso Regression
\(\ell_1=0 \Rightarrow\) Ridge Regression
\(\alpha=0 \Rightarrow\) OLS Regression
3. Logistic Regression
To predict a binary response variable \(y\) based on a set of input features \(x_1,x_2,\cdots,x_p\).
Instead of the actual value of \(y\), we are modeling the probability of success (outcome of interest).
The linear predictor (i.e. \({\bf w}^T{\bf x}_i=w_0+w_1x_{i1}+\cdots +w_px_{ip}\)) is transformed via a logistic (sigmoid) function:
\[P(Y_i=1)=\sigma({\bf w}^T{\bf x}_i)=\frac{1}{1+e^{-{\bf w}^T{\bf x}_i}}=\frac{1}{1+e^{-(w_0+w_1x_{i1}+\cdots +w_px_{ip})}}\]
The logit (log odds) of the probability of success is set to the linear predictor.
\[\text{Logit }P(Y_i=1) = \log\left[\frac{P(Y_i=1)}{P(Y_i=0)}\right]=w_0+w_1x_{i1}+w_2x_{i2}+\cdots+w_px_{ip}=\sum\limits_{j=1}^pw_jx_{ij}={\bf w}^T{\bf x}_i\] \[\text{Odds }(Y_i=1) = \frac{P(Y_i=1)}{P(Y_i=0)}=e^{w_0+w_1x_{i1}+w_2x_{i2}+\cdots+w_px_{ip}}=e^{\sum\limits_{j=1}^pw_jx_{ij}}=e^{{\bf w}^T{\bf x}_i}\] \(\ \ \ \ \ \ \ \ \) where \(w_0\) is the bias and \(w_j\)’s are the weights of the input features \(x_j\)’s.
The linear predictor is a decision boundary: \({\bf w}^T{\bf x}_i=w_0+w_1x_{i1}+\cdots +w_px_{ip}=0\)
If \({\bf w}^T{\bf x}_i\ge 0\Rightarrow \sigma({\bf w}^T{\bf x}_i)\ge 0.5\), the predicted class is 1 \((\hat{y}=1)\).
If \({\bf w}^T{\bf x}_i< 0\Rightarrow \sigma({\bf w}^T{\bf x}_i)< 0.5\), the predicted class is 0 \((\hat{y}=0)\).
4. Support Vector Machine (SVM)
Linear regression
Applicable only for continuous (or quantitative in general) response variables.
By default assumes a linear relationship between predictors and the outcome.
If the relationship is non-linear, transformations or manual specification are required.
Logistic regression
Used for modeling categorical outcome variables.
The model is intrinsically a linear function of the predictor variables (via the log-odds).
An SVM
Can handle both regression and classification problems.
For classification: Support Vector Classifier (SVC).
For regression problems: Support Vector Regressor (SVR).
Unlike regression models that are strictly linear unless specified otherwise, SVMs can be:
Linear, when used with a linear kernel.
Non-linear, by using kernels that implicitly map data to higher dimensions.
What is an SVM?
For Classification (SVC):
Finds the best separating hyperplane that maximizes the margin between the different classes.
Support vectors (SVs) define the decision boundary (line/plane) and are the closest points to that boundary.
Intuitively, SVM does not just separate classes — it separates them with the widest possible gap.
For regression (SVR):
- Instead of a strict boundary, it fits a function that keeps most data points within a margin of tolerance (the “tube”).
4.1. Linear SVC
Linear SVC looks for the hyperplane that maximizes the margins for linearly separable data.
The decision boundary for a Linear SVC is similar to that of a logistic regression model:
\[w_0+w_1x_{i1}+w_2x_{i2}+\cdots+w_px_{ip}={\bf w}^T{\bf x}_i=0\]
The SVC aims to choose \(\bf w\) that minimize the Hinge loss instead of the log-loss of logistic regression: \[min_{{\bf w}\in {\cal R}^{p+1}}\sum_{i=1}^n \max(0,1-y_i\;{\bf w}^T{\bf x}_i)\]
Note:
|
|
With an \(L_2\) regularization:
\[min_{{\bf w}\in {\cal R}^{p+1}}\left[C\sum_{i=1}^n \max(0,1-y_i\;{\bf w}^T{\bf x}_i)+\frac{1}{2}||{{\bf w}}||^2\right]\]
- Like a regularized logistic regression, \(𝐶>0\) is a regularization parameter.
What is the difference between Linear SVC and Logistic Regression?
An SVC aims to find the best separating hyperplane that maximizes the margin between the different classes.
- The margin is defined as the distance from the boundary to the closest points of each class.
|
Possible decision boundaries of logistic regression : A, B, C |
A decision boundary for a Linear SVC |
In Logistic and linear regression, all points contribute to \(\bf w\), which is not true for SVMs.
Only the SVs (i.e., incorrectly classified examples or close to the boundary) determine \(\bf w\).
If an example is not a SV, removing it has no effect on the model.
How close the SVs are to the boundary plane is controlled by regularization strength (\(C\)).
\(𝐶>0\) balances the trade-off between the margin and classification error.
Using a linear hyperplane to separate these two classes is not feasible.
However, projecting the data into a higher dimension makes a linear separator sufficient. How?
By simply adding a new set of nonlinear features, obtained from the original ones.
Representing each data point from a 2D point, \((x_1, x_2)\), to a 3D point, \((x_1, x_2, x_2^2)\).
Often we do not know which features to add.
Adding many features (like all possible interactions) might be computationally expensive.
Fortunately, there is a kernel trick in an SVM that allows to learn a without actually computing the new representation.
There are two ways commonly used with SVM:
Polynomial kernel: computes all possible polynomials up to a certain degree of the original features, such as \((x_1^2\cdot x_2^5)\).
Radial basis function (RBF) kernel (gaussian kernel):
It considers all possible polynomials of all degress.
But the importance of the features decreases for higher degrees.
\[f({\bf x}_i)=\sum_{h\in \text{SV}}\alpha_hy_h\;K({\bf x}_h,{\bf x}_i)+w_0 \;\;\;\;\; \text{ where }\;\; K({\bf x}_h,{\bf x}_i)=e^{-\gamma||{\bf x}_h-{\bf x}_i||^2}\]
\(\gamma\) represents kernel width (how far the influence of a point reaches) that affects the decision boundary.
The kernel equlivalent for a linear SVC is: \[f({\bf x}_i)={\bf w}^T{\bf x}_i+w_0=\sum_{h\in \text{SV}}\alpha_hy_h{\bf x}_h^T{\bf x}_i+w_0\;\;\;\;\; \text{ where }\;\; K({\bf x}_h,{\bf x}_i)={\bf x}_h^T{\bf x}_i\]
Deep learning
Multi-layer Perceptron (MLP) learns a target function \(f: {\cal X}^p\rightarrow {\cal Y}^q\) of either a classification or regression problem.
\(p\) is the number of input dimensions
\(q\) is the number of output dimensions.
Unlike linear or logistic regression, there can be one or more non-linear layers, called hidden layers, between the input and the output layer in MLP.
For a set of features \(\bf X\) and a target \(y\), the MLP representation can be as follows.
The leftmost layer is the input layer that consists of a set of neurons \(\{x_j|x_1,x_2,\cdots,x_p\}\) representing the input features.
The rightmost layer is the output layer with a single neuron \(\{y\}\), as shown above, for a regression and classsification problems.
- For a multi-class classification problem, the number of output neurons will be more than two.
Those layers between the input and output layers are called hidden layers, each consisting of a set of neurons.
Each neuron in the hidden layer transforms the values from the previous layer with a weighted linear summation \(w_0+w_1x_1+w_2x_2+\cdots+w_px_p\).
The linear sum is followed by a non-linear activation function \(\phi(\cdot): R\rightarrow R\) (e.g. logistic (sigmoid), tanh, relu).
MLP is a specific type of feedforward neural network.
Feedforward just means no cycles, information flows forward only (input → hidden → output).
Each layer is fully connected (dense).
There is no strict “magic number” for hidden neurons (default in scikit learn is 100).
But there are some common starting points and rules of thumb people use.
Power of two sizes: 8, 16, 32, 64, 128, 256 → Common because they fit well in memory blocks and scale naturally.
Pyramid shape: First hidden layer has more neurons than later layers (e.g., 64 → 32 → 16) to compress features gradually.
Input–output average rule: Hidden Neurons = (Num Inputs + Num Outputs) / 2
Too few neurons → underfitting (cannot capture complexity)
Too many neurons → overfitting (memorizes training set)