AI and Machine learning Concepts

I. Ridge Regression

  • OLS estimated parameters are BLUE (Best Linear Unbiased Estimators).

    • Best: Minimum Variance

    • Unbiased: E(Estimates) = Parameters

  • Possible to produce models with smaller MSEs by allowing the estimates to be biased.

\[MSE(\hat{\bf w})=\underset{\hat{\bf w} - \bar{\hat{\bf w}}}{\underbrace{Var(\hat{\bf w})}}+\underset{\hat{\bf w} - {\bf w}}{\underbrace{Bias(\hat{\bf w})}}\] - Ridge regression is a biased model that adds a penalty (i.e. \(L_2\) norm of \({\bf w}\) - the sum of square values) to the SSE.

\[\min_{{\bf w} \in {\cal R}^{p+1}} [\text{SSE}(\hat{\bf w})+\text{penalty}]=\min_{{\bf w} \in {\cal R}^{p+1}}\left[\sum_{i=1}^n(y_i-\hat{y})^2+\alpha\sum_{j=1}^pw_j^2\right]= \min_{{\bf w} \in {\cal R}^{p+1}} \left[\sum_{i=1}^n \left\|y_i-{\bf w}^T {\bf x}_i \right\|^2 + \alpha \left\| {\bf w} \right\|^2_2\right] \]

  • The ridge effect \(\alpha\) is a regularization hyper-parameter and its optimal value should be found.

  • Implication:

    • The little bias might help generalization (decrease the test error) even though training performance decreases.

    • Larger values of the parameter estimates are penalized more than smaller values.

    • Shrinks the estimates towards 0 as the \(\alpha\) penalty becomes large (called shrinkage methods).

    • Used when there is multicollinearity.

  • Note: IN ALL OF THE REGULARIZATION METHODS, THE PENALTIES DOES NOT INCLUDE THE BIAS TERM \(w_0\).

2. Lasso Regression

  • Lasso (Least Absolute Shrinkage and Selection Operator)

  • Lasso regression is another biased model that uses an \(L_1\) norm (i.e. the sum of absolute values):

    \[\min_{{\bf w} \in {\cal R}^p}\ \{\text{SSE}(\hat{\bf w})+\text{penalty}\}=\min_{{\bf w} \in {\cal R}^p}\left[\sum_{i=1}^n(y_i-\hat{y})^2+\alpha\sum_{j=1}^p|w_j|\right]= \min_{{\bf w} \in {\cal R}^p} \sum_{i=1}^p \left\| {\bf w}^T {\bf x}_i - y_i \right\|^2 + \alpha \left\| {\bf w} \right\|_1 \]

  • Differently from \(L_2\) norm (i.e., ridge regression), which penalizes very large coefficients more, \(L_1\) penalizes equally.

  • While ridge regression shrinks the parameter estimates towards 0, lasso sets some of the estimates to zero.

  • Lasso not only performs regularization to improve the model but it also conducts a sort of feature selection.

III. Elastic Net Regression

  • Elastic Net combines both \(L_1\) and \(L_2\) penalties together (using the effect of both ridge and lasso regressions):

\[\min_{{\bf w} \in {\cal R}^p}\ \{\text{SSE}(\hat{\bf w})+\text{penalty}\}=\min_{{\bf w} \in {\cal R}^p}\left[\sum_{i=1}^n(y_i-\hat{y})^2+\alpha_1\sum_{j=1}^p|w_j| + \alpha_2\sum_{j=1}^pw_j^2\right]= \min_{{\bf w} \in {\cal R}^p} \sum_{i=1}^p \left\| {\bf w}^T {\bf x}_i - y_i \right\|^2 + \alpha_1 \left\| {\bf w} \right\|_1 + \alpha_2 \left\| {\bf w} \right\|_2^2 \]

  • In scikit-learn, it is parametrized differently:

    \[\min_{{\bf w} \in {\cal R}^p}\ \{\text{SSE}(\hat{\bf w})+\text{penalty}\}= \min_{{\bf w} \in {\cal R}^p} \sum_{i=1}^p \left\| {\bf w}^T {\bf x}_i - y_i \right\|^2 + \alpha\; \ell_1\; \left\| {\bf w} \right\|_1+\alpha \;(1-\ell_1)\; \left\| {\bf w} \right\|_2^2 \]

where \(\ell_1\) captures the relative amount of \(L_1\)-penalty.

  • \(\ell_1=1 \Rightarrow\) Lasso Regression

  • \(\ell_1=0 \Rightarrow\) Ridge Regression

  • \(\alpha=0 \Rightarrow\) OLS Regression

3. Logistic Regression

  • To predict a binary response variable \(y\) based on a set of input features \(x_1,x_2,\cdots,x_p\).

  • Instead of the actual value of \(y\), we are modeling the probability of success (outcome of interest).

  • The linear predictor (i.e. \({\bf w}^T{\bf x}_i=w_0+w_1x_{i1}+\cdots +w_px_{ip}\)) is transformed via a logistic (sigmoid) function:

    \[P(Y_i=1)=\sigma({\bf w}^T{\bf x}_i)=\frac{1}{1+e^{-{\bf w}^T{\bf x}_i}}=\frac{1}{1+e^{-(w_0+w_1x_{i1}+\cdots +w_px_{ip})}}\]

  • The logit (log odds) of the probability of success is set to the linear predictor.

\[\text{Logit }P(Y_i=1) = \log\left[\frac{P(Y_i=1)}{P(Y_i=0)}\right]=w_0+w_1x_{i1}+w_2x_{i2}+\cdots+w_px_{ip}=\sum\limits_{j=1}^pw_jx_{ij}={\bf w}^T{\bf x}_i\] \[\text{Odds }(Y_i=1) = \frac{P(Y_i=1)}{P(Y_i=0)}=e^{w_0+w_1x_{i1}+w_2x_{i2}+\cdots+w_px_{ip}}=e^{\sum\limits_{j=1}^pw_jx_{ij}}=e^{{\bf w}^T{\bf x}_i}\] \(\ \ \ \ \ \ \ \ \) where \(w_0\) is the bias and \(w_j\)’s are the weights of the input features \(x_j\)’s.

  • The linear predictor is a decision boundary: \({\bf w}^T{\bf x}_i=w_0+w_1x_{i1}+\cdots +w_px_{ip}=0\)

    • If \({\bf w}^T{\bf x}_i\ge 0\Rightarrow \sigma({\bf w}^T{\bf x}_i)\ge 0.5\), the predicted class is 1 \((\hat{y}=1)\).

    • If \({\bf w}^T{\bf x}_i< 0\Rightarrow \sigma({\bf w}^T{\bf x}_i)< 0.5\), the predicted class is 0 \((\hat{y}=0)\).

4. Support Vector Machine (SVM)

  • Linear regression

    • Applicable only for continuous (or quantitative in general) response variables.

    • By default assumes a linear relationship between predictors and the outcome.

    • If the relationship is non-linear, transformations or manual specification are required.

  • Logistic regression

    • Used for modeling categorical outcome variables.

    • The model is intrinsically a linear function of the predictor variables (via the log-odds).

  • An SVM

    • Can handle both regression and classification problems.

      • For classification: Support Vector Classifier (SVC).

      • For regression problems: Support Vector Regressor (SVR).

    • Unlike regression models that are strictly linear unless specified otherwise, SVMs can be:

      • Linear, when used with a linear kernel.

      • Non-linear, by using kernels that implicitly map data to higher dimensions.

What is an SVM?

  • For Classification (SVC):

    • Finds the best separating hyperplane that maximizes the margin between the different classes.

    • Support vectors (SVs) define the decision boundary (line/plane) and are the closest points to that boundary.

    • Intuitively, SVM does not just separate classes — it separates them with the widest possible gap.

  • For regression (SVR):

    • Instead of a strict boundary, it fits a function that keeps most data points within a margin of tolerance (the “tube”).

4.1. Linear SVC

  • Linear SVC looks for the hyperplane that maximizes the margins for linearly separable data.

  • The decision boundary for a Linear SVC is similar to that of a logistic regression model:

    \[w_0+w_1x_{i1}+w_2x_{i2}+\cdots+w_px_{ip}={\bf w}^T{\bf x}_i=0\]

  • The SVC aims to choose \(\bf w\) that minimize the Hinge loss instead of the log-loss of logistic regression: \[min_{{\bf w}\in {\cal R}^{p+1}}\sum_{i=1}^n \max(0,1-y_i\;{\bf w}^T{\bf x}_i)\]

  • Note:

  • \(y_i=1\) and \({\bf w}^T{\bf x}_i>1\): hinge loss = 0

  • \(y_i=1\) and (\({\bf w}^T{\bf x}_i>0\) and \({\bf w}^T{\bf x}_i<1\)): 0 < hinge loss < 1

  • \(y_i=1\) and \({\bf w}^T{\bf x}_i<0\): hinge loss > 1

  • \(y_i=0\) and \({\bf w}^T{\bf x}_i<0\): hinge loss = 1

  • \(y_i=0\) and \({\bf w}^T{\bf x}_i>0\): hinge loss = 1

  • With an \(L_2\) regularization:

  • \[min_{{\bf w}\in {\cal R}^{p+1}}\left[C\sum_{i=1}^n \max(0,1-y_i\;{\bf w}^T{\bf x}_i)+\frac{1}{2}||{{\bf w}}||^2\right]\]

    • Like a regularized logistic regression, \(𝐶>0\) is a regularization parameter.

    What is the difference between Linear SVC and Logistic Regression?

    • An SVC aims to find the best separating hyperplane that maximizes the margin between the different classes.

      • The margin is defined as the distance from the boundary to the closest points of each class.

    Possible decision boundaries of logistic regression : A, B, C

    A decision boundary for a Linear SVC
    • In Logistic and linear regression, all points contribute to \(\bf w\), which is not true for SVMs.

      • Only the SVs (i.e., incorrectly classified examples or close to the boundary) determine \(\bf w\).

      • If an example is not a SV, removing it has no effect on the model.

      • How close the SVs are to the boundary plane is controlled by regularization strength (\(C\)).

      • \(𝐶>0\) balances the trade-off between the margin and classification error.

    • Using a linear hyperplane to separate these two classes is not feasible.

    • However, projecting the data into a higher dimension makes a linear separator sufficient. How?

      • By simply adding a new set of nonlinear features, obtained from the original ones.

      • Representing each data point from a 2D point, \((x_1, x_2)\), to a 3D point, \((x_1, x_2, x_2^2)\).

      • Often we do not know which features to add.

      • Adding many features (like all possible interactions) might be computationally expensive.

    • Fortunately, there is a kernel trick in an SVM that allows to learn a without actually computing the new representation.

    • There are two ways commonly used with SVM:

      • Polynomial kernel: computes all possible polynomials up to a certain degree of the original features, such as \((x_1^2\cdot x_2^5)\).

      • Radial basis function (RBF) kernel (gaussian kernel):

        • It considers all possible polynomials of all degress.

        • But the importance of the features decreases for higher degrees.

    \[f({\bf x}_i)=\sum_{h\in \text{SV}}\alpha_hy_h\;K({\bf x}_h,{\bf x}_i)+w_0 \;\;\;\;\; \text{ where }\;\; K({\bf x}_h,{\bf x}_i)=e^{-\gamma||{\bf x}_h-{\bf x}_i||^2}\]

    • \(\gamma\) represents kernel width (how far the influence of a point reaches) that affects the decision boundary.

    • The kernel equlivalent for a linear SVC is: \[f({\bf x}_i)={\bf w}^T{\bf x}_i+w_0=\sum_{h\in \text{SV}}\alpha_hy_h{\bf x}_h^T{\bf x}_i+w_0\;\;\;\;\; \text{ where }\;\; K({\bf x}_h,{\bf x}_i)={\bf x}_h^T{\bf x}_i\]

    Deep learning

    • Multi-layer Perceptron (MLP) learns a target function \(f: {\cal X}^p\rightarrow {\cal Y}^q\) of either a classification or regression problem.

      • \(p\) is the number of input dimensions

      • \(q\) is the number of output dimensions.

    • Unlike linear or logistic regression, there can be one or more non-linear layers, called hidden layers, between the input and the output layer in MLP.

    • For a set of features \(\bf X\) and a target \(y\), the MLP representation can be as follows.

    • The leftmost layer is the input layer that consists of a set of neurons \(\{x_j|x_1,x_2,\cdots,x_p\}\) representing the input features.

    • The rightmost layer is the output layer with a single neuron \(\{y\}\), as shown above, for a regression and classsification problems.

      • For a multi-class classification problem, the number of output neurons will be more than two.
    • Those layers between the input and output layers are called hidden layers, each consisting of a set of neurons.

      • Each neuron in the hidden layer transforms the values from the previous layer with a weighted linear summation \(w_0+w_1x_1+w_2x_2+\cdots+w_px_p\).

      • The linear sum is followed by a non-linear activation function \(\phi(\cdot): R\rightarrow R\) (e.g. logistic (sigmoid), tanh, relu).

    • MLP is a specific type of feedforward neural network.

    • Feedforward just means no cycles, information flows forward only (input → hidden → output).

    • Each layer is fully connected (dense).

    • There is no strict “magic number” for hidden neurons (default in scikit learn is 100).

    • But there are some common starting points and rules of thumb people use.

      • Power of two sizes: 8, 16, 32, 64, 128, 256 → Common because they fit well in memory blocks and scale naturally.

      • Pyramid shape: First hidden layer has more neurons than later layers (e.g., 64 → 32 → 16) to compress features gradually.

      • Input–output average rule: Hidden Neurons = (Num Inputs + Num Outputs) / 2

    • Too few neurons → underfitting (cannot capture complexity)

    • Too many neurons → overfitting (memorizes training set)