Machine Learning


Orthogonalization

Orthogonalization

Orthogonalization

If there are multiple questions you want to solve and features can be shared when you try to build a model for each question, then a single big model for all questions may be an ideal choice.

Error Evaluation

Error Evaluation

Regression

Linear Regression

Least Squares Estimates, LSE

Residual Sum of Squares, RSS

  • RSS is corresponding to the MLE
  • Implies the unexplained variance
  • Monotonely decreasing, i.e. "More variables, less RSS
  • \(min_{\beta_0, \beta_1}RSS\) is called least squares coefficient estimates
  • Model Assumption
    • linearity
      • \(\mu(y|x) = \beta_0 + \beta_1x\)
    • normality
      • \(Y|X \sim N(\beta_0 + \beta_1x, \sigma(y|x) = \sigma^2)\)
    • equal variance

Relative Standard Error, RSE

  • RSE \(= \sqrt{\frac{RSS}{df}}\) = \(\sqrt\frac{RSS}{n-2}\)

\(R^2\)

  • \(R^2 = \frac{TSS - RSS}{TSS}\)
  • \(R^2 = correlation^2\) in the case of simple linear regression

Model selection

Don’t use RSS, because…

  • Monotonely decreasing property, i.e. β€œMore variables, less RSS”, results in embedded overfitting problem

Use RSE, because…

  • NOT monotonely decreasing

Don’t use \(R^2\), because…

  • Monotonely decreasing
  • It’s hard to interpret because there is no test for \(R^2\).

Use F test

  • \(\frac{increased\; explained\; variance}{unexplained\; variance}\)
  • \(\frac{explained\; variance}{unexplained\; variance}\)
  • \(\beta_0\) is excluded in the degree of freedom
  • In R, use Anova() (F-test based) compared to lm() (t-test based)
    • \(t^2 = F\)
  • Use F test as the criteria for each factor variable, instead of t-test, because t-test only shows results for all levels of the factor variable.

p-value

  • The probability that the extreme cases happen given \(H_0\) is true
  • Don’t use p-value because it tends to give unreliable results by randomly sampling a subset from the dataset

CI, Confidence Interval

  • Mean interval
    • \(\bar{y} \mid x \in (\beta_0 + \beta_1x - 2\sigma_{se}(\beta_0 + \beta_1x), \beta_0 + \beta_1x + 2\sigma_{se}(\beta_0 + \beta_1x))\)
    • This captures the possible change of the entire regression line
    • \(\sigma_{new} = \sigma \sqrt{ \frac{1}{n} + \frac{(x_{new} - \bar{x})}{\sum_i(x_{i} - \bar{x})^2}}\) for adding a new estimate

Predictive interval

  • This capture the possible change of an unknown data point
  • \(\sigma_{prediction} = \sigma \sqrt{ 1 + \frac{1}{n} + \frac{(x_{new} - \bar{x})}{\sum_i((x_{i} - \bar{x})^2)}}\)
    • Prediction error \(E[(y - \hat{y})^2]\)

Training/Testing dataset

\(C_p\)

  • Unbiased estimate of the mean prediction error
  • \(C_p = \frac{RSS}{n} + \frac{2}{n}d\dot{}\sigma^2\)
  • \(\sigma^2 = Var(Y)\)
  • \(\hat{\sigma^2} = (RSE_{Full})^2\)
  • Use \(min C_p\) to choose your model
    • Step 1: Select features from \(d=1\) to \(d=19\), calculate each \(C_p\) and make comparison between them
    • Step 2: Kick out the variable which is not significant!
      • leaps::regsubset method="exhaustive"
      • Find the global \(min C_p\) with forward method

        Step 1: Choose the feature with the best fit in $y \sim x$ model, then keep the best feature in the model
        Step 2: Choose the next best fit by adding one more feature into the previous model
        Step 3: Repeat step 2
  • Note Once \(p\) is larger than \(n\) (\(n < p\)), then there is no one best solution!

BIC, Bayesian Information Criterion

  • \(\frac{RSS}{n} + \log(n) d \dot{} \sigma^2\)
  • Derived from Bayesian rule
  • BIC prefers smaller model because it penalizes model with a large number of variables

AIC, Akaike Information Criterion

  • \(-2 \text{loglikelihood} + 2(\text{number of parameters})\)
  • Measure the distance (loglikelihood) between two model
  • To find the better model, use AIC.
  • For classification, use \(C_p\)
  • A constant \(c\) times \(C_p\)
  • Proportion to the probability of the observed data
  • Proportion to BIC
  • bestglm()

Potential problems

Surrogator

  • In a simple regression with one regressor, some explanatory variables may act as a surrogate for other important missing variables.

Reverse regression

  • DON’T use y to explain x with model \(lm(y \sim x)\)

\(p < n\)

  • We use the regression to deal with data as \(p < n\)
  • High dimensional setting cannot be addressed here!

Heteroscedasticity

  • Check residual plot (\(y\) axis: residuals; \(x\) axis: fitted value \(y\)).
  • One solution is to transform \(y\).

Normality Assumption

  • Check residuals follow normal distribution.
  • CAN’T compare \(\beta\) for the same variable in two different models!
    • The meanings of two models are different
    • The standard errors are different
    • The effects of variables are only well-defined within the specific model
  • CAN’T rank the importance of the variables in a multiple regression model!
    • The \(\beta\) only shows the marginal effect
    • Non-significant result may be resulted from the multicolinearity.
  • Interaction term
    • Plot interaction plot
  • Nonlinearity in predictors
    • Plot pairwise scatter plot
  • Colinearity
    • \(se(\beta)\) becomes larger!
    • \((X'X)^{-1}\) becomes larger as some or all \(x\) are highly correlated!
    • \((X'X)\) diagonal positions remain the same; however, the covariance at \({i, j}\) which \(i \neq j\) becomes larger.

Regression Shrinkage Penalty

\(\lVert y - wx \rVert _{2}^2 + \lambda \lVert w \rVert_{p}^p\)

πŸ‘ Penalty will always lower the variance πŸ”» and increase the bias πŸ”Ί

Tuning parameter \(\lambda\)

  • Control the model complexity : the larger the value, the simpler the model is
  • \(\lambda \rightarrow \infty\), the shrinkage effect grows

Coefficient \(w\)

  • If coefficients are large, the model becomes more sensitive to the noise; thus, the variance goes up!

Ridge \(_{p=2}\)

\((X^T X + \lambda I )^{-1}X^\prime Y\)

πŸ‘ Perform better when there are lots of non-zero small coefficients

  • Consistent but biased
  • Convex optimization
  • Scale variant

Lasso \(_{p=1}\)

πŸ‘ Perform covariate selection

  • Variate selection consistent
  • Convex optimization
  • Scale variant

Stepwise/Streamwise/Stagewise \(_{p=0}\)

πŸ‘ Strong feature selection but no shinkage

  • Scale invariant
  • Use information theory (MDL) to pick \(\lambda\)
  • Minimum Description Length, MDL
    • \(D\) stands for data, and \(D=(x_1, y_1),...,(x_n, y_n)\)
    • \(L(D)=min_{H \in Hypothesis}(L(H)+L(D|H))\)
    • Against overfitting due to the tradeoff between the complexity of the hypothesis and the complexity of data given the hypothesis

Stepwise

πŸ‘ Within each step, pick the best out of all candidate features for inclusion in the model

  • Include linear terms, cross-product terms, and higher order terms
  • Higher model complexity
  • Compute-intensive

Streamwise

πŸ‘ Consider each feature β€˜once’ for inclusion in the model, and the order matters!

  • When the feature set is very large

Stagewise

πŸ‘ Like stepwise but freeze the previous coefficient.

  • For each addition of a feature, regress that feature against the residual
  • Compute faster then stepwise
  • Boosting

All Subset Regression

πŸ‘ Try all subsets of feature combination and compare the errors


Robust Regression


Logistic Regression, LR

\(\log (\prod_i \frac{1}{1+e^{(-y_iw^Tx_i)}})\)

  • \(n >> p\)
  • \(L_0\) loss is less when n is large.
  • \(\log odds = w^Tx\)

Linear Decision Boundary

\(P(Y=1|X) = \frac{1}{1+ e^{(\log \frac{P(Y=0)P(X|Y=0)}{P(Y=1)P(X|Y=1)})}}\)

\(P(Y=1|x, w)=P(Y=-1|x, w)=0.5 \Rightarrow w^Tx = 0\)

  • orthogonal
  • \(x\) is a plane and \(w\) is a vector
  • If the data is linearily seperable, MLE fails for LR, and we can use MAP instead to solve the problem

Linear Discriminant

Linear Discriminant Analysis, LDA

πŸ‘ Fundamental assumption is that the independent variables are normally distributed


Tree

πŸ‘ Goal is to minimize the cost function

\(\min(cost|split)\)

Regression Tree

\(\sum(y-\hat{y})^2\)

Classification Tree

\(\sum(p(1-p))\)

Selection criteria

  • Recursive or greedy algorithm
  • For each branch, \(n_{points} > a\)
  • Set the depth of a tree, \(d > c\)

Remark

  • If the variance is large, it implies the tree is complex, then we can use bagging and boosting to reduce the complexity of a tree.
  • Unbalance dataset will result in a larger bias, which may result from the domination of a few classes.

PCA

Concept

Eigen Value Decomposition

  • eigen value is the variance of each PC
  • eigen vector is the loading of each PC or we say the weight of each feature for one PC
  • Maximal variance of the data and eigenvectors of the covariance matrix
  • Direction vector \(v_j\) called loadings

Reconstruction error

\(min_{\mu,\{\lambda_i\},\mathbf{V}_q}\sum_i^N \lVert x_i-\mu-\mathbf{V}_q\lambda_i \rVert ^2\)

  • \(\mathbf{V}_q\) is a \(p\times q\) matrix with \(q\) orthogonal unit vector used for dimension transformation
  • \(\lambda\) is a \(q\) vector of parameters and \(\hat{\lambda}_i=\mathbf{V}_q^T(x_i-\bar{x})\)
  • \(\mu\) is a location vector in \(\mathbb{R}^p\) and \(\hat{\mu}=\bar{x}\)

Singular Value Decomposition, SVD

\(\mathbf{X}=\mathbf{U}\mathbf{D}\mathbf{V}^T\)

  • \(\mathbf{U}\) and \(\mathbf{V}\) are rotation matrices, or eigenvectors
  • Left singular vector \(\mathbf{U}\) is \(N \times p\) orthogonal matrix \((\mathbf{U}^T\mathbf{U}=\mathbf{I}_p)\)
  • Right singular vector \(\mathbf{V}\) is \(p \times p\) orthogonal matrix \((\mathbf{V}^T\mathbf{V}=\mathbf{I}_p)\)
  • \(\mathbf{D}\) is the square roots of the non-zero eigenvalues
  • \(\mathbf{D}\) is \(p \times p\) diagonal matrix with non-negative singular values as diagonal elements which are positive determinant
  • Principle components of \(\mathbf{X}\) are \(\mathbf{UD}\) (\(N \times p\))
  • For each \(x_i\), there is a closest point \(u_{ik}d_kv_k\) on the kth hyperplane
  • \(u_{ik}d_k\) measures distance between two points
  • \(v_k\) measures the direction of the line

Principle Curve

Kernel Principal Component

\((x_1, x2,..., x_n)\) β–Ά ︎\(\big( \phi(x_1),\phi(x_2),...,\phi(x_n) \big)\) β–Ά PCA

  • Non-linear transformation in PCA

πŸ“

  • \(\mathbf{11}^T\) is a n-by-n matrix of all ones
  • \(\mathbf{1}^T\mathbf{1}=n\)
  • \((\lambda_i,\alpha_i)\) is the eigenpair of \(\mathbf{K}\), denoting its \(i\)-th largest eigenvalue \(\lambda_i\) and its corresponding eigenvector \(\alpha_i\)

Normalized Kernel

\(K_{x \in X}(x, x)=1\)

Cauchy-Schwarz Inequality

\(K_{x, y \in X}(x, y)^2 \leq K(x, x)K(y, y)\)

Sparse Principal Component

Independent Component Analysis


Bayesian


SVM


Clustering

K-Means K-Medians

\(\mathbf{C}(i)=\arg\min \lVert x_i-m_k \rVert ^2\)

πŸ‘ Select \(K=k\) and \(m_k\) is the center point of the \(kth\) cluster

πŸ‘ \(O(n)\)

  • Step 1. Given a cluster assignment \(\mathbf{C}\), calculate \((m_1, ..., m_K)\) for each cluster.
  • Step 2. Build new \(\mathbf{C}\) \(s.t. \mathbf{C}(i)=\arg\min \lVert x_i-m_k \rVert ^2\)

Deep Learning

πŸ”—CS231n 2017 Course Summary

Backward Propogation

Cross-Entropy

Binary Distribution

\(Cost = \mathbf{J} = - \sum_{i=1}^N \bigg( t_i \log (y_i) + (1-t_i) \log (1-y_i) \bigg)\)

  • \(y_i\) stands for the probability of the output being equal to 1, or we say the probability of a hit, i.e. \(p(hit)\)
  • \(t_i\) stands for target, the actual target as \(0\) or \(1\), or we say it’s a hit or a miss
  • \(N\) stands for the total number of trials
  • \(i\) stands for the \(i\)-th trial

πŸ“Œ Interpretation
For a convex curve, or we say a cost objective function, the global minimum has the smallest value. To search for the minimum, we have to go downward the curve, or go in descending direction, to achieve the goal. That’s called gradient descent. On the contrary, to maximize the cost, we do gradient ascent.

Multiple Distribution

\(Cost = \mathbf{J} = - \sum_{i=1}^N \sum_{j=1}^K \bigg( t_j^i \log (y_j^i) \bigg)\)

  • \(K\) stands for \(K\) classes

The Derivative of Cross-Entropy

πŸ‘

Multiple Distribution

Input layer Weight \(W\) Hidden layer Weight \(V\) Output layer
\(D\) features
\(\{x_1, ..., x_D\}\)
\(w_{dm}\) M features
\(\{z_1, ..., z_M\}\)
\(v_{mk}\) \(K\) classes
  • Do backward propogation from the output layer to the hidden layer πŸ”»

\[ \begin{align*} \frac{\partial \mathbf(J)}{\partial v_{mk}} & = -\frac{\partial}{\partial v_{mk}} \bigg( \sum_{i \in N} \sum_{j \in K} t_j^i \log y_j^i \bigg) \\ & = -\sum_{i \in N} \sum_{j \in K} t_j^i \frac{1}{y_j^i}\frac{\partial y_j^i}{\partial a_k}\frac{\partial a_k}{\partial v^k_m} \\ & = -\sum_{i \in N} \bigg( \sum_{j \in K} t_j^i \frac{1}{y_j^i} \frac{\partial y_j^i}{\partial a_k} \bigg) z_m \\ & = -\sum_{i \in N} \sum_{j \in K} t_j^i (I_{j=k} - y^i_k) z_m \\ & = -\sum_{i \in N} \sum_{j \in K} (t_j^iI_{j=k}-t_j^iy^i_k) z_m \\ & = -\sum_{i \in N} \big( t_k^i - y^i_k \big) z_m \end{align*} \]

The second line of the above equation

\(y^i_j=\frac{e^{a_j}}{\sum_{j=1}^K e^{a_{j}}}\)

\(\frac{\partial y^i_{j=k}}{\partial a_k}=y^i_k(1-y^i_k)\)

\(\frac{\partial y^i_{j \neq k}}{\partial a_k}=-y^i_jy^i_k\)

\(\sum_{j \in K} \frac{1}{y^i_j} \frac{\partial y^i_j}{\partial a_k} = \sum_{j \in K} (I_{j=k} - y^i_k)\)

\(a_k={v^k_m}^Tz_m=\sum_m^M({v^k_m}z_m)\)

\(\frac{\partial a_k}{\partial v_m^k}=z_m\)

The last two lines of the above equation

\(\sum_{j \in K} (t_j^iI_{j=k})=t^i_1 \cdot 0 + t^i_2 \cdot 0 +...+ t^i_k \cdot 1 +...+ t^i_K \cdot 0 = t^i_k\)

\(\sum_{j \in K} t_j^iy^i_k=y^i_k \big(t^i_1 + t^i_2 +...+ t^i_k + ... + t^i_K \big)=y^i_k \cdot 1\)

  • For each data point, it’s target value \(t^i\) only belonging to \(k\)-th class. So \(t^i_k=1\) and the other \(t^i_{j \neq k}=0\)

Gradient Descent

\(\theta = \theta - \eta \cdot \bigtriangledown_{\theta} \mathbf{J}(\theta)\)

πŸ‘ Objective function, learning rate, and the level of gradient matter

πŸ‘ Adam is the best default choice for all cases

πŸ‘ L-BFGS is a good choice if you can afford to do full batch updates

πŸ‘ Minibatch Gradient Descent is the best default choice

πŸ”—Sebastian Ruder’s Website

Concept

  • For each epoch of training:
    • Batch GD uses entire dataset
    • Stochastic GD uses one data point
    • Mini-Batch GD uses a subset of dataset, to train the network

1st and 2nd Order Optimization

  • Newton parameter
  • Vanilla method
  • No need learning rate
  • Hessian metrix is too big for deep learning problem
  • Work for full batch only

BFGS

L-BFGS


Stochastic Gradient Descent, SGD

\(w_{t+1}=w_{t}-\alpha \bigtriangledown f(w_t)\)

  • What if one direction is more sensitive than the other ?
    • For high dimension, this is a common problem
  • local minima or saddle point problem
    • Gradient is zero in these two cases
    • Local minima is common in lower dimension cases
    • For high dimension cases, saddle point is a bigger problem
  • Problem of plateaus results in a slow learning process

  • Note these problems are not just for SGD only

Momentum

Exponentially Weighted Average

\(v_{t+1}=\rho v_t + (1-\rho)\bigtriangledown_{\theta} f(\theta)\)

\(\theta_{t+1}=\theta_{t}-\alpha v_{t+1}\)

  • It’s a kind of moving average techniques for the gradient across the training process
  • \(v\) is called velocity, which is a running mean of gradients
  • \(\rho\) functions as a friction, usaully set as \(0.9\) or \(0.99\)
  • \((1-\rho)\) is optional but highly suggested
Math Intuition

\(v_{t}=\rho v_{t-1}+(1-\rho) \theta_t\)

\(e^{-1}=(1-\beta)^{1/\beta}, \;\; \beta=1-\rho\)

Bias Correction

\(\frac{v_t}{1-\rho^t}\)

  • As \(t\) grows, \(1-\rho^t\) will be close to 1 and have less effect on \(v_t\)
  • This is used to correct the first few \(v\), e.g. \(v_0, v_1, v_2\)

Nesterov Momentum

πŸ‘ Calculate the gradient after updating the point with a big jump

\(v_{t+1}=\rho v_t -\alpha \bigtriangledown f(\theta_t+\rho v_t)\)

\(\theta_{t+1}=\theta_{t}+v_{t+1}\)

or

\(\tilde{\theta}_t=\theta_t+\rho v_t\)

\(v_{t+1}=\rho v_t - \alpha \bigtriangledown f(\tilde{\theta}_t)\)

\(\tilde{\theta}_{t+1}=\tilde{\theta}_{t}+v_{t+1}+\rho (v_{t+1}-v_t)\)


AdaGrad

\(\nabla \mathbf{J}^2 += (dx)^2\)

πŸ‘ Learning rate divided by the square root of the sum of squares of historic gradients

  • Adaptive Moment Estimation performs better in convex cases
  • Accumulated squared gradient makes learning rate become very small as the network gets deeper

RMSProp

  • Introduce a fix of AdaGrad in non-convex case
  • Rescale \(d\theta\) but no weighted average involved

\[ \begin{align*} s_{dw} & = \rho s_{dw} + (1-\rho)dw^2 \\ s_{db} & = \rho s_{db} + (1-\rho)db^2 \\ w & = w - \alpha \frac{dw}{\sqrt{s_{dw}+\epsilon}} \\ b & = b - \alpha \frac{db}{\sqrt{s_{db}+\epsilon}} \end{align*} \]

  • \(d\theta^2\) is element-wise
  • \(d\theta^2\) is to rescale the \(d\theta\) when updating \(\theta\)’s value
  • For \(\theta\) with larger values, the learning speed is relatively slow (the projection of one learning step onto the \(\theta\)’s dim is the key to the learning speed). Thus, using \(d\theta^2\), an even larger value, to rescale \(\theta\) will help speed up the learning process.

Adam

\[ \begin{align*} v_{dw} & = \rho_v v_{dw} + (1-\rho_v)dw \\ v_{db} & = \rho_v v_{db} + (1-\rho_v)db \\ s_{dw} & = \rho_s s_{dw} + (1-\rho_s)dw^2 \\ s_{db} & = \rho_s s_{db} + (1-\rho_s)db^2 \\ v_{dw}^{\prime} & = \frac{v_{dw}}{1-\rho_v^t},s_{dw}^{\prime} = \frac{s_{dw}}{1-\rho_s^t} \\ v_{db}^{\prime} & = \frac{v_{db}}{1-\rho_v^t},s_{db}^{\prime} = \frac{s_{db}}{1-\rho_s^t} \\ w & = w - \alpha \frac{v_{dw}^{\prime}}{\sqrt{s_{dw}^{\prime}+\epsilon}} \\ b & = b - \alpha \frac{v_{db}^{\prime}}{\sqrt{s_{db}^{\prime}+\epsilon}} \end{align*} \]

  • Adaptive moment estimation
  • Combine Momentum, AdaGrad and RMSProp
  • Include weighted average, rescaling, and bias-correction
  • \(\alpha\) has to be tuned
  • \(\rho_v=0.9,\rho_s=0.999,\epsilon=10^{-8}\)

Nadam


Activation Function

πŸ‘ The purpose of the activation function is for nonlinear transformation

πŸ‘ The default best selection is ReLU

Activation Function Range Description Derivatives \(\frac{\partial g(z)}{\partial z}\)
sigmoid \((0,1)\) If \(y\) takes a binary result, i.e. \(0\) and \(1\) \(g(z)(1-g(z))\)
tanh \((-1,1)\) If \(y\) is expected to be symmetry among a smaller range \[ g'(z) = \begin{cases} (0,1) & \quad \text{if } \lvert z \rvert > 3\\ \approx 0 & \quad \text{if } \lvert z \rvert \leq 3 \end{cases} \]
ReLU \(\max(0,z)\) If the range of \(y\) is larger than \(0\) \[ g'(z) = \begin{cases} 1 & \quad \text{if } z > 0\\ 0 & \quad \text{if } z \leq 0 \end{cases} \]
LeakyReLU \(\max(\alpha \cdot z,z)\) To reduce the shrinkage effect resulting from ReLU \[ g'(z) = \begin{cases} 1 & \quad \text{if } z > 0\\ \alpha & \quad \text{if } z \leq 0 \end{cases} \]
  • \(g(z)\) stands for the activation function
  • The slope goes to \(0\) as the \(z\) goes wide, and this will slow down the iteration
  • A linear function will not be used as an activation function, because it also gives a linear outcome which is nonsense for you to build a deeper linear model

Convolution Neural Network

πŸ‘ Parameter sharing

πŸ‘ Sparsity of connections

  • parameter sharing a small filter functions as a sliding window which reuses its parameters through the image
    • parameter size \(= (\text{filter's size} +1) \times \text{the number of filters}\)
    • One filter results in one constant parameter \(b\)
  • sparsity of connections each output value is linked to a small region of the input layer
  • Required computing memory size is proportional to the number of pixels for each layer

Computer Vision

Edge Detection


Convolution Layer

Output Dimension

\(\bigg\lfloor D_{out}=\frac{(D_{in}+2\times P-D_{f})}{S}+1\bigg\rfloor\)

πŸ“

  • \(D\) stands for dimension
  • \(D_f\) is the dimension of the filter’s size
  • \(P\) stands for padding
  • \(S\) stands for stride
  • \(D_W\)x\(D_H\)x\(K_F\), \(K\) stands for the number of filters, not the channel of the image or the filter
  • Computation-intensive

Hyperparameters πŸ“

Number of Parameters Calculation

  • Filter’s size \(\times\) size \(\times\) channel/depth \(\times\) \(K + K\)

Filter

  • Filter’s size \(= a\), or we say kernel \(=a\)
  • \(a \times a \times\) channel/depth(optional), e.g. \(5 \times 5\), \(3 \times 3\), \(1 \times 1\)
  • Choose filter’s size is to decide the number of neurons which all target the same spatial region but look for different things
  • If \(K\) filters are used, the output dimension of the transformed image is width \(\times\) height \(\times K\)
  • Different channels/layers of depth will be aggregated together after computation with a single filter

Stride

  • Movement of the filter’s window

Padding

  • zero Padding is to remain the same dims as the input’s by simply adding zeros to the peripheral
  • valid padding means no padding
  • same padding means the output size being equal to the input size

Example

Layer Shape Size Parameter size
Input (32,32,3) \(32\times32\times3=3072\) \(0\)
Conv (\(f=5, s=2\)) (12,12,6) \(12\times12\times6=864\) \((5\times5+1)\times6=156\)
Pool (\(f=2\)) (6,6,3) \(6\times6\times3=108\) \(0\)
FC (64,1) \(64\times1=64\) \(108\times64+1=6913\)
FC (32,1) \(32\times1=32\) \(64\times32+1=2049\)
Softmax \((3,1)\) \(3\times1\) \(32\times3+1=97\)

More Layers

Fully Connected Layer, FC

Dimention

  • \(n^{[l]} \times n^{[l-1]} + 1\) is the number of paramters for the \(l\)th layer
  • \(n^[l]\) is the output’s size for the \(l\)th layer
  • Each neuron connects to the entire input volume
  • Memory-intensive
  • AlexNet and VGG have FC layers which make the network really huge and require more memory

Pooling Layer

  • Type of pooling
    • max pooling
    • average pooling
  • Hyperparameters
    • Common settings are \(D_f=2, D_s=2\) and \(D_f=3, D_s=2\)
  • Goal: downsampling
    • Only width and height will be changed, but depth remains the same
    • \(Dim=\frac{(Original Dim - Filter Dim)}{(Stride)} + 1\)
    • No parameters introduced
    • Zero padding is NOT COMMON for pooling due to its downsampling goal

Inception Module

Architecture

ImageNet Year Description Training time
LeNet-5 1998 5 layers, plus 2 fully connected layers
\(5\)x\(5\) filter and \(2\)x\(2\) avgpooling only
AlexNet 2012 8 layers, plus 3 fully connected layers
2 parallele architecture
About 60M parameters
ReLU
Local response normalization
VGG16/19 2014 16-19 layers, plus 3 fully connected layers
Memory of 96MB/image for only forward propogation
In forward propogation, it takes 96MB/image, i.e.Β 24M values * 4 bytes
About 138M parameters
\(3\)x\(3\) filter with the same padding and \(2\)x\(2\) maxpooling only
4 GPUs for 2 to 3 weeks
GoogLeNet 2014 22 layers, no fully connected layers
ResNet 2015 One residual block with one skip connection

Recurrent Neural Network

Truncated Backpropagation

  • check min-char-rnn.py

Gradient Flow

  • largest singular value > 1 : exploding
    • gradient clipping : scale gradient
  • largest singular value < 1 : vanishing
    • LSTM

Vanilla RNN


GRU

  • No cell state
  • Output \(h_t\) only and no \(C_{t}\)

Mathematics

  • \(r_t=\sigma(W_r \cdot [h_{t-1}, x_t]+b_r)\)
  • \(z_t=\sigma(W_z \cdot [h_{t-1}, x_t]+b_r)\)
  • \(\tilde{h_t}=tanh(W_h \cdot [r_th_{t-1}, x_t]+b_h)\)
  • \(h_t=(1-z_t)\tilde{h_t}+z_th_{t-1}\)

LSTM

  • 1997
  • add cell state

Concept * Symmetric product * Element-wise product \(\odot\)

Keys

  • 4 Gates :
    • Input :
    • Forget : erase/remove
    • Gate : write
    • Output :
  • \(c_{t-1}\) and \(h_{t-1}\) are two inputs from the previous stage
  • \(c_t = f \odot c_{t-1} + i\odot g\)
  • \(h_t = o \odot tanh_{c_t}\)

Mathematics

\(f_t=\sigma(W_f \cdot [h_{t-1}, x_t]+b_f)\)

\(i_t=\sigma(W_i \cdot [h_{t-1}, x_t]+b_i)\)

\(\tilde{C_t}=tanh(W_C \cdot [h_{t-1}, x_t]+b_C)\)

\(C_t = f_t \times C_{t-1} + i_t \times \tilde{C_t}\)


Generative Adversarial Networks


Training Tricks

πŸ‘ Number of epochs set as 30 should be fine

πŸ‘ Model ensembling

πŸ‘ Multi-crop at test time for images analytics

πŸ‘ Transfer learning

πŸ‘ Shuffle training data for each epoch

πŸ‘ Save the model parameters and restart the training several times

Note

Model ensembling and multi-crop tricks may be helpful to slightly improve your model performance for a certain dataset you are working with. However, for production level, it’s better to train a model with acceptable performance rather than using these two tricks.


Acelerate Optimization Process

Normalization

πŸ‘ It’s working with both batch and minibatch gradient descent

πŸ‘ Covariate shift

Intuition

Normalization makes the radial distances from all dimensions to the minimum point of the loss function get closer to avoid zig-zag gradient descent pattern during model training.

By intuition, batch norm for each hidden layer makes them less vulnerable to the changes imposed by the previous layers. This implies that each layer becomes more independent to the others.

Minibatch normalization have similar effect as the regularization’s or the dropout’s because the samples’ mean and variance vary across minibatch and thus introduce additional noise to the model.

Testing

For model testing, using exponential weighted average to calculate the mean and variance of the minibatches used for training is suggested.


Weight Initialization

πŸ‘ Solution to gradient vanishing/exploding

πŸ‘ For deeper neural netwroks, weights slightly greater/less than 1 will result in an exponential increase/decrease of the weight for deeper layers

\(\text{Var}(w) = \frac{1}{n}\) or \(\frac{2}{n}\)

Method Math Description
\(w_{random}\sqrt{(\frac{2}{n^{[l-1]}})}\)
Xavier initialization \(\text{tanh}\sqrt{(\frac{1}{n^{[l-1]}})}\)
\(\text{tanh}\sqrt{(\frac{2}{n^{[l-1]+n^{[l]}}})}\)

Minibatch

πŸ‘ The concept behind is use a subset called minibatch sampling from the entire training set called batch

πŸ‘ Using minibatch size equal to 1 for each training epoch is called stochastic gradient descent

πŸ‘ For big dataset \((n > 2000)\), typical minibatch size will be \((64, 128, 256, 512)\), which fit in CPU/GPU memory

For batch gradient descent, the lost will be improved monotonely across iterations.


Learning Rate Decay

πŸ‘ Adam optimization functions as an alternative for learning rate decay

πŸ‘ Learning rate annealing, e.g.Β cosine annealing

\(\alpha=\frac{1}{1+r_{decay} \cdot n }\alpha_0\)

\(\alpha=\frac{k}{\sqrt{n}}\alpha_0\)

Exponential decay

\(\alpha=k^{n}\alpha_0\)

Staircase

Manual decay

  • \(k\) is a constant
  • \(n\) is the epoch number
  • \(c < 1\)

Hyperparameter Selection and Tuning

πŸ‘ Experience is most valuable

πŸ‘ No single one best solution for all kinds of problems across all fields

Evaluation Problems Solutions
High bias Training set problem Bigger network, including a wider layer or a deeper architecture (always help)
Train longer (not always help)
Different nn architecture
High variance Dev set problem Collect more data (best way)
Regularization
Different nn architecture

\(a_j^{[l]\{i\}(m)}\)

Hyperparameter Notation Importance
learning rate \(\alpha\)
friction in momentum \(\rho\)
minibatch size \(\{i\}\)
number of hidden neurons \(j\)
number of hidden layers \([l] \in L\)
learning rate decay

Train/Dev/Test Sets

πŸ‘ Dev/Test sets must come from the same distribution

  • Mismatched train/test distribution is acceptable
  • Test set is not a must

Bias-Variance Tradeoff

πŸ‘ Concept behind is the tradeoff between a general/specialized model

πŸ‘ High train set error implies high bias

πŸ‘ High dev set error implies high variance

πŸ‘ High train/dev set error implies the model is in general bad, this is a common problem in high dimension parameters space

πŸ‘ Bigger network reduces the bias without increasing the variance

πŸ‘ Data augmentation reduces the variance without increasing the variance

Regularization

πŸ‘ Please note that regularization here has a general meaning which is not just referred to the penalty term

Tricks How to do Description
Penalty term Additive to the loss function This is a must property
Dropout Additive to each layer It functions like the penalty term, and for images analytics with neural networks, this is a nice property which boosts the model performance on testing data
Data augmentation Additional actions In practice, this action takes time and involves more human works
Activation function \(\text{tanh}\) Built-in architecture design \(\text{tanh}\) makes the weights center around \(0\) and the values around \(0\) have an approxomately linear relationship which will result in a decision boundary with less nonlinear shape
Early stopping Additional actions This is not suggested because early stopping interrupt both the optimization process and the overfitting problem. It makes it harder to figure out where is the main problem if the network performs bad
Penalty Term

πŸ‘ Also called weight decay

πŸ‘ Regularization prevents overfitting

πŸ‘ \(L2\) should be the default choice

πŸ‘ \(\lambda\), called regularization parameter, is the hyperparameter which has to be tuned

πŸ‘ The larger the \(\lambda\) is, the stronger the regularization effect is

\[ \begin{align*} \text{Loss} & = \mathbf{J} + \frac{\lambda}{m}\rVert w^{[l]} \lVert^2_F \\ dw^{[l]} & = \frac{1}{m}dZ^{[l]} A^{[l-1]T}+\frac{\lambda}{m} w^{[l]} \\ w^{[l]} & = w^{[l]}-\alpha dw^{[l]} \end{align*} \]

\(\mathbf{J_{\text{regularized}}} = \bigg[-\frac{1}{m} \sum^m_{i=1} \big( y_i \log(a_i)+(1-y_i) \log(1-a_i) \bigg) \bigg] + \bigg[ \frac{1}{m} \frac{\lambda}{2} \sum w^2 \bigg]\)

\(dw^{[l]} = \frac{1}{m} dZ^{[l]}A^{[l-1]T}+\frac{\lambda}{m}w^{[l]}\)

Regularization Math Effect Description
\(L2\) \(\rVert w \lVert^2_F\) Most of \(w\) will be a smaller value
In convention, this is called Frobenius norm
No feature selection effect
\(L1\) \(\rVert w \lVert_1\) Some \(w\) will become \(0\) Strong feature selection effect

Dropout

πŸ‘ Drop features randomly

πŸ‘ Do not use dropout during testing because that will result in random predictions, and do not keep the keep-probability parameters in the calculations

Implementing dropout in the model is a bit tricky, please check Concept

the code below to have some ideas:

Method Math Description
Inverted dropout

Interpretation

It forces the neuron to spread out the weights, and has similar effects as \(L2\) norm.

The cost function \(J\) becomes not well-defined when using dropout. Because dropout is a solution to overfitting, if there is no overfitting problem, probably do not introduce dropout.


Data Augmentation

πŸ‘ Increasing batch size has the same effect as decaying learning rate


Activation Function

Early Stopping

Sanity Checking

Gradient Checking

πŸ‘ Debug backpropagation code

πŸ‘ Don’t work with dropout layers

πŸ‘ Don’t forget the regularization term

\(d\theta_k^{approx} \approx \frac{\mathbf{J(\theta_1,..,\theta_k+\epsilon,...,\theta_n)}-\mathbf{J(\theta_1,..,\theta_k-\epsilon,...,\theta_n)}}{2\epsilon}\)

Criteria

$ d_k^{approx} - d_2 <10^{-7} $

  • \(O(\epsilon^2)\)

Notation

\(a^{[\text{l}](i)}_j\)

  • \(i\)-th sample, \(i=1, ..., N\)
  • \(j\)-th neuron, \(i=1, ..., M\)
  • \(l\)-th layer, \(i=1, ..., L\)
  • \(a\) stands for the activation layer
  • Input layer is not counted in
  • \(z\) is the feature of the hidden layer

Reinforcement Learning


Loss

Distance Function

\(\sum I_{y\neq\hat{y}} \log(p(x)) \geq 1\)

\(d_p(x,y) \leq 0 \iff x=y\)

Norm

\(d_p(x,y)=|x-y|_p\)

  • \(L_1\) and \(L_2\) norms are convex πŸ“
  • \(L_0\) pseudo-norm and \(L_{\frac{1}{2}}\) are concave
  • hinge norm is the solution to the outlier problem

πŸ“ Convex function is continuous and its second derivative is always greater than or equal to zero


Log Likelihood

MLE

MAP


Kernel

πŸ‘ Similarity measure : the larger the kernel, the higher level of similarity


Entropy

\(\sum_i p_i log(p_i)\)

\(E_i(log(p_i))\)

  • \(P(event) = 1\) : entropy \(=\) 0
  • \(P(event) = 0\) : entropy \(=\infty\)
  • \(P(event) = 0.5\) : entropy \(= 1\)
  • Uniform distribution has the highest entropy
  • Example: Gini Index

\(Gini=\frac{\text{Area between the line of equality and Lorenz curve}}{\text{Area under the line of equality}}\)


Minimum Description Length, MDL

\(E(length) \geq - \sum P(z_i)\log_2 (P(z_i))\)


Kullback-Leibler Distance, KLD

\(D(new|old) = \sum P(new) \log \frac{P(new)}{P(old)}\)

  • Also called relative entropy
  • NOT symmetric
  • NOT a distance measure
  • NOT a metric measure
  • Remark : however, if data is normalized, to mimic a distribution, then KLD works

Evaluation

πŸ‘ Single number evaluation metric in use can be speed up the iteration and thus is highly suggested

πŸ‘ F1 score is the harmonic mean of precision and recall

πŸ‘ Evaluation metrix is equal to one optimizing metric plus one or several satisficing metric

To evaluate the model performance, we usually have one optimizing metric, e.g.Β the accuracy, precision, or recall, plus one or several constraints which we call the satisficing metric here. The constraints might be the running time of the training process, or the weight penalization on the misclassification of a subgroup of classes, etc.

Bayes optimal error is the theoretical minimum error a model can achieve. And in practice, we use the human-level error to approximate the theoretical error. On one hand, the gap between the theoretical and human errors is an avoidable bias, and it can be narrowed down by including more human expertise. On the other hand, the gap between the training and dev errors results from the variance.

Confusion Matrix

Evaluation Metric

Evaluation Metric

Intersection over Union, IoU

Tricks

Cross Validation

CV is in general a nice way for small datasets.

Leave-one-out CV, LOOCV

πŸ‘ Learn a classifier on all the training data minus one example and evaluate its error for the remaining

  • When n is small
  • Unbiased estimate of the error

K-fold CV

πŸ‘ Split data into K subsets (usually K = 10)

  • When n is large
  • Unbiased estimate of the error

Bias-Variance Decomposition

πŸ‘ Goal is to minimize the total error

πŸ‘ The sum of the distances between the raw data and its central estimate is called variance

πŸ‘ The sum of the distances between the true target and its central estimate is called bias

πŸ‘ The sum of variance, bias, and noise is the total error

Concept

Assume that in a real world \(y\) can be represented by a function of \(x\) plus a noise term \(\epsilon\). We want to find an approximate function of \(x\) to mimic the true function\(^{2}\).

  • \(^{1}\) \(y=f(x)+\epsilon\)
  • \(^{2}\) \(\hat{y}=\hat{f}(x)\)

\(E_{x,y,D}\big( (\hat{f}(x)-y)^2\big)=E_{x,D}\big((\hat{f}(x)-\bar{\hat{f}}(x))^2\big) + E_{x,y}\big((\bar{\hat{f}}(x)-y)^2\big)\)

1st Term: Model variance

  • A larger variance implies a higher flexibility which enable the model to capture the noise in the training data
  • To reduce model variance
    • Less model complexity
    • Data augmentation

2nd Term: Bias and Noise

\(E_{x,y}\big((\bar{\hat{f}}(x)-y)^2\big)=E_x\big((\bar{\hat{f}}(x)-f(x))^2\big) + E_{x,y}\big((f(x)-y)^2\big)\)

  • 1st term: Bias
    • Reduce the bias with a complex model
  • 2nd term: Sample Variance or Noise
    • Cannot be reduced!
Complexity/Flexibility Variance Bias Noise Model Generosity Problem
Low Low High Unchangable Linear High Underfitting
High High Low Unchangable Non-linear Low Overfitting

Note

  • Model variance is different from sample variance
  • Variance is to present the spread of the distribution
  • Bias is to present how far it is between your estimate and your target

Boosting

πŸ‘ A process in which we turn a weak classifier into a strong one

πŸ‘ Goal is to reduce the total error

AdaBoost

\(e_t=\sum_i e \big( F_{t-1}(x_i)+\alpha_th(x_i) \big)\)

  • \(h_i\) is the hypothesis for \(x_i\)
  • Downside is being sensitive to outliers and noise

XGBoost


LPBoost


Bagging

πŸ‘ Boostrap

\(D_1,...,D_m \sim D\)

  • m sets of boostrap samples for each of sample size n
  • Also called bootstrap aggregating
  • Improve stability and accuracy
  • Use the average of m samples for regression or the major vote for classification
  • Bagging does NOT always improve the performance. It may degrade the performance of stable methods, e.g.Β KNN

\(\lim_{n\rightarrow\infty}(1+\frac{-1}{n})^n = e^{-1} \sim 0.368\)

πŸ‘ If the sample size is large, the chance for each data point of not being chosen is 0.368

Hardware

NVIDIA GPU

Net Batch size Global memory
VGG-16 32
VGG-16 64 8GB
VGG-16 128 14GB
  • Reduce batch size
  • Reduce model’s size
  • Distribute the model on multiple GPUs

TFLOPS

Easa Project: Car Detection

The section YOLO covers almost all important topics for object detection, including classification, localization, landmark detection, sliding window, fully connected layers with convolutional implementation, grid region, intersection over union, non-max suppression, and anchor boxes.


Reference

Methods Year Title Summary
R-CNN 2013
Fast R-CNN 2015
Faster R-CNN 2016

YOLO

Redmon et al., 2015


Segmentation

Region Proposal

  • Propose regions

  • Convolute over one region or all regions at a time

  • Convolute end-to-end, including region proposals

R-CNN


Fast R-CNN


Faster R-CNN


Raccoon Project: Object Tracing

Object Recognition

Reference

Eunice Tsao

Apr 2019