Machine Learning

Orthogonalization
Regression
Linear Discriminant
- Linear Discriminant Analysis, LDA
Tree
PCA
Bayesian
SVM
Clustering
- K-Means K-Medians
Deep Learning
Reinforcement Learning
Loss
- Norm
- Log Likelihood
- Kernel
- Entropy
Evaluation
- Confusion Matrix
- Intersection over Union, IoU
Tricks
Hardware
- NVIDIA GPU
Easa Project: Car Detection
Raccoon Project: Object Tracing
- Object Recognition
- Reference

Orthogonalization

If there are multiple questions you want to solve and features can be shared when you try to build a model for each question, then a single big model for all questions may be an ideal choice.

$Error Evaluation$

Error Evaluation

Regression

Linear Regression

Least Squares Estimates, LSE

Residual Sum of Squares, RSS

RSS is corresponding to the MLE
Implies the unexplained variance
Monotonely decreasing, i.e. "More variables, less RSS
$min_{\beta_0, \beta_1}RSS$ is called least squares coefficient estimates
Model Assumption
- linearity
  - $\mu(y|x) = \beta_0 + \beta_1x$
- normality
  - $Y|X \sim N(\beta_0 + \beta_1x, \sigma(y|x) = \sigma^2)$
- equal variance

Relative Standard Error, RSE

RSE $= \sqrt{\frac{RSS}{df}}$ = $\sqrt\frac{RSS}{n-2}$

$R^2$

$R^2 = \frac{TSS - RSS}{TSS}$
$R^2 = correlation^2$ in the case of simple linear regression

Model selection

Don’t use RSS, because…

Monotonely decreasing property, i.e. “More variables, less RSS”, results in embedded overfitting problem

Use RSE, because…

NOT monotonely decreasing

Don’t use $R^2$, because…

Monotonely decreasing
It’s hard to interpret because there is no test for $R^2$.

Use F test

$\frac{increased\; explained\; variance}{unexplained\; variance}$
$\frac{explained\; variance}{unexplained\; variance}$
$\beta_0$ is excluded in the degree of freedom
In R, use Anova() (F-test based) compared to lm() (t-test based)
- $t^2 = F$
Use F test as the criteria for each factor variable, instead of t-test, because t-test only shows results for all levels of the factor variable.

p-value

The probability that the extreme cases happen given $H_0$ is true
Don’t use p-value because it tends to give unreliable results by randomly sampling a subset from the dataset

CI, Confidence Interval

Mean interval
- $\bar{y} \mid x \in (\beta_0 + \beta_1x - 2\sigma_{se}(\beta_0 + \beta_1x), \beta_0 + \beta_1x + 2\sigma_{se}(\beta_0 + \beta_1x))$
- This captures the possible change of the entire regression line
- $\sigma_{new} = \sigma \sqrt{ \frac{1}{n} + \frac{(x_{new} - \bar{x})}{\sum_i(x_{i} - \bar{x})^2}}$ for adding a new estimate

Predictive interval

This capture the possible change of an unknown data point
$\sigma_{prediction} = \sigma \sqrt{ 1 + \frac{1}{n} + \frac{(x_{new} - \bar{x})}{\sum_i((x_{i} - \bar{x})^2)}}$
- Prediction error $E[(y - \hat{y})^2]$

Training/Testing dataset

$C_p$

Unbiased estimate of the mean prediction error
$C_p = \frac{RSS}{n} + \frac{2}{n}d\dot{}\sigma^2$
$\sigma^2 = Var(Y)$
$\hat{\sigma^2} = (RSE_{Full})^2$
Use $min C_p$ to choose your model
- Step 1: Select features from $d=1$ to $d=19$, calculate each $C_p$ and make comparison between them
- Step 2: Kick out the variable which is not significant!
  - leaps::regsubset method="exhaustive"
  - Find the global $min C_p$ with forward method
```
Step 1: Choose the feature with the best fit in $y \sim x$ model, then keep the best feature in the model
Step 2: Choose the next best fit by adding one more feature into the previous model
Step 3: Repeat step 2
```
Note Once $p$ is larger than $n$ ($n < p$), then there is no one best solution!

BIC, Bayesian Information Criterion

$\frac{RSS}{n} + \log(n) d \dot{} \sigma^2$
Derived from Bayesian rule
BIC prefers smaller model because it penalizes model with a large number of variables

AIC, Akaike Information Criterion

$-2 \text{loglikelihood} + 2(\text{number of parameters})$
Measure the distance (loglikelihood) between two model
To find the better model, use AIC.
For classification, use $C_p$
A constant $c$ times $C_p$
Proportion to the probability of the observed data
Proportion to BIC
bestglm()

Potential problems

Surrogator

In a simple regression with one regressor, some explanatory variables may act as a surrogate for other important missing variables.

Reverse regression

DON’T use y to explain x with model $lm(y \sim x)$

$p < n$

We use the regression to deal with data as $p < n$
High dimensional setting cannot be addressed here!

Heteroscedasticity

Check residual plot ($y$ axis: residuals; $x$ axis: fitted value $y$).
One solution is to transform $y$.

Normality Assumption

Check residuals follow normal distribution.
CAN’T compare $\beta$ for the same variable in two different models!
- The meanings of two models are different
- The standard errors are different
- The effects of variables are only well-defined within the specific model
CAN’T rank the importance of the variables in a multiple regression model!
- The $\beta$ only shows the marginal effect
- Non-significant result may be resulted from the multicolinearity.
Interaction term
- Plot interaction plot
Nonlinearity in predictors
- Plot pairwise scatter plot
Colinearity
- $se(\beta)$ becomes larger!
- $(X'X)^{-1}$ becomes larger as some or all $x$ are highly correlated!
- $(X'X)$ diagonal positions remain the same; however, the covariance at ${i, j}$ which $i \neq j$ becomes larger.

Regression Shrinkage Penalty

$\lVert y - wx \rVert _{2}^2 + \lambda \lVert w \rVert_{p}^p$

👁 Penalty will always lower the variance 🔻 and increase the bias 🔺

Tuning parameter $\lambda$

Control the model complexity : the larger the value, the simpler the model is
$\lambda \rightarrow \infty$, the shrinkage effect grows

Coefficient $w$

If coefficients are large, the model becomes more sensitive to the noise; thus, the variance goes up!

Ridge $_{p=2}$

$(X^T X + \lambda I )^{-1}X^\prime Y$

👁 Perform better when there are lots of non-zero small coefficients

Consistent but biased
Convex optimization
Scale variant

Lasso $_{p=1}$

👁 Perform covariate selection

Variate selection consistent
Convex optimization
Scale variant

Stepwise/Streamwise/Stagewise $_{p=0}$

👁 Strong feature selection but no shinkage

Scale invariant
Use information theory (MDL) to pick $\lambda$
Minimum Description Length, MDL
- $D$ stands for data, and $D=(x_1, y_1),...,(x_n, y_n)$
- $L(D)=min_{H \in Hypothesis}(L(H)+L(D|H))$
- Against overfitting due to the tradeoff between the complexity of the hypothesis and the complexity of data given the hypothesis

Stepwise

👁 Within each step, pick the best out of all candidate features for inclusion in the model

Include linear terms, cross-product terms, and higher order terms
Higher model complexity
Compute-intensive

Streamwise

👁 Consider each feature ‘once’ for inclusion in the model, and the order matters!

When the feature set is very large

Stagewise

👁 Like stepwise but freeze the previous coefficient.

For each addition of a feature, regress that feature against the residual
Compute faster then stepwise
Boosting

All Subset Regression

👁 Try all subsets of feature combination and compare the errors

Robust Regression

Logistic Regression, LR

$\log (\prod_i \frac{1}{1+e^{(-y_iw^Tx_i)}})$

$n >> p$
$L_0$ loss is less when n is large.
$\log odds = w^Tx$

Linear Decision Boundary

$P(Y=1|X) = \frac{1}{1+ e^{(\log \frac{P(Y=0)P(X|Y=0)}{P(Y=1)P(X|Y=1)})}}$

$P(Y=1|x, w)=P(Y=-1|x, w)=0.5 \Rightarrow w^Tx = 0$

orthogonal
$x$ is a plane and $w$ is a vector
If the data is linearily seperable, MLE fails for LR, and we can use MAP instead to solve the problem

Linear Discriminant

Linear Discriminant Analysis, LDA

👁 Fundamental assumption is that the independent variables are normally distributed

Tree

👁 Goal is to minimize the cost function

$\min(cost|split)$

Regression Tree

$\sum(y-\hat{y})^2$

Classification Tree

$\sum(p(1-p))$

Selection criteria

Recursive or greedy algorithm
For each branch, $n_{points} > a$
Set the depth of a tree, $d > c$

Remark

If the variance is large, it implies the tree is complex, then we can use bagging and boosting to reduce the complexity of a tree.
Unbalance dataset will result in a larger bias, which may result from the domination of a few classes.

PCA

Concept

Eigen Value Decomposition

eigen value is the variance of each PC
eigen vector is the loading of each PC or we say the weight of each feature for one PC
Maximal variance of the data and eigenvectors of the covariance matrix
Direction vector $v_j$ called loadings

Reconstruction error

$min_{\mu,\{\lambda_i\},\mathbf{V}_q}\sum_i^N \lVert x_i-\mu-\mathbf{V}_q\lambda_i \rVert ^2$

$\mathbf{V}_q$ is a $p\times q$ matrix with $q$ orthogonal unit vector used for dimension transformation
$\lambda$ is a $q$ vector of parameters and $\hat{\lambda}_i=\mathbf{V}_q^T(x_i-\bar{x})$
$\mu$ is a location vector in $\mathbb{R}^p$ and $\hat{\mu}=\bar{x}$

Singular Value Decomposition, SVD

$\mathbf{X}=\mathbf{U}\mathbf{D}\mathbf{V}^T$

$\mathbf{U}$ and $\mathbf{V}$ are rotation matrices, or eigenvectors
Left singular vector $\mathbf{U}$ is $N \times p$ orthogonal matrix $(\mathbf{U}^T\mathbf{U}=\mathbf{I}_p)$
Right singular vector $\mathbf{V}$ is $p \times p$ orthogonal matrix $(\mathbf{V}^T\mathbf{V}=\mathbf{I}_p)$
$\mathbf{D}$ is the square roots of the non-zero eigenvalues
$\mathbf{D}$ is $p \times p$ diagonal matrix with non-negative singular values as diagonal elements which are positive determinant
Principle components of $\mathbf{X}$ are $\mathbf{UD}$ ($N \times p$)
For each $x_i$, there is a closest point $u_{ik}d_kv_k$ on the kth hyperplane
$u_{ik}d_k$ measures distance between two points
$v_k$ measures the direction of the line

Principle Curve

Kernel Principal Component

$(x_1, x2,..., x_n)$ ▶ ︎$\big( \phi(x_1),\phi(x_2),...,\phi(x_n) \big)$ ▶ PCA

Non-linear transformation in PCA

📝

$\mathbf{11}^T$ is a n-by-n matrix of all ones
$\mathbf{1}^T\mathbf{1}=n$
$(\lambda_i,\alpha_i)$ is the eigenpair of $\mathbf{K}$, denoting its $i$-th largest eigenvalue $\lambda_i$ and its corresponding eigenvector $\alpha_i$

Normalized Kernel

$K_{x \in X}(x, x)=1$

Cauchy-Schwarz Inequality

$K_{x, y \in X}(x, y)^2 \leq K(x, x)K(y, y)$

Sparse Principal Component

Independent Component Analysis

Bayesian

SVM

Clustering

K-Means K-Medians

$\mathbf{C}(i)=\arg\min \lVert x_i-m_k \rVert ^2$

👁 Select $K=k$ and $m_k$ is the center point of the $kth$ cluster

👁 $O(n)$

Step 1. Given a cluster assignment $\mathbf{C}$, calculate $(m_1, ..., m_K)$ for each cluster.
Step 2. Build new $\mathbf{C}$ $s.t. \mathbf{C}(i)=\arg\min \lVert x_i-m_k \rVert ^2$

Deep Learning

🔗CS231n 2017 Course Summary

Backward Propogation

Cross-Entropy

Binary Distribution

$Cost = \mathbf{J} = - \sum_{i=1}^N \bigg( t_i \log (y_i) + (1-t_i) \log (1-y_i) \bigg)$

$y_i$ stands for the probability of the output being equal to 1, or we say the probability of a hit, i.e. $p(hit)$
$t_i$ stands for target, the actual target as $0$ or $1$, or we say it’s a hit or a miss
$N$ stands for the total number of trials
$i$ stands for the $i$-th trial

📌 Interpretation
For a convex curve, or we say a cost objective function, the global minimum has the smallest value. To search for the minimum, we have to go downward the curve, or go in descending direction, to achieve the goal. That’s called gradient descent. On the contrary, to maximize the cost, we do gradient ascent.

Multiple Distribution

$Cost = \mathbf{J} = - \sum_{i=1}^N \sum_{j=1}^K \bigg( t_j^i \log (y_j^i) \bigg)$

$K$ stands for $K$ classes

The Derivative of Cross-Entropy

👁

Multiple Distribution

Input layer	Weight $W$	Hidden layer	Weight $V$	Output layer
$D$ features $\{x_1, ..., x_D\}$	$w_{dm}$	M features $\{z_1, ..., z_M\}$	$v_{mk}$	$K$ classes

Do backward propogation from the output layer to the hidden layer 🔻

\[ \begin{align*} \frac{\partial \mathbf(J)}{\partial v_{mk}} & = -\frac{\partial}{\partial v_{mk}} \bigg( \sum_{i \in N} \sum_{j \in K} t_j^i \log y_j^i \bigg) \\ & = -\sum_{i \in N} \sum_{j \in K} t_j^i \frac{1}{y_j^i}\frac{\partial y_j^i}{\partial a_k}\frac{\partial a_k}{\partial v^k_m} \\ & = -\sum_{i \in N} \bigg( \sum_{j \in K} t_j^i \frac{1}{y_j^i} \frac{\partial y_j^i}{\partial a_k} \bigg) z_m \\ & = -\sum_{i \in N} \sum_{j \in K} t_j^i (I_{j=k} - y^i_k) z_m \\ & = -\sum_{i \in N} \sum_{j \in K} (t_j^iI_{j=k}-t_j^iy^i_k) z_m \\ & = -\sum_{i \in N} \big( t_k^i - y^i_k \big) z_m \end{align*} \]

The second line of the above equation

$y^i_j=\frac{e^{a_j}}{\sum_{j=1}^K e^{a_{j}}}$

$\frac{\partial y^i_{j=k}}{\partial a_k}=y^i_k(1-y^i_k)$

$\frac{\partial y^i_{j \neq k}}{\partial a_k}=-y^i_jy^i_k$

$\sum_{j \in K} \frac{1}{y^i_j} \frac{\partial y^i_j}{\partial a_k} = \sum_{j \in K} (I_{j=k} - y^i_k)$

$a_k={v^k_m}^Tz_m=\sum_m^M({v^k_m}z_m)$

$\frac{\partial a_k}{\partial v_m^k}=z_m$

The last two lines of the above equation

$\sum_{j \in K} (t_j^iI_{j=k})=t^i_1 \cdot 0 + t^i_2 \cdot 0 +...+ t^i_k \cdot 1 +...+ t^i_K \cdot 0 = t^i_k$

$\sum_{j \in K} t_j^iy^i_k=y^i_k \big(t^i_1 + t^i_2 +...+ t^i_k + ... + t^i_K \big)=y^i_k \cdot 1$

For each data point, it’s target value $t^i$ only belonging to $k$-th class. So $t^i_k=1$ and the other $t^i_{j \neq k}=0$

Gradient Descent

$\theta = \theta - \eta \cdot \bigtriangledown_{\theta} \mathbf{J}(\theta)$

👁 Objective function, learning rate, and the level of gradient matter

👁 Adam is the best default choice for all cases

👁 L-BFGS is a good choice if you can afford to do full batch updates

👁 Minibatch Gradient Descent is the best default choice

🔗Sebastian Ruder’s Website

Concept

For each epoch of training:
- Batch GD uses entire dataset
- Stochastic GD uses one data point
- Mini-Batch GD uses a subset of dataset, to train the network

1st and 2nd Order Optimization

Newton parameter
Vanilla method
No need learning rate
Hessian metrix is too big for deep learning problem
Work for full batch only

BFGS

L-BFGS

Stochastic Gradient Descent, SGD

$w_{t+1}=w_{t}-\alpha \bigtriangledown f(w_t)$

What if one direction is more sensitive than the other ?
- For high dimension, this is a common problem
local minima or saddle point problem
- Gradient is zero in these two cases
- Local minima is common in lower dimension cases
- For high dimension cases, saddle point is a bigger problem
Problem of plateaus results in a slow learning process
Note these problems are not just for SGD only

Momentum

Exponentially Weighted Average

$v_{t+1}=\rho v_t + (1-\rho)\bigtriangledown_{\theta} f(\theta)$

$\theta_{t+1}=\theta_{t}-\alpha v_{t+1}$

It’s a kind of moving average techniques for the gradient across the training process
$v$ is called velocity, which is a running mean of gradients
$\rho$ functions as a friction, usaully set as $0.9$ or $0.99$
$(1-\rho)$ is optional but highly suggested

Math Intuition

$v_{t}=\rho v_{t-1}+(1-\rho) \theta_t$

$e^{-1}=(1-\beta)^{1/\beta}, \;\; \beta=1-\rho$

Bias Correction

$\frac{v_t}{1-\rho^t}$

As $t$ grows, $1-\rho^t$ will be close to 1 and have less effect on $v_t$
This is used to correct the first few $v$, e.g. $v_0, v_1, v_2$

Nesterov Momentum

👁 Calculate the gradient after updating the point with a big jump

$v_{t+1}=\rho v_t -\alpha \bigtriangledown f(\theta_t+\rho v_t)$

$\theta_{t+1}=\theta_{t}+v_{t+1}$

$\tilde{\theta}_t=\theta_t+\rho v_t$

$v_{t+1}=\rho v_t - \alpha \bigtriangledown f(\tilde{\theta}_t)$

$\tilde{\theta}_{t+1}=\tilde{\theta}_{t}+v_{t+1}+\rho (v_{t+1}-v_t)$

AdaGrad

$\nabla \mathbf{J}^2 += (dx)^2$

👁 Learning rate divided by the square root of the sum of squares of historic gradients

Adaptive Moment Estimation performs better in convex cases
Accumulated squared gradient makes learning rate become very small as the network gets deeper

RMSProp

Introduce a fix of AdaGrad in non-convex case
Rescale $d\theta$ but no weighted average involved

\[ \begin{align*} s_{dw} & = \rho s_{dw} + (1-\rho)dw^2 \\ s_{db} & = \rho s_{db} + (1-\rho)db^2 \\ w & = w - \alpha \frac{dw}{\sqrt{s_{dw}+\epsilon}} \\ b & = b - \alpha \frac{db}{\sqrt{s_{db}+\epsilon}} \end{align*} \]

$d\theta^2$ is element-wise
$d\theta^2$ is to rescale the $d\theta$ when updating $\theta$’s value
For $\theta$ with larger values, the learning speed is relatively slow (the projection of one learning step onto the $\theta$’s dim is the key to the learning speed). Thus, using $d\theta^2$, an even larger value, to rescale $\theta$ will help speed up the learning process.

Adam

\[ \begin{align*} v_{dw} & = \rho_v v_{dw} + (1-\rho_v)dw \\ v_{db} & = \rho_v v_{db} + (1-\rho_v)db \\ s_{dw} & = \rho_s s_{dw} + (1-\rho_s)dw^2 \\ s_{db} & = \rho_s s_{db} + (1-\rho_s)db^2 \\ v_{dw}^{\prime} & = \frac{v_{dw}}{1-\rho_v^t},s_{dw}^{\prime} = \frac{s_{dw}}{1-\rho_s^t} \\ v_{db}^{\prime} & = \frac{v_{db}}{1-\rho_v^t},s_{db}^{\prime} = \frac{s_{db}}{1-\rho_s^t} \\ w & = w - \alpha \frac{v_{dw}^{\prime}}{\sqrt{s_{dw}^{\prime}+\epsilon}} \\ b & = b - \alpha \frac{v_{db}^{\prime}}{\sqrt{s_{db}^{\prime}+\epsilon}} \end{align*} \]

Adaptive moment estimation
Combine Momentum, AdaGrad and RMSProp
Include weighted average, rescaling, and bias-correction
$\alpha$ has to be tuned
$\rho_v=0.9,\rho_s=0.999,\epsilon=10^{-8}$

Nadam

Activation Function

👁 The purpose of the activation function is for nonlinear transformation

👁 The default best selection is ReLU

Activation Function	Range	Description	Derivatives $\frac{\partial g(z)}{\partial z}$
sigmoid	$(0,1)$	If $y$ takes a binary result, i.e. $0$ and $1$	$g(z)(1-g(z))$
tanh	$(-1,1)$	If $y$ is expected to be symmetry among a smaller range	\[ g'(z) = \begin{cases} (0,1) & \quad \text{if } \lvert z \rvert > 3\\ \approx 0 & \quad \text{if } \lvert z \rvert \leq 3 \end{cases} \]
ReLU	$\max(0,z)$	If the range of $y$ is larger than $0$	\[ g'(z) = \begin{cases} 1 & \quad \text{if } z > 0\\ 0 & \quad \text{if } z \leq 0 \end{cases} \]
LeakyReLU	$\max(\alpha \cdot z,z)$	To reduce the shrinkage effect resulting from ReLU	\[ g'(z) = \begin{cases} 1 & \quad \text{if } z > 0\\ \alpha & \quad \text{if } z \leq 0 \end{cases} \]

$g(z)$ stands for the activation function
The slope goes to $0$ as the $z$ goes wide, and this will slow down the iteration
A linear function will not be used as an activation function, because it also gives a linear outcome which is nonsense for you to build a deeper linear model

Convolution Neural Network

👁 Parameter sharing

👁 Sparsity of connections

parameter sharing a small filter functions as a sliding window which reuses its parameters through the image
- parameter size $= (\text{filter's size} +1) \times \text{the number of filters}$
- One filter results in one constant parameter $b$
sparsity of connections each output value is linked to a small region of the input layer
Required computing memory size is proportional to the number of pixels for each layer

Computer Vision

Edge Detection

Convolution Layer

Output Dimension

$\bigg\lfloor D_{out}=\frac{(D_{in}+2\times P-D_{f})}{S}+1\bigg\rfloor$

📝

$D$ stands for dimension
$D_f$ is the dimension of the filter’s size
$P$ stands for padding
$S$ stands for stride
$D_W$x$D_H$x$K_F$, $K$ stands for the number of filters, not the channel of the image or the filter
Computation-intensive

Hyperparameters 📝

Number of Parameters Calculation

Filter’s size $\times$ size $\times$ channel/depth $\times$ $K + K$

Filter

Filter’s size $= a$, or we say kernel $=a$
$a \times a \times$ channel/depth(optional), e.g. $5 \times 5$, $3 \times 3$, $1 \times 1$
Choose filter’s size is to decide the number of neurons which all target the same spatial region but look for different things
If $K$ filters are used, the output dimension of the transformed image is width $\times$ height $\times K$
Different channels/layers of depth will be aggregated together after computation with a single filter

Stride

Movement of the filter’s window

Padding

zero Padding is to remain the same dims as the input’s by simply adding zeros to the peripheral
valid padding means no padding
same padding means the output size being equal to the input size

Example

Layer	Shape	Size	Parameter size
Input	(32,32,3)	$32\times32\times3=3072$	$0$
Conv ($f=5, s=2$)	(12,12,6)	$12\times12\times6=864$	$(5\times5+1)\times6=156$
Pool ($f=2$)	(6,6,3)	$6\times6\times3=108$	$0$
FC	(64,1)	$64\times1=64$	$108\times64+1=6913$
FC	(32,1)	$32\times1=32$	$64\times32+1=2049$
Softmax	$(3,1)$	$3\times1$	$32\times3+1=97$

More Layers

Fully Connected Layer, FC

Dimention

$n^{[l]} \times n^{[l-1]} + 1$ is the number of paramters for the $l$th layer
$n^[l]$ is the output’s size for the $l$th layer
Each neuron connects to the entire input volume
Memory-intensive
AlexNet and VGG have FC layers which make the network really huge and require more memory

Pooling Layer

Type of pooling
- max pooling
- average pooling
Hyperparameters
- Common settings are $D_f=2, D_s=2$ and $D_f=3, D_s=2$
Goal: downsampling
- Only width and height will be changed, but depth remains the same
- $Dim=\frac{(Original Dim - Filter Dim)}{(Stride)} + 1$
- No parameters introduced
- Zero padding is NOT COMMON for pooling due to its downsampling goal

Inception Module

Architecture

ImageNet	Year	Description	Training time
LeNet-5	1998	5 layers, plus 2 fully connected layers $5$x$5$ filter and $2$x$2$ avgpooling only
AlexNet	2012	8 layers, plus 3 fully connected layers 2 parallele architecture About 60M parameters ReLU Local response normalization
VGG16/19	2014	16-19 layers, plus 3 fully connected layers Memory of 96MB/image for only forward propogation In forward propogation, it takes 96MB/image, i.e. 24M values * 4 bytes About 138M parameters $3$x$3$ filter with the same padding and $2$x$2$ maxpooling only	4 GPUs for 2 to 3 weeks
GoogLeNet	2014	22 layers, no fully connected layers
ResNet	2015	One residual block with one skip connection

Recurrent Neural Network

Truncated Backpropagation

check min-char-rnn.py

Gradient Flow

largest singular value > 1 : exploding
- gradient clipping : scale gradient
largest singular value < 1 : vanishing
- LSTM

Vanilla RNN

GRU

No cell state
Output $h_t$ only and no $C_{t}$

Mathematics

$r_t=\sigma(W_r \cdot [h_{t-1}, x_t]+b_r)$
$z_t=\sigma(W_z \cdot [h_{t-1}, x_t]+b_r)$
$\tilde{h_t}=tanh(W_h \cdot [r_th_{t-1}, x_t]+b_h)$
$h_t=(1-z_t)\tilde{h_t}+z_th_{t-1}$

LSTM

1997
add cell state

Concept * Symmetric product * Element-wise product $\odot$

Keys

4 Gates :
- Input :
- Forget : erase/remove
- Gate : write
- Output :
$c_{t-1}$ and $h_{t-1}$ are two inputs from the previous stage
$c_t = f \odot c_{t-1} + i\odot g$
$h_t = o \odot tanh_{c_t}$

Mathematics

$f_t=\sigma(W_f \cdot [h_{t-1}, x_t]+b_f)$

$i_t=\sigma(W_i \cdot [h_{t-1}, x_t]+b_i)$

$\tilde{C_t}=tanh(W_C \cdot [h_{t-1}, x_t]+b_C)$

$C_t = f_t \times C_{t-1} + i_t \times \tilde{C_t}$

Generative Adversarial Networks

Training Tricks

👁 Number of epochs set as 30 should be fine

👁 Model ensembling

👁 Multi-crop at test time for images analytics

👁 Transfer learning

👁 Shuffle training data for each epoch

👁 Save the model parameters and restart the training several times

Note

Model ensembling and multi-crop tricks may be helpful to slightly improve your model performance for a certain dataset you are working with. However, for production level, it’s better to train a model with acceptable performance rather than using these two tricks.

Acelerate Optimization Process

Normalization

👁 It’s working with both batch and minibatch gradient descent

👁 Covariate shift

Intuition

Normalization makes the radial distances from all dimensions to the minimum point of the loss function get closer to avoid zig-zag gradient descent pattern during model training.

By intuition, batch norm for each hidden layer makes them less vulnerable to the changes imposed by the previous layers. This implies that each layer becomes more independent to the others.

Minibatch normalization have similar effect as the regularization’s or the dropout’s because the samples’ mean and variance vary across minibatch and thus introduce additional noise to the model.

Testing

For model testing, using exponential weighted average to calculate the mean and variance of the minibatches used for training is suggested.

Weight Initialization

👁 Solution to gradient vanishing/exploding

👁 For deeper neural netwroks, weights slightly greater/less than 1 will result in an exponential increase/decrease of the weight for deeper layers

$\text{Var}(w) = \frac{1}{n}$ or $\frac{2}{n}$

Method	Math	Description
	$w_{random}\sqrt{(\frac{2}{n^{[l-1]}})}$
Xavier initialization	$\text{tanh}\sqrt{(\frac{1}{n^{[l-1]}})}$
	$\text{tanh}\sqrt{(\frac{2}{n^{[l-1]+n^{[l]}}})}$

Minibatch

👁 The concept behind is use a subset called minibatch sampling from the entire training set called batch

👁 Using minibatch size equal to 1 for each training epoch is called stochastic gradient descent

👁 For big dataset $(n > 2000)$, typical minibatch size will be $(64, 128, 256, 512)$, which fit in CPU/GPU memory

For batch gradient descent, the lost will be improved monotonely across iterations.

Learning Rate Decay

👁 Adam optimization functions as an alternative for learning rate decay

👁 Learning rate annealing, e.g. cosine annealing

$\alpha=\frac{1}{1+r_{decay} \cdot n }\alpha_0$

$\alpha=\frac{k}{\sqrt{n}}\alpha_0$

Exponential decay

$\alpha=k^{n}\alpha_0$

Staircase

Manual decay

$k$ is a constant
$n$ is the epoch number
$c < 1$

Hyperparameter Selection and Tuning

👁 Experience is most valuable

👁 No single one best solution for all kinds of problems across all fields

Evaluation	Problems	Solutions
High bias	Training set problem	Bigger network, including a wider layer or a deeper architecture (always help) Train longer (not always help) Different nn architecture
High variance	Dev set problem	Collect more data (best way) Regularization Different nn architecture

$a_j^{[l]\{i\}(m)}$

Hyperparameter	Notation	Importance
learning rate	$\alpha$
friction in momentum	$\rho$
minibatch size	$\{i\}$
number of hidden neurons	$j$
number of hidden layers	$[l] \in L$
learning rate decay

Train/Dev/Test Sets

👁 Dev/Test sets must come from the same distribution

Mismatched train/test distribution is acceptable
Test set is not a must

Bias-Variance Tradeoff

👁 Concept behind is the tradeoff between a general/specialized model

👁 High train set error implies high bias

👁 High dev set error implies high variance

👁 High train/dev set error implies the model is in general bad, this is a common problem in high dimension parameters space

👁 Bigger network reduces the bias without increasing the variance

👁 Data augmentation reduces the variance without increasing the variance

Regularization

👁 Please note that regularization here has a general meaning which is not just referred to the penalty term

Tricks	How to do	Description
Penalty term	Additive to the loss function	This is a must property
Dropout	Additive to each layer	It functions like the penalty term, and for images analytics with neural networks, this is a nice property which boosts the model performance on testing data
Data augmentation	Additional actions	In practice, this action takes time and involves more human works
Activation function $\text{tanh}$	Built-in architecture design	$\text{tanh}$ makes the weights center around $0$ and the values around $0$ have an approxomately linear relationship which will result in a decision boundary with less nonlinear shape
Early stopping	Additional actions	This is not suggested because early stopping interrupt both the optimization process and the overfitting problem. It makes it harder to figure out where is the main problem if the network performs bad

Penalty Term

👁 Also called weight decay

👁 Regularization prevents overfitting

👁 $L2$ should be the default choice

👁 $\lambda$, called regularization parameter, is the hyperparameter which has to be tuned

👁 The larger the $\lambda$ is, the stronger the regularization effect is

\[ \begin{align*} \text{Loss} & = \mathbf{J} + \frac{\lambda}{m}\rVert w^{[l]} \lVert^2_F \\ dw^{[l]} & = \frac{1}{m}dZ^{[l]} A^{[l-1]T}+\frac{\lambda}{m} w^{[l]} \\ w^{[l]} & = w^{[l]}-\alpha dw^{[l]} \end{align*} \]

$\mathbf{J_{\text{regularized}}} = \bigg[-\frac{1}{m} \sum^m_{i=1} \big( y_i \log(a_i)+(1-y_i) \log(1-a_i) \bigg) \bigg] + \bigg[ \frac{1}{m} \frac{\lambda}{2} \sum w^2 \bigg]$

$dw^{[l]} = \frac{1}{m} dZ^{[l]}A^{[l-1]T}+\frac{\lambda}{m}w^{[l]}$

Regularization	Math	Effect	Description
$L2$	$\rVert w \lVert^2_F$	Most of $w$ will be a smaller value In convention, this is called `Frobenius norm`	No feature selection effect
$L1$	$\rVert w \lVert_1$	Some $w$ will become $0$	Strong feature selection effect

Dropout

👁 Drop features randomly

👁 Do not use dropout during testing because that will result in random predictions, and do not keep the keep-probability parameters in the calculations

Implementing dropout in the model is a bit tricky, please check Concept

the code below to have some ideas:

np.randm.rand

Method	Math	Description
Inverted dropout

Interpretation

It forces the neuron to spread out the weights, and has similar effects as $L2$ norm.

The cost function $J$ becomes not well-defined when using dropout. Because dropout is a solution to overfitting, if there is no overfitting problem, probably do not introduce dropout.

Data Augmentation

👁 Increasing batch size has the same effect as decaying learning rate

Activation Function

Early Stopping

Sanity Checking

Gradient Checking

👁 Debug backpropagation code

👁 Don’t work with dropout layers

👁 Don’t forget the regularization term

$d\theta_k^{approx} \approx \frac{\mathbf{J(\theta_1,..,\theta_k+\epsilon,...,\theta_n)}-\mathbf{J(\theta_1,..,\theta_k-\epsilon,...,\theta_n)}}{2\epsilon}$

Criteria

$ d_k^{approx} - d_2 <10^{-7} $

$O(\epsilon^2)$

Notation

$a^{[\text{l}](i)}_j$

$i$-th sample, $i=1, ..., N$
$j$-th neuron, $i=1, ..., M$
$l$-th layer, $i=1, ..., L$
$a$ stands for the activation layer
Input layer is not counted in
$z$ is the feature of the hidden layer

Reinforcement Learning

Loss

Distance Function

$\sum I_{y\neq\hat{y}} \log(p(x)) \geq 1$

$d_p(x,y) \leq 0 \iff x=y$

Norm

$d_p(x,y)=|x-y|_p$

$L_1$ and $L_2$ norms are convex 📝
$L_0$ pseudo-norm and $L_{\frac{1}{2}}$ are concave
hinge norm is the solution to the outlier problem

📝 Convex function is continuous and its second derivative is always greater than or equal to zero

Log Likelihood

MLE

MAP

Kernel

👁 Similarity measure : the larger the kernel, the higher level of similarity

Entropy

$\sum_i p_i log(p_i)$

$E_i(log(p_i))$

$P(event) = 1$ : entropy $=$ 0
$P(event) = 0$ : entropy $=\infty$
$P(event) = 0.5$ : entropy $= 1$
Uniform distribution has the highest entropy
Example: Gini Index

$Gini=\frac{\text{Area between the line of equality and Lorenz curve}}{\text{Area under the line of equality}}$

Minimum Description Length, MDL

$E(length) \geq - \sum P(z_i)\log_2 (P(z_i))$

Kullback-Leibler Distance, KLD

$D(new|old) = \sum P(new) \log \frac{P(new)}{P(old)}$

Also called relative entropy
NOT symmetric
NOT a distance measure
NOT a metric measure
Remark : however, if data is normalized, to mimic a distribution, then KLD works

Evaluation

👁 Single number evaluation metric in use can be speed up the iteration and thus is highly suggested

👁 F1 score is the harmonic mean of precision and recall

👁 Evaluation metrix is equal to one optimizing metric plus one or several satisficing metric

To evaluate the model performance, we usually have one optimizing metric, e.g. the accuracy, precision, or recall, plus one or several constraints which we call the satisficing metric here. The constraints might be the running time of the training process, or the weight penalization on the misclassification of a subgroup of classes, etc.

Bayes optimal error is the theoretical minimum error a model can achieve. And in practice, we use the human-level error to approximate the theoretical error. On one hand, the gap between the theoretical and human errors is an avoidable bias, and it can be narrowed down by including more human expertise. On the other hand, the gap between the training and dev errors results from the variance.

Confusion Matrix

Evaluation Metric

Intersection over Union, IoU

Tricks

Cross Validation

CV is in general a nice way for small datasets.

Leave-one-out CV, LOOCV

👁 Learn a classifier on all the training data minus one example and evaluate its error for the remaining

When n is small
Unbiased estimate of the error

K-fold CV

👁 Split data into K subsets (usually K = 10)

When n is large
Unbiased estimate of the error

Bias-Variance Decomposition

👁 Goal is to minimize the total error

👁 The sum of the distances between the raw data and its central estimate is called variance

👁 The sum of the distances between the true target and its central estimate is called bias

👁 The sum of variance, bias, and noise is the total error

Concept

Assume that in a real world $y$ can be represented by a function of $x$ plus a noise term $\epsilon$. We want to find an approximate function of $x$ to mimic the true function$^{2}$.

$^{1}$ $y=f(x)+\epsilon$
$^{2}$ $\hat{y}=\hat{f}(x)$

$E_{x,y,D}\big( (\hat{f}(x)-y)^2\big)=E_{x,D}\big((\hat{f}(x)-\bar{\hat{f}}(x))^2\big) + E_{x,y}\big((\bar{\hat{f}}(x)-y)^2\big)$

1st Term: Model variance

A larger variance implies a higher flexibility which enable the model to capture the noise in the training data
To reduce model variance
- Less model complexity
- Data augmentation

2nd Term: Bias and Noise

$E_{x,y}\big((\bar{\hat{f}}(x)-y)^2\big)=E_x\big((\bar{\hat{f}}(x)-f(x))^2\big) + E_{x,y}\big((f(x)-y)^2\big)$

1st term: Bias
- Reduce the bias with a complex model
2nd term: Sample Variance or Noise
- Cannot be reduced!

Complexity/Flexibility	Variance	Bias	Noise	Model	Generosity	Problem
Low	Low	High	Unchangable	Linear	High	Underfitting
High	High	Low	Unchangable	Non-linear	Low	Overfitting

Note

Model variance is different from sample variance
Variance is to present the spread of the distribution
Bias is to present how far it is between your estimate and your target

Boosting

👁 A process in which we turn a weak classifier into a strong one

👁 Goal is to reduce the total error

AdaBoost

$e_t=\sum_i e \big( F_{t-1}(x_i)+\alpha_th(x_i) \big)$

$h_i$ is the hypothesis for $x_i$
Downside is being sensitive to outliers and noise

XGBoost

LPBoost

Bagging

👁 Boostrap

$D_1,...,D_m \sim D$

m sets of boostrap samples for each of sample size n
Also called bootstrap aggregating
Improve stability and accuracy
Use the average of m samples for regression or the major vote for classification
Bagging does NOT always improve the performance. It may degrade the performance of stable methods, e.g. KNN

$\lim_{n\rightarrow\infty}(1+\frac{-1}{n})^n = e^{-1} \sim 0.368$

👁 If the sample size is large, the chance for each data point of not being chosen is 0.368

Hardware

NVIDIA GPU

Net	Batch size	Global memory
VGG-16	32
VGG-16	64	8GB
VGG-16	128	14GB

Reduce batch size
Reduce model’s size
Distribute the model on multiple GPUs

TFLOPS

Easa Project: Car Detection

The section YOLO covers almost all important topics for object detection, including classification, localization, landmark detection, sliding window, fully connected layers with convolutional implementation, grid region, intersection over union, non-max suppression, and anchor boxes.

Reference

Methods	Year	Title	Summary
R-CNN	2013
Fast R-CNN	2015
Faster R-CNN	2016

Layer	Shape	Size	Parameter size
Input	(32,32,3)	\(32\times32\times3=3072\)	\(0\)
Conv (\(f=5, s=2\))	(12,12,6)	\(12\times12\times6=864\)	\((5\times5+1)\times6=156\)
Pool (\(f=2\))	(6,6,3)	\(6\times6\times3=108\)	\(0\)
FC	(64,1)	\(64\times1=64\)	\(108\times64+1=6913\)
FC	(32,1)	\(32\times1=32\)	\(64\times32+1=2049\)
Softmax	\((3,1)\)	\(3\times1\)	\(32\times3+1=97\)

Activation Function	Range	Description	Derivatives \(\frac{\partial g(z)}{\partial z}\)
sigmoid	\((0,1)\)	If \(y\) takes a binary result, i.e. \(0\) and \(1\)	\(g(z)(1-g(z))\)
tanh	\((-1,1)\)	If \(y\) is expected to be symmetry among a smaller range	\[ g'(z) = \begin{cases} (0,1) & \quad \text{if } \lvert z \rvert > 3\\ \approx 0 & \quad \text{if } \lvert z \rvert \leq 3 \end{cases} \]
ReLU	\(\max(0,z)\)	If the range of \(y\) is larger than \(0\)	\[ g'(z) = \begin{cases} 1 & \quad \text{if } z > 0\\ 0 & \quad \text{if } z \leq 0 \end{cases} \]
LeakyReLU	\(\max(\alpha \cdot z,z)\)	To reduce the shrinkage effect resulting from ReLU	\[ g'(z) = \begin{cases} 1 & \quad \text{if } z > 0\\ \alpha & \quad \text{if } z \leq 0 \end{cases} \]

Method	Math	Description
	\(w_{random}\sqrt{(\frac{2}{n^{[l-1]}})}\)
Xavier initialization	\(\text{tanh}\sqrt{(\frac{1}{n^{[l-1]}})}\)
	\(\text{tanh}\sqrt{(\frac{2}{n^{[l-1]+n^{[l]}}})}\)

Hyperparameter	Notation	Importance
learning rate	\(\alpha\)
friction in momentum	\(\rho\)
minibatch size	\(\{i\}\)
number of hidden neurons	\(j\)
number of hidden layers	\([l] \in L\)
learning rate decay

Regularization	Math	Effect	Description
\(L2\)	\(\rVert w \lVert^2_F\)	Most of \(w\) will be a smaller value In convention, this is called `Frobenius norm`	No feature selection effect
\(L1\)	\(\rVert w \lVert_1\)	Some \(w\) will become \(0\)	Strong feature selection effect

Machine Learning

Machine Learning

Orthogonalization

Regression

Linear Regression

Model selection

Potential problems

Regression Shrinkage Penalty

Ridge \(_{p=2}\)

Lasso \(_{p=1}\)

Stepwise/Streamwise/Stagewise \(_{p=0}\)

Stepwise

Streamwise

Stagewise

All Subset Regression

Robust Regression

Logistic Regression, LR

Linear Decision Boundary

Linear Discriminant

Linear Discriminant Analysis, LDA

Tree

Regression Tree

Classification Tree

PCA

Concept

Kernel Principal Component

Normalized Kernel

Sparse Principal Component

Independent Component Analysis

Bayesian

SVM

Clustering

K-Means K-Medians

Deep Learning

Backward Propogation

Cross-Entropy

The Derivative of Cross-Entropy

Gradient Descent

1st and 2nd Order Optimization

Stochastic Gradient Descent, SGD

Momentum

Exponentially Weighted Average

Math Intuition

Bias Correction

Nesterov Momentum

AdaGrad

RMSProp

Adam

Nadam

Activation Function

Convolution Neural Network

Computer Vision

Edge Detection

Convolution Layer

Hyperparameters 📝

More Layers

Fully Connected Layer, FC

Pooling Layer

Inception Module

Architecture

Recurrent Neural Network

Truncated Backpropagation

Gradient Flow

Vanilla RNN

GRU

LSTM

Generative Adversarial Networks

Training Tricks

Acelerate Optimization Process

Normalization

Weight Initialization

Minibatch

Learning Rate Decay

Hyperparameter Selection and Tuning

Train/Dev/Test Sets

Bias-Variance Tradeoff

Regularization

Penalty Term

Dropout

Data Augmentation