Machine Learning
Machine Learning
Orthogonalization
Orthogonalization
If there are multiple questions you want to solve and features can be shared when you try to build a model for each question, then a single big model for all questions may be an ideal choice.
Error Evaluation
Regression
Linear Regression
Least Squares Estimates, LSE
Residual Sum of Squares, RSS
- RSS is corresponding to the MLE
- Implies the unexplained variance
- Monotonely decreasing, i.e. "More variables, less RSS
- \(min_{\beta_0, \beta_1}RSS\) is called
least squares coefficient estimates - Model Assumption
- linearity
- \(\mu(y|x) = \beta_0 + \beta_1x\)
- normality
- \(Y|X \sim N(\beta_0 + \beta_1x, \sigma(y|x) = \sigma^2)\)
- equal variance
- linearity
Relative Standard Error, RSE
- RSE \(= \sqrt{\frac{RSS}{df}}\) = \(\sqrt\frac{RSS}{n-2}\)
\(R^2\)
- \(R^2 = \frac{TSS - RSS}{TSS}\)
- \(R^2 = correlation^2\) in the case of simple linear regression
Model selection
Donβt use RSS, becauseβ¦
- Monotonely decreasing property, i.e. βMore variables, less RSSβ, results in embedded overfitting problem
Use RSE, becauseβ¦
- NOT monotonely decreasing
Donβt use \(R^2\), becauseβ¦
- Monotonely decreasing
- Itβs hard to interpret because there is no test for \(R^2\).
Use F test
- \(\frac{increased\; explained\; variance}{unexplained\; variance}\)
- \(\frac{explained\; variance}{unexplained\; variance}\)
- \(\beta_0\) is excluded in the degree of freedom
- In R, use
Anova()(F-test based) compared tolm()(t-test based)- \(t^2 = F\)
- Use F test as the criteria for each factor variable, instead of t-test, because t-test only shows results for all levels of the factor variable.
p-value
- The probability that the extreme cases happen given \(H_0\) is true
- Donβt use p-value because it tends to give unreliable results by randomly sampling a subset from the dataset
CI, Confidence Interval
- Mean interval
- \(\bar{y} \mid x \in (\beta_0 + \beta_1x - 2\sigma_{se}(\beta_0 + \beta_1x), \beta_0 + \beta_1x + 2\sigma_{se}(\beta_0 + \beta_1x))\)
- This captures the possible change of the entire regression line
- \(\sigma_{new} = \sigma \sqrt{ \frac{1}{n} + \frac{(x_{new} - \bar{x})}{\sum_i(x_{i} - \bar{x})^2}}\) for adding a new estimate
Predictive interval
- This capture the possible change of an unknown data point
- \(\sigma_{prediction} = \sigma \sqrt{ 1 + \frac{1}{n} + \frac{(x_{new} - \bar{x})}{\sum_i((x_{i} - \bar{x})^2)}}\)
- Prediction error \(E[(y - \hat{y})^2]\)
Training/Testing dataset
\(C_p\)
- Unbiased estimate of the mean prediction error
- \(C_p = \frac{RSS}{n} + \frac{2}{n}d\dot{}\sigma^2\)
- \(\sigma^2 = Var(Y)\)
- \(\hat{\sigma^2} = (RSE_{Full})^2\)
- Use \(min C_p\) to choose your model
- Step 1: Select features from \(d=1\) to \(d=19\), calculate each \(C_p\) and make comparison between them
- Step 2: Kick out the variable which is not significant!
leaps::regsubsetmethod="exhaustive"Find the global \(min C_p\) with
forward methodStep 1: Choose the feature with the best fit in $y \sim x$ model, then keep the best feature in the model Step 2: Choose the next best fit by adding one more feature into the previous model Step 3: Repeat step 2
- Note Once \(p\) is larger than \(n\) (\(n < p\)), then there is no one best solution!
BIC, Bayesian Information Criterion
- \(\frac{RSS}{n} + \log(n) d \dot{} \sigma^2\)
- Derived from Bayesian rule
- BIC prefers smaller model because it penalizes model with a large number of variables
AIC, Akaike Information Criterion
- \(-2 \text{loglikelihood} + 2(\text{number of parameters})\)
- Measure the distance (loglikelihood) between two model
- To find the better model, use AIC.
- For classification, use \(C_p\)
- A constant \(c\) times \(C_p\)
- Proportion to the probability of the observed data
- Proportion to BIC
bestglm()
Potential problems
Surrogator
- In a simple regression with one regressor, some explanatory variables may act as a surrogate for other important missing variables.
Reverse regression
- DONβT use y to explain x with model \(lm(y \sim x)\)
\(p < n\)
- We use the regression to deal with data as \(p < n\)
- High dimensional setting cannot be addressed here!
Heteroscedasticity
- Check residual plot (\(y\) axis: residuals; \(x\) axis: fitted value \(y\)).
- One solution is to transform \(y\).
Normality Assumption
- Check residuals follow normal distribution.
- CANβT compare \(\beta\) for the same variable in two different models!
- The meanings of two models are different
- The standard errors are different
- The effects of variables are only well-defined within the specific model
- CANβT rank the importance of the variables in a multiple regression model!
- The \(\beta\) only shows the marginal effect
- Non-significant result may be resulted from the multicolinearity.
- Interaction term
- Plot interaction plot
- Nonlinearity in predictors
- Plot pairwise scatter plot
- Colinearity
- \(se(\beta)\) becomes larger!
- \((X'X)^{-1}\) becomes larger as some or all \(x\) are highly correlated!
- \((X'X)\) diagonal positions remain the same; however, the covariance at \({i, j}\) which \(i \neq j\) becomes larger.
Regression Shrinkage Penalty
\(\lVert y - wx \rVert _{2}^2 + \lambda \lVert w \rVert_{p}^p\)
π Penalty will always lower the variance π» and increase the bias πΊ
Tuning parameter \(\lambda\)
- Control the model complexity : the larger the value, the simpler the model is
- \(\lambda \rightarrow \infty\), the shrinkage effect grows
Coefficient \(w\)
- If coefficients are large, the model becomes more sensitive to the noise; thus, the variance goes up!
Ridge \(_{p=2}\)
\((X^T X + \lambda I )^{-1}X^\prime Y\)
π Perform better when there are lots of non-zero small coefficients
- Consistent but biased
- Convex optimization
- Scale variant
Lasso \(_{p=1}\)
π Perform covariate selection
- Variate selection consistent
- Convex optimization
- Scale variant
Stepwise/Streamwise/Stagewise \(_{p=0}\)
π Strong feature selection but no shinkage
- Scale invariant
- Use information theory (MDL) to pick \(\lambda\)
- Minimum Description Length, MDL
- \(D\) stands for data, and \(D=(x_1, y_1),...,(x_n, y_n)\)
- \(L(D)=min_{H \in Hypothesis}(L(H)+L(D|H))\)
- Against overfitting due to the tradeoff between the complexity of the hypothesis and the complexity of data given the hypothesis
Stepwise
π Within each step, pick the best out of all candidate features for inclusion in the model
- Include linear terms, cross-product terms, and higher order terms
- Higher model complexity
- Compute-intensive
Streamwise
π Consider each feature βonceβ for inclusion in the model, and the order matters!
- When the feature set is very large
Stagewise
π Like stepwise but freeze the previous coefficient.
- For each addition of a feature, regress that feature against the residual
- Compute faster then stepwise
- Boosting
All Subset Regression
π Try all subsets of feature combination and compare the errors
Robust Regression
Logistic Regression, LR
\(\log (\prod_i \frac{1}{1+e^{(-y_iw^Tx_i)}})\)
- \(n >> p\)
- \(L_0\) loss is less when n is large.
- \(\log odds = w^Tx\)
Linear Decision Boundary
\(P(Y=1|X) = \frac{1}{1+ e^{(\log \frac{P(Y=0)P(X|Y=0)}{P(Y=1)P(X|Y=1)})}}\)
\(P(Y=1|x, w)=P(Y=-1|x, w)=0.5 \Rightarrow w^Tx = 0\)
orthogonal- \(x\) is a plane and \(w\) is a vector
- If the data is linearily seperable,
MLEfails for LR, and we can useMAPinstead to solve the problem
Linear Discriminant
Linear Discriminant Analysis, LDA
π Fundamental assumption is that the independent variables are normally distributed
Tree
π Goal is to minimize the cost function
\(\min(cost|split)\)
Regression Tree
\(\sum(y-\hat{y})^2\)
Classification Tree
\(\sum(p(1-p))\)
Selection criteria
- Recursive or greedy algorithm
- For each branch, \(n_{points} > a\)
- Set the depth of a tree, \(d > c\)
Remark
- If the variance is large, it implies the tree is complex, then we can use
baggingandboostingto reduce the complexity of a tree. - Unbalance dataset will result in a larger bias, which may result from the domination of a few classes.
PCA
Concept
Eigen Value Decomposition
eigen valueis the variance of each PCeigen vectoris the loading of each PC or we say the weight of each feature for one PC- Maximal variance of the data and eigenvectors of the covariance matrix
- Direction vector \(v_j\) called loadings
Reconstruction error
\(min_{\mu,\{\lambda_i\},\mathbf{V}_q}\sum_i^N \lVert x_i-\mu-\mathbf{V}_q\lambda_i \rVert ^2\)
- \(\mathbf{V}_q\) is a \(p\times q\) matrix with \(q\) orthogonal unit vector used for dimension transformation
- \(\lambda\) is a \(q\) vector of parameters and \(\hat{\lambda}_i=\mathbf{V}_q^T(x_i-\bar{x})\)
- \(\mu\) is a location vector in \(\mathbb{R}^p\) and \(\hat{\mu}=\bar{x}\)
Singular Value Decomposition, SVD
\(\mathbf{X}=\mathbf{U}\mathbf{D}\mathbf{V}^T\)
- \(\mathbf{U}\) and \(\mathbf{V}\) are rotation matrices, or eigenvectors
- Left singular vector \(\mathbf{U}\) is \(N \times p\) orthogonal matrix \((\mathbf{U}^T\mathbf{U}=\mathbf{I}_p)\)
- Right singular vector \(\mathbf{V}\) is \(p \times p\) orthogonal matrix \((\mathbf{V}^T\mathbf{V}=\mathbf{I}_p)\)
- \(\mathbf{D}\) is the square roots of the non-zero eigenvalues
- \(\mathbf{D}\) is \(p \times p\) diagonal matrix with non-negative singular values as diagonal elements which are positive determinant
- Principle components of \(\mathbf{X}\) are \(\mathbf{UD}\) (\(N \times p\))
- For each \(x_i\), there is a closest point \(u_{ik}d_kv_k\) on the kth hyperplane
- \(u_{ik}d_k\) measures distance between two points
- \(v_k\) measures the direction of the line
Principle Curve
Kernel Principal Component
\((x_1, x2,..., x_n)\) βΆ οΈ\(\big( \phi(x_1),\phi(x_2),...,\phi(x_n) \big)\) βΆ PCA
- Non-linear transformation in PCA
π
- \(\mathbf{11}^T\) is a n-by-n matrix of all ones
- \(\mathbf{1}^T\mathbf{1}=n\)
- \((\lambda_i,\alpha_i)\) is the eigenpair of \(\mathbf{K}\), denoting its \(i\)-th largest eigenvalue \(\lambda_i\) and its corresponding eigenvector \(\alpha_i\)
Normalized Kernel
\(K_{x \in X}(x, x)=1\)
Cauchy-Schwarz Inequality
\(K_{x, y \in X}(x, y)^2 \leq K(x, x)K(y, y)\)
Sparse Principal Component
Independent Component Analysis
Bayesian
SVM
Clustering
K-Means K-Medians
\(\mathbf{C}(i)=\arg\min \lVert x_i-m_k \rVert ^2\)
π Select \(K=k\) and \(m_k\) is the center point of the \(kth\) cluster
π \(O(n)\)
- Step 1. Given a cluster assignment \(\mathbf{C}\), calculate \((m_1, ..., m_K)\) for each cluster.
- Step 2. Build new \(\mathbf{C}\) \(s.t. \mathbf{C}(i)=\arg\min \lVert x_i-m_k \rVert ^2\)
Deep Learning
πCS231n 2017 Course Summary
Backward Propogation
Cross-Entropy
Binary Distribution
\(Cost = \mathbf{J} = - \sum_{i=1}^N \bigg( t_i \log (y_i) + (1-t_i) \log (1-y_i) \bigg)\)
- \(y_i\) stands for the
probabilityof the output being equal to 1, or we say the probability of a hit, i.e. \(p(hit)\) - \(t_i\) stands for
target, the actual target as \(0\) or \(1\), or we say itβs a hit or a miss - \(N\) stands for the total number of trials
- \(i\) stands for the \(i\)-th trial
π Interpretation
For a convex curve, or we say a cost objective function, the global minimum has the smallest value. To search for the minimum, we have to go downward the curve, or go in descending direction, to achieve the goal. Thatβs called gradient descent. On the contrary, to maximize the cost, we do gradient ascent.
Multiple Distribution
\(Cost = \mathbf{J} = - \sum_{i=1}^N \sum_{j=1}^K \bigg( t_j^i \log (y_j^i) \bigg)\)
- \(K\) stands for \(K\) classes
The Derivative of Cross-Entropy
π
Multiple Distribution
| Input layer | Weight \(W\) | Hidden layer | Weight \(V\) | Output layer |
|---|---|---|---|---|
| \(D\) features \(\{x_1, ..., x_D\}\) | \(w_{dm}\) | M features \(\{z_1, ..., z_M\}\) | \(v_{mk}\) | \(K\) classes |
- Do backward propogation from the output layer to the hidden layer π»
\[ \begin{align*} \frac{\partial \mathbf(J)}{\partial v_{mk}} & = -\frac{\partial}{\partial v_{mk}} \bigg( \sum_{i \in N} \sum_{j \in K} t_j^i \log y_j^i \bigg) \\ & = -\sum_{i \in N} \sum_{j \in K} t_j^i \frac{1}{y_j^i}\frac{\partial y_j^i}{\partial a_k}\frac{\partial a_k}{\partial v^k_m} \\ & = -\sum_{i \in N} \bigg( \sum_{j \in K} t_j^i \frac{1}{y_j^i} \frac{\partial y_j^i}{\partial a_k} \bigg) z_m \\ & = -\sum_{i \in N} \sum_{j \in K} t_j^i (I_{j=k} - y^i_k) z_m \\ & = -\sum_{i \in N} \sum_{j \in K} (t_j^iI_{j=k}-t_j^iy^i_k) z_m \\ & = -\sum_{i \in N} \big( t_k^i - y^i_k \big) z_m \end{align*} \]
The second line of the above equation
\(y^i_j=\frac{e^{a_j}}{\sum_{j=1}^K e^{a_{j}}}\)
\(\frac{\partial y^i_{j=k}}{\partial a_k}=y^i_k(1-y^i_k)\)
\(\frac{\partial y^i_{j \neq k}}{\partial a_k}=-y^i_jy^i_k\)
\(\sum_{j \in K} \frac{1}{y^i_j} \frac{\partial y^i_j}{\partial a_k} = \sum_{j \in K} (I_{j=k} - y^i_k)\)
\(a_k={v^k_m}^Tz_m=\sum_m^M({v^k_m}z_m)\)
\(\frac{\partial a_k}{\partial v_m^k}=z_m\)
The last two lines of the above equation
\(\sum_{j \in K} (t_j^iI_{j=k})=t^i_1 \cdot 0 + t^i_2 \cdot 0 +...+ t^i_k \cdot 1 +...+ t^i_K \cdot 0 = t^i_k\)
\(\sum_{j \in K} t_j^iy^i_k=y^i_k \big(t^i_1 + t^i_2 +...+ t^i_k + ... + t^i_K \big)=y^i_k \cdot 1\)
- For each data point, itβs target value \(t^i\) only belonging to \(k\)-th class. So \(t^i_k=1\) and the other \(t^i_{j \neq k}=0\)
Gradient Descent
\(\theta = \theta - \eta \cdot \bigtriangledown_{\theta} \mathbf{J}(\theta)\)
π Objective function, learning rate, and the level of gradient matter
π Adam is the best default choice for all cases
π L-BFGS is a good choice if you can afford to do full batch updates
π Minibatch Gradient Descent is the best default choice
πSebastian Ruderβs Website
Concept
- For each epoch of training:
- Batch GD uses entire dataset
- Stochastic GD uses one data point
- Mini-Batch GD uses a subset of dataset, to train the network
1st and 2nd Order Optimization
- Newton parameter
Vanillamethod- No need learning rate
Hessian metrixis too big for deep learning problem- Work for full batch only
BFGS
L-BFGS
Stochastic Gradient Descent, SGD
\(w_{t+1}=w_{t}-\alpha \bigtriangledown f(w_t)\)
- What if one direction is more sensitive than the other ?
- For high dimension, this is a common problem
local minimaorsaddle pointproblem- Gradient is zero in these two cases
- Local minima is common in lower dimension cases
- For high dimension cases, saddle point is a bigger problem
Problem of
plateausresults in a slow learning processNote these problems are not just for SGD only
Momentum
Exponentially Weighted Average
\(v_{t+1}=\rho v_t + (1-\rho)\bigtriangledown_{\theta} f(\theta)\)
\(\theta_{t+1}=\theta_{t}-\alpha v_{t+1}\)
- Itβs a kind of moving average techniques for the gradient across the training process
- \(v\) is called velocity, which is a running mean of gradients
- \(\rho\) functions as a friction, usaully set as \(0.9\) or \(0.99\)
- \((1-\rho)\) is optional but highly suggested
Math Intuition
\(v_{t}=\rho v_{t-1}+(1-\rho) \theta_t\)
\(e^{-1}=(1-\beta)^{1/\beta}, \;\; \beta=1-\rho\)
Bias Correction
\(\frac{v_t}{1-\rho^t}\)
- As \(t\) grows, \(1-\rho^t\) will be close to 1 and have less effect on \(v_t\)
- This is used to correct the first few \(v\), e.g. \(v_0, v_1, v_2\)
Nesterov Momentum
π Calculate the gradient after updating the point with a big jump
\(v_{t+1}=\rho v_t -\alpha \bigtriangledown f(\theta_t+\rho v_t)\)
\(\theta_{t+1}=\theta_{t}+v_{t+1}\)
or
\(\tilde{\theta}_t=\theta_t+\rho v_t\)
\(v_{t+1}=\rho v_t - \alpha \bigtriangledown f(\tilde{\theta}_t)\)
\(\tilde{\theta}_{t+1}=\tilde{\theta}_{t}+v_{t+1}+\rho (v_{t+1}-v_t)\)
AdaGrad
\(\nabla \mathbf{J}^2 += (dx)^2\)
π Learning rate divided by the square root of the sum of squares of historic gradients
- Adaptive Moment Estimation performs better in convex cases
- Accumulated squared gradient makes learning rate become very small as the network gets deeper
RMSProp
- Introduce a fix of AdaGrad in
non-convexcase - Rescale \(d\theta\) but no weighted average involved
\[ \begin{align*} s_{dw} & = \rho s_{dw} + (1-\rho)dw^2 \\ s_{db} & = \rho s_{db} + (1-\rho)db^2 \\ w & = w - \alpha \frac{dw}{\sqrt{s_{dw}+\epsilon}} \\ b & = b - \alpha \frac{db}{\sqrt{s_{db}+\epsilon}} \end{align*} \]
- \(d\theta^2\) is element-wise
- \(d\theta^2\) is to rescale the \(d\theta\) when updating \(\theta\)βs value
- For \(\theta\) with larger values, the learning speed is relatively slow (the projection of one learning step onto the \(\theta\)βs dim is the key to the learning speed). Thus, using \(d\theta^2\), an even larger value, to rescale \(\theta\) will help speed up the learning process.
Adam
\[ \begin{align*} v_{dw} & = \rho_v v_{dw} + (1-\rho_v)dw \\ v_{db} & = \rho_v v_{db} + (1-\rho_v)db \\ s_{dw} & = \rho_s s_{dw} + (1-\rho_s)dw^2 \\ s_{db} & = \rho_s s_{db} + (1-\rho_s)db^2 \\ v_{dw}^{\prime} & = \frac{v_{dw}}{1-\rho_v^t},s_{dw}^{\prime} = \frac{s_{dw}}{1-\rho_s^t} \\ v_{db}^{\prime} & = \frac{v_{db}}{1-\rho_v^t},s_{db}^{\prime} = \frac{s_{db}}{1-\rho_s^t} \\ w & = w - \alpha \frac{v_{dw}^{\prime}}{\sqrt{s_{dw}^{\prime}+\epsilon}} \\ b & = b - \alpha \frac{v_{db}^{\prime}}{\sqrt{s_{db}^{\prime}+\epsilon}} \end{align*} \]
- Adaptive moment estimation
- Combine
Momentum,AdaGradandRMSProp - Include
weighted average,rescaling, andbias-correction - \(\alpha\) has to be tuned
- \(\rho_v=0.9,\rho_s=0.999,\epsilon=10^{-8}\)
Nadam
Activation Function
π The purpose of the activation function is for nonlinear transformation
π The default best selection is ReLU
| Activation Function | Range | Description | Derivatives \(\frac{\partial g(z)}{\partial z}\) |
|---|---|---|---|
| sigmoid | \((0,1)\) | If \(y\) takes a binary result, i.e. \(0\) and \(1\) | \(g(z)(1-g(z))\) |
| tanh | \((-1,1)\) | If \(y\) is expected to be symmetry among a smaller range | \[ g'(z) = \begin{cases} (0,1) & \quad \text{if } \lvert z \rvert > 3\\ \approx 0 & \quad \text{if } \lvert z \rvert \leq 3 \end{cases} \] |
| ReLU | \(\max(0,z)\) | If the range of \(y\) is larger than \(0\) | \[ g'(z) = \begin{cases} 1 & \quad \text{if } z > 0\\ 0 & \quad \text{if } z \leq 0 \end{cases} \] |
| LeakyReLU | \(\max(\alpha \cdot z,z)\) | To reduce the shrinkage effect resulting from ReLU | \[ g'(z) = \begin{cases} 1 & \quad \text{if } z > 0\\ \alpha & \quad \text{if } z \leq 0 \end{cases} \] |
- \(g(z)\) stands for the activation function
- The slope goes to \(0\) as the \(z\) goes wide, and this will slow down the iteration
- A linear function will not be used as an activation function, because it also gives a linear outcome which is nonsense for you to build a deeper linear model
Convolution Neural Network
π Parameter sharing
π Sparsity of connections
parameter sharinga small filter functions as a sliding window which reuses its parameters through the imageparameter size\(= (\text{filter's size} +1) \times \text{the number of filters}\)- One filter results in one constant parameter \(b\)
sparsity of connectionseach output value is linked to a small region of the input layer- Required computing memory size is proportional to the number of pixels for each layer
Computer Vision
Edge Detection
Convolution Layer
Output Dimension
\(\bigg\lfloor D_{out}=\frac{(D_{in}+2\times P-D_{f})}{S}+1\bigg\rfloor\)
π
- \(D\) stands for dimension
- \(D_f\) is the dimension of the filterβs size
- \(P\) stands for padding
- \(S\) stands for stride
- \(D_W\)x\(D_H\)x\(K_F\), \(K\) stands for the number of filters, not the channel of the image or the filter
- Computation-intensive
Hyperparameters π
Number of Parameters Calculation
- Filterβs size \(\times\) size \(\times\) channel/depth \(\times\) \(K + K\)
Filter
- Filterβs size \(= a\), or we say
kernel\(=a\) - \(a \times a \times\) channel/depth(optional), e.g. \(5 \times 5\), \(3 \times 3\), \(1 \times 1\)
- Choose filterβs size is to decide the number of neurons which all target the same spatial region but look for different things
- If \(K\) filters are used, the output dimension of the transformed image is width \(\times\) height \(\times K\)
- Different channels/layers of depth will be aggregated together after computation with a single filter
Stride
- Movement of the filterβs window
Padding
zero Paddingis to remain the same dims as the inputβs by simply adding zeros to the peripheralvalidpadding means no paddingsamepadding means the output size being equal to the input size
Example
| Layer | Shape | Size | Parameter size |
|---|---|---|---|
| Input | (32,32,3) | \(32\times32\times3=3072\) | \(0\) |
| Conv (\(f=5, s=2\)) | (12,12,6) | \(12\times12\times6=864\) | \((5\times5+1)\times6=156\) |
| Pool (\(f=2\)) | (6,6,3) | \(6\times6\times3=108\) | \(0\) |
| FC | (64,1) | \(64\times1=64\) | \(108\times64+1=6913\) |
| FC | (32,1) | \(32\times1=32\) | \(64\times32+1=2049\) |
| Softmax | \((3,1)\) | \(3\times1\) | \(32\times3+1=97\) |
More Layers
Fully Connected Layer, FC
Dimention
- \(n^{[l]} \times n^{[l-1]} + 1\) is the number of paramters for the \(l\)th layer
- \(n^[l]\) is the outputβs size for the \(l\)th layer
- Each neuron connects to the entire input volume
- Memory-intensive
- AlexNet and VGG have FC layers which make the network really huge and require more memory
Pooling Layer
- Type of pooling
- max pooling
- average pooling
- Hyperparameters
- Common settings are \(D_f=2, D_s=2\) and \(D_f=3, D_s=2\)
- Goal: downsampling
- Only width and height will be changed, but depth remains the same
- \(Dim=\frac{(Original Dim - Filter Dim)}{(Stride)} + 1\)
- No parameters introduced
- Zero padding is NOT COMMON for pooling due to its downsampling goal
Inception Module
Architecture
| ImageNet | Year | Description | Training time |
|---|---|---|---|
| LeNet-5 | 1998 | 5 layers, plus 2 fully connected layers \(5\)x\(5\) filter and \(2\)x\(2\) avgpooling only | |
| AlexNet | 2012 | 8 layers, plus 3 fully connected layers 2 parallele architecture About 60M parameters ReLU Local response normalization | |
| VGG16/19 | 2014 | 16-19 layers, plus 3 fully connected layers Memory of 96MB/image for only forward propogation In forward propogation, it takes 96MB/image, i.e.Β 24M values * 4 bytes About 138M parameters \(3\)x\(3\) filter with the same padding and \(2\)x\(2\) maxpooling only | 4 GPUs for 2 to 3 weeks |
| GoogLeNet | 2014 | 22 layers, no fully connected layers | |
| ResNet | 2015 | One residual block with one skip connection |
Recurrent Neural Network
Truncated Backpropagation
- check min-char-rnn.py
Gradient Flow
- largest singular value > 1 : exploding
- gradient clipping : scale gradient
- gradient clipping : scale gradient
- largest singular value < 1 : vanishing
- LSTM
Vanilla RNN
GRU
- No cell state
- Output \(h_t\) only and no \(C_{t}\)
Mathematics
- \(r_t=\sigma(W_r \cdot [h_{t-1}, x_t]+b_r)\)
- \(z_t=\sigma(W_z \cdot [h_{t-1}, x_t]+b_r)\)
- \(\tilde{h_t}=tanh(W_h \cdot [r_th_{t-1}, x_t]+b_h)\)
- \(h_t=(1-z_t)\tilde{h_t}+z_th_{t-1}\)
LSTM
- 1997
- add cell state
Concept * Symmetric product * Element-wise product \(\odot\)
Keys
- 4 Gates :
- Input :
- Forget : erase/remove
- Gate : write
- Output :
- \(c_{t-1}\) and \(h_{t-1}\) are two inputs from the previous stage
- \(c_t = f \odot c_{t-1} + i\odot g\)
- \(h_t = o \odot tanh_{c_t}\)
Mathematics
\(f_t=\sigma(W_f \cdot [h_{t-1}, x_t]+b_f)\)
\(i_t=\sigma(W_i \cdot [h_{t-1}, x_t]+b_i)\)
\(\tilde{C_t}=tanh(W_C \cdot [h_{t-1}, x_t]+b_C)\)
\(C_t = f_t \times C_{t-1} + i_t \times \tilde{C_t}\)
Generative Adversarial Networks
Training Tricks
π Number of epochs set as 30 should be fine
π Model ensembling
π Multi-crop at test time for images analytics
π Transfer learning
π Shuffle training data for each epoch
π Save the model parameters and restart the training several times
Note
Model ensembling and multi-crop tricks may be helpful to slightly improve your model performance for a certain dataset you are working with. However, for production level, itβs better to train a model with acceptable performance rather than using these two tricks.
Acelerate Optimization Process
Normalization
π Itβs working with both batch and minibatch gradient descent
π Covariate shift
Intuition
Normalization makes the radial distances from all dimensions to the minimum point of the loss function get closer to avoid zig-zag gradient descent pattern during model training.
By intuition, batch norm for each hidden layer makes them less vulnerable to the changes imposed by the previous layers. This implies that each layer becomes more independent to the others.
Minibatch normalization have similar effect as the regularizationβs or the dropoutβs because the samplesβ mean and variance vary across minibatch and thus introduce additional noise to the model.
Testing
For model testing, using exponential weighted average to calculate the mean and variance of the minibatches used for training is suggested.
Weight Initialization
π Solution to gradient vanishing/exploding
π For deeper neural netwroks, weights slightly greater/less than 1 will result in an exponential increase/decrease of the weight for deeper layers
\(\text{Var}(w) = \frac{1}{n}\) or \(\frac{2}{n}\)
| Method | Math | Description |
|---|---|---|
| \(w_{random}\sqrt{(\frac{2}{n^{[l-1]}})}\) | ||
| Xavier initialization | \(\text{tanh}\sqrt{(\frac{1}{n^{[l-1]}})}\) | |
| \(\text{tanh}\sqrt{(\frac{2}{n^{[l-1]+n^{[l]}}})}\) |
Minibatch
π The concept behind is use a subset called minibatch sampling from the entire training set called batch
π Using minibatch size equal to 1 for each training epoch is called stochastic gradient descent
π For big dataset \((n > 2000)\), typical minibatch size will be \((64, 128, 256, 512)\), which fit in CPU/GPU memory
For batch gradient descent, the lost will be improved monotonely across iterations.
Learning Rate Decay
π Adam optimization functions as an alternative for learning rate decay
π Learning rate annealing, e.g.Β cosine annealing
\(\alpha=\frac{1}{1+r_{decay} \cdot n }\alpha_0\)
\(\alpha=\frac{k}{\sqrt{n}}\alpha_0\)
Exponential decay
\(\alpha=k^{n}\alpha_0\)
Staircase
Manual decay
- \(k\) is a constant
- \(n\) is the epoch number
- \(c < 1\)
Hyperparameter Selection and Tuning
π Experience is most valuable
π No single one best solution for all kinds of problems across all fields
| Evaluation | Problems | Solutions |
|---|---|---|
| High bias | Training set problem | Bigger network, including a wider layer or a deeper architecture (always help) Train longer (not always help) Different nn architecture |
| High variance | Dev set problem | Collect more data (best way) Regularization Different nn architecture |
\(a_j^{[l]\{i\}(m)}\)
| Hyperparameter | Notation | Importance |
|---|---|---|
| learning rate | \(\alpha\) | |
| friction in momentum | \(\rho\) | |
| minibatch size | \(\{i\}\) | |
| number of hidden neurons | \(j\) | |
| number of hidden layers | \([l] \in L\) | |
| learning rate decay |
Train/Dev/Test Sets
π Dev/Test sets must come from the same distribution
- Mismatched train/test distribution is acceptable
- Test set is not a must
Bias-Variance Tradeoff
π Concept behind is the tradeoff between a general/specialized model
π High train set error implies high bias
π High dev set error implies high variance
π High train/dev set error implies the model is in general bad, this is a common problem in high dimension parameters space
π Bigger network reduces the bias without increasing the variance
π Data augmentation reduces the variance without increasing the variance
Regularization
π Please note that regularization here has a general meaning which is not just referred to the penalty term
| Tricks | How to do | Description |
|---|---|---|
| Penalty term | Additive to the loss function | This is a must property |
| Dropout | Additive to each layer | It functions like the penalty term, and for images analytics with neural networks, this is a nice property which boosts the model performance on testing data |
| Data augmentation | Additional actions | In practice, this action takes time and involves more human works |
| Activation function \(\text{tanh}\) | Built-in architecture design | \(\text{tanh}\) makes the weights center around \(0\) and the values around \(0\) have an approxomately linear relationship which will result in a decision boundary with less nonlinear shape |
| Early stopping | Additional actions | This is not suggested because early stopping interrupt both the optimization process and the overfitting problem. It makes it harder to figure out where is the main problem if the network performs bad |
Penalty Term
π Also called weight decay
π Regularization prevents overfitting
π \(L2\) should be the default choice
π \(\lambda\), called regularization parameter, is the hyperparameter which has to be tuned
π The larger the \(\lambda\) is, the stronger the regularization effect is
\[ \begin{align*} \text{Loss} & = \mathbf{J} + \frac{\lambda}{m}\rVert w^{[l]} \lVert^2_F \\ dw^{[l]} & = \frac{1}{m}dZ^{[l]} A^{[l-1]T}+\frac{\lambda}{m} w^{[l]} \\ w^{[l]} & = w^{[l]}-\alpha dw^{[l]} \end{align*} \]
\(\mathbf{J_{\text{regularized}}} = \bigg[-\frac{1}{m} \sum^m_{i=1} \big( y_i \log(a_i)+(1-y_i) \log(1-a_i) \bigg) \bigg] + \bigg[ \frac{1}{m} \frac{\lambda}{2} \sum w^2 \bigg]\)
\(dw^{[l]} = \frac{1}{m} dZ^{[l]}A^{[l-1]T}+\frac{\lambda}{m}w^{[l]}\)
| Regularization | Math | Effect | Description |
|---|---|---|---|
| \(L2\) | \(\rVert w \lVert^2_F\) | Most of \(w\) will be a smaller value In convention, this is called Frobenius norm |
No feature selection effect |
| \(L1\) | \(\rVert w \lVert_1\) | Some \(w\) will become \(0\) | Strong feature selection effect |
Dropout
π Drop features randomly
π Do not use dropout during testing because that will result in random predictions, and do not keep the keep-probability parameters in the calculations
Implementing dropout in the model is a bit tricky, please check Concept
the code below to have some ideas:
| Method | Math | Description |
|---|---|---|
| Inverted dropout |
Interpretation
It forces the neuron to spread out the weights, and has similar effects as \(L2\) norm.
The cost function \(J\) becomes not well-defined when using dropout. Because dropout is a solution to overfitting, if there is no overfitting problem, probably do not introduce dropout.
Data Augmentation
π Increasing batch size has the same effect as decaying learning rate
Activation Function
Early Stopping
Sanity Checking
Gradient Checking
π Debug backpropagation code
π Donβt work with dropout layers
π Donβt forget the regularization term
\(d\theta_k^{approx} \approx \frac{\mathbf{J(\theta_1,..,\theta_k+\epsilon,...,\theta_n)}-\mathbf{J(\theta_1,..,\theta_k-\epsilon,...,\theta_n)}}{2\epsilon}\)
Criteria
$ d_k^{approx} - d_2 <10^{-7} $
- \(O(\epsilon^2)\)
Notation
\(a^{[\text{l}](i)}_j\)
- \(i\)-th sample, \(i=1, ..., N\)
- \(j\)-th neuron, \(i=1, ..., M\)
- \(l\)-th layer, \(i=1, ..., L\)
- \(a\) stands for the activation layer
- Input layer is not counted in
- \(z\) is the feature of the hidden layer
Reinforcement Learning
Loss
Distance Function
\(\sum I_{y\neq\hat{y}} \log(p(x)) \geq 1\)
\(d_p(x,y) \leq 0 \iff x=y\)
Norm
\(d_p(x,y)=|x-y|_p\)
- \(L_1\) and \(L_2\) norms are
convexπ - \(L_0\)
pseudo-normand \(L_{\frac{1}{2}}\) areconcave hingenorm is the solution to the outlier problem
π Convex function is continuous and its second derivative is always greater than or equal to zero
Log Likelihood
MLE
MAP
Kernel
π Similarity measure : the larger the kernel, the higher level of similarity
Entropy
\(\sum_i p_i log(p_i)\)
\(E_i(log(p_i))\)
- \(P(event) = 1\) : entropy \(=\) 0
- \(P(event) = 0\) : entropy \(=\infty\)
- \(P(event) = 0.5\) : entropy \(= 1\)
- Uniform distribution has the highest entropy
- Example:
Gini Index
\(Gini=\frac{\text{Area between the line of equality and Lorenz curve}}{\text{Area under the line of equality}}\)
Minimum Description Length, MDL
\(E(length) \geq - \sum P(z_i)\log_2 (P(z_i))\)
Kullback-Leibler Distance, KLD
\(D(new|old) = \sum P(new) \log \frac{P(new)}{P(old)}\)
- Also called
relative entropy - NOT symmetric
- NOT a distance measure
- NOT a metric measure
- Remark : however, if data is normalized, to mimic a distribution, then KLD works
Evaluation
π Single number evaluation metric in use can be speed up the iteration and thus is highly suggested
π F1 score is the harmonic mean of precision and recall
π Evaluation metrix is equal to one optimizing metric plus one or several satisficing metric
To evaluate the model performance, we usually have one optimizing metric, e.g.Β the accuracy, precision, or recall, plus one or several constraints which we call the satisficing metric here. The constraints might be the running time of the training process, or the weight penalization on the misclassification of a subgroup of classes, etc.
Bayes optimal error is the theoretical minimum error a model can achieve. And in practice, we use the human-level error to approximate the theoretical error. On one hand, the gap between the theoretical and human errors is an avoidable bias, and it can be narrowed down by including more human expertise. On the other hand, the gap between the training and dev errors results from the variance.
Confusion Matrix
Evaluation Metric
Intersection over Union, IoU
Tricks
Cross Validation
CV is in general a nice way for small datasets.
Leave-one-out CV, LOOCV
π Learn a classifier on all the training data minus one example and evaluate its error for the remaining
- When n is small
- Unbiased estimate of the error
K-fold CV
π Split data into K subsets (usually K = 10)
- When n is large
- Unbiased estimate of the error
Bias-Variance Decomposition
π Goal is to minimize the total error
π The sum of the distances between the raw data and its central estimate is called variance
π The sum of the distances between the true target and its central estimate is called bias
π The sum of variance, bias, and noise is the total error
Concept
Assume that in a real world \(y\) can be represented by a function of \(x\) plus a noise term \(\epsilon\). We want to find an approximate function of \(x\) to mimic the true function\(^{2}\).
- \(^{1}\) \(y=f(x)+\epsilon\)
- \(^{2}\) \(\hat{y}=\hat{f}(x)\)
\(E_{x,y,D}\big( (\hat{f}(x)-y)^2\big)=E_{x,D}\big((\hat{f}(x)-\bar{\hat{f}}(x))^2\big) + E_{x,y}\big((\bar{\hat{f}}(x)-y)^2\big)\)
1st Term: Model variance
- A larger variance implies a higher flexibility which enable the model to capture the noise in the training data
- To reduce model variance
- Less model complexity
- Data augmentation
2nd Term: Bias and Noise
\(E_{x,y}\big((\bar{\hat{f}}(x)-y)^2\big)=E_x\big((\bar{\hat{f}}(x)-f(x))^2\big) + E_{x,y}\big((f(x)-y)^2\big)\)
- 1st term: Bias
- Reduce the bias with a complex model
- 2nd term: Sample Variance or Noise
- Cannot be reduced!
| Complexity/Flexibility | Variance | Bias | Noise | Model | Generosity | Problem |
|---|---|---|---|---|---|---|
| Low | Low | High | Unchangable | Linear | High | Underfitting |
| High | High | Low | Unchangable | Non-linear | Low | Overfitting |
Note
- Model variance is different from sample variance
- Variance is to present the spread of the distribution
- Bias is to present how far it is between your estimate and your target
Boosting
π A process in which we turn a weak classifier into a strong one
π Goal is to reduce the total error
AdaBoost
\(e_t=\sum_i e \big( F_{t-1}(x_i)+\alpha_th(x_i) \big)\)
- \(h_i\) is the hypothesis for \(x_i\)
- Downside is being sensitive to outliers and noise
XGBoost
LPBoost
Bagging
π Boostrap
\(D_1,...,D_m \sim D\)
- m sets of boostrap samples for each of sample size n
- Also called
bootstrap aggregating - Improve stability and accuracy
- Use the average of m samples for regression or the major vote for classification
- Bagging does NOT always improve the performance. It may degrade the performance of stable methods, e.g.Β KNN
\(\lim_{n\rightarrow\infty}(1+\frac{-1}{n})^n = e^{-1} \sim 0.368\)
π If the sample size is large, the chance for each data point of not being chosen is 0.368
Hardware
NVIDIA GPU
| Net | Batch size | Global memory |
|---|---|---|
| VGG-16 | 32 | |
| VGG-16 | 64 | 8GB |
| VGG-16 | 128 | 14GB |
- Reduce batch size
- Reduce modelβs size
- Distribute the model on multiple GPUs
TFLOPS
Easa Project: Car Detection
The section YOLO covers almost all important topics for object detection, including classification, localization, landmark detection, sliding window, fully connected layers with convolutional implementation, grid region, intersection over union, non-max suppression, and anchor boxes.
Reference
| Methods | Year | Title | Summary |
|---|---|---|---|
| R-CNN | 2013 | ||
| Fast R-CNN | 2015 | ||
| Faster R-CNN | 2016 |
YOLO
Redmon et al., 2015
Segmentation
Region Proposal
Propose regions
Convolute over one region or all regions at a time
Convolute end-to-end, including region proposals