0.1 R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both # week 7 ref:https://arxiv.org/pdf/1603.05027.pdf

Article Reviewed: Identity Mappings in Deep Residual Networks

0.2 Basic:

0.2.1 Summary:

[[ResNet]] Apply Identity mapping (skip connection and after-addition activation) to direct propagate the forward and backward signals from one block to any other blocks. the Id mapping and shortcut connection makes the training faster, easier and the info propagation smoother.

? Questions: 1. do we want the gradient layer vanish?

  1. why do we need BN:

  2. Constant Scaling choices (always 0.5?)

  3. Why full pre-activation is the best:
    xl ->BN-> ReLu-> Wt-> BN ->ReLu->Wt-> x(l+1)

  4. Why dropout /1x1 conv bad at optimization

📍 Keywords Category: ResNet, Identity mapping, direct path

Context: ResNets central idea: learn the additive residual function F with respect to h(xl), with a key choice of using identity mapping \(h(x_l)=x_l\)

Correctness:

Contribution: Direct path make the training become easier and faster when \(h(x_l) and f(y_l)\) are both identity mappings.

. Clarity:

0.3 Outline:

0.3.1 1. Introduction

Deep Residual network(ResNets) Residual Units: \[ y_l=h(x_l)+F(x_l,W_l)\] \[x_{l+1}=f(y_l)\] Input l-th unit: \(x_l\)
Output l-th unit: \(x_{l+1}\)
\(h(x_l)=x_l\)
Residual Unit: 3x3
Pre-activation: ReLu +BN
Exploit the dimension of network depth, a key to the success of modern deep learning

0.3.2 2. Analysis

Full Pre-activation Win: ReLu +BN

Skip Connections: 1. Constant Scaling: \(\lambda= 0.5\) f=ReLu ? how to pick the constant scaling?the optimization has difficulties when the shortcut signal is scaled down

  1. Exclusive Gating \(g(x)= \sigma (W_gx+b_g)\) g(x) 1 by 1 conv layer

  2. Shortcut-only gating: F is not scaled, shortcut path is gated by \(1-g(x)\)

  3. 1x1 Conv shortcut: Replace the 1x1 convolution shortcut connections that replace the identity. good when the Residual Units are not too many (34 layers), if layer >110 then poor result

  4. Dropout shortcut ratio : 0.5, fail to converge to a good solution.

those methods can hamper info lead to optimization problems. 4,5 have good representation abilities, but bad at optimization.

0.3.2.1 Goal:

Ease of optimization: ReLU impacted when it is negative, when the Residual units are too many it is not a good approximation.

Reducing overfitting: Pre-activation : BN normalizes the signal, but merged signal is not normalized and passed the input as the next wt layer
pre-activation version: inputs to all weight layers have been normalized

0.3.3 3. Activation

  1. BN after addition: f involves BN and ReLU resulting alter the signal which reflected on reducing training loss.

  2. ReLU before addition: it leads to a non-negative output from the transform F. the forward propagated signal is monotonically increasing. we want (-inf, inf)

  3. Post-activation or pre-activation? asymmetric form where an activation f only affects F path.

0.3.4 4. Result:

Comparisons: CIFAR-10/100 (state of the art): good for smaller dataset. ResNet push for deeper

ImageNet: face similar optimization difficulties.