Chapter 5. Neural network (Deep learning)

Neural network는 1980년대 후반부터 주목받기 시작.
이후, Support vector machine, Boosting, Random Forest 등의 방법이 등장하고 큰 성공을 거둠.
이와 함께, Neural network에 대한 관심도가 떨어지기 시작함.
- 이유는 Neural network는 tuning하는 과정이 까다롭고, 제대로 train이 되지 않을 경우, SVM, Boosting 등의 방버에 비해 퍼포먼스가 떨어짐.
2010년 이후에 “Deep learning”이라는 이름으로 다시 주목받기 시작.
특히 이미지, 영상, 텍스트 등의 데이터를 다룰 때, 기존의 방법론 대비 큰 성공을 거둠.
이러한 성공의 이유는 훨씬 큰 training dataset으로 학습할 수 있기 때문인 것으로 추정.

Section 5.1 R에서 deep learning 활용

본 강의에서는 R을 이용하여 deep learning을 수행할 것이다.
그러나 대부분의 머신러닝을 수행하는 연구자들은 pythong을 기반으로 작업하고 대다수의 관련 라이브러리도 파이썬 언어를 기반으로 공유되기 때문에 R자체만으로 deep learning을 수행하는 것은 실행하기에 매우 힘들고 비효율적이다.
따라서 R 환경에서 코딩을 하지만 실제 계산은 파이썬에서 실행되어 효율적으로 deep learning을 수행할 것이다.
이를 위한 패키지가 keras패키지이다.
- Keras는 high-level neural network로 파이썬을 기반으로 하며 빠른 계산을 초점을 맞추고 있다.
- R 패키지 keras는 python Keras API와 연계하여 R에서 빠르게 Deep learning을 수행할 수 있게 만들어준다.
keras패키지는 파이썬과 연계하여 실행되기 때문에 파이썬이 깔려있어야 한다. 이를 포함한 여러가지 이유로 인해 keras 패키지를 사용할 수 있는 환경을 조성하는 것은 여러 어려움이 따른다.
여기서는 keras패키지를 사용할 수 있는 환경을 조성하는 방법에 대해서 다루고자 한다. 자세한 내용은 https://hastie.su.domains/ISLR2/keras-instructions.html 참조.
- keras패키지외에 torch라는 새로운 R패키지가 등장하였다.
- torch패키지는 python 설치를 하지 않아도 된다라는 장점을 가진다.
- 대신 keras패키지보다는 약간 느리다는 단점을 가지고 있다.
- 본 강의에서는 torch가 아닌 keras패키지를 활용하여 진행될 것이다.
keras패키지 사용 환경 조성 방법은 다음과 같다 :

먼저 최신 버전의 ISLR2패키지를 설치한다.

### step 1
if(sum(installed.packages()[,1] %in% "ISLR2") == 0){ 
  install.packages("ISLR2") 
}

기존에 설치되어있는 keras, reticulate, tensorflow 패키지를 삭제한다.

### step 2
tryCatch(
  remove.packages(c("keras", "tensorflow", "reticulate")),
  error = function(e) "Some or all packages not previously installed, that's ok!"
)

최신버전의 keras패키지를 설치한다. 이 때 reticulate 패키지와 tensorflow 패키지도 함께 설치된다.

### step 3
install.packages("keras", repos = 'https://cloud.r-project.org')

miniconda를 PC의 어느 디렉토리에 설치할지 결정한다. 여기서는 miniconda가 설치되어있지 않다고 가정하고 islr-miniconda에 miniconda를 설치하도록 한다.

### step 4
write('RETICULATE_AUTOCONFIGURE=FALSE', file = "~/.Renviron", append = TRUE)
write(sprintf('RETICULATE_MINICONDA_PATH=%s', normalizePath("~/islr-miniconda", winslash = "/", mustWork = FALSE)),
      file = "~/.Renviron", append = TRUE)

R을 끄고 다시 실행한다. 혹은 다음을 입력한다.

### step 5 - restart!!!!
# Sys.setenv(RETICULATE_AUTOCONFIGURE='FALSE',
#            RETICULATE_MINICONDA_PATH=normalizePath("~/islr-miniconda", winslash = "/", mustWork = FALSE))

ISLR2패키지의 helpers 함수를 사용할 것이다. 이후 install_miniconda()와 install_tensorflow()함수를 통해 설치를 완료한다.

### step 6
install_miniconda()
install_tensorflow()

마지막으로 다음을 통해 버전 등 설치 및 구성 현황을 확인할 수 있다.

### step 7
print_py_config()

Section 5.2 Single Layer Neural Network

Neural network의 목적: input \(X=(X_1, \dots , X_p)\)로부터 output \(Y\)를 예측하는 \[y=f(X) + \epsilon\]에서 nonlinear function \(f\)를 다음 그림과 같은 형태로 구축하는 것.
- (Feed-forward neural network의 한 예시)
  - input : \(p=4\)차원의 데이터 \(X=(X_1, \dots , X_4)\)
  - output : continuous response \(Y\)
  - input \(X_1, \dots ,X_4\)가 input layer를 이룸
  - 이 예시에서는 한 개의 hidden layer만 존재.
  - 각 화살표는 각 input이 \(K\)개의 Hidden unit에 관여된다는 것을 나타냄.
    - 이 때, \(K\)는 우리가 미리 지정하는 hyperparameter로 위 그림의 예시에서는 \(K=5\)를 사용함.
  - 이 때, Neural network 모형은 \[\begin{equation} \begin{split} f(X) & = \beta_0 + \sum_{k=1}^{K}{\beta_k X_k} \\ & = \beta_0 + \sum_{k=1}^{K}{\beta_k h_k(X)} \\ & = \beta_0 + \sum_{k=1}^{K}{\beta_k g(w_{k0} + \sum_{j=1}^{p}{w_{kj}X_j})} \\ \end{split} \end{equation}\]
    - Output layer는 input으로 \(A_k\)를 이용하여 선형 모형으로 계산됨.
  - 이 때, \(A_k\)를 activation이라고 부름. \[A_k = h_k (X) = g(w_{k0}+\sum_{j=1}^{p}{w_{kj}X_j})\]
    - 즉, \(A_k\)들은 직접적으로 관측되는 값들이 아님.
  - \(g\)는 activation function이라고 부르며, nonlinear한 함수를 사용.
  - \(\beta_0, \dots \beta_K\)와 \(w_{10}, \dots , w_{Kp}\)는 training해서 추정해야하는 parameter들
activation function \(g\) :
- (Option 1) Sigmoid : \[g(z) = {e^z \over 1+e^z} = {1\over 1+e^{-z}} \]
  - 0부터 1까지의 확률로 변환시켜주는 개념이라고 생각할 수 있음.
- (Option 2) ReLU (rectified linear unit) : \[g(z) = (z)_+ = \begin{cases} 0 \text{ if } z < 0 \\ 1 \text{ if } z \ge 0 \\ \end{cases}\]
과거에는 sigmoid를 사용했으나, 현대의 neural network에서는 ReLU를 일반적으로 사용함.
- ReLU가 훨씬 계산적으로 안정적. (차원의 저주; curse of dimensionality)
- Deep 하게 hidden layer를 쌓을 경우 training을 하면서 parameter를 계산하기 위해 미분한 값들을 계속해서 곱해나가게 됨 (뒤에서 배울 Backpropagation 참조)
  - 이 때, sigmoid의 경우 미분했을 때 0부터 1사이의 값을 가지기 때문에 계속해서 곱해나갈 경우, 프로그래밍 언어로 표현할 수 있는 범위를 벗어나게 작아지면 0이 나오게 됨.
  - 반면, ReLU의 경우 미분했을 때 0 또는 1의 값이 나오므로 계속해서 곱해나간다고 하더라도 sigmoid와 같은 문제가 발생하지 않고 안정적으로 계산 가능. 따라서 “Deep” learning에서 훨씬 안정적으로 계산됨.
activation function \(g\)의 nonlinearity
- \(g\)는 nonlinear해야함.
- 왜냐하면 \(g\)가 linear하면 \(f(X)\)도 결국엔 linear하게 되기 때문에 \(X\)와 \(Y\)의 복잡한 관계를 설명할 수 없음.
- 반면, \(g\)가 nonlinear하게 되면, \(X\)와 \(Y\)의 복잡한 nonlinear한 관계 및 \(X\)간의 interaction 까지도 잡아낼 수 있음.
- (예시) \(p=2\)인 input \(X=(X_1,X_2)\)에 대해서 \(K=2\)이고 \(g(z)=z^2\)인 Feed-forward neural network를 고려할 때
  - parameter들을 다음과 같이 놓게 되면 \[\begin{equation} \begin{split} \beta_0=0, \ & \beta_1={1\over 4}, \ \beta_2=-{1\over 4}, \\ w_{10}=0, \ & w_{11}=1, \ w_{12}=1, \\ w_{20}=0, \ & w_{21}=1, \ w_{22}=1, \\ \end{split} \end{equation}\]
  - \(h_1(X)\)와 \(h_2(X)\)는 다음과 같은 형태가 되고 \[\begin{equation} \begin{split} h_1(X) & = (0+X_1 + X_2)^2 \\ h_2(X) & = (0+X_1 - X_2)^2 \\ \end{split} \end{equation}\]
  - 이 때 \(f(X)\)는 다음과 같이 계산된다. \[\begin{equation} \begin{split} f(X) & = 0 + {1\over 4} (0+X_1 + X_2)^2 - {1\over 4} (0+X_1 - X_2)^2 \\ & = {1\over 4} [ (X_1 + X_2)^2 - (X_1 - X_2)^2 ] \\ & = X_1 X_2\end{split} \end{equation}\]
  - 물론 실제로 \(g(z)=z^2\)를 사용하지는 않음. 왜냐하면 이 경우에는 \(f(X)\)가 항상 2차 다항식 (2nd degree polynomial) 형태로 나올 것이기 때문. 실제로는 대부분 ReLU를 사용하며, ReLU의 경우 이러한 문제점이 생기지 않음.
Neural network를 fitting하는 방법
- parameter인 \(\beta_0, \dots \beta_K\)와 \(w_{10}, \dots , w_{Kp}\)를 데이터로부터 어떻게 fitting할 것인가?
- \(y\)가 continuous인 경우는 일반적으로 squared-error loss를 사용. (자세한 내용은 추후에 다룰 예정. )
  - 즉, \(\sum_{i=1}^{n}{(y_i - f(x_i))^2}\)를 최소화시키는 최적화문제를 풂으로써 fitting.

R코드 예제

######################################################
# library
######################################################
library(ISLR2)
library(dplyr)
library(ggplot2)


######################################################
# data
######################################################
Gitters <- na.omit(Hitters)
n <- nrow (Gitters)

######################################################
# test set
######################################################
set.seed (13)
ntest <- trunc(n / 3)
testid <- sample(1:n, ntest)


######################################################
# linear regression
######################################################
lr.fit <- lm(Salary ~ ., data=Gitters[-testid,])
lr.pred <- predict(lr.fit , Gitters[testid , ])

# mean absolute prediction error
mean(abs(Gitters[testid,]$Salary - lr.pred))

[1] 254.6687

######################################################
# Single layer Neural network
######################################################
library(keras)
# library(tensorflow)
# install_tensorflow()

# reshaping data
x <- scale (model.matrix(Salary ~ . -1, data=Gitters))
y <- Gitters$Salary

# construct the model
modnn <- keras_model_sequential() %>% 
  layer_dense(units=50, activation="relu", input_shape = ncol(x)) %>% # first hidden layer with 50 units with ReLU
  layer_dropout(rate=0.4) %>% # dropout layer (you will learn this later)
  layer_dense(units=1) # output layer - one unit with no activation function

# fitting algorithm
modnn %>% compile(loss="mse", # minimize the squared error loss
                  optimizer=optimizer_rmsprop(),
                  metric=list("mean_absolute_error"))

# fitting
history <- modnn %>% fit (
  x[-testid , ], y[-testid],
  epochs = 1500,   # an epoch amounts to the number of SGD steps required to process n observations
  batch_size = 32, # randomly select 32 training observations for each step of SGD
  validation_data = list (x[testid , ], y[testid])
)
plot(history)

# prediction
npred <- predict(modnn, x[testid , ])
mean(abs(y[testid] - npred))

[1] 250.7136

Section 5.3 Multi-layer Neural Network

현대의 Neural network는 일반적으로 다수의 Hidden layer를 둠 → 그래서 “Deep” learning 이라는 이름이 붙음.
이론적으로, single hidden layrer로도 hidden unit이 많으면 대부분의 \(f\)를 근사할 수 있음.
그러나 multiple hidden layer를 사용하면 hidden unit을 적정 수준의 개수를 사용해도 됨.
(예시) Digit recognition problem :
- MNIST 데이터 - 다음 그림과 같이 손글씨 그림에 대한 데이터로 60000개의 training 이미지와 10000개의 test 이미지가 있음.
- \(y\) : 0 ~ 9 까지의 숫자를 classification하는 문제
  - 이 때, \(Y=(Y_0,\dots Y_9)\)의 길이가 10인 ’dummy variable’의 벡터 형태로 표현.
  - 즉, \(y=3\)이면 \(Y_3=1\)이고 나머지 \(Y_j\)는 0인 구조를 가짐.
  - 이를 machine learning 관련 문헌에서는 “one-hot encoding”이라고 부름
  - 통계학에서 일반적으로 활용하는 categorical 변수를 dummy variable로 표현할 때와는 약간 다른 모습.
  - 통계학에서는 base가 되는 카테고리를 하나 지정하고 dummy variable로 나타낼 때 \(y\)의 카테고리 개수보다 하나 더 적은 길이의 벡터로 표현.
  - 반면에 one-hot encoding은 base가 되는 카테고리를 따로 정하지 않고 \(y\)의 카테고리 개수만큼의 길이의 벡터로 표현함.
- 각 이미지는 \(p=28\times 28=784\) 픽셀로 이루어져 있고 각 픽셀은 8bit의 gray scale (0-255)로 나타내어짐.
  - 즉, \(X=(X_1,\dots ,X_p)\)는 \(p=784\)길이의 vector가 되고 각 \(X_j\)는 손글씨를 나타내는 이미지의 한 픽셀을 나타내며 0-255 사이의 정수 값을 가짐.
- 이러한 손글씨 이미지를 input으로 받아들이고 classification하는 문제에는 Multi-layer Neural network를 적용할 수 있다.
Multi-layer Neural Network는 다음과 같은 개념도를 가짐.
- 위의 예제에서는 두개의 hidden layer를 사용.
  - 첫번째 hidden layer에는 \(L_1=256\)개의 unit을, 두번째 hidden layer에는 \(L_2=128\)개의 unit을 사용함.
  - 1st hidden layer : \[ A_k^{(1)} = h_k^{(1)}(X) = g(w_{k0}^{(1)} + \sum_{j=1}^{p}{ w_{kj}^{(1)}X_j }) \ \text{ for } k=1,\dots , K_1 \]
  - 2nd hidden layer : \[ A_l^{(2)} = h_l^{(2)}(X) = g(w_{l0}^{(2)} + \sum_{m=1}^{K_1}{ w_{lm}^{(1)}A_m^{(1)} }) \ \text{ for } l=1,\dots , K_2 \]
    - 2nd hidden layer의 경우 1st hidden layer의 \(A_k^{(1)}\)를 input으로 사용.
    - 이 때, \(A_k^{(1)}\)는 \(X\)의 함수이므로 \(A_l^{(2)}\)는 다시 \(X\)에 대한 복잡한 형태의 nonlinear함수가 됨.
  - \(w_{kj}^{(1)}\)을 \(\mathbf{w}_1\)이라는 \(785 \times 256\) matrix로 표현.
    - \(w_{k0}^{(1)}\)때문에 row의 개수가 \(p=784\)가 아닌 \(p+1=785\)임.
  - \(w_{lm}^{(2)}\)을 \(\mathbf{w}_2\)이라는 \(257 \times 128\) matrix로 표현.
- Output은 10개의 변수로 이루어짐. 이를 multi-task learning이라고 함.
  - Output layer : \[\begin{equation} \begin{split} Z_m & = \beta_{m0} + \sum_{l=1}^{K_2}{\beta_{ml}h_l^{(2)}(X)} \\ & = \beta_{m0} + \sum_{l=1}^{K_2}{\beta_{ml}A_l^{(2)}} \ \text{ for } m=0,\dots,9 \\ \end{split} \end{equation}\]
  - \(\beta_{ml}\)을 \(\mathbf{B}\)이라는 \(129 \times 10\) matrix로 표현
  - \(y\)가 continuous response라면 \(f_m(X) = Z_m\)
  - \(y\)가 categorical response라면 softmax 함수를 사용
    - 다음과 같이 \[ f_m(X) = Pr(Y=m|X) = {e^{Z_m} \over \sum_{l=0}^{9}{ e^{Z_l} } } \] 로 표현하고, 이후 가장 높은 확률을 가진 카테고리로 예측.
- parameter를 fitting하는 최적화문제를 풀 때 사용하는 loss function은
  - \(y\)가 continuous response라면 squared-error loss를 사용 : \[\text{minimize } \ \sum_{i=1}^{n}{(y_i - f(x_i))^2}\]
  - \(y\)가 categorical response라면 cross-entropy (negative multinomial log-likelihood)를 사용 : \[\text{minimize } \ -\sum_{i=1}^{n}{ \sum_{l=0}^{9}{y_{il}\log{(f_l(x_i))}} }\]
- parameter 개수 : 총 235146개! ( \(\mathbf{w}_1\) - \(785 \times 256 = 200960\) & \(\mathbf{w}_w\) - \(257 \times 128 = 32896\) & \(\mathbf{B}\) - \(129 \times 10 = 1290\))
  - 반면 통계학 분야에서 일반적으로 활용할 수 있는 multinomial logistic linear regression이라면 \(765 \times 9 = 7065\)개의 parameter로 훨씬 적은 수의 parameter를 가지고 있음.
  - MNIST 데이터를 기준으로 training 이미지는 60000개이므로 overfitting의 위험성이 상당히 높음
  - overifitting의 위험성을 극복하기 위하여 regularization 테크닉을 일반적으로 사용함
  - Ridge regularization, drop-out regularization 등을 활용. 최근에는 Ridge를 대부분 사용함. 뒤에서 다룰 예정.
용어 차이 (통계 vs 기계학습) : 위의 Multi-layer Neural Network에서
- (통계) \(w_{kj}^{(1)}\) 및 \(w_{lm}^{(2)}\) 같은 parameter들을 다른 표현으로 coefficient라고도 부름 & \(w_{k0}^{(1)}\) 같은 parameter는 intercept라고 부름.
- (기계학습) \(w_{kj}^{(1)}\) 및 \(w_{lm}^{(2)}\) 같은 parameter들을 다른 표현으로 weight라고도 부름 & \(w_{k0}^{(1)}\) 같은 parameter는 bias라고 부름.

R코드 예제

######################################################
# library
######################################################
library(dplyr)
library(ggplot2)
library(keras)


######################################################
# data
######################################################

### load MNIST data
mnist <- dataset_mnist ()

### train & test datta
x_train <- mnist$train$x
g_train <- mnist$train$y
x_test <- mnist$test$x
g_test <- mnist$test$y
dim(x_train)

[1] 60000    28    28

dim(x_test)

[1] 10000    28    28

### reshape X from a three dimensional array to a matrix
x_train <- array_reshape(x_train , c( nrow(x_train), 784))
x_test <- array_reshape(x_test , c( nrow(x_test), 784))

### reshape Y (one-hot-encoding)
y_train <- to_categorical(g_train , 10)
y_test <- to_categorical(g_test , 10)

### scaling input
# Neural networks are somewhat sensitive to the scale of the inputs
# For example, ridge and lasso regularization are affected by scaling
# 0~255 to 0~1
x_train <- x_train/255
x_test <- x_test/255


######################################################
# Multi layer Neural network
######################################################

### construct the model
modelnn <- keras_model_sequential()
modelnn %>% layer_dense(units = 256, 
                        activation = "relu",
                        input_shape = c(784)) %>%
  layer_dropout(rate = 0.4) %>%
  layer_dense(units = 128, activation = "relu") %>% 
  layer_dropout(rate = 0.3) %>%
  layer_dense(units = 10, activation = "softmax")
summary(modelnn)

Model: "sequential_1"
______________________________________________________________________
Layer (type)                   Output Shape                Param #    
======================================================================
dense_4 (Dense)                (None, 256)                 200960     
______________________________________________________________________
dropout_2 (Dropout)            (None, 256)                 0          
______________________________________________________________________
dense_3 (Dense)                (None, 128)                 32896      
______________________________________________________________________
dropout_1 (Dropout)            (None, 128)                 0          
______________________________________________________________________
dense_2 (Dense)                (None, 10)                  1290       
======================================================================
Total params: 235,146
Trainable params: 235,146
Non-trainable params: 0
______________________________________________________________________

### fitting algorithm
modelnn %>% compile(loss = "categorical_crossentropy",
                    optimizer = optimizer_rmsprop(), metrics = c("accuracy"))

### fitting 
history <- modelnn %>% 
  fit(x_train , y_train , 
      epochs = 30, batch_size = 128, 
      validation_split = 0.2)
plot(history)

### prediction
result <- modelnn %>% predict(x_test) %>% k_argmax() 
mean((result %>% as.numeric()) == g_test)

[1] 0.9806

######################################################
# Multinomial logistic regression (for comparison) 
######################################################
# much faster than glmnet
modellr <- keras_model_sequential () %>% 
  layer_dense(input_shape = 784, units = 10, 
              activation = "softmax")
summary(modellr)

Model: "sequential_2"
______________________________________________________________________
Layer (type)                   Output Shape                Param #    
======================================================================
dense_5 (Dense)                (None, 10)                  7850       
======================================================================
Total params: 7,850
Trainable params: 7,850
Non-trainable params: 0
______________________________________________________________________

modellr %>% compile(loss = "categorical_crossentropy",
                    optimizer = optimizer_rmsprop (), 
                    metrics = c("accuracy"))
modellr %>% fit(x_train , y_train, epochs = 30,
                batch_size = 128, validation_split = 0.2)
### prediction
result2 <- modellr %>% predict(x_test) %>% k_argmax() 
mean((result2 %>% as.numeric()) == g_test)

[1] 0.9276

Section 5.4 Convolutional Neural Netowrk (CNN)

Neural network는 2010년 즈음에 image classification에서 놀라운 성과를 보이며 재조명 받기 시작함.
특히 Convolutional Neural Network (CNN)이 image classification에서 잘 작동했음.
(예시) CIFAR100 데이터셋
- 60000개의 이미지. 즉, \(n=60000\). 50000개의 이미지는 training 데이터로, 나머지 10000개는 test 데이터.
- 20개의 super class (예를 들어, 수중 포유류 등) 와 각 super class 당 5개의 class로 카테고리를 이루고 있음. 즉, categorical response \(y\).
- \(X\)는 각 이미지로, \(32 \times 32\) 픽셀로 이루어짐. 각 픽셀은 RGB (red, green, blue)를 나타내는 3차원의 숫자로 이루어짐. 각각은 0-255 사이의 정수.
  - 즉, \(32 \times 32 \times 3\)의 array 형태. 이를 feature map 이라고도 부름.
  - 이 때 \(32 \times 32\)의 처음 두 axis는 spatial axis라고 부르고, 마지막 \(\times 3\)의 axis는 channel axis라고 부름.
(예시2)
위 CIFAR100 데이터 예시에 대해서 CNN을 적용했을 때 다음과 같은 개념도로 표현할 수 있음.
- CNN은 기본적으로 사람이 이미지를 분류하는 과정을 흉내냄.
  - CNN은 먼저 이미지의 low-level feature를 인식함. (위 그림에서 밑에서 두번째 줄) → Convolution layer
    - 예를 들어, 국소적인 경계선, 색깔 등.
  - 이 후 low-level feature를 조합해서 high-level feature를 인식. (위 그림에서 위에서 두번째 줄) → Pooling layer
    - 예를 들어, 눈, 귀 등
  - 그리고 high-level feature를 combine해서 최종적으로 각 class에 대한 확률을 계산하고 classifcation 수행. (위 그림에서 맨 윗줄)
  - 이러한 방식으로 수많은 convolution layer와 pooling layer를 조합하여 CNN 모형 수립.
- 밑의 그림은 CNN에 대한 대략적인 구조를 나타냄.

5.4.1 Convolution layer

Convolution layer는 다수의 convolution filter로 이루어짐.
각각의 convolution filter는 이미지에 특정한 국소적인 feature가 있는지를 나타냄.
convolution filter는 다음 예시와 같이 작동.
(예시 1)
- original image의 각 \(2\times 2\) submatrix가 convolution filter와 비슷하다면 convolved image에서 해당 element가 큰 값으로, 비슷하지않다면 작은 값으로 나올 것.
- 따라서, convolution filter와 비슷한 부분을 강조하는 형태로 필터링 될 것.
(예시 2)
- input인 original 이미지는 왼편의 호랑이 그림으로 \(32 \times 32 \times 3\)의 array 형태
- convolution filter는 가운데열의 두개의 \(3 \times 3\) matrix.
  - 검은색 부분은 0. 하얀색 부분은 1
- 오른편의 두 이미지가 각 convoltuion filter를 적용한 convolved image.
  - 첫번째 이미지는 convolution filter가 \[\begin{bmatrix} 0 & 1 & 0 \\ 0 & 1 & 0 \\ 0 & 1 & 0 \\ \end{bmatrix}\] 이므로 수직 방향으로 이미지가 강조된 모습.
  - 두번째 이미지는 convolution filter가 \[\begin{bmatrix} 0 & 0 & 0 \\ 1 & 1 & 1 \\ 0 & 0 & 0 \\ \end{bmatrix}\] 이므로 수평 방향으로 이미지가 강조된 모습.
- input인 original 이미지가 \(32 \times 32 \times 3\) arrya이므로 3개의 channel 각각에 convolution filter를 적용하고 3개의 convolved image를 더해서 2차원 matrix 형태의 output feature map으로 나옴.(오른편의 이미지)
  - 이는 (예시 2) 한정으로 일반적으로는 depth는 동일한 convolution filter를 적용한다. 즉, \(M \times L \times 3\) array 형태의 convoltuion filter를 사용함.
- 만약 \(K\)개의 convolution filter를 적용하면 output feature map은 \(K\)개의 channel을 가진 3차원 array 형태가 됨.
- 일반적으로 convolution filter를 적용한 이후에 convovled image의 각 element에 대해서 ReLU activation function 적용.
  - 가끔 이를 별도의 layer로 취급하기도 함. 이 경우, 이를 detect layer라고 부름

5.4.2 Pooling layer

Pooling layer를 통해 이미지를 보다 작은 크기의 이미지로 요약함.
대표적인 max pooling은 다음과 같이 겹치지 않는 submatrix에 대해서 max값으로 pooling 함.

5.4.3 CNN의 구조

실제 CNN을 적용할 떄는 다음 그림과 같이 다수의 layer를 통해 “Deep” CNN의 형태로 사용한다.

R코드 예제

######################################################
# library
######################################################
library(dplyr)
library(ggplot2)
library(keras)


######################################################
# data
######################################################

### load MNIST data
cifar100 <- dataset_cifar100()

### train & test datta
x_train <- cifar100$train$x
g_train <- cifar100$train$y
x_test <- cifar100$test$x
g_test <- cifar100$test$y

### We keep the array structure for X

### reshape Y (one-hot-encoding)
y_train <- to_categorical(g_train , 100)
y_test <- to_categorical(g_test , 100)

### scaling input
# Neural networks are somewhat sensitive to the scale of the inputs
# For example, ridge and lasso regularization are affected by scaling
# 0~255 to 0~1
x_train <- x_train/255
x_test <- x_test/255

### plot
library(jpeg)
par(mar = c(0, 0, 0, 0), mfrow = c(4, 4))
index <- sample( seq(50000), 16)
for(i in index){plot(as.raster(x_train[i,,,]))}

# as.raster() function converts the feature map so that 
# it can be plotted as.raster() as a color image

######################################################
# CNN
######################################################

### construct the model
model <- keras_model_sequential() %>% 
  layer_conv_2d(filters = 32, kernel_size = c(3, 3), 
                padding = "same", # which means that  output channels have the same dimension as the input channels
                activation = "relu",
                input_shape = c(32, 32, 3)) %>%
  layer_max_pooling_2d(pool_size = c(2, 2)) %>%
  layer_conv_2d(filters = 64, kernel_size = c(3, 3),
                padding = "same", activation = "relu") %>%
  layer_max_pooling_2d(pool_size = c(2, 2)) %>%
  layer_conv_2d(filters = 128, kernel_size = c(3, 3),
                padding = "same", activation = "relu") %>%
  layer_max_pooling_2d(pool_size = c(2, 2)) %>%
  layer_conv_2d(filters = 256, kernel_size = c(3, 3), 
                padding = "same", activation = "relu") %>%
  layer_max_pooling_2d(pool_size = c(2, 2)) %>%
  layer_flatten() %>%
  layer_dropout(rate = 0.5) %>%
  layer_dense(units = 512, activation = "relu") %>%
  layer_dense(units = 100, activation = "softmax")
summary(model)

Model: "sequential_3"
______________________________________________________________________
Layer (type)                   Output Shape                Param #    
======================================================================
conv2d_3 (Conv2D)              (None, 32, 32, 32)          896        
______________________________________________________________________
max_pooling2d_3 (MaxPooling2D) (None, 16, 16, 32)          0          
______________________________________________________________________
conv2d_2 (Conv2D)              (None, 16, 16, 64)          18496      
______________________________________________________________________
max_pooling2d_2 (MaxPooling2D) (None, 8, 8, 64)            0          
______________________________________________________________________
conv2d_1 (Conv2D)              (None, 8, 8, 128)           73856      
______________________________________________________________________
max_pooling2d_1 (MaxPooling2D) (None, 4, 4, 128)           0          
______________________________________________________________________
conv2d (Conv2D)                (None, 4, 4, 256)           295168     
______________________________________________________________________
max_pooling2d (MaxPooling2D)   (None, 2, 2, 256)           0          
______________________________________________________________________
flatten (Flatten)              (None, 1024)                0          
______________________________________________________________________
dropout_3 (Dropout)            (None, 1024)                0          
______________________________________________________________________
dense_7 (Dense)                (None, 512)                 524800     
______________________________________________________________________
dense_6 (Dense)                (None, 100)                 51300      
======================================================================
Total params: 964,516
Trainable params: 964,516
Non-trainable params: 0
______________________________________________________________________

### fitting algorithm
model %>% compile(loss = "categorical_crossentropy",
                  optimizer = optimizer_rmsprop(), metrics = c("accuracy"))

### fitting
history <- model %>% fit(x_train , y_train , epochs = 30,
                         batch_size = 128, validation_split = 0.2)
### prediction
result <- model %>% predict(x_test) %>% k_argmax() 
mean((result %>% as.numeric()) == g_test)

[1] 0.4247

5.4.4 Data augmentation

이미지를 training할때 data augmentation을 통해 training data의 사이즈를 크게 늘릴 수 있음.
original image에 노이즈를 주거나 회전하는 등의 방법을 통해 sample size를 늘릴 수 있음. (\(y\)는 그대로)

5.4.5 Transfer learning

CNN을 학습하는 과정에서의 핵심은 convolution filter를 학습하는 과정.
이 때, 비슷한 종류의 classification을 한다면 기존에 학습된 classifier의 convolution filter의 weight를 그대로 사용하고 (weight freezing)
- 예를 들어, 개 vs 고양이 그림을 분류하는 CNN을 새로 학습하고자 할 때, 기존에 이미 학습된 동물을 분류하는 CNN 모형이 있다면 이 모형의 구조와 weight 값들을 그대로 빌려올 수 있음.
마지막 소수의 layer만 새로 학습하는 방식을 사용하기도 함 : Transfer learning!!
다음의 R코드 예제는
- imagenet이라는 데이터셋을 학습한 resnet50 classifier를 불러와서 사용
  - 여기 예시에서는 추가적인 학습은 하지 않음.
- 다음의 6가지 그림에 적용해서 어떻게 분류되는지 확인

######################################################
# library
######################################################
library(dplyr)
library(ggplot2)
library(keras)


######################################################
# data
######################################################
img_dir <- "book_images"
image_names <- list.files(img_dir)
num_images <- length(image_names)
x <- array( dim = c(num_images , 224, 224, 3))
for(i in 1:num_images){
  
  img_path <- paste(img_dir, image_names[i], sep = "/")
  img <- image_load(img_path, target_size = c(224, 224))
  x[i,,, ] <- image_to_array(img)
}
x <- imagenet_preprocess_input (x)

# # python pillow module installation
# library(reticulate)
# py_install("pillow")


######################################################
# load the trained network
######################################################
model <- application_resnet50(weights = "imagenet")
# summary(model)


######################################################
# Classification
######################################################
pred6 <- model %>% predict(x) %>%
  imagenet_decode_predictions (top = 3)
names(pred6) <- image_names
print(pred6)

$flamingo.jpg
  class_name class_description       score
1  n02007558          flamingo 0.930109262
2  n02006656         spoonbill 0.068014868
3  n02002556       white_stork 0.001172375

$hawk.jpg
  class_name class_description     score
1  n03388043          fountain 0.2772168
2  n03532672              hook 0.1794709
3  n03804744              nail 0.1103582

$hawk_cropped.jpeg
  class_name class_description      score
1  n01608432              kite 0.72434253
2  n01622779    great_grey_owl 0.08683522
3  n01532829       house_finch 0.03997646

$huey.jpg
  class_name class_description      score
1  n02097474   Tibetan_terrier 0.52462161
2  n02098413             Lhasa 0.40590140
3  n02094114   Norfolk_terrier 0.01699529

$kitty.jpg
  class_name    class_description      score
1  n02105641 Old_English_sheepdog 0.83900332
2  n02086240             Shih-Tzu 0.04235809
3  n03223299              doormat 0.03226566

$weaver.jpg
  class_name class_description      score
1  n01843065           jacamar 0.48659834
2  n01818515             macaw 0.23028962
3  n02494079   squirrel_monkey 0.04457098

Section 5.5 Recurrent Neural Network (RNN)

5.5.1 Document classification

문서를 그 내용에 따라 분류하는 문제
(예시) IMDb 데이터셋 :
- 위 그림과 같이 input \(X\)는 영화 리뷰 문서 & output \(y\)는 긍정적 or 부정적으로 두개의 카테고리를 가진 범주형 변수
문서는 길이도 제각각 다르고, 비속어나 유행어 등을 쓸 수도 있으며 오타도 있을 수 있음
- 따라서 각 document를 잘 “featureize” 해야 함.
가장 단순하고 대표적인 방법이 bag-of-words model
- 사전에 있는 각 단어에 대해 해당 문서에 포함되어 있는지 아닌지에 따라 점수화한 것.
- 즉, 사전에 있는 모든 단어의 수가 \(M\)이라면, 길이가 \(M\)인 binary vector를 정의하고, 각 단어가 해당 문서에 포함되면 1 아니면 0으로 표현.
  - 단순히 포함되냐 안되느냐를 넘어서 빈도 수로 표현하는 경우도 있음.
- 그런데 사전에 있는 단어 수는 너무 많으므로 training 데이터로부터 가장 자주 등장하는 \(M\)개의 단어 (예를 들어 \(M=10000\)개)에 대해서만 featureize함
- featureize된 vector에 대해서 classification 모형 적용.
(예시 1) IMDb 데이터 :
- 23000개의 리뷰는 training 데이터로 / 2000개의리뷰는 test 데이터로 놓았을 때,
- 모형은
  1. Lasso logistic regression
  2. Neural network with 2 hidden layer (각 layer에 16개의 unit & ReLU activation function 사용)
- 결과는 다음과 같음.
  - Lasso의 경우 \(\lambda\)에 따라 fitting된 classifier들에 대한 결과
  - Neural network의 경우 fitting 과정에서 stochastic gradient descent의 iteration 횟수에 따라 fitting된 classfier에 대한 결과
  - validation error를 최소화하는 \(\lambda\) 혹은 epoch를 선택할 경우
    - 두 모형 모두 test accuracy가 88%로 비슷한 수준.
Bag-of-words 모형은 각 document에 대해서 각 단어들이 포함되는지 아닌지만 체크하고 내용 자체에 대해서는 고려하지 않음
- 리뷰에 “good”이라는 단어가 들어가면 긍정적인 리뷰일 가능성 ↑
- 그러나 “not good”이라는 단어가 들어가면 반대로 부정적인 리뷰일 가능성 ↑
- 그런데 “not”이라는 단어와 “good”이라는 단어가 붙어있지 않고 떨어져 있다면?
  - Bag-of-words 모형은 각 단어의 포함여부만 고려하고 내용을 고려안하므로 이러한 상황에 대해서 고려할 수 없음.
내용 자체를 고려하는 방법으로는
1. Bag-of-n-grams 모형
  - 예를 들어 Bag-of-2-grams는 연속적인 2단어가 포함되는지 아닌지를 체크함.
2. Document를 단어들의 sequence로 인식하고 단어들의 순서도 고려하는 방법
  - 대표적인 방법이 Recurrent neural network (RNN)
  - 문서 분류 뿐만 아니라, 날씨 예측, 번역 등의 문제에도 적용할 수 있음.

######################################################
# library
######################################################
library(dplyr)
library(ggplot2)
library(keras)


######################################################
# data
######################################################
max_features <- 10000
imdb <- dataset_imdb (num_words = max_features)
c(c(x_train , y_train), c(x_test , y_test)) %<-% imdb

### see the words in the document
word_index <- dataset_imdb_word_index()
decode_review <- function(text , word_index){
  word <- names(word_index)
  idx <- unlist(word_index , use.names = FALSE)
  word <- c("<PAD>", "<START>", "<UNK>", "<UNUSED>", word)
  idx <- c(0:3, idx + 3)
  words <- word[ match(text, idx , 2)]
  paste(words , collapse = " ")
}
decode_review(x_train[[1]][1:12], word_index)

[1] "<START> this film was just brilliant casting location scenery story direction everyone's"

### for one-hot encoding
library (Matrix)
one_hot <- function(sequences, dimension){
  seqlen <- sapply(sequences, length)
  n <- length(seqlen)
  rowind <- rep (1:n, seqlen)
  colind <- unlist (sequences)
  sparseMatrix(i = rowind, j = colind,
               dims = c(n, dimension))
}
  
### data : one-hot econding
x_train_1h <- one_hot(x_train , 10000)
x_test_1h <- one_hot(x_test , 10000)
nnzero(x_train_1h) / (25000 * 10000)

[1] 0.01316987

### validation set : 2000 / training set : 23000
set.seed (3)
ival <- sample(seq(along = y_train), 2000)



######################################################
# Lasso logistic regression
######################################################
library(glmnet)

fitlm <- glmnet(x_train_1h[-ival,], y_train[-ival],
                 family = "binomial", standardize = FALSE)
classlmv <- predict(fitlm, x_train_1h[ival,]) > 0
acclmv <- apply(classlmv, 2, function(x){ mean(x==(y_train[ival] > 0), na.rm=T) })
plot(-log(fitlm$lambda), acclmv, type='l')

######################################################
# Neural network
######################################################
# fully-connected neural network with two hidden layers, 
# each with 16 units and ReLU activatio
######################################################
model <- keras_model_sequential() %>%
  layer_dense(units = 16, activation = "relu",
               input_shape = c (10000)) %>%
  layer_dense(units = 16, activation = "relu") %>%
  layer_dense(units = 1, activation = "sigmoid")
model %>% compile(optimizer = "rmsprop",
                  loss = "binary_crossentropy", metrics = c("accuracy"))
# history <- model %>% fit(x_train_1h[-ival,], y_train[-ival],
#                          epochs = 20, batch_size = 512,
#                          validation_data = list(x_train_1h[ival,], y_train[ival]))
# history <- model %>% 
#   fit( x_train_1h[-ival,], y_train[-ival], epochs = 20, 
#        batch_size = 512, validation_data = list (x_test_1h, y_test))
# 
# history <- model %>% 
#   fit( x_train_1h[-ival, ], y_train[-ival], epochs = 20, 
#        batch_size = 512, validation_split = 0.2 )
# plot(history)

5.5.2 Recurrent Neural network

문서나 time series data (시계열 자료; 예: 온도, 강수량) 처럼 순차적인 데이터를 다룰 때 주료 활용
- CNN이 image의 spatial한 정보를 활용하는 것처럼, RNN은 데이터의 순서에 대한 정보를 활용.
input \(X=\{ X_1, \dots , X_L \}\)은 sequence.
- 문서에 대한 예시라면 각 \(X_l\)은 단어를 가리킴
output \(Y\)는 scalar일수도 (예: 영화 리뷰 - binary 변수) 혹은 sequence일 수도 있음 (예: 번역)
다음은 RNN에 대한 개념도
- input - \(X=\{ X_1, \dots , X_L \}\) & output \(Y\) & hidden layer sequence \(\{ A_1, \dots A_L \}\)
- 각 \(X_l\)은 vector 형태.
  - 문서 분류에 대한 예시에서는 문서의 \(l\)번째 단어에 대해서 dummy variable 형태로 이러우진 vector 형태
  - 위 그림에서 검은색은 1 & 회색은 0으로 된 dummy variable
  - 이를 one-hot encoding이라고 machine learning 관련 문헌에서는 부름.
- 순차적으로 \(A_l\)을 update하는데, 이 때 \(X_{l}\)과 \(A_{l-1}\)을 사용함.
- 그리고 각 \(A_l\)마다 output layer로 가서 \(Y\)에 대한 prediction 값이 \(O_l\)을 계산.
- 마지막 값이 \(O_L\)이 가장 \(Y\)와 관계있는 최종 prediction
- 이러한 구조에서 \(A_l\)은 \(X_1\)부터 \(X_l\)까지의 내역에 대한 정보를 함축하고 있는 값으로 이해할 수 있음.
위 개념도에서 수학적으로 보다 자세히 살펴보면 다음과 같음.
- \(X_l\)은 \(p\)차원, \(A_l\)은 \(K\)차원으로 가정할 때, 즉 \(X_l=(X_{l1}, \dots , X_{lp})'\) & \(A_l = (A_{l1}, \dots , A_{lK})'\)
- 이 때, \(A_l\)은 \(A_{l-1}\)과 \(X_l\)을 이용해서 다음과 같이 구해짐 \[A_{lm} = g(w_{m0} + \sum_{j=1}^{p}{w_{mj}X_{lj}} + \sum_{s=1}^{K}{u_{ms}A_{l-1,s}} )\]
- 또한 \(O_l\)은 \(y\)가 continuous인경우 다음과 같이 구해짐. (\(y\)가 categorical인 경우는 sigmoid 함수 적용) \[ O_l = \beta_0 + \sum_{k=1}^{K}{\beta_k A_{lk}} \]
- activation function \(g\)는 ReLU를 일반적으로 활용
- 이 때, \(\mathbf{W}, \mathbf{U}, \mathbf{B}\) (각각, \(w_{mj}\), \(u_{ms}\), \(\beta_k\)로 이루어진 matrix )는 전체 sequence에서 공통적으로 사용함. : weight sharing
- fitting의 경우, \(y\)가 continuous일때 \[ \text{minimize } \ \sum_{i=1}^{n}{(y_i - O_{iL})^2} = \sum_{i=1}^{n}{ \left( y_i - \left(\beta_0 + \sum_{k=1}^{K}{ \beta_k g(w_{k0} + \sum_{j=1}^{p}{w_{kj}x_{iLj}} + \sum_{s=1}^{K}{u_{ms}a_{i,L-1,s}} ) } \right) \right)^2 } \]
- 이 예시에서는 \(O_l\)은 fitting 과정에서 사용되지 않음. 즉, 단순히 계산만 되는 값으로 prediction이 sequence가 진행됨에 따라 어떻게 변하는지 확인할 수 있는 역할 정도.
- 그러나 \(y\) 또한 sequence인 경우에는 \(\{ O_1, \dots O_L \}\) 모두가 쓰일 것.

R코드 예제 : Sequential models for document classification

5.5.1에서 다룬 IMDb 데이터셋 예시에 RNN 모형 적용
이 때, one-hot encoding 형태로 표현한 벡터의 차원 (\(M=10000\))이 너무 크므로, 더 적은 차원 (\(m\))의 embedding space로 표현
이를 위해 embedding layer를 도입. 이 때 필요한 weight는 \(m \times M\) matrix \(E\)로 표현.
- 이 때, \(E\)는 새로 학습하기 보다는, 문서 관련 machine learning 연구자들에 의해 미리 학습된 값을 사용 : word2vec 또는 GloVe를 많이 사용.
- 위 그림에서는 \(M=20\) 차원의 one-hot encoding 된 형태의 벡터를 \(m=5\)차원의 벡터로 변환시켜서 표현함
문서를 마지막 \(L\)개의 단어로만 제한. \(L\)개 보다 적은 수의 단어들로 이루어진 문서는 zero padding. 즉, 모자란 앞쪽 word를 0으로 넣기.
이렇게 되면, 각 문서는 \(X=\{ X_1 , \dots X_L \}\)로 표현되고, \(X_l\)은 \(m\) 길이의 벡터가 됨.
input \(X\)에 대해서 위에서 설명한 RNN 모형을 적용.
이 때 parameter 개수는 \(K\)개의 hidden unit을 사용할 경우
- \(\mathbf{W}\) : \(K \times (m+1)\) matrix
- \(\mathbf{U}\) : \(K \times K\) matrix
- \(\mathbf{B}\) : \(2(K+1) \times 1\) matrix
\(m=32\) & \(K=32\)로 놓고 dropout regularization을 사용하며 25000개의 training 데이터를 이용해 fitting할 경우 대략 76%의 정확도를 얻을 수 있음.
여기서는 예시를 위해 매우 단순한 RNN을 사용했지만, 보다 발전된 형태의 long term and short term memory (LSTM)을 사용할 수 있음
- 예시에서 사용한 RNN은 긴 시간 이전의 데이터에 대한 정보를 고려하기 힘들지만
- LSTM은 긴 시간 이전의 데이터 까지도 잘고려할 수 있는 구조로 만들어져있어서 성능을 크게 향상시킬 수 있음.

###
wc <- sapply(x_train, length)
median(wc)

[1] 178

sum(wc <= 500) / length(wc)

[1] 0.91568

###
maxlen <- 500
x_train <- pad_sequences(x_train, maxlen = maxlen)
x_test <- pad_sequences(x_test, maxlen = maxlen)

###
model <- keras_model_sequential() %>%
  layer_embedding(input_dim = 10000, output_dim = 32) %>%
  layer_lstm(units = 32) %>%
  layer_dense(units = 1, activation = "sigmoid")
###
model %>% compile(optimizer = "rmsprop",
                  loss = "binary_crossentropy", metrics = c("acc"))
history <- model %>% fit(x_train, y_train, epochs = 10,
                         batch_size = 128, validation_data = list(x_test, y_test))
plot(history)

predy <- predict(model, x_test) > 0.5
mean(abs(y_test == as.numeric(predy)))

[1] 0.86676

R코드 예제 : Time Series forecasting

주가 정보 예측 모형 : 3차원의 시계열 데이터
- Log trading volume
- Dow Jones return
- Log volatility
목적 : trading volume을 예측.
\(t\)번째 날에 \(V_t, r_t, z_t\) 관측 (\(t=1,\dots T=6051\))
- \(V_t\) - Log trading volume
- \(r_t\) - Dow Jones return
- \(z_t\) - Log volatility
autocorrelation이 뚜렷함. 즉 가까운 시간에 있는 수치끼리 correlation ↑
이전 예시와는 다르게 데이터셋으로 1개의 sequence만 있음.
타겟이 되는 outut \(y\)는 \(V_t\)이고, input으로는 이전의 \(V_t, r_t, z_t\)들이 그대로 쓰임.
다음과 같이 RNN 적용
- input으로 mini-series를 사용.
  - 즉, \(L\)을 미리 지정하고 \(X=\{ X_1, \dots , X_L \}\)을 input으로 사용하고, 즉, \[X_1 = \begin{pmatrix} V_{t-L} \\ r_{t-L} \\ z_{t-L} \\ \end{pmatrix} , \dots , X_L = \begin{pmatrix} V_{t-1} \\ r_{t-1} \\ z_{t-1} \\ \end{pmatrix} \] 로 놓고 \(Y=V_t\)로 놓음.
- 결국 \(t=L+1, \dots , T\)까지의 총 \(n=T-L\)개의 데이터셋을 얻음.
- 이 때, \(L\)은 CV나 validation data approach로부터 결정.
\(K=12\)개의 hidden unit을 사용했을 때 결과는 다음과 같음

######################################################
# library
######################################################
library(dplyr)
library(ggplot2)
library(keras)


######################################################
# data
######################################################
library(ISLR2)
xdata <- data.matrix( NYSE[, c("DJ_return", "log_volume","log_volatility")] )
istrain <- NYSE[, "train"]
xdata <- scale(xdata)

### lag
lagm <- function(x, k = 1) {
  n <- nrow(x)
  pad <- matrix(NA, k, ncol(x))
  rbind(pad, x[1:(n - k), ])
}

### data
arframe <- data.frame(log_volume = xdata[, "log_volume"],
                      L1 = lagm(xdata, 1), L2 = lagm(xdata, 2),
                      L3 = lagm(xdata, 3), L4 = lagm(xdata, 4),
                      L5 = lagm(xdata, 5))
arframe <- arframe[-(1:5), ]
istrain <- istrain[-(1:5)]


######################################################
# AR model
######################################################

### fitting
arfit <- lm(log_volume ~ ., data = arframe[istrain, ])

### prediction
arpred <- predict(arfit, arframe[!istrain, ])
V0 <- var(arframe[!istrain, "log_volume"])
1 - mean((arpred - arframe[!istrain, "log_volume"])^2) / V0

[1] 0.413223

### including DOW
arframed <- data.frame(day = NYSE[-(1:5), "day_of_week"], arframe)
arfitd <- lm(log_volume ~ ., data = arframed[istrain, ])
arpredd <- predict(arfitd, arframed[!istrain, ])
1 - mean((arpredd - arframe[!istrain, "log_volume"])^2) / V0

[1] 0.4598616

######################################################
# RNN
######################################################
n <- nrow(arframe)
xrnn <- data.matrix(arframe[, -1])
xrnn <- array(xrnn, c(n, 3, 5))
xrnn <- xrnn[,, 5:1]
xrnn <- aperm(xrnn, c(1, 3, 2))
dim(xrnn)

[1] 6046    5    3

### construct the model
model <- keras_model_sequential() %>%
  layer_simple_rnn(units = 12,
                   input_shape = list(5, 3),
                   dropout = 0.1, recurrent_dropout = 0.1) %>%
  layer_dense(units = 1)

### fitting algorithm
model %>% compile(optimizer = optimizer_rmsprop(),
                  loss = "mse")
### fitting
history <- model %>% fit(
  xrnn[istrain,, ], arframe[istrain, "log_volume"],
  batch_size = 64, epochs = 200,
  validation_data =
    list(xrnn[!istrain,, ], arframe[!istrain, "log_volume"])
)
kpred <- predict(model, xrnn[!istrain,, ])
1 - mean((kpred - arframe[!istrain, "log_volume"])^2) / V0

[1] 0.4130743

######################################################
# Comparison
######################################################

### linear AR model
model <- keras_model_sequential() %>%
  layer_flatten(input_shape = c(5, 3)) %>%
  layer_dense(units = 1)

### nonlinear AR model 
x <- model.matrix(log_volume ~ . - 1, data = arframed)
colnames(x)

 [1] "dayfri"            "daymon"            "daythur"          
 [4] "daytues"           "daywed"            "L1.DJ_return"     
 [7] "L1.log_volume"     "L1.log_volatility" "L2.DJ_return"     
[10] "L2.log_volume"     "L2.log_volatility" "L3.DJ_return"     
[13] "L3.log_volume"     "L3.log_volatility" "L4.DJ_return"     
[16] "L4.log_volume"     "L4.log_volatility" "L5.DJ_return"     
[19] "L5.log_volume"     "L5.log_volatility"

### result
arnnd <- keras_model_sequential() %>%
  layer_dense(units = 32, activation = 'relu',
              input_shape = ncol(x)) %>%
  layer_dropout(rate = 0.5) %>%
  layer_dense(units = 1)
arnnd %>% compile(loss = "mse",
                  optimizer = optimizer_rmsprop())
history <- arnnd %>% fit(
  x[istrain, ], arframe[istrain, "log_volume"], epochs = 100,
  batch_size = 32, validation_data =
    list(x[!istrain, ], arframe[!istrain, "log_volume"])
)
plot(history)

npred <- predict(arnnd, x[!istrain, ])
1 - mean((arframe[!istrain, "log_volume"] - npred)^2) / V0

[1] 0.4678054

###

Section 5.6 Deep learning을 적용할 때 고려해야하는 사항들.

샘플 사이즈 \(n\)이 매우 크고 inference (혹은 interpretability)가 관심사항이 아닐 때, Deep learning 기법은 매우 유용할 수 있음.
그러나 Deep learning은 기존 통계 모형보다 훨씬 복잡하므로, 비슷한 퍼포먼스를 낸다면 더 단순한 기존 모형들이 선호될 수 있다.
- “Occam’s razor principle : When faced with several methods that gives roughly equivalent performance, pick the simplest”
(예시) Single layer neural network에서의 R코드 예제 참조.
- Linear regression과 Single layer neural network의 퍼포먼스가 비슷하게 나왔기 때문에 훨씬 단순한 모형인 Linear regression을 사용하는 것이 더 선호됨.

Section 5.7 Neural Network를 fitting하는 방법

5.7.1 Backpropagation

Single layer neural network의 경우
- parameter : \(w_k = (w_{k0}, \dots , w_{kp})\) for \(k=1,\dots ,K\) & \(\beta=(\beta_0,\dots , \beta_K)\)
- observations : \((x_i,y_i)\) for \(i=1,\dots , n\)
- 에 대해서 다음과 같은 최적화 문제를 풀어야함: \[\text{minimize}_{ \{w_k\}_1^K, \beta } {1\over 2} \sum_{i=1}^{n}{(y_i - f(x_i))^2} \ \text{ where } f(x_i)= \beta_0 + \sum_{k=1}^{K}{\beta_k g(w_{k0}+\sum_{j=1}^{p}{w_{kj}x_{ij}})}\]
- 문제는 이 최적화 문제가 nonconvex optimization이라는 것.
- 따라서 local minimum과 global minimum이 다름
- local minimum을 찾는 것은 빠르게 수행할 수 있지만 (convex optimization) global minimum을 찾는 것은 계산하기가 힘들다는 단점.
이 문제를 극복하고 overifitting을 피하기 위하여 다음과 같은 전략을 사용
1. Slow learning : gradient descent 처럼 iterative하게 learning하며 그 속도를 느리게 함. 그리고 overfitting이 감지되면 멈추는 방식
2. Regularization : Ridge나 Lasso처럼 penalty를 parameter에 부여하여 fitting함.
\(\theta = (w_{10}, \dots , w_{1p}, \dots , , w_{K0}, \dots , , w_{Kp}, \beta_0, \dots , \beta_K)'\)로 놓고, \(R(\theta) = {1\over 2}\sum_{i=1}^{n}{(y_i - f_\theta(x_i))^2}\)를 최소화하는 것과 동일함.
- 이 때, 최적화문제를 풀기 위해 다음의 Gradient descent 방식을 고려할 수 있음.
  1. \(\theta^0\)를 initial value로 놓고 시작. \(t=0\)으로 놓음
  2. \(R(\theta)\)가 감소하지 않을 때까지 iteration을 돌림.
    1. \(\theta^{t+1} = \theta^{t} + \delta\)에 대하여 \(R(\theta^{t+1}) < R(\theta^t)\) 가 성립하는 \(\delta\) 찾기.
    2. \(t \leftarrow t+1\)로 놓기.
- 그렇다면 2-a에서 어떻게 \(R(\theta)\)가 감소하는 \(\theta\)의 방향을 찾을 것인가?
  - \(R(\theta)\)의 gradient인 \(\nabla R(\theta) = {\partial R(\theta) \over \partial \theta}\)는 \(\theta\)의 공간에서 \(R(\theta)\)가 가장 급격하게 증가하는 방향.
  - 따라서 \(-\nabla R(\theta)\)은 가장 급격하게 감소하는 방향 (\(f(\theta)=-R(\theta)\)를 고려)
  - 그러므로, \(\theta^{t+1} \leftarrow \theta^t - \rho \nabla R(\theta^t)\)
  - 이 때, \(\rho\)는 learning rate를 나타내며 작은 값을 사용하여 slow learning하도록 만듦.
Backpropagation : Neural network의 경우 gradient를 계산하는 과정이 Chain rule에 의해서 비교적 간단하게 수행할 수 있음.
- \(R(\theta)=\sum_{i=1}^{n}{R_i(\theta)}=\sum_{i=1}^{n}{(y_i - f_{\theta}(x_i))^2}\)이므로 \(R_i(\theta)=(y_i - f_{\theta}(x_i))^2\)만 고려하면
- \(R_i(\theta)=(y_i - f_{\theta}(x_i))^2 = {1\over 2}\left( y_i - \beta_0 - \sum_{k=1}^{K}{\beta_kg(w_{k0}+\sum_{j=1}^{p}{w_{kj}x_{ij}})} \right)^2\)
  1. \[\begin{equation} \begin{split} {\partial R_i(\theta) \over \partial \beta_k} & = {\partial R_i(\theta) \over \partial f_\theta(x_i)} \cdot {\partial f_\theta(x_i) \over \partial \beta_k} \\ & = - (y_i-f_\theta (x_i)) \cdot g(z_{ik}) \end{split} \end{equation}\]
  2. \[\begin{equation} \begin{split} {\partial R_i(\theta) \over \partial w_{kj}} & = {\partial R_i(\theta) \over \partial f_\theta(x_i)} \cdot {\partial f_\theta(x_i) \over \partial g(z_{ik})} \cdot {\partial g(z_{ik}) \over \partial z_{ik}} \cdot {\partial z_{ik} \over \partial w_{kj}} \\ & = - (y_i-f_\theta (x_i)) \cdot \beta_k \cdot g'(z_{ik}) \cdot x_{ij} \end{split} \end{equation}\]
- 따라서 \(y_i - f_\theta(x_i)\) & \(\beta_k\) & \(g(z_{ik})\) & \(g'(z_{ik})\) 값들을 계산 과정에서 저장해놓음으로써 gradient는 손쉽게 계산이 가능함.

5.7.2 Stochastic Gradient Descent and Regularization

Gradient descent는 local minimum까지 도달하기 위해 많은 시간이 걸리는편.
- 보통 \(n\)이 크기 때문에 gradient를 계산할때 필요한 계산량이 크기 때문.
- 좀 더 속도를 높이기 위한 방법 중 하나라 Stochastic gradient descent (SGD)
- \(\nabla R(\theta) = \sum_{i=1}^{n}{\nabla R_i(\theta)}\)에서 모든 \(i=1,\dots ,n\)을 계산하는 것이 아닌 일부 \(i\)만 (minibatch) 랜덤하게 선택해서 gradient 계산
- 모든 \(i\)가 아닌 일부 \(i\)만 랜덤하게 선택하기 때문에 이러한 랜덤성에 의해 local minimum에 빠지기 보다 더 좋은 minimum을 찾아갈 수 있음.
- SGD보다 빨리 minimum을 찾기 위한 방법으로 SGD momentum, Adam 등의 방법이 있음.
(Regularization) Neural network는 학습해야하는 parameter가 매우 많기 떄문에 regularization을 필수적으로 한다.
- Digit Recognition problem의 예시를 다시 생각해보면
  - 당시 사용한 loss function은 \(R(\theta) = -\sum_{i=1}^{n}{ \sum_{m=0}^{9}{y_{im}\log(f_m(x_i))} }\)이었으나
  - Regularization에 의해 \(R(\theta) + \text{penalty}\)의 형태에 대해서 최소화하는 최적화문제를 풀게 됨
  - \[ R(\theta;\lambda) = -\sum_{i=1}^{n}{ -\sum_{i=1}^{n}{ \sum_{m=0}^{9}{y_{im}\log(f_m(x_i))} } + \lambda \sum_{j}{\theta_j^2} } \]
  - 이 때 penalty로 \(\lambda \sum_{j}{\theta_j^2}\)를 쓰는 것은 Ridge regularization이라고 부르고
  - 또다른 방법으로는 Lasso regularization으로 \(\lambda \sum_{j}{|\theta_j|}\)를 쓰는 것임.
  - 일반적으로 Ridge를 많이 사용.
  - 이 때, \(\lambda\)는 작은 값으로 미리 지정하거나 valdiation set approach로 정함.

5.7.3 Dropout

Dropout : Regularization의 한 방법으로 Random forest의 spirit을 따라서 매 SGD interation마다 layer에서 \(\phi\)비율 만큼의 activation을 랜덤하게 제외하고 update
- 이 떄 제외되지 않은 activation은 \(1/(1-\phi)\)를 곱하여 평균을 맞추면서 계산하는 개념
- 실제 계산할때는 제외되는 activation 값은 0으로 놓고 제외되지 않은 activation 값을 \(1/(1-\phi)\)을 곱한 상태에서 neural network 구조는 그대로 해서 계산.
- 그러나 현시점에서는 Dropout보다 ridge regularization을 일반적으로 활용하는 추세.

5.7.4 Neural netowrk의 튜닝

튜닝해야할 hyperparameter들 :
- hidden layer의 개수 & layer 당 unit의 개수
  - unit은 크게 놓고 regularization 등을 통해 overifitting을 방지.
- Regularization tuning parameter : \(\lambda\) (Ridge나 Lasso) 또는 \(\phi\) (dropout)
  - layer마다 따로 조정함.
- SGD : Minibatch의 사이즈, 학습할때 epoch의 개수 등
Hyperparameter tuning에 따라 퍼포먼스가 크게 차이가 날 수 있음. 다만 무분별하게 수행할 경우 overfitting이 있을 수 있으니 주의해야함.

1152-4015 Machine Learning & Practice

Chapter 5. Neural network (Deep learning)

Section 5.1 R에서 deep learning 활용

Section 5.2 Single Layer Neural Network

R코드 예제

Section 5.3 Multi-layer Neural Network

R코드 예제

Section 5.4 Convolutional Neural Netowrk (CNN)

5.4.1 Convolution layer

5.4.2 Pooling layer

5.4.3 CNN의 구조

R코드 예제

5.4.4 Data augmentation

5.4.5 Transfer learning

Section 5.5 Recurrent Neural Network (RNN)

5.5.1 Document classification

5.5.2 Recurrent Neural network

R코드 예제 : Sequential models for document classification

R코드 예제 : Time Series forecasting

Section 5.6 Deep learning을 적용할 때 고려해야하는 사항들.

Section 5.7 Neural Network를 fitting하는 방법

5.7.1 Backpropagation

5.7.2 Stochastic Gradient Descent and Regularization

5.7.3 Dropout

5.7.4 Neural netowrk의 튜닝