1 筆記的目標

overshooting and small learning rate

overshooting and small learning rate

2 切入正題

2.1 目標

  • Our goal is to minimize \(f(\theta)\) where \(\theta\) is a vector or a scalar

  • theta它是我們的參數可以有一個或是很多個

  • here is the way to updata \(\theta\)

\[\theta{^{i+1}} \leftarrow \theta{^i}-\eta* \frac{\partial f(\theta{^i})}{\partial\theta^i} \]

  • 請注意\(i\)並非次方的意思,而是第幾次的迭代符號 ,這是一個迭代的算法,你必須要從某個initail guess出發, \(\eta\)則是所謂的學習率代表要往哪裡走幾步, \(\frac{\partial f(\theta{^i})}{\partial\theta^i}\)則是我們要往哪裡走的方向!

但是在\(\theta\)只有一個dim時,方向不是往左就是往右就是了, 而且這件事真的是宇宙無敵直關,你想想看\(\eta\)是一個正的數字,如果 \[\frac{\partial f(\theta{^i})}{\partial\theta^i}\]

的值為正,代表那個點的切線斜率為正,不就代表往右走一小步的話\(f(\theta)\)值應該會變大,往左走會變小,這時候\(-\eta*\frac{\partial f(\theta{^i})}{\partial\theta^i}\)的值是負的!

代表我們要往左走

\(\theta{^i}-\eta* \frac{\partial f(\theta{^i})}{\partial\theta^i}\)

\(f(\theta{^{i+1}})\)真的會變小

反之讀者則可自行驗證

現在我要開始代數字了 ,假設我們的function 很簡單,只有一個參數 \(y=f(x)=x^2\) 也就是theta in this case is x. 然後我們的intial guess \(x{^1}\) is -3. 為了方便計算學習率\(\eta=0.1\), 我們來逐步的代入看看會發生甚麼事情

2.2 算法的數字例子

第一步:要先找出\(\frac{\partial f(\theta{^i})}{\partial\theta^i}\)的形式為何

這裡會微分的朋友就知道 答案是 \(2x{^{(i)}}\)

第二步:開始代數字啦

所以當 i=1時

\[x{^2}=x{^1}-\eta*\frac{\partial f(x{^i})}{\partial x^i} \]

\[x{^2}=-3-0.1*2*(x{^1}) \] 將-3代入\(x{^1}\) \[x{^2}=-3-0.1*2*(-3) \] \[x{^2}=-2.4\]

所以當 i=2時

將-2.4代入\(x{^2}\)

\[x{^3}=-2.4-0.1*2*(-2.4)\] \[x{^3}=-1.92 \]

剩下的應該可以依此類推了, 最後寫一個寫程式詳細列出這所有的過程

2.3 程式碼實作 (函數1)

以下都令\(y=f(x)\)

我們的函數是 $ min f(x)=x^2$,這個函數的最小值發生在x=0時,其最小值也是\(0\)

  • 讓我們先把函數畫出來
x=seq(-3,3,0.01)
plot(  x,x^2 ,type = "l" ,xlab = "x",ylab = "f(x)"  )

  • 這裡不使用迭代的收斂條件,而是固定迭代的次數 秀出結果(每次的X和其對應的f(x))
#R is number of iteration 
R=10
#eta is learning rate
eta <- 0.1
#intail guess of x
intail_x <- -3
# objective function
f <- function(x) {x^2}
# partial objective function
partialf <-function(x) {2*x} 

x=numeric(length=(R+1))
x[1] <-intail_x 

for(i in 1:R){

  x[i+1] <- x[i]-eta*partialf(x[i])
  
}
df <- data.frame(x,y=f(x))
df
##             x         y
## 1  -3.0000000 9.0000000
## 2  -2.4000000 5.7600000
## 3  -1.9200000 3.6864000
## 4  -1.5360000 2.3592960
## 5  -1.2288000 1.5099494
## 6  -0.9830400 0.9663676
## 7  -0.7864320 0.6184753
## 8  -0.6291456 0.3958242
## 9  -0.5033165 0.2533275
## 10 -0.4026532 0.1621296
## 11 -0.3221225 0.1037629
x=seq(-3,3,0.01)
plot(  x,x^2 ,type = "l" ,xlab = "x",ylab = "f(x)"  )
y <- df$y
points(  df$x   ,y   ,col = "red",pch =20 ,cex=1.5 )
s <- seq(length(df$x)-1)
arrows(df$x[s], y[s], df$x[s+1], df$y[s+1], col = 1:2)

可以看出我們的算法正確,請讀者去看一下x的值和手算的一樣,而且x也確實隨著迭代的次數增加也會跟著靠近真正的答案0

但是還沒完喔,如果我們的learing rate設的不太好,可能會有以下的慘況

2.3.1 if \(\eta=1\) in this case

R=10
#eta is learning rate
eta <- 1
#intail guess of x
intail_x <- -3
# objective function
f <- function(x) {x^2}
# partial objective function
partialf <-function(x) {2*x} 

x=numeric(length=(R+1))
x[1] <-intail_x 

for(i in 1:R){

  x[i+1] <- x[i]-eta*partialf(x[i])
  
}
data.frame(x,y=f(x))
##     x y
## 1  -3 9
## 2   3 9
## 3  -3 9
## 4   3 9
## 5  -3 9
## 6   3 9
## 7  -3 9
## 8   3 9
## 9  -3 9
## 10  3 9
## 11 -3 9
  • 就是互相射來射去又在同一個位置上

2.3.2 if \(\eta=1.1\) in this case

R=10
#eta is learning rate
eta <- 1.15
#intail guess of x
intail_x <- -10
# objective function
f <- function(x) {x^2}
# partial objective function
partialf <-function(x) {2*x} 

x=numeric(length=(R+1))
x[1] <-intail_x 

for(i in 1:R){

  x[i+1] <- x[i]-eta*partialf(x[i])
  
}
df<- data.frame(x,y=f(x))
df
##             x          y
## 1   -10.00000   100.0000
## 2    13.00000   169.0000
## 3   -16.90000   285.6100
## 4    21.97000   482.6809
## 5   -28.56100   815.7307
## 6    37.12930  1378.5849
## 7   -48.26809  2329.8085
## 8    62.74852  3937.3764
## 9   -81.57307  6654.1661
## 10  106.04499 11245.5407
## 11 -137.85849 19004.9638
  • this is the case called overshooting.就是射太遠了!
x=seq(-50,50,0.01)
plot(  x,x^2 ,type = "l" ,xlab = "x",ylab = "f(x)"  )
y <- df$y
points(  df$x   ,y   ,col = "red",pch =20 ,cex=1.5 )
s <- seq(length(df$x)-1)
arrows(df$x[s], y[s], df$x[s+1], df$y[s+1], col = 1:2)

overshooting and small learning rate

overshooting and small learning rate

  • 稍微對照就會發現這就是圖片左邊的overshooting

  • 可能會有讀者覺得learning rate越小越好但可以看一下這個以下這個例子

2.4 程式碼實作 (函數2)

以下都令\(y=f(x)\)

我們的目標函數是 \(x^4-8*x^3\)

這個函數的最小值發生在\(x=6\)

其最小值是\(-432\)

  • 讓我們先把函數畫出來
x=seq(-7,12,0.0001)
plot( x ,x^4-8*x^3 ,type = "l" ,xlab = "x",ylab = "f(x)"  )

2.4.1 case1:

\(\eta=0.0001\) - 要注意一件事情並非learning rate越小越好!

#R is number of iteration 
R=35
#eta is learning rate
eta <- 0.001
#intail guess of x
intail_x <- -3
# objective function
f <- function(x) {x^4-8*x^3}
# partial objective function
partialf <-function(x) {4*x^3-24*x^2} 
x=numeric(length=(R+1))
x[1] <-intail_x 
for(i in 1:R){
  x[i+1] <- x[i]-eta*partialf(x[i])
}
data.frame(x,y=f(x))
##             x          y
## 1  -3.0000000 297.000000
## 2  -2.6760000 204.581751
## 3  -2.4274855 149.159020
## 4  -2.2288434 113.257036
## 5  -2.0653283  88.673779
## 6  -1.9277150  71.117717
## 7  -1.8098748  58.157944
## 8  -1.7075452  48.331031
## 9  -1.6176533  40.712286
## 10 -1.5379178  34.693883
## 11 -1.4666033  29.862912
## 12 -1.4023629  25.930938
## 13 -1.3441323  22.691595
## 14 -1.2910580  19.994124
## 15 -1.2424461  17.726362
## 16 -1.1977263  15.803497
## 17 -1.1564243  14.160467
## 18 -1.1181427  12.746709
## 19 -1.0825451  11.522468
## 20 -1.0493448  10.456150
## 21 -1.0182960   9.522403
## 22 -0.9891861   8.700705
## 23 -0.9618308   7.974303
## 24 -0.9360687   7.329421
## 25 -0.9117585   6.754650
## 26 -0.8887754   6.240480
## 27 -0.8670090   5.778938
## 28 -0.8463612   5.363297
## 29 -0.8267442   4.987858
## 30 -0.8080798   4.647762
## 31 -0.7902972   4.338853
## 32 -0.7733332   4.057556
## 33 -0.7571302   3.800787
## 34 -0.7416362   3.565870
## 35 -0.7268039   3.350480
## 36 -0.7125904   3.152587
  • 這裡看出他不小心停留在高原地帶了(x=0附近)

2.4.2 case2:

\(\eta=0.01\)

#R is number of iteration 
R=35
#eta is learning rate
eta <- 0.01
#intail guess of x
intail_x <- -3
# objective function
f <- function(x) {x^4-8*x^3}
# partial objective function
partialf <-function(x) {4*x^3-24*x^2} 
x=numeric(length=(R+1))
x[1] <-intail_x 
for(i in 1:R){
  x[i+1] <- x[i]-eta*partialf(x[i])
}
data.frame(x,y=f(x))
##             x            y
## 1  -3.0000000  297.0000000
## 2   0.2400000   -0.1072742
## 3   0.2532710   -0.1258563
## 4   0.2680163   -0.1488588
## 5   0.2844860   -0.1776429
## 6   0.3029888   -0.2140927
## 7   0.3239088   -0.2608604
## 8   0.3477295   -0.3217472
## 9   0.3750674   -0.4023130
## 10  0.4067190   -0.5108731
## 11  0.4437288   -0.6601768
## 12  0.4874889   -0.8703206
## 13  0.5398898   -1.1739797
## 14  0.6035505   -1.6261635
## 15  0.6821818   -2.3231756
## 16  0.7811724   -3.4411793
## 17  0.9085598   -5.3185720
## 18  1.0766753   -8.6410988
## 19  1.3049659  -14.8781960
## 20  1.6247796  -27.3450405
## 21  2.0867869  -53.7350715
## 22  2.7684185 -111.0013359
## 23  3.7591103 -225.2745756
## 24  5.0257427 -377.5541873
## 25  6.0100578 -431.9927002
## 26  5.9955260 -431.9985602
## 27  6.0019590 -431.9997236
## 28  5.9991362 -431.9999463
## 29  6.0003797 -431.9999896
## 30  5.9998329 -431.9999980
## 31  6.0000735 -431.9999996
## 32  5.9999676 -431.9999999
## 33  6.0000142 -432.0000000
## 34  5.9999937 -432.0000000
## 35  6.0000028 -432.0000000
## 36  5.9999988 -432.0000000
  • 可以看出在第25次左右就得到了非常好的答案

3 總結