\[ \nabla f(\theta) = (-2\theta_1, -4\theta_2, -6\theta_3)^{T} \]
\[ \nabla^2 f(\theta) = \begin{equation*} \mathbf{}\left[\begin{matrix} -2 & 0 & 0\\ 0 & -4 & 0\\0 & 0 & -6\ \end{matrix}\right] \end{equation*} \]
We will test if \(d^{T}\nabla f(\theta_0) > 0\). See code below. The output is -16, so this implies that \(d = (1,1,1)^{T}\) is not an ascent direction at the initial value \(\theta_0 = (4,2,0)^{T}\)
fn <- function(theta) {
theta1 = theta[1,]
theta2 = theta[2,]
theta3 = theta[3,]
val = theta1^2 + 2*theta2^2 + 3*theta3^2
return(-1*val)
}
gradient <- function(theta) {
theta1 = theta[1,]
theta2 = theta[2,]
theta3 = theta[3,]
val1 = -2*theta1
val2 = -4*theta2
val3 = -6*theta3
return(matrix(c(val1, val2, val3), ncol=1))
}
d = matrix(c(1,1,1), ncol=1)
theta.0 = matrix(c(4,2,0), ncol=1)
t(d) %*% gradient(theta.0)
## [,1]
## [1,] -16
#-16 No. It is not an ascent direction.
By definition of steep ascent ou direction \(d_n\) is calculated by \(\nabla f(\theta_n)\). We do this using the code below.
steep.d = gradient(theta.0)
steep.d
## [,1]
## [1,] -8
## [2,] -8
## [3,] 0
#d = (-8, -8, 0)
Thus, by the code above we see that the direction would be \(d = (-8,-8,0)^{T}\)
We check if our new theta, call it \(\theta_1\) has a greater function output value than our previous \(\theta_0\). In other words, we will compute \(\theta_1\) and check if \(f(\theta_1) > f(\theta_0)\). If this is true, then no step-halving is required. Otherwise, step-halving will be required. We do this in the code below.
next.theta = theta.0 + steep.d
l1 = fn(theta.0) #-24
l2 = fn(next.theta) #-88
if(l2 < l1) {
print("Step Halving Required")
} else {
print("No step halve required")
}
## [1] "Step Halving Required"
Now we use Newton’s method to compute the direction at a certain point. According to class notes our direction is \(d_n = -(\nabla^2 f(\theta_n))^{-1} \nabla f(\theta_n)\). We will compute this in the code below.
hessian <- function(theta) {
theta1 = theta[1,]
theta2 = theta[2,]
theta3 = theta[3,]
val = c(-2, 0, 0, 0, -4, 0, 0, 0, -6)
mat = matrix(val, nrow=3, byrow=TRUE)
return(mat)
}
h = hessian(theta.0)
h.inv = solve(h)
dir = -1*h.inv %*% gradient(theta.0)
dir
## [,1]
## [1,] -4
## [2,] -2
## [3,] 0
next.theta = theta.0 + dir
next.theta
## [,1]
## [1,] 0
## [2,] 0
## [3,] 0
Thus, our direction would be \(d = (-4,-2,0)^{T}\) and \(\theta_1 = (0,0,0)^{T}\)
fun <- function(teta){
#Objective function
f = -(teta[1] ^2 + 2 * teta[2]^2 + 3 * teta[3]^2)
#Gradient
g = -c(2 * teta[1], 4 * teta[2], 6 * teta[3] )
list(f = f, g = g)
}
# Steepest Ascent Algorithm
steep <- function(teta0, maxit = NULL, tolgrad = NULL, tolerr = NULL){
for(it in 1:maxit){
comp <- fun(teta0)
f0 <- comp$f
d <- comp$g
teta1 <- teta0 + d
comp <- fun(teta1)
f1 <- comp$f
halve = 0
while(f1 < f0 & halve <= 20) {
halve = halve + 1
teta1 = teta0 + d/(2^halve)
comp = fun(teta1)
f1 = comp$f
}
#--- begin step-halving if needed
#Write the step-halving piece here
#--------End Step halving
# --- check convergence
mre = max(abs(teta1 - teta0)/pmax(1, teta1))
norm_grad = norm(as.matrix(comp$g), "F")
print(c(it, halve, f1, norm_grad, mre))
print("__________________________-")
if(norm_grad < tolgrad & mre < tolerr){
break
}
teta0 <- teta1
}
}
theta.init = c(1,1,1)
y = steep(theta.init, maxit=20, tolgrad = 10^-2, tolerr = 10^-2)
## [1] 1.000000 2.000000 -1.000000 3.162278 1.500000
## [1] "__________________________-"
## [1] 2.000000 2.000000 -0.250000 1.581139 0.750000
## [1] "__________________________-"
## [1] 3.0000000 2.0000000 -0.0625000 0.7905694 0.3750000
## [1] "__________________________-"
## [1] 4.0000000 2.0000000 -0.0156250 0.3952847 0.1875000
## [1] "__________________________-"
## [1] 5.00000000 2.00000000 -0.00390625 0.19764235 0.09375000
## [1] "__________________________-"
## [1] 6.0000000000 2.0000000000 -0.0009765625 0.0988211769 0.0468750000
## [1] "__________________________-"
## [1] 7.0000000000 2.0000000000 -0.0002441406 0.0494105884 0.0234375000
## [1] "__________________________-"
## [1] 8.000000e+00 2.000000e+00 -6.103516e-05 2.470529e-02 1.171875e-02
## [1] "__________________________-"
## [1] 9.000000e+00 2.000000e+00 -1.525879e-05 1.235265e-02 5.859375e-03
## [1] "__________________________-"
## [1] 1.000000e+01 2.000000e+00 -3.814697e-06 6.176324e-03 2.929688e-03
## [1] "__________________________-"
\[ d\pi(d\beta) = \frac{(x_i^{T})(d\beta)(\exp(x_i^{T}\beta))}{(1+\exp(x_i^{T}\beta))^2} \]
\[ \frac{\partial\ell(\beta)}{\partial\beta_1} = \sum_{i=1}^n(y_i - \pi_i(\beta)) \]
\[ \frac{\partial\ell(\beta)}{\partial\beta_2} = \sum_{i=1}^n(y_i - \pi_i(\beta))log(D_i) \] \[ \frac{\partial\ell(\beta)}{\partial\beta_3} = \sum_{i=1}^n(y_i - \pi_i(\beta))S_i \] In general, if we wanted to write a formula that worked for all \(j\) we would write: \[ \frac{\partial\ell(\beta)}{\partial\beta_j} = \sum_{i=1}^n(y_i - \pi_i(\beta))x_{ij} \]
A general formula that holds is the following:
\[ \frac{\partial^2\ell(\beta)}{\partial\beta_j \partial \beta_k}= \sum_{i=1}^{n}(\pi_i(\beta))[\pi_i(\beta) - 1]x_{ij}x_{ik} \] The above is derived by setting \(d\beta\) equal to the vector \(e_j\) for the appropriate choice of \(j\) and \(k\)
data = read.table("blowBF.txt", header=TRUE)
head(data)
## D S y
## 1 9 0.0242120 0
## 2 11 0.0305947 0
## 3 9 0.0305947 0
## 4 9 0.0341815 0
## 5 5 0.0341815 0
## 6 8 0.0341815 0
#X = matrix(0,nrow(data), 3)
rowlist = list()
for(i in 1:nrow(data)) {
current.row1 = 1
current.row2 = log(data[i,1])
current.row3 = data[i,2]
rowlist[[i]] = c(current.row1, current.row2, current.row3)
}
X = matrix(unlist(rowlist),ncol=3, byrow=TRUE)
head(X)
## [,1] [,2] [,3]
## [1,] 1 2.197225 0.0242120
## [2,] 1 2.397895 0.0305947
## [3,] 1 2.197225 0.0305947
## [4,] 1 2.197225 0.0341815
## [5,] 1 1.609438 0.0341815
## [6,] 1 2.079442 0.0341815
data <- read.table(file = "blowBF.txt", header = TRUE)
n = dim(data)[1]
rowlist = list()
#-----------------------------
for(i in 1:nrow(data)) {
current.row1 = 1
current.row2 = log(data[i,1])
current.row3 = data[i,2]
rowlist[[i]] = c(current.row1, current.row2, current.row3)
}
X = matrix(unlist(rowlist),ncol=3, byrow=TRUE)
#-----------------------------
beta = c(-9.562079,3.197561,4.5085)
Hessian <- function(n, X = NULL, beta = NULL){
beta = matrix(beta, nrow = 3, ncol = 1)
E = exp(X %*% beta)
P = E / (1 + E)
#--- begin writing the code for the Hessian
#diagonal entries
H = matrix(0, 3, 3)
for(j in 1:3) {
sum = 0
for(i in 1:nrow(X)) {
val = P[i,]*(P[i,] - 1)*X[i,j]*X[i,j]
sum = sum + val
}
H[j,j] = sum
}
for(j in 1:3) {
sum = 0
if(j == 3) k = 1
else k = (j+1)
for(i in 1:nrow(X)) {
val = P[i,]*(P[i,] - 1)*X[i,j]*X[i,k]
sum = sum + val
}
H[j,k] = sum
H[k,j] = sum
}
#---- end of computing Hessian
return(H)
}
out = Hessian(n, X, beta)
out
## [,1] [,2] [,3]
## [1,] -90.10259 -202.59774 -37.17247
## [2,] -202.59774 -467.02995 -82.39195
## [3,] -37.17247 -82.39195 -19.21699
We check to see if the Hessian evaluated at the beta.star is negative definite by checking if all the eigenvalues are negative. If the Hessian evaluated at beta.star is negative definite, this ensures that beta.star is a local maximum.
e.values = eigen(out)$values
e.values
## [1] -1.408105 -4.843684 -570.097749
Therefore, we can conclude that the Hessian evaluted at beta.star is negative definite which means that beta.star is a local maximum.
We will appeal to a theorem in class that states the following. For large n we have:
\[
SE(\hat{\theta_i}) \approx \sqrt{[-\nabla^{2}\ell(\hat{\theta})^{-1}]_{ii}}
\]
h.inv = solve(-1*out)
se.vec = rep(0,3)
for(i in 1:3) {
se.vec[i] = sqrt(h.inv[i,i])
}
se.vec
## [1] 0.7498812 0.2998976 0.5158705
Therefore, \[ SE(\beta_1) \approx 0.7498\\ SE(\beta_2) \approx 0.2998\\ SE(\beta_3) \approx 0.5158 \]