The problem

A simulated dataset

To show how bad-scaling can produce problems with convergence or numerical issues involved in inverting the Hessian, I create an artificial mixed logit dataset:

rm(list = ls(all = TRUE))          # Clean objects
library("gmnl")
library("mlogit")

I first set the globals (seed, number of individuals, number of alternatives, etc) for the simulation:

# Globals
set.seed(250586)             # Set seed
N <- 300                     # Number of individuals
J <- 3                       # Number of alternatives
t <- 10                      # Number of choice situations
R <- 50                      # Numer of draws
id <- rep(1:N, each = J * t) # Id for each individual

Now we create two normally distributed parameters and two variables.

# Create random parameters and variables
b1 <- 1 + 1 * rnorm(N) # Normal
b2 <- 2 + 1 * rnorm(N) # Normal
x1 <- as.numeric(runif(N * J * t) < 0.5) # dummy variable
x2 <- rnorm(N * J * t)                   # continuous variable

Note that x1 is a dummy variable, whereas x2 is continuous. The utility for each individual is created as follows:

# The true data generating process
U <-  b1[id] * x1 + b2[id] * x2 - log(-log(runif(N * J * t))) 

Note that the error term is distributed as Extreme value type I. The chosen alternative is based in the following code:

# Chosen alternative
chosen <- rep(0, N * J * t)
for (i in 1:(N * t)) {
  U_j <- U[((i - 1) * J + 1):(i * J)]
  U_max <- max(U_j)
  pos <- which(U_j == U_max)
  chosen[((i - 1) * J + 1):(i * J)][pos] <- TRUE
}

Putting the data in the correct format using mlogit.data function from mlogit package

data <- as.data.frame(cbind(id, chosen, x1, x2))
data <- mlogit.data(data, 
                    choice = "chosen",
                    shape = "long", 
                    alt.levels = c("1", "2", "3"), 
                    id.var = "id")

Now we estimate a MIXL model:

mix1 <- gmnl(chosen ~ x1 + x2| 0, 
            data = data,
            model = "mixl",
            ranp = c(x1 = "n", x2 = "n"),
            print.init = TRUE,
            R = R,
            panel = TRUE,
            iterlim = 300,
            method = "bhhh",
            print.level = 2)
## 
## Starting Values:
##        x1        x2     sd.x1     sd.x2 
## 0.7574765 1.5132982 0.1000000 0.1000000 
## Estimating MIXL model 
## ----- Initial parameters: -----
## fcn value: -2109.154 
##       parameter initial gradient free
## x1    0.7574765      -0.03071733    1
## x2    1.5132982       3.56397182    1
## sd.x1 0.1000000      18.41496933    1
## sd.x2 0.1000000      93.80607253    1
## Condition number of the (active) hessian: 160.454 
## -----Iteration 1 -----
## -----Iteration 2 -----
## -----Iteration 3 -----
## -----Iteration 4 -----
## -----Iteration 5 -----
## -----Iteration 6 -----
## -----Iteration 7 -----
## -----Iteration 8 -----
## -----Iteration 9 -----
## -----Iteration 10 -----
## -----Iteration 11 -----
## -----Iteration 12 -----
## --------------
## successive function values within tolerance limit 
## 12  iterations
## estimate: 0.8749408 2.062459 0.9824276 0.9992621 
## Function value: -2003.711
summary(mix1)
## 
## Model estimated on: Sat Oct 07 15:34:09 2017 
## 
## Call:
## gmnl(formula = chosen ~ x1 + x2 | 0, data = data, model = "mixl", 
##     ranp = c(x1 = "n", x2 = "n"), R = R, panel = TRUE, print.init = TRUE, 
##     iterlim = 300, method = "bhhh", print.level = 2)
## 
## Frequencies of categories:
## 
##       1       2       3 
## 0.31933 0.34100 0.33967 
## 
## The estimation took: 0h:0m:13s 
## 
## Coefficients:
##       Estimate Std. Error z-value  Pr(>|z|)    
## x1    0.874941   0.071802 12.1855 < 2.2e-16 ***
## x2    2.062459   0.078690 26.2099 < 2.2e-16 ***
## sd.x1 0.982428   0.118759  8.2725  2.22e-16 ***
## sd.x2 0.999262   0.072016 13.8755 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Optimization of log-likelihood by BHHH maximisation
## Log Likelihood: -2003.7
## Number of observations: 3000
## Number of iterations: 12
## Exit of MLE: successive function values within tolerance limit
## Simulation based on 50 draws

The results show that the point estimates are very close to the true population parameters. It is also important to notice that I have included the argument print.level = 2 in the gmnl function. This allows tracing the optimization procedure in real time. The information provided by optimizer is very informative. For example, the gradient for standard deviation parameters is high compare to the mean parameters (Recall that the gradient should be close to zero at the optimum). Furthermore, the output shows the condition number of the Hessian, where a huge number implies a poorly conditioned Hessian (poor stability). If this number is high, then one will not get very accurate estimates of the parameters and this means that the numerical optimization may take a long time to converge to the solution, if it converges at all.

Bad-scaling problem

To see how the bad-scaling of variables can lead to convergence problems, we modify variable x2 by multiplying by a huge number:

data$x2m <- data$x2 * 100000
mix2 <- gmnl(chosen ~ x1 + x2m| 0, 
            data = data,
            model = "mixl",
            ranp = c(x1 = "n", x2m = "n"),
            print.init = TRUE,
            R = R,
            panel = TRUE,
            iterlim = 300,
            method = "bhhh",
            print.level = 2)
## 
## Starting Values:
##           x1          x2m        sd.x1       sd.x2m 
## 7.574765e-01 1.513298e-05 1.000000e-01 1.000000e-01 
## Estimating MIXL model 
## ----- Initial parameters: -----
## fcn value: -Inf 
##           parameter initial gradient free
## x1     7.574765e-01        3.6801446    1
## x2m    1.513298e-05    -9168.4143317    1
## sd.x1  1.000000e-01       -0.9129622    1
## sd.x2m 1.000000e-01      -37.4459849    1
## Condition number of the (active) hessian: 12069350 
## -----Iteration 1 -----
## Error in if (is.null(newVal) && sum(f1) - sum(f0) < slot(control, "tol")) {: missing value where TRUE/FALSE needed

As expected, the optimizer presents problems due to the scale differences between the variable x1 and x2m. However, the output contains important information that provides clues to the potential problem. For example, the gradient for the bad-scaled variable x2m and the condition number are very high compared with the mix1 model.

One way to solve the problem is by standardizing the variable (or rescaling it). Note that the problem is solved:

# standardizing the variable
x2_bar <- mean(data$x2m)
x2_sd  <- sd(data$x2m)
data$x2m_n <- (data$x2m - x2_bar) / x2_sd
mix3 <- gmnl(chosen ~ x1 + x2m_n| 0, 
            data = data,
            model = "mixl",
            ranp = c(x1 = "n", x2m_n = "n"),
            print.init = TRUE,
            R = R,
            panel = TRUE,
            iterlim = 300,
            method = "bhhh",
            print.level = 2)
## 
## Starting Values:
##        x1     x2m_n     sd.x1  sd.x2m_n 
## 0.7574765 1.5105171 0.1000000 0.1000000 
## Estimating MIXL model 
## ----- Initial parameters: -----
## fcn value: -2109.136 
##          parameter initial gradient free
## x1       0.7574765       -0.0315253    1
## x2m_n    1.5105171        3.5816950    1
## sd.x1    0.1000000       18.4136633    1
## sd.x2m_n 0.1000000       94.1192559    1
## Condition number of the (active) hessian: 161.0625 
## -----Iteration 1 -----
## -----Iteration 2 -----
## -----Iteration 3 -----
## -----Iteration 4 -----
## -----Iteration 5 -----
## -----Iteration 6 -----
## -----Iteration 7 -----
## -----Iteration 8 -----
## -----Iteration 9 -----
## -----Iteration 10 -----
## -----Iteration 11 -----
## -----Iteration 12 -----
## --------------
## successive function values within tolerance limit 
## 12  iterations
## estimate: 0.8749401 2.058671 0.9824277 0.9974254 
## Function value: -2003.711
summary(mix3)
## 
## Model estimated on: Sat Oct 07 15:34:22 2017 
## 
## Call:
## gmnl(formula = chosen ~ x1 + x2m_n | 0, data = data, model = "mixl", 
##     ranp = c(x1 = "n", x2m_n = "n"), R = R, panel = TRUE, print.init = TRUE, 
##     iterlim = 300, method = "bhhh", print.level = 2)
## 
## Frequencies of categories:
## 
##       1       2       3 
## 0.31933 0.34100 0.33967 
## 
## The estimation took: 0h:0m:12s 
## 
## Coefficients:
##          Estimate Std. Error z-value  Pr(>|z|)    
## x1       0.874940   0.071802 12.1855 < 2.2e-16 ***
## x2m_n    2.058671   0.078545 26.2099 < 2.2e-16 ***
## sd.x1    0.982428   0.118759  8.2725  2.22e-16 ***
## sd.x2m_n 0.997425   0.071884 13.8755 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Optimization of log-likelihood by BHHH maximisation
## Log Likelihood: -2003.7
## Number of observations: 3000
## Number of iterations: 12
## Exit of MLE: successive function values within tolerance limit
## Simulation based on 50 draws