Using Multiclass Softmax Multinomial Regularized Logit Classifier and One vs. all Binary Regularized Logistic Regression Classifier with Gradient Descent

In this article, the gradient-descent-based implementations of two different techniques softmax multinomial logit classifier vs. one-vs-all binary logistic regression classifier (both of them with \(L_2\) regularization) are going to be compared for multiclass classification on the handwritten digits dataset.

There are 5000 examples of handwritten digits in the dataset. Some of the examples are shown below. We have 10 class labels here: the digits 0:9. Given an image we want to classify it as one of the 10 classes.
The next set of equations show the cross-entropy sigmoid cost function and the gradient for the binary sigmoid logistic regression classifier. In one-vs-all technique we need to learn 10 such logistic regression classifiers and then combine the parameter vectors learnt (using batch gradient descent) to a 10-column matrix and then classify an image.

\(\begin{align*} J_L = -y.log(\sigma_{\theta}(X))-(1-y).log(1-\sigma_{\theta}(X))\\ \nabla J_L = (\sigma_{\theta}(X)-y)X\\ \sigma_{\theta}(X)=\frac{1}{1+e^{-X\theta}} \end{align*}\)
The next figure shows the cross-entropy softmax cost function and the gradient for the multiclass softmax multinomial logit classifier. Here we learn the 10-column parameter matrix directly at once (using batch gradient descent) and then classify an image.

\(\begin{align*} J_S = -\sum_{i=1}^{10}y_i.log(s_{\Theta}(X))\\ \nabla J_S = (s_{\Theta}(X)-y)X\\ s_{\theta_j}(X)=\frac{e^{X\theta_j}}{\sum_{i=1}^{10}e^{X\theta_{i}}} \end{align*}\)
The entire dataset was divided into two parts: 4000 samples were taken for for training and 1000 samples were taken for test. The following figure shows the performance of the classifiers in terms of accuracy and also time taken to train and predict. Both the classifiers were trained with same set of parameters for gradient descent (for one-vs-all logistic regression each of the 10 classifiers were trained using gradient descent upto 1500 iterations, where for the softmax multinomial logit just one classifier was trained with gradient descent upto 1500 iterations) and regularization. As can be seen in the results, the performance of the classifiers on the test dataset is almost same (with accuracy 0.9), but the softmax classifier took 1/7 th fraction of time the one-vs-all logistic regression classifier took.

Data File Format

The next figure shows how the gradient descent reduces the cost function (error) when run for 5000 iterations, both for the 10 one-vs-all binary logistic regression classifiers and softmax multinomial logit classifier.

Data File Format

The next figure shows how the training and test dataset error changes w.r.t. number of iterations in gradient descent with the softmax multinomial logit classifier.

Data File Format