In this article, the gradient-descent-based implementations of two different techniques softmax multinomial logit classifier vs. one-vs-all binary logistic regression classifier (both of them with \(L_2\) regularization) are going to be compared for multiclass classification on the handwritten digits dataset.
There are 5000 examples of handwritten digits in the dataset. Some of the examples are shown below. We have 10 class labels here: the digits 0:9. Given an image we want to classify it as one of the 10 classes.
The next set of equations show the cross-entropy sigmoid cost function and the gradient for the binary sigmoid logistic regression classifier. In one-vs-all technique we need to learn 10 such logistic regression classifiers and then combine the parameter vectors learnt (using batch gradient descent) to a 10-column matrix and then classify an image.
\(\begin{align*} J_L = -y.log(\sigma_{\theta}(X))-(1-y).log(1-\sigma_{\theta}(X))\\ \nabla J_L = (\sigma_{\theta}(X)-y)X\\ \sigma_{\theta}(X)=\frac{1}{1+e^{-X\theta}} \end{align*}\)
The next figure shows the cross-entropy softmax cost function and the gradient for the multiclass softmax multinomial logit classifier. Here we learn the 10-column parameter matrix directly at once (using batch gradient descent) and then classify an image.
\(\begin{align*} J_S = -\sum_{i=1}^{10}y_i.log(s_{\Theta}(X))\\ \nabla J_S = (s_{\Theta}(X)-y)X\\ s_{\theta_j}(X)=\frac{e^{X\theta_j}}{\sum_{i=1}^{10}e^{X\theta_{i}}} \end{align*}\)
The entire dataset was divided into two parts: 4000 samples were taken for for training and 1000 samples were taken for test. The following figure shows the performance of the classifiers in terms of accuracy and also time taken to train and predict. Both the classifiers were trained with same set of parameters for gradient descent (for one-vs-all logistic regression each of the 10 classifiers were trained using gradient descent upto 1500 iterations, where for the softmax multinomial logit just one classifier was trained with gradient descent upto 1500 iterations) and regularization. As can be seen in the results, the performance of the classifiers on the test dataset is almost same (with accuracy 0.9), but the softmax classifier took 1/7 th fraction of time the one-vs-all logistic regression classifier took.