Title: Negative Correlation Ensemble Learning for Ordinal Regression
Authors: Fernandez et al. 
Year: 2013
Journal: IEEE Transactions on Neural Networks and Learning Systems
DOI: https://doi.org/10.1109/TNNLS.2013.2268279


1. Topic

  • Neural network threshold ensemble models for ordinal regression problems
  • Using a diverisity-encouraging error function, negative correlation learning framework


2. Motivation

  • No previous work where the member classifiers are designed taking into account the correlation among them in ordinal regression problems


3. Proposed method

  • Ordinal regression
    • Assign patterns into a set of finite ordered classes
      • example: students’ performance using A, B, C and D (order relation \(\small A > B > C> D\))
    • The difference from regression
      • The number of ranks is finite
      • The exact amounts of difference between ranks are not defined
    • Application
      • teaching assistant evaluation [1]
      • car insurance risk rating [2]
      • credit rating [3]


  • Threshold models for ordinal regression
    • Model the ordinal response as a latent continuous variable
    • Need to determine potential function
      • To project the patterns into a real line
      • A set of thresholds are used to divied the real line into consecutive intervals, each of them representing one of the ordered classes
    • Threshold regression model example with the standard multilayer perceptron as potential function


  • Decision function \[r_{(f(\mathbf{x}, \mathbf{w}, \mathbf{\beta}), \mathbf{\theta})}\]
    • \(\small \mathbf{w} = \{w_{s0}, \cdots, w_{sk}\}\): vector of connections between the input and hidden layer
    • \(\small \mathbf{\beta} = \{\beta_1, \cdots,\beta_S \}\): vector of connections between the hidden and output layer
    • \(\small \mathbf{\theta}\): vector of thresholds

    • Prediction (\(\small \mathbf{z} = \{\mathbf{w}, \mathbf{\beta}, \mathbf{\theta} \}\)) \[\small r_{(f(\mathbf{x}), \mathbf{z})} = \min \{j: f(\mathbf{x}) \le \theta_j \}\]

    • Posterior probability if \(\small f(\mathbf{x})\) follows a logistic cumulative distribution \[\small \begin{aligned} P(y=C_j|\mathbf{x}, \mathbf{z}) = P(y \le C_j|\mathbf{x}, \mathbf{z}) - P(y \le C_{j-1}|\mathbf{x}, \mathbf{z}) \\ = \frac {1}{1+exp(f(\mathbf{x})-\theta_j)} - \frac {1}{1+exp(f(\mathbf{x})-\theta_{j-1})} \end{aligned}\]


  • Example of good and bad threshold models

    • potential value \(\small f(\mathbf{x})\) should be far from the extremes or boundaries of the desired interval
    • With the average of diverse projections from ensemble, we intend to better estimate the real values of the latent variable


  • NCL for Ordinal Regression using ensemble model with Fixed Thresholds (NCLOR-FT)
    • Fixed Thresholds means that the vector \(\small \theta\) is not modified during the learning procedure
      • \(\small \theta_j=\theta_1+(j-1)*width\)
    • Ensemble potential function \[\small \bar{f}(\mathbf{x}) = \frac 1M \sum_{i=1}^M f_i(\mathbf{x})\]
    • Prediction \[\small r_{(\bar{f}(\mathbf{x}), \mathbf{z})} = \min \{j: \bar{f}(\mathbf{x}) \le \theta_j \} \]
    • Error function of \(\small i\)th neural network \[\small e_i = \frac 1N \sum_{n=1}^N(f_i(\mathbf{x}_n) - t_n) + \frac {\lambda}N \sum_{n=1}^N((f_i(\mathbf{x}_n) - \bar{f}(\mathbf{x}_n))\sum_{j\neq i}(f_j(\mathbf{x}_n) - \bar{f}(\mathbf{x}_n))) \]
      • target value \(\small t_n\) of pattern \(\small \mathbf{x}_n\) is the center of the interval of the class the pattern belongs to (\(\small t_n = \theta_j - \theta_{j-1}+\theta_{j-1}/2\))


  • NCL for Ordinal Regression using ensemble model with Adaptive Thresholds (NCLOR-AT)
    • In this case, each model of the ensemble uses a different \(\small \mathbf{\theta}\) vector
    • Adaptive Thresholds means that the vector \(\small \theta\) is moved during the learning procedure using posterior probability
    • Ensemble probability \[\small \bar{\hat{P}}(y=C_j|\mathbf{x}, \mathbf{Z})=\frac 1M \sum_{i=1}^M \hat{P}_i(y=C_j|\mathbf{x}, \mathbf{z}_i) \]
    • Optimal classification rule \[\small \mathbf{C(x)} = \hat{j},\; where \; \hat{j}= argmax_j \{\bar{\hat{P}}(y=C_j|\mathbf{x}, \mathbf{Z}) \}\]

    • Error function of \(\small i\)th neural network \[\small e_i = \frac 1{N \cdot J} \sum_{n=1}^N \sum_{j=1}^J (\hat{P}_{n, i}^j - y_n^{(j)})^2 + \frac 1{N \cdot J} \sum_{n=1}^N \sum_{j=1}^J ((\hat{P}_{n, i}^j - \bar{\hat{P}}_{n}^j) \sum_{l \neq i}(\hat{P}_{n, l}^j - \bar{\hat{P}}_{n}^j) )\]
      • \(\small \mathbf{y}_n=\{y_n^{(1)}, \cdots, y_n^{(J)}\}\) with \(\small y_n^{(j)}=1\) if the pattern is from class \(\small j\), and \(\small y_n^{(j)}=1\) if it is not
      • \(\small \hat{P}_{n, i}^j = \hat{P}_i(y_n=C_j|\mathbf{x}, \mathbf{z}_i)\)
      • \(\small \bar{\hat{P}}_{n}^j = \bar{\hat{P}}(y_n=C_j|\mathbf{x}, \mathbf{Z})\)


4. Experiments

  • dataset
  • Comparison methods

    • Ensemble approaches for ordinal regression
      • A simple approach to ordinal regression(ASAOR) [4]
      • MultiClass Ordinal Support vector machines(MCOSvm)[5]
      • Ordinal Regression Boosting (ORBoost) [6]
    • Threshold models for ordinal regression
      • ordinal Neural Network(oNN) [7]
      • Supprot Vector Ordinal Regression(SVOR) [8],[9]
      • Gaussian Processes for Ordinal Regression(GPOR) [10]
  • Results
    • Comparison of NCLOR-AT to other methods
      • Measure \[\small Accuary = \frac 1N \sum_{i=1}^N I(\hat{y}_i=y_i)\]
    • Comparison of NCLOR and regression neural network model on synthetic toy dataset


5. References

[1] T.-S. Lim, W.-Y. Loh, and Y.-S. Shih, “A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms,” Mach. Learn., vol. 40, no. 3, pp. 203–228, 2000.
[2] D. Kibler, D. W. Aha, and M. K. Albert, “Instance-based prediction of real-valued attributes,” Comput. Intell., vol. 5, no. 2, pp. 51–57, 1989.
[3] K.-J. Kim and H. Ahn, “A corporate credit rating model using multi-class support vector machines with an ordinal pairwise partitioning approach,” Comput. Oper. Res., vol. 39, no. 8, pp. 1800–1811, 2012.
[4] E. Frank and M. Hall, “A simple approach to ordinal classification,” in Proc. ECML, 2001, pp. 145–156.
[5] W. Waegeman and L. Boullart, “An ensemble of weighted support vector machines for ordinal regression,” Int. J. Comput. Syst. Sci. Eng., vol. 3, no. 1, pp. 47–51, 2009.
[6] H.-T. Lin and L. Li, “Large-margin thresholded ensembles for ordinal regression: Theory and practice,” in Proc. 17th Int. Conf. Algorithmic Learn. Theory, 2006, pp. 319–333.
[7] J. S. Cardoso and J. F. Pinto da Costa, “Learning to classify ordinal data: The data replication method,” J. Mach. Learn. Res., vol. 8, pp. 1393–1429, Sep. 2007.
[8] W. Chu and S. S. Keerthi, “Support vector ordinal regression,” Neural Comput., vol. 19, no. 3, pp. 792–815, 2007.
[9] W. Chu and S. S. Keerthi, “New approaches to support vector ordinal regression,” in Proc. 22nd Int. Conf. Mach. Learn., 2005, pp. 145–152.
[10] W. Chu and Z. Ghahramani, “Gaussian processes for ordinal regression,” J. Mach. Learn. Res., vol. 6, pp. 1019–1041, Jan. 2005.