Title: Negative Correlation Ensemble Learning for Ordinal Regression
Authors: Fernandez et al.
Year: 2013
Journal: IEEE Transactions on Neural Networks and Learning Systems
DOI: https://doi.org/10.1109/TNNLS.2013.2268279

1. Topic

Neural network threshold ensemble models for ordinal regression problems
Using a diverisity-encouraging error function, negative correlation learning framework

2. Motivation

No previous work where the member classifiers are designed taking into account the correlation among them in ordinal regression problems

3. Proposed method

Ordinal regression
- Assign patterns into a set of finite ordered classes
  - example: students’ performance using A, B, C and D (order relation \(\small A > B > C> D\))
- The difference from regression
  - The number of ranks is finite
  - The exact amounts of difference between ranks are not defined
- Application
  - teaching assistant evaluation [1]
  - car insurance risk rating [2]
  - credit rating [3]

Threshold models for ordinal regression
- Model the ordinal response as a latent continuous variable
- Need to determine potential function
  - To project the patterns into a real line
  - A set of thresholds are used to divied the real line into consecutive intervals, each of them representing one of the ordered classes
- Threshold regression model example with the standard multilayer perceptron as potential function

Decision function \[r_{(f(\mathbf{x}, \mathbf{w}, \mathbf{\beta}), \mathbf{\theta})}\]
- \(\small \mathbf{w} = \{w_{s0}, \cdots, w_{sk}\}\): vector of connections between the input and hidden layer
- \(\small \mathbf{\beta} = \{\beta_1, \cdots,\beta_S \}\): vector of connections between the hidden and output layer
- \(\small \mathbf{\theta}\): vector of thresholds
- Prediction (\(\small \mathbf{z} = \{\mathbf{w}, \mathbf{\beta}, \mathbf{\theta} \}\)) \[\small r_{(f(\mathbf{x}), \mathbf{z})} = \min \{j: f(\mathbf{x}) \le \theta_j \}\]
- Posterior probability if \(\small f(\mathbf{x})\) follows a logistic cumulative distribution \[\small \begin{aligned} P(y=C_j|\mathbf{x}, \mathbf{z}) = P(y \le C_j|\mathbf{x}, \mathbf{z}) - P(y \le C_{j-1}|\mathbf{x}, \mathbf{z}) \\ = \frac {1}{1+exp(f(\mathbf{x})-\theta_j)} - \frac {1}{1+exp(f(\mathbf{x})-\theta_{j-1})} \end{aligned}\]

Example of good and bad threshold models
- potential value \(\small f(\mathbf{x})\) should be far from the extremes or boundaries of the desired interval
- With the average of diverse projections from ensemble, we intend to better estimate the real values of the latent variable

NCL for Ordinal Regression using ensemble model with Fixed Thresholds (NCLOR-FT)
- Fixed Thresholds means that the vector \(\small \theta\) is not modified during the learning procedure
  - \(\small \theta_j=\theta_1+(j-1)*width\)
- Ensemble potential function \[\small \bar{f}(\mathbf{x}) = \frac 1M \sum_{i=1}^M f_i(\mathbf{x})\]
- Prediction \[\small r_{(\bar{f}(\mathbf{x}), \mathbf{z})} = \min \{j: \bar{f}(\mathbf{x}) \le \theta_j \} \]
- Error function of \(\small i\)th neural network \[\small e_i = \frac 1N \sum_{n=1}^N(f_i(\mathbf{x}_n) - t_n) + \frac {\lambda}N \sum_{n=1}^N((f_i(\mathbf{x}_n) - \bar{f}(\mathbf{x}_n))\sum_{j\neq i}(f_j(\mathbf{x}_n) - \bar{f}(\mathbf{x}_n))) \]
  - target value \(\small t_n\) of pattern \(\small \mathbf{x}_n\) is the center of the interval of the class the pattern belongs to (\(\small t_n = \theta_j - \theta_{j-1}+\theta_{j-1}/2\))

NCL for Ordinal Regression using ensemble model with Adaptive Thresholds (NCLOR-AT)
- In this case, each model of the ensemble uses a different \(\small \mathbf{\theta}\) vector
- Adaptive Thresholds means that the vector \(\small \theta\) is moved during the learning procedure using posterior probability
- Ensemble probability \[\small \bar{\hat{P}}(y=C_j|\mathbf{x}, \mathbf{Z})=\frac 1M \sum_{i=1}^M \hat{P}_i(y=C_j|\mathbf{x}, \mathbf{z}_i) \]
- Optimal classification rule \[\small \mathbf{C(x)} = \hat{j},\; where \; \hat{j}= argmax_j \{\bar{\hat{P}}(y=C_j|\mathbf{x}, \mathbf{Z}) \}\]
- Error function of \(\small i\)th neural network \[\small e_i = \frac 1{N \cdot J} \sum_{n=1}^N \sum_{j=1}^J (\hat{P}_{n, i}^j - y_n^{(j)})^2 + \frac 1{N \cdot J} \sum_{n=1}^N \sum_{j=1}^J ((\hat{P}_{n, i}^j - \bar{\hat{P}}_{n}^j) \sum_{l \neq i}(\hat{P}_{n, l}^j - \bar{\hat{P}}_{n}^j) )\]
  - \(\small \mathbf{y}_n=\{y_n^{(1)}, \cdots, y_n^{(J)}\}\) with \(\small y_n^{(j)}=1\) if the pattern is from class \(\small j\), and \(\small y_n^{(j)}=1\) if it is not
  - \(\small \hat{P}_{n, i}^j = \hat{P}_i(y_n=C_j|\mathbf{x}, \mathbf{z}_i)\)
  - \(\small \bar{\hat{P}}_{n}^j = \bar{\hat{P}}(y_n=C_j|\mathbf{x}, \mathbf{Z})\)

4. Experiments

dataset
Comparison methods
- Ensemble approaches for ordinal regression
  - A simple approach to ordinal regression(ASAOR) [4]
  - MultiClass Ordinal Support vector machines(MCOSvm)[5]
  - Ordinal Regression Boosting (ORBoost) [6]
- Threshold models for ordinal regression
  - ordinal Neural Network(oNN) [7]
  - Supprot Vector Ordinal Regression(SVOR) [8],[9]
  - Gaussian Processes for Ordinal Regression(GPOR) [10]
Results
- Comparison of NCLOR-AT to other methods
  - Measure \[\small Accuary = \frac 1N \sum_{i=1}^N I(\hat{y}_i=y_i)\]
- Comparison of NCLOR and regression neural network model on synthetic toy dataset

5. References

[1] T.-S. Lim, W.-Y. Loh, and Y.-S. Shih, “A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms,” Mach. Learn., vol. 40, no. 3, pp. 203–228, 2000.
[2] D. Kibler, D. W. Aha, and M. K. Albert, “Instance-based prediction of real-valued attributes,” Comput. Intell., vol. 5, no. 2, pp. 51–57, 1989.
[3] K.-J. Kim and H. Ahn, “A corporate credit rating model using multi-class support vector machines with an ordinal pairwise partitioning approach,” Comput. Oper. Res., vol. 39, no. 8, pp. 1800–1811, 2012.
[4] E. Frank and M. Hall, “A simple approach to ordinal classification,” in Proc. ECML, 2001, pp. 145–156.
[5] W. Waegeman and L. Boullart, “An ensemble of weighted support vector machines for ordinal regression,” Int. J. Comput. Syst. Sci. Eng., vol. 3, no. 1, pp. 47–51, 2009.
[6] H.-T. Lin and L. Li, “Large-margin thresholded ensembles for ordinal regression: Theory and practice,” in Proc. 17th Int. Conf. Algorithmic Learn. Theory, 2006, pp. 319–333.
[7] J. S. Cardoso and J. F. Pinto da Costa, “Learning to classify ordinal data: The data replication method,” J. Mach. Learn. Res., vol. 8, pp. 1393–1429, Sep. 2007.
[8] W. Chu and S. S. Keerthi, “Support vector ordinal regression,” Neural Comput., vol. 19, no. 3, pp. 792–815, 2007.
[9] W. Chu and S. S. Keerthi, “New approaches to support vector ordinal regression,” in Proc. 22nd Int. Conf. Mach. Learn., 2005, pp. 145–152.
[10] W. Chu and Z. Ghahramani, “Gaussian processes for ordinal regression,” J. Mach. Learn. Res., vol. 6, pp. 1019–1041, Jan. 2005.

Negative Correlation Ensemble Learning for Ordinal Regression

by Stoney

2020-04-18

1. Topic

2. Motivation

3. Proposed method

4. Experiments

5. References