1. Introduction

A class of rules is developed to span the range between quadratic discriminant analysis and naive Bayes, through a path of sparse graphical models. A group lasso penalty is used to introduce shrinkage and encourage a similar pattern of sparsity across precision matrices. It gives sparse estimates of interactions and produces interpretable models.

In the generic classification problem, the outcome of interest \(G\) falls into \(K\) unordered classes, which for convenience we denote by {\(1, 2, ..., K\)}. Our goal is to build a rule for predicting the class label of an item based on \(p\) measurements of features \(X\in R^P\). The training set consists of the class labels and features for \(n\) items. This is an important practical problem with applications in many fields.

2. Classification Methods

a. Quadratic Discriminant Analysis

  • \(n_k\): #samples in Class \(k\). So \(\sum_{k=1}^K n_k=n\). \(X^k\): data matrix for samples in Class \(k\). \(X^k\) is of dimension: \(n_k\times p\).

  • Assign \(x\) to class \(k=\underset{1\leq k \leq K}{\mathrm{arg\min}}d(x,\hat \mu_k)\), where \(\hat \mu_k= \frac{1}{n_k}\sum_{l=1}^{n_k}X_l^k\) is the centroid of Class \(k\).

  • Quadratic Gaussian discriminant fucntions (QDA): \[\delta_k(x)=log\pi_k-\frac{1}{2}log(det\Sigma_k)-\frac{1}{2}(x-\mu_k)^T\Sigma_k^{-1}(x-\mu_k)\]

  • Actually a Bayesian idea. \(\delta_k(x)\) is roughly the log of the posterior probability. \(\pi_k\) is the prior.

  • Discriminant rule: \[k=\underset{1\leq k \leq K}{\mathrm{arg\max}}{\delta_k(x)}\]

  • Often the parameters are unknown but can be estimated from the data.

\[\hat\pi_k=n_k/n\] \[\hat\mu_k=\frac{1}{n_k}\sum_{l=1}^{n_k}X_l^k\] \[\hat\Sigma_k=\frac{1}{n_k-1}\sum_{l=1}^{n_k}(X_l^k-\hat\mu_k)(X_l^k-\hat\mu_k)^T\]

  • In this case, \(\delta_{hl}(x)=\delta_h(x)-\delta_l(x)\) is quadratic and the the decision boundary between each pair of classes \(h\) and \(l\) is described by a quadratic equation \(\{x: \delta_h(x)-\delta_l(x)\}\). So the decision boundary is quadratic.

This method is called Quadratic Discriminant Analysis (QDA). It is a a favored tool when there are some strong interactions and linear boundaries cannot separate the classes. However, QDA is poorly posed when the sample size \(n_k\) is not considerably larger than \(p\) for any class k, and clearly ill-posed if \(n_k\leq p\).

Therefore it encourages us to employ method to regularize QDA. Friedman (Regularized Discriminant Analysis) suggests applying a ridge penalty to within-class covariance matrices, but no elements of the resulting within-class precision matrices will be zero, leading to dense interaction terms in the discriminant functions and difficulties in interpretation. There is an approach assuming conditional independence of the features – naive Bayes.

b. Naive Bayes

  • Use the Bayes rules: \(P(Y|X)=\frac{P(Y)P(X|Y)}{P(X)}\)

  • \(k=\underset{1\leq k \leq K}{\mathrm{arg\max}}P(Y=k|X_1=x_1,..., X_p=x_p)=\underset{1\leq k \leq K}{\mathrm{arg\max}}P(Y=k)P(X_1=x_1,..., X_p=x_p|Y=k)\)

    • \(P(Y=k)\): from prior knowledge or estimated from data.

    • \(P(X_1=x_1,..., X_p=x_p|Y=k)\): estimatd from data.

  • Often, \(P(Y=k)\) can be easily estimated by \(n_k/n\), but the estimation of \(P(X_1=x_1,..., X_p=x_p|Y=k)\) can be quite hard when \(p\) is not very small - curse of dimension.

  • The simplest way is to assume that \(X_1, ..., X_p\) are independent in each class (given class lable \(k\)), i.e. \[P(X_1=x_1,..., X_p=x_p|Y=k)=\prod_{j=1}^pP(X_j=x_j|Y=k)\]
    • When \(X_j\) is discrete:\(P(X_j=x_j|Y=k)=\#(X_j=x_j,Y=k)/\#(Y=k)\)

    • \(X_j\) is continuous, \(P\) should be the density function f. We often assume that \(X_j|Y=k\sim N(\hat\mu_j(k),\sigma_j^2(k))\), where \(\hat\mu_j(k)=\frac{1}{n_k}\sum_{l=1}^{n_k}X_{lj}\), and \(\hat\sigma_j^2(k)=\frac{1}{n_k-1}\sum_{l=1}^{n_k}(X_{lj}-\mu_j(k))^2\)

This is the Naive Bayes Classifier. It often outperforms far more sophisticated alternatives when the dimension \(p\) of the feature space is high , but the conditional independence assumption may be too rigid in the presence of strong interactions.

Therefore, we develop a class of rules spanning the range between QDA and naive Bayes using a group lasso penalty. A group lasso penalty is applied to the \((i, j)th\) element across all \(K\) within-class precision matrices, which forces the zeros in the \(K\) estimated precision matrices to occur in the same places. This shared pattern of sparsity results in a sparse estimate of interaction terms, making the classifier easy to interpret. We refer to this classification method as Sparse Quadratic Discriminant Analysis (SQDA).

c. Sparse Quadratic Discriminant Analysis

  • Training data \((x_i,g_i)\in R^p\times K, i=1,2, ..., n\), where \(x_i\)’s are observations with measurements on a set of \(p\) features and \(g_i\)’s are class labels.

  • Assume \(n_k\) observations within the \(k\)th class a identically distributed as \(N(\mu_k, \Sigma_k)\), and \(n=\sum_kn_k\) observations are independent.

  • Quadratic discriminant function (\(\pi_k\) denotes the prior for class \(k\)): \[\delta_k(x)=log\pi_k+\frac{1}{2}log(det\Sigma_k^{-1})-\frac{1}{2}(x-\mu_k)^T\Sigma_k^{-1}(x-\mu_k)\]

  • \(\Theta^{(k)}=\Sigma_k^{-1}\): true precision matrix for class \(k\), \(\theta_{ij}=(\theta_{ij}^{(1)}, ..., \theta_{ij}^{(K)})\) denote the vector of the \((i,j)\)th element across all \(K\) precision matrices.

  • Estimate \(\mu\), \(\pi\), \(\Theta\) by maximizing the penalized log likelihood \[\sum_{k=1}^K[\frac{n_k}{2}log(det\Theta^{(k)})-\frac{1}{2}\sum_{g_i=k}(x_i-\mu_k)^T\Theta^{(k)}(x_i-\mu_k)+n_klog(\pi_k)]-\frac{\lambda}{2}\sum_{i\neq j}||\theta_{ij}||_2\]

    • The last term is a group lasso penalty, which forces a patern of sparsity across all \(K\) precision matrices.

    • Let \(\Theta=(\Theta^{(1)},...,\Theta^{(K)})\). The estimates are

\[\hat\pi_k=\frac{n_k}{n}\]

\[\hat\mu_k=\frac{1}{n_k}\sum_{g_i=k}x_i\] \[\hat\Theta = \underset{\Theta^{(K)\succ 0}}{\mathrm{arg\max}}\sum_{k=1}^Kn_k[log(det\Theta^{(k)})-tr(S^{(k)}\Theta^{(k)})]-\lambda\sum_{i\neq j}||\theta_{ij}||_2\]

where \(S^{(k)}=\frac{1}{n_k}\sum_{g_i=k}(x_i-\hat\mu_k)(x_i-\hat\mu_k)^T\) is the sample covariance matrix for class \(k\).

This model has several important features:

  • \(\lambda\geq 0\), which controls the simultaneous shrinkage of all precision matrices toward diagonal matrices.

    • The value \(\lambda=0\) gives rise to QDA.

    • Whereas \(\lambda=\infty\) yields the Naive Bayes Classifier.

    • \(0<\lambda<\infty\) repersents degrees of regularization (SQDA).

  • Interpretability: The group lasso penalty forces \(\hat\Theta^{(1)}, ..., \hat\Theta^{(K)}\) to share the locations of the nonzero elements, leading to a sparse estimate of interactions.

  • The optimization problem for \(\hat\Theta\) can be solved by R package JGL and be separated into independent subproblems of the same form by simple screening rules on \(S^{(1)}, ..., S^{(K)}\). This leads to a potentially massive reduction in computational complexity.

Experimental Analysis

Handwritten digit recognition examples

Here we focus on the sub-task of distinguishing handwritten \(3s\) and \(8s\). We use the same data as Le Cun et al., who normalized binary images for size and orientation, resulting in \(8\)-bit, \(16\times 16\) gray scale images. There are \(658\) threes and \(542\) eights in our training set, and \(166\) test samples for each. Figure 1 shows a random selection of \(3s\) and \(8s\).

train = read.csv("zip.train.txt", header = F, sep = '') # 7291
test = read.csv("zip.test.txt", header = F, sep ='') # 2007

dim(train[which(train[,1]==3),]); dim(train[which(train[,1]==8),])
## [1] 658 257
## [1] 542 257
dim(test[which(test[,1]==3),]); dim(test[which(test[,1]==8),])
## [1] 166 257
## [1] 166 257
train38 = train[which(train[,1]==3 | train[,1]==8), ] 
test38 = test[which(test[,1]==3 | test[,1]==8), ]
dim(train38); dim(test38)
## [1] 1200  257
## [1] 332 257

The following function takes a row of the data (which corresponds to a single picture), and shows the image. The number \(n\) must be between \(1\) to \(1200\). We can change the train inside the function to look at test images. The actual number gets printed to the title. Note that the rows of the train data are fed column wise into the matrix im.

show_an_image = function(n)
{
  v  = as.numeric(train38[n,2:257])
  im = matrix(v,16,16)
  im = im[,nrow(im):1]
  image(im, col = gray((0:255)/255), main = train38[n,1])
}

set.seed(523)
par(mfrow = c(2,5))
for (i in sample(1:1200, 10)){
  show_an_image(i)
}

Figure 1: Examples of digitized handwritten \(3s\) and \(8s\). Each image is a \(8\) bit, \(16\times 16\) gray scale version of the original binary image.

Because of their spatial arrangement the features are highly correlated and some kind of smoothing or filtering always helps. We filtered the data by replacing each non-overlapping \(2\times 2\) pixel block with its average. This reduces the dimension of the feature space from \(256\) to \(64\).

# Averages each non-overlapping 2x2 block
filterdigit <- function(x)
{
  x = matrix(x,16,16)
  evens = 2 * (1:8)
  x = x[evens,] + x[-evens,]
  x = x[,evens] + x[,-evens]
  as.vector(x)/4
}

# Filter data
xtrain = as.data.frame(t(apply(train38[,-1],1,filterdigit)))
xtest = as.data.frame(t(apply(test38[,-1],1,filterdigit)))
dim(xtrain); dim(xtest)
## [1] 1200   64
## [1] 332  64
ytrain = factor(train38[,1])
ytest = factor(test38[,1])

We compare our method with QDA, naive Bayes and a variant of RDA which we refer to as diagonal regularized discriminant analysis (DRDA). The estimated covariance matrix of DRDA for class \(k\) is

\(\hat\Sigma^{(k)}(\lambda)=(1-\lambda)S^{(k)}+\lambda diag(S^{(k)})\), with \(\lambda\in[0,1]\)

The tuning parameters are selected by 5-fold cross validation. Table 1 shows the misclassification errors of each method and the corresponding standardized tuning parameters \(s=P(\hat\Theta(\lambda))/P(\hat\Theta(0))\), where \(P(\Theta) = \sum_{i\neq j}||\theta_{ij}^{(1)},..., \theta_{ij}^{(K)}||_2\).

# Naive Bayes
library(e1071)
set.seed(1)
MNIST.nb <- naiveBayes(xtrain, ytrain)
label.train <- predict(MNIST.nb, newdata=xtrain)
table(label.train, ytrain) # confusion matrix
##            ytrain
## label.train   3   8
##           3 528  14
##           8 130 528
sum(label.train != ytrain) / length(label.train) # training error
## [1] 0.12
label.test <- predict(MNIST.nb, newdata=xtest)
table(label.test, ytest) # confusion matrix
##           ytest
## label.test   3   8
##          3 123  10
##          8  43 156
sum(label.test != ytest) / length(label.test) # test error 
## [1] 0.1596386
# QDA
library(MASS)
MNIST.qda <- qda(xtrain, ytrain)
y.hat <- predict(MNIST.qda, xtrain)$class
confmat <- table(y.hat, ytrain) # confusion matrix
print(confmat)
##      ytrain
## y.hat   3   8
##     3 639   7
##     8  19 535
sum(y.hat != ytrain) / length(ytrain) # train error
## [1] 0.02166667
y.hat <- predict(MNIST.qda, xtest)$class
confmat <- table(y.hat, ytest) # confusion matrix
print(confmat)
##      ytest
## y.hat   3   8
##     3 157  12
##     8   9 154
sum(y.hat != ytest) / length(ytest) # train error
## [1] 0.06325301

Conclusion

The results show that SQDA outperforms DRDA, QDA and naive Bayes.

Figure 2: The 5-fold cross-validation errors (blue) and the test errors (red) of SQDA and DRDA on \(3s\) and \(8s\).

Figure 3: Heat maps of the sample precision matrices and the estimated precision matrices of SQDA and DRDA. Estimates are standardized to have unit diagonal. The first line corresponds to the precision matrix of \(3s\) and the second line corresponds to that of \(8s\).

The estimated precision matrices of SQDA are often block diagonal under suitable ordering of the features, which implies that the features can be partitioned into several communities and these communities are mutually independent within each class. This idea can be generalized to non-Gaussian data and other classification models, and we refer to it as the community Bayes model. For more information about community Bayes, check the paper Sparse Quadratic Discriminant Analysis and Community Bayes.