The Background

Can we identify a likely incidence of credit card fraud before consumer complaint? This is the question we seek to answer using data from a set of European credit card transactions. The dataset contains two days worth of data along with a flag indicating whether a given transaction was fraudulent. Most columns of interest have been anonymized through use of Principal Component Analysis. (PCA is a technique frequently used to mitigate feature collinearity and to aid in dimension reduction; the leveraging of PCA to preserve confidentiality was a clever twist).

The Data

The data contain 284,807 observations of 31 variables (Time from first record, Transaction Amount, 28 Principal Components, and a Fraud Indicator).

library(data.table)
cred <- fread('creditcard.csv')
str(cred)
## Classes 'data.table' and 'data.frame':   284807 obs. of  31 variables:
##  $ Time  : num  0 0 1 1 2 2 4 7 7 9 ...
##  $ V1    : num  -1.36 1.192 -1.358 -0.966 -1.158 ...
##  $ V2    : num  -0.0728 0.2662 -1.3402 -0.1852 0.8777 ...
##  $ V3    : num  2.536 0.166 1.773 1.793 1.549 ...
##  $ V4    : num  1.378 0.448 0.38 -0.863 0.403 ...
##  $ V5    : num  -0.3383 0.06 -0.5032 -0.0103 -0.4072 ...
##  $ V6    : num  0.4624 -0.0824 1.8005 1.2472 0.0959 ...
##  $ V7    : num  0.2396 -0.0788 0.7915 0.2376 0.5929 ...
##  $ V8    : num  0.0987 0.0851 0.2477 0.3774 -0.2705 ...
##  $ V9    : num  0.364 -0.255 -1.515 -1.387 0.818 ...
##  $ V10   : num  0.0908 -0.167 0.2076 -0.055 0.7531 ...
##  $ V11   : num  -0.552 1.613 0.625 -0.226 -0.823 ...
##  $ V12   : num  -0.6178 1.0652 0.0661 0.1782 0.5382 ...
##  $ V13   : num  -0.991 0.489 0.717 0.508 1.346 ...
##  $ V14   : num  -0.311 -0.144 -0.166 -0.288 -1.12 ...
##  $ V15   : num  1.468 0.636 2.346 -0.631 0.175 ...
##  $ V16   : num  -0.47 0.464 -2.89 -1.06 -0.451 ...
##  $ V17   : num  0.208 -0.115 1.11 -0.684 -0.237 ...
##  $ V18   : num  0.0258 -0.1834 -0.1214 1.9658 -0.0382 ...
##  $ V19   : num  0.404 -0.146 -2.262 -1.233 0.803 ...
##  $ V20   : num  0.2514 -0.0691 0.525 -0.208 0.4085 ...
##  $ V21   : num  -0.01831 -0.22578 0.248 -0.1083 -0.00943 ...
##  $ V22   : num  0.27784 -0.63867 0.77168 0.00527 0.79828 ...
##  $ V23   : num  -0.11 0.101 0.909 -0.19 -0.137 ...
##  $ V24   : num  0.0669 -0.3398 -0.6893 -1.1756 0.1413 ...
##  $ V25   : num  0.129 0.167 -0.328 0.647 -0.206 ...
##  $ V26   : num  -0.189 0.126 -0.139 -0.222 0.502 ...
##  $ V27   : num  0.13356 -0.00898 -0.05535 0.06272 0.21942 ...
##  $ V28   : num  -0.0211 0.0147 -0.0598 0.0615 0.2152 ...
##  $ Amount: num  149.62 2.69 378.66 123.5 69.99 ...
##  $ Class : int  0 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, ".internal.selfref")=<externalptr>

There are no missing values, which is consistent with the idea that the data were preprocessed to ensure confidentiality. (We would need deeper context to determine an appropriate method of data imputation).

sum(is.na(cred)) #how many missing values
## [1] 0

It’s helpful to take a peak at the mean and standard deviation of the numeric variables to make sure things look okay there.

aperm(as.matrix(vapply(cred[,-31]
       , function(x){
            c(mean = round(mean(x),5)
              ,sd = round(sd(x),5))
            }
       , numeric(2))))
##               mean          sd
## Time   94813.85958 47488.14595
## V1         0.00000     1.95870
## V2         0.00000     1.65131
## V3         0.00000     1.51626
## V4         0.00000     1.41587
## V5         0.00000     1.38025
## V6         0.00000     1.33227
## V7         0.00000     1.23709
## V8         0.00000     1.19435
## V9         0.00000     1.09863
## V10        0.00000     1.08885
## V11        0.00000     1.02071
## V12        0.00000     0.99920
## V13        0.00000     0.99527
## V14        0.00000     0.95860
## V15        0.00000     0.91532
## V16        0.00000     0.87625
## V17        0.00000     0.84934
## V18        0.00000     0.83818
## V19        0.00000     0.81404
## V20        0.00000     0.77093
## V21        0.00000     0.73452
## V22        0.00000     0.72570
## V23        0.00000     0.62446
## V24        0.00000     0.60565
## V25        0.00000     0.52128
## V26        0.00000     0.48223
## V27        0.00000     0.40363
## V28        0.00000     0.33008
## Amount    88.34962   250.12011

It’s comforting to note that the PCA appears to have been conducted on centered data. (Centering is generally considered a crucial pre-processing step for PCA). We may need to do something about the time variable as I’m not sure its particularly useful in its current form.

summary(cred$Time, digits = 9)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##      0.00  54201.50  84692.00  94813.86 139320.50 172792.00

The relevance of the time variable is hampered by the lack of an explicit reference value. It’s plausible that a time of 0 corresponds to midnight since 172,800 is the number of seconds in 2 calendar days, but it’s possible that the data simply span some arbitrary 48-hour period. We’ll work from the former premise to transform the data into an hour-of-day variable, but we should keep in mind that we lack certainty of context here. I’ll probably leave this variable out of my model fits to be safe.

cred$hod = (cred$Time %% (60*60*24)) / 60 / 60
summary(cred$hod) #verify new hour-of-day variable computed correctly
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   10.60   15.01   14.54   19.33   24.00

Finally, let’s take a look at the prevalence of fraud in our dataset.

cred$Class = factor(cred$Class, levels = 0:1, labels = c('legit', 'FRAUD'))
table(cred$Class) #raw counts
## 
##  legit  FRAUD 
## 284315    492
table(cred$Class)/length(cred$Class) #proportions
## 
##       legit       FRAUD 
## 0.998272514 0.001727486

The low proportion of fraudulent transactions is striking and has important ramifications for our model development.

Analysis

Preparatory Considerations

Any analysis of the fraud dataset needs to reckon with the binary class imbalance within our dataset because it impacts model evaluation and may even affect model construction. The model evaluation concern is relatively straightforward: if frauds rarely occur, then any model that rarely predicts ‘Fraud’ can appear successful. (If a model always predicted ‘legit’, then it’d be right 99.8% of the time). This evaluative vulnerability is present in traditional measures such as ‘Accuracy’ or ‘ROC’, so it’s safer to rely on a ‘Precision-Recall’ combination. In the context of our problem, we’ll define Precision to be the proportion of predicted frauds that are actually fraudulent and Recall to be the proportion of actual frauds that are correctly predicted. A successful model should possess both high precision and high recall. In terms of model construction, we’ll need to keep in mind that class imbalance may adversely affect some model-building algorithms.

Let’s start building predictive models. We’ll begin by designating a holdout dataset to mitigate some of the bias in our final model evaluations.

#library(caret)
set.seed(100)
cred_train_init = createDataPartition(cred$Class, p = 0.8, list = FALSE)
cred_traini = cred[cred_train_init,]
cred_testi = cred[-cred_train_init,]
rm(cred); gc() #optional RAM consideration
##            used (Mb) gc trigger  (Mb) max used  (Mb)
## Ncells  1703376 91.0    2637877 140.9  2164898 115.7
## Vcells 11286956 86.2   33603935 256.4 33475627 255.4

We can proceed directly to model-fitting at this point, but the computational limitations of my machine recommmend further treatment of the training data. (It would take multiple hours for a single model to be fit to the training data on my machine, and I intend to fit over a dozen). How can we increase the model-fitting efficiency? I’m inclined to try subsampling.

Subsampling is a general technique that addresses class imbalance by subsetting the majority class or replicating the minority class. Subsampling is capable of alleviating some of the concerns I mentioned just a few paragraphs ago, but my use of it is motivated by a desire for computational savings. Subsampling does carry some risk, but I believe it’s conducive to the core intentions of this analysis. I’ve opted for the SMOTE hybrid subsampling method that combines majority subsetting with minority replication. The SMOTE approach preserves more of the majority class than straight downsampling, but it will still result in signficant reduction of the training set.

#library(DMwR)
set.seed(200)
cred_smote = SMOTE(Class ~ ., data = cred_traini)
table(cred_smote$Class)
## 
## legit FRAUD 
##  1576  1182

The balance within the training set has been drastically improved and the size of the training set has been reduced by over 99%! Now, let’s build some models!

Model-Building

I want to fit multiple models, so let’s go ahead and create some static parameters to help stabilize any model comparisons. I want to train the models using 5-fold cross-validation, and I plan to evaluate the models using the area under the Precision-Recall curve.

set.seed(300)
credsm_folds = createFolds(cred_smote$Class, k = 5)
credsm_control = trainControl(
     method = 'cv'
     , summaryFunction = prSummary #precision-recall
     , classProbs = TRUE
     , verboseIter = FALSE #muted for publication
     , savePredictions = TRUE
     , index = credsm_folds
     )

I’ll begin by fitting an elastic net model (i.e. regularized regression) to see how it performs. These models are typically fairly zippy, and they lend themselves to straightforward interpretations. Since the context of our data has been obscured by concealed PCA, interpretability seems less relevant. Still, elastic net seems worthy of a try.

glmn_sm1 = train(Class ~ . -Time -hod, data = cred_smote
                 , method = 'glmnet', metric = 'AUC'
                 , trControl = credsm_control)
plot(glmn_sm1)

I was actually a little startled by how effective that model turned out to be. It appears that our data lend themselves to strong classifications, but we’ll have to wait until we test the models on the holdout data to be sure.

I’m going to tweak the elastic net model with custom settings to probe the margins a bit, but the initial model has already set a high bar.

glmn_sm2 = train(Class ~ . -Time -hod, data = cred_smote
                 , method = 'glmnet', metric = 'AUC'
                 , tuneGrid = expand.grid(
                      lambda = seq(0.0001, .1, length = 50), alpha = c(0, 0.55, 1))
                 , trControl = credsm_control)
plot(glmn_sm2)

Now it’s time to give random forests a look. These models, crafted using bootstrap aggregation with decision trees, are a personal favorite of mine and a common choice in general.

set.seed(400)
rf_sm1 = train(Class ~ . -Time -hod, data = cred_smote
               , method = 'ranger', metric = 'AUC'
               , trControl = credsm_control
               , importance = 'impurity')
plot(rf_sm1)

Another strong showing. While we’re here, let’s go ahead and take a look at which variables appear to have strong predictive value. A friend of mine was a big fan of using random forests to decide which variables are important, and I tend to admire that approach.

varImp(rf_sm1) #which variables are important?
## ranger variable importance
## 
##   only 20 most important variables shown (out of 29)
## 
##     Overall
## V14 100.000
## V10  73.565
## V17  71.655
## V12  65.300
## V4   56.968
## V11  55.583
## V16  38.140
## V3   36.201
## V2   34.319
## V7   31.877
## V9   24.854
## V21  19.758
## V27  17.921
## V18  16.956
## V5   13.617
## V1   10.207
## V19  10.148
## V6    9.907
## V8    7.875
## V20   5.981

I’ve created some quick exploratory graphs that visualize the relationship between the most ‘important’ variables and the fraud classification. I’ll include the link to these graphs in the End Notes in case you’d like to review them.

Let’s go ahead and fit some random forests using small variable batches at each node.

set.seed(400)
rf_sm2 = train(Class ~ . -Time -hod, data = cred_smote
               , method = 'ranger', metric = 'AUC'
               , tuneGrid = expand.grid(.mtry = 2:7)
               , trControl = credsm_control
               , importance = 'impurity')
plot(rf_sm2)

‘Boosting’ is another famous model-building approach. Boosting attempts to combine individually weak models into a collectively strong model. It’s a novel approach that generally has impressive results.

I’m going to try two separate boosting algorithms (‘Gradient Boosting’ and ‘Extreme Gradient Boosting’). I’ll stick to my recent pattern of an initial model followed by a custom version to probe the fringes. The classic ‘Gradient Boosting’ approach is up first.

set.seed(400)
gbm_sm1 = train(Class ~ . -Time -hod, data = cred_smote
               , method = 'gbm', metric = 'AUC'
               , trControl = credsm_control)
#plot(gbm_sm1)

It’s now old news that all the models appear to be working well. Still, we’ll continue to fit model variants so that we have a wide pool for comparison on the holdout data. I’ll spare you the output from the custom GBM, but the settings are as follows.

set.seed(400)
gbm_sm2 = train(Class ~ . -Time -hod, data = cred_smote
                , method = 'gbm', metric = 'AUC'
                , tuneGrid = 
                     expand.grid(n.trees = 25*(1:10)+50
                                 , interaction.depth = 1:4
                                 , n.minobsinnode = 10
                                 , shrinkage = 0.05)
                , trControl = credsm_control)

Let’s try our hand with ‘Extreme Gradient Boosting’:

set.seed(400)
gbt_sm1 = train(Class ~ . -Time -hod, data = cred_smote
                , method = 'xgbTree', metric = 'AUC'
                , trControl = credsm_control)

Nicely done. Now for the custom version (output redacted):

set.seed(400)
gbt_sm2 = train(Class ~ . -Time -hod, data = cred_smote
                , method = 'xgbTree', metric = 'AUC'
                , tuneGrid = 
                     expand.grid(nrounds = c(20, 35, 50)
                                 , max_depth = c(2, 3, 4)
                                 , eta = c(0.2, 0.4), gamma = 0
                                 , colsample_bytree = c(0.8, 0.6, 0.4)
                                 , min_child_weight = 1
                                 , subsample = c(0.5, 0.75, 1))
                , trControl = credsm_control)

I know what you’re thinking. Aren’t we done? We can definitely stop whenever, but the central goal of this exploration is process overview. In that spirit, I’m going to quickly funnel the four model types we’ve already fit through two additional approaches to see if those models perform better on the holdout data.

The first approach I’m going to apply is feature reduction. Model parsimony is generally a worthy goal as it can help mitigate overfit and reduce data requirements. I’ll go ahead and use the top 6 variables I identified with the inital random forest model along with the ‘Amount’ column to rerun each of the four model types. The code is below (the output is hidden).

###feature-reduction models
cred_smote_top6 = cred_smote[,c('V14', 'V10', 'V17', 'V12', 'V4', 'V11', 'Amount', 'Class')]

glmn_smtop6 = train(Class ~ . #Time & hod already removed
                    , data = cred_smote_top6
                    , method = 'glmnet', metric = 'AUC'
                    , trControl = credsm_control)
#plot(glmn_smtop6)

set.seed(400)
rf_smtop6 = train(Class ~ . #Time & hod already removed
                  , data = cred_smote_top6
                  , method = 'ranger', metric = 'AUC'
                  , trControl = credsm_control
                  , importance = 'impurity')
#plot(rf_smtop6)

set.seed(400)
gbm_smtop6 = train(Class ~ . #Time & hod already removed
                   , data = cred_smote_top6
                   , method = 'gbm', metric = 'AUC'
                   , trControl = credsm_control) 
#plot(gbm_smtop6)

set.seed(400)
gbt_smtop6 = train(Class ~ . #Time & hod already removed
                   , data = cred_smote_top6
                   , method = 'xgbTree', metric = 'AUC'
                   , trControl = credsm_control)
#plot(gbt_smtop6)

The next approach is a bit of a Frankenstein, but I’d like to re-apply PCA to this dataset. There are two reasons: I’d like to see what happens if the Amount column is incorporated into the PCA, and I feel a bit unsettled by the notion that the initial PCA was clandestine.

#pca of pca #already sounds like a bad idea ;-P
cred_sm_pca_itera = prcomp(~ . , center = TRUE, scale = TRUE, data = cred_smote[,-c('Time', 'hod', 'Class')])
cred_sm_pca_iter = data.table(cred_sm_pca_itera$x, cred_smote[,'Class'])

glmn_sm_pcaiter = train(Class ~ . #Time & hod excluded already
                        , data = cred_sm_pca_iter
                        , method = 'glmnet', metric = 'AUC'
                        , trControl = credsm_control)
#plot(glmn_sm_pcaiter)

set.seed(400)
rf_sm_pcaiter = train(Class ~ . #Time & hod excluded already
                      , data = cred_sm_pca_iter
                      , method = 'ranger', metric = 'AUC'
                      , trControl = credsm_control
                      , importance = 'impurity')
#plot(rf_sm_pcaiter)

set.seed(400)
gbm_sm_pcaiter = train(Class ~ . #Time & hod excluded already
                       , data = cred_sm_pca_iter
                       , method = 'gbm', metric = 'AUC'
                       , trControl = credsm_control) 
#plot(gbm_sm_pcaiter)

set.seed(400)
gbt_sm_pcaiter = train(Class ~ . #Time & hod excluded already
                       , data = cred_sm_pca_iter
                       , method = 'xgbTree', metric = 'AUC'
                       , trControl = credsm_control)
#plot(gbt_sm_pcaiter)

Model-Evaluation

Okay, I recognize that only the bravest of souls has made it this far. We’re basically done. I’m just going to apply each of the 16 models to the holdout data and collect the collect the stats relating to the area under the Precision-Recall Curve (max value of 1). The code is surprisingly tedious for this, but the result will provide a high-level recap of the model performances and whether they held up in the Test Data. Please feel free to skip to the results.

#######EVALUATING the models #PRAUC
smote_models = list(
     glmn_sm1 = glmn_sm1, glmn_sm2 = glmn_sm2
     , glmn_smtop6 = glmn_smtop6, rf_sm1 = rf_sm1
     , rf_sm2 = rf_sm2, rf_smtop6 = rf_smtop6
     , gbm_sm1 = gbm_sm1, gbm_sm2 = gbm_sm2
     , gbm_smtop6 = gbm_smtop6, gbt_sm1 = gbt_sm1
     , gbt_sm2 = gbt_sm2, gbt_smtop6 = gbt_smtop6
     )
sm_pca_iter_models = list(
     glmn_sm_pcaiter = glmn_sm_pcaiter
     , rf_sm_pcaiter = rf_sm_pcaiter
     , gbm_sm_pcaiter = gbm_sm_pcaiter
     , gbt_sm_pcaiter = gbt_sm_pcaiter)

#grabbing training results
smote_resamp = resamples(smote_models)
sm_pca_iter_resamp = resamples(sm_pca_iter_models)

#grabbing test results
#library(MLmetrics, quietly = TRUE)
test_pred_smote = lapply(smote_models, predict
                    , newdata = cred_testi
                    , type = 'prob')
test_preds_smote = lapply(test_pred_smote
                     , function(x){x[,'FRAUD']})
test_aucs_smote = sapply(test_preds_smote, PRAUC
                    , cred_testi$Class)

#pca_iter section requires test data transformation
cred_test_pca_iter = 
     scale(cred_testi[,-c('Time', 'hod', 'Class')]
      , center = cred_sm_pca_itera$center
      , scale = cred_sm_pca_itera$scale) %*%
     cred_sm_pca_itera$rotation
test_pred_sm_pca_iter = 
     lapply(sm_pca_iter_models, predict
            , newdata = cred_test_pca_iter
            , type = 'prob')
test_preds_sm_pca_iter = 
     lapply(test_pred_sm_pca_iter
            , function(x){x[,'FRAUD']})
test_aucs_sm_pca_iter = sapply(test_preds_sm_pca_iter
                          , PRAUC, cred_testi$Class)

resamp_AUC_smote = 
     summary(smote_resamp, metric = 'AUC')$statistics$AUC
resamp_AUC_sm_pca_iter =
     summary(sm_pca_iter_resamp, metric = 'AUC')$statistics$AUC

#formatting results
smote_results = 
     data.table(resamp_AUC_smote, test_aucs_smote
                , keep.rownames=TRUE)
sm_pca_iter_res = 
     data.table(resamp_AUC_sm_pca_iter
                , test_aucs_sm_pca_iter
                , keep.rownames = TRUE)

smote_results[,c('1st Qu.','3rd Qu.','Mean', "NA's")] <- NULL
names(smote_results) = 
     c('model'
       , paste0('Train', c('Min', 'Med', 'Max'))
       , 'Test_AUC')

sm_pca_iter_res[,c('1st Qu.','3rd Qu.','Mean', "NA's")] <- NULL
names(sm_pca_iter_res) = 
     c('model'
       , paste0('Train', c('Min', 'Med', 'Max'))
       , 'Test_AUC')
cred_mod_results = rbindlist(list(smote_results, sm_pca_iter_res))
setkey(cred_mod_results, model)
cred_mod_results[,c('TrainRnk','TestRnk'):=
                      .(frank(-TrainMed, ties.method = 'dense')
                        ,frank(-Test_AUC, ties.method = 'dense'))]
#library(stringr)
cred_mod_results[,TestRank := 
                      paste0(TestRnk,' (trn_rk'
                             , str_pad(TrainRnk, 2, pad = '0'),')')]
cred_mod_results[,c('TestRnk','TrainRnk')] <- NULL
cred_mod_results
##               model TrainMin TrainMed TrainMax  Test_AUC      TestRank
##  1:         gbm_sm1   0.9851   0.9872   0.9902 0.9897626  9 (trn_rk04)
##  2:         gbm_sm2   0.9848   0.9876   0.9910 0.9897431 11 (trn_rk03)
##  3:  gbm_sm_pcaiter   0.9744   0.9806   0.9848 0.9897835  7 (trn_rk13)
##  4:      gbm_smtop6   0.9837   0.9864   0.9869 0.9897993  4 (trn_rk06)
##  5:         gbt_sm1   0.9866   0.9878   0.9901 0.9896969 13 (trn_rk02)
##  6:         gbt_sm2   0.9849   0.9880   0.9911 0.9897102 12 (trn_rk01)
##  7:  gbt_sm_pcaiter   0.9793   0.9833   0.9854 0.9897976  5 (trn_rk12)
##  8:      gbt_smtop6   0.9837   0.9865   0.9874 0.9897484 10 (trn_rk05)
##  9:        glmn_sm1   0.9834   0.9857   0.9870 0.9898586  2 (trn_rk09)
## 10:        glmn_sm2   0.9835   0.9858   0.9868 0.9898577  3 (trn_rk08)
## 11: glmn_sm_pcaiter   0.9833   0.9860   0.9892 0.9904169  1 (trn_rk07)
## 12:     glmn_smtop6   0.9766   0.9841   0.9877 0.9897874  6 (trn_rk10)
## 13:          rf_sm1   0.9822   0.9833   0.9844 0.9896390 15 (trn_rk12)
## 14:          rf_sm2   0.9830   0.9836   0.9848 0.9896318 16 (trn_rk11)
## 15:   rf_sm_pcaiter   0.9715   0.9784   0.9814 0.9897634  8 (trn_rk14)
## 16:       rf_smtop6   0.9169   0.9545   0.9718 0.9896528 14 (trn_rk15)

Wrap-Up

The Classification methods we used were generally successful on our dataset. While each approach has some intrigue, the relative parity of performance encourages us to choose the model that seems most practical/interpretable. For this reason, I would advocate the use of the elastic net model built on the 6 most ‘important’ principal components and the transaction amount (‘glmn_smtop6’). It seems like it’d provide the most straightforward classifier. It might also enable the initial data providers to combine model insights with their knowledge of the principal components’ construction to develop an even deeper intuition regarding credit card fraud.

Thank you for your interest in this analysis. Have a great day!

End Notes and Data Source:

Some .pdfs containing several very basic exploratory graphs are located at the following link: https://drive.google.com/open?id=0B8cL0MhaQRNYd0huNmNyOU15bFk If you have access to Tableau software, you may find it more convenient to download the packaged workbook containing the graphs: https://drive.google.com/open?id=0B8cL0MhaQRNYSUFtakdxenFxdk0

It’s probably worth pointing out that predictive goals are motivated by the problem’s context. This analysis searched for the ‘closest’ predictive fit because that was consistent with its instructional slant. However, I suspect that many financial institutions would be willing to tolerate increased false positives if they could be assured that fraud was always flagged. Perspective aims outcome.

Data Source (accessed via Kaggle): Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson and Gianluca Bontempi. Calibrating Probability with Undersampling for Unbalanced Classification. In Symposium on Computational Intelligence and Data Mining (CIDM), IEEE, 2015