Abstract

Early diagnosis of any disease plays a critical role in successful treatment of patients. Every year thousands of patients are diagnosed with breast cancer. Though large amounts of patient clinical data is collected and stored every year, but a small subset of predictive factors are used in determining outcomes.

In this project, we use a data mining approach to diagnose breast cancer. The data-driven approach used here can efficiently process clinical dataset to discover patterns and reveal hidden information for early detection and successfully treatment of breast cancer patients.

Introduction and Background

Breast cancer is one of the most severe cancers. It has taken hundreds of thousands of lives every year. Early prediction of breast cancer plays an important role in successful treatment and saving lives of thousands of patients every year. However, the conventional approaches are limited in providing such capability. The recent breakthrough of data analytics and data mining techniques have opened a new door for healthcare diagnostic and prediction.

Over the past decades medical records and clinical data have been collected and stored in electronic databases. Both the government and other public organizations have accelerated the technology toward transparency by making massively stored data usable, searchable, and actionable. Despite of the massive healthcare databases available, only small part of the data has been used by domain-experts for diagnostic and cure of diseases. This is because the massive healthcare data is too complex and voluminous to be effectively and efficiently processed and analyzed by the conventional methods.

Most breast cancers are detected by patients as a lump in the breast. These lumps can be benign or malignant. It is the physician’s responsibility to diagnose the cancer and determine whether it is benign or malignant.

There are different ways of diagnosing breast cancer. For example, mammography and surgical biopsy. Our main purpose in this project is to predict if the cells (tumor) are benign or malignant (cancer), based on dimensional characteristics of the nuclei cells. As secondary objectives we want to know if the quality of our prediction is similar to the scientific articles and if linear models are still competitive against nonlinear models, based on rules / trees.

Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. FNA’s involves taking one or more samples of breast cells using a fine needle and syringe. They describe characteristics of the cell nuclei present in the image.

The mean, standard error, and “worst” or largest (mean of the three largest values) of these dimensional features were computed for each image, resulting in 30 features. For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.

Data Source

Data repository UC-Irvine, Breast Cancer Wisconsin (Diagnostic) Data Set from University of Wisconsin, Clinical Sciences Center.

Methodology

Data Exploration

Variable description

  1. ID number
  2. diagnosis (M = malignant, B = benign), this we changed for (1 = malignant, 0 = benign).

Ten real-valued features are computed for each cell nucleus:

  1. radius (mean of distances from center to points on the perimeter)
  2. texture (standard deviation of gray-scale values)
  3. perimeter (the distance around a shape)
  4. area
  5. smoothness (local variation in radius lengths)
  6. compactness (\((perimeter^2 / area) - 1.0\))
  7. concavity (severity of concave portions of the contour)
  8. concave points (number of concave portions of the contour)
  9. symmetry
  10. fractal dimension (coastline approximation - 1)

For each variable we have 3 different statistics: Mean, standard error and worst case. We have a total of 30 predictors.

The change diagnosis predictor from malignant for 1, benign for 0, help us to see relationship beetwen the response and the predictors, and was done in this phase.

Data Analysis

Below is the structure and the summary of the dataset.

Selected Stats
vars n mean sd median min max range skew kurtosis
diagnosis 1 569 0.37 0.48 0.00 0.00 1.00 1.00 0.53 -1.73
radius_mean 2 569 14.13 3.52 13.37 6.98 28.11 21.13 0.94 0.81
texture_mean 3 569 19.29 4.30 18.84 9.71 39.28 29.57 0.65 0.73
perimeter_mean 4 569 91.97 24.30 86.24 43.79 188.50 144.71 0.99 0.94
area_mean 5 569 654.89 351.91 551.10 143.50 2501.00 2357.50 1.64 3.59
smoothness_mean 6 569 0.10 0.01 0.10 0.05 0.16 0.11 0.45 0.82
compactness_mean 7 569 0.10 0.05 0.09 0.02 0.35 0.33 1.18 1.61
concavity_mean 8 569 0.09 0.08 0.06 0.00 0.43 0.43 1.39 1.95
concave.points_mean 9 569 0.05 0.04 0.03 0.00 0.20 0.20 1.17 1.03
symmetry_mean 10 569 0.18 0.03 0.18 0.11 0.30 0.20 0.72 1.25
fractal_dimension_mean 11 569 0.06 0.01 0.06 0.05 0.10 0.05 1.30 2.95
radius_se 12 569 0.41 0.28 0.32 0.11 2.87 2.76 3.07 17.45
texture_se 13 569 1.22 0.55 1.11 0.36 4.88 4.52 1.64 5.26
perimeter_se 14 569 2.87 2.02 2.29 0.76 21.98 21.22 3.43 21.12
area_se 15 569 40.34 45.49 24.53 6.80 542.20 535.40 5.42 48.59
smoothness_se 16 569 0.01 0.00 0.01 0.00 0.03 0.03 2.30 10.32
compactness_se 17 569 0.03 0.02 0.02 0.00 0.14 0.13 1.89 5.02
concavity_se 18 569 0.03 0.03 0.03 0.00 0.40 0.40 5.08 48.24
concave.points_se 19 569 0.01 0.01 0.01 0.00 0.05 0.05 1.44 5.04
symmetry_se 20 569 0.02 0.01 0.02 0.01 0.08 0.07 2.18 7.78
fractal_dimension_se 21 569 0.00 0.00 0.00 0.00 0.03 0.03 3.90 25.94
radius_worst 22 569 16.27 4.83 14.97 7.93 36.04 28.11 1.10 0.91
texture_worst 23 569 25.68 6.15 25.41 12.02 49.54 37.52 0.50 0.20
perimeter_worst 24 569 107.26 33.60 97.66 50.41 251.20 200.79 1.12 1.04
area_worst 25 569 880.58 569.36 686.50 185.20 4254.00 4068.80 1.85 4.32
smoothness_worst 26 569 0.13 0.02 0.13 0.07 0.22 0.15 0.41 0.49
compactness_worst 27 569 0.25 0.16 0.21 0.03 1.06 1.03 1.47 2.98
concavity_worst 28 569 0.27 0.21 0.23 0.00 1.25 1.25 1.14 1.57
concave.points_worst 29 569 0.11 0.07 0.10 0.00 0.29 0.29 0.49 -0.55
symmetry_worst 30 569 0.29 0.06 0.28 0.16 0.66 0.51 1.43 4.37
fractal_dimension_worst 31 569 0.08 0.02 0.08 0.06 0.21 0.15 1.65 5.16

Below are the inference of the summary.

  1. diagonsis is the target variable.
  2. myCol and id are index. It is not required in predictors.
  3. There is no NA in all the predictors. Imputation is not required.
  4. Each variable is in different scale.
  5. Each cell nuclei has different properties like radius, texture, perimeter, area, smoothness, compactness, concavity, concave points, symmetry_mean, fractal dimenstion. Dataset also has Mean, standard error and the worst measure of that particular cell.

As a next step we will remove the unwanted variables and analyze individual set of dimensions.

Data Visualization

Histogram

As a first visualization, we will plot the histogram of all the predictor variables.

Histograms

BOX-PLOT

Lets split the predictors according to its category

BOX-PLOT

Some inference from the chart

  1. Nuclei mean of the perimeter and area is higher.
  2. Standard Error of the Area is higher.
  3. In worst nuclei scenario, area has extremly high values.

Correlation matrix

Now lets plot the correlation plots to understand more about the correlated predictors.

Below is the overall correlation matrix of all the predictors

Correlation list (only high correlation values, r>0.9)

Diagnosis Plots

Lets deepdive into response variable and see its distribution.

Summary of data explorations findings:

  • Most histograms present very asymmetric behavior with similar to exponential distribution.
  • Some predictor look like exponential distribuition as radius_se, perimeter_se, area_se, concavity_se and fractal_dimension_se.
  • There is no true outliers, the outliers at box-plot is due the kind of distribution.
  • There is no missing.
  • we identified 21 pairs of highly correlated predictors, r> 0.9, this was due to the choice of predictors that are associated, measures things related: radius, perimeter and area.
  • There are 14 predictors related with the response, Diagnosis, with r>=0.6, this is a good news.

Data Preparation

To solve the highly correlated variables, we will follow two types.

  1. Predictors transformation and remove correlated variables.
  2. PCA transformed correlated variables.

Predictor transformations and removal of correlated variables

In this method following steps will be performed:

  • Solve high correlations values, the follow predictors were removed: area_mean, radius_mean, area_worst, compactness_mean, perimeter_worst, compactness_se. concavity_worstand fractal_dimension_worst.
  • Verify the data behavior of power transformation for the follow predictors: radius_se, perimeter_se, area_se and fractal_dimension_se.

There are some predictor variables which are skwed. We can perform power transformations on those variables.

As we saw in the chart, figure 3, the power transformation, has improved the distribution.

Now we will remove the highly correlated variables and then split the data in two data set, Train data set with 70% and Test data set with 30% of all data.

PCA transformation

In this type, as the variables are highly correlated, we will transform the predictors using principal component analysis(PCA). PCA will provide the transformed variables.

Above summary shows that 10 PCA’s shows the 95% of variation. And 17 PCA’s shows 99.1% of variation in the dataset. We will select 17 PCA’s to show the variation in the data. It also has a clusters of around 7. Now lets see the clusters in detail.

MODELS

The analyzed models are divided into two families:
  • Linear models: Logistic regression, Probit regression
  • Nonlinear models: Cubist Model, Neural Network Model and SVM (Support Vector Machine)
  • Linear Models

    Logistic Regression

    Logistic Regression is used to estimate the probability of a binary response based on one ore more predictors. The model itself simply models probability of output in terms of input, and does not perform statistical classification (it is not a classifier), though it can be used to make a classifier, for instance by choosing a cutoff value and classifying inputs with probability greater than the cutoff as one class, below the cutoff as the other.

    Now we will build different models and compare the results.

    As a initial set of models, we will try logistic regression with different variations in it.

    Logistic Regression Model(Logit) with all variables

    As a first model, we will build logistic regression model which has all the predictors with correlated variables..

    ## 
    ## Call:
    ## glm(formula = trainfull_y ~ ., family = binomial, data = trainfull_x)
    ## 
    ## Deviance Residuals: 
    ##        Min          1Q      Median          3Q         Max  
    ## -4.085e-04  -2.000e-08  -2.000e-08   2.000e-08   4.427e-04  
    ## 
    ## Coefficients:
    ##                           Estimate Std. Error z value Pr(>|z|)
    ## (Intercept)             -4.125e+03  1.062e+06  -0.004    0.997
    ## radius_mean             -1.689e+03  1.323e+05  -0.013    0.990
    ## texture_mean             2.697e+01  8.738e+03   0.003    0.998
    ## perimeter_mean           2.606e+02  2.497e+04   0.010    0.992
    ## area_mean               -1.522e+00  7.836e+02  -0.002    0.998
    ## smoothness_mean         -2.641e+03  9.064e+05  -0.003    0.998
    ## compactness_mean        -1.526e+04  1.898e+06  -0.008    0.994
    ## concavity_mean          -1.625e+02  8.434e+05   0.000    1.000
    ## concave.points_mean      8.913e+03  2.021e+06   0.004    0.996
    ## symmetry_mean           -1.826e+03  2.701e+05  -0.007    0.995
    ## fractal_dimension_mean   2.986e+04  2.587e+06   0.012    0.991
    ## radius_se                3.444e+03  6.046e+05   0.006    0.995
    ## texture_se               4.161e+01  6.254e+04   0.001    0.999
    ## perimeter_se            -3.411e+02  3.773e+04  -0.009    0.993
    ## area_se                 -3.528e+00  5.808e+03  -0.001    1.000
    ## smoothness_se           -8.118e+04  7.394e+06  -0.011    0.991
    ## compactness_se           2.619e+04  1.849e+06   0.014    0.989
    ## concavity_se            -1.046e+04  7.599e+05  -0.014    0.989
    ## concave.points_se        3.982e+04  4.285e+06   0.009    0.993
    ## symmetry_se              7.354e+03  1.928e+06   0.004    0.997
    ## fractal_dimension_se    -2.080e+05  1.870e+07  -0.011    0.991
    ## radius_worst            -4.495e+01  8.104e+04  -0.001    1.000
    ## texture_worst            3.704e+00  6.233e+03   0.001    1.000
    ## perimeter_worst          1.242e+01  5.608e+03   0.002    0.998
    ## area_worst               1.688e+00  3.940e+02   0.004    0.997
    ## smoothness_worst         8.448e+03  1.482e+06   0.006    0.995
    ## compactness_worst       -2.650e+03  2.002e+05  -0.013    0.989
    ## concavity_worst          2.083e+03  2.053e+05   0.010    0.992
    ## concave.points_worst     1.531e+03  4.357e+05   0.004    0.997
    ## symmetry_worst           1.064e+03  3.589e+05   0.003    0.998
    ## fractal_dimension_worst  1.029e+04  8.558e+05   0.012    0.990
    ## 
    ## (Dispersion parameter for binomial family taken to be 1)
    ## 
    ##     Null deviance: 5.2517e+02  on 398  degrees of freedom
    ## Residual deviance: 1.9353e-06  on 368  degrees of freedom
    ## AIC: 62
    ## 
    ## Number of Fisher Scoring iterations: 25

    None of the predictors are signficant due to correlated variables. This model did not provide any results and did not converge.

    Logistic Regression without correlated variables

    ## 
    ## Call:
    ## glm(formula = train_y ~ ., family = binomial, data = train_x)
    ## 
    ## Deviance Residuals: 
    ##     Min       1Q   Median       3Q      Max  
    ## -1.5623  -0.0088  -0.0002   0.0000   3.4069  
    ## 
    ## Coefficients:
    ##                         Estimate Std. Error z value Pr(>|z|)  
    ## (Intercept)             -46.7853    37.6301  -1.243    0.214  
    ## texture_mean              0.2764     0.4418   0.626    0.531  
    ## perimeter_mean           -0.6385     0.3766  -1.696    0.090 .
    ## smoothness_mean        -194.1969   163.4813  -1.188    0.235  
    ## concavity_mean           89.8349    63.9501   1.405    0.160  
    ## concave.points_mean      81.0294   138.0329   0.587    0.557  
    ## symmetry_mean           -29.7323    53.1203  -0.560    0.576  
    ## fractal_dimension_mean  114.7941   251.3729   0.457    0.648  
    ## radius_se                14.8761    16.7799   0.887    0.375  
    ## texture_se                0.5514     3.4229   0.161    0.872  
    ## perimeter_se              3.5570    15.2526   0.233    0.816  
    ## area_se                -121.8861    97.8252  -1.246    0.213  
    ## smoothness_se          -755.0038   857.5639  -0.880    0.379  
    ## concavity_se           -132.9106    93.5363  -1.421    0.155  
    ## concave.points_se       677.3859   533.8223   1.269    0.204  
    ## symmetry_se              -7.0271   254.5291  -0.028    0.978  
    ## fractal_dimension_se      1.8134     1.5961   1.136    0.256  
    ## radius_worst              4.0617     2.4981   1.626    0.104  
    ## texture_worst             0.1956     0.4214   0.464    0.643  
    ## smoothness_worst        189.4718   128.4210   1.475    0.140  
    ## compactness_worst       -12.2243    10.7555  -1.137    0.256  
    ## concave.points_worst     -4.1564    49.2015  -0.084    0.933  
    ## symmetry_worst           34.9436    40.2384   0.868    0.385  
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## (Dispersion parameter for binomial family taken to be 1)
    ## 
    ##     Null deviance: 525.172  on 398  degrees of freedom
    ## Residual deviance:  35.976  on 376  degrees of freedom
    ## AIC: 81.976
    ## 
    ## Number of Fisher Scoring iterations: 11

    This time the model did converge, however it seems many variables are not significant. We will perform backward stepwise selection of the model.

    ## 
    ## Call:
    ## glm(formula = train_y ~ perimeter_mean + concavity_mean + concavity_se + 
    ##     concave.points_se + fractal_dimension_se + radius_worst + 
    ##     texture_worst + smoothness_worst, family = binomial, data = train_x)
    ## 
    ## Deviance Residuals: 
    ##     Min       1Q   Median       3Q      Max  
    ## -1.7767  -0.0259  -0.0016   0.0000   3.8179  
    ## 
    ## Coefficients:
    ##                      Estimate Std. Error z value Pr(>|z|)    
    ## (Intercept)          -62.0316    16.4291  -3.776 0.000160 ***
    ## perimeter_mean        -0.4903     0.1901  -2.580 0.009888 ** 
    ## concavity_mean        64.7415    22.2160   2.914 0.003566 ** 
    ## concavity_se         -76.3333    36.4414  -2.095 0.036199 *  
    ## concave.points_se    451.3646   183.1439   2.465 0.013719 *  
    ## fractal_dimension_se   1.7057     0.8748   1.950 0.051198 .  
    ## radius_worst           4.3085     1.2386   3.478 0.000504 ***
    ## texture_worst          0.3653     0.1013   3.604 0.000313 ***
    ## smoothness_worst      72.0289    33.1361   2.174 0.029726 *  
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## (Dispersion parameter for binomial family taken to be 1)
    ## 
    ##     Null deviance: 525.17  on 398  degrees of freedom
    ## Residual deviance:  47.10  on 390  degrees of freedom
    ## AIC: 65.1
    ## 
    ## Number of Fisher Scoring iterations: 10

    Now all the variables are statistically significant and AIC got reduced.

    Following predictors were chosen: - perimeter_mean - concavity_mean - concavity_se - concave.points_se - fractal_dimension_se - radius_worst - texture_worst - smoothness_worst

    Now lets try out other models.

    Logistic regression on PCA variables

    We can also perform logistic regression on PCA transformed variables. However, the problem is the model is not interpretable. One of the biggest advantage of Logistic regression is interpretability. By using PCA variables, we loose that advantage. But lets see how the model reacts to the PCA variables.

    ## 
    ## Call:
    ## glm(formula = train_y ~ . - PC11 - PC6 - PC8 - PC10 - PC5 - PC9 - 
    ##     PC13 - PC14 - PC12 - PC7, family = binomial, data = pca_wdbc$x[, 
    ##     c(1:14)] %>% data.frame())
    ## 
    ## Deviance Residuals: 
    ##     Min       1Q   Median       3Q      Max  
    ## -1.4551  -0.0467  -0.0028   0.0009   3.3087  
    ## 
    ## Coefficients:
    ##             Estimate Std. Error z value Pr(>|z|)    
    ## (Intercept)  -1.3741     0.4019  -3.419 0.000629 ***
    ## PC1          -3.0425     0.5450  -5.583 2.36e-08 ***
    ## PC2           2.8103     0.5622   4.999 5.77e-07 ***
    ## PC3          -1.7361     0.4486  -3.870 0.000109 ***
    ## PC4          -2.5773     0.6424  -4.012 6.02e-05 ***
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## (Dispersion parameter for binomial family taken to be 1)
    ## 
    ##     Null deviance: 525.172  on 398  degrees of freedom
    ## Residual deviance:  57.999  on 394  degrees of freedom
    ## AIC: 67.999
    ## 
    ## Number of Fisher Scoring iterations: 9

    PCA models are difficult to interpret. So this model does not provide any other information.

    Probit Regression

    In Probit Regression, the dependent variable can take only two values. The purpose of the model is to estimate the probability that an observation with a particular characteristics will fall into a specific one of the categories. A probit model is a popular specification for an ordinal or a binary response model.

    Now lets perform logistic regression with probit link function. After backward stepwise eliminiation, below is the model we got.

    ## 
    ## Call:
    ## glm(formula = train_y ~ . - perimeter_se - texture_worst - concave.points_worst - 
    ##     symmetry_se - fractal_dimension_mean - symmetry_mean - compactness_worst - 
    ##     concavity_se - concave.points_se - concavity_mean - smoothness_se - 
    ##     texture_se - symmetry_worst, family = binomial(link = "probit"), 
    ##     data = train_x)
    ## 
    ## Deviance Residuals: 
    ##    Min      1Q  Median      3Q     Max  
    ## -1.819  -0.003   0.000   0.000   3.281  
    ## 
    ## Coefficients:
    ##                       Estimate Std. Error z value Pr(>|z|)    
    ## (Intercept)          -10.07635    5.98008  -1.685 0.091991 .  
    ## texture_mean           0.25963    0.06897   3.764 0.000167 ***
    ## perimeter_mean        -0.41491    0.10733  -3.866 0.000111 ***
    ## smoothness_mean      -81.77502   42.32285  -1.932 0.053338 .  
    ## concave.points_mean  131.35210   34.53616   3.803 0.000143 ***
    ## radius_se             13.05659    5.87379   2.223 0.026226 *  
    ## area_se              -84.30666   38.45030  -2.193 0.028335 *  
    ## fractal_dimension_se   0.66639    0.37989   1.754 0.079402 .  
    ## radius_worst           2.20845    0.62953   3.508 0.000451 ***
    ## smoothness_worst      54.55936   20.46253   2.666 0.007669 ** 
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## (Dispersion parameter for binomial family taken to be 1)
    ## 
    ##     Null deviance: 525.172  on 398  degrees of freedom
    ## Residual deviance:  45.703  on 389  degrees of freedom
    ## AIC: 65.703
    ## 
    ## Number of Fisher Scoring iterations: 12

    Seems there are nine predictors which are significant. It has similar AIC score compared to logit model.

    Linear Models Summary

    Train set performance

    Performance metrics train
    accuracy auc
    Logistic 0.980 0.976
    Logistic PCA 0.980 0.974
    Probit 0.975 0.969

    Test set performance

    Performance metrics test
    accuracy auc
    Logistic 0.976 0.975
    Logistic PCA 0.100 0.916
    Probit 0.971 0.967

    Non-linear Models

    Cubist Model

    Cubist is a prediction-oriented regression model that combines the ideas in Quinlan (1992) and Quinlan (1993).
    Although it initially creates a tree structure, it collapses each path through the tree into a rule. A regression model is fit for each rule based on the data subset defined by the rules. The set of rules are pruned or possibly combined. and the candidate variables for the linear regression models are the predictors that were used in the parts of the rule that were pruned away. This part of the algorithm is consistent with the “M5” or Model Tree approach.
    Cubist generalizes this model to add boosting (when committees > 1) and instance based corrections (see predict.cubist()). The number of instances is set at prediction time by the user and is not needed for model building.

    Predictor importance from cubist model

    Average Neural Network Model

    Average Neural Networks regression AVNNET. A Neural Network (NN) is a graph of computational units that receive inputs and transfer the result into an output that is passed on. The units are ordered into layers to connect the features of an input vector to the features of an output vector. With training, such as the Back-Propagation algorithm, neural networks can be designed and trained to model the underlying relationship in data.

    ## Model Averaged Neural Network 
    ## 
    ## 399 samples
    ##  22 predictor
    ##   2 classes: '0', '1' 
    ## 
    ## Pre-processing: centered (22), scaled (22) 
    ## Resampling: Cross-Validated (10 fold, repeated 5 times) 
    ## Summary of sample sizes: 360, 360, 359, 359, 360, 359, ... 
    ## Resampling results across tuning parameters:
    ## 
    ##   decay  Accuracy   Kappa    
    ##   0.01   0.9709681  0.9375156
    ##   0.10   0.9750322  0.9460329
    ## 
    ## Tuning parameter 'size' was held constant at a value of 1
    ## Tuning
    ##  parameter 'bag' was held constant at a value of FALSE
    ## Accuracy was used to select the optimal model using the largest value.
    ## The final values used for the model were size = 1, decay = 0.1 and bag
    ##  = FALSE.

    SVM

    Support Vector Machines (SVM) are a class of methods, developed originally for classification, that find support points that best separate classes. SVM for regression is called Support Vector Regression (SVM).

    ## Support Vector Machines with Radial Basis Function Kernel 
    ## 
    ## 399 samples
    ##  22 predictor
    ##   2 classes: '0', '1' 
    ## 
    ## Pre-processing: centered (22), scaled (22) 
    ## Resampling: Cross-Validated (10 fold, repeated 5 times) 
    ## Summary of sample sizes: 360, 360, 359, 359, 360, 359, ... 
    ## Resampling results across tuning parameters:
    ## 
    ##   C        Accuracy   Kappa    
    ##      0.25  0.9569418  0.9052807
    ##      0.50  0.9639931  0.9214983
    ##      1.00  0.9674675  0.9287819
    ##      2.00  0.9724681  0.9397658
    ##      4.00  0.9719431  0.9386826
    ##      8.00  0.9729431  0.9412281
    ##     16.00  0.9609515  0.9158910
    ##     32.00  0.9559619  0.9052550
    ##     64.00  0.9494362  0.8911838
    ##    128.00  0.9494362  0.8911838
    ##    256.00  0.9494362  0.8911838
    ##    512.00  0.9494362  0.8911838
    ##   1024.00  0.9494362  0.8911838
    ##   2048.00  0.9494362  0.8911838
    ## 
    ## Tuning parameter 'sigma' was held constant at a value of 0.04889095
    ## Accuracy was used to select the optimal model using the largest value.
    ## The final values used for the model were sigma = 0.04889095 and C = 8.

    Linear Models Summary

    Train set performance

    Performance metrics train
    accuracy auc
    AvNeural 0.990 0.986
    SVM 0.995 0.993
    Cubist 1.000 1.000

    Test set performance

    Performance metrics test
    accuracy auc
    AvNeural 0.971 0.970
    SVM 0.971 0.970
    Cubist 0.976 0.975

    Model Analysis and Diagnostics

    Best Linear and Nonlinear Models

    In the following section, we will describe the best model.

    Logistic regression (reduced model without hightly correlated predictors)

    Confusion Matrix Test set

    ## Confusion Matrix and Statistics
    ## 
    ##                test_y
    ## conv_13_logit_t   0   1
    ##               0 103   2
    ##               1   2  63
    ##                                           
    ##                Accuracy : 0.9765          
    ##                  95% CI : (0.9409, 0.9936)
    ##     No Information Rate : 0.6176          
    ##     P-Value [Acc > NIR] : <2e-16          
    ##                                           
    ##                   Kappa : 0.9502          
    ##  Mcnemar's Test P-Value : 1               
    ##                                           
    ##             Sensitivity : 0.9692          
    ##             Specificity : 0.9810          
    ##          Pos Pred Value : 0.9692          
    ##          Neg Pred Value : 0.9810          
    ##              Prevalence : 0.3824          
    ##          Detection Rate : 0.3706          
    ##    Detection Prevalence : 0.3824          
    ##       Balanced Accuracy : 0.9751          
    ##                                           
    ##        'Positive' Class : 1               
    ## 

    Cubistic Regression

    ## Confusion Matrix and Statistics
    ## 
    ##                test_y
    ## conv_31_cubit_t   0   1
    ##               0 103   2
    ##               1   2  63
    ##                                           
    ##                Accuracy : 0.9765          
    ##                  95% CI : (0.9409, 0.9936)
    ##     No Information Rate : 0.6176          
    ##     P-Value [Acc > NIR] : <2e-16          
    ##                                           
    ##                   Kappa : 0.9502          
    ##  Mcnemar's Test P-Value : 1               
    ##                                           
    ##             Sensitivity : 0.9692          
    ##             Specificity : 0.9810          
    ##          Pos Pred Value : 0.9692          
    ##          Neg Pred Value : 0.9810          
    ##              Prevalence : 0.3824          
    ##          Detection Rate : 0.3706          
    ##    Detection Prevalence : 0.3824          
    ##       Balanced Accuracy : 0.9751          
    ##                                           
    ##        'Positive' Class : 1               
    ## 

    Comparison with scientific paper

    Simple Learning Algorithms for Training Support Vector Machines

    Charles Campbell and Nello Cristianini Cover in depth the usage of the Kernel-Adatron Algorithm in their paper Simple Learning Algorithms for Training Support Vector Machines[5]. Their method was adapted from the Statistical Mechanics approach to learning[8]. The core concept of implementing this algorithm is to find hyperplanes which have optimal stability.

    The authors of the paper tested different statistical models before applying the Kernel-Adaton support Vector Machine model. Non SVM models include CART, RBF, and Linear Discriminant Analysis(LDA). CART models are representative of a binary tree where each root node represents a single input variable (x) and is then split on that variable, the leaf nodes of the tree will contain the output variable (y) that will make the prediction[9]. Radial Basis Function(RBF) is a type of model that can be employed for methods that require Linear or nonlinear problems. These types of functions have been associated with a single layer and multilayer networks[10]. Linear Discriminant Analysis(LDA) use statistical properties of the data that are calculated by each class for input variables in a single variable. For multiple variables, the properties are calculated over the multivariate Gaussian[11]. Multi-Layer Neural Network(Back-Propagation) is a supervised learning technique that consists of at least three-layered nodes, where each node is a neuron with a nonlinear activation function.

    To achieve the ideal model performance, SLATSVM[5] used 10 fold cross-validation and the whole dataset. The results for each model were the following CART 94.2%, RBF 95.9%, LDA 96.0%, Multi-Layer Neural Network(Back-Propagation) 96.6% with the optimal performance being 99.48%.

    The optimal model was achieved using cross-validation 10 fold, Centering and scaling the variables. The model used was svmRadial applied using the caret package in R-3.3.1. The Data was split 70% training and 30% testing. The performance was 98.7% against the training set. While high in accuracy when the code was tested against the holdout dataset, the accuracy increased slightly to 98.8% percent.

    The svmRadial model performed with high accuracy in the training and test set. Other parameters that were exceptional where Area Under the Curve(AUC) with a value of 99.3% and 97.0% in the test set. The AUC captures the efficiency of the model in this example 90% being an A grade.

    Summary and Conclusions

    From the analysis we can conclude that the cubist and the logistic models have similar performance. The logistic model seems to be the easiest to implement whereas nonlinear models like SVM have implementation complexity. In logistic models we can see the coefficient of all predictors. Although, the logistic model is more interpretable than the cubist model, the cubist model seems to be best model because the accuracy in the cubist model is the highest 0.976.

    REFERENCES:
    [1] Dua, D. and Karra Taniskidou, E. (2017). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
    [2] https://stackoverflow.com/questions/7074246/show-correlations-as-an-ordered-list-not-as-a-large-matrix
    [3] https://cran.r-project.org/web/packages/Cubist/Cubist.pdf
    [4] Andrew I. Schein and Lyle H. Ungar. A-Optimality for Active Learning of Logistic Regression Classifiers. Department of Computer and Information Science Levine Hall.
    [5] Charles Campbell and Nello Cristianini. Simple Learning Algorithms for Training Support Vector Machines. Dept. of Engineering Mathematics.
    [6] O.L. Mangasarian, W.N. Street and W.H. Wolberg. Breast cancer diagnosis and prognosis via linear programming. Operations Research, 43(4), pages 570-577, July-August 1995.
    [7] Kuhn, Johnson, Applied Predictive Modeling, Springer [8] Watkin, T., Ran, A. & Biehl, M. (1993). The Statistical Mechanics of Learning a Rule, Rev. Mod. Phys. 65(2). [9]Brownlee, J., Ph.D. (2017, September 20). Classification And Regression Trees for Machine Learning. Retrieved May 20, 2018, from https://machinelearningmastery.com/classification-and-regression-trees-for-machine-learning/ [10]Mark J L Orr. (1996). Introduction to Radial Basis Function Networks. [11]Brownlee, J., Ph.D. (2016, April 6).Linear Discriminant Analysis for Machine Learning. Retrieved May 20 2018, from https://machinelearningmastery.com/linear-discriminant-analysis-for-machine-learning/

    R code


    knitr::opts_chunk$set(fig.width=12, fig.height=8, fig.path='Figs/', echo=FALSE, warning=FALSE, message=FALSE)
    #load libraries
    library("xlsx")
    library("psych")
    library("GGally")
    library("ggplot2")
    library("caret")
    library("dplyr")
    library("pROC")
    library("car")
    library("leaps")
    library("sjstats")
    library("PerformanceAnalytics")
    library("factoextra")
    library("BGLR")
    library("keras")
    library("dplyr")
    library("kableExtra")
    library("Cubist")
    #load dataset
    
    url<-"https://raw.githubusercontent.com/bvshyam/Cancer_prediction/master/data/data_with groups.xlsx"
    
    temp.file <- paste(tempfile(),".xlsx",sep = "")
    download.file(url,temp.file, mode="wb")
    WDBCdata<<-xlsx::read.xlsx(temp.file, sheetName = "ics data", header=T)
    
    
    
    #WDBCdata<- read.xlsx("data/data_with groups.xlsx", sheetName = "ics data")
    
    # First step the id will be removed to avoid any future trouble
    # change M = malignant for 1 and B for B = benign for zero
    WDBCdata$mycol<-NULL
    WDBCdata$id<-NULL
    WDBCdata$diagnosis<-ifelse(WDBCdata$diagnosis == "B", 0, 1)
    
    # summary stats
    WDBC_tbl<-describe(WDBCdata,IQR=T)[,c(1:5,8:10,11,12)]
    kable(round(WDBC_tbl,2), caption = "Selected Stats", format = "html") %>% 
            kable_styling(latex_options = "striped", font_size=10)
    rm(WDBC_tbl)
    library("Hmisc")
    hist.data.frame(WDBCdata, n.unique=1, mtitl = "Breast Cancer Histogram")
    WDBCdata_mean = cbind(diagnosis=WDBCdata[,c(1)], WDBCdata[,c(2:11)])
    WDBCdata_se = cbind(diagnosis=WDBCdata[,c(1)], WDBCdata[,c(12:21)])
    WDBCdata_worst = cbind(diagnosis=WDBCdata[,c(1)], WDBCdata[,c(22:31)])
    
    # boxplot
    par(cex.axis=0.8) # is for x-axis
    boxplot(WDBCdata_mean, las=2, col="green", main="Breast Cancer Box-Plot for Mean", ylim = c(0,150))
    boxplot(WDBCdata_se, las=2, col="green", main="Breast Cancer Box-Plot for SE", ylim = c(0,150))
    boxplot(WDBCdata_worst, las=2, col="green", main="Breast Cancer Box-Plot for Worst", ylim = c(0,150))
    
    #Mean
    chart.Correlation(WDBCdata_mean,histogram=TRUE,pch=19)
    
    # SE
    chart.Correlation(WDBCdata_se,histogram=TRUE,pch=19)
    # Worst
    chart.Correlation(WDBCdata_worst,histogram=TRUE,pch=19)
    ggcorr(WDBCdata, nbreaks=8, palette='PRGn', label=TRUE, label_size=2, size = 1.8, label_color='black') + ggtitle("Breast Cancer Correlation Matrix") + theme(plot.title = element_text(hjust = 0.5, color = "grey15"))
    z = cor(WDBCdata)
    
    z = round(z,4)
    z[abs(z)<0.9]=NA # remove low relationship
    z[lower.tri(z,diag=TRUE)]=NA  #Prepare to drop duplicates and meaningless information
    z=as.data.frame(as.table(z))  #Turn into a 3-column table
    z=na.omit(z)  #Get rid of the junk we flagged above
    z=z[order(-abs(z$Freq)),]    #Sort by highest correlation (whether +ve or -ve)
    z
    rm(z)
    
    
    #qplot(as.factor(WDBCdata$diagnosis))+geom_bar() + #labs(x='Diagnosis', y ='Count')  
    
    barplot(table(WDBCdata$diagnosis), col="blue", ylab="count", xlab="Diagnosis", main="Response distribution")
    
    
    
    bc1<-BoxCoxTrans(WDBCdata$radius_se)
    bc2<-BoxCoxTrans(WDBCdata$perimeter_se)
    bc3<-BoxCoxTrans(WDBCdata$area_se )
    bc5<-BoxCoxTrans(WDBCdata$fractal_dimension_se)
    
    par(mfrow=c(1,2))
    hist(WDBCdata$radius_se, main="Histogram radius_se", xlab="", col="yellow")
    hist(WDBCdata$radius_se^bc1$lambda, main="Histogram radius_se transf.", xlab="", col="green")
    par(mfrow=c(1,2))
    hist(WDBCdata$perimeter_se, main="Histogram perimeter_se", xlab="", col="yellow")
    hist(WDBCdata$perimeter_se^bc2$lambda, main="Histogram perimeter_se transf.", xlab="",col="green")
    par(mfrow=c(1,2))
    hist(WDBCdata$area_se, main="Histogram area_se",xlab="", col="yellow")
    hist(WDBCdata$area_se^bc3$lambda, main="Histogram area_se transf.",xlab="",col="green")
    par(mfrow=c(1,2))
    hist(WDBCdata$fractal_dimension_se, main="Histogram dimension_se",xlab="", col="yellow")
    hist(WDBCdata$fractal_dimension_se^bc5$lambda, main="Histogram dimension_se transf.",xlab="",col="green")
    par(mfrow=c(1,1))
    
    
    WDBCdatafull<-WDBCdata #copy data set
    
    WDBCdata<-subset(WDBCdata, select=-c(area_mean,radius_mean,area_worst,compactness_mean,perimeter_worst,compactness_se,concavity_worst,fractal_dimension_worst)) # remove predictors
    
    
    # WDBCdatafull all predictors
    # WDBCdata remove highly correlated predictors
    
    
    set.seed(123)
    indx<-createDataPartition(WDBCdata$diagnosis, p=0.7, list=FALSE)
    
    train_x<-WDBCdata[indx,-1]
    train_y<-WDBCdata$diagnosis[indx] 
    
    test_x<-WDBCdata[-indx,-1]
    test_y<-WDBCdata$diagnosis[-indx]
    
    
    # the subset bellow is with full predictors, cannot be used with logistic model, only can be used with model that accept highly correlated predictors.
    
    trainfull_x<-WDBCdatafull[indx,-1]
    trainfull_y<-WDBCdatafull$diagnosis[indx]
    
    testfull_x<-WDBCdatafull[-indx,-1]
    testfull_y<-WDBCdatafull$diagnosis[-indx]
    
    # applying the power transformation for the test and train data set
    
    train_x$radius_se<-train_x$radius_se^bc1$lambda
    train_x$perimeter_se<-train_x$perimeter_se^bc2$lambda    
    train_x$area_se<-train_x$area_se^bc3$lambda
    train_x$fractal_dimension_se<-train_x$fractal_dimension_se^bc5$lambda
    
    test_x$radius_se<-test_x$radius_se^bc1$lambda
    test_x$perimeter_se<-test_x$perimeter_se^bc2$lambda    
    test_x$area_se<-test_x$area_se^bc3$lambda
    test_x$fractal_dimension_se<-test_x$fractal_dimension_se^bc5$lambda
    
    
    pca_wdbc <- prcomp(train_x[,2:ncol(train_x)],center = TRUE, scale=TRUE)
    pca_wbdc_test<-prcomp(test_x[,2:ncol(test_x)],center = TRUE, scale=TRUE)
    
    plot(pca_wdbc, type='l', main="PCA - Principal Components Analysis Chart", col="red")
    
    #summary(pca_wdbc$x)
    
    
    pca_wdbc_var <- get_pca_var(pca_wdbc)
    
    
    
    res <- kmeans(pca_wdbc_var$coord,centers = 5, nstart=25)
    grp <- as.factor(res$cluster)
    
    fviz_pca_var(pca_wdbc, col.var=grp, palette='jco', legend.title='Cluster')
    
    model_11_logit_full <-glm(trainfull_y ~ . ,family=binomial, trainfull_x)
    summary(model_11_logit_full)
    model_12_logit_corr <-glm(train_y~.,family=binomial,data=train_x)
    summary(model_12_logit_corr)
    
    model_13_logit_corr_final <- update(model_12_logit_corr, .~.-symmetry_se-concave.points_worst-texture_se-perimeter_se-radius_se-symmetry_mean-fractal_dimension_mean-concave.points_mean-texture_mean-compactness_worst-symmetry_worst-smoothness_mean-area_se-smoothness_se )
    summary(model_13_logit_corr_final)
    
    model_14_pca <-glm( train_y~.-PC11 -PC6 -PC8 -PC10 -PC5 -PC9 -PC13 -PC14 -PC12 -PC7,family=binomial,data=pca_wdbc$x[,c(1:14)] %>% data.frame())
    summary(model_14_pca)
    
    model_15_probit <-glm( train_y~.-perimeter_se-texture_worst-concave.points_worst-symmetry_se-fractal_dimension_mean-symmetry_mean-compactness_worst-concavity_se-concave.points_se-concavity_mean-smoothness_se-texture_se-symmetry_worst,family=binomial(link = 'probit'),data=train_x)
    summary(model_15_probit)
    #Convert to 0/1
    conv_13_logit_corr <- ifelse(predict(model_13_logit_corr_final) > 0.5,1,0)
    #conf_13_logit_corr <- confusionMatrix(conv_13_logit_corr, train_y, positive="1")
    conf_13_logit_corr <- confusionMatrix(table(conv_13_logit_corr, train_y), positive = "1")
    
    conv_14_pca <- ifelse(predict(model_14_pca) > 0.5,1,0)
    #conf_14_pca <- confusionMatrix(conv_14_pca, train_y, positive="1")
    conf_14_pca <- confusionMatrix(table(conv_14_pca, train_y), positive="1")
    
    conv_15_probit <- ifelse(predict(model_15_probit ) > 0.5,1,0)
    #conf_15_probit <- confusionMatrix(conv_15_probit,train_y,positive="1")
    conf_15_probit <- confusionMatrix(table(conv_15_probit, train_y), positive="1")
    
    # compute accuracy
    
    acc13<-conf_13_logit_corr$overall["Accuracy"]
    acc14<-conf_14_pca$overall["Accuracy"]
    acc15<-conf_15_probit$overall["Accuracy"]
    
    # compute AUC
    
    auc13<-roc(train_y ~ conv_13_logit_corr, train_x)$auc
    auc14<-roc(train_y ~ conv_14_pca, train_x)$auc
    auc15<-roc(train_y ~ conv_15_probit, train_x)$auc
    
    df<-data.frame(accuracy=c(acc13,acc14,acc15),auc=c(auc13,auc14,auc15))
    row.names(df)<-c("Logistic","Logistic PCA","Probit")
    
    
    kable(round(df,3), caption = "Performance metrics train")        
    
    #Convert to 0/1
    conv_13_logit_t <- ifelse(predict(object=model_13_logit_corr_final, newdata=test_x, type="response") > 0.5,1,0)
    #conf_13_logit_t <- confusionMatrix(conv_13_logit_t, test_y, positive="1")
    conf_13_logit_t <- confusionMatrix(table(conv_13_logit_t, test_y), positive="1")
    
    conv_14_pca_t <- ifelse(predict(model_14_pca, newdata=as.data.frame(pca_wbdc_test$x[,c(1:14)]), type="response") > 0.5,1,0)
    #conf_14_pca_t <- confusionMatrix(conv_14_pca_t, test_y, positive="1")
    conf_14_pca_t <- confusionMatrix(table(conv_14_pca_t, test_y), positive="1")
    
    conv_15_probit_t <- ifelse(predict(model_15_probit, newdata=test_x, type="response") > 0.5,1,0)
    #conf_15_probit_t <- confusionMatrix(conv_15_probit_t, test_y, positive="1")
    conf_15_probit_t <- confusionMatrix(table(conv_15_probit_t, test_y), positive="1")
    
    # compute accuracy
    
    acc13_t<-conf_13_logit_t$overall["Accuracy"]
    acc14_t<-conf_14_pca_t$overall["Accuracy"]
    acc15_t<-conf_15_probit_t$overall["Accuracy"]
    
    # compute AUC
    
    auc13_t<-roc(test_y ~ conv_13_logit_t, test_x)$auc
    auc14_t<-roc(test_y ~ conv_14_pca_t, test_x)$auc
    auc15_t<-roc(test_y ~ conv_15_probit_t, test_x)$auc
    
    df<-data.frame(accuracy=c(acc13_t,acc14_t,acc15_t),auc=c(auc13_t,auc14_t,auc15_t))
    row.names(df)<-c("Logistic","Logistic PCA","Probit")
    
    kable(round(df,3), caption = "Performance metrics test")   
    
    
    
    set.seed(123)
    ctrl=(trainControl(method="repeatedcv", repeats=5))
    
    c<-c(100)
    n<-c(3,4)
    
    model_31_cubit<-train(train_x,train_y, method="cubist",
                    preProcess = c("center", "scale"),
                    tuneGrid = expand.grid(committees=c,neighbors=n),
                    trControl = ctrl)
    
    dotPlot(varImp(model_31_cubit), main="Cubist Predictor importance")
    
    set.seed(123)
    
    ctrl=(trainControl(method="repeatedcv", repeats=5))
    nnetGrid<-expand.grid(.decay=c(0.01,0.1), .size=1, .bag=FALSE)
    
    model_41_neural<-train(x=train_x,y=as.factor(train_y),method="avNNet",
                     tuneGrid = nnetGrid,
                     trControl = ctrl,
                     preProcess = c("center","scale"),
                     linout=TRUE,
                     trace=FALSE,
                     MaxNWts=10*(ncol(train_x)+1)+10+1,
                     maxit=500)
    
    model_41_neural
    set.seed(123)
    ctrl=(trainControl(method="repeatedcv", repeats=5))
    
    SVM_Radial_Fit = train(x=train_x,y=as.factor(train_y), method = "svmRadial",
    preProc = c("center", "scale"),
    tuneLength = 14,trControl = ctrl)
    
    SVM_Radial_Fit
    
    
    set.seed(123)
    Chris_SVM_Pred = data.frame(Pred= predict(SVM_Radial_Fit, newdata = test_x))
    
    
    plot(SVM_Radial_Fit, scales = list(x = list(log = 2)))
    
    #Chris_Con =confusionMatrix(Chris_SVM_Pred$Pred,test_y,dnn = c("Prediction", "Reference"))
    
    #Chris_SVM_acc = data.frame( Val =postResample(pred = Chris_SVM_Pred$Pred, obs = test_y))
    
    #fourfoldplot(Chris_Con$table,main=paste("SVM (",round(Chris_SVM_acc[1,]*100),"%)",sep=""),color = c("#ed3b3b", "#0099ff"))
    
    #Chris_Con
    #Convert to 0/1
    conv_41_neural <- predict(model_41_neural) 
    #conf_41_neural<-confusionMatrix(conv_41_neural, train_y, positive="1")
    conf_41_neural<-confusionMatrix(table(conv_41_neural, train_y), positive="1")
    
    conv_21_svm <- predict(SVM_Radial_Fit)
    #conf_21_svm<-confusionMatrix(conv_21_svm, train_y, positive="1")
    conf_21_svm<-confusionMatrix(table(conv_21_svm, train_y), positive="1")
    
    conv_31_cubit <- ifelse(predict(model_31_cubit) > 0.5,1,0)
    #conf_31_cubit<-confusionMatrix(conv_31_cubit, train_y, positive="1")
    conf_31_cubit<-confusionMatrix(table(conv_31_cubit, train_y), positive="1")
    
    # compute accuracy
    
    acc41<-conf_41_neural$overall["Accuracy"]
    acc21<-conf_21_svm$overall["Accuracy"]
    acc31<-conf_31_cubit$overall["Accuracy"]
    
    # compute AUC
    
    auc41<-roc(train_y ~ as.numeric(as.character(conv_41_neural)), train_x)$auc
    auc21<-roc(train_y ~ as.numeric(as.character(conv_21_svm)), train_x)$auc
    auc31<-roc(train_y ~ conv_31_cubit, train_x)$auc
    
    df1<-data.frame(accuracy=c(acc41,acc21,acc31),auc=c(auc41,auc21,auc31))
    row.names(df1)<-c("AvNeural","SVM","Cubist")
    
    kable(round(df1,3), caption = "Performance metrics train")        
    
    #Convert to 0/1
    conv_41_neural_t <- predict(model_41_neural, newdata=test_x, type="raw") 
    #conf_41_neural_t<-confusionMatrix(conv_41_neural_t,test_y,positive="1")
    conf_41_neural_t<-confusionMatrix(table(conv_41_neural_t, test_y), positive="1")
    
    conv_21_svm_t <- predict(SVM_Radial_Fit,newdata=test_x, type="raw")
    #conf_21_svm_t<-confusionMatrix(conv_21_svm_t,test_y,positive="1")
    conf_21_svm_t<-confusionMatrix(table(conv_21_svm_t, test_y), positive="1")
    
    conv_31_cubit_t <- ifelse(predict(model_31_cubit,newdata=test_x) > 0.5,1,0)
    #conf_31_cubit_t<-confusionMatrix(conv_31_cubit_t,test_y,positive="1")
    conf_31_cubit_t<-confusionMatrix(table(conv_31_cubit_t, test_y), positive="1")
    
    # compute accuracy
    
    acc41_t<-conf_41_neural_t$overall["Accuracy"]
    acc21_t<-conf_21_svm_t$overall["Accuracy"]
    acc31_t<-conf_31_cubit_t$overall["Accuracy"]
    
    # compute AUC
    
    auc41_t<-roc(test_y ~ as.numeric(as.character(conv_41_neural_t)), test_x)$auc
    auc21_t<-roc(test_y ~ as.numeric(as.character(conv_21_svm_t)), test_x)$auc
    auc31_t<-roc(test_y ~ conv_31_cubit_t, test_x)$auc
    
    df2<-data.frame(accuracy=c(acc41_t,acc21_t,acc31_t),auc=c(auc41_t,auc21_t,auc31_t))
    row.names(df2)<-c("AvNeural","SVM","Cubist")
    
    kable(round(df2,3), caption = "Performance metrics test")        
    
    
    # Function for printing confusion matrix
    confusion_analysis <- function(df,model){
      # Threshold value is 0.5, positive class is 1
      
      predicted = if_else(predict(model,df)>=0.5, 1,0)
      confusionMatrix(data = predicted,
                    reference = df$Diagnosis,
                    positive = "1")
    }
    
    # Function for calculating evaluation metrics
    summary_analysis <- function(df,model){
      print(summary(model))
      print(paste0("BIC: ",BIC(model)))
      print(paste0("VIF: ",vif(model)))
    
      n = length(df$target)
      print(paste0("Naglekerke-pseudo-R2:",(1-exp((model$dev-model$null)/n))/(1-exp(-model$null/n))))
      print("Confusion Matrix:")
      confusion_analysis(df,model)
    }
    
    
    
    conf_13_logit_t
    
    fourfoldplot(conf_13_logit_t$table,color = c("#ed3b3b", "#0099ff"))
    RocCurve <- roc(response = test_y,
                    ifelse(predict(object=model_13_logit_corr_final, newdata=test_x, type="response") > 0.5,1,0),levels = c(1,0))
    plot(RocCurve, ylab = "Sensitivity", xlab = "1 - Specificity", main = "ROC Curve - Logistic", col = "red")
    conf_31_cubit_t
    fourfoldplot(conf_31_cubit_t$table,color = c("#ed3b3b", "#0099ff"))
    rocCurve <- roc(response = test_y, predictor = predict(model_31_cubit,newdata = test_x), levels = c(1,0))
    plot(rocCurve, ylab = "Sensitivity", xlab = "1 - Specificity", main = "ROC Curve - Cubist", col = "red")