—————————————————————————


Introduction to the Breast Cancer.

Worldwide, breast cancer is the most common invasive cancer in women. It affects 1 in 7 (\(14\%\)) of women worldwide. In 2008, breast cancer caused \(458,503\) deaths worldwide ( i.e \(13.7\%\) of cancer deaths in women and \(6.0\%\) of all cancer deaths for men and women together).

The incidence of breast cancer varies greatly around the world, it is lowest in developing countries and greatest in the developed countries. The number of cases worldwide has significantly increased since the 1970s, a phenomenon partly attributed to the modern lifestyles. And breast cancer is strongly related to age with only \(5\%\) of all breast cancers occurring in women under 40 years old.

Based on U.S. statistics in 2015 there were 2.8 million women affected by breast cancer worldwide. However, age-adjusted deaths from breast cancer per 100,000 women only rose slightly from 31.4 in 1975 to 33.2 in 1989 and have since declined steadily to 20.5 in 2014.

Breast cancer is one of the most severe and common reproductive cancers that affect mostly women. Breast tumor is an abnormal growth of tissues in the breast, and it may be felt as a lump or nipple discharge or change of skin texture around the nipple region. Cancers are abnormal cells that divide uncontrollably and are able to invade other tissues. Cancer cells have the ability to spread to other parts of the body through the blood and lymphatic systems. It has become a major health issue in the past 50 years, and its incidence has increased in recent years.

In today’s world with the advanced technology, medicine and healthy systems are accelerating the rapid growth in cancer detection and diagnosis. And the myriad of cancer preventive, detection and diagnosis approaches have been put in place to relevantly decline the new cancer cases in community. Among those adopted techniques, the artificial intelligence (AI) and machine learning (ML) couldn’t be left behind in the task since it support significantly and surprisingly in cancer detection and prediction beyond the human expectations. Due to this fact, the intention of this work is to quite employ the machine learning based ensemble approach to classify, detect and predict the breast cancer diagnosis decision.

Key characteristics of Breast Cancer.

A breast mass or lump is generally one of the possible signs of breast cancer in men or women. Breast cancer can cause several additional changes to the skin on and around the breast. Here below are the potential signs and symptoms of breast cancer that may occur without a noticeable lumb in the breast (Note: they are not common for all person):

  • Change to the skin’s textute Breast cancer can cause changes and inflammation in skin cells that can lead to texture changes. For example, the skin is sunburned or extremely dry and skin thickening in any part of the breast.

  • Nipple discharge An individual may observe discharge from the nipple, which can be thin or thick and can range in color from clear to milky to yellow, green, or red. However, the nipple discharge is not always indicating the breast cancer occurence, here below are the other possible reasons for nipple discharge: breast infections, a side effect of birth control pills, a side effect of taking certain medications, variations in body physiology, certain medical conditions, such as a thyroid disease.

  • Dimpling Skin simpling can sometimes be a sign of inflammatory breast cancer, an aggressive type of breast cancer. Cancer cells can cause a buildup of lymph fluid in the breast that leads to swelling as well as dimpling or pitted skin. The doctors are calling this change in the skin’s “peau d’orange” because the dimpled skin resembles the surface of an orange.

  • Breast or nipple pain Breast cancer can cause changes in skin cells that lead to feeling of pain, tenderness, and discomfort in the breast. Although breast cancer is often painless, it is important not to ignore any signs that could be due to breast cancer.

  • Redness Breast cancer can cause changes to the skin that may make it appear discolored or even bruised. The skin may be red or pupple or have a bluish tint.

  • Swelling Breast cancer can cause the entire breast or an area of the breast to swell.

It is advisable that the people should not panic or be fearful when they notice breast cahnges. Aging, change in hormone levels, and other factors can lead to breast changes throughout a person’s lifetime. However, they should be proactive about their health conditions and visit a doctor to determine the cause of any breast symptoms.

Different Types of Breast Tumor.

According to National Cancer Institute (https://www.cancer.gov/), a tumor is termed as “an abnormal mass of tissue that results when cells divide more than they should or do not die when they should.”

In healthy human body, cells grow, divide and replace each other. As new cells form, the old ones die. When a person has a cancer, new cells form form when the body doesn’t need them at all. I there are too many cells, a group of cells, or tumor , can develop.

In human body, they exist non cancerous cells which are Benign and others are Malignant . Malignant tumors are cancerous, and they can spread to the other parts of human body. Additionally, a tumor develops when cells reproduce too quickly. Tumors can also vary in size from a tiny nodule to a large mass, depending on the type, and they can appear almost anywhere on the body.

By refering to the National Cancer Institute, they are three main types of tumor:

  • Benign: These are not cancerous cells. Most Benign tumors are not harmful. They are either cannot spread or grow, or they do so very slowly. If they have been removed from body, they do not generally return. However, they can cause pain or other problems if they press against nerves or blood vessels or if they trigger the overproduction of hormones, as in the endocrine system.

  • Premalignant: In these tumors, the cells are not yet cancerous, but they have the potential to become malignant.

  • Malignant: Malignant tumors are cancerous. The cells can grow and spread to the other parts of human body through the metastasis process. They develop when cells grow uncontrollably. If they continue to grow and spread, the disease become life threatening. Surprisengly, the cancer cells that move to other parts of body are the same as the original ones. And they have the overall ability to invade the other organs. If breast cancer spreads to the lung cancer, for example, the cancer cells in the liver are still breast cancer cells.

It is not always clear how tumor will act in the future. Some benign tumors can become premalignant and then malignant. For this reason, it better to monitor any growth by regular attending and consulting an expert physician or doctor for health checks.

Causes and Risk Factors of Breast Cancer.

Mayo Cancer Clinic (https://mayoclinic.org), has defined a breast cancer risk factor as anything that makes it more likely you will get breast cancer. But having one or several breast cancer risk factors doesn’t necessarily mean you will develop breast cancer. Many women who developed breast cancer did not show any known risk factors other than simply being women.

Factors that are associated with an increased risk of breast cancer include:

  • Being female. Woman are much more likely than men to develop breast cancer.

  • Age. The risk of getting breast cancer increases as you age. Nearly \(80\%\) of breast cancers are found in women over the age of \(50\).

  • Personal history of breast conditions. If you have a breast biopsy that found lobular carcinoma in situ (LCIS) or atypical hyperplasia of the breast, you have an increased risk of breast cancer.

  • A personal history of breast cancer. If you have had breast cancer in one breast, you have an increased risk of developing cancer in the other breast.

  • A family history of breast cancer. If your mother, sister or daughter was diagnosedwith breast cancer, particularly at young age (before \(40\)), your risk of breast cancer is high. Having other relatives with breast cancer may also raise the risk.

  • Inherited genes. Certain gene mutations that increase the riskof breast cancercan be inherited from parents to childrens. Women with certain genetic mutations, including changes to the BRCA1 and BRCA2 genes, are at higher risk of developing breast cancer during their lifetime.

  • Radiation exposure. If you received radiation treatments to your chest as a child or young adult, your risk of breast cancer is increased.

  • Beginning your period at a young age and menaupose at older age. Women who menstruate for the first time at an early age (before \(12\)) and Women who go through menopause late (after age \(55\)), they are more likely to develop breast cancer.

  • Having first child at an older age and having never been pregnant. Women who give birth to their first child after age \(30\) and have never been pregnant, they have a greater risk of breast cancer.

They are any other risk factors for breast cancer and it is not always means that people with those trend factors are necessarily developing the breast cancer. It is advisable to routinely ask your doctor about breast cancer screening and practice all kind of cancer preventive measures.

The Main Motivation of this Piece of Work.

Current’s daily personalized medicine increases the workload and complexity of the physicians/ doctors in several type of cancer detection and diagnosis. At hospital, the radiologists and pathologists are the essential key players in making decision for cancer presence and diagnosis. Based on the radiology diagnostic decision, the results will be submitted to pathologist for further diagnosis. The Pathologist and radiologist form the core of cancer diagnosis based on the anatomy and physiological processes gained from the Nuclear Magnetic Resonance Imaging (NMRI) or CT Scanner images captured from the tumor cells inside of the human body. In several hospitals, the communication among them remained on papers. That paper contains their respective report of the case on the same patient. This might slow down the cancer diagnosis decision making and decline the patient survival rate.

With the enormous technological and scientific advances and currently occurring in all fields, the opportunity has emerged to develop an integrated diagnostic reporting system that supports both medicine fields (radiology and pathology). Therefore, it improves the overall quality of patient care through accurate communication in DICOM (Digital Imaging and Commubication in Medicine) images. In this work, we are highly motivated to contribute in cancer information faster processing and early preparedness by focusing on accurate diagnostic prediction.

The emerging of fourth Industrial Revolution (4IR) technology allowed huge amount of data (big data) to be collected, and this leads to the complexity of the radiology and pathology workload. To address all aforementioned challenges, the artificial intelligence are laveraged to improve medical diagnostics. From this, our study employs ensemble technique which consists of a combination of several machine learning techniques into a single accurate predictive model in order to precisely predict the breast cancer diagnostic decision prevailence.

Machine Learning and Ensemble Technique

This serves to clarify the approach used in this work and to explain properly the overview of machine learning process and ensemble technique to tackle the considered problem in different tasks assigned.

Machine Learning

Machine Learning (ML) is the science of generating the computations and algorithms that allow software applications to become more accurate in predicting outcomes without being explicitly programmed and simplify the work of humans. ML comprises a broad class of statistical analysis algorithms that iteratively improve in response to training data to build models for autonomous predictions. ML is quite powered by data and enables computers to learn from them to make tremendous automatic data-based decision and prediction.

Machine Learning is often categorized into supervised, unsupervised and reinforcement learning machine. The supervised learning algorithms have been used in this study to distinguish the benign from malignant breast cancer hence their combination has beeen used to accurately predict the cancer diagnosis labels. Here below is the summarized workflow of supervised machine learning algorithms employed.

Workflow of Employed Machine Learning Algorithms (Source: https://www.wikipedia.org)

Workflow of Employed Machine Learning Algorithms (Source: https://www.wikipedia.org)

Simply, the above workflow chart is summarizing the undertaken processes of acquisition, preparation and transformation, feature engineering and data pre-processing, hyper-parameter tuning and model building, model testing and validation therefore model comparisons based on their performance.

Ensemble Technique

In statistics and machine learning, ensemble methods refer to combination of multiple learning algorithms to obtain better predictive performance than any single learning algorithm. An ensemble itself is a supervised learning algorithm, because it can be trained and then used to make predictions and the trained ensemble represents a single hypothesis and results.

The three most popular methods for combining the predictions from different models are:

  • Bagging. Building multiple models (typically of the same type) from different subsamples of the training dataset.

  • Boosting. Building multiple models (typically of the same type) each of which learns to fix the prediction errors of a prior model in the chain.

  • Stacking. Building multiple models (typically of differing types) and supervisor model that learns how to best combine the predictions of the primary models.

This study will not explain and employ each of these methods, we will only focus on stacking technique (sometimes called stacked generalization). In this method, the multiple sub-models contribute equally to a combined prediction.

A lot of exciting data related activities ahead. To make it simpler, this tutorial is organized to cover the followings: Data Preparation and Pre-processing, data visualization and identification of important variables, Feature Selection using using different approaches,Training and Tuning the model, Model comparing based on the several performamnce metrics, and finally an Ensembling the predictions.

Data Source and Preparation.

The dataset used is real-valued continuous multivariate set of data which have been extracted from the breast cell nuclei. Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. The data set is publicly available from various data set repositories including Kaggle Repository and UCI Machine Learning Repositor. The table below descibes a clear describtion of the data set.

Table 1: Breast Cancer Diagnostic Data Description

Table 1: Breast Cancer Diagnostic Data Description

Here below are the loaded libraries and the dataset.

Name Class Values
patient_id integer Num: 8670 to 911320502
radius_mean numeric Num: 6.981 to 28.11
texture_mean numeric Num: 9.71 to 39.28
perimeter_mean numeric Num: 43.79 to 188.5
area_mean numeric Num: 143.5 to 2501
smoothness_mean numeric Num: 0.053 to 0.163
compactness_mean numeric Num: 0.019 to 0.345
concavity_mean numeric Num: 0 to 0.427
concave.points_mean numeric Num: 0 to 0.201
symmetry_mean numeric Num: 0.106 to 0.304
fractal_dimension_mean numeric Num: 0.05 to 0.097
radius_se numeric Num: 0.112 to 2.873
texture_se numeric Num: 0.36 to 4.885
perimeter_se numeric Num: 0.757 to 21.98
area_se numeric Num: 6.802 to 542.2
smoothness_se numeric Num: 0.002 to 0.031
compactness_se numeric Num: 0.002 to 0.135
concavity_se numeric Num: 0 to 0.396
concave.points_se numeric Num: 0 to 0.053
symmetry_se numeric Num: 0.008 to 0.079
fractal_dimension_se numeric Num: 0.001 to 0.03
radius_worst numeric Num: 7.93 to 36.04
texture_worst numeric Num: 12.02 to 49.54
perimeter_worst numeric Num: 50.41 to 251.2
area_worst numeric Num: 185.2 to 4254
smoothness_worst numeric Num: 0.071 to 0.223
compactness_worst numeric Num: 0.027 to 1.058
concavity_worst numeric Num: 0 to 1.252
concave.points_worst numeric Num: 0 to 0.291
symmetry_worst numeric Num: 0.156 to 0.664
fractal_dimension_worst numeric Num: 0.055 to 0.208
diagnosis factor ‘B’ ‘M’
## The dimension of data set is ( 569 32 )
##   radius_mean      texture_mean   perimeter_mean     area_mean     
##  Min.   : 6.981   Min.   : 9.71   Min.   : 43.79   Min.   : 143.5  
##  1st Qu.:11.700   1st Qu.:16.17   1st Qu.: 75.17   1st Qu.: 420.3  
##  Median :13.370   Median :18.84   Median : 86.24   Median : 551.1  
##  Mean   :14.127   Mean   :19.29   Mean   : 91.97   Mean   : 654.9  
##  3rd Qu.:15.780   3rd Qu.:21.80   3rd Qu.:104.10   3rd Qu.: 782.7  
##  Max.   :28.110   Max.   :39.28   Max.   :188.50   Max.   :2501.0  
##  smoothness_mean   compactness_mean  concavity_mean    concave.points_mean
##  Min.   :0.05263   Min.   :0.01938   Min.   :0.00000   Min.   :0.00000    
##  1st Qu.:0.08637   1st Qu.:0.06492   1st Qu.:0.02956   1st Qu.:0.02031    
##  Median :0.09587   Median :0.09263   Median :0.06154   Median :0.03350    
##  Mean   :0.09636   Mean   :0.10434   Mean   :0.08880   Mean   :0.04892    
##  3rd Qu.:0.10530   3rd Qu.:0.13040   3rd Qu.:0.13070   3rd Qu.:0.07400    
##  Max.   :0.16340   Max.   :0.34540   Max.   :0.42680   Max.   :0.20120    
##  symmetry_mean    fractal_dimension_mean
##  Min.   :0.1060   Min.   :0.04996       
##  1st Qu.:0.1619   1st Qu.:0.05770       
##  Median :0.1792   Median :0.06154       
##  Mean   :0.1812   Mean   :0.06280       
##  3rd Qu.:0.1957   3rd Qu.:0.06612       
##  Max.   :0.3040   Max.   :0.09744
## 'data.frame':    569 obs. of  10 variables:
##  $ radius_mean           : num  18 20.6 19.7 11.4 20.3 ...
##  $ texture_mean          : num  10.4 17.8 21.2 20.4 14.3 ...
##  $ perimeter_mean        : num  122.8 132.9 130 77.6 135.1 ...
##  $ area_mean             : num  1001 1326 1203 386 1297 ...
##  $ smoothness_mean       : num  0.1184 0.0847 0.1096 0.1425 0.1003 ...
##  $ compactness_mean      : num  0.2776 0.0786 0.1599 0.2839 0.1328 ...
##  $ concavity_mean        : num  0.3001 0.0869 0.1974 0.2414 0.198 ...
##  $ concave.points_mean   : num  0.1471 0.0702 0.1279 0.1052 0.1043 ...
##  $ symmetry_mean         : num  0.242 0.181 0.207 0.26 0.181 ...
##  $ fractal_dimension_mean: num  0.0787 0.0567 0.06 0.0974 0.0588 ...
##           texture_mean         perimeter_mean              area_mean 
##                      0                      0                      0 
##        smoothness_mean       compactness_mean         concavity_mean 
##                      0                      0                      0 
##    concave.points_mean          symmetry_mean fractal_dimension_mean 
##                      0                      0                      0 
##              radius_se 
##                      0
## The total missing value of data set is ( 0 )

We don’t have any missing data. We’re good to go. The main intention of this dataset is to predict which of the two breast cancer diagnosis decisions (Malignant, M and Benign, B) did the patient being tested.

Data Preparation and Feature Engineering.

The data set has \(569\) observations and a \(32\) features. Feature selection is critical in feature engineering task. The R FSelector packagehas been used to cut down and eliminate unecessary features prior to building a prediction model. The random.forest.importance() function is used to rate the importance of each feature in a recognition and classification of the outcome. The function returns a data frame containing the name of each attribute and the importance value based on the mean decrease in accuracy.

From the returned data frame, we cam select the top best features using the importance according to their importance. The cutoff.biggest.diff() function automatically identifies the features which have a significantly higher importance value than other features. And cutoff.k is used to provide the top ten features with the highest importance values. Similarly, cutoff.k.percent it can be used to return \(k\) percent of the features with the highest importance values.

## [1] "diagonis"

Let’s tabulating the data set with top ten best features.

Exploratory Data Analysis

Overall
(N=569)
area_worst
Mean (SD) 881 (569)
Median [Min, Max] 687 [185, 4250]
radius_worst
Mean (SD) 16.3 (4.83)
Median [Min, Max] 15.0 [7.93, 36.0]
perimeter_worst
Mean (SD) 107 (33.6)
Median [Min, Max] 97.7 [50.4, 251]
concave.points_worst
Mean (SD) 0.115 (0.0657)
Median [Min, Max] 0.0999 [0, 0.291]
concave.points_mean
Mean (SD) 0.0489 (0.0388)
Median [Min, Max] 0.0335 [0, 0.201]
area_se
Mean (SD) 40.3 (45.5)
Median [Min, Max] 24.5 [6.80, 542]
texture_worst
Mean (SD) 25.7 (6.15)
Median [Min, Max] 25.4 [12.0, 49.5]
concavity_worst
Mean (SD) 0.272 (0.209)
Median [Min, Max] 0.227 [0, 1.25]
texture_mean
Mean (SD) 19.3 (4.30)
Median [Min, Max] 18.8 [9.71, 39.3]
## 
##     B     M 
## 62.74 37.26

This often refers to as the data set has \(62.74\%\) and \(37.26\%\) of patients being tested Benign and Malignant breast tumor respectively. Here below are the selected top columns.

  • diagnosis: This is the diagnosis decision indicating whether a patient is breast cancer tested malignant (M) or Benign (B).

  • perimeter_worst: A worst or largest mean value of cell nuclei perimeter.

  • area_worst: A worst or largest mean value of cell nuclei area.

  • concave.points_worst: A “worst” or largest mean value for number of concave portions of the contour.

  • radius_worst: A “worst” or largest mean value for mean of distances from center to points on the perimeter.

  • texture_worst: A “worst” or largest mean value for standard deviation of gray-scale values.

  • concave.points_mean: A mean for number of concave portions of the contour.

  • area_se: A standard error for standard deviation on cell nuclei area.

  • concavevity_worst: A worst or largest mean value for severity of concave portions of the contour.

  • concavity_mean: A mean of severity of concave portions of the cell nuclei contour.

Data Exploration and Visualization.

Exploration and visualization of response variable

  • B: Represents the total number of patients tested malignant breast tumor.

  • M: Represents the total number of patients tested malignant breast tumor.

We can see that most of the patient was Benign breast tumor tested at the rate of \(62.74\%\). And let’s explore all features distribution through the histograms since all features are continuous.

We can see that the texture-mean is normal distributed compared to other features.

We can see clearly the difference between this two groups of patients.

Multicollinearity Testing

This refers to testing the correlation among the predictors (independant features). A collinearity can worse the predictive accuracy and also make challenging to determine which features to include in predictive model. Here below are the correlation table among features.

We can notice that there are variables which are highly correlated. (i.e. presence of multicollinearity). So we’ll eliminate some of them in predictive models to ensure the highest possible predictive performance.

Exploration of Importance of Features on Diagnosis Decision.

Now, it is a time to visually examine how the best selected features of breast cell nuclei influence the diagnosis decision. Both box plots and density plots are commonly used to investigate this. And the caret’s featurePlot() function makes it so convenient.

## The dimension of the training data set is ( 456 10 )
## The dimension of test data set is ( 113 10 )

In this case, For a variable to be extremely important, one should expect the density curves to be significantly different for the two classes, both in terms of the height (kurtosis) and placement (skewness). By taking a look at the density curves of the two diagnosis categories for all features, we notice that area_se, area_worst, perimeter_worst, concavity_worst, concave.point_mean, and radius_worst are most likely to be important to predict breast cancer diagnosis decision compared to others. But it may not be wise to conclude which variables are NOT important unless we perform the various feature selection techniques.

Feature Selection

Feature selection always plays a crucial role in machine learning. They several techniques to select the best features to build the predictive models. Here below are the two techniques adopted in this post:

  • Chi-Square Test: It is used in statistics to test the independance of two events. In feature selection, we aim to select the features which are highly dependent on the responce (output label). When two features are independent, then the small Chi-Square value. So simply, the higher Chi-Square value, the best feature to be selected for model training.
variable chi.square p.value
perimeter_worst 536.2035 0.2313225
area_worst 558.3055 0.3154948
concave.points_worst 546.1851 0.0427027
radius_worst 537.6295 0.0049616
concave.points_mean 562.5833 0.2521223
area_se 552.6018 0.2128353
texture_worst 514.8145 0.4320580
texture_mean 498.4163 0.2505930
concavity_worst 555.4536 0.2922900

We can notice that all features have larger Chi-Square values, this confirms our hypothesis that these almost all features can provide useful information on the response (target) variable. At the other hand, we can confidently select only radius_worst and concave.point_worst since their p-value are less than \(0.05\). So to be more safe, let’s not arrive at conclusions about excluding variables prematurely and try other machine learning feature selection technique called recursive feature elimination.

  • Recursive Feature Elimination (RFE): Most machine learning algorithms are able to determine what features are important to predict the response. But in many scenarios, RFE is good choice of automatically selecting the important features to be included in predictive model. Here below are the function to implement RFE.
## 
## Recursive feature selection
## 
## Outer resampling method: Cross-Validated (10 fold, repeated 5 times) 
## 
## Resampling performance over subset size:
## 
##  Variables Accuracy  Kappa AccuracySD KappaSD Selected
##          4   0.9437 0.8794    0.03638 0.07764         
##          8   0.9591 0.9120    0.02744 0.05934         
##          9   0.9613 0.9166    0.02615 0.05672        *
## 
## The top 5 variables (out of 9):
##    concave.points_worst, area_se, area_worst, concave.points_mean, texture_worst

In above code, the cross validation method is repeatedcv which implements k-Fold cross validation repeated 5 times as re-sampling method, which is rigorous enough for our case. From the above output, a model is automatically select the top five out of nine features that seems to achieve the optimal accuracy.

In the next step, we will train the machine learning models on training set with those top five features.

Correcting Class Imbalanced by SMOTE

The target of our dataset is highly imbalanced with far more patients tested benign tumor than breast cell malignant tumor. In such cases, a machine learning algorithm can simply classify and detect all the examples to be of the majority class and get a very high accuracy. And a such classifier will be practically useless. Hence in order to maximize classification accuracy on both the classes, typically a weight is specified to the minority class which relatively increases the penalty of misclassifying it compared to misclassifying the majority class. Among the different popular approaches for class imbalanced, this post will use Over-sampling technique especially Synthetic Minority Over-Sampling (SMOTE).

## 
##     B     M 
## 0.627 0.373
## 
##     B     M 
## 0.627 0.373
## 
##     B     M 
## 0.628 0.372

The class labels are not balanced, 62.9% and 37.8% for Benign (B) and Malignant (M) and respectively. To balance the target class, SMOTE() function built-in in DMwR R package implements such task.

## 
##   B   M 
## 0.5 0.5

Now, the class labels are balanced at rate of 50% for both Benign and Malignant breast cell tumors.

Such technique is a heuristic approach based on k-nearest neighbour (kNN) algorithm. This technique generates synthetic observations from minority class by interpolating a collinear point (observation) between observation of the minority class and its nearest neighbours.

Note: I want to list down some important points which one must keep in mind when he is using this technique.

  • This technique should be only applied on training data set. The aim is to make the training data balanced in order to train a classification algorithm properly. And the model performance should be tested on the actual uncharged test dataset that is representative part of the original data.

  • Better use the stratified random sampling for the training and test data split. This ensures that the class distribution in each of these splits is the same. And SMOTE should be applied on datasets that contain only numeric variables since it is distance calculation based (kNN).

  • The various researches have shown that AUROC is usually preferred model performance in the presence imbalance datasets. And SMOTE is only applied for binary classification problems and for multiclass classification tasks, SCUT algorithm which used SMOTE and Cluster-based Undersampling for multi-class imbalanced classification tasks.

Here we are! finally, let’s train the multi-machine learning algorithms of prepared amd preprocessed breast cancer data set to predict the near future diagnosis decision. The test data set (bcancerTest) will be used only to evaluate performance (such as to compare models) and the training set (bcancerTrain) will be used for all other activities such as training predictive models.

Combination of the predictions of multi-models.

Tuelve individual learning models have been trained on training data set by using 10-fold cross validation repeatedly five times as re-sampling method. And indeed, those models have been tuned to find the optimal models. We didn’t forget to set the random seed to initialize a pseudo-random number generator for the avoidance of the results variability in order to obtain the nearly trusted experimental results.

## 
## Call:
## summary.resamples(object = results)
## 
## Models: glmboost, rpart, rf, xgbTree, naive_bayes, earth, kknn, svmRadial, lda, Linda, C5.0Tree, adaboost 
## Number of resamples: 50 
## 
## Accuracy 
##                  Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## glmboost    0.8823529 0.9411765 0.9705882 0.9573529 0.9705882 1.0000000    0
## rpart       0.8529412 0.9117647 0.9338235 0.9270588 0.9411765 0.9852941    0
## rf          0.9558824 0.9852941 0.9852941 0.9879412 1.0000000 1.0000000    0
## xgbTree     0.9558824 0.9852941 0.9852941 0.9882353 1.0000000 1.0000000    0
## naive_bayes 0.8970588 0.9411765 0.9558824 0.9550000 0.9705882 1.0000000    0
## earth       0.9264706 0.9558824 0.9705882 0.9702941 0.9852941 1.0000000    0
## kknn        0.9705882 0.9889706 1.0000000 0.9955882 1.0000000 1.0000000    0
## svmRadial   0.9411765 0.9705882 0.9705882 0.9761765 0.9852941 1.0000000    0
## lda         0.8676471 0.9301471 0.9411765 0.9435294 0.9558824 1.0000000    0
## Linda       0.8970588 0.9411765 0.9558824 0.9588235 0.9816176 1.0000000    0
## C5.0Tree    0.9117647 0.9558824 0.9705882 0.9714706 0.9852941 1.0000000    0
## adaboost    0.9558824 0.9852941 1.0000000 0.9932353 1.0000000 1.0000000    0
## 
## Kappa 
##                  Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## glmboost    0.7647059 0.8823529 0.9411765 0.9147059 0.9411765 1.0000000    0
## rpart       0.7058824 0.8235294 0.8676471 0.8541176 0.8823529 0.9705882    0
## rf          0.9117647 0.9705882 0.9705882 0.9758824 1.0000000 1.0000000    0
## xgbTree     0.9117647 0.9705882 0.9705882 0.9764706 1.0000000 1.0000000    0
## naive_bayes 0.7941176 0.8823529 0.9117647 0.9100000 0.9411765 1.0000000    0
## earth       0.8529412 0.9117647 0.9411765 0.9405882 0.9705882 1.0000000    0
## kknn        0.9411765 0.9779412 1.0000000 0.9911765 1.0000000 1.0000000    0
## svmRadial   0.8823529 0.9411765 0.9411765 0.9523529 0.9705882 1.0000000    0
## lda         0.7352941 0.8602941 0.8823529 0.8870588 0.9117647 1.0000000    0
## Linda       0.7941176 0.8823529 0.9117647 0.9176471 0.9632353 1.0000000    0
## C5.0Tree    0.8235294 0.9117647 0.9411765 0.9429412 0.9705882 1.0000000    0
## adaboost    0.9117647 0.9705882 1.0000000 0.9864706 1.0000000 1.0000000    0

In the above output you can see clearly how the algorithms performed in terms of Accuracy and Cohen’s Kappa Statistic. It gives us the right decision to select both Kernel k-Nearest Neighbor (kknn) and Adaboost as the superior optimal performing models relative to others because of the highest accuracy and kappa.

Here below are the other ways to visualize clearly the performance of multi-models employed. They include both pie-charts and bar plots.

# Visualize the comparative model performance

Compadata <- read.csv("C:/Users/Murera Gisa/Desktop/Predicting_Cancer/ModelPerformance.csv")

Plot_Kappa<- Compadata %>% ggplot(aes(x=Models, y=Kappa, fill= Kappa)) + geom_col()+ geom_bar(stat = "identity") + 
  coord_polar() + theme_bw() + labs(title = "Kappa Performance Metric.",
    caption=  "Source: Breast Cancer Analytics@mgisa")+ 
  theme(legend.position = "none", axis.title = element_blank(),axis.line = element_blank(),axis.text.y= element_blank(),axis.text.x = element_text(angle = 45, 
              face = "bold", colour = "purple", size = 15),
  plot.title = element_text(size=16, face="bold", color="forest green")) +
  theme(legend.text = element_text(colour="blue", size=10,face= "bold")) +
  #geom_text(aes(label=  str_c(Kappa,"%")),size= 5, color= "red", angle=360)+
  theme(legend.title = element_text(colour="red", size = 20,face="bold"))


Plot_Accuracy<- Compadata %>% ggplot(aes(x=Models, y=Accuracy, fill= Accuracy)) +
  geom_col()+ geom_bar(stat = "identity") + 
  coord_polar() + theme_bw() + labs(title = "Accuracy Performance Metric.", 
                  caption=  "Source: Breast Cancer Analytics@mgisa")+ 
  theme(legend.position = "none", axis.title = element_blank(),axis.line = element_blank(),axis.text.y= element_blank(),axis.text.x = element_text(angle = 45, 
                    face = "bold", colour = "orange", size = 15),
        plot.title = element_text(size=16, face="bold", color="blue")) +
  theme(legend.text = element_text(colour="blue", size=10,face= "bold")) +
  #geom_text(aes(label=  str_c(Accuracy,"%")),size= 5, color= "black", angle=360)+
  theme(legend.title = element_text(colour="red", size = 20,face="bold"))

grid.arrange(Plot_Kappa,Plot_Accuracy, nrow=1)

The cohen’s Kappa is going to be widely used as model performance metric, since it is an excellent performance measure when the classes are highly unbalanced like in our case of breast cancer prediction. Cohen’s kappa is essential measure of how well the classifier performed as compared to how well it would have performed simply by chance. Additionally, Kappa statistcs metric is indicting the level of agreement between the actual and predicted breast cancer diagnostic decision.

Here below are bar plots of ML model performance showing both the Cohen’s Kappa statistics and Accuracy for each and every individual model employed.

All the above bar plots, boxplots and pie charts show Kernel kNN and adaboost as the superior classifier algorithms. And They take a lead and win in breast cancer detection and classification relative to others with the highest cohen’s kappa and accuracy.

Those predictive models have had higher kappa score, because of existence of big difference between their accuracies and null error rate which is an error committed in prediction of majority class.

As previously mentioned, the “optimal” models are selected to be a relevant candidate models for stacking ensembling technique for breast cancer prediction.

## A kknn ensemble of 12 base models: glmboost, rpart, rf, xgbTree, naive_bayes, earth, kknn, svmRadial, lda, Linda, C5.0Tree, adaboost
## 
## Ensemble results:
## k-Nearest Neighbors 
## 
## 3400 samples
##   12 predictor
##    2 classes: 'B', 'M' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 3060, 3060, 3060, 3060, 3060, 3060, ... 
## Resampling results across tuning parameters:
## 
##   kmax  Accuracy   Kappa    
##   5     0.9984706  0.9969412
##   7     0.9984706  0.9969412
##   9     0.9984706  0.9969412
## 
## Tuning parameter 'distance' was held constant at a value of 2
## Tuning
##  parameter 'kernel' was held constant at a value of optimal
## Kappa was used to select the optimal model using the largest value.
## The final values used for the model were kmax = 9, distance = 2 and kernel
##  = optimal.
## A adaboost ensemble of 12 base models: glmboost, rpart, rf, xgbTree, naive_bayes, earth, kknn, svmRadial, lda, Linda, C5.0Tree, adaboost
## 
## Ensemble results:
## AdaBoost Classification Trees 
## 
## 3400 samples
##   12 predictor
##    2 classes: 'B', 'M' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 3060, 3060, 3060, 3060, 3060, 3060, ... 
## Resampling results across tuning parameters:
## 
##   nIter  method         Accuracy   Kappa    
##    50    Adaboost.M1    0.9990588  0.9981176
##    50    Real adaboost  0.9986471  0.9972941
##   100    Adaboost.M1    0.9990588  0.9981176
##   100    Real adaboost  0.9988235  0.9976471
##   150    Adaboost.M1    0.9988235  0.9976471
##   150    Real adaboost  0.9988235  0.9976471
## 
## Kappa was used to select the optimal model using the largest value.
## The final values used for the model were nIter = 50 and method = Adaboost.M1.

By combining the predictions of the classifiers using Kernel kNN model, We can see that we have lifted both accuracy and cohen’s kappa above \(99.50\%\) and \(99.00\%\) respectively which is an impressive improvement over using kkNN alone. This is also the same to an improvement over using adaboost alone on the breast cancer dataset, as observed above.

Further more, one may want to try passing different types of models, both high and low performing rather than just stick to passing high accuracy models to the caretStack to check their performance levels.

Correlation of Model Results

The stacking ensembles tend to perfectly perform if the predictions are less correlated with each other. This would suggest that the predictve algorithms are skillful but in different ways. To significantly allow a new classifier to figure out how to get the best from each model for an improved score, we need to test the correlation score among slow predictive learners.

If the predictions for the learning models will be highly corrected at rate of \(>90\%\) then they would be making the same or very similar predictions. And most of the time they tend to worsen and reduce the ultimate benefit of combining the multi-predictions.

From the above results, we notice that all pairs of predictions have generally low correlation compared to the set threshold. This means that we will include all classifiers in stacking ensemble technique to predict the breast cancer diagnostic decision.

Excellent!, we can now implement the stacking ensemble technique.

Ensemble Technique for Breast Cancer Prediction.

Finally, we can predict the breast cancer diagnosis decision using the combined models. Later, we will explore the superiority of the ensemble predictive model by analyzing the different types of errors used to be made in ML-based classification through confusion matrices.

##  [1] M M M M M M M M M M
## Levels: B M
##  [1] M M M M M M M M M M
## Levels: B M

Exploration of Confusion Matrix for Winning Classifiers.

A confusion matrix is a performance measurement technique for Machine learning classification. It is a kind of table which helps to know the performance status of the classification models. And also it shows how the classifiers are confused during recognition, classification and prediction task.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  B  M
##          B 69  2
##          M  2 40
##                                           
##                Accuracy : 0.9646          
##                  95% CI : (0.9118, 0.9903)
##     No Information Rate : 0.6283          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9242          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9718          
##             Specificity : 0.9524          
##          Pos Pred Value : 0.9718          
##          Neg Pred Value : 0.9524          
##              Prevalence : 0.6283          
##          Detection Rate : 0.6106          
##    Detection Prevalence : 0.6283          
##       Balanced Accuracy : 0.9621          
##                                           
##        'Positive' Class : B               
## 
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  B  M
##          B 69  2
##          M  2 40
##                                           
##                Accuracy : 0.9646          
##                  95% CI : (0.9118, 0.9903)
##     No Information Rate : 0.6283          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9242          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9718          
##             Specificity : 0.9524          
##          Pos Pred Value : 0.9718          
##          Neg Pred Value : 0.9524          
##              Prevalence : 0.6283          
##          Detection Rate : 0.6106          
##    Detection Prevalence : 0.6283          
##       Balanced Accuracy : 0.9621          
##                                           
##        'Positive' Class : B               
## 

The above confusion matrix charts show how the stack ensemble classification models have performed when they predict the breast cancer diagnostic decision. Additionally, they give the insightful information on errors and error types being made during the prediction problem.

Finally, let’s tabulate the final predicted breast cancer diagnostic decision. The table is showing the random generated patient identities based on the last patient_id appeared on the actual data set and also shows the predicted diagnostic labels.

## # A tibble: 2 x 3
##   Diagnosis     n Percentage
##   <fct>     <int>      <dbl>
## 1 B            71       62.8
## 2 M            42       37.2

The above table indicates both total number and percentage rate of predicted Benign and Malignant breast tumors by stack ensemble technique.

Finally, Let’s formulate the submission files namely BreastCancer_Class which includes each predicted breast cancer diagnostic decision in text format.

Conclusion.

In this blog post, the stack ensemble tech model was developed and This blog post has considered twelve different slow learners which have been carefully selected from the set of supervised learning machine. Those learning models have been trained on high class imbalanced breast cancer data set. Thefore, combined them (Stack Ensemble technique) with crucial intention to extremely optimizing the overall performance. The optimal model has been used to train the stack ensemble model (parallel training to encourage the division of labor/mixture of experts); then used to accurately predict the breast cancer diagnostic decision.

This information would serve as a reference and also as a template others can use to build a standardised and relevant machine learning workflow for different purpose.

——————————————————————————-


References

  1. Thomas G. Dietterich (2013), Ensemle Methods in Machine Learning, Oregon State University, Corvallis, Oregon, USA.

  2. Shahnorbanun Sahran et al. (November 5, 2018), Machine Learning Methods for Breast Cancer Diagnostic, DOI: 10.5772/intechopen.79446

  3. McGuire A, Brown JA, Malone C, McLaughlin R, Kerin MJ (May 2015). “Effects of age on the detection and management of breast cancer”. doi:10.3390/cancers7020815.

  4. Balasubramanian R, Rolph R, Morgan C, Hamed H (2019). “Genetics of breast cancer: management strategies and risk-reducing surgery”. doi:10.12968/hmed.2019.80.12.720. 91.

  5. “World Cancer Report”. International Agency for Research on Cancer. (2008). Archived from the original on 31 December 2011. Retrieved 26 February 2011. (cancer statistics often exclude non-melanoma skin cancers such as basal-cell carcinoma, which are common but rarely fatal)

  6. Laurance, Jeremy (29 September 2006). “Breast cancer cases rise 80% since Seventies”. The Independent. London, UK.

  7. Opitz, D.; Maclin, R. (1999). “Popular ensemble methods: An empirical study”. Journal of Artificial Intelligence Research. 11: 169–198. doi:10.1613/jair.614.

  8. Rousseauw, J., du Plessis, J., Benade, A., Jordaan, P., Kotze, J. and Ferreira, J. (1983). Coronary risk factor screening in three rural communities, South African Medical Journal 64: 430–436.

  9. S.Murthy, J.Kurumathur and B.R. Reddy. (2016). Online International Conference on Green Engineering and Technologies (IC-GET).

  10. J. Gorodkin, (2004). Computational Biology and Chemistry, 2004,28, 367-374