Worldwide, breast cancer is the most common invasive cancer in women. It affects 1 in 7 (\(14\%\)) of women worldwide. In 2008, breast cancer caused \(458,503\) deaths worldwide ( i.e \(13.7\%\) of cancer deaths in women and \(6.0\%\) of all cancer deaths for men and women together).
The incidence of breast cancer varies greatly around the world, it is lowest in developing countries and greatest in the developed countries. The number of cases worldwide has significantly increased since the 1970s, a phenomenon partly attributed to the modern lifestyles. And breast cancer is strongly related to age with only \(5\%\) of all breast cancers occurring in women under 40 years old.
Based on U.S. statistics in 2015 there were 2.8 million women affected by breast cancer worldwide. However, age-adjusted deaths from breast cancer per 100,000 women only rose slightly from 31.4 in 1975 to 33.2 in 1989 and have since declined steadily to 20.5 in 2014.
Breast cancer is one of the most severe and common reproductive cancers that affect mostly women. Breast tumor is an abnormal growth of tissues in the breast, and it may be felt as a lump or nipple discharge or change of skin texture around the nipple region. Cancers are abnormal cells that divide uncontrollably and are able to invade other tissues. Cancer cells have the ability to spread to other parts of the body through the blood and lymphatic systems. It has become a major health issue in the past 50 years, and its incidence has increased in recent years.
In today’s world with the advanced technology, medicine and healthy systems are accelerating the rapid growth in cancer detection and diagnosis. And the myriad of cancer preventive, detection and diagnosis approaches have been put in place to relevantly decline the new cancer cases in community. Among those adopted techniques, the artificial intelligence (AI) and machine learning (ML) couldn’t be left behind in the task since it support significantly and surprisingly in cancer detection and prediction beyond the human expectations. Due to this fact, the intention of this work is to quite employ the machine learning based ensemble approach to classify, detect and predict the breast cancer diagnosis decision.
A breast mass or lump is generally one of the possible signs of breast cancer in men or women. Breast cancer can cause several additional changes to the skin on and around the breast. Here below are the potential signs and symptoms of breast cancer that may occur without a noticeable lumb in the breast (Note: they are not common for all person):
Change to the skin’s textute Breast cancer can cause changes and inflammation in skin cells that can lead to texture changes. For example, the skin is sunburned or extremely dry and skin thickening in any part of the breast.
Nipple discharge An individual may observe discharge from the nipple, which can be thin or thick and can range in color from clear to milky to yellow, green, or red. However, the nipple discharge is not always indicating the breast cancer occurence, here below are the other possible reasons for nipple discharge: breast infections, a side effect of birth control pills, a side effect of taking certain medications, variations in body physiology, certain medical conditions, such as a thyroid disease.
Dimpling Skin simpling can sometimes be a sign of inflammatory breast cancer, an aggressive type of breast cancer. Cancer cells can cause a buildup of lymph fluid in the breast that leads to swelling as well as dimpling or pitted skin. The doctors are calling this change in the skin’s “peau d’orange” because the dimpled skin resembles the surface of an orange.
Breast or nipple pain Breast cancer can cause changes in skin cells that lead to feeling of pain, tenderness, and discomfort in the breast. Although breast cancer is often painless, it is important not to ignore any signs that could be due to breast cancer.
Redness Breast cancer can cause changes to the skin that may make it appear discolored or even bruised. The skin may be red or pupple or have a bluish tint.
Swelling Breast cancer can cause the entire breast or an area of the breast to swell.
It is advisable that the people should not panic or be fearful when they notice breast cahnges. Aging, change in hormone levels, and other factors can lead to breast changes throughout a person’s lifetime. However, they should be proactive about their health conditions and visit a doctor to determine the cause of any breast symptoms.
According to National Cancer Institute (https://www.cancer.gov/), a tumor is termed as “an abnormal mass of tissue that results when cells divide more than they should or do not die when they should.”
In healthy human body, cells grow, divide and replace each other. As new cells form, the old ones die. When a person has a cancer, new cells form form when the body doesn’t need them at all. I there are too many cells, a group of cells, or tumor , can develop.
In human body, they exist non cancerous cells which are Benign and others are Malignant . Malignant tumors are cancerous, and they can spread to the other parts of human body. Additionally, a tumor develops when cells reproduce too quickly. Tumors can also vary in size from a tiny nodule to a large mass, depending on the type, and they can appear almost anywhere on the body.
By refering to the National Cancer Institute, they are three main types of tumor:
Benign: These are not cancerous cells. Most Benign tumors are not harmful. They are either cannot spread or grow, or they do so very slowly. If they have been removed from body, they do not generally return. However, they can cause pain or other problems if they press against nerves or blood vessels or if they trigger the overproduction of hormones, as in the endocrine system.
Premalignant: In these tumors, the cells are not yet cancerous, but they have the potential to become malignant.
Malignant: Malignant tumors are cancerous. The cells can grow and spread to the other parts of human body through the metastasis process. They develop when cells grow uncontrollably. If they continue to grow and spread, the disease become life threatening. Surprisengly, the cancer cells that move to other parts of body are the same as the original ones. And they have the overall ability to invade the other organs. If breast cancer spreads to the lung cancer, for example, the cancer cells in the liver are still breast cancer cells.
It is not always clear how tumor will act in the future. Some benign tumors can become premalignant and then malignant. For this reason, it better to monitor any growth by regular attending and consulting an expert physician or doctor for health checks.
Mayo Cancer Clinic (https://mayoclinic.org), has defined a breast cancer risk factor as anything that makes it more likely you will get breast cancer. But having one or several breast cancer risk factors doesn’t necessarily mean you will develop breast cancer. Many women who developed breast cancer did not show any known risk factors other than simply being women.
Factors that are associated with an increased risk of breast cancer include:
Being female. Woman are much more likely than men to develop breast cancer.
Age. The risk of getting breast cancer increases as you age. Nearly \(80\%\) of breast cancers are found in women over the age of \(50\).
Personal history of breast conditions. If you have a breast biopsy that found lobular carcinoma in situ (LCIS) or atypical hyperplasia of the breast, you have an increased risk of breast cancer.
A personal history of breast cancer. If you have had breast cancer in one breast, you have an increased risk of developing cancer in the other breast.
A family history of breast cancer. If your mother, sister or daughter was diagnosedwith breast cancer, particularly at young age (before \(40\)), your risk of breast cancer is high. Having other relatives with breast cancer may also raise the risk.
Inherited genes. Certain gene mutations that increase the riskof breast cancercan be inherited from parents to childrens. Women with certain genetic mutations, including changes to the BRCA1 and BRCA2 genes, are at higher risk of developing breast cancer during their lifetime.
Radiation exposure. If you received radiation treatments to your chest as a child or young adult, your risk of breast cancer is increased.
Beginning your period at a young age and menaupose at older age. Women who menstruate for the first time at an early age (before \(12\)) and Women who go through menopause late (after age \(55\)), they are more likely to develop breast cancer.
Having first child at an older age and having never been pregnant. Women who give birth to their first child after age \(30\) and have never been pregnant, they have a greater risk of breast cancer.
They are any other risk factors for breast cancer and it is not always means that people with those trend factors are necessarily developing the breast cancer. It is advisable to routinely ask your doctor about breast cancer screening and practice all kind of cancer preventive measures.
Current’s daily personalized medicine increases the workload and complexity of the physicians/ doctors in several type of cancer detection and diagnosis. At hospital, the radiologists and pathologists are the essential key players in making decision for cancer presence and diagnosis. Based on the radiology diagnostic decision, the results will be submitted to pathologist for further diagnosis. The Pathologist and radiologist form the core of cancer diagnosis based on the anatomy and physiological processes gained from the Nuclear Magnetic Resonance Imaging (NMRI) or CT Scanner images captured from the tumor cells inside of the human body. In several hospitals, the communication among them remained on papers. That paper contains their respective report of the case on the same patient. This might slow down the cancer diagnosis decision making and decline the patient survival rate.
With the enormous technological and scientific advances and currently occurring in all fields, the opportunity has emerged to develop an integrated diagnostic reporting system that supports both medicine fields (radiology and pathology). Therefore, it improves the overall quality of patient care through accurate communication in DICOM (Digital Imaging and Commubication in Medicine) images. In this work, we are highly motivated to contribute in cancer information faster processing and early preparedness by focusing on accurate diagnostic prediction.
The emerging of fourth Industrial Revolution (4IR) technology allowed huge amount of data (big data) to be collected, and this leads to the complexity of the radiology and pathology workload. To address all aforementioned challenges, the artificial intelligence are laveraged to improve medical diagnostics. From this, our study employs ensemble technique which consists of a combination of several machine learning techniques into a single accurate predictive model in order to precisely predict the breast cancer diagnostic decision prevailence.
This serves to clarify the approach used in this work and to explain properly the overview of machine learning process and ensemble technique to tackle the considered problem in different tasks assigned.
Machine Learning (ML) is the science of generating the computations and algorithms that allow software applications to become more accurate in predicting outcomes without being explicitly programmed and simplify the work of humans. ML comprises a broad class of statistical analysis algorithms that iteratively improve in response to training data to build models for autonomous predictions. ML is quite powered by data and enables computers to learn from them to make tremendous automatic data-based decision and prediction.
Machine Learning is often categorized into supervised, unsupervised and reinforcement learning machine. The supervised learning algorithms have been used in this study to distinguish the benign from malignant breast cancer hence their combination has beeen used to accurately predict the cancer diagnosis labels. Here below is the summarized workflow of supervised machine learning algorithms employed.
Workflow of Employed Machine Learning Algorithms (Source: https://www.wikipedia.org)
Simply, the above workflow chart is summarizing the undertaken processes of acquisition, preparation and transformation, feature engineering and data pre-processing, hyper-parameter tuning and model building, model testing and validation therefore model comparisons based on their performance.
In statistics and machine learning, ensemble methods refer to combination of multiple learning algorithms to obtain better predictive performance than any single learning algorithm. An ensemble itself is a supervised learning algorithm, because it can be trained and then used to make predictions and the trained ensemble represents a single hypothesis and results.
The three most popular methods for combining the predictions from different models are:
Bagging. Building multiple models (typically of the same type) from different subsamples of the training dataset.
Boosting. Building multiple models (typically of the same type) each of which learns to fix the prediction errors of a prior model in the chain.
Stacking. Building multiple models (typically of differing types) and supervisor model that learns how to best combine the predictions of the primary models.
This study will not explain and employ each of these methods, we will only focus on stacking technique (sometimes called stacked generalization). In this method, the multiple sub-models contribute equally to a combined prediction.
A lot of exciting data related activities ahead. To make it simpler, this tutorial is organized to cover the followings: Data Preparation and Pre-processing, data visualization and identification of important variables, Feature Selection using using different approaches,Training and Tuning the model, Model comparing based on the several performamnce metrics, and finally an Ensembling the predictions.
The dataset used is real-valued continuous multivariate set of data which have been extracted from the breast cell nuclei. Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. The data set is publicly available from various data set repositories including Kaggle Repository and UCI Machine Learning Repositor. The table below descibes a clear describtion of the data set.
Table 1: Breast Cancer Diagnostic Data Description
Here below are the loaded libraries and the dataset.
#Loading libraries
#____________________
library(tidyverse)
library(gganimate)
library(gifski)
library(formatR)
library(gridExtra)
library(grid)
library(vcd)
library(knitr)
library(corrplot)
library(ggcorrplot)
library(scales)
library(lme4)
library(DMwR)
library(InformationValue)
library(ROCR)
library(ggthemes)
library(patchwork)
library(caret)
library(caretEnsemble)
library(skimr)
library(plotly)
library(table1)
library(vtable)
library(FSelector)
library(gridExtra)
library(doSNOW)
library(doParallel)
library(MLmetrics)
library(parallel)
library(iterators)
library(DT)
library(doParallel)
library(foreach)
library(doSNOW)
#________________________
#Importing dataset
bcancer<-read.csv("/Users/Murera Gisa/Desktop/Predicting_Cancer/data.csv")
vtable(bcancer, factor.limit =0) # Data Variable documentation| Name | Class | Values |
|---|---|---|
| patient_id | integer | Num: 8670 to 911320502 |
| radius_mean | numeric | Num: 6.981 to 28.11 |
| texture_mean | numeric | Num: 9.71 to 39.28 |
| perimeter_mean | numeric | Num: 43.79 to 188.5 |
| area_mean | numeric | Num: 143.5 to 2501 |
| smoothness_mean | numeric | Num: 0.053 to 0.163 |
| compactness_mean | numeric | Num: 0.019 to 0.345 |
| concavity_mean | numeric | Num: 0 to 0.427 |
| concave.points_mean | numeric | Num: 0 to 0.201 |
| symmetry_mean | numeric | Num: 0.106 to 0.304 |
| fractal_dimension_mean | numeric | Num: 0.05 to 0.097 |
| radius_se | numeric | Num: 0.112 to 2.873 |
| texture_se | numeric | Num: 0.36 to 4.885 |
| perimeter_se | numeric | Num: 0.757 to 21.98 |
| area_se | numeric | Num: 6.802 to 542.2 |
| smoothness_se | numeric | Num: 0.002 to 0.031 |
| compactness_se | numeric | Num: 0.002 to 0.135 |
| concavity_se | numeric | Num: 0 to 0.396 |
| concave.points_se | numeric | Num: 0 to 0.053 |
| symmetry_se | numeric | Num: 0.008 to 0.079 |
| fractal_dimension_se | numeric | Num: 0.001 to 0.03 |
| radius_worst | numeric | Num: 7.93 to 36.04 |
| texture_worst | numeric | Num: 12.02 to 49.54 |
| perimeter_worst | numeric | Num: 50.41 to 251.2 |
| area_worst | numeric | Num: 185.2 to 4254 |
| smoothness_worst | numeric | Num: 0.071 to 0.223 |
| compactness_worst | numeric | Num: 0.027 to 1.058 |
| concavity_worst | numeric | Num: 0 to 1.252 |
| concave.points_worst | numeric | Num: 0 to 0.291 |
| symmetry_worst | numeric | Num: 0.156 to 0.664 |
| fractal_dimension_worst | numeric | Num: 0.055 to 0.208 |
| diagnosis | factor | ‘B’ ‘M’ |
## The dimension of data set is ( 569 32 )
## radius_mean texture_mean perimeter_mean area_mean
## Min. : 6.981 Min. : 9.71 Min. : 43.79 Min. : 143.5
## 1st Qu.:11.700 1st Qu.:16.17 1st Qu.: 75.17 1st Qu.: 420.3
## Median :13.370 Median :18.84 Median : 86.24 Median : 551.1
## Mean :14.127 Mean :19.29 Mean : 91.97 Mean : 654.9
## 3rd Qu.:15.780 3rd Qu.:21.80 3rd Qu.:104.10 3rd Qu.: 782.7
## Max. :28.110 Max. :39.28 Max. :188.50 Max. :2501.0
## smoothness_mean compactness_mean concavity_mean concave.points_mean
## Min. :0.05263 Min. :0.01938 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.08637 1st Qu.:0.06492 1st Qu.:0.02956 1st Qu.:0.02031
## Median :0.09587 Median :0.09263 Median :0.06154 Median :0.03350
## Mean :0.09636 Mean :0.10434 Mean :0.08880 Mean :0.04892
## 3rd Qu.:0.10530 3rd Qu.:0.13040 3rd Qu.:0.13070 3rd Qu.:0.07400
## Max. :0.16340 Max. :0.34540 Max. :0.42680 Max. :0.20120
## symmetry_mean fractal_dimension_mean
## Min. :0.1060 Min. :0.04996
## 1st Qu.:0.1619 1st Qu.:0.05770
## Median :0.1792 Median :0.06154
## Mean :0.1812 Mean :0.06280
## 3rd Qu.:0.1957 3rd Qu.:0.06612
## Max. :0.3040 Max. :0.09744
## 'data.frame': 569 obs. of 10 variables:
## $ radius_mean : num 18 20.6 19.7 11.4 20.3 ...
## $ texture_mean : num 10.4 17.8 21.2 20.4 14.3 ...
## $ perimeter_mean : num 122.8 132.9 130 77.6 135.1 ...
## $ area_mean : num 1001 1326 1203 386 1297 ...
## $ smoothness_mean : num 0.1184 0.0847 0.1096 0.1425 0.1003 ...
## $ compactness_mean : num 0.2776 0.0786 0.1599 0.2839 0.1328 ...
## $ concavity_mean : num 0.3001 0.0869 0.1974 0.2414 0.198 ...
## $ concave.points_mean : num 0.1471 0.0702 0.1279 0.1052 0.1043 ...
## $ symmetry_mean : num 0.242 0.181 0.207 0.26 0.181 ...
## $ fractal_dimension_mean: num 0.0787 0.0567 0.06 0.0974 0.0588 ...
#___________DATA CLEANING_____________
#Removing the patient identity
bcancer <- bcancer %>%
dplyr::select(-patient_id)
# Check NA entries
sapply(bcancer[,2:11], function(x) sum(is.na(x))) # Total NA per column## texture_mean perimeter_mean area_mean
## 0 0 0
## smoothness_mean compactness_mean concavity_mean
## 0 0 0
## concave.points_mean symmetry_mean fractal_dimension_mean
## 0 0 0
## radius_se
## 0
## The total missing value of data set is ( 0 )
We don’t have any missing data. We’re good to go. The main intention of this dataset is to predict which of the two breast cancer diagnosis decisions (Malignant, M and Benign, B) did the patient being tested.
The data set has \(569\) observations and a \(32\) features. Feature selection is critical in feature engineering task. The R FSelector packagehas been used to cut down and eliminate unecessary features prior to building a prediction model. The random.forest.importance() function is used to rate the importance of each feature in a recognition and classification of the outcome. The function returns a data frame containing the name of each attribute and the importance value based on the mean decrease in accuracy.
bcancer$diagonis <- as.factor(bcancer$diagnosis)
attribute.scores <- random.forest.importance(diagnosis ~ ., bcancer)
attribute.scoresFrom the returned data frame, we cam select the top best features using the importance according to their importance. The cutoff.biggest.diff() function automatically identifies the features which have a significantly higher importance value than other features. And cutoff.k is used to provide the top ten features with the highest importance values. Similarly, cutoff.k.percent it can be used to return \(k\) percent of the features with the highest importance values.
## [1] "diagonis"
Let’s tabulating the data set with top ten best features.
bcancer<-bcancer%>%
dplyr::select(Top_10_features)%>%
rename(diagnosis=diagonis)
#Tabulating
datatable(bcancer, colnames = c('ID' = 0),class = 'cell-border stripe',
caption = htmltools::tags$caption(style = 'caption-side: bottom; text-align: center;',
'Table 2: ',htmltools::em("Source: UCI ML Repository."))
#setting header black
,options = list(autoWidth = FALSE,
initComplete = JS(
"function(settings, json) {",
"$(this.api().table().header()).css({'background-color': '#000', 'color': 'blue'});","}")))#Descriptive Statistics and getting insight from data
#______________________
# Tuning Descriptive statistics table
table1::table1(~area_worst + radius_worst + perimeter_worst + concave.points_worst+
+ concave.points_mean + area_se + texture_worst +concavity_worst +
+ texture_mean, data = bcancer)| Overall (N=569) |
|
|---|---|
| area_worst | |
| Mean (SD) | 881 (569) |
| Median [Min, Max] | 687 [185, 4250] |
| radius_worst | |
| Mean (SD) | 16.3 (4.83) |
| Median [Min, Max] | 15.0 [7.93, 36.0] |
| perimeter_worst | |
| Mean (SD) | 107 (33.6) |
| Median [Min, Max] | 97.7 [50.4, 251] |
| concave.points_worst | |
| Mean (SD) | 0.115 (0.0657) |
| Median [Min, Max] | 0.0999 [0, 0.291] |
| concave.points_mean | |
| Mean (SD) | 0.0489 (0.0388) |
| Median [Min, Max] | 0.0335 [0, 0.201] |
| area_se | |
| Mean (SD) | 40.3 (45.5) |
| Median [Min, Max] | 24.5 [6.80, 542] |
| texture_worst | |
| Mean (SD) | 25.7 (6.15) |
| Median [Min, Max] | 25.4 [12.0, 49.5] |
| concavity_worst | |
| Mean (SD) | 0.272 (0.209) |
| Median [Min, Max] | 0.227 [0, 1.25] |
| texture_mean | |
| Mean (SD) | 19.3 (4.30) |
| Median [Min, Max] | 18.8 [9.71, 39.3] |
#skimmed <- skim(bcancer) #You can uncomment it to see the outputs
#print(skimmed)
#Convert output diagnosis into factor and finding both Malignant and Benign rates (%)
#______________________________
factor_variable_position<-c(1)
bcancer<-bcancer%>% mutate_at(.,vars(factor_variable_position),~as.factor(.))
Patient_percentage<-round(prop.table(table(bcancer$diagnosis))*100,digits = 2)
print(Patient_percentage)##
## B M
## 62.74 37.26
This often refers to as the data set has \(62.74\%\) and \(37.26\%\) of patients being tested Benign and Malignant breast tumor respectively. Here below are the selected top columns.
diagnosis: This is the diagnosis decision indicating whether a patient is breast cancer tested malignant (M) or Benign (B).
perimeter_worst: A worst or largest mean value of cell nuclei perimeter.
area_worst: A worst or largest mean value of cell nuclei area.
concave.points_worst: A “worst” or largest mean value for number of concave portions of the contour.
radius_worst: A “worst” or largest mean value for mean of distances from center to points on the perimeter.
texture_worst: A “worst” or largest mean value for standard deviation of gray-scale values.
concave.points_mean: A mean for number of concave portions of the contour.
area_se: A standard error for standard deviation on cell nuclei area.
concavevity_worst: A worst or largest mean value for severity of concave portions of the contour.
concavity_mean: A mean of severity of concave portions of the cell nuclei contour.
Exploration and visualization of response variable
B: Represents the total number of patients tested malignant breast tumor.
M: Represents the total number of patients tested malignant breast tumor.
#CREATION OF THEME
blank_theme <- theme_bw() +
theme(
axis.title.x = element_text(angle = 360, face = "bold", colour = "red", size = 15),
axis.text.x = element_text(angle = 360, face = "bold", colour = "black", size = 12),
axis.title.y = element_text(angle = 90, face = "bold", colour = "red", size = 15),
axis.text.y = element_text(angle = 45, face = "bold", colour = "black", size = 12),
legend.position= "none",
plot.title = element_text(size=16, face="bold", color="forest green")
)
counting_cancer_status <- bcancer %>% count(diagnosis) %>%
mutate(Rate=round(prop.table(table(bcancer$diagnosis))*100, digits = 2)) %>%
mutate(diagnosislabels= as.factor(fct_recode(diagnosis, Benign= "B",Malignant= "M"))) %>%
ggplot(aes(x=diagnosislabels , y= n, fill = diagnosislabels)) +
geom_bar(stat = "identity", width = 0.8, show.legend = FALSE) + theme_bw() + blank_theme+
labs(x= "Type of Breast Tumor Present", y= "Total Number of Patients", caption = "Source: mgisa Breast Cancer Analytics") +scale_fill_manual(values = c("Benign" = "blue", "Malignant"= "orange"), aesthetics = "fill")+
scale_x_discrete(limits = c("Benign","Malignant")) + ggtitle("Total Test Patients with the Breast Tumor Diagnosis ") +
geom_text(aes(label= str_c(Rate,"%")),vjust= 4.5,size= 6, color= "black")
print(counting_cancer_status)We can see that most of the patient was Benign breast tumor tested at the rate of \(62.74\%\). And let’s explore all features distribution through the histograms since all features are continuous.
#Overall Distribution For All Continuous Features
HistCont<-bcancer %>%
keep(is.numeric) %>%
gather() %>%
ggplot() +
geom_histogram(mapping = aes(x=value,fill=key), color="black") +theme_bw()+
facet_wrap(~ key, scales = "free") +
theme_minimal() + labs(x="Corresponding Value",y= "Total Numbers",
caption = "Source: mgisa Breast Cancer Analytics")+
ggtitle("DISTRIBUTION OF CONTINUOUS FEATURES ")+
theme(legend.position = 'none',
#axis.text.x = element_text(angle = 45, face = "bold", colour = "black",size = 15),
axis.title.x= element_text(size = 18,face = "bold",color="red"),
axis.title.y = element_text(size=18, angle = 90,vjust = 0.3,face = "bold",color = "red"),
plot.title = element_text(size=16, face="bold", color="forest green"))
#print(HistCont)
plotly_build(HistCont)We can see that the texture-mean is normal distributed compared to other features.
#Visualization of diagnostic decision separation by numeric features
#_____________________________________________________
plot_box <- function(df, cols,col_x="diagnosis"){options(repr.plot.width=4,repr.plot.height=3.5)
for (col in cols){
chart <- ggplot(df, aes_string(col_x,col,fill=col_x))+
geom_boxplot(show.legend = FALSE)+
scale_fill_manual(values = c("B" = "blue", "M"= "red"), aesthetics = "fill")+
labs(x="Tested Breast Tumor",title = paste("Box Plot of",col,"\n Vs",col_x),
caption = "Source: Breast Cancer Analytics@mgisa")+
theme_bw()+blank_theme+theme(axis.text.x = element_text(size=12,angle = 360,
vjust= 0,face="bold"), axis.title.y = element_text(size=15,face="bold"),
axis.title.x = element_text(size=15, face="bold"),plot.title=element_text(size=15, face="bold",colour = "red"),legend.position = "none")
print(chart)
}
}
num_cols = bcancer %>% select_if(is.numeric)%>%colnames()
plot_box(bcancer,num_cols)We can see clearly the difference between this two groups of patients.
This refers to testing the correlation among the predictors (independant features). A collinearity can worse the predictive accuracy and also make challenging to determine which features to include in predictive model. Here below are the correlation table among features.
#TESTING THE CORRELATION AMONG VARIABLES
#_________________________________________
numericVarName <- names(which(sapply(bcancer, is.numeric)))
corr <- cor(bcancer[,numericVarName], use = 'pairwise.complete.obs')
ggcorrplot(corr, lab = TRUE, title = "CORRELATION AMONG CONTINUOUS VARIABLES")We can notice that there are variables which are highly correlated. (i.e. presence of multicollinearity). So we’ll eliminate some of them in predictive models to ensure the highest possible predictive performance.
Now, it is a time to visually examine how the best selected features of breast cell nuclei influence the diagnosis decision. Both box plots and density plots are commonly used to investigate this. And the caret’s featurePlot() function makes it so convenient.
#Split data set into training and test set
#___________________________________________
# 1. Get row numbers for the training data
partition <- createDataPartition(bcancer$diagnosis, p = 0.8, list = FALSE)
# 2. Create the training sample
bcancerTrain<- bcancer[partition, ]
#Shape of training set.
cat("The dimension of the training data set is (", dim(bcancerTrain), ")")## The dimension of the training data set is ( 456 10 )
# 3. Create the test sample
bcancerTest <- bcancer[-partition, ]
cat("The dimension of test data set is (", dim(bcancerTest), ")")## The dimension of test data set is ( 113 10 )
# Scaling the continuous variables
preProcess_scale_model <- preProcess(bcancerTrain, method = c("center", "scale"))
bcancerTrain <- predict(preProcess_scale_model, bcancerTrain)
bcancerTest <- predict(preProcess_scale_model, bcancerTest)
#Density plots Visually Presenting of Important Features
#__________________________________________________
featurePlot(x = bcancerTrain[, -1], #Removing output column.
y = bcancerTrain$diagnosis,
plot = "density", main= "Density Visually Presenting of the Important Features",
strip=strip.custom(par.strip.text=list(cex=1)),
scales = list(x = list(relation="free"),
y = list(relation="free")))In this case, For a variable to be extremely important, one should expect the density curves to be significantly different for the two classes, both in terms of the height (kurtosis) and placement (skewness). By taking a look at the density curves of the two diagnosis categories for all features, we notice that area_se, area_worst, perimeter_worst, concavity_worst, concave.point_mean, and radius_worst are most likely to be important to predict breast cancer diagnosis decision compared to others. But it may not be wise to conclude which variables are NOT important unless we perform the various feature selection techniques.
Feature selection always plays a crucial role in machine learning. They several techniques to select the best features to build the predictive models. Here below are the two techniques adopted in this post:
#FEATURES SELECTION BY CHI-SQUARE TEST METHOD
chi.square <- vector()
p.value <- vector()
cateVar <- bcancer %>%
dplyr::select(-diagnosis) %>%
keep(is.numeric)
for (i in 1:length(cateVar)) {
p.value[i] <- chisq.test(bcancer$diagnosis, unname(unlist(cateVar[i])),
correct = FALSE)[3]$p.value
chi.square[i] <- unname(chisq.test(bcancer$diagnosis, unname(unlist(cateVar[i])),
correct = FALSE)[1]$statistic)
}
chi_square_test <- tibble(variable =names(cateVar)) %>%
add_column(chi.square = chi.square) %>%
add_column(p.value = p.value)
knitr::kable(chi_square_test)| variable | chi.square | p.value |
|---|---|---|
| perimeter_worst | 536.2035 | 0.2313225 |
| area_worst | 558.3055 | 0.3154948 |
| concave.points_worst | 546.1851 | 0.0427027 |
| radius_worst | 537.6295 | 0.0049616 |
| concave.points_mean | 562.5833 | 0.2521223 |
| area_se | 552.6018 | 0.2128353 |
| texture_worst | 514.8145 | 0.4320580 |
| texture_mean | 498.4163 | 0.2505930 |
| concavity_worst | 555.4536 | 0.2922900 |
We can notice that all features have larger Chi-Square values, this confirms our hypothesis that these almost all features can provide useful information on the response (target) variable. At the other hand, we can confidently select only radius_worst and concave.point_worst since their p-value are less than \(0.05\). So to be more safe, let’s not arrive at conclusions about excluding variables prematurely and try other machine learning feature selection technique called recursive feature elimination.
#Feature selection using rfe in caret (Recursive Feature Elimination)
#__________________________________________________________________
control <- rfeControl(functions = rfFuncs,
method = "repeatedcv",
number= 10,
repeats = 5,
verbose = FALSE,
allowParallel = TRUE)
outcomeName<-'diagnosis'
predictors<-names(bcancerTrain)[!names(bcancerTrain) %in% outcomeName]
bcancer_Pred_Profile <- rfe(bcancerTrain[,predictors], bcancerTrain[,outcomeName],
rfeControl = control)
print(bcancer_Pred_Profile)##
## Recursive feature selection
##
## Outer resampling method: Cross-Validated (10 fold, repeated 5 times)
##
## Resampling performance over subset size:
##
## Variables Accuracy Kappa AccuracySD KappaSD Selected
## 4 0.9437 0.8794 0.03638 0.07764
## 8 0.9591 0.9120 0.02744 0.05934
## 9 0.9613 0.9166 0.02615 0.05672 *
##
## The top 5 variables (out of 9):
## concave.points_worst, area_se, area_worst, concave.points_mean, texture_worst
In above code, the cross validation method is repeatedcv which implements k-Fold cross validation repeated 5 times as re-sampling method, which is rigorous enough for our case. From the above output, a model is automatically select the top five out of nine features that seems to achieve the optimal accuracy.
In the next step, we will train the machine learning models on training set with those top five features.
#Taking only the top 5 predictors
#__________________________________
#Training data set
bcancerTrain <- bcancerTrain%>%
dplyr::select(concave.points_worst,
texture_worst,
area_se,
area_worst,
concave.points_mean,
diagnosis)
#Test data set
bcancerTest <- bcancerTest%>%
dplyr::select(concave.points_worst,
texture_worst,
area_se,
area_worst,
concave.points_mean,
diagnosis)The target of our dataset is highly imbalanced with far more patients tested benign tumor than breast cell malignant tumor. In such cases, a machine learning algorithm can simply classify and detect all the examples to be of the majority class and get a very high accuracy. And a such classifier will be practically useless. Hence in order to maximize classification accuracy on both the classes, typically a weight is specified to the minority class which relatively increases the penalty of misclassifying it compared to misclassifying the majority class. Among the different popular approaches for class imbalanced, this post will use Over-sampling technique especially Synthetic Minority Over-Sampling (SMOTE).
#Let's look at the class distribution again.
round(prop.table(table(bcancer$diagnosis)),3) # Entire dataset##
## B M
## 0.627 0.373
##
## B M
## 0.627 0.373
##
## B M
## 0.628 0.372
The class labels are not balanced, 62.9% and 37.8% for Benign (B) and Malignant (M) and respectively. To balance the target class, SMOTE() function built-in in DMwR R package implements such task.
#SMOTE function to balance imbalanced calss (assign them the same weight)
bcancerTrain <- SMOTE(diagnosis~., data.frame(bcancerTrain),
perc.over = 100, perc.under = 200)
#Make a look at the data
round(prop.table(table(dplyr::select(bcancerTrain, diagnosis), exclude = NULL)),3)##
## B M
## 0.5 0.5
Now, the class labels are balanced at rate of 50% for both Benign and Malignant breast cell tumors.
Such technique is a heuristic approach based on k-nearest neighbour (kNN) algorithm. This technique generates synthetic observations from minority class by interpolating a collinear point (observation) between observation of the minority class and its nearest neighbours.
Note: I want to list down some important points which one must keep in mind when he is using this technique.
This technique should be only applied on training data set. The aim is to make the training data balanced in order to train a classification algorithm properly. And the model performance should be tested on the actual uncharged test dataset that is representative part of the original data.
Better use the stratified random sampling for the training and test data split. This ensures that the class distribution in each of these splits is the same. And SMOTE should be applied on datasets that contain only numeric variables since it is distance calculation based (kNN).
The various researches have shown that AUROC is usually preferred model performance in the presence imbalance datasets. And SMOTE is only applied for binary classification problems and for multiclass classification tasks, SCUT algorithm which used SMOTE and Cluster-based Undersampling for multi-class imbalanced classification tasks.
Here we are! finally, let’s train the multi-machine learning algorithms of prepared amd preprocessed breast cancer data set to predict the near future diagnosis decision. The test data set (bcancerTest) will be used only to evaluate performance (such as to compare models) and the training set (bcancerTrain) will be used for all other activities such as training predictive models.
Tuelve individual learning models have been trained on training data set by using 10-fold cross validation repeatedly five times as re-sampling method. And indeed, those models have been tuned to find the optimal models. We didn’t forget to set the random seed to initialize a pseudo-random number generator for the avoidance of the results variability in order to obtain the nearly trusted experimental results.
#ENSEMBLING METHODS
#__________________________________
## Set the seed for reproducibility
seed<-12345
set.seed(seed)
# Define the training control
trainControl <- trainControl(
method="repeatedcv", #k-fold cross validation
number=10, #Number of folds
repeats=5, #Number of reapeting each fold
savePredictions="final", #Saves predictions for optimal tuning parameter
classProbs=TRUE, #should class probabilities be returned
allowParallel = TRUE #Allow parallel computing, here core=4 is used
)
#Model List
algorithmList <- c("glmboost",'rpart',"rf","xgbTree","naive_bayes",
'earth', 'kknn', 'svmRadial',
"lda", "Linda","C5.0Tree","adaboost")
#Train model
models <- caretList(diagnosis~., data=bcancerTrain,
trControl = trainControl, preProcess=c("center","scale"),
methodList = algorithmList,metric = "Kappa")
# Resamples and Summary of the models performances
results <- resamples(models)
summary(results)##
## Call:
## summary.resamples(object = results)
##
## Models: glmboost, rpart, rf, xgbTree, naive_bayes, earth, kknn, svmRadial, lda, Linda, C5.0Tree, adaboost
## Number of resamples: 50
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## glmboost 0.8823529 0.9411765 0.9705882 0.9573529 0.9705882 1.0000000 0
## rpart 0.8529412 0.9117647 0.9338235 0.9270588 0.9411765 0.9852941 0
## rf 0.9558824 0.9852941 0.9852941 0.9879412 1.0000000 1.0000000 0
## xgbTree 0.9558824 0.9852941 0.9852941 0.9882353 1.0000000 1.0000000 0
## naive_bayes 0.8970588 0.9411765 0.9558824 0.9550000 0.9705882 1.0000000 0
## earth 0.9264706 0.9558824 0.9705882 0.9702941 0.9852941 1.0000000 0
## kknn 0.9705882 0.9889706 1.0000000 0.9955882 1.0000000 1.0000000 0
## svmRadial 0.9411765 0.9705882 0.9705882 0.9761765 0.9852941 1.0000000 0
## lda 0.8676471 0.9301471 0.9411765 0.9435294 0.9558824 1.0000000 0
## Linda 0.8970588 0.9411765 0.9558824 0.9588235 0.9816176 1.0000000 0
## C5.0Tree 0.9117647 0.9558824 0.9705882 0.9714706 0.9852941 1.0000000 0
## adaboost 0.9558824 0.9852941 1.0000000 0.9932353 1.0000000 1.0000000 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## glmboost 0.7647059 0.8823529 0.9411765 0.9147059 0.9411765 1.0000000 0
## rpart 0.7058824 0.8235294 0.8676471 0.8541176 0.8823529 0.9705882 0
## rf 0.9117647 0.9705882 0.9705882 0.9758824 1.0000000 1.0000000 0
## xgbTree 0.9117647 0.9705882 0.9705882 0.9764706 1.0000000 1.0000000 0
## naive_bayes 0.7941176 0.8823529 0.9117647 0.9100000 0.9411765 1.0000000 0
## earth 0.8529412 0.9117647 0.9411765 0.9405882 0.9705882 1.0000000 0
## kknn 0.9411765 0.9779412 1.0000000 0.9911765 1.0000000 1.0000000 0
## svmRadial 0.8823529 0.9411765 0.9411765 0.9523529 0.9705882 1.0000000 0
## lda 0.7352941 0.8602941 0.8823529 0.8870588 0.9117647 1.0000000 0
## Linda 0.7941176 0.8823529 0.9117647 0.9176471 0.9632353 1.0000000 0
## C5.0Tree 0.8235294 0.9117647 0.9411765 0.9429412 0.9705882 1.0000000 0
## adaboost 0.9117647 0.9705882 1.0000000 0.9864706 1.0000000 1.0000000 0
#____________________________
# Box plots to compare models
scales <- list(x=list(relation="free"), y=list(relation="free"))
bwplot(results, scales=scales,main="Comparison of ML Models for Breast Cancer Detection")In the above output you can see clearly how the algorithms performed in terms of Accuracy and Cohen’s Kappa Statistic. It gives us the right decision to select both Kernel k-Nearest Neighbor (kknn) and Adaboost as the superior optimal performing models relative to others because of the highest accuracy and kappa.
Here below are the other ways to visualize clearly the performance of multi-models employed. They include both pie-charts and bar plots.
# Visualize the comparative model performance
Compadata <- read.csv("C:/Users/Murera Gisa/Desktop/Predicting_Cancer/ModelPerformance.csv")
Plot_Kappa<- Compadata %>% ggplot(aes(x=Models, y=Kappa, fill= Kappa)) + geom_col()+ geom_bar(stat = "identity") +
coord_polar() + theme_bw() + labs(title = "Kappa Performance Metric.",
caption= "Source: Breast Cancer Analytics@mgisa")+
theme(legend.position = "none", axis.title = element_blank(),axis.line = element_blank(),axis.text.y= element_blank(),axis.text.x = element_text(angle = 45,
face = "bold", colour = "purple", size = 15),
plot.title = element_text(size=16, face="bold", color="forest green")) +
theme(legend.text = element_text(colour="blue", size=10,face= "bold")) +
#geom_text(aes(label= str_c(Kappa,"%")),size= 5, color= "red", angle=360)+
theme(legend.title = element_text(colour="red", size = 20,face="bold"))
Plot_Accuracy<- Compadata %>% ggplot(aes(x=Models, y=Accuracy, fill= Accuracy)) +
geom_col()+ geom_bar(stat = "identity") +
coord_polar() + theme_bw() + labs(title = "Accuracy Performance Metric.",
caption= "Source: Breast Cancer Analytics@mgisa")+
theme(legend.position = "none", axis.title = element_blank(),axis.line = element_blank(),axis.text.y= element_blank(),axis.text.x = element_text(angle = 45,
face = "bold", colour = "orange", size = 15),
plot.title = element_text(size=16, face="bold", color="blue")) +
theme(legend.text = element_text(colour="blue", size=10,face= "bold")) +
#geom_text(aes(label= str_c(Accuracy,"%")),size= 5, color= "black", angle=360)+
theme(legend.title = element_text(colour="red", size = 20,face="bold"))
grid.arrange(Plot_Kappa,Plot_Accuracy, nrow=1)The cohen’s Kappa is going to be widely used as model performance metric, since it is an excellent performance measure when the classes are highly unbalanced like in our case of breast cancer prediction. Cohen’s kappa is essential measure of how well the classifier performed as compared to how well it would have performed simply by chance. Additionally, Kappa statistcs metric is indicting the level of agreement between the actual and predicted breast cancer diagnostic decision.
Here below are bar plots of ML model performance showing both the Cohen’s Kappa statistics and Accuracy for each and every individual model employed.
#BarPlot for ML model performance by Cohen's Kappa statistics measure)
Compadata<-Compadata %>%
mutate(name = fct_reorder(Models, Kappa))
bar_plot<-ggplot(Compadata, aes(x=name, y=Kappa)) +
geom_bar(stat="identity", width=.8, fill="#990033") + theme_bw()+ xlab("")+
#geom_text(aes(label = str_c(Kappa,"%")),vjust=.005,angle=360,size = 5, color = "black")+
labs(y="Cohen's Kappa Statistics" ,caption = "Source:Breast Cancer Analytics@mgisa")+
ggtitle(" Bar Plot of ML Models Performance, Cohen's Kappa.")+
theme(axis.title.y = element_text(size = 17, angle = 90,vjust = 0.5,colour = "red",
face = "bold"),axis.text.y = element_text(angle = 45, face = "bold", colour = "black", size = 12),axis.text.x = element_text(angle = 360, face = "bold", colour = "black",
size = 12),axis.title.x = element_text(size = 17, angle = 360,vjust = 5,colour = "red",
face = "bold"),plot.title = element_text(size = 20, face="bold",
colour="forest green"),legend.position="none")
anim <- bar_plot+transition_reveal(along= Kappa)+shadow_mark(wake_length=0.5)+
enter_grow()+ enter_fade()
animate(anim,fps=20, height =400, width=800,
rewind=TRUE, duration=30)All the above bar plots, boxplots and pie charts show Kernel kNN and adaboost as the superior classifier algorithms. And They take a lead and win in breast cancer detection and classification relative to others with the highest cohen’s kappa and accuracy.
Those predictive models have had higher kappa score, because of existence of big difference between their accuracies and null error rate which is an error committed in prediction of majority class.
As previously mentioned, the “optimal” models are selected to be a relevant candidate models for stacking ensembling technique for breast cancer prediction.
#______________________________
#ENSEMBLING METHOD BY STACKING
stackControl <- trainControl(method="repeatedcv",
number=10, repeats=5,
savePredictions=TRUE,
classProbs=TRUE,
allowParallel = TRUE
)
#Ensembling prediction by Top Performing models
#__________________________________________
# 1. Stacking using Kernel kNN
set.seed(seed)
stack.kknn <- caretStack(models, method="kknn",
metric="Kappa",trControl=stackControl)
print(stack.kknn)## A kknn ensemble of 12 base models: glmboost, rpart, rf, xgbTree, naive_bayes, earth, kknn, svmRadial, lda, Linda, C5.0Tree, adaboost
##
## Ensemble results:
## k-Nearest Neighbors
##
## 3400 samples
## 12 predictor
## 2 classes: 'B', 'M'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 3060, 3060, 3060, 3060, 3060, 3060, ...
## Resampling results across tuning parameters:
##
## kmax Accuracy Kappa
## 5 0.9984706 0.9969412
## 7 0.9984706 0.9969412
## 9 0.9984706 0.9969412
##
## Tuning parameter 'distance' was held constant at a value of 2
## Tuning
## parameter 'kernel' was held constant at a value of optimal
## Kappa was used to select the optimal model using the largest value.
## The final values used for the model were kmax = 9, distance = 2 and kernel
## = optimal.
#2. stack using adaptive boosting machine (adaboost)
set.seed(seed)
stack.adaboost <- caretStack(models, method="adaboost",
metric="Kappa",trControl=stackControl)
print(stack.adaboost)## A adaboost ensemble of 12 base models: glmboost, rpart, rf, xgbTree, naive_bayes, earth, kknn, svmRadial, lda, Linda, C5.0Tree, adaboost
##
## Ensemble results:
## AdaBoost Classification Trees
##
## 3400 samples
## 12 predictor
## 2 classes: 'B', 'M'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 3060, 3060, 3060, 3060, 3060, 3060, ...
## Resampling results across tuning parameters:
##
## nIter method Accuracy Kappa
## 50 Adaboost.M1 0.9990588 0.9981176
## 50 Real adaboost 0.9986471 0.9972941
## 100 Adaboost.M1 0.9990588 0.9981176
## 100 Real adaboost 0.9988235 0.9976471
## 150 Adaboost.M1 0.9988235 0.9976471
## 150 Real adaboost 0.9988235 0.9976471
##
## Kappa was used to select the optimal model using the largest value.
## The final values used for the model were nIter = 50 and method = Adaboost.M1.
By combining the predictions of the classifiers using Kernel kNN model, We can see that we have lifted both accuracy and cohen’s kappa above \(99.50\%\) and \(99.00\%\) respectively which is an impressive improvement over using kkNN alone. This is also the same to an improvement over using adaboost alone on the breast cancer dataset, as observed above.
Further more, one may want to try passing different types of models, both high and low performing rather than just stick to passing high accuracy models to the caretStack to check their performance levels.
The stacking ensembles tend to perfectly perform if the predictions are less correlated with each other. This would suggest that the predictve algorithms are skillful but in different ways. To significantly allow a new classifier to figure out how to get the best from each model for an improved score, we need to test the correlation score among slow predictive learners.
If the predictions for the learning models will be highly corrected at rate of \(>90\%\) then they would be making the same or very similar predictions. And most of the time they tend to worsen and reduce the ultimate benefit of combining the multi-predictions.
# correlation between results
Modelcorr <- modelCor(results)
ggcorrplot(Modelcorr, lab = TRUE,title = "CORRELATION BETWEEN MODEL RESULTS.")From the above results, we notice that all pairs of predictions have generally low correlation compared to the set threshold. This means that we will include all classifiers in stacking ensemble technique to predict the breast cancer diagnostic decision.
Excellent!, we can now implement the stacking ensemble technique.
Finally, we can predict the breast cancer diagnosis decision using the combined models. Later, we will explore the superiority of the ensemble predictive model by analyzing the different types of errors used to be made in ML-based classification through confusion matrices.
#PREDICTION WITH ENSEMBLE MODELS ON TEST DATA SET
#___________________________________
#1. Breast cancer prediction by kKnn
stack_predicted_kknn <- predict(stack.kknn, newdata = bcancerTest)
head(stack_predicted_kknn, n = 10)## [1] M M M M M M M M M M
## Levels: B M
#2. Breast cancer prediction by adaboost
stack_predicted_adaboost <- predict(stack.adaboost, newdata = bcancerTest)
head(stack_predicted_adaboost , n = 10)## [1] M M M M M M M M M M
## Levels: B M
A confusion matrix is a performance measurement technique for Machine learning classification. It is a kind of table which helps to know the performance status of the classification models. And also it shows how the classifiers are confused during recognition, classification and prediction task.
kknn.cmx <- caret::confusionMatrix(stack_predicted_kknn,
bcancerTest$diagnosis, positive = "B")
adaboost.cmx <- caret::confusionMatrix(stack_predicted_adaboost,
bcancerTest$diagnosis, positive = "B")
print(kknn.cmx)## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 69 2
## M 2 40
##
## Accuracy : 0.9646
## 95% CI : (0.9118, 0.9903)
## No Information Rate : 0.6283
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9242
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9718
## Specificity : 0.9524
## Pos Pred Value : 0.9718
## Neg Pred Value : 0.9524
## Prevalence : 0.6283
## Detection Rate : 0.6106
## Detection Prevalence : 0.6283
## Balanced Accuracy : 0.9621
##
## 'Positive' Class : B
##
## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 69 2
## M 2 40
##
## Accuracy : 0.9646
## 95% CI : (0.9118, 0.9903)
## No Information Rate : 0.6283
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9242
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9718
## Specificity : 0.9524
## Pos Pred Value : 0.9718
## Neg Pred Value : 0.9524
## Prevalence : 0.6283
## Detection Rate : 0.6106
## Detection Prevalence : 0.6283
## Balanced Accuracy : 0.9621
##
## 'Positive' Class : B
##
The above confusion matrix charts show how the stack ensemble classification models have performed when they predict the breast cancer diagnostic decision. Additionally, they give the insightful information on errors and error types being made during the prediction problem.
Finally, let’s tabulate the final predicted breast cancer diagnostic decision. The table is showing the random generated patient identities based on the last patient_id appeared on the actual data set and also shows the predicted diagnostic labels.
#Convert submission data frame
Predicted_BreastCancer <- tibble(Diagnosis = stack_predicted_kknn)
Random_Patient_ID<-sample(92751:928634, 113, replace = FALSE) #Generating random patient Id
Random_Patient_ID<-tibble(Random_patient_ID = Random_Patient_ID)
#Tabulating the Predicted Breast Cancer Diagnosis to be submitted.
Predicted_BreastCancer_table<-cbind(Random_Patient_ID,Predicted_BreastCancer)
head(Predicted_BreastCancer_table, n=10)#checking the prediction rate
Predicted_BreastCancer_table<-Predicted_BreastCancer_table %>%
count(Diagnosis) %>% mutate(Percentage = round(n / sum(n) * 100, digits=3))
print(Predicted_BreastCancer_table)## # A tibble: 2 x 3
## Diagnosis n Percentage
## <fct> <int> <dbl>
## 1 B 71 62.8
## 2 M 42 37.2
The above table indicates both total number and percentage rate of predicted Benign and Malignant breast tumors by stack ensemble technique.
Finally, Let’s formulate the submission files namely BreastCancer_Class which includes each predicted breast cancer diagnostic decision in text format.
# Write files for submission
pml_write_files = function(x){
n = length(x)
for(i in 1:n){
filename = paste0("BreastCancer_Class_",i,".txt")
write.table(x[i],file=filename,quote=FALSE,row.names=FALSE,col.names=FALSE)
}
}
#Submission file for kknn
pml_write_files(stack_predicted_kknn)
#Submission file for adaboost
pml_write_files(stack_predicted_adaboost)In this blog post, the stack ensemble tech model was developed and This blog post has considered twelve different slow learners which have been carefully selected from the set of supervised learning machine. Those learning models have been trained on high class imbalanced breast cancer data set. Thefore, combined them (Stack Ensemble technique) with crucial intention to extremely optimizing the overall performance. The optimal model has been used to train the stack ensemble model (parallel training to encourage the division of labor/mixture of experts); then used to accurately predict the breast cancer diagnostic decision.
This information would serve as a reference and also as a template others can use to build a standardised and relevant machine learning workflow for different purpose.
Thomas G. Dietterich (2013), Ensemle Methods in Machine Learning, Oregon State University, Corvallis, Oregon, USA.
Shahnorbanun Sahran et al. (November 5, 2018), Machine Learning Methods for Breast Cancer Diagnostic, DOI: 10.5772/intechopen.79446
McGuire A, Brown JA, Malone C, McLaughlin R, Kerin MJ (May 2015). “Effects of age on the detection and management of breast cancer”. doi:10.3390/cancers7020815.
Balasubramanian R, Rolph R, Morgan C, Hamed H (2019). “Genetics of breast cancer: management strategies and risk-reducing surgery”. doi:10.12968/hmed.2019.80.12.720. 91.
“World Cancer Report”. International Agency for Research on Cancer. (2008). Archived from the original on 31 December 2011. Retrieved 26 February 2011. (cancer statistics often exclude non-melanoma skin cancers such as basal-cell carcinoma, which are common but rarely fatal)
Laurance, Jeremy (29 September 2006). “Breast cancer cases rise 80% since Seventies”. The Independent. London, UK.
Opitz, D.; Maclin, R. (1999). “Popular ensemble methods: An empirical study”. Journal of Artificial Intelligence Research. 11: 169–198. doi:10.1613/jair.614.
Rousseauw, J., du Plessis, J., Benade, A., Jordaan, P., Kotze, J. and Ferreira, J. (1983). Coronary risk factor screening in three rural communities, South African Medical Journal 64: 430–436.
S.Murthy, J.Kurumathur and B.R. Reddy. (2016). Online International Conference on Green Engineering and Technologies (IC-GET).
J. Gorodkin, (2004). Computational Biology and Chemistry, 2004,28, 367-374