Early diagnosis of any disease plays a critical role in successful treatment of patients. Every year thousands of patients are diagnosed with breast cancer. Though large amounts of patient clinical data is collected and stored every year, but a small subset of predictive factors are used in determining outcomes.
In this project, we use a data mining approach to diagnose breast cancer. The data-driven approach used here can efficiently process clinical dataset to discover patterns and reveal hidden information for early detection and successfully treatment of breast cancer patients.
Breast cancer is one of the most severe cancers. It has taken hundreds of thousands of lives every year. Early prediction of breast cancer plays an important role in successful treatment and saving lives of thousands of patients every year. However, the conventional approaches are limited in providing such capability. The recent breakthrough of data analytics and data mining techniques have opened a new door for healthcare diagnostic and prediction.
Over the past decades medical records and clinical data have been collected and stored in electronic databases. Both the government and other public organizations have accelerated the technology toward transparency by making massively stored data usable, searchable, and actionable. Despite of the massive healthcare databases available, only small part of the data has been used by domain-experts for diagnostic and cure of diseases. This is because the massive healthcare data is too complex and voluminous to be effectively and efficiently processed and analyzed by the conventional methods.
Most breast cancers are detected by patients as a lump in the breast. These lumps can be benign or malignant. It is the physician’s responsibility to diagnose the cancer and determine whether it is benign or malignant.
There are different ways of diagnosing breast cancer. For example, mammography and surgical biopsy. Our main purpose in this project is to predict if the cells (tumor) are benign or malignant (cancer), based on dimensional characteristics of the nuclei cells. As secondary objectives we want to know if the quality of our prediction is similar to the scientific articles and if linear models are still competitive against nonlinear models, based on rules / trees.
Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. FNA’s involves taking one or more samples of breast cells using a fine needle and syringe. They describe characteristics of the cell nuclei present in the image.
The mean, standard error, and “worst” or largest (mean of the three largest values) of these dimensional features were computed for each image, resulting in 30 features. For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.
Data repository UC-Irvine, Breast Cancer Wisconsin (Diagnostic) Data Set from University of Wisconsin, Clinical Sciences Center.
Ten real-valued features are computed for each cell nucleus:
For each variable we have 3 different statistics: Mean, standard error and worst case. We have a total of 30 predictors.
The change diagnosis
predictor from malignant for 1, benign for 0, help us to see relationship beetwen the response and the predictors, and was done in this phase.
Below is the structure and the summary of the dataset.
vars | n | mean | sd | median | min | max | range | skew | kurtosis | |
---|---|---|---|---|---|---|---|---|---|---|
diagnosis | 1 | 569 | 0.37 | 0.48 | 0.00 | 0.00 | 1.00 | 1.00 | 0.53 | -1.73 |
radius_mean | 2 | 569 | 14.13 | 3.52 | 13.37 | 6.98 | 28.11 | 21.13 | 0.94 | 0.81 |
texture_mean | 3 | 569 | 19.29 | 4.30 | 18.84 | 9.71 | 39.28 | 29.57 | 0.65 | 0.73 |
perimeter_mean | 4 | 569 | 91.97 | 24.30 | 86.24 | 43.79 | 188.50 | 144.71 | 0.99 | 0.94 |
area_mean | 5 | 569 | 654.89 | 351.91 | 551.10 | 143.50 | 2501.00 | 2357.50 | 1.64 | 3.59 |
smoothness_mean | 6 | 569 | 0.10 | 0.01 | 0.10 | 0.05 | 0.16 | 0.11 | 0.45 | 0.82 |
compactness_mean | 7 | 569 | 0.10 | 0.05 | 0.09 | 0.02 | 0.35 | 0.33 | 1.18 | 1.61 |
concavity_mean | 8 | 569 | 0.09 | 0.08 | 0.06 | 0.00 | 0.43 | 0.43 | 1.39 | 1.95 |
concave.points_mean | 9 | 569 | 0.05 | 0.04 | 0.03 | 0.00 | 0.20 | 0.20 | 1.17 | 1.03 |
symmetry_mean | 10 | 569 | 0.18 | 0.03 | 0.18 | 0.11 | 0.30 | 0.20 | 0.72 | 1.25 |
fractal_dimension_mean | 11 | 569 | 0.06 | 0.01 | 0.06 | 0.05 | 0.10 | 0.05 | 1.30 | 2.95 |
radius_se | 12 | 569 | 0.41 | 0.28 | 0.32 | 0.11 | 2.87 | 2.76 | 3.07 | 17.45 |
texture_se | 13 | 569 | 1.22 | 0.55 | 1.11 | 0.36 | 4.88 | 4.52 | 1.64 | 5.26 |
perimeter_se | 14 | 569 | 2.87 | 2.02 | 2.29 | 0.76 | 21.98 | 21.22 | 3.43 | 21.12 |
area_se | 15 | 569 | 40.34 | 45.49 | 24.53 | 6.80 | 542.20 | 535.40 | 5.42 | 48.59 |
smoothness_se | 16 | 569 | 0.01 | 0.00 | 0.01 | 0.00 | 0.03 | 0.03 | 2.30 | 10.32 |
compactness_se | 17 | 569 | 0.03 | 0.02 | 0.02 | 0.00 | 0.14 | 0.13 | 1.89 | 5.02 |
concavity_se | 18 | 569 | 0.03 | 0.03 | 0.03 | 0.00 | 0.40 | 0.40 | 5.08 | 48.24 |
concave.points_se | 19 | 569 | 0.01 | 0.01 | 0.01 | 0.00 | 0.05 | 0.05 | 1.44 | 5.04 |
symmetry_se | 20 | 569 | 0.02 | 0.01 | 0.02 | 0.01 | 0.08 | 0.07 | 2.18 | 7.78 |
fractal_dimension_se | 21 | 569 | 0.00 | 0.00 | 0.00 | 0.00 | 0.03 | 0.03 | 3.90 | 25.94 |
radius_worst | 22 | 569 | 16.27 | 4.83 | 14.97 | 7.93 | 36.04 | 28.11 | 1.10 | 0.91 |
texture_worst | 23 | 569 | 25.68 | 6.15 | 25.41 | 12.02 | 49.54 | 37.52 | 0.50 | 0.20 |
perimeter_worst | 24 | 569 | 107.26 | 33.60 | 97.66 | 50.41 | 251.20 | 200.79 | 1.12 | 1.04 |
area_worst | 25 | 569 | 880.58 | 569.36 | 686.50 | 185.20 | 4254.00 | 4068.80 | 1.85 | 4.32 |
smoothness_worst | 26 | 569 | 0.13 | 0.02 | 0.13 | 0.07 | 0.22 | 0.15 | 0.41 | 0.49 |
compactness_worst | 27 | 569 | 0.25 | 0.16 | 0.21 | 0.03 | 1.06 | 1.03 | 1.47 | 2.98 |
concavity_worst | 28 | 569 | 0.27 | 0.21 | 0.23 | 0.00 | 1.25 | 1.25 | 1.14 | 1.57 |
concave.points_worst | 29 | 569 | 0.11 | 0.07 | 0.10 | 0.00 | 0.29 | 0.29 | 0.49 | -0.55 |
symmetry_worst | 30 | 569 | 0.29 | 0.06 | 0.28 | 0.16 | 0.66 | 0.51 | 1.43 | 4.37 |
fractal_dimension_worst | 31 | 569 | 0.08 | 0.02 | 0.08 | 0.06 | 0.21 | 0.15 | 1.65 | 5.16 |
Below are the inference of the summary.
diagonsis
is the target variable.myCol and id
are index. It is not required in predictors.NA
in all the predictors. Imputation is not required.radius, texture, perimeter, area, smoothness, compactness, concavity, concave points, symmetry_mean, fractal dimenstion
. Dataset also has Mean, standard error and the worst measure of that particular cell.As a next step we will remove the unwanted variables and analyze individual set of dimensions.
As a first visualization, we will plot the histogram of all the predictor variables.
Histograms
Lets split the predictors according to its category
BOX-PLOT
Some inference from the chart
Now lets plot the correlation plots to understand more about the correlated predictors.
Below is the overall correlation matrix of all the predictors
Correlation list (only high correlation values, r>0.9)Lets deepdive into response variable and see its distribution.
Summary of data explorations findings:
radius_se
, perimeter_se
, area_se
, concavity_se
and fractal_dimension_se
.Diagnosis
, with r>=0.6, this is a good news.To solve the highly correlated variables, we will follow two types.
In this type, as the variables are highly correlated, we will transform the predictors using principal component analysis(PCA). PCA will provide the transformed variables.
Above summary shows that 10 PCA’s shows the 95% of variation. And 17 PCA’s shows 99.1% of variation in the dataset. We will select 17 PCA’s to show the variation in the data. It also has a clusters of around 7. Now lets see the clusters in detail.
Logistic Regression is used to estimate the probability of a binary response based on one ore more predictors. The model itself simply models probability of output in terms of input, and does not perform statistical classification (it is not a classifier), though it can be used to make a classifier, for instance by choosing a cutoff value and classifying inputs with probability greater than the cutoff as one class, below the cutoff as the other.
Now we will build different models and compare the results.
As a initial set of models, we will try logistic regression with different variations in it.
Logistic Regression Model(Logit) with all variables
As a first model, we will build logistic regression model which has all the predictors with correlated variables..
##
## Call:
## glm(formula = trainfull_y ~ ., family = binomial, data = trainfull_x)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.085e-04 -2.000e-08 -2.000e-08 2.000e-08 4.427e-04
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.125e+03 1.062e+06 -0.004 0.997
## radius_mean -1.689e+03 1.323e+05 -0.013 0.990
## texture_mean 2.697e+01 8.738e+03 0.003 0.998
## perimeter_mean 2.606e+02 2.497e+04 0.010 0.992
## area_mean -1.522e+00 7.836e+02 -0.002 0.998
## smoothness_mean -2.641e+03 9.064e+05 -0.003 0.998
## compactness_mean -1.526e+04 1.898e+06 -0.008 0.994
## concavity_mean -1.625e+02 8.434e+05 0.000 1.000
## concave.points_mean 8.913e+03 2.021e+06 0.004 0.996
## symmetry_mean -1.826e+03 2.701e+05 -0.007 0.995
## fractal_dimension_mean 2.986e+04 2.587e+06 0.012 0.991
## radius_se 3.444e+03 6.046e+05 0.006 0.995
## texture_se 4.161e+01 6.254e+04 0.001 0.999
## perimeter_se -3.411e+02 3.773e+04 -0.009 0.993
## area_se -3.528e+00 5.808e+03 -0.001 1.000
## smoothness_se -8.118e+04 7.394e+06 -0.011 0.991
## compactness_se 2.619e+04 1.849e+06 0.014 0.989
## concavity_se -1.046e+04 7.599e+05 -0.014 0.989
## concave.points_se 3.982e+04 4.285e+06 0.009 0.993
## symmetry_se 7.354e+03 1.928e+06 0.004 0.997
## fractal_dimension_se -2.080e+05 1.870e+07 -0.011 0.991
## radius_worst -4.495e+01 8.104e+04 -0.001 1.000
## texture_worst 3.704e+00 6.233e+03 0.001 1.000
## perimeter_worst 1.242e+01 5.608e+03 0.002 0.998
## area_worst 1.688e+00 3.940e+02 0.004 0.997
## smoothness_worst 8.448e+03 1.482e+06 0.006 0.995
## compactness_worst -2.650e+03 2.002e+05 -0.013 0.989
## concavity_worst 2.083e+03 2.053e+05 0.010 0.992
## concave.points_worst 1.531e+03 4.357e+05 0.004 0.997
## symmetry_worst 1.064e+03 3.589e+05 0.003 0.998
## fractal_dimension_worst 1.029e+04 8.558e+05 0.012 0.990
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5.2517e+02 on 398 degrees of freedom
## Residual deviance: 1.9353e-06 on 368 degrees of freedom
## AIC: 62
##
## Number of Fisher Scoring iterations: 25
None of the predictors are signficant due to correlated variables. This model did not provide any results and did not converge.
We can also perform logistic regression on PCA transformed variables. However, the problem is the model is not interpretable. One of the biggest advantage of Logistic regression is interpretability. By using PCA variables, we loose that advantage. But lets see how the model reacts to the PCA variables.
##
## Call:
## glm(formula = train_y ~ . - PC11 - PC6 - PC8 - PC10 - PC5 - PC9 -
## PC13 - PC14 - PC12 - PC7, family = binomial, data = pca_wdbc$x[,
## c(1:14)] %>% data.frame())
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4551 -0.0467 -0.0028 0.0009 3.3087
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.3741 0.4019 -3.419 0.000629 ***
## PC1 -3.0425 0.5450 -5.583 2.36e-08 ***
## PC2 2.8103 0.5622 4.999 5.77e-07 ***
## PC3 -1.7361 0.4486 -3.870 0.000109 ***
## PC4 -2.5773 0.6424 -4.012 6.02e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 525.172 on 398 degrees of freedom
## Residual deviance: 57.999 on 394 degrees of freedom
## AIC: 67.999
##
## Number of Fisher Scoring iterations: 9
PCA models are difficult to interpret. So this model does not provide any other information.
In Probit Regression, the dependent variable can take only two values. The purpose of the model is to estimate the probability that an observation with a particular characteristics will fall into a specific one of the categories. A probit model is a popular specification for an ordinal or a binary response model.
Now lets perform logistic regression with probit link function. After backward stepwise eliminiation, below is the model we got.
##
## Call:
## glm(formula = train_y ~ . - perimeter_se - texture_worst - concave.points_worst -
## symmetry_se - fractal_dimension_mean - symmetry_mean - compactness_worst -
## concavity_se - concave.points_se - concavity_mean - smoothness_se -
## texture_se - symmetry_worst, family = binomial(link = "probit"),
## data = train_x)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.819 -0.003 0.000 0.000 3.281
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -10.07635 5.98008 -1.685 0.091991 .
## texture_mean 0.25963 0.06897 3.764 0.000167 ***
## perimeter_mean -0.41491 0.10733 -3.866 0.000111 ***
## smoothness_mean -81.77502 42.32285 -1.932 0.053338 .
## concave.points_mean 131.35210 34.53616 3.803 0.000143 ***
## radius_se 13.05659 5.87379 2.223 0.026226 *
## area_se -84.30666 38.45030 -2.193 0.028335 *
## fractal_dimension_se 0.66639 0.37989 1.754 0.079402 .
## radius_worst 2.20845 0.62953 3.508 0.000451 ***
## smoothness_worst 54.55936 20.46253 2.666 0.007669 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 525.172 on 398 degrees of freedom
## Residual deviance: 45.703 on 389 degrees of freedom
## AIC: 65.703
##
## Number of Fisher Scoring iterations: 12
Seems there are nine predictors which are significant. It has similar AIC score compared to logit model.
Train set performance
accuracy | auc | |
---|---|---|
Logistic | 0.980 | 0.976 |
Logistic PCA | 0.980 | 0.974 |
Probit | 0.975 | 0.969 |
Test set performance
accuracy | auc | |
---|---|---|
Logistic | 0.976 | 0.975 |
Logistic PCA | 0.100 | 0.916 |
Probit | 0.971 | 0.967 |
Cubist is a prediction-oriented regression model that combines the ideas in Quinlan (1992) and Quinlan (1993).
Although it initially creates a tree structure, it collapses each path through the tree into a rule. A regression model is fit for each rule based on the data subset defined by the rules. The set of rules are pruned or possibly combined. and the candidate variables for the linear regression models are the predictors that were used in the parts of the rule that were pruned away. This part of the algorithm is consistent with the “M5” or Model Tree approach.
Cubist generalizes this model to add boosting (when committees > 1) and instance based corrections (see predict.cubist()). The number of instances is set at prediction time by the user and is not needed for model building.
Predictor importance from cubist model
Average Neural Networks regression AVNNET. A Neural Network (NN) is a graph of computational units that receive inputs and transfer the result into an output that is passed on. The units are ordered into layers to connect the features of an input vector to the features of an output vector. With training, such as the Back-Propagation algorithm, neural networks can be designed and trained to model the underlying relationship in data.
## Model Averaged Neural Network
##
## 399 samples
## 22 predictor
## 2 classes: '0', '1'
##
## Pre-processing: centered (22), scaled (22)
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 360, 360, 359, 359, 360, 359, ...
## Resampling results across tuning parameters:
##
## decay Accuracy Kappa
## 0.01 0.9709681 0.9375156
## 0.10 0.9750322 0.9460329
##
## Tuning parameter 'size' was held constant at a value of 1
## Tuning
## parameter 'bag' was held constant at a value of FALSE
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were size = 1, decay = 0.1 and bag
## = FALSE.
Support Vector Machines (SVM) are a class of methods, developed originally for classification, that find support points that best separate classes. SVM for regression is called Support Vector Regression (SVM).
## Support Vector Machines with Radial Basis Function Kernel
##
## 399 samples
## 22 predictor
## 2 classes: '0', '1'
##
## Pre-processing: centered (22), scaled (22)
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 360, 360, 359, 359, 360, 359, ...
## Resampling results across tuning parameters:
##
## C Accuracy Kappa
## 0.25 0.9569418 0.9052807
## 0.50 0.9639931 0.9214983
## 1.00 0.9674675 0.9287819
## 2.00 0.9724681 0.9397658
## 4.00 0.9719431 0.9386826
## 8.00 0.9729431 0.9412281
## 16.00 0.9609515 0.9158910
## 32.00 0.9559619 0.9052550
## 64.00 0.9494362 0.8911838
## 128.00 0.9494362 0.8911838
## 256.00 0.9494362 0.8911838
## 512.00 0.9494362 0.8911838
## 1024.00 0.9494362 0.8911838
## 2048.00 0.9494362 0.8911838
##
## Tuning parameter 'sigma' was held constant at a value of 0.04889095
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.04889095 and C = 8.
Train set performance
accuracy | auc | |
---|---|---|
AvNeural | 0.990 | 0.986 |
SVM | 0.995 | 0.993 |
Cubist | 1.000 | 1.000 |
Test set performance
accuracy | auc | |
---|---|---|
AvNeural | 0.971 | 0.970 |
SVM | 0.971 | 0.970 |
Cubist | 0.976 | 0.975 |
In the following section, we will describe the best model.
## Confusion Matrix and Statistics
##
## test_y
## conv_31_cubit_t 0 1
## 0 103 2
## 1 2 63
##
## Accuracy : 0.9765
## 95% CI : (0.9409, 0.9936)
## No Information Rate : 0.6176
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9502
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9692
## Specificity : 0.9810
## Pos Pred Value : 0.9692
## Neg Pred Value : 0.9810
## Prevalence : 0.3824
## Detection Rate : 0.3706
## Detection Prevalence : 0.3824
## Balanced Accuracy : 0.9751
##
## 'Positive' Class : 1
##
Charles Campbell and Nello Cristianini Cover in depth the usage of the Kernel-Adatron Algorithm in their paper Simple Learning Algorithms for Training Support Vector Machines[5]. Their method was adapted from the Statistical Mechanics approach to learning[8]. The core concept of implementing this algorithm is to find hyperplanes which have optimal stability.
The authors of the paper tested different statistical models before applying the Kernel-Adaton support Vector Machine model. Non SVM models include CART, RBF, and Linear Discriminant Analysis(LDA). CART models are representative of a binary tree where each root node represents a single input variable (x) and is then split on that variable, the leaf nodes of the tree will contain the output variable (y) that will make the prediction[9]. Radial Basis Function(RBF) is a type of model that can be employed for methods that require Linear or nonlinear problems. These types of functions have been associated with a single layer and multilayer networks[10]. Linear Discriminant Analysis(LDA) use statistical properties of the data that are calculated by each class for input variables in a single variable. For multiple variables, the properties are calculated over the multivariate Gaussian[11]. Multi-Layer Neural Network(Back-Propagation) is a supervised learning technique that consists of at least three-layered nodes, where each node is a neuron with a nonlinear activation function.
To achieve the ideal model performance, SLATSVM[5] used 10 fold cross-validation and the whole dataset. The results for each model were the following CART 94.2%, RBF 95.9%, LDA 96.0%, Multi-Layer Neural Network(Back-Propagation) 96.6% with the optimal performance being 99.48%.
The optimal model was achieved using cross-validation 10 fold, Centering and scaling the variables. The model used was svmRadial applied using the caret package in R-3.3.1. The Data was split 70% training and 30% testing. The performance was 98.7% against the training set. While high in accuracy when the code was tested against the holdout dataset, the accuracy increased slightly to 98.8% percent.
The svmRadial model performed with high accuracy in the training and test set. Other parameters that were exceptional where Area Under the Curve(AUC) with a value of 99.3% and 97.0% in the test set. The AUC captures the efficiency of the model in this example 90% being an A grade.
From the analysis we can conclude that the cubist and the logistic models have similar performance. The logistic model seems to be the easiest to implement whereas nonlinear models like SVM have implementation complexity. In logistic models we can see the coefficient of all predictors. Although, the logistic model is more interpretable than the cubist model, the cubist model seems to be best model because the accuracy in the cubist model is the highest 0.976
.
REFERENCES:
[1] Dua, D. and Karra Taniskidou, E. (2017). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
[2] https://stackoverflow.com/questions/7074246/show-correlations-as-an-ordered-list-not-as-a-large-matrix
[3] https://cran.r-project.org/web/packages/Cubist/Cubist.pdf
[4] Andrew I. Schein and Lyle H. Ungar. A-Optimality for Active Learning of Logistic Regression Classifiers. Department of Computer and Information Science Levine Hall.
[5] Charles Campbell and Nello Cristianini. Simple Learning Algorithms for Training Support Vector Machines. Dept. of Engineering Mathematics.
[6] O.L. Mangasarian, W.N. Street and W.H. Wolberg. Breast cancer diagnosis and prognosis via linear programming. Operations Research, 43(4), pages 570-577, July-August 1995.
[7] Kuhn, Johnson, Applied Predictive Modeling, Springer [8] Watkin, T., Ran, A. & Biehl, M. (1993). The Statistical Mechanics of Learning a Rule, Rev. Mod. Phys. 65(2). [9]Brownlee, J., Ph.D. (2017, September 20). Classification And Regression Trees for Machine Learning. Retrieved May 20, 2018, from https://machinelearningmastery.com/classification-and-regression-trees-for-machine-learning/ [10]Mark J L Orr. (1996). Introduction to Radial Basis Function Networks. [11]Brownlee, J., Ph.D. (2016, April 6).Linear Discriminant Analysis for Machine Learning. Retrieved May 20 2018, from https://machinelearningmastery.com/linear-discriminant-analysis-for-machine-learning/
R code
knitr::opts_chunk$set(fig.width=12, fig.height=8, fig.path='Figs/', echo=FALSE, warning=FALSE, message=FALSE)
#load libraries
library("xlsx")
library("psych")
library("GGally")
library("ggplot2")
library("caret")
library("dplyr")
library("pROC")
library("car")
library("leaps")
library("sjstats")
library("PerformanceAnalytics")
library("factoextra")
library("BGLR")
library("keras")
library("dplyr")
library("kableExtra")
library("Cubist")
#load dataset
url<-"https://raw.githubusercontent.com/bvshyam/Cancer_prediction/master/data/data_with groups.xlsx"
temp.file <- paste(tempfile(),".xlsx",sep = "")
download.file(url,temp.file, mode="wb")
WDBCdata<<-xlsx::read.xlsx(temp.file, sheetName = "ics data", header=T)
#WDBCdata<- read.xlsx("data/data_with groups.xlsx", sheetName = "ics data")
# First step the id will be removed to avoid any future trouble
# change M = malignant for 1 and B for B = benign for zero
WDBCdata$mycol<-NULL
WDBCdata$id<-NULL
WDBCdata$diagnosis<-ifelse(WDBCdata$diagnosis == "B", 0, 1)
# summary stats
WDBC_tbl<-describe(WDBCdata,IQR=T)[,c(1:5,8:10,11,12)]
kable(round(WDBC_tbl,2), caption = "Selected Stats", format = "html") %>%
kable_styling(latex_options = "striped", font_size=10)
rm(WDBC_tbl)
library("Hmisc")
hist.data.frame(WDBCdata, n.unique=1, mtitl = "Breast Cancer Histogram")
WDBCdata_mean = cbind(diagnosis=WDBCdata[,c(1)], WDBCdata[,c(2:11)])
WDBCdata_se = cbind(diagnosis=WDBCdata[,c(1)], WDBCdata[,c(12:21)])
WDBCdata_worst = cbind(diagnosis=WDBCdata[,c(1)], WDBCdata[,c(22:31)])
# boxplot
par(cex.axis=0.8) # is for x-axis
boxplot(WDBCdata_mean, las=2, col="green", main="Breast Cancer Box-Plot for Mean", ylim = c(0,150))
boxplot(WDBCdata_se, las=2, col="green", main="Breast Cancer Box-Plot for SE", ylim = c(0,150))
boxplot(WDBCdata_worst, las=2, col="green", main="Breast Cancer Box-Plot for Worst", ylim = c(0,150))
#Mean
chart.Correlation(WDBCdata_mean,histogram=TRUE,pch=19)
# SE
chart.Correlation(WDBCdata_se,histogram=TRUE,pch=19)
# Worst
chart.Correlation(WDBCdata_worst,histogram=TRUE,pch=19)
ggcorr(WDBCdata, nbreaks=8, palette='PRGn', label=TRUE, label_size=2, size = 1.8, label_color='black') + ggtitle("Breast Cancer Correlation Matrix") + theme(plot.title = element_text(hjust = 0.5, color = "grey15"))
z = cor(WDBCdata)
z = round(z,4)
z[abs(z)<0.9]=NA # remove low relationship
z[lower.tri(z,diag=TRUE)]=NA #Prepare to drop duplicates and meaningless information
z=as.data.frame(as.table(z)) #Turn into a 3-column table
z=na.omit(z) #Get rid of the junk we flagged above
z=z[order(-abs(z$Freq)),] #Sort by highest correlation (whether +ve or -ve)
z
rm(z)
#qplot(as.factor(WDBCdata$diagnosis))+geom_bar() + #labs(x='Diagnosis', y ='Count')
barplot(table(WDBCdata$diagnosis), col="blue", ylab="count", xlab="Diagnosis", main="Response distribution")
bc1<-BoxCoxTrans(WDBCdata$radius_se)
bc2<-BoxCoxTrans(WDBCdata$perimeter_se)
bc3<-BoxCoxTrans(WDBCdata$area_se )
bc5<-BoxCoxTrans(WDBCdata$fractal_dimension_se)
par(mfrow=c(1,2))
hist(WDBCdata$radius_se, main="Histogram radius_se", xlab="", col="yellow")
hist(WDBCdata$radius_se^bc1$lambda, main="Histogram radius_se transf.", xlab="", col="green")
par(mfrow=c(1,2))
hist(WDBCdata$perimeter_se, main="Histogram perimeter_se", xlab="", col="yellow")
hist(WDBCdata$perimeter_se^bc2$lambda, main="Histogram perimeter_se transf.", xlab="",col="green")
par(mfrow=c(1,2))
hist(WDBCdata$area_se, main="Histogram area_se",xlab="", col="yellow")
hist(WDBCdata$area_se^bc3$lambda, main="Histogram area_se transf.",xlab="",col="green")
par(mfrow=c(1,2))
hist(WDBCdata$fractal_dimension_se, main="Histogram dimension_se",xlab="", col="yellow")
hist(WDBCdata$fractal_dimension_se^bc5$lambda, main="Histogram dimension_se transf.",xlab="",col="green")
par(mfrow=c(1,1))
WDBCdatafull<-WDBCdata #copy data set
WDBCdata<-subset(WDBCdata, select=-c(area_mean,radius_mean,area_worst,compactness_mean,perimeter_worst,compactness_se,concavity_worst,fractal_dimension_worst)) # remove predictors
# WDBCdatafull all predictors
# WDBCdata remove highly correlated predictors
set.seed(123)
indx<-createDataPartition(WDBCdata$diagnosis, p=0.7, list=FALSE)
train_x<-WDBCdata[indx,-1]
train_y<-WDBCdata$diagnosis[indx]
test_x<-WDBCdata[-indx,-1]
test_y<-WDBCdata$diagnosis[-indx]
# the subset bellow is with full predictors, cannot be used with logistic model, only can be used with model that accept highly correlated predictors.
trainfull_x<-WDBCdatafull[indx,-1]
trainfull_y<-WDBCdatafull$diagnosis[indx]
testfull_x<-WDBCdatafull[-indx,-1]
testfull_y<-WDBCdatafull$diagnosis[-indx]
# applying the power transformation for the test and train data set
train_x$radius_se<-train_x$radius_se^bc1$lambda
train_x$perimeter_se<-train_x$perimeter_se^bc2$lambda
train_x$area_se<-train_x$area_se^bc3$lambda
train_x$fractal_dimension_se<-train_x$fractal_dimension_se^bc5$lambda
test_x$radius_se<-test_x$radius_se^bc1$lambda
test_x$perimeter_se<-test_x$perimeter_se^bc2$lambda
test_x$area_se<-test_x$area_se^bc3$lambda
test_x$fractal_dimension_se<-test_x$fractal_dimension_se^bc5$lambda
pca_wdbc <- prcomp(train_x[,2:ncol(train_x)],center = TRUE, scale=TRUE)
pca_wbdc_test<-prcomp(test_x[,2:ncol(test_x)],center = TRUE, scale=TRUE)
plot(pca_wdbc, type='l', main="PCA - Principal Components Analysis Chart", col="red")
#summary(pca_wdbc$x)
pca_wdbc_var <- get_pca_var(pca_wdbc)
res <- kmeans(pca_wdbc_var$coord,centers = 5, nstart=25)
grp <- as.factor(res$cluster)
fviz_pca_var(pca_wdbc, col.var=grp, palette='jco', legend.title='Cluster')
model_11_logit_full <-glm(trainfull_y ~ . ,family=binomial, trainfull_x)
summary(model_11_logit_full)
model_12_logit_corr <-glm(train_y~.,family=binomial,data=train_x)
summary(model_12_logit_corr)
model_13_logit_corr_final <- update(model_12_logit_corr, .~.-symmetry_se-concave.points_worst-texture_se-perimeter_se-radius_se-symmetry_mean-fractal_dimension_mean-concave.points_mean-texture_mean-compactness_worst-symmetry_worst-smoothness_mean-area_se-smoothness_se )
summary(model_13_logit_corr_final)
model_14_pca <-glm( train_y~.-PC11 -PC6 -PC8 -PC10 -PC5 -PC9 -PC13 -PC14 -PC12 -PC7,family=binomial,data=pca_wdbc$x[,c(1:14)] %>% data.frame())
summary(model_14_pca)
model_15_probit <-glm( train_y~.-perimeter_se-texture_worst-concave.points_worst-symmetry_se-fractal_dimension_mean-symmetry_mean-compactness_worst-concavity_se-concave.points_se-concavity_mean-smoothness_se-texture_se-symmetry_worst,family=binomial(link = 'probit'),data=train_x)
summary(model_15_probit)
#Convert to 0/1
conv_13_logit_corr <- ifelse(predict(model_13_logit_corr_final) > 0.5,1,0)
#conf_13_logit_corr <- confusionMatrix(conv_13_logit_corr, train_y, positive="1")
conf_13_logit_corr <- confusionMatrix(table(conv_13_logit_corr, train_y), positive = "1")
conv_14_pca <- ifelse(predict(model_14_pca) > 0.5,1,0)
#conf_14_pca <- confusionMatrix(conv_14_pca, train_y, positive="1")
conf_14_pca <- confusionMatrix(table(conv_14_pca, train_y), positive="1")
conv_15_probit <- ifelse(predict(model_15_probit ) > 0.5,1,0)
#conf_15_probit <- confusionMatrix(conv_15_probit,train_y,positive="1")
conf_15_probit <- confusionMatrix(table(conv_15_probit, train_y), positive="1")
# compute accuracy
acc13<-conf_13_logit_corr$overall["Accuracy"]
acc14<-conf_14_pca$overall["Accuracy"]
acc15<-conf_15_probit$overall["Accuracy"]
# compute AUC
auc13<-roc(train_y ~ conv_13_logit_corr, train_x)$auc
auc14<-roc(train_y ~ conv_14_pca, train_x)$auc
auc15<-roc(train_y ~ conv_15_probit, train_x)$auc
df<-data.frame(accuracy=c(acc13,acc14,acc15),auc=c(auc13,auc14,auc15))
row.names(df)<-c("Logistic","Logistic PCA","Probit")
kable(round(df,3), caption = "Performance metrics train")
#Convert to 0/1
conv_13_logit_t <- ifelse(predict(object=model_13_logit_corr_final, newdata=test_x, type="response") > 0.5,1,0)
#conf_13_logit_t <- confusionMatrix(conv_13_logit_t, test_y, positive="1")
conf_13_logit_t <- confusionMatrix(table(conv_13_logit_t, test_y), positive="1")
conv_14_pca_t <- ifelse(predict(model_14_pca, newdata=as.data.frame(pca_wbdc_test$x[,c(1:14)]), type="response") > 0.5,1,0)
#conf_14_pca_t <- confusionMatrix(conv_14_pca_t, test_y, positive="1")
conf_14_pca_t <- confusionMatrix(table(conv_14_pca_t, test_y), positive="1")
conv_15_probit_t <- ifelse(predict(model_15_probit, newdata=test_x, type="response") > 0.5,1,0)
#conf_15_probit_t <- confusionMatrix(conv_15_probit_t, test_y, positive="1")
conf_15_probit_t <- confusionMatrix(table(conv_15_probit_t, test_y), positive="1")
# compute accuracy
acc13_t<-conf_13_logit_t$overall["Accuracy"]
acc14_t<-conf_14_pca_t$overall["Accuracy"]
acc15_t<-conf_15_probit_t$overall["Accuracy"]
# compute AUC
auc13_t<-roc(test_y ~ conv_13_logit_t, test_x)$auc
auc14_t<-roc(test_y ~ conv_14_pca_t, test_x)$auc
auc15_t<-roc(test_y ~ conv_15_probit_t, test_x)$auc
df<-data.frame(accuracy=c(acc13_t,acc14_t,acc15_t),auc=c(auc13_t,auc14_t,auc15_t))
row.names(df)<-c("Logistic","Logistic PCA","Probit")
kable(round(df,3), caption = "Performance metrics test")
set.seed(123)
ctrl=(trainControl(method="repeatedcv", repeats=5))
c<-c(100)
n<-c(3,4)
model_31_cubit<-train(train_x,train_y, method="cubist",
preProcess = c("center", "scale"),
tuneGrid = expand.grid(committees=c,neighbors=n),
trControl = ctrl)
dotPlot(varImp(model_31_cubit), main="Cubist Predictor importance")
set.seed(123)
ctrl=(trainControl(method="repeatedcv", repeats=5))
nnetGrid<-expand.grid(.decay=c(0.01,0.1), .size=1, .bag=FALSE)
model_41_neural<-train(x=train_x,y=as.factor(train_y),method="avNNet",
tuneGrid = nnetGrid,
trControl = ctrl,
preProcess = c("center","scale"),
linout=TRUE,
trace=FALSE,
MaxNWts=10*(ncol(train_x)+1)+10+1,
maxit=500)
model_41_neural
set.seed(123)
ctrl=(trainControl(method="repeatedcv", repeats=5))
SVM_Radial_Fit = train(x=train_x,y=as.factor(train_y), method = "svmRadial",
preProc = c("center", "scale"),
tuneLength = 14,trControl = ctrl)
SVM_Radial_Fit
set.seed(123)
Chris_SVM_Pred = data.frame(Pred= predict(SVM_Radial_Fit, newdata = test_x))
plot(SVM_Radial_Fit, scales = list(x = list(log = 2)))
#Chris_Con =confusionMatrix(Chris_SVM_Pred$Pred,test_y,dnn = c("Prediction", "Reference"))
#Chris_SVM_acc = data.frame( Val =postResample(pred = Chris_SVM_Pred$Pred, obs = test_y))
#fourfoldplot(Chris_Con$table,main=paste("SVM (",round(Chris_SVM_acc[1,]*100),"%)",sep=""),color = c("#ed3b3b", "#0099ff"))
#Chris_Con
#Convert to 0/1
conv_41_neural <- predict(model_41_neural)
#conf_41_neural<-confusionMatrix(conv_41_neural, train_y, positive="1")
conf_41_neural<-confusionMatrix(table(conv_41_neural, train_y), positive="1")
conv_21_svm <- predict(SVM_Radial_Fit)
#conf_21_svm<-confusionMatrix(conv_21_svm, train_y, positive="1")
conf_21_svm<-confusionMatrix(table(conv_21_svm, train_y), positive="1")
conv_31_cubit <- ifelse(predict(model_31_cubit) > 0.5,1,0)
#conf_31_cubit<-confusionMatrix(conv_31_cubit, train_y, positive="1")
conf_31_cubit<-confusionMatrix(table(conv_31_cubit, train_y), positive="1")
# compute accuracy
acc41<-conf_41_neural$overall["Accuracy"]
acc21<-conf_21_svm$overall["Accuracy"]
acc31<-conf_31_cubit$overall["Accuracy"]
# compute AUC
auc41<-roc(train_y ~ as.numeric(as.character(conv_41_neural)), train_x)$auc
auc21<-roc(train_y ~ as.numeric(as.character(conv_21_svm)), train_x)$auc
auc31<-roc(train_y ~ conv_31_cubit, train_x)$auc
df1<-data.frame(accuracy=c(acc41,acc21,acc31),auc=c(auc41,auc21,auc31))
row.names(df1)<-c("AvNeural","SVM","Cubist")
kable(round(df1,3), caption = "Performance metrics train")
#Convert to 0/1
conv_41_neural_t <- predict(model_41_neural, newdata=test_x, type="raw")
#conf_41_neural_t<-confusionMatrix(conv_41_neural_t,test_y,positive="1")
conf_41_neural_t<-confusionMatrix(table(conv_41_neural_t, test_y), positive="1")
conv_21_svm_t <- predict(SVM_Radial_Fit,newdata=test_x, type="raw")
#conf_21_svm_t<-confusionMatrix(conv_21_svm_t,test_y,positive="1")
conf_21_svm_t<-confusionMatrix(table(conv_21_svm_t, test_y), positive="1")
conv_31_cubit_t <- ifelse(predict(model_31_cubit,newdata=test_x) > 0.5,1,0)
#conf_31_cubit_t<-confusionMatrix(conv_31_cubit_t,test_y,positive="1")
conf_31_cubit_t<-confusionMatrix(table(conv_31_cubit_t, test_y), positive="1")
# compute accuracy
acc41_t<-conf_41_neural_t$overall["Accuracy"]
acc21_t<-conf_21_svm_t$overall["Accuracy"]
acc31_t<-conf_31_cubit_t$overall["Accuracy"]
# compute AUC
auc41_t<-roc(test_y ~ as.numeric(as.character(conv_41_neural_t)), test_x)$auc
auc21_t<-roc(test_y ~ as.numeric(as.character(conv_21_svm_t)), test_x)$auc
auc31_t<-roc(test_y ~ conv_31_cubit_t, test_x)$auc
df2<-data.frame(accuracy=c(acc41_t,acc21_t,acc31_t),auc=c(auc41_t,auc21_t,auc31_t))
row.names(df2)<-c("AvNeural","SVM","Cubist")
kable(round(df2,3), caption = "Performance metrics test")
# Function for printing confusion matrix
confusion_analysis <- function(df,model){
# Threshold value is 0.5, positive class is 1
predicted = if_else(predict(model,df)>=0.5, 1,0)
confusionMatrix(data = predicted,
reference = df$Diagnosis,
positive = "1")
}
# Function for calculating evaluation metrics
summary_analysis <- function(df,model){
print(summary(model))
print(paste0("BIC: ",BIC(model)))
print(paste0("VIF: ",vif(model)))
n = length(df$target)
print(paste0("Naglekerke-pseudo-R2:",(1-exp((model$dev-model$null)/n))/(1-exp(-model$null/n))))
print("Confusion Matrix:")
confusion_analysis(df,model)
}
conf_13_logit_t
fourfoldplot(conf_13_logit_t$table,color = c("#ed3b3b", "#0099ff"))
RocCurve <- roc(response = test_y,
ifelse(predict(object=model_13_logit_corr_final, newdata=test_x, type="response") > 0.5,1,0),levels = c(1,0))
plot(RocCurve, ylab = "Sensitivity", xlab = "1 - Specificity", main = "ROC Curve - Logistic", col = "red")
conf_31_cubit_t
fourfoldplot(conf_31_cubit_t$table,color = c("#ed3b3b", "#0099ff"))
rocCurve <- roc(response = test_y, predictor = predict(model_31_cubit,newdata = test_x), levels = c(1,0))
plot(rocCurve, ylab = "Sensitivity", xlab = "1 - Specificity", main = "ROC Curve - Cubist", col = "red")