The following dataset has been taken from UCI machine learning repository
We will use the data to help classify breast tumors as benign or malign
We have 569 observations with 33 variables
Ten real-valued features are computed for each cell nucleus:
SVM technique has been used to model the data
The following packages have been used for the analysis:
library(tidyverse) #
library(ggplot2)
library(dplyr)
library(corrplot)
library(caret)
library(corrr)
library(kernlab)
library(e1071)
library(DT)
The datafile can be found at UCI machine learning repository
We have a data of 569 observations containing 33 variables. A quick view of the dataset is as below:
setwd("C:/Users/heman/Desktop/study/BreastCancer")
cancer_data <- read.csv("Breast_Cancer.csv", header = TRUE, na.strings = c("",NA))
cancer_data %>% datatable(caption = "Cancer Data")
We have an id variable acting as key value for the observation.
Diagnosis variable is a categorical variable which tells whether the cancer is malign/benign. X variable seems to have null values
remaining all variables hold numerical data
glimpse(cancer_data)
## Observations: 569
## Variables: 33
## $ id <int> 842302, 842517, 84300903, 84348301, 84...
## $ diagnosis <fct> M, M, M, M, M, M, M, M, M, M, M, M, M,...
## $ radius_mean <dbl> 17.990, 20.570, 19.690, 11.420, 20.290...
## $ texture_mean <dbl> 10.38, 17.77, 21.25, 20.38, 14.34, 15....
## $ perimeter_mean <dbl> 122.80, 132.90, 130.00, 77.58, 135.10,...
## $ area_mean <dbl> 1001.0, 1326.0, 1203.0, 386.1, 1297.0,...
## $ smoothness_mean <dbl> 0.11840, 0.08474, 0.10960, 0.14250, 0....
## $ compactness_mean <dbl> 0.27760, 0.07864, 0.15990, 0.28390, 0....
## $ concavity_mean <dbl> 0.30010, 0.08690, 0.19740, 0.24140, 0....
## $ concave.points_mean <dbl> 0.14710, 0.07017, 0.12790, 0.10520, 0....
## $ symmetry_mean <dbl> 0.2419, 0.1812, 0.2069, 0.2597, 0.1809...
## $ fractal_dimension_mean <dbl> 0.07871, 0.05667, 0.05999, 0.09744, 0....
## $ radius_se <dbl> 1.0950, 0.5435, 0.7456, 0.4956, 0.7572...
## $ texture_se <dbl> 0.9053, 0.7339, 0.7869, 1.1560, 0.7813...
## $ perimeter_se <dbl> 8.589, 3.398, 4.585, 3.445, 5.438, 2.2...
## $ area_se <dbl> 153.40, 74.08, 94.03, 27.23, 94.44, 27...
## $ smoothness_se <dbl> 0.006399, 0.005225, 0.006150, 0.009110...
## $ compactness_se <dbl> 0.049040, 0.013080, 0.040060, 0.074580...
## $ concavity_se <dbl> 0.05373, 0.01860, 0.03832, 0.05661, 0....
## $ concave.points_se <dbl> 0.015870, 0.013400, 0.020580, 0.018670...
## $ symmetry_se <dbl> 0.03003, 0.01389, 0.02250, 0.05963, 0....
## $ fractal_dimension_se <dbl> 0.006193, 0.003532, 0.004571, 0.009208...
## $ radius_worst <dbl> 25.38, 24.99, 23.57, 14.91, 22.54, 15....
## $ texture_worst <dbl> 17.33, 23.41, 25.53, 26.50, 16.67, 23....
## $ perimeter_worst <dbl> 184.60, 158.80, 152.50, 98.87, 152.20,...
## $ area_worst <dbl> 2019.0, 1956.0, 1709.0, 567.7, 1575.0,...
## $ smoothness_worst <dbl> 0.1622, 0.1238, 0.1444, 0.2098, 0.1374...
## $ compactness_worst <dbl> 0.6656, 0.1866, 0.4245, 0.8663, 0.2050...
## $ concavity_worst <dbl> 0.71190, 0.24160, 0.45040, 0.68690, 0....
## $ concave.points_worst <dbl> 0.26540, 0.18600, 0.24300, 0.25750, 0....
## $ symmetry_worst <dbl> 0.4601, 0.2750, 0.3613, 0.6638, 0.2364...
## $ fractal_dimension_worst <dbl> 0.11890, 0.08902, 0.08758, 0.17300, 0....
## $ X <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
Out of 569 observations, we have 357 benign and 212 malign tumors
summary(cancer_data)
## id diagnosis radius_mean texture_mean
## Min. : 8670 B:357 Min. : 6.981 Min. : 9.71
## 1st Qu.: 869218 M:212 1st Qu.:11.700 1st Qu.:16.17
## Median : 906024 Median :13.370 Median :18.84
## Mean : 30371831 Mean :14.127 Mean :19.29
## 3rd Qu.: 8813129 3rd Qu.:15.780 3rd Qu.:21.80
## Max. :911320502 Max. :28.110 Max. :39.28
## perimeter_mean area_mean smoothness_mean compactness_mean
## Min. : 43.79 Min. : 143.5 Min. :0.05263 Min. :0.01938
## 1st Qu.: 75.17 1st Qu.: 420.3 1st Qu.:0.08637 1st Qu.:0.06492
## Median : 86.24 Median : 551.1 Median :0.09587 Median :0.09263
## Mean : 91.97 Mean : 654.9 Mean :0.09636 Mean :0.10434
## 3rd Qu.:104.10 3rd Qu.: 782.7 3rd Qu.:0.10530 3rd Qu.:0.13040
## Max. :188.50 Max. :2501.0 Max. :0.16340 Max. :0.34540
## concavity_mean concave.points_mean symmetry_mean
## Min. :0.00000 Min. :0.00000 Min. :0.1060
## 1st Qu.:0.02956 1st Qu.:0.02031 1st Qu.:0.1619
## Median :0.06154 Median :0.03350 Median :0.1792
## Mean :0.08880 Mean :0.04892 Mean :0.1812
## 3rd Qu.:0.13070 3rd Qu.:0.07400 3rd Qu.:0.1957
## Max. :0.42680 Max. :0.20120 Max. :0.3040
## fractal_dimension_mean radius_se texture_se perimeter_se
## Min. :0.04996 Min. :0.1115 Min. :0.3602 Min. : 0.757
## 1st Qu.:0.05770 1st Qu.:0.2324 1st Qu.:0.8339 1st Qu.: 1.606
## Median :0.06154 Median :0.3242 Median :1.1080 Median : 2.287
## Mean :0.06280 Mean :0.4052 Mean :1.2169 Mean : 2.866
## 3rd Qu.:0.06612 3rd Qu.:0.4789 3rd Qu.:1.4740 3rd Qu.: 3.357
## Max. :0.09744 Max. :2.8730 Max. :4.8850 Max. :21.980
## area_se smoothness_se compactness_se concavity_se
## Min. : 6.802 Min. :0.001713 Min. :0.002252 Min. :0.00000
## 1st Qu.: 17.850 1st Qu.:0.005169 1st Qu.:0.013080 1st Qu.:0.01509
## Median : 24.530 Median :0.006380 Median :0.020450 Median :0.02589
## Mean : 40.337 Mean :0.007041 Mean :0.025478 Mean :0.03189
## 3rd Qu.: 45.190 3rd Qu.:0.008146 3rd Qu.:0.032450 3rd Qu.:0.04205
## Max. :542.200 Max. :0.031130 Max. :0.135400 Max. :0.39600
## concave.points_se symmetry_se fractal_dimension_se
## Min. :0.000000 Min. :0.007882 Min. :0.0008948
## 1st Qu.:0.007638 1st Qu.:0.015160 1st Qu.:0.0022480
## Median :0.010930 Median :0.018730 Median :0.0031870
## Mean :0.011796 Mean :0.020542 Mean :0.0037949
## 3rd Qu.:0.014710 3rd Qu.:0.023480 3rd Qu.:0.0045580
## Max. :0.052790 Max. :0.078950 Max. :0.0298400
## radius_worst texture_worst perimeter_worst area_worst
## Min. : 7.93 Min. :12.02 Min. : 50.41 Min. : 185.2
## 1st Qu.:13.01 1st Qu.:21.08 1st Qu.: 84.11 1st Qu.: 515.3
## Median :14.97 Median :25.41 Median : 97.66 Median : 686.5
## Mean :16.27 Mean :25.68 Mean :107.26 Mean : 880.6
## 3rd Qu.:18.79 3rd Qu.:29.72 3rd Qu.:125.40 3rd Qu.:1084.0
## Max. :36.04 Max. :49.54 Max. :251.20 Max. :4254.0
## smoothness_worst compactness_worst concavity_worst concave.points_worst
## Min. :0.07117 Min. :0.02729 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.11660 1st Qu.:0.14720 1st Qu.:0.1145 1st Qu.:0.06493
## Median :0.13130 Median :0.21190 Median :0.2267 Median :0.09993
## Mean :0.13237 Mean :0.25427 Mean :0.2722 Mean :0.11461
## 3rd Qu.:0.14600 3rd Qu.:0.33910 3rd Qu.:0.3829 3rd Qu.:0.16140
## Max. :0.22260 Max. :1.05800 Max. :1.2520 Max. :0.29100
## symmetry_worst fractal_dimension_worst X
## Min. :0.1565 Min. :0.05504 Mode:logical
## 1st Qu.:0.2504 1st Qu.:0.07146 NA's:569
## Median :0.2822 Median :0.08004
## Mean :0.2901 Mean :0.08395
## 3rd Qu.:0.3179 3rd Qu.:0.09208
## Max. :0.6638 Max. :0.20750
The variable X has no values associated with it. hence it cam be dropped.
colSums(is.na(cancer_data))
## id diagnosis radius_mean
## 0 0 0
## texture_mean perimeter_mean area_mean
## 0 0 0
## smoothness_mean compactness_mean concavity_mean
## 0 0 0
## concave.points_mean symmetry_mean fractal_dimension_mean
## 0 0 0
## radius_se texture_se perimeter_se
## 0 0 0
## area_se smoothness_se compactness_se
## 0 0 0
## concavity_se concave.points_se symmetry_se
## 0 0 0
## fractal_dimension_se radius_worst texture_worst
## 0 0 0
## perimeter_worst area_worst smoothness_worst
## 0 0 0
## compactness_worst concavity_worst concave.points_worst
## 0 0 0
## symmetry_worst fractal_dimension_worst X
## 0 0 569
cancer_data <- select(cancer_data, -X)
In the first chart, we see a high amount of correlation among the variables
corrdata <- cancer_data[,-c(1,2)]
corrplot(cor(corrdata), order = "hclust")
On checking the variables having more than 0.9 correlation, we keep only one variable among the highly correlated groups
We end up removing the following variables from our analysis: * area_se * radius_mean * area_worst * perimeter_worst * radius_worst * concave.points_mean
highly_correlated <- findCorrelation(cor(corrdata), cutoff = 0.9)
corrplot(cor(corrdata[,highly_correlated]),method="number", order = "hclust")
cancer_data <- select(cancer_data, - area_se, -radius_mean, -area_worst, -perimeter_worst, -radius_worst, -concave.points_mean)
Out of 569 observations, we have 357 benign and 212 malign tumors
ggplot(data = cancer_data, aes(x = diagnosis, fill = diagnosis)) +
geom_bar()
We split our data into 80:20 Train:Test proportion
set.seed(10)
ind.train <- createDataPartition(cancer_data$diagnosis, p=0.8, list=FALSE)
cancer_data_train <- cancer_data[ind.train,]
cancer_data_test <- cancer_data[-ind.train,]
We try to create a linear hyperplane separating the classes
As there is no perfect separation, we try to find out the best cost value for the linear model.
We use tune function which incorporates 10 fold cross validation to give the best cost cost giving least amount of error.
We get a cost value of 0.05 for the data
cost_range <-c(0.001, 0.005, 0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 1, 1.5, 2, 5)
tune.out <- tune(svm, diagnosis~. -id, data = cancer_data_train, kernel = "linear",
ranges = list(cost=cost_range))
bestmod_linear <- tune.out$best.model
summary(bestmod_linear)
##
## Call:
## best.tune(method = svm, train.x = diagnosis ~ . - id, data = cancer_data_train,
## ranges = list(cost = cost_range), kernel = "linear")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 0.05
## gamma: 0.04166667
##
## Number of Support Vectors: 76
##
## ( 37 39 )
##
##
## Number of Classes: 2
##
## Levels:
## B M
On the training data, we get an accuracy of 0.9846, with a Sensitivity of 0.9965 and Specificity of 0.9647
predictions_train <- predict(bestmod_linear)
confusionMatrix(predictions_train, cancer_data_train$diagnosis)
## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 285 6
## M 1 164
##
## Accuracy : 0.9846
## 95% CI : (0.9686, 0.9938)
## No Information Rate : 0.6272
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.967
## Mcnemar's Test P-Value : 0.1306
##
## Sensitivity : 0.9965
## Specificity : 0.9647
## Pos Pred Value : 0.9794
## Neg Pred Value : 0.9939
## Prevalence : 0.6272
## Detection Rate : 0.6250
## Detection Prevalence : 0.6382
## Balanced Accuracy : 0.9806
##
## 'Positive' Class : B
##
On the test data, we get an accuracy of 0.9823, with a Sensitivity of 0.9718 and Specificity of 1.00
predictions_test <- predict(bestmod_linear, newdata = cancer_data_test)
confusionMatrix(predictions_test, cancer_data_test$diagnosis)
## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 69 0
## M 2 42
##
## Accuracy : 0.9823
## 95% CI : (0.9375, 0.9978)
## No Information Rate : 0.6283
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9625
## Mcnemar's Test P-Value : 0.4795
##
## Sensitivity : 0.9718
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 0.9545
## Prevalence : 0.6283
## Detection Rate : 0.6106
## Detection Prevalence : 0.6106
## Balanced Accuracy : 0.9859
##
## 'Positive' Class : B
##
We try to fit a non-linear boundary between the classes using svm with polynomial kernel
We use tune function which incorporates 10 fold cross validation to give the best cost value and polynomial degree giving least amount of error
We get a cost value of 5 and a degree of 3 for the data
tune.out <- tune(svm, diagnosis~. -id, data = cancer_data_train, kernel = "polynomial",
ranges = list(cost = cost_range))
bestmod_polynomial <- tune.out$best.model
summary(bestmod_polynomial)
##
## Call:
## best.tune(method = svm, train.x = diagnosis ~ . - id, data = cancer_data_train,
## ranges = list(cost = cost_range), kernel = "polynomial")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: polynomial
## cost: 5
## degree: 3
## gamma: 0.04166667
## coef.0: 0
##
## Number of Support Vectors: 119
##
## ( 56 63 )
##
##
## Number of Classes: 2
##
## Levels:
## B M
On the training data, we get an accuracy of 0.9693, with a Sensitivity of 1.0000 and Specificity of 0.9176
predictions_train <- predict(bestmod_polynomial)
confusionMatrix(predictions_train, cancer_data_train$diagnosis)
## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 286 14
## M 0 156
##
## Accuracy : 0.9693
## 95% CI : (0.949, 0.9831)
## No Information Rate : 0.6272
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9332
## Mcnemar's Test P-Value : 0.000512
##
## Sensitivity : 1.0000
## Specificity : 0.9176
## Pos Pred Value : 0.9533
## Neg Pred Value : 1.0000
## Prevalence : 0.6272
## Detection Rate : 0.6272
## Detection Prevalence : 0.6579
## Balanced Accuracy : 0.9588
##
## 'Positive' Class : B
##
On the test data, we get an accuracy of 0.9381, with a Sensitivity of 1.0000 and Specificity of 0.8333
predictions_test <- predict(bestmod_polynomial, newdata = cancer_data_test)
confusionMatrix(predictions_test, cancer_data_test$diagnosis)
## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 71 7
## M 0 35
##
## Accuracy : 0.9381
## 95% CI : (0.8765, 0.9747)
## No Information Rate : 0.6283
## P-Value [Acc > NIR] : 1.718e-14
##
## Kappa : 0.8627
## Mcnemar's Test P-Value : 0.02334
##
## Sensitivity : 1.0000
## Specificity : 0.8333
## Pos Pred Value : 0.9103
## Neg Pred Value : 1.0000
## Prevalence : 0.6283
## Detection Rate : 0.6283
## Detection Prevalence : 0.6903
## Balanced Accuracy : 0.9167
##
## 'Positive' Class : B
##
Trying to get a radial boundary between the classes, we use kernel=radial
We use tune function which incorporates 10 fold cross validation to give the best cost value and gamma value giving least amount of error
We get a cost value of 2 and gamma of 0.5 for the data
gamma_range = c(0.5,1,2,3,4)
tune.out <- tune(svm, diagnosis~. -id, data = cancer_data_train, kernel = "radial",
ranges = list(cost = cost_range,
gamma = gamma_range))
bestmod_radial <- tune.out$best.model
summary(bestmod_radial)
##
## Call:
## best.tune(method = svm, train.x = diagnosis ~ . - id, data = cancer_data_train,
## ranges = list(cost = cost_range, gamma = gamma_range), kernel = "radial")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1.5
## gamma: 0.5
##
## Number of Support Vectors: 416
##
## ( 168 248 )
##
##
## Number of Classes: 2
##
## Levels:
## B M
On the training data, we are able to predict the classes with 100% accuracy
predictions_train <- predict(bestmod_radial)
confusionMatrix(predictions_train, cancer_data_train$diagnosis)
## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 286 0
## M 0 170
##
## Accuracy : 1
## 95% CI : (0.9919, 1)
## No Information Rate : 0.6272
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.0000
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 1.0000
## Prevalence : 0.6272
## Detection Rate : 0.6272
## Detection Prevalence : 0.6272
## Balanced Accuracy : 1.0000
##
## 'Positive' Class : B
##
On the test data, we get an accuracy of 0.8584, with a Sensitivity of 0.9437 and Specificity of 0.7143
predictions_test <- predict(bestmod_radial, newdata = cancer_data_test)
confusionMatrix(predictions_test, cancer_data_test$diagnosis)
## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 67 12
## M 4 30
##
## Accuracy : 0.8584
## 95% CI : (0.7803, 0.9168)
## No Information Rate : 0.6283
## P-Value [Acc > NIR] : 5.334e-08
##
## Kappa : 0.6846
## Mcnemar's Test P-Value : 0.08012
##
## Sensitivity : 0.9437
## Specificity : 0.7143
## Pos Pred Value : 0.8481
## Neg Pred Value : 0.8824
## Prevalence : 0.6283
## Detection Rate : 0.5929
## Detection Prevalence : 0.6991
## Balanced Accuracy : 0.8290
##
## 'Positive' Class : B
##
Using SVM technique employing linear, polynomial and radial kernels, we have been able to get a good separation for the two tumor classes. The radial model which the most flexible model fits the training data very well correctly predicting all tumor classes, but doesnt do so well on the test data. The polynomial model does a decent job both on the training data and the test data.
The Linear model however oytperforms the other two models consistently getting a high accuracy on both the training and testing set. Hence we decide to use linear model as our final model