Introduction

The following dataset has been taken from UCI machine learning repository

We will use the data to help classify breast tumors as benign or malign

We have 569 observations with 33 variables

Ten real-valued features are computed for each cell nucleus:

SVM technique has been used to model the data

Packages used

The following packages have been used for the analysis:

library(tidyverse)    # 
library(ggplot2)      
library(dplyr)
library(corrplot)
library(caret)
library(corrr)
library(kernlab)  
library(e1071)    
library(DT)

Initial Data Exploration

importing the data

The datafile can be found at UCI machine learning repository

We have a data of 569 observations containing 33 variables. A quick view of the dataset is as below:

setwd("C:/Users/heman/Desktop/study/BreastCancer")
cancer_data <- read.csv("Breast_Cancer.csv", header = TRUE, na.strings = c("",NA))

cancer_data %>% datatable(caption = "Cancer Data")

checking for data structure

We have an id variable acting as key value for the observation.
Diagnosis variable is a categorical variable which tells whether the cancer is malign/benign. X variable seems to have null values
remaining all variables hold numerical data

glimpse(cancer_data)
## Observations: 569
## Variables: 33
## $ id                      <int> 842302, 842517, 84300903, 84348301, 84...
## $ diagnosis               <fct> M, M, M, M, M, M, M, M, M, M, M, M, M,...
## $ radius_mean             <dbl> 17.990, 20.570, 19.690, 11.420, 20.290...
## $ texture_mean            <dbl> 10.38, 17.77, 21.25, 20.38, 14.34, 15....
## $ perimeter_mean          <dbl> 122.80, 132.90, 130.00, 77.58, 135.10,...
## $ area_mean               <dbl> 1001.0, 1326.0, 1203.0, 386.1, 1297.0,...
## $ smoothness_mean         <dbl> 0.11840, 0.08474, 0.10960, 0.14250, 0....
## $ compactness_mean        <dbl> 0.27760, 0.07864, 0.15990, 0.28390, 0....
## $ concavity_mean          <dbl> 0.30010, 0.08690, 0.19740, 0.24140, 0....
## $ concave.points_mean     <dbl> 0.14710, 0.07017, 0.12790, 0.10520, 0....
## $ symmetry_mean           <dbl> 0.2419, 0.1812, 0.2069, 0.2597, 0.1809...
## $ fractal_dimension_mean  <dbl> 0.07871, 0.05667, 0.05999, 0.09744, 0....
## $ radius_se               <dbl> 1.0950, 0.5435, 0.7456, 0.4956, 0.7572...
## $ texture_se              <dbl> 0.9053, 0.7339, 0.7869, 1.1560, 0.7813...
## $ perimeter_se            <dbl> 8.589, 3.398, 4.585, 3.445, 5.438, 2.2...
## $ area_se                 <dbl> 153.40, 74.08, 94.03, 27.23, 94.44, 27...
## $ smoothness_se           <dbl> 0.006399, 0.005225, 0.006150, 0.009110...
## $ compactness_se          <dbl> 0.049040, 0.013080, 0.040060, 0.074580...
## $ concavity_se            <dbl> 0.05373, 0.01860, 0.03832, 0.05661, 0....
## $ concave.points_se       <dbl> 0.015870, 0.013400, 0.020580, 0.018670...
## $ symmetry_se             <dbl> 0.03003, 0.01389, 0.02250, 0.05963, 0....
## $ fractal_dimension_se    <dbl> 0.006193, 0.003532, 0.004571, 0.009208...
## $ radius_worst            <dbl> 25.38, 24.99, 23.57, 14.91, 22.54, 15....
## $ texture_worst           <dbl> 17.33, 23.41, 25.53, 26.50, 16.67, 23....
## $ perimeter_worst         <dbl> 184.60, 158.80, 152.50, 98.87, 152.20,...
## $ area_worst              <dbl> 2019.0, 1956.0, 1709.0, 567.7, 1575.0,...
## $ smoothness_worst        <dbl> 0.1622, 0.1238, 0.1444, 0.2098, 0.1374...
## $ compactness_worst       <dbl> 0.6656, 0.1866, 0.4245, 0.8663, 0.2050...
## $ concavity_worst         <dbl> 0.71190, 0.24160, 0.45040, 0.68690, 0....
## $ concave.points_worst    <dbl> 0.26540, 0.18600, 0.24300, 0.25750, 0....
## $ symmetry_worst          <dbl> 0.4601, 0.2750, 0.3613, 0.6638, 0.2364...
## $ fractal_dimension_worst <dbl> 0.11890, 0.08902, 0.08758, 0.17300, 0....
## $ X                       <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...

data summary

Out of 569 observations, we have 357 benign and 212 malign tumors

summary(cancer_data)
##        id            diagnosis  radius_mean      texture_mean  
##  Min.   :     8670   B:357     Min.   : 6.981   Min.   : 9.71  
##  1st Qu.:   869218   M:212     1st Qu.:11.700   1st Qu.:16.17  
##  Median :   906024             Median :13.370   Median :18.84  
##  Mean   : 30371831             Mean   :14.127   Mean   :19.29  
##  3rd Qu.:  8813129             3rd Qu.:15.780   3rd Qu.:21.80  
##  Max.   :911320502             Max.   :28.110   Max.   :39.28  
##  perimeter_mean     area_mean      smoothness_mean   compactness_mean 
##  Min.   : 43.79   Min.   : 143.5   Min.   :0.05263   Min.   :0.01938  
##  1st Qu.: 75.17   1st Qu.: 420.3   1st Qu.:0.08637   1st Qu.:0.06492  
##  Median : 86.24   Median : 551.1   Median :0.09587   Median :0.09263  
##  Mean   : 91.97   Mean   : 654.9   Mean   :0.09636   Mean   :0.10434  
##  3rd Qu.:104.10   3rd Qu.: 782.7   3rd Qu.:0.10530   3rd Qu.:0.13040  
##  Max.   :188.50   Max.   :2501.0   Max.   :0.16340   Max.   :0.34540  
##  concavity_mean    concave.points_mean symmetry_mean   
##  Min.   :0.00000   Min.   :0.00000     Min.   :0.1060  
##  1st Qu.:0.02956   1st Qu.:0.02031     1st Qu.:0.1619  
##  Median :0.06154   Median :0.03350     Median :0.1792  
##  Mean   :0.08880   Mean   :0.04892     Mean   :0.1812  
##  3rd Qu.:0.13070   3rd Qu.:0.07400     3rd Qu.:0.1957  
##  Max.   :0.42680   Max.   :0.20120     Max.   :0.3040  
##  fractal_dimension_mean   radius_se        texture_se      perimeter_se   
##  Min.   :0.04996        Min.   :0.1115   Min.   :0.3602   Min.   : 0.757  
##  1st Qu.:0.05770        1st Qu.:0.2324   1st Qu.:0.8339   1st Qu.: 1.606  
##  Median :0.06154        Median :0.3242   Median :1.1080   Median : 2.287  
##  Mean   :0.06280        Mean   :0.4052   Mean   :1.2169   Mean   : 2.866  
##  3rd Qu.:0.06612        3rd Qu.:0.4789   3rd Qu.:1.4740   3rd Qu.: 3.357  
##  Max.   :0.09744        Max.   :2.8730   Max.   :4.8850   Max.   :21.980  
##     area_se        smoothness_se      compactness_se      concavity_se    
##  Min.   :  6.802   Min.   :0.001713   Min.   :0.002252   Min.   :0.00000  
##  1st Qu.: 17.850   1st Qu.:0.005169   1st Qu.:0.013080   1st Qu.:0.01509  
##  Median : 24.530   Median :0.006380   Median :0.020450   Median :0.02589  
##  Mean   : 40.337   Mean   :0.007041   Mean   :0.025478   Mean   :0.03189  
##  3rd Qu.: 45.190   3rd Qu.:0.008146   3rd Qu.:0.032450   3rd Qu.:0.04205  
##  Max.   :542.200   Max.   :0.031130   Max.   :0.135400   Max.   :0.39600  
##  concave.points_se   symmetry_se       fractal_dimension_se
##  Min.   :0.000000   Min.   :0.007882   Min.   :0.0008948   
##  1st Qu.:0.007638   1st Qu.:0.015160   1st Qu.:0.0022480   
##  Median :0.010930   Median :0.018730   Median :0.0031870   
##  Mean   :0.011796   Mean   :0.020542   Mean   :0.0037949   
##  3rd Qu.:0.014710   3rd Qu.:0.023480   3rd Qu.:0.0045580   
##  Max.   :0.052790   Max.   :0.078950   Max.   :0.0298400   
##   radius_worst   texture_worst   perimeter_worst    area_worst    
##  Min.   : 7.93   Min.   :12.02   Min.   : 50.41   Min.   : 185.2  
##  1st Qu.:13.01   1st Qu.:21.08   1st Qu.: 84.11   1st Qu.: 515.3  
##  Median :14.97   Median :25.41   Median : 97.66   Median : 686.5  
##  Mean   :16.27   Mean   :25.68   Mean   :107.26   Mean   : 880.6  
##  3rd Qu.:18.79   3rd Qu.:29.72   3rd Qu.:125.40   3rd Qu.:1084.0  
##  Max.   :36.04   Max.   :49.54   Max.   :251.20   Max.   :4254.0  
##  smoothness_worst  compactness_worst concavity_worst  concave.points_worst
##  Min.   :0.07117   Min.   :0.02729   Min.   :0.0000   Min.   :0.00000     
##  1st Qu.:0.11660   1st Qu.:0.14720   1st Qu.:0.1145   1st Qu.:0.06493     
##  Median :0.13130   Median :0.21190   Median :0.2267   Median :0.09993     
##  Mean   :0.13237   Mean   :0.25427   Mean   :0.2722   Mean   :0.11461     
##  3rd Qu.:0.14600   3rd Qu.:0.33910   3rd Qu.:0.3829   3rd Qu.:0.16140     
##  Max.   :0.22260   Max.   :1.05800   Max.   :1.2520   Max.   :0.29100     
##  symmetry_worst   fractal_dimension_worst    X          
##  Min.   :0.1565   Min.   :0.05504         Mode:logical  
##  1st Qu.:0.2504   1st Qu.:0.07146         NA's:569      
##  Median :0.2822   Median :0.08004                       
##  Mean   :0.2901   Mean   :0.08395                       
##  3rd Qu.:0.3179   3rd Qu.:0.09208                       
##  Max.   :0.6638   Max.   :0.20750

checking for null values

The variable X has no values associated with it. hence it cam be dropped.

colSums(is.na(cancer_data))
##                      id               diagnosis             radius_mean 
##                       0                       0                       0 
##            texture_mean          perimeter_mean               area_mean 
##                       0                       0                       0 
##         smoothness_mean        compactness_mean          concavity_mean 
##                       0                       0                       0 
##     concave.points_mean           symmetry_mean  fractal_dimension_mean 
##                       0                       0                       0 
##               radius_se              texture_se            perimeter_se 
##                       0                       0                       0 
##                 area_se           smoothness_se          compactness_se 
##                       0                       0                       0 
##            concavity_se       concave.points_se             symmetry_se 
##                       0                       0                       0 
##    fractal_dimension_se            radius_worst           texture_worst 
##                       0                       0                       0 
##         perimeter_worst              area_worst        smoothness_worst 
##                       0                       0                       0 
##       compactness_worst         concavity_worst    concave.points_worst 
##                       0                       0                       0 
##          symmetry_worst fractal_dimension_worst                       X 
##                       0                       0                     569
cancer_data <- select(cancer_data, -X)

checking for correlation

In the first chart, we see a high amount of correlation among the variables

corrdata <- cancer_data[,-c(1,2)]
corrplot(cor(corrdata), order = "hclust")

On checking the variables having more than 0.9 correlation, we keep only one variable among the highly correlated groups
We end up removing the following variables from our analysis: * area_se * radius_mean * area_worst * perimeter_worst * radius_worst * concave.points_mean

highly_correlated <- findCorrelation(cor(corrdata), cutoff = 0.9)
corrplot(cor(corrdata[,highly_correlated]),method="number", order = "hclust")

cancer_data <- select(cancer_data, - area_se, -radius_mean, -area_worst, -perimeter_worst, -radius_worst, -concave.points_mean)

distribution of benign vs malign

Out of 569 observations, we have 357 benign and 212 malign tumors

ggplot(data = cancer_data, aes(x = diagnosis, fill = diagnosis)) +
  geom_bar()

splitting data into test and train

We split our data into 80:20 Train:Test proportion

set.seed(10)
ind.train <- createDataPartition(cancer_data$diagnosis, p=0.8, list=FALSE)
cancer_data_train <- cancer_data[ind.train,]
cancer_data_test <- cancer_data[-ind.train,]

Linear Model

We try to create a linear hyperplane separating the classes

Selecting the right cost

As there is no perfect separation, we try to find out the best cost value for the linear model.
We use tune function which incorporates 10 fold cross validation to give the best cost cost giving least amount of error.
We get a cost value of 0.05 for the data

cost_range <-c(0.001, 0.005, 0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 1, 1.5, 2, 5)

tune.out <- tune(svm, diagnosis~. -id, data = cancer_data_train, kernel = "linear",
                 ranges = list(cost=cost_range))

bestmod_linear <- tune.out$best.model
summary(bestmod_linear)
## 
## Call:
## best.tune(method = svm, train.x = diagnosis ~ . - id, data = cancer_data_train, 
##     ranges = list(cost = cost_range), kernel = "linear")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  0.05 
##       gamma:  0.04166667 
## 
## Number of Support Vectors:  76
## 
##  ( 37 39 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  B M

Confusion Matrix

Train Data

On the training data, we get an accuracy of 0.9846, with a Sensitivity of 0.9965 and Specificity of 0.9647

predictions_train <- predict(bestmod_linear)
confusionMatrix(predictions_train, cancer_data_train$diagnosis)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   B   M
##          B 285   6
##          M   1 164
##                                           
##                Accuracy : 0.9846          
##                  95% CI : (0.9686, 0.9938)
##     No Information Rate : 0.6272          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.967           
##  Mcnemar's Test P-Value : 0.1306          
##                                           
##             Sensitivity : 0.9965          
##             Specificity : 0.9647          
##          Pos Pred Value : 0.9794          
##          Neg Pred Value : 0.9939          
##              Prevalence : 0.6272          
##          Detection Rate : 0.6250          
##    Detection Prevalence : 0.6382          
##       Balanced Accuracy : 0.9806          
##                                           
##        'Positive' Class : B               
## 

Test Data

On the test data, we get an accuracy of 0.9823, with a Sensitivity of 0.9718 and Specificity of 1.00

predictions_test <- predict(bestmod_linear, newdata = cancer_data_test)
confusionMatrix(predictions_test, cancer_data_test$diagnosis)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  B  M
##          B 69  0
##          M  2 42
##                                           
##                Accuracy : 0.9823          
##                  95% CI : (0.9375, 0.9978)
##     No Information Rate : 0.6283          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9625          
##  Mcnemar's Test P-Value : 0.4795          
##                                           
##             Sensitivity : 0.9718          
##             Specificity : 1.0000          
##          Pos Pred Value : 1.0000          
##          Neg Pred Value : 0.9545          
##              Prevalence : 0.6283          
##          Detection Rate : 0.6106          
##    Detection Prevalence : 0.6106          
##       Balanced Accuracy : 0.9859          
##                                           
##        'Positive' Class : B               
## 

Polynomial Model

We try to fit a non-linear boundary between the classes using svm with polynomial kernel
We use tune function which incorporates 10 fold cross validation to give the best cost value and polynomial degree giving least amount of error
We get a cost value of 5 and a degree of 3 for the data

Selecting the right cost

tune.out <- tune(svm,  diagnosis~. -id, data = cancer_data_train, kernel = "polynomial",
                 ranges = list(cost = cost_range))

bestmod_polynomial <- tune.out$best.model
summary(bestmod_polynomial)
## 
## Call:
## best.tune(method = svm, train.x = diagnosis ~ . - id, data = cancer_data_train, 
##     ranges = list(cost = cost_range), kernel = "polynomial")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  polynomial 
##        cost:  5 
##      degree:  3 
##       gamma:  0.04166667 
##      coef.0:  0 
## 
## Number of Support Vectors:  119
## 
##  ( 56 63 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  B M

Confusion Matrix

Train Data

On the training data, we get an accuracy of 0.9693, with a Sensitivity of 1.0000 and Specificity of 0.9176

predictions_train <- predict(bestmod_polynomial)
confusionMatrix(predictions_train, cancer_data_train$diagnosis)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   B   M
##          B 286  14
##          M   0 156
##                                          
##                Accuracy : 0.9693         
##                  95% CI : (0.949, 0.9831)
##     No Information Rate : 0.6272         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.9332         
##  Mcnemar's Test P-Value : 0.000512       
##                                          
##             Sensitivity : 1.0000         
##             Specificity : 0.9176         
##          Pos Pred Value : 0.9533         
##          Neg Pred Value : 1.0000         
##              Prevalence : 0.6272         
##          Detection Rate : 0.6272         
##    Detection Prevalence : 0.6579         
##       Balanced Accuracy : 0.9588         
##                                          
##        'Positive' Class : B              
## 

Test Data

On the test data, we get an accuracy of 0.9381, with a Sensitivity of 1.0000 and Specificity of 0.8333

predictions_test <- predict(bestmod_polynomial, newdata = cancer_data_test)
confusionMatrix(predictions_test, cancer_data_test$diagnosis)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  B  M
##          B 71  7
##          M  0 35
##                                           
##                Accuracy : 0.9381          
##                  95% CI : (0.8765, 0.9747)
##     No Information Rate : 0.6283          
##     P-Value [Acc > NIR] : 1.718e-14       
##                                           
##                   Kappa : 0.8627          
##  Mcnemar's Test P-Value : 0.02334         
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.8333          
##          Pos Pred Value : 0.9103          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.6283          
##          Detection Rate : 0.6283          
##    Detection Prevalence : 0.6903          
##       Balanced Accuracy : 0.9167          
##                                           
##        'Positive' Class : B               
## 

Radial Model

Selecting the right cost

Trying to get a radial boundary between the classes, we use kernel=radial
We use tune function which incorporates 10 fold cross validation to give the best cost value and gamma value giving least amount of error
We get a cost value of 2 and gamma of 0.5 for the data

gamma_range = c(0.5,1,2,3,4)

tune.out <- tune(svm,  diagnosis~. -id, data = cancer_data_train, kernel = "radial",
                 ranges = list(cost = cost_range,
                               gamma = gamma_range))
bestmod_radial <- tune.out$best.model
summary(bestmod_radial)
## 
## Call:
## best.tune(method = svm, train.x = diagnosis ~ . - id, data = cancer_data_train, 
##     ranges = list(cost = cost_range, gamma = gamma_range), kernel = "radial")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1.5 
##       gamma:  0.5 
## 
## Number of Support Vectors:  416
## 
##  ( 168 248 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  B M

Confusion Matrix

Train Data

On the training data, we are able to predict the classes with 100% accuracy

predictions_train <- predict(bestmod_radial)
confusionMatrix(predictions_train, cancer_data_train$diagnosis)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   B   M
##          B 286   0
##          M   0 170
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9919, 1)
##     No Information Rate : 0.6272     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0000     
##             Specificity : 1.0000     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 1.0000     
##              Prevalence : 0.6272     
##          Detection Rate : 0.6272     
##    Detection Prevalence : 0.6272     
##       Balanced Accuracy : 1.0000     
##                                      
##        'Positive' Class : B          
## 

Test Data

On the test data, we get an accuracy of 0.8584, with a Sensitivity of 0.9437 and Specificity of 0.7143

predictions_test <- predict(bestmod_radial, newdata = cancer_data_test)
confusionMatrix(predictions_test, cancer_data_test$diagnosis)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  B  M
##          B 67 12
##          M  4 30
##                                           
##                Accuracy : 0.8584          
##                  95% CI : (0.7803, 0.9168)
##     No Information Rate : 0.6283          
##     P-Value [Acc > NIR] : 5.334e-08       
##                                           
##                   Kappa : 0.6846          
##  Mcnemar's Test P-Value : 0.08012         
##                                           
##             Sensitivity : 0.9437          
##             Specificity : 0.7143          
##          Pos Pred Value : 0.8481          
##          Neg Pred Value : 0.8824          
##              Prevalence : 0.6283          
##          Detection Rate : 0.5929          
##    Detection Prevalence : 0.6991          
##       Balanced Accuracy : 0.8290          
##                                           
##        'Positive' Class : B               
## 

Conclusion

Using SVM technique employing linear, polynomial and radial kernels, we have been able to get a good separation for the two tumor classes. The radial model which the most flexible model fits the training data very well correctly predicting all tumor classes, but doesnt do so well on the test data. The polynomial model does a decent job both on the training data and the test data.
The Linear model however oytperforms the other two models consistently getting a high accuracy on both the training and testing set. Hence we decide to use linear model as our final model