Detecting Breast Cancer Using Machinen Learning

Introduction
Step 1: First Lets Load the Data

data<-read.csv(file=file.choose(), header=TRUE, stringsAsFactors=FALSE,strip.white=TRUE, sep=",")
data$ID<-NULL
kable(head(data, n = 10), format = 'html') %>%
  kable_styling(bootstrap_options = c('striped', 'hover'))

diagnosis	radius_mean	radius_sd_error	radius_worst	texture_mean	texture_sd_error	texture_worst	perimeter_mean	perimeter_sd_error	perimeter_worst	area_mean	area_sd_error	area_worst	smoothness_mean	smoothness_sd_error	smoothness_worst	compactness_mean	compactness_sd_error	compactness_worst	concavity_mean	concavity_sd_error	concavity_worst	concave_points_mean	concave_points_sd_error	concave_points_worst	symmetry_mean	symmetry_sd_error	symmetry_worst	fractal_dimension_mean	fractal_dimension_sd_error	fractal_dimension_worst
M	17.99	10.38	122.80	1001.0	0.11840	0.27760	0.30010	0.14710	0.2419	0.07871	1.0950	0.9053	8.589	153.40	0.006399	0.04904	0.05373	0.01587	0.03003	0.006193	25.38	17.33	184.60	2019.0	0.1622	0.6656	0.7119	0.2654	0.4601	0.11890
M	20.57	17.77	132.90	1326.0	0.08474	0.07864	0.08690	0.07017	0.1812	0.05667	0.5435	0.7339	3.398	74.08	0.005225	0.01308	0.01860	0.01340	0.01389	0.003532	24.99	23.41	158.80	1956.0	0.1238	0.1866	0.2416	0.1860	0.2750	0.08902
M	19.69	21.25	130.00	1203.0	0.10960	0.15990	0.19740	0.12790	0.2069	0.05999	0.7456	0.7869	4.585	94.03	0.006150	0.04006	0.03832	0.02058	0.02250	0.004571	23.57	25.53	152.50	1709.0	0.1444	0.4245	0.4504	0.2430	0.3613	0.08758
M	11.42	20.38	77.58	386.1	0.14250	0.28390	0.24140	0.10520	0.2597	0.09744	0.4956	1.1560	3.445	27.23	0.009110	0.07458	0.05661	0.01867	0.05963	0.009208	14.91	26.50	98.87	567.7	0.2098	0.8663	0.6869	0.2575	0.6638	0.17300
M	20.29	14.34	135.10	1297.0	0.10030	0.13280	0.19800	0.10430	0.1809	0.05883	0.7572	0.7813	5.438	94.44	0.011490	0.02461	0.05688	0.01885	0.01756	0.005115	22.54	16.67	152.20	1575.0	0.1374	0.2050	0.4000	0.1625	0.2364	0.07678
M	12.45	15.70	82.57	477.1	0.12780	0.17000	0.15780	0.08089	0.2087	0.07613	0.3345	0.8902	2.217	27.19	0.007510	0.03345	0.03672	0.01137	0.02165	0.005082	15.47	23.75	103.40	741.6	0.1791	0.5249	0.5355	0.1741	0.3985	0.12440
M	18.25	19.98	119.60	1040.0	0.09463	0.10900	0.11270	0.07400	0.1794	0.05742	0.4467	0.7732	3.180	53.91	0.004314	0.01382	0.02254	0.01039	0.01369	0.002179	22.88	27.66	153.20	1606.0	0.1442	0.2576	0.3784	0.1932	0.3063	0.08368
M	13.71	20.83	90.20	577.9	0.11890	0.16450	0.09366	0.05985	0.2196	0.07451	0.5835	1.3770	3.856	50.96	0.008805	0.03029	0.02488	0.01448	0.01486	0.005412	17.06	28.14	110.60	897.0	0.1654	0.3682	0.2678	0.1556	0.3196	0.11510
M	13.00	21.82	87.50	519.8	0.12730	0.19320	0.18590	0.09353	0.2350	0.07389	0.3063	1.0020	2.406	24.32	0.005731	0.03502	0.03553	0.01226	0.02143	0.003749	15.49	30.73	106.20	739.3	0.1703	0.5401	0.5390	0.2060	0.4378	0.10720
M	12.46	24.04	83.97	475.9	0.11860	0.23960	0.22730	0.08543	0.2030	0.08243	0.2976	1.5990	2.039	23.94	0.007149	0.07217	0.07743	0.01432	0.01789	0.010080	15.09	40.68	97.65	711.4	0.1853	1.0580	1.1050	0.2210	0.4366	0.20750

Step 2: Data Cleaning - Convert ‘0’ values into NA

#data[, 2:31][data[, 2:31] == 0] <- NA

Step 3: visualize the missing data(if any)

missmap(data)

Step 4: Summary of the dataSet

summary(data)

##   diagnosis          radius_mean     radius_sd_error  radius_worst   
##  Length:569         Min.   : 6.981   Min.   : 9.71   Min.   : 43.79  
##  Class :character   1st Qu.:11.700   1st Qu.:16.17   1st Qu.: 75.17  
##  Mode  :character   Median :13.370   Median :18.84   Median : 86.24  
##                     Mean   :14.127   Mean   :19.29   Mean   : 91.97  
##                     3rd Qu.:15.780   3rd Qu.:21.80   3rd Qu.:104.10  
##                     Max.   :28.110   Max.   :39.28   Max.   :188.50  
##   texture_mean    texture_sd_error  texture_worst     perimeter_mean   
##  Min.   : 143.5   Min.   :0.05263   Min.   :0.01938   Min.   :0.00000  
##  1st Qu.: 420.3   1st Qu.:0.08637   1st Qu.:0.06492   1st Qu.:0.02956  
##  Median : 551.1   Median :0.09587   Median :0.09263   Median :0.06154  
##  Mean   : 654.9   Mean   :0.09636   Mean   :0.10434   Mean   :0.08880  
##  3rd Qu.: 782.7   3rd Qu.:0.10530   3rd Qu.:0.13040   3rd Qu.:0.13070  
##  Max.   :2501.0   Max.   :0.16340   Max.   :0.34540   Max.   :0.42680  
##  perimeter_sd_error perimeter_worst    area_mean       area_sd_error   
##  Min.   :0.00000    Min.   :0.1060   Min.   :0.04996   Min.   :0.1115  
##  1st Qu.:0.02031    1st Qu.:0.1619   1st Qu.:0.05770   1st Qu.:0.2324  
##  Median :0.03350    Median :0.1792   Median :0.06154   Median :0.3242  
##  Mean   :0.04892    Mean   :0.1812   Mean   :0.06280   Mean   :0.4052  
##  3rd Qu.:0.07400    3rd Qu.:0.1957   3rd Qu.:0.06612   3rd Qu.:0.4789  
##  Max.   :0.20120    Max.   :0.3040   Max.   :0.09744   Max.   :2.8730  
##    area_worst     smoothness_mean  smoothness_sd_error smoothness_worst  
##  Min.   :0.3602   Min.   : 0.757   Min.   :  6.802     Min.   :0.001713  
##  1st Qu.:0.8339   1st Qu.: 1.606   1st Qu.: 17.850     1st Qu.:0.005169  
##  Median :1.1080   Median : 2.287   Median : 24.530     Median :0.006380  
##  Mean   :1.2169   Mean   : 2.866   Mean   : 40.337     Mean   :0.007041  
##  3rd Qu.:1.4740   3rd Qu.: 3.357   3rd Qu.: 45.190     3rd Qu.:0.008146  
##  Max.   :4.8850   Max.   :21.980   Max.   :542.200     Max.   :0.031130  
##  compactness_mean   compactness_sd_error compactness_worst  concavity_mean    
##  Min.   :0.002252   Min.   :0.00000      Min.   :0.000000   Min.   :0.007882  
##  1st Qu.:0.013080   1st Qu.:0.01509      1st Qu.:0.007638   1st Qu.:0.015160  
##  Median :0.020450   Median :0.02589      Median :0.010930   Median :0.018730  
##  Mean   :0.025478   Mean   :0.03189      Mean   :0.011796   Mean   :0.020542  
##  3rd Qu.:0.032450   3rd Qu.:0.04205      3rd Qu.:0.014710   3rd Qu.:0.023480  
##  Max.   :0.135400   Max.   :0.39600      Max.   :0.052790   Max.   :0.078950  
##  concavity_sd_error  concavity_worst concave_points_mean
##  Min.   :0.0008948   Min.   : 7.93   Min.   :12.02      
##  1st Qu.:0.0022480   1st Qu.:13.01   1st Qu.:21.08      
##  Median :0.0031870   Median :14.97   Median :25.41      
##  Mean   :0.0037949   Mean   :16.27   Mean   :25.68      
##  3rd Qu.:0.0045580   3rd Qu.:18.79   3rd Qu.:29.72      
##  Max.   :0.0298400   Max.   :36.04   Max.   :49.54      
##  concave_points_sd_error concave_points_worst symmetry_mean    
##  Min.   : 50.41          Min.   : 185.2       Min.   :0.07117  
##  1st Qu.: 84.11          1st Qu.: 515.3       1st Qu.:0.11660  
##  Median : 97.66          Median : 686.5       Median :0.13130  
##  Mean   :107.26          Mean   : 880.6       Mean   :0.13237  
##  3rd Qu.:125.40          3rd Qu.:1084.0       3rd Qu.:0.14600  
##  Max.   :251.20          Max.   :4254.0       Max.   :0.22260  
##  symmetry_sd_error symmetry_worst   fractal_dimension_mean
##  Min.   :0.02729   Min.   :0.0000   Min.   :0.00000       
##  1st Qu.:0.14720   1st Qu.:0.1145   1st Qu.:0.06493       
##  Median :0.21190   Median :0.2267   Median :0.09993       
##  Mean   :0.25427   Mean   :0.2722   Mean   :0.11461       
##  3rd Qu.:0.33910   3rd Qu.:0.3829   3rd Qu.:0.16140       
##  Max.   :1.05800   Max.   :1.2520   Max.   :0.29100       
##  fractal_dimension_sd_error fractal_dimension_worst
##  Min.   :0.1565             Min.   :0.05504        
##  1st Qu.:0.2504             1st Qu.:0.07146        
##  Median :0.2822             Median :0.08004        
##  Mean   :0.2901             Mean   :0.08395        
##  3rd Qu.:0.3179             3rd Qu.:0.09208        
##  Max.   :0.6638             Max.   :0.20750

Step 5: Because we are dealing with different units of measurement with each variable, it is sometimes necessary to “normalize” the numerical data. We can easily accomplish this usinng R’s scale

If we want to detect whether a patient may have beast cancer, it will be important to inspect the “diagnosis” variable. In the variable, B stands for benign (i.e. no breast canccer) and M means malignant(i.e. breast cancer).

prop.table(table(data$diagnosis))

## 
##         B         M 
## 0.6274165 0.3725835

Step 6: Many R machie learning classification algorithms require the target feature to be coded as a factor, so we will need to convert the diagnosis variable to a factor as opposed to just as a character.

data$diagnosis <- factor(data$diagnosis, levels = c("B", "M"), labels = c("Benign", "Malignant"))
str(data$diagnosis)

##  Factor w/ 2 levels "Benign","Malignant": 2 2 2 2 2 2 2 2 2 2 ...

Step 7: Because we are dealing with different units of measurement for each numerical variable, we will need to “normalize” each relevantn numerical variable to ensure accuracy of our model. We can easily achieve this using R’s scale() function. Eacch numericcal value will be convnerted to their Z-score equivalent.

data.norm <- scale(data[c(2:30)])
summary(data.norm)

##   radius_mean      radius_sd_error    radius_worst      texture_mean    
##  Min.   :-2.0279   Min.   :-2.2273   Min.   :-1.9828   Min.   :-1.4532  
##  1st Qu.:-0.6888   1st Qu.:-0.7253   1st Qu.:-0.6913   1st Qu.:-0.6666  
##  Median :-0.2149   Median :-0.1045   Median :-0.2358   Median :-0.2949  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4690   3rd Qu.: 0.5837   3rd Qu.: 0.4992   3rd Qu.: 0.3632  
##  Max.   : 3.9678   Max.   : 4.6478   Max.   : 3.9726   Max.   : 5.2459  
##  texture_sd_error   texture_worst     perimeter_mean    perimeter_sd_error
##  Min.   :-3.10935   Min.   :-1.6087   Min.   :-1.1139   Min.   :-1.2607   
##  1st Qu.:-0.71034   1st Qu.:-0.7464   1st Qu.:-0.7431   1st Qu.:-0.7373   
##  Median :-0.03486   Median :-0.2217   Median :-0.3419   Median :-0.3974   
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   
##  3rd Qu.: 0.63564   3rd Qu.: 0.4934   3rd Qu.: 0.5256   3rd Qu.: 0.6464   
##  Max.   : 4.76672   Max.   : 4.5644   Max.   : 4.2399   Max.   : 3.9245   
##  perimeter_worst      area_mean       area_sd_error       area_worst     
##  Min.   :-2.74171   Min.   :-1.8183   Min.   :-1.0590   Min.   :-1.5529  
##  1st Qu.:-0.70262   1st Qu.:-0.7220   1st Qu.:-0.6230   1st Qu.:-0.6942  
##  Median :-0.07156   Median :-0.1781   Median :-0.2920   Median :-0.1973  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.53031   3rd Qu.: 0.4706   3rd Qu.: 0.2659   3rd Qu.: 0.4661  
##  Max.   : 4.48081   Max.   : 4.9066   Max.   : 8.8991   Max.   : 6.6494  
##  smoothness_mean   smoothness_sd_error smoothness_worst  compactness_mean 
##  Min.   :-1.0431   Min.   :-0.7372     Min.   :-1.7745   Min.   :-1.2970  
##  1st Qu.:-0.6232   1st Qu.:-0.4943     1st Qu.:-0.6235   1st Qu.:-0.6923  
##  Median :-0.2864   Median :-0.3475     Median :-0.2201   Median :-0.2808  
##  Mean   : 0.0000   Mean   : 0.0000     Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.2428   3rd Qu.: 0.1067     3rd Qu.: 0.3680   3rd Qu.: 0.3893  
##  Max.   : 9.4537   Max.   :11.0321     Max.   : 8.0229   Max.   : 6.1381  
##  compactness_sd_error compactness_worst concavity_mean    concavity_sd_error
##  Min.   :-1.0566      Min.   :-1.9118   Min.   :-1.5315   Min.   :-1.0960   
##  1st Qu.:-0.5567      1st Qu.:-0.6739   1st Qu.:-0.6511   1st Qu.:-0.5846   
##  Median :-0.1989      Median :-0.1404   Median :-0.2192   Median :-0.2297   
##  Mean   : 0.0000      Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   
##  3rd Qu.: 0.3365      3rd Qu.: 0.4722   3rd Qu.: 0.3554   3rd Qu.: 0.2884   
##  Max.   :12.0621      Max.   : 6.6438   Max.   : 7.0657   Max.   : 9.8429   
##  concavity_worst   concave_points_mean concave_points_sd_error
##  Min.   :-1.7254   Min.   :-2.22204    Min.   :-1.6919        
##  1st Qu.:-0.6743   1st Qu.:-0.74797    1st Qu.:-0.6890        
##  Median :-0.2688   Median :-0.04348    Median :-0.2857        
##  Mean   : 0.0000   Mean   : 0.00000    Mean   : 0.0000        
##  3rd Qu.: 0.5216   3rd Qu.: 0.65776    3rd Qu.: 0.5398        
##  Max.   : 4.0906   Max.   : 3.88249    Max.   : 4.2836        
##  concave_points_worst symmetry_mean     symmetry_sd_error symmetry_worst   
##  Min.   :-1.2213      Min.   :-2.6803   Min.   :-1.4426   Min.   :-1.3047  
##  1st Qu.:-0.6416      1st Qu.:-0.6906   1st Qu.:-0.6805   1st Qu.:-0.7558  
##  Median :-0.3409      Median :-0.0468   Median :-0.2693   Median :-0.2180  
##  Mean   : 0.0000      Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.3573      3rd Qu.: 0.5970   3rd Qu.: 0.5392   3rd Qu.: 0.5307  
##  Max.   : 5.9250      Max.   : 3.9519   Max.   : 5.1084   Max.   : 4.6965  
##  fractal_dimension_mean fractal_dimension_sd_error
##  Min.   :-1.7435        Min.   :-2.1591           
##  1st Qu.:-0.7557        1st Qu.:-0.6413           
##  Median :-0.2233        Median :-0.1273           
##  Mean   : 0.0000        Mean   : 0.0000           
##  3rd Qu.: 0.7119        3rd Qu.: 0.4497           
##  Max.   : 2.6835        Max.   : 6.0407

Dividing Our Data Into Training and Test Sets

Step 8 Before we can fit our model, we first have to divide our data into training and test sets. This allows us to assess the accuracy of our model.

indxTrain <- createDataPartition(y = data$diagnosis,p = 0.75,list = FALSE)
training <- data[indxTrain,]
testing <- data[-indxTrain,]

Step 9 Check dimensions of the split

prop.table(table(data$diagnosis)) * 100

## 
##    Benign Malignant 
##  62.74165  37.25835

prop.table(table(training$diagnosis)) * 100

## 
##    Benign Malignant 
##  62.76347  37.23653

prop.table(table(testing$diagnosis)) * 100

## 
##    Benign Malignant 
##  62.67606  37.32394

Step 10: For comparing the outcome of the training and testing phase let’s create separate variables that store the value of the response variable:

x = training[,-30]
y = training$diagnosis

Step 11: Create Naive Bayes model by using the training data set:

model = train(x,y,'nb',trControl=trainControl(method='cv',number=30))
model

## Naive Bayes 
## 
## 427 samples
##  30 predictor
##   2 classes: 'Benign', 'Malignant' 
## 
## No pre-processing
## Resampling: Cross-Validated (30 fold) 
## Summary of sample sizes: 413, 413, 412, 413, 412, 412, ... 
## Resampling results across tuning parameters:
## 
##   usekernel  Accuracy   Kappa    
##   FALSE      0.9722222  0.9393870
##    TRUE      0.9744444  0.9450779
## 
## Tuning parameter 'fL' was held constant at a value of 0
## Tuning
##  parameter 'adjust' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were fL = 0, usekernel = TRUE and adjust
##  = 1.

Step 12: Model Evaluation

To check the efficiency of the model, we are now going to run the testing data set on the model, after which we will evaluate the accuracy of the model by using a Confusion matrix.

Predict testing set

Predict <- predict(model,newdata = testing )

Step 13
Get the confusion matrix to see accuracy value and other parameter values

Please look for parameter Balanced Accuracy below to get the accuracy of our model

confusionMatrix(Predict, testing$diagnosis )

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Benign Malignant
##   Benign        87         4
##   Malignant      2        49
##                                           
##                Accuracy : 0.9577          
##                  95% CI : (0.9103, 0.9843)
##     No Information Rate : 0.6268          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.909           
##                                           
##  Mcnemar's Test P-Value : 0.6831          
##                                           
##             Sensitivity : 0.9775          
##             Specificity : 0.9245          
##          Pos Pred Value : 0.9560          
##          Neg Pred Value : 0.9608          
##              Prevalence : 0.6268          
##          Detection Rate : 0.6127          
##    Detection Prevalence : 0.6408          
##       Balanced Accuracy : 0.9510          
##                                           
##        'Positive' Class : Benign          
##

Detecting Breast Cancer Using Machinen Learning

Abir Chakraborty

11/26/2019