This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. ##SUMMARY Summary: Early diagnosis of cancer is critical for its successful treatment. Thus, there is a high demand for accurate and cheap diagnostic methods. In this project we explored the applicability of decision tree machine learning techniques (CART, Random Forests, and Boosted Trees,Naive Bayes) for breast cancer diagnosis using digitized images of tissue samples. The data was obtained from UC Irvine Machine Learning Repository (“Breast Cancer Wisconsin data set” created by William H. Wolberg, W. Nick Street, and Olvi L. Mangasarian). The most accurate traditional method for diagnostic is a rather invasive technique, called breast biopsy, where a small piece of breast tissue is surgically removed, and then the tissue sample has to be examined by specialist. However, a much less invasive technique can be used, where the samples can be obtained by a minimally invasive fine needle aspirate method. The sample obtained by this method can be easily digitized and used for computationally based diagnostic. Using machine learning methods for diagnostic can significantly increase processing speed and on a big scale can make the diagnostic significantly cheaper.

Here we studied the applicability of Random Forests and Boosted Trees methods and Naive Bayes for cancer prediction. We used CART method for comparison as well. The CART model achieved an estimated accuracy of about 91%. Random Forests 94% and Boosted Trees models achieved an estimated accuracy of about 97% on this dataset.

Data Cleaning and Loading

First the necessary libraries are loaded in R environment. ggplot library is used to make plots, corrplot is used to make corelation plots, caret is used to make data processing and machine learning

library(
readr,dplyr)
library("ggplot2")
library("corrplot")
library("gridExtra")
library("pROC")
FALSE Type 'citation("pROC")' for a citation.
FALSE 
FALSE Attaching package: 'pROC'
FALSE The following objects are masked from 'package:stats':
FALSE 
FALSE     cov, smooth, var
library("MASS")
library("caTools")
library("caret")
FALSE Loading required package: lattice

Data loading

The data is taken from UCI Repository and downloaded and saved into the localmachine

data <- read.csv("C:/Users/USER/Documents/data.csv")

Seeing the structure and the summary of the data

## Reading cancer data
str(data)
## 'data.frame':    569 obs. of  32 variables:
##  $ id                     : int  842302 842517 84300903 84348301 84358402 843786 844359 84458202 844981 84501001 ...
##  $ diagnosis              : Factor w/ 2 levels "B","M": 2 2 2 2 2 2 2 2 2 1 ...
##  $ radius_mean            : num  18 20.6 19.7 11.4 20.3 ...
##  $ texture_mean           : num  10.4 17.8 21.2 20.4 14.3 ...
##  $ perimeter_mean         : num  122.8 132.9 130 77.6 135.1 ...
##  $ area_mean              : num  1001 1326 1203 386 1297 ...
##  $ smoothness_mean        : num  0.1184 0.0847 0.1096 0.1425 0.1003 ...
##  $ compactness_mean       : num  0.2776 0.0786 0.1599 0.2839 0.1328 ...
##  $ concavity_mean         : num  0.3001 0.0869 0.1974 0.2414 0.198 ...
##  $ concave.points_mean    : num  0.1471 0.0702 0.1279 0.1052 0.1043 ...
##  $ symmetry_mean          : num  0.242 0.181 0.207 0.26 0.181 ...
##  $ fractal_dimension_mean : num  0.0787 0.0567 0.06 0.0974 0.0588 ...
##  $ radius_se              : num  1.095 0.543 0.746 0.496 0.757 ...
##  $ texture_se             : num  0.905 0.734 0.787 1.156 0.781 ...
##  $ perimeter_se           : num  8.59 3.4 4.58 3.44 5.44 ...
##  $ area_se                : num  153.4 74.1 94 27.2 94.4 ...
##  $ smoothness_se          : num  0.0064 0.00522 0.00615 0.00911 0.01149 ...
##  $ compactness_se         : num  0.049 0.0131 0.0401 0.0746 0.0246 ...
##  $ concavity_se           : num  0.0537 0.0186 0.0383 0.0566 0.0569 ...
##  $ concave.points_se      : num  0.0159 0.0134 0.0206 0.0187 0.0188 ...
##  $ symmetry_se            : num  0.03 0.0139 0.0225 0.0596 0.0176 ...
##  $ fractal_dimension_se   : num  0.00619 0.00353 0.00457 0.00921 0.00511 ...
##  $ radius_worst           : num  25.4 25 23.6 14.9 22.5 ...
##  $ texture_worst          : num  17.3 23.4 25.5 26.5 16.7 ...
##  $ perimeter_worst        : num  184.6 158.8 152.5 98.9 152.2 ...
##  $ area_worst             : num  2019 1956 1709 568 1575 ...
##  $ smoothness_worst       : num  0.162 0.124 0.144 0.21 0.137 ...
##  $ compactness_worst      : num  0.666 0.187 0.424 0.866 0.205 ...
##  $ concavity_worst        : num  0.712 0.242 0.45 0.687 0.4 ...
##  $ concave.points_worst   : num  0.265 0.186 0.243 0.258 0.163 ...
##  $ symmetry_worst         : num  0.46 0.275 0.361 0.664 0.236 ...
##  $ fractal_dimension_worst: num  0.1189 0.089 0.0876 0.173 0.0768 ...
##
data$diagnosis <- as.factor(data$diagnosis)

data[,33] <- NULL
## We then find summary of the dataset 
summary(data)
##        id            diagnosis  radius_mean      texture_mean  
##  Min.   :     8670   B:354     Min.   : 6.981   Min.   : 9.71  
##  1st Qu.:   869218   M:215     1st Qu.:11.700   1st Qu.:16.17  
##  Median :   906024             Median :13.370   Median :18.84  
##  Mean   : 30371831             Mean   :14.127   Mean   :19.29  
##  3rd Qu.:  8813129             3rd Qu.:15.780   3rd Qu.:21.80  
##  Max.   :911320502             Max.   :28.110   Max.   :39.28  
##  perimeter_mean     area_mean      smoothness_mean   compactness_mean 
##  Min.   : 43.79   Min.   : 143.5   Min.   :0.05263   Min.   :0.01938  
##  1st Qu.: 75.17   1st Qu.: 420.3   1st Qu.:0.08637   1st Qu.:0.06492  
##  Median : 86.24   Median : 551.1   Median :0.09587   Median :0.09263  
##  Mean   : 91.97   Mean   : 654.9   Mean   :0.09636   Mean   :0.10434  
##  3rd Qu.:104.10   3rd Qu.: 782.7   3rd Qu.:0.10530   3rd Qu.:0.13040  
##  Max.   :188.50   Max.   :2501.0   Max.   :0.16340   Max.   :0.34540  
##  concavity_mean    concave.points_mean symmetry_mean   
##  Min.   :0.00000   Min.   :0.00000     Min.   :0.1060  
##  1st Qu.:0.02956   1st Qu.:0.02031     1st Qu.:0.1619  
##  Median :0.06154   Median :0.03350     Median :0.1792  
##  Mean   :0.08880   Mean   :0.04892     Mean   :0.1812  
##  3rd Qu.:0.13070   3rd Qu.:0.07400     3rd Qu.:0.1957  
##  Max.   :0.42680   Max.   :0.20120     Max.   :0.3040  
##  fractal_dimension_mean   radius_se        texture_se      perimeter_se   
##  Min.   :0.04996        Min.   :0.1115   Min.   :0.3602   Min.   : 0.757  
##  1st Qu.:0.05770        1st Qu.:0.2324   1st Qu.:0.8339   1st Qu.: 1.606  
##  Median :0.06154        Median :0.3242   Median :1.1080   Median : 2.287  
##  Mean   :0.06280        Mean   :0.4052   Mean   :1.2169   Mean   : 2.866  
##  3rd Qu.:0.06612        3rd Qu.:0.4789   3rd Qu.:1.4740   3rd Qu.: 3.357  
##  Max.   :0.09744        Max.   :2.8730   Max.   :4.8850   Max.   :21.980  
##     area_se        smoothness_se      compactness_se      concavity_se    
##  Min.   :  6.802   Min.   :0.001713   Min.   :0.002252   Min.   :0.00000  
##  1st Qu.: 17.850   1st Qu.:0.005169   1st Qu.:0.013080   1st Qu.:0.01509  
##  Median : 24.530   Median :0.006380   Median :0.020450   Median :0.02589  
##  Mean   : 40.337   Mean   :0.007041   Mean   :0.025478   Mean   :0.03189  
##  3rd Qu.: 45.190   3rd Qu.:0.008146   3rd Qu.:0.032450   3rd Qu.:0.04205  
##  Max.   :542.200   Max.   :0.031130   Max.   :0.135400   Max.   :0.39600  
##  concave.points_se   symmetry_se       fractal_dimension_se
##  Min.   :0.000000   Min.   :0.007882   Min.   :0.0008948   
##  1st Qu.:0.007638   1st Qu.:0.015160   1st Qu.:0.0022480   
##  Median :0.010930   Median :0.018730   Median :0.0031870   
##  Mean   :0.011796   Mean   :0.020542   Mean   :0.0037949   
##  3rd Qu.:0.014710   3rd Qu.:0.023480   3rd Qu.:0.0045580   
##  Max.   :0.052790   Max.   :0.078950   Max.   :0.0298400   
##   radius_worst   texture_worst   perimeter_worst    area_worst    
##  Min.   : 7.93   Min.   :12.02   Min.   : 50.41   Min.   : 185.2  
##  1st Qu.:13.01   1st Qu.:21.08   1st Qu.: 84.11   1st Qu.: 515.3  
##  Median :14.97   Median :25.41   Median : 97.66   Median : 686.5  
##  Mean   :16.27   Mean   :25.68   Mean   :107.26   Mean   : 880.6  
##  3rd Qu.:18.79   3rd Qu.:29.72   3rd Qu.:125.40   3rd Qu.:1084.0  
##  Max.   :36.04   Max.   :49.54   Max.   :251.20   Max.   :4254.0  
##  smoothness_worst  compactness_worst concavity_worst  concave.points_worst
##  Min.   :0.07117   Min.   :0.02729   Min.   :0.0000   Min.   :0.00000     
##  1st Qu.:0.11660   1st Qu.:0.14720   1st Qu.:0.1145   1st Qu.:0.06493     
##  Median :0.13130   Median :0.21190   Median :0.2267   Median :0.09993     
##  Mean   :0.13237   Mean   :0.25427   Mean   :0.2722   Mean   :0.11461     
##  3rd Qu.:0.14600   3rd Qu.:0.33910   3rd Qu.:0.3829   3rd Qu.:0.16140     
##  Max.   :0.22260   Max.   :1.05800   Max.   :1.2520   Max.   :0.29100     
##  symmetry_worst   fractal_dimension_worst
##  Min.   :0.1565   Min.   :0.05504        
##  1st Qu.:0.2504   1st Qu.:0.07146        
##  Median :0.2822   Median :0.08004        
##  Mean   :0.2901   Mean   :0.08395        
##  3rd Qu.:0.3179   3rd Qu.:0.09208        
##  Max.   :0.6638   Max.   :0.20750

we find that the data is imbalanced and also there is a lot of corelation between the attributes

## we find that there are no missing values
## we find that data is little unbalanced
prop.table(table(data$diagnosis))
## 
##         B         M 
## 0.6221441 0.3778559
## we then show some correlation 
corr_mat <- cor(data[,3:ncol(data)])
corrplot(corr_mat)

Modelling

We are going to get a training and a testing set to use when building some models:

## We are going to get a training and a testing set to use when building some models:
set.seed(1234)
data_index <- createDataPartition(data$diagnosis, p=0.7, list = FALSE)
train_data <- data[data_index, -1]
test_data <- data[-data_index, -1]

Applying learning models

## Applying learning models
fitControl <- trainControl(method="cv",
                           number = 5,
                           preProcOptions = list(thresh = 0.99), # threshold for pca preprocess
                           classProbs = TRUE,
                           summaryFunction = twoClassSummary)

Model1: Random Forest

Building the model on the training data

## random forest
model_rf <- train(diagnosis~.,
                  train_data,
                  method="ranger",
                  metric="ROC",
                  #tuneLength=10,
                  #tuneGrid = expand.grid(mtry = c(2, 3, 6)),
                  preProcess = c('center', 'scale'),
                  trControl=fitControl)
## Loading required package: e1071
## Loading required package: ranger

Testing on the testing data

## testing for random forets
pred_rf <- predict(model_rf, test_data)
cm_rf <- confusionMatrix(pred_rf, test_data$diagnosis, positive = "M")
cm_rf
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   B   M
##          B 103   5
##          M   3  59
##                                           
##                Accuracy : 0.9529          
##                  95% CI : (0.9094, 0.9795)
##     No Information Rate : 0.6235          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8991          
##  Mcnemar's Test P-Value : 0.7237          
##                                           
##             Sensitivity : 0.9219          
##             Specificity : 0.9717          
##          Pos Pred Value : 0.9516          
##          Neg Pred Value : 0.9537          
##              Prevalence : 0.3765          
##          Detection Rate : 0.3471          
##    Detection Prevalence : 0.3647          
##       Balanced Accuracy : 0.9468          
##                                           
##        'Positive' Class : M               
## 

We find that accuracy of this model is 95%

Model2: Naive Bayes

Building and testing the model

model_nb <- train(diagnosis~.,
                  train_data,
                  method="nb",
                  metric="ROC",
                  preProcess=c('center', 'scale'),
                  trace=FALSE,
                  trControl=fitControl)
## Loading required package: klaR
## predicting for test data
pred_nb <- predict(model_nb, test_data)
cm_nb <- confusionMatrix(pred_nb, test_data$diagnosis, positive = "M")
cm_nb
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   B   M
##          B 100   7
##          M   6  57
##                                           
##                Accuracy : 0.9235          
##                  95% CI : (0.8728, 0.9587)
##     No Information Rate : 0.6235          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8366          
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.8906          
##             Specificity : 0.9434          
##          Pos Pred Value : 0.9048          
##          Neg Pred Value : 0.9346          
##              Prevalence : 0.3765          
##          Detection Rate : 0.3353          
##    Detection Prevalence : 0.3706          
##       Balanced Accuracy : 0.9170          
##                                           
##        'Positive' Class : M               
## 

Accuracy of this model is found to be 92%

Model3:Cart Model

## cart model
library("rpart")
library(rattle)
## Warning: package 'rattle' was built under R version 3.3.3
## Rattle: A free graphical interface for data mining with R.
## Version 4.1.0 Copyright (c) 2006-2015 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
set.seed(1)
cart_model <- train(diagnosis ~ ., train_data, method="rpart")

cart_model
## CART 
## 
## 399 samples
##  30 predictor
##   2 classes: 'B', 'M' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 399, 399, 399, 399, 399, 399, ... 
## Resampling results across tuning parameters:
## 
##   cp          Accuracy   Kappa    
##   0.00000000  0.9142129  0.8148022
##   0.04966887  0.9016065  0.7852476
##   0.78145695  0.8433296  0.6310219
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was cp = 0.
fancyRpartPlot(cart_model$finalModel, sub="")

Accuracy was found to be 91%

Model4: Boosted tree

library("gbm")
## Loading required package: survival
## 
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
## 
##     cluster
## Loading required package: splines
## Loading required package: parallel
## Loaded gbm 2.1.3
set.seed(1)
gbm_model <- train(diagnosis ~ ., train_data, method="gbm", verbose=FALSE)
## Loading required package: plyr
gbm_model
## Stochastic Gradient Boosting 
## 
## 399 samples
##  30 predictor
##   2 classes: 'B', 'M' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 399, 399, 399, 399, 399, 399, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  Accuracy   Kappa    
##   1                   50      0.9346846  0.8582378
##   1                  100      0.9420355  0.8745503
##   1                  150      0.9461964  0.8834817
##   2                   50      0.9434785  0.8778883
##   2                  100      0.9478252  0.8871747
##   2                  150      0.9484099  0.8880967
##   3                   50      0.9399320  0.8701763
##   3                  100      0.9459135  0.8828523
##   3                  150      0.9461732  0.8836462
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using  the largest value.
## The final values used for the model were n.trees = 150,
##  interaction.depth = 2, shrinkage = 0.1 and n.minobsinnode = 10.

Testing on testing data

gbm_model$finalModel
## A gradient boosted model with bernoulli loss function.
## 150 iterations were performed.
## There were 30 predictors of which 30 had non-zero influence.
#Performance on testing set:

pred5 <- predict(gbm_model, test_data)
confusionMatrix(pred5, test_data$diagnosis, positive="M")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   B   M
##          B 104   3
##          M   2  61
##                                           
##                Accuracy : 0.9706          
##                  95% CI : (0.9327, 0.9904)
##     No Information Rate : 0.6235          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9372          
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9531          
##             Specificity : 0.9811          
##          Pos Pred Value : 0.9683          
##          Neg Pred Value : 0.9720          
##              Prevalence : 0.3765          
##          Detection Rate : 0.3588          
##    Detection Prevalence : 0.3706          
##       Balanced Accuracy : 0.9671          
##                                           
##        'Positive' Class : M               
## 

Accuracy was found to be 97%

Accuracy Measure

Boosted Tree: 97% Random Forest : 94% Naive Bayes : 92% CART : 91%

Boosted tree method has given the best accuracy among the four