This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. ##SUMMARY Summary: Early diagnosis of cancer is critical for its successful treatment. Thus, there is a high demand for accurate and cheap diagnostic methods. In this project we explored the applicability of decision tree machine learning techniques (CART, Random Forests, and Boosted Trees,Naive Bayes) for breast cancer diagnosis using digitized images of tissue samples. The data was obtained from UC Irvine Machine Learning Repository (“Breast Cancer Wisconsin data set” created by William H. Wolberg, W. Nick Street, and Olvi L. Mangasarian). The most accurate traditional method for diagnostic is a rather invasive technique, called breast biopsy, where a small piece of breast tissue is surgically removed, and then the tissue sample has to be examined by specialist. However, a much less invasive technique can be used, where the samples can be obtained by a minimally invasive fine needle aspirate method. The sample obtained by this method can be easily digitized and used for computationally based diagnostic. Using machine learning methods for diagnostic can significantly increase processing speed and on a big scale can make the diagnostic significantly cheaper.
Here we studied the applicability of Random Forests and Boosted Trees methods and Naive Bayes for cancer prediction. We used CART method for comparison as well. The CART model achieved an estimated accuracy of about 91%. Random Forests 94% and Boosted Trees models achieved an estimated accuracy of about 97% on this dataset.
First the necessary libraries are loaded in R environment. ggplot library is used to make plots, corrplot is used to make corelation plots, caret is used to make data processing and machine learning
library(
readr,dplyr)
library("ggplot2")
library("corrplot")
library("gridExtra")
library("pROC")
FALSE Type 'citation("pROC")' for a citation.
FALSE
FALSE Attaching package: 'pROC'
FALSE The following objects are masked from 'package:stats':
FALSE
FALSE cov, smooth, var
library("MASS")
library("caTools")
library("caret")
FALSE Loading required package: lattice
The data is taken from UCI Repository and downloaded and saved into the localmachine
data <- read.csv("C:/Users/USER/Documents/data.csv")
## Reading cancer data
str(data)
## 'data.frame': 569 obs. of 32 variables:
## $ id : int 842302 842517 84300903 84348301 84358402 843786 844359 84458202 844981 84501001 ...
## $ diagnosis : Factor w/ 2 levels "B","M": 2 2 2 2 2 2 2 2 2 1 ...
## $ radius_mean : num 18 20.6 19.7 11.4 20.3 ...
## $ texture_mean : num 10.4 17.8 21.2 20.4 14.3 ...
## $ perimeter_mean : num 122.8 132.9 130 77.6 135.1 ...
## $ area_mean : num 1001 1326 1203 386 1297 ...
## $ smoothness_mean : num 0.1184 0.0847 0.1096 0.1425 0.1003 ...
## $ compactness_mean : num 0.2776 0.0786 0.1599 0.2839 0.1328 ...
## $ concavity_mean : num 0.3001 0.0869 0.1974 0.2414 0.198 ...
## $ concave.points_mean : num 0.1471 0.0702 0.1279 0.1052 0.1043 ...
## $ symmetry_mean : num 0.242 0.181 0.207 0.26 0.181 ...
## $ fractal_dimension_mean : num 0.0787 0.0567 0.06 0.0974 0.0588 ...
## $ radius_se : num 1.095 0.543 0.746 0.496 0.757 ...
## $ texture_se : num 0.905 0.734 0.787 1.156 0.781 ...
## $ perimeter_se : num 8.59 3.4 4.58 3.44 5.44 ...
## $ area_se : num 153.4 74.1 94 27.2 94.4 ...
## $ smoothness_se : num 0.0064 0.00522 0.00615 0.00911 0.01149 ...
## $ compactness_se : num 0.049 0.0131 0.0401 0.0746 0.0246 ...
## $ concavity_se : num 0.0537 0.0186 0.0383 0.0566 0.0569 ...
## $ concave.points_se : num 0.0159 0.0134 0.0206 0.0187 0.0188 ...
## $ symmetry_se : num 0.03 0.0139 0.0225 0.0596 0.0176 ...
## $ fractal_dimension_se : num 0.00619 0.00353 0.00457 0.00921 0.00511 ...
## $ radius_worst : num 25.4 25 23.6 14.9 22.5 ...
## $ texture_worst : num 17.3 23.4 25.5 26.5 16.7 ...
## $ perimeter_worst : num 184.6 158.8 152.5 98.9 152.2 ...
## $ area_worst : num 2019 1956 1709 568 1575 ...
## $ smoothness_worst : num 0.162 0.124 0.144 0.21 0.137 ...
## $ compactness_worst : num 0.666 0.187 0.424 0.866 0.205 ...
## $ concavity_worst : num 0.712 0.242 0.45 0.687 0.4 ...
## $ concave.points_worst : num 0.265 0.186 0.243 0.258 0.163 ...
## $ symmetry_worst : num 0.46 0.275 0.361 0.664 0.236 ...
## $ fractal_dimension_worst: num 0.1189 0.089 0.0876 0.173 0.0768 ...
##
data$diagnosis <- as.factor(data$diagnosis)
data[,33] <- NULL
## We then find summary of the dataset
summary(data)
## id diagnosis radius_mean texture_mean
## Min. : 8670 B:354 Min. : 6.981 Min. : 9.71
## 1st Qu.: 869218 M:215 1st Qu.:11.700 1st Qu.:16.17
## Median : 906024 Median :13.370 Median :18.84
## Mean : 30371831 Mean :14.127 Mean :19.29
## 3rd Qu.: 8813129 3rd Qu.:15.780 3rd Qu.:21.80
## Max. :911320502 Max. :28.110 Max. :39.28
## perimeter_mean area_mean smoothness_mean compactness_mean
## Min. : 43.79 Min. : 143.5 Min. :0.05263 Min. :0.01938
## 1st Qu.: 75.17 1st Qu.: 420.3 1st Qu.:0.08637 1st Qu.:0.06492
## Median : 86.24 Median : 551.1 Median :0.09587 Median :0.09263
## Mean : 91.97 Mean : 654.9 Mean :0.09636 Mean :0.10434
## 3rd Qu.:104.10 3rd Qu.: 782.7 3rd Qu.:0.10530 3rd Qu.:0.13040
## Max. :188.50 Max. :2501.0 Max. :0.16340 Max. :0.34540
## concavity_mean concave.points_mean symmetry_mean
## Min. :0.00000 Min. :0.00000 Min. :0.1060
## 1st Qu.:0.02956 1st Qu.:0.02031 1st Qu.:0.1619
## Median :0.06154 Median :0.03350 Median :0.1792
## Mean :0.08880 Mean :0.04892 Mean :0.1812
## 3rd Qu.:0.13070 3rd Qu.:0.07400 3rd Qu.:0.1957
## Max. :0.42680 Max. :0.20120 Max. :0.3040
## fractal_dimension_mean radius_se texture_se perimeter_se
## Min. :0.04996 Min. :0.1115 Min. :0.3602 Min. : 0.757
## 1st Qu.:0.05770 1st Qu.:0.2324 1st Qu.:0.8339 1st Qu.: 1.606
## Median :0.06154 Median :0.3242 Median :1.1080 Median : 2.287
## Mean :0.06280 Mean :0.4052 Mean :1.2169 Mean : 2.866
## 3rd Qu.:0.06612 3rd Qu.:0.4789 3rd Qu.:1.4740 3rd Qu.: 3.357
## Max. :0.09744 Max. :2.8730 Max. :4.8850 Max. :21.980
## area_se smoothness_se compactness_se concavity_se
## Min. : 6.802 Min. :0.001713 Min. :0.002252 Min. :0.00000
## 1st Qu.: 17.850 1st Qu.:0.005169 1st Qu.:0.013080 1st Qu.:0.01509
## Median : 24.530 Median :0.006380 Median :0.020450 Median :0.02589
## Mean : 40.337 Mean :0.007041 Mean :0.025478 Mean :0.03189
## 3rd Qu.: 45.190 3rd Qu.:0.008146 3rd Qu.:0.032450 3rd Qu.:0.04205
## Max. :542.200 Max. :0.031130 Max. :0.135400 Max. :0.39600
## concave.points_se symmetry_se fractal_dimension_se
## Min. :0.000000 Min. :0.007882 Min. :0.0008948
## 1st Qu.:0.007638 1st Qu.:0.015160 1st Qu.:0.0022480
## Median :0.010930 Median :0.018730 Median :0.0031870
## Mean :0.011796 Mean :0.020542 Mean :0.0037949
## 3rd Qu.:0.014710 3rd Qu.:0.023480 3rd Qu.:0.0045580
## Max. :0.052790 Max. :0.078950 Max. :0.0298400
## radius_worst texture_worst perimeter_worst area_worst
## Min. : 7.93 Min. :12.02 Min. : 50.41 Min. : 185.2
## 1st Qu.:13.01 1st Qu.:21.08 1st Qu.: 84.11 1st Qu.: 515.3
## Median :14.97 Median :25.41 Median : 97.66 Median : 686.5
## Mean :16.27 Mean :25.68 Mean :107.26 Mean : 880.6
## 3rd Qu.:18.79 3rd Qu.:29.72 3rd Qu.:125.40 3rd Qu.:1084.0
## Max. :36.04 Max. :49.54 Max. :251.20 Max. :4254.0
## smoothness_worst compactness_worst concavity_worst concave.points_worst
## Min. :0.07117 Min. :0.02729 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.11660 1st Qu.:0.14720 1st Qu.:0.1145 1st Qu.:0.06493
## Median :0.13130 Median :0.21190 Median :0.2267 Median :0.09993
## Mean :0.13237 Mean :0.25427 Mean :0.2722 Mean :0.11461
## 3rd Qu.:0.14600 3rd Qu.:0.33910 3rd Qu.:0.3829 3rd Qu.:0.16140
## Max. :0.22260 Max. :1.05800 Max. :1.2520 Max. :0.29100
## symmetry_worst fractal_dimension_worst
## Min. :0.1565 Min. :0.05504
## 1st Qu.:0.2504 1st Qu.:0.07146
## Median :0.2822 Median :0.08004
## Mean :0.2901 Mean :0.08395
## 3rd Qu.:0.3179 3rd Qu.:0.09208
## Max. :0.6638 Max. :0.20750
we find that the data is imbalanced and also there is a lot of corelation between the attributes
## we find that there are no missing values
## we find that data is little unbalanced
prop.table(table(data$diagnosis))
##
## B M
## 0.6221441 0.3778559
## we then show some correlation
corr_mat <- cor(data[,3:ncol(data)])
corrplot(corr_mat)
We are going to get a training and a testing set to use when building some models:
## We are going to get a training and a testing set to use when building some models:
set.seed(1234)
data_index <- createDataPartition(data$diagnosis, p=0.7, list = FALSE)
train_data <- data[data_index, -1]
test_data <- data[-data_index, -1]
Applying learning models
## Applying learning models
fitControl <- trainControl(method="cv",
number = 5,
preProcOptions = list(thresh = 0.99), # threshold for pca preprocess
classProbs = TRUE,
summaryFunction = twoClassSummary)
Building the model on the training data
## random forest
model_rf <- train(diagnosis~.,
train_data,
method="ranger",
metric="ROC",
#tuneLength=10,
#tuneGrid = expand.grid(mtry = c(2, 3, 6)),
preProcess = c('center', 'scale'),
trControl=fitControl)
## Loading required package: e1071
## Loading required package: ranger
Testing on the testing data
## testing for random forets
pred_rf <- predict(model_rf, test_data)
cm_rf <- confusionMatrix(pred_rf, test_data$diagnosis, positive = "M")
cm_rf
## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 103 5
## M 3 59
##
## Accuracy : 0.9529
## 95% CI : (0.9094, 0.9795)
## No Information Rate : 0.6235
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8991
## Mcnemar's Test P-Value : 0.7237
##
## Sensitivity : 0.9219
## Specificity : 0.9717
## Pos Pred Value : 0.9516
## Neg Pred Value : 0.9537
## Prevalence : 0.3765
## Detection Rate : 0.3471
## Detection Prevalence : 0.3647
## Balanced Accuracy : 0.9468
##
## 'Positive' Class : M
##
We find that accuracy of this model is 95%
Building and testing the model
model_nb <- train(diagnosis~.,
train_data,
method="nb",
metric="ROC",
preProcess=c('center', 'scale'),
trace=FALSE,
trControl=fitControl)
## Loading required package: klaR
## predicting for test data
pred_nb <- predict(model_nb, test_data)
cm_nb <- confusionMatrix(pred_nb, test_data$diagnosis, positive = "M")
cm_nb
## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 100 7
## M 6 57
##
## Accuracy : 0.9235
## 95% CI : (0.8728, 0.9587)
## No Information Rate : 0.6235
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8366
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.8906
## Specificity : 0.9434
## Pos Pred Value : 0.9048
## Neg Pred Value : 0.9346
## Prevalence : 0.3765
## Detection Rate : 0.3353
## Detection Prevalence : 0.3706
## Balanced Accuracy : 0.9170
##
## 'Positive' Class : M
##
Accuracy of this model is found to be 92%
## cart model
library("rpart")
library(rattle)
## Warning: package 'rattle' was built under R version 3.3.3
## Rattle: A free graphical interface for data mining with R.
## Version 4.1.0 Copyright (c) 2006-2015 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
set.seed(1)
cart_model <- train(diagnosis ~ ., train_data, method="rpart")
cart_model
## CART
##
## 399 samples
## 30 predictor
## 2 classes: 'B', 'M'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 399, 399, 399, 399, 399, 399, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.00000000 0.9142129 0.8148022
## 0.04966887 0.9016065 0.7852476
## 0.78145695 0.8433296 0.6310219
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.
fancyRpartPlot(cart_model$finalModel, sub="")
Accuracy was found to be 91%
library("gbm")
## Loading required package: survival
##
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
##
## cluster
## Loading required package: splines
## Loading required package: parallel
## Loaded gbm 2.1.3
set.seed(1)
gbm_model <- train(diagnosis ~ ., train_data, method="gbm", verbose=FALSE)
## Loading required package: plyr
gbm_model
## Stochastic Gradient Boosting
##
## 399 samples
## 30 predictor
## 2 classes: 'B', 'M'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 399, 399, 399, 399, 399, 399, ...
## Resampling results across tuning parameters:
##
## interaction.depth n.trees Accuracy Kappa
## 1 50 0.9346846 0.8582378
## 1 100 0.9420355 0.8745503
## 1 150 0.9461964 0.8834817
## 2 50 0.9434785 0.8778883
## 2 100 0.9478252 0.8871747
## 2 150 0.9484099 0.8880967
## 3 50 0.9399320 0.8701763
## 3 100 0.9459135 0.8828523
## 3 150 0.9461732 0.8836462
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 150,
## interaction.depth = 2, shrinkage = 0.1 and n.minobsinnode = 10.
Testing on testing data
gbm_model$finalModel
## A gradient boosted model with bernoulli loss function.
## 150 iterations were performed.
## There were 30 predictors of which 30 had non-zero influence.
#Performance on testing set:
pred5 <- predict(gbm_model, test_data)
confusionMatrix(pred5, test_data$diagnosis, positive="M")
## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 104 3
## M 2 61
##
## Accuracy : 0.9706
## 95% CI : (0.9327, 0.9904)
## No Information Rate : 0.6235
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9372
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9531
## Specificity : 0.9811
## Pos Pred Value : 0.9683
## Neg Pred Value : 0.9720
## Prevalence : 0.3765
## Detection Rate : 0.3588
## Detection Prevalence : 0.3706
## Balanced Accuracy : 0.9671
##
## 'Positive' Class : M
##
Accuracy was found to be 97%
Boosted Tree: 97% Random Forest : 94% Naive Bayes : 92% CART : 91%
Boosted tree method has given the best accuracy among the four