Machine Learning in Cancer prediction using R

Cancer prediction has long been regarded as a critical topic. With big data and Machine Learning growth in biomedical and healthcare communities, accurate analysis of medical data benefits early disease detection, patient care,and community services.The main aim of this project is to build Machine Learning Models to predict the type of Cancer (Malignant or Benign).

Step 1: Load the given dataset

library(knitr)
library(readr)
CancerData <- read_csv("CancerData.csv")

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   id = col_integer(),
##   diagnosis = col_character(),
##   X33 = col_character()
## )

## See spec(...) for full column specifications.

Step 2: Check for the missing values.

anyNA(CancerData)

## [1] TRUE

step 3: Plot the structure of the cancer data.

library(DataExplorer)
str(CancerData)

## Classes 'tbl_df', 'tbl' and 'data.frame':    569 obs. of  33 variables:
##  $ id                     : int  842302 842517 84300903 84348301 84358402 843786 844359 84458202 844981 84501001 ...
##  $ diagnosis              : chr  "M" "M" "M" "M" ...
##  $ radius_mean            : num  18 20.6 19.7 11.4 20.3 ...
##  $ texture_mean           : num  10.4 17.8 21.2 20.4 14.3 ...
##  $ perimeter_mean         : num  122.8 132.9 130 77.6 135.1 ...
##  $ area_mean              : num  1001 1326 1203 386 1297 ...
##  $ smoothness_mean        : num  0.1184 0.0847 0.1096 0.1425 0.1003 ...
##  $ compactness_mean       : num  0.2776 0.0786 0.1599 0.2839 0.1328 ...
##  $ concavity_mean         : num  0.3001 0.0869 0.1974 0.2414 0.198 ...
##  $ concave points_mean    : num  0.1471 0.0702 0.1279 0.1052 0.1043 ...
##  $ symmetry_mean          : num  0.242 0.181 0.207 0.26 0.181 ...
##  $ fractal_dimension_mean : num  0.0787 0.0567 0.06 0.0974 0.0588 ...
##  $ radius_se              : num  1.095 0.543 0.746 0.496 0.757 ...
##  $ texture_se             : num  0.905 0.734 0.787 1.156 0.781 ...
##  $ perimeter_se           : num  8.59 3.4 4.58 3.44 5.44 ...
##  $ area_se                : num  153.4 74.1 94 27.2 94.4 ...
##  $ smoothness_se          : num  0.0064 0.00522 0.00615 0.00911 0.01149 ...
##  $ compactness_se         : num  0.049 0.0131 0.0401 0.0746 0.0246 ...
##  $ concavity_se           : num  0.0537 0.0186 0.0383 0.0566 0.0569 ...
##  $ concave points_se      : num  0.0159 0.0134 0.0206 0.0187 0.0188 ...
##  $ symmetry_se            : num  0.03 0.0139 0.0225 0.0596 0.0176 ...
##  $ fractal_dimension_se   : num  0.00619 0.00353 0.00457 0.00921 0.00511 ...
##  $ radius_worst           : num  25.4 25 23.6 14.9 22.5 ...
##  $ texture_worst          : num  17.3 23.4 25.5 26.5 16.7 ...
##  $ perimeter_worst        : num  184.6 158.8 152.5 98.9 152.2 ...
##  $ area_worst             : num  2019 1956 1709 568 1575 ...
##  $ smoothness_worst       : num  0.162 0.124 0.144 0.21 0.137 ...
##  $ compactness_worst      : num  0.666 0.187 0.424 0.866 0.205 ...
##  $ concavity_worst        : num  0.712 0.242 0.45 0.687 0.4 ...
##  $ concave points_worst   : num  0.265 0.186 0.243 0.258 0.163 ...
##  $ symmetry_worst         : num  0.46 0.275 0.361 0.664 0.236 ...
##  $ fractal_dimension_worst: num  0.1189 0.089 0.0876 0.173 0.0768 ...
##  $ X33                    : chr  NA NA NA NA ...
##  - attr(*, "problems")=Classes 'tbl_df', 'tbl' and 'data.frame': 569 obs. of  5 variables:
##   ..$ row     : int  1 2 3 4 5 6 7 8 9 10 ...
##   ..$ col     : chr  NA NA NA NA ...
##   ..$ expected: chr  "33 columns" "33 columns" "33 columns" "33 columns" ...
##   ..$ actual  : chr  "32 columns" "32 columns" "32 columns" "32 columns" ...
##   ..$ file    : chr  "'CancerData.csv'" "'CancerData.csv'" "'CancerData.csv'" "'CancerData.csv'" ...
##  - attr(*, "spec")=List of 2
##   ..$ cols   :List of 33
##   .. ..$ id                     : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ diagnosis              : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ radius_mean            : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ texture_mean           : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ perimeter_mean         : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ area_mean              : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ smoothness_mean        : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ compactness_mean       : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ concavity_mean         : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ concave points_mean    : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ symmetry_mean          : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ fractal_dimension_mean : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ radius_se              : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ texture_se             : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ perimeter_se           : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ area_se                : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ smoothness_se          : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ compactness_se         : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ concavity_se           : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ concave points_se      : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ symmetry_se            : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ fractal_dimension_se   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ radius_worst           : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ texture_worst          : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ perimeter_worst        : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ area_worst             : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ smoothness_worst       : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ compactness_worst      : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ concavity_worst        : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ concave points_worst   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ symmetry_worst         : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ fractal_dimension_worst: list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ X33                    : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   ..$ default: list()
##   .. ..- attr(*, "class")= chr  "collector_guess" "collector"
##   ..- attr(*, "class")= chr "col_spec"

step 4: Remove the missing values.

here column x33 contains missing values. so set it to NULL

CancerData$X33<-NULL

step 5: Find out the total number of diagnosis for better understanding.

text(barplot(table(CancerData$diagnosis),col=c('green','red'),main='Bar plot of Diagnosis')
     ,0,table(CancerData$diagnosis),cex=2,pos=3)

In the given dataset 357 patients are of the cancer type Benign and 212 patients are of the cancer type Malignant.

Malignant means the type of cancer which spreads throughout the body.

Benign means the type of cancer which does not spread throughout the body.

Now our main aim is to predict the chances of cancer turning to be malignant out of 357 patients on further checkup.

Step 6: Plot the correlated variables.

library(corrplot)
corrplot(cor(CancerData[sapply(CancerData, is.numeric)]))

keep the highly correlated variables. correlation means how the variables are dependent on each other. Remove the non correlated variables.Then remove the unnecessary variables which are not required.

CancerData$id<-NULL
CancerData$smoothness_se<-NULL
CancerData$compactness_se<-NULL
CancerData$concavity_se<-NULL
CancerData$`concave points_se`<-NULL
CancerData$symmetry_se <-NULL
CancerData$radius_se<-NULL
CancerData$texture_se<-NULL
CancerData$perimeter_se<-NULL
CancerData$area_se<-NULL
CancerData$fractal_dimension_se<-NULL
CancerData$symmetry_mean<-NULL
CancerData$fractal_dimension_mean<-NULL
CancerData$symmetry_worst<-NULL
CancerData$fractal_dimension_worst<-NULL
CancerData$smoothness_worst<-NULL
CancerData$smoothness_mean<-NULL
CancerData$texture_worst<-NULL
CancerData$compactness_mean<-NULL
CancerData$concavity_mean<-NULL
CancerData$`concave points_mean`<-NULL
CancerData$texture_mean<-NULL
CancerData$compactness_worst<-NULL
CancerData$concavity_worst<-NULL
CancerData$`concave points_worst`<-NULL

The variables which are set to NULL means they are not required further.

Step 7: Convert Malignant(M) in diagnosis column to 1 and Benign(B) in diagnosis column to 0.

CancerData$diagnosis[CancerData$diagnosis=="M"]<-'1'
CancerData$diagnosis[CancerData$diagnosis=="B"]<-'0'
CancerData$diagnosis<-as.numeric(CancerData$diagnosis)

Step 8: Perform Logistic Regression.

library(caTools)
set.seed(101)
split<-sample.split(CancerData,SplitRatio = 0.75)
split

## [1]  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE

In the above chunk of codes the dataset is split into two parts in the ratio 75% : 25%

Whichever values are TRUE in the split, make it as a training dataset. and whichever values are FALSE make it as a testing dataset.

training<-subset(CancerData,split=="TRUE")
testing<-subset(CancerData,split=="FALSE")

Now create the model using glm function keeping Churn as the target variable in the training dataset.

model<-glm(diagnosis~.,training,family ="binomial")
summary(model)

## 
## Call:
## glm(formula = diagnosis ~ ., family = "binomial", data = training)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.18733  -0.14008  -0.05353   0.00059   2.94649  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)   
## (Intercept)      11.89728   10.49377   1.134  0.25690   
## radius_mean     -11.29113    3.59800  -3.138  0.00170 **
## perimeter_mean    1.01916    0.32132   3.172  0.00152 **
## area_mean         0.02613    0.03049   0.857  0.39157   
## radius_worst      0.33808    2.33011   0.145  0.88464   
## perimeter_worst   0.16468    0.12802   1.286  0.19832   
## area_worst        0.01757    0.02342   0.750  0.45305   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 535.758  on 406  degrees of freedom
## Residual deviance:  84.154  on 400  degrees of freedom
## AIC: 98.154
## 
## Number of Fisher Scoring iterations: 9

optimize the model by removing independent variables. After removing the independent variables from the model make sure the residual deviance should not increase and AIC value should decrease.

model<-glm(diagnosis~.-radius_worst,training,family ="binomial")
summary(model)

## 
## Call:
## glm(formula = diagnosis ~ . - radius_worst, family = "binomial", 
##     data = training)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.16799  -0.14255  -0.05439   0.00055   2.96254  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      12.163442  10.327984   1.178  0.23891    
## radius_mean     -10.940445   2.648767  -4.130 3.62e-05 ***
## perimeter_mean    1.009230   0.312801   3.226  0.00125 ** 
## area_mean         0.022844   0.020381   1.121  0.26234    
## perimeter_worst   0.171246   0.120103   1.426  0.15392    
## area_worst        0.020665   0.009712   2.128  0.03336 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 535.758  on 406  degrees of freedom
## Residual deviance:  84.175  on 401  degrees of freedom
## AIC: 96.175
## 
## Number of Fisher Scoring iterations: 9

Initially the AIC value was 98.154 but after removing the radius_worst from the model AIC value became 96.175

Step 9: Now predict the value for testing dataset and predict the accuracy of the model.(type=reponse means we want the probability of the testing dataset.)

res<-predict(model,testing,type = "response")
res

##            1            2            3            4            5 
## 1.000000e+00 9.526745e-01 9.471076e-01 9.766886e-01 9.994231e-01 
##            6            7            8            9           10 
## 9.999988e-01 1.000000e+00 1.000000e+00 1.000000e+00 9.999611e-01 
##           11           12           13           14           15 
## 9.069354e-05 1.756094e-04 4.503439e-01 9.999999e-01 2.565323e-03 
##           16           17           18           19           20 
## 1.071703e-03 1.065442e-03 1.999505e-03 9.410985e-01 1.914518e-03 
##           21           22           23           24           25 
## 1.000000e+00 8.130160e-01 1.192414e-02 1.606432e-02 2.596316e-01 
##           26           27           28           29           30 
## 9.999899e-01 2.318959e-02 9.991549e-01 9.409155e-01 1.365559e-02 
##           31           32           33           34           35 
## 1.152920e-03 1.000000e+00 8.545624e-03 1.692638e-02 9.999941e-01 
##           36           37           38           39           40 
## 1.000000e+00 3.783901e-02 9.999972e-01 1.423396e-02 8.076163e-03 
##           41           42           43           44           45 
## 7.359270e-03 5.675181e-02 1.452678e-02 3.612446e-03 9.998635e-01 
##           46           47           48           49           50 
## 3.455325e-01 5.189878e-03 1.000000e+00 2.019712e-03 9.949405e-01 
##           51           52           53           54           55 
## 9.450403e-01 4.533332e-04 6.584084e-01 1.021147e-02 2.586460e-03 
##           56           57           58           59           60 
## 8.543183e-05 1.000000e+00 9.972185e-01 9.679912e-01 6.778920e-04 
##           61           62           63           64           65 
## 1.000000e+00 6.249878e-01 1.000000e+00 2.149825e-02 1.445052e-03 
##           66           67           68           69           70 
## 1.545478e-01 1.000000e+00 7.828286e-04 4.081557e-03 3.677118e-04 
##           71           72           73           74           75 
## 4.411889e-01 6.581162e-03 1.000000e+00 7.409356e-01 9.432439e-01 
##           76           77           78           79           80 
## 9.999721e-01 2.098160e-03 4.472053e-03 4.549159e-04 5.177999e-04 
##           81           82           83           84           85 
## 9.999993e-01 9.967413e-01 6.903889e-04 3.944481e-01 1.833639e-04 
##           86           87           88           89           90 
## 7.084269e-03 4.937584e-04 4.259237e-03 5.997133e-04 2.148278e-02 
##           91           92           93           94           95 
## 9.999816e-01 6.538703e-02 1.134285e-02 2.022639e-03 6.270656e-02 
##           96           97           98           99          100 
## 9.535633e-04 4.398883e-03 1.000000e+00 1.644318e-03 3.040722e-03 
##          101          102          103          104          105 
## 1.000000e+00 9.926756e-01 9.354318e-02 1.268432e-03 9.999999e-01 
##          106          107          108          109          110 
## 5.039383e-02 1.000000e+00 7.444032e-03 1.641544e-02 2.759074e-03 
##          111          112          113          114          115 
## 1.203750e-02 5.649281e-03 1.051701e-02 1.486491e-03 1.462480e-02 
##          116          117          118          119          120 
## 3.469472e-02 9.997780e-01 1.161365e-02 3.924309e-03 5.544760e-03 
##          121          122          123          124          125 
## 4.016923e-03 1.366339e-01 1.539029e-03 9.985333e-01 1.302597e-02 
##          126          127          128          129          130 
## 1.728119e-02 1.496133e-04 9.964365e-01 4.030264e-03 9.992768e-01 
##          131          132          133          134          135 
## 1.525470e-03 1.975726e-03 1.637397e-02 9.035011e-01 6.344683e-03 
##          136          137          138          139          140 
## 4.695678e-01 2.710624e-03 9.846137e-01 3.118842e-01 8.421589e-02 
##          141          142          143          144          145 
## 9.999913e-01 2.239503e-04 1.000000e+00 3.781347e-01 3.179473e-03 
##          146          147          148          149          150 
## 8.804819e-04 3.219131e-01 7.635766e-01 8.502197e-03 1.000000e+00 
##          151          152          153          154          155 
## 2.840503e-03 2.619077e-03 1.594852e-03 1.000000e+00 8.450897e-01 
##          156          157          158          159          160 
## 5.831363e-02 1.881867e-03 2.633348e-02 5.571214e-04 4.338274e-04 
##          161          162 
## 9.993441e-01 1.000000e+00

testing

## # A tibble: 162 x 7
##    diagnosis radius_mean perimeter_mean area_mean radius_worst
##        <dbl>       <dbl>          <dbl>     <dbl>        <dbl>
##  1         1        19.7          130       1203          23.6
##  2         1        11.4           77.6      386.         14.9
##  3         1        12.5           84.0      476.         15.1
##  4         1        16.0          103.       798.         19.2
##  5         1        14.7           94.7      684.         19.1
##  6         1        16.1          108.       799.         21.0
##  7         1        21.2          137.      1404          29.2
##  8         1        16.6          110        905.         26.5
##  9         1        18.6          125.      1088          23.2
## 10         1        11.8           77.9      441.         16.8
## # ... with 152 more rows, and 2 more variables: perimeter_worst <dbl>,
## #   area_worst <dbl>

step 10: Now find the threshold using ROC curve.

library(ROCR)
res<-predict(model,training,type = "response")
ROCRpred=prediction(res,training$diagnosis)
ROCRperf<-performance(ROCRpred,"tpr","fpr")
plot(ROCRperf,colorize=TRUE,print.cutoffs.at=seq(0.1,by=0.1))

Now the graph says threshold value is 0.2

Step 11: Now Create the confusion matrix.

res<-predict(model,testing,type = "response")
table(Actualvalue=testing$diagnosis,Predictedvalue=res>0.2)

##            Predictedvalue
## Actualvalue FALSE TRUE
##           0    91    9
##           1     3   59

Now calculate the Accuracy of the model using Confusion matrix.

(91+59)/(91+59+3+9)

## [1] 0.9259259

Now our model is 92.59% accurate.

Step 12 : Now calculate the odds ratio.

exp(confint(model))

## Waiting for profiling to be done...

##                        2.5 %       97.5 %
## (Intercept)     5.552209e-05 4.632513e+13
## radius_mean     5.813501e-08 2.306326e-03
## perimeter_mean  1.564613e+00 5.458182e+00
## area_mean       9.827457e-01 1.066776e+00
## perimeter_worst 9.419771e-01 1.515445e+00
## area_worst      1.002565e+00 1.041715e+00

The odds ratio says that with the increase in 1 unit of perimeter_mean the probability of increasing the cancer to be malignant is 56.46% in the patients next checkup.

This approximation is based on statistics.

“Statistics is approximate Science but not exact Science.”.

Machine Learning in Cancer prediction using R

Ashish S Shanbhag

22 October 2018