Cancer prediction has long been regarded as a critical topic. With big data and Machine Learning growth in biomedical and healthcare communities, accurate analysis of medical data benefits early disease detection, patient care,and community services.The main aim of this project is to build Machine Learning Models to predict the type of Cancer (Malignant or Benign).
Step 1: Load the given dataset
library(knitr)
library(readr)
CancerData <- read_csv("CancerData.csv")
## Parsed with column specification:
## cols(
## .default = col_double(),
## id = col_integer(),
## diagnosis = col_character(),
## X33 = col_character()
## )
## See spec(...) for full column specifications.
Step 2: Check for the missing values.
anyNA(CancerData)
## [1] TRUE
step 3: Plot the structure of the cancer data.
library(DataExplorer)
str(CancerData)
## Classes 'tbl_df', 'tbl' and 'data.frame': 569 obs. of 33 variables:
## $ id : int 842302 842517 84300903 84348301 84358402 843786 844359 84458202 844981 84501001 ...
## $ diagnosis : chr "M" "M" "M" "M" ...
## $ radius_mean : num 18 20.6 19.7 11.4 20.3 ...
## $ texture_mean : num 10.4 17.8 21.2 20.4 14.3 ...
## $ perimeter_mean : num 122.8 132.9 130 77.6 135.1 ...
## $ area_mean : num 1001 1326 1203 386 1297 ...
## $ smoothness_mean : num 0.1184 0.0847 0.1096 0.1425 0.1003 ...
## $ compactness_mean : num 0.2776 0.0786 0.1599 0.2839 0.1328 ...
## $ concavity_mean : num 0.3001 0.0869 0.1974 0.2414 0.198 ...
## $ concave points_mean : num 0.1471 0.0702 0.1279 0.1052 0.1043 ...
## $ symmetry_mean : num 0.242 0.181 0.207 0.26 0.181 ...
## $ fractal_dimension_mean : num 0.0787 0.0567 0.06 0.0974 0.0588 ...
## $ radius_se : num 1.095 0.543 0.746 0.496 0.757 ...
## $ texture_se : num 0.905 0.734 0.787 1.156 0.781 ...
## $ perimeter_se : num 8.59 3.4 4.58 3.44 5.44 ...
## $ area_se : num 153.4 74.1 94 27.2 94.4 ...
## $ smoothness_se : num 0.0064 0.00522 0.00615 0.00911 0.01149 ...
## $ compactness_se : num 0.049 0.0131 0.0401 0.0746 0.0246 ...
## $ concavity_se : num 0.0537 0.0186 0.0383 0.0566 0.0569 ...
## $ concave points_se : num 0.0159 0.0134 0.0206 0.0187 0.0188 ...
## $ symmetry_se : num 0.03 0.0139 0.0225 0.0596 0.0176 ...
## $ fractal_dimension_se : num 0.00619 0.00353 0.00457 0.00921 0.00511 ...
## $ radius_worst : num 25.4 25 23.6 14.9 22.5 ...
## $ texture_worst : num 17.3 23.4 25.5 26.5 16.7 ...
## $ perimeter_worst : num 184.6 158.8 152.5 98.9 152.2 ...
## $ area_worst : num 2019 1956 1709 568 1575 ...
## $ smoothness_worst : num 0.162 0.124 0.144 0.21 0.137 ...
## $ compactness_worst : num 0.666 0.187 0.424 0.866 0.205 ...
## $ concavity_worst : num 0.712 0.242 0.45 0.687 0.4 ...
## $ concave points_worst : num 0.265 0.186 0.243 0.258 0.163 ...
## $ symmetry_worst : num 0.46 0.275 0.361 0.664 0.236 ...
## $ fractal_dimension_worst: num 0.1189 0.089 0.0876 0.173 0.0768 ...
## $ X33 : chr NA NA NA NA ...
## - attr(*, "problems")=Classes 'tbl_df', 'tbl' and 'data.frame': 569 obs. of 5 variables:
## ..$ row : int 1 2 3 4 5 6 7 8 9 10 ...
## ..$ col : chr NA NA NA NA ...
## ..$ expected: chr "33 columns" "33 columns" "33 columns" "33 columns" ...
## ..$ actual : chr "32 columns" "32 columns" "32 columns" "32 columns" ...
## ..$ file : chr "'CancerData.csv'" "'CancerData.csv'" "'CancerData.csv'" "'CancerData.csv'" ...
## - attr(*, "spec")=List of 2
## ..$ cols :List of 33
## .. ..$ id : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ diagnosis : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ radius_mean : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ texture_mean : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ perimeter_mean : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ area_mean : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ smoothness_mean : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ compactness_mean : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ concavity_mean : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ concave points_mean : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ symmetry_mean : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ fractal_dimension_mean : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ radius_se : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ texture_se : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ perimeter_se : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ area_se : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ smoothness_se : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ compactness_se : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ concavity_se : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ concave points_se : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ symmetry_se : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ fractal_dimension_se : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ radius_worst : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ texture_worst : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ perimeter_worst : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ area_worst : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ smoothness_worst : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ compactness_worst : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ concavity_worst : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ concave points_worst : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ symmetry_worst : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ fractal_dimension_worst: list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ X33 : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## ..$ default: list()
## .. ..- attr(*, "class")= chr "collector_guess" "collector"
## ..- attr(*, "class")= chr "col_spec"
step 4: Remove the missing values.
here column x33 contains missing values. so set it to NULL
CancerData$X33<-NULL
step 5: Find out the total number of diagnosis for better understanding.
text(barplot(table(CancerData$diagnosis),col=c('green','red'),main='Bar plot of Diagnosis')
,0,table(CancerData$diagnosis),cex=2,pos=3)
In the given dataset 357 patients are of the cancer type Benign and 212 patients are of the cancer type Malignant.
Malignant means the type of cancer which spreads throughout the body.
Benign means the type of cancer which does not spread throughout the body.
Now our main aim is to predict the chances of cancer turning to be malignant out of 357 patients on further checkup.
Step 6: Plot the correlated variables.
library(corrplot)
corrplot(cor(CancerData[sapply(CancerData, is.numeric)]))
keep the highly correlated variables. correlation means how the variables are dependent on each other. Remove the non correlated variables.Then remove the unnecessary variables which are not required.
CancerData$id<-NULL
CancerData$smoothness_se<-NULL
CancerData$compactness_se<-NULL
CancerData$concavity_se<-NULL
CancerData$`concave points_se`<-NULL
CancerData$symmetry_se <-NULL
CancerData$radius_se<-NULL
CancerData$texture_se<-NULL
CancerData$perimeter_se<-NULL
CancerData$area_se<-NULL
CancerData$fractal_dimension_se<-NULL
CancerData$symmetry_mean<-NULL
CancerData$fractal_dimension_mean<-NULL
CancerData$symmetry_worst<-NULL
CancerData$fractal_dimension_worst<-NULL
CancerData$smoothness_worst<-NULL
CancerData$smoothness_mean<-NULL
CancerData$texture_worst<-NULL
CancerData$compactness_mean<-NULL
CancerData$concavity_mean<-NULL
CancerData$`concave points_mean`<-NULL
CancerData$texture_mean<-NULL
CancerData$compactness_worst<-NULL
CancerData$concavity_worst<-NULL
CancerData$`concave points_worst`<-NULL
The variables which are set to NULL means they are not required further.
Step 7: Convert Malignant(M) in diagnosis column to 1 and Benign(B) in diagnosis column to 0.
CancerData$diagnosis[CancerData$diagnosis=="M"]<-'1'
CancerData$diagnosis[CancerData$diagnosis=="B"]<-'0'
CancerData$diagnosis<-as.numeric(CancerData$diagnosis)
Step 8: Perform Logistic Regression.
library(caTools)
set.seed(101)
split<-sample.split(CancerData,SplitRatio = 0.75)
split
## [1] TRUE TRUE FALSE FALSE TRUE TRUE TRUE
In the above chunk of codes the dataset is split into two parts in the ratio 75% : 25%
Whichever values are TRUE in the split, make it as a training dataset. and whichever values are FALSE make it as a testing dataset.
training<-subset(CancerData,split=="TRUE")
testing<-subset(CancerData,split=="FALSE")
Now create the model using glm function keeping Churn as the target variable in the training dataset.
model<-glm(diagnosis~.,training,family ="binomial")
summary(model)
##
## Call:
## glm(formula = diagnosis ~ ., family = "binomial", data = training)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.18733 -0.14008 -0.05353 0.00059 2.94649
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 11.89728 10.49377 1.134 0.25690
## radius_mean -11.29113 3.59800 -3.138 0.00170 **
## perimeter_mean 1.01916 0.32132 3.172 0.00152 **
## area_mean 0.02613 0.03049 0.857 0.39157
## radius_worst 0.33808 2.33011 0.145 0.88464
## perimeter_worst 0.16468 0.12802 1.286 0.19832
## area_worst 0.01757 0.02342 0.750 0.45305
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 535.758 on 406 degrees of freedom
## Residual deviance: 84.154 on 400 degrees of freedom
## AIC: 98.154
##
## Number of Fisher Scoring iterations: 9
optimize the model by removing independent variables. After removing the independent variables from the model make sure the residual deviance should not increase and AIC value should decrease.
model<-glm(diagnosis~.-radius_worst,training,family ="binomial")
summary(model)
##
## Call:
## glm(formula = diagnosis ~ . - radius_worst, family = "binomial",
## data = training)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.16799 -0.14255 -0.05439 0.00055 2.96254
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 12.163442 10.327984 1.178 0.23891
## radius_mean -10.940445 2.648767 -4.130 3.62e-05 ***
## perimeter_mean 1.009230 0.312801 3.226 0.00125 **
## area_mean 0.022844 0.020381 1.121 0.26234
## perimeter_worst 0.171246 0.120103 1.426 0.15392
## area_worst 0.020665 0.009712 2.128 0.03336 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 535.758 on 406 degrees of freedom
## Residual deviance: 84.175 on 401 degrees of freedom
## AIC: 96.175
##
## Number of Fisher Scoring iterations: 9
Initially the AIC value was 98.154 but after removing the radius_worst from the model AIC value became 96.175
Step 9: Now predict the value for testing dataset and predict the accuracy of the model.(type=reponse means we want the probability of the testing dataset.)
res<-predict(model,testing,type = "response")
res
## 1 2 3 4 5
## 1.000000e+00 9.526745e-01 9.471076e-01 9.766886e-01 9.994231e-01
## 6 7 8 9 10
## 9.999988e-01 1.000000e+00 1.000000e+00 1.000000e+00 9.999611e-01
## 11 12 13 14 15
## 9.069354e-05 1.756094e-04 4.503439e-01 9.999999e-01 2.565323e-03
## 16 17 18 19 20
## 1.071703e-03 1.065442e-03 1.999505e-03 9.410985e-01 1.914518e-03
## 21 22 23 24 25
## 1.000000e+00 8.130160e-01 1.192414e-02 1.606432e-02 2.596316e-01
## 26 27 28 29 30
## 9.999899e-01 2.318959e-02 9.991549e-01 9.409155e-01 1.365559e-02
## 31 32 33 34 35
## 1.152920e-03 1.000000e+00 8.545624e-03 1.692638e-02 9.999941e-01
## 36 37 38 39 40
## 1.000000e+00 3.783901e-02 9.999972e-01 1.423396e-02 8.076163e-03
## 41 42 43 44 45
## 7.359270e-03 5.675181e-02 1.452678e-02 3.612446e-03 9.998635e-01
## 46 47 48 49 50
## 3.455325e-01 5.189878e-03 1.000000e+00 2.019712e-03 9.949405e-01
## 51 52 53 54 55
## 9.450403e-01 4.533332e-04 6.584084e-01 1.021147e-02 2.586460e-03
## 56 57 58 59 60
## 8.543183e-05 1.000000e+00 9.972185e-01 9.679912e-01 6.778920e-04
## 61 62 63 64 65
## 1.000000e+00 6.249878e-01 1.000000e+00 2.149825e-02 1.445052e-03
## 66 67 68 69 70
## 1.545478e-01 1.000000e+00 7.828286e-04 4.081557e-03 3.677118e-04
## 71 72 73 74 75
## 4.411889e-01 6.581162e-03 1.000000e+00 7.409356e-01 9.432439e-01
## 76 77 78 79 80
## 9.999721e-01 2.098160e-03 4.472053e-03 4.549159e-04 5.177999e-04
## 81 82 83 84 85
## 9.999993e-01 9.967413e-01 6.903889e-04 3.944481e-01 1.833639e-04
## 86 87 88 89 90
## 7.084269e-03 4.937584e-04 4.259237e-03 5.997133e-04 2.148278e-02
## 91 92 93 94 95
## 9.999816e-01 6.538703e-02 1.134285e-02 2.022639e-03 6.270656e-02
## 96 97 98 99 100
## 9.535633e-04 4.398883e-03 1.000000e+00 1.644318e-03 3.040722e-03
## 101 102 103 104 105
## 1.000000e+00 9.926756e-01 9.354318e-02 1.268432e-03 9.999999e-01
## 106 107 108 109 110
## 5.039383e-02 1.000000e+00 7.444032e-03 1.641544e-02 2.759074e-03
## 111 112 113 114 115
## 1.203750e-02 5.649281e-03 1.051701e-02 1.486491e-03 1.462480e-02
## 116 117 118 119 120
## 3.469472e-02 9.997780e-01 1.161365e-02 3.924309e-03 5.544760e-03
## 121 122 123 124 125
## 4.016923e-03 1.366339e-01 1.539029e-03 9.985333e-01 1.302597e-02
## 126 127 128 129 130
## 1.728119e-02 1.496133e-04 9.964365e-01 4.030264e-03 9.992768e-01
## 131 132 133 134 135
## 1.525470e-03 1.975726e-03 1.637397e-02 9.035011e-01 6.344683e-03
## 136 137 138 139 140
## 4.695678e-01 2.710624e-03 9.846137e-01 3.118842e-01 8.421589e-02
## 141 142 143 144 145
## 9.999913e-01 2.239503e-04 1.000000e+00 3.781347e-01 3.179473e-03
## 146 147 148 149 150
## 8.804819e-04 3.219131e-01 7.635766e-01 8.502197e-03 1.000000e+00
## 151 152 153 154 155
## 2.840503e-03 2.619077e-03 1.594852e-03 1.000000e+00 8.450897e-01
## 156 157 158 159 160
## 5.831363e-02 1.881867e-03 2.633348e-02 5.571214e-04 4.338274e-04
## 161 162
## 9.993441e-01 1.000000e+00
testing
## # A tibble: 162 x 7
## diagnosis radius_mean perimeter_mean area_mean radius_worst
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 19.7 130 1203 23.6
## 2 1 11.4 77.6 386. 14.9
## 3 1 12.5 84.0 476. 15.1
## 4 1 16.0 103. 798. 19.2
## 5 1 14.7 94.7 684. 19.1
## 6 1 16.1 108. 799. 21.0
## 7 1 21.2 137. 1404 29.2
## 8 1 16.6 110 905. 26.5
## 9 1 18.6 125. 1088 23.2
## 10 1 11.8 77.9 441. 16.8
## # ... with 152 more rows, and 2 more variables: perimeter_worst <dbl>,
## # area_worst <dbl>
step 10: Now find the threshold using ROC curve.
library(ROCR)
res<-predict(model,training,type = "response")
ROCRpred=prediction(res,training$diagnosis)
ROCRperf<-performance(ROCRpred,"tpr","fpr")
plot(ROCRperf,colorize=TRUE,print.cutoffs.at=seq(0.1,by=0.1))
Now the graph says threshold value is 0.2
Step 11: Now Create the confusion matrix.
res<-predict(model,testing,type = "response")
table(Actualvalue=testing$diagnosis,Predictedvalue=res>0.2)
## Predictedvalue
## Actualvalue FALSE TRUE
## 0 91 9
## 1 3 59
Now calculate the Accuracy of the model using Confusion matrix.
(91+59)/(91+59+3+9)
## [1] 0.9259259
Now our model is 92.59% accurate.
Step 12 : Now calculate the odds ratio.
exp(confint(model))
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) 5.552209e-05 4.632513e+13
## radius_mean 5.813501e-08 2.306326e-03
## perimeter_mean 1.564613e+00 5.458182e+00
## area_mean 9.827457e-01 1.066776e+00
## perimeter_worst 9.419771e-01 1.515445e+00
## area_worst 1.002565e+00 1.041715e+00
The odds ratio says that with the increase in 1 unit of perimeter_mean the probability of increasing the cancer to be malignant is 56.46% in the patients next checkup.
This approximation is based on statistics.
“Statistics is approximate Science but not exact Science.”.