In this solution, I am going to build a Knn classifier using R programming language. Will use the R machine learning caret package to build our Knn classifier. The most commonly used distance measure is Euclidean distance. The Euclidean distance is also known as simply distance. The usage of Euclidean distance measure is highly recommended when data is dense or continuous. Euclidean distance is the best proximity measure.

Load Required Libraries

First we need to load the Library

library('TSA')
## Warning: package 'TSA' was built under R version 3.4.3
## Loading required package: leaps
## Warning: package 'leaps' was built under R version 3.4.3
## Loading required package: locfit
## Warning: package 'locfit' was built under R version 3.4.3
## locfit 1.5-9.1    2013-03-22
## Loading required package: mgcv
## Warning: package 'mgcv' was built under R version 3.4.3
## Loading required package: nlme
## Warning: package 'nlme' was built under R version 3.4.3
## This is mgcv 1.8-23. For overview type 'help("mgcv-package")'.
## Loading required package: tseries
## Warning: package 'tseries' was built under R version 3.4.3
## 
## Attaching package: 'TSA'
## The following objects are masked from 'package:stats':
## 
##     acf, arima
## The following object is masked from 'package:utils':
## 
##     tar
library('forecast')
## Warning: package 'forecast' was built under R version 3.4.3
## 
## Attaching package: 'forecast'
## The following object is masked from 'package:nlme':
## 
##     getResponse
library('tseries')
library('ggplot2') # visualization
library('ggthemes') # visualization
## Warning: package 'ggthemes' was built under R version 3.4.3
library('scales') # visualization
library('dplyr') # data manipulation
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:nlme':
## 
##     collapse
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library('mice') # imputation
## Warning: package 'mice' was built under R version 3.4.3
## Loading required package: lattice
library('randomForest') # classification algorithm
## Warning: package 'randomForest' was built under R version 3.4.3
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
library('rpart') # for decision tree
## Warning: package 'rpart' was built under R version 3.4.3
library('ROCR')
## Warning: package 'ROCR' was built under R version 3.4.3
## Loading required package: gplots
## 
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
## 
##     lowess
# library('ROCR')
# library('randomForest')
# library('corrr')
# library('corrplot')
# library('glue')
# library('caTools')
# library('data.table')
# require("knitr")
# require("geosphere")
# require("gmapsdistance")
require("tidyr")
## Loading required package: tidyr
## Warning: package 'tidyr' was built under R version 3.4.3
## 
## Attaching package: 'tidyr'
## The following object is masked from 'package:mice':
## 
##     complete
library('corrplot')
## Warning: package 'corrplot' was built under R version 3.4.3
## corrplot 0.84 loaded
#source("distance.R")
library('car')
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
library('caret')
## Warning: package 'caret' was built under R version 3.4.3
library('gclus')
## Loading required package: cluster
library('MASS')
## Warning: package 'MASS' was built under R version 3.4.3
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library('ggcorrplot')
## Warning: package 'ggcorrplot' was built under R version 3.4.3
library('cluster')
library('caTools')
## Warning: package 'caTools' was built under R version 3.4.3
library('rpart')
library('rpart.plot')
## Warning: package 'rpart.plot' was built under R version 3.4.3
library('rattle')
## Warning: package 'rattle' was built under R version 3.4.3
## Rattle: A free graphical interface for data science with R.
## Version 5.1.0 Copyright (c) 2006-2017 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
## 
## Attaching package: 'rattle'
## The following object is masked from 'package:randomForest':
## 
##     importance
library('RColorBrewer')
library('data.table')
## Warning: package 'data.table' was built under R version 3.4.3
## 
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
library('ROCR')
library('purrr')
## Warning: package 'purrr' was built under R version 3.4.3
## 
## Attaching package: 'purrr'
## The following object is masked from 'package:data.table':
## 
##     transpose
## The following object is masked from 'package:caret':
## 
##     lift
## The following object is masked from 'package:car':
## 
##     some
## The following object is masked from 'package:scales':
## 
##     discard
library('tidyr')
library('ggplot2')
library('dummies')
## dummies-1.5.6 provided by Decision Patterns
library('corrplot')
library('usdm')
## Warning: package 'usdm' was built under R version 3.4.3
## Loading required package: sp
## Warning: package 'sp' was built under R version 3.4.3
## Loading required package: raster
## 
## Attaching package: 'raster'
## The following object is masked from 'package:data.table':
## 
##     shift
## The following objects are masked from 'package:MASS':
## 
##     area, select
## The following object is masked from 'package:tidyr':
## 
##     extract
## The following object is masked from 'package:dplyr':
## 
##     select
## The following object is masked from 'package:nlme':
## 
##     getData
## 
## Attaching package: 'usdm'
## The following object is masked from 'package:car':
## 
##     vif
## The following object is masked from 'package:nlme':
## 
##     Variogram
library('e1071')
## Warning: package 'e1071' was built under R version 3.4.3
## 
## Attaching package: 'e1071'
## The following object is masked from 'package:raster':
## 
##     interpolate
## The following objects are masked from 'package:TSA':
## 
##     kurtosis, skewness
library('ElemStatLearn')
## Warning: package 'ElemStatLearn' was built under R version 3.4.3
library('purrr')
library('tidyr')
library('ggplot2')
library('caret')
library('ROCR')
library('pROC')
## Warning: package 'pROC' was built under R version 3.4.3
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

The data is taken from https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic) and https://www.kaggle.com/uciml/breast-cancer-wisconsin-data/data

Loading and PreProcessing

The first thing we need to do is load the data set. To check whether our data contains missing values or not, we can use anyNA() method. Here, NA means Not Available. Preprocessing is all about correcting the problems in data before building a machine learning model using that data. Problems can be of many types like missing values, attributes with a different range, etc.

For checking the summarized details of our data, we can use summary() method. It will give us a basic idea about our dataset’s attributes range.

medical_data <- read.csv('bstcancer_data.csv')
medical_data<- within(medical_data, rm('X','id'))  ## removing the column x which is no requied 
str(medical_data)
## 'data.frame':    569 obs. of  31 variables:
##  $ diagnosis              : Factor w/ 2 levels "B","M": 2 2 2 2 2 2 2 2 2 2 ...
##  $ radius_mean            : num  18 20.6 19.7 11.4 20.3 ...
##  $ texture_mean           : num  10.4 17.8 21.2 20.4 14.3 ...
##  $ perimeter_mean         : num  122.8 132.9 130 77.6 135.1 ...
##  $ area_mean              : num  1001 1326 1203 386 1297 ...
##  $ smoothness_mean        : num  0.1184 0.0847 0.1096 0.1425 0.1003 ...
##  $ compactness_mean       : num  0.2776 0.0786 0.1599 0.2839 0.1328 ...
##  $ concavity_mean         : num  0.3001 0.0869 0.1974 0.2414 0.198 ...
##  $ concave.points_mean    : num  0.1471 0.0702 0.1279 0.1052 0.1043 ...
##  $ symmetry_mean          : num  0.242 0.181 0.207 0.26 0.181 ...
##  $ fractal_dimension_mean : num  0.0787 0.0567 0.06 0.0974 0.0588 ...
##  $ radius_se              : num  1.095 0.543 0.746 0.496 0.757 ...
##  $ texture_se             : num  0.905 0.734 0.787 1.156 0.781 ...
##  $ perimeter_se           : num  8.59 3.4 4.58 3.44 5.44 ...
##  $ area_se                : num  153.4 74.1 94 27.2 94.4 ...
##  $ smoothness_se          : num  0.0064 0.00522 0.00615 0.00911 0.01149 ...
##  $ compactness_se         : num  0.049 0.0131 0.0401 0.0746 0.0246 ...
##  $ concavity_se           : num  0.0537 0.0186 0.0383 0.0566 0.0569 ...
##  $ concave.points_se      : num  0.0159 0.0134 0.0206 0.0187 0.0188 ...
##  $ symmetry_se            : num  0.03 0.0139 0.0225 0.0596 0.0176 ...
##  $ fractal_dimension_se   : num  0.00619 0.00353 0.00457 0.00921 0.00511 ...
##  $ radius_worst           : num  25.4 25 23.6 14.9 22.5 ...
##  $ texture_worst          : num  17.3 23.4 25.5 26.5 16.7 ...
##  $ perimeter_worst        : num  184.6 158.8 152.5 98.9 152.2 ...
##  $ area_worst             : num  2019 1956 1709 568 1575 ...
##  $ smoothness_worst       : num  0.162 0.124 0.144 0.21 0.137 ...
##  $ compactness_worst      : num  0.666 0.187 0.424 0.866 0.205 ...
##  $ concavity_worst        : num  0.712 0.242 0.45 0.687 0.4 ...
##  $ concave.points_worst   : num  0.265 0.186 0.243 0.258 0.163 ...
##  $ symmetry_worst         : num  0.46 0.275 0.361 0.664 0.236 ...
##  $ fractal_dimension_worst: num  0.1189 0.089 0.0876 0.173 0.0768 ...
anyNA(medical_data)
## [1] FALSE
summary(medical_data)
##  diagnosis  radius_mean      texture_mean   perimeter_mean  
##  B:357     Min.   : 6.981   Min.   : 9.71   Min.   : 43.79  
##  M:212     1st Qu.:11.700   1st Qu.:16.17   1st Qu.: 75.17  
##            Median :13.370   Median :18.84   Median : 86.24  
##            Mean   :14.127   Mean   :19.29   Mean   : 91.97  
##            3rd Qu.:15.780   3rd Qu.:21.80   3rd Qu.:104.10  
##            Max.   :28.110   Max.   :39.28   Max.   :188.50  
##    area_mean      smoothness_mean   compactness_mean  concavity_mean   
##  Min.   : 143.5   Min.   :0.05263   Min.   :0.01938   Min.   :0.00000  
##  1st Qu.: 420.3   1st Qu.:0.08637   1st Qu.:0.06492   1st Qu.:0.02956  
##  Median : 551.1   Median :0.09587   Median :0.09263   Median :0.06154  
##  Mean   : 654.9   Mean   :0.09636   Mean   :0.10434   Mean   :0.08880  
##  3rd Qu.: 782.7   3rd Qu.:0.10530   3rd Qu.:0.13040   3rd Qu.:0.13070  
##  Max.   :2501.0   Max.   :0.16340   Max.   :0.34540   Max.   :0.42680  
##  concave.points_mean symmetry_mean    fractal_dimension_mean
##  Min.   :0.00000     Min.   :0.1060   Min.   :0.04996       
##  1st Qu.:0.02031     1st Qu.:0.1619   1st Qu.:0.05770       
##  Median :0.03350     Median :0.1792   Median :0.06154       
##  Mean   :0.04892     Mean   :0.1812   Mean   :0.06280       
##  3rd Qu.:0.07400     3rd Qu.:0.1957   3rd Qu.:0.06612       
##  Max.   :0.20120     Max.   :0.3040   Max.   :0.09744       
##    radius_se        texture_se      perimeter_se       area_se       
##  Min.   :0.1115   Min.   :0.3602   Min.   : 0.757   Min.   :  6.802  
##  1st Qu.:0.2324   1st Qu.:0.8339   1st Qu.: 1.606   1st Qu.: 17.850  
##  Median :0.3242   Median :1.1080   Median : 2.287   Median : 24.530  
##  Mean   :0.4052   Mean   :1.2169   Mean   : 2.866   Mean   : 40.337  
##  3rd Qu.:0.4789   3rd Qu.:1.4740   3rd Qu.: 3.357   3rd Qu.: 45.190  
##  Max.   :2.8730   Max.   :4.8850   Max.   :21.980   Max.   :542.200  
##  smoothness_se      compactness_se      concavity_se    
##  Min.   :0.001713   Min.   :0.002252   Min.   :0.00000  
##  1st Qu.:0.005169   1st Qu.:0.013080   1st Qu.:0.01509  
##  Median :0.006380   Median :0.020450   Median :0.02589  
##  Mean   :0.007041   Mean   :0.025478   Mean   :0.03189  
##  3rd Qu.:0.008146   3rd Qu.:0.032450   3rd Qu.:0.04205  
##  Max.   :0.031130   Max.   :0.135400   Max.   :0.39600  
##  concave.points_se   symmetry_se       fractal_dimension_se
##  Min.   :0.000000   Min.   :0.007882   Min.   :0.0008948   
##  1st Qu.:0.007638   1st Qu.:0.015160   1st Qu.:0.0022480   
##  Median :0.010930   Median :0.018730   Median :0.0031870   
##  Mean   :0.011796   Mean   :0.020542   Mean   :0.0037949   
##  3rd Qu.:0.014710   3rd Qu.:0.023480   3rd Qu.:0.0045580   
##  Max.   :0.052790   Max.   :0.078950   Max.   :0.0298400   
##   radius_worst   texture_worst   perimeter_worst    area_worst    
##  Min.   : 7.93   Min.   :12.02   Min.   : 50.41   Min.   : 185.2  
##  1st Qu.:13.01   1st Qu.:21.08   1st Qu.: 84.11   1st Qu.: 515.3  
##  Median :14.97   Median :25.41   Median : 97.66   Median : 686.5  
##  Mean   :16.27   Mean   :25.68   Mean   :107.26   Mean   : 880.6  
##  3rd Qu.:18.79   3rd Qu.:29.72   3rd Qu.:125.40   3rd Qu.:1084.0  
##  Max.   :36.04   Max.   :49.54   Max.   :251.20   Max.   :4254.0  
##  smoothness_worst  compactness_worst concavity_worst  concave.points_worst
##  Min.   :0.07117   Min.   :0.02729   Min.   :0.0000   Min.   :0.00000     
##  1st Qu.:0.11660   1st Qu.:0.14720   1st Qu.:0.1145   1st Qu.:0.06493     
##  Median :0.13130   Median :0.21190   Median :0.2267   Median :0.09993     
##  Mean   :0.13237   Mean   :0.25427   Mean   :0.2722   Mean   :0.11461     
##  3rd Qu.:0.14600   3rd Qu.:0.33910   3rd Qu.:0.3829   3rd Qu.:0.16140     
##  Max.   :0.22260   Max.   :1.05800   Max.   :1.2520   Max.   :0.29100     
##  symmetry_worst   fractal_dimension_worst
##  Min.   :0.1565   Min.   :0.05504        
##  1st Qu.:0.2504   1st Qu.:0.07146        
##  Median :0.2822   Median :0.08004        
##  Mean   :0.2901   Mean   :0.08395        
##  3rd Qu.:0.3179   3rd Qu.:0.09208        
##  Max.   :0.6638   Max.   :0.20750

Reviews the Data

A glimpse of the reviews data

medical_data %>%
  keep(is.numeric) %>%                     # Keep only numeric columns
  gather() %>%                             # Convert to key-value pairs
  ggplot(aes(value)) +                     # Plot the values
    facet_wrap(~ key, scales = "free") +   # In separate panels
    geom_density()                         # as density

table(medical_data$diagnosis)
## 
##   B   M 
## 357 212
boxplot(medical_data$radius_mean ~ medical_data$diagnosis,data = medical_data)

Split the data into Training and Test Data for model training and testing

Data slicing is a step to split data into train and test set. Training data set can be used specifically for our model building. Test dataset should not be mixed up while building model. Even during standardization, we should not standardize our test set. The set.seed() method is used to make our work replicable. To make our answers replicable, we need to set a seed value.

set.seed(341)
mypd<-sample(2,nrow(medical_data),replace=TRUE, prob=c(0.7,0.3))

medical_data_train<-medical_data[mypd==1,]
medical_data_val<-medical_data[mypd==2,]
dim(medical_data_train);dim(medical_data_val)
## [1] 397  31
## [1] 172  31

Model Development on Training Data

From above summary statistics, it shows us that all the attributes have a different range. So, we need to standardize our data. We can standardize data using caret’s preProcess() method. Note: Here we are only doing standardisation for training data.

Training the Knn model

Caret package provides train() method for training our data for various algorithms. We need to pass different parameter values for different algorithms. Before train() method, we will first use trainControl() method. It controls the computational logic of the train() method.

Three parameters of trainControl() method are set. The “method” parameter holds the details about resampling method. We can set “method” with many values like “boot”, “boot632”, “cv”, “repeatedcv”, “LOOCV”, “LGOCV” etc. For this tutorial, let’s use repeatedcv i.e, repeated cross-validation. The “number” parameter holds the number of resampling iterations. The “repeats” parameter contains the complete sets of folds to compute for our repeated cross-validation. We are using setting number =10 and repeats =3. This trainControl() methods returns a list which will be passed to our train() method.

knn classifier, train() method should be passed with “method” parameter as “knn”. Traget variable here is set as our response variable and we have denotes a formula for using all attributes in our classifier and our target variable. The “trControl” parameter should be passed with results from our trianControl() method. The “preProcess” parameter is for preprocessing our training data.

data, preprocessing is a mandatory task. We are passing 2 values in our “preProcess” parameter “center” & “scale”. These two help for centering and scaling the data. After preProcessing these convert our training data with mean value as approximately “0” and standard deviation as “1”. The “tuneLength” parameter holds an integer value. This is for tuning our algorithm.

sum(is.na(medical_data_train))
## [1] 0
medctrl <- trainControl(method="repeatedcv",number =10 ,repeats = 3) #,classProbs=TRUE,summaryFunction = twoClassSummary)
medknnFit <- train(diagnosis ~ ., data = medical_data_train, method = "knn", trControl = medctrl, preProcess = c("center","scale"), tuneLength = 20, na.action="na.omit")

print(medknnFit)
## k-Nearest Neighbors 
## 
## 397 samples
##  30 predictor
##   2 classes: 'B', 'M' 
## 
## Pre-processing: centered (30), scaled (30) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 357, 357, 357, 357, 358, 357, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    5  0.9681624  0.9324949
##    7  0.9689957  0.9342910
##    9  0.9698291  0.9360454
##   11  0.9664957  0.9288774
##   13  0.9631410  0.9216257
##   15  0.9605983  0.9161315
##   17  0.9614530  0.9178006
##   19  0.9597863  0.9142910
##   21  0.9589316  0.9125083
##   23  0.9589316  0.9125083
##   25  0.9597863  0.9143836
##   27  0.9606410  0.9162589
##   29  0.9606410  0.9162589
##   31  0.9606197  0.9161765
##   33  0.9589103  0.9124645
##   35  0.9563889  0.9070514
##   37  0.9555556  0.9051814
##   39  0.9555342  0.9051171
##   41  0.9530128  0.8996838
##   43  0.9521795  0.8978137
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 9.
summary(medknnFit)
##             Length Class      Mode     
## learn        2     -none-     list     
## k            1     -none-     numeric  
## theDots      0     -none-     list     
## xNames      30     -none-     character
## problemType  1     -none-     character
## tuneValue    1     data.frame list     
## obsLevels    2     -none-     character
## param        0     -none-     list

Plotting the model result

Its showing Accuracy and Kappa metrics result for different k value. From the results, it automatically selects best k-value. Here, our training model is choosing k = 21 as its final value.

plot(medknnFit)

Model Testing on Test Data

Now, our model is trained with K value as 9. We are ready to predict classes for our test set. We can use predict() method. We are passing twp arguments. The first parameter is our trained model and second parameter “newdata” holds our testing data frame. The predict() method returns a list, we are saving it in a variable.

medical_pred <- predict(medknnFit, newdata = medical_data_val)
medical_pred
##   [1] M M M M M M M B M M M M M B M B M B B M B B M B B B M B B M B M B B B
##  [36] B M B B M B B B B M B B B B B M B M B M B M M B M M B M B B B B B B B
##  [71] B B M B B M B M M M B B B B B B B B B B B B B B B B B M B B M B B M B
## [106] B B B B B B M B B B B B B B B B M M B B B B B B B B B B B B B M B M B
## [141] B B B B B B M B B B B B B B M M B M B B B B B M B B B B B B M M
## Levels: B M

Testing the model Accuracy

Using confusion matrix, we can print statistics of our results. It shows that our model accuracy for test set is 95.93%.

medical_data_val$diagnosis_pred <- medical_pred
medical_dataRev <- table(actualclass=medical_data_val$diagnosis, predictedclass=medical_data_val$diagnosis_pred)

medical_dataRevCF <- confusionMatrix(medical_dataRev)
print(medical_dataRevCF)
## Confusion Matrix and Statistics
## 
##            predictedclass
## actualclass   B   M
##           B 116   1
##           M   6  49
##                                           
##                Accuracy : 0.9593          
##                  95% CI : (0.9179, 0.9835)
##     No Information Rate : 0.7093          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9041          
##  Mcnemar's Test P-Value : 0.1306          
##                                           
##             Sensitivity : 0.9508          
##             Specificity : 0.9800          
##          Pos Pred Value : 0.9915          
##          Neg Pred Value : 0.8909          
##              Prevalence : 0.7093          
##          Detection Rate : 0.6744          
##    Detection Prevalence : 0.6802          
##       Balanced Accuracy : 0.9654          
##                                           
##        'Positive' Class : B               
## 
class(medical_data_val$diagnosis_pred)
## [1] "factor"
class(medical_data_val$diagnosis)
## [1] "factor"
medical_pred_prob <- predict(medknnFit, newdata = medical_data_val,type = 'prob')