In this solution, I am going to build a Knn classifier using R programming language. Will use the R machine learning caret package to build our Knn classifier. The most commonly used distance measure is Euclidean distance. The Euclidean distance is also known as simply distance. The usage of Euclidean distance measure is highly recommended when data is dense or continuous. Euclidean distance is the best proximity measure.
First we need to load the Library
library('TSA')
## Warning: package 'TSA' was built under R version 3.4.3
## Loading required package: leaps
## Warning: package 'leaps' was built under R version 3.4.3
## Loading required package: locfit
## Warning: package 'locfit' was built under R version 3.4.3
## locfit 1.5-9.1 2013-03-22
## Loading required package: mgcv
## Warning: package 'mgcv' was built under R version 3.4.3
## Loading required package: nlme
## Warning: package 'nlme' was built under R version 3.4.3
## This is mgcv 1.8-23. For overview type 'help("mgcv-package")'.
## Loading required package: tseries
## Warning: package 'tseries' was built under R version 3.4.3
##
## Attaching package: 'TSA'
## The following objects are masked from 'package:stats':
##
## acf, arima
## The following object is masked from 'package:utils':
##
## tar
library('forecast')
## Warning: package 'forecast' was built under R version 3.4.3
##
## Attaching package: 'forecast'
## The following object is masked from 'package:nlme':
##
## getResponse
library('tseries')
library('ggplot2') # visualization
library('ggthemes') # visualization
## Warning: package 'ggthemes' was built under R version 3.4.3
library('scales') # visualization
library('dplyr') # data manipulation
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:nlme':
##
## collapse
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library('mice') # imputation
## Warning: package 'mice' was built under R version 3.4.3
## Loading required package: lattice
library('randomForest') # classification algorithm
## Warning: package 'randomForest' was built under R version 3.4.3
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
library('rpart') # for decision tree
## Warning: package 'rpart' was built under R version 3.4.3
library('ROCR')
## Warning: package 'ROCR' was built under R version 3.4.3
## Loading required package: gplots
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
# library('ROCR')
# library('randomForest')
# library('corrr')
# library('corrplot')
# library('glue')
# library('caTools')
# library('data.table')
# require("knitr")
# require("geosphere")
# require("gmapsdistance")
require("tidyr")
## Loading required package: tidyr
## Warning: package 'tidyr' was built under R version 3.4.3
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:mice':
##
## complete
library('corrplot')
## Warning: package 'corrplot' was built under R version 3.4.3
## corrplot 0.84 loaded
#source("distance.R")
library('car')
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
library('caret')
## Warning: package 'caret' was built under R version 3.4.3
library('gclus')
## Loading required package: cluster
library('MASS')
## Warning: package 'MASS' was built under R version 3.4.3
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library('ggcorrplot')
## Warning: package 'ggcorrplot' was built under R version 3.4.3
library('cluster')
library('caTools')
## Warning: package 'caTools' was built under R version 3.4.3
library('rpart')
library('rpart.plot')
## Warning: package 'rpart.plot' was built under R version 3.4.3
library('rattle')
## Warning: package 'rattle' was built under R version 3.4.3
## Rattle: A free graphical interface for data science with R.
## Version 5.1.0 Copyright (c) 2006-2017 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
##
## Attaching package: 'rattle'
## The following object is masked from 'package:randomForest':
##
## importance
library('RColorBrewer')
library('data.table')
## Warning: package 'data.table' was built under R version 3.4.3
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
library('ROCR')
library('purrr')
## Warning: package 'purrr' was built under R version 3.4.3
##
## Attaching package: 'purrr'
## The following object is masked from 'package:data.table':
##
## transpose
## The following object is masked from 'package:caret':
##
## lift
## The following object is masked from 'package:car':
##
## some
## The following object is masked from 'package:scales':
##
## discard
library('tidyr')
library('ggplot2')
library('dummies')
## dummies-1.5.6 provided by Decision Patterns
library('corrplot')
library('usdm')
## Warning: package 'usdm' was built under R version 3.4.3
## Loading required package: sp
## Warning: package 'sp' was built under R version 3.4.3
## Loading required package: raster
##
## Attaching package: 'raster'
## The following object is masked from 'package:data.table':
##
## shift
## The following objects are masked from 'package:MASS':
##
## area, select
## The following object is masked from 'package:tidyr':
##
## extract
## The following object is masked from 'package:dplyr':
##
## select
## The following object is masked from 'package:nlme':
##
## getData
##
## Attaching package: 'usdm'
## The following object is masked from 'package:car':
##
## vif
## The following object is masked from 'package:nlme':
##
## Variogram
library('e1071')
## Warning: package 'e1071' was built under R version 3.4.3
##
## Attaching package: 'e1071'
## The following object is masked from 'package:raster':
##
## interpolate
## The following objects are masked from 'package:TSA':
##
## kurtosis, skewness
library('ElemStatLearn')
## Warning: package 'ElemStatLearn' was built under R version 3.4.3
library('purrr')
library('tidyr')
library('ggplot2')
library('caret')
library('ROCR')
library('pROC')
## Warning: package 'pROC' was built under R version 3.4.3
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
The data is taken from https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic) and https://www.kaggle.com/uciml/breast-cancer-wisconsin-data/data
The first thing we need to do is load the data set. To check whether our data contains missing values or not, we can use anyNA() method. Here, NA means Not Available. Preprocessing is all about correcting the problems in data before building a machine learning model using that data. Problems can be of many types like missing values, attributes with a different range, etc.
For checking the summarized details of our data, we can use summary() method. It will give us a basic idea about our dataset’s attributes range.
medical_data <- read.csv('bstcancer_data.csv')
medical_data<- within(medical_data, rm('X','id')) ## removing the column x which is no requied
str(medical_data)
## 'data.frame': 569 obs. of 31 variables:
## $ diagnosis : Factor w/ 2 levels "B","M": 2 2 2 2 2 2 2 2 2 2 ...
## $ radius_mean : num 18 20.6 19.7 11.4 20.3 ...
## $ texture_mean : num 10.4 17.8 21.2 20.4 14.3 ...
## $ perimeter_mean : num 122.8 132.9 130 77.6 135.1 ...
## $ area_mean : num 1001 1326 1203 386 1297 ...
## $ smoothness_mean : num 0.1184 0.0847 0.1096 0.1425 0.1003 ...
## $ compactness_mean : num 0.2776 0.0786 0.1599 0.2839 0.1328 ...
## $ concavity_mean : num 0.3001 0.0869 0.1974 0.2414 0.198 ...
## $ concave.points_mean : num 0.1471 0.0702 0.1279 0.1052 0.1043 ...
## $ symmetry_mean : num 0.242 0.181 0.207 0.26 0.181 ...
## $ fractal_dimension_mean : num 0.0787 0.0567 0.06 0.0974 0.0588 ...
## $ radius_se : num 1.095 0.543 0.746 0.496 0.757 ...
## $ texture_se : num 0.905 0.734 0.787 1.156 0.781 ...
## $ perimeter_se : num 8.59 3.4 4.58 3.44 5.44 ...
## $ area_se : num 153.4 74.1 94 27.2 94.4 ...
## $ smoothness_se : num 0.0064 0.00522 0.00615 0.00911 0.01149 ...
## $ compactness_se : num 0.049 0.0131 0.0401 0.0746 0.0246 ...
## $ concavity_se : num 0.0537 0.0186 0.0383 0.0566 0.0569 ...
## $ concave.points_se : num 0.0159 0.0134 0.0206 0.0187 0.0188 ...
## $ symmetry_se : num 0.03 0.0139 0.0225 0.0596 0.0176 ...
## $ fractal_dimension_se : num 0.00619 0.00353 0.00457 0.00921 0.00511 ...
## $ radius_worst : num 25.4 25 23.6 14.9 22.5 ...
## $ texture_worst : num 17.3 23.4 25.5 26.5 16.7 ...
## $ perimeter_worst : num 184.6 158.8 152.5 98.9 152.2 ...
## $ area_worst : num 2019 1956 1709 568 1575 ...
## $ smoothness_worst : num 0.162 0.124 0.144 0.21 0.137 ...
## $ compactness_worst : num 0.666 0.187 0.424 0.866 0.205 ...
## $ concavity_worst : num 0.712 0.242 0.45 0.687 0.4 ...
## $ concave.points_worst : num 0.265 0.186 0.243 0.258 0.163 ...
## $ symmetry_worst : num 0.46 0.275 0.361 0.664 0.236 ...
## $ fractal_dimension_worst: num 0.1189 0.089 0.0876 0.173 0.0768 ...
anyNA(medical_data)
## [1] FALSE
summary(medical_data)
## diagnosis radius_mean texture_mean perimeter_mean
## B:357 Min. : 6.981 Min. : 9.71 Min. : 43.79
## M:212 1st Qu.:11.700 1st Qu.:16.17 1st Qu.: 75.17
## Median :13.370 Median :18.84 Median : 86.24
## Mean :14.127 Mean :19.29 Mean : 91.97
## 3rd Qu.:15.780 3rd Qu.:21.80 3rd Qu.:104.10
## Max. :28.110 Max. :39.28 Max. :188.50
## area_mean smoothness_mean compactness_mean concavity_mean
## Min. : 143.5 Min. :0.05263 Min. :0.01938 Min. :0.00000
## 1st Qu.: 420.3 1st Qu.:0.08637 1st Qu.:0.06492 1st Qu.:0.02956
## Median : 551.1 Median :0.09587 Median :0.09263 Median :0.06154
## Mean : 654.9 Mean :0.09636 Mean :0.10434 Mean :0.08880
## 3rd Qu.: 782.7 3rd Qu.:0.10530 3rd Qu.:0.13040 3rd Qu.:0.13070
## Max. :2501.0 Max. :0.16340 Max. :0.34540 Max. :0.42680
## concave.points_mean symmetry_mean fractal_dimension_mean
## Min. :0.00000 Min. :0.1060 Min. :0.04996
## 1st Qu.:0.02031 1st Qu.:0.1619 1st Qu.:0.05770
## Median :0.03350 Median :0.1792 Median :0.06154
## Mean :0.04892 Mean :0.1812 Mean :0.06280
## 3rd Qu.:0.07400 3rd Qu.:0.1957 3rd Qu.:0.06612
## Max. :0.20120 Max. :0.3040 Max. :0.09744
## radius_se texture_se perimeter_se area_se
## Min. :0.1115 Min. :0.3602 Min. : 0.757 Min. : 6.802
## 1st Qu.:0.2324 1st Qu.:0.8339 1st Qu.: 1.606 1st Qu.: 17.850
## Median :0.3242 Median :1.1080 Median : 2.287 Median : 24.530
## Mean :0.4052 Mean :1.2169 Mean : 2.866 Mean : 40.337
## 3rd Qu.:0.4789 3rd Qu.:1.4740 3rd Qu.: 3.357 3rd Qu.: 45.190
## Max. :2.8730 Max. :4.8850 Max. :21.980 Max. :542.200
## smoothness_se compactness_se concavity_se
## Min. :0.001713 Min. :0.002252 Min. :0.00000
## 1st Qu.:0.005169 1st Qu.:0.013080 1st Qu.:0.01509
## Median :0.006380 Median :0.020450 Median :0.02589
## Mean :0.007041 Mean :0.025478 Mean :0.03189
## 3rd Qu.:0.008146 3rd Qu.:0.032450 3rd Qu.:0.04205
## Max. :0.031130 Max. :0.135400 Max. :0.39600
## concave.points_se symmetry_se fractal_dimension_se
## Min. :0.000000 Min. :0.007882 Min. :0.0008948
## 1st Qu.:0.007638 1st Qu.:0.015160 1st Qu.:0.0022480
## Median :0.010930 Median :0.018730 Median :0.0031870
## Mean :0.011796 Mean :0.020542 Mean :0.0037949
## 3rd Qu.:0.014710 3rd Qu.:0.023480 3rd Qu.:0.0045580
## Max. :0.052790 Max. :0.078950 Max. :0.0298400
## radius_worst texture_worst perimeter_worst area_worst
## Min. : 7.93 Min. :12.02 Min. : 50.41 Min. : 185.2
## 1st Qu.:13.01 1st Qu.:21.08 1st Qu.: 84.11 1st Qu.: 515.3
## Median :14.97 Median :25.41 Median : 97.66 Median : 686.5
## Mean :16.27 Mean :25.68 Mean :107.26 Mean : 880.6
## 3rd Qu.:18.79 3rd Qu.:29.72 3rd Qu.:125.40 3rd Qu.:1084.0
## Max. :36.04 Max. :49.54 Max. :251.20 Max. :4254.0
## smoothness_worst compactness_worst concavity_worst concave.points_worst
## Min. :0.07117 Min. :0.02729 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.11660 1st Qu.:0.14720 1st Qu.:0.1145 1st Qu.:0.06493
## Median :0.13130 Median :0.21190 Median :0.2267 Median :0.09993
## Mean :0.13237 Mean :0.25427 Mean :0.2722 Mean :0.11461
## 3rd Qu.:0.14600 3rd Qu.:0.33910 3rd Qu.:0.3829 3rd Qu.:0.16140
## Max. :0.22260 Max. :1.05800 Max. :1.2520 Max. :0.29100
## symmetry_worst fractal_dimension_worst
## Min. :0.1565 Min. :0.05504
## 1st Qu.:0.2504 1st Qu.:0.07146
## Median :0.2822 Median :0.08004
## Mean :0.2901 Mean :0.08395
## 3rd Qu.:0.3179 3rd Qu.:0.09208
## Max. :0.6638 Max. :0.20750
A glimpse of the reviews data
medical_data %>%
keep(is.numeric) %>% # Keep only numeric columns
gather() %>% # Convert to key-value pairs
ggplot(aes(value)) + # Plot the values
facet_wrap(~ key, scales = "free") + # In separate panels
geom_density() # as density
table(medical_data$diagnosis)
##
## B M
## 357 212
boxplot(medical_data$radius_mean ~ medical_data$diagnosis,data = medical_data)
Data slicing is a step to split data into train and test set. Training data set can be used specifically for our model building. Test dataset should not be mixed up while building model. Even during standardization, we should not standardize our test set. The set.seed() method is used to make our work replicable. To make our answers replicable, we need to set a seed value.
set.seed(341)
mypd<-sample(2,nrow(medical_data),replace=TRUE, prob=c(0.7,0.3))
medical_data_train<-medical_data[mypd==1,]
medical_data_val<-medical_data[mypd==2,]
dim(medical_data_train);dim(medical_data_val)
## [1] 397 31
## [1] 172 31
From above summary statistics, it shows us that all the attributes have a different range. So, we need to standardize our data. We can standardize data using caret’s preProcess() method. Note: Here we are only doing standardisation for training data.
Caret package provides train() method for training our data for various algorithms. We need to pass different parameter values for different algorithms. Before train() method, we will first use trainControl() method. It controls the computational logic of the train() method.
Three parameters of trainControl() method are set. The “method” parameter holds the details about resampling method. We can set “method” with many values like “boot”, “boot632”, “cv”, “repeatedcv”, “LOOCV”, “LGOCV” etc. For this tutorial, let’s use repeatedcv i.e, repeated cross-validation. The “number” parameter holds the number of resampling iterations. The “repeats” parameter contains the complete sets of folds to compute for our repeated cross-validation. We are using setting number =10 and repeats =3. This trainControl() methods returns a list which will be passed to our train() method.
knn classifier, train() method should be passed with “method” parameter as “knn”. Traget variable here is set as our response variable and we have denotes a formula for using all attributes in our classifier and our target variable. The “trControl” parameter should be passed with results from our trianControl() method. The “preProcess” parameter is for preprocessing our training data.
data, preprocessing is a mandatory task. We are passing 2 values in our “preProcess” parameter “center” & “scale”. These two help for centering and scaling the data. After preProcessing these convert our training data with mean value as approximately “0” and standard deviation as “1”. The “tuneLength” parameter holds an integer value. This is for tuning our algorithm.
sum(is.na(medical_data_train))
## [1] 0
medctrl <- trainControl(method="repeatedcv",number =10 ,repeats = 3) #,classProbs=TRUE,summaryFunction = twoClassSummary)
medknnFit <- train(diagnosis ~ ., data = medical_data_train, method = "knn", trControl = medctrl, preProcess = c("center","scale"), tuneLength = 20, na.action="na.omit")
print(medknnFit)
## k-Nearest Neighbors
##
## 397 samples
## 30 predictor
## 2 classes: 'B', 'M'
##
## Pre-processing: centered (30), scaled (30)
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 357, 357, 357, 357, 358, 357, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.9681624 0.9324949
## 7 0.9689957 0.9342910
## 9 0.9698291 0.9360454
## 11 0.9664957 0.9288774
## 13 0.9631410 0.9216257
## 15 0.9605983 0.9161315
## 17 0.9614530 0.9178006
## 19 0.9597863 0.9142910
## 21 0.9589316 0.9125083
## 23 0.9589316 0.9125083
## 25 0.9597863 0.9143836
## 27 0.9606410 0.9162589
## 29 0.9606410 0.9162589
## 31 0.9606197 0.9161765
## 33 0.9589103 0.9124645
## 35 0.9563889 0.9070514
## 37 0.9555556 0.9051814
## 39 0.9555342 0.9051171
## 41 0.9530128 0.8996838
## 43 0.9521795 0.8978137
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 9.
summary(medknnFit)
## Length Class Mode
## learn 2 -none- list
## k 1 -none- numeric
## theDots 0 -none- list
## xNames 30 -none- character
## problemType 1 -none- character
## tuneValue 1 data.frame list
## obsLevels 2 -none- character
## param 0 -none- list
Its showing Accuracy and Kappa metrics result for different k value. From the results, it automatically selects best k-value. Here, our training model is choosing k = 21 as its final value.
plot(medknnFit)
Now, our model is trained with K value as 9. We are ready to predict classes for our test set. We can use predict() method. We are passing twp arguments. The first parameter is our trained model and second parameter “newdata” holds our testing data frame. The predict() method returns a list, we are saving it in a variable.
medical_pred <- predict(medknnFit, newdata = medical_data_val)
medical_pred
## [1] M M M M M M M B M M M M M B M B M B B M B B M B B B M B B M B M B B B
## [36] B M B B M B B B B M B B B B B M B M B M B M M B M M B M B B B B B B B
## [71] B B M B B M B M M M B B B B B B B B B B B B B B B B B M B B M B B M B
## [106] B B B B B B M B B B B B B B B B M M B B B B B B B B B B B B B M B M B
## [141] B B B B B B M B B B B B B B M M B M B B B B B M B B B B B B M M
## Levels: B M
Using confusion matrix, we can print statistics of our results. It shows that our model accuracy for test set is 95.93%.
medical_data_val$diagnosis_pred <- medical_pred
medical_dataRev <- table(actualclass=medical_data_val$diagnosis, predictedclass=medical_data_val$diagnosis_pred)
medical_dataRevCF <- confusionMatrix(medical_dataRev)
print(medical_dataRevCF)
## Confusion Matrix and Statistics
##
## predictedclass
## actualclass B M
## B 116 1
## M 6 49
##
## Accuracy : 0.9593
## 95% CI : (0.9179, 0.9835)
## No Information Rate : 0.7093
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9041
## Mcnemar's Test P-Value : 0.1306
##
## Sensitivity : 0.9508
## Specificity : 0.9800
## Pos Pred Value : 0.9915
## Neg Pred Value : 0.8909
## Prevalence : 0.7093
## Detection Rate : 0.6744
## Detection Prevalence : 0.6802
## Balanced Accuracy : 0.9654
##
## 'Positive' Class : B
##
class(medical_data_val$diagnosis_pred)
## [1] "factor"
class(medical_data_val$diagnosis)
## [1] "factor"
medical_pred_prob <- predict(medknnFit, newdata = medical_data_val,type = 'prob')