R Notebook

Diagnostic breast cancer with the k-NN algorithm

Breast cancer diagnosis

Cancer screening allows to treat the disease before it causes noticeable symptoms. Wisconsin Breast Cancer Diagnostic dataset

Cancer screening

Machine learning approach

Machine learning can help to automate the process, providing greater accuracy by removing the human subjectivity.

This will allow physicians to spend less time diagnosing and more time in treatment Wisconsin Breast Cancer Diagnostic. UCI machine learning repository. http://archive.ics.uci.edu/ml

The dataset

The dataset conains digitized images of fine-needle aspirate of a breast mass.

Includes 569 examples of cancer biopsies, each with 32 features.

Loading the data

setwd("C:/Users/alfa2/Dropbox/projects_interview")
wbcd <- read.csv("wisc_bc_data.csv", stringsAsFactors = FALSE)

Step 2. Check the structure of the data

str(wbcd)

## 'data.frame':    569 obs. of  32 variables:
##  $ ï..id                  : int  842302 842517 84300903 84348301 84358402 843786 844359 84458202 844981 84501001 ...
##  $ diagnosis              : chr  "M" "M" "M" "M" ...
##  $ radius_mean            : num  18 20.6 19.7 11.4 20.3 ...
##  $ texture_mean           : num  10.4 17.8 21.2 20.4 14.3 ...
##  $ perimeter_mean         : num  122.8 132.9 130 77.6 135.1 ...
##  $ area_mean              : num  1001 1326 1203 386 1297 ...
##  $ smoothness_mean        : num  0.1184 0.0847 0.1096 0.1425 0.1003 ...
##  $ compactness_mean       : num  0.2776 0.0786 0.1599 0.2839 0.1328 ...
##  $ concavity_mean         : num  0.3001 0.0869 0.1974 0.2414 0.198 ...
##  $ concave.points_mean    : num  0.1471 0.0702 0.1279 0.1052 0.1043 ...
##  $ symmetry_mean          : num  0.242 0.181 0.207 0.26 0.181 ...
##  $ fractal_dimension_mean : num  0.0787 0.0567 0.06 0.0974 0.0588 ...
##  $ radius_se              : num  1.095 0.543 0.746 0.496 0.757 ...
##  $ texture_se             : num  0.905 0.734 0.787 1.156 0.781 ...
##  $ perimeter_se           : num  8.59 3.4 4.58 3.44 5.44 ...
##  $ area_se                : num  153.4 74.1 94 27.2 94.4 ...
##  $ smoothness_se          : num  0.0064 0.00522 0.00615 0.00911 0.01149 ...
##  $ compactness_se         : num  0.049 0.0131 0.0401 0.0746 0.0246 ...
##  $ concavity_se           : num  0.0537 0.0186 0.0383 0.0566 0.0569 ...
##  $ concave.points_se      : num  0.0159 0.0134 0.0206 0.0187 0.0188 ...
##  $ symmetry_se            : num  0.03 0.0139 0.0225 0.0596 0.0176 ...
##  $ fractal_dimension_se   : num  0.00619 0.00353 0.00457 0.00921 0.00511 ...
##  $ radius_worst           : num  25.4 25 23.6 14.9 22.5 ...
##  $ texture_worst          : num  17.3 23.4 25.5 26.5 16.7 ...
##  $ perimeter_worst        : num  184.6 158.8 152.5 98.9 152.2 ...
##  $ area_worst             : num  2019 1956 1709 568 1575 ...
##  $ smoothness_worst       : num  0.162 0.124 0.144 0.21 0.137 ...
##  $ compactness_worst      : num  0.666 0.187 0.424 0.866 0.205 ...
##  $ concavity_worst        : num  0.712 0.242 0.45 0.687 0.4 ...
##  $ concave.points_worst   : num  0.265 0.186 0.243 0.258 0.163 ...
##  $ symmetry_worst         : num  0.46 0.275 0.361 0.664 0.236 ...
##  $ fractal_dimension_worst: num  0.1189 0.089 0.0876 0.173 0.0768 ...

Excluding variables

Always exclude id variables from your analysis. a model that includes an identifier will suffer from overfitting, and is unlikely to generalize well to other data.

wbcd <- wbcd[-1]

The outcome variable

The outcome variable represents the outcome that we attempt to predict.

The outcome variable in this dataset is diagnosis.

Diagnosis indicates if the example is from a benign or malignant class.

table( wbcd $ diagnosis)

## 
##   B   M 
## 357 212

Output variable as factor

Some machine learning algorithm requiere a variable coded as factor. We will do this while transforming B and M to are more meaningful meaning.

wbcd $ diagnosis <- factor( wbcd $ diagnosis, levels = 
                              c("B", "M"), 
                            labels = c(" Benign", "Malignant"))

Table of proportions

round( prop.table( table( wbcd $ diagnosis)) * 100, digits = 1)

## 
##    Benign Malignant 
##      62.7      37.3

Summary of the remaining features

All remaining features are numeric. We will check summary statistics of 3 of them.

Can you notice something problematic?

summary( wbcd[ c("radius_mean", "area_mean", "smoothness_mean")])

##   radius_mean       area_mean      smoothness_mean  
##  Min.   : 6.981   Min.   : 143.5   Min.   :0.05263  
##  1st Qu.:11.700   1st Qu.: 420.3   1st Qu.:0.08637  
##  Median :13.370   Median : 551.1   Median :0.09587  
##  Mean   :14.127   Mean   : 654.9   Mean   :0.09636  
##  3rd Qu.:15.780   3rd Qu.: 782.7   3rd Qu.:0.10530  
##  Max.   :28.110   Max.   :2501.0   Max.   :0.16340

Normalizing numeric data

Remember that distance computations in k-NN are highly dependent on the scale of the features.

Min-max normalization

Brings features to the same range.
Brings data to the scale 0 to 1.

\[X_{new} = \frac{X-min(X)} {max(X)-min(X)}\]

Let’s create a normalize function

normalize <- function( x) { 
  return (( x - min( x)) / (max( x) - min( x))) 
  }

Testing our normalize function

normalize( c( 1, 2, 3, 4, 5))

## [1] 0.00 0.25 0.50 0.75 1.00

normalize( c( 10, 20, 30, 40, 50))

## [1] 0.00 0.25 0.50 0.75 1.00

Before normalizing our data let’s explore the apply function

The lapply() function takes a list and applies a specified function to each list element.

prueba <- data.frame(x=c(.1,.2,.3,.4,.5,.6,.7,.8,.9,1.0),
                     y=c(1,2,3,4,5,6,7,8,9,10),
                     z=c(10,20,30,40,50,60,70,80,90,100))
mean(prueba$x)

## [1] 0.55

mean(prueba$y)

## [1] 5.5

mean(prueba$z)

## [1] 55

lapply(prueba,mean)

## $x
## [1] 0.55
## 
## $y
## [1] 5.5
## 
## $z
## [1] 55

sapply(prueba,mean)

##     x     y     z 
##  0.55  5.50 55.00

sapply(prueba,normalize)

##               x         y         z
##  [1,] 0.0000000 0.0000000 0.0000000
##  [2,] 0.1111111 0.1111111 0.1111111
##  [3,] 0.2222222 0.2222222 0.2222222
##  [4,] 0.3333333 0.3333333 0.3333333
##  [5,] 0.4444444 0.4444444 0.4444444
##  [6,] 0.5555556 0.5555556 0.5555556
##  [7,] 0.6666667 0.6666667 0.6666667
##  [8,] 0.7777778 0.7777778 0.7777778
##  [9,] 0.8888889 0.8888889 0.8888889
## [10,] 1.0000000 1.0000000 1.0000000

Normalizing features

Apply our normalize function to our 30 numeric variables.

wbcd_n <- as.data.frame( lapply( wbcd[ 2: 31], normalize))
summary( wbcd_n$area_mean)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.1174  0.1729  0.2169  0.2711  1.0000

Data preparation. Training and test datasets

Split the dataframe into training and test data

wbcd_train <- wbcd_n[ 1: 469, ]
wbcd_test <- wbcd_n[ 470: 569, ]

dim(wbcd_train)

## [1] 469  30

dim(wbcd_test)

## [1] 100  30

Important note

It is important that each dataset is a representative subset of the full set of data.

The target variables

When splitting the data into training and test sets we omitted the target variable. We’ll form two vectors with the the target variable.

wbcd_train_labels <- wbcd[ 1: 469, 1] 
wbcd_test_labels <- wbcd[ 470: 569, 1]

Step 3. Training a model on the data

Lazy algorithms like k-NN, actually, do not have a real training phase, instead it simply stores input data in a structured format.

k-NN implementation in the Class package.

library(class)

Using K-NN function

wbcd_test_pred <- knn( train = wbcd_train, 
                       test = wbcd_test, 
                       cl = wbcd_train_labels, 
                       k = 21)

Step 4. Evaluating model performance

Next, we need to know how well k-NN predited the classes of the observations in the test set.

To do this, we can use the CrossTable() function in the gmodels package. Specifying prop.chisq in the CrossTable() function we can create a cross tabulation with the agreement between two vectors. In this case these vectors correspond to the output of k-NN and the true class in the test set.

 library(gmodels)
CrossTable( x = wbcd_test_labels, 
            y = wbcd_test_pred, 
            prop.chisq = FALSE)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  100 
## 
##  
##                  | wbcd_test_pred 
## wbcd_test_labels |    Benign | Malignant | Row Total | 
## -----------------|-----------|-----------|-----------|
##           Benign |        77 |         0 |        77 | 
##                  |     1.000 |     0.000 |     0.770 | 
##                  |     0.975 |     0.000 |           | 
##                  |     0.770 |     0.000 |           | 
## -----------------|-----------|-----------|-----------|
##        Malignant |         2 |        21 |        23 | 
##                  |     0.087 |     0.913 |     0.230 | 
##                  |     0.025 |     1.000 |           | 
##                  |     0.020 |     0.210 |           | 
## -----------------|-----------|-----------|-----------|
##     Column Total |        79 |        21 |       100 | 
##                  |     0.790 |     0.210 |           | 
## -----------------|-----------|-----------|-----------|
## 
##

Confusion matrix

Step 5. Improving model performance

Transforming -z- score normalization

wbcd_z <- as.data.frame( scale( wbcd[-1]))
summary( wbcd_z $ area_mean)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.4532 -0.6666 -0.2949  0.0000  0.3632  5.2459

wbcd_train <- wbcd_z[ 1: 469, ] 
wbcd_test <- wbcd_z[ 470: 569, ]
wbcd_train_labels <- wbcd[ 1: 469, 1]
wbcd_test_labels <- wbcd[ 470: 569, 1]
wbcd_test_pred <- knn( train = wbcd_train, 
                       test = wbcd_test, 
                       cl = wbcd_train_labels, 
                       k = 21)

CrossTable( x = wbcd_test_labels, y = wbcd_test_pred, prop.chisq = FALSE)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  100 
## 
##  
##                  | wbcd_test_pred 
## wbcd_test_labels |    Benign | Malignant | Row Total | 
## -----------------|-----------|-----------|-----------|
##           Benign |        77 |         0 |        77 | 
##                  |     1.000 |     0.000 |     0.770 | 
##                  |     0.975 |     0.000 |           | 
##                  |     0.770 |     0.000 |           | 
## -----------------|-----------|-----------|-----------|
##        Malignant |         2 |        21 |        23 | 
##                  |     0.087 |     0.913 |     0.230 | 
##                  |     0.025 |     1.000 |           | 
##                  |     0.020 |     0.210 |           | 
## -----------------|-----------|-----------|-----------|
##     Column Total |        79 |        21 |       100 | 
##                  |     0.790 |     0.210 |           | 
## -----------------|-----------|-----------|-----------|
## 
##

table(wbcd_test_labels, wbcd_test_pred)

##                 wbcd_test_pred
## wbcd_test_labels  Benign Malignant
##         Benign        77         0
##        Malignant       2        21

Z-score normalization

\[X_{new} = \frac{X-\mu} {\sigma} = \frac{X-Mean(X)} {StdDev(X)}\]