Cancer screening allows to treat the disease before it causes noticeable symptoms. Wisconsin Breast Cancer Diagnostic dataset
Machine learning can help to automate the process, providing greater accuracy by removing the human subjectivity.
This will allow physicians to spend less time diagnosing and more time in treatment Wisconsin Breast Cancer Diagnostic. UCI machine learning repository. http://archive.ics.uci.edu/ml
The dataset conains digitized images of fine-needle aspirate of a breast mass.
Includes 569 examples of cancer biopsies, each with 32 features.
setwd("C:/Users/alfa2/Dropbox/projects_interview")
wbcd <- read.csv("wisc_bc_data.csv", stringsAsFactors = FALSE)
str(wbcd)
## 'data.frame': 569 obs. of 32 variables:
## $ ï..id : int 842302 842517 84300903 84348301 84358402 843786 844359 84458202 844981 84501001 ...
## $ diagnosis : chr "M" "M" "M" "M" ...
## $ radius_mean : num 18 20.6 19.7 11.4 20.3 ...
## $ texture_mean : num 10.4 17.8 21.2 20.4 14.3 ...
## $ perimeter_mean : num 122.8 132.9 130 77.6 135.1 ...
## $ area_mean : num 1001 1326 1203 386 1297 ...
## $ smoothness_mean : num 0.1184 0.0847 0.1096 0.1425 0.1003 ...
## $ compactness_mean : num 0.2776 0.0786 0.1599 0.2839 0.1328 ...
## $ concavity_mean : num 0.3001 0.0869 0.1974 0.2414 0.198 ...
## $ concave.points_mean : num 0.1471 0.0702 0.1279 0.1052 0.1043 ...
## $ symmetry_mean : num 0.242 0.181 0.207 0.26 0.181 ...
## $ fractal_dimension_mean : num 0.0787 0.0567 0.06 0.0974 0.0588 ...
## $ radius_se : num 1.095 0.543 0.746 0.496 0.757 ...
## $ texture_se : num 0.905 0.734 0.787 1.156 0.781 ...
## $ perimeter_se : num 8.59 3.4 4.58 3.44 5.44 ...
## $ area_se : num 153.4 74.1 94 27.2 94.4 ...
## $ smoothness_se : num 0.0064 0.00522 0.00615 0.00911 0.01149 ...
## $ compactness_se : num 0.049 0.0131 0.0401 0.0746 0.0246 ...
## $ concavity_se : num 0.0537 0.0186 0.0383 0.0566 0.0569 ...
## $ concave.points_se : num 0.0159 0.0134 0.0206 0.0187 0.0188 ...
## $ symmetry_se : num 0.03 0.0139 0.0225 0.0596 0.0176 ...
## $ fractal_dimension_se : num 0.00619 0.00353 0.00457 0.00921 0.00511 ...
## $ radius_worst : num 25.4 25 23.6 14.9 22.5 ...
## $ texture_worst : num 17.3 23.4 25.5 26.5 16.7 ...
## $ perimeter_worst : num 184.6 158.8 152.5 98.9 152.2 ...
## $ area_worst : num 2019 1956 1709 568 1575 ...
## $ smoothness_worst : num 0.162 0.124 0.144 0.21 0.137 ...
## $ compactness_worst : num 0.666 0.187 0.424 0.866 0.205 ...
## $ concavity_worst : num 0.712 0.242 0.45 0.687 0.4 ...
## $ concave.points_worst : num 0.265 0.186 0.243 0.258 0.163 ...
## $ symmetry_worst : num 0.46 0.275 0.361 0.664 0.236 ...
## $ fractal_dimension_worst: num 0.1189 0.089 0.0876 0.173 0.0768 ...
Always exclude id variables from your analysis. a model that includes an identifier will suffer from overfitting, and is unlikely to generalize well to other data.
wbcd <- wbcd[-1]
The outcome variable represents the outcome that we attempt to predict.
The outcome variable in this dataset is diagnosis.
Diagnosis indicates if the example is from a benign or malignant class.
table( wbcd $ diagnosis)
##
## B M
## 357 212
Some machine learning algorithm requiere a variable coded as factor. We will do this while transforming B and M to are more meaningful meaning.
wbcd $ diagnosis <- factor( wbcd $ diagnosis, levels =
c("B", "M"),
labels = c(" Benign", "Malignant"))
round( prop.table( table( wbcd $ diagnosis)) * 100, digits = 1)
##
## Benign Malignant
## 62.7 37.3
All remaining features are numeric. We will check summary statistics of 3 of them.
Can you notice something problematic?
summary( wbcd[ c("radius_mean", "area_mean", "smoothness_mean")])
## radius_mean area_mean smoothness_mean
## Min. : 6.981 Min. : 143.5 Min. :0.05263
## 1st Qu.:11.700 1st Qu.: 420.3 1st Qu.:0.08637
## Median :13.370 Median : 551.1 Median :0.09587
## Mean :14.127 Mean : 654.9 Mean :0.09636
## 3rd Qu.:15.780 3rd Qu.: 782.7 3rd Qu.:0.10530
## Max. :28.110 Max. :2501.0 Max. :0.16340
Remember that distance computations in k-NN are highly dependent on the scale of the features.
\[X_{new} = \frac{X-min(X)} {max(X)-min(X)}\]
Let’s create a normalize function
normalize <- function( x) {
return (( x - min( x)) / (max( x) - min( x)))
}
normalize( c( 1, 2, 3, 4, 5))
## [1] 0.00 0.25 0.50 0.75 1.00
normalize( c( 10, 20, 30, 40, 50))
## [1] 0.00 0.25 0.50 0.75 1.00
The lapply() function takes a list and applies a specified function to each list element.
prueba <- data.frame(x=c(.1,.2,.3,.4,.5,.6,.7,.8,.9,1.0),
y=c(1,2,3,4,5,6,7,8,9,10),
z=c(10,20,30,40,50,60,70,80,90,100))
mean(prueba$x)
## [1] 0.55
mean(prueba$y)
## [1] 5.5
mean(prueba$z)
## [1] 55
lapply(prueba,mean)
## $x
## [1] 0.55
##
## $y
## [1] 5.5
##
## $z
## [1] 55
sapply(prueba,mean)
## x y z
## 0.55 5.50 55.00
sapply(prueba,normalize)
## x y z
## [1,] 0.0000000 0.0000000 0.0000000
## [2,] 0.1111111 0.1111111 0.1111111
## [3,] 0.2222222 0.2222222 0.2222222
## [4,] 0.3333333 0.3333333 0.3333333
## [5,] 0.4444444 0.4444444 0.4444444
## [6,] 0.5555556 0.5555556 0.5555556
## [7,] 0.6666667 0.6666667 0.6666667
## [8,] 0.7777778 0.7777778 0.7777778
## [9,] 0.8888889 0.8888889 0.8888889
## [10,] 1.0000000 1.0000000 1.0000000
Apply our normalize function to our 30 numeric variables.
wbcd_n <- as.data.frame( lapply( wbcd[ 2: 31], normalize))
summary( wbcd_n$area_mean)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1174 0.1729 0.2169 0.2711 1.0000
wbcd_train <- wbcd_n[ 1: 469, ]
wbcd_test <- wbcd_n[ 470: 569, ]
dim(wbcd_train)
## [1] 469 30
dim(wbcd_test)
## [1] 100 30
It is important that each dataset is a representative subset of the full set of data.
When splitting the data into training and test sets we omitted the target variable. We’ll form two vectors with the the target variable.
wbcd_train_labels <- wbcd[ 1: 469, 1]
wbcd_test_labels <- wbcd[ 470: 569, 1]
Lazy algorithms like k-NN, actually, do not have a real training phase, instead it simply stores input data in a structured format.
k-NN implementation in the Class package.
library(class)
wbcd_test_pred <- knn( train = wbcd_train,
test = wbcd_test,
cl = wbcd_train_labels,
k = 21)
Next, we need to know how well k-NN predited the classes of the observations in the test set.
To do this, we can use the CrossTable() function in the gmodels package. Specifying prop.chisq in the CrossTable() function we can create a cross tabulation with the agreement between two vectors. In this case these vectors correspond to the output of k-NN and the true class in the test set.
library(gmodels)
CrossTable( x = wbcd_test_labels,
y = wbcd_test_pred,
prop.chisq = FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 100
##
##
## | wbcd_test_pred
## wbcd_test_labels | Benign | Malignant | Row Total |
## -----------------|-----------|-----------|-----------|
## Benign | 77 | 0 | 77 |
## | 1.000 | 0.000 | 0.770 |
## | 0.975 | 0.000 | |
## | 0.770 | 0.000 | |
## -----------------|-----------|-----------|-----------|
## Malignant | 2 | 21 | 23 |
## | 0.087 | 0.913 | 0.230 |
## | 0.025 | 1.000 | |
## | 0.020 | 0.210 | |
## -----------------|-----------|-----------|-----------|
## Column Total | 79 | 21 | 100 |
## | 0.790 | 0.210 | |
## -----------------|-----------|-----------|-----------|
##
##
Transforming -z- score normalization
wbcd_z <- as.data.frame( scale( wbcd[-1]))
summary( wbcd_z $ area_mean)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.4532 -0.6666 -0.2949 0.0000 0.3632 5.2459
wbcd_train <- wbcd_z[ 1: 469, ]
wbcd_test <- wbcd_z[ 470: 569, ]
wbcd_train_labels <- wbcd[ 1: 469, 1]
wbcd_test_labels <- wbcd[ 470: 569, 1]
wbcd_test_pred <- knn( train = wbcd_train,
test = wbcd_test,
cl = wbcd_train_labels,
k = 21)
CrossTable( x = wbcd_test_labels, y = wbcd_test_pred, prop.chisq = FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 100
##
##
## | wbcd_test_pred
## wbcd_test_labels | Benign | Malignant | Row Total |
## -----------------|-----------|-----------|-----------|
## Benign | 77 | 0 | 77 |
## | 1.000 | 0.000 | 0.770 |
## | 0.975 | 0.000 | |
## | 0.770 | 0.000 | |
## -----------------|-----------|-----------|-----------|
## Malignant | 2 | 21 | 23 |
## | 0.087 | 0.913 | 0.230 |
## | 0.025 | 1.000 | |
## | 0.020 | 0.210 | |
## -----------------|-----------|-----------|-----------|
## Column Total | 79 | 21 | 100 |
## | 0.790 | 0.210 | |
## -----------------|-----------|-----------|-----------|
##
##
table(wbcd_test_labels, wbcd_test_pred)
## wbcd_test_pred
## wbcd_test_labels Benign Malignant
## Benign 77 0
## Malignant 2 21
\[X_{new} = \frac{X-\mu} {\sigma} = \frac{X-Mean(X)} {StdDev(X)}\]