Caveat emptor

This is an unfinished draft. The walk-throughs are incomplete and possibly innaccurate, the writing may be charitably described as “choppy,” and at some point this page will be removed from the Web and replaced with shorter, more focused walkthroughs.

Until then, I hope it provides something useful.

Introduction

Libraries

To do the analysis, I’m going to rely on readr for import, magrittr and dplyr for chaining data objects and functions, ggplot2 for plotting. The actual cluster analysis will be carried out using knn() from the class package, and model checking will be done using functions in gmodels, caret and ROCR.

library(knitr)
library(skimr)
library(readr)
library(magrittr)
library(dplyr)
library(tidyr)
library(ggplot2)
library(scales)

Distance between observations

Cluster analysis works by calculating the distance between every observation, or point, in a data set, and then dividing the observations into groups—or clusters. The clustering method may be based on assigning observations to predetermined number of clusters, \(k\), by grouping observations with their \(k\) nearest neighbors to form approximately \(k\) clusters, or by other, similar methods. Typically, the analyst makes a choice regarding either the number of clusters, or the number of nearest neighbors to consider in a cluster.

The most common distance metric is the Euclidean distance, a generalization of the Pythagorean Theorem.

Assume two points \(\bar{p}\) and \(\bar{q}\) as displayed in the plane below:

The variables \(x\) and \(y\) are independent variables, a.k.a. features, in our data set. We can define the distance between them as \(c\), which itself is a function of the distance \(a\) in the \(x\) coordinate and the distance \(b\) in the \(y\) coordinate.

The distance \(c\) between \(\bar{p}\) and \(\bar{q}\) is defined by the Pythagorean Theorem:

\[ \begin{align} c &= \sqrt{a^2 + b^2} \\ &= \sqrt{\left(q_x - p_x \right)^2 + \left(q_y - p_y\right)^2}\\ &= \sqrt{\left(4 - 2 \right)^2 + \left(3 - 1 \right)^2}\\ &= \sqrt{\left(2 \right)^2 + \left(2 \right)^2}\\ &= \sqrt{4 + 4}\\ &= 2.8 \end{align} \]

For coordinate systems with more than two dimensions, we can generalize this to the Euclidean distance:

\[ \begin{aligned} \mathrm{d}(\bar{p}, \bar{q}) &= \sqrt{(q_1 - p_1)^2 + (q_2 - p_2)^2 + \ldots + (q_n - p_n)^2} \\ &= \sqrt{\sum_{i=0}^n \left(q_i + p_i \right)^2} \end{aligned} \]

Utility function

If any variable in the above equation covers a much larger scale than the other dimensions, then the corresponding distance metric will be dominated by that variable. For instance,

\[ \begin{aligned} \mathrm{d}(\bar{p}, \bar{q}) &= \sqrt{(q_1 - p_1)^2 + (q_2 - p_2)^2 + (q_3 - p_3)^2} \\ &= \sqrt{(2)^2 + (1000)^2 + (1)^2}\\ &= 1000.002\\ &\approx \sqrt{1000^2} \end{aligned} \]

The clusters will be defined entirely by the second variable, and we might as well leave all other variables out of the model. However, because we are often using dissimilar variables to compare and cluster, it is rare that one variable should be correspondingly more important than the others; we don’t want one variable to dominate the clustering.

To correct for this, we need to normalize the data. There are a couple of ways to do this, but one of the nicer ones is to compute the z-score:

\[z_{score}=\frac{x - \bar{x}}{\sigma_{s}}\]

I’ll convert each value of the independent variables to their z-score equivalent using the function below.

#' zscore_scale
#' @param x A numeric vector to compute z scores from
#' @return A numeric vector of length \code{length(x)} where every element is the z score of every element of \code{x}
zscore_scale <- function(x) {
  x_mean <- mean(x, na.rm = TRUE)
  x_sd <- sd(x, na.rm = TRUE)
  x_scaled <- (x - x_mean) / x_sd
  return(x_scaled)
}

Data import and cleaning

Loading the data file

We’ll use a publically-available data set, the Breast Cancer Wisconsin (Diagnostic) Data Set. The data is available from the Machine Learning Repository at the Center for Machine Learning and Intelligent Systems, part of the University of California, Irvine.

After downloading from the website, we read in the data.

# Download data and column names from
# \url{https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)}

# Load data
wdbc_df <- read_csv("data-raw/wdbc.data", col_names = FALSE) %>% na.omit()

The data doesn’t come with column names embedded in the wdbc.data file. There is a description file, wdbc.names, but it doesn’t actually list column names in a machine-readable format, so we’ll create our own.

colnames(wdbc_df) <- wdbc_names

wdbc_df$Diagnosis <- as.factor(wdbc_df$Diagnosis)

wdbc_df %>% skim()

## Skim summary statistics
##  n obs: 569 
##  n variables: 32 
## 
## Variable type: factor 
##   variable missing complete   n n_unique            top_counts ordered
##  Diagnosis       0      569 569        2 B: 357, M: 212, NA: 0   FALSE
## 
## Variable type: integer 
##  variable missing complete   n  mean      sd   p0    p25 median     p75
##        ID       0      569 569 3e+07 1.3e+08 8670 869218 906024 8813129
##     p100     hist
##  9.1e+08 ▇▁▁▁▁▁▁▁
## 
## Variable type: numeric 
##          variable missing complete   n     mean       sd        p0
##            area_m       0      569 569 654.89   351.91   143.5    
##            area_s       0      569 569  40.34    45.49     6.8    
##            area_w       0      569 569 880.58   569.36   185.2    
##     compactness_m       0      569 569   0.1      0.053    0.019  
##     compactness_s       0      569 569   0.025    0.018    0.0023 
##     compactness_w       0      569 569   0.25     0.16     0.027  
##  concave_points_m       0      569 569   0.049    0.039    0      
##  concave_points_s       0      569 569   0.012    0.0062   0      
##  concave_points_w       0      569 569   0.11     0.066    0      
##       concavity_m       0      569 569   0.089    0.08     0      
##       concavity_s       0      569 569   0.032    0.03     0      
##       concavity_w       0      569 569   0.27     0.21     0      
##     fractal_dim_m       0      569 569   0.063    0.0071   0.05   
##     fractal_dim_s       0      569 569   0.0038   0.0026   0.00089
##     fractal_dim_w       0      569 569   0.084    0.018    0.055  
##       perimeter_m       0      569 569  91.97    24.3     43.79   
##       perimeter_s       0      569 569   2.87     2.02     0.76   
##       perimeter_w       0      569 569 107.26    33.6     50.41   
##          radius_m       0      569 569  14.13     3.52     6.98   
##          radius_s       0      569 569   0.41     0.28     0.11   
##          radius_w       0      569 569  16.27     4.83     7.93   
##      smoothness_m       0      569 569   0.096    0.014    0.053  
##      smoothness_s       0      569 569   0.007    0.003    0.0017 
##      smoothness_w       0      569 569   0.13     0.023    0.071  
##        symmetry_m       0      569 569   0.18     0.027    0.11   
##        symmetry_s       0      569 569   0.021    0.0083   0.0079 
##        symmetry_w       0      569 569   0.29     0.062    0.16   
##         texture_m       0      569 569  19.29     4.3      9.71   
##         texture_s       0      569 569   1.22     0.55     0.36   
##         texture_w       0      569 569  25.68     6.15    12.02   
##       p25   median       p75     p100     hist
##  420.3    551.1     782.7    2501     ▅▇▂▂▁▁▁▁
##   17.85    24.53     45.19    542.2   ▇▁▁▁▁▁▁▁
##  515.3    686.5    1084      4254     ▇▅▂▁▁▁▁▁
##    0.065    0.093     0.13      0.35  ▅▇▆▃▁▁▁▁
##    0.013    0.02      0.032     0.14  ▇▆▂▁▁▁▁▁
##    0.15     0.21      0.34      1.06  ▆▇▅▂▁▁▁▁
##    0.02     0.034     0.074     0.2   ▇▆▃▃▁▁▁▁
##    0.0076   0.011     0.015     0.053 ▃▇▅▁▁▁▁▁
##    0.065    0.1       0.16      0.29  ▃▇▇▅▅▃▂▁
##    0.03     0.062     0.13      0.43  ▇▃▂▂▁▁▁▁
##    0.015    0.026     0.042     0.4   ▇▂▁▁▁▁▁▁
##    0.11     0.23      0.38      1.25  ▇▆▅▂▁▁▁▁
##    0.058    0.062     0.066     0.097 ▃▇▆▂▁▁▁▁
##    0.0022   0.0032    0.0046    0.03  ▇▂▁▁▁▁▁▁
##    0.071    0.08      0.092     0.21  ▆▇▃▁▁▁▁▁
##   75.17    86.24    104.1     188.5   ▂▇▇▃▂▁▁▁
##    1.61     2.29      3.36     21.98  ▇▂▁▁▁▁▁▁
##   84.11    97.66    125.4     251.2   ▂▇▃▂▂▁▁▁
##   11.7     13.37     15.78     28.11  ▁▆▇▃▂▁▁▁
##    0.23     0.32      0.48      2.87  ▇▂▁▁▁▁▁▁
##   13.01    14.97     18.79     36.04  ▂▇▅▂▂▁▁▁
##    0.086    0.096     0.11      0.16  ▁▂▇▇▃▁▁▁
##    0.0052   0.0064    0.0081    0.031 ▅▇▂▁▁▁▁▁
##    0.12     0.13      0.15      0.22  ▁▃▆▇▃▁▁▁
##    0.16     0.18      0.2       0.3   ▁▃▇▇▂▁▁▁
##    0.015    0.019     0.023     0.079 ▇▇▃▁▁▁▁▁
##    0.25     0.28      0.32      0.66  ▁▇▆▂▁▁▁▁
##   16.17    18.84     21.8      39.28  ▂▆▇▅▂▁▁▁
##    0.83     1.11      1.47      4.88  ▆▇▃▁▁▁▁▁
##   21.08    25.41     29.72     49.54  ▂▆▇▆▅▁▁▁

Normalizing

# Normalize the data, so Euclidian distance is not dominated by differences in scale
wdbc_norm_df <- wdbc_df %>% 
  select(radius_m:fractal_dim_w) %>% 
  lapply(X = ., FUN = zscore_scale) %>% 
  as_data_frame()

Cluster Analysis

Analysis using `kNN()`

Training and test data

# Set the split size
sample_size <- floor(0.8 * nrow(wdbc_norm_df))

# Create a vector of row number indicators for the training set
train_ind <- wdbc_norm_df %>%
  nrow() %>% 
  seq_len() %>% 
  sample(., size = sample_size, replace = FALSE)

# Split data into training and test data sets
wdbc_train_df <- wdbc_norm_df[train_ind, ]
wdbc_test_df <- wdbc_norm_df[-train_ind, ]

# Split the dependent response variable "labels" (column 2) into training and test sets
train_labels <- as.vector(wdbc_df[train_ind,]$Diagnosis)
# alternatively: train_labels <- wdbc_df[train_ind, 2][[1]]
test_labels <- as.vector(wdbc_df[-train_ind, ]$Diagnosis)

Cluster analysis

# Set a reasonable value for k; ensure odd to avoid ties
k_size <- 2 * round(((sqrt(length(train_ind)) + 1) / 2), 0) - 1
k_size

## [1] 21

From the class package, use knn() to cluster.

library(class)

# k-means cluster
wdbc_test_pred <- knn(train = wdbc_train_df, 
                      test = wdbc_test_df, 
                      cl = train_labels, 
                      k = k_size, prob = TRUE)

Convert to a data frame, and extract the probabilities from the attributes of the wdbc_test_pred vector. The probabilities are relative to the predicted value. We want to change these to be the probability of a positive result (“M”) so we re-compute rows where B is predicted as \(\Pr = 1 - \Pr\).

wdbc_pred_df <- data_frame(actual = test_labels, 
                           predicted = wdbc_test_pred, 
                           probability = attr(x = wdbc_test_pred, which = "prob")) 
wdbc_pred_df[wdbc_pred_df$predicted == "B",]$probability <- 1 - wdbc_pred_df[wdbc_pred_df$predicted == "B",]$probability
head(wdbc_pred_df)

## # A tibble: 6 x 3
##   actual predicted probability
##   <chr>  <fct>           <dbl>
## 1 M      M               1.00 
## 2 M      M               1.00 
## 3 M      M               0.762
## 4 B      B               0    
## 5 M      M               1.00 
## 6 M      M               1.00

Checking the model

Use gmodels::CrossTable()

library(gmodels)

CrossTable(x = wdbc_pred_df$actual, y = wdbc_pred_df$predicted, prop.chisq = FALSE)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  114 
## 
##  
##                     | wdbc_pred_df$predicted 
## wdbc_pred_df$actual |         B |         M | Row Total | 
## --------------------|-----------|-----------|-----------|
##                   B |        72 |         0 |        72 | 
##                     |     1.000 |     0.000 |     0.632 | 
##                     |     0.923 |     0.000 |           | 
##                     |     0.632 |     0.000 |           | 
## --------------------|-----------|-----------|-----------|
##                   M |         6 |        36 |        42 | 
##                     |     0.143 |     0.857 |     0.368 | 
##                     |     0.077 |     1.000 |           | 
##                     |     0.053 |     0.316 |           | 
## --------------------|-----------|-----------|-----------|
##        Column Total |        78 |        36 |       114 | 
##                     |     0.684 |     0.316 |           | 
## --------------------|-----------|-----------|-----------|
## 
##

Use caret::confusionMatrix

library(caret)

## Loading required package: lattice

print(wdbc_knn_cm <- confusionMatrix(wdbc_test_pred, test_labels, positive = "M"))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  B  M
##          B 72  6
##          M  0 36
##                                          
##                Accuracy : 0.9474         
##                  95% CI : (0.889, 0.9804)
##     No Information Rate : 0.6316         
##     P-Value [Acc > NIR] : 2.054e-15      
##                                          
##                   Kappa : 0.8834         
##  Mcnemar's Test P-Value : 0.04123        
##                                          
##             Sensitivity : 0.8571         
##             Specificity : 1.0000         
##          Pos Pred Value : 1.0000         
##          Neg Pred Value : 0.9231         
##              Prevalence : 0.3684         
##          Detection Rate : 0.3158         
##    Detection Prevalence : 0.3158         
##       Balanced Accuracy : 0.9286         
##                                          
##        'Positive' Class : M              
##

wdbc_knn_acc <- wdbc_knn_cm$overall["Accuracy"]

The accuracy is 94.7%

library(ROCR)

## Loading required package: gplots

## 
## Attaching package: 'gplots'

## The following object is masked from 'package:stats':
## 
##     lowess

wdbc_pred <- prediction(predictions = wdbc_pred_df$probability, labels = wdbc_pred_df$actual)
wdbc_perf <- performance(wdbc_pred, measure = "tpr", x.measure = "fpr")
plot(wdbc_perf, colorize = TRUE)
abline(a = 0, b = 1, lwd = 1, lty = 2, col = "gray")

wdbc_auc <- performance(wdbc_pred, measure = "auc")
unlist(wdbc_auc@y.values)

## [1] 0.994709

Cross validation

#library(class)

wdbc_knn_cv <- knn.cv(train = wdbc_df[, 3:ncol(wdbc_df)], 
                      cl = wdbc_df$Diagnosis, 
                      k = k_size, 
                      prob = TRUE)
#str(wdbc_knn_cv)
#summary(wdbc_knn_cv)
#attributes(wdbc_knn_cv)
wdbc_cv_pred_df <- data_frame(actual = wdbc_df$Diagnosis,
                              predicted = wdbc_knn_cv,
                              probability = attr(x = wdbc_knn_cv, which = "prob"))
wdbc_cv_pred_df[wdbc_cv_pred_df$predicted == "B",]$probability <- 1 - wdbc_cv_pred_df[wdbc_cv_pred_df$predicted == "B",]$probability

CrossTable(x = wdbc_cv_pred_df$actual, y = wdbc_cv_pred_df$predicted, prop.chisq = FALSE)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  569 
## 
##  
##                        | wdbc_cv_pred_df$predicted 
## wdbc_cv_pred_df$actual |         B |         M | Row Total | 
## -----------------------|-----------|-----------|-----------|
##                      B |       347 |        10 |       357 | 
##                        |     0.972 |     0.028 |     0.627 | 
##                        |     0.920 |     0.052 |           | 
##                        |     0.610 |     0.018 |           | 
## -----------------------|-----------|-----------|-----------|
##                      M |        30 |       182 |       212 | 
##                        |     0.142 |     0.858 |     0.373 | 
##                        |     0.080 |     0.948 |           | 
##                        |     0.053 |     0.320 |           | 
## -----------------------|-----------|-----------|-----------|
##           Column Total |       377 |       192 |       569 | 
##                        |     0.663 |     0.337 |           | 
## -----------------------|-----------|-----------|-----------|
## 
##

Analysis using `kmeans()`

cluster_lookup <- c(`1` = "M", `2` = "B")
set.seed(1)
wdbc_cluster_df <- wdbc_norm_df %>% 
  mutate(cluster = kmeans(., centers = 2)$cluster,
         predicted = as.factor(cluster_lookup[cluster]),
         diagnosis = wdbc_df$Diagnosis)
set.seed(NULL)

CrossTable(wdbc_cluster_df$diagnosis, wdbc_cluster_df$predicted, prop.chisq = FALSE)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  569 
## 
##  
##                           | wdbc_cluster_df$predicted 
## wdbc_cluster_df$diagnosis |         B |         M | Row Total | 
## --------------------------|-----------|-----------|-----------|
##                         B |       343 |        14 |       357 | 
##                           |     0.961 |     0.039 |     0.627 | 
##                           |     0.903 |     0.074 |           | 
##                           |     0.603 |     0.025 |           | 
## --------------------------|-----------|-----------|-----------|
##                         M |        37 |       175 |       212 | 
##                           |     0.175 |     0.825 |     0.373 | 
##                           |     0.097 |     0.926 |           | 
##                           |     0.065 |     0.308 |           | 
## --------------------------|-----------|-----------|-----------|
##              Column Total |       380 |       189 |       569 | 
##                           |     0.668 |     0.332 |           | 
## --------------------------|-----------|-----------|-----------|
## 
##

print(km_cm <- confusionMatrix(data = wdbc_cluster_df$predicted, wdbc_cluster_df$diagnosis))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   B   M
##          B 343  37
##          M  14 175
##                                           
##                Accuracy : 0.9104          
##                  95% CI : (0.8838, 0.9325)
##     No Information Rate : 0.6274          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.804           
##  Mcnemar's Test P-Value : 0.002066        
##                                           
##             Sensitivity : 0.9608          
##             Specificity : 0.8255          
##          Pos Pred Value : 0.9026          
##          Neg Pred Value : 0.9259          
##              Prevalence : 0.6274          
##          Detection Rate : 0.6028          
##    Detection Prevalence : 0.6678          
##       Balanced Accuracy : 0.8931          
##                                           
##        'Positive' Class : B               
##

wdbc_km_acc <- km_cm$overall["Accuracy"]

The accuracy using kmeans() is 91%

Xgboost

library(xgboost)
library(Matrix)

wdbc_boost_df <- cbind(wdbc_df[,2], wdbc_norm_df) %>% 
  mutate(Diagnosis = ifelse(Diagnosis == "B", 0, 1))
wdbc_boost_train_df <- wdbc_boost_df[train_ind,]
wdbc_boost_test_df <- wdbc_boost_df[-train_ind,]

train_mx <- sparse.model.matrix(Diagnosis ~ ., wdbc_boost_train_df)
test_mx <- sparse.model.matrix(Diagnosis ~ ., wdbc_boost_test_df)

dtrain <- xgb.DMatrix(train_mx, label = wdbc_boost_train_df$Diagnosis)
dtest <- xgb.DMatrix(test_mx, label = wdbc_boost_test_df$Diagnosis)

train_gdbt <- xgb.train(params=list(booster = "gbtree", objective="multi:softmax", num_class=5, eval_metric="mlogloss", eta=0.3, max_depth=5, subsample=0.5, colsample_bytree=0.5), data=dtrain, nrounds=70, watchlist=list(train=dtrain,test=dtest))

## [1]  train-mlogloss:1.027319 test-mlogloss:1.084157 
## [2]  train-mlogloss:0.734214 test-mlogloss:0.820508 
## [3]  train-mlogloss:0.544498 test-mlogloss:0.652891 
## [4]  train-mlogloss:0.414367 test-mlogloss:0.526565 
## [5]  train-mlogloss:0.322072 test-mlogloss:0.429897 
## [6]  train-mlogloss:0.254539 test-mlogloss:0.373557 
## [7]  train-mlogloss:0.204383 test-mlogloss:0.319509 
## [8]  train-mlogloss:0.168338 test-mlogloss:0.275545 
## [9]  train-mlogloss:0.138529 test-mlogloss:0.255286 
## [10] train-mlogloss:0.116342 test-mlogloss:0.241146 
## [11] train-mlogloss:0.098863 test-mlogloss:0.222835 
## [12] train-mlogloss:0.087315 test-mlogloss:0.217464 
## [13] train-mlogloss:0.075777 test-mlogloss:0.210402 
## [14] train-mlogloss:0.068001 test-mlogloss:0.206243 
## [15] train-mlogloss:0.061641 test-mlogloss:0.206999 
## [16] train-mlogloss:0.054738 test-mlogloss:0.209115 
## [17] train-mlogloss:0.049147 test-mlogloss:0.207105 
## [18] train-mlogloss:0.045712 test-mlogloss:0.202369 
## [19] train-mlogloss:0.042529 test-mlogloss:0.199345 
## [20] train-mlogloss:0.039504 test-mlogloss:0.202202 
## [21] train-mlogloss:0.036606 test-mlogloss:0.196084 
## [22] train-mlogloss:0.033360 test-mlogloss:0.198013 
## [23] train-mlogloss:0.031255 test-mlogloss:0.191757 
## [24] train-mlogloss:0.030143 test-mlogloss:0.188708 
## [25] train-mlogloss:0.028304 test-mlogloss:0.186164 
## [26] train-mlogloss:0.026988 test-mlogloss:0.187461 
## [27] train-mlogloss:0.026514 test-mlogloss:0.189752 
## [28] train-mlogloss:0.025938 test-mlogloss:0.189481 
## [29] train-mlogloss:0.025147 test-mlogloss:0.192066 
## [30] train-mlogloss:0.023949 test-mlogloss:0.185547 
## [31] train-mlogloss:0.023468 test-mlogloss:0.186001 
## [32] train-mlogloss:0.022915 test-mlogloss:0.181186 
## [33] train-mlogloss:0.022461 test-mlogloss:0.183141 
## [34] train-mlogloss:0.022382 test-mlogloss:0.186404 
## [35] train-mlogloss:0.021549 test-mlogloss:0.186259 
## [36] train-mlogloss:0.020860 test-mlogloss:0.184464 
## [37] train-mlogloss:0.020238 test-mlogloss:0.181917 
## [38] train-mlogloss:0.019789 test-mlogloss:0.179912 
## [39] train-mlogloss:0.019088 test-mlogloss:0.176082 
## [40] train-mlogloss:0.018339 test-mlogloss:0.172613 
## [41] train-mlogloss:0.018058 test-mlogloss:0.172749 
## [42] train-mlogloss:0.017435 test-mlogloss:0.174132 
## [43] train-mlogloss:0.017008 test-mlogloss:0.174312 
## [44] train-mlogloss:0.016529 test-mlogloss:0.175471 
## [45] train-mlogloss:0.016301 test-mlogloss:0.174211 
## [46] train-mlogloss:0.016051 test-mlogloss:0.174410 
## [47] train-mlogloss:0.015778 test-mlogloss:0.172593 
## [48] train-mlogloss:0.015131 test-mlogloss:0.170877 
## [49] train-mlogloss:0.014766 test-mlogloss:0.167649 
## [50] train-mlogloss:0.014472 test-mlogloss:0.171999 
## [51] train-mlogloss:0.014364 test-mlogloss:0.171108 
## [52] train-mlogloss:0.014182 test-mlogloss:0.168475 
## [53] train-mlogloss:0.013799 test-mlogloss:0.166454 
## [54] train-mlogloss:0.013575 test-mlogloss:0.166015 
## [55] train-mlogloss:0.013414 test-mlogloss:0.170298 
## [56] train-mlogloss:0.013449 test-mlogloss:0.169793 
## [57] train-mlogloss:0.013190 test-mlogloss:0.169707 
## [58] train-mlogloss:0.012924 test-mlogloss:0.166498 
## [59] train-mlogloss:0.012790 test-mlogloss:0.167273 
## [60] train-mlogloss:0.012560 test-mlogloss:0.167709 
## [61] train-mlogloss:0.012351 test-mlogloss:0.163600 
## [62] train-mlogloss:0.012001 test-mlogloss:0.161160 
## [63] train-mlogloss:0.011831 test-mlogloss:0.158813 
## [64] train-mlogloss:0.011726 test-mlogloss:0.157154 
## [65] train-mlogloss:0.011405 test-mlogloss:0.159416 
## [66] train-mlogloss:0.011197 test-mlogloss:0.161758 
## [67] train-mlogloss:0.011060 test-mlogloss:0.161476 
## [68] train-mlogloss:0.010884 test-mlogloss:0.161595 
## [69] train-mlogloss:0.010814 test-mlogloss:0.159599 
## [70] train-mlogloss:0.010622 test-mlogloss:0.159925

Summary:

table(wdbc_boost_test_df$Diagnosis, predict(train_gdbt, newdata = dtest))

##    
##      0  1
##   0 66  6
##   1  1 41

Accuracy of result:

print(wdbc_xg_accuracy <- sum(diag(table(wdbc_boost_test_df$Diagnosis,predict(train_gdbt,newdata=dtest))))/nrow(wdbc_boost_test_df))

## [1] 0.9385965

The accuracy from XGBoost is 93.9%

Random forest

library(randomForest)

## randomForest 4.6-12

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

## The following object is masked from 'package:dplyr':
## 
##     combine

wdbc_rf_df <- cbind(wdbc_df[,2], wdbc_norm_df)
wdbc_rf_train_df <- wdbc_rf_df[train_ind,]
wdbc_rf_test_df <- wdbc_rf_df[-train_ind,]
wdbc_fgl <- randomForest(Diagnosis ~ ., data = wdbc_rf_train_df, model = randomForest, importance = TRUE)
print(wdbc_fgl)

## 
## Call:
##  randomForest(formula = Diagnosis ~ ., data = wdbc_rf_train_df,      model = randomForest, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 5
## 
##         OOB estimate of  error rate: 4.18%
## Confusion matrix:
##     B   M class.error
## B 277   8  0.02807018
## M  11 159  0.06470588

#wdbc_fgl_acc <- mean(1 - wdbc_fgl$confusion[, 'class.error'])
wdbc_fgl_acc <- 1 - (wdbc_fgl$confusion[2] + wdbc_fgl$confusion[3])/(sum(wdbc_fgl$confusion[1:4]))

The overall accuracy is:

mean: 95.8%

wdbc_rf_predict <- predict(wdbc_fgl, newdata = wdbc_rf_test_df)

wdbc_rf_predict_df <- data_frame(index = as.integer(names(wdbc_rf_predict)),
                                 predicted = as.character(wdbc_rf_predict),
                                 actual = wdbc_rf_df[index,]$Diagnosis)

CrossTable(wdbc_rf_predict_df$actual, wdbc_rf_predict_df$predicted)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  114 
## 
##  
##                           | wdbc_rf_predict_df$predicted 
## wdbc_rf_predict_df$actual |         B |         M | Row Total | 
## --------------------------|-----------|-----------|-----------|
##                         B |        68 |         4 |        72 | 
##                           |    12.801 |    20.365 |           | 
##                           |     0.944 |     0.056 |     0.632 | 
##                           |     0.971 |     0.091 |           | 
##                           |     0.596 |     0.035 |           | 
## --------------------------|-----------|-----------|-----------|
##                         M |         2 |        40 |        42 | 
##                           |    21.945 |    34.912 |           | 
##                           |     0.048 |     0.952 |     0.368 | 
##                           |     0.029 |     0.909 |           | 
##                           |     0.018 |     0.351 |           | 
## --------------------------|-----------|-----------|-----------|
##              Column Total |        70 |        44 |       114 | 
##                           |     0.614 |     0.386 |           | 
## --------------------------|-----------|-----------|-----------|
## 
##

wdbc_rf_cm <- confusionMatrix(wdbc_rf_predict_df$actual, wdbc_rf_predict_df$predicted)
print(wdbc_rf_cm)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  B  M
##          B 68  4
##          M  2 40
##                                          
##                Accuracy : 0.9474         
##                  95% CI : (0.889, 0.9804)
##     No Information Rate : 0.614          
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.888          
##  Mcnemar's Test P-Value : 0.6831         
##                                          
##             Sensitivity : 0.9714         
##             Specificity : 0.9091         
##          Pos Pred Value : 0.9444         
##          Neg Pred Value : 0.9524         
##              Prevalence : 0.6140         
##          Detection Rate : 0.5965         
##    Detection Prevalence : 0.6316         
##       Balanced Accuracy : 0.9403         
##                                          
##        'Positive' Class : B              
##

wdbc_fgl_acc_2 <- wdbc_rf_cm$overall['Accuracy']

Logistic regression

Here we use stepwise regression using the BIC score to select the best variables for the regression.

wdbc_glm_df <- wdbc_df %>% 
  select(Diagnosis) %>% 
  bind_cols(wdbc_norm_df) %>% 
  mutate(Diagnosis = Diagnosis == "M")

wdbc_glm_train_df <- wdbc_glm_df[train_ind,]
wdbc_glm_test_df <- wdbc_glm_df[-train_ind,]

Independent variables in a regression should be independent; we can check this.

wdbc_cor_df <- wdbc_glm_df %>% 
  select(-Diagnosis) %>% 
  cor() %>% 
  as.matrix()
wdbc_cor_df <- data_frame(var1 = rownames(wdbc_cor_df)) %>% 
  bind_cols(as_data_frame(wdbc_cor_df))
wdbc_cor_df <- wdbc_cor_df%>% 
  gather(var2, correlation, -var1) %>% 
  mutate(var1 = as.factor(var1),
         var2 = as.factor(var2))
wdbc_cor_df$var2 <- factor(wdbc_cor_df$var2, levels = rev(levels(wdbc_cor_df$var2)))
wdbc_cor_df %>% 
  ggplot(aes(x = var1, y = var2, fill = correlation)) +
  geom_tile() +
  scale_fill_gradient2(low = ("#DD3333"),mid = ("#CCFFCC"), high = ("#DD3333")) +
  scale_x_discrete(position = "top") +
  theme(axis.text.x = element_text(angle = 45, hjust = 0))

We can see that some of the variables—particularly area, perimeter and radius— are highly-correlated. For our regression to work, we need to down-select to only the most significant variables. One approach is step-wise logistic regression, in which an algorithm tests a large number of models with different variables selected and chooses the simplest model with the best fit based on the BIC score.

wdbc_glm_f0 <- glm(formula = Diagnosis ~ 1, data = wdbc_glm_train_df, family = binomial(link = "logit"))
wdbc_glm_fa <- glm(formula = Diagnosis ~ ., data = wdbc_glm_train_df, family = binomial(link = "logit"))

wdbc_glm_step <- step(object = wdbc_glm_f0, 
                      scope = formula(wdbc_glm_fa), 
                      k = log(ncol(wdbc_glm_train_df)-1), 
                      direction = "both", trace = 0)

Next, we perform the logistic regression using the formula selected by stepwise regression.

wdbc_glm <- glm(formula = formula(wdbc_glm_step), data = wdbc_glm_train_df, family = binomial(link = "logit"))

summary(wdbc_glm)

## 
## Call:
## glm(formula = formula(wdbc_glm_step), family = binomial(link = "logit"), 
##     data = wdbc_glm_train_df)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.55694  -0.02725  -0.00188   0.00002   2.82787  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)
## (Intercept)        0.7661     0.5915   1.295  0.19528
## smoothness_w       1.1887     0.6136   1.937  0.05271
## texture_w          2.1767     0.5408   4.025 5.71e-05
## radius_s           4.6803     1.4722   3.179  0.00148
## compactness_s     -1.3498     0.5194  -2.599  0.00936
## concave_points_w   3.3416     1.2974   2.576  0.01000
## concavity_w        1.6462     0.7813   2.107  0.03513
## area_w             7.5282     1.8473   4.075 4.60e-05
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 601.38  on 454  degrees of freedom
## Residual deviance:  51.34  on 447  degrees of freedom
## AIC: 67.34
## 
## Number of Fisher Scoring iterations: 10

wdbc_glm_pred <- predict(wdbc_glm, newdata = wdbc_glm_test_df, type = "response")

# Convert prediction results to a data frame, with predicted diagnosis
# and actual diagnosis.
wdbc_glm_predict_df <- data_frame(prob = (wdbc_glm_pred),
                                  actual = wdbc_glm_df[-train_ind,]$Diagnosis,
                                  predicted = ifelse(prob > 0.5, "M", "B")) %>% 
  mutate(actual = ifelse(actual, "M", "B"))

Now we check the model. First we’ll just look at the distribution of probabilities, with mis-classification errors in red. Then we’ll use CrossTable() and confusionmatrix() to look more closely at the accuracy, sensitivity and specificity of the model.

wdbc_glm_predict_df %>% 
  mutate(correct_class = ifelse(actual == predicted, TRUE, FALSE)) %>% 
  ggplot(aes(x = prob, fill = correct_class)) +
  geom_histogram(binwidth = 0.1)

CrossTable(wdbc_glm_predict_df$actual, wdbc_glm_predict_df$predicted)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  114 
## 
##  
##                            | wdbc_glm_predict_df$predicted 
## wdbc_glm_predict_df$actual |         B |         M | Row Total | 
## ---------------------------|-----------|-----------|-----------|
##                          B |        70 |         2 |        72 | 
##                            |    14.114 |    23.305 |           | 
##                            |     0.972 |     0.028 |     0.632 | 
##                            |     0.986 |     0.047 |           | 
##                            |     0.614 |     0.018 |           | 
## ---------------------------|-----------|-----------|-----------|
##                          M |         1 |        41 |        42 | 
##                            |    24.196 |    39.952 |           | 
##                            |     0.024 |     0.976 |     0.368 | 
##                            |     0.014 |     0.953 |           | 
##                            |     0.009 |     0.360 |           | 
## ---------------------------|-----------|-----------|-----------|
##               Column Total |        71 |        43 |       114 | 
##                            |     0.623 |     0.377 |           | 
## ---------------------------|-----------|-----------|-----------|
## 
##

#print(class(wdbc_glm_predict_df$actual))
#print(class(wdbc_glm_predict_df$predicted))
wdbc_glm_cm <- confusionMatrix(wdbc_glm_predict_df$actual, wdbc_glm_predict_df$predicted)
print(wdbc_glm_cm)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  B  M
##          B 70  2
##          M  1 41
##                                          
##                Accuracy : 0.9737         
##                  95% CI : (0.925, 0.9945)
##     No Information Rate : 0.6228         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.9437         
##  Mcnemar's Test P-Value : 1              
##                                          
##             Sensitivity : 0.9859         
##             Specificity : 0.9535         
##          Pos Pred Value : 0.9722         
##          Neg Pred Value : 0.9762         
##              Prevalence : 0.6228         
##          Detection Rate : 0.6140         
##    Detection Prevalence : 0.6316         
##       Balanced Accuracy : 0.9697         
##                                          
##        'Positive' Class : B              
##

#wdbc_glm_acc_2 <- NULL
wdbc_glm_acc_2 <- wdbc_glm_cm$overall['Accuracy']

Comparison of Methods

Accuracies

method	accuracy
k-means	91.0%
knn	94.7%
xgboost	93.9%
random forest	94.7%
logistic regression	97.4%

Each of the above will vary each time they are run due to differences in the random number seed used.

Learning Data Science with R:

Classification

Thomas Hopper

January 31, 2018

Caveat emptor

Introduction

Libraries

Distance between observations

Utility function

Data import and cleaning

Loading the data file

Normalizing

Cluster Analysis

Analysis using `kNN()`

Training and test data

Cluster analysis

Checking the model

Cross validation

Analysis using `kmeans()`

Xgboost

Random forest

Logistic regression

Comparison of Methods

Accuracies

Learning Data Science with R:

Classification

Thomas Hopper

January 31, 2018

Caveat emptor

Introduction

Libraries

Distance between observations

Utility function

Data import and cleaning

Loading the data file

Normalizing

Cluster Analysis

Analysis using kNN()

Training and test data

Cluster analysis

Checking the model

Cross validation

Analysis using kmeans()

Xgboost

Random forest

Logistic regression

Comparison of Methods

Accuracies

Analysis using `kNN()`

Analysis using `kmeans()`