Breast Cancer Diagnostic Study

1 Introduction

According to U.S. breast cancer statistics, about 1 in 8 U.S. women (about 12%) will develop invasive breast cancer over the course of her lifetime; in 2020, an estimated 276,480 new cases of invasive breast cancer are expected to be diagnosed in women in the U.S.; about 42,170 women in the U.S. are expected to die in 2020 from breast cancer. It also found that the overall death rate from breast cancer decreased by 1.3% per year from 2013 to 2017. These decreases are thought to be the result of treatment advances and earlier detection through screening.

Earlier detection is considered very essencial to diagnose potential breast cancer. Breast Cancer occurs as a results of abnormal growth of cells in the breast tissue, commonly referred to as a Tumor which can be benign (not cancerous), pre-malignant (pre-cancerous), or malignant (cancerous). Tests such as MRI, mammogram, ultrasound and biopsy are commonly used to diagnose breast cancer performed.

In this project we will analys the Breast Cancer Wisconsin (Diagnostic) DataSet, obtained from Kaggle. This data set was created by Dr. William H. Wolberg, physician at the University Of Wisconsin Hospital at Madison, Wisconsin,USA. The informations of this dataset were obtained based on a digital scan of breast cell samples taken by biopsy from patients with solid breast masses. It includs the mean value, extreme value and standard error of 10 measures of breast cells , returning a 30 real-valuated vectors.

Aproaches and objectives:

The project objective is to find the best model classifying types of breast cancer (benign or malignant). Aproaches listed below will be conducted:

Data preparation and EDA
Modeling using different mathods including generated linear model, decission tree, randem forest, neural networks
Tuning hyperparametors to optimaze models
Testing the models to find the best one

Packages Required

library(dplyr)          # manipulate data frame
library(knitr)          # dynamic deport deneration
library(ggplot2)        # ggplot function
library(tidyverse)      # pipe function
library(ggpubr)         # combined ggplots
library(plotrix)        # 3D Exploded Pie Chart
library(corrplot)       # correlation of variables
library(ROCR)           # get AUC
library(rpart)          # classification tree
library(rpart.plot)     # plot trees
library(randomForest)   # randomforest model
library(nnet)           # neural network
library(knitr)          # display tables
library(psych)          # perform summary statistics by group
library(ipred)          # bagging
library(caret)          # creating predictive models
library(e1071)          # svm mideling and tuning
library(vip)            # construct variable importance plots
library(gbm)            # Gradient Boosting Machine
library(keras)          # model tuning
library(mltools)        # machine learning helper
library(data.table)     # create great table

Data preparation

Data introduction

The original dataset contains 569 observations and 32 variables. The dataset is complete with no missing values. By looking into this Breast Cancer dataset, it contains columns including patients id, diagnosis, and the mean, standard error and “worst” or largest (mean of the three largest values) of 10 measurement features resulting in 30 features (see table 1.1 Breast Cancer Wisconsin Data). For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.

BCD <- readr::read_csv("C:/Users/yizha/Desktop/Spring 2020/Data Mining-ll/Final/Breat cancer data.csv")
# structure of the data
str(BCD)

## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 569 obs. of  32 variables:
##  $ id                     : num  842302 842517 84300903 84348301 84358402 ...
##  $ diagnosis              : chr  "M" "M" "M" "M" ...
##  $ radius_mean            : num  18 20.6 19.7 11.4 20.3 ...
##  $ texture_mean           : num  10.4 17.8 21.2 20.4 14.3 ...
##  $ perimeter_mean         : num  122.8 132.9 130 77.6 135.1 ...
##  $ area_mean              : num  1001 1326 1203 386 1297 ...
##  $ smoothness_mean        : num  0.1184 0.0847 0.1096 0.1425 0.1003 ...
##  $ compactness_mean       : num  0.2776 0.0786 0.1599 0.2839 0.1328 ...
##  $ concavity_mean         : num  0.3001 0.0869 0.1974 0.2414 0.198 ...
##  $ concave points_mean    : num  0.1471 0.0702 0.1279 0.1052 0.1043 ...
##  $ symmetry_mean          : num  0.242 0.181 0.207 0.26 0.181 ...
##  $ fractal_dimension_mean : num  0.0787 0.0567 0.06 0.0974 0.0588 ...
##  $ radius_se              : num  1.095 0.543 0.746 0.496 0.757 ...
##  $ texture_se             : num  0.905 0.734 0.787 1.156 0.781 ...
##  $ perimeter_se           : num  8.59 3.4 4.58 3.44 5.44 ...
##  $ area_se                : num  153.4 74.1 94 27.2 94.4 ...
##  $ smoothness_se          : num  0.0064 0.00522 0.00615 0.00911 0.01149 ...
##  $ compactness_se         : num  0.049 0.0131 0.0401 0.0746 0.0246 ...
##  $ concavity_se           : num  0.0537 0.0186 0.0383 0.0566 0.0569 ...
##  $ concave points_se      : num  0.0159 0.0134 0.0206 0.0187 0.0188 ...
##  $ symmetry_se            : num  0.03 0.0139 0.0225 0.0596 0.0176 ...
##  $ fractal_dimension_se   : num  0.00619 0.00353 0.00457 0.00921 0.00511 ...
##  $ radius_worst           : num  25.4 25 23.6 14.9 22.5 ...
##  $ texture_worst          : num  17.3 23.4 25.5 26.5 16.7 ...
##  $ perimeter_worst        : num  184.6 158.8 152.5 98.9 152.2 ...
##  $ area_worst             : num  2019 1956 1709 568 1575 ...
##  $ smoothness_worst       : num  0.162 0.124 0.144 0.21 0.137 ...
##  $ compactness_worst      : num  0.666 0.187 0.424 0.866 0.205 ...
##  $ concavity_worst        : num  0.712 0.242 0.45 0.687 0.4 ...
##  $ concave points_worst   : num  0.265 0.186 0.243 0.258 0.163 ...
##  $ symmetry_worst         : num  0.46 0.275 0.361 0.664 0.236 ...
##  $ fractal_dimension_worst: num  0.1189 0.089 0.0876 0.173 0.0768 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   id = col_double(),
##   ..   diagnosis = col_character(),
##   ..   radius_mean = col_double(),
##   ..   texture_mean = col_double(),
##   ..   perimeter_mean = col_double(),
##   ..   area_mean = col_double(),
##   ..   smoothness_mean = col_double(),
##   ..   compactness_mean = col_double(),
##   ..   concavity_mean = col_double(),
##   ..   `concave points_mean` = col_double(),
##   ..   symmetry_mean = col_double(),
##   ..   fractal_dimension_mean = col_double(),
##   ..   radius_se = col_double(),
##   ..   texture_se = col_double(),
##   ..   perimeter_se = col_double(),
##   ..   area_se = col_double(),
##   ..   smoothness_se = col_double(),
##   ..   compactness_se = col_double(),
##   ..   concavity_se = col_double(),
##   ..   `concave points_se` = col_double(),
##   ..   symmetry_se = col_double(),
##   ..   fractal_dimension_se = col_double(),
##   ..   radius_worst = col_double(),
##   ..   texture_worst = col_double(),
##   ..   perimeter_worst = col_double(),
##   ..   area_worst = col_double(),
##   ..   smoothness_worst = col_double(),
##   ..   compactness_worst = col_double(),
##   ..   concavity_worst = col_double(),
##   ..   `concave points_worst` = col_double(),
##   ..   symmetry_worst = col_double(),
##   ..   fractal_dimension_worst = col_double()
##   .. )

summary(BCD)

##        id             diagnosis          radius_mean      texture_mean  
##  Min.   :     8670   Length:569         Min.   : 6.981   Min.   : 9.71  
##  1st Qu.:   869218   Class :character   1st Qu.:11.700   1st Qu.:16.17  
##  Median :   906024   Mode  :character   Median :13.370   Median :18.84  
##  Mean   : 30371831                      Mean   :14.127   Mean   :19.29  
##  3rd Qu.:  8813129                      3rd Qu.:15.780   3rd Qu.:21.80  
##  Max.   :911320502                      Max.   :28.110   Max.   :39.28  
##  perimeter_mean     area_mean      smoothness_mean   compactness_mean 
##  Min.   : 43.79   Min.   : 143.5   Min.   :0.05263   Min.   :0.01938  
##  1st Qu.: 75.17   1st Qu.: 420.3   1st Qu.:0.08637   1st Qu.:0.06492  
##  Median : 86.24   Median : 551.1   Median :0.09587   Median :0.09263  
##  Mean   : 91.97   Mean   : 654.9   Mean   :0.09636   Mean   :0.10434  
##  3rd Qu.:104.10   3rd Qu.: 782.7   3rd Qu.:0.10530   3rd Qu.:0.13040  
##  Max.   :188.50   Max.   :2501.0   Max.   :0.16340   Max.   :0.34540  
##  concavity_mean    concave points_mean symmetry_mean    fractal_dimension_mean
##  Min.   :0.00000   Min.   :0.00000     Min.   :0.1060   Min.   :0.04996       
##  1st Qu.:0.02956   1st Qu.:0.02031     1st Qu.:0.1619   1st Qu.:0.05770       
##  Median :0.06154   Median :0.03350     Median :0.1792   Median :0.06154       
##  Mean   :0.08880   Mean   :0.04892     Mean   :0.1812   Mean   :0.06280       
##  3rd Qu.:0.13070   3rd Qu.:0.07400     3rd Qu.:0.1957   3rd Qu.:0.06612       
##  Max.   :0.42680   Max.   :0.20120     Max.   :0.3040   Max.   :0.09744       
##    radius_se        texture_se      perimeter_se       area_se       
##  Min.   :0.1115   Min.   :0.3602   Min.   : 0.757   Min.   :  6.802  
##  1st Qu.:0.2324   1st Qu.:0.8339   1st Qu.: 1.606   1st Qu.: 17.850  
##  Median :0.3242   Median :1.1080   Median : 2.287   Median : 24.530  
##  Mean   :0.4052   Mean   :1.2169   Mean   : 2.866   Mean   : 40.337  
##  3rd Qu.:0.4789   3rd Qu.:1.4740   3rd Qu.: 3.357   3rd Qu.: 45.190  
##  Max.   :2.8730   Max.   :4.8850   Max.   :21.980   Max.   :542.200  
##  smoothness_se      compactness_se      concavity_se     concave points_se 
##  Min.   :0.001713   Min.   :0.002252   Min.   :0.00000   Min.   :0.000000  
##  1st Qu.:0.005169   1st Qu.:0.013080   1st Qu.:0.01509   1st Qu.:0.007638  
##  Median :0.006380   Median :0.020450   Median :0.02589   Median :0.010930  
##  Mean   :0.007041   Mean   :0.025478   Mean   :0.03189   Mean   :0.011796  
##  3rd Qu.:0.008146   3rd Qu.:0.032450   3rd Qu.:0.04205   3rd Qu.:0.014710  
##  Max.   :0.031130   Max.   :0.135400   Max.   :0.39600   Max.   :0.052790  
##   symmetry_se       fractal_dimension_se  radius_worst   texture_worst  
##  Min.   :0.007882   Min.   :0.0008948    Min.   : 7.93   Min.   :12.02  
##  1st Qu.:0.015160   1st Qu.:0.0022480    1st Qu.:13.01   1st Qu.:21.08  
##  Median :0.018730   Median :0.0031870    Median :14.97   Median :25.41  
##  Mean   :0.020542   Mean   :0.0037949    Mean   :16.27   Mean   :25.68  
##  3rd Qu.:0.023480   3rd Qu.:0.0045580    3rd Qu.:18.79   3rd Qu.:29.72  
##  Max.   :0.078950   Max.   :0.0298400    Max.   :36.04   Max.   :49.54  
##  perimeter_worst    area_worst     smoothness_worst  compactness_worst
##  Min.   : 50.41   Min.   : 185.2   Min.   :0.07117   Min.   :0.02729  
##  1st Qu.: 84.11   1st Qu.: 515.3   1st Qu.:0.11660   1st Qu.:0.14720  
##  Median : 97.66   Median : 686.5   Median :0.13130   Median :0.21190  
##  Mean   :107.26   Mean   : 880.6   Mean   :0.13237   Mean   :0.25427  
##  3rd Qu.:125.40   3rd Qu.:1084.0   3rd Qu.:0.14600   3rd Qu.:0.33910  
##  Max.   :251.20   Max.   :4254.0   Max.   :0.22260   Max.   :1.05800  
##  concavity_worst  concave points_worst symmetry_worst   fractal_dimension_worst
##  Min.   :0.0000   Min.   :0.00000      Min.   :0.1565   Min.   :0.05504        
##  1st Qu.:0.1145   1st Qu.:0.06493      1st Qu.:0.2504   1st Qu.:0.07146        
##  Median :0.2267   Median :0.09993      Median :0.2822   Median :0.08004        
##  Mean   :0.2722   Mean   :0.11461      Mean   :0.2901   Mean   :0.08395        
##  3rd Qu.:0.3829   3rd Qu.:0.16140      3rd Qu.:0.3179   3rd Qu.:0.09208        
##  Max.   :1.2520   Max.   :0.29100      Max.   :0.6638   Max.   :0.20750

# check the missing value
colSums(is.na(BCD))

##                      id               diagnosis             radius_mean 
##                       0                       0                       0 
##            texture_mean          perimeter_mean               area_mean 
##                       0                       0                       0 
##         smoothness_mean        compactness_mean          concavity_mean 
##                       0                       0                       0 
##     concave points_mean           symmetry_mean  fractal_dimension_mean 
##                       0                       0                       0 
##               radius_se              texture_se            perimeter_se 
##                       0                       0                       0 
##                 area_se           smoothness_se          compactness_se 
##                       0                       0                       0 
##            concavity_se       concave points_se             symmetry_se 
##                       0                       0                       0 
##    fractal_dimension_se            radius_worst           texture_worst 
##                       0                       0                       0 
##         perimeter_worst              area_worst        smoothness_worst 
##                       0                       0                       0 
##       compactness_worst         concavity_worst    concave points_worst 
##                       0                       0                       0 
##          symmetry_worst fractal_dimension_worst 
##                       0                       0

# remove column id
BCD <- BCD[,-1]
# convert the diagnosis to factor
BCD$diagnosis <- as.factor(BCD$diagnosis)

The description of the ten measurements is listed as follow:

BCD.type <- lapply(BCD[,-1], class)
BCD.var_desc <- c("The diagnosis of breast tissues (M = malignant, B = benign)",
                  "mean of distances from center to points on the perimeter",
                  "standard deviation of gray-scale values",
                  "mean size of the core tumor",
                  "Not provided",
                  "mean of local variation in radius lengths",
                  "mean of perimeter^2 / area - 1.0",
                  "mean of severity of concave portions of the contour",
                  "mean for number of concave portions of the contour",
                  "Not provided",
                  "mean for 'coastline approximation' - 1",
                  "standard error for the mean of distances from center to points on the perimeter",
                  "standard error for standard deviation of gray-scale values",
                  "Not provided",
                  "Not provided",
                  "standard error for local variation in radius lengths",
                  "standard error for perimeter^2 / area - 1.0",
                  "standard error for severity of concave portions of the contour",
                  "standard error for number of concave portions of the contour",
                  "Not provided",
                  "standard error for 'coastline approximation' - 1",
                  "'worst' or largest mean value for mean of distances from center to points on the perimeter",
                  "'worst' or largest mean value for standard deviation of gray-scale values",
                  "Not provided",
                  "Not provided",
                  "'worst' or largest mean value for local variation in radius lengths",
                  "'worst' or largest mean value for perimeter^2 / area - 1.0",
                  "'worst' or largest mean value for severity of concave portions of the contour",
                  "'worst' or largest mean value for number of concave portions of the contour",
                  "Not provided",
                  "'worst' or largest mean value for 'coastline approximation' - 1")
BCD.var_names <- colnames(BCD)
data.description <- as_data_frame(cbind(BCD.var_names, BCD.type, BCD.var_desc))
colnames(data.description) <- c("Variable name", "Data Type", "Variable Description")
kable(data.description[1:11,], caption = "Table 1-2 Data Description")

Table 1-2 Data Description
Variable name	Data Type	Variable Description
diagnosis	numeric	The diagnosis of breast tissues (M = malignant, B = benign)
radius_mean	numeric	mean of distances from center to points on the perimeter
texture_mean	numeric	standard deviation of gray-scale values
perimeter_mean	numeric	mean size of the core tumor
area_mean	numeric	Not provided
smoothness_mean	numeric	mean of local variation in radius lengths
compactness_mean	numeric	mean of perimeter^2 / area - 1.0
concavity_mean	numeric	mean of severity of concave portions of the contour
concave points_mean	numeric	mean for number of concave portions of the contour
symmetry_mean	numeric	Not provided
fractal_dimension_mean	numeric	mean for ‘coastline approximation’ - 1

Exploratory Data Analysis(EDA)

Data summary and histogram

The patient id was removed, now the dataset has 30 numeric variables and one binary variable “diagnosis” which is the response variable with two categories “M”(Malignant) and “B”(Benign). The pie chart shows malignant plays 37.26% of the total data (see Figure 1-1 Pie chart of diagnosis). A quick summary and histogram (see Figure 1-2) of varibles 1 to 11 (mean value of the 10 measurements) were listed below. All variables show various skewness, especially “concavity_mean” and “concave points_mean”. Skewness is usually undesireble, transformations might be needed.

pie <- BCD %>% 
  group_by(diagnosis) %>% 
  count(diagnosis) %>% 
  summarise(percent = round(n/nrow(BCD)*100,2), total = n)

# plot the pie of diagnosis
slices <- c(sum(BCD$diagnosis=="B"),sum(BCD$diagnosis=="M"))
lbls <- c("Benign","Malignant")
pie3D(slices,labels=lbls,explode=0.1,
   main="Figure 1-1: Pie Chart of diagnosis ")

BCD_mean <- BCD[,1:11]
ggplot(gather(BCD_mean,key,value,2:11), aes(x=value,fill=diagnosis)) + 
    geom_histogram(bins = 20) +
  #geom_density() +
  facet_wrap(~key,scales = "free_x") +
  ggtitle("Figure 1_2 Histograms of Breast Cancer Variables")

Boxplot and outliers

Outlier percentage in each variable is no more than 2%, there for, we won’t remove any outliers unless further evidence found they are significant. The boxplots (Figure 1-3) also shows median value of most variables for malignant cases are higher than those for benign cases.

# Percent outliers in every column
outlier_percentage <- BCD[,2:11] %>% 
  summarise_each(funs(sum(. > mean(.) + 3*sd(.) | . < mean(.) - 3*sd(.))/n()))
round(outlier_percentage*100,2)

##   radius_mean texture_mean perimeter_mean area_mean smoothness_mean
## 1        0.88          0.7           1.23      1.41            0.88
##   compactness_mean concavity_mean concave points_mean symmetry_mean
## 1             1.58           1.58                1.05          0.88
##   fractal_dimension_mean
## 1                   1.23

Bivariate correlation

A heatmap was created to exam the bivariate relationships among these 30 predictors drew from image of breast mass cells and carring size, shape, density informations of samples. It’s not suprising that some variables are highly correlated (Figure 1-4). Multi-collinearity may cause trouble in linear regression models which also can be handled by variable selection approaches like stepwise or LASSO, however, it seems not problematic in other models like randem forest, boosting gredient model or neural networks. We can tune our models to achieve best performance.

BCD_sub <- BCD[,-1]
corrplot(cor(BCD_sub, method = "pearson"), number.cex = .9, method = "square", 
         hclust.method = "ward", order = "FPC",
         type = "full", tl.cex=0.8,tl.col = "black",
         title = "Figure 1-4 Collinearity of predicters")

we can tell from the heatmap that the highest correlation are between:

area_worst and radius_mean, perimeter_mean, concave points_mean.
perimeter_se and area_se, area_mean, area_worst, radius_mean, radius_worst, perimeter_worst, perimeter_mean, concave points_worst, concavity_mean.
texture_mean and teture_worst.
perimeter_mean and radius_worst

Let’s show the plots for some highly correlated features, inverse correlated features and low correlated features, grouped by diagnosis.

Highly correlated variables:

high1 <- BCD %>% 
  ggplot(aes(x = radius_worst, y = perimeter_mean, color = diagnosis))+
  geom_point()+
  stat_ellipse()
high2 <- BCD %>% 
  ggplot(aes(x = radius_worst, y = area_worst, color = diagnosis))+
  geom_point()+
  stat_ellipse()
high3 <- BCD %>% 
  ggplot(aes(x = texture_worst, y = texture_mean, color = diagnosis))+
  geom_point()+
  stat_ellipse()
high4 <- BCD %>% 
  ggplot(aes(x = area_worst, y = perimeter_mean, color = diagnosis))+
  geom_point()+
  stat_ellipse()
ggarrange(high1,high2, labels = c("Perimeter mean vs.Radius mean", "Area worst vs. Radius worst"),
          common.legend = TRUE, legend = "bottom", nrow = 1, ncol = 2)

ggarrange(high3,high4, labels = c("Texture mean vs.Texture worst", "Perimeter mean vs.Area worst"),
          common.legend = TRUE, legend = "bottom", nrow = 1, ncol = 2)

Inverse correlated variables:

inverse1 <- BCD %>% 
  ggplot(aes(x = radius_mean, y = fractal_dimension_mean, color = diagnosis))+
  geom_point()+
  stat_ellipse()
inverse2 <- BCD %>% 
  ggplot(aes(x = area_mean, y = fractal_dimension_mean, color = diagnosis))+
  geom_point()+
  stat_ellipse()

ggarrange(inverse1,inverse2,labels = c("Fractal dimension mean vs.Radius mean", "Fractal dimension mean vs. Area mean"),common.legend = TRUE, legend = "bottom", nrow = 1, ncol = 2)

Low correlated variables:

low1 <- BCD %>% 
  ggplot(aes(x = texture_mean, y = smoothness_mean, color = diagnosis))+
  geom_point()+
  stat_ellipse()
low2 <- BCD %>% 
  ggplot(aes(x = perimeter_worst, y = fractal_dimension_se, color = diagnosis))+
  geom_point()+
  stat_ellipse()

ggarrange(low1,low2,labels = c("Smoothness mean vs.Texture mean", "Fractal dimension SE vs. Perimeter worst"),common.legend = TRUE, legend = "bottom", nrow = 1, ncol = 2)

2 Statistical Modeling

Creating training and test sample by splitting the data set to training(80%) and testing(20%) datasets

set.seed(20470011)
BCD <- rename(BCD, concave_points_mean = `concave points_mean`,concave_points_se = `concave points_se`, concave_points_worst = `concave points_worst`)
index <- sample(nrow(BCD),nrow(BCD)*0.8)
BCD.train <- BCD[index,]
BCD.test <-  BCD[-index,]

2.1 SVM model

The support-vector machine (SVM) is one of the most popular classification algorithms in the machine learning community.

2.1.1 Fit a SVM model

We apply this approach on this BCD dataset. We first fit a general SVM model which has not specified the values for cost and gamma. This proces used kernel type radial, cost 1 to fit the model, which has a misclassification rate of 0.0088, accuracy of 0.9912. 114 support vectors are used to build the model.

svm.model <- svm(diagnosis ~ ., data = BCD.train)
summary(svm.model)

## 
## Call:
## svm(formula = diagnosis ~ ., data = BCD.train)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
## 
## Number of Support Vectors:  114
## 
##  ( 57 57 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  B M

svm.pred <- predict(svm.model, BCD.test[,-1])
confusionMatrix(svm.pred, BCD.test$diagnosis)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  B  M
##          B 75  0
##          M  1 38
##                                           
##                Accuracy : 0.9912          
##                  95% CI : (0.9521, 0.9998)
##     No Information Rate : 0.6667          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9804          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9868          
##             Specificity : 1.0000          
##          Pos Pred Value : 1.0000          
##          Neg Pred Value : 0.9744          
##              Prevalence : 0.6667          
##          Detection Rate : 0.6579          
##    Detection Prevalence : 0.6579          
##       Balanced Accuracy : 0.9934          
##                                           
##        'Positive' Class : B               
##

2.1.2 Tune parameters by tune.svm()

Since SVM is very sensitive to the choice of parameters, so we use tune.svm() method to tune the parameter cost and gamma. This algorithm would return the values of cost and gamma that can minmize the classification error for the 10-fold cross validation. After tuning these two parameters, we gain the best set of values are 10 for cost, and 0.01 for gamma.The best performance is 0.0287. Number of Support Vectors is 60.

tuned_parameters <- tune.svm(diagnosis ~ ., data = BCD.train, gamma = 10^(-5:-1), cost = 10^(-3:1))
summary(tuned_parameters)

## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  gamma cost
##   0.01   10
## 
## - best performance: 0.02869565 
## 
## - Detailed performance results:
##    gamma  cost      error dispersion
## 1  1e-05 1e-03 0.38217391 0.06786038
## 2  1e-04 1e-03 0.38217391 0.06786038
## 3  1e-03 1e-03 0.38217391 0.06786038
## 4  1e-02 1e-03 0.38217391 0.06786038
## 5  1e-01 1e-03 0.38217391 0.06786038
## 6  1e-05 1e-02 0.38217391 0.06786038
## 7  1e-04 1e-02 0.38217391 0.06786038
## 8  1e-03 1e-02 0.38217391 0.06786038
## 9  1e-02 1e-02 0.38217391 0.06786038
## 10 1e-01 1e-02 0.38217391 0.06786038
## 11 1e-05 1e-01 0.38217391 0.06786038
## 12 1e-04 1e-01 0.38217391 0.06786038
## 13 1e-03 1e-01 0.22821256 0.08066595
## 14 1e-02 1e-01 0.05053140 0.01801845
## 15 1e-01 1e-01 0.06371981 0.03513749
## 16 1e-05 1e+00 0.38217391 0.06786038
## 17 1e-04 1e+00 0.22599034 0.07979023
## 18 1e-03 1e+00 0.05053140 0.01801845
## 19 1e-02 1e+00 0.03971014 0.02526980
## 20 1e-01 1e+00 0.04410628 0.03288026
## 21 1e-05 1e+01 0.22376812 0.08089630
## 22 1e-04 1e+01 0.05053140 0.01801845
## 23 1e-03 1e+01 0.03526570 0.02387167
## 24 1e-02 1e+01 0.02869565 0.02573186
## 25 1e-01 1e+01 0.05077295 0.03466047

tuned_parameters$best.model

## 
## Call:
## best.svm(x = diagnosis ~ ., data = BCD.train, gamma = 10^(-5:-1), 
##     cost = 10^(-3:1))
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  10 
## 
## Number of Support Vectors:  60

2.1.3 Fit svm model by using the optimal parameters

When use the tuned parameters cost = 10, gamma = 0.01 to fit the model. This model also has an accuracy of 0.9912 for test data, just like the general model does. However, this time, 60 support vactors are used.

BCD.svm <- svm(diagnosis ~ ., data = BCD.train, cost = 10, gamma = 0.01)
summary(BCD.svm)

## 
## Call:
## svm(formula = diagnosis ~ ., data = BCD.train, cost = 10, gamma = 0.01)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  10 
## 
## Number of Support Vectors:  60
## 
##  ( 27 33 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  B M

svm.predict <- predict(BCD.svm, BCD.test[,-1])
confusionMatrix(svm.predict, BCD.test$diagnosis)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  B  M
##          B 76  1
##          M  0 37
##                                           
##                Accuracy : 0.9912          
##                  95% CI : (0.9521, 0.9998)
##     No Information Rate : 0.6667          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9801          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.9737          
##          Pos Pred Value : 0.9870          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.6667          
##          Detection Rate : 0.6667          
##    Detection Prevalence : 0.6754          
##       Balanced Accuracy : 0.9868          
##                                           
##        'Positive' Class : B               
##

The in-sample Accuracy is 0.9868.

svm.predict.train <- predict(BCD.svm, BCD.train[,-1])
confusionMatrix(svm.predict.train, BCD.train$diagnosis)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   B   M
##          B 281   6
##          M   0 168
##                                           
##                Accuracy : 0.9868          
##                  95% CI : (0.9715, 0.9951)
##     No Information Rate : 0.6176          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.9719          
##                                           
##  Mcnemar's Test P-Value : 0.04123         
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.9655          
##          Pos Pred Value : 0.9791          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.6176          
##          Detection Rate : 0.6176          
##    Detection Prevalence : 0.6308          
##       Balanced Accuracy : 0.9828          
##                                           
##        'Positive' Class : B               
##

The AUC before tuning is 0.967, it increases to 0.98 after tuning.

BCD.svm0 <- svm(diagnosis ~ ., data = BCD.train, probability = TRUE)
pre.svm0 <-  predict(BCD.svm0, BCD.test, probability = TRUE)
prob.svm0  <-  attr(pre.svm0, 'probabilities')[,2] 
pred.svm0  <-  as.numeric((prob.svm0 <= 5/6))
svm.pred0 <- predict(BCD.svm0, BCD.test[,-1], type="prob")
pred0 <- prediction(pred.svm0, BCD.test$diagnosis)
perf0 <- performance(pred0, "tpr", "fpr")
plot(perf0, colorize=TRUE)

AUC0 <- slot(performance(pred0, "auc"), "y.values")[[1]]
AUC0

## [1] 0.9671053

BCD.svm1 <- svm(diagnosis ~ ., data = BCD.train, cost = 10, gamma = 0.01, probability = TRUE)
pre.svm1 <-  predict(BCD.svm1, BCD.test, probability = TRUE)
prob.svm  <-  attr(pre.svm1, 'probabilities')[,2] 
pred.svm  <-  as.numeric((prob.svm <= 5/6))
svm.pred <- predict(BCD.svm1, BCD.test[,-1], type="prob")
pred <- prediction(pred.svm, BCD.test$diagnosis)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)

AUC <- slot(performance(pred, "auc"), "y.values")[[1]]
AUC

## [1] 0.9802632

2.1.4 Tune parameters by train()

Use train() function to tune and train the SVM model by setting the radial basis kernel function with autotuning for the sigma parameter and 10-fold CV. From the plot, we can see that smaller values of the cost parameter C (around 5) provide better cross-validated accuracy scores for these training data. The result is slightly different from that obtained by tune.svm() function. This is possiably because that tune.svm() function aims to best performance in general, while train() tends to sort the value of C in terms of accuracy for train data.

set.seed(20470011) 
BCD.svm.1 <- train(
  diagnosis ~ ., 
  data = BCD.train,
  method = "svmRadial",               
  preProcess = c("center", "scale"),  
  trControl = trainControl(method = "cv", number = 10),
  tuneLength = 10
)

ggplot(BCD.svm.1) + theme_light()

BCD.svm.1$results

##         sigma      C  Accuracy     Kappa AccuracySD    KappaSD
## 1  0.03883888   0.25 0.9537198 0.9001411 0.03354604 0.07295343
## 2  0.03883888   0.50 0.9648289 0.9241270 0.02594200 0.05660139
## 3  0.03883888   1.00 0.9714935 0.9387529 0.02088992 0.04573638
## 4  0.03883888   2.00 0.9737157 0.9434258 0.02273335 0.04960782
## 5  0.03883888   4.00 0.9757950 0.9479094 0.01949094 0.04297412
## 6  0.03883888   8.00 0.9736694 0.9438579 0.02271594 0.04877337
## 7  0.03883888  16.00 0.9603844 0.9166942 0.03730449 0.07869705
## 8  0.03883888  32.00 0.9559400 0.9077688 0.04173570 0.08731562
## 9  0.03883888  64.00 0.9492733 0.8935080 0.04446626 0.09238656
## 10 0.03883888 128.00 0.9492733 0.8935080 0.04446626 0.09238656

The out-of-sample accurary of the model tuned by train() function is 0.989, which is smaller than that tuned by tune.svm() function, and the in-sample arruracy is 0.9912.

svm.predict.train.1 <- predict(BCD.svm.1, BCD.train[,-1])
confusionMatrix(svm.predict.train.1, BCD.train$diagnosis)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   B   M
##          B 281   5
##          M   0 169
##                                           
##                Accuracy : 0.989           
##                  95% CI : (0.9745, 0.9964)
##     No Information Rate : 0.6176          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.9766          
##                                           
##  Mcnemar's Test P-Value : 0.07364         
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.9713          
##          Pos Pred Value : 0.9825          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.6176          
##          Detection Rate : 0.6176          
##    Detection Prevalence : 0.6286          
##       Balanced Accuracy : 0.9856          
##                                           
##        'Positive' Class : B               
##

svm.predict.1 <- predict(BCD.svm.1, BCD.test[,-1])
confusionMatrix(svm.predict.1, BCD.test$diagnosis)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  B  M
##          B 75  0
##          M  1 38
##                                           
##                Accuracy : 0.9912          
##                  95% CI : (0.9521, 0.9998)
##     No Information Rate : 0.6667          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9804          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9868          
##             Specificity : 1.0000          
##          Pos Pred Value : 1.0000          
##          Neg Pred Value : 0.9744          
##              Prevalence : 0.6667          
##          Detection Rate : 0.6579          
##    Detection Prevalence : 0.6579          
##       Balanced Accuracy : 0.9934          
##                                           
##        'Positive' Class : B               
##

2.1.5 Class probabilities

SVMs classify new observations by determining which side of the decision boundary they fall on. So, they do not automatically provide predicted class probabilities. In order to obtain predicted class probabilities, additional parameters need to be estimated. Actually, predicted class probabilities are often more useful than the predicted class labels. For instance, we would need the predicted class probabilities if we were using an optimization metric like AUC, as opposed to classification accuracy. In terms of AUC, we can see that in general, smaller values of the cost parameter C provide better cross-validated AUC scores on the training data. When the value of C is around 2, the model can access the highest AUC. The result also shows the corresponding sensitivity and specificity.

ctrl <- trainControl(
  method = "cv", 
  number = 10, 
  classProbs = TRUE,                 
  summaryFunction = twoClassSummary
)

set.seed(20470011)
BCD.svm.auc <- train(
  diagnosis ~ ., 
  data = BCD.train,
  method = "svmRadial",               
  preProcess = c("center", "scale"),  
  metric = "ROC",  # area under ROC curve (AUC)       
  trControl = ctrl,
  tuneLength = 10
)

ggplot(BCD.svm.auc) + theme_light()

BCD.svm.auc$results

##         sigma      C       ROC      Sens      Spec      ROCSD     SensSD
## 1  0.03883888   0.25 0.9894491 0.9645320 0.9196078 0.01338089 0.02883399
## 2  0.03883888   0.50 0.9910899 0.9752463 0.9428105 0.01083490 0.02358516
## 3  0.03883888   1.00 0.9925557 0.9788177 0.9542484 0.01066835 0.02966933
## 4  0.03883888   2.00 0.9929525 0.9822660 0.9542484 0.01000411 0.02515988
## 5  0.03883888   4.00 0.9925207 0.9821429 0.9598039 0.01098303 0.03035131
## 6  0.03883888   8.00 0.9912952 0.9928571 0.9542484 0.01172856 0.01505847
## 7  0.03883888  16.00 0.9902866 0.9678571 0.9542484 0.01405453 0.04595300
## 8  0.03883888  32.00 0.9886526 0.9642857 0.9486928 0.01643318 0.05323971
## 9  0.03883888  64.00 0.9861248 0.9642857 0.9486928 0.01604885 0.05832118
## 10 0.03883888 128.00 0.9859147 0.9642857 0.9372549 0.01601354 0.05323971
##        SpecSD
## 1  0.06738334
## 2  0.06671168
## 3  0.05298643
## 4  0.04606182
## 5  0.05548182
## 6  0.05376454
## 7  0.05376454
## 8  0.06849851
## 9  0.06329435
## 10 0.07365317

In this case it is clear that our model does good job at predicting both the Ms and Bs. The accuracy is 0.9714.

confusionMatrix(BCD.svm.auc)

## Cross-Validated (10 fold) Confusion Matrix 
## 
## (entries are percentual average cell counts across resamples)
##  
##           Reference
## Prediction    B    M
##          B 60.7  1.8
##          M  1.1 36.5
##                             
##  Accuracy (average) : 0.9714

2.1.6 Feature interpretation

We use vip() to quantify the importance of each feature using the permutation approach since svm doesn’t pull out the measures of feature importance automatically. In this case, the top 10 most important pridictors in terms of prediction are as follows.

prob.M <- function(object, newdata) {
  predict(object, newdata = newdata, type = "prob")[, "M"]
}

set.seed(20470011) 
vip(BCD.svm.auc, method = "permute", nsim = 5, train = BCD.train, 
    target = "diagnosis", metric = "auc", reference_class = "M", 
    pred_wrapper = prob.M)

2.1.7 Visualize the svm classification

In order to visualize the svm classification, we reduce the demention to 2 by use the top 2 important predictors as the dementions. We use the model tuned by tune.svm() function. The plots for both train data and test data show relatively clear boundaries. However, there are not obvious separating hyperplane in the plots probably due to the high dimention of the data.

plot(BCD.svm, BCD.train, texture_worst ~ radius_se,
     slice = list(compactness_se = 3, radius_worst = 4), main = "in-sample svm classification plot")

plot(BCD.svm, BCD.test, texture_worst ~ radius_se,
     slice = list(compactness_se = 3, radius_worst = 4), main = "out-of-sample svm classification plot")

In general, svm models perform really great with or without tuning parameters. This is prabably because the data itself is quite suitable for svm model.

2.2 RandomForest

2.2.1 Model building and performance check

fit RF model using BCD.train dataset, default ntrees=500, mtry=5

BCD.rf<-randomForest(diagnosis~., data = BCD.train, importance=TRUE)
BCD.rf

## 
## Call:
##  randomForest(formula = diagnosis ~ ., data = BCD.train, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 5
## 
##         OOB estimate of  error rate: 3.96%
## Confusion matrix:
##     B   M class.error
## B 274   7  0.02491103
## M  11 163  0.06321839

check variables importance

imp <- data.frame(BCD.rf$importance)
imp %>% 
  ggplot(aes(x=reorder(rownames(imp),MeanDecreaseAccuracy),y=MeanDecreaseAccuracy))+
  geom_bar(stat="identity")+
  coord_flip()+
  ggtitle("Variable importance (accuracy)")

imp %>% 
  ggplot(aes(x=reorder(rownames(imp),MeanDecreaseGini),y=MeanDecreaseGini))+
  geom_bar(stat="identity")+
  coord_flip()+
  ggtitle("Variable importance (Gini)")

AUC

BCD.rf.pred<- predict(BCD.rf, newdata = BCD.test, type = "prob")[,2]
pred <- prediction(BCD.rf.pred, BCD.test$diagnosis)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)

unlist(slot(performance(pred, "auc"), "y.values"))

## [1] 0.9951524

AUC2

BCD.rf.pred2<- predict(BCD.rf, newdata = BCD.test, type = "prob")[,2]
pred2 <- prediction(BCD.rf.pred2, BCD.test$diagnosis)
perf2 <- performance(pred2, "tpr", "fpr")
plot(perf2, colorize=TRUE)

unlist(slot(performance(pred2, "auc"), "y.values"))

## [1] 0.9951524

BCD.test.pred <- predict(BCD.rf, BCD.test[,-1]) ############
cm.test <- confusionMatrix(BCD.test.pred, BCD.test$diagnosis)
cm.test

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  B  M
##          B 73  2
##          M  3 36
##                                           
##                Accuracy : 0.9561          
##                  95% CI : (0.9006, 0.9856)
##     No Information Rate : 0.6667          
##     P-Value [Acc > NIR] : 4.243e-14       
##                                           
##                   Kappa : 0.902           
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9605          
##             Specificity : 0.9474          
##          Pos Pred Value : 0.9733          
##          Neg Pred Value : 0.9231          
##              Prevalence : 0.6667          
##          Detection Rate : 0.6404          
##    Detection Prevalence : 0.6579          
##       Balanced Accuracy : 0.9539          
##                                           
##        'Positive' Class : B               
##

MR2

BCD.test.pred2 <- predict(BCD.rf, BCD.test[,-1]) 
cm.test2 <- confusionMatrix(BCD.test.pred2, BCD.test$diagnosis)
cm.test2

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  B  M
##          B 73  2
##          M  3 36
##                                           
##                Accuracy : 0.9561          
##                  95% CI : (0.9006, 0.9856)
##     No Information Rate : 0.6667          
##     P-Value [Acc > NIR] : 4.243e-14       
##                                           
##                   Kappa : 0.902           
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9605          
##             Specificity : 0.9474          
##          Pos Pred Value : 0.9733          
##          Neg Pred Value : 0.9231          
##              Prevalence : 0.6667          
##          Detection Rate : 0.6404          
##    Detection Prevalence : 0.6579          
##       Balanced Accuracy : 0.9539          
##                                           
##        'Positive' Class : B               
##

plot error vs. ntrees As we can see that the OOBError, FPR and FNR all very low, that mean the RF model works well for the BCD.train dataset. However, we are still going to try paramaters tuning to see whether we can get even better results.

plot(BCD.rf, lwd=rep(2, 3))
legend("right", legend = c("OOB Error", "FPR", "FNR"), lwd=rep(2, 3), lty = c(1,2,3), col = c("black", "red", "green"))

2.2.2 tuning parameters

Let’s create a baseline for comparison by using the recommend defaults for each parameter and mtry=floor(sqrt(ncol(BCD.train))) or mtry=5 and ntree=500

Create model with default paramters

control <- trainControl(method="repeatedcv", number=10, repeats=3)
set.seed(20470011)
metric <- "Accuracy"
mtry <- sqrt(ncol(BCD.train))
tunegrid <- expand.grid(.mtry=floor(sqrt(ncol(BCD.train))))
rf_default <- train(diagnosis~., data=BCD.train, method="rf", metric=metric, tuneGrid=tunegrid, trControl=control)
print(rf_default)

## Random Forest 
## 
## 455 samples
##  30 predictor
##   2 classes: 'B', 'M' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 410, 409, 410, 409, 410, 409, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.9574355  0.9091911
## 
## Tuning parameter 'mtry' was held constant at a value of 5

rf_default.test<-train(diagnosis~., data=BCD.test, method="rf", metric=metric, tuneGrid=tunegrid, trControl=control)
print(rf_default.test)

## Random Forest 
## 
## 114 samples
##  30 predictor
##   2 classes: 'B', 'M' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 102, 103, 104, 103, 102, 102, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.9338384  0.8445196
## 
## Tuning parameter 'mtry' was held constant at a value of 5

One search strategy that we can use is to try random values within a range. This can be good if we are unsure of what the value might be and we want to overcome any biases we may have for setting the parameter (like the suggested equation above). Let’s try a random search for mtry using caret:

tuning for mtry (Random Search)

control <- trainControl(method="repeatedcv", number=10, repeats=3, search="random")
set.seed(20470011)
mtry <- sqrt(ncol(BCD.train))
rf_random <- train(diagnosis~., data=BCD.train, method="rf", metric=metric, tuneLength=15, trControl=control)
print(rf_random)

## Random Forest 
## 
## 455 samples
##  30 predictor
##   2 classes: 'B', 'M' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 410, 409, 410, 409, 410, 409, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    1    0.9516833  0.8964967
##    6    0.9567109  0.9075165
##   12    0.9611231  0.9171652
##   13    0.9582400  0.9109516
##   20    0.9611868  0.9172547
##   21    0.9626368  0.9202428
##   22    0.9597208  0.9140079
##   25    0.9611553  0.9170178
##   26    0.9619115  0.9186939
##   29    0.9597053  0.9139444
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 21.

plot(rf_random)

Another search is to define a grid of algorithm parameters to try. Each axis of the grid is an algorithm parameter, and points in the grid are specific combinations of parameters. Because we are only tuning one parameter, the grid search is a linear search through a vector of candidate values.

control <- trainControl(method="repeatedcv", number=10, repeats=3, search="grid")
set.seed(20470011)
tunegrid <- expand.grid(.mtry=c(1:25))
rf_gridsearch <- train(diagnosis~., data=BCD.train, method="rf", metric=metric, tuneGrid=tunegrid, trControl=control)
print(rf_gridsearch)

## Random Forest 
## 
## 455 samples
##  30 predictor
##   2 classes: 'B', 'M' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 410, 409, 410, 409, 410, 409, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    1    0.9523925  0.8978999
##    2    0.9574509  0.9089967
##    3    0.9581917  0.9106700
##    4    0.9574509  0.9091969
##    5    0.9574831  0.9093567
##    6    0.9574355  0.9094084
##    7    0.9589009  0.9123072
##    8    0.9581762  0.9108933
##    9    0.9581923  0.9108763
##   10    0.9589170  0.9125201
##   11    0.9589009  0.9124260
##   12    0.9589485  0.9125103
##   13    0.9596738  0.9139902
##   14    0.9589807  0.9124711
##   15    0.9633768  0.9218791
##   16    0.9604461  0.9156278
##   17    0.9633299  0.9216946
##   18    0.9633607  0.9218299
##   19    0.9590599  0.9127599
##   20    0.9582715  0.9110545
##   21    0.9619115  0.9186012
##   22    0.9597530  0.9140706
##   23    0.9597369  0.9141453
##   24    0.9589961  0.9124855
##   25    0.9583030  0.9111178
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 15.

plot(rf_gridsearch)

Tuning ntrees

We want to keep using caret because it provides a direct point of comparison to our previous models (apples to apples, even the same data splits) and because of the repeated cross validation test harness that we like as it reduces the severity of overfitting.One approach is to create many caret models for our algorithm and pass in a different parameters directly to the algorithm manually. Let’s look at an example doing this to evaluate different values for ntree while holding mtry constant.

Manual Search

control <- trainControl(method="repeatedcv", number=10, repeats=3, search="grid")
tunegrid <- expand.grid(.mtry=c(sqrt(ncol(BCD.train))))
modellist <- list()
for (ntree in c(500,1000, 1500, 2000, 2500, 3000)) {
  set.seed(20470011)
  fit <- train(diagnosis~., data=BCD.train, method="rf", metric=metric, tuneGrid=tunegrid, trControl=control, ntree=ntree)
  key <- toString(ntree)
  modellist[[key]] <- fit
}

compare results

results <- resamples(modellist)
summary(results)

## 
## Call:
## summary.resamples(object = results)
## 
## Models: 500, 1000, 1500, 2000, 2500, 3000 
## Number of resamples: 30 
## 
## Accuracy 
##           Min.   1st Qu.    Median      Mean   3rd Qu. Max. NA's
## 500  0.8888889 0.9388889 0.9565217 0.9581917 0.9782609    1    0
## 1000 0.8936170 0.9555556 0.9565217 0.9611231 0.9782609    1    0
## 1500 0.8888889 0.9555556 0.9565217 0.9611231 0.9782609    1    0
## 2000 0.8888889 0.9555556 0.9565217 0.9603824 0.9781401    1    0
## 2500 0.8888889 0.9555556 0.9565217 0.9596416 0.9781401    1    0
## 3000 0.8888889 0.9555556 0.9565217 0.9603824 0.9781401    1    0
## 
## Kappa 
##           Min.   1st Qu.    Median      Mean   3rd Qu. Max. NA's
## 500  0.7608927 0.8706663 0.9068826 0.9106062 0.9543530    1    0
## 1000 0.7772512 0.9032258 0.9078064 0.9170550 0.9545870    1    0
## 1500 0.7608927 0.9032258 0.9078064 0.9170581 0.9545870    1    0
## 2000 0.7608927 0.9032258 0.9078064 0.9155005 0.9537486    1    0
## 2500 0.7608927 0.9032258 0.9078064 0.9139788 0.9537486    1    0
## 3000 0.7608927 0.9032258 0.9078064 0.9155005 0.9537486    1    0

dotplot(results)

Extend Caret tuning for both ntry and ntrees

Another approach is to create a “new” algorithm for caret to support. This is the same random forest algorithm you are using, only modified so that it supports multiple tuning of multiple parameters. A risk with this approach is that the caret native support for the algorithm has additional or fancy code wrapping it that subtly but importantly changes it’s behavior. You many need to repeat prior experiments with your custom algorithm support. We can define our own algorithm to use in caret by defining a list that contains a number of custom named elements that the caret package looks for, such as how to fit and how to predict. See below for a definition of a custom random forest algorithm for use with caret that takes both an mtry and ntree parameters.

Define x and y first

x <- BCD.train[, 2:31]
y <- BCD.train[, 1]


customRF <- list(type = "Classification", library = "randomForest", loop = NULL)
customRF$parameters <- data.frame(parameter = c("mtry", "ntree"), class = rep("numeric", 2), label = c("mtry", "ntree"))
customRF$grid <- function(x, y, len = NULL, search = "grid") {}
customRF$fit <- function(x, y, wts, param, lev, last, weights, classProbs, ...) {
  randomForest(x, y, mtry = param$mtry, ntree=param$ntree, ...)
}
customRF$predict <- function(modelFit, newdata, preProc = NULL, submodels = NULL)
  predict(modelFit, newdata)
customRF$prob <- function(modelFit, newdata, preProc = NULL, submodels = NULL)
  predict(modelFit, newdata, type = "prob")
customRF$sort <- function(x) x[order(x[,1]),]
customRF$levels <- function(x) x$classes

2.3 Gradient Boosting Machine (GBM)

GBM constructs a forward stage-wise additive model by implementing gradient descent in function space. We will use as well cross validation with 5 folds.

In order to find the best number of trees to use for the prediction for the test data, we can use gbm.perf function. This function returns the optimal number of trees for prediction. we can tell from the plot that the best ntree near 200.

BCD.train$diagnosis = as.integer(factor(BCD.train$diagnosis))-1
BCD.test$diagnosis = as.integer(factor(BCD.test$diagnosis))-1

n<-names(BCD.train)
gbm.formula <- as.formula(paste("diagnosis ~", paste(n[!n %in% "diagnosis"], collapse = " + ")))
BCD.gbmCV <-  gbm(formula = gbm.formula,
               distribution = "bernoulli",
               data = BCD.train,
               n.trees = 500,
               shrinkage = .1,
               n.minobsinnode = 15,
               cv.folds = 5,
               n.cores = 1)
# optimal number of trees for prediction
gbm.perf(BCD.gbmCV)

## [1] 157

# prediction
pred.gbm.test<- predict(BCD.gbmCV, newdata = BCD.test, type="response")

## Using 157 trees...

pred <- prediction(pred.gbm.test, BCD.test$diagnosis)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)

unlist(slot(performance(pred, "auc"), "y.values"))

## [1] 0.9993075

# get binary prediction
pcut1<- mean(BCD.train$diagnosis)
class.gbm.test<- (pred.gbm.test>pcut1)*1
# get confusion matrix
table(BCD.test$diagnosis, class.gbm.test, dnn = c("True", "Predicted"))

##     Predicted
## True  0  1
##    0 74  2
##    1  0 38

# misclassification rate
MR<- mean(BCD.test$diagnosis!=class.gbm.test)
MR

## [1] 0.01754386