According to U.S. breast cancer statistics, about 1 in 8 U.S. women (about 12%) will develop invasive breast cancer over the course of her lifetime; in 2020, an estimated 276,480 new cases of invasive breast cancer are expected to be diagnosed in women in the U.S.; about 42,170 women in the U.S. are expected to die in 2020 from breast cancer. It also found that the overall death rate from breast cancer decreased by 1.3% per year from 2013 to 2017. These decreases are thought to be the result of treatment advances and earlier detection through screening.
Earlier detection is considered very essencial to diagnose potential breast cancer. Breast Cancer occurs as a results of abnormal growth of cells in the breast tissue, commonly referred to as a Tumor which can be benign (not cancerous), pre-malignant (pre-cancerous), or malignant (cancerous). Tests such as MRI, mammogram, ultrasound and biopsy are commonly used to diagnose breast cancer performed.
In this project we will analys the Breast Cancer Wisconsin (Diagnostic) DataSet, obtained from Kaggle. This data set was created by Dr. William H. Wolberg, physician at the University Of Wisconsin Hospital at Madison, Wisconsin,USA. The informations of this dataset were obtained based on a digital scan of breast cell samples taken by biopsy from patients with solid breast masses. It includs the mean value, extreme value and standard error of 10 measures of breast cells , returning a 30 real-valuated vectors.
Aproaches and objectives:
The project objective is to find the best model classifying types of breast cancer (benign or malignant). Aproaches listed below will be conducted:
library(dplyr) # manipulate data frame
library(knitr) # dynamic deport deneration
library(ggplot2) # ggplot function
library(tidyverse) # pipe function
library(ggpubr) # combined ggplots
library(plotrix) # 3D Exploded Pie Chart
library(corrplot) # correlation of variables
library(ROCR) # get AUC
library(rpart) # classification tree
library(rpart.plot) # plot trees
library(randomForest) # randomforest model
library(nnet) # neural network
library(knitr) # display tables
library(psych) # perform summary statistics by group
library(ipred) # bagging
library(caret) # creating predictive models
library(e1071) # svm mideling and tuning
library(vip) # construct variable importance plots
library(gbm) # Gradient Boosting Machine
library(keras) # model tuning
library(mltools) # machine learning helper
library(data.table) # create great table
Data introduction
The original dataset contains 569 observations and 32 variables. The dataset is complete with no missing values. By looking into this Breast Cancer dataset, it contains columns including patients id, diagnosis, and the mean, standard error and “worst” or largest (mean of the three largest values) of 10 measurement features resulting in 30 features (see table 1.1 Breast Cancer Wisconsin Data). For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.
BCD <- readr::read_csv("C:/Users/yizha/Desktop/Spring 2020/Data Mining-ll/Final/Breat cancer data.csv")
# structure of the data
str(BCD)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 569 obs. of 32 variables:
## $ id : num 842302 842517 84300903 84348301 84358402 ...
## $ diagnosis : chr "M" "M" "M" "M" ...
## $ radius_mean : num 18 20.6 19.7 11.4 20.3 ...
## $ texture_mean : num 10.4 17.8 21.2 20.4 14.3 ...
## $ perimeter_mean : num 122.8 132.9 130 77.6 135.1 ...
## $ area_mean : num 1001 1326 1203 386 1297 ...
## $ smoothness_mean : num 0.1184 0.0847 0.1096 0.1425 0.1003 ...
## $ compactness_mean : num 0.2776 0.0786 0.1599 0.2839 0.1328 ...
## $ concavity_mean : num 0.3001 0.0869 0.1974 0.2414 0.198 ...
## $ concave points_mean : num 0.1471 0.0702 0.1279 0.1052 0.1043 ...
## $ symmetry_mean : num 0.242 0.181 0.207 0.26 0.181 ...
## $ fractal_dimension_mean : num 0.0787 0.0567 0.06 0.0974 0.0588 ...
## $ radius_se : num 1.095 0.543 0.746 0.496 0.757 ...
## $ texture_se : num 0.905 0.734 0.787 1.156 0.781 ...
## $ perimeter_se : num 8.59 3.4 4.58 3.44 5.44 ...
## $ area_se : num 153.4 74.1 94 27.2 94.4 ...
## $ smoothness_se : num 0.0064 0.00522 0.00615 0.00911 0.01149 ...
## $ compactness_se : num 0.049 0.0131 0.0401 0.0746 0.0246 ...
## $ concavity_se : num 0.0537 0.0186 0.0383 0.0566 0.0569 ...
## $ concave points_se : num 0.0159 0.0134 0.0206 0.0187 0.0188 ...
## $ symmetry_se : num 0.03 0.0139 0.0225 0.0596 0.0176 ...
## $ fractal_dimension_se : num 0.00619 0.00353 0.00457 0.00921 0.00511 ...
## $ radius_worst : num 25.4 25 23.6 14.9 22.5 ...
## $ texture_worst : num 17.3 23.4 25.5 26.5 16.7 ...
## $ perimeter_worst : num 184.6 158.8 152.5 98.9 152.2 ...
## $ area_worst : num 2019 1956 1709 568 1575 ...
## $ smoothness_worst : num 0.162 0.124 0.144 0.21 0.137 ...
## $ compactness_worst : num 0.666 0.187 0.424 0.866 0.205 ...
## $ concavity_worst : num 0.712 0.242 0.45 0.687 0.4 ...
## $ concave points_worst : num 0.265 0.186 0.243 0.258 0.163 ...
## $ symmetry_worst : num 0.46 0.275 0.361 0.664 0.236 ...
## $ fractal_dimension_worst: num 0.1189 0.089 0.0876 0.173 0.0768 ...
## - attr(*, "spec")=
## .. cols(
## .. id = col_double(),
## .. diagnosis = col_character(),
## .. radius_mean = col_double(),
## .. texture_mean = col_double(),
## .. perimeter_mean = col_double(),
## .. area_mean = col_double(),
## .. smoothness_mean = col_double(),
## .. compactness_mean = col_double(),
## .. concavity_mean = col_double(),
## .. `concave points_mean` = col_double(),
## .. symmetry_mean = col_double(),
## .. fractal_dimension_mean = col_double(),
## .. radius_se = col_double(),
## .. texture_se = col_double(),
## .. perimeter_se = col_double(),
## .. area_se = col_double(),
## .. smoothness_se = col_double(),
## .. compactness_se = col_double(),
## .. concavity_se = col_double(),
## .. `concave points_se` = col_double(),
## .. symmetry_se = col_double(),
## .. fractal_dimension_se = col_double(),
## .. radius_worst = col_double(),
## .. texture_worst = col_double(),
## .. perimeter_worst = col_double(),
## .. area_worst = col_double(),
## .. smoothness_worst = col_double(),
## .. compactness_worst = col_double(),
## .. concavity_worst = col_double(),
## .. `concave points_worst` = col_double(),
## .. symmetry_worst = col_double(),
## .. fractal_dimension_worst = col_double()
## .. )
summary(BCD)
## id diagnosis radius_mean texture_mean
## Min. : 8670 Length:569 Min. : 6.981 Min. : 9.71
## 1st Qu.: 869218 Class :character 1st Qu.:11.700 1st Qu.:16.17
## Median : 906024 Mode :character Median :13.370 Median :18.84
## Mean : 30371831 Mean :14.127 Mean :19.29
## 3rd Qu.: 8813129 3rd Qu.:15.780 3rd Qu.:21.80
## Max. :911320502 Max. :28.110 Max. :39.28
## perimeter_mean area_mean smoothness_mean compactness_mean
## Min. : 43.79 Min. : 143.5 Min. :0.05263 Min. :0.01938
## 1st Qu.: 75.17 1st Qu.: 420.3 1st Qu.:0.08637 1st Qu.:0.06492
## Median : 86.24 Median : 551.1 Median :0.09587 Median :0.09263
## Mean : 91.97 Mean : 654.9 Mean :0.09636 Mean :0.10434
## 3rd Qu.:104.10 3rd Qu.: 782.7 3rd Qu.:0.10530 3rd Qu.:0.13040
## Max. :188.50 Max. :2501.0 Max. :0.16340 Max. :0.34540
## concavity_mean concave points_mean symmetry_mean fractal_dimension_mean
## Min. :0.00000 Min. :0.00000 Min. :0.1060 Min. :0.04996
## 1st Qu.:0.02956 1st Qu.:0.02031 1st Qu.:0.1619 1st Qu.:0.05770
## Median :0.06154 Median :0.03350 Median :0.1792 Median :0.06154
## Mean :0.08880 Mean :0.04892 Mean :0.1812 Mean :0.06280
## 3rd Qu.:0.13070 3rd Qu.:0.07400 3rd Qu.:0.1957 3rd Qu.:0.06612
## Max. :0.42680 Max. :0.20120 Max. :0.3040 Max. :0.09744
## radius_se texture_se perimeter_se area_se
## Min. :0.1115 Min. :0.3602 Min. : 0.757 Min. : 6.802
## 1st Qu.:0.2324 1st Qu.:0.8339 1st Qu.: 1.606 1st Qu.: 17.850
## Median :0.3242 Median :1.1080 Median : 2.287 Median : 24.530
## Mean :0.4052 Mean :1.2169 Mean : 2.866 Mean : 40.337
## 3rd Qu.:0.4789 3rd Qu.:1.4740 3rd Qu.: 3.357 3rd Qu.: 45.190
## Max. :2.8730 Max. :4.8850 Max. :21.980 Max. :542.200
## smoothness_se compactness_se concavity_se concave points_se
## Min. :0.001713 Min. :0.002252 Min. :0.00000 Min. :0.000000
## 1st Qu.:0.005169 1st Qu.:0.013080 1st Qu.:0.01509 1st Qu.:0.007638
## Median :0.006380 Median :0.020450 Median :0.02589 Median :0.010930
## Mean :0.007041 Mean :0.025478 Mean :0.03189 Mean :0.011796
## 3rd Qu.:0.008146 3rd Qu.:0.032450 3rd Qu.:0.04205 3rd Qu.:0.014710
## Max. :0.031130 Max. :0.135400 Max. :0.39600 Max. :0.052790
## symmetry_se fractal_dimension_se radius_worst texture_worst
## Min. :0.007882 Min. :0.0008948 Min. : 7.93 Min. :12.02
## 1st Qu.:0.015160 1st Qu.:0.0022480 1st Qu.:13.01 1st Qu.:21.08
## Median :0.018730 Median :0.0031870 Median :14.97 Median :25.41
## Mean :0.020542 Mean :0.0037949 Mean :16.27 Mean :25.68
## 3rd Qu.:0.023480 3rd Qu.:0.0045580 3rd Qu.:18.79 3rd Qu.:29.72
## Max. :0.078950 Max. :0.0298400 Max. :36.04 Max. :49.54
## perimeter_worst area_worst smoothness_worst compactness_worst
## Min. : 50.41 Min. : 185.2 Min. :0.07117 Min. :0.02729
## 1st Qu.: 84.11 1st Qu.: 515.3 1st Qu.:0.11660 1st Qu.:0.14720
## Median : 97.66 Median : 686.5 Median :0.13130 Median :0.21190
## Mean :107.26 Mean : 880.6 Mean :0.13237 Mean :0.25427
## 3rd Qu.:125.40 3rd Qu.:1084.0 3rd Qu.:0.14600 3rd Qu.:0.33910
## Max. :251.20 Max. :4254.0 Max. :0.22260 Max. :1.05800
## concavity_worst concave points_worst symmetry_worst fractal_dimension_worst
## Min. :0.0000 Min. :0.00000 Min. :0.1565 Min. :0.05504
## 1st Qu.:0.1145 1st Qu.:0.06493 1st Qu.:0.2504 1st Qu.:0.07146
## Median :0.2267 Median :0.09993 Median :0.2822 Median :0.08004
## Mean :0.2722 Mean :0.11461 Mean :0.2901 Mean :0.08395
## 3rd Qu.:0.3829 3rd Qu.:0.16140 3rd Qu.:0.3179 3rd Qu.:0.09208
## Max. :1.2520 Max. :0.29100 Max. :0.6638 Max. :0.20750
# check the missing value
colSums(is.na(BCD))
## id diagnosis radius_mean
## 0 0 0
## texture_mean perimeter_mean area_mean
## 0 0 0
## smoothness_mean compactness_mean concavity_mean
## 0 0 0
## concave points_mean symmetry_mean fractal_dimension_mean
## 0 0 0
## radius_se texture_se perimeter_se
## 0 0 0
## area_se smoothness_se compactness_se
## 0 0 0
## concavity_se concave points_se symmetry_se
## 0 0 0
## fractal_dimension_se radius_worst texture_worst
## 0 0 0
## perimeter_worst area_worst smoothness_worst
## 0 0 0
## compactness_worst concavity_worst concave points_worst
## 0 0 0
## symmetry_worst fractal_dimension_worst
## 0 0
# remove column id
BCD <- BCD[,-1]
# convert the diagnosis to factor
BCD$diagnosis <- as.factor(BCD$diagnosis)
The description of the ten measurements is listed as follow:
BCD.type <- lapply(BCD[,-1], class)
BCD.var_desc <- c("The diagnosis of breast tissues (M = malignant, B = benign)",
"mean of distances from center to points on the perimeter",
"standard deviation of gray-scale values",
"mean size of the core tumor",
"Not provided",
"mean of local variation in radius lengths",
"mean of perimeter^2 / area - 1.0",
"mean of severity of concave portions of the contour",
"mean for number of concave portions of the contour",
"Not provided",
"mean for 'coastline approximation' - 1",
"standard error for the mean of distances from center to points on the perimeter",
"standard error for standard deviation of gray-scale values",
"Not provided",
"Not provided",
"standard error for local variation in radius lengths",
"standard error for perimeter^2 / area - 1.0",
"standard error for severity of concave portions of the contour",
"standard error for number of concave portions of the contour",
"Not provided",
"standard error for 'coastline approximation' - 1",
"'worst' or largest mean value for mean of distances from center to points on the perimeter",
"'worst' or largest mean value for standard deviation of gray-scale values",
"Not provided",
"Not provided",
"'worst' or largest mean value for local variation in radius lengths",
"'worst' or largest mean value for perimeter^2 / area - 1.0",
"'worst' or largest mean value for severity of concave portions of the contour",
"'worst' or largest mean value for number of concave portions of the contour",
"Not provided",
"'worst' or largest mean value for 'coastline approximation' - 1")
BCD.var_names <- colnames(BCD)
data.description <- as_data_frame(cbind(BCD.var_names, BCD.type, BCD.var_desc))
colnames(data.description) <- c("Variable name", "Data Type", "Variable Description")
kable(data.description[1:11,], caption = "Table 1-2 Data Description")
| Variable name | Data Type | Variable Description |
|---|---|---|
| diagnosis | numeric | The diagnosis of breast tissues (M = malignant, B = benign) |
| radius_mean | numeric | mean of distances from center to points on the perimeter |
| texture_mean | numeric | standard deviation of gray-scale values |
| perimeter_mean | numeric | mean size of the core tumor |
| area_mean | numeric | Not provided |
| smoothness_mean | numeric | mean of local variation in radius lengths |
| compactness_mean | numeric | mean of perimeter^2 / area - 1.0 |
| concavity_mean | numeric | mean of severity of concave portions of the contour |
| concave points_mean | numeric | mean for number of concave portions of the contour |
| symmetry_mean | numeric | Not provided |
| fractal_dimension_mean | numeric | mean for ‘coastline approximation’ - 1 |
The patient id was removed, now the dataset has 30 numeric variables and one binary variable “diagnosis” which is the response variable with two categories “M”(Malignant) and “B”(Benign). The pie chart shows malignant plays 37.26% of the total data (see Figure 1-1 Pie chart of diagnosis). A quick summary and histogram (see Figure 1-2) of varibles 1 to 11 (mean value of the 10 measurements) were listed below. All variables show various skewness, especially “concavity_mean” and “concave points_mean”. Skewness is usually undesireble, transformations might be needed.
pie <- BCD %>%
group_by(diagnosis) %>%
count(diagnosis) %>%
summarise(percent = round(n/nrow(BCD)*100,2), total = n)
# plot the pie of diagnosis
slices <- c(sum(BCD$diagnosis=="B"),sum(BCD$diagnosis=="M"))
lbls <- c("Benign","Malignant")
pie3D(slices,labels=lbls,explode=0.1,
main="Figure 1-1: Pie Chart of diagnosis ")
BCD_mean <- BCD[,1:11]
ggplot(gather(BCD_mean,key,value,2:11), aes(x=value,fill=diagnosis)) +
geom_histogram(bins = 20) +
#geom_density() +
facet_wrap(~key,scales = "free_x") +
ggtitle("Figure 1_2 Histograms of Breast Cancer Variables")
Outlier percentage in each variable is no more than 2%, there for, we won’t remove any outliers unless further evidence found they are significant. The boxplots (Figure 1-3) also shows median value of most variables for malignant cases are higher than those for benign cases.
# Percent outliers in every column
outlier_percentage <- BCD[,2:11] %>%
summarise_each(funs(sum(. > mean(.) + 3*sd(.) | . < mean(.) - 3*sd(.))/n()))
round(outlier_percentage*100,2)
## radius_mean texture_mean perimeter_mean area_mean smoothness_mean
## 1 0.88 0.7 1.23 1.41 0.88
## compactness_mean concavity_mean concave points_mean symmetry_mean
## 1 1.58 1.58 1.05 0.88
## fractal_dimension_mean
## 1 1.23
A heatmap was created to exam the bivariate relationships among these 30 predictors drew from image of breast mass cells and carring size, shape, density informations of samples. It’s not suprising that some variables are highly correlated (Figure 1-4). Multi-collinearity may cause trouble in linear regression models which also can be handled by variable selection approaches like stepwise or LASSO, however, it seems not problematic in other models like randem forest, boosting gredient model or neural networks. We can tune our models to achieve best performance.
BCD_sub <- BCD[,-1]
corrplot(cor(BCD_sub, method = "pearson"), number.cex = .9, method = "square",
hclust.method = "ward", order = "FPC",
type = "full", tl.cex=0.8,tl.col = "black",
title = "Figure 1-4 Collinearity of predicters")
we can tell from the heatmap that the highest correlation are between:
Let’s show the plots for some highly correlated features, inverse correlated features and low correlated features, grouped by diagnosis.
high1 <- BCD %>%
ggplot(aes(x = radius_worst, y = perimeter_mean, color = diagnosis))+
geom_point()+
stat_ellipse()
high2 <- BCD %>%
ggplot(aes(x = radius_worst, y = area_worst, color = diagnosis))+
geom_point()+
stat_ellipse()
high3 <- BCD %>%
ggplot(aes(x = texture_worst, y = texture_mean, color = diagnosis))+
geom_point()+
stat_ellipse()
high4 <- BCD %>%
ggplot(aes(x = area_worst, y = perimeter_mean, color = diagnosis))+
geom_point()+
stat_ellipse()
ggarrange(high1,high2, labels = c("Perimeter mean vs.Radius mean", "Area worst vs. Radius worst"),
common.legend = TRUE, legend = "bottom", nrow = 1, ncol = 2)
ggarrange(high3,high4, labels = c("Texture mean vs.Texture worst", "Perimeter mean vs.Area worst"),
common.legend = TRUE, legend = "bottom", nrow = 1, ncol = 2)
inverse1 <- BCD %>%
ggplot(aes(x = radius_mean, y = fractal_dimension_mean, color = diagnosis))+
geom_point()+
stat_ellipse()
inverse2 <- BCD %>%
ggplot(aes(x = area_mean, y = fractal_dimension_mean, color = diagnosis))+
geom_point()+
stat_ellipse()
ggarrange(inverse1,inverse2,labels = c("Fractal dimension mean vs.Radius mean", "Fractal dimension mean vs. Area mean"),common.legend = TRUE, legend = "bottom", nrow = 1, ncol = 2)
low1 <- BCD %>%
ggplot(aes(x = texture_mean, y = smoothness_mean, color = diagnosis))+
geom_point()+
stat_ellipse()
low2 <- BCD %>%
ggplot(aes(x = perimeter_worst, y = fractal_dimension_se, color = diagnosis))+
geom_point()+
stat_ellipse()
ggarrange(low1,low2,labels = c("Smoothness mean vs.Texture mean", "Fractal dimension SE vs. Perimeter worst"),common.legend = TRUE, legend = "bottom", nrow = 1, ncol = 2)
Creating training and test sample by splitting the data set to training(80%) and testing(20%) datasets
set.seed(20470011)
BCD <- rename(BCD, concave_points_mean = `concave points_mean`,concave_points_se = `concave points_se`, concave_points_worst = `concave points_worst`)
index <- sample(nrow(BCD),nrow(BCD)*0.8)
BCD.train <- BCD[index,]
BCD.test <- BCD[-index,]
The support-vector machine (SVM) is one of the most popular classification algorithms in the machine learning community.
We apply this approach on this BCD dataset. We first fit a general SVM model which has not specified the values for cost and gamma. This proces used kernel type radial, cost 1 to fit the model, which has a misclassification rate of 0.0088, accuracy of 0.9912. 114 support vectors are used to build the model.
svm.model <- svm(diagnosis ~ ., data = BCD.train)
summary(svm.model)
##
## Call:
## svm(formula = diagnosis ~ ., data = BCD.train)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
##
## Number of Support Vectors: 114
##
## ( 57 57 )
##
##
## Number of Classes: 2
##
## Levels:
## B M
svm.pred <- predict(svm.model, BCD.test[,-1])
confusionMatrix(svm.pred, BCD.test$diagnosis)
## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 75 0
## M 1 38
##
## Accuracy : 0.9912
## 95% CI : (0.9521, 0.9998)
## No Information Rate : 0.6667
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9804
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9868
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 0.9744
## Prevalence : 0.6667
## Detection Rate : 0.6579
## Detection Prevalence : 0.6579
## Balanced Accuracy : 0.9934
##
## 'Positive' Class : B
##
Since SVM is very sensitive to the choice of parameters, so we use tune.svm() method to tune the parameter cost and gamma. This algorithm would return the values of cost and gamma that can minmize the classification error for the 10-fold cross validation. After tuning these two parameters, we gain the best set of values are 10 for cost, and 0.01 for gamma.The best performance is 0.0287. Number of Support Vectors is 60.
tuned_parameters <- tune.svm(diagnosis ~ ., data = BCD.train, gamma = 10^(-5:-1), cost = 10^(-3:1))
summary(tuned_parameters)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## gamma cost
## 0.01 10
##
## - best performance: 0.02869565
##
## - Detailed performance results:
## gamma cost error dispersion
## 1 1e-05 1e-03 0.38217391 0.06786038
## 2 1e-04 1e-03 0.38217391 0.06786038
## 3 1e-03 1e-03 0.38217391 0.06786038
## 4 1e-02 1e-03 0.38217391 0.06786038
## 5 1e-01 1e-03 0.38217391 0.06786038
## 6 1e-05 1e-02 0.38217391 0.06786038
## 7 1e-04 1e-02 0.38217391 0.06786038
## 8 1e-03 1e-02 0.38217391 0.06786038
## 9 1e-02 1e-02 0.38217391 0.06786038
## 10 1e-01 1e-02 0.38217391 0.06786038
## 11 1e-05 1e-01 0.38217391 0.06786038
## 12 1e-04 1e-01 0.38217391 0.06786038
## 13 1e-03 1e-01 0.22821256 0.08066595
## 14 1e-02 1e-01 0.05053140 0.01801845
## 15 1e-01 1e-01 0.06371981 0.03513749
## 16 1e-05 1e+00 0.38217391 0.06786038
## 17 1e-04 1e+00 0.22599034 0.07979023
## 18 1e-03 1e+00 0.05053140 0.01801845
## 19 1e-02 1e+00 0.03971014 0.02526980
## 20 1e-01 1e+00 0.04410628 0.03288026
## 21 1e-05 1e+01 0.22376812 0.08089630
## 22 1e-04 1e+01 0.05053140 0.01801845
## 23 1e-03 1e+01 0.03526570 0.02387167
## 24 1e-02 1e+01 0.02869565 0.02573186
## 25 1e-01 1e+01 0.05077295 0.03466047
tuned_parameters$best.model
##
## Call:
## best.svm(x = diagnosis ~ ., data = BCD.train, gamma = 10^(-5:-1),
## cost = 10^(-3:1))
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 10
##
## Number of Support Vectors: 60
When use the tuned parameters cost = 10, gamma = 0.01 to fit the model. This model also has an accuracy of 0.9912 for test data, just like the general model does. However, this time, 60 support vactors are used.
BCD.svm <- svm(diagnosis ~ ., data = BCD.train, cost = 10, gamma = 0.01)
summary(BCD.svm)
##
## Call:
## svm(formula = diagnosis ~ ., data = BCD.train, cost = 10, gamma = 0.01)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 10
##
## Number of Support Vectors: 60
##
## ( 27 33 )
##
##
## Number of Classes: 2
##
## Levels:
## B M
svm.predict <- predict(BCD.svm, BCD.test[,-1])
confusionMatrix(svm.predict, BCD.test$diagnosis)
## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 76 1
## M 0 37
##
## Accuracy : 0.9912
## 95% CI : (0.9521, 0.9998)
## No Information Rate : 0.6667
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9801
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 1.0000
## Specificity : 0.9737
## Pos Pred Value : 0.9870
## Neg Pred Value : 1.0000
## Prevalence : 0.6667
## Detection Rate : 0.6667
## Detection Prevalence : 0.6754
## Balanced Accuracy : 0.9868
##
## 'Positive' Class : B
##
The in-sample Accuracy is 0.9868.
svm.predict.train <- predict(BCD.svm, BCD.train[,-1])
confusionMatrix(svm.predict.train, BCD.train$diagnosis)
## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 281 6
## M 0 168
##
## Accuracy : 0.9868
## 95% CI : (0.9715, 0.9951)
## No Information Rate : 0.6176
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.9719
##
## Mcnemar's Test P-Value : 0.04123
##
## Sensitivity : 1.0000
## Specificity : 0.9655
## Pos Pred Value : 0.9791
## Neg Pred Value : 1.0000
## Prevalence : 0.6176
## Detection Rate : 0.6176
## Detection Prevalence : 0.6308
## Balanced Accuracy : 0.9828
##
## 'Positive' Class : B
##
The AUC before tuning is 0.967, it increases to 0.98 after tuning.
BCD.svm0 <- svm(diagnosis ~ ., data = BCD.train, probability = TRUE)
pre.svm0 <- predict(BCD.svm0, BCD.test, probability = TRUE)
prob.svm0 <- attr(pre.svm0, 'probabilities')[,2]
pred.svm0 <- as.numeric((prob.svm0 <= 5/6))
svm.pred0 <- predict(BCD.svm0, BCD.test[,-1], type="prob")
pred0 <- prediction(pred.svm0, BCD.test$diagnosis)
perf0 <- performance(pred0, "tpr", "fpr")
plot(perf0, colorize=TRUE)
AUC0 <- slot(performance(pred0, "auc"), "y.values")[[1]]
AUC0
## [1] 0.9671053
BCD.svm1 <- svm(diagnosis ~ ., data = BCD.train, cost = 10, gamma = 0.01, probability = TRUE)
pre.svm1 <- predict(BCD.svm1, BCD.test, probability = TRUE)
prob.svm <- attr(pre.svm1, 'probabilities')[,2]
pred.svm <- as.numeric((prob.svm <= 5/6))
svm.pred <- predict(BCD.svm1, BCD.test[,-1], type="prob")
pred <- prediction(pred.svm, BCD.test$diagnosis)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)
AUC <- slot(performance(pred, "auc"), "y.values")[[1]]
AUC
## [1] 0.9802632
Use train() function to tune and train the SVM model by setting the radial basis kernel function with autotuning for the sigma parameter and 10-fold CV. From the plot, we can see that smaller values of the cost parameter C (around 5) provide better cross-validated accuracy scores for these training data. The result is slightly different from that obtained by tune.svm() function. This is possiably because that tune.svm() function aims to best performance in general, while train() tends to sort the value of C in terms of accuracy for train data.
set.seed(20470011)
BCD.svm.1 <- train(
diagnosis ~ .,
data = BCD.train,
method = "svmRadial",
preProcess = c("center", "scale"),
trControl = trainControl(method = "cv", number = 10),
tuneLength = 10
)
ggplot(BCD.svm.1) + theme_light()
BCD.svm.1$results
## sigma C Accuracy Kappa AccuracySD KappaSD
## 1 0.03883888 0.25 0.9537198 0.9001411 0.03354604 0.07295343
## 2 0.03883888 0.50 0.9648289 0.9241270 0.02594200 0.05660139
## 3 0.03883888 1.00 0.9714935 0.9387529 0.02088992 0.04573638
## 4 0.03883888 2.00 0.9737157 0.9434258 0.02273335 0.04960782
## 5 0.03883888 4.00 0.9757950 0.9479094 0.01949094 0.04297412
## 6 0.03883888 8.00 0.9736694 0.9438579 0.02271594 0.04877337
## 7 0.03883888 16.00 0.9603844 0.9166942 0.03730449 0.07869705
## 8 0.03883888 32.00 0.9559400 0.9077688 0.04173570 0.08731562
## 9 0.03883888 64.00 0.9492733 0.8935080 0.04446626 0.09238656
## 10 0.03883888 128.00 0.9492733 0.8935080 0.04446626 0.09238656
The out-of-sample accurary of the model tuned by train() function is 0.989, which is smaller than that tuned by tune.svm() function, and the in-sample arruracy is 0.9912.
svm.predict.train.1 <- predict(BCD.svm.1, BCD.train[,-1])
confusionMatrix(svm.predict.train.1, BCD.train$diagnosis)
## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 281 5
## M 0 169
##
## Accuracy : 0.989
## 95% CI : (0.9745, 0.9964)
## No Information Rate : 0.6176
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.9766
##
## Mcnemar's Test P-Value : 0.07364
##
## Sensitivity : 1.0000
## Specificity : 0.9713
## Pos Pred Value : 0.9825
## Neg Pred Value : 1.0000
## Prevalence : 0.6176
## Detection Rate : 0.6176
## Detection Prevalence : 0.6286
## Balanced Accuracy : 0.9856
##
## 'Positive' Class : B
##
svm.predict.1 <- predict(BCD.svm.1, BCD.test[,-1])
confusionMatrix(svm.predict.1, BCD.test$diagnosis)
## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 75 0
## M 1 38
##
## Accuracy : 0.9912
## 95% CI : (0.9521, 0.9998)
## No Information Rate : 0.6667
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9804
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9868
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 0.9744
## Prevalence : 0.6667
## Detection Rate : 0.6579
## Detection Prevalence : 0.6579
## Balanced Accuracy : 0.9934
##
## 'Positive' Class : B
##
SVMs classify new observations by determining which side of the decision boundary they fall on. So, they do not automatically provide predicted class probabilities. In order to obtain predicted class probabilities, additional parameters need to be estimated. Actually, predicted class probabilities are often more useful than the predicted class labels. For instance, we would need the predicted class probabilities if we were using an optimization metric like AUC, as opposed to classification accuracy. In terms of AUC, we can see that in general, smaller values of the cost parameter C provide better cross-validated AUC scores on the training data. When the value of C is around 2, the model can access the highest AUC. The result also shows the corresponding sensitivity and specificity.
ctrl <- trainControl(
method = "cv",
number = 10,
classProbs = TRUE,
summaryFunction = twoClassSummary
)
set.seed(20470011)
BCD.svm.auc <- train(
diagnosis ~ .,
data = BCD.train,
method = "svmRadial",
preProcess = c("center", "scale"),
metric = "ROC", # area under ROC curve (AUC)
trControl = ctrl,
tuneLength = 10
)
ggplot(BCD.svm.auc) + theme_light()
BCD.svm.auc$results
## sigma C ROC Sens Spec ROCSD SensSD
## 1 0.03883888 0.25 0.9894491 0.9645320 0.9196078 0.01338089 0.02883399
## 2 0.03883888 0.50 0.9910899 0.9752463 0.9428105 0.01083490 0.02358516
## 3 0.03883888 1.00 0.9925557 0.9788177 0.9542484 0.01066835 0.02966933
## 4 0.03883888 2.00 0.9929525 0.9822660 0.9542484 0.01000411 0.02515988
## 5 0.03883888 4.00 0.9925207 0.9821429 0.9598039 0.01098303 0.03035131
## 6 0.03883888 8.00 0.9912952 0.9928571 0.9542484 0.01172856 0.01505847
## 7 0.03883888 16.00 0.9902866 0.9678571 0.9542484 0.01405453 0.04595300
## 8 0.03883888 32.00 0.9886526 0.9642857 0.9486928 0.01643318 0.05323971
## 9 0.03883888 64.00 0.9861248 0.9642857 0.9486928 0.01604885 0.05832118
## 10 0.03883888 128.00 0.9859147 0.9642857 0.9372549 0.01601354 0.05323971
## SpecSD
## 1 0.06738334
## 2 0.06671168
## 3 0.05298643
## 4 0.04606182
## 5 0.05548182
## 6 0.05376454
## 7 0.05376454
## 8 0.06849851
## 9 0.06329435
## 10 0.07365317
In this case it is clear that our model does good job at predicting both the Ms and Bs. The accuracy is 0.9714.
confusionMatrix(BCD.svm.auc)
## Cross-Validated (10 fold) Confusion Matrix
##
## (entries are percentual average cell counts across resamples)
##
## Reference
## Prediction B M
## B 60.7 1.8
## M 1.1 36.5
##
## Accuracy (average) : 0.9714
We use vip() to quantify the importance of each feature using the permutation approach since svm doesn’t pull out the measures of feature importance automatically. In this case, the top 10 most important pridictors in terms of prediction are as follows.
prob.M <- function(object, newdata) {
predict(object, newdata = newdata, type = "prob")[, "M"]
}
set.seed(20470011)
vip(BCD.svm.auc, method = "permute", nsim = 5, train = BCD.train,
target = "diagnosis", metric = "auc", reference_class = "M",
pred_wrapper = prob.M)
In order to visualize the svm classification, we reduce the demention to 2 by use the top 2 important predictors as the dementions. We use the model tuned by tune.svm() function. The plots for both train data and test data show relatively clear boundaries. However, there are not obvious separating hyperplane in the plots probably due to the high dimention of the data.
plot(BCD.svm, BCD.train, texture_worst ~ radius_se,
slice = list(compactness_se = 3, radius_worst = 4), main = "in-sample svm classification plot")
plot(BCD.svm, BCD.test, texture_worst ~ radius_se,
slice = list(compactness_se = 3, radius_worst = 4), main = "out-of-sample svm classification plot")
In general, svm models perform really great with or without tuning parameters. This is prabably because the data itself is quite suitable for svm model.
fit RF model using BCD.train dataset, default ntrees=500, mtry=5
BCD.rf<-randomForest(diagnosis~., data = BCD.train, importance=TRUE)
BCD.rf
##
## Call:
## randomForest(formula = diagnosis ~ ., data = BCD.train, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 5
##
## OOB estimate of error rate: 3.96%
## Confusion matrix:
## B M class.error
## B 274 7 0.02491103
## M 11 163 0.06321839
check variables importance
imp <- data.frame(BCD.rf$importance)
imp %>%
ggplot(aes(x=reorder(rownames(imp),MeanDecreaseAccuracy),y=MeanDecreaseAccuracy))+
geom_bar(stat="identity")+
coord_flip()+
ggtitle("Variable importance (accuracy)")
imp %>%
ggplot(aes(x=reorder(rownames(imp),MeanDecreaseGini),y=MeanDecreaseGini))+
geom_bar(stat="identity")+
coord_flip()+
ggtitle("Variable importance (Gini)")
AUC
BCD.rf.pred<- predict(BCD.rf, newdata = BCD.test, type = "prob")[,2]
pred <- prediction(BCD.rf.pred, BCD.test$diagnosis)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.9951524
AUC2
BCD.rf.pred2<- predict(BCD.rf, newdata = BCD.test, type = "prob")[,2]
pred2 <- prediction(BCD.rf.pred2, BCD.test$diagnosis)
perf2 <- performance(pred2, "tpr", "fpr")
plot(perf2, colorize=TRUE)
unlist(slot(performance(pred2, "auc"), "y.values"))
## [1] 0.9951524
MR
BCD.test.pred <- predict(BCD.rf, BCD.test[,-1]) ############
cm.test <- confusionMatrix(BCD.test.pred, BCD.test$diagnosis)
cm.test
## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 73 2
## M 3 36
##
## Accuracy : 0.9561
## 95% CI : (0.9006, 0.9856)
## No Information Rate : 0.6667
## P-Value [Acc > NIR] : 4.243e-14
##
## Kappa : 0.902
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9605
## Specificity : 0.9474
## Pos Pred Value : 0.9733
## Neg Pred Value : 0.9231
## Prevalence : 0.6667
## Detection Rate : 0.6404
## Detection Prevalence : 0.6579
## Balanced Accuracy : 0.9539
##
## 'Positive' Class : B
##
MR2
BCD.test.pred2 <- predict(BCD.rf, BCD.test[,-1])
cm.test2 <- confusionMatrix(BCD.test.pred2, BCD.test$diagnosis)
cm.test2
## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 73 2
## M 3 36
##
## Accuracy : 0.9561
## 95% CI : (0.9006, 0.9856)
## No Information Rate : 0.6667
## P-Value [Acc > NIR] : 4.243e-14
##
## Kappa : 0.902
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9605
## Specificity : 0.9474
## Pos Pred Value : 0.9733
## Neg Pred Value : 0.9231
## Prevalence : 0.6667
## Detection Rate : 0.6404
## Detection Prevalence : 0.6579
## Balanced Accuracy : 0.9539
##
## 'Positive' Class : B
##
plot error vs. ntrees As we can see that the OOBError, FPR and FNR all very low, that mean the RF model works well for the BCD.train dataset. However, we are still going to try paramaters tuning to see whether we can get even better results.
plot(BCD.rf, lwd=rep(2, 3))
legend("right", legend = c("OOB Error", "FPR", "FNR"), lwd=rep(2, 3), lty = c(1,2,3), col = c("black", "red", "green"))
Let’s create a baseline for comparison by using the recommend defaults for each parameter and mtry=floor(sqrt(ncol(BCD.train))) or mtry=5 and ntree=500
Create model with default paramters
control <- trainControl(method="repeatedcv", number=10, repeats=3)
set.seed(20470011)
metric <- "Accuracy"
mtry <- sqrt(ncol(BCD.train))
tunegrid <- expand.grid(.mtry=floor(sqrt(ncol(BCD.train))))
rf_default <- train(diagnosis~., data=BCD.train, method="rf", metric=metric, tuneGrid=tunegrid, trControl=control)
print(rf_default)
## Random Forest
##
## 455 samples
## 30 predictor
## 2 classes: 'B', 'M'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 410, 409, 410, 409, 410, 409, ...
## Resampling results:
##
## Accuracy Kappa
## 0.9574355 0.9091911
##
## Tuning parameter 'mtry' was held constant at a value of 5
rf_default.test<-train(diagnosis~., data=BCD.test, method="rf", metric=metric, tuneGrid=tunegrid, trControl=control)
print(rf_default.test)
## Random Forest
##
## 114 samples
## 30 predictor
## 2 classes: 'B', 'M'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 102, 103, 104, 103, 102, 102, ...
## Resampling results:
##
## Accuracy Kappa
## 0.9338384 0.8445196
##
## Tuning parameter 'mtry' was held constant at a value of 5
One search strategy that we can use is to try random values within a range. This can be good if we are unsure of what the value might be and we want to overcome any biases we may have for setting the parameter (like the suggested equation above). Let’s try a random search for mtry using caret:
tuning for mtry (Random Search)
control <- trainControl(method="repeatedcv", number=10, repeats=3, search="random")
set.seed(20470011)
mtry <- sqrt(ncol(BCD.train))
rf_random <- train(diagnosis~., data=BCD.train, method="rf", metric=metric, tuneLength=15, trControl=control)
print(rf_random)
## Random Forest
##
## 455 samples
## 30 predictor
## 2 classes: 'B', 'M'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 410, 409, 410, 409, 410, 409, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 1 0.9516833 0.8964967
## 6 0.9567109 0.9075165
## 12 0.9611231 0.9171652
## 13 0.9582400 0.9109516
## 20 0.9611868 0.9172547
## 21 0.9626368 0.9202428
## 22 0.9597208 0.9140079
## 25 0.9611553 0.9170178
## 26 0.9619115 0.9186939
## 29 0.9597053 0.9139444
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 21.
plot(rf_random)
Another search is to define a grid of algorithm parameters to try. Each axis of the grid is an algorithm parameter, and points in the grid are specific combinations of parameters. Because we are only tuning one parameter, the grid search is a linear search through a vector of candidate values.
control <- trainControl(method="repeatedcv", number=10, repeats=3, search="grid")
set.seed(20470011)
tunegrid <- expand.grid(.mtry=c(1:25))
rf_gridsearch <- train(diagnosis~., data=BCD.train, method="rf", metric=metric, tuneGrid=tunegrid, trControl=control)
print(rf_gridsearch)
## Random Forest
##
## 455 samples
## 30 predictor
## 2 classes: 'B', 'M'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 410, 409, 410, 409, 410, 409, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 1 0.9523925 0.8978999
## 2 0.9574509 0.9089967
## 3 0.9581917 0.9106700
## 4 0.9574509 0.9091969
## 5 0.9574831 0.9093567
## 6 0.9574355 0.9094084
## 7 0.9589009 0.9123072
## 8 0.9581762 0.9108933
## 9 0.9581923 0.9108763
## 10 0.9589170 0.9125201
## 11 0.9589009 0.9124260
## 12 0.9589485 0.9125103
## 13 0.9596738 0.9139902
## 14 0.9589807 0.9124711
## 15 0.9633768 0.9218791
## 16 0.9604461 0.9156278
## 17 0.9633299 0.9216946
## 18 0.9633607 0.9218299
## 19 0.9590599 0.9127599
## 20 0.9582715 0.9110545
## 21 0.9619115 0.9186012
## 22 0.9597530 0.9140706
## 23 0.9597369 0.9141453
## 24 0.9589961 0.9124855
## 25 0.9583030 0.9111178
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 15.
plot(rf_gridsearch)
Tuning ntrees
We want to keep using caret because it provides a direct point of comparison to our previous models (apples to apples, even the same data splits) and because of the repeated cross validation test harness that we like as it reduces the severity of overfitting.One approach is to create many caret models for our algorithm and pass in a different parameters directly to the algorithm manually. Let’s look at an example doing this to evaluate different values for ntree while holding mtry constant.
Manual Search
control <- trainControl(method="repeatedcv", number=10, repeats=3, search="grid")
tunegrid <- expand.grid(.mtry=c(sqrt(ncol(BCD.train))))
modellist <- list()
for (ntree in c(500,1000, 1500, 2000, 2500, 3000)) {
set.seed(20470011)
fit <- train(diagnosis~., data=BCD.train, method="rf", metric=metric, tuneGrid=tunegrid, trControl=control, ntree=ntree)
key <- toString(ntree)
modellist[[key]] <- fit
}
compare results
results <- resamples(modellist)
summary(results)
##
## Call:
## summary.resamples(object = results)
##
## Models: 500, 1000, 1500, 2000, 2500, 3000
## Number of resamples: 30
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 500 0.8888889 0.9388889 0.9565217 0.9581917 0.9782609 1 0
## 1000 0.8936170 0.9555556 0.9565217 0.9611231 0.9782609 1 0
## 1500 0.8888889 0.9555556 0.9565217 0.9611231 0.9782609 1 0
## 2000 0.8888889 0.9555556 0.9565217 0.9603824 0.9781401 1 0
## 2500 0.8888889 0.9555556 0.9565217 0.9596416 0.9781401 1 0
## 3000 0.8888889 0.9555556 0.9565217 0.9603824 0.9781401 1 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 500 0.7608927 0.8706663 0.9068826 0.9106062 0.9543530 1 0
## 1000 0.7772512 0.9032258 0.9078064 0.9170550 0.9545870 1 0
## 1500 0.7608927 0.9032258 0.9078064 0.9170581 0.9545870 1 0
## 2000 0.7608927 0.9032258 0.9078064 0.9155005 0.9537486 1 0
## 2500 0.7608927 0.9032258 0.9078064 0.9139788 0.9537486 1 0
## 3000 0.7608927 0.9032258 0.9078064 0.9155005 0.9537486 1 0
dotplot(results)
Extend Caret tuning for both ntry and ntrees
Another approach is to create a “new” algorithm for caret to support. This is the same random forest algorithm you are using, only modified so that it supports multiple tuning of multiple parameters. A risk with this approach is that the caret native support for the algorithm has additional or fancy code wrapping it that subtly but importantly changes it’s behavior. You many need to repeat prior experiments with your custom algorithm support. We can define our own algorithm to use in caret by defining a list that contains a number of custom named elements that the caret package looks for, such as how to fit and how to predict. See below for a definition of a custom random forest algorithm for use with caret that takes both an mtry and ntree parameters.
Define x and y first
x <- BCD.train[, 2:31]
y <- BCD.train[, 1]
customRF <- list(type = "Classification", library = "randomForest", loop = NULL)
customRF$parameters <- data.frame(parameter = c("mtry", "ntree"), class = rep("numeric", 2), label = c("mtry", "ntree"))
customRF$grid <- function(x, y, len = NULL, search = "grid") {}
customRF$fit <- function(x, y, wts, param, lev, last, weights, classProbs, ...) {
randomForest(x, y, mtry = param$mtry, ntree=param$ntree, ...)
}
customRF$predict <- function(modelFit, newdata, preProc = NULL, submodels = NULL)
predict(modelFit, newdata)
customRF$prob <- function(modelFit, newdata, preProc = NULL, submodels = NULL)
predict(modelFit, newdata, type = "prob")
customRF$sort <- function(x) x[order(x[,1]),]
customRF$levels <- function(x) x$classes
GBM constructs a forward stage-wise additive model by implementing gradient descent in function space. We will use as well cross validation with 5 folds.
In order to find the best number of trees to use for the prediction for the test data, we can use gbm.perf function. This function returns the optimal number of trees for prediction. we can tell from the plot that the best ntree near 200.
BCD.train$diagnosis = as.integer(factor(BCD.train$diagnosis))-1
BCD.test$diagnosis = as.integer(factor(BCD.test$diagnosis))-1
n<-names(BCD.train)
gbm.formula <- as.formula(paste("diagnosis ~", paste(n[!n %in% "diagnosis"], collapse = " + ")))
BCD.gbmCV <- gbm(formula = gbm.formula,
distribution = "bernoulli",
data = BCD.train,
n.trees = 500,
shrinkage = .1,
n.minobsinnode = 15,
cv.folds = 5,
n.cores = 1)
# optimal number of trees for prediction
gbm.perf(BCD.gbmCV)
## [1] 157
# prediction
pred.gbm.test<- predict(BCD.gbmCV, newdata = BCD.test, type="response")
## Using 157 trees...
pred <- prediction(pred.gbm.test, BCD.test$diagnosis)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.9993075
# get binary prediction
pcut1<- mean(BCD.train$diagnosis)
class.gbm.test<- (pred.gbm.test>pcut1)*1
# get confusion matrix
table(BCD.test$diagnosis, class.gbm.test, dnn = c("True", "Predicted"))
## Predicted
## True 0 1
## 0 74 2
## 1 0 38
# misclassification rate
MR<- mean(BCD.test$diagnosis!=class.gbm.test)
MR
## [1] 0.01754386