En este fichero .RMD se realizará un análisis sobre un dataset que recopila información sobre el cáncer de mama. Concretamente, contiene información sobre diferentes mediciones relacionadas a los núcleos celulares que aparecen en las imágenes digitalizadas a partir de punciones (PAAF, Punción Aspiración Aguja Fina) en tejido mamario que presenta células cancerosas. Dicho Dataset se encuentra disponible en el siguiente ftp público: ftp.cs.wisc.edu o pulsando aquí
Este dataset cuenta con 569 filas por 32 columnas las cuales son:
Por último, este dataset no contiene missing values(Valores faltantes), estando perfectamente conformado para todas sus filas y columnas.
Lo que se pretende conseguir con este dataset es lo siguiente:
Instalamos y cargamos las librerías que vamos a utilizar
if(!is.element("GGally", installed.packages()[, 1]))
install.packages("GGally",repos = 'http://cran.us.r-project.org')
if(!is.element("ggplot2", installed.packages()[, 1]))
install.packages("ggplot2",repos = 'http://cran.us.r-project.org')
if(!is.element("class", installed.packages()[, 1]))
install.packages('class')
if(!is.element("e1071", installed.packages()[, 1]))
install.packages("e1071",repos = 'http://cran.us.r-project.org')
if(!is.element("caret", installed.packages()[, 1]))
install.packages("caret",repos = 'http://cran.us.r-project.org')
if(!is.element("klaR", installed.packages()[, 1]))
install.packages("klaR",repos = 'http://cran.us.r-project.org')
if(!is.element("gmodels", installed.packages()[, 1]))
install.packages("gmodels",repos = 'http://cran.us.r-project.org')
if(!is.element("randomForest", installed.packages()[, 1]))
install.packages("randomForest",repos = 'http://cran.us.r-project.org')
if(!is.element("factoextra", installed.packages()[, 1]))
install.packages("factoextra",repos = 'http://cran.us.r-project.org')
if(!is.element("FactoMineR", installed.packages()[, 1]))
install.packages("FactoMineR",repos = 'http://cran.us.r-project.org')
if(!is.element("gridExtra", installed.packages()[, 1]))
install.packages("gridExtra",repos = 'http://cran.us.r-project.org')
suppressPackageStartupMessages(library(GGally))
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(class))
suppressPackageStartupMessages(library(e1071))
suppressPackageStartupMessages(library(caret))
suppressPackageStartupMessages(library(klaR))
suppressPackageStartupMessages(library(gmodels))
suppressPackageStartupMessages(library(randomForest))
suppressPackageStartupMessages(library(factoextra))
suppressPackageStartupMessages(library(FactoMineR))
suppressPackageStartupMessages(library(gridExtra))
Cargamos el fichero que tenemos en la siguiente ruta. Ignoramos las cabeceras
WDBC <- read.csv("../datos/WDBC.dat", header=FALSE)
Realizamos un pequeño proceso para colocar las cabeceras de cada columna.
tipos <- c("mean", "se", "worst")
mediciones <- c("radius", "texture", "perimeter", "area", "smoothness","compactness",
"concavity", "concave_points", "symmetry","fractal_dimension")
names <- c("id","analysis", paste0(rep(mediciones, 3), "_", rep(tipos, each=10)))
colnames(WDBC) <- c(names)
Realizamos un pequeño proceso de limpieza y preparación de los datos. En este caso no existen ni missing values para los atributos ni valores nulos, por lo que simplemente desechamos la columna id que no nos aporta información de utilidad y renombramos los valores B y M.
WDBC <- WDBC[,-1]
WDBC$analysis <- factor(WDBC$analysis, levels = c("B", "M"), labels = c("Benign", "Malignant"))
attach(WDBC)
str(WDBC)
## 'data.frame': 569 obs. of 31 variables:
## $ analysis : Factor w/ 2 levels "Benign","Malignant": 2 2 2 2 2 2 2 2 2 2 ...
## $ radius_mean : num 18 20.6 19.7 11.4 20.3 ...
## $ texture_mean : num 10.4 17.8 21.2 20.4 14.3 ...
## $ perimeter_mean : num 122.8 132.9 130 77.6 135.1 ...
## $ area_mean : num 1001 1326 1203 386 1297 ...
## $ smoothness_mean : num 0.1184 0.0847 0.1096 0.1425 0.1003 ...
## $ compactness_mean : num 0.2776 0.0786 0.1599 0.2839 0.1328 ...
## $ concavity_mean : num 0.3001 0.0869 0.1974 0.2414 0.198 ...
## $ concave_points_mean : num 0.1471 0.0702 0.1279 0.1052 0.1043 ...
## $ symmetry_mean : num 0.242 0.181 0.207 0.26 0.181 ...
## $ fractal_dimension_mean : num 0.0787 0.0567 0.06 0.0974 0.0588 ...
## $ radius_se : num 1.095 0.543 0.746 0.496 0.757 ...
## $ texture_se : num 0.905 0.734 0.787 1.156 0.781 ...
## $ perimeter_se : num 8.59 3.4 4.58 3.44 5.44 ...
## $ area_se : num 153.4 74.1 94 27.2 94.4 ...
## $ smoothness_se : num 0.0064 0.00522 0.00615 0.00911 0.01149 ...
## $ compactness_se : num 0.049 0.0131 0.0401 0.0746 0.0246 ...
## $ concavity_se : num 0.0537 0.0186 0.0383 0.0566 0.0569 ...
## $ concave_points_se : num 0.0159 0.0134 0.0206 0.0187 0.0188 ...
## $ symmetry_se : num 0.03 0.0139 0.0225 0.0596 0.0176 ...
## $ fractal_dimension_se : num 0.00619 0.00353 0.00457 0.00921 0.00511 ...
## $ radius_worst : num 25.4 25 23.6 14.9 22.5 ...
## $ texture_worst : num 17.3 23.4 25.5 26.5 16.7 ...
## $ perimeter_worst : num 184.6 158.8 152.5 98.9 152.2 ...
## $ area_worst : num 2019 1956 1709 568 1575 ...
## $ smoothness_worst : num 0.162 0.124 0.144 0.21 0.137 ...
## $ compactness_worst : num 0.666 0.187 0.424 0.866 0.205 ...
## $ concavity_worst : num 0.712 0.242 0.45 0.687 0.4 ...
## $ concave_points_worst : num 0.265 0.186 0.243 0.258 0.163 ...
## $ symmetry_worst : num 0.46 0.275 0.361 0.664 0.236 ...
## $ fractal_dimension_worst: num 0.1189 0.089 0.0876 0.173 0.0768 ...
head(WDBC)
## analysis radius_mean texture_mean perimeter_mean area_mean
## 1 Malignant 17.99 10.38 122.80 1001.0
## 2 Malignant 20.57 17.77 132.90 1326.0
## 3 Malignant 19.69 21.25 130.00 1203.0
## 4 Malignant 11.42 20.38 77.58 386.1
## 5 Malignant 20.29 14.34 135.10 1297.0
## 6 Malignant 12.45 15.70 82.57 477.1
## smoothness_mean compactness_mean concavity_mean concave_points_mean
## 1 0.11840 0.27760 0.3001 0.14710
## 2 0.08474 0.07864 0.0869 0.07017
## 3 0.10960 0.15990 0.1974 0.12790
## 4 0.14250 0.28390 0.2414 0.10520
## 5 0.10030 0.13280 0.1980 0.10430
## 6 0.12780 0.17000 0.1578 0.08089
## symmetry_mean fractal_dimension_mean radius_se texture_se perimeter_se
## 1 0.2419 0.07871 1.0950 0.9053 8.589
## 2 0.1812 0.05667 0.5435 0.7339 3.398
## 3 0.2069 0.05999 0.7456 0.7869 4.585
## 4 0.2597 0.09744 0.4956 1.1560 3.445
## 5 0.1809 0.05883 0.7572 0.7813 5.438
## 6 0.2087 0.07613 0.3345 0.8902 2.217
## area_se smoothness_se compactness_se concavity_se concave_points_se
## 1 153.40 0.006399 0.04904 0.05373 0.01587
## 2 74.08 0.005225 0.01308 0.01860 0.01340
## 3 94.03 0.006150 0.04006 0.03832 0.02058
## 4 27.23 0.009110 0.07458 0.05661 0.01867
## 5 94.44 0.011490 0.02461 0.05688 0.01885
## 6 27.19 0.007510 0.03345 0.03672 0.01137
## symmetry_se fractal_dimension_se radius_worst texture_worst
## 1 0.03003 0.006193 25.38 17.33
## 2 0.01389 0.003532 24.99 23.41
## 3 0.02250 0.004571 23.57 25.53
## 4 0.05963 0.009208 14.91 26.50
## 5 0.01756 0.005115 22.54 16.67
## 6 0.02165 0.005082 15.47 23.75
## perimeter_worst area_worst smoothness_worst compactness_worst
## 1 184.60 2019.0 0.1622 0.6656
## 2 158.80 1956.0 0.1238 0.1866
## 3 152.50 1709.0 0.1444 0.4245
## 4 98.87 567.7 0.2098 0.8663
## 5 152.20 1575.0 0.1374 0.2050
## 6 103.40 741.6 0.1791 0.5249
## concavity_worst concave_points_worst symmetry_worst
## 1 0.7119 0.2654 0.4601
## 2 0.2416 0.1860 0.2750
## 3 0.4504 0.2430 0.3613
## 4 0.6869 0.2575 0.6638
## 5 0.4000 0.1625 0.2364
## 6 0.5355 0.1741 0.3985
## fractal_dimension_worst
## 1 0.11890
## 2 0.08902
## 3 0.08758
## 4 0.17300
## 5 0.07678
## 6 0.12440
p <- ggplot(WDBC, aes(WDBC$analysis))
p <- p + geom_bar(stat = "count", aes(fill = analysis))
p <- p + xlab("Types") + ylab("Count")
p <- p + ggtitle("Types of Cancer")
p <- p + geom_text(stat='count', aes(label=..count..), vjust=-0.5)
p <- p + theme_minimal()
p <- p + theme(legend.title = element_blank())
show(p)
Se puede observar que la muestra está ligeramente desbalanceada en cuanto que existen 357 casos de cánceres benignos y 212 de malignos.
Dada la gran cantidad de variables que se tiene en el Dataset, representar todas al mismo tiempo resultaría complicado y confuso. Por ello se ha optado por representar las mismas agrupadas por Mean (Media), SE(Error estándar) y Worst (Peor o más largo) para hacerlo más legible.
p <- ggpairs(
WDBC,
aes(alpha = 0.65,color = WDBC$analysis),
columns = 2:11,
legend = 1,
upper = list(continuous = wrap("cor", alpha = 1, size = 3, alignPercent=1)),
lower = list(continuous = wrap("points", alpha=0.35)),
title = "Cancer Features. Mean")
p <- p + theme_minimal()
p <- p + theme(legend.position = "bottom",legend.title = element_blank())
print(p)
p <- ggpairs(
WDBC,
aes(alpha = 0.65,color = WDBC$analysis),
columns = 12:21,
legend = 1,
upper = list(continuous = wrap("cor", alpha = 1, size = 3, alignPercent=1)),
lower = list(continuous = wrap("points", alpha=0.35)),
title = "Cancer Features. Standard Error")
p <- p + theme_minimal()
p <- p + theme(legend.position = "bottom",legend.title = element_blank())
print(p)
p <- ggpairs(
WDBC,
aes(alpha = 0.65,color = WDBC$analysis),
columns = 22:31,
legend = 1,
upper = list(continuous = wrap("cor", alpha = 1, size = 3, alignPercent=1)),
lower = list(continuous = wrap("points", alpha=0.35)),
title = "Cancer Features. Worst o largest features")
p <- p + theme_minimal()
p <- p + theme(legend.position = "bottom",legend.title = element_blank())
print(p)
Las gráficas anteriores también han sido generadas a través de una salida visual en R-shiny (código adjunto a este proyecto) y posteriormente publicadas shinyapp.io en el siguiente enlace:
En este caso, se representa la correlación de las variables de manera alternativa a la anteriormente expuesta, siendo quizás una de las representaciones más eficaces con respecto a la correlación.
p <- ggcorr(
WDBC[,2:11],
geom = "tile",
label = TRUE,
label_round = 2,
hjust = 0.80,
vjust = 1,
nbreaks = 5,
drop = TRUE,
palette = "Accent",
layout.exp = 1,
legend.position = "bottom")
p <- p + ggplot2::labs(title = "Cancer Correlations Features. Mean")
print(p)
p <- ggcorr(
WDBC[,12:21],
geom = "tile",
label = TRUE,
label_round = 2,
hjust = 0.80,
vjust = 1,
nbreaks = 5,
drop = TRUE,
palette = "Accent",
layout.exp = 1,
legend.position = "bottom")
p <- p + ggplot2::labs(title = "Cancer Correlations Features. SE")
print(p)
p <- ggcorr(
WDBC[,22:31],
geom = "tile",
label = TRUE,
label_round = 2,
hjust = 0.80,
vjust = 1,
nbreaks = 5,
drop = TRUE,
palette = "Accent",
layout.exp = 1,
legend.position = "bottom")
p <- p + ggplot2::labs(title = "Cancer Correlations Features. Worst o largest features")
print(p)
Realizamos un proceso de normalización de los datos para asegurarnos que variables o características con métricas diferenciadas o con valores muy alejados en pico (valores máximos y mínimos muy alejados del grupo central que puede distorsionar la media), no influencien más de la cuenta en el análisis.
Asignamos un semilla para asegurarnos la reproducibilidad de los datos.
Por último, creamos un dataset de ‘train’ y otro de ‘test’ que utilizaremos más adelante.
set.seed(1234)
normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x))) }
wdbc_norm <- as.data.frame(lapply(WDBC[2:ncol(WDBC)], normalize))
ind <- sample(2, nrow(WDBC), replace=TRUE, prob=c(0.67, 0.33))
wdbc.train <- wdbc_norm[ind==1, 1:ncol(wdbc_norm)]
wdbc.test <- wdbc_norm[ind==2, 1:ncol(wdbc_norm)]
wdbc.trainLabels <- WDBC[ind==1, 1]
wdbc.testLabels <- WDBC[ind==2, 1]
Realizamos un análisis de las componentes principales en su totalidad como dividas por grupos(MEAN, SE, WORST)
all.pca <- prcomp(wdbc_norm, center = TRUE, scale = TRUE)
mean.pca <- prcomp(wdbc_norm[,1:10], center = TRUE, scale = TRUE)
se.pca <- prcomp(wdbc_norm[,11:20], center = TRUE, scale = TRUE)
worst.pca <- prcomp(wdbc_norm[,21:30], center = TRUE, scale = TRUE)
summary(all.pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 3.6444 2.3857 1.67867 1.40735 1.28403 1.09880
## Proportion of Variance 0.4427 0.1897 0.09393 0.06602 0.05496 0.04025
## Cumulative Proportion 0.4427 0.6324 0.72636 0.79239 0.84734 0.88759
## PC7 PC8 PC9 PC10 PC11 PC12
## Standard deviation 0.82172 0.69037 0.6457 0.59219 0.5421 0.51104
## Proportion of Variance 0.02251 0.01589 0.0139 0.01169 0.0098 0.00871
## Cumulative Proportion 0.91010 0.92598 0.9399 0.95157 0.9614 0.97007
## PC13 PC14 PC15 PC16 PC17 PC18
## Standard deviation 0.49128 0.39624 0.30681 0.28260 0.24372 0.22939
## Proportion of Variance 0.00805 0.00523 0.00314 0.00266 0.00198 0.00175
## Cumulative Proportion 0.97812 0.98335 0.98649 0.98915 0.99113 0.99288
## PC19 PC20 PC21 PC22 PC23 PC24
## Standard deviation 0.22244 0.17652 0.1731 0.16565 0.15602 0.1344
## Proportion of Variance 0.00165 0.00104 0.0010 0.00091 0.00081 0.0006
## Cumulative Proportion 0.99453 0.99557 0.9966 0.99749 0.99830 0.9989
## PC25 PC26 PC27 PC28 PC29 PC30
## Standard deviation 0.12442 0.09043 0.08307 0.03987 0.02736 0.01153
## Proportion of Variance 0.00052 0.00027 0.00023 0.00005 0.00002 0.00000
## Cumulative Proportion 0.99942 0.99969 0.99992 0.99997 1.00000 1.00000
Se puede observar que entre la PC1 y la PC5 el ‘Cumulative proportion of Variance’(Suma de la proporción de la Varianza) es aproximadamente del 85%, lo cual significa que con con estas 5 componentes se puede explicar el 85% de los datos.
En esta gráfica ‘screeplot’, representamos el porcentaje de variabilidad explicado por las componentes principales. Esta grafica ordena las componentes principales de mayor a menor ‘eigenvalores’(porción de la varianza total explicada por un componente)
p <- fviz_screeplot(all.pca, barfill ="#72dbde", barcolor="#333333", addlabels = TRUE, ylim = c(0, 50), linecolor = "red",ncp = 30)
p <- p + theme_minimal()
p <- p + labs(title = "Screeplot", x = "Principal Components")
print(p)
Repetimos la representación anterior, pero agrupando por grupos (MEAN,SE,WORST) en la misma línea para poder observar las diferencias
p1 <- fviz_screeplot(mean.pca, barfill ="#72dbde", barcolor="#333333", addlabels = TRUE, ylim = c(0, 60), linecolor = "red",ncp = 10)
p1 <- p1 + theme_minimal()
p1 <- p1 + labs(title = "MEAN PCA", x = "Principal Components")
p2 <- fviz_screeplot(se.pca, barfill ="#f8b1ac", barcolor="#333333", addlabels = TRUE, ylim = c(0, 60), linecolor = "red",ncp = 10)
p2 <- p2 + theme_minimal()
p2 <- p2 + labs(title = "SE PCA", x = "Principal Components")
p3 <- fviz_screeplot(worst.pca, barfill ="#beaed4", barcolor="#333333", addlabels = TRUE, ylim = c(0, 60), linecolor = "red",ncp = 10)
p3 <- p3 + theme_minimal()
p3 <- p3 + labs(title = "WORST PCA", x = "Principal Components")
grid.arrange(p1,p2,p3,ncol=3)
En esta gráfica lo que se muestra es la contribución que realiza cada variable con respecto a cada una de las 3 componentes principales más importantes. Si una variable sobrepasa la línea roja divisoria horizontal, se puede considerar que esa variable es importante a la hora de contribuir a esa componente
p1 <- fviz_contrib(all.pca, fill="#72dbde", color="#333333", choice = "var", axes = 1, top = 10)
p1 <- p1 + theme_minimal()
p1 <- p1 + theme(axis.text.x = element_text(angle=90))
p1 <- p1 + labs(title = "Contribution of variables to PC1", x = "Principal Components")
p2 <- fviz_contrib(all.pca, fill="#f8b1ac", color="#333333", choice = "var", axes = 2, top = 10)
p2 <- p2 + theme_minimal()
p2 <- p2 + theme(axis.text.x = element_text(angle=90))
p2 <- p2 + labs(title = "Contribution of variables to PC2", x = "Principal Components")
p3 <- fviz_contrib(all.pca, fill="#beaed4", color="#333333", choice = "var", axes = 3, top = 10)
p3 <- p3 + theme_minimal()
p3 <- p3 + theme(axis.text.x = element_text(angle=90))
p3 <- p3 + labs(title = "Contribution of variables to PC3", x = "Principal Components")
grid.arrange(p1,p2,p3,ncol=3)
En este apartado, realizaremos diferentes pruebas con diferentes algoritmos de clasificación como KNN, SVM, etc. Para ello utilizaremos los dataset de ‘train (que se compone del )67% del total de Dataset original (380 filas)), y ’test’ (33% del total del Dataset original (189 filas)) que se crearon en el apartado anterior.
Comenzamos con KNN, y su configuración básica. K = 1
knn_pred <- knn(train = wdbc.train, test = wdbc.test, cl = wdbc.trainLabels, k=1)
CrossTable(x = wdbc.testLabels, y = knn_pred, prop.chisq=FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 189
##
##
## | knn_pred
## wdbc.testLabels | Benign | Malignant | Row Total |
## ----------------|-----------|-----------|-----------|
## Benign | 107 | 5 | 112 |
## | 0.955 | 0.045 | 0.593 |
## | 0.955 | 0.065 | |
## | 0.566 | 0.026 | |
## ----------------|-----------|-----------|-----------|
## Malignant | 5 | 72 | 77 |
## | 0.065 | 0.935 | 0.407 |
## | 0.045 | 0.935 | |
## | 0.026 | 0.381 | |
## ----------------|-----------|-----------|-----------|
## Column Total | 112 | 77 | 189 |
## | 0.593 | 0.407 | |
## ----------------|-----------|-----------|-----------|
##
##
Utilizando la función tune.knn, intentaremos encontrar el mejor ‘K’ para el algoritmo KNN. De esta manera se intenta maximizar o mejorar la estimación (accuracy) que realiza el algoritmo.
A la función tune.knn se le pueden variar diferentes parámetros. En este caso, probaremos con un k comprendido entre 1 y 100 y utilizaremos diferentes técnicas de ‘sampling’(muestreo) las cuales se utilizarán para probar o estimar cuan bueno el predictor.
En este caso, mostramos el k ofrecido por el tuning como una gráfica de los diferentes intentos realizados
knn_tune <- tune.knn(wdbc.train, wdbc.trainLabels, k = 1:100, tunecontrol = tune.control(sampling = "boot"))
summary(knn_tune)
##
## Parameter tuning of 'knn.wrapper':
##
## - sampling method: bootstrapping
##
## - best parameters:
## k
## 15
##
## - best performance: 0.04082349
##
## - Detailed performance results:
## k error dispersion
## 1 1 0.04767389 0.013949478
## 2 2 0.05144260 0.015990369
## 3 3 0.04507377 0.010183908
## 4 4 0.05203528 0.014338113
## 5 5 0.05150333 0.010009822
## 6 6 0.04384690 0.009135431
## 7 7 0.04317199 0.014997490
## 8 8 0.04706106 0.015966426
## 9 9 0.05024447 0.014526648
## 10 10 0.04445005 0.016238217
## 11 11 0.04316794 0.016036153
## 12 12 0.04318572 0.013224810
## 13 13 0.04582779 0.014286435
## 14 14 0.04146163 0.016298855
## 15 15 0.04082349 0.018119871
## 16 16 0.04531075 0.018597775
## 17 17 0.04336326 0.018307819
## 18 18 0.04649794 0.017791950
## 19 19 0.04643885 0.017597991
## 20 20 0.04708412 0.019193422
## 21 21 0.04709826 0.020356919
## 22 22 0.04648756 0.020932460
## 23 23 0.04780832 0.022411680
## 24 24 0.05029144 0.018811267
## 25 25 0.04773268 0.019144259
## 26 26 0.04704748 0.021099083
## 27 27 0.04834496 0.020047617
## 28 28 0.04834496 0.020047617
## 29 29 0.04965516 0.014725234
## 30 30 0.04965516 0.017081968
## 31 31 0.04778986 0.017279687
## 32 32 0.05155405 0.015881268
## 33 33 0.05095296 0.015519683
## 34 34 0.05095296 0.013238942
## 35 35 0.05094146 0.014139398
## 36 36 0.05413291 0.014422238
## 37 37 0.05411560 0.013310236
## 38 38 0.05475254 0.012976223
## 39 39 0.05544312 0.014873289
## 40 40 0.05607174 0.012725794
## 41 41 0.05672109 0.014389670
## 42 42 0.05737044 0.016143498
## 43 43 0.05804135 0.015251180
## 44 44 0.05801366 0.016622714
## 45 45 0.05544720 0.012859257
## 46 46 0.05674477 0.016538468
## 47 47 0.05799521 0.016181130
## 48 48 0.05735469 0.016336318
## 49 49 0.05737134 0.016395482
## 50 50 0.05992523 0.019083285
## 51 51 0.05932788 0.018713982
## 52 52 0.06059073 0.020396388
## 53 53 0.05990179 0.017180177
## 54 54 0.06119216 0.018557857
## 55 55 0.06054740 0.019223699
## 56 56 0.05857778 0.017073663
## 57 57 0.05921473 0.016875367
## 58 58 0.06050919 0.018589023
## 59 59 0.06118418 0.020395315
## 60 60 0.05926673 0.020152640
## 61 61 0.06247471 0.020370538
## 62 62 0.06314546 0.020860433
## 63 63 0.06314138 0.020630404
## 64 64 0.06439229 0.021715951
## 65 65 0.06441573 0.021573149
## 66 66 0.06502923 0.021368682
## 67 67 0.06441573 0.021573149
## 68 68 0.06693310 0.022921239
## 69 69 0.06692936 0.023267927
## 70 70 0.07017489 0.024680136
## 71 71 0.07013904 0.022799185
## 72 72 0.06945996 0.022531712
## 73 73 0.07010931 0.022059576
## 74 74 0.07204496 0.026149863
## 75 75 0.07143146 0.026180762
## 76 76 0.07271163 0.025818388
## 77 77 0.07333663 0.025105197
## 78 78 0.07333663 0.025105197
## 79 79 0.07333663 0.025105197
## 80 80 0.07459115 0.024311436
## 81 81 0.07459115 0.024311436
## 82 82 0.07332293 0.025367300
## 83 83 0.07332293 0.025367300
## 84 84 0.07398060 0.025423624
## 85 85 0.07398060 0.025423624
## 86 86 0.07459410 0.025323277
## 87 87 0.07526077 0.025567979
## 88 88 0.07399792 0.026216868
## 89 89 0.07403377 0.025947090
## 90 90 0.07587427 0.025597811
## 91 91 0.07653261 0.026675712
## 92 92 0.07715541 0.025185751
## 93 93 0.07718938 0.024577626
## 94 94 0.07848400 0.024757509
## 95 95 0.07909750 0.024529621
## 96 96 0.07969434 0.024392168
## 97 97 0.08034369 0.026095370
## 98 98 0.07967702 0.025810247
## 99 99 0.08031397 0.025974021
## 100 100 0.08162998 0.027964443
plot(knn_tune)
vectorK<-c(knn_tune$best.parameters$k)
knn_pred <- knn(train = wdbc.train, test = wdbc.test, cl = wdbc.trainLabels, k=knn_tune$best.parameters$k)
CrossTable(x = wdbc.testLabels, y = knn_pred, prop.chisq=FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 189
##
##
## | knn_pred
## wdbc.testLabels | Benign | Malignant | Row Total |
## ----------------|-----------|-----------|-----------|
## Benign | 112 | 0 | 112 |
## | 1.000 | 0.000 | 0.593 |
## | 0.949 | 0.000 | |
## | 0.593 | 0.000 | |
## ----------------|-----------|-----------|-----------|
## Malignant | 6 | 71 | 77 |
## | 0.078 | 0.922 | 0.407 |
## | 0.051 | 1.000 | |
## | 0.032 | 0.376 | |
## ----------------|-----------|-----------|-----------|
## Column Total | 118 | 71 | 189 |
## | 0.624 | 0.376 | |
## ----------------|-----------|-----------|-----------|
##
##
knn_tune <- tune.knn(wdbc.train, wdbc.trainLabels, k = 1:100, tunecontrol = tune.control(sampling = "cross"),cross = 10 )
summary(knn_tune)
##
## Parameter tuning of 'knn.wrapper':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## k
## 3
##
## - best performance: 0.02368421
##
## - Detailed performance results:
## k error dispersion
## 1 1 0.05263158 0.04641669
## 2 2 0.04473684 0.04649950
## 3 3 0.02368421 0.01493806
## 4 4 0.02894737 0.02616918
## 5 5 0.02631579 0.02481076
## 6 6 0.03157895 0.02075817
## 7 7 0.03157895 0.02075817
## 8 8 0.02894737 0.01493806
## 9 9 0.02894737 0.02616918
## 10 10 0.02631579 0.02148675
## 11 11 0.02368421 0.01941749
## 12 12 0.03684211 0.02828862
## 13 13 0.04210526 0.02542347
## 14 14 0.03947368 0.02842430
## 15 15 0.04210526 0.02219142
## 16 16 0.04473684 0.03051321
## 17 17 0.04473684 0.03051321
## 18 18 0.04473684 0.03051321
## 19 19 0.04473684 0.03051321
## 20 20 0.04473684 0.03051321
## 21 21 0.04736842 0.02717883
## 22 22 0.04736842 0.02717883
## 23 23 0.04473684 0.02787763
## 24 24 0.04736842 0.02717883
## 25 25 0.05000000 0.03150576
## 26 26 0.05263158 0.03282156
## 27 27 0.05000000 0.03150576
## 28 28 0.05000000 0.03386010
## 29 29 0.04736842 0.03234928
## 30 30 0.05000000 0.03386010
## 31 31 0.04736842 0.03234928
## 32 32 0.04736842 0.03234928
## 33 33 0.04736842 0.03234928
## 34 34 0.04736842 0.03234928
## 35 35 0.05000000 0.03813520
## 36 36 0.05000000 0.03813520
## 37 37 0.05263158 0.03922926
## 38 38 0.05263158 0.03922926
## 39 39 0.04736842 0.03883499
## 40 40 0.05000000 0.04377184
## 41 41 0.05000000 0.03813520
## 42 42 0.05263158 0.03922926
## 43 43 0.04736842 0.03883499
## 44 44 0.04736842 0.03883499
## 45 45 0.04736842 0.03883499
## 46 46 0.04736842 0.03883499
## 47 47 0.05263158 0.04472824
## 48 48 0.05526316 0.04377184
## 49 49 0.05263158 0.04472824
## 50 50 0.05789474 0.04261389
## 51 51 0.05263158 0.04472824
## 52 52 0.05526316 0.04377184
## 53 53 0.05263158 0.04472824
## 54 54 0.05263158 0.04472824
## 55 55 0.05526316 0.04377184
## 56 56 0.05526316 0.04377184
## 57 57 0.05789474 0.04261389
## 58 58 0.05789474 0.04261389
## 59 59 0.05789474 0.04261389
## 60 60 0.05263158 0.04641669
## 61 61 0.05263158 0.04114400
## 62 62 0.05000000 0.04197714
## 63 63 0.05526316 0.04377184
## 64 64 0.05526316 0.04377184
## 65 65 0.05526316 0.04377184
## 66 66 0.05526316 0.04377184
## 67 67 0.05789474 0.04261389
## 68 68 0.05789474 0.04261389
## 69 69 0.05789474 0.04261389
## 70 70 0.05789474 0.04261389
## 71 71 0.06315789 0.03961961
## 72 72 0.06578947 0.03772950
## 73 73 0.06578947 0.03772950
## 74 74 0.06578947 0.03772950
## 75 75 0.06842105 0.03762739
## 76 76 0.06842105 0.03762739
## 77 77 0.06578947 0.03772950
## 78 78 0.06578947 0.03772950
## 79 79 0.06578947 0.03772950
## 80 80 0.06315789 0.03961961
## 81 81 0.06842105 0.03961961
## 82 82 0.06842105 0.03961961
## 83 83 0.06842105 0.03961961
## 84 84 0.06842105 0.03961961
## 85 85 0.06842105 0.03961961
## 86 86 0.07105263 0.03731938
## 87 87 0.07105263 0.03731938
## 88 88 0.06842105 0.03961961
## 89 89 0.07105263 0.03731938
## 90 90 0.06842105 0.03961961
## 91 91 0.06842105 0.03961961
## 92 92 0.06842105 0.03961961
## 93 93 0.07105263 0.03731938
## 94 94 0.07105263 0.03731938
## 95 95 0.07105263 0.03731938
## 96 96 0.07631579 0.04010221
## 97 97 0.07631579 0.04010221
## 98 98 0.07631579 0.04010221
## 99 99 0.07631579 0.04010221
## 100 100 0.07894737 0.03922926
plot(knn_tune)
vectorK<-c(vectorK, knn_tune$best.parameters$k)
knn_pred <- knn(train = wdbc.train, test = wdbc.test, cl = wdbc.trainLabels, k=knn_tune$best.parameters$k)
CrossTable(x = wdbc.testLabels, y = knn_pred, prop.chisq=FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 189
##
##
## | knn_pred
## wdbc.testLabels | Benign | Malignant | Row Total |
## ----------------|-----------|-----------|-----------|
## Benign | 109 | 3 | 112 |
## | 0.973 | 0.027 | 0.593 |
## | 0.940 | 0.041 | |
## | 0.577 | 0.016 | |
## ----------------|-----------|-----------|-----------|
## Malignant | 7 | 70 | 77 |
## | 0.091 | 0.909 | 0.407 |
## | 0.060 | 0.959 | |
## | 0.037 | 0.370 | |
## ----------------|-----------|-----------|-----------|
## Column Total | 116 | 73 | 189 |
## | 0.614 | 0.386 | |
## ----------------|-----------|-----------|-----------|
##
##
knn_tune <- tune.knn(wdbc.train, wdbc.trainLabels, k = 1:100, tunecontrol= tune.control(sampling = "fix"),fix = 10 )
summary(knn_tune)
##
## Parameter tuning of 'knn.wrapper':
##
## - sampling method: fixed training/validation set
##
## - best parameters:
## k
## 3
##
## - best performance: 0.01574803
##
## - Detailed performance results:
## k error dispersion
## 1 1 0.03937008 NA
## 2 2 0.03149606 NA
## 3 3 0.01574803 NA
## 4 4 0.03937008 NA
## 5 5 0.02362205 NA
## 6 6 0.02362205 NA
## 7 7 0.02362205 NA
## 8 8 0.02362205 NA
## 9 9 0.03149606 NA
## 10 10 0.03149606 NA
## 11 11 0.03149606 NA
## 12 12 0.03937008 NA
## 13 13 0.03937008 NA
## 14 14 0.04724409 NA
## 15 15 0.04724409 NA
## 16 16 0.04724409 NA
## 17 17 0.04724409 NA
## 18 18 0.04724409 NA
## 19 19 0.04724409 NA
## 20 20 0.03937008 NA
## 21 21 0.03937008 NA
## 22 22 0.03937008 NA
## 23 23 0.03937008 NA
## 24 24 0.03937008 NA
## 25 25 0.03937008 NA
## 26 26 0.03937008 NA
## 27 27 0.03937008 NA
## 28 28 0.03937008 NA
## 29 29 0.03937008 NA
## 30 30 0.03937008 NA
## 31 31 0.03937008 NA
## 32 32 0.03937008 NA
## 33 33 0.03937008 NA
## 34 34 0.03937008 NA
## 35 35 0.03937008 NA
## 36 36 0.04724409 NA
## 37 37 0.03937008 NA
## 38 38 0.03937008 NA
## 39 39 0.03937008 NA
## 40 40 0.03937008 NA
## 41 41 0.04724409 NA
## 42 42 0.04724409 NA
## 43 43 0.04724409 NA
## 44 44 0.04724409 NA
## 45 45 0.05511811 NA
## 46 46 0.05511811 NA
## 47 47 0.05511811 NA
## 48 48 0.05511811 NA
## 49 49 0.05511811 NA
## 50 50 0.05511811 NA
## 51 51 0.05511811 NA
## 52 52 0.05511811 NA
## 53 53 0.06299213 NA
## 54 54 0.06299213 NA
## 55 55 0.06299213 NA
## 56 56 0.06299213 NA
## 57 57 0.06299213 NA
## 58 58 0.06299213 NA
## 59 59 0.06299213 NA
## 60 60 0.06299213 NA
## 61 61 0.06299213 NA
## 62 62 0.06299213 NA
## 63 63 0.07086614 NA
## 64 64 0.07086614 NA
## 65 65 0.07086614 NA
## 66 66 0.07086614 NA
## 67 67 0.07086614 NA
## 68 68 0.07086614 NA
## 69 69 0.07086614 NA
## 70 70 0.07086614 NA
## 71 71 0.07086614 NA
## 72 72 0.07086614 NA
## 73 73 0.07086614 NA
## 74 74 0.07086614 NA
## 75 75 0.07086614 NA
## 76 76 0.07086614 NA
## 77 77 0.07086614 NA
## 78 78 0.07086614 NA
## 79 79 0.07086614 NA
## 80 80 0.07086614 NA
## 81 81 0.07086614 NA
## 82 82 0.07086614 NA
## 83 83 0.07086614 NA
## 84 84 0.07086614 NA
## 85 85 0.07086614 NA
## 86 86 0.07086614 NA
## 87 87 0.07086614 NA
## 88 88 0.07086614 NA
## 89 89 0.07086614 NA
## 90 90 0.07086614 NA
## 91 91 0.07086614 NA
## 92 92 0.07086614 NA
## 93 93 0.07086614 NA
## 94 94 0.07874016 NA
## 95 95 0.07086614 NA
## 96 96 0.07086614 NA
## 97 97 0.07874016 NA
## 98 98 0.07874016 NA
## 99 99 0.07874016 NA
## 100 100 0.07086614 NA
plot(knn_tune)
vectorK<-c(vectorK, knn_tune$best.parameters$k)
knn_pred <- knn(train = wdbc.train, test = wdbc.test, cl = wdbc.trainLabels, k=knn_tune$best.parameters$k)
CrossTable(x = wdbc.testLabels, y = knn_pred, prop.chisq=FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 189
##
##
## | knn_pred
## wdbc.testLabels | Benign | Malignant | Row Total |
## ----------------|-----------|-----------|-----------|
## Benign | 109 | 3 | 112 |
## | 0.973 | 0.027 | 0.593 |
## | 0.940 | 0.041 | |
## | 0.577 | 0.016 | |
## ----------------|-----------|-----------|-----------|
## Malignant | 7 | 70 | 77 |
## | 0.091 | 0.909 | 0.407 |
## | 0.060 | 0.959 | |
## | 0.037 | 0.370 | |
## ----------------|-----------|-----------|-----------|
## Column Total | 116 | 73 | 189 |
## | 0.614 | 0.386 | |
## ----------------|-----------|-----------|-----------|
##
##
La ejecución del ‘tuning’ anterior nos provee de diferentes valores para k los cuales son: 15, 3, 3 Se puede observar que para los diferentes K ofrecidos por el tuning con su diferentes métodos de sampling, los resultados son prácticamente iguales, variando a lo sumo en 1 o 2 aciertos con respecto a la predicción de cánceres benignos y prácticamente igual acierto en la predicción de cánceres malignos.
Por tanto, el tunning ha mejorado ligeramente los resultados iniciales que habíamos obtenidos con los valores por defecto.
En este caso, directamente utilizaremos la función tune.svm para obtener los mejores parámetros de costo y gamma. Para esta función, tenemos posibilidad de probar con diferentes valores de gamma, costo y kernel.
Probaremos con diferentes kernels (“lienar”,“Polynomial”,“radial” y “sigmoid”) y con lo valores de costo(0.1, 1, 10, 100 y 1000) y gamma(0.037, 0.11, 0.33,1 y 3)
svm_tune <- tune.svm(x=wdbc.train, y=wdbc.trainLabels, kernel="linear", cost=10^(-1:3), gamma=3^(-3:1))
summary(svm_tune)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## gamma cost
## 0.03703704 0.1
##
## - best performance: 0.02368421
##
## - Detailed performance results:
## gamma cost error dispersion
## 1 0.03703704 1e-01 0.02368421 0.01493806
## 2 0.11111111 1e-01 0.02368421 0.01493806
## 3 0.33333333 1e-01 0.02368421 0.01493806
## 4 1.00000000 1e-01 0.02368421 0.01493806
## 5 3.00000000 1e-01 0.02368421 0.01493806
## 6 0.03703704 1e+00 0.03684211 0.02828862
## 7 0.11111111 1e+00 0.03684211 0.02828862
## 8 0.33333333 1e+00 0.03684211 0.02828862
## 9 1.00000000 1e+00 0.03684211 0.02828862
## 10 3.00000000 1e+00 0.03684211 0.02828862
## 11 0.03703704 1e+01 0.03947368 0.03101346
## 12 0.11111111 1e+01 0.03947368 0.03101346
## 13 0.33333333 1e+01 0.03947368 0.03101346
## 14 1.00000000 1e+01 0.03947368 0.03101346
## 15 3.00000000 1e+01 0.03947368 0.03101346
## 16 0.03703704 1e+02 0.03947368 0.03101346
## 17 0.11111111 1e+02 0.03947368 0.03101346
## 18 0.33333333 1e+02 0.03947368 0.03101346
## 19 1.00000000 1e+02 0.03947368 0.03101346
## 20 3.00000000 1e+02 0.03947368 0.03101346
## 21 0.03703704 1e+03 0.03947368 0.03101346
## 22 0.11111111 1e+03 0.03947368 0.03101346
## 23 0.33333333 1e+03 0.03947368 0.03101346
## 24 1.00000000 1e+03 0.03947368 0.03101346
## 25 3.00000000 1e+03 0.03947368 0.03101346
plot(svm_tune)
vectorSVM <- c("linear", svm_tune$best.parameters$cost, svm_tune$best.parameters$gamma)
svm_pred <- svm(wdbc.train,wdbc.trainLabels, kernerl="linear", cost=svm_tune$best.parameters$cost,gamma = svm_tune$best.parameters$gamma)
summary(svm_pred)
##
## Call:
## svm.default(x = wdbc.train, y = wdbc.trainLabels, gamma = svm_tune$best.parameters$gamma,
## cost = svm_tune$best.parameters$cost, kernerl = "linear")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 0.1
## gamma: 0.03703704
##
## Number of Support Vectors: 181
##
## ( 89 92 )
##
##
## Number of Classes: 2
##
## Levels:
## Benign Malignant
pred <- predict(svm_pred, new=wdbc.test)
CrossTable(x = wdbc.testLabels, y = pred,prop.chisq=FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 189
##
##
## | pred
## wdbc.testLabels | Benign | Malignant | Row Total |
## ----------------|-----------|-----------|-----------|
## Benign | 108 | 4 | 112 |
## | 0.964 | 0.036 | 0.593 |
## | 0.931 | 0.055 | |
## | 0.571 | 0.021 | |
## ----------------|-----------|-----------|-----------|
## Malignant | 8 | 69 | 77 |
## | 0.104 | 0.896 | 0.407 |
## | 0.069 | 0.945 | |
## | 0.042 | 0.365 | |
## ----------------|-----------|-----------|-----------|
## Column Total | 116 | 73 | 189 |
## | 0.614 | 0.386 | |
## ----------------|-----------|-----------|-----------|
##
##
svm_tune <- tune.svm(x=wdbc.train, y=wdbc.trainLabels, kernel="polynomial", cost=10^(-1:3), gamma=3^(-3:1))
summary(svm_tune)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## gamma cost
## 0.3333333 0.1
##
## - best performance: 0.03157895
##
## - Detailed performance results:
## gamma cost error dispersion
## 1 0.03703704 1e-01 0.16315789 0.04438284
## 2 0.11111111 1e-01 0.07105263 0.03519720
## 3 0.33333333 1e-01 0.03157895 0.02717883
## 4 1.00000000 1e-01 0.05526316 0.03150576
## 5 3.00000000 1e-01 0.05526316 0.03150576
## 6 0.03703704 1e+00 0.10526316 0.04297350
## 7 0.11111111 1e+00 0.03421053 0.02166507
## 8 0.33333333 1e+00 0.05526316 0.03150576
## 9 1.00000000 1e+00 0.05526316 0.03150576
## 10 3.00000000 1e+00 0.05526316 0.03150576
## 11 0.03703704 1e+01 0.03684211 0.02542347
## 12 0.11111111 1e+01 0.05263158 0.03508772
## 13 0.33333333 1e+01 0.05526316 0.03150576
## 14 1.00000000 1e+01 0.05526316 0.03150576
## 15 3.00000000 1e+01 0.05526316 0.03150576
## 16 0.03703704 1e+02 0.03421053 0.02787763
## 17 0.11111111 1e+02 0.05526316 0.03150576
## 18 0.33333333 1e+02 0.05526316 0.03150576
## 19 1.00000000 1e+02 0.05526316 0.03150576
## 20 3.00000000 1e+02 0.05526316 0.03150576
## 21 0.03703704 1e+03 0.05526316 0.03150576
## 22 0.11111111 1e+03 0.05526316 0.03150576
## 23 0.33333333 1e+03 0.05526316 0.03150576
## 24 1.00000000 1e+03 0.05526316 0.03150576
## 25 3.00000000 1e+03 0.05526316 0.03150576
plot(svm_tune)
vectorSVM <- c(vectorSVM, c("polynomial", svm_tune$best.parameters$cost, svm_tune$best.parameters$gamma))
svm_pred <- svm(wdbc.train,wdbc.trainLabels, kernerl="polynomial", cost=svm_tune$best.parameters$cost,gamma = svm_tune$best.parameters$gamma)
summary(svm_pred)
##
## Call:
## svm.default(x = wdbc.train, y = wdbc.trainLabels, gamma = svm_tune$best.parameters$gamma,
## cost = svm_tune$best.parameters$cost, kernerl = "polynomial")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 0.1
## gamma: 0.3333333
##
## Number of Support Vectors: 322
##
## ( 135 187 )
##
##
## Number of Classes: 2
##
## Levels:
## Benign Malignant
pred <- predict(svm_pred,new=wdbc.test)
CrossTable(x = wdbc.testLabels, y = pred,prop.chisq=FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 189
##
##
## | pred
## wdbc.testLabels | Benign | Row Total |
## ----------------|-----------|-----------|
## Benign | 112 | 112 |
## | 0.593 | |
## ----------------|-----------|-----------|
## Malignant | 77 | 77 |
## | 0.407 | |
## ----------------|-----------|-----------|
## Column Total | 189 | 189 |
## ----------------|-----------|-----------|
##
##
svm_tune <- tune.svm(x=wdbc.train, y=wdbc.trainLabels, kernel="radial", cost=10^(-1:3), gamma=3^(-3:1))
summary(svm_tune)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## gamma cost
## 0.03703704 10
##
## - best performance: 0.01842105
##
## - Detailed performance results:
## gamma cost error dispersion
## 1 0.03703704 1e-01 0.05789474 0.04438284
## 2 0.11111111 1e-01 0.09473684 0.06817319
## 3 0.33333333 1e-01 0.35526316 0.07260064
## 4 1.00000000 1e-01 0.35526316 0.07260064
## 5 3.00000000 1e-01 0.35526316 0.07260064
## 6 0.03703704 1e+00 0.02631579 0.02481076
## 7 0.11111111 1e+00 0.03947368 0.02842430
## 8 0.33333333 1e+00 0.09210526 0.06709236
## 9 1.00000000 1e+00 0.35000000 0.07131746
## 10 3.00000000 1e+00 0.35526316 0.07260064
## 11 0.03703704 1e+01 0.01842105 0.01776180
## 12 0.11111111 1e+01 0.04473684 0.03731938
## 13 0.33333333 1e+01 0.08684211 0.07023024
## 14 1.00000000 1e+01 0.34736842 0.07525477
## 15 3.00000000 1e+01 0.35526316 0.07260064
## 16 0.03703704 1e+02 0.03157895 0.03234928
## 17 0.11111111 1e+02 0.04473684 0.03731938
## 18 0.33333333 1e+02 0.08684211 0.07023024
## 19 1.00000000 1e+02 0.34736842 0.07525477
## 20 3.00000000 1e+02 0.35526316 0.07260064
## 21 0.03703704 1e+03 0.03157895 0.03234928
## 22 0.11111111 1e+03 0.04473684 0.03731938
## 23 0.33333333 1e+03 0.08684211 0.07023024
## 24 1.00000000 1e+03 0.34736842 0.07525477
## 25 3.00000000 1e+03 0.35526316 0.07260064
plot(svm_tune)
vectorSVM <- c(vectorSVM, c("radial", svm_tune$best.parameters$cost, svm_tune$best.parameters$gamma))
svm_pred <- svm(wdbc.train,wdbc.trainLabels, kernerl="radial", cost=100, gamma = svm_tune$best.parameters$gamma)
summary(svm_pred)
##
## Call:
## svm.default(x = wdbc.train, y = wdbc.trainLabels, gamma = svm_tune$best.parameters$gamma,
## cost = 100, kernerl = "radial")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 100
## gamma: 0.03703704
##
## Number of Support Vectors: 78
##
## ( 41 37 )
##
##
## Number of Classes: 2
##
## Levels:
## Benign Malignant
pred <- predict(svm_pred,new=wdbc.test)
CrossTable(x = wdbc.testLabels, y = pred,prop.chisq=FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 189
##
##
## | pred
## wdbc.testLabels | Benign | Malignant | Row Total |
## ----------------|-----------|-----------|-----------|
## Benign | 107 | 5 | 112 |
## | 0.955 | 0.045 | 0.593 |
## | 0.930 | 0.068 | |
## | 0.566 | 0.026 | |
## ----------------|-----------|-----------|-----------|
## Malignant | 8 | 69 | 77 |
## | 0.104 | 0.896 | 0.407 |
## | 0.070 | 0.932 | |
## | 0.042 | 0.365 | |
## ----------------|-----------|-----------|-----------|
## Column Total | 115 | 74 | 189 |
## | 0.608 | 0.392 | |
## ----------------|-----------|-----------|-----------|
##
##
svm_tune <- tune.svm(x=wdbc.train, y=wdbc.trainLabels, kernel="sigmoid", cost=10^(-1:3), gamma=3^(-3:1))
summary(svm_tune)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## gamma cost
## 0.03703704 0.1
##
## - best performance: 0.04736842
##
## - Detailed performance results:
## gamma cost error dispersion
## 1 0.03703704 1e-01 0.04736842 0.05657725
## 2 0.11111111 1e-01 0.06315789 0.05435766
## 3 0.33333333 1e-01 0.08421053 0.04608395
## 4 1.00000000 1e-01 0.10263158 0.04549580
## 5 3.00000000 1e-01 0.07894737 0.03922926
## 6 0.03703704 1e+00 0.05263158 0.03721615
## 7 0.11111111 1e+00 0.09473684 0.04674706
## 8 0.33333333 1e+00 0.11578947 0.04333014
## 9 1.00000000 1e+00 0.11578947 0.05575526
## 10 3.00000000 1e+00 0.11315789 0.04812585
## 11 0.03703704 1e+01 0.07631579 0.04715677
## 12 0.11111111 1e+01 0.10789474 0.05031449
## 13 0.33333333 1e+01 0.12894737 0.04377184
## 14 1.00000000 1e+01 0.11578947 0.05144870
## 15 3.00000000 1e+01 0.11578947 0.05292317
## 16 0.03703704 1e+02 0.06842105 0.04836508
## 17 0.11111111 1e+02 0.09210526 0.04160892
## 18 0.33333333 1e+02 0.12631579 0.03883499
## 19 1.00000000 1e+02 0.12631579 0.05084694
## 20 3.00000000 1e+02 0.11578947 0.04836508
## 21 0.03703704 1e+03 0.07105263 0.04649950
## 22 0.11111111 1e+03 0.10000000 0.02987612
## 23 0.33333333 1e+03 0.13684211 0.04261389
## 24 1.00000000 1e+03 0.11578947 0.05292317
## 25 3.00000000 1e+03 0.11578947 0.04836508
plot(svm_tune)
vectorSVM <- c(vectorSVM, c("sigmoid", svm_tune$best.parameters$cost, svm_tune$best.parameters$gamma))
svm_pred <- svm(wdbc.train,wdbc.trainLabels, kernerl="sigmoid", cost=svm_tune$best.parameters$cost,gamma = svm_tune$best.parameters$gamma)
summary(svm_pred)
##
## Call:
## svm.default(x = wdbc.train, y = wdbc.trainLabels, gamma = svm_tune$best.parameters$gamma,
## cost = svm_tune$best.parameters$cost, kernerl = "sigmoid")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 0.1
## gamma: 0.03703704
##
## Number of Support Vectors: 181
##
## ( 89 92 )
##
##
## Number of Classes: 2
##
## Levels:
## Benign Malignant
pred <- predict(svm_pred,new=wdbc.test)
CrossTable(x = wdbc.testLabels, y = pred,prop.chisq=FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 189
##
##
## | pred
## wdbc.testLabels | Benign | Malignant | Row Total |
## ----------------|-----------|-----------|-----------|
## Benign | 108 | 4 | 112 |
## | 0.964 | 0.036 | 0.593 |
## | 0.931 | 0.055 | |
## | 0.571 | 0.021 | |
## ----------------|-----------|-----------|-----------|
## Malignant | 8 | 69 | 77 |
## | 0.104 | 0.896 | 0.407 |
## | 0.069 | 0.945 | |
## | 0.042 | 0.365 | |
## ----------------|-----------|-----------|-----------|
## Column Total | 116 | 73 | 189 |
## | 0.614 | 0.386 | |
## ----------------|-----------|-----------|-----------|
##
##
La ejecución del ‘tuning’ anterior nos provee de diferentes valores para costo y gamma para cada kernel, los cuales son:
| Kernel | Costo | Gamma |
|---|---|---|
| linear | 0.1 | 0.037037037037037 |
| polynomial | 0.1 | 0.333333333333333 |
| radial | 10 | 0.037037037037037 |
| sigmoid | 0.1 | 0.037037037037037 |
Se puede observar que para los svm con kernel “linear” o “sigmoid” son los que mejores resultados tiene a la hora de predecir correctamente cánceres benignos. El svm con kernel “polymonial” es el que mejor acierto tiene con respecto a la predicción de cánceres malignos. Aún así, los resultados están bastante cercanos en cuanto a predicción de cánceres benignos o malignos, no existiendo grandes diferencias.
Los valores de los costos y gamma pueden variar entre diferentes ejecuciones, variando los resultados finales
En este caso, al igual que en el caso del KNN, utilizaremos diferentes técnicas o métodos de sampling, pero en este caso, no modificaremos ningún parámetro adicional. Simplemente mediremos cuan ajustado está el predictor de cuatro maneras diferentes.
nb_pred = train(wdbc.train,wdbc.trainLabels,method = 'nb', trControl=trainControl(method='cv',number=10))
nb_pred
## Naive Bayes
##
## 380 samples
## 30 predictor
## 2 classes: 'Benign', 'Malignant'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 342, 342, 342, 341, 342, 342, ...
## Resampling results across tuning parameters:
##
## usekernel Accuracy Kappa
## FALSE 0.9344695 0.8547734
## TRUE 0.9343345 0.8558323
##
## Tuning parameter 'fL' was held constant at a value of 0
## Tuning
## parameter 'adjust' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were fL = 0, usekernel = FALSE
## and adjust = 1.
pred <- predict(nb_pred$finalModel, wdbc.test)
CrossTable(x = wdbc.testLabels, y = pred$class, prop.chisq=FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 189
##
##
## | pred$class
## wdbc.testLabels | Benign | Malignant | Row Total |
## ----------------|-----------|-----------|-----------|
## Benign | 106 | 6 | 112 |
## | 0.946 | 0.054 | 0.593 |
## | 0.930 | 0.080 | |
## | 0.561 | 0.032 | |
## ----------------|-----------|-----------|-----------|
## Malignant | 8 | 69 | 77 |
## | 0.104 | 0.896 | 0.407 |
## | 0.070 | 0.920 | |
## | 0.042 | 0.365 | |
## ----------------|-----------|-----------|-----------|
## Column Total | 114 | 75 | 189 |
## | 0.603 | 0.397 | |
## ----------------|-----------|-----------|-----------|
##
##
nb_pred = train(wdbc.train,wdbc.trainLabels,method = 'nb', trControl=trainControl(method='repeatedcv',number=10, repeats = 3))
nb_pred
## Naive Bayes
##
## 380 samples
## 30 predictor
## 2 classes: 'Benign', 'Malignant'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 343, 342, 343, 342, 342, 341, ...
## Resampling results across tuning parameters:
##
## usekernel Accuracy Kappa
## FALSE 0.9343084 0.8549976
## TRUE 0.9360877 0.8597369
##
## Tuning parameter 'fL' was held constant at a value of 0
## Tuning
## parameter 'adjust' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were fL = 0, usekernel = TRUE
## and adjust = 1.
pred <- predict(nb_pred$finalModel, wdbc.test)
CrossTable(x = wdbc.testLabels, y = pred$class, prop.chisq=FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 189
##
##
## | pred$class
## wdbc.testLabels | Benign | Malignant | Row Total |
## ----------------|-----------|-----------|-----------|
## Benign | 105 | 7 | 112 |
## | 0.938 | 0.062 | 0.593 |
## | 0.955 | 0.089 | |
## | 0.556 | 0.037 | |
## ----------------|-----------|-----------|-----------|
## Malignant | 5 | 72 | 77 |
## | 0.065 | 0.935 | 0.407 |
## | 0.045 | 0.911 | |
## | 0.026 | 0.381 | |
## ----------------|-----------|-----------|-----------|
## Column Total | 110 | 79 | 189 |
## | 0.582 | 0.418 | |
## ----------------|-----------|-----------|-----------|
##
##
nb_pred = train(wdbc.train, wdbc.trainLabels, method = 'nb', trControl=trainControl(method='boot',number=100))
nb_pred
## Naive Bayes
##
## 380 samples
## 30 predictor
## 2 classes: 'Benign', 'Malignant'
##
## No pre-processing
## Resampling: Bootstrapped (100 reps)
## Summary of sample sizes: 380, 380, 380, 380, 380, 380, ...
## Resampling results across tuning parameters:
##
## usekernel Accuracy Kappa
## FALSE 0.9393848 0.8659063
## TRUE 0.9397078 0.8678455
##
## Tuning parameter 'fL' was held constant at a value of 0
## Tuning
## parameter 'adjust' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were fL = 0, usekernel = TRUE
## and adjust = 1.
pred <- predict(nb_pred$finalModel, wdbc.test)
CrossTable(x = wdbc.testLabels, y = pred$class, prop.chisq=FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 189
##
##
## | pred$class
## wdbc.testLabels | Benign | Malignant | Row Total |
## ----------------|-----------|-----------|-----------|
## Benign | 105 | 7 | 112 |
## | 0.938 | 0.062 | 0.593 |
## | 0.955 | 0.089 | |
## | 0.556 | 0.037 | |
## ----------------|-----------|-----------|-----------|
## Malignant | 5 | 72 | 77 |
## | 0.065 | 0.935 | 0.407 |
## | 0.045 | 0.911 | |
## | 0.026 | 0.381 | |
## ----------------|-----------|-----------|-----------|
## Column Total | 110 | 79 | 189 |
## | 0.582 | 0.418 | |
## ----------------|-----------|-----------|-----------|
##
##
nb_pred = train(wdbc.train,wdbc.trainLabels,method = 'nb', trControl=trainControl(method='LOOCV'))
nb_pred
## Naive Bayes
##
## 380 samples
## 30 predictor
## 2 classes: 'Benign', 'Malignant'
##
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation
## Summary of sample sizes: 379, 379, 379, 379, 379, 379, ...
## Resampling results across tuning parameters:
##
## usekernel Accuracy Kappa
## FALSE 0.9342105 0.8551829
## TRUE 0.9368421 0.8621315
##
## Tuning parameter 'fL' was held constant at a value of 0
## Tuning
## parameter 'adjust' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were fL = 0, usekernel = TRUE
## and adjust = 1.
pred <- predict(nb_pred$finalModel, wdbc.test)
CrossTable(x = wdbc.testLabels, y = pred$class, prop.chisq=FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 189
##
##
## | pred$class
## wdbc.testLabels | Benign | Malignant | Row Total |
## ----------------|-----------|-----------|-----------|
## Benign | 105 | 7 | 112 |
## | 0.938 | 0.062 | 0.593 |
## | 0.955 | 0.089 | |
## | 0.556 | 0.037 | |
## ----------------|-----------|-----------|-----------|
## Malignant | 5 | 72 | 77 |
## | 0.065 | 0.935 | 0.407 |
## | 0.045 | 0.911 | |
## | 0.026 | 0.381 | |
## ----------------|-----------|-----------|-----------|
## Column Total | 110 | 79 | 189 |
## | 0.582 | 0.418 | |
## ----------------|-----------|-----------|-----------|
##
##
Las cuatro pruebas realizadas han dado prácticamente los mismos resultados, lo cual por otro lado era lo esperado ya que en cada ejecución lo único que modificamos es la forma de medir la precisión del modelo o predictor. Aún así, se puede observar, que el accuracy y kappa están muy cerca de 1, lo cual indica lo bueno que es el predictor en sí, para cada una de las pruebas. Dicho valores se han confirmado al ver que los resultados obtenidos con los datos de ‘test’, son buenos y acordes con lo esperado dada la precisión obtenida con los datos de ‘train’.
Para Radon forest, ejecutaremos el predictor con un número de árboles igual a 10000 y sin restricción en número de nodos. Posteriormente, ejecutaremos la función tuneRF para intentar mejorar la configuración del clasificador
rf_pred <- randomForest(x = wdbc.train, y = wdbc.trainLabels,ntree=10000,proximity=TRUE)
rf_pred
##
## Call:
## randomForest(x = wdbc.train, y = wdbc.trainLabels, ntree = 10000, proximity = TRUE)
## Type of random forest: classification
## Number of trees: 10000
## No. of variables tried at each split: 5
##
## OOB estimate of error rate: 3.68%
## Confusion matrix:
## Benign Malignant class.error
## Benign 241 4 0.01632653
## Malignant 10 125 0.07407407
pred <- predict(rf_pred, newdata=wdbc.test)
CrossTable(x = wdbc.testLabels, y = pred, prop.chisq=FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 189
##
##
## | pred
## wdbc.testLabels | Benign | Malignant | Row Total |
## ----------------|-----------|-----------|-----------|
## Benign | 108 | 4 | 112 |
## | 0.964 | 0.036 | 0.593 |
## | 0.973 | 0.051 | |
## | 0.571 | 0.021 | |
## ----------------|-----------|-----------|-----------|
## Malignant | 3 | 74 | 77 |
## | 0.039 | 0.961 | 0.407 |
## | 0.027 | 0.949 | |
## | 0.016 | 0.392 | |
## ----------------|-----------|-----------|-----------|
## Column Total | 111 | 78 | 189 |
## | 0.587 | 0.413 | |
## ----------------|-----------|-----------|-----------|
##
##
Intentamos obtener una mejor configuración para el clasificador, obteniendo un valor de ‘mtry’(conjunto de variables a tener en cuanta de manera aleatoria en cada división de árboles) y ‘ntree’ (número de árboles) ajustado.
tune.rf <- tuneRF(wdbc.train,wdbc.trainLabels, stepFactor=0.5,doBest = TRUE, ntreeTry = 10001,improve = 0.01)
## mtry = 5 OOB error = 3.42%
## Searching left ...
## mtry = 10 OOB error = 3.16%
## 0.07692308 0.01
## mtry = 20 OOB error = 3.16%
## 0 0.01
## Searching right ...
## mtry = 2 OOB error = 3.95%
## -0.25 0.01
rf_pred <- randomForest(x = wdbc.train, y = wdbc.trainLabels,ntree=tune.rf$ntree,proximity=TRUE,mtry = tune.rf$mtry)
rf_pred
##
## Call:
## randomForest(x = wdbc.train, y = wdbc.trainLabels, ntree = tune.rf$ntree, mtry = tune.rf$mtry, proximity = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 10
##
## OOB estimate of error rate: 3.68%
## Confusion matrix:
## Benign Malignant class.error
## Benign 241 4 0.01632653
## Malignant 10 125 0.07407407
pred <- predict(rf_pred, newdata=wdbc.test)
CrossTable(x = wdbc.testLabels, y = pred, prop.chisq=FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 189
##
##
## | pred
## wdbc.testLabels | Benign | Malignant | Row Total |
## ----------------|-----------|-----------|-----------|
## Benign | 108 | 4 | 112 |
## | 0.964 | 0.036 | 0.593 |
## | 0.973 | 0.051 | |
## | 0.571 | 0.021 | |
## ----------------|-----------|-----------|-----------|
## Malignant | 3 | 74 | 77 |
## | 0.039 | 0.961 | 0.407 |
## | 0.027 | 0.949 | |
## | 0.016 | 0.392 | |
## ----------------|-----------|-----------|-----------|
## Column Total | 111 | 78 | 189 |
## | 0.587 | 0.413 | |
## ----------------|-----------|-----------|-----------|
##
##
Los datos obtenidos para Random Forest, son bastante buenos con su configuración inicial. Al utilizar la función tuneRF para encontrar el mejor número de árboles y mtry, nos encontramos que mejora ligeramente los resultados con los datos de ‘train’, pero empeora también ligeramente los resultados con los datos de ‘test’. los valores que nos ha indicado la función son:
ntrees: 500
mtry: 10
En conclusión, los resultados obtenidos en ambos casos son bastante satisfactorios.
En este caso, probaremos con diferentes variantes del algoritmo K-Means, y con ‘iter.max’(iteraciones máximas) de 1000 y ‘nstart’(número de conjuntos aleatorios) de 100. Además, en este caso hay que modificar la semilla en cada caso para que los centroides elegidos varíen.
set.seed(1)
result <- kmeans(x = wdbc_norm, centers = 2,iter.max = 1000,algorithm = "Hartigan-Wong", nstart = 100)
CrossTable(x = WDBC$analysis, y = result$cluster, prop.chisq=FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 569
##
##
## | result$cluster
## WDBC$analysis | 1 | 2 | Row Total |
## --------------|-----------|-----------|-----------|
## Benign | 9 | 348 | 357 |
## | 0.025 | 0.975 | 0.627 |
## | 0.048 | 0.916 | |
## | 0.016 | 0.612 | |
## --------------|-----------|-----------|-----------|
## Malignant | 180 | 32 | 212 |
## | 0.849 | 0.151 | 0.373 |
## | 0.952 | 0.084 | |
## | 0.316 | 0.056 | |
## --------------|-----------|-----------|-----------|
## Column Total | 189 | 380 | 569 |
## | 0.332 | 0.668 | |
## --------------|-----------|-----------|-----------|
##
##
set.seed(12)
result <- kmeans(x = wdbc_norm,centers = 2,iter.max = 1000,algorithm = "Lloyd",nstart = 100)
CrossTable(x = WDBC$analysis, y = result$cluster, prop.chisq=FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 569
##
##
## | result$cluster
## WDBC$analysis | 1 | 2 | Row Total |
## --------------|-----------|-----------|-----------|
## Benign | 9 | 348 | 357 |
## | 0.025 | 0.975 | 0.627 |
## | 0.048 | 0.916 | |
## | 0.016 | 0.612 | |
## --------------|-----------|-----------|-----------|
## Malignant | 180 | 32 | 212 |
## | 0.849 | 0.151 | 0.373 |
## | 0.952 | 0.084 | |
## | 0.316 | 0.056 | |
## --------------|-----------|-----------|-----------|
## Column Total | 189 | 380 | 569 |
## | 0.332 | 0.668 | |
## --------------|-----------|-----------|-----------|
##
##
set.seed(123)
result <- kmeans(x = wdbc_norm,centers = 2,iter.max = 1000,algorithm = "Forgy",nstart = 100)
CrossTable(x = WDBC$analysis, y = result$cluster, prop.chisq=FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 569
##
##
## | result$cluster
## WDBC$analysis | 1 | 2 | Row Total |
## --------------|-----------|-----------|-----------|
## Benign | 348 | 9 | 357 |
## | 0.975 | 0.025 | 0.627 |
## | 0.916 | 0.048 | |
## | 0.612 | 0.016 | |
## --------------|-----------|-----------|-----------|
## Malignant | 32 | 180 | 212 |
## | 0.151 | 0.849 | 0.373 |
## | 0.084 | 0.952 | |
## | 0.056 | 0.316 | |
## --------------|-----------|-----------|-----------|
## Column Total | 380 | 189 | 569 |
## | 0.668 | 0.332 | |
## --------------|-----------|-----------|-----------|
##
##
set.seed(1234)
result <- kmeans(x = wdbc_norm,centers = 2,iter.max = 1000,algorithm = "MacQueen",nstart = 100)
CrossTable(x = WDBC$analysis, y = result$cluster, prop.chisq=FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 569
##
##
## | result$cluster
## WDBC$analysis | 1 | 2 | Row Total |
## --------------|-----------|-----------|-----------|
## Benign | 348 | 9 | 357 |
## | 0.975 | 0.025 | 0.627 |
## | 0.916 | 0.048 | |
## | 0.612 | 0.016 | |
## --------------|-----------|-----------|-----------|
## Malignant | 32 | 180 | 212 |
## | 0.151 | 0.849 | 0.373 |
## | 0.084 | 0.952 | |
## | 0.056 | 0.316 | |
## --------------|-----------|-----------|-----------|
## Column Total | 380 | 189 | 569 |
## | 0.668 | 0.332 | |
## --------------|-----------|-----------|-----------|
##
##
Para este clasificador, el uso de las diferentes variantes del algoritmo K-Means no ha producido ningún cambio. En sus cuatro variantes ha obtenido los mismos resultados a pesar de haber forzado semillas diferentes en cada caso. Esto puede ser debido al número de cluster y tamaño del dataset entre otros factores.
Por último, probaremos con las diferentes variantes del algoritmo jerárquico aglomerativo, cortando en 2 grupos el árbol resultante
result = hclust(dist(wdbc_norm), method = "ward.D")
result_cut <- cutree (result, 2)
CrossTable(x = WDBC$analysis, y = result_cut, prop.chisq=FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 569
##
##
## | result_cut
## WDBC$analysis | 1 | 2 | Row Total |
## --------------|-----------|-----------|-----------|
## Benign | 10 | 347 | 357 |
## | 0.028 | 0.972 | 0.627 |
## | 0.053 | 0.913 | |
## | 0.018 | 0.610 | |
## --------------|-----------|-----------|-----------|
## Malignant | 179 | 33 | 212 |
## | 0.844 | 0.156 | 0.373 |
## | 0.947 | 0.087 | |
## | 0.315 | 0.058 | |
## --------------|-----------|-----------|-----------|
## Column Total | 189 | 380 | 569 |
## | 0.332 | 0.668 | |
## --------------|-----------|-----------|-----------|
##
##
result = hclust(dist(wdbc_norm), method = "ward.D2")
result_cut <- cutree (result, 2)
CrossTable(x = WDBC$analysis, y = result_cut, prop.chisq=FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 569
##
##
## | result_cut
## WDBC$analysis | 1 | 2 | Row Total |
## --------------|-----------|-----------|-----------|
## Benign | 27 | 330 | 357 |
## | 0.076 | 0.924 | 0.627 |
## | 0.141 | 0.873 | |
## | 0.047 | 0.580 | |
## --------------|-----------|-----------|-----------|
## Malignant | 164 | 48 | 212 |
## | 0.774 | 0.226 | 0.373 |
## | 0.859 | 0.127 | |
## | 0.288 | 0.084 | |
## --------------|-----------|-----------|-----------|
## Column Total | 191 | 378 | 569 |
## | 0.336 | 0.664 | |
## --------------|-----------|-----------|-----------|
##
##
result = hclust(dist(wdbc_norm), method = "single")
result_cut <- cutree (result, 2)
CrossTable(x = WDBC$analysis, y = result_cut, prop.chisq=FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 569
##
##
## | result_cut
## WDBC$analysis | 1 | 2 | Row Total |
## --------------|-----------|-----------|-----------|
## Benign | 357 | 0 | 357 |
## | 1.000 | 0.000 | 0.627 |
## | 0.630 | 0.000 | |
## | 0.627 | 0.000 | |
## --------------|-----------|-----------|-----------|
## Malignant | 210 | 2 | 212 |
## | 0.991 | 0.009 | 0.373 |
## | 0.370 | 1.000 | |
## | 0.369 | 0.004 | |
## --------------|-----------|-----------|-----------|
## Column Total | 567 | 2 | 569 |
## | 0.996 | 0.004 | |
## --------------|-----------|-----------|-----------|
##
##
result = hclust(dist(wdbc_norm), method = "complete")
result_cut <- cutree (result, 2)
CrossTable(x = WDBC$analysis, y = result_cut, prop.chisq=FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 569
##
##
## | result_cut
## WDBC$analysis | 1 | 2 | Row Total |
## --------------|-----------|-----------|-----------|
## Benign | 0 | 357 | 357 |
## | 0.000 | 1.000 | 0.627 |
## | 0.000 | 0.758 | |
## | 0.000 | 0.627 | |
## --------------|-----------|-----------|-----------|
## Malignant | 98 | 114 | 212 |
## | 0.462 | 0.538 | 0.373 |
## | 1.000 | 0.242 | |
## | 0.172 | 0.200 | |
## --------------|-----------|-----------|-----------|
## Column Total | 98 | 471 | 569 |
## | 0.172 | 0.828 | |
## --------------|-----------|-----------|-----------|
##
##
result = hclust(dist(wdbc_norm), method = "average")
result_cut <- cutree (result, 2)
CrossTable(x = WDBC$analysis, y = result_cut, prop.chisq=FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 569
##
##
## | result_cut
## WDBC$analysis | 1 | 2 | Row Total |
## --------------|-----------|-----------|-----------|
## Benign | 357 | 0 | 357 |
## | 1.000 | 0.000 | 0.627 |
## | 0.630 | 0.000 | |
## | 0.627 | 0.000 | |
## --------------|-----------|-----------|-----------|
## Malignant | 210 | 2 | 212 |
## | 0.991 | 0.009 | 0.373 |
## | 0.370 | 1.000 | |
## | 0.369 | 0.004 | |
## --------------|-----------|-----------|-----------|
## Column Total | 567 | 2 | 569 |
## | 0.996 | 0.004 | |
## --------------|-----------|-----------|-----------|
##
##
result = hclust(dist(wdbc_norm), method = "mcquitty")
result_cut <- cutree (result, 2)
CrossTable(x = WDBC$analysis, y = result_cut, prop.chisq=FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 569
##
##
## | result_cut
## WDBC$analysis | 1 | 2 | Row Total |
## --------------|-----------|-----------|-----------|
## Benign | 0 | 357 | 357 |
## | 0.000 | 1.000 | 0.627 |
## | 0.000 | 0.655 | |
## | 0.000 | 0.627 | |
## --------------|-----------|-----------|-----------|
## Malignant | 24 | 188 | 212 |
## | 0.113 | 0.887 | 0.373 |
## | 1.000 | 0.345 | |
## | 0.042 | 0.330 | |
## --------------|-----------|-----------|-----------|
## Column Total | 24 | 545 | 569 |
## | 0.042 | 0.958 | |
## --------------|-----------|-----------|-----------|
##
##
result = hclust(dist(wdbc_norm), method = "median")
result_cut <- cutree (result, 2)
CrossTable(x = WDBC$analysis, y = result_cut, prop.chisq=FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 569
##
##
## | result_cut
## WDBC$analysis | 1 | 2 | Row Total |
## --------------|-----------|-----------|-----------|
## Benign | 357 | 0 | 357 |
## | 1.000 | 0.000 | 0.627 |
## | 0.630 | 0.000 | |
## | 0.627 | 0.000 | |
## --------------|-----------|-----------|-----------|
## Malignant | 210 | 2 | 212 |
## | 0.991 | 0.009 | 0.373 |
## | 0.370 | 1.000 | |
## | 0.369 | 0.004 | |
## --------------|-----------|-----------|-----------|
## Column Total | 567 | 2 | 569 |
## | 0.996 | 0.004 | |
## --------------|-----------|-----------|-----------|
##
##
result = hclust(dist(wdbc_norm), method = "centroid")
result_cut <- cutree (result, 2)
CrossTable(x = WDBC$analysis, y = result_cut, prop.chisq=FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 569
##
##
## | result_cut
## WDBC$analysis | 1 | 2 | Row Total |
## --------------|-----------|-----------|-----------|
## Benign | 357 | 0 | 357 |
## | 1.000 | 0.000 | 0.627 |
## | 0.630 | 0.000 | |
## | 0.627 | 0.000 | |
## --------------|-----------|-----------|-----------|
## Malignant | 210 | 2 | 212 |
## | 0.991 | 0.009 | 0.373 |
## | 0.370 | 1.000 | |
## | 0.369 | 0.004 | |
## --------------|-----------|-----------|-----------|
## Column Total | 567 | 2 | 569 |
## | 0.996 | 0.004 | |
## --------------|-----------|-----------|-----------|
##
##
De los diferentes métodos expuestos, los únicos métodos que dan resultados aceptables son “Ward.D” y “Ward.D2”. El resto de métodos prácticamente agrupaba casi la totalidad de las muestran en un mismo grupo, lo cual no es de utilidad en absoluto para este análisis.
En cuanto a la clasificación, tanto los algoritmos supervisados como los no supervisados, han dado buenos resultado. Concretamente, los datos fueron los siguientes para cada uno de los algoritmos (tomando los mejores resultados obtenidos):
IMPORTANTE: Recordar que los resultados pueden variar entre ejecución y ejecución y por tanto los datos finales sufrir variaciones. En todo caso, dichas variaciones no son muy significativas en la mayoría de los casos.
Para los algoritmos de aprendizaje supervisado ejecutados, y con un data.frame de 189 entradas:
| Tipo | KNN | SVM | Naive Bayes | Random Forest |
|---|---|---|---|---|
| Aciertos Benignos | 109/97,3% | 108/96,4% | 105/93,8% | 108/96,4% |
| Errores Benignos | 3/2,7% | 4/3,6% | 7/6,2% | 4/3,6% |
| Aciertos Malignos | 71/92,2% | 69/89,6% | 72/93,5% | 74/96,1% |
| Errores Malignos | 6/2,7% | 8/1,04% | 5/6,5% | 3/3,9% |
De la tabla anterior se observa que la probabilidad de acierto en la predicción de cánceres benignos es muy alta (por encima del 93%) en todos los algoritmos, destacando KNN, con una cierto del 97,3%. En cuanto a los cánceres malignos, la precisión disminuye ligeramente, aunque sigue siendo aceptablemente alta (por encima del 89%), destacando en este caso Random Forest, con una precisión del 96,1%.
En general, el algoritmo de aprendizaje supervisado que mejores resultados ha dado, ha sido Random Forest, al mantenerse por encima del 96% de acierto tanto en la predicción de cánceres malignos como benignos.
Para los algoritmos de aprendizaje no supervisado ejecutados, y con un data.frame de 569 entradas:
| Tipo | K-means | Jerárquico Aglomerativo |
|---|---|---|
| Aciertos Benignos | 348/97,5% | 347/97,2% |
| Errores Benignos | 9/2,5% | 10/2,8% |
| Aciertos Malignos | 180/84,9% | 179/84,4% |
| Errores Malignos | 32/15,1% | 33/15,6% |
De la tabla anterior, se observa que la probabilidad de acierto en la predicción de cánceres benignos es muy alta (por encima del 97%), destacando por muy poco (apenas unas décimas) K-means. En cuanto a los malignos, el acierto desciende estando en torno al 84% en ambos casos, destacando nuevamente por muy poco K-means.
En general, los resultados obtenidos son aceptablemente buenos, sobre todo en la predicción de cánceres benignos. En los malignos, el acierto desciende, aunque mantiene un nivel de acierto alto. Destaca en ambos casos K-means.
En base a los datos anteriores y comparándolos a nivel de porcentaje, parece que existen una relativa ventaja al usar los algoritmos supervisados sobre los no supervisados (en referencia a los algoritmos utilizados en este estudio) en la predicción de cánceres malignos. Específicamente, si comparamos los resultados en porcentaje de Random Forest y K-Means, observamos como en la predicción de cánceres benignos destaca ligeramente K-Means (mejoría de 1,1%). En el caso de los malignos, destaca Random Forest (mejoría de 11,2%).
En conclusión, Random Forest es el algoritmo que da los mejores resultados en general, y por tanto las mejores porcentajes de predicción de cánceres de ambos tipos.
Dataset con la información de cáncer de mama
ftp.cs.wisc.edu/math-prog/cpo-dataset/machine-learn/cancer/WDBC/WDBC.dat
Información relevante sobre el Dataset
https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)
URL código en github
https://github.com/alu4216/WDBC
URL código en rpubs
https://rpubs.com/alu4216/WDBC
URL a visualización en shinyapp
sessionInfo()
## R version 3.4.4 (2018-03-15)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 17134)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=Spanish_Spain.1252 LC_CTYPE=Spanish_Spain.1252
## [3] LC_MONETARY=Spanish_Spain.1252 LC_NUMERIC=C
## [5] LC_TIME=Spanish_Spain.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] gridExtra_2.3 FactoMineR_1.41 factoextra_1.0.5
## [4] randomForest_4.6-14 gmodels_2.16.2 klaR_0.6-14
## [7] MASS_7.3-49 caret_6.0-80 lattice_0.20-35
## [10] e1071_1.6-8 class_7.3-14 GGally_1.4.0
## [13] ggplot2_2.2.1
##
## loaded via a namespace (and not attached):
## [1] nlme_3.1-131.1 lubridate_1.7.4 dimRed_0.1.0
## [4] RColorBrewer_1.1-2 rprojroot_1.3-2 tools_3.4.4
## [7] backports_1.1.2 R6_2.2.2 rpart_4.1-13
## [10] lazyeval_0.2.1 questionr_0.6.3 colorspace_1.3-2
## [13] nnet_7.3-12 withr_2.1.2 tidyselect_0.2.4
## [16] mnormt_1.5-5 compiler_3.4.4 flashClust_1.01-2
## [19] labeling_0.3 scales_0.5.0 sfsmisc_1.1-2
## [22] DEoptimR_1.0-8 psych_1.8.4 robustbase_0.93-1
## [25] stringr_1.3.0 digest_0.6.15 foreign_0.8-69
## [28] rmarkdown_1.9 pkgconfig_2.0.1 htmltools_0.3.6
## [31] highr_0.6 rlang_0.2.0 ddalpha_1.3.4
## [34] rstudioapi_0.7 shiny_1.0.5 bindr_0.1.1
## [37] combinat_0.0-8 gtools_3.5.0 dplyr_0.7.4
## [40] ModelMetrics_1.1.0 magrittr_1.5 leaps_3.0
## [43] Matrix_1.2-12 Rcpp_0.12.16 munsell_0.4.3
## [46] abind_1.4-5 scatterplot3d_0.3-41 stringi_1.1.7
## [49] yaml_2.1.18 plyr_1.8.4 recipes_0.1.3
## [52] grid_3.4.4 ggrepel_0.8.0 gdata_2.18.0
## [55] parallel_3.4.4 pls_2.6-0 promises_1.0.1
## [58] miniUI_0.1.1.1 splines_3.4.4 knitr_1.20
## [61] pillar_1.2.1 ggpubr_0.1.7 reshape2_1.4.3
## [64] codetools_0.2-15 stats4_3.4.4 CVST_0.2-2
## [67] magic_1.5-8 glue_1.2.0 evaluate_0.10.1
## [70] httpuv_1.4.4.2 foreach_1.4.4 gtable_0.2.0
## [73] purrr_0.2.4 tidyr_0.8.0 reshape_0.8.7
## [76] kernlab_0.9-26 assertthat_0.2.0 DRR_0.0.3
## [79] gower_0.1.2 mime_0.5 prodlim_2018.04.18
## [82] xtable_1.8-2 broom_0.4.5 later_0.7.3
## [85] survival_2.41-3 geometry_0.3-6 timeDate_3043.102
## [88] RcppRoll_0.3.0 tibble_1.4.2 iterators_1.0.9
## [91] cluster_2.0.6 bindrcpp_0.2.2 lava_1.6.2
## [94] ipred_0.9-6