Introducción

En este fichero .RMD se realizará un análisis sobre un dataset que recopila información sobre el cáncer de mama. Concretamente, contiene información sobre diferentes mediciones relacionadas a los núcleos celulares que aparecen en las imágenes digitalizadas a partir de punciones (PAAF, Punción Aspiración Aguja Fina) en tejido mamario que presenta células cancerosas. Dicho Dataset se encuentra disponible en el siguiente ftp público: ftp.cs.wisc.edu o pulsando aquí

Este dataset cuenta con 569 filas por 32 columnas las cuales son:

  1. Id de la muestra
  2. Diagnosis (M = Maligno, B = Benigno)
  3. Agrupados en MEAN (Media), SE(Error estandar) y WORST(Peor) de las siguentes caracteristicas:

Por último, este dataset no contiene missing values(Valores faltantes), estando perfectamente conformado para todas sus filas y columnas.

Lo que se pretende conseguir con este dataset es lo siguiente:

Primer bloque. Carga de datos y representación

Pre-requisitos

Instalamos y cargamos las librerías que vamos a utilizar

if(!is.element("GGally", installed.packages()[, 1]))
  install.packages("GGally",repos = 'http://cran.us.r-project.org')

if(!is.element("ggplot2", installed.packages()[, 1]))
  install.packages("ggplot2",repos = 'http://cran.us.r-project.org')

if(!is.element("class", installed.packages()[, 1]))
  install.packages('class')

if(!is.element("e1071", installed.packages()[, 1]))
  install.packages("e1071",repos = 'http://cran.us.r-project.org')

if(!is.element("caret", installed.packages()[, 1]))
  install.packages("caret",repos = 'http://cran.us.r-project.org')

if(!is.element("klaR", installed.packages()[, 1]))
  install.packages("klaR",repos = 'http://cran.us.r-project.org')

if(!is.element("gmodels", installed.packages()[, 1]))
  install.packages("gmodels",repos = 'http://cran.us.r-project.org')

if(!is.element("randomForest", installed.packages()[, 1]))
  install.packages("randomForest",repos = 'http://cran.us.r-project.org')

if(!is.element("factoextra", installed.packages()[, 1]))
  install.packages("factoextra",repos = 'http://cran.us.r-project.org')

if(!is.element("FactoMineR", installed.packages()[, 1]))
  install.packages("FactoMineR",repos = 'http://cran.us.r-project.org')

if(!is.element("gridExtra", installed.packages()[, 1]))
  install.packages("gridExtra",repos = 'http://cran.us.r-project.org')

suppressPackageStartupMessages(library(GGally))
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(class))
suppressPackageStartupMessages(library(e1071))
suppressPackageStartupMessages(library(caret))
suppressPackageStartupMessages(library(klaR))
suppressPackageStartupMessages(library(gmodels))
suppressPackageStartupMessages(library(randomForest))
suppressPackageStartupMessages(library(factoextra))
suppressPackageStartupMessages(library(FactoMineR))
suppressPackageStartupMessages(library(gridExtra))

Carga y adecuación de los datos

Obtención de los datos

Cargamos el fichero que tenemos en la siguiente ruta. Ignoramos las cabeceras

WDBC <- read.csv("../datos/WDBC.dat", header=FALSE)

Colocamos cabeceras

Realizamos un pequeño proceso para colocar las cabeceras de cada columna.

tipos <- c("mean", "se", "worst")
mediciones <- c("radius", "texture", "perimeter", "area", "smoothness","compactness",
                "concavity", "concave_points", "symmetry","fractal_dimension")

names <- c("id","analysis", paste0(rep(mediciones, 3), "_", rep(tipos, each=10)))
colnames(WDBC) <- c(names)

Preparamos los datos (ETL)

Realizamos un pequeño proceso de limpieza y preparación de los datos. En este caso no existen ni missing values para los atributos ni valores nulos, por lo que simplemente desechamos la columna id que no nos aporta información de utilidad y renombramos los valores B y M.

WDBC <- WDBC[,-1]
WDBC$analysis <- factor(WDBC$analysis, levels = c("B", "M"), labels = c("Benign", "Malignant"))
attach(WDBC)

Reprentación visual de los datos

Comenzamos mostrando estructura y primeros registros

str(WDBC)
## 'data.frame':    569 obs. of  31 variables:
##  $ analysis               : Factor w/ 2 levels "Benign","Malignant": 2 2 2 2 2 2 2 2 2 2 ...
##  $ radius_mean            : num  18 20.6 19.7 11.4 20.3 ...
##  $ texture_mean           : num  10.4 17.8 21.2 20.4 14.3 ...
##  $ perimeter_mean         : num  122.8 132.9 130 77.6 135.1 ...
##  $ area_mean              : num  1001 1326 1203 386 1297 ...
##  $ smoothness_mean        : num  0.1184 0.0847 0.1096 0.1425 0.1003 ...
##  $ compactness_mean       : num  0.2776 0.0786 0.1599 0.2839 0.1328 ...
##  $ concavity_mean         : num  0.3001 0.0869 0.1974 0.2414 0.198 ...
##  $ concave_points_mean    : num  0.1471 0.0702 0.1279 0.1052 0.1043 ...
##  $ symmetry_mean          : num  0.242 0.181 0.207 0.26 0.181 ...
##  $ fractal_dimension_mean : num  0.0787 0.0567 0.06 0.0974 0.0588 ...
##  $ radius_se              : num  1.095 0.543 0.746 0.496 0.757 ...
##  $ texture_se             : num  0.905 0.734 0.787 1.156 0.781 ...
##  $ perimeter_se           : num  8.59 3.4 4.58 3.44 5.44 ...
##  $ area_se                : num  153.4 74.1 94 27.2 94.4 ...
##  $ smoothness_se          : num  0.0064 0.00522 0.00615 0.00911 0.01149 ...
##  $ compactness_se         : num  0.049 0.0131 0.0401 0.0746 0.0246 ...
##  $ concavity_se           : num  0.0537 0.0186 0.0383 0.0566 0.0569 ...
##  $ concave_points_se      : num  0.0159 0.0134 0.0206 0.0187 0.0188 ...
##  $ symmetry_se            : num  0.03 0.0139 0.0225 0.0596 0.0176 ...
##  $ fractal_dimension_se   : num  0.00619 0.00353 0.00457 0.00921 0.00511 ...
##  $ radius_worst           : num  25.4 25 23.6 14.9 22.5 ...
##  $ texture_worst          : num  17.3 23.4 25.5 26.5 16.7 ...
##  $ perimeter_worst        : num  184.6 158.8 152.5 98.9 152.2 ...
##  $ area_worst             : num  2019 1956 1709 568 1575 ...
##  $ smoothness_worst       : num  0.162 0.124 0.144 0.21 0.137 ...
##  $ compactness_worst      : num  0.666 0.187 0.424 0.866 0.205 ...
##  $ concavity_worst        : num  0.712 0.242 0.45 0.687 0.4 ...
##  $ concave_points_worst   : num  0.265 0.186 0.243 0.258 0.163 ...
##  $ symmetry_worst         : num  0.46 0.275 0.361 0.664 0.236 ...
##  $ fractal_dimension_worst: num  0.1189 0.089 0.0876 0.173 0.0768 ...
head(WDBC)
##    analysis radius_mean texture_mean perimeter_mean area_mean
## 1 Malignant       17.99        10.38         122.80    1001.0
## 2 Malignant       20.57        17.77         132.90    1326.0
## 3 Malignant       19.69        21.25         130.00    1203.0
## 4 Malignant       11.42        20.38          77.58     386.1
## 5 Malignant       20.29        14.34         135.10    1297.0
## 6 Malignant       12.45        15.70          82.57     477.1
##   smoothness_mean compactness_mean concavity_mean concave_points_mean
## 1         0.11840          0.27760         0.3001             0.14710
## 2         0.08474          0.07864         0.0869             0.07017
## 3         0.10960          0.15990         0.1974             0.12790
## 4         0.14250          0.28390         0.2414             0.10520
## 5         0.10030          0.13280         0.1980             0.10430
## 6         0.12780          0.17000         0.1578             0.08089
##   symmetry_mean fractal_dimension_mean radius_se texture_se perimeter_se
## 1        0.2419                0.07871    1.0950     0.9053        8.589
## 2        0.1812                0.05667    0.5435     0.7339        3.398
## 3        0.2069                0.05999    0.7456     0.7869        4.585
## 4        0.2597                0.09744    0.4956     1.1560        3.445
## 5        0.1809                0.05883    0.7572     0.7813        5.438
## 6        0.2087                0.07613    0.3345     0.8902        2.217
##   area_se smoothness_se compactness_se concavity_se concave_points_se
## 1  153.40      0.006399        0.04904      0.05373           0.01587
## 2   74.08      0.005225        0.01308      0.01860           0.01340
## 3   94.03      0.006150        0.04006      0.03832           0.02058
## 4   27.23      0.009110        0.07458      0.05661           0.01867
## 5   94.44      0.011490        0.02461      0.05688           0.01885
## 6   27.19      0.007510        0.03345      0.03672           0.01137
##   symmetry_se fractal_dimension_se radius_worst texture_worst
## 1     0.03003             0.006193        25.38         17.33
## 2     0.01389             0.003532        24.99         23.41
## 3     0.02250             0.004571        23.57         25.53
## 4     0.05963             0.009208        14.91         26.50
## 5     0.01756             0.005115        22.54         16.67
## 6     0.02165             0.005082        15.47         23.75
##   perimeter_worst area_worst smoothness_worst compactness_worst
## 1          184.60     2019.0           0.1622            0.6656
## 2          158.80     1956.0           0.1238            0.1866
## 3          152.50     1709.0           0.1444            0.4245
## 4           98.87      567.7           0.2098            0.8663
## 5          152.20     1575.0           0.1374            0.2050
## 6          103.40      741.6           0.1791            0.5249
##   concavity_worst concave_points_worst symmetry_worst
## 1          0.7119               0.2654         0.4601
## 2          0.2416               0.1860         0.2750
## 3          0.4504               0.2430         0.3613
## 4          0.6869               0.2575         0.6638
## 5          0.4000               0.1625         0.2364
## 6          0.5355               0.1741         0.3985
##   fractal_dimension_worst
## 1                 0.11890
## 2                 0.08902
## 3                 0.08758
## 4                 0.17300
## 5                 0.07678
## 6                 0.12440

Gráfica sencilla de como está distribuido el dataset en cuanto a tipos de cáncer

p <- ggplot(WDBC, aes(WDBC$analysis))
p <- p + geom_bar(stat = "count", aes(fill = analysis))
p <- p + xlab("Types") + ylab("Count")
p <- p + ggtitle("Types of Cancer") 
p <- p + geom_text(stat='count', aes(label=..count..), vjust=-0.5)
p <- p + theme_minimal()
p <- p + theme(legend.title = element_blank())
show(p)

Se puede observar que la muestra está ligeramente desbalanceada en cuanto que existen 357 casos de cánceres benignos y 212 de malignos.

Seguimos con la representación del dataset

Dada la gran cantidad de variables que se tiene en el Dataset, representar todas al mismo tiempo resultaría complicado y confuso. Por ello se ha optado por representar las mismas agrupadas por Mean (Media), SE(Error estándar) y Worst (Peor o más largo) para hacerlo más legible.

MEAN. Variables enfrentadas y correlación entre ellas

p <- ggpairs(
     WDBC, 
     aes(alpha = 0.65,color = WDBC$analysis), 
     columns = 2:11,
     legend = 1,
     upper = list(continuous = wrap("cor", alpha = 1, size = 3, alignPercent=1)),
     lower = list(continuous = wrap("points", alpha=0.35)),
     title = "Cancer Features. Mean")
p <- p + theme_minimal()
p <- p + theme(legend.position = "bottom",legend.title = element_blank())

print(p)

SE. Variables enfrentadas y correlación entre ellas

p <- ggpairs(
     WDBC, 
     aes(alpha = 0.65,color = WDBC$analysis), 
     columns = 12:21,
     legend = 1,
     upper = list(continuous = wrap("cor", alpha = 1, size = 3, alignPercent=1)),
     lower = list(continuous = wrap("points", alpha=0.35)),
     title = "Cancer Features. Standard Error")
p <- p + theme_minimal()
p <- p + theme(legend.position = "bottom",legend.title = element_blank())

print(p)

WORST. Variables enfrentadas y correlación entre ellas

p <- ggpairs(
     WDBC, 
     aes(alpha = 0.65,color = WDBC$analysis),  
     columns = 22:31,
     legend = 1,
     upper = list(continuous = wrap("cor", alpha = 1, size = 3, alignPercent=1)),
     lower = list(continuous = wrap("points", alpha=0.35)),
     title = "Cancer Features. Worst o largest features")
p <- p + theme_minimal()
p <- p + theme(legend.position = "bottom",legend.title = element_blank())

print(p)

Las gráficas anteriores también han sido generadas a través de una salida visual en R-shiny (código adjunto a este proyecto) y posteriormente publicadas shinyapp.io en el siguiente enlace:

https://alu4216.shinyapps.io/tfm_uned_wdbc/

Realizamos otro tipo de representación un poco más clara sobre la correlación

En este caso, se representa la correlación de las variables de manera alternativa a la anteriormente expuesta, siendo quizás una de las representaciones más eficaces con respecto a la correlación.

MEAN. Correlación de las variables

p <- ggcorr(
     WDBC[,2:11],
     geom = "tile", 
     label = TRUE, 
     label_round = 2,
     hjust = 0.80,
     vjust = 1,
     nbreaks = 5,
     drop = TRUE,
     palette = "Accent",
     layout.exp = 1,
     legend.position = "bottom")

p <- p + ggplot2::labs(title = "Cancer Correlations Features. Mean")
print(p)

SE. Correlación de las variables

p <- ggcorr(
     WDBC[,12:21],
     geom = "tile", 
     label = TRUE, 
     label_round = 2,
     hjust = 0.80,
     vjust = 1,
     nbreaks = 5,
     drop = TRUE,
     palette = "Accent",
     layout.exp = 1,
     legend.position = "bottom")

p <- p + ggplot2::labs(title = "Cancer Correlations Features. SE")
print(p)

WORST. Correlación de las variables

p <- ggcorr(
     WDBC[,22:31],
     geom = "tile", 
     label = TRUE, 
     label_round = 2,
     hjust = 0.80,
     vjust = 1,
     nbreaks = 5,
     drop = TRUE,
     palette = "Accent",
     layout.exp = 1,
     legend.position = "bottom")

p <- p + ggplot2::labs(title = "Cancer Correlations Features. Worst o largest features")
print(p)

Segundo bloque. Preparación de los datos para ser analizados

Preparación de los datos

Realizamos un proceso de normalización de los datos para asegurarnos que variables o características con métricas diferenciadas o con valores muy alejados en pico (valores máximos y mínimos muy alejados del grupo central que puede distorsionar la media), no influencien más de la cuenta en el análisis.

Asignamos un semilla para asegurarnos la reproducibilidad de los datos.

Por último, creamos un dataset de ‘train’ y otro de ‘test’ que utilizaremos más adelante.

set.seed(1234)

normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x))) }

wdbc_norm <- as.data.frame(lapply(WDBC[2:ncol(WDBC)], normalize))

ind <- sample(2, nrow(WDBC), replace=TRUE, prob=c(0.67, 0.33))

wdbc.train <- wdbc_norm[ind==1, 1:ncol(wdbc_norm)]
wdbc.test <- wdbc_norm[ind==2, 1:ncol(wdbc_norm)]

wdbc.trainLabels <- WDBC[ind==1, 1]
wdbc.testLabels <- WDBC[ind==2, 1]

Tercer bloque. Análisis PCA

Análisis de las componentes principales

Realizamos un análisis de las componentes principales en su totalidad como dividas por grupos(MEAN, SE, WORST)

all.pca <- prcomp(wdbc_norm, center = TRUE, scale = TRUE)
mean.pca <- prcomp(wdbc_norm[,1:10], center = TRUE, scale = TRUE)
se.pca <- prcomp(wdbc_norm[,11:20], center = TRUE, scale = TRUE)
worst.pca <- prcomp(wdbc_norm[,21:30], center = TRUE, scale = TRUE)
summary(all.pca)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6
## Standard deviation     3.6444 2.3857 1.67867 1.40735 1.28403 1.09880
## Proportion of Variance 0.4427 0.1897 0.09393 0.06602 0.05496 0.04025
## Cumulative Proportion  0.4427 0.6324 0.72636 0.79239 0.84734 0.88759
##                            PC7     PC8    PC9    PC10   PC11    PC12
## Standard deviation     0.82172 0.69037 0.6457 0.59219 0.5421 0.51104
## Proportion of Variance 0.02251 0.01589 0.0139 0.01169 0.0098 0.00871
## Cumulative Proportion  0.91010 0.92598 0.9399 0.95157 0.9614 0.97007
##                           PC13    PC14    PC15    PC16    PC17    PC18
## Standard deviation     0.49128 0.39624 0.30681 0.28260 0.24372 0.22939
## Proportion of Variance 0.00805 0.00523 0.00314 0.00266 0.00198 0.00175
## Cumulative Proportion  0.97812 0.98335 0.98649 0.98915 0.99113 0.99288
##                           PC19    PC20   PC21    PC22    PC23   PC24
## Standard deviation     0.22244 0.17652 0.1731 0.16565 0.15602 0.1344
## Proportion of Variance 0.00165 0.00104 0.0010 0.00091 0.00081 0.0006
## Cumulative Proportion  0.99453 0.99557 0.9966 0.99749 0.99830 0.9989
##                           PC25    PC26    PC27    PC28    PC29    PC30
## Standard deviation     0.12442 0.09043 0.08307 0.03987 0.02736 0.01153
## Proportion of Variance 0.00052 0.00027 0.00023 0.00005 0.00002 0.00000
## Cumulative Proportion  0.99942 0.99969 0.99992 0.99997 1.00000 1.00000

Se puede observar que entre la PC1 y la PC5 el ‘Cumulative proportion of Variance’(Suma de la proporción de la Varianza) es aproximadamente del 85%, lo cual significa que con con estas 5 componentes se puede explicar el 85% de los datos.

Representación de las componentes principales.

SCREEPLOT

En esta gráfica ‘screeplot’, representamos el porcentaje de variabilidad explicado por las componentes principales. Esta grafica ordena las componentes principales de mayor a menor ‘eigenvalores’(porción de la varianza total explicada por un componente)

p <- fviz_screeplot(all.pca,  barfill ="#72dbde", barcolor="#333333", addlabels = TRUE, ylim = c(0, 50), linecolor = "red",ncp = 30)
p <- p + theme_minimal() 
p <- p + labs(title = "Screeplot", x = "Principal Components")
print(p)

SCREEPLOT agrupadas por MEAN, SE y WORST

Repetimos la representación anterior, pero agrupando por grupos (MEAN,SE,WORST) en la misma línea para poder observar las diferencias

p1 <- fviz_screeplot(mean.pca,  barfill ="#72dbde", barcolor="#333333", addlabels = TRUE, ylim = c(0, 60), linecolor = "red",ncp = 10)
p1 <- p1 + theme_minimal() 
p1 <- p1 + labs(title = "MEAN PCA", x = "Principal Components")

p2 <- fviz_screeplot(se.pca,  barfill ="#f8b1ac", barcolor="#333333", addlabels = TRUE, ylim = c(0, 60), linecolor = "red",ncp = 10)
p2 <- p2 + theme_minimal() 
p2 <- p2 + labs(title = "SE PCA", x = "Principal Components")

p3 <- fviz_screeplot(worst.pca, barfill ="#beaed4", barcolor="#333333", addlabels = TRUE, ylim = c(0, 60), linecolor = "red",ncp = 10)
p3 <- p3 + theme_minimal() 
p3 <- p3 + labs(title = "WORST PCA", x = "Principal Components")

grid.arrange(p1,p2,p3,ncol=3)

Contribución de las variables con respecto a la PC1,PC2 y PC3

En esta gráfica lo que se muestra es la contribución que realiza cada variable con respecto a cada una de las 3 componentes principales más importantes. Si una variable sobrepasa la línea roja divisoria horizontal, se puede considerar que esa variable es importante a la hora de contribuir a esa componente

p1 <- fviz_contrib(all.pca, fill="#72dbde", color="#333333", choice = "var", axes = 1, top = 10)
p1 <- p1 + theme_minimal() 
p1 <- p1 + theme(axis.text.x = element_text(angle=90))
p1 <- p1 + labs(title = "Contribution of variables to PC1", x = "Principal Components")

p2 <- fviz_contrib(all.pca, fill="#f8b1ac", color="#333333", choice = "var", axes = 2, top = 10)
p2 <- p2 + theme_minimal() 
p2 <- p2 + theme(axis.text.x = element_text(angle=90))
p2 <- p2 + labs(title = "Contribution of variables to PC2", x = "Principal Components")

p3 <- fviz_contrib(all.pca, fill="#beaed4", color="#333333", choice = "var", axes = 3, top = 10)
p3 <- p3 + theme_minimal() 
p3 <- p3 + theme(axis.text.x = element_text(angle=90))
p3 <- p3 + labs(title = "Contribution of variables to PC3", x = "Principal Components")

grid.arrange(p1,p2,p3,ncol=3)

Cuarto bloque. Métodos de clasificación

Algoritmos de Machine Learning enfocados a la clasificación

Aprendizaje Supervisado

En este apartado, realizaremos diferentes pruebas con diferentes algoritmos de clasificación como KNN, SVM, etc. Para ello utilizaremos los dataset de ‘train (que se compone del )67% del total de Dataset original (380 filas)), y ’test’ (33% del total del Dataset original (189 filas)) que se crearon en el apartado anterior.

KNN(k-nearest neighbors)

Comenzamos con KNN, y su configuración básica. K = 1

knn_pred <- knn(train = wdbc.train, test = wdbc.test, cl = wdbc.trainLabels, k=1)
CrossTable(x = wdbc.testLabels, y = knn_pred, prop.chisq=FALSE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  189 
## 
##  
##                 | knn_pred 
## wdbc.testLabels |    Benign | Malignant | Row Total | 
## ----------------|-----------|-----------|-----------|
##          Benign |       107 |         5 |       112 | 
##                 |     0.955 |     0.045 |     0.593 | 
##                 |     0.955 |     0.065 |           | 
##                 |     0.566 |     0.026 |           | 
## ----------------|-----------|-----------|-----------|
##       Malignant |         5 |        72 |        77 | 
##                 |     0.065 |     0.935 |     0.407 | 
##                 |     0.045 |     0.935 |           | 
##                 |     0.026 |     0.381 |           | 
## ----------------|-----------|-----------|-----------|
##    Column Total |       112 |        77 |       189 | 
##                 |     0.593 |     0.407 |           | 
## ----------------|-----------|-----------|-----------|
## 
## 

Tuning KNN

Utilizando la función tune.knn, intentaremos encontrar el mejor ‘K’ para el algoritmo KNN. De esta manera se intenta maximizar o mejorar la estimación (accuracy) que realiza el algoritmo.

A la función tune.knn se le pueden variar diferentes parámetros. En este caso, probaremos con un k comprendido entre 1 y 100 y utilizaremos diferentes técnicas de ‘sampling’(muestreo) las cuales se utilizarán para probar o estimar cuan bueno el predictor.

Comencemos con el tuning

Sampling Bootstrap y k entre 1:100

En este caso, mostramos el k ofrecido por el tuning como una gráfica de los diferentes intentos realizados

knn_tune <- tune.knn(wdbc.train, wdbc.trainLabels, k = 1:100, tunecontrol = tune.control(sampling = "boot")) 
summary(knn_tune)
## 
## Parameter tuning of 'knn.wrapper':
## 
## - sampling method: bootstrapping 
## 
## - best parameters:
##   k
##  15
## 
## - best performance: 0.04082349 
## 
## - Detailed performance results:
##       k      error  dispersion
## 1     1 0.04767389 0.013949478
## 2     2 0.05144260 0.015990369
## 3     3 0.04507377 0.010183908
## 4     4 0.05203528 0.014338113
## 5     5 0.05150333 0.010009822
## 6     6 0.04384690 0.009135431
## 7     7 0.04317199 0.014997490
## 8     8 0.04706106 0.015966426
## 9     9 0.05024447 0.014526648
## 10   10 0.04445005 0.016238217
## 11   11 0.04316794 0.016036153
## 12   12 0.04318572 0.013224810
## 13   13 0.04582779 0.014286435
## 14   14 0.04146163 0.016298855
## 15   15 0.04082349 0.018119871
## 16   16 0.04531075 0.018597775
## 17   17 0.04336326 0.018307819
## 18   18 0.04649794 0.017791950
## 19   19 0.04643885 0.017597991
## 20   20 0.04708412 0.019193422
## 21   21 0.04709826 0.020356919
## 22   22 0.04648756 0.020932460
## 23   23 0.04780832 0.022411680
## 24   24 0.05029144 0.018811267
## 25   25 0.04773268 0.019144259
## 26   26 0.04704748 0.021099083
## 27   27 0.04834496 0.020047617
## 28   28 0.04834496 0.020047617
## 29   29 0.04965516 0.014725234
## 30   30 0.04965516 0.017081968
## 31   31 0.04778986 0.017279687
## 32   32 0.05155405 0.015881268
## 33   33 0.05095296 0.015519683
## 34   34 0.05095296 0.013238942
## 35   35 0.05094146 0.014139398
## 36   36 0.05413291 0.014422238
## 37   37 0.05411560 0.013310236
## 38   38 0.05475254 0.012976223
## 39   39 0.05544312 0.014873289
## 40   40 0.05607174 0.012725794
## 41   41 0.05672109 0.014389670
## 42   42 0.05737044 0.016143498
## 43   43 0.05804135 0.015251180
## 44   44 0.05801366 0.016622714
## 45   45 0.05544720 0.012859257
## 46   46 0.05674477 0.016538468
## 47   47 0.05799521 0.016181130
## 48   48 0.05735469 0.016336318
## 49   49 0.05737134 0.016395482
## 50   50 0.05992523 0.019083285
## 51   51 0.05932788 0.018713982
## 52   52 0.06059073 0.020396388
## 53   53 0.05990179 0.017180177
## 54   54 0.06119216 0.018557857
## 55   55 0.06054740 0.019223699
## 56   56 0.05857778 0.017073663
## 57   57 0.05921473 0.016875367
## 58   58 0.06050919 0.018589023
## 59   59 0.06118418 0.020395315
## 60   60 0.05926673 0.020152640
## 61   61 0.06247471 0.020370538
## 62   62 0.06314546 0.020860433
## 63   63 0.06314138 0.020630404
## 64   64 0.06439229 0.021715951
## 65   65 0.06441573 0.021573149
## 66   66 0.06502923 0.021368682
## 67   67 0.06441573 0.021573149
## 68   68 0.06693310 0.022921239
## 69   69 0.06692936 0.023267927
## 70   70 0.07017489 0.024680136
## 71   71 0.07013904 0.022799185
## 72   72 0.06945996 0.022531712
## 73   73 0.07010931 0.022059576
## 74   74 0.07204496 0.026149863
## 75   75 0.07143146 0.026180762
## 76   76 0.07271163 0.025818388
## 77   77 0.07333663 0.025105197
## 78   78 0.07333663 0.025105197
## 79   79 0.07333663 0.025105197
## 80   80 0.07459115 0.024311436
## 81   81 0.07459115 0.024311436
## 82   82 0.07332293 0.025367300
## 83   83 0.07332293 0.025367300
## 84   84 0.07398060 0.025423624
## 85   85 0.07398060 0.025423624
## 86   86 0.07459410 0.025323277
## 87   87 0.07526077 0.025567979
## 88   88 0.07399792 0.026216868
## 89   89 0.07403377 0.025947090
## 90   90 0.07587427 0.025597811
## 91   91 0.07653261 0.026675712
## 92   92 0.07715541 0.025185751
## 93   93 0.07718938 0.024577626
## 94   94 0.07848400 0.024757509
## 95   95 0.07909750 0.024529621
## 96   96 0.07969434 0.024392168
## 97   97 0.08034369 0.026095370
## 98   98 0.07967702 0.025810247
## 99   99 0.08031397 0.025974021
## 100 100 0.08162998 0.027964443
plot(knn_tune)

vectorK<-c(knn_tune$best.parameters$k)

Ejecutamos y probamos con los datos de test para un k = 15

knn_pred <- knn(train = wdbc.train, test = wdbc.test, cl = wdbc.trainLabels, k=knn_tune$best.parameters$k)
CrossTable(x = wdbc.testLabels, y = knn_pred, prop.chisq=FALSE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  189 
## 
##  
##                 | knn_pred 
## wdbc.testLabels |    Benign | Malignant | Row Total | 
## ----------------|-----------|-----------|-----------|
##          Benign |       112 |         0 |       112 | 
##                 |     1.000 |     0.000 |     0.593 | 
##                 |     0.949 |     0.000 |           | 
##                 |     0.593 |     0.000 |           | 
## ----------------|-----------|-----------|-----------|
##       Malignant |         6 |        71 |        77 | 
##                 |     0.078 |     0.922 |     0.407 | 
##                 |     0.051 |     1.000 |           | 
##                 |     0.032 |     0.376 |           | 
## ----------------|-----------|-----------|-----------|
##    Column Total |       118 |        71 |       189 | 
##                 |     0.624 |     0.376 |           | 
## ----------------|-----------|-----------|-----------|
## 
## 

Sampling Cross Validation y k entre 1:100 y validación cruzada de 10

knn_tune <- tune.knn(wdbc.train, wdbc.trainLabels, k = 1:100, tunecontrol = tune.control(sampling = "cross"),cross = 10 )
summary(knn_tune)
## 
## Parameter tuning of 'knn.wrapper':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  k
##  3
## 
## - best performance: 0.02368421 
## 
## - Detailed performance results:
##       k      error dispersion
## 1     1 0.05263158 0.04641669
## 2     2 0.04473684 0.04649950
## 3     3 0.02368421 0.01493806
## 4     4 0.02894737 0.02616918
## 5     5 0.02631579 0.02481076
## 6     6 0.03157895 0.02075817
## 7     7 0.03157895 0.02075817
## 8     8 0.02894737 0.01493806
## 9     9 0.02894737 0.02616918
## 10   10 0.02631579 0.02148675
## 11   11 0.02368421 0.01941749
## 12   12 0.03684211 0.02828862
## 13   13 0.04210526 0.02542347
## 14   14 0.03947368 0.02842430
## 15   15 0.04210526 0.02219142
## 16   16 0.04473684 0.03051321
## 17   17 0.04473684 0.03051321
## 18   18 0.04473684 0.03051321
## 19   19 0.04473684 0.03051321
## 20   20 0.04473684 0.03051321
## 21   21 0.04736842 0.02717883
## 22   22 0.04736842 0.02717883
## 23   23 0.04473684 0.02787763
## 24   24 0.04736842 0.02717883
## 25   25 0.05000000 0.03150576
## 26   26 0.05263158 0.03282156
## 27   27 0.05000000 0.03150576
## 28   28 0.05000000 0.03386010
## 29   29 0.04736842 0.03234928
## 30   30 0.05000000 0.03386010
## 31   31 0.04736842 0.03234928
## 32   32 0.04736842 0.03234928
## 33   33 0.04736842 0.03234928
## 34   34 0.04736842 0.03234928
## 35   35 0.05000000 0.03813520
## 36   36 0.05000000 0.03813520
## 37   37 0.05263158 0.03922926
## 38   38 0.05263158 0.03922926
## 39   39 0.04736842 0.03883499
## 40   40 0.05000000 0.04377184
## 41   41 0.05000000 0.03813520
## 42   42 0.05263158 0.03922926
## 43   43 0.04736842 0.03883499
## 44   44 0.04736842 0.03883499
## 45   45 0.04736842 0.03883499
## 46   46 0.04736842 0.03883499
## 47   47 0.05263158 0.04472824
## 48   48 0.05526316 0.04377184
## 49   49 0.05263158 0.04472824
## 50   50 0.05789474 0.04261389
## 51   51 0.05263158 0.04472824
## 52   52 0.05526316 0.04377184
## 53   53 0.05263158 0.04472824
## 54   54 0.05263158 0.04472824
## 55   55 0.05526316 0.04377184
## 56   56 0.05526316 0.04377184
## 57   57 0.05789474 0.04261389
## 58   58 0.05789474 0.04261389
## 59   59 0.05789474 0.04261389
## 60   60 0.05263158 0.04641669
## 61   61 0.05263158 0.04114400
## 62   62 0.05000000 0.04197714
## 63   63 0.05526316 0.04377184
## 64   64 0.05526316 0.04377184
## 65   65 0.05526316 0.04377184
## 66   66 0.05526316 0.04377184
## 67   67 0.05789474 0.04261389
## 68   68 0.05789474 0.04261389
## 69   69 0.05789474 0.04261389
## 70   70 0.05789474 0.04261389
## 71   71 0.06315789 0.03961961
## 72   72 0.06578947 0.03772950
## 73   73 0.06578947 0.03772950
## 74   74 0.06578947 0.03772950
## 75   75 0.06842105 0.03762739
## 76   76 0.06842105 0.03762739
## 77   77 0.06578947 0.03772950
## 78   78 0.06578947 0.03772950
## 79   79 0.06578947 0.03772950
## 80   80 0.06315789 0.03961961
## 81   81 0.06842105 0.03961961
## 82   82 0.06842105 0.03961961
## 83   83 0.06842105 0.03961961
## 84   84 0.06842105 0.03961961
## 85   85 0.06842105 0.03961961
## 86   86 0.07105263 0.03731938
## 87   87 0.07105263 0.03731938
## 88   88 0.06842105 0.03961961
## 89   89 0.07105263 0.03731938
## 90   90 0.06842105 0.03961961
## 91   91 0.06842105 0.03961961
## 92   92 0.06842105 0.03961961
## 93   93 0.07105263 0.03731938
## 94   94 0.07105263 0.03731938
## 95   95 0.07105263 0.03731938
## 96   96 0.07631579 0.04010221
## 97   97 0.07631579 0.04010221
## 98   98 0.07631579 0.04010221
## 99   99 0.07631579 0.04010221
## 100 100 0.07894737 0.03922926
plot(knn_tune)

vectorK<-c(vectorK, knn_tune$best.parameters$k)

Ejecutamos y probamos con los datos de test para un k = 3

knn_pred <- knn(train = wdbc.train, test = wdbc.test, cl = wdbc.trainLabels, k=knn_tune$best.parameters$k)
CrossTable(x = wdbc.testLabels, y = knn_pred, prop.chisq=FALSE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  189 
## 
##  
##                 | knn_pred 
## wdbc.testLabels |    Benign | Malignant | Row Total | 
## ----------------|-----------|-----------|-----------|
##          Benign |       109 |         3 |       112 | 
##                 |     0.973 |     0.027 |     0.593 | 
##                 |     0.940 |     0.041 |           | 
##                 |     0.577 |     0.016 |           | 
## ----------------|-----------|-----------|-----------|
##       Malignant |         7 |        70 |        77 | 
##                 |     0.091 |     0.909 |     0.407 | 
##                 |     0.060 |     0.959 |           | 
##                 |     0.037 |     0.370 |           | 
## ----------------|-----------|-----------|-----------|
##    Column Total |       116 |        73 |       189 | 
##                 |     0.614 |     0.386 |           | 
## ----------------|-----------|-----------|-----------|
## 
## 

Sampling Fix y k entre 1:100

knn_tune <- tune.knn(wdbc.train, wdbc.trainLabels, k = 1:100, tunecontrol= tune.control(sampling = "fix"),fix = 10 )
summary(knn_tune)
## 
## Parameter tuning of 'knn.wrapper':
## 
## - sampling method: fixed training/validation set 
## 
## - best parameters:
##  k
##  3
## 
## - best performance: 0.01574803 
## 
## - Detailed performance results:
##       k      error dispersion
## 1     1 0.03937008         NA
## 2     2 0.03149606         NA
## 3     3 0.01574803         NA
## 4     4 0.03937008         NA
## 5     5 0.02362205         NA
## 6     6 0.02362205         NA
## 7     7 0.02362205         NA
## 8     8 0.02362205         NA
## 9     9 0.03149606         NA
## 10   10 0.03149606         NA
## 11   11 0.03149606         NA
## 12   12 0.03937008         NA
## 13   13 0.03937008         NA
## 14   14 0.04724409         NA
## 15   15 0.04724409         NA
## 16   16 0.04724409         NA
## 17   17 0.04724409         NA
## 18   18 0.04724409         NA
## 19   19 0.04724409         NA
## 20   20 0.03937008         NA
## 21   21 0.03937008         NA
## 22   22 0.03937008         NA
## 23   23 0.03937008         NA
## 24   24 0.03937008         NA
## 25   25 0.03937008         NA
## 26   26 0.03937008         NA
## 27   27 0.03937008         NA
## 28   28 0.03937008         NA
## 29   29 0.03937008         NA
## 30   30 0.03937008         NA
## 31   31 0.03937008         NA
## 32   32 0.03937008         NA
## 33   33 0.03937008         NA
## 34   34 0.03937008         NA
## 35   35 0.03937008         NA
## 36   36 0.04724409         NA
## 37   37 0.03937008         NA
## 38   38 0.03937008         NA
## 39   39 0.03937008         NA
## 40   40 0.03937008         NA
## 41   41 0.04724409         NA
## 42   42 0.04724409         NA
## 43   43 0.04724409         NA
## 44   44 0.04724409         NA
## 45   45 0.05511811         NA
## 46   46 0.05511811         NA
## 47   47 0.05511811         NA
## 48   48 0.05511811         NA
## 49   49 0.05511811         NA
## 50   50 0.05511811         NA
## 51   51 0.05511811         NA
## 52   52 0.05511811         NA
## 53   53 0.06299213         NA
## 54   54 0.06299213         NA
## 55   55 0.06299213         NA
## 56   56 0.06299213         NA
## 57   57 0.06299213         NA
## 58   58 0.06299213         NA
## 59   59 0.06299213         NA
## 60   60 0.06299213         NA
## 61   61 0.06299213         NA
## 62   62 0.06299213         NA
## 63   63 0.07086614         NA
## 64   64 0.07086614         NA
## 65   65 0.07086614         NA
## 66   66 0.07086614         NA
## 67   67 0.07086614         NA
## 68   68 0.07086614         NA
## 69   69 0.07086614         NA
## 70   70 0.07086614         NA
## 71   71 0.07086614         NA
## 72   72 0.07086614         NA
## 73   73 0.07086614         NA
## 74   74 0.07086614         NA
## 75   75 0.07086614         NA
## 76   76 0.07086614         NA
## 77   77 0.07086614         NA
## 78   78 0.07086614         NA
## 79   79 0.07086614         NA
## 80   80 0.07086614         NA
## 81   81 0.07086614         NA
## 82   82 0.07086614         NA
## 83   83 0.07086614         NA
## 84   84 0.07086614         NA
## 85   85 0.07086614         NA
## 86   86 0.07086614         NA
## 87   87 0.07086614         NA
## 88   88 0.07086614         NA
## 89   89 0.07086614         NA
## 90   90 0.07086614         NA
## 91   91 0.07086614         NA
## 92   92 0.07086614         NA
## 93   93 0.07086614         NA
## 94   94 0.07874016         NA
## 95   95 0.07086614         NA
## 96   96 0.07086614         NA
## 97   97 0.07874016         NA
## 98   98 0.07874016         NA
## 99   99 0.07874016         NA
## 100 100 0.07086614         NA
plot(knn_tune)

vectorK<-c(vectorK, knn_tune$best.parameters$k)

Ejecutamos y probamos con los datos de test para un k = 3

knn_pred <- knn(train = wdbc.train, test = wdbc.test, cl = wdbc.trainLabels, k=knn_tune$best.parameters$k)
CrossTable(x = wdbc.testLabels, y = knn_pred, prop.chisq=FALSE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  189 
## 
##  
##                 | knn_pred 
## wdbc.testLabels |    Benign | Malignant | Row Total | 
## ----------------|-----------|-----------|-----------|
##          Benign |       109 |         3 |       112 | 
##                 |     0.973 |     0.027 |     0.593 | 
##                 |     0.940 |     0.041 |           | 
##                 |     0.577 |     0.016 |           | 
## ----------------|-----------|-----------|-----------|
##       Malignant |         7 |        70 |        77 | 
##                 |     0.091 |     0.909 |     0.407 | 
##                 |     0.060 |     0.959 |           | 
##                 |     0.037 |     0.370 |           | 
## ----------------|-----------|-----------|-----------|
##    Column Total |       116 |        73 |       189 | 
##                 |     0.614 |     0.386 |           | 
## ----------------|-----------|-----------|-----------|
## 
## 

Conclusiones obtenidas para KNN.

La ejecución del ‘tuning’ anterior nos provee de diferentes valores para k los cuales son: 15, 3, 3 Se puede observar que para los diferentes K ofrecidos por el tuning con su diferentes métodos de sampling, los resultados son prácticamente iguales, variando a lo sumo en 1 o 2 aciertos con respecto a la predicción de cánceres benignos y prácticamente igual acierto en la predicción de cánceres malignos.

Por tanto, el tunning ha mejorado ligeramente los resultados iniciales que habíamos obtenidos con los valores por defecto.

SVM (Support Vector Machines)

En este caso, directamente utilizaremos la función tune.svm para obtener los mejores parámetros de costo y gamma. Para esta función, tenemos posibilidad de probar con diferentes valores de gamma, costo y kernel.

Probaremos con diferentes kernels (“lienar”,“Polynomial”,“radial” y “sigmoid”) y con lo valores de costo(0.1, 1, 10, 100 y 1000) y gamma(0.037, 0.11, 0.33,1 y 3)

Kernel linear
svm_tune <- tune.svm(x=wdbc.train, y=wdbc.trainLabels, kernel="linear", cost=10^(-1:3), gamma=3^(-3:1))
summary(svm_tune)
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##       gamma cost
##  0.03703704  0.1
## 
## - best performance: 0.02368421 
## 
## - Detailed performance results:
##         gamma  cost      error dispersion
## 1  0.03703704 1e-01 0.02368421 0.01493806
## 2  0.11111111 1e-01 0.02368421 0.01493806
## 3  0.33333333 1e-01 0.02368421 0.01493806
## 4  1.00000000 1e-01 0.02368421 0.01493806
## 5  3.00000000 1e-01 0.02368421 0.01493806
## 6  0.03703704 1e+00 0.03684211 0.02828862
## 7  0.11111111 1e+00 0.03684211 0.02828862
## 8  0.33333333 1e+00 0.03684211 0.02828862
## 9  1.00000000 1e+00 0.03684211 0.02828862
## 10 3.00000000 1e+00 0.03684211 0.02828862
## 11 0.03703704 1e+01 0.03947368 0.03101346
## 12 0.11111111 1e+01 0.03947368 0.03101346
## 13 0.33333333 1e+01 0.03947368 0.03101346
## 14 1.00000000 1e+01 0.03947368 0.03101346
## 15 3.00000000 1e+01 0.03947368 0.03101346
## 16 0.03703704 1e+02 0.03947368 0.03101346
## 17 0.11111111 1e+02 0.03947368 0.03101346
## 18 0.33333333 1e+02 0.03947368 0.03101346
## 19 1.00000000 1e+02 0.03947368 0.03101346
## 20 3.00000000 1e+02 0.03947368 0.03101346
## 21 0.03703704 1e+03 0.03947368 0.03101346
## 22 0.11111111 1e+03 0.03947368 0.03101346
## 23 0.33333333 1e+03 0.03947368 0.03101346
## 24 1.00000000 1e+03 0.03947368 0.03101346
## 25 3.00000000 1e+03 0.03947368 0.03101346
plot(svm_tune)

vectorSVM <- c("linear", svm_tune$best.parameters$cost, svm_tune$best.parameters$gamma)

Ejecutamos y probamos con los datos de test para un costo = 0.1 y gamma = 0.037037

svm_pred <- svm(wdbc.train,wdbc.trainLabels, kernerl="linear", cost=svm_tune$best.parameters$cost,gamma = svm_tune$best.parameters$gamma)
summary(svm_pred)
## 
## Call:
## svm.default(x = wdbc.train, y = wdbc.trainLabels, gamma = svm_tune$best.parameters$gamma, 
##     cost = svm_tune$best.parameters$cost, kernerl = "linear")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  0.1 
##       gamma:  0.03703704 
## 
## Number of Support Vectors:  181
## 
##  ( 89 92 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  Benign Malignant
pred <- predict(svm_pred, new=wdbc.test)
CrossTable(x = wdbc.testLabels, y = pred,prop.chisq=FALSE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  189 
## 
##  
##                 | pred 
## wdbc.testLabels |    Benign | Malignant | Row Total | 
## ----------------|-----------|-----------|-----------|
##          Benign |       108 |         4 |       112 | 
##                 |     0.964 |     0.036 |     0.593 | 
##                 |     0.931 |     0.055 |           | 
##                 |     0.571 |     0.021 |           | 
## ----------------|-----------|-----------|-----------|
##       Malignant |         8 |        69 |        77 | 
##                 |     0.104 |     0.896 |     0.407 | 
##                 |     0.069 |     0.945 |           | 
##                 |     0.042 |     0.365 |           | 
## ----------------|-----------|-----------|-----------|
##    Column Total |       116 |        73 |       189 | 
##                 |     0.614 |     0.386 |           | 
## ----------------|-----------|-----------|-----------|
## 
## 

Kernel Polynomial

svm_tune <- tune.svm(x=wdbc.train, y=wdbc.trainLabels, kernel="polynomial", cost=10^(-1:3), gamma=3^(-3:1))
summary(svm_tune)
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##      gamma cost
##  0.3333333  0.1
## 
## - best performance: 0.03157895 
## 
## - Detailed performance results:
##         gamma  cost      error dispersion
## 1  0.03703704 1e-01 0.16315789 0.04438284
## 2  0.11111111 1e-01 0.07105263 0.03519720
## 3  0.33333333 1e-01 0.03157895 0.02717883
## 4  1.00000000 1e-01 0.05526316 0.03150576
## 5  3.00000000 1e-01 0.05526316 0.03150576
## 6  0.03703704 1e+00 0.10526316 0.04297350
## 7  0.11111111 1e+00 0.03421053 0.02166507
## 8  0.33333333 1e+00 0.05526316 0.03150576
## 9  1.00000000 1e+00 0.05526316 0.03150576
## 10 3.00000000 1e+00 0.05526316 0.03150576
## 11 0.03703704 1e+01 0.03684211 0.02542347
## 12 0.11111111 1e+01 0.05263158 0.03508772
## 13 0.33333333 1e+01 0.05526316 0.03150576
## 14 1.00000000 1e+01 0.05526316 0.03150576
## 15 3.00000000 1e+01 0.05526316 0.03150576
## 16 0.03703704 1e+02 0.03421053 0.02787763
## 17 0.11111111 1e+02 0.05526316 0.03150576
## 18 0.33333333 1e+02 0.05526316 0.03150576
## 19 1.00000000 1e+02 0.05526316 0.03150576
## 20 3.00000000 1e+02 0.05526316 0.03150576
## 21 0.03703704 1e+03 0.05526316 0.03150576
## 22 0.11111111 1e+03 0.05526316 0.03150576
## 23 0.33333333 1e+03 0.05526316 0.03150576
## 24 1.00000000 1e+03 0.05526316 0.03150576
## 25 3.00000000 1e+03 0.05526316 0.03150576
plot(svm_tune)

vectorSVM <- c(vectorSVM, c("polynomial", svm_tune$best.parameters$cost, svm_tune$best.parameters$gamma))

Ejecutamos y probamos con los datos de test para un costo = 0.1 y gamma = 0.3333333

svm_pred <- svm(wdbc.train,wdbc.trainLabels, kernerl="polynomial", cost=svm_tune$best.parameters$cost,gamma = svm_tune$best.parameters$gamma)
summary(svm_pred)
## 
## Call:
## svm.default(x = wdbc.train, y = wdbc.trainLabels, gamma = svm_tune$best.parameters$gamma, 
##     cost = svm_tune$best.parameters$cost, kernerl = "polynomial")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  0.1 
##       gamma:  0.3333333 
## 
## Number of Support Vectors:  322
## 
##  ( 135 187 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  Benign Malignant
pred <- predict(svm_pred,new=wdbc.test)
CrossTable(x = wdbc.testLabels, y = pred,prop.chisq=FALSE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  189 
## 
##  
##                 | pred 
## wdbc.testLabels |    Benign | Row Total | 
## ----------------|-----------|-----------|
##          Benign |       112 |       112 | 
##                 |     0.593 |           | 
## ----------------|-----------|-----------|
##       Malignant |        77 |        77 | 
##                 |     0.407 |           | 
## ----------------|-----------|-----------|
##    Column Total |       189 |       189 | 
## ----------------|-----------|-----------|
## 
## 

Kernel Radial

svm_tune <- tune.svm(x=wdbc.train, y=wdbc.trainLabels, kernel="radial", cost=10^(-1:3), gamma=3^(-3:1))
summary(svm_tune)
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##       gamma cost
##  0.03703704   10
## 
## - best performance: 0.01842105 
## 
## - Detailed performance results:
##         gamma  cost      error dispersion
## 1  0.03703704 1e-01 0.05789474 0.04438284
## 2  0.11111111 1e-01 0.09473684 0.06817319
## 3  0.33333333 1e-01 0.35526316 0.07260064
## 4  1.00000000 1e-01 0.35526316 0.07260064
## 5  3.00000000 1e-01 0.35526316 0.07260064
## 6  0.03703704 1e+00 0.02631579 0.02481076
## 7  0.11111111 1e+00 0.03947368 0.02842430
## 8  0.33333333 1e+00 0.09210526 0.06709236
## 9  1.00000000 1e+00 0.35000000 0.07131746
## 10 3.00000000 1e+00 0.35526316 0.07260064
## 11 0.03703704 1e+01 0.01842105 0.01776180
## 12 0.11111111 1e+01 0.04473684 0.03731938
## 13 0.33333333 1e+01 0.08684211 0.07023024
## 14 1.00000000 1e+01 0.34736842 0.07525477
## 15 3.00000000 1e+01 0.35526316 0.07260064
## 16 0.03703704 1e+02 0.03157895 0.03234928
## 17 0.11111111 1e+02 0.04473684 0.03731938
## 18 0.33333333 1e+02 0.08684211 0.07023024
## 19 1.00000000 1e+02 0.34736842 0.07525477
## 20 3.00000000 1e+02 0.35526316 0.07260064
## 21 0.03703704 1e+03 0.03157895 0.03234928
## 22 0.11111111 1e+03 0.04473684 0.03731938
## 23 0.33333333 1e+03 0.08684211 0.07023024
## 24 1.00000000 1e+03 0.34736842 0.07525477
## 25 3.00000000 1e+03 0.35526316 0.07260064
plot(svm_tune)

vectorSVM <- c(vectorSVM, c("radial", svm_tune$best.parameters$cost, svm_tune$best.parameters$gamma))

Ejecutamos y probamos con los datos de test para un costo = 10 y gamma = 0.037037

svm_pred <- svm(wdbc.train,wdbc.trainLabels, kernerl="radial", cost=100, gamma = svm_tune$best.parameters$gamma)

summary(svm_pred)
## 
## Call:
## svm.default(x = wdbc.train, y = wdbc.trainLabels, gamma = svm_tune$best.parameters$gamma, 
##     cost = 100, kernerl = "radial")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  100 
##       gamma:  0.03703704 
## 
## Number of Support Vectors:  78
## 
##  ( 41 37 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  Benign Malignant
pred <- predict(svm_pred,new=wdbc.test)
CrossTable(x = wdbc.testLabels, y = pred,prop.chisq=FALSE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  189 
## 
##  
##                 | pred 
## wdbc.testLabels |    Benign | Malignant | Row Total | 
## ----------------|-----------|-----------|-----------|
##          Benign |       107 |         5 |       112 | 
##                 |     0.955 |     0.045 |     0.593 | 
##                 |     0.930 |     0.068 |           | 
##                 |     0.566 |     0.026 |           | 
## ----------------|-----------|-----------|-----------|
##       Malignant |         8 |        69 |        77 | 
##                 |     0.104 |     0.896 |     0.407 | 
##                 |     0.070 |     0.932 |           | 
##                 |     0.042 |     0.365 |           | 
## ----------------|-----------|-----------|-----------|
##    Column Total |       115 |        74 |       189 | 
##                 |     0.608 |     0.392 |           | 
## ----------------|-----------|-----------|-----------|
## 
## 

Kernel Sigmoid

svm_tune <- tune.svm(x=wdbc.train, y=wdbc.trainLabels, kernel="sigmoid", cost=10^(-1:3), gamma=3^(-3:1))
summary(svm_tune)
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##       gamma cost
##  0.03703704  0.1
## 
## - best performance: 0.04736842 
## 
## - Detailed performance results:
##         gamma  cost      error dispersion
## 1  0.03703704 1e-01 0.04736842 0.05657725
## 2  0.11111111 1e-01 0.06315789 0.05435766
## 3  0.33333333 1e-01 0.08421053 0.04608395
## 4  1.00000000 1e-01 0.10263158 0.04549580
## 5  3.00000000 1e-01 0.07894737 0.03922926
## 6  0.03703704 1e+00 0.05263158 0.03721615
## 7  0.11111111 1e+00 0.09473684 0.04674706
## 8  0.33333333 1e+00 0.11578947 0.04333014
## 9  1.00000000 1e+00 0.11578947 0.05575526
## 10 3.00000000 1e+00 0.11315789 0.04812585
## 11 0.03703704 1e+01 0.07631579 0.04715677
## 12 0.11111111 1e+01 0.10789474 0.05031449
## 13 0.33333333 1e+01 0.12894737 0.04377184
## 14 1.00000000 1e+01 0.11578947 0.05144870
## 15 3.00000000 1e+01 0.11578947 0.05292317
## 16 0.03703704 1e+02 0.06842105 0.04836508
## 17 0.11111111 1e+02 0.09210526 0.04160892
## 18 0.33333333 1e+02 0.12631579 0.03883499
## 19 1.00000000 1e+02 0.12631579 0.05084694
## 20 3.00000000 1e+02 0.11578947 0.04836508
## 21 0.03703704 1e+03 0.07105263 0.04649950
## 22 0.11111111 1e+03 0.10000000 0.02987612
## 23 0.33333333 1e+03 0.13684211 0.04261389
## 24 1.00000000 1e+03 0.11578947 0.05292317
## 25 3.00000000 1e+03 0.11578947 0.04836508
plot(svm_tune)

vectorSVM <- c(vectorSVM, c("sigmoid", svm_tune$best.parameters$cost, svm_tune$best.parameters$gamma))

Ejecutamos y probamos con los datos de test para un costo = 0.1 y gamma = 0.037037

svm_pred <- svm(wdbc.train,wdbc.trainLabels, kernerl="sigmoid", cost=svm_tune$best.parameters$cost,gamma = svm_tune$best.parameters$gamma)
summary(svm_pred)
## 
## Call:
## svm.default(x = wdbc.train, y = wdbc.trainLabels, gamma = svm_tune$best.parameters$gamma, 
##     cost = svm_tune$best.parameters$cost, kernerl = "sigmoid")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  0.1 
##       gamma:  0.03703704 
## 
## Number of Support Vectors:  181
## 
##  ( 89 92 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  Benign Malignant
pred <- predict(svm_pred,new=wdbc.test)
CrossTable(x = wdbc.testLabels, y = pred,prop.chisq=FALSE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  189 
## 
##  
##                 | pred 
## wdbc.testLabels |    Benign | Malignant | Row Total | 
## ----------------|-----------|-----------|-----------|
##          Benign |       108 |         4 |       112 | 
##                 |     0.964 |     0.036 |     0.593 | 
##                 |     0.931 |     0.055 |           | 
##                 |     0.571 |     0.021 |           | 
## ----------------|-----------|-----------|-----------|
##       Malignant |         8 |        69 |        77 | 
##                 |     0.104 |     0.896 |     0.407 | 
##                 |     0.069 |     0.945 |           | 
##                 |     0.042 |     0.365 |           | 
## ----------------|-----------|-----------|-----------|
##    Column Total |       116 |        73 |       189 | 
##                 |     0.614 |     0.386 |           | 
## ----------------|-----------|-----------|-----------|
## 
## 

Conclusiones obtenidas para SVM

La ejecución del ‘tuning’ anterior nos provee de diferentes valores para costo y gamma para cada kernel, los cuales son:

Kernel Costo Gamma
linear 0.1 0.037037037037037
polynomial 0.1 0.333333333333333
radial 10 0.037037037037037
sigmoid 0.1 0.037037037037037

Se puede observar que para los svm con kernel “linear” o “sigmoid” son los que mejores resultados tiene a la hora de predecir correctamente cánceres benignos. El svm con kernel “polymonial” es el que mejor acierto tiene con respecto a la predicción de cánceres malignos. Aún así, los resultados están bastante cercanos en cuanto a predicción de cánceres benignos o malignos, no existiendo grandes diferencias.

Los valores de los costos y gamma pueden variar entre diferentes ejecuciones, variando los resultados finales

Naive Bayes

En este caso, al igual que en el caso del KNN, utilizaremos diferentes técnicas o métodos de sampling, pero en este caso, no modificaremos ningún parámetro adicional. Simplemente mediremos cuan ajustado está el predictor de cuatro maneras diferentes.

Sampling Cross validation

nb_pred = train(wdbc.train,wdbc.trainLabels,method = 'nb', trControl=trainControl(method='cv',number=10))
nb_pred
## Naive Bayes 
## 
## 380 samples
##  30 predictor
##   2 classes: 'Benign', 'Malignant' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 342, 342, 342, 341, 342, 342, ... 
## Resampling results across tuning parameters:
## 
##   usekernel  Accuracy   Kappa    
##   FALSE      0.9344695  0.8547734
##    TRUE      0.9343345  0.8558323
## 
## Tuning parameter 'fL' was held constant at a value of 0
## Tuning
##  parameter 'adjust' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were fL = 0, usekernel = FALSE
##  and adjust = 1.
pred <- predict(nb_pred$finalModel, wdbc.test)
CrossTable(x = wdbc.testLabels, y = pred$class, prop.chisq=FALSE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  189 
## 
##  
##                 | pred$class 
## wdbc.testLabels |    Benign | Malignant | Row Total | 
## ----------------|-----------|-----------|-----------|
##          Benign |       106 |         6 |       112 | 
##                 |     0.946 |     0.054 |     0.593 | 
##                 |     0.930 |     0.080 |           | 
##                 |     0.561 |     0.032 |           | 
## ----------------|-----------|-----------|-----------|
##       Malignant |         8 |        69 |        77 | 
##                 |     0.104 |     0.896 |     0.407 | 
##                 |     0.070 |     0.920 |           | 
##                 |     0.042 |     0.365 |           | 
## ----------------|-----------|-----------|-----------|
##    Column Total |       114 |        75 |       189 | 
##                 |     0.603 |     0.397 |           | 
## ----------------|-----------|-----------|-----------|
## 
## 

Sampling Cross validation with repeat

nb_pred = train(wdbc.train,wdbc.trainLabels,method = 'nb', trControl=trainControl(method='repeatedcv',number=10, repeats = 3))
nb_pred
## Naive Bayes 
## 
## 380 samples
##  30 predictor
##   2 classes: 'Benign', 'Malignant' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 343, 342, 343, 342, 342, 341, ... 
## Resampling results across tuning parameters:
## 
##   usekernel  Accuracy   Kappa    
##   FALSE      0.9343084  0.8549976
##    TRUE      0.9360877  0.8597369
## 
## Tuning parameter 'fL' was held constant at a value of 0
## Tuning
##  parameter 'adjust' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were fL = 0, usekernel = TRUE
##  and adjust = 1.
pred <- predict(nb_pred$finalModel, wdbc.test)
CrossTable(x = wdbc.testLabels, y = pred$class, prop.chisq=FALSE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  189 
## 
##  
##                 | pred$class 
## wdbc.testLabels |    Benign | Malignant | Row Total | 
## ----------------|-----------|-----------|-----------|
##          Benign |       105 |         7 |       112 | 
##                 |     0.938 |     0.062 |     0.593 | 
##                 |     0.955 |     0.089 |           | 
##                 |     0.556 |     0.037 |           | 
## ----------------|-----------|-----------|-----------|
##       Malignant |         5 |        72 |        77 | 
##                 |     0.065 |     0.935 |     0.407 | 
##                 |     0.045 |     0.911 |           | 
##                 |     0.026 |     0.381 |           | 
## ----------------|-----------|-----------|-----------|
##    Column Total |       110 |        79 |       189 | 
##                 |     0.582 |     0.418 |           | 
## ----------------|-----------|-----------|-----------|
## 
## 

Sampling Bootstrap

nb_pred = train(wdbc.train, wdbc.trainLabels, method = 'nb', trControl=trainControl(method='boot',number=100))
nb_pred
## Naive Bayes 
## 
## 380 samples
##  30 predictor
##   2 classes: 'Benign', 'Malignant' 
## 
## No pre-processing
## Resampling: Bootstrapped (100 reps) 
## Summary of sample sizes: 380, 380, 380, 380, 380, 380, ... 
## Resampling results across tuning parameters:
## 
##   usekernel  Accuracy   Kappa    
##   FALSE      0.9393848  0.8659063
##    TRUE      0.9397078  0.8678455
## 
## Tuning parameter 'fL' was held constant at a value of 0
## Tuning
##  parameter 'adjust' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were fL = 0, usekernel = TRUE
##  and adjust = 1.
pred <- predict(nb_pred$finalModel, wdbc.test)
CrossTable(x = wdbc.testLabels, y = pred$class, prop.chisq=FALSE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  189 
## 
##  
##                 | pred$class 
## wdbc.testLabels |    Benign | Malignant | Row Total | 
## ----------------|-----------|-----------|-----------|
##          Benign |       105 |         7 |       112 | 
##                 |     0.938 |     0.062 |     0.593 | 
##                 |     0.955 |     0.089 |           | 
##                 |     0.556 |     0.037 |           | 
## ----------------|-----------|-----------|-----------|
##       Malignant |         5 |        72 |        77 | 
##                 |     0.065 |     0.935 |     0.407 | 
##                 |     0.045 |     0.911 |           | 
##                 |     0.026 |     0.381 |           | 
## ----------------|-----------|-----------|-----------|
##    Column Total |       110 |        79 |       189 | 
##                 |     0.582 |     0.418 |           | 
## ----------------|-----------|-----------|-----------|
## 
## 

Sampling Leave One Out Cross Validation

nb_pred = train(wdbc.train,wdbc.trainLabels,method = 'nb', trControl=trainControl(method='LOOCV'))
nb_pred
## Naive Bayes 
## 
## 380 samples
##  30 predictor
##   2 classes: 'Benign', 'Malignant' 
## 
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation 
## Summary of sample sizes: 379, 379, 379, 379, 379, 379, ... 
## Resampling results across tuning parameters:
## 
##   usekernel  Accuracy   Kappa    
##   FALSE      0.9342105  0.8551829
##    TRUE      0.9368421  0.8621315
## 
## Tuning parameter 'fL' was held constant at a value of 0
## Tuning
##  parameter 'adjust' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were fL = 0, usekernel = TRUE
##  and adjust = 1.
pred <- predict(nb_pred$finalModel, wdbc.test)
CrossTable(x = wdbc.testLabels, y = pred$class, prop.chisq=FALSE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  189 
## 
##  
##                 | pred$class 
## wdbc.testLabels |    Benign | Malignant | Row Total | 
## ----------------|-----------|-----------|-----------|
##          Benign |       105 |         7 |       112 | 
##                 |     0.938 |     0.062 |     0.593 | 
##                 |     0.955 |     0.089 |           | 
##                 |     0.556 |     0.037 |           | 
## ----------------|-----------|-----------|-----------|
##       Malignant |         5 |        72 |        77 | 
##                 |     0.065 |     0.935 |     0.407 | 
##                 |     0.045 |     0.911 |           | 
##                 |     0.026 |     0.381 |           | 
## ----------------|-----------|-----------|-----------|
##    Column Total |       110 |        79 |       189 | 
##                 |     0.582 |     0.418 |           | 
## ----------------|-----------|-----------|-----------|
## 
## 

Conclusiones obtenidas para Naibe Bayes

Las cuatro pruebas realizadas han dado prácticamente los mismos resultados, lo cual por otro lado era lo esperado ya que en cada ejecución lo único que modificamos es la forma de medir la precisión del modelo o predictor. Aún así, se puede observar, que el accuracy y kappa están muy cerca de 1, lo cual indica lo bueno que es el predictor en sí, para cada una de las pruebas. Dicho valores se han confirmado al ver que los resultados obtenidos con los datos de ‘test’, son buenos y acordes con lo esperado dada la precisión obtenida con los datos de ‘train’.

Random Forest

Para Radon forest, ejecutaremos el predictor con un número de árboles igual a 10000 y sin restricción en número de nodos. Posteriormente, ejecutaremos la función tuneRF para intentar mejorar la configuración del clasificador

rf_pred <- randomForest(x = wdbc.train, y = wdbc.trainLabels,ntree=10000,proximity=TRUE)
rf_pred
## 
## Call:
##  randomForest(x = wdbc.train, y = wdbc.trainLabels, ntree = 10000,      proximity = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 10000
## No. of variables tried at each split: 5
## 
##         OOB estimate of  error rate: 3.68%
## Confusion matrix:
##           Benign Malignant class.error
## Benign       241         4  0.01632653
## Malignant     10       125  0.07407407
pred <- predict(rf_pred, newdata=wdbc.test)
CrossTable(x = wdbc.testLabels, y = pred, prop.chisq=FALSE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  189 
## 
##  
##                 | pred 
## wdbc.testLabels |    Benign | Malignant | Row Total | 
## ----------------|-----------|-----------|-----------|
##          Benign |       108 |         4 |       112 | 
##                 |     0.964 |     0.036 |     0.593 | 
##                 |     0.973 |     0.051 |           | 
##                 |     0.571 |     0.021 |           | 
## ----------------|-----------|-----------|-----------|
##       Malignant |         3 |        74 |        77 | 
##                 |     0.039 |     0.961 |     0.407 | 
##                 |     0.027 |     0.949 |           | 
##                 |     0.016 |     0.392 |           | 
## ----------------|-----------|-----------|-----------|
##    Column Total |       111 |        78 |       189 | 
##                 |     0.587 |     0.413 |           | 
## ----------------|-----------|-----------|-----------|
## 
## 

Intentamos mejorar los resultados con tuneRF.

Intentamos obtener una mejor configuración para el clasificador, obteniendo un valor de ‘mtry’(conjunto de variables a tener en cuanta de manera aleatoria en cada división de árboles) y ‘ntree’ (número de árboles) ajustado.

tune.rf <- tuneRF(wdbc.train,wdbc.trainLabels, stepFactor=0.5,doBest = TRUE, ntreeTry = 10001,improve = 0.01)
## mtry = 5  OOB error = 3.42% 
## Searching left ...
## mtry = 10    OOB error = 3.16% 
## 0.07692308 0.01 
## mtry = 20    OOB error = 3.16% 
## 0 0.01 
## Searching right ...
## mtry = 2     OOB error = 3.95% 
## -0.25 0.01

rf_pred <- randomForest(x = wdbc.train, y = wdbc.trainLabels,ntree=tune.rf$ntree,proximity=TRUE,mtry = tune.rf$mtry)
rf_pred
## 
## Call:
##  randomForest(x = wdbc.train, y = wdbc.trainLabels, ntree = tune.rf$ntree,      mtry = tune.rf$mtry, proximity = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 10
## 
##         OOB estimate of  error rate: 3.68%
## Confusion matrix:
##           Benign Malignant class.error
## Benign       241         4  0.01632653
## Malignant     10       125  0.07407407
pred <- predict(rf_pred, newdata=wdbc.test) 
CrossTable(x = wdbc.testLabels, y = pred, prop.chisq=FALSE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  189 
## 
##  
##                 | pred 
## wdbc.testLabels |    Benign | Malignant | Row Total | 
## ----------------|-----------|-----------|-----------|
##          Benign |       108 |         4 |       112 | 
##                 |     0.964 |     0.036 |     0.593 | 
##                 |     0.973 |     0.051 |           | 
##                 |     0.571 |     0.021 |           | 
## ----------------|-----------|-----------|-----------|
##       Malignant |         3 |        74 |        77 | 
##                 |     0.039 |     0.961 |     0.407 | 
##                 |     0.027 |     0.949 |           | 
##                 |     0.016 |     0.392 |           | 
## ----------------|-----------|-----------|-----------|
##    Column Total |       111 |        78 |       189 | 
##                 |     0.587 |     0.413 |           | 
## ----------------|-----------|-----------|-----------|
## 
## 

Conclusiones obtenidas para Random Forest

Los datos obtenidos para Random Forest, son bastante buenos con su configuración inicial. Al utilizar la función tuneRF para encontrar el mejor número de árboles y mtry, nos encontramos que mejora ligeramente los resultados con los datos de ‘train’, pero empeora también ligeramente los resultados con los datos de ‘test’. los valores que nos ha indicado la función son:

 ntrees: 500 
   mtry: 10  

En conclusión, los resultados obtenidos en ambos casos son bastante satisfactorios.

Aprendizaje no supervisado

k-means

En este caso, probaremos con diferentes variantes del algoritmo K-Means, y con ‘iter.max’(iteraciones máximas) de 1000 y ‘nstart’(número de conjuntos aleatorios) de 100. Además, en este caso hay que modificar la semilla en cada caso para que los centroides elegidos varíen.

K-Means. Hartigan-Wong

set.seed(1)
result <- kmeans(x = wdbc_norm, centers = 2,iter.max = 1000,algorithm = "Hartigan-Wong", nstart = 100)
CrossTable(x = WDBC$analysis, y = result$cluster, prop.chisq=FALSE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  569 
## 
##  
##               | result$cluster 
## WDBC$analysis |         1 |         2 | Row Total | 
## --------------|-----------|-----------|-----------|
##        Benign |         9 |       348 |       357 | 
##               |     0.025 |     0.975 |     0.627 | 
##               |     0.048 |     0.916 |           | 
##               |     0.016 |     0.612 |           | 
## --------------|-----------|-----------|-----------|
##     Malignant |       180 |        32 |       212 | 
##               |     0.849 |     0.151 |     0.373 | 
##               |     0.952 |     0.084 |           | 
##               |     0.316 |     0.056 |           | 
## --------------|-----------|-----------|-----------|
##  Column Total |       189 |       380 |       569 | 
##               |     0.332 |     0.668 |           | 
## --------------|-----------|-----------|-----------|
## 
## 

K-Means. Lloyd

set.seed(12)
result <- kmeans(x = wdbc_norm,centers = 2,iter.max = 1000,algorithm = "Lloyd",nstart = 100)
CrossTable(x = WDBC$analysis, y = result$cluster, prop.chisq=FALSE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  569 
## 
##  
##               | result$cluster 
## WDBC$analysis |         1 |         2 | Row Total | 
## --------------|-----------|-----------|-----------|
##        Benign |         9 |       348 |       357 | 
##               |     0.025 |     0.975 |     0.627 | 
##               |     0.048 |     0.916 |           | 
##               |     0.016 |     0.612 |           | 
## --------------|-----------|-----------|-----------|
##     Malignant |       180 |        32 |       212 | 
##               |     0.849 |     0.151 |     0.373 | 
##               |     0.952 |     0.084 |           | 
##               |     0.316 |     0.056 |           | 
## --------------|-----------|-----------|-----------|
##  Column Total |       189 |       380 |       569 | 
##               |     0.332 |     0.668 |           | 
## --------------|-----------|-----------|-----------|
## 
## 

K-Means. Forgy

set.seed(123)
result <- kmeans(x = wdbc_norm,centers = 2,iter.max = 1000,algorithm = "Forgy",nstart = 100)
CrossTable(x = WDBC$analysis, y = result$cluster, prop.chisq=FALSE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  569 
## 
##  
##               | result$cluster 
## WDBC$analysis |         1 |         2 | Row Total | 
## --------------|-----------|-----------|-----------|
##        Benign |       348 |         9 |       357 | 
##               |     0.975 |     0.025 |     0.627 | 
##               |     0.916 |     0.048 |           | 
##               |     0.612 |     0.016 |           | 
## --------------|-----------|-----------|-----------|
##     Malignant |        32 |       180 |       212 | 
##               |     0.151 |     0.849 |     0.373 | 
##               |     0.084 |     0.952 |           | 
##               |     0.056 |     0.316 |           | 
## --------------|-----------|-----------|-----------|
##  Column Total |       380 |       189 |       569 | 
##               |     0.668 |     0.332 |           | 
## --------------|-----------|-----------|-----------|
## 
## 

K-Means. MacQueen

set.seed(1234)
result <- kmeans(x = wdbc_norm,centers = 2,iter.max = 1000,algorithm = "MacQueen",nstart = 100)
CrossTable(x = WDBC$analysis, y = result$cluster, prop.chisq=FALSE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  569 
## 
##  
##               | result$cluster 
## WDBC$analysis |         1 |         2 | Row Total | 
## --------------|-----------|-----------|-----------|
##        Benign |       348 |         9 |       357 | 
##               |     0.975 |     0.025 |     0.627 | 
##               |     0.916 |     0.048 |           | 
##               |     0.612 |     0.016 |           | 
## --------------|-----------|-----------|-----------|
##     Malignant |        32 |       180 |       212 | 
##               |     0.151 |     0.849 |     0.373 | 
##               |     0.084 |     0.952 |           | 
##               |     0.056 |     0.316 |           | 
## --------------|-----------|-----------|-----------|
##  Column Total |       380 |       189 |       569 | 
##               |     0.668 |     0.332 |           | 
## --------------|-----------|-----------|-----------|
## 
## 

Conclusiones obtenidas para K-means

Para este clasificador, el uso de las diferentes variantes del algoritmo K-Means no ha producido ningún cambio. En sus cuatro variantes ha obtenido los mismos resultados a pesar de haber forzado semillas diferentes en cada caso. Esto puede ser debido al número de cluster y tamaño del dataset entre otros factores.

Algoritmo jerárquico aglomerativo

Por último, probaremos con las diferentes variantes del algoritmo jerárquico aglomerativo, cortando en 2 grupos el árbol resultante

Método Ward.D

result = hclust(dist(wdbc_norm), method = "ward.D")
result_cut <- cutree (result, 2)
CrossTable(x = WDBC$analysis, y = result_cut, prop.chisq=FALSE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  569 
## 
##  
##               | result_cut 
## WDBC$analysis |         1 |         2 | Row Total | 
## --------------|-----------|-----------|-----------|
##        Benign |        10 |       347 |       357 | 
##               |     0.028 |     0.972 |     0.627 | 
##               |     0.053 |     0.913 |           | 
##               |     0.018 |     0.610 |           | 
## --------------|-----------|-----------|-----------|
##     Malignant |       179 |        33 |       212 | 
##               |     0.844 |     0.156 |     0.373 | 
##               |     0.947 |     0.087 |           | 
##               |     0.315 |     0.058 |           | 
## --------------|-----------|-----------|-----------|
##  Column Total |       189 |       380 |       569 | 
##               |     0.332 |     0.668 |           | 
## --------------|-----------|-----------|-----------|
## 
## 

Método Ward.D2

result = hclust(dist(wdbc_norm), method = "ward.D2")
result_cut <- cutree (result, 2)
CrossTable(x = WDBC$analysis, y = result_cut, prop.chisq=FALSE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  569 
## 
##  
##               | result_cut 
## WDBC$analysis |         1 |         2 | Row Total | 
## --------------|-----------|-----------|-----------|
##        Benign |        27 |       330 |       357 | 
##               |     0.076 |     0.924 |     0.627 | 
##               |     0.141 |     0.873 |           | 
##               |     0.047 |     0.580 |           | 
## --------------|-----------|-----------|-----------|
##     Malignant |       164 |        48 |       212 | 
##               |     0.774 |     0.226 |     0.373 | 
##               |     0.859 |     0.127 |           | 
##               |     0.288 |     0.084 |           | 
## --------------|-----------|-----------|-----------|
##  Column Total |       191 |       378 |       569 | 
##               |     0.336 |     0.664 |           | 
## --------------|-----------|-----------|-----------|
## 
## 

Método Single

result = hclust(dist(wdbc_norm), method = "single")
result_cut <- cutree (result, 2)
CrossTable(x = WDBC$analysis, y = result_cut, prop.chisq=FALSE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  569 
## 
##  
##               | result_cut 
## WDBC$analysis |         1 |         2 | Row Total | 
## --------------|-----------|-----------|-----------|
##        Benign |       357 |         0 |       357 | 
##               |     1.000 |     0.000 |     0.627 | 
##               |     0.630 |     0.000 |           | 
##               |     0.627 |     0.000 |           | 
## --------------|-----------|-----------|-----------|
##     Malignant |       210 |         2 |       212 | 
##               |     0.991 |     0.009 |     0.373 | 
##               |     0.370 |     1.000 |           | 
##               |     0.369 |     0.004 |           | 
## --------------|-----------|-----------|-----------|
##  Column Total |       567 |         2 |       569 | 
##               |     0.996 |     0.004 |           | 
## --------------|-----------|-----------|-----------|
## 
## 

Método Complete

result = hclust(dist(wdbc_norm), method = "complete")
result_cut <- cutree (result, 2)
CrossTable(x = WDBC$analysis, y = result_cut, prop.chisq=FALSE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  569 
## 
##  
##               | result_cut 
## WDBC$analysis |         1 |         2 | Row Total | 
## --------------|-----------|-----------|-----------|
##        Benign |         0 |       357 |       357 | 
##               |     0.000 |     1.000 |     0.627 | 
##               |     0.000 |     0.758 |           | 
##               |     0.000 |     0.627 |           | 
## --------------|-----------|-----------|-----------|
##     Malignant |        98 |       114 |       212 | 
##               |     0.462 |     0.538 |     0.373 | 
##               |     1.000 |     0.242 |           | 
##               |     0.172 |     0.200 |           | 
## --------------|-----------|-----------|-----------|
##  Column Total |        98 |       471 |       569 | 
##               |     0.172 |     0.828 |           | 
## --------------|-----------|-----------|-----------|
## 
## 

Método Average

result = hclust(dist(wdbc_norm), method = "average")
result_cut <- cutree (result, 2)
CrossTable(x = WDBC$analysis, y = result_cut, prop.chisq=FALSE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  569 
## 
##  
##               | result_cut 
## WDBC$analysis |         1 |         2 | Row Total | 
## --------------|-----------|-----------|-----------|
##        Benign |       357 |         0 |       357 | 
##               |     1.000 |     0.000 |     0.627 | 
##               |     0.630 |     0.000 |           | 
##               |     0.627 |     0.000 |           | 
## --------------|-----------|-----------|-----------|
##     Malignant |       210 |         2 |       212 | 
##               |     0.991 |     0.009 |     0.373 | 
##               |     0.370 |     1.000 |           | 
##               |     0.369 |     0.004 |           | 
## --------------|-----------|-----------|-----------|
##  Column Total |       567 |         2 |       569 | 
##               |     0.996 |     0.004 |           | 
## --------------|-----------|-----------|-----------|
## 
## 

Método Mcquitty

result = hclust(dist(wdbc_norm), method = "mcquitty")
result_cut <- cutree (result, 2)
CrossTable(x = WDBC$analysis, y = result_cut, prop.chisq=FALSE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  569 
## 
##  
##               | result_cut 
## WDBC$analysis |         1 |         2 | Row Total | 
## --------------|-----------|-----------|-----------|
##        Benign |         0 |       357 |       357 | 
##               |     0.000 |     1.000 |     0.627 | 
##               |     0.000 |     0.655 |           | 
##               |     0.000 |     0.627 |           | 
## --------------|-----------|-----------|-----------|
##     Malignant |        24 |       188 |       212 | 
##               |     0.113 |     0.887 |     0.373 | 
##               |     1.000 |     0.345 |           | 
##               |     0.042 |     0.330 |           | 
## --------------|-----------|-----------|-----------|
##  Column Total |        24 |       545 |       569 | 
##               |     0.042 |     0.958 |           | 
## --------------|-----------|-----------|-----------|
## 
## 

Método Median

result = hclust(dist(wdbc_norm), method = "median")
result_cut <- cutree (result, 2)
CrossTable(x = WDBC$analysis, y = result_cut, prop.chisq=FALSE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  569 
## 
##  
##               | result_cut 
## WDBC$analysis |         1 |         2 | Row Total | 
## --------------|-----------|-----------|-----------|
##        Benign |       357 |         0 |       357 | 
##               |     1.000 |     0.000 |     0.627 | 
##               |     0.630 |     0.000 |           | 
##               |     0.627 |     0.000 |           | 
## --------------|-----------|-----------|-----------|
##     Malignant |       210 |         2 |       212 | 
##               |     0.991 |     0.009 |     0.373 | 
##               |     0.370 |     1.000 |           | 
##               |     0.369 |     0.004 |           | 
## --------------|-----------|-----------|-----------|
##  Column Total |       567 |         2 |       569 | 
##               |     0.996 |     0.004 |           | 
## --------------|-----------|-----------|-----------|
## 
## 

Método Centroid

result = hclust(dist(wdbc_norm), method = "centroid")
result_cut <- cutree (result, 2)
CrossTable(x = WDBC$analysis, y = result_cut, prop.chisq=FALSE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  569 
## 
##  
##               | result_cut 
## WDBC$analysis |         1 |         2 | Row Total | 
## --------------|-----------|-----------|-----------|
##        Benign |       357 |         0 |       357 | 
##               |     1.000 |     0.000 |     0.627 | 
##               |     0.630 |     0.000 |           | 
##               |     0.627 |     0.000 |           | 
## --------------|-----------|-----------|-----------|
##     Malignant |       210 |         2 |       212 | 
##               |     0.991 |     0.009 |     0.373 | 
##               |     0.370 |     1.000 |           | 
##               |     0.369 |     0.004 |           | 
## --------------|-----------|-----------|-----------|
##  Column Total |       567 |         2 |       569 | 
##               |     0.996 |     0.004 |           | 
## --------------|-----------|-----------|-----------|
## 
## 

Conclusiones obtenidas para Jeráquico Aglomerativo

De los diferentes métodos expuestos, los únicos métodos que dan resultados aceptables son “Ward.D” y “Ward.D2”. El resto de métodos prácticamente agrupaba casi la totalidad de las muestran en un mismo grupo, lo cual no es de utilidad en absoluto para este análisis.

Conclusiones finales sobre la clasificación

En cuanto a la clasificación, tanto los algoritmos supervisados como los no supervisados, han dado buenos resultado. Concretamente, los datos fueron los siguientes para cada uno de los algoritmos (tomando los mejores resultados obtenidos):

IMPORTANTE: Recordar que los resultados pueden variar entre ejecución y ejecución y por tanto los datos finales sufrir variaciones. En todo caso, dichas variaciones no son muy significativas en la mayoría de los casos.

Para los algoritmos de aprendizaje supervisado ejecutados, y con un data.frame de 189 entradas:

Tipo KNN SVM Naive Bayes Random Forest
Aciertos Benignos 109/97,3% 108/96,4% 105/93,8% 108/96,4%
Errores Benignos 3/2,7% 4/3,6% 7/6,2% 4/3,6%
Aciertos Malignos 71/92,2% 69/89,6% 72/93,5% 74/96,1%
Errores Malignos 6/2,7% 8/1,04% 5/6,5% 3/3,9%

De la tabla anterior se observa que la probabilidad de acierto en la predicción de cánceres benignos es muy alta (por encima del 93%) en todos los algoritmos, destacando KNN, con una cierto del 97,3%. En cuanto a los cánceres malignos, la precisión disminuye ligeramente, aunque sigue siendo aceptablemente alta (por encima del 89%), destacando en este caso Random Forest, con una precisión del 96,1%.

En general, el algoritmo de aprendizaje supervisado que mejores resultados ha dado, ha sido Random Forest, al mantenerse por encima del 96% de acierto tanto en la predicción de cánceres malignos como benignos.

Para los algoritmos de aprendizaje no supervisado ejecutados, y con un data.frame de 569 entradas:

Tipo K-means Jerárquico Aglomerativo
Aciertos Benignos 348/97,5% 347/97,2%
Errores Benignos 9/2,5% 10/2,8%
Aciertos Malignos 180/84,9% 179/84,4%
Errores Malignos 32/15,1% 33/15,6%

De la tabla anterior, se observa que la probabilidad de acierto en la predicción de cánceres benignos es muy alta (por encima del 97%), destacando por muy poco (apenas unas décimas) K-means. En cuanto a los malignos, el acierto desciende estando en torno al 84% en ambos casos, destacando nuevamente por muy poco K-means.

En general, los resultados obtenidos son aceptablemente buenos, sobre todo en la predicción de cánceres benignos. En los malignos, el acierto desciende, aunque mantiene un nivel de acierto alto. Destaca en ambos casos K-means.

En base a los datos anteriores y comparándolos a nivel de porcentaje, parece que existen una relativa ventaja al usar los algoritmos supervisados sobre los no supervisados (en referencia a los algoritmos utilizados en este estudio) en la predicción de cánceres malignos. Específicamente, si comparamos los resultados en porcentaje de Random Forest y K-Means, observamos como en la predicción de cánceres benignos destaca ligeramente K-Means (mejoría de 1,1%). En el caso de los malignos, destaca Random Forest (mejoría de 11,2%).

En conclusión, Random Forest es el algoritmo que da los mejores resultados en general, y por tanto las mejores porcentajes de predicción de cánceres de ambos tipos.

Enlaces de interés

Dataset con la información de cáncer de mama

ftp.cs.wisc.edu/math-prog/cpo-dataset/machine-learn/cancer/WDBC/WDBC.dat

Información relevante sobre el Dataset

https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)

URL código en github

https://github.com/alu4216/WDBC

URL código en rpubs

https://rpubs.com/alu4216/WDBC

URL a visualización en shinyapp

https://alu4216.shinyapps.io/WDBC/

Finalizamos

sessionInfo()
## R version 3.4.4 (2018-03-15)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 17134)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=Spanish_Spain.1252  LC_CTYPE=Spanish_Spain.1252   
## [3] LC_MONETARY=Spanish_Spain.1252 LC_NUMERIC=C                  
## [5] LC_TIME=Spanish_Spain.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] gridExtra_2.3       FactoMineR_1.41     factoextra_1.0.5   
##  [4] randomForest_4.6-14 gmodels_2.16.2      klaR_0.6-14        
##  [7] MASS_7.3-49         caret_6.0-80        lattice_0.20-35    
## [10] e1071_1.6-8         class_7.3-14        GGally_1.4.0       
## [13] ggplot2_2.2.1      
## 
## loaded via a namespace (and not attached):
##  [1] nlme_3.1-131.1       lubridate_1.7.4      dimRed_0.1.0        
##  [4] RColorBrewer_1.1-2   rprojroot_1.3-2      tools_3.4.4         
##  [7] backports_1.1.2      R6_2.2.2             rpart_4.1-13        
## [10] lazyeval_0.2.1       questionr_0.6.3      colorspace_1.3-2    
## [13] nnet_7.3-12          withr_2.1.2          tidyselect_0.2.4    
## [16] mnormt_1.5-5         compiler_3.4.4       flashClust_1.01-2   
## [19] labeling_0.3         scales_0.5.0         sfsmisc_1.1-2       
## [22] DEoptimR_1.0-8       psych_1.8.4          robustbase_0.93-1   
## [25] stringr_1.3.0        digest_0.6.15        foreign_0.8-69      
## [28] rmarkdown_1.9        pkgconfig_2.0.1      htmltools_0.3.6     
## [31] highr_0.6            rlang_0.2.0          ddalpha_1.3.4       
## [34] rstudioapi_0.7       shiny_1.0.5          bindr_0.1.1         
## [37] combinat_0.0-8       gtools_3.5.0         dplyr_0.7.4         
## [40] ModelMetrics_1.1.0   magrittr_1.5         leaps_3.0           
## [43] Matrix_1.2-12        Rcpp_0.12.16         munsell_0.4.3       
## [46] abind_1.4-5          scatterplot3d_0.3-41 stringi_1.1.7       
## [49] yaml_2.1.18          plyr_1.8.4           recipes_0.1.3       
## [52] grid_3.4.4           ggrepel_0.8.0        gdata_2.18.0        
## [55] parallel_3.4.4       pls_2.6-0            promises_1.0.1      
## [58] miniUI_0.1.1.1       splines_3.4.4        knitr_1.20          
## [61] pillar_1.2.1         ggpubr_0.1.7         reshape2_1.4.3      
## [64] codetools_0.2-15     stats4_3.4.4         CVST_0.2-2          
## [67] magic_1.5-8          glue_1.2.0           evaluate_0.10.1     
## [70] httpuv_1.4.4.2       foreach_1.4.4        gtable_0.2.0        
## [73] purrr_0.2.4          tidyr_0.8.0          reshape_0.8.7       
## [76] kernlab_0.9-26       assertthat_0.2.0     DRR_0.0.3           
## [79] gower_0.1.2          mime_0.5             prodlim_2018.04.18  
## [82] xtable_1.8-2         broom_0.4.5          later_0.7.3         
## [85] survival_2.41-3      geometry_0.3-6       timeDate_3043.102   
## [88] RcppRoll_0.3.0       tibble_1.4.2         iterators_1.0.9     
## [91] cluster_2.0.6        bindrcpp_0.2.2       lava_1.6.2          
## [94] ipred_0.9-6