En este trabajo de investigación, vamos a aplicar varios modelos de clasificación tanto supervisada como no supervisada utilizando datos sobre el cáncer de mama. El objetivo principal es clasificar los tumores como benignos o malignos en base a sus características. Para ello, emplearemos un conjunto de datos que describe diversas propiedades de las células cancerígenas.
Primero exportamos nuestra base de datos:
setwd("C:/Users/Tatiana/Documents/Universidad Stuff/Material de ESTA/ESTA 5504/bases de datos/Para proyecto")
library(foreign)
## Warning: package 'foreign' was built under R version 4.3.2
Data <- read.table("wdbc.data", sep = ",", header = F)
Pasamos a nombrar las variables con la carasterística que les corresponde:
colnames(Data) <- c("ID", "Diagnosis", "radius_mean","texture_mean","perimeter_mean",
"area_mean","smoothness_mean","compactness_mean","concavity_mean",
"concave_points_mean","symmetry_mean","fractal_dimension_mean",
"radius_se","texture_se","perimeter_se","area_se","smoothness_se",
"compactness_se","concavity_se","concave_points_se","symmetry_se",
"fractal_dimension_se","radius_worst","texture_worst","perimeter_worst",
"area_worst","smoothness_worst","compactness_worst","concavity_worst",
"concave_points_worst","symmetry_worst","fractal_dimension_worst")
head(Data)
## ID Diagnosis radius_mean texture_mean perimeter_mean area_mean
## 1 842302 M 17.99 10.38 122.80 1001.0
## 2 842517 M 20.57 17.77 132.90 1326.0
## 3 84300903 M 19.69 21.25 130.00 1203.0
## 4 84348301 M 11.42 20.38 77.58 386.1
## 5 84358402 M 20.29 14.34 135.10 1297.0
## 6 843786 M 12.45 15.70 82.57 477.1
## smoothness_mean compactness_mean concavity_mean concave_points_mean
## 1 0.11840 0.27760 0.3001 0.14710
## 2 0.08474 0.07864 0.0869 0.07017
## 3 0.10960 0.15990 0.1974 0.12790
## 4 0.14250 0.28390 0.2414 0.10520
## 5 0.10030 0.13280 0.1980 0.10430
## 6 0.12780 0.17000 0.1578 0.08089
## symmetry_mean fractal_dimension_mean radius_se texture_se perimeter_se
## 1 0.2419 0.07871 1.0950 0.9053 8.589
## 2 0.1812 0.05667 0.5435 0.7339 3.398
## 3 0.2069 0.05999 0.7456 0.7869 4.585
## 4 0.2597 0.09744 0.4956 1.1560 3.445
## 5 0.1809 0.05883 0.7572 0.7813 5.438
## 6 0.2087 0.07613 0.3345 0.8902 2.217
## area_se smoothness_se compactness_se concavity_se concave_points_se
## 1 153.40 0.006399 0.04904 0.05373 0.01587
## 2 74.08 0.005225 0.01308 0.01860 0.01340
## 3 94.03 0.006150 0.04006 0.03832 0.02058
## 4 27.23 0.009110 0.07458 0.05661 0.01867
## 5 94.44 0.011490 0.02461 0.05688 0.01885
## 6 27.19 0.007510 0.03345 0.03672 0.01137
## symmetry_se fractal_dimension_se radius_worst texture_worst perimeter_worst
## 1 0.03003 0.006193 25.38 17.33 184.60
## 2 0.01389 0.003532 24.99 23.41 158.80
## 3 0.02250 0.004571 23.57 25.53 152.50
## 4 0.05963 0.009208 14.91 26.50 98.87
## 5 0.01756 0.005115 22.54 16.67 152.20
## 6 0.02165 0.005082 15.47 23.75 103.40
## area_worst smoothness_worst compactness_worst concavity_worst
## 1 2019.0 0.1622 0.6656 0.7119
## 2 1956.0 0.1238 0.1866 0.2416
## 3 1709.0 0.1444 0.4245 0.4504
## 4 567.7 0.2098 0.8663 0.6869
## 5 1575.0 0.1374 0.2050 0.4000
## 6 741.6 0.1791 0.5249 0.5355
## concave_points_worst symmetry_worst fractal_dimension_worst
## 1 0.2654 0.4601 0.11890
## 2 0.1860 0.2750 0.08902
## 3 0.2430 0.3613 0.08758
## 4 0.2575 0.6638 0.17300
## 5 0.1625 0.2364 0.07678
## 6 0.1741 0.3985 0.12440
Verificamos la estructura de la base de datos utilizando la función
skimr::skim:
library(skimr)
## Warning: package 'skimr' was built under R version 4.3.2
skimr::skim(Data)
| Name | Data |
| Number of rows | 569 |
| Number of columns | 32 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 31 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Diagnosis | 0 | 1 | 1 | 1 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| ID | 0 | 1 | 30371831.43 | 125020585.61 | 8670.00 | 869218.00 | 906024.00 | 8813129.00 | 911320502.00 | ▇▁▁▁▁ |
| radius_mean | 0 | 1 | 14.13 | 3.52 | 6.98 | 11.70 | 13.37 | 15.78 | 28.11 | ▂▇▃▁▁ |
| texture_mean | 0 | 1 | 19.29 | 4.30 | 9.71 | 16.17 | 18.84 | 21.80 | 39.28 | ▃▇▃▁▁ |
| perimeter_mean | 0 | 1 | 91.97 | 24.30 | 43.79 | 75.17 | 86.24 | 104.10 | 188.50 | ▃▇▃▁▁ |
| area_mean | 0 | 1 | 654.89 | 351.91 | 143.50 | 420.30 | 551.10 | 782.70 | 2501.00 | ▇▃▂▁▁ |
| smoothness_mean | 0 | 1 | 0.10 | 0.01 | 0.05 | 0.09 | 0.10 | 0.11 | 0.16 | ▁▇▇▁▁ |
| compactness_mean | 0 | 1 | 0.10 | 0.05 | 0.02 | 0.06 | 0.09 | 0.13 | 0.35 | ▇▇▂▁▁ |
| concavity_mean | 0 | 1 | 0.09 | 0.08 | 0.00 | 0.03 | 0.06 | 0.13 | 0.43 | ▇▃▂▁▁ |
| concave_points_mean | 0 | 1 | 0.05 | 0.04 | 0.00 | 0.02 | 0.03 | 0.07 | 0.20 | ▇▃▂▁▁ |
| symmetry_mean | 0 | 1 | 0.18 | 0.03 | 0.11 | 0.16 | 0.18 | 0.20 | 0.30 | ▁▇▅▁▁ |
| fractal_dimension_mean | 0 | 1 | 0.06 | 0.01 | 0.05 | 0.06 | 0.06 | 0.07 | 0.10 | ▆▇▂▁▁ |
| radius_se | 0 | 1 | 0.41 | 0.28 | 0.11 | 0.23 | 0.32 | 0.48 | 2.87 | ▇▁▁▁▁ |
| texture_se | 0 | 1 | 1.22 | 0.55 | 0.36 | 0.83 | 1.11 | 1.47 | 4.88 | ▇▅▁▁▁ |
| perimeter_se | 0 | 1 | 2.87 | 2.02 | 0.76 | 1.61 | 2.29 | 3.36 | 21.98 | ▇▁▁▁▁ |
| area_se | 0 | 1 | 40.34 | 45.49 | 6.80 | 17.85 | 24.53 | 45.19 | 542.20 | ▇▁▁▁▁ |
| smoothness_se | 0 | 1 | 0.01 | 0.00 | 0.00 | 0.01 | 0.01 | 0.01 | 0.03 | ▇▃▁▁▁ |
| compactness_se | 0 | 1 | 0.03 | 0.02 | 0.00 | 0.01 | 0.02 | 0.03 | 0.14 | ▇▃▁▁▁ |
| concavity_se | 0 | 1 | 0.03 | 0.03 | 0.00 | 0.02 | 0.03 | 0.04 | 0.40 | ▇▁▁▁▁ |
| concave_points_se | 0 | 1 | 0.01 | 0.01 | 0.00 | 0.01 | 0.01 | 0.01 | 0.05 | ▇▇▁▁▁ |
| symmetry_se | 0 | 1 | 0.02 | 0.01 | 0.01 | 0.02 | 0.02 | 0.02 | 0.08 | ▇▃▁▁▁ |
| fractal_dimension_se | 0 | 1 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.03 | ▇▁▁▁▁ |
| radius_worst | 0 | 1 | 16.27 | 4.83 | 7.93 | 13.01 | 14.97 | 18.79 | 36.04 | ▆▇▃▁▁ |
| texture_worst | 0 | 1 | 25.68 | 6.15 | 12.02 | 21.08 | 25.41 | 29.72 | 49.54 | ▃▇▆▁▁ |
| perimeter_worst | 0 | 1 | 107.26 | 33.60 | 50.41 | 84.11 | 97.66 | 125.40 | 251.20 | ▇▇▃▁▁ |
| area_worst | 0 | 1 | 880.58 | 569.36 | 185.20 | 515.30 | 686.50 | 1084.00 | 4254.00 | ▇▂▁▁▁ |
| smoothness_worst | 0 | 1 | 0.13 | 0.02 | 0.07 | 0.12 | 0.13 | 0.15 | 0.22 | ▂▇▇▂▁ |
| compactness_worst | 0 | 1 | 0.25 | 0.16 | 0.03 | 0.15 | 0.21 | 0.34 | 1.06 | ▇▅▁▁▁ |
| concavity_worst | 0 | 1 | 0.27 | 0.21 | 0.00 | 0.11 | 0.23 | 0.38 | 1.25 | ▇▅▂▁▁ |
| concave_points_worst | 0 | 1 | 0.11 | 0.07 | 0.00 | 0.06 | 0.10 | 0.16 | 0.29 | ▅▇▅▃▁ |
| symmetry_worst | 0 | 1 | 0.29 | 0.06 | 0.16 | 0.25 | 0.28 | 0.32 | 0.66 | ▅▇▁▁▁ |
| fractal_dimension_worst | 0 | 1 | 0.08 | 0.02 | 0.06 | 0.07 | 0.08 | 0.09 | 0.21 | ▇▃▁▁▁ |
str(Data)
## 'data.frame': 569 obs. of 32 variables:
## $ ID : int 842302 842517 84300903 84348301 84358402 843786 844359 84458202 844981 84501001 ...
## $ Diagnosis : chr "M" "M" "M" "M" ...
## $ radius_mean : num 18 20.6 19.7 11.4 20.3 ...
## $ texture_mean : num 10.4 17.8 21.2 20.4 14.3 ...
## $ perimeter_mean : num 122.8 132.9 130 77.6 135.1 ...
## $ area_mean : num 1001 1326 1203 386 1297 ...
## $ smoothness_mean : num 0.1184 0.0847 0.1096 0.1425 0.1003 ...
## $ compactness_mean : num 0.2776 0.0786 0.1599 0.2839 0.1328 ...
## $ concavity_mean : num 0.3001 0.0869 0.1974 0.2414 0.198 ...
## $ concave_points_mean : num 0.1471 0.0702 0.1279 0.1052 0.1043 ...
## $ symmetry_mean : num 0.242 0.181 0.207 0.26 0.181 ...
## $ fractal_dimension_mean : num 0.0787 0.0567 0.06 0.0974 0.0588 ...
## $ radius_se : num 1.095 0.543 0.746 0.496 0.757 ...
## $ texture_se : num 0.905 0.734 0.787 1.156 0.781 ...
## $ perimeter_se : num 8.59 3.4 4.58 3.44 5.44 ...
## $ area_se : num 153.4 74.1 94 27.2 94.4 ...
## $ smoothness_se : num 0.0064 0.00522 0.00615 0.00911 0.01149 ...
## $ compactness_se : num 0.049 0.0131 0.0401 0.0746 0.0246 ...
## $ concavity_se : num 0.0537 0.0186 0.0383 0.0566 0.0569 ...
## $ concave_points_se : num 0.0159 0.0134 0.0206 0.0187 0.0188 ...
## $ symmetry_se : num 0.03 0.0139 0.0225 0.0596 0.0176 ...
## $ fractal_dimension_se : num 0.00619 0.00353 0.00457 0.00921 0.00511 ...
## $ radius_worst : num 25.4 25 23.6 14.9 22.5 ...
## $ texture_worst : num 17.3 23.4 25.5 26.5 16.7 ...
## $ perimeter_worst : num 184.6 158.8 152.5 98.9 152.2 ...
## $ area_worst : num 2019 1956 1709 568 1575 ...
## $ smoothness_worst : num 0.162 0.124 0.144 0.21 0.137 ...
## $ compactness_worst : num 0.666 0.187 0.424 0.866 0.205 ...
## $ concavity_worst : num 0.712 0.242 0.45 0.687 0.4 ...
## $ concave_points_worst : num 0.265 0.186 0.243 0.258 0.163 ...
## $ symmetry_worst : num 0.46 0.275 0.361 0.664 0.236 ...
## $ fractal_dimension_worst: num 0.1189 0.089 0.0876 0.173 0.0768 ...
Se deben estandarizar los datos mas adelate para poder crear los modelos de clasificación.
La base de datos no tiene datos faltantes ni duplicados.
Modificamos la clasificación de las variables “Diagnosis” y “ID”:
# Modificaciones a la base de datos
Data$ID <- as.character(Data$ID)
Data$Diagnosis <- as.factor(Data$Diagnosis)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(forcats)
Data <- mutate(Data,Diagnosis = fct_recode(Data$Diagnosis,
"Benigno" = "B","Maligno" = "M"))
Para familiarizarnos un poco mas con la estructura de los datos
creamos un diagrama que presenta correlaciones graficas, numericas y
tambien la distribucion de datos por variables, esto utilizando la
funcion ggpairs de la librería GGally:
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
ggpairs(Data[,2:10])
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Calculamos la cantidad de tumores benignos y malignos dentro de la base de datos como tambien su proporcion:
table(Data$Diagnosis)
##
## Benigno Maligno
## 357 212
round(prop.table(table(Data$Diagnosis)),2)
##
## Benigno Maligno
## 0.63 0.37
Graficamos distribucion:
plot(Data$Diagnosis, col = c("lightgreen","indianred1"), main = "Distribucion de datos por grupos")
K-vecinos mas cercanos
Primero estandarizamos los datos:
library(scales)
## Warning: package 'scales' was built under R version 4.3.3
Datan <- as.data.frame(lapply(Data[,-c(1,2)], rescale))
Datan <- cbind(Data$Diagnosis,Datan)
colnames(Datan)[colnames(Datan) == "Data$Diagnosis"] <- "Diagnosis"
Luego dividimos los datos en datos de entrenamiento y datos de prueba:
set.seed(2020)
library(caret)
## Warning: package 'caret' was built under R version 4.3.3
## Loading required package: lattice
folds <- createFolds(Datan$Diagnosis, k = 5)
entrenamiento <- Datan[c(folds$Fold1,folds$Fold2,folds$Fold3,folds$Fold4),]
prueba <- Datan[folds$Fold5,]
# Para conocer cantidad de tumores benignos y malignos en los grupos creados:
table(entrenamiento$Diagnosis)
##
## Benigno Maligno
## 285 170
table(prueba$Diagnosis)
##
## Benigno Maligno
## 72 42
Ajustamos el modelo:
library(kknn)
## Warning: package 'kknn' was built under R version 4.3.3
##
## Attaching package: 'kknn'
## The following object is masked from 'package:caret':
##
## contr.dummy
modelo <- train.kknn(Diagnosis ~ ., data = entrenamiento, kmax = 30)
modelo
##
## Call:
## train.kknn(formula = Diagnosis ~ ., data = entrenamiento, kmax = 30)
##
## Type of response variable: nominal
## Minimal misclassification: 0.02857143
## Best kernel: optimal
## Best k: 10
Pasamos a hacer las predicciones con los datos de entrenamiento junto con su tasa de aciertos:
# Tabla de predicciones:
Pred <- predict(modelo, entrenamiento[,-1])
tt <- table(entrenamiento[,1],Pred)
tt
## Pred
## Benigno Maligno
## Benigno 283 2
## Maligno 5 165
# Tasa de aciertos:
TA <- (sum(diag(tt)))/sum(tt)
round(TA,2)
## [1] 0.98
Ahora pasamos a probar el modelo con los datos de prueba, sacando sus predicciones y tasa de aciertos:
# Tabla de predicciones:
Pre <- predict(modelo, prueba[,-1])
tabla <- table(prueba[,1],Pre)
tabla
## Pre
## Benigno Maligno
## Benigno 72 0
## Maligno 3 39
# Tasa de aciertos:
T_A <- (sum(diag(tabla)))/sum(tabla)
round(T_A,2)
## [1] 0.97
Arboles de decision
El segundo metodo de clasificacion supervisada que estaremos realizando es el de “Árboles de decisión”. Ya teniendo nuestra base de datos modificada, normalizada, pasamos a dividir los datos en entrenamiento (80%) y prueba (20%):
N <- nrow(Datan)
n <- round(N*0.8)
set.seed(111)
indices <- sample(1:N,n)
entrenamiento <- Datan[indices,]
prueba <- Datan[-indices,]
Pasamos a construir el modelo ajustado con los datos de
entrenamiento, utilizando la funcion rpart y graficamos el
arbol de decision:
# Librerias:
library(rpart)
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.3.3
# Ajuste de modelo:
arbol <- rpart(Diagnosis ~ ., data = entrenamiento)
# Grafica de arbol:
rpart.plot(arbol)
Pasamos a calcular las predicciones para los datos de entrenamiento:
pred <- predict(arbol,entrenamiento, type = "class")
confusionMatrix(pred, entrenamiento[,"Diagnosis"])
## Confusion Matrix and Statistics
##
## Reference
## Prediction Benigno Maligno
## Benigno 277 10
## Maligno 5 163
##
## Accuracy : 0.967
## 95% CI : (0.9462, 0.9814)
## No Information Rate : 0.6198
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9297
##
## Mcnemar's Test P-Value : 0.3017
##
## Sensitivity : 0.9823
## Specificity : 0.9422
## Pos Pred Value : 0.9652
## Neg Pred Value : 0.9702
## Prevalence : 0.6198
## Detection Rate : 0.6088
## Detection Prevalence : 0.6308
## Balanced Accuracy : 0.9622
##
## 'Positive' Class : Benigno
##
Ahora pasamos a probar el modelo con los datos de prueba:
pred1 <- predict(arbol,prueba, type = "class")
confusionMatrix(pred1, prueba[,"Diagnosis"])
## Confusion Matrix and Statistics
##
## Reference
## Prediction Benigno Maligno
## Benigno 70 5
## Maligno 5 34
##
## Accuracy : 0.9123
## 95% CI : (0.8446, 0.9571)
## No Information Rate : 0.6579
## P-Value [Acc > NIR] : 2.23e-10
##
## Kappa : 0.8051
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9333
## Specificity : 0.8718
## Pos Pred Value : 0.9333
## Neg Pred Value : 0.8718
## Prevalence : 0.6579
## Detection Rate : 0.6140
## Detection Prevalence : 0.6579
## Balanced Accuracy : 0.9026
##
## 'Positive' Class : Benigno
##
Redes Neuronales Artificiales
El cuarto metodo de clasificacion supervisada que estaremos
realizando es el de “Redes Neuronales Artificiales”. Ya teniendo nuestra
base de datos modificada, normalizada y dividida en datos de
entrenamiento y prueba, pasamos a construir el modelo ajustado con los
datos de entrenamiento, utilizando la funcion
neuralnet:
library(neuralnet)
## Warning: package 'neuralnet' was built under R version 4.3.3
##
## Attaching package: 'neuralnet'
## The following object is masked from 'package:dplyr':
##
## compute
RNA1 <- neuralnet(as.numeric(Diagnosis) ~ .,entrenamiento, hidden = 4)
# plot(RNA1, rep = "best")
Ahora pasamos a evaluar el modelo con los datos de entrenamiento sacando las predicciones y tasa de aciertos:
# Predicciones:
output <- compute(RNA1, entrenamiento[,-1])
pred <- levels(Datan$Diagnosis)[round(output$net.result)]
tt <- table(entrenamiento[,1],pred)
tt
## pred
## Benigno Maligno
## Benigno 282 0
## Maligno 0 173
# Tasa de aciertos:
TA1 <- (sum(diag(tt)))/sum(tt)
round(TA1,4)
## [1] 1
Ahora pasamos a calcular las predicciones y tasa de aciertos para los datos de prueba:
# Tabla de predicciones:
output <- compute(RNA1, prueba[,-1])
pred <- levels(Datan$Diagnosis)[round(output$net.result)]
tt <- table(prueba[,1],pred)
tt
## pred
## Benigno Maligno
## Benigno 74 1
## Maligno 4 34
# Tasa de aciertos:
TA2 <- (sum(diag(tt)))/sum(tt) # tasa de aciertos
round(TA2,2)
## [1] 0.96
Maquina de Vector Soporte
El ultimo metodo de clasificacion supervisada que estaremos
realizando es el de “Maquina de Vector Soporte”. Ya teniendo nuestra
base de datos modificada, normalizada y dividida en datos de
entrenamiento y prueba, pasamos a construir el modelo ajustado con los
datos de entrenamiento, utilizando la funcion svm:
library(e1071)
## Warning: package 'e1071' was built under R version 4.3.2
svm1 <- svm(Diagnosis ~ ., entrenamiento, kernel = "linear",cost = 10, scale = F)
summary(svm1)
##
## Call:
## svm(formula = Diagnosis ~ ., data = entrenamiento, kernel = "linear",
## cost = 10, scale = F)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 10
##
## Number of Support Vectors: 39
##
## ( 19 20 )
##
##
## Number of Classes: 2
##
## Levels:
## Benigno Maligno
# Observaciones que actúan como vector soporte
svm1$index
## [1] 3 62 88 95 110 129 154 219 247 269 293 297 340 364 365 366 376 390 403
## [20] 11 15 26 35 50 52 73 87 102 120 159 197 228 238 252 273 276 302 303
## [39] 305
set.seed(111)
svm_vc <- tune("svm", Diagnosis ~ ., data = entrenamiento,
kernel = 'linear',
ranges = list(cost = c(0.001, 0.01, 0.1, 1, 5, 10, 20, 50, 100,150,200)))
summary(svm_vc)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost
## 0.1
##
## - best performance: 0.02869565
##
## - Detailed performance results:
## cost error dispersion
## 1 0.001 0.06603865 0.04168363
## 2 0.010 0.03299517 0.02598841
## 3 0.100 0.02869565 0.02573186
## 4 1.000 0.03082126 0.02138350
## 5 5.000 0.03299517 0.02146248
## 6 10.000 0.03299517 0.02607905
## 7 20.000 0.03734300 0.02740365
## 8 50.000 0.04386473 0.02038650
## 9 100.000 0.04386473 0.02038650
## 10 150.000 0.04386473 0.02038650
## 11 200.000 0.04386473 0.02038650
mejor_modelo <- svm_vc$best.model
Calculamos predicciones y tasa de aciertos para datos de entrenamiento:
pred1 <- predict(object = mejor_modelo,entrenamiento)
# Matriz de confusión
MC1 <- table(pred1, entrenamiento$Diagnosis)
MC1
##
## pred1 Benigno Maligno
## Benigno 280 4
## Maligno 2 169
# Tasa de aciertos
total <- sum(MC1)
TA1 <- sum(MC1[1,1]+MC1[2,2])/total
TA1
## [1] 0.9868132
Pasamos a probar el modelo con los datos de prueba:
# Predicciones
pred1_p <- predict(object = mejor_modelo, prueba)
# Matriz de confusión
MC1_p <- table(pred1_p, prueba$Diagnosis)
MC1_p
##
## pred1_p Benigno Maligno
## Benigno 75 3
## Maligno 0 36
# Tasa de aciertos
total <- sum(MC1_p)
TA1 <- sum(MC1_p[1,1]+MC1_p[2,2])/total
TA1
## [1] 0.9736842
table(entrenamiento$Diagnosis)
##
## Benigno Maligno
## 282 173
table(prueba$Diagnosis)
##
## Benigno Maligno
## 75 39
Clustering Jerárquico
El primer metodo de clasificacion no supervisada que estaremos realizando es el de “Cluster Jerarquico”.
Primero estandarizamos los datos:
data.scaled <- scale(x = Data[,-c(1,2)],
center = TRUE,
scale = TRUE)
head(data.scaled)
## radius_mean texture_mean perimeter_mean area_mean smoothness_mean
## [1,] 1.0960995 -2.0715123 1.2688173 0.9835095 1.5670875
## [2,] 1.8282120 -0.3533215 1.6844726 1.9070303 -0.8262354
## [3,] 1.5784992 0.4557859 1.5651260 1.5575132 0.9413821
## [4,] -0.7682333 0.2535091 -0.5921661 -0.7637917 3.2806668
## [5,] 1.7487579 -1.1508038 1.7750113 1.8246238 0.2801253
## [6,] -0.4759559 -0.8346009 -0.3868077 -0.5052059 2.2354545
## compactness_mean concavity_mean concave_points_mean symmetry_mean
## [1,] 3.2806281 2.65054179 2.5302489 2.215565542
## [2,] -0.4866435 -0.02382489 0.5476623 0.001391139
## [3,] 1.0519999 1.36227979 2.0354398 0.938858720
## [4,] 3.3999174 1.91421287 1.4504311 2.864862154
## [5,] 0.5388663 1.36980615 1.4272370 -0.009552062
## [6,] 1.2432416 0.86554001 0.8239307 1.004517928
## fractal_dimension_mean radius_se texture_se perimeter_se area_se
## [1,] 2.2537638 2.4875451 -0.5647681 2.8305403 2.4853907
## [2,] -0.8678888 0.4988157 -0.8754733 0.2630955 0.7417493
## [3,] -0.3976580 1.2275958 -0.7793976 0.8501802 1.1802975
## [4,] 4.9066020 0.3260865 -0.1103120 0.2863415 -0.2881246
## [5,] -0.5619555 1.2694258 -0.7895490 1.2720701 1.1893103
## [6,] 1.8883435 -0.2548461 -0.5921406 -0.3210217 -0.2890039
## smoothness_se compactness_se concavity_se concave_points_se symmetry_se
## [1,] -0.2138135 1.31570389 0.7233897 0.66023900 1.1477468
## [2,] -0.6048187 -0.69231710 -0.4403926 0.25993335 -0.8047423
## [3,] -0.2967439 0.81425704 0.2128891 1.42357487 0.2368272
## [4,] 0.6890953 2.74186785 0.8187979 1.11402678 4.7285198
## [5,] 1.4817634 -0.04847723 0.8277425 1.14319885 -0.3607748
## [6,] 0.1562093 0.44515196 0.1598845 -0.06906279 0.1340009
## fractal_dimension_se radius_worst texture_worst perimeter_worst area_worst
## [1,] 0.90628565 1.8850310 -1.35809849 2.3015755 1.9994782
## [2,] -0.09935632 1.8043398 -0.36887865 1.5337764 1.8888270
## [3,] 0.29330133 1.5105411 -0.02395331 1.3462906 1.4550043
## [4,] 2.04571087 -0.2812170 0.13386631 -0.2497196 -0.5495377
## [5,] 0.49888916 1.2974336 -1.46548091 1.3373627 1.2196511
## [6,] 0.48641784 -0.1653528 -0.31356043 -0.1149083 -0.2441054
## smoothness_worst compactness_worst concavity_worst concave_points_worst
## [1,] 1.3065367 2.6143647 2.1076718 2.2940576
## [2,] -0.3752817 -0.4300658 -0.1466200 1.0861286
## [3,] 0.5269438 1.0819801 0.8542223 1.9532817
## [4,] 3.3912907 3.8899747 1.9878392 2.1738732
## [5,] 0.2203623 -0.3131190 0.6126397 0.7286181
## [6,] 2.0467119 1.7201029 1.2621327 0.9050914
## symmetry_worst fractal_dimension_worst
## [1,] 2.7482041 1.9353117
## [2,] -0.2436753 0.2809428
## [3,] 1.1512420 0.2012142
## [4,] 6.0407261 4.9306719
## [5,] -0.8675896 -0.3967505
## [6,] 1.7525273 2.2398308
Pasamos a calcular la matriz de distancias y a crear una visualizacion de la misma:
dist <- dist(data.scaled,method = "euclidean")
# matriz de distancias
dist_mat <- as.matrix(round(dist,3))
as.dist(dist_mat[1:6,1:6])
## 1 2 3 4 5
## 2 10.309
## 3 6.772 5.028
## 4 10.463 16.236 12.833
## 5 8.663 4.375 4.458 15.362
## 6 8.402 8.636 6.885 9.363 8.236
Pasamos a calcular el numero optimo de clusters utilizando la funcion
fviz_nbclust, y sus distintos metodologias:
library(factoextra)
## Warning: package 'factoextra' was built under R version 4.3.3
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
fviz_nbclust(data.scaled, FUN = hcut, method = "silhouette")
fviz_nbclust(data.scaled, FUN = hcut, method = "wss")
fviz_nbclust(data.scaled, FUN = hcut, method = "gap_stat")
Pasamos a calcular el modelo:
modelo2 <- hclust(dist, method = "complete")
Creamos un dendograma para visualizar clasificaciones:
fviz_dend(modelo2, cex = 0.5, k=4,rect = TRUE, k_colors = "jco", rect_border = "jco", rect_fill = TRUE,horiz = TRUE,ggtheme = theme_bw())
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
## Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# filogenetico
fviz_dend(modelo2, cex = 0.5, k =5,
rect = TRUE,
k_colors = "jco",
rect_border = "jco",
rect_fill = TRUE,
type = "phylogenic")
# circular
fviz_dend(modelo2, cex = 0.5, k = 5,
rect = TRUE,
k_colors = "jco",
rect_border = "jco",
rect_fill = TRUE,
type = "circular")
Calculamos la cantidad de elementos por grupo basado al modelo:
grupos <- cutree(modelo2, k = 4)
table(grupos)
## grupos
## 1 2 3 4
## 177 7 383 2
Metodo de Particionamiento
El segundo metodo de clasificacion no supervisada que estaremos
realizando es el de “Metodo de Particionamiento”. Con nuestra base de
datos ya escalada, pasamos a corroborar cual es el numero optimo para k,
utilizando la funcion NbClust:
library(NbClust)
res.nbclust<-NbClust(data.scaled, distance="euclidean",min.nc=2,max.nc=10,method="complete",index="all")
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 7 proposed 2 as the best number of clusters
## * 2 proposed 3 as the best number of clusters
## * 12 proposed 4 as the best number of clusters
## * 2 proposed 8 as the best number of clusters
## * 1 proposed 10 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 4
##
##
## *******************************************************************
Ahora pasamos a corroborar esto con la funcion
clValid:
# Sacar particiones
library(clValid)
## Warning: package 'clValid' was built under R version 4.3.3
## Loading required package: cluster
validclus<-clValid(data.scaled,nClust=2:5,clMethos=c("hierarchical","kmeans","diana","fanny","pam","clara","agnes"),validation="internal")
## Warning in clValid(data.scaled, nClust = 2:5, clMethos = c("hierarchical", :
## rownames for data not specified, using 1:nrow(data)
summary(validclus)
##
## Clustering Methods:
## hierarchical
##
## Cluster sizes:
## 2 3 4 5
##
## Validation Measures:
## 2 3 4 5
##
## hierarchical Connectivity 6.7202 11.5782 11.7448 14.6738
## Dunn 0.3405 0.3825 0.3825 0.3825
## Silhouette 0.6340 0.5846 0.5543 0.4550
##
## Optimal Scores:
##
## Score Method Clusters
## Connectivity 6.7202 hierarchical 2
## Dunn 0.3825 hierarchical 3
## Silhouette 0.6340 hierarchical 2
Calculamos la matriz de distancias:
dist <- dist(data.scaled, method = "euclidean")
modelo3 <- hclust(dist, method = "complete")
library(factoextra)
Graficamos un dendograma para visualizar clasificaciones:
fviz_dend(modelo3, cex = 0.5, k = 4,
rect = TRUE,
k_colors = "jco",
rect_border = "jco",
rect_fill = TRUE,
horiz = TRUE,
ggtheme = theme_bw())
# filogenetico
fviz_dend(modelo3, cex = 0.5, k = 4,
rect = TRUE,
k_colors = "jco",
rect_border = "jco",
rect_fill = TRUE,
type = "phylogenic")
Calculamos la cantidad de elementos por grupo basado al modelo:
grupos1 <- cutree(modelo3, k = 4)
table(grupos1)
## grupos1
## 1 2 3 4
## 177 7 383 2
La cantidad de elementos por grupo quedaron igual que con el metodo anterior, siendo de 177, 7 , 383 y 2 para los grupos 1, 2, 3 y 4, respectivamente.