Teoría

El paquete caret(Classifciation and Regression Training) es una herramienta poderosa para la implementación de modelos de Machine Learning.

Paquetes y librerias

#install.packages("caret")
library(caret)#algoritmos de aprendiaje automártico
#install.packages("datasets")
library(datasets) # PAra usar la base de datos "Iris"
library(ggplot2)
#install.packages("lattice")
library(lattice) #Crear base de datos
#install.packages("DataExplorer")
library(DataExplorer)

Importar base de datos

df <- data.frame(iris)

Análisis descriptivo

# create_report(df)
plot_histogram(df)

plot_missing(df)

plot_correlation(df)

** NOTA: La variable que queremos predecir tiene que tener formato de FACTOR.**

División de datos 80/20

set.seed(123)
r_train <- createDataPartition(df$Species, p=0.8, list=FALSE)
train<- iris[r_train, ]
test <- iris [-r_train,]

Moelos

Los metodos mas utilizados para entrenar aprendizaje automático son:

  • SVM: Support Vector Machine a Máquina de Vectores de Soporte. Hay varios subtipos: Lineal(svmLinear), Radial(svmRadial), Polinomico (svmPoly), etc.
  • Arból de decision: rpart
  • Redes neuronales: nnet
  • Random Forest: Bosques Aleatorios: rf La validación cruzada (CV) es una técnica para evaluar el rendimiento de un modelo, dividiendo los datos en multiples subconjuntos, permitiendo medir su capacidad de generalización y evitar sobreajuste (overfitting)

La Matriz de confusión (Confusion Matrix) permite analizar qué tan bien funciona un modelo y qué tipps de errores comete. Lo que hace es comparar las predicciones del modelo con los valores reales de la variable objetivo.

Si la precsión es muy alta en entrenamiento (95% - 100%), pero baja en prueba (60- 70%), es una señal de sobreajuste

Modelo 1. SVM Lineal

modelo1 <- train(
  Species ~ ., 
  data = train, 
  method = "svmLinear",  # Cambiar si deseas otro método
  preProcess = c("scale", "center"),
  trControl = trainControl(method = "cv", number = 10),  # Se cierra correctamente el paréntesis
  tuneGrid = data.frame(C = 1)  # Se separa correctamente de trainControl
)

Predicciones

resultado_train <- predict(modelo1, train)
resultado_test <- predict(modelo1, test)

Matriz de Confusión

mcrp1<- confusionMatrix(resultado_train, train$Species)
mcrp1
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         40          0         0
##   versicolor      0         39         0
##   virginica       0          1        40
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9917          
##                  95% CI : (0.9544, 0.9998)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9875          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            0.9750           1.0000
## Specificity                 1.0000            1.0000           0.9875
## Pos Pred Value              1.0000            1.0000           0.9756
## Neg Pred Value              1.0000            0.9877           1.0000
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.3250           0.3333
## Detection Prevalence        0.3333            0.3250           0.3417
## Balanced Accuracy           1.0000            0.9875           0.9938
mcrp2<- confusionMatrix(resultado_test, test$Species)
mcrp2
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         10          0         0
##   versicolor      0         10         1
##   virginica       0          0         9
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9667          
##                  95% CI : (0.8278, 0.9992)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : 2.963e-13       
##                                           
##                   Kappa : 0.95            
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            1.0000           0.9000
## Specificity                 1.0000            0.9500           1.0000
## Pos Pred Value              1.0000            0.9091           1.0000
## Neg Pred Value              1.0000            1.0000           0.9524
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.3333           0.3000
## Detection Prevalence        0.3333            0.3667           0.3000
## Balanced Accuracy           1.0000            0.9750           0.9500

**Modelo 2. SVM Radial*

modelo2 <- train(
  Species ~ ., 
  data = train, 
  method = "svmRadial",  # Cambiar si deseas otro método
  preProcess = c("scale", "center"),
  trControl = trainControl(method = "cv", number = 10),  # Se cierra correctamente el paréntesis
  tuneGrid = data.frame(sigma=1,C = 1)  # Se separa correctamente de trainControl
)

Predicciones

resultado_train2 <- predict(modelo2, train)
resultado_test2 <- predict(modelo2, test)

Matriz de Confusión

mcrp3<- confusionMatrix(resultado_train2, train$Species)
mcrp3
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         40          0         0
##   versicolor      0         39         0
##   virginica       0          1        40
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9917          
##                  95% CI : (0.9544, 0.9998)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9875          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            0.9750           1.0000
## Specificity                 1.0000            1.0000           0.9875
## Pos Pred Value              1.0000            1.0000           0.9756
## Neg Pred Value              1.0000            0.9877           1.0000
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.3250           0.3333
## Detection Prevalence        0.3333            0.3250           0.3417
## Balanced Accuracy           1.0000            0.9875           0.9938
mcrp4<- confusionMatrix(resultado_test2, test$Species)
mcrp4
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         10          0         0
##   versicolor      0         10         2
##   virginica       0          0         8
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9333          
##                  95% CI : (0.7793, 0.9918)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : 8.747e-12       
##                                           
##                   Kappa : 0.9             
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            1.0000           0.8000
## Specificity                 1.0000            0.9000           1.0000
## Pos Pred Value              1.0000            0.8333           1.0000
## Neg Pred Value              1.0000            1.0000           0.9091
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.3333           0.2667
## Detection Prevalence        0.3333            0.4000           0.2667
## Balanced Accuracy           1.0000            0.9500           0.9000

Modelo 3. SVM Polinomial

modelo3 <- train(
  Species ~ ., 
  data = train, 
  method = "svmPoly",  # Cambiar si deseas otro método
  preProcess = c("scale", "center"),
  trControl = trainControl(method = "cv", number = 10),  # Se cierra correctamente el paréntesis
  tuneGrid = data.frame(degree=1, scale=1,C = 1)  # Se separa correctamente de trainControl
)

Predicciones

resultado_train3 <- predict(modelo3, train)
resultado_test3 <- predict(modelo3, test)

Matriz de Confusión

mcrp5<- confusionMatrix(resultado_train3, train$Species)
mcrp5
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         40          0         0
##   versicolor      0         39         0
##   virginica       0          1        40
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9917          
##                  95% CI : (0.9544, 0.9998)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9875          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            0.9750           1.0000
## Specificity                 1.0000            1.0000           0.9875
## Pos Pred Value              1.0000            1.0000           0.9756
## Neg Pred Value              1.0000            0.9877           1.0000
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.3250           0.3333
## Detection Prevalence        0.3333            0.3250           0.3417
## Balanced Accuracy           1.0000            0.9875           0.9938
mcrp6<- confusionMatrix(resultado_test3, test$Species)
mcrp6
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         10          0         0
##   versicolor      0         10         1
##   virginica       0          0         9
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9667          
##                  95% CI : (0.8278, 0.9992)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : 2.963e-13       
##                                           
##                   Kappa : 0.95            
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            1.0000           0.9000
## Specificity                 1.0000            0.9500           1.0000
## Pos Pred Value              1.0000            0.9091           1.0000
## Neg Pred Value              1.0000            1.0000           0.9524
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.3333           0.3000
## Detection Prevalence        0.3333            0.3667           0.3000
## Balanced Accuracy           1.0000            0.9750           0.9500

Modelo 4. .Arból de Decisió

modelo4 <- train(
  Species ~ ., 
  data = train, 
  method = "rpart",  # Cambiar si deseas otro método
  preProcess = c("scale", "center"),
  trControl = trainControl(method = "cv", number = 10),  # Se cierra correctamente el paréntesis
  tuneLength=10  # Se separa correctamente de trainControl
)

Predicciones

resultado_train4 <- predict(modelo4, train)
resultado_test4<- predict(modelo4, test)

Matriz de Confusión

mcrp7<- confusionMatrix(resultado_train4, train$Species)
mcrp7
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         40          0         0
##   versicolor      0         39         3
##   virginica       0          1        37
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9667          
##                  95% CI : (0.9169, 0.9908)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.95            
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            0.9750           0.9250
## Specificity                 1.0000            0.9625           0.9875
## Pos Pred Value              1.0000            0.9286           0.9737
## Neg Pred Value              1.0000            0.9872           0.9634
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.3250           0.3083
## Detection Prevalence        0.3333            0.3500           0.3167
## Balanced Accuracy           1.0000            0.9688           0.9563
mcrp8<- confusionMatrix(resultado_test4, test$Species)
mcrp8
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         10          0         0
##   versicolor      0         10         2
##   virginica       0          0         8
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9333          
##                  95% CI : (0.7793, 0.9918)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : 8.747e-12       
##                                           
##                   Kappa : 0.9             
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            1.0000           0.8000
## Specificity                 1.0000            0.9000           1.0000
## Pos Pred Value              1.0000            0.8333           1.0000
## Neg Pred Value              1.0000            1.0000           0.9091
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.3333           0.2667
## Detection Prevalence        0.3333            0.4000           0.2667
## Balanced Accuracy           1.0000            0.9500           0.9000

Modelo 5. Redes Neuronales

modelo5 <- train(
  Species ~ ., 
  data = train, 
  method = "nnet",  # Cambiar si deseas otro método
  preProcess = c("scale", "center"),
  trControl = trainControl(method = "cv", number = 10),  # Se cierra correctamente el paréntesis
  trace=FALSE
)

**Predicciones*

resultado_train5 <- predict(modelo5, train)
resultado_test5 <- predict(modelo5, test)

Matriz de Confusión

mcrp9<- confusionMatrix(resultado_train5, train$Species)
mcrp9
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         40          0         0
##   versicolor      0         36         0
##   virginica       0          4        40
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9667          
##                  95% CI : (0.9169, 0.9908)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.95            
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            0.9000           1.0000
## Specificity                 1.0000            1.0000           0.9500
## Pos Pred Value              1.0000            1.0000           0.9091
## Neg Pred Value              1.0000            0.9524           1.0000
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.3000           0.3333
## Detection Prevalence        0.3333            0.3000           0.3667
## Balanced Accuracy           1.0000            0.9500           0.9750
mcrp10<- confusionMatrix(resultado_test5, test$Species)
mcrp10
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         10          0         0
##   versicolor      0          9         0
##   virginica       0          1        10
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9667          
##                  95% CI : (0.8278, 0.9992)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : 2.963e-13       
##                                           
##                   Kappa : 0.95            
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            0.9000           1.0000
## Specificity                 1.0000            1.0000           0.9500
## Pos Pred Value              1.0000            1.0000           0.9091
## Neg Pred Value              1.0000            0.9524           1.0000
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.3000           0.3333
## Detection Prevalence        0.3333            0.3000           0.3667
## Balanced Accuracy           1.0000            0.9500           0.9750

Modelo 6. Random Forest

modelo6 <- train(
  Species ~ ., 
  data = train, 
  method = "rf",  # Cambiar si deseas otro método
  preProcess = c("scale", "center"),
  trControl = trainControl(method = "cv", number = 10),  # Se cierra correctamente el paréntesis
  tuneLength=10  # Se separa correctamente de trainControl
)
## note: only 3 unique complexity parameters in default grid. Truncating the grid to 3 .

Predicciones

resultado_train6 <- predict(modelo6, train)
resultado_test6 <- predict(modelo6, test)

Matriz de Confusión

mcrp11<- confusionMatrix(resultado_train6, train$Species)
mcrp11
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         40          0         0
##   versicolor      0         40         0
##   virginica       0          0        40
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9697, 1)
##     No Information Rate : 0.3333     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            1.0000           1.0000
## Specificity                 1.0000            1.0000           1.0000
## Pos Pred Value              1.0000            1.0000           1.0000
## Neg Pred Value              1.0000            1.0000           1.0000
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.3333           0.3333
## Detection Prevalence        0.3333            0.3333           0.3333
## Balanced Accuracy           1.0000            1.0000           1.0000
mcrp12<- confusionMatrix(resultado_test6, test$Species)
mcrp12
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         10          0         0
##   versicolor      0         10         2
##   virginica       0          0         8
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9333          
##                  95% CI : (0.7793, 0.9918)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : 8.747e-12       
##                                           
##                   Kappa : 0.9             
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            1.0000           0.8000
## Specificity                 1.0000            0.9000           1.0000
## Pos Pred Value              1.0000            0.8333           1.0000
## Neg Pred Value              1.0000            1.0000           0.9091
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.3333           0.2667
## Detection Prevalence        0.3333            0.4000           0.2667
## Balanced Accuracy           1.0000            0.9500           0.9000

Resultados de los modelos

resultados <- data.frame(
  "SVM Lineal" = c(mcrp1$overall["Accuracy"], mcrp2$overall["Accuracy"]),
  
  "SVM Radial" = c(mcrp3$overall["Accuracy"], mcrp4$overall["Accuracy"]),
  
  "SVM Polinómico" = c(mcrp5$overall["Accuracy"], mcrp6$overall["Accuracy"]),
  
  "Árbol de decisión" = c(mcrp7$overall["Accuracy"], mcrp8$overall["Accuracy"]),
  
  "Redes Neuronales" = c(mcrp9$overall["Accuracy"], mcrp10$overall["Accuracy"]),
  
  "Bosques Aleatorios" = c(mcrp11$overall["Accuracy"], mcrp12$overall["Accuracy"])
  
)
rownames(resultados)<- c("Precisión de Entrenamiento","Precision de prueba")
resultados
##                            SVM.Lineal SVM.Radial SVM.Polinómico
## Precisión de Entrenamiento  0.9916667  0.9916667      0.9916667
## Precision de prueba         0.9666667  0.9333333      0.9666667
##                            Árbol.de.decisión Redes.Neuronales
## Precisión de Entrenamiento         0.9666667        0.9666667
## Precision de prueba                0.9333333        0.9666667
##                            Bosques.Aleatorios
## Precisión de Entrenamiento          1.0000000
## Precision de prueba                 0.9333333
