Es una manera de entrar un modelo utilizando datos etiquetados. En
estos casos cada dato de entrenamiento viene con su respuesta corresta
que el modelo debe aprender a predecir. Ejemplos:
* Detectar correos de spam * Segmentación de clientes * Recomendaciones
basandote en datos
Los principales algoritmos de Supervised Machine Learning son: *
Regresión Logística: se utiliza para clasificar,
mediante la probabilidad de que una variable sea parte de una categoría
específica.
* Árboles de Decisión: modela decisiones y sus posibles
consecuencias, creando un árbol que representa decisiones y sus
resultados. * Random Forest: Es un conjunto de árboles
de decisión para mejorar la robustez y precisión de la clasificación.
Reduce el riesgo de sobreajuste asociado con los árboles de decisión
individuales.
* Máquinas de Soporte Vectorial (SVM): Busca el
hiperplano que mejor separa las clases en el espacio de características.
Es efectivo en espacios de alta dimensión y para casos donde el número
de dimensiones supera al número de muestras.
* K-Nearest Neighbors (KNN): Clasifica una muestra
basándose en las etiquetas de las ‘k’ muestras más cercanas en el
espacio de características.
Matriz de confusión: Es una tabla que nos muestra la
cantidad de predicciones correctas e incorrectas divididas en cada
clase. Las principales componentes incluyen verdaderos positivos, falsos
positivos, verdaderos negativos y falsos negativos.
* Estadístico Kappa: Ayuda a evaluar la precisión en
situaciones donde la clasificación por azar podría ser significativa,
especialmente en datasets desbalanceados.
* Relación entre AUC y ROC Curve: La Curva ROC es un
gráfico que muestra una curva que traza la tasa de verdaderos positivos
frente a la tasa de falsos positivos. AUC mide el área total debajo de
la curva ROC y proporciona una medida agregada de rendimiento en todos
los umbrales de clasificación posibles. Un AUC más alto indica un mejor
rendimiento del modelo en la distinción entre las clases positivas y
negativas.
df <- read.csv("/Users/kikepablos/Documents/Development/escuela/concentracion_ai/modulo_3/Act_2/bank_marketing_strategy.csv")
summary(df)
## age job marital education
## Min. :18.00 Length:45211 Length:45211 Length:45211
## 1st Qu.:33.00 Class :character Class :character Class :character
## Median :39.00 Mode :character Mode :character Mode :character
## Mean :40.94
## 3rd Qu.:48.00
## Max. :95.00
## default balance housing loan
## Length:45211 Min. : -8019 Length:45211 Length:45211
## Class :character 1st Qu.: 72 Class :character Class :character
## Mode :character Median : 448 Mode :character Mode :character
## Mean : 1362
## 3rd Qu.: 1428
## Max. :102127
## contact day month duration
## Length:45211 Min. : 1.00 Length:45211 Min. : 0.0
## Class :character 1st Qu.: 8.00 Class :character 1st Qu.: 103.0
## Mode :character Median :16.00 Mode :character Median : 180.0
## Mean :15.81 Mean : 258.2
## 3rd Qu.:21.00 3rd Qu.: 319.0
## Max. :31.00 Max. :4918.0
## campaign pdays previous poutcome
## Min. : 1.000 Min. : -1.0 Min. : 0.0000 Length:45211
## 1st Qu.: 1.000 1st Qu.: -1.0 1st Qu.: 0.0000 Class :character
## Median : 2.000 Median : -1.0 Median : 0.0000 Mode :character
## Mean : 2.764 Mean : 40.2 Mean : 0.5803
## 3rd Qu.: 3.000 3rd Qu.: -1.0 3rd Qu.: 0.0000
## Max. :63.000 Max. :871.0 Max. :275.0000
## outcome
## Min. :1.000
## 1st Qu.:1.000
## Median :1.000
## Mean :1.117
## 3rd Qu.:1.000
## Max. :2.000
plot_missing(df) # No se encontraron valores NA
# c. Medidas descriptivas
str(df)
## 'data.frame': 45211 obs. of 17 variables:
## $ age : int 58 44 33 47 33 35 28 42 58 43 ...
## $ job : chr "management" "technician" "entrepreneur" "blue-collar" ...
## $ marital : chr "married" "single" "married" "married" ...
## $ education: chr "tertiary" "secondary" "secondary" "unknown" ...
## $ default : chr "no" "no" "no" "no" ...
## $ balance : int 2143 29 2 1506 1 231 447 2 121 593 ...
## $ housing : chr "yes" "yes" "yes" "yes" ...
## $ loan : chr "no" "no" "yes" "no" ...
## $ contact : chr "unknown" "unknown" "unknown" "unknown" ...
## $ day : int 5 5 5 5 5 5 5 5 5 5 ...
## $ month : chr "may" "may" "may" "may" ...
## $ duration : int 261 151 76 92 198 139 217 380 50 55 ...
## $ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
## $ pdays : int -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
## $ previous : int 0 0 0 0 0 0 0 0 0 0 ...
## $ poutcome : chr "unknown" "unknown" "unknown" "unknown" ...
## $ outcome : int 1 1 1 1 1 1 1 1 1 1 ...
summary(df)
## age job marital education
## Min. :18.00 Length:45211 Length:45211 Length:45211
## 1st Qu.:33.00 Class :character Class :character Class :character
## Median :39.00 Mode :character Mode :character Mode :character
## Mean :40.94
## 3rd Qu.:48.00
## Max. :95.00
## default balance housing loan
## Length:45211 Min. : -8019 Length:45211 Length:45211
## Class :character 1st Qu.: 72 Class :character Class :character
## Mode :character Median : 448 Mode :character Mode :character
## Mean : 1362
## 3rd Qu.: 1428
## Max. :102127
## contact day month duration
## Length:45211 Min. : 1.00 Length:45211 Min. : 0.0
## Class :character 1st Qu.: 8.00 Class :character 1st Qu.: 103.0
## Mode :character Median :16.00 Mode :character Median : 180.0
## Mean :15.81 Mean : 258.2
## 3rd Qu.:21.00 3rd Qu.: 319.0
## Max. :31.00 Max. :4918.0
## campaign pdays previous poutcome
## Min. : 1.000 Min. : -1.0 Min. : 0.0000 Length:45211
## 1st Qu.: 1.000 1st Qu.: -1.0 1st Qu.: 0.0000 Class :character
## Median : 2.000 Median : -1.0 Median : 0.0000 Mode :character
## Mean : 2.764 Mean : 40.2 Mean : 0.5803
## 3rd Qu.: 3.000 3rd Qu.: -1.0 3rd Qu.: 0.0000
## Max. :63.000 Max. :871.0 Max. :275.0000
## outcome
## Min. :1.000
## 1st Qu.:1.000
## Median :1.000
## Mean :1.117
## 3rd Qu.:1.000
## Max. :2.000
df_var <- df %>% select_if(is.numeric)
varianza_df <- var(df_var)
varianza_df
## age balance day duration campaign
## age 112.75810728 3161.47670 -0.80597850 -12.7117159 0.15660067
## balance 3161.47670016 9270598.95447 114.09546669 16905.7505445 -137.51323247
## day -0.80597850 114.09547 69.26360932 -64.7403169 4.18951855
## duration -12.71171595 16905.75054 -64.74031687 66320.5740901 -67.47179379
## campaign 0.15660067 -137.51323 4.18951855 -67.4717938 9.59773339
## pdays -25.26054997 1047.32171 -77.53540389 -40.3490729 -27.49238644
## previous 0.03151189 116.93953 -0.99130745 0.7136535 -0.23445891
## outcome 0.08585234 51.70794 -0.07582723 32.6548655 -0.07285895
## pdays previous outcome
## age -25.260550 0.03151189 0.08585234
## balance 1047.321715 116.93952715 51.70794264
## day -77.535404 -0.99130745 -0.07582723
## duration -40.349073 0.71365352 32.65486554
## campaign -27.492386 -0.23445891 -0.07285895
## pdays 10025.765774 104.89990241 3.33474205
## previous 104.899902 5.30584065 0.06902609
## outcome 3.334742 0.06902609 0.10330164
plot_histogram(df)
# co_df <- select(df, -default, -contact, -campaign, -pdays,-previous)
plot_correlation(df)
# DataExplorer::create_report()
# División del conjunto de datos en entrenamiento y prueba
chr_cols <- names(df)[sapply(df, is.character)]
df[chr_cols] <- lapply(df[chr_cols], as.factor)
int_cols <- names(df)[sapply(df, is.integer)]
df[int_cols] <- lapply(df[int_cols], as.numeric)
df$outcome <- ifelse(df$outcome == 2, "Si", "No")
df$outcome <- as.factor(df$outcome)
set.seed(123)
sample <- createDataPartition(y = df$outcome, p=0.7, list=F)
train <- df[sample, ]
test <- df[-sample, ]
options(scipen=999)
multiple_logistic <- glm(outcome ~., family = "binomial", data = train)
summary(multiple_logistic)
##
## Call:
## glm(formula = outcome ~ ., family = "binomial", data = train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.267365560 0.217801302 -10.410 < 0.0000000000000002 ***
## age -0.003181404 0.002616648 -1.216 0.224049
## jobblue-collar -0.320256903 0.086638789 -3.696 0.000219 ***
## jobentrepreneur -0.459257820 0.149368282 -3.075 0.002107 **
## jobhousemaid -0.597136662 0.164836430 -3.623 0.000292 ***
## jobmanagement -0.245156038 0.087210989 -2.811 0.004938 **
## jobretired 0.276127575 0.116259917 2.375 0.017545 *
## jobself-employed -0.399419091 0.135610981 -2.945 0.003226 **
## jobservices -0.164707835 0.098648897 -1.670 0.094991 .
## jobstudent 0.450216538 0.128415901 3.506 0.000455 ***
## jobtechnician -0.166960918 0.081455919 -2.050 0.040393 *
## jobunemployed -0.178794987 0.134906806 -1.325 0.185064
## jobunknown -0.253464661 0.265615557 -0.954 0.339955
## maritalmarried -0.174567705 0.071065791 -2.456 0.014033 *
## maritalsingle 0.109789930 0.080663057 1.361 0.173484
## educationsecondary 0.150791035 0.077325331 1.950 0.051166 .
## educationtertiary 0.402913613 0.089660781 4.494 0.000006997802904691 ***
## educationunknown 0.268353881 0.121690077 2.205 0.027438 *
## defaultyes -0.051630877 0.199841618 -0.258 0.796130
## balance 0.000014194 0.000006288 2.257 0.024001 *
## housingyes -0.755941948 0.052183243 -14.486 < 0.0000000000000002 ***
## loanyes -0.438544040 0.070915008 -6.184 0.000000000624659114 ***
## contacttelephone -0.057923482 0.087273935 -0.664 0.506884
## contactunknown -1.606632789 0.087653550 -18.329 < 0.0000000000000002 ***
## day 0.009198765 0.002963887 3.104 0.001912 **
## monthaug -0.702336821 0.093468897 -7.514 0.000000000000057294 ***
## monthdec 0.665279606 0.207549966 3.205 0.001349 **
## monthfeb -0.101259116 0.105771806 -0.957 0.338398
## monthjan -1.261970428 0.144724558 -8.720 < 0.0000000000000002 ***
## monthjul -0.733312021 0.091221042 -8.039 0.000000000000000907 ***
## monthjun 0.428405485 0.110900645 3.863 0.000112 ***
## monthmar 1.603075401 0.141189652 11.354 < 0.0000000000000002 ***
## monthmay -0.350577860 0.085331500 -4.108 0.000039837070002395 ***
## monthnov -0.783920553 0.099063795 -7.913 0.000000000000002507 ***
## monthoct 0.951451976 0.127150373 7.483 0.000000000000072707 ***
## monthsep 0.696651439 0.143836393 4.843 0.000001276616463076 ***
## duration 0.004084763 0.000076387 53.474 < 0.0000000000000002 ***
## campaign -0.094067513 0.012201792 -7.709 0.000000000000012649 ***
## pdays -0.000080165 0.000361504 -0.222 0.824505
## previous 0.007343230 0.006422649 1.143 0.252900
## poutcomeother 0.133008375 0.105844524 1.257 0.208884
## poutcomesuccess 2.167107058 0.096875980 22.370 < 0.0000000000000002 ***
## poutcomeunknown -0.161300405 0.108965010 -1.480 0.138794
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 22845 on 31648 degrees of freedom
## Residual deviance: 15242 on 31606 degrees of freedom
## AIC: 15328
##
## Number of Fisher Scoring iterations: 6
varImp(multiple_logistic)
## Overall
## age 1.2158318
## jobblue-collar 3.6964610
## jobentrepreneur 3.0746676
## jobhousemaid 3.6226013
## jobmanagement 2.8110682
## jobretired 2.3750884
## jobself-employed 2.9453300
## jobservices 1.6696369
## jobstudent 3.5059252
## jobtechnician 2.0497088
## jobunemployed 1.3253222
## jobunknown 0.9542538
## maritalmarried 2.4564239
## maritalsingle 1.3610931
## educationsecondary 1.9500859
## educationtertiary 4.4937553
## educationunknown 2.2052240
## defaultyes 0.2583590
## balance 2.2571141
## housingyes 14.4862969
## loanyes 6.1840794
## contacttelephone 0.6636974
## contactunknown 18.3293522
## day 3.1036155
## monthaug 7.5141233
## monthdec 3.2053949
## monthfeb 0.9573356
## monthjan 8.7198085
## monthjul 8.0388473
## monthjun 3.8629666
## monthmar 11.3540573
## monthmay 4.1084226
## monthnov 7.9132901
## monthoct 7.4828878
## monthsep 4.8433601
## duration 53.4743740
## campaign 7.7093193
## pdays 0.2217550
## previous 1.1433335
## poutcomeother 1.2566392
## poutcomesuccess 22.3699110
## poutcomeunknown 1.4802954
# Predicción con el modelo de regresión logística sobre el conjunto de prueba
test_multiple_logistic <- predict(multiple_logistic, newdata = test, type = "response")
# mL_cm <- confusionMatrix(test_multiple_logistic, test$outcome)
multiple_logistic_data <- augment(multiple_logistic) %>%
mutate(index = 1:n())
# Obtener las 5 observaciones con mayor distancia de Cook
top_cooksd <- multiple_logistic_data %>%
top_n(5, .cooksd)
# Evaluar el rendimiento del modelo usando una tabla de contingencia con proporciones
model_performance <- table(test$outcome, test_multiple_logistic > 0.5) %>%
prop.table() %>%
round(3)
# Listar el rendimiento del modelo
list(multiple_logistic_performance = model_performance)
## $multiple_logistic_performance
##
## FALSE TRUE
## No 0.862 0.021
## Si 0.076 0.041
prediction(test_multiple_logistic, test$outcome) %>%
performance(measure = "tpr", x.measure = "fpr") %>%
plot()
mlr_auc <- prediction(test_multiple_logistic, test$outcome) %>% performance(measure = "auc") %>% .@y.values
mlr_aes <- prediction(test_multiple_logistic, test$outcome) %>% performance(measure = "auc") %>% .@y.values
mlr_aes
## [[1]]
## [1] 0.9087577
roc_log <- roc(test$outcome, predict(multiple_logistic, newdata = test))
## Setting levels: control = No, case = Si
## Setting direction: controls < cases
auc_log <- pROC::auc(roc_log)
auc_log
## Area under the curve: 0.9088
set.seed(123)
options(scipen=999)
dt.rpart <- rpart(outcome ~ .,data = train, method = "class")
#
dt.rpart$variable.importance
## duration poutcome month pdays age campaign
## 991.2553044 535.2120007 229.8670632 3.0851575 1.6231573 1.5362460
## contact day balance default job
## 0.7708972 0.5965242 0.5693730 0.2569657 0.2569657
prp(dt.rpart, extra = 2)
#
#
training_pred <- predict(object = dt.rpart, newdata = train, type = "class")
# # training confusion matrix
training_conf <- confusionMatrix(data = training_pred, reference = train$outcome, positive = "Si", mode = "everything")
training_conf
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Si
## No 27006 2189
## Si 940 1514
##
## Accuracy : 0.9011
## 95% CI : (0.8978, 0.9044)
## No Information Rate : 0.883
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.4395
##
## Mcnemar's Test P-Value : < 0.00000000000000022
##
## Sensitivity : 0.40886
## Specificity : 0.96636
## Pos Pred Value : 0.61695
## Neg Pred Value : 0.92502
## Precision : 0.61695
## Recall : 0.40886
## F1 : 0.49180
## Prevalence : 0.11700
## Detection Rate : 0.04784
## Detection Prevalence : 0.07754
## Balanced Accuracy : 0.68761
##
## 'Positive' Class : Si
##
### TESTING PERFORMANCE
test_pred <- predict(object = dt.rpart, newdata = test, type = "class")
test_conf <- confusionMatrix(data = test_pred, reference = test$outcome, positive = "Si", mode = "everything")
test_conf
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Si
## No 11574 929
## Si 402 657
##
## Accuracy : 0.9019
## 95% CI : (0.8967, 0.9068)
## No Information Rate : 0.8831
## P-Value [Acc > NIR] : 0.000000000001619
##
## Kappa : 0.4448
##
## Mcnemar's Test P-Value : < 0.00000000000000022
##
## Sensitivity : 0.41425
## Specificity : 0.96643
## Pos Pred Value : 0.62040
## Neg Pred Value : 0.92570
## Precision : 0.62040
## Recall : 0.41425
## F1 : 0.49679
## Prevalence : 0.11694
## Detection Rate : 0.04844
## Detection Prevalence : 0.07809
## Balanced Accuracy : 0.69034
##
## 'Positive' Class : Si
##
accuracy_tree <- test_conf$overall['Accuracy']
accuracy_tree
## Accuracy
## 0.9018581
# Obtener el estadístico Kappa
kappa_tree <- test_conf$overall['Kappa']
kappa_tree
## Kappa
## 0.4447948
printcp(x = dt.rpart)
##
## Classification tree:
## rpart(formula = outcome ~ ., data = train, method = "class")
##
## Variables actually used in tree construction:
## [1] duration month poutcome
##
## Root node error: 3703/31649 = 0.117
##
## n= 31649
##
## CP nsplit rel error xerror xstd
## 1 0.035377 0 1.00000 1.00000 0.015442
## 2 0.026465 3 0.89387 0.89441 0.014706
## 3 0.011207 4 0.86740 0.86876 0.014518
## 4 0.010000 6 0.84499 0.86713 0.014506
min_corss_validated_error <- dt.rpart$cptable[which.min(dt.rpart$cptable[,"xerror"]),"CP"]
dt.rpart_alt <- prune(tree = dt.rpart, cp = min_corss_validated_error)
prp(dt.rpart_alt, extra = 2)
predicted_prob <- predict(dt.rpart, newdata = test, type = "prob")[,2]
roc_tree <- roc(test$outcome, predict(dt.rpart_alt, newdata = test, type = "prob")[, "Si"])
## Setting levels: control = No, case = Si
## Setting direction: controls < cases
auc_tree <- pROC::auc(roc_tree)
auc_tree
## Area under the curve: 0.8027
svm_model_1 <- svm(outcome ~., data = train,type = "C-classification",cost = 1, kernel = "linear",scale = FALSE)
predict <- predict(svm_model_1, newdata = test)
confusionMatrix_svm <- confusionMatrix(predict, test$outcome)
# plot(svm_model_1, train)
summary(svm_model_1)
##
## Call:
## svm(formula = outcome ~ ., data = train, type = "C-classification",
## cost = 1, kernel = "linear", scale = FALSE)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 1
##
## Number of Support Vectors: 1399
##
## ( 511 888 )
##
##
## Number of Classes: 2
##
## Levels:
## No Si
accuracy_svm <- confusionMatrix_svm$overall['Accuracy']
accuracy_svm
## Accuracy
## 0.765595
# Obtener el estadístico Kappa
kappa_svm <- confusionMatrix_svm$overall['Kappa']
kappa_svm
## Kappa
## 0.1679636
roc_svm <- roc(test$outcome, as.numeric(predict(svm_model_1, newdata = test) == "Si"))
## Setting levels: control = No, case = Si
## Setting direction: controls < cases
auc_svm <- pROC::auc(roc_svm)
auc_svm
## Area under the curve: 0.6146
train_n <- names(train)[sapply(train, is.numeric)]
train_num <- dplyr::select(train, train_n)
head(train_num)
## age balance day duration campaign pdays previous
## 1 58 2143 5 261 1 -1 0
## 2 44 29 5 151 1 -1 0
## 3 33 2 5 76 1 -1 0
## 4 47 1506 5 92 1 -1 0
## 5 33 1 5 198 1 -1 0
## 6 35 231 5 139 1 -1 0
set.seed(123)
scl_df <- scale(train_num)
k4 <- kmeans(scl_df, centers = 4, nstart = 20)
fviz_cluster(k4, data = scl_df)
## eliminamos outlier
# scl_df <- scl_df[-c(29183), ]
c_29183 <-scl_df[29183, ]
df_sf<- scl_df[-29183, ]
set.seed(123)
wss <- function(k) {
kmeans(df_sf, k, nstart = 10 )$tot.withinss
}
# Compute and plot wss for k = 1 to k = 15
k.values <- 1:15
# extract wss for 2-15 clusters
wss_values <- map_dbl(k.values, wss)
plot(k.values, wss_values,
type="b", pch = 19, frame = FALSE,
xlab="Number of clusters K",
ylab="Total within-clusters sum of squares")
k7 <- kmeans(df_sf, centers = 7, nstart = 20)
fviz_cluster(k7, data = df_sf)
df_sf <- as.data.frame(df_sf)
df_sf %>%
mutate(Cluster = k7$cluster) %>%
group_by(Cluster) %>%
summarise_all("mean")
## # A tibble: 7 × 8
## Cluster age balance day duration campaign pdays previous
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 -0.468 -0.150 0.949 -0.226 -0.0674 -0.365 -0.203
## 2 2 1.29 -0.0236 -0.0757 -0.212 -0.112 -0.309 -0.162
## 3 3 -0.535 -0.185 -0.907 -0.187 -0.200 -0.368 -0.197
## 4 4 -0.0233 -0.149 0.716 -0.461 4.14 -0.404 -0.229
## 5 5 -0.150 -0.0879 -0.260 -0.102 -0.204 2.28 1.24
## 6 6 0.248 4.45 0.0410 -0.0532 -0.0817 -0.0702 -0.0171
## 7 7 -0.0505 -0.0415 0.00165 2.74 -0.106 -0.233 -0.131
# No se puede calcular AUC para K-Means porque no es un modelo de clasificación supervisada
k <- 4
model_knn <- train(outcome ~ job + education + housing + pdays + duration + loan + campaign + poutcome + month + balance, data = train, method = "knn", preProcess = c("center", "scale"), trControl = trainControl(method = "cv", number = 4), tuneGrid = expand.grid(k = k))
# Resumen del modelo
print(model_knn)
## k-Nearest Neighbors
##
## 31649 samples
## 10 predictor
## 2 classes: 'No', 'Si'
##
## Pre-processing: centered (34), scaled (34)
## Resampling: Cross-Validated (4 fold)
## Summary of sample sizes: 23737, 23737, 23736, 23737
## Resampling results:
##
## Accuracy Kappa
## 0.8844513 0.3491849
##
## Tuning parameter 'k' was held constant at a value of 4
# Hacer predicciones en el conjunto de prueba
knn_predictions_test <- predict(model_knn, newdata = test)
knn_predictions_train <- predict(model_knn, newdata = train)
# Confusion Matrix
confusionMatrix_knn <- confusionMatrix(knn_predictions_test, test$outcome)
confusionMatrix(knn_predictions_train, train$outcome)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Si
## No 27165 1753
## Si 781 1950
##
## Accuracy : 0.9199
## 95% CI : (0.9169, 0.9229)
## No Information Rate : 0.883
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.5627
##
## Mcnemar's Test P-Value : < 0.00000000000000022
##
## Sensitivity : 0.9721
## Specificity : 0.5266
## Pos Pred Value : 0.9394
## Neg Pred Value : 0.7140
## Prevalence : 0.8830
## Detection Rate : 0.8583
## Detection Prevalence : 0.9137
## Balanced Accuracy : 0.7493
##
## 'Positive' Class : No
##
accuracy_knn <- confusionMatrix_knn$overall['Accuracy']
accuracy_knn
## Accuracy
## 0.884899
# Obtener el estadístico Kappa
kappa_knn <- confusionMatrix_knn$overall['Kappa']
kappa_knn
## Kappa
## 0.3484375
# roc_knn <- roc(test$outcome, knn_predictions_test)
# auc_knn <- pROC::auc(roc_knn)
# auc_knn
roc_knn <- roc(test$outcome, as.numeric(knn_predictions_test == "Si"))
## Setting levels: control = No, case = Si
## Setting direction: controls < cases
auc_knn <- pROC::auc(roc_knn)
auc_knn
## Area under the curve: 0.649
table(train$outcome) %>% prop.table()
##
## No Si
## 0.8829979 0.1170021
table(test$outcome) %>% prop.table()
##
## No Si
## 0.8830556 0.1169444
naive_bayes <- naiveBayes(outcome ~ ., data = train)
predicted<-predict(naive_bayes, as.data.frame(test))
confusionMatrix_nb <- confusionMatrix(test$outcome, predicted)
accuracy_nb <- confusionMatrix_nb$overall['Accuracy']
accuracy_nb
## Accuracy
## 0.8811385
# Obtener el estadístico Kappa
kappa_nb <- confusionMatrix_nb$overall['Kappa']
kappa_nb
## Kappa
## 0.4463092
predicted_prob <- predict(naive_bayes, as.data.frame(test), type = "raw")
predicted_prob_pos <- predicted_prob[,2]
# Calculando la curva ROC y el AUC
roc_nb <- roc(test$outcome, predicted_prob_pos)
## Setting levels: control = No, case = Si
## Setting direction: controls < cases
auc_nb <- pROC::auc(roc_nb)
auc_nb
## Area under the curve: 0.8601
naive_bayes_b <- NaiveBayes(outcome ~ ., data = train)
plot(naive_bayes_b)
# rf_model <- randomForest(outcome ~ job + education +housing + )
rf_model <- randomForest(outcome ~ ., data = train)
test$outcome <- as.factor(test$outcome)
# Predicción y Evaluación del modelo
prediction <- predict(rf_model, test)
confusionMatrix_rf <- confusionMatrix(prediction, test$outcome)
confusionMatrix
## function (data, ...)
## {
## UseMethod("confusionMatrix")
## }
## <bytecode: 0x126e71938>
## <environment: namespace:caret>
accuracy_rf <- confusionMatrix_rf$overall['Accuracy']
accuracy_rf
## Accuracy
## 0.9114437
# Obtener el estadístico Kappa
kappa_rf <- confusionMatrix_rf$overall['Kappa']
kappa_rf
## Kappa
## 0.5172272
roc_rf <- roc(test$outcome, predict(rf_model, newdata = test, type = "prob")[, "Si"])
## Setting levels: control = No, case = Si
## Setting direction: controls < cases
auc_rf <- pROC::auc(roc_rf)
auc_rf
## Area under the curve: 0.9354
model_metrics <- data.frame(
Model = c("Multiple Regression", "Decision Trees", "Support Vector Machine (SVM)", "K – Means Clustering", "KNN", "Naive Bayes", "Random Forest"),
accuracy = c(NA, accuracy_tree, accuracy_svm, NA, accuracy_knn, accuracy_nb, accuracy_rf),
kappa = c(NA, kappa_tree,kappa_svm, NA, kappa_knn, kappa_nb, kappa_rf),
# roc_m = c(roc_log, roc_tree, roc_svm, NA, roc_knn, roc_nb, roc_rf),
auc_m = c(mlr_aes, auc_tree, auc_svm, NA, auc_knn, auc_nb, auc_rf)
)
model_metrics
## Model accuracy kappa auc_m.0.90875766876332
## 1 Multiple Regression NA NA 0.9087577
## 2 Decision Trees 0.9018581 0.4447948 0.9087577
## 3 Support Vector Machine (SVM) 0.7655950 0.1679636 0.9087577
## 4 K – Means Clustering NA NA 0.9087577
## 5 KNN 0.8848990 0.3484375 0.9087577
## 6 Naive Bayes 0.8811385 0.4463092 0.9087577
## 7 Random Forest 0.9114437 0.5172272 0.9087577
## auc_m.0.802737858019528 auc_m.0.614554508344137 auc_m.NA
## 1 0.8027379 0.6145545 NA
## 2 0.8027379 0.6145545 NA
## 3 0.8027379 0.6145545 NA
## 4 0.8027379 0.6145545 NA
## 5 0.8027379 0.6145545 NA
## 6 0.8027379 0.6145545 NA
## 7 0.8027379 0.6145545 NA
## auc_m.0.64901176880874 auc_m.0.860062153520994 auc_m.0.935401409165535
## 1 0.6490118 0.8600622 0.9354014
## 2 0.6490118 0.8600622 0.9354014
## 3 0.6490118 0.8600622 0.9354014
## 4 0.6490118 0.8600622 0.9354014
## 5 0.6490118 0.8600622 0.9354014
## 6 0.6490118 0.8600622 0.9354014
## 7 0.6490118 0.8600622 0.9354014
Despues de abalizar todas métricas de desempeño de los modelos, conclui que Random Forest es el modelo más adecuado para esta tarea. El modelo alcanza una precision de un 91.14%, lo que nos dice que tiene una alta tasa de éxito en la identificación correcta de los resultados. El valor de Kappa es de 0.517, aunque no es alto, muy alto, es notablemente significativo, pues indica que las predicciones del modelo son considerablemente mejores que un acierto aleatorio, ajustando por el azar y mostrando un acuerdo más allá de lo esperado por casualidad. Por ultimo tiene un AUC de 0.935 lo que nos dice que es muy bueno distiguiendio entre clases, puede diferenciar efectivamente entre casos positivos y negativos.