1) Brevemente responder con tus propias palabras 2 de las siguientes 3 preguntas:

i) ¿Qué es Supervised Machine Learning y cuáles son algunas de sus aplicaciones en análisis de clasificación?

Es una manera de entrar un modelo utilizando datos etiquetados. En estos casos cada dato de entrenamiento viene con su respuesta corresta que el modelo debe aprender a predecir. Ejemplos:
* Detectar correos de spam * Segmentación de clientes * Recomendaciones basandote en datos

ii) ¿Cuáles son los principales algoritmos de Supervised Machine Learning - Classification? Brevemente describir con tus propias palarbas 4 – 6 de los principales algoritmos de Supervised Machine Learning - Classification.

Los principales algoritmos de Supervised Machine Learning son: * Regresión Logística: se utiliza para clasificar, mediante la probabilidad de que una variable sea parte de una categoría específica.
* Árboles de Decisión: modela decisiones y sus posibles consecuencias, creando un árbol que representa decisiones y sus resultados. * Random Forest: Es un conjunto de árboles de decisión para mejorar la robustez y precisión de la clasificación. Reduce el riesgo de sobreajuste asociado con los árboles de decisión individuales.
* Máquinas de Soporte Vectorial (SVM): Busca el hiperplano que mejor separa las clases en el espacio de características. Es efectivo en espacios de alta dimensión y para casos donde el número de dimensiones supera al número de muestras.
* K-Nearest Neighbors (KNN): Clasifica una muestra basándose en las etiquetas de las ‘k’ muestras más cercanas en el espacio de características.

iii) Respecto a la selección de los resultados de los modelos de clasificación ¿Qué es la matriz de confusión? ¿Qué es el estadístico Kappa? ¿Cuál es la relación entre AUC y ROC Curve?

Matriz de confusión: Es una tabla que nos muestra la cantidad de predicciones correctas e incorrectas divididas en cada clase. Las principales componentes incluyen verdaderos positivos, falsos positivos, verdaderos negativos y falsos negativos.
* Estadístico Kappa: Ayuda a evaluar la precisión en situaciones donde la clasificación por azar podría ser significativa, especialmente en datasets desbalanceados.
* Relación entre AUC y ROC Curve: La Curva ROC es un gráfico que muestra una curva que traza la tasa de verdaderos positivos frente a la tasa de falsos positivos. AUC mide el área total debajo de la curva ROC y proporciona una medida agregada de rendimiento en todos los umbrales de clasificación posibles. Un AUC más alto indica un mejor rendimiento del modelo en la distinción entre las clases positivas y negativas.

df <- read.csv("/Users/kikepablos/Documents/Development/escuela/concentracion_ai/modulo_3/Act_2/bank_marketing_strategy.csv")

2) Desarrollar Análisis Exploratorio de los Datos (EDA) que incluye los siguientes elementos:

summary(df)

##       age            job              marital           education        
##  Min.   :18.00   Length:45211       Length:45211       Length:45211      
##  1st Qu.:33.00   Class :character   Class :character   Class :character  
##  Median :39.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :40.94                                                           
##  3rd Qu.:48.00                                                           
##  Max.   :95.00                                                           
##    default             balance         housing              loan          
##  Length:45211       Min.   : -8019   Length:45211       Length:45211      
##  Class :character   1st Qu.:    72   Class :character   Class :character  
##  Mode  :character   Median :   448   Mode  :character   Mode  :character  
##                     Mean   :  1362                                        
##                     3rd Qu.:  1428                                        
##                     Max.   :102127                                        
##    contact               day           month              duration     
##  Length:45211       Min.   : 1.00   Length:45211       Min.   :   0.0  
##  Class :character   1st Qu.: 8.00   Class :character   1st Qu.: 103.0  
##  Mode  :character   Median :16.00   Mode  :character   Median : 180.0  
##                     Mean   :15.81                      Mean   : 258.2  
##                     3rd Qu.:21.00                      3rd Qu.: 319.0  
##                     Max.   :31.00                      Max.   :4918.0  
##     campaign          pdays          previous          poutcome        
##  Min.   : 1.000   Min.   : -1.0   Min.   :  0.0000   Length:45211      
##  1st Qu.: 1.000   1st Qu.: -1.0   1st Qu.:  0.0000   Class :character  
##  Median : 2.000   Median : -1.0   Median :  0.0000   Mode  :character  
##  Mean   : 2.764   Mean   : 40.2   Mean   :  0.5803                     
##  3rd Qu.: 3.000   3rd Qu.: -1.0   3rd Qu.:  0.0000                     
##  Max.   :63.000   Max.   :871.0   Max.   :275.0000                     
##     outcome     
##  Min.   :1.000  
##  1st Qu.:1.000  
##  Median :1.000  
##  Mean   :1.117  
##  3rd Qu.:1.000  
##  Max.   :2.000

plot_missing(df) # No se encontraron valores NA

# c. Medidas descriptivas

str(df)

## 'data.frame':    45211 obs. of  17 variables:
##  $ age      : int  58 44 33 47 33 35 28 42 58 43 ...
##  $ job      : chr  "management" "technician" "entrepreneur" "blue-collar" ...
##  $ marital  : chr  "married" "single" "married" "married" ...
##  $ education: chr  "tertiary" "secondary" "secondary" "unknown" ...
##  $ default  : chr  "no" "no" "no" "no" ...
##  $ balance  : int  2143 29 2 1506 1 231 447 2 121 593 ...
##  $ housing  : chr  "yes" "yes" "yes" "yes" ...
##  $ loan     : chr  "no" "no" "yes" "no" ...
##  $ contact  : chr  "unknown" "unknown" "unknown" "unknown" ...
##  $ day      : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ month    : chr  "may" "may" "may" "may" ...
##  $ duration : int  261 151 76 92 198 139 217 380 50 55 ...
##  $ campaign : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ pdays    : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
##  $ previous : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ poutcome : chr  "unknown" "unknown" "unknown" "unknown" ...
##  $ outcome  : int  1 1 1 1 1 1 1 1 1 1 ...

summary(df)

##       age            job              marital           education        
##  Min.   :18.00   Length:45211       Length:45211       Length:45211      
##  1st Qu.:33.00   Class :character   Class :character   Class :character  
##  Median :39.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :40.94                                                           
##  3rd Qu.:48.00                                                           
##  Max.   :95.00                                                           
##    default             balance         housing              loan          
##  Length:45211       Min.   : -8019   Length:45211       Length:45211      
##  Class :character   1st Qu.:    72   Class :character   Class :character  
##  Mode  :character   Median :   448   Mode  :character   Mode  :character  
##                     Mean   :  1362                                        
##                     3rd Qu.:  1428                                        
##                     Max.   :102127                                        
##    contact               day           month              duration     
##  Length:45211       Min.   : 1.00   Length:45211       Min.   :   0.0  
##  Class :character   1st Qu.: 8.00   Class :character   1st Qu.: 103.0  
##  Mode  :character   Median :16.00   Mode  :character   Median : 180.0  
##                     Mean   :15.81                      Mean   : 258.2  
##                     3rd Qu.:21.00                      3rd Qu.: 319.0  
##                     Max.   :31.00                      Max.   :4918.0  
##     campaign          pdays          previous          poutcome        
##  Min.   : 1.000   Min.   : -1.0   Min.   :  0.0000   Length:45211      
##  1st Qu.: 1.000   1st Qu.: -1.0   1st Qu.:  0.0000   Class :character  
##  Median : 2.000   Median : -1.0   Median :  0.0000   Mode  :character  
##  Mean   : 2.764   Mean   : 40.2   Mean   :  0.5803                     
##  3rd Qu.: 3.000   3rd Qu.: -1.0   3rd Qu.:  0.0000                     
##  Max.   :63.000   Max.   :871.0   Max.   :275.0000                     
##     outcome     
##  Min.   :1.000  
##  1st Qu.:1.000  
##  Median :1.000  
##  Mean   :1.117  
##  3rd Qu.:1.000  
##  Max.   :2.000

d. Medidas de dispersión

df_var <- df %>% select_if(is.numeric)
varianza_df <- var(df_var)
varianza_df

##                    age       balance          day      duration      campaign
## age       112.75810728    3161.47670  -0.80597850   -12.7117159    0.15660067
## balance  3161.47670016 9270598.95447 114.09546669 16905.7505445 -137.51323247
## day        -0.80597850     114.09547  69.26360932   -64.7403169    4.18951855
## duration  -12.71171595   16905.75054 -64.74031687 66320.5740901  -67.47179379
## campaign    0.15660067    -137.51323   4.18951855   -67.4717938    9.59773339
## pdays     -25.26054997    1047.32171 -77.53540389   -40.3490729  -27.49238644
## previous    0.03151189     116.93953  -0.99130745     0.7136535   -0.23445891
## outcome     0.08585234      51.70794  -0.07582723    32.6548655   -0.07285895
##                 pdays     previous     outcome
## age        -25.260550   0.03151189  0.08585234
## balance   1047.321715 116.93952715 51.70794264
## day        -77.535404  -0.99130745 -0.07582723
## duration   -40.349073   0.71365352 32.65486554
## campaign   -27.492386  -0.23445891 -0.07285895
## pdays    10025.765774 104.89990241  3.33474205
## previous   104.899902   5.30584065  0.06902609
## outcome      3.334742   0.06902609  0.10330164

Identificación de patrones y/o tendencias en los datos mediante el uso de gráficos incluyendo bar plots, line plots, pie plots, histogramas, matriz de correlación, box plot, scatter plot, qq- plot, etc Mostrar al menos 4 – 6 gráficos.

plot_histogram(df)

# co_df <- select(df, -default, -contact, -campaign, -pdays,-previous)
plot_correlation(df)

# DataExplorer::create_report()

# División del conjunto de datos en entrenamiento y prueba
chr_cols <- names(df)[sapply(df, is.character)]
df[chr_cols] <- lapply(df[chr_cols], as.factor)

int_cols <- names(df)[sapply(df, is.integer)]
df[int_cols] <- lapply(df[int_cols], as.numeric)

df$outcome <- ifelse(df$outcome == 2, "Si", "No")
df$outcome <- as.factor(df$outcome)
set.seed(123)
sample <- createDataPartition(y = df$outcome, p=0.7, list=F)
train  <- df[sample, ]
test   <- df[-sample, ]

A. Logistic Regression.

options(scipen=999) 
multiple_logistic <- glm(outcome ~., family = "binomial", data = train)
summary(multiple_logistic)

## 
## Call:
## glm(formula = outcome ~ ., family = "binomial", data = train)
## 
## Coefficients:
##                        Estimate   Std. Error z value             Pr(>|z|)    
## (Intercept)        -2.267365560  0.217801302 -10.410 < 0.0000000000000002 ***
## age                -0.003181404  0.002616648  -1.216             0.224049    
## jobblue-collar     -0.320256903  0.086638789  -3.696             0.000219 ***
## jobentrepreneur    -0.459257820  0.149368282  -3.075             0.002107 ** 
## jobhousemaid       -0.597136662  0.164836430  -3.623             0.000292 ***
## jobmanagement      -0.245156038  0.087210989  -2.811             0.004938 ** 
## jobretired          0.276127575  0.116259917   2.375             0.017545 *  
## jobself-employed   -0.399419091  0.135610981  -2.945             0.003226 ** 
## jobservices        -0.164707835  0.098648897  -1.670             0.094991 .  
## jobstudent          0.450216538  0.128415901   3.506             0.000455 ***
## jobtechnician      -0.166960918  0.081455919  -2.050             0.040393 *  
## jobunemployed      -0.178794987  0.134906806  -1.325             0.185064    
## jobunknown         -0.253464661  0.265615557  -0.954             0.339955    
## maritalmarried     -0.174567705  0.071065791  -2.456             0.014033 *  
## maritalsingle       0.109789930  0.080663057   1.361             0.173484    
## educationsecondary  0.150791035  0.077325331   1.950             0.051166 .  
## educationtertiary   0.402913613  0.089660781   4.494 0.000006997802904691 ***
## educationunknown    0.268353881  0.121690077   2.205             0.027438 *  
## defaultyes         -0.051630877  0.199841618  -0.258             0.796130    
## balance             0.000014194  0.000006288   2.257             0.024001 *  
## housingyes         -0.755941948  0.052183243 -14.486 < 0.0000000000000002 ***
## loanyes            -0.438544040  0.070915008  -6.184 0.000000000624659114 ***
## contacttelephone   -0.057923482  0.087273935  -0.664             0.506884    
## contactunknown     -1.606632789  0.087653550 -18.329 < 0.0000000000000002 ***
## day                 0.009198765  0.002963887   3.104             0.001912 ** 
## monthaug           -0.702336821  0.093468897  -7.514 0.000000000000057294 ***
## monthdec            0.665279606  0.207549966   3.205             0.001349 ** 
## monthfeb           -0.101259116  0.105771806  -0.957             0.338398    
## monthjan           -1.261970428  0.144724558  -8.720 < 0.0000000000000002 ***
## monthjul           -0.733312021  0.091221042  -8.039 0.000000000000000907 ***
## monthjun            0.428405485  0.110900645   3.863             0.000112 ***
## monthmar            1.603075401  0.141189652  11.354 < 0.0000000000000002 ***
## monthmay           -0.350577860  0.085331500  -4.108 0.000039837070002395 ***
## monthnov           -0.783920553  0.099063795  -7.913 0.000000000000002507 ***
## monthoct            0.951451976  0.127150373   7.483 0.000000000000072707 ***
## monthsep            0.696651439  0.143836393   4.843 0.000001276616463076 ***
## duration            0.004084763  0.000076387  53.474 < 0.0000000000000002 ***
## campaign           -0.094067513  0.012201792  -7.709 0.000000000000012649 ***
## pdays              -0.000080165  0.000361504  -0.222             0.824505    
## previous            0.007343230  0.006422649   1.143             0.252900    
## poutcomeother       0.133008375  0.105844524   1.257             0.208884    
## poutcomesuccess     2.167107058  0.096875980  22.370 < 0.0000000000000002 ***
## poutcomeunknown    -0.161300405  0.108965010  -1.480             0.138794    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 22845  on 31648  degrees of freedom
## Residual deviance: 15242  on 31606  degrees of freedom
## AIC: 15328
## 
## Number of Fisher Scoring iterations: 6

varImp(multiple_logistic)

##                       Overall
## age                 1.2158318
## jobblue-collar      3.6964610
## jobentrepreneur     3.0746676
## jobhousemaid        3.6226013
## jobmanagement       2.8110682
## jobretired          2.3750884
## jobself-employed    2.9453300
## jobservices         1.6696369
## jobstudent          3.5059252
## jobtechnician       2.0497088
## jobunemployed       1.3253222
## jobunknown          0.9542538
## maritalmarried      2.4564239
## maritalsingle       1.3610931
## educationsecondary  1.9500859
## educationtertiary   4.4937553
## educationunknown    2.2052240
## defaultyes          0.2583590
## balance             2.2571141
## housingyes         14.4862969
## loanyes             6.1840794
## contacttelephone    0.6636974
## contactunknown     18.3293522
## day                 3.1036155
## monthaug            7.5141233
## monthdec            3.2053949
## monthfeb            0.9573356
## monthjan            8.7198085
## monthjul            8.0388473
## monthjun            3.8629666
## monthmar           11.3540573
## monthmay            4.1084226
## monthnov            7.9132901
## monthoct            7.4828878
## monthsep            4.8433601
## duration           53.4743740
## campaign            7.7093193
## pdays               0.2217550
## previous            1.1433335
## poutcomeother       1.2566392
## poutcomesuccess    22.3699110
## poutcomeunknown     1.4802954

Diagnostico

# Predicción con el modelo de regresión logística sobre el conjunto de prueba
test_multiple_logistic <- predict(multiple_logistic, newdata = test, type = "response")

# mL_cm <- confusionMatrix(test_multiple_logistic, test$outcome)

multiple_logistic_data <- augment(multiple_logistic) %>%  
  mutate(index = 1:n())

# Obtener las 5 observaciones con mayor distancia de Cook
top_cooksd <- multiple_logistic_data %>% 
  top_n(5, .cooksd)

# Evaluar el rendimiento del modelo usando una tabla de contingencia con proporciones
model_performance <- table(test$outcome, test_multiple_logistic > 0.5) %>% 
  prop.table() %>% 
  round(3)

# Listar el rendimiento del modelo
list(multiple_logistic_performance = model_performance)

## $multiple_logistic_performance
##     
##      FALSE  TRUE
##   No 0.862 0.021
##   Si 0.076 0.041

prediction(test_multiple_logistic, test$outcome) %>%
  performance(measure = "tpr", x.measure = "fpr") %>%
  plot()

mlr_auc <- prediction(test_multiple_logistic, test$outcome) %>% performance(measure = "auc") %>% .@y.values

mlr_aes <- prediction(test_multiple_logistic, test$outcome) %>% performance(measure = "auc") %>% .@y.values
mlr_aes

## [[1]]
## [1] 0.9087577

roc_log <- roc(test$outcome, predict(multiple_logistic, newdata = test))

## Setting levels: control = No, case = Si

## Setting direction: controls < cases

auc_log <- pROC::auc(roc_log)
auc_log

## Area under the curve: 0.9088

b. Decision Trees

set.seed(123)
options(scipen=999)
dt.rpart <- rpart(outcome ~ .,data = train, method = "class")


# 
dt.rpart$variable.importance

##    duration    poutcome       month       pdays         age    campaign 
## 991.2553044 535.2120007 229.8670632   3.0851575   1.6231573   1.5362460 
##     contact         day     balance     default         job 
##   0.7708972   0.5965242   0.5693730   0.2569657   0.2569657

prp(dt.rpart, extra = 2)
# 
# 
training_pred <- predict(object = dt.rpart, newdata = train, type = "class")

# # training confusion matrix
training_conf <- confusionMatrix(data = training_pred, reference = train$outcome, positive = "Si", mode = "everything")
training_conf

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    No    Si
##         No 27006  2189
##         Si   940  1514
##                                                
##                Accuracy : 0.9011               
##                  95% CI : (0.8978, 0.9044)     
##     No Information Rate : 0.883                
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.4395               
##                                                
##  Mcnemar's Test P-Value : < 0.00000000000000022
##                                                
##             Sensitivity : 0.40886              
##             Specificity : 0.96636              
##          Pos Pred Value : 0.61695              
##          Neg Pred Value : 0.92502              
##               Precision : 0.61695              
##                  Recall : 0.40886              
##                      F1 : 0.49180              
##              Prevalence : 0.11700              
##          Detection Rate : 0.04784              
##    Detection Prevalence : 0.07754              
##       Balanced Accuracy : 0.68761              
##                                                
##        'Positive' Class : Si                   
##

### TESTING PERFORMANCE
test_pred <- predict(object = dt.rpart, newdata = test, type = "class")

test_conf <- confusionMatrix(data = test_pred, reference = test$outcome, positive = "Si", mode = "everything")
test_conf

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    No    Si
##         No 11574   929
##         Si   402   657
##                                                
##                Accuracy : 0.9019               
##                  95% CI : (0.8967, 0.9068)     
##     No Information Rate : 0.8831               
##     P-Value [Acc > NIR] : 0.000000000001619    
##                                                
##                   Kappa : 0.4448               
##                                                
##  Mcnemar's Test P-Value : < 0.00000000000000022
##                                                
##             Sensitivity : 0.41425              
##             Specificity : 0.96643              
##          Pos Pred Value : 0.62040              
##          Neg Pred Value : 0.92570              
##               Precision : 0.62040              
##                  Recall : 0.41425              
##                      F1 : 0.49679              
##              Prevalence : 0.11694              
##          Detection Rate : 0.04844              
##    Detection Prevalence : 0.07809              
##       Balanced Accuracy : 0.69034              
##                                                
##        'Positive' Class : Si                   
##

accuracy_tree <- test_conf$overall['Accuracy']
accuracy_tree

##  Accuracy 
## 0.9018581

# Obtener el estadístico Kappa
kappa_tree <- test_conf$overall['Kappa']
kappa_tree

##     Kappa 
## 0.4447948

printcp(x = dt.rpart)

## 
## Classification tree:
## rpart(formula = outcome ~ ., data = train, method = "class")
## 
## Variables actually used in tree construction:
## [1] duration month    poutcome
## 
## Root node error: 3703/31649 = 0.117
## 
## n= 31649 
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.035377      0   1.00000 1.00000 0.015442
## 2 0.026465      3   0.89387 0.89441 0.014706
## 3 0.011207      4   0.86740 0.86876 0.014518
## 4 0.010000      6   0.84499 0.86713 0.014506

min_corss_validated_error <- dt.rpart$cptable[which.min(dt.rpart$cptable[,"xerror"]),"CP"]
dt.rpart_alt <- prune(tree = dt.rpart, cp = min_corss_validated_error) 
prp(dt.rpart_alt, extra = 2)

predicted_prob <- predict(dt.rpart, newdata = test, type = "prob")[,2]

roc_tree <- roc(test$outcome, predict(dt.rpart_alt, newdata = test, type = "prob")[, "Si"])

## Setting levels: control = No, case = Si

## Setting direction: controls < cases

auc_tree <- pROC::auc(roc_tree)
auc_tree

## Area under the curve: 0.8027

c. Support Vector Machine (SVM)

svm_model_1 <- svm(outcome ~., data = train,type = "C-classification",cost = 1, kernel = "linear",scale = FALSE)

predict <- predict(svm_model_1, newdata = test)
confusionMatrix_svm <- confusionMatrix(predict, test$outcome)
# plot(svm_model_1, train)
summary(svm_model_1)

## 
## Call:
## svm(formula = outcome ~ ., data = train, type = "C-classification", 
##     cost = 1, kernel = "linear", scale = FALSE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  1 
## 
## Number of Support Vectors:  1399
## 
##  ( 511 888 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  No Si

accuracy_svm <- confusionMatrix_svm$overall['Accuracy']
accuracy_svm

## Accuracy 
## 0.765595

# Obtener el estadístico Kappa
kappa_svm <- confusionMatrix_svm$overall['Kappa']
kappa_svm

##     Kappa 
## 0.1679636

roc_svm <- roc(test$outcome, as.numeric(predict(svm_model_1, newdata = test) == "Si"))

## Setting levels: control = No, case = Si

## Setting direction: controls < cases

auc_svm <- pROC::auc(roc_svm)
auc_svm

## Area under the curve: 0.6146

train_n <- names(train)[sapply(train, is.numeric)]

train_num <- dplyr::select(train, train_n)
head(train_num)

##   age balance day duration campaign pdays previous
## 1  58    2143   5      261        1    -1        0
## 2  44      29   5      151        1    -1        0
## 3  33       2   5       76        1    -1        0
## 4  47    1506   5       92        1    -1        0
## 5  33       1   5      198        1    -1        0
## 6  35     231   5      139        1    -1        0

K – Means Clustering

set.seed(123)
scl_df <- scale(train_num)
k4 <- kmeans(scl_df, centers = 4, nstart = 20)
fviz_cluster(k4, data = scl_df)

## eliminamos outlier

# scl_df <- scl_df[-c(29183), ]

c_29183 <-scl_df[29183, ]
df_sf<- scl_df[-29183, ]

set.seed(123)
wss <- function(k) {
  kmeans(df_sf, k, nstart = 10 )$tot.withinss    
}

# Compute and plot wss for k = 1 to k = 15
k.values <- 1:15

# extract wss for 2-15 clusters
wss_values <- map_dbl(k.values, wss)

plot(k.values, wss_values,
     type="b", pch = 19, frame = FALSE, 
     xlab="Number of clusters K",
     ylab="Total within-clusters sum of squares")

k7 <- kmeans(df_sf, centers = 7, nstart = 20)
fviz_cluster(k7, data = df_sf)

df_sf <- as.data.frame(df_sf)
df_sf %>%
  mutate(Cluster = k7$cluster) %>%
  group_by(Cluster) %>%
  summarise_all("mean")

## # A tibble: 7 × 8
##   Cluster     age balance      day duration campaign   pdays previous
##     <int>   <dbl>   <dbl>    <dbl>    <dbl>    <dbl>   <dbl>    <dbl>
## 1       1 -0.468  -0.150   0.949    -0.226   -0.0674 -0.365   -0.203 
## 2       2  1.29   -0.0236 -0.0757   -0.212   -0.112  -0.309   -0.162 
## 3       3 -0.535  -0.185  -0.907    -0.187   -0.200  -0.368   -0.197 
## 4       4 -0.0233 -0.149   0.716    -0.461    4.14   -0.404   -0.229 
## 5       5 -0.150  -0.0879 -0.260    -0.102   -0.204   2.28     1.24  
## 6       6  0.248   4.45    0.0410   -0.0532  -0.0817 -0.0702  -0.0171
## 7       7 -0.0505 -0.0415  0.00165   2.74    -0.106  -0.233   -0.131

# No se puede calcular AUC para K-Means porque no es un modelo de clasificación supervisada

k <- 4
model_knn <- train(outcome ~ job + education + housing + pdays + duration + loan + campaign + poutcome + month + balance, data = train, method = "knn", preProcess = c("center", "scale"), trControl = trainControl(method = "cv", number = 4), tuneGrid = expand.grid(k = k))

# Resumen del modelo
print(model_knn)

## k-Nearest Neighbors 
## 
## 31649 samples
##    10 predictor
##     2 classes: 'No', 'Si' 
## 
## Pre-processing: centered (34), scaled (34) 
## Resampling: Cross-Validated (4 fold) 
## Summary of sample sizes: 23737, 23737, 23736, 23737 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.8844513  0.3491849
## 
## Tuning parameter 'k' was held constant at a value of 4

# Hacer predicciones en el conjunto de prueba
knn_predictions_test <- predict(model_knn, newdata = test)
knn_predictions_train <- predict(model_knn, newdata = train)

# Confusion Matrix
confusionMatrix_knn <- confusionMatrix(knn_predictions_test, test$outcome)
confusionMatrix(knn_predictions_train, train$outcome)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    No    Si
##         No 27165  1753
##         Si   781  1950
##                                                
##                Accuracy : 0.9199               
##                  95% CI : (0.9169, 0.9229)     
##     No Information Rate : 0.883                
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.5627               
##                                                
##  Mcnemar's Test P-Value : < 0.00000000000000022
##                                                
##             Sensitivity : 0.9721               
##             Specificity : 0.5266               
##          Pos Pred Value : 0.9394               
##          Neg Pred Value : 0.7140               
##              Prevalence : 0.8830               
##          Detection Rate : 0.8583               
##    Detection Prevalence : 0.9137               
##       Balanced Accuracy : 0.7493               
##                                                
##        'Positive' Class : No                   
##

accuracy_knn <- confusionMatrix_knn$overall['Accuracy']
accuracy_knn

## Accuracy 
## 0.884899

# Obtener el estadístico Kappa
kappa_knn <- confusionMatrix_knn$overall['Kappa']
kappa_knn

##     Kappa 
## 0.3484375

# roc_knn <- roc(test$outcome, knn_predictions_test)
# auc_knn <- pROC::auc(roc_knn)
# auc_knn
roc_knn <- roc(test$outcome, as.numeric(knn_predictions_test == "Si"))

## Setting levels: control = No, case = Si

## Setting direction: controls < cases

auc_knn <- pROC::auc(roc_knn)
auc_knn

## Area under the curve: 0.649

Naïve Bayes

table(train$outcome) %>% prop.table()

## 
##        No        Si 
## 0.8829979 0.1170021

table(test$outcome) %>% prop.table()

## 
##        No        Si 
## 0.8830556 0.1169444

naive_bayes  <- naiveBayes(outcome ~ ., data = train)

predicted<-predict(naive_bayes, as.data.frame(test))
confusionMatrix_nb <- confusionMatrix(test$outcome, predicted)
accuracy_nb <- confusionMatrix_nb$overall['Accuracy']
accuracy_nb

##  Accuracy 
## 0.8811385

# Obtener el estadístico Kappa
kappa_nb <- confusionMatrix_nb$overall['Kappa']
kappa_nb

##     Kappa 
## 0.4463092

predicted_prob <- predict(naive_bayes, as.data.frame(test), type = "raw")
predicted_prob_pos <- predicted_prob[,2]

# Calculando la curva ROC y el AUC
roc_nb <- roc(test$outcome, predicted_prob_pos)

## Setting levels: control = No, case = Si

## Setting direction: controls < cases

auc_nb <- pROC::auc(roc_nb)
auc_nb

## Area under the curve: 0.8601

naive_bayes_b   <- NaiveBayes(outcome ~ ., data = train)
plot(naive_bayes_b)

Random Forest

# rf_model <- randomForest(outcome ~ job + education +housing + )
rf_model <- randomForest(outcome ~ ., data = train)

test$outcome <- as.factor(test$outcome)

# Predicción y Evaluación del modelo
prediction <- predict(rf_model, test)
confusionMatrix_rf <- confusionMatrix(prediction, test$outcome)
confusionMatrix

## function (data, ...) 
## {
##     UseMethod("confusionMatrix")
## }
## <bytecode: 0x126e71938>
## <environment: namespace:caret>

accuracy_rf <- confusionMatrix_rf$overall['Accuracy']
accuracy_rf

##  Accuracy 
## 0.9114437

# Obtener el estadístico Kappa
kappa_rf <- confusionMatrix_rf$overall['Kappa']
kappa_rf

##     Kappa 
## 0.5172272

roc_rf <- roc(test$outcome, predict(rf_model, newdata = test, type = "prob")[, "Si"])

## Setting levels: control = No, case = Si

## Setting direction: controls < cases

auc_rf <- pROC::auc(roc_rf)
auc_rf

## Area under the curve: 0.9354

model_metrics <- data.frame(
  Model = c("Multiple Regression", "Decision Trees", "Support Vector Machine (SVM)", "K – Means Clustering", "KNN", "Naive Bayes", "Random Forest"),
  accuracy = c(NA, accuracy_tree, accuracy_svm, NA, accuracy_knn, accuracy_nb, accuracy_rf), 
  kappa = c(NA, kappa_tree,kappa_svm, NA, kappa_knn, kappa_nb, kappa_rf), 
  # roc_m = c(roc_log, roc_tree, roc_svm, NA, roc_knn, roc_nb, roc_rf),
  auc_m = c(mlr_aes, auc_tree, auc_svm, NA, auc_knn, auc_nb, auc_rf) 
)
model_metrics

##                          Model  accuracy     kappa auc_m.0.90875766876332
## 1          Multiple Regression        NA        NA              0.9087577
## 2               Decision Trees 0.9018581 0.4447948              0.9087577
## 3 Support Vector Machine (SVM) 0.7655950 0.1679636              0.9087577
## 4         K – Means Clustering        NA        NA              0.9087577
## 5                          KNN 0.8848990 0.3484375              0.9087577
## 6                  Naive Bayes 0.8811385 0.4463092              0.9087577
## 7                Random Forest 0.9114437 0.5172272              0.9087577
##   auc_m.0.802737858019528 auc_m.0.614554508344137 auc_m.NA
## 1               0.8027379               0.6145545       NA
## 2               0.8027379               0.6145545       NA
## 3               0.8027379               0.6145545       NA
## 4               0.8027379               0.6145545       NA
## 5               0.8027379               0.6145545       NA
## 6               0.8027379               0.6145545       NA
## 7               0.8027379               0.6145545       NA
##   auc_m.0.64901176880874 auc_m.0.860062153520994 auc_m.0.935401409165535
## 1              0.6490118               0.8600622               0.9354014
## 2              0.6490118               0.8600622               0.9354014
## 3              0.6490118               0.8600622               0.9354014
## 4              0.6490118               0.8600622               0.9354014
## 5              0.6490118               0.8600622               0.9354014
## 6              0.6490118               0.8600622               0.9354014
## 7              0.6490118               0.8600622               0.9354014

Mejor modelo:

Despues de abalizar todas métricas de desempeño de los modelos, conclui que Random Forest es el modelo más adecuado para esta tarea. El modelo alcanza una precision de un 91.14%, lo que nos dice que tiene una alta tasa de éxito en la identificación correcta de los resultados. El valor de Kappa es de 0.517, aunque no es alto, muy alto, es notablemente significativo, pues indica que las predicciones del modelo son considerablemente mejores que un acierto aleatorio, ajustando por el azar y mostrando un acuerdo más allá de lo esperado por casualidad. Por ultimo tiene un AUC de 0.935 lo que nos dice que es muy bueno distiguiendio entre clases, puede diferenciar efectivamente entre casos positivos y negativos.

Actividad 2

Enrique Pablos A00835037

2024-03-13