Objetivo

Realizar predicciones con modelos basados en algoritmos de clasificación evaluando la exactitud de cada modelo.

Descripción

Se cargan librerías y se descargan los datos: https://raw.githubusercontent.com/rpizarrog/Machine-Learning-con-R/main/datos/heart_2020_cleaned.csv

Se buscan datos de entrenamiento y validación al 80% y 20% cada uno.

Se construyen los modelos de:

  • Regresión Logística binaria

  • Árbol de Clasificacón tipo class

  • SVM Lineal

  • SVM Polinomial

  • SVM Radial

Los modelo se aceptan si tienen un valor de exactitud por encima del 70%..

Desarrollo

Cargar librerías

library(readr)
library(dplyr)
library(caret)
library(rpart)
library(rpart.plot)
library(knitr)
library(e1071)        # Vectores de Soporte SVM
library(rpart)        # Arboles de clasificación

Cargar datos

Cargar datos de manera local.

# datos <- read.csv("https://raw.githubusercontent.com/rpizarrog/Machine-Learning-con-R/main/datos/heart_2020_cleaned.csv")
datos <- read.csv("../datos/heart_2020_cleaned.csv", encoding = "UTF-8", stringsAsFactors = TRUE)

Explorar datos

str(datos)
## 'data.frame':    319795 obs. of  18 variables:
##  $ HeartDisease    : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 1 1 1 1 ...
##  $ BMI             : num  16.6 20.3 26.6 24.2 23.7 ...
##  $ Smoking         : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 2 1 2 1 1 ...
##  $ AlcoholDrinking : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Stroke          : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 1 1 1 1 ...
##  $ PhysicalHealth  : num  3 0 20 0 28 6 15 5 0 0 ...
##  $ MentalHealth    : num  30 0 30 0 0 0 0 0 0 0 ...
##  $ DiffWalking     : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 2 1 2 1 2 ...
##  $ Sex             : Factor w/ 2 levels "Female","Male": 1 1 2 1 1 1 1 1 1 2 ...
##  $ AgeCategory     : Factor w/ 13 levels "18-24","25-29",..: 8 13 10 12 5 12 11 13 13 10 ...
##  $ Race            : Factor w/ 6 levels "American Indian/Alaskan Native",..: 6 6 6 6 6 3 6 6 6 6 ...
##  $ Diabetic        : Factor w/ 4 levels "No","No, borderline diabetes",..: 3 1 3 1 1 1 1 3 2 1 ...
##  $ PhysicalActivity: Factor w/ 2 levels "No","Yes": 2 2 2 1 2 1 2 1 1 2 ...
##  $ GenHealth       : Factor w/ 5 levels "Excellent","Fair",..: 5 5 2 3 5 2 2 3 2 3 ...
##  $ SleepTime       : num  5 7 8 6 8 12 4 9 5 10 ...
##  $ Asthma          : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 2 2 1 1 ...
##  $ KidneyDisease   : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 2 1 ...
##  $ SkinCancer      : Factor w/ 2 levels "No","Yes": 2 1 1 2 1 1 2 1 1 1 ...
summary(datos)
##  HeartDisease      BMI        Smoking      AlcoholDrinking Stroke      
##  No :292422   Min.   :12.02   No :187887   No :298018      No :307726  
##  Yes: 27373   1st Qu.:24.03   Yes:131908   Yes: 21777      Yes: 12069  
##               Median :27.34                                            
##               Mean   :28.33                                            
##               3rd Qu.:31.42                                            
##               Max.   :94.85                                            
##                                                                        
##  PhysicalHealth    MentalHealth    DiffWalking      Sex        
##  Min.   : 0.000   Min.   : 0.000   No :275385   Female:167805  
##  1st Qu.: 0.000   1st Qu.: 0.000   Yes: 44410   Male  :151990  
##  Median : 0.000   Median : 0.000                               
##  Mean   : 3.372   Mean   : 3.898                               
##  3rd Qu.: 2.000   3rd Qu.: 3.000                               
##  Max.   :30.000   Max.   :30.000                               
##                                                                
##       AgeCategory                                 Race       
##  65-69      : 34151   American Indian/Alaskan Native:  5202  
##  60-64      : 33686   Asian                         :  8068  
##  70-74      : 31065   Black                         : 22939  
##  55-59      : 29757   Hispanic                      : 27446  
##  50-54      : 25382   Other                         : 10928  
##  80 or older: 24153   White                         :245212  
##  (Other)    :141601                                          
##                     Diabetic      PhysicalActivity     GenHealth     
##  No                     :269653   No : 71838       Excellent: 66842  
##  No, borderline diabetes:  6781   Yes:247957       Fair     : 34677  
##  Yes                    : 40802                    Good     : 93129  
##  Yes (during pregnancy) :  2559                    Poor     : 11289  
##                                                    Very good:113858  
##                                                                      
##                                                                      
##    SleepTime      Asthma       KidneyDisease SkinCancer  
##  Min.   : 1.000   No :276923   No :308016    No :289976  
##  1st Qu.: 6.000   Yes: 42872   Yes: 11779    Yes: 29819  
##  Median : 7.000                                          
##  Mean   : 7.097                                          
##  3rd Qu.: 8.000                                          
##  Max.   :24.000                                          
## 

Limpiar datos

Crear variable llamada HeartDisease01 que se utilizará en el modelo de regresión Logística tendrá valores o de para no daño y 1 para daño.

datos = mutate (datos,HeartDisease_01=if_else(HeartDisease=='Yes',1,0))

Las variables de interés

Todas las variables son de entrada o variables independientes

La variable de interés como dependiente o variable de salida es la de daño al corazón (HeartDisease).

Datos de entrenamiento y validación

80% y 20%

set.seed(2022)
entrena <- createDataPartition(y = datos$HeartDisease, 
                               p = 0.8, 
                               list = FALSE, 
                               times = 1)

# Datos entrenamiento
datos.entrenamiento <- datos[entrena, ]  # [renglones, columna]

# Datos validación
datos.validacion <- datos[-entrena, ]

Modelos a construir

Regresión logística

modelo.rl = glm(data = datos.entrenamiento,formula =    HeartDisease_01 ~ BMI+Smoking+AlcoholDrinking+Stroke+PhysicalHealth+MentalHealth+DiffWalking+Sex
+AgeCategory+Race+Diabetic+PhysicalActivity+GenHealth+SleepTime+Asthma+KidneyDisease+SkinCancer, family = "binomial")

summary(modelo.rl)
## 
## Call:
## glm(formula = HeartDisease_01 ~ BMI + Smoking + AlcoholDrinking + 
##     Stroke + PhysicalHealth + MentalHealth + DiffWalking + Sex + 
##     AgeCategory + Race + Diabetic + PhysicalActivity + GenHealth + 
##     SleepTime + Asthma + KidneyDisease + SkinCancer, family = "binomial", 
##     data = datos.entrenamiento)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1314  -0.4107  -0.2425  -0.1284   3.6298  
## 
## Coefficients:
##                                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                     -6.3411940  0.1303031 -48.665  < 2e-16 ***
## BMI                              0.0087849  0.0012805   6.860 6.87e-12 ***
## SmokingYes                       0.3585650  0.0160884  22.287  < 2e-16 ***
## AlcoholDrinkingYes              -0.2461803  0.0375446  -6.557 5.49e-11 ***
## StrokeYes                        1.0402871  0.0253478  41.041  < 2e-16 ***
## PhysicalHealth                   0.0028693  0.0009676   2.965  0.00302 ** 
## MentalHealth                     0.0051477  0.0009867   5.217 1.82e-07 ***
## DiffWalkingYes                   0.2225086  0.0202815  10.971  < 2e-16 ***
## SexMale                          0.7081741  0.0163019  43.441  < 2e-16 ***
## AgeCategory25-29                 0.1993694  0.1401300   1.423  0.15481    
## AgeCategory30-34                 0.5290687  0.1266554   4.177 2.95e-05 ***
## AgeCategory35-39                 0.6670041  0.1210068   5.512 3.55e-08 ***
## AgeCategory40-44                 1.0134329  0.1149100   8.819  < 2e-16 ***
## AgeCategory45-49                 1.3576044  0.1106412  12.270  < 2e-16 ***
## AgeCategory50-54                 1.7724467  0.1069570  16.572  < 2e-16 ***
## AgeCategory55-59                 2.0105269  0.1053642  19.082  < 2e-16 ***
## AgeCategory60-64                 2.2670547  0.1044561  21.703  < 2e-16 ***
## AgeCategory65-69                 2.5199492  0.1041334  24.199  < 2e-16 ***
## AgeCategory70-74                 2.8148759  0.1040561  27.052  < 2e-16 ***
## AgeCategory75-79                 3.0241344  0.1046196  28.906  < 2e-16 ***
## AgeCategory80 or older           3.2813849  0.1043562  31.444  < 2e-16 ***
## RaceAsian                       -0.5548084  0.0941154  -5.895 3.75e-09 ***
## RaceBlack                       -0.3819882  0.0639401  -5.974 2.31e-09 ***
## RaceHispanic                    -0.2801229  0.0652138  -4.295 1.74e-05 ***
## RaceOther                       -0.0820434  0.0710104  -1.155  0.24794    
## RaceWhite                       -0.1058944  0.0569547  -1.859  0.06299 .  
## DiabeticNo, borderline diabetes  0.1308946  0.0466680   2.805  0.00503 ** 
## DiabeticYes                      0.4644979  0.0187271  24.803  < 2e-16 ***
## DiabeticYes (during pregnancy)   0.1426133  0.1181732   1.207  0.22750    
## PhysicalActivityYes              0.0087935  0.0179610   0.490  0.62442    
## GenHealthFair                    1.5211294  0.0368863  41.238  < 2e-16 ***
## GenHealthGood                    1.0642420  0.0332160  32.040  < 2e-16 ***
## GenHealthPoor                    1.9221866  0.0458775  41.898  < 2e-16 ***
## GenHealthVery good               0.4752773  0.0341450  13.919  < 2e-16 ***
## SleepTime                       -0.0238408  0.0048407  -4.925 8.43e-07 ***
## AsthmaYes                        0.2862537  0.0214746  13.330  < 2e-16 ***
## KidneyDiseaseYes                 0.5620791  0.0272492  20.627  < 2e-16 ***
## SkinCancerYes                    0.1130570  0.0217684   5.194 2.06e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 149527  on 255836  degrees of freedom
## Residual deviance: 115839  on 255799  degrees of freedom
## AIC: 115915
## 
## Number of Fisher Scoring iterations: 7

Generar predicciones del modelo …

Predicciones con datos de validación

prediciones_rl = predict(object = modelo.rl,newdata = datos.validacion, se.fit = TRUE)
# convertir a valores probabilisticos 
# Mediante la función logit se transforman los a probabilidades.
prediciones_rl_prob <- exp(prediciones_rl$fit) / (1 + exp(prediciones_rl$fit))


t_comparativa = data.frame(datos.validacion,prediciones_rl_prob)

t_comparativa <- t_comparativa %>%
  mutate(heartDiseasePred = if_else(prediciones_rl_prob < 0.50, 0, 1))

top20 = head(t_comparativa,20)
kable(top20,caption = 'Primeros 20 registros')
Primeros 20 registros
HeartDisease BMI Smoking AlcoholDrinking Stroke PhysicalHealth MentalHealth DiffWalking Sex AgeCategory Race Diabetic PhysicalActivity GenHealth SleepTime Asthma KidneyDisease SkinCancer HeartDisease_01 prediciones_rl_prob heartDiseasePred
19 No 29.86 Yes No No 0 0 Yes Female 75-79 Black Yes No Fair 5 No Yes No 0 0.3948057 0
25 No 25.75 No No No 0 0 No Female 80 or older White No Yes Very good 6 No No Yes 0 0.0768873 0
27 No 34.34 Yes No No 21 8 Yes Female 65-69 White No Yes Fair 9 No No No 0 0.1641593 0
30 No 36.58 No No No 0 0 No Female 60-64 White Yes No Good 5 No No Yes 0 0.0881778 0
43 Yes 25.06 No No No 0 0 Yes Female 80 or older White Yes No Good 7 No No Yes 1 0.2230232 0
47 No 33.23 No No No 0 0 No Male 65-69 White Yes Yes Very good 8 No No No 0 0.1025505 0
48 No 25.11 No No No 5 5 No Female 65-69 Black No Yes Good 7 No No No 0 0.0458078 0
55 No 32.10 No No No 14 0 No Male 65-69 White Yes Yes Very good 9 No No No 0 0.1031412 0
60 No 27.20 Yes No Yes 0 0 No Male 80 or older White No No Very good 8 No No Yes 0 0.3960094 0
61 No 28.94 Yes No No 0 0 No Female 70-74 White Yes Yes Good 5 Yes No No 0 0.2116797 0
62 No 21.03 No No No 1 0 No Female 80 or older White No Yes Excellent 8 No No No 0 0.0406982 0
64 No 31.46 Yes No No 0 0 No Male 75-79 White No Yes Very good 8 No No No 0 0.1435026 0
72 No 27.76 Yes No No 15 0 Yes Female 80 or older White No No Good 8 Yes No No 0 0.2426964 0
74 No 30.23 No No No 0 5 No Female 65-69 White No Yes Good 6 No No No 0 0.0626294 0
79 Yes 28.29 Yes No No 30 30 No Female 70-74 White Yes Yes Poor 9 No Yes No 1 0.4895958 0
89 No 32.81 Yes No No 0 0 Yes Female 70-74 White Yes Yes Good 5 No No No 0 0.2067582 0
91 No 44.29 No No No 30 10 Yes Female 70-74 White No No Fair 7 No No Yes 0 0.1953263 0
93 No 21.80 Yes No No 0 0 No Female 75-79 White No Yes Very good 8 No No Yes 0 0.0782419 0
99 No 24.37 No No No 0 0 No Female 55-59 White No Yes Very good 7 Yes No No 0 0.0261086 0
111 No 26.63 No No No 0 0 No Female 75-79 Black Yes Yes Good 8 No No No 0 0.1073309 0

Matriz de confusion

factorizar las columnas “prediciones_rl_prob” & “heasrtDiseasePred” de la tabla comparativa

Factorizar en R ==> categorizar con la funcion “as.factor” o “factor”

t_comparativa$HeartDisease_01 = as.factor(t_comparativa$HeartDisease_01)
t_comparativa$heartDiseasePred = as.factor(t_comparativa$heartDiseasePred)

Creacion de la matriz de confusion

matrixConfusion <- confusionMatrix(t_comparativa$HeartDisease_01,t_comparativa$heartDiseasePred)
matrixConfusion
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 57987   497
##          1  4921   553
##                                           
##                Accuracy : 0.9153          
##                  95% CI : (0.9131, 0.9174)
##     No Information Rate : 0.9836          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.146           
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.9218          
##             Specificity : 0.5267          
##          Pos Pred Value : 0.9915          
##          Neg Pred Value : 0.1010          
##              Prevalence : 0.9836          
##          Detection Rate : 0.9066          
##    Detection Prevalence : 0.9144          
##       Balanced Accuracy : 0.7242          
##                                           
##        'Positive' Class : 0               
## 

Arbol de clasificación

Pendiente

Máquinas de Soporte Vectorial (SVM) lineal

Máquinas de Soporte Vectorial (SVM) Polinomial

Máquinas de Soporte Vectorial (SVM) Radial

Generar predicciones del modelo …

Predicciones con datos de validación

Evaluar predicciones del modelo …

Predicciones con datos nuevos

BMI <- 20
Smoking <- 'Yes'
AlcoholDrinking = 'Yes'
Stroke <- 'No'
PhysicalHealth <- 13
MentalHealth = 22
DiffWalking = 'Yes'
Sex = 'Male'
AgeCategory = '60-64'
Race = 'Hispanic'
Diabetic <- 'Yes'
PhysicalActivity = "No"
GenHealth = "Fair"
SleepTime = 8
Asthma = "No"
KidneyDisease = "Yes"
SkinCancer = 'No'

persona <- data.frame(BMI,Smoking, AlcoholDrinking, Stroke, PhysicalHealth, MentalHealth, DiffWalking, Sex, AgeCategory, Race, Diabetic, PhysicalActivity, GenHealth, SleepTime, Asthma, KidneyDisease, SkinCancer)
persona
##   BMI Smoking AlcoholDrinking Stroke PhysicalHealth MentalHealth DiffWalking
## 1  20     Yes             Yes     No             13           22         Yes
##    Sex AgeCategory     Race Diabetic PhysicalActivity GenHealth SleepTime
## 1 Male       60-64 Hispanic      Yes               No      Fair         8
##   Asthma KidneyDisease SkinCancer
## 1     No           Yes         No

Interpretación

Bibliografría