Realizar predicciones con modelos basados en algoritmos de clasificación evaluando la exactitud de cada modelo.
Se cargan librerías y se descargan los datos: https://raw.githubusercontent.com/rpizarrog/Machine-Learning-con-R/main/datos/heart_2020_cleaned.csv
Se buscan datos de entrenamiento y validación al 80% y 20% cada uno.
Se construyen los modelos de:
Regresión Logística binaria
Árbol de Clasificacón tipo class
SVM Lineal
SVM Polinomial
SVM Radial
Los modelo se aceptan si tienen un valor de exactitud por encima del 70%..
library(readr)
library(dplyr)
library(caret)
library(rpart)
library(rpart.plot)
library(knitr)
library(e1071) # Vectores de Soporte SVM
library(rpart) # Arboles de clasificación
Cargar datos de manera local.
# datos <- read.csv("https://raw.githubusercontent.com/rpizarrog/Machine-Learning-con-R/main/datos/heart_2020_cleaned.csv")
datos <- read.csv("../datos/heart_2020_cleaned.csv", encoding = "UTF-8", stringsAsFactors = TRUE)
str(datos)
## 'data.frame': 319795 obs. of 18 variables:
## $ HeartDisease : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 1 1 1 1 ...
## $ BMI : num 16.6 20.3 26.6 24.2 23.7 ...
## $ Smoking : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 2 1 2 1 1 ...
## $ AlcoholDrinking : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ Stroke : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 1 1 1 1 ...
## $ PhysicalHealth : num 3 0 20 0 28 6 15 5 0 0 ...
## $ MentalHealth : num 30 0 30 0 0 0 0 0 0 0 ...
## $ DiffWalking : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 2 1 2 1 2 ...
## $ Sex : Factor w/ 2 levels "Female","Male": 1 1 2 1 1 1 1 1 1 2 ...
## $ AgeCategory : Factor w/ 13 levels "18-24","25-29",..: 8 13 10 12 5 12 11 13 13 10 ...
## $ Race : Factor w/ 6 levels "American Indian/Alaskan Native",..: 6 6 6 6 6 3 6 6 6 6 ...
## $ Diabetic : Factor w/ 4 levels "No","No, borderline diabetes",..: 3 1 3 1 1 1 1 3 2 1 ...
## $ PhysicalActivity: Factor w/ 2 levels "No","Yes": 2 2 2 1 2 1 2 1 1 2 ...
## $ GenHealth : Factor w/ 5 levels "Excellent","Fair",..: 5 5 2 3 5 2 2 3 2 3 ...
## $ SleepTime : num 5 7 8 6 8 12 4 9 5 10 ...
## $ Asthma : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 2 2 1 1 ...
## $ KidneyDisease : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 2 1 ...
## $ SkinCancer : Factor w/ 2 levels "No","Yes": 2 1 1 2 1 1 2 1 1 1 ...
summary(datos)
## HeartDisease BMI Smoking AlcoholDrinking Stroke
## No :292422 Min. :12.02 No :187887 No :298018 No :307726
## Yes: 27373 1st Qu.:24.03 Yes:131908 Yes: 21777 Yes: 12069
## Median :27.34
## Mean :28.33
## 3rd Qu.:31.42
## Max. :94.85
##
## PhysicalHealth MentalHealth DiffWalking Sex
## Min. : 0.000 Min. : 0.000 No :275385 Female:167805
## 1st Qu.: 0.000 1st Qu.: 0.000 Yes: 44410 Male :151990
## Median : 0.000 Median : 0.000
## Mean : 3.372 Mean : 3.898
## 3rd Qu.: 2.000 3rd Qu.: 3.000
## Max. :30.000 Max. :30.000
##
## AgeCategory Race
## 65-69 : 34151 American Indian/Alaskan Native: 5202
## 60-64 : 33686 Asian : 8068
## 70-74 : 31065 Black : 22939
## 55-59 : 29757 Hispanic : 27446
## 50-54 : 25382 Other : 10928
## 80 or older: 24153 White :245212
## (Other) :141601
## Diabetic PhysicalActivity GenHealth
## No :269653 No : 71838 Excellent: 66842
## No, borderline diabetes: 6781 Yes:247957 Fair : 34677
## Yes : 40802 Good : 93129
## Yes (during pregnancy) : 2559 Poor : 11289
## Very good:113858
##
##
## SleepTime Asthma KidneyDisease SkinCancer
## Min. : 1.000 No :276923 No :308016 No :289976
## 1st Qu.: 6.000 Yes: 42872 Yes: 11779 Yes: 29819
## Median : 7.000
## Mean : 7.097
## 3rd Qu.: 8.000
## Max. :24.000
##
Crear variable llamada HeartDisease01 que se utilizará en el modelo de regresión Logística tendrá valores o de para no daño y 1 para daño.
datos = mutate (datos,HeartDisease_01=if_else(HeartDisease=='Yes',1,0))
Todas las variables son de entrada o variables independientes
La variable de interés como dependiente o variable de salida es la de daño al corazón (HeartDisease).
80% y 20%
set.seed(2022)
entrena <- createDataPartition(y = datos$HeartDisease,
p = 0.8,
list = FALSE,
times = 1)
# Datos entrenamiento
datos.entrenamiento <- datos[entrena, ] # [renglones, columna]
# Datos validación
datos.validacion <- datos[-entrena, ]
modelo.rl = glm(data = datos.entrenamiento,formula = HeartDisease_01 ~ BMI+Smoking+AlcoholDrinking+Stroke+PhysicalHealth+MentalHealth+DiffWalking+Sex
+AgeCategory+Race+Diabetic+PhysicalActivity+GenHealth+SleepTime+Asthma+KidneyDisease+SkinCancer, family = "binomial")
summary(modelo.rl)
##
## Call:
## glm(formula = HeartDisease_01 ~ BMI + Smoking + AlcoholDrinking +
## Stroke + PhysicalHealth + MentalHealth + DiffWalking + Sex +
## AgeCategory + Race + Diabetic + PhysicalActivity + GenHealth +
## SleepTime + Asthma + KidneyDisease + SkinCancer, family = "binomial",
## data = datos.entrenamiento)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1314 -0.4107 -0.2425 -0.1284 3.6298
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.3411940 0.1303031 -48.665 < 2e-16 ***
## BMI 0.0087849 0.0012805 6.860 6.87e-12 ***
## SmokingYes 0.3585650 0.0160884 22.287 < 2e-16 ***
## AlcoholDrinkingYes -0.2461803 0.0375446 -6.557 5.49e-11 ***
## StrokeYes 1.0402871 0.0253478 41.041 < 2e-16 ***
## PhysicalHealth 0.0028693 0.0009676 2.965 0.00302 **
## MentalHealth 0.0051477 0.0009867 5.217 1.82e-07 ***
## DiffWalkingYes 0.2225086 0.0202815 10.971 < 2e-16 ***
## SexMale 0.7081741 0.0163019 43.441 < 2e-16 ***
## AgeCategory25-29 0.1993694 0.1401300 1.423 0.15481
## AgeCategory30-34 0.5290687 0.1266554 4.177 2.95e-05 ***
## AgeCategory35-39 0.6670041 0.1210068 5.512 3.55e-08 ***
## AgeCategory40-44 1.0134329 0.1149100 8.819 < 2e-16 ***
## AgeCategory45-49 1.3576044 0.1106412 12.270 < 2e-16 ***
## AgeCategory50-54 1.7724467 0.1069570 16.572 < 2e-16 ***
## AgeCategory55-59 2.0105269 0.1053642 19.082 < 2e-16 ***
## AgeCategory60-64 2.2670547 0.1044561 21.703 < 2e-16 ***
## AgeCategory65-69 2.5199492 0.1041334 24.199 < 2e-16 ***
## AgeCategory70-74 2.8148759 0.1040561 27.052 < 2e-16 ***
## AgeCategory75-79 3.0241344 0.1046196 28.906 < 2e-16 ***
## AgeCategory80 or older 3.2813849 0.1043562 31.444 < 2e-16 ***
## RaceAsian -0.5548084 0.0941154 -5.895 3.75e-09 ***
## RaceBlack -0.3819882 0.0639401 -5.974 2.31e-09 ***
## RaceHispanic -0.2801229 0.0652138 -4.295 1.74e-05 ***
## RaceOther -0.0820434 0.0710104 -1.155 0.24794
## RaceWhite -0.1058944 0.0569547 -1.859 0.06299 .
## DiabeticNo, borderline diabetes 0.1308946 0.0466680 2.805 0.00503 **
## DiabeticYes 0.4644979 0.0187271 24.803 < 2e-16 ***
## DiabeticYes (during pregnancy) 0.1426133 0.1181732 1.207 0.22750
## PhysicalActivityYes 0.0087935 0.0179610 0.490 0.62442
## GenHealthFair 1.5211294 0.0368863 41.238 < 2e-16 ***
## GenHealthGood 1.0642420 0.0332160 32.040 < 2e-16 ***
## GenHealthPoor 1.9221866 0.0458775 41.898 < 2e-16 ***
## GenHealthVery good 0.4752773 0.0341450 13.919 < 2e-16 ***
## SleepTime -0.0238408 0.0048407 -4.925 8.43e-07 ***
## AsthmaYes 0.2862537 0.0214746 13.330 < 2e-16 ***
## KidneyDiseaseYes 0.5620791 0.0272492 20.627 < 2e-16 ***
## SkinCancerYes 0.1130570 0.0217684 5.194 2.06e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 149527 on 255836 degrees of freedom
## Residual deviance: 115839 on 255799 degrees of freedom
## AIC: 115915
##
## Number of Fisher Scoring iterations: 7
Predicciones con datos de validación
prediciones_rl = predict(object = modelo.rl,newdata = datos.validacion, se.fit = TRUE)
# convertir a valores probabilisticos
# Mediante la función logit se transforman los a probabilidades.
prediciones_rl_prob <- exp(prediciones_rl$fit) / (1 + exp(prediciones_rl$fit))
t_comparativa = data.frame(datos.validacion,prediciones_rl_prob)
t_comparativa <- t_comparativa %>%
mutate(heartDiseasePred = if_else(prediciones_rl_prob < 0.50, 0, 1))
top20 = head(t_comparativa,20)
kable(top20,caption = 'Primeros 20 registros')
| HeartDisease | BMI | Smoking | AlcoholDrinking | Stroke | PhysicalHealth | MentalHealth | DiffWalking | Sex | AgeCategory | Race | Diabetic | PhysicalActivity | GenHealth | SleepTime | Asthma | KidneyDisease | SkinCancer | HeartDisease_01 | prediciones_rl_prob | heartDiseasePred | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 19 | No | 29.86 | Yes | No | No | 0 | 0 | Yes | Female | 75-79 | Black | Yes | No | Fair | 5 | No | Yes | No | 0 | 0.3948057 | 0 |
| 25 | No | 25.75 | No | No | No | 0 | 0 | No | Female | 80 or older | White | No | Yes | Very good | 6 | No | No | Yes | 0 | 0.0768873 | 0 |
| 27 | No | 34.34 | Yes | No | No | 21 | 8 | Yes | Female | 65-69 | White | No | Yes | Fair | 9 | No | No | No | 0 | 0.1641593 | 0 |
| 30 | No | 36.58 | No | No | No | 0 | 0 | No | Female | 60-64 | White | Yes | No | Good | 5 | No | No | Yes | 0 | 0.0881778 | 0 |
| 43 | Yes | 25.06 | No | No | No | 0 | 0 | Yes | Female | 80 or older | White | Yes | No | Good | 7 | No | No | Yes | 1 | 0.2230232 | 0 |
| 47 | No | 33.23 | No | No | No | 0 | 0 | No | Male | 65-69 | White | Yes | Yes | Very good | 8 | No | No | No | 0 | 0.1025505 | 0 |
| 48 | No | 25.11 | No | No | No | 5 | 5 | No | Female | 65-69 | Black | No | Yes | Good | 7 | No | No | No | 0 | 0.0458078 | 0 |
| 55 | No | 32.10 | No | No | No | 14 | 0 | No | Male | 65-69 | White | Yes | Yes | Very good | 9 | No | No | No | 0 | 0.1031412 | 0 |
| 60 | No | 27.20 | Yes | No | Yes | 0 | 0 | No | Male | 80 or older | White | No | No | Very good | 8 | No | No | Yes | 0 | 0.3960094 | 0 |
| 61 | No | 28.94 | Yes | No | No | 0 | 0 | No | Female | 70-74 | White | Yes | Yes | Good | 5 | Yes | No | No | 0 | 0.2116797 | 0 |
| 62 | No | 21.03 | No | No | No | 1 | 0 | No | Female | 80 or older | White | No | Yes | Excellent | 8 | No | No | No | 0 | 0.0406982 | 0 |
| 64 | No | 31.46 | Yes | No | No | 0 | 0 | No | Male | 75-79 | White | No | Yes | Very good | 8 | No | No | No | 0 | 0.1435026 | 0 |
| 72 | No | 27.76 | Yes | No | No | 15 | 0 | Yes | Female | 80 or older | White | No | No | Good | 8 | Yes | No | No | 0 | 0.2426964 | 0 |
| 74 | No | 30.23 | No | No | No | 0 | 5 | No | Female | 65-69 | White | No | Yes | Good | 6 | No | No | No | 0 | 0.0626294 | 0 |
| 79 | Yes | 28.29 | Yes | No | No | 30 | 30 | No | Female | 70-74 | White | Yes | Yes | Poor | 9 | No | Yes | No | 1 | 0.4895958 | 0 |
| 89 | No | 32.81 | Yes | No | No | 0 | 0 | Yes | Female | 70-74 | White | Yes | Yes | Good | 5 | No | No | No | 0 | 0.2067582 | 0 |
| 91 | No | 44.29 | No | No | No | 30 | 10 | Yes | Female | 70-74 | White | No | No | Fair | 7 | No | No | Yes | 0 | 0.1953263 | 0 |
| 93 | No | 21.80 | Yes | No | No | 0 | 0 | No | Female | 75-79 | White | No | Yes | Very good | 8 | No | No | Yes | 0 | 0.0782419 | 0 |
| 99 | No | 24.37 | No | No | No | 0 | 0 | No | Female | 55-59 | White | No | Yes | Very good | 7 | Yes | No | No | 0 | 0.0261086 | 0 |
| 111 | No | 26.63 | No | No | No | 0 | 0 | No | Female | 75-79 | Black | Yes | Yes | Good | 8 | No | No | No | 0 | 0.1073309 | 0 |
factorizar las columnas “prediciones_rl_prob” & “heasrtDiseasePred” de la tabla comparativa
Factorizar en R ==> categorizar con la funcion “as.factor” o “factor”
t_comparativa$HeartDisease_01 = as.factor(t_comparativa$HeartDisease_01)
t_comparativa$heartDiseasePred = as.factor(t_comparativa$heartDiseasePred)
Creacion de la matriz de confusion
matrixConfusion <- confusionMatrix(t_comparativa$HeartDisease_01,t_comparativa$heartDiseasePred)
matrixConfusion
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 57987 497
## 1 4921 553
##
## Accuracy : 0.9153
## 95% CI : (0.9131, 0.9174)
## No Information Rate : 0.9836
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.146
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.9218
## Specificity : 0.5267
## Pos Pred Value : 0.9915
## Neg Pred Value : 0.1010
## Prevalence : 0.9836
## Detection Rate : 0.9066
## Detection Prevalence : 0.9144
## Balanced Accuracy : 0.7242
##
## 'Positive' Class : 0
##
Pendiente
Predicciones con datos de validación
BMI <- 20
Smoking <- 'Yes'
AlcoholDrinking = 'Yes'
Stroke <- 'No'
PhysicalHealth <- 13
MentalHealth = 22
DiffWalking = 'Yes'
Sex = 'Male'
AgeCategory = '60-64'
Race = 'Hispanic'
Diabetic <- 'Yes'
PhysicalActivity = "No"
GenHealth = "Fair"
SleepTime = 8
Asthma = "No"
KidneyDisease = "Yes"
SkinCancer = 'No'
persona <- data.frame(BMI,Smoking, AlcoholDrinking, Stroke, PhysicalHealth, MentalHealth, DiffWalking, Sex, AgeCategory, Race, Diabetic, PhysicalActivity, GenHealth, SleepTime, Asthma, KidneyDisease, SkinCancer)
persona
## BMI Smoking AlcoholDrinking Stroke PhysicalHealth MentalHealth DiffWalking
## 1 20 Yes Yes No 13 22 Yes
## Sex AgeCategory Race Diabetic PhysicalActivity GenHealth SleepTime
## 1 Male 60-64 Hispanic Yes No Fair 8
## Asthma KidneyDisease SkinCancer
## 1 No Yes No