output: html_document: toc: TRUE toc_float: TRUE code_dowland: TRUE theme: cosmo —

Teoría

La regresión logística es un modelo estadístico de clasificación binaria, que estima la probabilidad de que ocurra un evento (valor 1) frente a que no ocurra (valor 0), en función de variables independientes.

Instalar paquetes y llamar librerías

 #install.packages("caret")
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
 #install.packages("tidyverse")
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.0     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ lubridate 1.9.5     ✔ tibble    3.3.1
## ✔ purrr     1.2.1     ✔ tidyr     1.3.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ✖ purrr::lift()   masks caret::lift()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Crear la base de datos

#file.choose()
df1 <- read.csv("/Users/eduardojuniormedinahernandez/Downloads/heart.csv")

Entender la base de datos

summary(df1)
##       age             sex               cp            trestbps    
##  Min.   :29.00   Min.   :0.0000   Min.   :0.0000   Min.   : 94.0  
##  1st Qu.:48.00   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:120.0  
##  Median :56.00   Median :1.0000   Median :1.0000   Median :130.0  
##  Mean   :54.43   Mean   :0.6956   Mean   :0.9424   Mean   :131.6  
##  3rd Qu.:61.00   3rd Qu.:1.0000   3rd Qu.:2.0000   3rd Qu.:140.0  
##  Max.   :77.00   Max.   :1.0000   Max.   :3.0000   Max.   :200.0  
##       chol          fbs            restecg          thalach     
##  Min.   :126   Min.   :0.0000   Min.   :0.0000   Min.   : 71.0  
##  1st Qu.:211   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:132.0  
##  Median :240   Median :0.0000   Median :1.0000   Median :152.0  
##  Mean   :246   Mean   :0.1493   Mean   :0.5298   Mean   :149.1  
##  3rd Qu.:275   3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:166.0  
##  Max.   :564   Max.   :1.0000   Max.   :2.0000   Max.   :202.0  
##      exang           oldpeak          slope             ca        
##  Min.   :0.0000   Min.   :0.000   Min.   :0.000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:1.000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.800   Median :1.000   Median :0.0000  
##  Mean   :0.3366   Mean   :1.072   Mean   :1.385   Mean   :0.7541  
##  3rd Qu.:1.0000   3rd Qu.:1.800   3rd Qu.:2.000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :6.200   Max.   :2.000   Max.   :4.0000  
##       thal           target      
##  Min.   :0.000   Min.   :0.0000  
##  1st Qu.:2.000   1st Qu.:0.0000  
##  Median :2.000   Median :1.0000  
##  Mean   :2.324   Mean   :0.5132  
##  3rd Qu.:3.000   3rd Qu.:1.0000  
##  Max.   :3.000   Max.   :1.0000
str(df1)
## 'data.frame':    1025 obs. of  14 variables:
##  $ age     : int  52 53 70 61 62 58 58 55 46 54 ...
##  $ sex     : int  1 1 1 1 0 0 1 1 1 1 ...
##  $ cp      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ trestbps: int  125 140 145 148 138 100 114 160 120 122 ...
##  $ chol    : int  212 203 174 203 294 248 318 289 249 286 ...
##  $ fbs     : int  0 1 0 0 1 0 0 0 0 0 ...
##  $ restecg : int  1 0 1 1 1 0 2 0 0 0 ...
##  $ thalach : int  168 155 125 161 106 122 140 145 144 116 ...
##  $ exang   : int  0 1 1 0 0 0 0 1 0 1 ...
##  $ oldpeak : num  1 3.1 2.6 0 1.9 1 4.4 0.8 0.8 3.2 ...
##  $ slope   : int  2 0 0 2 1 1 0 1 2 1 ...
##  $ ca      : int  2 0 0 1 3 0 3 1 0 2 ...
##  $ thal    : int  3 3 3 3 2 2 1 3 3 2 ...
##  $ target  : int  0 0 0 0 0 1 0 0 0 0 ...
df1 <- df1[, c("target","age","sex","cp","thalach","exang","oldpeak","ca","thal")]
df1 <- na.omit(df1)
df1$target <- as.factor(df1$target)
df1$sex <- as.factor(df1$sex)
df1$cp <- as.factor(df1$cp)
df1$exang <- as.factor(df1$exang)
df1$ca <- as.factor(df1$ca)
df1$thal <- as.factor(df1$thal)

Crear el modelo

modelo <- glm(target ~ ., data=df1, family=binomial)
summary(modelo)
## 
## Call:
## glm(formula = target ~ ., family = binomial, data = df1)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.307050   1.658865  -1.391 0.164304    
## age         -0.002845   0.012627  -0.225 0.821741    
## sex1        -1.317370   0.265582  -4.960 7.04e-07 ***
## cp1          0.965560   0.292954   3.296 0.000981 ***
## cp2          1.865984   0.264683   7.050 1.79e-12 ***
## cp3          1.826127   0.354608   5.150 2.61e-07 ***
## thalach      0.020410   0.005732   3.561 0.000370 ***
## exang1      -0.833013   0.236454  -3.523 0.000427 ***
## oldpeak     -0.663828   0.115890  -5.728 1.02e-08 ***
## ca1         -1.968795   0.259073  -7.599 2.98e-14 ***
## ca2         -2.651901   0.375660  -7.059 1.67e-12 ***
## ca3         -1.976884   0.473077  -4.179 2.93e-05 ***
## ca4          0.853130   0.820865   1.039 0.298662    
## thal1        1.975621   1.215095   1.626 0.103971    
## thal2        1.929371   1.161709   1.661 0.096753 .  
## thal3        0.384922   1.163201   0.331 0.740708    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1420.24  on 1024  degrees of freedom
## Residual deviance:  665.09  on 1009  degrees of freedom
## AIC: 697.09
## 
## Number of Fisher Scoring iterations: 6
# Crear dos perfiles extremos

prueba <- data.frame(age = c(30, 70), sex = as.factor(c(0, 1)), cp = as.factor(c(1, 3)), thalach = c(180, 90), exang = as.factor(c(0, 1)), oldpeak = c(0.2, 3.5), ca = as.factor(c(0, 3)), thal = as.factor(c(2, 1)))

probabilidad <- predict(modelo, newdata=prueba, type="response")

cbind(prueba, Probabilidad_de_estar_sano=probabilidad)
##   age sex cp thalach exang oldpeak ca thal Probabilidad_de_estar_sano
## 1  30   0  1     180     0     0.2  0    2                  0.9827668
## 2  70   1  3      90     1     3.5  3    1                  0.0349510

Conclusiones

El modelo discrimina correctamente entre perfiles extremos:

El paciente joven, sin factores de riesgo relevantes, presenta una probabilidad extremadamente alta de estar sano (≈98%).

En contraste, el paciente mayor con múltiples factores clínicos de riesgo presenta una probabilidad muy baja de estar sano (≈3%).

Esto indica que el modelo tiene buena capacidad de separación.