Si es la primera vez que se ejecuta este documento, descomentar y ejecutar el bloque de instalación.
# Ejecutar UNA sola vez si los paquetes no están instalados
install.packages(c(
"tidyverse", # manipulación y gráficos
"readxl", # leer .xlsx
"pROC", # curva ROC y test de DeLong
"knitr", # tablas formateadas
"kableExtra", # tablas bonitas
"mice", # imputación múltiple / EM
"ggcorrplot", # mapas de correlación
"GGally", # pares de variables
"broom", # tidy() para modelos
"scales" # formato porcentaje
))library(tidyverse)
library(readxl)
library(pROC)
library(knitr)
library(kableExtra)
library(mice)
library(broom)
library(scales)
set.seed(2026)Una distribución pertenece a la familia exponencial si su función de densidad puede escribirse como:
\[ p(x \mid \theta) = h(x)\,\exp\!\bigl(\eta(\theta)\,t(x) - a(\theta)\bigr) \]
A continuación demostramos que Bernoulli, Normal y Poisson pertenecen a esta familia.
Sea \(X \sim \text{Bernoulli}(p)\) con \(x \in \{0,1\}\):
\[ p(x \mid p) = p^x(1-p)^{1-x} = \exp\!\bigl(x\log p + (1-x)\log(1-p)\bigr) \]
Reordenando:
\[ = \exp\!\Bigl(x\log\tfrac{p}{1-p} + \log(1-p)\Bigr) \]
Por tanto:
| Componente | Expresión |
|---|---|
| \(h(x)\) | \(1\) |
| \(t(x)\) | \(x\) |
| \(\eta(p)\) | \(\log\!\bigl(\frac{p}{1-p}\bigr)\) — función logit |
| \(a(p)\) | \(-\log(1-p)\) |
Sea \(X \sim \mathcal{N}(\mu, \sigma^2)\):
\[ p(x \mid \mu) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp\!\Bigl(-\frac{(x-\mu)^2}{2\sigma^2}\Bigr) \]
Expandiendo \((x-\mu)^2 = x^2 - 2\mu x + \mu^2\):
\[ = \underbrace{\frac{1}{\sqrt{2\pi\sigma^2}}\exp\!\Bigl(-\frac{x^2}{2\sigma^2}\Bigr)}_{h(x)}\,\exp\!\Bigl(\underbrace{\tfrac{\mu}{\sigma^2}}_{\eta(\mu)}\,x - \underbrace{\tfrac{\mu^2}{2\sigma^2}}_{a(\mu)}\Bigr) \]
| Componente | Expresión |
|---|---|
| \(h(x)\) | \(\frac{1}{\sqrt{2\pi\sigma^2}}\exp(-x^2/2\sigma^2)\) |
| \(t(x)\) | \(x\) |
| \(\eta(\mu)\) | \(\mu/\sigma^2\) — función identidad (escalada) |
| \(a(\mu)\) | \(\mu^2/(2\sigma^2)\) |
Sea \(X \sim \text{Poisson}(\lambda)\) con \(x \in \{0,1,2,\dots\}\):
\[ p(x \mid \lambda) = \frac{\lambda^x e^{-\lambda}}{x!} = \frac{1}{x!}\exp\!\bigl(x\log\lambda - \lambda\bigr) \]
| Componente | Expresión |
|---|---|
| \(h(x)\) | \(1/x!\) |
| \(t(x)\) | \(x\) |
| \(\eta(\lambda)\) | \(\log\lambda\) — función log |
| \(a(\lambda)\) | \(\lambda\) |
| Distribución | \(h(x)\) | \(t(x)\) | \(\eta(\theta)\) (link canónico) | \(a(\theta)\) |
|---|---|---|---|---|
| Bernoulli | \(1\) | \(x\) | \(\log\!\bigl(\frac{p}{1-p}\bigr)\) | \(-\log(1-p)\) |
| Normal | \(\frac{1}{\sqrt{2\pi\sigma^2}}e^{-x^2/2\sigma^2}\) | \(x\) | \(\mu/\sigma^2\) | \(\mu^2/2\sigma^2\) |
| Poisson | \(1/x!\) | \(x\) | \(\log\lambda\) | \(\lambda\) |
Las funciones \(\eta(\theta)\) son los enlaces canónicos que usa cada GLM: logit para Bernoulli (logística), identidad para Normal (lineal) y log para Poisson.
El archivo processed.cleveland.data contiene los datos
UCI sin encabezado y con ? como marcador de NA.
nombres <- c("age","sex","cp","trestbps","chol","fbs","restecg",
"thalach","exang","oldpeak","slope","ca","thal","num")
heart <- read.csv(
"processed.cleveland.data",
header = FALSE,
na.strings = "?",
col.names = nombres
)
# Variable objetivo: 1 si tiene enfermedad (num > 0), 0 en caso contrario
heart$hd <- as.integer(heart$num > 0)
cat("Dimensiones:", dim(heart), "\n")## Dimensiones: 303 15
## Distribución hd:
##
## 0 1
## 164 139
## ca thal
## 4 2
# Imputar con mediana las columnas con NA
heart_imp <- heart
for (col in c("ca", "thal")) {
med <- median(heart_imp[[col]], na.rm = TRUE)
heart_imp[[col]][is.na(heart_imp[[col]])] <- med
cat(sprintf("Mediana de %s = %.1f\n", col, med))
}## Mediana de ca = 0.0
## Mediana de thal = 3.0
# Variables continuas para boxplot vs hd
continuas <- c("age","trestbps","chol","thalach","oldpeak")
heart_long <- heart_imp %>%
select(all_of(continuas), hd) %>%
pivot_longer(-hd, names_to = "variable", values_to = "valor")
ggplot(heart_long, aes(x = factor(hd), y = valor, fill = factor(hd))) +
geom_boxplot(alpha = 0.7) +
facet_wrap(~variable, scales = "free_y") +
labs(x = "Enfermedad coronaria", y = "Valor",
title = "Variables continuas vs hd",
fill = "hd") +
theme_minimal()# Variables categóricas vs hd
categoricas <- c("sex","cp","fbs","restecg","exang","slope","ca","thal")
heart_cat <- heart_imp %>%
select(all_of(categoricas), hd) %>%
pivot_longer(-hd, names_to = "variable", values_to = "categoria") %>%
count(variable, categoria, hd) %>%
group_by(variable, categoria) %>%
mutate(prop = n / sum(n))
ggplot(heart_cat, aes(x = factor(categoria), y = prop, fill = factor(hd))) +
geom_col(position = "fill") +
facet_wrap(~variable, scales = "free_x") +
labs(x = "Categoría", y = "Proporción",
title = "Proporción de hd por categoría",
fill = "hd") +
theme_minimal()fbsCalculamos \(\beta_0\) y \(\beta_1\) desde la tabla de
contingencia y los comparamos con los del glm.
# Tabla 2x2: fbs (filas) vs hd (columnas)
tab <- table(fbs = heart_imp$fbs, hd = heart_imp$hd)
print(tab)## hd
## fbs 0 1
## 0 141 117
## 1 23 22
n00 <- tab["0","0"]; n01 <- tab["0","1"]
n10 <- tab["1","0"]; n11 <- tab["1","1"]
# β0 = log(odds | fbs=0)
beta0_manual <- log(n01 / n00)
# β1 = log(odds_ratio entre fbs=1 y fbs=0)
beta1_manual <- log((n11/n10) / (n01/n00))
OR_manual <- exp(beta1_manual)
cat(sprintf("β0 (manual) = %.4f\n", beta0_manual))## β0 (manual) = -0.1866
## β1 (manual) = 0.1421
## OR (manual) = 1.1527
# Comparación con glm()
mod_biv <- glm(hd ~ fbs, data = heart_imp, family = binomial)
summary(mod_biv)##
## Call:
## glm(formula = hd ~ fbs, family = binomial, data = heart_imp)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.1866 0.1251 -1.492 0.136
## fbs 0.1421 0.3234 0.440 0.660
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 417.98 on 302 degrees of freedom
## Residual deviance: 417.79 on 301 degrees of freedom
## AIC: 421.79
##
## Number of Fisher Scoring iterations: 3
##
## OR (glm) = 1.1527
Los coeficientes coinciden exactamente porque la estimación de máxima verosimilitud del modelo logístico saturado con un único predictor binario es justamente la log-odds y log-odds-ratio empíricos. El p-valor de Wald indica que
fbsno es significativo.
# Convertimos categóricas a factor para que R cree dummies
vars_factor <- c("sex","cp","fbs","restecg","exang","slope","ca","thal")
heart_imp[vars_factor] <- lapply(heart_imp[vars_factor], factor)
# Modelo completo (excluyendo num que es la variable original)
mod_full <- glm(hd ~ age + sex + cp + trestbps + chol + fbs + restecg +
thalach + exang + oldpeak + slope + ca + thal,
data = heart_imp, family = binomial)
summary(mod_full)##
## Call:
## glm(formula = hd ~ age + sex + cp + trestbps + chol + fbs + restecg +
## thalach + exang + oldpeak + slope + ca + thal, family = binomial,
## data = heart_imp)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.360179 2.959798 -2.149 0.031646 *
## age -0.024629 0.025119 -0.980 0.326855
## sex1 1.705567 0.549410 3.104 0.001907 **
## cp2 1.378119 0.801650 1.719 0.085596 .
## cp3 0.401816 0.697483 0.576 0.564552
## cp4 2.450529 0.707253 3.465 0.000531 ***
## trestbps 0.027700 0.011725 2.362 0.018157 *
## chol 0.004344 0.004121 1.054 0.291854
## fbs1 -0.377176 0.564759 -0.668 0.504227
## restecg1 1.036882 2.690194 0.385 0.699919
## restecg2 0.485676 0.392792 1.236 0.216283
## thalach -0.018826 0.011700 -1.609 0.107594
## exang1 0.735620 0.443279 1.659 0.097016 .
## oldpeak 0.398376 0.238953 1.667 0.095481 .
## slope2 1.314503 0.478528 2.747 0.006015 **
## slope3 0.588936 0.945911 0.623 0.533539
## ca1 2.228217 0.511914 4.353 1.34e-05 ***
## ca2 3.261752 0.786434 4.148 3.36e-05 ***
## ca3 2.148739 0.914676 2.349 0.018815 *
## thal6 -0.268720 0.804929 -0.334 0.738498
## thal7 1.339207 0.432244 3.098 0.001947 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 417.98 on 302 degrees of freedom
## Residual deviance: 186.19 on 282 degrees of freedom
## AIC: 228.19
##
## Number of Fisher Scoring iterations: 6
# Tabla resumen con OR y significancia
res <- broom::tidy(mod_full, conf.int = TRUE, exponentiate = TRUE) %>%
mutate(
significativo = ifelse(p.value < 0.05, "Sí", "No"),
estimate = round(estimate, 3),
std.error = round(std.error, 3),
statistic = round(statistic, 3),
p.value = round(p.value, 4),
conf.low = round(conf.low, 3),
conf.high = round(conf.high, 3)
)
kable(res, caption = "Coeficientes (en escala OR), error estándar y test de Wald") %>%
kable_styling(bootstrap_options = c("striped","hover","condensed"))| term | estimate | std.error | statistic | p.value | conf.low | conf.high | significativo |
|---|---|---|---|---|---|---|---|
| (Intercept) | 0.002 | 2.960 | -2.149 | 0.0316 | 0.000 | 0.497 | Sí |
| age | 0.976 | 0.025 | -0.980 | 0.3269 | 0.928 | 1.025 | No |
| sex1 | 5.505 | 0.549 | 3.104 | 0.0019 | 1.938 | 16.919 | Sí |
| cp2 | 3.967 | 0.802 | 1.719 | 0.0856 | 0.839 | 19.959 | No |
| cp3 | 1.495 | 0.697 | 0.576 | 0.5646 | 0.388 | 6.109 | No |
| cp4 | 11.594 | 0.707 | 3.465 | 0.0005 | 3.063 | 50.076 | Sí |
| trestbps | 1.028 | 0.012 | 2.362 | 0.0182 | 1.005 | 1.053 | Sí |
| chol | 1.004 | 0.004 | 1.054 | 0.2919 | 0.996 | 1.012 | No |
| fbs1 | 0.686 | 0.565 | -0.668 | 0.5042 | 0.221 | 2.050 | No |
| restecg1 | 2.820 | 2.690 | 0.385 | 0.6999 | 0.049 | 319.506 | No |
| restecg2 | 1.625 | 0.393 | 1.236 | 0.2163 | 0.755 | 3.553 | No |
| thalach | 0.981 | 0.012 | -1.609 | 0.1076 | 0.958 | 1.004 | No |
| exang1 | 2.087 | 0.443 | 1.659 | 0.0970 | 0.872 | 5.003 | No |
| oldpeak | 1.489 | 0.239 | 1.667 | 0.0955 | 0.943 | 2.418 | No |
| slope2 | 3.723 | 0.479 | 2.747 | 0.0060 | 1.480 | 9.772 | Sí |
| slope3 | 1.802 | 0.946 | 0.623 | 0.5335 | 0.261 | 11.081 | No |
| ca1 | 9.283 | 0.512 | 4.353 | 0.0000 | 3.510 | 26.412 | Sí |
| ca2 | 26.095 | 0.786 | 4.148 | 0.0000 | 5.989 | 132.729 | Sí |
| ca3 | 8.574 | 0.915 | 2.349 | 0.0188 | 1.635 | 62.961 | Sí |
| thal6 | 0.764 | 0.805 | -0.334 | 0.7385 | 0.156 | 3.818 | No |
| thal7 | 3.816 | 0.432 | 3.098 | 0.0019 | 1.653 | 9.087 | Sí |
Variables significativas (\(p<0.05\)):
sex,cp=4,slope=2,ca(1, 2, 3),thal=7,trestbps.
Variables no significativas:age,chol,fbs,restecg,exang,thalach,oldpeak.
Inconveniente:restecg=1tiene muy pocas observaciones (4), lo que produce errores estándar grandes.
heart_imp$prob <- predict(mod_full, type = "response")
heart_imp$pred <- as.integer(heart_imp$prob > 0.5)
# Matriz de confusión
mc <- table(real = heart_imp$hd, pred = heart_imp$pred)
print(mc)## pred
## real 0 1
## 0 151 13
## 1 25 114
accuracy <- sum(diag(mc)) / sum(mc)
sensibilidad <- mc["1","1"] / sum(mc["1",])
especifidad <- mc["0","0"] / sum(mc["0",])
cat(sprintf("Accuracy = %.3f\n", accuracy))## Accuracy = 0.875
## Sensibilidad = 0.820
## Especificidad= 0.921
roc_obj <- roc(heart_imp$hd, heart_imp$prob, quiet = TRUE)
auc_val <- auc(roc_obj)
par(mfrow = c(1, 2))
# Probabilidades ordenadas
ord <- order(heart_imp$prob)
plot(seq_along(ord), heart_imp$prob[ord],
col = ifelse(heart_imp$hd[ord] == 1, "red", "steelblue"),
pch = 16, cex = 0.6,
xlab = "Observación (ordenada por prob)",
ylab = "Probabilidad predicha",
main = "Probabilidades predichas")
abline(h = 0.5, lty = 2, col = "gray")
legend("topleft", c("hd=0","hd=1"), col = c("steelblue","red"), pch = 16, bty = "n")
# Curva ROC
plot(roc_obj, col = "darkblue", lwd = 2,
main = sprintf("Curva ROC (AUC = %.3f)", auc_val))# Ajusta la ruta al archivo .xlsx que entregó el profesor
ruta_xlsx <- "AAD-taller03.xlsx"
if (file.exists(ruta_xlsx)) {
cred <- read_xlsx(ruta_xlsx)
} else {
warning("No se encontró AAD-taller03.xlsx; coloque el archivo en el mismo directorio que el .Rmd.")
cred <- NULL
}
if (!is.null(cred)) {
cat("Dimensiones:", dim(cred), "\n")
cat("Columnas:", names(cred), "\n")
cat("Distribución de Incumplimiento:\n"); print(table(cred$Incumplimiento))
}## Dimensiones: 9080 3
## Columnas: Incumplimiento ScoreLogisticoA ScoreLogisticoB
## Distribución de Incumplimiento:
##
## 0 1
## 4450 4630
calc_metricas <- function(y, score, nombre) {
r <- roc(y, score, quiet = TRUE)
auc_v <- as.numeric(auc(r))
gini <- 2 * auc_v - 1
# KS = máxima diferencia entre TPR y FPR
coords_r <- coords(r, x = "all", ret = c("threshold","tpr","fpr"),
transpose = FALSE)
ks <- max(coords_r$tpr - coords_r$fpr)
# Brier score y log-loss (recortando para evitar log(0))
p <- pmin(pmax(score, 1e-15), 1 - 1e-15)
brier <- mean((p - y)^2)
logloss <- -mean(y * log(p) + (1 - y) * log(1 - p))
data.frame(
Modelo = nombre,
AUC = round(auc_v, 4),
Gini = round(gini, 4),
KS = round(ks, 4),
Brier = round(brier, 4),
LogLoss = round(logloss, 4)
)
}
m_A <- calc_metricas(cred$Incumplimiento, cred$ScoreLogisticoA, "Modelo A")
m_B <- calc_metricas(cred$Incumplimiento, cred$ScoreLogisticoB, "Modelo B")
tabla_metricas <- rbind(m_A, m_B)
kable(tabla_metricas, caption = "Métricas comparativas") %>%
kable_styling(bootstrap_options = c("striped","hover"))| Modelo | AUC | Gini | KS | Brier | LogLoss |
|---|---|---|---|---|---|
| Modelo A | 0.6060 | 0.2121 | 0.1915 | 0.4936 | 2.3148 |
| Modelo B | 0.6779 | 0.3557 | 0.2624 | 0.4021 | 1.2724 |
roc_A <- roc(cred$Incumplimiento, cred$ScoreLogisticoA, quiet = TRUE)
roc_B <- roc(cred$Incumplimiento, cred$ScoreLogisticoB, quiet = TRUE)
# Test de DeLong para diferencia de AUC
test_delong <- roc.test(roc_A, roc_B, method = "delong")
print(test_delong)##
## DeLong's test for two correlated ROC curves
##
## data: roc_A and roc_B
## Z = -8.4336, p-value < 2.2e-16
## alternative hypothesis: true difference in AUC is not equal to 0
## 95 percent confidence interval:
## -0.08851072 -0.05512900
## sample estimates:
## AUC of roc1 AUC of roc2
## 0.6060437 0.6778636
par(mfrow = c(2, 2))
# 1. Curvas ROC
plot(roc_A, col = "steelblue", lwd = 2, main = "Curvas ROC")
lines(roc_B, col = "tomato", lwd = 2)
abline(0, 1, lty = 2, col = "gray")
legend("bottomright",
legend = c(sprintf("Modelo A (AUC=%.3f)", auc(roc_A)),
sprintf("Modelo B (AUC=%.3f)", auc(roc_B))),
col = c("steelblue","tomato"), lwd = 2, bty = "n")
# 2. Distribución de scores Modelo A
plot(density(cred$ScoreLogisticoA[cred$Incumplimiento == 0]),
col = "steelblue", lwd = 2, main = "Distribución scores - Modelo A",
xlab = "Score", ylim = c(0, 3))
lines(density(cred$ScoreLogisticoA[cred$Incumplimiento == 1]),
col = "tomato", lwd = 2)
legend("topright", c("No incumple","Incumple"),
col = c("steelblue","tomato"), lwd = 2, bty = "n")
# 3. Distribución de scores Modelo B
plot(density(cred$ScoreLogisticoB[cred$Incumplimiento == 0]),
col = "steelblue", lwd = 2, main = "Distribución scores - Modelo B",
xlab = "Score", ylim = c(0, 3))
lines(density(cred$ScoreLogisticoB[cred$Incumplimiento == 1]),
col = "tomato", lwd = 2)
legend("topright", c("No incumple","Incumple"),
col = c("steelblue","tomato"), lwd = 2, bty = "n")
# 4. Comparación de métricas
barplot(
rbind(c(m_A$AUC, m_A$Gini, m_A$KS),
c(m_B$AUC, m_B$Gini, m_B$KS)),
beside = TRUE, col = c("steelblue","tomato"),
names.arg = c("AUC","Gini","KS"),
main = "Métricas de discriminación", ylim = c(-0.5, 1)
)
abline(h = 0, col = "black")
legend("topright", c("A","B"), fill = c("steelblue","tomato"), bty = "n")1 - ScoreB), su AUC sería ≈ 0.68.Brier por estar mejor calibrado al
rango \([0,1]\), eso es
calibración, no discriminación, que es
lo evaluado.Conclusión: el Modelo A tiene mayor poder predictivo de incumplimiento.
Repetimos el Problema 2 reemplazando la mediana por
imputación EM (algoritmo basado en mice
con pmm o norm).
# Recargamos crudo (con NA)
heart_em <- read.csv(
"processed.cleveland.data",
header = FALSE,
na.strings = "?",
col.names = nombres
)
heart_em$hd <- as.integer(heart_em$num > 0)
heart_em$num <- NULL
# Imputación con mice (m=1, equivalente a un EM single-imputation con norm)
imp <- mice(heart_em, m = 1, method = "norm", maxit = 20,
seed = 2026, printFlag = FALSE)
heart_em_imp <- complete(imp, 1)
# Las variables ca y thal son enteras: redondeamos a la categoría más cercana válida
heart_em_imp$ca <- pmin(pmax(round(heart_em_imp$ca), 0), 3)
heart_em_imp$thal <- ifelse(round(heart_em_imp$thal) <= 4, 3,
ifelse(round(heart_em_imp$thal) <= 6, 6, 7))
# Comparación: índices que estaban NA originalmente
idx_ca_na <- which(is.na(heart_em$ca))
idx_thal_na <- which(is.na(heart_em$thal))
cat("Imputados con EM en 'ca' :", heart_em_imp$ca[idx_ca_na], "\n")## Imputados con EM en 'ca' : 1 2 1 0
## Imputados con EM en 'thal' : 3 3
## (Mediana asignaba: ca=0, thal=3)
fbs (datos EM)## hd
## fbs 0 1
## 0 141 117
## 1 23 22
n00 <- tab_em["0","0"]; n01 <- tab_em["0","1"]
n10 <- tab_em["1","0"]; n11 <- tab_em["1","1"]
beta0_em <- log(n01 / n00)
beta1_em <- log((n11/n10) / (n01/n00))
cat(sprintf("β0 (manual EM) = %.4f\n", beta0_em))## β0 (manual EM) = -0.1866
## β1 (manual EM) = 0.1421
## OR (manual EM) = 1.1527
heart_em_imp[vars_factor] <- lapply(heart_em_imp[vars_factor], factor)
mod_em <- glm(hd ~ age + sex + cp + trestbps + chol + fbs + restecg +
thalach + exang + oldpeak + slope + ca + thal,
data = heart_em_imp, family = binomial)
summary(mod_em)##
## Call:
## glm(formula = hd ~ age + sex + cp + trestbps + chol + fbs + restecg +
## thalach + exang + oldpeak + slope + ca + thal, family = binomial,
## data = heart_em_imp)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.162372 2.915173 -2.114 0.034524 *
## age -0.022751 0.025068 -0.908 0.364113
## sex1 1.646077 0.543071 3.031 0.002437 **
## cp2 1.230947 0.797250 1.544 0.122590
## cp3 0.391048 0.688699 0.568 0.570166
## cp4 2.378896 0.695873 3.419 0.000629 ***
## trestbps 0.026669 0.011531 2.313 0.020732 *
## chol 0.004374 0.004070 1.075 0.282501
## fbs1 -0.349657 0.564349 -0.620 0.535537
## restecg1 1.011044 2.605265 0.388 0.697959
## restecg2 0.521309 0.389607 1.338 0.180883
## thalach -0.018739 0.011580 -1.618 0.105618
## exang1 0.744037 0.439629 1.692 0.090566 .
## oldpeak 0.406429 0.237566 1.711 0.087117 .
## slope2 1.204663 0.472785 2.548 0.010834 *
## slope3 0.519735 0.944829 0.550 0.582262
## ca1 2.009146 0.492605 4.079 4.53e-05 ***
## ca2 3.159137 0.773409 4.085 4.41e-05 ***
## ca3 2.084797 0.900351 2.316 0.020583 *
## thal6 -0.219647 0.800672 -0.274 0.783832
## thal7 1.292571 0.427318 3.025 0.002488 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 417.98 on 302 degrees of freedom
## Residual deviance: 189.14 on 282 degrees of freedom
## AIC: 231.14
##
## Number of Fisher Scoring iterations: 6
heart_em_imp$prob <- predict(mod_em, type = "response")
roc_med <- roc(heart_imp$hd, heart_imp$prob, quiet = TRUE)
roc_em <- roc(heart_em_imp$hd, heart_em_imp$prob, quiet = TRUE)
cat(sprintf("AUC con imputación por mediana: %.4f\n", auc(roc_med)))## AUC con imputación por mediana: 0.9403
## AUC con imputación EM : 0.9378
plot(roc_med, col = "steelblue", lwd = 2,
main = "ROC - Mediana vs EM")
lines(roc_em, col = "darkorange", lwd = 2, lty = 2)
legend("bottomright",
legend = c(sprintf("Mediana (AUC=%.3f)", auc(roc_med)),
sprintf("EM (AUC=%.3f)", auc(roc_em))),
col = c("steelblue","darkorange"), lwd = 2, lty = c(1,2), bty = "n")sex, cp=4, slope=2,
ca, thal=7, trestbps).thal=6: con
EM su valor varía respecto a la imputación por mediana porque uno de los
NA originales fue imputado con valor cercano a 6,
modificando ligeramente la masa de esa categoría.Recomendación práctica: con tan pocos
NA(6 de 303), cualquier estrategia razonable produce resultados equivalentes. La imputación EM es preferible cuando hay másNAo cuando la imputación por mediana introduciría sesgos perceptibles.
## R version 4.5.0 (2025-04-11 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26200)
##
## Matrix products: default
## LAPACK version 3.12.1
##
## locale:
## [1] LC_COLLATE=Spanish_Mexico.utf8 LC_CTYPE=Spanish_Mexico.utf8
## [3] LC_MONETARY=Spanish_Mexico.utf8 LC_NUMERIC=C
## [5] LC_TIME=Spanish_Mexico.utf8
##
## time zone: America/Bogota
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] scales_1.4.0 broom_1.0.12 mice_3.19.0 kableExtra_1.4.0
## [5] knitr_1.51 pROC_1.19.0.1 readxl_1.4.5 lubridate_1.9.5
## [9] forcats_1.0.1 stringr_1.5.1 dplyr_1.1.4 purrr_1.0.4
## [13] readr_2.1.6 tidyr_1.3.1 tibble_3.2.1 ggplot2_4.0.2
## [17] tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.6 shape_1.4.6.1 xfun_0.52 bslib_0.9.0
## [5] lattice_0.22-6 tzdb_0.5.0 Rdpack_2.6.6 vctrs_0.6.5
## [9] tools_4.5.0 generics_0.1.4 pan_1.9 pkgconfig_2.0.3
## [13] jomo_2.7-6 Matrix_1.7-3 RColorBrewer_1.1-3 S7_0.2.1
## [17] lifecycle_1.0.4 compiler_4.5.0 farver_2.1.2 textshaping_1.0.4
## [21] codetools_0.2-20 htmltools_0.5.8.1 sass_0.4.10 yaml_2.3.10
## [25] glmnet_4.1-10 nloptr_2.2.1 pillar_1.10.2 jquerylib_0.1.4
## [29] MASS_7.3-65 cachem_1.1.0 reformulas_0.4.4 iterators_1.0.14
## [33] rpart_4.1.24 boot_1.3-31 foreach_1.5.2 mitml_0.4-5
## [37] nlme_3.1-168 tidyselect_1.2.1 digest_0.6.37 stringi_1.8.7
## [41] labeling_0.4.3 splines_4.5.0 fastmap_1.2.0 grid_4.5.0
## [45] cli_3.6.5 magrittr_2.0.3 survival_3.8-3 withr_3.0.2
## [49] backports_1.5.0 timechange_0.4.0 rmarkdown_2.31 nnet_7.3-20
## [53] lme4_2.0-1 cellranger_1.1.0 hms_1.1.4 evaluate_1.0.3
## [57] rbibutils_2.4.1 viridisLite_0.4.2 rlang_1.1.6 Rcpp_1.0.14
## [61] glue_1.8.0 xml2_1.5.2 minqa_1.2.8 svglite_2.2.2
## [65] rstudioapi_0.18.0 jsonlite_2.0.0 R6_2.6.1 systemfonts_1.3.1