La regresión logística binaria es una extensión directa de la regresión lineal que se usa cuando la variable que queremos predecir tiene solo dos posibles resultados: sí o no, fraude o no fraude, 0 o 1.
En lugar de predecir un número, el modelo estima la probabilidad de que ocurra el evento de interés. Si esa probabilidad supera un umbral (por ejemplo 0.5), la observación se clasifica como positiva.
| Aspecto | Regresión Lineal | Regresión Logística |
|---|---|---|
| Variable dependiente | Continua | Binaria (0/1) |
| Salida del modelo | Valor numérico \(\hat{y}\) | Probabilidad \(P \in [0,1]\) |
| Función de enlace | Identidad | Logit |
| Estimación | Mínimos cuadrados (OLS) | Máxima verosimilitud (MLE) |
| Interpretación \(\beta\) | Cambio en \(Y\) | Cambio en log-odds |
Función logística (sigmoide):
\[P(Y=1 \mid X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + \cdots + \beta_k X_k)}}\]
Forma log-odds:
\[\ln\left(\frac{P}{1-P}\right) = \beta_0 + \beta_1 X_1 + \cdots + \beta_k X_k\]
data.frame(
Variable = names(df_log),
Tipo = sapply(df_log, class),
Descripcion = c(
"Segundos desde la primera transaccion",
"Componente PCA 1 (anonimizado)",
"Componente PCA 2 (anonimizado)",
"Componente PCA 3 (anonimizado)",
"Componente PCA 4 (anonimizado)",
"Componente PCA 5 (anonimizado)",
"Monto de la transaccion",
"0 = legitima | 1 = fraudulenta"
)
) %>%
kable(caption = "Tabla 1. Estructura de la base de datos", booktabs = TRUE) %>%
kable_styling(bootstrap_options = c("striped","hover","condensed"),
full_width = FALSE)| Variable | Tipo | Descripcion | |
|---|---|---|---|
| Time | Time | integer | Segundos desde la primera transaccion |
| V1 | V1 | numeric | Componente PCA 1 (anonimizado) |
| V2 | V2 | numeric | Componente PCA 2 (anonimizado) |
| V3 | V3 | numeric | Componente PCA 3 (anonimizado) |
| V4 | V4 | numeric | Componente PCA 4 (anonimizado) |
| V5 | V5 | numeric | Componente PCA 5 (anonimizado) |
| Amount | Amount | numeric | Monto de la transaccion |
| Class | Class | integer | 0 = legitima | 1 = fraudulenta |
df_log %>%
count(Class) %>%
mutate(
Categoria = ifelse(Class == 0, "Legitima (0)", "Fraudulenta (1)"),
Porcentaje = paste0(round(n / sum(n) * 100, 1), "%")
) %>%
select(Categoria, n, Porcentaje) %>%
kable(col.names = c("Tipo de transaccion", "N", "%"),
caption = "Tabla 2. Distribucion de la variable Class",
align = "lcc", booktabs = TRUE) %>%
kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)| Tipo de transaccion | N | % |
|---|---|---|
| Legitima (0) | 982 | 98.2% |
| Fraudulenta (1) | 18 | 1.8% |
⚠️ Solo el 1.8% de las transacciones son fraude — la base está muy desbalanceada.
modelo_log <- glm(Class ~ V1 + V2 + V3 + V4 + V5 + Amount,
data = df_log,
family = binomial(link = "logit"))coef_df <- summary(modelo_log)$coefficients
coef_tabla <- data.frame(
Variable = rownames(coef_df),
Estimado = round(coef_df[, 1], 4),
Std.Error = round(coef_df[, 2], 4),
z_value = round(coef_df[, 3], 3),
p_valor = round(coef_df[, 4], 4)
)
coef_tabla %>%
kable(row.names = FALSE,
col.names = c("Variable","Estimado","Error Estandar","Estadistico z","p-valor"),
caption = "Tabla 3. Coeficientes del modelo logistico",
booktabs = TRUE) %>%
kable_styling(bootstrap_options = c("striped","hover","condensed"),
full_width = FALSE) %>%
row_spec(which(coef_tabla$p_valor < 0.05),
bold = TRUE, color = "white", background = "#2c7fb8")| Variable | Estimado | Error Estandar | Estadistico z | p-valor |
|---|---|---|---|---|
| (Intercept) | -3.8233 | 0.4688 | -8.155 | 0.0000 |
| V1 | 0.1448 | 0.2288 | 0.633 | 0.5268 |
| V2 | -0.1058 | 0.2282 | -0.464 | 0.6428 |
| V3 | -0.1627 | 0.2428 | -0.670 | 0.5030 |
| V4 | 0.2114 | 0.2434 | 0.869 | 0.3851 |
| V5 | 0.0001 | 0.2386 | 0.001 | 0.9996 |
| Amount | -0.0001 | 0.0002 | -0.559 | 0.5763 |
df_log$prob <- predict(modelo_log, type = "response")
df_log$pred <- ifelse(df_log$prob > 0.5, 1, 0)
cm <- confusionMatrix(factor(df_log$pred), factor(df_log$Class), positive = "1")
data.frame(
Metrica = c("Exactitud (Accuracy)", "Sensibilidad (Recall)",
"Especificidad", "Kappa"),
Valor = round(c(cm$overall["Accuracy"],
cm$byClass["Sensitivity"],
cm$byClass["Specificity"],
cm$overall["Kappa"]), 4)
) %>%
kable(row.names = FALSE,
caption = "Tabla 4. Metricas de desempeno",
booktabs = TRUE) %>%
kable_styling(bootstrap_options = c("striped"), full_width = FALSE)| Metrica | Valor |
|---|---|
| Exactitud (Accuracy) | 0.982 |
| Sensibilidad (Recall) | 0.000 |
| Especificidad | 1.000 |
| Kappa | 0.000 |
roc_obj <- roc(df_log$Class, df_log$prob, quiet = TRUE)
auc_val <- round(auc(roc_obj), 4)
ggroc(roc_obj, colour = "#2c7fb8", size = 1.2) +
geom_abline(intercept = 1, slope = 1,
linetype = "dashed", colour = "grey50") +
labs(title = paste0("Curva ROC — AUC = ", auc_val),
x = "Especificidad", y = "Sensibilidad") +
theme_minimal(base_size = 13)Un AUC de 0.6176 supera al azar (0.5) pero el desbalance limita el modelo.
ggplot(df_log, aes(x = prob, fill = factor(Class))) +
geom_histogram(bins = 40, alpha = 0.7, position = "identity") +
scale_fill_manual(values = c("#41b6c4","#e31a1c"),
labels = c("Legitima","Fraudulenta"),
name = "Transaccion") +
labs(title = "Distribucion de probabilidades predichas",
x = "Probabilidad estimada de fraude",
y = "Frecuencia") +
theme_minimal(base_size = 13)Un árbol de decisión aprende reglas del tipo “si X supera cierto valor, la categoría es Y”. Las organiza en forma de árbol donde cada nodo es una pregunta y cada rama es una respuesta posible. No requiere que los datos sigan ninguna distribución específica y sus resultados son fáciles de interpretar.
datos <- read.csv("economic_stress_score.csv", stringsAsFactors = TRUE)
# Filtrar filas con datos completos en la variable objetivo
datos <- datos[!is.na(datos$final_economic_stress_score) &
datos$final_economic_stress_score != "", ]
# Asegurar tipos correctos
datos$final_economic_stress_score <- as.numeric(as.character(datos$final_economic_stress_score))
datos$inflation_score <- as.numeric(as.character(datos$inflation_score))
datos$unemployment_score <- as.numeric(as.character(datos$unemployment_score))
datos$gdp_growth_score <- as.numeric(as.character(datos$gdp_growth_score))
datos$income_vulnerability_score <- as.numeric(as.character(datos$income_vulnerability_score))
datos$food_pressure_score <- as.numeric(as.character(datos$food_pressure_score))data.frame(
Variable = c("country_code","country_name","year","region","income_group",
"inflation_score","unemployment_score","gdp_growth_score",
"income_vulnerability_score","food_pressure_score",
"final_economic_stress_score","stress_category"),
Tipo = c("Categorica","Categorica","Numerica entera","Categorica","Categorica",
"Numerica","Numerica","Numerica","Numerica","Numerica",
"Numerica","Categorica"),
Rol = c("Identificador","Identificador","Temporal","Contextual","Contextual",
"Predictor","Predictor","Predictor","Predictor","Predictor",
"Score global","Variable objetivo"),
Descripcion = c("Codigo ISO del pais","Nombre del pais","Anno de observacion",
"Region geografica (Banco Mundial)","Grupo de ingreso (Banco Mundial)",
"Score de inflacion (0-100)","Score de desempleo (0-100)",
"Score de crecimiento del PIB (0-100)",
"Score de vulnerabilidad de ingresos (0-100)",
"Score de presion alimentaria (0-100)",
"Promedio ponderado de los sub-scores",
"Low / Moderate / High / Severe")
) %>%
kable(caption = "Tabla 1. Descripcion del dataset economic_stress_score",
booktabs = TRUE) %>%
kable_styling(bootstrap_options = c("striped","hover","condensed"),
font_size = 13, full_width = FALSE)| Variable | Tipo | Rol | Descripcion |
|---|---|---|---|
| country_code | Categorica | Identificador | Codigo ISO del pais |
| country_name | Categorica | Identificador | Nombre del pais |
| year | Numerica entera | Temporal | Anno de observacion |
| region | Categorica | Contextual | Region geografica (Banco Mundial) |
| income_group | Categorica | Contextual | Grupo de ingreso (Banco Mundial) |
| inflation_score | Numerica | Predictor | Score de inflacion (0-100) |
| unemployment_score | Numerica | Predictor | Score de desempleo (0-100) |
| gdp_growth_score | Numerica | Predictor | Score de crecimiento del PIB (0-100) |
| income_vulnerability_score | Numerica | Predictor | Score de vulnerabilidad de ingresos (0-100) |
| food_pressure_score | Numerica | Predictor | Score de presion alimentaria (0-100) |
| final_economic_stress_score | Numerica | Score global | Promedio ponderado de los sub-scores |
| stress_category | Categorica | Variable objetivo | Low / Moderate / High / Severe |
conteos <- as.data.frame(table(datos$stress_category))
names(conteos) <- c("Categoria","n")
conteos$pct <- round(conteos$n / sum(conteos$n) * 100, 1)
ggplot(conteos, aes(x = Categoria, y = n, fill = Categoria)) +
geom_col(width = 0.6, show.legend = FALSE) +
geom_text(aes(label = paste0(n, "\n(", pct, "%)")),
vjust = -0.3, size = 3.8, fontface = "bold") +
scale_fill_manual(values = c("Low"="#2ecc71","Moderate"="#f39c12",
"High"="#e67e22","Severe"="#e74c3c")) +
labs(title = "Distribucion de categorias de estres economico",
x = "Categoria", y = "Frecuencia") +
theme_minimal(base_size = 13)datos_ing <- datos %>%
filter(income_group %in% c("Low income","Lower middle income",
"Upper middle income","High income"))
ggplot(datos_ing,
aes(x = income_group, y = final_economic_stress_score, fill = income_group)) +
geom_boxplot(show.legend = FALSE, outlier.alpha = 0.3, outlier.size = 0.8) +
scale_fill_manual(values = c("Low income" = "#e74c3c",
"Lower middle income" = "#e67e22",
"Upper middle income" = "#3498db",
"High income" = "#2ecc71")) +
labs(title = "Score de estres economico por grupo de ingreso",
x = "Grupo de ingreso (Banco Mundial)",
y = "Score de estres economico (0-100)") +
theme_minimal(base_size = 13)tabla_cruzada <- table(datos$income_group, datos$stress_category)
n_cats <- ncol(tabla_cruzada)
as.data.frame.matrix(tabla_cruzada) %>%
tibble::rownames_to_column("Grupo de ingreso") %>%
kable(caption = "Tabla 3. Frecuencia de categorias de estres por grupo de ingreso",
booktabs = TRUE) %>%
kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE) %>%
add_header_above(setNames(c(1, n_cats), c(" ", "Categoria de estres")))| Grupo de ingreso | V1 | High | Low | Moderate | Severe |
|---|---|---|---|---|---|
| High income | 0 | 1048 | 731 | 2863 | 87 |
| Low income | 0 | 803 | 42 | 397 | 336 |
| Lower middle income | 0 | 1809 | 31 | 943 | 375 |
| Not classified | 0 | 68 | 1 | 50 | 13 |
| Upper middle income | 0 | 1560 | 115 | 1334 | 156 |
datos_modelo <- datos %>%
select(inflation_score, unemployment_score, gdp_growth_score,
income_vulnerability_score, food_pressure_score,
stress_category) %>%
na.omit()
cat("Registros disponibles para modelar:", nrow(datos_modelo), "\n")## Registros disponibles para modelar: 5438
set.seed(42)
idx <- sample(1:nrow(datos_modelo), size = round(0.7 * nrow(datos_modelo)))
entrenamiento <- datos_modelo[idx, ]
prueba <- datos_modelo[-idx, ]
cat("Entrenamiento:", nrow(entrenamiento), "| Prueba:", nrow(prueba), "\n")## Entrenamiento: 3807 | Prueba: 1631
prop_train <- round(prop.table(table(entrenamiento$stress_category)) * 100, 1)
prop_test <- round(prop.table(table(prueba$stress_category)) * 100, 1)
data.frame(
Categoria = names(prop_train),
Entrenamiento = paste0(prop_train, "%"),
Prueba = paste0(prop_test, "%")
) %>%
kable(caption = "Tabla 4. Proporcion de categorias en cada conjunto",
booktabs = TRUE) %>%
kable_styling(bootstrap_options = c("striped"), full_width = FALSE)| Categoria | Entrenamiento | Prueba |
|---|---|---|
| 0% | 0% | |
| High | 44.3% | 44.4% |
| Low | 3.2% | 2.2% |
| Moderate | 48.2% | 49.6% |
| Severe | 4.2% | 3.8% |
modelo <- rpart(
stress_category ~ inflation_score + unemployment_score +
gdp_growth_score + income_vulnerability_score + food_pressure_score,
data = entrenamiento,
method = "class",
control = rpart.control(minsplit = 30, cp = 0.005, maxdepth = 6)
)
printcp(modelo)##
## Classification tree:
## rpart(formula = stress_category ~ inflation_score + unemployment_score +
## gdp_growth_score + income_vulnerability_score + food_pressure_score,
## data = entrenamiento, method = "class", control = rpart.control(minsplit = 30,
## cp = 0.005, maxdepth = 6))
##
## Variables actually used in tree construction:
## [1] food_pressure_score gdp_growth_score
## [3] income_vulnerability_score inflation_score
## [5] unemployment_score
##
## Root node error: 1971/3807 = 0.51773
##
## n= 3807
##
## CP nsplit rel error xerror xstd
## 1 0.3952308 0 1.00000 1.00000 0.015642
## 2 0.1024860 1 0.60477 0.61898 0.014608
## 3 0.0324708 2 0.50228 0.52613 0.013936
## 4 0.0157281 3 0.46981 0.49518 0.013668
## 5 0.0139523 4 0.45408 0.46220 0.013356
## 6 0.0123457 9 0.37443 0.44039 0.013134
## 7 0.0088787 12 0.33739 0.41451 0.012852
## 8 0.0086251 14 0.31963 0.41299 0.012835
## 9 0.0083714 15 0.31101 0.40386 0.012730
## 10 0.0055809 17 0.29427 0.37950 0.012438
## 11 0.0053272 19 0.28311 0.37037 0.012324
## 12 0.0050000 21 0.27245 0.36986 0.012317
imp <- modelo$variable.importance
imp_norm <- round(imp / sum(imp) * 100, 2)
imp_df <- data.frame(
Variable = names(imp_norm),
Importancia = paste0(as.numeric(imp_norm), "%")
)
imp_df <- imp_df[order(-as.numeric(sub("%","", imp_df$Importancia))), ]
imp_df %>%
kable(row.names = FALSE,
caption = "Tabla 5. Importancia relativa de las variables predictoras",
booktabs = TRUE) %>%
kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)| Variable | Importancia |
|---|---|
| inflation_score | 33.08% |
| unemployment_score | 24.71% |
| income_vulnerability_score | 17.36% |
| gdp_growth_score | 17.2% |
| food_pressure_score | 7.66% |
pred_clase <- predict(modelo, newdata = prueba, type = "class")
mc <- table(Real = prueba$stress_category, Predicho = pred_clase)
n_clases <- ncol(mc)
as.data.frame.matrix(mc) %>%
tibble::rownames_to_column("Real / Predicho") %>%
kable(caption = "Tabla 6. Matriz de confusion — Arbol de decision (conjunto de prueba)",
booktabs = TRUE) %>%
kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE) %>%
add_header_above(setNames(c(1, n_clases), c(" ", "Clase predicha")))| Real / Predicho | V1 | High | Low | Moderate | Severe |
|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | |
| High | 0 | 623 | 0 | 85 | 16 |
| Low | 0 | 0 | 11 | 25 | 0 |
| Moderate | 0 | 124 | 15 | 670 | 0 |
| Severe | 0 | 40 | 0 | 0 | 22 |
exactitud <- round(sum(diag(mc)) / sum(mc) * 100, 2)
n <- nrow(mc)
metricas_df <- data.frame(
Clase = character(n),
Sensibilidad = numeric(n),
Especificidad = numeric(n),
Precision = numeric(n),
stringsAsFactors = FALSE
)
for (i in seq_len(n)) {
VP <- as.numeric(mc[i, i])
FP <- as.numeric(sum(mc[, i])) - VP
FN <- as.numeric(sum(mc[i, ])) - VP
VN <- as.numeric(sum(mc)) - VP - FP - FN
metricas_df[i, "Clase"] <- rownames(mc)[i]
metricas_df[i, "Sensibilidad"] <- round(VP / max(VP + FN, 1) * 100, 1)
metricas_df[i, "Especificidad"] <- round(VN / max(VN + FP, 1) * 100, 1)
metricas_df[i, "Precision"] <- round(VP / max(VP + FP, 1) * 100, 1)
}
metricas_df %>%
kable(row.names = FALSE,
caption = paste0("Tabla 7. Metricas por clase — Exactitud global: ",
exactitud, "%"),
booktabs = TRUE) %>%
kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)| Clase | Sensibilidad | Especificidad | Precision |
|---|---|---|---|
| 0.0 | 100.0 | 0.0 | |
| High | 86.0 | 81.9 | 79.2 |
| Low | 30.6 | 99.1 | 42.3 |
| Moderate | 82.8 | 86.6 | 85.9 |
| Severe | 35.5 | 99.0 | 57.9 |
data.frame(
Aspecto = c("Variable respuesta","Forma del modelo","Salida",
"Interpretacion","Requiere distribucion",
"Manejo de desbalance","Mejor para"),
Logistica = c("Binaria (0/1)",
"Funcion logistica (sigmoide)",
"Probabilidad entre 0 y 1",
"Coeficientes (log-odds)",
"No (distribucion Bernoulli)",
"Sensible — necesita SMOTE u otros ajustes",
"Cuantificar el efecto de cada variable"),
Arbol = c("Categorica (2 o mas clases)",
"Particiones binarias recursivas",
"Clase directa",
"Reglas Si/No legibles",
"No (no parametrico)",
"Mas robusto al desbalance",
"Explicar decisiones de forma simple")
) %>%
kable(col.names = c("Aspecto","Regresion Logistica","Arbol de Decision"),
caption = "Tabla 8. Comparacion entre modelos",
booktabs = TRUE) %>%
kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE) %>%
column_spec(1, bold = TRUE)| Aspecto | Regresion Logistica | Arbol de Decision |
|---|---|---|
| Variable respuesta | Binaria (0/1) | Categorica (2 o mas clases) |
| Forma del modelo | Funcion logistica (sigmoide) | Particiones binarias recursivas |
| Salida | Probabilidad entre 0 y 1 | Clase directa |
| Interpretacion | Coeficientes (log-odds) | Reglas Si/No legibles |
| Requiere distribucion | No (distribucion Bernoulli) | No (no parametrico) |
| Manejo de desbalance | Sensible — necesita SMOTE u otros ajustes | Mas robusto al desbalance |
| Mejor para | Cuantificar el efecto de cada variable | Explicar decisiones de forma simple |
La regresion logistica pregunta “cuanto cambia la probabilidad cuando sube el monto?”. El arbol pregunta “en que punto de la inflacion separamos los paises?”. Uno cuantifica efectos, el otro traza fronteras.