Demostrar que las distribuciones Bernoulli, Normal y Poisson pertenecen a la familia exponencial de distribuciones.
\[ p(x|\eta) = \eta^x(1-\eta)^{1-x} \]
Reescribir en la forma de la familia exponencial:
\[ p(x|\eta) = \exp\left(x \log(\eta) + (1-x) \log(1-\eta)\right) \]
\[ f(x|\mu,\sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right) \]
Reescribir en la forma de la familia exponencial:
\[ f(x|\mu,\sigma^2) = \exp\left(-\frac{(x^2 - 2\mu x + \mu^2)}{2\sigma^2} - \frac{\log(2\pi\sigma^2)}{2}\right) \]
\[ P(X=k) = \frac{\lambda^k e^{-\lambda}}{k!} \]
Reescribir en la forma de la familia exponencial:
\[ P(X=k) = \exp\left(k \log(\lambda) - \lambda - \log(k!)\right) \]
Construir un modelo de regresión logística utilizando el conjunto de datos de enfermedades cardíacas de la UCI.
# Descargar y leer los datos localmente
download.file("http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data",
destfile = "processed.cleveland.data")
col_names <- c("age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", "thalach", "exang", "oldpeak", "slope", "ca", "thal", "target")
heart_data <- read.table("processed.cleveland.data", sep = ",", col.names = col_names, na.strings = "?")
# Imputar los datos faltantes con la mediana
heart_data <- heart_data %>%
mutate(across(everything(), ~ifelse(is.na(.), median(., na.rm = TRUE), .)))
# Convertir la variable respuesta en categórica
heart_data$target <- ifelse(heart_data$target > 0, 1, 0)
heart_data$target <- as.factor(heart_data$target)
# Ver las primeras filas del conjunto de datos
head(heart_data)
## age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal
## 1 63 1 1 145 233 1 2 150 0 2.3 3 0 6
## 2 67 1 4 160 286 0 2 108 1 1.5 2 3 3
## 3 67 1 4 120 229 0 2 129 1 2.6 2 2 7
## 4 37 1 3 130 250 0 0 187 0 3.5 3 0 3
## 5 41 0 2 130 204 0 2 172 0 1.4 1 0 3
## 6 56 1 2 120 236 0 0 178 0 0.8 1 0 3
## target
## 1 0
## 2 1
## 3 1
## 4 0
## 5 0
## 6 0
# Distribución de la variable respuesta para cada covariable categórica
heart_data %>%
gather(key = "variable", value = "value", -target) %>%
ggplot(aes(x = value, fill = target)) +
geom_histogram(position = "dodge", bins = 30) +
facet_wrap(~variable, scales = "free_x") +
theme_minimal()
# Ajustar el modelo bivariado
model_bivariado <- glm(target ~ fbs, data = heart_data, family = binomial)
summary(model_bivariado)
##
## Call:
## glm(formula = target ~ fbs, family = binomial, data = heart_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.1866 0.1251 -1.492 0.136
## fbs 0.1421 0.3234 0.440 0.660
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 417.98 on 302 degrees of freedom
## Residual deviance: 417.79 on 301 degrees of freedom
## AIC: 421.79
##
## Number of Fisher Scoring iterations: 3
# Ajustar el modelo multivariado
model_multivariado <- glm(target ~ ., data = heart_data, family = binomial)
summary(model_multivariado)
##
## Call:
## glm(formula = target ~ ., family = binomial, data = heart_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.551087 2.857180 -2.643 0.008221 **
## age -0.014491 0.023877 -0.607 0.543921
## sex 1.371145 0.488398 2.807 0.004994 **
## cp 0.611789 0.190567 3.210 0.001326 **
## trestbps 0.023837 0.010710 2.226 0.026028 *
## chol 0.005030 0.003767 1.335 0.181740
## fbs -0.777545 0.529833 -1.468 0.142232
## restecg 0.235034 0.182607 1.287 0.198059
## thalach -0.020028 0.010158 -1.972 0.048651 *
## exang 1.032238 0.409469 2.521 0.011705 *
## oldpeak 0.243988 0.210239 1.161 0.245835
## slope 0.591238 0.359671 1.644 0.100211
## ca 1.237096 0.262772 4.708 2.5e-06 ***
## thal 0.324359 0.098555 3.291 0.000998 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 417.98 on 302 degrees of freedom
## Residual deviance: 209.02 on 289 degrees of freedom
## AIC: 237.02
##
## Number of Fisher Scoring iterations: 6
# Predecir probabilidades
heart_data$predicted_prob <- predict(model_multivariado, type = "response")
# Visualizar las probabilidades predichas
ggplot(heart_data, aes(x = predicted_prob, fill = target)) +
geom_histogram(position = "dodge", bins = 30) +
theme_minimal() +
labs(title = "Probabilidades predichas de enfermedad cardíaca",
x = "Probabilidad predicha",
y = "Frecuencia")
Evaluar cuál de los dos modelos logísticos (A y B) tiene mayor poder de predicción para incumplimiento de pago de tarjeta de crédito.
# Cargar los datos
library(readxl)
credit_data <- read_excel("C:/Users/Neyde/Downloads/AAD-taller03 (2).xlsx")
# Calcular AUC para ambos modelos
roc_A <- roc(credit_data$Incumplimiento, credit_data$ScoreLogisticoA)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
roc_B <- roc(credit_data$Incumplimiento, credit_data$ScoreLogisticoB)
## Setting levels: control = 0, case = 1
## Setting direction: controls > cases
auc_A <- auc(roc_A)
auc_B <- auc(roc_B)
# Graficar las curvas ROC
ggplot() +
geom_line(aes(x = 1 - roc_A$specificities, y = roc_A$sensitivities), color = "blue", linetype = "dashed") +
geom_line(aes(x = 1 - roc_B$specificities, y = roc_B$sensitivities), color = "red") +
geom_abline(linetype = "dotted") +
theme_minimal() +
labs(title = "Curvas ROC para los Modelos A y B",
x = "Tasa de Falsos Positivos",
y = "Tasa de Verdaderos Positivos") +
annotate("text", x = 0.5, y = 0.2, label = paste("AUC Modelo A:", round(auc_A, 2)), color = "blue") +
annotate("text", x = 0.5, y = 0.1, label = paste("AUC Modelo B:", round(auc_B, 2)), color = "red")
# Umbral de clasificación
threshold <- 0.5
# Predicciones binarias usando el umbral
pred_A <- ifelse(credit_data$ScoreLogisticoA >= threshold, 1, 0)
pred_B <- ifelse(credit_data$ScoreLogisticoB >= threshold, 1, 0)
# Matrices de confusión
cm_A <- confusionMatrix(factor(pred_A), factor(credit_data$Incumplimiento))
## Warning in confusionMatrix.default(factor(pred_A),
## factor(credit_data$Incumplimiento)): Levels are not in the same order for
## reference and data. Refactoring data to match.
cm_B <- confusionMatrix(factor(pred_B), factor(credit_data$Incumplimiento))
## Warning in confusionMatrix.default(factor(pred_B),
## factor(credit_data$Incumplimiento)): Levels are not in the same order for
## reference and data. Refactoring data to match.
# Reporte de clasificación
report_A <- cm_A
report_B <- cm_B
cm_A
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 4450 4630
## 1 0 0
##
## Accuracy : 0.4901
## 95% CI : (0.4798, 0.5004)
## No Information Rate : 0.5099
## P-Value [Acc > NIR] : 0.9999
##
## Kappa : 0
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 1.0000
## Specificity : 0.0000
## Pos Pred Value : 0.4901
## Neg Pred Value : NaN
## Prevalence : 0.4901
## Detection Rate : 0.4901
## Detection Prevalence : 1.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : 0
##
cm_B
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 4450 4630
## 1 0 0
##
## Accuracy : 0.4901
## 95% CI : (0.4798, 0.5004)
## No Information Rate : 0.5099
## P-Value [Acc > NIR] : 0.9999
##
## Kappa : 0
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 1.0000
## Specificity : 0.0000
## Pos Pred Value : 0.4901
## Neg Pred Value : NaN
## Prevalence : 0.4901
## Detection Rate : 0.4901
## Detection Prevalence : 1.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : 0
##
Dado que la imputación con Amelia no es necesaria porque
los datos ya no tienen valores faltantes, podemos omitir este paso.
Repetir el problema 2 sin el uso de Amelia ya que no hay
datos faltantes.
# Ajustar el modelo multivariado sin imputación EM (ya que no hay datos faltantes)
model_multivariado_em <- glm(target ~ ., data = heart_data, family = binomial)
summary(model_multivariado_em)
##
## Call:
## glm(formula = target ~ ., family = binomial, data = heart_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.850480 3.633065 -2.161 0.0307 *
## age -0.015604 0.025248 -0.618 0.5366
## sex 1.459212 0.820168 1.779 0.0752 .
## cp 0.655657 0.378459 1.732 0.0832 .
## trestbps 0.025336 0.015501 1.635 0.1022
## chol 0.005356 0.004486 1.194 0.2326
## fbs -0.829849 0.658139 -1.261 0.2073
## restecg 0.251860 0.221431 1.137 0.2554
## thalach -0.021419 0.014521 -1.475 0.1402
## exang 1.104567 0.676745 1.632 0.1026
## oldpeak 0.261158 0.246067 1.061 0.2885
## slope 0.629975 0.461337 1.366 0.1721
## ca 1.323027 0.693465 1.908 0.0564 .
## thal 0.349445 0.211015 1.656 0.0977 .
## predicted_prob -0.410500 3.050447 -0.135 0.8930
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 417.98 on 302 degrees of freedom
## Residual deviance: 209.00 on 288 degrees of freedom
## AIC: 239
##
## Number of Fisher Scoring iterations: 7
# Predecir probabilidades
heart_data$predicted_prob_em <- predict(model_multivariado_em, type = "response")
# Visualizar las probabilidades predichas
ggplot(heart_data, aes(x = predicted_prob_em, fill = target)) +
geom_histogram(position = "dodge", bins = 30) +
theme_minimal() +
labs(title = "Probabilidades predichas de enfermedad cardíaca",
x = "Probabilidad predicha",
y = "Frecuencia")