Solución del Taller 3 - Análisis Avanzado de Datos

Problema 1 - 20 pts (teórico)

Demostrar que las distribuciones Bernoulli, Normal y Poisson pertenecen a la familia exponencial de distribuciones.

Distribución Bernoulli

\[ p(x|\eta) = \eta^x(1-\eta)^{1-x} \]

Reescribir en la forma de la familia exponencial:

\[ p(x|\eta) = \exp\left(x \log(\eta) + (1-x) \log(1-\eta)\right) \]

Distribución Normal

\[ f(x|\mu,\sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right) \]

Reescribir en la forma de la familia exponencial:

\[ f(x|\mu,\sigma^2) = \exp\left(-\frac{(x^2 - 2\mu x + \mu^2)}{2\sigma^2} - \frac{\log(2\pi\sigma^2)}{2}\right) \]

Distribución Poisson

\[ P(X=k) = \frac{\lambda^k e^{-\lambda}}{k!} \]

Reescribir en la forma de la familia exponencial:

\[ P(X=k) = \exp\left(k \log(\lambda) - \lambda - \log(k!)\right) \]

Problema 2 - 50 pts (práctico)

Construir un modelo de regresión logística utilizando el conjunto de datos de enfermedades cardíacas de la UCI.

# Descargar y leer los datos localmente
download.file("http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data", 
              destfile = "processed.cleveland.data")
col_names <- c("age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", "thalach", "exang", "oldpeak", "slope", "ca", "thal", "target")
heart_data <- read.table("processed.cleveland.data", sep = ",", col.names = col_names, na.strings = "?")

# Imputar los datos faltantes con la mediana
heart_data <- heart_data %>%
  mutate(across(everything(), ~ifelse(is.na(.), median(., na.rm = TRUE), .)))

# Convertir la variable respuesta en categórica
heart_data$target <- ifelse(heart_data$target > 0, 1, 0)
heart_data$target <- as.factor(heart_data$target)

# Ver las primeras filas del conjunto de datos
head(heart_data)

##   age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal
## 1  63   1  1      145  233   1       2     150     0     2.3     3  0    6
## 2  67   1  4      160  286   0       2     108     1     1.5     2  3    3
## 3  67   1  4      120  229   0       2     129     1     2.6     2  2    7
## 4  37   1  3      130  250   0       0     187     0     3.5     3  0    3
## 5  41   0  2      130  204   0       2     172     0     1.4     1  0    3
## 6  56   1  2      120  236   0       0     178     0     0.8     1  0    3
##   target
## 1      0
## 2      1
## 3      1
## 4      0
## 5      0
## 6      0

Revisar distribuciones bivariadas

# Distribución de la variable respuesta para cada covariable categórica
heart_data %>%
  gather(key = "variable", value = "value", -target) %>%
  ggplot(aes(x = value, fill = target)) +
  geom_histogram(position = "dodge", bins = 30) +
  facet_wrap(~variable, scales = "free_x") +
  theme_minimal()

Modelo bivariado

# Ajustar el modelo bivariado
model_bivariado <- glm(target ~ fbs, data = heart_data, family = binomial)
summary(model_bivariado)

## 
## Call:
## glm(formula = target ~ fbs, family = binomial, data = heart_data)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)  -0.1866     0.1251  -1.492    0.136
## fbs           0.1421     0.3234   0.440    0.660
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 417.98  on 302  degrees of freedom
## Residual deviance: 417.79  on 301  degrees of freedom
## AIC: 421.79
## 
## Number of Fisher Scoring iterations: 3

Modelo multivariado

# Ajustar el modelo multivariado
model_multivariado <- glm(target ~ ., data = heart_data, family = binomial)
summary(model_multivariado)

## 
## Call:
## glm(formula = target ~ ., family = binomial, data = heart_data)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -7.551087   2.857180  -2.643 0.008221 ** 
## age         -0.014491   0.023877  -0.607 0.543921    
## sex          1.371145   0.488398   2.807 0.004994 ** 
## cp           0.611789   0.190567   3.210 0.001326 ** 
## trestbps     0.023837   0.010710   2.226 0.026028 *  
## chol         0.005030   0.003767   1.335 0.181740    
## fbs         -0.777545   0.529833  -1.468 0.142232    
## restecg      0.235034   0.182607   1.287 0.198059    
## thalach     -0.020028   0.010158  -1.972 0.048651 *  
## exang        1.032238   0.409469   2.521 0.011705 *  
## oldpeak      0.243988   0.210239   1.161 0.245835    
## slope        0.591238   0.359671   1.644 0.100211    
## ca           1.237096   0.262772   4.708  2.5e-06 ***
## thal         0.324359   0.098555   3.291 0.000998 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 417.98  on 302  degrees of freedom
## Residual deviance: 209.02  on 289  degrees of freedom
## AIC: 237.02
## 
## Number of Fisher Scoring iterations: 6

Visualización de probabilidades predichas bajo modelo multivariado

# Predecir probabilidades
heart_data$predicted_prob <- predict(model_multivariado, type = "response")

# Visualizar las probabilidades predichas
ggplot(heart_data, aes(x = predicted_prob, fill = target)) +
  geom_histogram(position = "dodge", bins = 30) +
  theme_minimal() +
  labs(title = "Probabilidades predichas de enfermedad cardíaca",
       x = "Probabilidad predicha",
       y = "Frecuencia")

Problema 3 - 30 pts (práctico)

Evaluar cuál de los dos modelos logísticos (A y B) tiene mayor poder de predicción para incumplimiento de pago de tarjeta de crédito.

# Cargar los datos
library(readxl)
credit_data <- read_excel("C:/Users/Neyde/Downloads/AAD-taller03 (2).xlsx")

# Calcular AUC para ambos modelos
roc_A <- roc(credit_data$Incumplimiento, credit_data$ScoreLogisticoA)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

roc_B <- roc(credit_data$Incumplimiento, credit_data$ScoreLogisticoB)

## Setting levels: control = 0, case = 1

## Setting direction: controls > cases

auc_A <- auc(roc_A)
auc_B <- auc(roc_B)

# Graficar las curvas ROC
ggplot() +
  geom_line(aes(x = 1 - roc_A$specificities, y = roc_A$sensitivities), color = "blue", linetype = "dashed") +
  geom_line(aes(x = 1 - roc_B$specificities, y = roc_B$sensitivities), color = "red") +
  geom_abline(linetype = "dotted") +
  theme_minimal() +
  labs(title = "Curvas ROC para los Modelos A y B",
       x = "Tasa de Falsos Positivos",
       y = "Tasa de Verdaderos Positivos") +
  annotate("text", x = 0.5, y = 0.2, label = paste("AUC Modelo A:", round(auc_A, 2)), color = "blue") +
  annotate("text", x = 0.5, y = 0.1, label = paste("AUC Modelo B:", round(auc_B, 2)), color = "red")

Matriz de Confusión y Reporte de Clasificación

# Umbral de clasificación
threshold <- 0.5

# Predicciones binarias usando el umbral
pred_A <- ifelse(credit_data$ScoreLogisticoA >= threshold, 1, 0)
pred_B <- ifelse(credit_data$ScoreLogisticoB >= threshold, 1, 0)

# Matrices de confusión
cm_A <- confusionMatrix(factor(pred_A), factor(credit_data$Incumplimiento))

## Warning in confusionMatrix.default(factor(pred_A),
## factor(credit_data$Incumplimiento)): Levels are not in the same order for
## reference and data. Refactoring data to match.

cm_B <- confusionMatrix(factor(pred_B), factor(credit_data$Incumplimiento))

## Warning in confusionMatrix.default(factor(pred_B),
## factor(credit_data$Incumplimiento)): Levels are not in the same order for
## reference and data. Refactoring data to match.

# Reporte de clasificación
report_A <- cm_A
report_B <- cm_B

cm_A

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 4450 4630
##          1    0    0
##                                           
##                Accuracy : 0.4901          
##                  95% CI : (0.4798, 0.5004)
##     No Information Rate : 0.5099          
##     P-Value [Acc > NIR] : 0.9999          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.0000          
##          Pos Pred Value : 0.4901          
##          Neg Pred Value :    NaN          
##              Prevalence : 0.4901          
##          Detection Rate : 0.4901          
##    Detection Prevalence : 1.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : 0               
##

cm_B

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 4450 4630
##          1    0    0
##                                           
##                Accuracy : 0.4901          
##                  95% CI : (0.4798, 0.5004)
##     No Information Rate : 0.5099          
##     P-Value [Acc > NIR] : 0.9999          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.0000          
##          Pos Pred Value : 0.4901          
##          Neg Pred Value :    NaN          
##              Prevalence : 0.4901          
##          Detection Rate : 0.4901          
##    Detection Prevalence : 1.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : 0               
##

Problema 4 - 10 pts (práctico) [Opcional]

Dado que la imputación con Amelia no es necesaria porque los datos ya no tienen valores faltantes, podemos omitir este paso.

Repetir el problema 2 sin el uso de Amelia ya que no hay datos faltantes.

# Ajustar el modelo multivariado sin imputación EM (ya que no hay datos faltantes)
model_multivariado_em <- glm(target ~ ., data = heart_data, family = binomial)
summary(model_multivariado_em)

## 
## Call:
## glm(formula = target ~ ., family = binomial, data = heart_data)
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)  
## (Intercept)    -7.850480   3.633065  -2.161   0.0307 *
## age            -0.015604   0.025248  -0.618   0.5366  
## sex             1.459212   0.820168   1.779   0.0752 .
## cp              0.655657   0.378459   1.732   0.0832 .
## trestbps        0.025336   0.015501   1.635   0.1022  
## chol            0.005356   0.004486   1.194   0.2326  
## fbs            -0.829849   0.658139  -1.261   0.2073  
## restecg         0.251860   0.221431   1.137   0.2554  
## thalach        -0.021419   0.014521  -1.475   0.1402  
## exang           1.104567   0.676745   1.632   0.1026  
## oldpeak         0.261158   0.246067   1.061   0.2885  
## slope           0.629975   0.461337   1.366   0.1721  
## ca              1.323027   0.693465   1.908   0.0564 .
## thal            0.349445   0.211015   1.656   0.0977 .
## predicted_prob -0.410500   3.050447  -0.135   0.8930  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 417.98  on 302  degrees of freedom
## Residual deviance: 209.00  on 288  degrees of freedom
## AIC: 239
## 
## Number of Fisher Scoring iterations: 7

# Predecir probabilidades
heart_data$predicted_prob_em <- predict(model_multivariado_em, type = "response")

# Visualizar las probabilidades predichas
ggplot(heart_data, aes(x = predicted_prob_em, fill = target)) +
  geom_histogram(position = "dodge", bins = 30) +
  theme_minimal() +
  labs(title = "Probabilidades predichas de enfermedad cardíaca",
       x = "Probabilidad predicha",
       y = "Frecuencia")

Solución del Taller 3 - Análisis Avanzado de Datos

Usuario

2024-05-17

Problema 1 - 20 pts (teórico)

Distribución Bernoulli

Distribución Normal

Distribución Poisson

Problema 2 - 50 pts (práctico)

Revisar distribuciones bivariadas

Modelo bivariado

Modelo multivariado

Visualización de probabilidades predichas bajo modelo multivariado

Problema 3 - 30 pts (práctico)

Matriz de Confusión y Reporte de Clasificación

Problema 4 - 10 pts (práctico) [Opcional]