Datos Perdidos

Columna 1

Grupo 3: Jair Colcas, Angello Custodio, Verónica Armas, Diego Cortés, Ed Castañeda.

DataFrame

El siguiente trabajo es de un estudio cardiovascular de los residentes de Framingham, Massachuttes. Esta investigación tiene como objetivo identificar los factores de riesgo / más relevantes de la enfermedad cardíaca, así como predecir el riesgo general mediante regresión logística, usando como variable dependiente si el paciente tiene un riesgo de 10 años de enfermedad coronaria en el futuro (TenYearCHD).

Se incluye mas de 4000 datos registrados y 15 variables que representan un factor de riesgo que incluye variables sociodemográficas. La base de datos se encuentra en Kaggle.

Columna 2

Prop. Faltantes por Variable

Prop. de Missing

Numero de Datos Perdidos por Variable

library("VIM")
aggr(dataset, plot = FALSE)

 Missings in variables:
   Variable Count
  education   105
 cigsPerDay    29
     BPMeds    53
    totChol    50
        BMI    19
  heartRate     1
    glucose   388

Patron de Perdida

Tabla Patron de Perdida

male age currentSmoker prevalentStroke prevalentHyp diabetes sysBP diaBP TenYearCHD heartRate BMI cigsPerDay totChol BPMeds education glucose
3656 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
331 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1
93 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1
8 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 2
51 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 2
9 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1
38 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 2
1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 2
1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 0 3
23 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1
4 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 2
2 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 2
13 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1
4 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 0 2
1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 2
1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 1 0 3
1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1
0 0 0 0 0 0 0 0 0 1 19 29 50 53 105 388 645

Faltantes Cruzados

male age education currentSmoker cigsPerDay BPMeds prevalentStroke prevalentHyp diabetes totChol sysBP diaBP BMI heartRate glucose TenYearCHD
male 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
age 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
education 0 0 105 0 2 0 0 0 0 1 0 0 1 0 8 0
currentSmoker 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
cigsPerDay 0 0 2 0 29 0 0 0 0 0 0 0 0 0 4 0
BPMeds 0 0 0 0 0 53 0 0 0 1 0 0 0 0 2 0
prevalentStroke 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
prevalentHyp 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
diabetes 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
totChol 0 0 1 0 0 1 0 0 0 50 0 0 1 0 40 0
sysBP 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
diaBP 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
BMI 0 0 1 0 0 0 0 0 0 1 0 0 19 0 5 0
heartRate 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
glucose 0 0 8 0 4 2 0 0 0 40 0 0 5 0 388 0
TenYearCHD 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Nivel de Colesterol vs. Glucosa

Presion Arterial vs. Glucosa

IMC vs. Glucosa

Frec. Cardiaca vs. Glucosa

Imputacion

Columna 1

Simple Univariada

Frecuencia Cardiaca

Ya que solo hay 1 valor perdido de la variable y es continua entonces imputamos por la media.

data_imp=dataset

data_imp$heartRate[which(is.na(data_imp$heartRate))]=mean(data_imp$heartRate , na.rm = T)

Nivel de Educacion

Ya que la variables es categorica se procedera a imputar por la moda

data_imp$education <- as.factor(data_imp$education)
levels(data_imp$education)<-c("Some High School","High School or GED","Some College or Vocational Schooll", "College")

table(data_imp$education)

                  Some High School                 High School or GED 
                              1720                               1253 
Some College or Vocational Schooll                            College 
                               687                                473 

Vemos que la mayor frecuencia la tiene el Nivel de Educacion “Some High School” entonces imputamos por eso valor.

data_imp$education[which(is.na(data_imp$education))]="Some High School"
  • Actualizando los Datos perdidos
library("VIM")
aggr(data_imp, plot = FALSE)

 Missings in variables:
   Variable Count
 cigsPerDay    29
     BPMeds    53
    totChol    50
        BMI    19
    glucose   388

Multiple Multivariada

Utilizaremos la libreria MICE para generar multiples dataset para realizar distintas imputaciones y hallar una combinancion optima de ellas.

library(mice)
imp1 <- mice(data_imp, m = 5, seed = 2)

 iter imp variable
  1   1  cigsPerDay  BPMeds  totChol  BMI  glucose
  1   2  cigsPerDay  BPMeds  totChol  BMI  glucose
  1   3  cigsPerDay  BPMeds  totChol  BMI  glucose
  1   4  cigsPerDay  BPMeds  totChol  BMI  glucose
  1   5  cigsPerDay  BPMeds  totChol  BMI  glucose
  2   1  cigsPerDay  BPMeds  totChol  BMI  glucose
  2   2  cigsPerDay  BPMeds  totChol  BMI  glucose
  2   3  cigsPerDay  BPMeds  totChol  BMI  glucose
  2   4  cigsPerDay  BPMeds  totChol  BMI  glucose
  2   5  cigsPerDay  BPMeds  totChol  BMI  glucose
  3   1  cigsPerDay  BPMeds  totChol  BMI  glucose
  3   2  cigsPerDay  BPMeds  totChol  BMI  glucose
  3   3  cigsPerDay  BPMeds  totChol  BMI  glucose
  3   4  cigsPerDay  BPMeds  totChol  BMI  glucose
  3   5  cigsPerDay  BPMeds  totChol  BMI  glucose
  4   1  cigsPerDay  BPMeds  totChol  BMI  glucose
  4   2  cigsPerDay  BPMeds  totChol  BMI  glucose
  4   3  cigsPerDay  BPMeds  totChol  BMI  glucose
  4   4  cigsPerDay  BPMeds  totChol  BMI  glucose
  4   5  cigsPerDay  BPMeds  totChol  BMI  glucose
  5   1  cigsPerDay  BPMeds  totChol  BMI  glucose
  5   2  cigsPerDay  BPMeds  totChol  BMI  glucose
  5   3  cigsPerDay  BPMeds  totChol  BMI  glucose
  5   4  cigsPerDay  BPMeds  totChol  BMI  glucose
  5   5  cigsPerDay  BPMeds  totChol  BMI  glucose
imp1
Class: mids
Number of multiple imputations:  5 
Imputation methods:
           male             age       education   currentSmoker      cigsPerDay 
             ""              ""              ""              ""           "pmm" 
         BPMeds prevalentStroke    prevalentHyp        diabetes         totChol 
          "pmm"              ""              ""              ""           "pmm" 
          sysBP           diaBP             BMI       heartRate         glucose 
             ""              ""           "pmm"              ""           "pmm" 
     TenYearCHD 
             "" 
PredictorMatrix:
              male age education currentSmoker cigsPerDay BPMeds
male             0   1         1             1          1      1
age              1   0         1             1          1      1
education        1   1         0             1          1      1
currentSmoker    1   1         1             0          1      1
cigsPerDay       1   1         1             1          0      1
BPMeds           1   1         1             1          1      0
              prevalentStroke prevalentHyp diabetes totChol sysBP diaBP BMI
male                        1            1        1       1     1     1   1
age                         1            1        1       1     1     1   1
education                   1            1        1       1     1     1   1
currentSmoker               1            1        1       1     1     1   1
cigsPerDay                  1            1        1       1     1     1   1
BPMeds                      1            1        1       1     1     1   1
              heartRate glucose TenYearCHD
male                  1       1          1
age                   1       1          1
education             1       1          1
currentSmoker         1       1          1
cigsPerDay            1       1          1
BPMeds                1       1          1
imp1$method
           male             age       education   currentSmoker      cigsPerDay 
             ""              ""              ""              ""           "pmm" 
         BPMeds prevalentStroke    prevalentHyp        diabetes         totChol 
          "pmm"              ""              ""              ""           "pmm" 
          sysBP           diaBP             BMI       heartRate         glucose 
             ""              ""           "pmm"              ""           "pmm" 
     TenYearCHD 
             "" 

Columna 2

Simple Univariada

Frecuencia Cardiaca

Multiple Multivariada

Visualizacion de Datos Imputados

Datos Imputados

Visualizacion de datos que seran imputados

Colesterol vs. Glucosa MICE

IMC vs. Glucosa MICE

Data Completa

data<- mice::complete(imp1)
aggr(data, plot = FALSE )

 Missings in variables:
[1] Variable Count   
<0 rows> (or 0-length row.names)

Se comprueba que la base ya no posee valores perdidos.

Análisis Exploratorio

Columna 2

GENERO

EDAD

NIVEL DE EDUCACION

FUMADOR

CIGARROS DIARIOS

PRESION ARTERIAL

PROBLEMA CARDIOVASCULAR PREVIO

Columna 3

HIPERTENSION

DIABETES

COLESTEROL

PRESION SISTOLICA

PRESION DIASTOLICA

IMC

FRECUENCIA CARDIACA

GLUCOSA

Particionamiento y Balanceo

Columna 1

Particionamiento de la Base

Se particiona la data en 80 porciento para el entrenamiento y el 20 porciento para el testeo.

library(ROSE)
library(caret)


split = 0.8
trainIndex = createDataPartition(data$TenYearCHD, p = split, list = FALSE, times = 1 )
train = data[trainIndex,]
test = data[-trainIndex,]
table(train$TenYearCHD)

No Riesgo    Riesgo 
     2876       516 
prop.table(table(train$TenYearCHD))

No Riesgo    Riesgo 
0.8478774 0.1521226 

Balanceo de la Data de Entrenamiento

Se aplica el metodo de balanceo de ROSE

train_bal <- ROSE(TenYearCHD ~ ., data = train , seed = 2021)$data
Conteo de la clase del target Riesgo Cardiaco
table(train_bal$TenYearCHD)

No Riesgo    Riesgo 
     1706      1686 
prop.table(table(train_bal$TenYearCHD))

No Riesgo    Riesgo 
0.5029481 0.4970519 

Columna 2

Balanceo

Se utiliza el metodo de ROSE

Modelo

Columna 1

Ajuste del Modelo

Regresion Logistica

Utilizaremos el modelo de Regresion Logistica ya que contempla predictores numericos como categoricos y la respuesta es binaria, el Riesgo o No Riesgo Cardiaco.

logit <- glm(TenYearCHD ~., data = train_bal, family = binomial(link = "logit"))
summary(logit)

Call:
glm(formula = TenYearCHD ~ ., family = binomial(link = "logit"), 
    data = train_bal)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.1290  -1.0378  -0.5257   1.0678   2.1137  

Coefficients:
                                             Estimate Std. Error z value
(Intercept)                                 -5.208842   0.469969 -11.083
maleMasculino                                0.423891   0.080795   5.247
age                                          0.045694   0.004441  10.289
educationHigh School or GED                 -0.082629   0.090737  -0.911
educationSome College or Vocational Schooll  0.171788   0.112522   1.527
educationCollege                            -0.220583   0.127080  -1.736
currentSmokerFumador                         0.219497   0.103246   2.126
cigsPerDay                                   0.008162   0.003749   2.177
BPMedsPresion Alta                           0.715293   0.198514   3.603
prevalentStrokePro. Previo                   0.648169   0.514834   1.259
prevalentHypHipertenso                       0.202979   0.100421   2.021
diabetesDiabetico                           -0.113726   0.260551  -0.436
totChol                                      0.002279   0.000758   3.007
sysBP                                        0.010347   0.001959   5.282
diaBP                                       -0.001902   0.003283  -0.579
BMI                                          0.011245   0.008230   1.366
heartRate                                   -0.004808   0.002782  -1.728
glucose                                      0.007586   0.001521   4.989
                                            Pr(>|z|)    
(Intercept)                                  < 2e-16 ***
maleMasculino                               1.55e-07 ***
age                                          < 2e-16 ***
educationHigh School or GED                 0.362486    
educationSome College or Vocational Schooll 0.126835    
educationCollege                            0.082601 .  
currentSmokerFumador                        0.033507 *  
cigsPerDay                                  0.029483 *  
BPMedsPresion Alta                          0.000314 ***
prevalentStrokePro. Previo                  0.208035    
prevalentHypHipertenso                      0.043251 *  
diabetesDiabetico                           0.662487    
totChol                                     0.002637 ** 
sysBP                                       1.28e-07 ***
diaBP                                       0.562326    
BMI                                         0.171818    
heartRate                                   0.083954 .  
glucose                                     6.08e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 4702.2  on 3391  degrees of freedom
Residual deviance: 4211.1  on 3374  degrees of freedom
AIC: 4247.1

Number of Fisher Scoring iterations: 4

Predicciones

Prediccion Data de Entrenamiento

train_pred <- predict(logit, newdata = train_bal, type = "response")

head(train_pred)
        1         2         3         4         5         6 
0.5924946 0.3794478 0.3899629 0.3129622 0.2121198 0.5089698 

Segun la probabilidad se categoriza como No Riesgo si la probabilidad predicha es menor a 0.5 y Riesgo si es mayor.

train_pred_clase <- factor(ifelse(train_pred > 0.5, 1, 0))
levels(train_pred_clase) <- c("No Riesgo","Riesgo")

head(train_pred_clase)
        1         2         3         4         5         6 
   Riesgo No Riesgo No Riesgo No Riesgo No Riesgo    Riesgo 
Levels: No Riesgo Riesgo

Prediccion Data Testeo

test_pred <- predict(logit, newdata = test, type = "response")

test_pred_clase <- factor(ifelse(test_pred > 0.5, 1, 0))
levels(test_pred_clase) <- c("No Riesgo","Riesgo")

head(test_pred_clase)
        2         3         4         5        24        27 
No Riesgo No Riesgo    Riesgo No Riesgo No Riesgo No Riesgo 
Levels: No Riesgo Riesgo

Matriz de Confusion

Data Entrenamiento

train_bal$TenYearCHD <- as.factor(train_bal$TenYearCHD)
levels(train_bal$TenYearCHD)<-c("No Riesgo","Riesgo")

table(Predicho = train_pred_clase, Real = train_bal$TenYearCHD)
           Real
Predicho    No Riesgo Riesgo
  No Riesgo      1146    645
  Riesgo          560   1041

Data Testeo

test$TenYearCHD <- as.factor(test$TenYearCHD)
levels(test$TenYearCHD)<-c("No Riesgo","Riesgo")

table(Predicho = test_pred_clase, Real = test$TenYearCHD)
           Real
Predicho    No Riesgo Riesgo
  No Riesgo       502     39
  Riesgo          216     89

Metricas de Evaluacion Entrenamiento

library(caret)
confusionMatrix(train_pred_clase, train_bal$TenYearCHD )
Confusion Matrix and Statistics

           Reference
Prediction  No Riesgo Riesgo
  No Riesgo      1146    645
  Riesgo          560   1041
                                          
               Accuracy : 0.6448          
                 95% CI : (0.6284, 0.6609)
    No Information Rate : 0.5029          
    P-Value [Acc > NIR] : < 2e-16         
                                          
                  Kappa : 0.2893          
                                          
 Mcnemar's Test P-Value : 0.01553         
                                          
            Sensitivity : 0.6717          
            Specificity : 0.6174          
         Pos Pred Value : 0.6399          
         Neg Pred Value : 0.6502          
             Prevalence : 0.5029          
         Detection Rate : 0.3379          
   Detection Prevalence : 0.5280          
      Balanced Accuracy : 0.6446          
                                          
       'Positive' Class : No Riesgo       
                                          

Metricas de Evaluacion Testeo

library(caret)
confusionMatrix(test_pred_clase, test$TenYearCHD )
Confusion Matrix and Statistics

           Reference
Prediction  No Riesgo Riesgo
  No Riesgo       502     39
  Riesgo          216     89
                                          
               Accuracy : 0.6986          
                 95% CI : (0.6664, 0.7294)
    No Information Rate : 0.8487          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.2516          
                                          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.6992          
            Specificity : 0.6953          
         Pos Pred Value : 0.9279          
         Neg Pred Value : 0.2918          
             Prevalence : 0.8487          
         Detection Rate : 0.5934          
   Detection Prevalence : 0.6395          
      Balanced Accuracy : 0.6972          
                                          
       'Positive' Class : No Riesgo       
                                          

Columna 2

M. Confusion Entrenamiento

M. Confusion Testeo

AUC

Datos de Entrenamiento

library(ROCR)

pred_train <- prediction(train_pred, train_bal$TenYearCHD)

AUC_train <- performance(pred_train ,measure="auc")

AUCaltura_train <- AUC_train@y.values
cat("AUC:", AUCaltura_train[[1]]) 
AUC: 0.7095458

Datos de Testeo

library(ROCR)

pred_test <- prediction(test_pred, test$TenYearCHD)

AUC_test <- performance(pred_test ,measure="auc")

AUCaltura_test <- AUC_test@y.values
cat("AUC:", AUCaltura_test[[1]]) 
AUC: 0.74333

ROC Data Entrenamiento

ROC Data Testeo