Grupo 3: Jair Colcas, Angello Custodio, Verónica Armas, Diego Cortés, Ed Castañeda.
El siguiente trabajo es de un estudio cardiovascular de los residentes de Framingham, Massachuttes. Esta investigación tiene como objetivo identificar los factores de riesgo / más relevantes de la enfermedad cardíaca, así como predecir el riesgo general mediante regresión logística, usando como variable dependiente si el paciente tiene un riesgo de 10 años de enfermedad coronaria en el futuro (TenYearCHD).
Se incluye mas de 4000 datos registrados y 15 variables que representan un factor de riesgo que incluye variables sociodemográficas. La base de datos se encuentra en Kaggle.
Numero de Datos Perdidos por Variable
library("VIM")
aggr(dataset, plot = FALSE)
Missings in variables:
Variable Count
education 105
cigsPerDay 29
BPMeds 53
totChol 50
BMI 19
heartRate 1
glucose 388
| male | age | currentSmoker | prevalentStroke | prevalentHyp | diabetes | sysBP | diaBP | TenYearCHD | heartRate | BMI | cigsPerDay | totChol | BPMeds | education | glucose | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3656 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 |
| 331 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 |
| 93 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 |
| 8 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 2 |
| 51 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 |
| 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 0 | 2 |
| 9 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 |
| 38 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 2 |
| 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 0 | 1 | 2 |
| 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 3 |
| 23 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 |
| 4 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 2 |
| 2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 2 |
| 13 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 |
| 4 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 2 |
| 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 2 |
| 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 0 | 1 | 1 | 0 | 3 |
| 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 19 | 29 | 50 | 53 | 105 | 388 | 645 |
| male | age | education | currentSmoker | cigsPerDay | BPMeds | prevalentStroke | prevalentHyp | diabetes | totChol | sysBP | diaBP | BMI | heartRate | glucose | TenYearCHD | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| male | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| age | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| education | 0 | 0 | 105 | 0 | 2 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 8 | 0 |
| currentSmoker | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| cigsPerDay | 0 | 0 | 2 | 0 | 29 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0 |
| BPMeds | 0 | 0 | 0 | 0 | 0 | 53 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 2 | 0 |
| prevalentStroke | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| prevalentHyp | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| diabetes | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| totChol | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 50 | 0 | 0 | 1 | 0 | 40 | 0 |
| sysBP | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| diaBP | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| BMI | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 19 | 0 | 5 | 0 |
| heartRate | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| glucose | 0 | 0 | 8 | 0 | 4 | 2 | 0 | 0 | 0 | 40 | 0 | 0 | 5 | 0 | 388 | 0 |
| TenYearCHD | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Frecuencia Cardiaca
Ya que solo hay 1 valor perdido de la variable y es continua entonces imputamos por la media.
data_imp=dataset
data_imp$heartRate[which(is.na(data_imp$heartRate))]=mean(data_imp$heartRate , na.rm = T)Nivel de Educacion
Ya que la variables es categorica se procedera a imputar por la moda
data_imp$education <- as.factor(data_imp$education)
levels(data_imp$education)<-c("Some High School","High School or GED","Some College or Vocational Schooll", "College")
table(data_imp$education)
Some High School High School or GED
1720 1253
Some College or Vocational Schooll College
687 473
Vemos que la mayor frecuencia la tiene el Nivel de Educacion “Some High School” entonces imputamos por eso valor.
data_imp$education[which(is.na(data_imp$education))]="Some High School"library("VIM")
aggr(data_imp, plot = FALSE)
Missings in variables:
Variable Count
cigsPerDay 29
BPMeds 53
totChol 50
BMI 19
glucose 388
Utilizaremos la libreria MICE para generar multiples dataset para realizar distintas imputaciones y hallar una combinancion optima de ellas.
library(mice)
imp1 <- mice(data_imp, m = 5, seed = 2)
iter imp variable
1 1 cigsPerDay BPMeds totChol BMI glucose
1 2 cigsPerDay BPMeds totChol BMI glucose
1 3 cigsPerDay BPMeds totChol BMI glucose
1 4 cigsPerDay BPMeds totChol BMI glucose
1 5 cigsPerDay BPMeds totChol BMI glucose
2 1 cigsPerDay BPMeds totChol BMI glucose
2 2 cigsPerDay BPMeds totChol BMI glucose
2 3 cigsPerDay BPMeds totChol BMI glucose
2 4 cigsPerDay BPMeds totChol BMI glucose
2 5 cigsPerDay BPMeds totChol BMI glucose
3 1 cigsPerDay BPMeds totChol BMI glucose
3 2 cigsPerDay BPMeds totChol BMI glucose
3 3 cigsPerDay BPMeds totChol BMI glucose
3 4 cigsPerDay BPMeds totChol BMI glucose
3 5 cigsPerDay BPMeds totChol BMI glucose
4 1 cigsPerDay BPMeds totChol BMI glucose
4 2 cigsPerDay BPMeds totChol BMI glucose
4 3 cigsPerDay BPMeds totChol BMI glucose
4 4 cigsPerDay BPMeds totChol BMI glucose
4 5 cigsPerDay BPMeds totChol BMI glucose
5 1 cigsPerDay BPMeds totChol BMI glucose
5 2 cigsPerDay BPMeds totChol BMI glucose
5 3 cigsPerDay BPMeds totChol BMI glucose
5 4 cigsPerDay BPMeds totChol BMI glucose
5 5 cigsPerDay BPMeds totChol BMI glucose
imp1Class: mids
Number of multiple imputations: 5
Imputation methods:
male age education currentSmoker cigsPerDay
"" "" "" "" "pmm"
BPMeds prevalentStroke prevalentHyp diabetes totChol
"pmm" "" "" "" "pmm"
sysBP diaBP BMI heartRate glucose
"" "" "pmm" "" "pmm"
TenYearCHD
""
PredictorMatrix:
male age education currentSmoker cigsPerDay BPMeds
male 0 1 1 1 1 1
age 1 0 1 1 1 1
education 1 1 0 1 1 1
currentSmoker 1 1 1 0 1 1
cigsPerDay 1 1 1 1 0 1
BPMeds 1 1 1 1 1 0
prevalentStroke prevalentHyp diabetes totChol sysBP diaBP BMI
male 1 1 1 1 1 1 1
age 1 1 1 1 1 1 1
education 1 1 1 1 1 1 1
currentSmoker 1 1 1 1 1 1 1
cigsPerDay 1 1 1 1 1 1 1
BPMeds 1 1 1 1 1 1 1
heartRate glucose TenYearCHD
male 1 1 1
age 1 1 1
education 1 1 1
currentSmoker 1 1 1
cigsPerDay 1 1 1
BPMeds 1 1 1
imp1$method male age education currentSmoker cigsPerDay
"" "" "" "" "pmm"
BPMeds prevalentStroke prevalentHyp diabetes totChol
"pmm" "" "" "" "pmm"
sysBP diaBP BMI heartRate glucose
"" "" "pmm" "" "pmm"
TenYearCHD
""
Frecuencia Cardiaca
Visualizacion de Datos Imputados
Visualizacion de datos que seran imputados
data<- mice::complete(imp1)
aggr(data, plot = FALSE )
Missings in variables:
[1] Variable Count
<0 rows> (or 0-length row.names)
Se comprueba que la base ya no posee valores perdidos.
Se particiona la data en 80 porciento para el entrenamiento y el 20 porciento para el testeo.
library(ROSE)
library(caret)
split = 0.8
trainIndex = createDataPartition(data$TenYearCHD, p = split, list = FALSE, times = 1 )
train = data[trainIndex,]
test = data[-trainIndex,]table(train$TenYearCHD)
No Riesgo Riesgo
2876 516
prop.table(table(train$TenYearCHD))
No Riesgo Riesgo
0.8478774 0.1521226
Se aplica el metodo de balanceo de ROSE
train_bal <- ROSE(TenYearCHD ~ ., data = train , seed = 2021)$datatable(train_bal$TenYearCHD)
No Riesgo Riesgo
1706 1686
prop.table(table(train_bal$TenYearCHD))
No Riesgo Riesgo
0.5029481 0.4970519
Se utiliza el metodo de ROSE
Regresion Logistica
Utilizaremos el modelo de Regresion Logistica ya que contempla predictores numericos como categoricos y la respuesta es binaria, el Riesgo o No Riesgo Cardiaco.
logit <- glm(TenYearCHD ~., data = train_bal, family = binomial(link = "logit"))
summary(logit)
Call:
glm(formula = TenYearCHD ~ ., family = binomial(link = "logit"),
data = train_bal)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.1290 -1.0378 -0.5257 1.0678 2.1137
Coefficients:
Estimate Std. Error z value
(Intercept) -5.208842 0.469969 -11.083
maleMasculino 0.423891 0.080795 5.247
age 0.045694 0.004441 10.289
educationHigh School or GED -0.082629 0.090737 -0.911
educationSome College or Vocational Schooll 0.171788 0.112522 1.527
educationCollege -0.220583 0.127080 -1.736
currentSmokerFumador 0.219497 0.103246 2.126
cigsPerDay 0.008162 0.003749 2.177
BPMedsPresion Alta 0.715293 0.198514 3.603
prevalentStrokePro. Previo 0.648169 0.514834 1.259
prevalentHypHipertenso 0.202979 0.100421 2.021
diabetesDiabetico -0.113726 0.260551 -0.436
totChol 0.002279 0.000758 3.007
sysBP 0.010347 0.001959 5.282
diaBP -0.001902 0.003283 -0.579
BMI 0.011245 0.008230 1.366
heartRate -0.004808 0.002782 -1.728
glucose 0.007586 0.001521 4.989
Pr(>|z|)
(Intercept) < 2e-16 ***
maleMasculino 1.55e-07 ***
age < 2e-16 ***
educationHigh School or GED 0.362486
educationSome College or Vocational Schooll 0.126835
educationCollege 0.082601 .
currentSmokerFumador 0.033507 *
cigsPerDay 0.029483 *
BPMedsPresion Alta 0.000314 ***
prevalentStrokePro. Previo 0.208035
prevalentHypHipertenso 0.043251 *
diabetesDiabetico 0.662487
totChol 0.002637 **
sysBP 1.28e-07 ***
diaBP 0.562326
BMI 0.171818
heartRate 0.083954 .
glucose 6.08e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 4702.2 on 3391 degrees of freedom
Residual deviance: 4211.1 on 3374 degrees of freedom
AIC: 4247.1
Number of Fisher Scoring iterations: 4
Prediccion Data de Entrenamiento
train_pred <- predict(logit, newdata = train_bal, type = "response")
head(train_pred) 1 2 3 4 5 6
0.5924946 0.3794478 0.3899629 0.3129622 0.2121198 0.5089698
Segun la probabilidad se categoriza como No Riesgo si la probabilidad predicha es menor a 0.5 y Riesgo si es mayor.
train_pred_clase <- factor(ifelse(train_pred > 0.5, 1, 0))
levels(train_pred_clase) <- c("No Riesgo","Riesgo")
head(train_pred_clase) 1 2 3 4 5 6
Riesgo No Riesgo No Riesgo No Riesgo No Riesgo Riesgo
Levels: No Riesgo Riesgo
Prediccion Data Testeo
test_pred <- predict(logit, newdata = test, type = "response")
test_pred_clase <- factor(ifelse(test_pred > 0.5, 1, 0))
levels(test_pred_clase) <- c("No Riesgo","Riesgo")
head(test_pred_clase) 2 3 4 5 24 27
No Riesgo No Riesgo Riesgo No Riesgo No Riesgo No Riesgo
Levels: No Riesgo Riesgo
Data Entrenamiento
train_bal$TenYearCHD <- as.factor(train_bal$TenYearCHD)
levels(train_bal$TenYearCHD)<-c("No Riesgo","Riesgo")
table(Predicho = train_pred_clase, Real = train_bal$TenYearCHD) Real
Predicho No Riesgo Riesgo
No Riesgo 1146 645
Riesgo 560 1041
Data Testeo
test$TenYearCHD <- as.factor(test$TenYearCHD)
levels(test$TenYearCHD)<-c("No Riesgo","Riesgo")
table(Predicho = test_pred_clase, Real = test$TenYearCHD) Real
Predicho No Riesgo Riesgo
No Riesgo 502 39
Riesgo 216 89
library(caret)
confusionMatrix(train_pred_clase, train_bal$TenYearCHD )Confusion Matrix and Statistics
Reference
Prediction No Riesgo Riesgo
No Riesgo 1146 645
Riesgo 560 1041
Accuracy : 0.6448
95% CI : (0.6284, 0.6609)
No Information Rate : 0.5029
P-Value [Acc > NIR] : < 2e-16
Kappa : 0.2893
Mcnemar's Test P-Value : 0.01553
Sensitivity : 0.6717
Specificity : 0.6174
Pos Pred Value : 0.6399
Neg Pred Value : 0.6502
Prevalence : 0.5029
Detection Rate : 0.3379
Detection Prevalence : 0.5280
Balanced Accuracy : 0.6446
'Positive' Class : No Riesgo
library(caret)
confusionMatrix(test_pred_clase, test$TenYearCHD )Confusion Matrix and Statistics
Reference
Prediction No Riesgo Riesgo
No Riesgo 502 39
Riesgo 216 89
Accuracy : 0.6986
95% CI : (0.6664, 0.7294)
No Information Rate : 0.8487
P-Value [Acc > NIR] : 1
Kappa : 0.2516
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.6992
Specificity : 0.6953
Pos Pred Value : 0.9279
Neg Pred Value : 0.2918
Prevalence : 0.8487
Detection Rate : 0.5934
Detection Prevalence : 0.6395
Balanced Accuracy : 0.6972
'Positive' Class : No Riesgo
Datos de Entrenamiento
library(ROCR)
pred_train <- prediction(train_pred, train_bal$TenYearCHD)
AUC_train <- performance(pred_train ,measure="auc")
AUCaltura_train <- AUC_train@y.values
cat("AUC:", AUCaltura_train[[1]]) AUC: 0.7095458
Datos de Testeo
library(ROCR)
pred_test <- prediction(test_pred, test$TenYearCHD)
AUC_test <- performance(pred_test ,measure="auc")
AUCaltura_test <- AUC_test@y.values
cat("AUC:", AUCaltura_test[[1]]) AUC: 0.74333