Analisis de NA’s

El conjunto de datos incluye varias variables médicas predictoras(independientes) y una variable objetivo (dependiente), que es el resultado variables independientes se encuentran el número de embarazos de la paciente, el índice de masa corporal

Librerías necesarias para el estudio

library(dplyr)

## 
## Adjuntando el paquete: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ readr     2.1.5
## ✔ ggplot2   3.5.2     ✔ stringr   1.5.1
## ✔ lubridate 1.9.4     ✔ tibble    3.3.0
## ✔ purrr     1.1.0     ✔ tidyr     1.3.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(mice)

## 
## Adjuntando el paquete: 'mice'
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     cbind, rbind

library(VIM)

## Cargando paquete requerido: colorspace
## Cargando paquete requerido: grid
## VIM is ready to use.
## 
## Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues
## 
## Adjuntando el paquete: 'VIM'
## 
## The following object is masked from 'package:datasets':
## 
##     sleep

library(outliers)    
library(EnvStats)

## 
## Adjuntando el paquete: 'EnvStats'
## 
## The following objects are masked from 'package:stats':
## 
##     predict, predict.lm
## 
## The following object is masked from 'package:base':
## 
##     print.default

library(naniar)      
library(robustbase)  
library(psych)

## 
## Adjuntando el paquete: 'psych'
## 
## The following object is masked from 'package:outliers':
## 
##     outlier
## 
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

library(GGally)

Base de datos

df <- read.csv("https://raw.githubusercontent.com/Kalbam/Datos/refs/heads/main/diabetes.csv",
               header = TRUE, fileEncoding = "UTF-8")

Aquí conocemos la base y empezamos a ver a qué datos nos enfrentamos

head(df)

##   Pregnancies Glucose BloodPressure SkinThickness Insulin  BMI
## 1           6     148            72            35       0 33.6
## 2           1      85            66            29       0 26.6
## 3           8     183            64             0       0 23.3
## 4           1      89            66            23      94 28.1
## 5           0     137            40            35     168 43.1
## 6           5     116            74             0       0 25.6
##   DiabetesPedigreeFunction Age Outcome
## 1                    0.627  50       1
## 2                    0.351  31       0
## 3                    0.672  32       1
## 4                    0.167  21       0
## 5                    2.288  33       1
## 6                    0.201  30       0

tail(df)

##     Pregnancies Glucose BloodPressure SkinThickness Insulin  BMI
## 763           9      89            62             0       0 22.5
## 764          10     101            76            48     180 32.9
## 765           2     122            70            27       0 36.8
## 766           5     121            72            23     112 26.2
## 767           1     126            60             0       0 30.1
## 768           1      93            70            31       0 30.4
##     DiabetesPedigreeFunction Age Outcome
## 763                    0.142  33       0
## 764                    0.171  63       0
## 765                    0.340  27       0
## 766                    0.245  30       0
## 767                    0.349  47       1
## 768                    0.315  23       0

dim(df)

## [1] 768   9

colnames(df)

## [1] "Pregnancies"              "Glucose"                 
## [3] "BloodPressure"            "SkinThickness"           
## [5] "Insulin"                  "BMI"                     
## [7] "DiabetesPedigreeFunction" "Age"                     
## [9] "Outcome"

summary(df)

##   Pregnancies        Glucose      BloodPressure    SkinThickness  
##  Min.   : 0.000   Min.   :  0.0   Min.   :  0.00   Min.   : 0.00  
##  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.00   1st Qu.: 0.00  
##  Median : 3.000   Median :117.0   Median : 72.00   Median :23.00  
##  Mean   : 3.845   Mean   :120.9   Mean   : 69.11   Mean   :20.54  
##  3rd Qu.: 6.000   3rd Qu.:140.2   3rd Qu.: 80.00   3rd Qu.:32.00  
##  Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
##     Insulin           BMI        DiabetesPedigreeFunction      Age       
##  Min.   :  0.0   Min.   : 0.00   Min.   :0.0780           Min.   :21.00  
##  1st Qu.:  0.0   1st Qu.:27.30   1st Qu.:0.2437           1st Qu.:24.00  
##  Median : 30.5   Median :32.00   Median :0.3725           Median :29.00  
##  Mean   : 79.8   Mean   :31.99   Mean   :0.4719           Mean   :33.24  
##  3rd Qu.:127.2   3rd Qu.:36.60   3rd Qu.:0.6262           3rd Qu.:41.00  
##  Max.   :846.0   Max.   :67.10   Max.   :2.4200           Max.   :81.00  
##     Outcome     
##  Min.   :0.000  
##  1st Qu.:0.000  
##  Median :0.000  
##  Mean   :0.349  
##  3rd Qu.:1.000  
##  Max.   :1.000

Revisamos su naturaleza

str(df)

## 'data.frame':    768 obs. of  9 variables:
##  $ Pregnancies             : int  6 1 8 1 0 5 3 10 2 8 ...
##  $ Glucose                 : int  148 85 183 89 137 116 78 115 197 125 ...
##  $ BloodPressure           : int  72 66 64 66 40 74 50 0 70 96 ...
##  $ SkinThickness           : int  35 29 0 23 35 0 32 0 45 0 ...
##  $ Insulin                 : int  0 0 0 94 168 0 88 0 543 0 ...
##  $ BMI                     : num  33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
##  $ DiabetesPedigreeFunction: num  0.627 0.351 0.672 0.167 2.288 ...
##  $ Age                     : int  50 31 32 21 33 30 26 29 53 54 ...
##  $ Outcome                 : int  1 0 1 0 1 0 1 0 1 1 ...

Todas la variables son numéricas Reemplazamos los datos nulos con los NA’s

cols_zero_to_na <- c("Glucose","BloodPressure","SkinThickness","Insulin","BMI")
for(v in intersect(cols_zero_to_na, names(df))) {
  df[[v]][df[[v]] == 0] <- NA
}

Vemos los datos faltantes

faltantes <- tibble(
  variable = names(df),
  n_na     = sapply(df, function(x) sum(is.na(x))),
  pct_na   = round(100 * sapply(df, function(x) mean(is.na(x))), 2)
) %>% arrange(desc(pct_na))
print(faltantes)

##                   variable n_na pct_na
## 1                  Insulin  374  48.70
## 2            SkinThickness  227  29.56
## 3            BloodPressure   35   4.56
## 4                      BMI   11   1.43
## 5                  Glucose    5   0.65
## 6              Pregnancies    0   0.00
## 7 DiabetesPedigreeFunction    0   0.00
## 8                      Age    0   0.00
## 9                  Outcome    0   0.00

En las variables de Insulina hay 374 datos faltantes, en SkinThickness hay 227 en BlooodPressure hay 35, en BMI hay 11 y en Glucose 5 datos faltantes. Los visualizamos:

naniar::vis_miss(df)

Utilizando la libreria MICE, vemos el patrón de los datos faltantes, como sus cantidades y valores

mice::md.pattern(df, plot = FALSE)

##     Pregnancies DiabetesPedigreeFunction Age Outcome Glucose BMI BloodPressure
## 392           1                        1   1       1       1   1             1
## 140           1                        1   1       1       1   1             1
## 192           1                        1   1       1       1   1             1
## 2             1                        1   1       1       1   1             0
## 26            1                        1   1       1       1   1             0
## 1             1                        1   1       1       1   0             1
## 1             1                        1   1       1       1   0             1
## 2             1                        1   1       1       1   0             1
## 7             1                        1   1       1       1   0             0
## 1             1                        1   1       1       0   1             1
## 4             1                        1   1       1       0   1             1
##               0                        0   0       0       5  11            35
##     SkinThickness Insulin    
## 392             1       1   0
## 140             1       0   1
## 192             0       0   2
## 2               1       0   2
## 26              0       0   3
## 1               1       1   1
## 1               1       0   2
## 2               0       0   3
## 7               0       0   4
## 1               1       1   1
## 4               1       0   2
##               227     374 652

Amelia::missmap(df, main = "Mapa de faltantes", col = c("steelblue","tomato"))

Verificamos por medio de un resumen de los datos cómo quedó su naturaleza y caracteristicas luego de la imputación

str(df)

## 'data.frame':    768 obs. of  9 variables:
##  $ Pregnancies             : int  6 1 8 1 0 5 3 10 2 8 ...
##  $ Glucose                 : int  148 85 183 89 137 116 78 115 197 125 ...
##  $ BloodPressure           : int  72 66 64 66 40 74 50 NA 70 96 ...
##  $ SkinThickness           : int  35 29 NA 23 35 NA 32 NA 45 NA ...
##  $ Insulin                 : int  NA NA NA 94 168 NA 88 NA 543 NA ...
##  $ BMI                     : num  33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 NA ...
##  $ DiabetesPedigreeFunction: num  0.627 0.351 0.672 0.167 2.288 ...
##  $ Age                     : int  50 31 32 21 33 30 26 29 53 54 ...
##  $ Outcome                 : int  1 0 1 0 1 0 1 0 1 1 ...

summary(df)

##   Pregnancies        Glucose      BloodPressure    SkinThickness  
##  Min.   : 0.000   Min.   : 44.0   Min.   : 24.00   Min.   : 7.00  
##  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 64.00   1st Qu.:22.00  
##  Median : 3.000   Median :117.0   Median : 72.00   Median :29.00  
##  Mean   : 3.845   Mean   :121.7   Mean   : 72.41   Mean   :29.15  
##  3rd Qu.: 6.000   3rd Qu.:141.0   3rd Qu.: 80.00   3rd Qu.:36.00  
##  Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
##                   NA's   :5       NA's   :35       NA's   :227    
##     Insulin            BMI        DiabetesPedigreeFunction      Age       
##  Min.   : 14.00   Min.   :18.20   Min.   :0.0780           Min.   :21.00  
##  1st Qu.: 76.25   1st Qu.:27.50   1st Qu.:0.2437           1st Qu.:24.00  
##  Median :125.00   Median :32.30   Median :0.3725           Median :29.00  
##  Mean   :155.55   Mean   :32.46   Mean   :0.4719           Mean   :33.24  
##  3rd Qu.:190.00   3rd Qu.:36.60   3rd Qu.:0.6262           3rd Qu.:41.00  
##  Max.   :846.00   Max.   :67.10   Max.   :2.4200           Max.   :81.00  
##  NA's   :374      NA's   :11                                              
##     Outcome     
##  Min.   :0.000  
##  1st Qu.:0.000  
##  Median :0.000  
##  Mean   :0.349  
##  3rd Qu.:1.000  
##  Max.   :1.000  
##

Guardamos la base con la imputación

df_imputado <- df

Con esto, seguimos con la imputación con el paquete de MICE Vemos el total:

colSums(is.na(df_imputado))

##              Pregnancies                  Glucose            BloodPressure 
##                        0                        5                       35 
##            SkinThickness                  Insulin                      BMI 
##                      227                      374                       11 
## DiabetesPedigreeFunction                      Age                  Outcome 
##                        0                        0                        0

metodos <- c("pmm", "norm.predict", "norm.nob", "norm")
df_imputados <- list()

Ejecutamos la imputación para cada método

for (m in metodos) {
  cat("\n=== Imputando con method =", m, "===\n")
  
  
  imp <- mice(df_imputado, method = m, m = 1, maxit = 10, seed = 123)
  
  
  df_completo <- complete(imp)
  
 
  df_imputados[[m]] <- df_completo
  
  
  print(summary(df_completo))
}

## 
## === Imputando con method = pmm ===
## 
##  iter imp variable
##   1   1  Glucose  BloodPressure  SkinThickness  Insulin  BMI
##   2   1  Glucose  BloodPressure  SkinThickness  Insulin  BMI
##   3   1  Glucose  BloodPressure  SkinThickness  Insulin  BMI
##   4   1  Glucose  BloodPressure  SkinThickness  Insulin  BMI
##   5   1  Glucose  BloodPressure  SkinThickness  Insulin  BMI
##   6   1  Glucose  BloodPressure  SkinThickness  Insulin  BMI
##   7   1  Glucose  BloodPressure  SkinThickness  Insulin  BMI
##   8   1  Glucose  BloodPressure  SkinThickness  Insulin  BMI
##   9   1  Glucose  BloodPressure  SkinThickness  Insulin  BMI
##   10   1  Glucose  BloodPressure  SkinThickness  Insulin  BMI
##   Pregnancies        Glucose      BloodPressure    SkinThickness  
##  Min.   : 0.000   Min.   : 44.0   Min.   : 24.00   Min.   : 7.00  
##  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 64.00   1st Qu.:21.00  
##  Median : 3.000   Median :117.0   Median : 72.00   Median :29.00  
##  Mean   : 3.845   Mean   :121.7   Mean   : 72.45   Mean   :29.03  
##  3rd Qu.: 6.000   3rd Qu.:141.0   3rd Qu.: 80.00   3rd Qu.:36.00  
##  Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
##     Insulin           BMI        DiabetesPedigreeFunction      Age       
##  Min.   : 14.0   Min.   :18.20   Min.   :0.0780           Min.   :21.00  
##  1st Qu.: 78.0   1st Qu.:27.50   1st Qu.:0.2437           1st Qu.:24.00  
##  Median :125.0   Median :32.30   Median :0.3725           Median :29.00  
##  Mean   :154.8   Mean   :32.47   Mean   :0.4719           Mean   :33.24  
##  3rd Qu.:190.0   3rd Qu.:36.62   3rd Qu.:0.6262           3rd Qu.:41.00  
##  Max.   :846.0   Max.   :67.10   Max.   :2.4200           Max.   :81.00  
##     Outcome     
##  Min.   :0.000  
##  1st Qu.:0.000  
##  Median :0.000  
##  Mean   :0.349  
##  3rd Qu.:1.000  
##  Max.   :1.000  
## 
## === Imputando con method = norm.predict ===
## 
##  iter imp variable
##   1   1  Glucose  BloodPressure  SkinThickness  Insulin  BMI
##   2   1  Glucose  BloodPressure  SkinThickness  Insulin  BMI
##   3   1  Glucose  BloodPressure  SkinThickness  Insulin  BMI
##   4   1  Glucose  BloodPressure  SkinThickness  Insulin  BMI
##   5   1  Glucose  BloodPressure  SkinThickness  Insulin  BMI
##   6   1  Glucose  BloodPressure  SkinThickness  Insulin  BMI
##   7   1  Glucose  BloodPressure  SkinThickness  Insulin  BMI
##   8   1  Glucose  BloodPressure  SkinThickness  Insulin  BMI
##   9   1  Glucose  BloodPressure  SkinThickness  Insulin  BMI
##   10   1  Glucose  BloodPressure  SkinThickness  Insulin  BMI
##   Pregnancies        Glucose      BloodPressure    SkinThickness  
##  Min.   : 0.000   Min.   : 44.0   Min.   : 24.00   Min.   : 7.00  
##  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 64.00   1st Qu.:22.00  
##  Median : 3.000   Median :117.0   Median : 72.00   Median :28.50  
##  Mean   : 3.845   Mean   :121.7   Mean   : 72.35   Mean   :28.89  
##  3rd Qu.: 6.000   3rd Qu.:141.0   3rd Qu.: 80.00   3rd Qu.:35.00  
##  Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
##     Insulin            BMI        DiabetesPedigreeFunction      Age       
##  Min.   :-22.18   Min.   :18.20   Min.   :0.0780           Min.   :21.00  
##  1st Qu.: 88.00   1st Qu.:27.50   1st Qu.:0.2437           1st Qu.:24.00  
##  Median :130.00   Median :32.05   Median :0.3725           Median :29.00  
##  Mean   :151.74   Mean   :32.44   Mean   :0.4719           Mean   :33.24  
##  3rd Qu.:190.47   3rd Qu.:36.60   3rd Qu.:0.6262           3rd Qu.:41.00  
##  Max.   :846.00   Max.   :67.10   Max.   :2.4200           Max.   :81.00  
##     Outcome     
##  Min.   :0.000  
##  1st Qu.:0.000  
##  Median :0.000  
##  Mean   :0.349  
##  3rd Qu.:1.000  
##  Max.   :1.000  
## 
## === Imputando con method = norm.nob ===
## 
##  iter imp variable
##   1   1  Glucose  BloodPressure  SkinThickness  Insulin  BMI
##   2   1  Glucose  BloodPressure  SkinThickness  Insulin  BMI
##   3   1  Glucose  BloodPressure  SkinThickness  Insulin  BMI
##   4   1  Glucose  BloodPressure  SkinThickness  Insulin  BMI
##   5   1  Glucose  BloodPressure  SkinThickness  Insulin  BMI
##   6   1  Glucose  BloodPressure  SkinThickness  Insulin  BMI
##   7   1  Glucose  BloodPressure  SkinThickness  Insulin  BMI
##   8   1  Glucose  BloodPressure  SkinThickness  Insulin  BMI
##   9   1  Glucose  BloodPressure  SkinThickness  Insulin  BMI
##   10   1  Glucose  BloodPressure  SkinThickness  Insulin  BMI
##   Pregnancies        Glucose      BloodPressure    SkinThickness   
##  Min.   : 0.000   Min.   : 44.0   Min.   : 24.00   Min.   :-1.689  
##  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 64.00   1st Qu.:21.000  
##  Median : 3.000   Median :117.0   Median : 72.00   Median :28.906  
##  Mean   : 3.845   Mean   :121.7   Mean   : 72.35   Mean   :28.749  
##  3rd Qu.: 6.000   3rd Qu.:141.0   3rd Qu.: 80.00   3rd Qu.:35.452  
##  Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.000  
##     Insulin             BMI        DiabetesPedigreeFunction      Age       
##  Min.   :-210.67   Min.   :12.31   Min.   :0.0780           Min.   :21.00  
##  1st Qu.:  71.75   1st Qu.:27.48   1st Qu.:0.2437           1st Qu.:24.00  
##  Median : 130.00   Median :32.30   Median :0.3725           Median :29.00  
##  Mean   : 150.30   Mean   :32.43   Mean   :0.4719           Mean   :33.24  
##  3rd Qu.: 207.00   3rd Qu.:36.61   3rd Qu.:0.6262           3rd Qu.:41.00  
##  Max.   : 846.00   Max.   :67.10   Max.   :2.4200           Max.   :81.00  
##     Outcome     
##  Min.   :0.000  
##  1st Qu.:0.000  
##  Median :0.000  
##  Mean   :0.349  
##  3rd Qu.:1.000  
##  Max.   :1.000  
## 
## === Imputando con method = norm ===
## 
##  iter imp variable
##   1   1  Glucose  BloodPressure  SkinThickness  Insulin  BMI
##   2   1  Glucose  BloodPressure  SkinThickness  Insulin  BMI
##   3   1  Glucose  BloodPressure  SkinThickness  Insulin  BMI
##   4   1  Glucose  BloodPressure  SkinThickness  Insulin  BMI
##   5   1  Glucose  BloodPressure  SkinThickness  Insulin  BMI
##   6   1  Glucose  BloodPressure  SkinThickness  Insulin  BMI
##   7   1  Glucose  BloodPressure  SkinThickness  Insulin  BMI
##   8   1  Glucose  BloodPressure  SkinThickness  Insulin  BMI
##   9   1  Glucose  BloodPressure  SkinThickness  Insulin  BMI
##   10   1  Glucose  BloodPressure  SkinThickness  Insulin  BMI
##   Pregnancies        Glucose      BloodPressure    SkinThickness   
##  Min.   : 0.000   Min.   : 44.0   Min.   : 24.00   Min.   : 3.459  
##  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 64.00   1st Qu.:21.374  
##  Median : 3.000   Median :117.0   Median : 72.00   Median :28.794  
##  Mean   : 3.845   Mean   :121.7   Mean   : 72.44   Mean   :28.802  
##  3rd Qu.: 6.000   3rd Qu.:141.0   3rd Qu.: 80.00   3rd Qu.:36.000  
##  Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.000  
##     Insulin            BMI        DiabetesPedigreeFunction      Age       
##  Min.   :-194.2   Min.   :12.81   Min.   :0.0780           Min.   :21.00  
##  1st Qu.:  77.0   1st Qu.:27.50   1st Qu.:0.2437           1st Qu.:24.00  
##  Median : 135.0   Median :32.30   Median :0.3725           Median :29.00  
##  Mean   : 152.7   Mean   :32.43   Mean   :0.4719           Mean   :33.24  
##  3rd Qu.: 206.0   3rd Qu.:36.60   3rd Qu.:0.6262           3rd Qu.:41.00  
##  Max.   : 846.0   Max.   :67.10   Max.   :2.4200           Max.   :81.00  
##     Outcome     
##  Min.   :0.000  
##  1st Qu.:0.000  
##  Median :0.000  
##  Mean   :0.349  
##  3rd Qu.:1.000  
##  Max.   :1.000

Ahora hay 4 dataframes imputados por método. Para escoger cual método resulta más conveniente para el estudio debemos evaluar normalidad para ver a cuál distribución se acercan nuestros datos y elegir cuál método resultaría ser más efectivo

vars_num <- df_imputado

Función para KS normalidad

ks_normalidad <- function(x) {
  
  x_scaled <- scale(x)
  
  ks.test(x_scaled, "pnorm", mean = 0, sd = 1)$p.value
}

Aplicamos a todas las variables numéricas

ks_resultados <- sapply(vars_num, ks_normalidad)

## Warning in ks.test.default(x_scaled, "pnorm", mean = 0, sd = 1): ties should
## not be present for the one-sample Kolmogorov-Smirnov test
## Warning in ks.test.default(x_scaled, "pnorm", mean = 0, sd = 1): ties should
## not be present for the one-sample Kolmogorov-Smirnov test
## Warning in ks.test.default(x_scaled, "pnorm", mean = 0, sd = 1): ties should
## not be present for the one-sample Kolmogorov-Smirnov test
## Warning in ks.test.default(x_scaled, "pnorm", mean = 0, sd = 1): ties should
## not be present for the one-sample Kolmogorov-Smirnov test
## Warning in ks.test.default(x_scaled, "pnorm", mean = 0, sd = 1): ties should
## not be present for the one-sample Kolmogorov-Smirnov test
## Warning in ks.test.default(x_scaled, "pnorm", mean = 0, sd = 1): ties should
## not be present for the one-sample Kolmogorov-Smirnov test
## Warning in ks.test.default(x_scaled, "pnorm", mean = 0, sd = 1): ties should
## not be present for the one-sample Kolmogorov-Smirnov test
## Warning in ks.test.default(x_scaled, "pnorm", mean = 0, sd = 1): ties should
## not be present for the one-sample Kolmogorov-Smirnov test
## Warning in ks.test.default(x_scaled, "pnorm", mean = 0, sd = 1): ties should
## not be present for the one-sample Kolmogorov-Smirnov test

Data frame de resultados

ks_tabla <- data.frame(
  Variable = names(ks_resultados),
  KS_pvalue = round(ks_resultados, 4)
)
print(ks_tabla)

##                                          Variable KS_pvalue
## Pregnancies                           Pregnancies    0.0000
## Glucose                                   Glucose    0.0006
## BloodPressure                       BloodPressure    0.0897
## SkinThickness                       SkinThickness    0.1981
## Insulin                                   Insulin    0.0000
## BMI                                           BMI    0.3102
## DiabetesPedigreeFunction DiabetesPedigreeFunction    0.0000
## Age                                           Age    0.0000
## Outcome                                   Outcome    0.0000

Se rechaza normalidad para la variables los P-valores son menores que el nivel de significancia, en este caso, solo las variables BloodPressure, SkinThickness y BMI tienen una distribución normal. El método más apropiado a escoger para el seguimiento del EDA es el método PMM pues suele ser el más recomendado cuando tienes datos, o que la mayoría que no siguen una distribución normal y ayude a preservar los valores reales observados.

df_final <- df_imputados[["pmm"]]

Para detectar los atípicos en nuestro EDA se usa Rosner pues hay n datos mayor a 25 observaciones. Primero verlos visualmente:

df_final %>%
  pivot_longer(cols = where(is.numeric), names_to = "Variable", values_to = "Valor") %>%
  ggplot(aes(x = Variable, y = Valor)) +
  geom_boxplot(outlier.colour = "red", fill = "lightblue") +
  theme_minimal() +
  coord_flip() +
  labs(title = "Detección visual de outliers", x = "Variable", y = "Valor")

Test de Rosner:

outliers_rosner <- list()

for (var in names(df_final)) {
  x <- df_final[[var]]
  
  # Rosner: máximo de outliers a buscar
  k_max <- min(10, floor(length(x) * 0.1))
  
  ros <- rosnerTest(x, k = k_max)
  outliers_rosner[[var]] <- ros
  
  cat("\n\n=== Variable:", var, "===\n")
  print(ros)
}

## 
## 
## === Variable: Pregnancies ===
## 
## Results of Outlier Test
## -------------------------
## 
## Test Method:                     Rosner's Test for Outliers
## 
## Hypothesized Distribution:       Normal
## 
## Data:                            x
## 
## Sample Size:                     768
## 
## Test Statistics:                 R.1  = 3.904034
##                                  R.2  = 3.346880
##                                  R.3  = 3.072258
##                                  R.4  = 3.093431
##                                  R.5  = 2.810050
##                                  R.6  = 2.826572
##                                  R.7  = 2.843389
##                                  R.8  = 2.860510
##                                  R.9  = 2.877944
##                                  R.10 = 2.895700
## 
## Test Statistic Parameter:        k = 10
## 
## Alternative Hypothesis:          Up to 10 observations are not
##                                  from the same Distribution.
## 
## Type I Error:                    5%
## 
## Number of Outliers Detected:     0
## 
##    i   Mean.i     SD.i Value Obs.Num    R.i+1 lambda.i+1 Outlier
## 1  0 3.845052 3.369578    17     160 3.904034   3.974092   FALSE
## 2  1 3.827901 3.338063    15      89 3.346880   3.973762   FALSE
## 3  2 3.813316 3.315699    14     299 3.072258   3.973432   FALSE
## 4  3 3.800000 3.297310    14     456 3.093431   3.973102   FALSE
## 5  4 3.786649 3.278714    13      29 2.810050   3.972771   FALSE
## 6  5 3.774574 3.263821    13      73 2.826572   3.972440   FALSE
## 7  6 3.762467 3.248775    13      87 2.843389   3.972108   FALSE
## 8  7 3.750329 3.233574    13     275 2.860510   3.971776   FALSE
## 9  8 3.738158 3.218215    13     324 2.877944   3.971443   FALSE
## 10 9 3.725955 3.202695    13     358 2.895700   3.971110   FALSE
## 
## 
## 
## 
## === Variable: Glucose ===
## 
## Results of Outlier Test
## -------------------------
## 
## Test Method:                     Rosner's Test for Outliers
## 
## Hypothesized Distribution:       Normal
## 
## Data:                            x
## 
## Sample Size:                     768
## 
## Test Statistics:                 R.1  = 2.546119
##                                  R.2  = 2.539714
##                                  R.3  = 2.519140
##                                  R.4  = 2.498192
##                                  R.5  = 2.510111
##                                  R.6  = 2.522203
##                                  R.7  = 2.534470
##                                  R.8  = 2.513323
##                                  R.9  = 2.525505
##                                  R.10 = 2.537866
## 
## Test Statistic Parameter:        k = 10
## 
## Alternative Hypothesis:          Up to 10 observations are not
##                                  from the same Distribution.
## 
## Type I Error:                    5%
## 
## Number of Outliers Detected:     0
## 
##    i   Mean.i     SD.i Value Obs.Num    R.i+1 lambda.i+1 Outlier
## 1  0 121.6862 30.51162    44      63 2.546119   3.974092   FALSE
## 2  1 121.7875 30.40206   199     662 2.539714   3.973762   FALSE
## 3  2 121.6867 30.29340   198     562 2.519140   3.973432   FALSE
## 4  3 121.5869 30.18706   197       9 2.498192   3.973102   FALSE
## 5  4 121.4882 30.08304   197     229 2.510111   3.972771   FALSE
## 6  5 121.3893 29.97806   197     409 2.522203   3.972440   FALSE
## 7  6 121.2900 29.87211   197     580 2.534470   3.972108   FALSE
## 8  7 121.1905 29.76516   196      23 2.513323   3.971776   FALSE
## 9  8 121.0921 29.66056   196     207 2.525505   3.971443   FALSE
## 10 9 120.9934 29.55499   196     360 2.537866   3.971110   FALSE
## 
## 
## 
## 
## === Variable: BloodPressure ===
## 
## Results of Outlier Test
## -------------------------
## 
## Test Method:                     Rosner's Test for Outliers
## 
## Hypothesized Distribution:       Normal
## 
## Data:                            x
## 
## Sample Size:                     768
## 
## Test Statistics:                 R.1  = 4.004059
##                                  R.2  = 3.948398
##                                  R.3  = 3.497458
##                                  R.4  = 3.528098
##                                  R.5  = 3.466499
##                                  R.6  = 3.159423
##                                  R.7  = 3.182433
##                                  R.8  = 3.205954
##                                  R.9  = 3.058405
##                                  R.10 = 3.079476
## 
## Test Statistic Parameter:        k = 10
## 
## Alternative Hypothesis:          Up to 10 observations are not
##                                  from the same Distribution.
## 
## Type I Error:                    5%
## 
## Number of Outliers Detected:     1
## 
##    i   Mean.i     SD.i Value Obs.Num    R.i+1 lambda.i+1 Outlier
## 1  0 72.44661 12.37579   122     107 4.004059   3.974092    TRUE
## 2  1 72.38201 12.25358    24     598 3.948398   3.973762   FALSE
## 3  2 72.44517 12.13600    30      19 3.497458   3.973432   FALSE
## 4  3 72.50065 12.04634    30     126 3.528098   3.973102   FALSE
## 5  4 72.55628 11.95550   114     692 3.466499   3.972771   FALSE
## 6  5 72.50197 11.86864   110      44 3.159423   3.972440   FALSE
## 7  6 72.45276 11.79828   110     178 3.182433   3.972108   FALSE
## 8  7 72.40342 11.72711   110     550 3.205954   3.971776   FALSE
## 9  8 72.35395 11.65511   108      85 3.058405   3.971443   FALSE
## 10 9 72.30698 11.59061   108     363 3.079476   3.971110   FALSE
## 
## 
## 
## 
## === Variable: SkinThickness ===
## 
## Results of Outlier Test
## -------------------------
## 
## Test Method:                     Rosner's Test for Outliers
## 
## Hypothesized Distribution:       Normal
## 
## Data:                            x
## 
## Sample Size:                     768
## 
## Test Statistics:                 R.1  = 6.659459
##                                  R.2  = 3.337693
##                                  R.3  = 3.068501
##                                  R.4  = 3.089600
##                                  R.5  = 2.712096
##                                  R.6  = 2.727065
##                                  R.7  = 2.541068
##                                  R.8  = 2.553606
##                                  R.9  = 2.566331
##                                  R.10 = 2.375814
## 
## Test Statistic Parameter:        k = 10
## 
## Alternative Hypothesis:          Up to 10 observations are not
##                                  from the same Distribution.
## 
## Type I Error:                    5%
## 
## Number of Outliers Detected:     1
## 
##    i   Mean.i      SD.i Value Obs.Num    R.i+1 lambda.i+1 Outlier
## 1  0 29.03125 10.506672    99     580 6.659459   3.974092    TRUE
## 2  1 28.94003 10.204645    63     446 3.337693   3.973762   FALSE
## 3  2 28.89556 10.136690    60      58 3.068501   3.973432   FALSE
## 4  3 28.85490 10.080624    60     692 3.089600   3.973102   FALSE
## 5  4 28.81414 10.023930    56     121 2.712096   3.972771   FALSE
## 6  5 28.77851  9.981976    56     304 2.727065   3.972440   FALSE
## 7  6 28.74278  9.939606    54      87 2.541068   3.972108   FALSE
## 8  7 28.70959  9.903802    54     212 2.553606   3.971776   FALSE
## 9  8 28.67632  9.867660    54     584 2.566331   3.971443   FALSE
## 10 9 28.64295  9.831177    52     276 2.375814   3.971110   FALSE
## 
## 
## 
## 
## === Variable: Insulin ===
## 
## Results of Outlier Test
## -------------------------
## 
## Test Method:                     Rosner's Test for Outliers
## 
## Hypothesized Distribution:       Normal
## 
## Data:                            x
## 
## Sample Size:                     768
## 
## Test Statistics:                 R.1  = 6.012751
##                                  R.2  = 5.255403
##                                  R.3  = 4.776185
##                                  R.4  = 4.852292
##                                  R.5  = 4.932157
##                                  R.6  = 4.257371
##                                  R.7  = 4.110369
##                                  R.8  = 4.159560
##                                  R.9  = 4.210560
##                                  R.10 = 3.926510
## 
## Test Statistic Parameter:        k = 10
## 
## Alternative Hypothesis:          Up to 10 observations are not
##                                  from the same Distribution.
## 
## Type I Error:                    5%
## 
## Number of Outliers Detected:     9
## 
##    i   Mean.i     SD.i Value Obs.Num    R.i+1 lambda.i+1 Outlier
## 1  0 154.8477 114.9478   846      14 6.012751   3.974092    TRUE
## 2  1 153.9465 112.2756   744     229 5.255403   3.973762    TRUE
## 3  2 153.1762 110.3022   680     248 4.776185   3.973432    TRUE
## 4  3 152.4876 108.7141   680     328 4.852292   3.973102    TRUE
## 5  4 151.7971 107.0937   680     409 4.932157   3.972771    TRUE
## 6  5 151.1048 105.4395   600     585 4.257371   3.972440    TRUE
## 7  6 150.5157 104.2447   579     333 4.110369   3.972108    TRUE
## 8  7 149.9527 103.1473   579     410 4.159560   3.971776    TRUE
## 9  8 149.3882 102.0320   579     441 4.210560   3.971443    TRUE
## 10 9 148.8221 100.8982   545     287 3.926510   3.971110   FALSE
## 
## 
## 
## 
## === Variable: BMI ===
## 
## Results of Outlier Test
## -------------------------
## 
## Test Method:                     Rosner's Test for Outliers
## 
## Hypothesized Distribution:       Normal
## 
## Data:                            x
## 
## Sample Size:                     768
## 
## Test Statistics:                 R.1  = 4.995880
##                                  R.2  = 3.954005
##                                  R.3  = 3.686847
##                                  R.4  = 3.379276
##                                  R.5  = 3.136599
##                                  R.6  = 3.113767
##                                  R.7  = 3.044636
##                                  R.8  = 3.065385
##                                  R.9  = 2.733037
##                                  R.10 = 2.702099
## 
## Test Statistic Parameter:        k = 10
## 
## Alternative Hypothesis:          Up to 10 observations are not
##                                  from the same Distribution.
## 
## Type I Error:                    5%
## 
## Number of Outliers Detected:     1
## 
##    i   Mean.i     SD.i Value Obs.Num    R.i+1 lambda.i+1 Outlier
## 1  0 32.46888 6.931936  67.1     178 4.995880   3.974092    TRUE
## 2  1 32.42373 6.822518  59.4     446 3.954005   3.973762   FALSE
## 3  2 32.38851 6.756855  57.3     674 3.686847   3.973432   FALSE
## 4  3 32.35595 6.700858  55.0     126 3.379276   3.973102   FALSE
## 5  4 32.32631 6.654881  53.2     121 3.136599   3.972771   FALSE
## 6  5 32.29895 6.616118  52.9     304 3.113767   3.972440   FALSE
## 7  6 32.27192 6.578154  52.3     194 3.044636   3.972108   FALSE
## 8  7 32.24560 6.542214  52.3     248 3.065385   3.971776   FALSE
## 9  8 32.21921 6.505872  50.0     156 2.733037   3.971443   FALSE
## 10 9 32.19578 6.478007  49.7     100 2.702099   3.971110   FALSE
## 
## 
## 
## 
## === Variable: DiabetesPedigreeFunction ===
## 
## Results of Outlier Test
## -------------------------
## 
## Test Method:                     Rosner's Test for Outliers
## 
## Hypothesized Distribution:       Normal
## 
## Data:                            x
## 
## Sample Size:                     768
## 
## Test Statistics:                 R.1  = 5.879733
##                                  R.2  = 5.740114
##                                  R.3  = 5.742410
##                                  R.4  = 5.387880
##                                  R.5  = 4.696067
##                                  R.6  = 4.395823
##                                  R.7  = 4.287150
##                                  R.8  = 4.233710
##                                  R.9  = 4.283971
##                                  R.10 = 3.997834
## 
## Test Statistic Parameter:        k = 10
## 
## Alternative Hypothesis:          Up to 10 observations are not
##                                  from the same Distribution.
## 
## Type I Error:                    5%
## 
## Number of Outliers Detected:     10
## 
##    i    Mean.i      SD.i Value Obs.Num    R.i+1 lambda.i+1 Outlier
## 1  0 0.4718763 0.3313286 2.420     446 5.879733   3.974092    TRUE
## 2  1 0.4693364 0.3239768 2.329     229 5.740114   3.973762    TRUE
## 3  2 0.4669086 0.3171301 2.288       5 5.742410   3.973432    TRUE
## 4  3 0.4645281 0.3104137 2.137     371 5.387880   3.973102    TRUE
## 5  4 0.4623390 0.3046509 1.893      46 4.696067   3.972771    TRUE
## 6  5 0.4604640 0.3004070 1.781      59 4.395823   3.972440    TRUE
## 7  6 0.4587310 0.2967633 1.731     372 4.287150   3.972108    TRUE
## 8  7 0.4570591 0.2933457 1.699     594 4.233710   3.971776    TRUE
## 9  8 0.4554250 0.2900522 1.698     622 4.283971   3.971443    TRUE
## 10 9 0.4537879 0.2867083 1.600     396 3.997834   3.971110    TRUE
## 
## 
## 
## 
## === Variable: Age ===
## 
## Results of Outlier Test
## -------------------------
## 
## Test Method:                     Rosner's Test for Outliers
## 
## Hypothesized Distribution:       Normal
## 
## Data:                            x
## 
## Sample Size:                     768
## 
## Test Statistics:                 R.1  = 4.061069
##                                  R.2  = 3.335018
##                                  R.3  = 3.188755
##                                  R.4  = 3.125278
##                                  R.5  = 3.147531
##                                  R.6  = 3.082239
##                                  R.7  = 3.015167
##                                  R.8  = 3.035354
##                                  R.9  = 3.055953
##                                  R.10 = 2.986993
## 
## Test Statistic Parameter:        k = 10
## 
## Alternative Hypothesis:          Up to 10 observations are not
##                                  from the same Distribution.
## 
## Type I Error:                    5%
## 
## Number of Outliers Detected:     1
## 
##    i   Mean.i     SD.i Value Obs.Num    R.i+1 lambda.i+1 Outlier
## 1  0 33.24089 11.76023    81     460 4.061069   3.974092    TRUE
## 2  1 33.17862 11.64053    72     454 3.335018   3.973762   FALSE
## 3  2 33.12794 11.56315    70     667 3.188755   3.973432   FALSE
## 4  3 33.07974 11.49346    69     124 3.125278   3.973102   FALSE
## 5  4 33.03272 11.42714    69     685 3.147531   3.972771   FALSE
## 6  5 32.98558 11.36006    68     675 3.082239   3.972440   FALSE
## 7  6 32.93963 11.29634    67     364 3.015167   3.972108   FALSE
## 8  7 32.89488 11.23596    67     490 3.035354   3.971776   FALSE
## 9  8 32.85000 11.17491    67     538 3.055953   3.971443   FALSE
## 10 9 32.80501 11.11318    66     222 2.986993   3.971110   FALSE
## 
## 
## 
## 
## === Variable: Outcome ===
## 
## Results of Outlier Test
## -------------------------
## 
## Test Method:                     Rosner's Test for Outliers
## 
## Hypothesized Distribution:       Normal
## 
## Data:                            x
## 
## Sample Size:                     768
## 
## Test Statistics:                 R.1  = 1.365006
##                                  R.2  = 1.367559
##                                  R.3  = 1.370126
##                                  R.4  = 1.372708
##                                  R.5  = 1.375304
##                                  R.6  = 1.377915
##                                  R.7  = 1.380541
##                                  R.8  = 1.383182
##                                  R.9  = 1.385838
##                                  R.10 = 1.388509
## 
## Test Statistic Parameter:        k = 10
## 
## Alternative Hypothesis:          Up to 10 observations are not
##                                  from the same Distribution.
## 
## Type I Error:                    5%
## 
## Number of Outliers Detected:     0
## 
##    i    Mean.i      SD.i Value Obs.Num    R.i+1 lambda.i+1 Outlier
## 1  0 0.3489583 0.4769514     1       1 1.365006   3.974092   FALSE
## 2  1 0.3481095 0.4766818     1       3 1.367559   3.973762   FALSE
## 3  2 0.3472585 0.4764098     1       5 1.370126   3.973432   FALSE
## 4  3 0.3464052 0.4761355     1       7 1.372708   3.973102   FALSE
## 5  4 0.3455497 0.4758587     1       9 1.375304   3.972771   FALSE
## 6  5 0.3446920 0.4755795     1      10 1.377915   3.972440   FALSE
## 7  6 0.3438320 0.4752978     1      12 1.380541   3.972108   FALSE
## 8  7 0.3429698 0.4750137     1      14 1.383182   3.971776   FALSE
## 9  8 0.3421053 0.4747271     1      15 1.385838   3.971443   FALSE
## 10 9 0.3412385 0.4744379     1      16 1.388509   3.971110   FALSE

Con los resultados que arrojó el test de Rosner muestra que no hay atípicos en las 10 primeras k observaciones las siguientes variables variables: PregnanciesGlucose y Outcome. Las demás tienen atípicos en sus datos. Para verlo más de cerca:

outliers_rosner[["Glucose"]]$all.stats

##    i   Mean.i     SD.i Value Obs.Num    R.i+1 lambda.i+1 Outlier
## 1  0 121.6862 30.51162    44      63 2.546119   3.974092   FALSE
## 2  1 121.7875 30.40206   199     662 2.539714   3.973762   FALSE
## 3  2 121.6867 30.29340   198     562 2.519140   3.973432   FALSE
## 4  3 121.5869 30.18706   197       9 2.498192   3.973102   FALSE
## 5  4 121.4882 30.08304   197     229 2.510111   3.972771   FALSE
## 6  5 121.3893 29.97806   197     409 2.522203   3.972440   FALSE
## 7  6 121.2900 29.87211   197     580 2.534470   3.972108   FALSE
## 8  7 121.1905 29.76516   196      23 2.513323   3.971776   FALSE
## 9  8 121.0921 29.66056   196     207 2.525505   3.971443   FALSE
## 10 9 120.9934 29.55499   196     360 2.537866   3.971110   FALSE

outliers_rosner[["Pregnancies"]]$all.stats

##    i   Mean.i     SD.i Value Obs.Num    R.i+1 lambda.i+1 Outlier
## 1  0 3.845052 3.369578    17     160 3.904034   3.974092   FALSE
## 2  1 3.827901 3.338063    15      89 3.346880   3.973762   FALSE
## 3  2 3.813316 3.315699    14     299 3.072258   3.973432   FALSE
## 4  3 3.800000 3.297310    14     456 3.093431   3.973102   FALSE
## 5  4 3.786649 3.278714    13      29 2.810050   3.972771   FALSE
## 6  5 3.774574 3.263821    13      73 2.826572   3.972440   FALSE
## 7  6 3.762467 3.248775    13      87 2.843389   3.972108   FALSE
## 8  7 3.750329 3.233574    13     275 2.860510   3.971776   FALSE
## 9  8 3.738158 3.218215    13     324 2.877944   3.971443   FALSE
## 10 9 3.725955 3.202695    13     358 2.895700   3.971110   FALSE

outliers_rosner[["BloodPressure"]]$all.stats

##    i   Mean.i     SD.i Value Obs.Num    R.i+1 lambda.i+1 Outlier
## 1  0 72.44661 12.37579   122     107 4.004059   3.974092    TRUE
## 2  1 72.38201 12.25358    24     598 3.948398   3.973762   FALSE
## 3  2 72.44517 12.13600    30      19 3.497458   3.973432   FALSE
## 4  3 72.50065 12.04634    30     126 3.528098   3.973102   FALSE
## 5  4 72.55628 11.95550   114     692 3.466499   3.972771   FALSE
## 6  5 72.50197 11.86864   110      44 3.159423   3.972440   FALSE
## 7  6 72.45276 11.79828   110     178 3.182433   3.972108   FALSE
## 8  7 72.40342 11.72711   110     550 3.205954   3.971776   FALSE
## 9  8 72.35395 11.65511   108      85 3.058405   3.971443   FALSE
## 10 9 72.30698 11.59061   108     363 3.079476   3.971110   FALSE

outliers_rosner[["SkinThickness"]]$all.stats

##    i   Mean.i      SD.i Value Obs.Num    R.i+1 lambda.i+1 Outlier
## 1  0 29.03125 10.506672    99     580 6.659459   3.974092    TRUE
## 2  1 28.94003 10.204645    63     446 3.337693   3.973762   FALSE
## 3  2 28.89556 10.136690    60      58 3.068501   3.973432   FALSE
## 4  3 28.85490 10.080624    60     692 3.089600   3.973102   FALSE
## 5  4 28.81414 10.023930    56     121 2.712096   3.972771   FALSE
## 6  5 28.77851  9.981976    56     304 2.727065   3.972440   FALSE
## 7  6 28.74278  9.939606    54      87 2.541068   3.972108   FALSE
## 8  7 28.70959  9.903802    54     212 2.553606   3.971776   FALSE
## 9  8 28.67632  9.867660    54     584 2.566331   3.971443   FALSE
## 10 9 28.64295  9.831177    52     276 2.375814   3.971110   FALSE

outliers_rosner[["Insulin"]]$all.stats

##    i   Mean.i     SD.i Value Obs.Num    R.i+1 lambda.i+1 Outlier
## 1  0 154.8477 114.9478   846      14 6.012751   3.974092    TRUE
## 2  1 153.9465 112.2756   744     229 5.255403   3.973762    TRUE
## 3  2 153.1762 110.3022   680     248 4.776185   3.973432    TRUE
## 4  3 152.4876 108.7141   680     328 4.852292   3.973102    TRUE
## 5  4 151.7971 107.0937   680     409 4.932157   3.972771    TRUE
## 6  5 151.1048 105.4395   600     585 4.257371   3.972440    TRUE
## 7  6 150.5157 104.2447   579     333 4.110369   3.972108    TRUE
## 8  7 149.9527 103.1473   579     410 4.159560   3.971776    TRUE
## 9  8 149.3882 102.0320   579     441 4.210560   3.971443    TRUE
## 10 9 148.8221 100.8982   545     287 3.926510   3.971110   FALSE

outliers_rosner[["BMI"]]$all.stats

##    i   Mean.i     SD.i Value Obs.Num    R.i+1 lambda.i+1 Outlier
## 1  0 32.46888 6.931936  67.1     178 4.995880   3.974092    TRUE
## 2  1 32.42373 6.822518  59.4     446 3.954005   3.973762   FALSE
## 3  2 32.38851 6.756855  57.3     674 3.686847   3.973432   FALSE
## 4  3 32.35595 6.700858  55.0     126 3.379276   3.973102   FALSE
## 5  4 32.32631 6.654881  53.2     121 3.136599   3.972771   FALSE
## 6  5 32.29895 6.616118  52.9     304 3.113767   3.972440   FALSE
## 7  6 32.27192 6.578154  52.3     194 3.044636   3.972108   FALSE
## 8  7 32.24560 6.542214  52.3     248 3.065385   3.971776   FALSE
## 9  8 32.21921 6.505872  50.0     156 2.733037   3.971443   FALSE
## 10 9 32.19578 6.478007  49.7     100 2.702099   3.971110   FALSE

outliers_rosner[["DiabetesPedigreeFunction"]]$all.stats

##    i    Mean.i      SD.i Value Obs.Num    R.i+1 lambda.i+1 Outlier
## 1  0 0.4718763 0.3313286 2.420     446 5.879733   3.974092    TRUE
## 2  1 0.4693364 0.3239768 2.329     229 5.740114   3.973762    TRUE
## 3  2 0.4669086 0.3171301 2.288       5 5.742410   3.973432    TRUE
## 4  3 0.4645281 0.3104137 2.137     371 5.387880   3.973102    TRUE
## 5  4 0.4623390 0.3046509 1.893      46 4.696067   3.972771    TRUE
## 6  5 0.4604640 0.3004070 1.781      59 4.395823   3.972440    TRUE
## 7  6 0.4587310 0.2967633 1.731     372 4.287150   3.972108    TRUE
## 8  7 0.4570591 0.2933457 1.699     594 4.233710   3.971776    TRUE
## 9  8 0.4554250 0.2900522 1.698     622 4.283971   3.971443    TRUE
## 10 9 0.4537879 0.2867083 1.600     396 3.997834   3.971110    TRUE

outliers_rosner[["Age"]]$all.stats

##    i   Mean.i     SD.i Value Obs.Num    R.i+1 lambda.i+1 Outlier
## 1  0 33.24089 11.76023    81     460 4.061069   3.974092    TRUE
## 2  1 33.17862 11.64053    72     454 3.335018   3.973762   FALSE
## 3  2 33.12794 11.56315    70     667 3.188755   3.973432   FALSE
## 4  3 33.07974 11.49346    69     124 3.125278   3.973102   FALSE
## 5  4 33.03272 11.42714    69     685 3.147531   3.972771   FALSE
## 6  5 32.98558 11.36006    68     675 3.082239   3.972440   FALSE
## 7  6 32.93963 11.29634    67     364 3.015167   3.972108   FALSE
## 8  7 32.89488 11.23596    67     490 3.035354   3.971776   FALSE
## 9  8 32.85000 11.17491    67     538 3.055953   3.971443   FALSE
## 10 9 32.80501 11.11318    66     222 2.986993   3.971110   FALSE

outliers_rosner[["Outcome"]]$all.stats

##    i    Mean.i      SD.i Value Obs.Num    R.i+1 lambda.i+1 Outlier
## 1  0 0.3489583 0.4769514     1       1 1.365006   3.974092   FALSE
## 2  1 0.3481095 0.4766818     1       3 1.367559   3.973762   FALSE
## 3  2 0.3472585 0.4764098     1       5 1.370126   3.973432   FALSE
## 4  3 0.3464052 0.4761355     1       7 1.372708   3.973102   FALSE
## 5  4 0.3455497 0.4758587     1       9 1.375304   3.972771   FALSE
## 6  5 0.3446920 0.4755795     1      10 1.377915   3.972440   FALSE
## 7  6 0.3438320 0.4752978     1      12 1.380541   3.972108   FALSE
## 8  7 0.3429698 0.4750137     1      14 1.383182   3.971776   FALSE
## 9  8 0.3421053 0.4747271     1      15 1.385838   3.971443   FALSE
## 10 9 0.3412385 0.4744379     1      16 1.388509   3.971110   FALSE

Para imputar la presencia de estos valores atípicos, pasamos esta base para la imputación

df_ati <- df_final  

for (var in names(df_ati)) {
  
  ros <- outliers_rosner[[var]]$all.stats
  
  filas_out <- which(ros$Outlier)
  
  if (length(filas_out) > 0) {
    cat("\nVariable:", var, "- Reemplazando", length(filas_out), "outliers por NA")
    df_ati[filas_out, var] <- NA
  }
}

## 
## Variable: BloodPressure - Reemplazando 1 outliers por NA
## Variable: SkinThickness - Reemplazando 1 outliers por NA
## Variable: Insulin - Reemplazando 9 outliers por NA
## Variable: BMI - Reemplazando 1 outliers por NA
## Variable: DiabetesPedigreeFunction - Reemplazando 10 outliers por NA
## Variable: Age - Reemplazando 1 outliers por NA

Verificar si quedaron NA

colSums(is.na(df_ati))

##              Pregnancies                  Glucose            BloodPressure 
##                        0                        0                        1 
##            SkinThickness                  Insulin                      BMI 
##                        1                        9                        1 
## DiabetesPedigreeFunction                      Age                  Outcome 
##                       10                        1                        0

Imputar esos NA con la mediana de cada variable

for (var in names(df_ati)) {
  if (any(is.na(df_ati[[var]]))) {
    mediana_val <- median(df_ati[[var]], na.rm = TRUE)
    df_ati[[var]][is.na(df_ati[[var]])] <- mediana_val
  }
}

Sin NA

colSums(is.na(df_ati))

##              Pregnancies                  Glucose            BloodPressure 
##                        0                        0                        0 
##            SkinThickness                  Insulin                      BMI 
##                        0                        0                        0 
## DiabetesPedigreeFunction                      Age                  Outcome 
##                        0                        0                        0

Ya todo limpio

df_oficial <- df_ati

Con nuestro dataset limpio, empezamos nuestro análisis descriptibo y las visualizacione de estas relaciones para sacar conclusiones.

summary(df_oficial)

##   Pregnancies        Glucose      BloodPressure    SkinThickness  
##  Min.   : 0.000   Min.   : 44.0   Min.   : 24.00   Min.   : 7.00  
##  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 64.00   1st Qu.:21.00  
##  Median : 3.000   Median :117.0   Median : 72.00   Median :29.00  
##  Mean   : 3.845   Mean   :121.7   Mean   : 72.45   Mean   :29.02  
##  3rd Qu.: 6.000   3rd Qu.:141.0   3rd Qu.: 80.00   3rd Qu.:36.00  
##  Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
##     Insulin           BMI        DiabetesPedigreeFunction      Age       
##  Min.   : 14.0   Min.   :18.20   Min.   :0.0780           Min.   :21.00  
##  1st Qu.: 78.0   1st Qu.:27.50   1st Qu.:0.2450           1st Qu.:24.00  
##  Median :125.0   Median :32.30   Median :0.3755           Median :29.00  
##  Mean   :153.8   Mean   :32.47   Mean   :0.4702           Mean   :33.21  
##  3rd Qu.:188.0   3rd Qu.:36.62   3rd Qu.:0.6160           3rd Qu.:41.00  
##  Max.   :846.0   Max.   :67.10   Max.   :2.4200           Max.   :81.00  
##     Outcome     
##  Min.   :0.000  
##  1st Qu.:0.000  
##  Median :0.000  
##  Mean   :0.349  
##  3rd Qu.:1.000  
##  Max.   :1.000

descriptivos <- df_oficial %>%
  summarise(across(everything(),
                   list(
                     media = mean,
                     mediana = median,
                     sd = sd,
                     min = min,
                     max = max
                   ),
                   na.rm = TRUE))

## Warning: There was 1 warning in `summarise()`.
## ℹ In argument: `across(...)`.
## Caused by warning:
## ! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
## Supply arguments directly to `.fns` through an anonymous function instead.
## 
##   # Previously
##   across(a:b, mean, na.rm = TRUE)
## 
##   # Now
##   across(a:b, \(x) mean(x, na.rm = TRUE))

print(descriptivos)

##   Pregnancies_media Pregnancies_mediana Pregnancies_sd Pregnancies_min
## 1          3.845052                   3       3.369578               0
##   Pregnancies_max Glucose_media Glucose_mediana Glucose_sd Glucose_min
## 1              17      121.6862             117   30.51162          44
##   Glucose_max BloodPressure_media BloodPressure_mediana BloodPressure_sd
## 1         199            72.44661                    72         12.37579
##   BloodPressure_min BloodPressure_max SkinThickness_media SkinThickness_mediana
## 1                24               122            29.02344                    29
##   SkinThickness_sd SkinThickness_min SkinThickness_max Insulin_media
## 1         10.50446                 7                99      153.8268
##   Insulin_mediana Insulin_sd Insulin_min Insulin_max BMI_media BMI_mediana
## 1             125   113.3815          14         846  32.46719        32.3
##     BMI_sd BMI_min BMI_max DiabetesPedigreeFunction_media
## 1 6.931819    18.2    67.1                      0.4701536
##   DiabetesPedigreeFunction_mediana DiabetesPedigreeFunction_sd
## 1                           0.3755                   0.3238199
##   DiabetesPedigreeFunction_min DiabetesPedigreeFunction_max Age_media
## 1                        0.078                         2.42  33.21354
##   Age_mediana   Age_sd Age_min Age_max Outcome_media Outcome_mediana Outcome_sd
## 1          29 11.74562      21      81     0.3489583               0  0.4769514
##   Outcome_min Outcome_max
## 1           0           1

Algunas características principales encontradas en este descriptivo son la siguientes: - En la variable de embarazos tiene una media de 3 embarazos, por lo cual, al hacer el conteo y transcripción de esta base, el 50% de pacientes tuvieron 3 o menos.

En Glucosa hqy valores por debajo de 70 mg/dL, lo que podría indicar hipoglucemia en estos pacientes.
En Diabetes hay un rango de 0.078 a 2.42, lo que podría indicar riesgo genético de diabetes en estos pacientes.

También, hay una gran variabilidad en insulina y glucosa, lo que indicaría una población muy diversa en sus metabilismos. El BMI promedio indica obesidad en la población estudiada. La diferencia entre media y mediana en las edades y embarazos sugiere una distribución asimétrica con algunos valores altos.Y un 34.9% de los casos en este conjunto de datos fueron diagnosticados con diabetes.

Luego de esto, veamos un histogramas por variable para ver sus distribuciones luego de la limpieza:

df_oficial %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Valor") %>%
  ggplot(aes(x = Valor)) +
  geom_histogram(bins = 30, fill = "steelblue", color = "white") +
  facet_wrap(~Variable, scales = "free") +
  theme_minimal() +
  labs(title = "Distribuciones de variables", x = "Valor", y = "Frecuencia")

Viendo esto, algunas características encontradas fueron que variables como los embarazos y la insulina muestran colas largas hacia la derecha como nos mostró el descriptivo. Entre las bariables con distribuciones más normales está la presión arterial y glusoca pues muestran distribuciones simétricas.

También veremos un los boxplots con la limpieza final

df_oficial %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Valor") %>%
  ggplot(aes(x = Variable, y = Valor)) +
  geom_boxplot(fill = "lightgreen", outlier.colour = "red") +
  coord_flip() +
  theme_minimal() +
  labs(title = "Boxplots después de limpieza", x = "Variable", y = "Valor")

En estos boxplots vemos que las variables con mayor dispersión, como la insulina muestra la mayor variabilidad,con posibles outliers en el rango alto, como en glucosa también presenta un rango amplio.

En la variable Outcome aparece un boxplot para la descripción de sus datos categóricos pues muestran 0 para quienes no presentan diabetes y un 1 para quienes si lo presentan.

Para variables con menor rango, como diabetes y presión arterial usan una escala diferente mientras que en embarazos tendría una escala adaptada

Por ello, los boxplots después de limpieza muestran el manejo de valores faltantes aunque con una pequeña diferencia de los originales

Y por último veremos la matriz de correlación para ver quienes estás más relacionadas para evaluarlas más a fondo en siguientes estudios:

ggpairs(df_oficial, progress = FALSE)

En la matriz de correlación se ven relaciones importantes entre variables asociadas a la diabetes, como glucosa e insulina, y antropométricas como IMC y grosor de piel, lo que sugiere una estrecha relación fisiopatológica entre estos parámetros.

Las correlaciones moderadas como la edad y los embarazos o IMC con antecedentes genéticos de diabetes.

Y las mientras que las correlaciones débiles como los factores metabólicos y antropométricos están fuertemente interconectados, mientras que otros como la presión arterial tienen asociaciones más modestas, lo que podría reflejar diferentes vías de influencia en el desarrollo de la diabetes.

Analisis de NA’s

2025-08-14