El conjunto de datos incluye varias variables médicas predictoras(independientes) y una variable objetivo (dependiente), que es el resultado variables independientes se encuentran el número de embarazos de la paciente, el índice de masa corporal
Librerías necesarias para el estudio
library(dplyr)
##
## Adjuntando el paquete: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ readr 2.1.5
## ✔ ggplot2 3.5.2 ✔ stringr 1.5.1
## ✔ lubridate 1.9.4 ✔ tibble 3.3.0
## ✔ purrr 1.1.0 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(mice)
##
## Adjuntando el paquete: 'mice'
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## cbind, rbind
library(VIM)
## Cargando paquete requerido: colorspace
## Cargando paquete requerido: grid
## VIM is ready to use.
##
## Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues
##
## Adjuntando el paquete: 'VIM'
##
## The following object is masked from 'package:datasets':
##
## sleep
library(outliers)
library(EnvStats)
##
## Adjuntando el paquete: 'EnvStats'
##
## The following objects are masked from 'package:stats':
##
## predict, predict.lm
##
## The following object is masked from 'package:base':
##
## print.default
library(naniar)
library(robustbase)
library(psych)
##
## Adjuntando el paquete: 'psych'
##
## The following object is masked from 'package:outliers':
##
## outlier
##
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
library(GGally)
Base de datos
df <- read.csv("https://raw.githubusercontent.com/Kalbam/Datos/refs/heads/main/diabetes.csv",
header = TRUE, fileEncoding = "UTF-8")
Aquí conocemos la base y empezamos a ver a qué datos nos enfrentamos
head(df)
## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## 1 6 148 72 35 0 33.6
## 2 1 85 66 29 0 26.6
## 3 8 183 64 0 0 23.3
## 4 1 89 66 23 94 28.1
## 5 0 137 40 35 168 43.1
## 6 5 116 74 0 0 25.6
## DiabetesPedigreeFunction Age Outcome
## 1 0.627 50 1
## 2 0.351 31 0
## 3 0.672 32 1
## 4 0.167 21 0
## 5 2.288 33 1
## 6 0.201 30 0
tail(df)
## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## 763 9 89 62 0 0 22.5
## 764 10 101 76 48 180 32.9
## 765 2 122 70 27 0 36.8
## 766 5 121 72 23 112 26.2
## 767 1 126 60 0 0 30.1
## 768 1 93 70 31 0 30.4
## DiabetesPedigreeFunction Age Outcome
## 763 0.142 33 0
## 764 0.171 63 0
## 765 0.340 27 0
## 766 0.245 30 0
## 767 0.349 47 1
## 768 0.315 23 0
dim(df)
## [1] 768 9
colnames(df)
## [1] "Pregnancies" "Glucose"
## [3] "BloodPressure" "SkinThickness"
## [5] "Insulin" "BMI"
## [7] "DiabetesPedigreeFunction" "Age"
## [9] "Outcome"
summary(df)
## Pregnancies Glucose BloodPressure SkinThickness
## Min. : 0.000 Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 62.00 1st Qu.: 0.00
## Median : 3.000 Median :117.0 Median : 72.00 Median :23.00
## Mean : 3.845 Mean :120.9 Mean : 69.11 Mean :20.54
## 3rd Qu.: 6.000 3rd Qu.:140.2 3rd Qu.: 80.00 3rd Qu.:32.00
## Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00
## Insulin BMI DiabetesPedigreeFunction Age
## Min. : 0.0 Min. : 0.00 Min. :0.0780 Min. :21.00
## 1st Qu.: 0.0 1st Qu.:27.30 1st Qu.:0.2437 1st Qu.:24.00
## Median : 30.5 Median :32.00 Median :0.3725 Median :29.00
## Mean : 79.8 Mean :31.99 Mean :0.4719 Mean :33.24
## 3rd Qu.:127.2 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00
## Max. :846.0 Max. :67.10 Max. :2.4200 Max. :81.00
## Outcome
## Min. :0.000
## 1st Qu.:0.000
## Median :0.000
## Mean :0.349
## 3rd Qu.:1.000
## Max. :1.000
Revisamos su naturaleza
str(df)
## 'data.frame': 768 obs. of 9 variables:
## $ Pregnancies : int 6 1 8 1 0 5 3 10 2 8 ...
## $ Glucose : int 148 85 183 89 137 116 78 115 197 125 ...
## $ BloodPressure : int 72 66 64 66 40 74 50 0 70 96 ...
## $ SkinThickness : int 35 29 0 23 35 0 32 0 45 0 ...
## $ Insulin : int 0 0 0 94 168 0 88 0 543 0 ...
## $ BMI : num 33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
## $ DiabetesPedigreeFunction: num 0.627 0.351 0.672 0.167 2.288 ...
## $ Age : int 50 31 32 21 33 30 26 29 53 54 ...
## $ Outcome : int 1 0 1 0 1 0 1 0 1 1 ...
Todas la variables son numéricas Reemplazamos los datos nulos con los NA’s
cols_zero_to_na <- c("Glucose","BloodPressure","SkinThickness","Insulin","BMI")
for(v in intersect(cols_zero_to_na, names(df))) {
df[[v]][df[[v]] == 0] <- NA
}
Vemos los datos faltantes
faltantes <- tibble(
variable = names(df),
n_na = sapply(df, function(x) sum(is.na(x))),
pct_na = round(100 * sapply(df, function(x) mean(is.na(x))), 2)
) %>% arrange(desc(pct_na))
print(faltantes)
## variable n_na pct_na
## 1 Insulin 374 48.70
## 2 SkinThickness 227 29.56
## 3 BloodPressure 35 4.56
## 4 BMI 11 1.43
## 5 Glucose 5 0.65
## 6 Pregnancies 0 0.00
## 7 DiabetesPedigreeFunction 0 0.00
## 8 Age 0 0.00
## 9 Outcome 0 0.00
En las variables de Insulina hay 374 datos faltantes, en SkinThickness hay 227 en BlooodPressure hay 35, en BMI hay 11 y en Glucose 5 datos faltantes. Los visualizamos:
naniar::vis_miss(df)
Utilizando la libreria MICE, vemos el patrón de los datos faltantes,
como sus cantidades y valores
mice::md.pattern(df, plot = FALSE)
## Pregnancies DiabetesPedigreeFunction Age Outcome Glucose BMI BloodPressure
## 392 1 1 1 1 1 1 1
## 140 1 1 1 1 1 1 1
## 192 1 1 1 1 1 1 1
## 2 1 1 1 1 1 1 0
## 26 1 1 1 1 1 1 0
## 1 1 1 1 1 1 0 1
## 1 1 1 1 1 1 0 1
## 2 1 1 1 1 1 0 1
## 7 1 1 1 1 1 0 0
## 1 1 1 1 1 0 1 1
## 4 1 1 1 1 0 1 1
## 0 0 0 0 5 11 35
## SkinThickness Insulin
## 392 1 1 0
## 140 1 0 1
## 192 0 0 2
## 2 1 0 2
## 26 0 0 3
## 1 1 1 1
## 1 1 0 2
## 2 0 0 3
## 7 0 0 4
## 1 1 1 1
## 4 1 0 2
## 227 374 652
Amelia::missmap(df, main = "Mapa de faltantes", col = c("steelblue","tomato"))
Verificamos por medio de un resumen de los datos cómo quedó su
naturaleza y caracteristicas luego de la imputación
str(df)
## 'data.frame': 768 obs. of 9 variables:
## $ Pregnancies : int 6 1 8 1 0 5 3 10 2 8 ...
## $ Glucose : int 148 85 183 89 137 116 78 115 197 125 ...
## $ BloodPressure : int 72 66 64 66 40 74 50 NA 70 96 ...
## $ SkinThickness : int 35 29 NA 23 35 NA 32 NA 45 NA ...
## $ Insulin : int NA NA NA 94 168 NA 88 NA 543 NA ...
## $ BMI : num 33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 NA ...
## $ DiabetesPedigreeFunction: num 0.627 0.351 0.672 0.167 2.288 ...
## $ Age : int 50 31 32 21 33 30 26 29 53 54 ...
## $ Outcome : int 1 0 1 0 1 0 1 0 1 1 ...
summary(df)
## Pregnancies Glucose BloodPressure SkinThickness
## Min. : 0.000 Min. : 44.0 Min. : 24.00 Min. : 7.00
## 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 64.00 1st Qu.:22.00
## Median : 3.000 Median :117.0 Median : 72.00 Median :29.00
## Mean : 3.845 Mean :121.7 Mean : 72.41 Mean :29.15
## 3rd Qu.: 6.000 3rd Qu.:141.0 3rd Qu.: 80.00 3rd Qu.:36.00
## Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00
## NA's :5 NA's :35 NA's :227
## Insulin BMI DiabetesPedigreeFunction Age
## Min. : 14.00 Min. :18.20 Min. :0.0780 Min. :21.00
## 1st Qu.: 76.25 1st Qu.:27.50 1st Qu.:0.2437 1st Qu.:24.00
## Median :125.00 Median :32.30 Median :0.3725 Median :29.00
## Mean :155.55 Mean :32.46 Mean :0.4719 Mean :33.24
## 3rd Qu.:190.00 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00
## Max. :846.00 Max. :67.10 Max. :2.4200 Max. :81.00
## NA's :374 NA's :11
## Outcome
## Min. :0.000
## 1st Qu.:0.000
## Median :0.000
## Mean :0.349
## 3rd Qu.:1.000
## Max. :1.000
##
Guardamos la base con la imputación
df_imputado <- df
Con esto, seguimos con la imputación con el paquete de MICE Vemos el total:
colSums(is.na(df_imputado))
## Pregnancies Glucose BloodPressure
## 0 5 35
## SkinThickness Insulin BMI
## 227 374 11
## DiabetesPedigreeFunction Age Outcome
## 0 0 0
metodos <- c("pmm", "norm.predict", "norm.nob", "norm")
df_imputados <- list()
Ejecutamos la imputación para cada método
for (m in metodos) {
cat("\n=== Imputando con method =", m, "===\n")
imp <- mice(df_imputado, method = m, m = 1, maxit = 10, seed = 123)
df_completo <- complete(imp)
df_imputados[[m]] <- df_completo
print(summary(df_completo))
}
##
## === Imputando con method = pmm ===
##
## iter imp variable
## 1 1 Glucose BloodPressure SkinThickness Insulin BMI
## 2 1 Glucose BloodPressure SkinThickness Insulin BMI
## 3 1 Glucose BloodPressure SkinThickness Insulin BMI
## 4 1 Glucose BloodPressure SkinThickness Insulin BMI
## 5 1 Glucose BloodPressure SkinThickness Insulin BMI
## 6 1 Glucose BloodPressure SkinThickness Insulin BMI
## 7 1 Glucose BloodPressure SkinThickness Insulin BMI
## 8 1 Glucose BloodPressure SkinThickness Insulin BMI
## 9 1 Glucose BloodPressure SkinThickness Insulin BMI
## 10 1 Glucose BloodPressure SkinThickness Insulin BMI
## Pregnancies Glucose BloodPressure SkinThickness
## Min. : 0.000 Min. : 44.0 Min. : 24.00 Min. : 7.00
## 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 64.00 1st Qu.:21.00
## Median : 3.000 Median :117.0 Median : 72.00 Median :29.00
## Mean : 3.845 Mean :121.7 Mean : 72.45 Mean :29.03
## 3rd Qu.: 6.000 3rd Qu.:141.0 3rd Qu.: 80.00 3rd Qu.:36.00
## Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00
## Insulin BMI DiabetesPedigreeFunction Age
## Min. : 14.0 Min. :18.20 Min. :0.0780 Min. :21.00
## 1st Qu.: 78.0 1st Qu.:27.50 1st Qu.:0.2437 1st Qu.:24.00
## Median :125.0 Median :32.30 Median :0.3725 Median :29.00
## Mean :154.8 Mean :32.47 Mean :0.4719 Mean :33.24
## 3rd Qu.:190.0 3rd Qu.:36.62 3rd Qu.:0.6262 3rd Qu.:41.00
## Max. :846.0 Max. :67.10 Max. :2.4200 Max. :81.00
## Outcome
## Min. :0.000
## 1st Qu.:0.000
## Median :0.000
## Mean :0.349
## 3rd Qu.:1.000
## Max. :1.000
##
## === Imputando con method = norm.predict ===
##
## iter imp variable
## 1 1 Glucose BloodPressure SkinThickness Insulin BMI
## 2 1 Glucose BloodPressure SkinThickness Insulin BMI
## 3 1 Glucose BloodPressure SkinThickness Insulin BMI
## 4 1 Glucose BloodPressure SkinThickness Insulin BMI
## 5 1 Glucose BloodPressure SkinThickness Insulin BMI
## 6 1 Glucose BloodPressure SkinThickness Insulin BMI
## 7 1 Glucose BloodPressure SkinThickness Insulin BMI
## 8 1 Glucose BloodPressure SkinThickness Insulin BMI
## 9 1 Glucose BloodPressure SkinThickness Insulin BMI
## 10 1 Glucose BloodPressure SkinThickness Insulin BMI
## Pregnancies Glucose BloodPressure SkinThickness
## Min. : 0.000 Min. : 44.0 Min. : 24.00 Min. : 7.00
## 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 64.00 1st Qu.:22.00
## Median : 3.000 Median :117.0 Median : 72.00 Median :28.50
## Mean : 3.845 Mean :121.7 Mean : 72.35 Mean :28.89
## 3rd Qu.: 6.000 3rd Qu.:141.0 3rd Qu.: 80.00 3rd Qu.:35.00
## Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00
## Insulin BMI DiabetesPedigreeFunction Age
## Min. :-22.18 Min. :18.20 Min. :0.0780 Min. :21.00
## 1st Qu.: 88.00 1st Qu.:27.50 1st Qu.:0.2437 1st Qu.:24.00
## Median :130.00 Median :32.05 Median :0.3725 Median :29.00
## Mean :151.74 Mean :32.44 Mean :0.4719 Mean :33.24
## 3rd Qu.:190.47 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00
## Max. :846.00 Max. :67.10 Max. :2.4200 Max. :81.00
## Outcome
## Min. :0.000
## 1st Qu.:0.000
## Median :0.000
## Mean :0.349
## 3rd Qu.:1.000
## Max. :1.000
##
## === Imputando con method = norm.nob ===
##
## iter imp variable
## 1 1 Glucose BloodPressure SkinThickness Insulin BMI
## 2 1 Glucose BloodPressure SkinThickness Insulin BMI
## 3 1 Glucose BloodPressure SkinThickness Insulin BMI
## 4 1 Glucose BloodPressure SkinThickness Insulin BMI
## 5 1 Glucose BloodPressure SkinThickness Insulin BMI
## 6 1 Glucose BloodPressure SkinThickness Insulin BMI
## 7 1 Glucose BloodPressure SkinThickness Insulin BMI
## 8 1 Glucose BloodPressure SkinThickness Insulin BMI
## 9 1 Glucose BloodPressure SkinThickness Insulin BMI
## 10 1 Glucose BloodPressure SkinThickness Insulin BMI
## Pregnancies Glucose BloodPressure SkinThickness
## Min. : 0.000 Min. : 44.0 Min. : 24.00 Min. :-1.689
## 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 64.00 1st Qu.:21.000
## Median : 3.000 Median :117.0 Median : 72.00 Median :28.906
## Mean : 3.845 Mean :121.7 Mean : 72.35 Mean :28.749
## 3rd Qu.: 6.000 3rd Qu.:141.0 3rd Qu.: 80.00 3rd Qu.:35.452
## Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.000
## Insulin BMI DiabetesPedigreeFunction Age
## Min. :-210.67 Min. :12.31 Min. :0.0780 Min. :21.00
## 1st Qu.: 71.75 1st Qu.:27.48 1st Qu.:0.2437 1st Qu.:24.00
## Median : 130.00 Median :32.30 Median :0.3725 Median :29.00
## Mean : 150.30 Mean :32.43 Mean :0.4719 Mean :33.24
## 3rd Qu.: 207.00 3rd Qu.:36.61 3rd Qu.:0.6262 3rd Qu.:41.00
## Max. : 846.00 Max. :67.10 Max. :2.4200 Max. :81.00
## Outcome
## Min. :0.000
## 1st Qu.:0.000
## Median :0.000
## Mean :0.349
## 3rd Qu.:1.000
## Max. :1.000
##
## === Imputando con method = norm ===
##
## iter imp variable
## 1 1 Glucose BloodPressure SkinThickness Insulin BMI
## 2 1 Glucose BloodPressure SkinThickness Insulin BMI
## 3 1 Glucose BloodPressure SkinThickness Insulin BMI
## 4 1 Glucose BloodPressure SkinThickness Insulin BMI
## 5 1 Glucose BloodPressure SkinThickness Insulin BMI
## 6 1 Glucose BloodPressure SkinThickness Insulin BMI
## 7 1 Glucose BloodPressure SkinThickness Insulin BMI
## 8 1 Glucose BloodPressure SkinThickness Insulin BMI
## 9 1 Glucose BloodPressure SkinThickness Insulin BMI
## 10 1 Glucose BloodPressure SkinThickness Insulin BMI
## Pregnancies Glucose BloodPressure SkinThickness
## Min. : 0.000 Min. : 44.0 Min. : 24.00 Min. : 3.459
## 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 64.00 1st Qu.:21.374
## Median : 3.000 Median :117.0 Median : 72.00 Median :28.794
## Mean : 3.845 Mean :121.7 Mean : 72.44 Mean :28.802
## 3rd Qu.: 6.000 3rd Qu.:141.0 3rd Qu.: 80.00 3rd Qu.:36.000
## Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.000
## Insulin BMI DiabetesPedigreeFunction Age
## Min. :-194.2 Min. :12.81 Min. :0.0780 Min. :21.00
## 1st Qu.: 77.0 1st Qu.:27.50 1st Qu.:0.2437 1st Qu.:24.00
## Median : 135.0 Median :32.30 Median :0.3725 Median :29.00
## Mean : 152.7 Mean :32.43 Mean :0.4719 Mean :33.24
## 3rd Qu.: 206.0 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00
## Max. : 846.0 Max. :67.10 Max. :2.4200 Max. :81.00
## Outcome
## Min. :0.000
## 1st Qu.:0.000
## Median :0.000
## Mean :0.349
## 3rd Qu.:1.000
## Max. :1.000
Ahora hay 4 dataframes imputados por método. Para escoger cual método resulta más conveniente para el estudio debemos evaluar normalidad para ver a cuál distribución se acercan nuestros datos y elegir cuál método resultaría ser más efectivo
vars_num <- df_imputado
Función para KS normalidad
ks_normalidad <- function(x) {
x_scaled <- scale(x)
ks.test(x_scaled, "pnorm", mean = 0, sd = 1)$p.value
}
Aplicamos a todas las variables numéricas
ks_resultados <- sapply(vars_num, ks_normalidad)
## Warning in ks.test.default(x_scaled, "pnorm", mean = 0, sd = 1): ties should
## not be present for the one-sample Kolmogorov-Smirnov test
## Warning in ks.test.default(x_scaled, "pnorm", mean = 0, sd = 1): ties should
## not be present for the one-sample Kolmogorov-Smirnov test
## Warning in ks.test.default(x_scaled, "pnorm", mean = 0, sd = 1): ties should
## not be present for the one-sample Kolmogorov-Smirnov test
## Warning in ks.test.default(x_scaled, "pnorm", mean = 0, sd = 1): ties should
## not be present for the one-sample Kolmogorov-Smirnov test
## Warning in ks.test.default(x_scaled, "pnorm", mean = 0, sd = 1): ties should
## not be present for the one-sample Kolmogorov-Smirnov test
## Warning in ks.test.default(x_scaled, "pnorm", mean = 0, sd = 1): ties should
## not be present for the one-sample Kolmogorov-Smirnov test
## Warning in ks.test.default(x_scaled, "pnorm", mean = 0, sd = 1): ties should
## not be present for the one-sample Kolmogorov-Smirnov test
## Warning in ks.test.default(x_scaled, "pnorm", mean = 0, sd = 1): ties should
## not be present for the one-sample Kolmogorov-Smirnov test
## Warning in ks.test.default(x_scaled, "pnorm", mean = 0, sd = 1): ties should
## not be present for the one-sample Kolmogorov-Smirnov test
Data frame de resultados
ks_tabla <- data.frame(
Variable = names(ks_resultados),
KS_pvalue = round(ks_resultados, 4)
)
print(ks_tabla)
## Variable KS_pvalue
## Pregnancies Pregnancies 0.0000
## Glucose Glucose 0.0006
## BloodPressure BloodPressure 0.0897
## SkinThickness SkinThickness 0.1981
## Insulin Insulin 0.0000
## BMI BMI 0.3102
## DiabetesPedigreeFunction DiabetesPedigreeFunction 0.0000
## Age Age 0.0000
## Outcome Outcome 0.0000
Se rechaza normalidad para la variables los P-valores son menores que el nivel de significancia, en este caso, solo las variables BloodPressure, SkinThickness y BMI tienen una distribución normal. El método más apropiado a escoger para el seguimiento del EDA es el método PMM pues suele ser el más recomendado cuando tienes datos, o que la mayoría que no siguen una distribución normal y ayude a preservar los valores reales observados.
df_final <- df_imputados[["pmm"]]
Para detectar los atípicos en nuestro EDA se usa Rosner pues hay n datos mayor a 25 observaciones. Primero verlos visualmente:
df_final %>%
pivot_longer(cols = where(is.numeric), names_to = "Variable", values_to = "Valor") %>%
ggplot(aes(x = Variable, y = Valor)) +
geom_boxplot(outlier.colour = "red", fill = "lightblue") +
theme_minimal() +
coord_flip() +
labs(title = "Detección visual de outliers", x = "Variable", y = "Valor")
Test de Rosner:
outliers_rosner <- list()
for (var in names(df_final)) {
x <- df_final[[var]]
# Rosner: máximo de outliers a buscar
k_max <- min(10, floor(length(x) * 0.1))
ros <- rosnerTest(x, k = k_max)
outliers_rosner[[var]] <- ros
cat("\n\n=== Variable:", var, "===\n")
print(ros)
}
##
##
## === Variable: Pregnancies ===
##
## Results of Outlier Test
## -------------------------
##
## Test Method: Rosner's Test for Outliers
##
## Hypothesized Distribution: Normal
##
## Data: x
##
## Sample Size: 768
##
## Test Statistics: R.1 = 3.904034
## R.2 = 3.346880
## R.3 = 3.072258
## R.4 = 3.093431
## R.5 = 2.810050
## R.6 = 2.826572
## R.7 = 2.843389
## R.8 = 2.860510
## R.9 = 2.877944
## R.10 = 2.895700
##
## Test Statistic Parameter: k = 10
##
## Alternative Hypothesis: Up to 10 observations are not
## from the same Distribution.
##
## Type I Error: 5%
##
## Number of Outliers Detected: 0
##
## i Mean.i SD.i Value Obs.Num R.i+1 lambda.i+1 Outlier
## 1 0 3.845052 3.369578 17 160 3.904034 3.974092 FALSE
## 2 1 3.827901 3.338063 15 89 3.346880 3.973762 FALSE
## 3 2 3.813316 3.315699 14 299 3.072258 3.973432 FALSE
## 4 3 3.800000 3.297310 14 456 3.093431 3.973102 FALSE
## 5 4 3.786649 3.278714 13 29 2.810050 3.972771 FALSE
## 6 5 3.774574 3.263821 13 73 2.826572 3.972440 FALSE
## 7 6 3.762467 3.248775 13 87 2.843389 3.972108 FALSE
## 8 7 3.750329 3.233574 13 275 2.860510 3.971776 FALSE
## 9 8 3.738158 3.218215 13 324 2.877944 3.971443 FALSE
## 10 9 3.725955 3.202695 13 358 2.895700 3.971110 FALSE
##
##
##
##
## === Variable: Glucose ===
##
## Results of Outlier Test
## -------------------------
##
## Test Method: Rosner's Test for Outliers
##
## Hypothesized Distribution: Normal
##
## Data: x
##
## Sample Size: 768
##
## Test Statistics: R.1 = 2.546119
## R.2 = 2.539714
## R.3 = 2.519140
## R.4 = 2.498192
## R.5 = 2.510111
## R.6 = 2.522203
## R.7 = 2.534470
## R.8 = 2.513323
## R.9 = 2.525505
## R.10 = 2.537866
##
## Test Statistic Parameter: k = 10
##
## Alternative Hypothesis: Up to 10 observations are not
## from the same Distribution.
##
## Type I Error: 5%
##
## Number of Outliers Detected: 0
##
## i Mean.i SD.i Value Obs.Num R.i+1 lambda.i+1 Outlier
## 1 0 121.6862 30.51162 44 63 2.546119 3.974092 FALSE
## 2 1 121.7875 30.40206 199 662 2.539714 3.973762 FALSE
## 3 2 121.6867 30.29340 198 562 2.519140 3.973432 FALSE
## 4 3 121.5869 30.18706 197 9 2.498192 3.973102 FALSE
## 5 4 121.4882 30.08304 197 229 2.510111 3.972771 FALSE
## 6 5 121.3893 29.97806 197 409 2.522203 3.972440 FALSE
## 7 6 121.2900 29.87211 197 580 2.534470 3.972108 FALSE
## 8 7 121.1905 29.76516 196 23 2.513323 3.971776 FALSE
## 9 8 121.0921 29.66056 196 207 2.525505 3.971443 FALSE
## 10 9 120.9934 29.55499 196 360 2.537866 3.971110 FALSE
##
##
##
##
## === Variable: BloodPressure ===
##
## Results of Outlier Test
## -------------------------
##
## Test Method: Rosner's Test for Outliers
##
## Hypothesized Distribution: Normal
##
## Data: x
##
## Sample Size: 768
##
## Test Statistics: R.1 = 4.004059
## R.2 = 3.948398
## R.3 = 3.497458
## R.4 = 3.528098
## R.5 = 3.466499
## R.6 = 3.159423
## R.7 = 3.182433
## R.8 = 3.205954
## R.9 = 3.058405
## R.10 = 3.079476
##
## Test Statistic Parameter: k = 10
##
## Alternative Hypothesis: Up to 10 observations are not
## from the same Distribution.
##
## Type I Error: 5%
##
## Number of Outliers Detected: 1
##
## i Mean.i SD.i Value Obs.Num R.i+1 lambda.i+1 Outlier
## 1 0 72.44661 12.37579 122 107 4.004059 3.974092 TRUE
## 2 1 72.38201 12.25358 24 598 3.948398 3.973762 FALSE
## 3 2 72.44517 12.13600 30 19 3.497458 3.973432 FALSE
## 4 3 72.50065 12.04634 30 126 3.528098 3.973102 FALSE
## 5 4 72.55628 11.95550 114 692 3.466499 3.972771 FALSE
## 6 5 72.50197 11.86864 110 44 3.159423 3.972440 FALSE
## 7 6 72.45276 11.79828 110 178 3.182433 3.972108 FALSE
## 8 7 72.40342 11.72711 110 550 3.205954 3.971776 FALSE
## 9 8 72.35395 11.65511 108 85 3.058405 3.971443 FALSE
## 10 9 72.30698 11.59061 108 363 3.079476 3.971110 FALSE
##
##
##
##
## === Variable: SkinThickness ===
##
## Results of Outlier Test
## -------------------------
##
## Test Method: Rosner's Test for Outliers
##
## Hypothesized Distribution: Normal
##
## Data: x
##
## Sample Size: 768
##
## Test Statistics: R.1 = 6.659459
## R.2 = 3.337693
## R.3 = 3.068501
## R.4 = 3.089600
## R.5 = 2.712096
## R.6 = 2.727065
## R.7 = 2.541068
## R.8 = 2.553606
## R.9 = 2.566331
## R.10 = 2.375814
##
## Test Statistic Parameter: k = 10
##
## Alternative Hypothesis: Up to 10 observations are not
## from the same Distribution.
##
## Type I Error: 5%
##
## Number of Outliers Detected: 1
##
## i Mean.i SD.i Value Obs.Num R.i+1 lambda.i+1 Outlier
## 1 0 29.03125 10.506672 99 580 6.659459 3.974092 TRUE
## 2 1 28.94003 10.204645 63 446 3.337693 3.973762 FALSE
## 3 2 28.89556 10.136690 60 58 3.068501 3.973432 FALSE
## 4 3 28.85490 10.080624 60 692 3.089600 3.973102 FALSE
## 5 4 28.81414 10.023930 56 121 2.712096 3.972771 FALSE
## 6 5 28.77851 9.981976 56 304 2.727065 3.972440 FALSE
## 7 6 28.74278 9.939606 54 87 2.541068 3.972108 FALSE
## 8 7 28.70959 9.903802 54 212 2.553606 3.971776 FALSE
## 9 8 28.67632 9.867660 54 584 2.566331 3.971443 FALSE
## 10 9 28.64295 9.831177 52 276 2.375814 3.971110 FALSE
##
##
##
##
## === Variable: Insulin ===
##
## Results of Outlier Test
## -------------------------
##
## Test Method: Rosner's Test for Outliers
##
## Hypothesized Distribution: Normal
##
## Data: x
##
## Sample Size: 768
##
## Test Statistics: R.1 = 6.012751
## R.2 = 5.255403
## R.3 = 4.776185
## R.4 = 4.852292
## R.5 = 4.932157
## R.6 = 4.257371
## R.7 = 4.110369
## R.8 = 4.159560
## R.9 = 4.210560
## R.10 = 3.926510
##
## Test Statistic Parameter: k = 10
##
## Alternative Hypothesis: Up to 10 observations are not
## from the same Distribution.
##
## Type I Error: 5%
##
## Number of Outliers Detected: 9
##
## i Mean.i SD.i Value Obs.Num R.i+1 lambda.i+1 Outlier
## 1 0 154.8477 114.9478 846 14 6.012751 3.974092 TRUE
## 2 1 153.9465 112.2756 744 229 5.255403 3.973762 TRUE
## 3 2 153.1762 110.3022 680 248 4.776185 3.973432 TRUE
## 4 3 152.4876 108.7141 680 328 4.852292 3.973102 TRUE
## 5 4 151.7971 107.0937 680 409 4.932157 3.972771 TRUE
## 6 5 151.1048 105.4395 600 585 4.257371 3.972440 TRUE
## 7 6 150.5157 104.2447 579 333 4.110369 3.972108 TRUE
## 8 7 149.9527 103.1473 579 410 4.159560 3.971776 TRUE
## 9 8 149.3882 102.0320 579 441 4.210560 3.971443 TRUE
## 10 9 148.8221 100.8982 545 287 3.926510 3.971110 FALSE
##
##
##
##
## === Variable: BMI ===
##
## Results of Outlier Test
## -------------------------
##
## Test Method: Rosner's Test for Outliers
##
## Hypothesized Distribution: Normal
##
## Data: x
##
## Sample Size: 768
##
## Test Statistics: R.1 = 4.995880
## R.2 = 3.954005
## R.3 = 3.686847
## R.4 = 3.379276
## R.5 = 3.136599
## R.6 = 3.113767
## R.7 = 3.044636
## R.8 = 3.065385
## R.9 = 2.733037
## R.10 = 2.702099
##
## Test Statistic Parameter: k = 10
##
## Alternative Hypothesis: Up to 10 observations are not
## from the same Distribution.
##
## Type I Error: 5%
##
## Number of Outliers Detected: 1
##
## i Mean.i SD.i Value Obs.Num R.i+1 lambda.i+1 Outlier
## 1 0 32.46888 6.931936 67.1 178 4.995880 3.974092 TRUE
## 2 1 32.42373 6.822518 59.4 446 3.954005 3.973762 FALSE
## 3 2 32.38851 6.756855 57.3 674 3.686847 3.973432 FALSE
## 4 3 32.35595 6.700858 55.0 126 3.379276 3.973102 FALSE
## 5 4 32.32631 6.654881 53.2 121 3.136599 3.972771 FALSE
## 6 5 32.29895 6.616118 52.9 304 3.113767 3.972440 FALSE
## 7 6 32.27192 6.578154 52.3 194 3.044636 3.972108 FALSE
## 8 7 32.24560 6.542214 52.3 248 3.065385 3.971776 FALSE
## 9 8 32.21921 6.505872 50.0 156 2.733037 3.971443 FALSE
## 10 9 32.19578 6.478007 49.7 100 2.702099 3.971110 FALSE
##
##
##
##
## === Variable: DiabetesPedigreeFunction ===
##
## Results of Outlier Test
## -------------------------
##
## Test Method: Rosner's Test for Outliers
##
## Hypothesized Distribution: Normal
##
## Data: x
##
## Sample Size: 768
##
## Test Statistics: R.1 = 5.879733
## R.2 = 5.740114
## R.3 = 5.742410
## R.4 = 5.387880
## R.5 = 4.696067
## R.6 = 4.395823
## R.7 = 4.287150
## R.8 = 4.233710
## R.9 = 4.283971
## R.10 = 3.997834
##
## Test Statistic Parameter: k = 10
##
## Alternative Hypothesis: Up to 10 observations are not
## from the same Distribution.
##
## Type I Error: 5%
##
## Number of Outliers Detected: 10
##
## i Mean.i SD.i Value Obs.Num R.i+1 lambda.i+1 Outlier
## 1 0 0.4718763 0.3313286 2.420 446 5.879733 3.974092 TRUE
## 2 1 0.4693364 0.3239768 2.329 229 5.740114 3.973762 TRUE
## 3 2 0.4669086 0.3171301 2.288 5 5.742410 3.973432 TRUE
## 4 3 0.4645281 0.3104137 2.137 371 5.387880 3.973102 TRUE
## 5 4 0.4623390 0.3046509 1.893 46 4.696067 3.972771 TRUE
## 6 5 0.4604640 0.3004070 1.781 59 4.395823 3.972440 TRUE
## 7 6 0.4587310 0.2967633 1.731 372 4.287150 3.972108 TRUE
## 8 7 0.4570591 0.2933457 1.699 594 4.233710 3.971776 TRUE
## 9 8 0.4554250 0.2900522 1.698 622 4.283971 3.971443 TRUE
## 10 9 0.4537879 0.2867083 1.600 396 3.997834 3.971110 TRUE
##
##
##
##
## === Variable: Age ===
##
## Results of Outlier Test
## -------------------------
##
## Test Method: Rosner's Test for Outliers
##
## Hypothesized Distribution: Normal
##
## Data: x
##
## Sample Size: 768
##
## Test Statistics: R.1 = 4.061069
## R.2 = 3.335018
## R.3 = 3.188755
## R.4 = 3.125278
## R.5 = 3.147531
## R.6 = 3.082239
## R.7 = 3.015167
## R.8 = 3.035354
## R.9 = 3.055953
## R.10 = 2.986993
##
## Test Statistic Parameter: k = 10
##
## Alternative Hypothesis: Up to 10 observations are not
## from the same Distribution.
##
## Type I Error: 5%
##
## Number of Outliers Detected: 1
##
## i Mean.i SD.i Value Obs.Num R.i+1 lambda.i+1 Outlier
## 1 0 33.24089 11.76023 81 460 4.061069 3.974092 TRUE
## 2 1 33.17862 11.64053 72 454 3.335018 3.973762 FALSE
## 3 2 33.12794 11.56315 70 667 3.188755 3.973432 FALSE
## 4 3 33.07974 11.49346 69 124 3.125278 3.973102 FALSE
## 5 4 33.03272 11.42714 69 685 3.147531 3.972771 FALSE
## 6 5 32.98558 11.36006 68 675 3.082239 3.972440 FALSE
## 7 6 32.93963 11.29634 67 364 3.015167 3.972108 FALSE
## 8 7 32.89488 11.23596 67 490 3.035354 3.971776 FALSE
## 9 8 32.85000 11.17491 67 538 3.055953 3.971443 FALSE
## 10 9 32.80501 11.11318 66 222 2.986993 3.971110 FALSE
##
##
##
##
## === Variable: Outcome ===
##
## Results of Outlier Test
## -------------------------
##
## Test Method: Rosner's Test for Outliers
##
## Hypothesized Distribution: Normal
##
## Data: x
##
## Sample Size: 768
##
## Test Statistics: R.1 = 1.365006
## R.2 = 1.367559
## R.3 = 1.370126
## R.4 = 1.372708
## R.5 = 1.375304
## R.6 = 1.377915
## R.7 = 1.380541
## R.8 = 1.383182
## R.9 = 1.385838
## R.10 = 1.388509
##
## Test Statistic Parameter: k = 10
##
## Alternative Hypothesis: Up to 10 observations are not
## from the same Distribution.
##
## Type I Error: 5%
##
## Number of Outliers Detected: 0
##
## i Mean.i SD.i Value Obs.Num R.i+1 lambda.i+1 Outlier
## 1 0 0.3489583 0.4769514 1 1 1.365006 3.974092 FALSE
## 2 1 0.3481095 0.4766818 1 3 1.367559 3.973762 FALSE
## 3 2 0.3472585 0.4764098 1 5 1.370126 3.973432 FALSE
## 4 3 0.3464052 0.4761355 1 7 1.372708 3.973102 FALSE
## 5 4 0.3455497 0.4758587 1 9 1.375304 3.972771 FALSE
## 6 5 0.3446920 0.4755795 1 10 1.377915 3.972440 FALSE
## 7 6 0.3438320 0.4752978 1 12 1.380541 3.972108 FALSE
## 8 7 0.3429698 0.4750137 1 14 1.383182 3.971776 FALSE
## 9 8 0.3421053 0.4747271 1 15 1.385838 3.971443 FALSE
## 10 9 0.3412385 0.4744379 1 16 1.388509 3.971110 FALSE
Con los resultados que arrojó el test de Rosner muestra que no hay atípicos en las 10 primeras k observaciones las siguientes variables variables: PregnanciesGlucose y Outcome. Las demás tienen atípicos en sus datos. Para verlo más de cerca:
outliers_rosner[["Glucose"]]$all.stats
## i Mean.i SD.i Value Obs.Num R.i+1 lambda.i+1 Outlier
## 1 0 121.6862 30.51162 44 63 2.546119 3.974092 FALSE
## 2 1 121.7875 30.40206 199 662 2.539714 3.973762 FALSE
## 3 2 121.6867 30.29340 198 562 2.519140 3.973432 FALSE
## 4 3 121.5869 30.18706 197 9 2.498192 3.973102 FALSE
## 5 4 121.4882 30.08304 197 229 2.510111 3.972771 FALSE
## 6 5 121.3893 29.97806 197 409 2.522203 3.972440 FALSE
## 7 6 121.2900 29.87211 197 580 2.534470 3.972108 FALSE
## 8 7 121.1905 29.76516 196 23 2.513323 3.971776 FALSE
## 9 8 121.0921 29.66056 196 207 2.525505 3.971443 FALSE
## 10 9 120.9934 29.55499 196 360 2.537866 3.971110 FALSE
outliers_rosner[["Pregnancies"]]$all.stats
## i Mean.i SD.i Value Obs.Num R.i+1 lambda.i+1 Outlier
## 1 0 3.845052 3.369578 17 160 3.904034 3.974092 FALSE
## 2 1 3.827901 3.338063 15 89 3.346880 3.973762 FALSE
## 3 2 3.813316 3.315699 14 299 3.072258 3.973432 FALSE
## 4 3 3.800000 3.297310 14 456 3.093431 3.973102 FALSE
## 5 4 3.786649 3.278714 13 29 2.810050 3.972771 FALSE
## 6 5 3.774574 3.263821 13 73 2.826572 3.972440 FALSE
## 7 6 3.762467 3.248775 13 87 2.843389 3.972108 FALSE
## 8 7 3.750329 3.233574 13 275 2.860510 3.971776 FALSE
## 9 8 3.738158 3.218215 13 324 2.877944 3.971443 FALSE
## 10 9 3.725955 3.202695 13 358 2.895700 3.971110 FALSE
outliers_rosner[["BloodPressure"]]$all.stats
## i Mean.i SD.i Value Obs.Num R.i+1 lambda.i+1 Outlier
## 1 0 72.44661 12.37579 122 107 4.004059 3.974092 TRUE
## 2 1 72.38201 12.25358 24 598 3.948398 3.973762 FALSE
## 3 2 72.44517 12.13600 30 19 3.497458 3.973432 FALSE
## 4 3 72.50065 12.04634 30 126 3.528098 3.973102 FALSE
## 5 4 72.55628 11.95550 114 692 3.466499 3.972771 FALSE
## 6 5 72.50197 11.86864 110 44 3.159423 3.972440 FALSE
## 7 6 72.45276 11.79828 110 178 3.182433 3.972108 FALSE
## 8 7 72.40342 11.72711 110 550 3.205954 3.971776 FALSE
## 9 8 72.35395 11.65511 108 85 3.058405 3.971443 FALSE
## 10 9 72.30698 11.59061 108 363 3.079476 3.971110 FALSE
outliers_rosner[["SkinThickness"]]$all.stats
## i Mean.i SD.i Value Obs.Num R.i+1 lambda.i+1 Outlier
## 1 0 29.03125 10.506672 99 580 6.659459 3.974092 TRUE
## 2 1 28.94003 10.204645 63 446 3.337693 3.973762 FALSE
## 3 2 28.89556 10.136690 60 58 3.068501 3.973432 FALSE
## 4 3 28.85490 10.080624 60 692 3.089600 3.973102 FALSE
## 5 4 28.81414 10.023930 56 121 2.712096 3.972771 FALSE
## 6 5 28.77851 9.981976 56 304 2.727065 3.972440 FALSE
## 7 6 28.74278 9.939606 54 87 2.541068 3.972108 FALSE
## 8 7 28.70959 9.903802 54 212 2.553606 3.971776 FALSE
## 9 8 28.67632 9.867660 54 584 2.566331 3.971443 FALSE
## 10 9 28.64295 9.831177 52 276 2.375814 3.971110 FALSE
outliers_rosner[["Insulin"]]$all.stats
## i Mean.i SD.i Value Obs.Num R.i+1 lambda.i+1 Outlier
## 1 0 154.8477 114.9478 846 14 6.012751 3.974092 TRUE
## 2 1 153.9465 112.2756 744 229 5.255403 3.973762 TRUE
## 3 2 153.1762 110.3022 680 248 4.776185 3.973432 TRUE
## 4 3 152.4876 108.7141 680 328 4.852292 3.973102 TRUE
## 5 4 151.7971 107.0937 680 409 4.932157 3.972771 TRUE
## 6 5 151.1048 105.4395 600 585 4.257371 3.972440 TRUE
## 7 6 150.5157 104.2447 579 333 4.110369 3.972108 TRUE
## 8 7 149.9527 103.1473 579 410 4.159560 3.971776 TRUE
## 9 8 149.3882 102.0320 579 441 4.210560 3.971443 TRUE
## 10 9 148.8221 100.8982 545 287 3.926510 3.971110 FALSE
outliers_rosner[["BMI"]]$all.stats
## i Mean.i SD.i Value Obs.Num R.i+1 lambda.i+1 Outlier
## 1 0 32.46888 6.931936 67.1 178 4.995880 3.974092 TRUE
## 2 1 32.42373 6.822518 59.4 446 3.954005 3.973762 FALSE
## 3 2 32.38851 6.756855 57.3 674 3.686847 3.973432 FALSE
## 4 3 32.35595 6.700858 55.0 126 3.379276 3.973102 FALSE
## 5 4 32.32631 6.654881 53.2 121 3.136599 3.972771 FALSE
## 6 5 32.29895 6.616118 52.9 304 3.113767 3.972440 FALSE
## 7 6 32.27192 6.578154 52.3 194 3.044636 3.972108 FALSE
## 8 7 32.24560 6.542214 52.3 248 3.065385 3.971776 FALSE
## 9 8 32.21921 6.505872 50.0 156 2.733037 3.971443 FALSE
## 10 9 32.19578 6.478007 49.7 100 2.702099 3.971110 FALSE
outliers_rosner[["DiabetesPedigreeFunction"]]$all.stats
## i Mean.i SD.i Value Obs.Num R.i+1 lambda.i+1 Outlier
## 1 0 0.4718763 0.3313286 2.420 446 5.879733 3.974092 TRUE
## 2 1 0.4693364 0.3239768 2.329 229 5.740114 3.973762 TRUE
## 3 2 0.4669086 0.3171301 2.288 5 5.742410 3.973432 TRUE
## 4 3 0.4645281 0.3104137 2.137 371 5.387880 3.973102 TRUE
## 5 4 0.4623390 0.3046509 1.893 46 4.696067 3.972771 TRUE
## 6 5 0.4604640 0.3004070 1.781 59 4.395823 3.972440 TRUE
## 7 6 0.4587310 0.2967633 1.731 372 4.287150 3.972108 TRUE
## 8 7 0.4570591 0.2933457 1.699 594 4.233710 3.971776 TRUE
## 9 8 0.4554250 0.2900522 1.698 622 4.283971 3.971443 TRUE
## 10 9 0.4537879 0.2867083 1.600 396 3.997834 3.971110 TRUE
outliers_rosner[["Age"]]$all.stats
## i Mean.i SD.i Value Obs.Num R.i+1 lambda.i+1 Outlier
## 1 0 33.24089 11.76023 81 460 4.061069 3.974092 TRUE
## 2 1 33.17862 11.64053 72 454 3.335018 3.973762 FALSE
## 3 2 33.12794 11.56315 70 667 3.188755 3.973432 FALSE
## 4 3 33.07974 11.49346 69 124 3.125278 3.973102 FALSE
## 5 4 33.03272 11.42714 69 685 3.147531 3.972771 FALSE
## 6 5 32.98558 11.36006 68 675 3.082239 3.972440 FALSE
## 7 6 32.93963 11.29634 67 364 3.015167 3.972108 FALSE
## 8 7 32.89488 11.23596 67 490 3.035354 3.971776 FALSE
## 9 8 32.85000 11.17491 67 538 3.055953 3.971443 FALSE
## 10 9 32.80501 11.11318 66 222 2.986993 3.971110 FALSE
outliers_rosner[["Outcome"]]$all.stats
## i Mean.i SD.i Value Obs.Num R.i+1 lambda.i+1 Outlier
## 1 0 0.3489583 0.4769514 1 1 1.365006 3.974092 FALSE
## 2 1 0.3481095 0.4766818 1 3 1.367559 3.973762 FALSE
## 3 2 0.3472585 0.4764098 1 5 1.370126 3.973432 FALSE
## 4 3 0.3464052 0.4761355 1 7 1.372708 3.973102 FALSE
## 5 4 0.3455497 0.4758587 1 9 1.375304 3.972771 FALSE
## 6 5 0.3446920 0.4755795 1 10 1.377915 3.972440 FALSE
## 7 6 0.3438320 0.4752978 1 12 1.380541 3.972108 FALSE
## 8 7 0.3429698 0.4750137 1 14 1.383182 3.971776 FALSE
## 9 8 0.3421053 0.4747271 1 15 1.385838 3.971443 FALSE
## 10 9 0.3412385 0.4744379 1 16 1.388509 3.971110 FALSE
Para imputar la presencia de estos valores atípicos, pasamos esta base para la imputación
df_ati <- df_final
for (var in names(df_ati)) {
ros <- outliers_rosner[[var]]$all.stats
filas_out <- which(ros$Outlier)
if (length(filas_out) > 0) {
cat("\nVariable:", var, "- Reemplazando", length(filas_out), "outliers por NA")
df_ati[filas_out, var] <- NA
}
}
##
## Variable: BloodPressure - Reemplazando 1 outliers por NA
## Variable: SkinThickness - Reemplazando 1 outliers por NA
## Variable: Insulin - Reemplazando 9 outliers por NA
## Variable: BMI - Reemplazando 1 outliers por NA
## Variable: DiabetesPedigreeFunction - Reemplazando 10 outliers por NA
## Variable: Age - Reemplazando 1 outliers por NA
Verificar si quedaron NA
colSums(is.na(df_ati))
## Pregnancies Glucose BloodPressure
## 0 0 1
## SkinThickness Insulin BMI
## 1 9 1
## DiabetesPedigreeFunction Age Outcome
## 10 1 0
Imputar esos NA con la mediana de cada variable
for (var in names(df_ati)) {
if (any(is.na(df_ati[[var]]))) {
mediana_val <- median(df_ati[[var]], na.rm = TRUE)
df_ati[[var]][is.na(df_ati[[var]])] <- mediana_val
}
}
Sin NA
colSums(is.na(df_ati))
## Pregnancies Glucose BloodPressure
## 0 0 0
## SkinThickness Insulin BMI
## 0 0 0
## DiabetesPedigreeFunction Age Outcome
## 0 0 0
Ya todo limpio
df_oficial <- df_ati
Con nuestro dataset limpio, empezamos nuestro análisis descriptibo y las visualizacione de estas relaciones para sacar conclusiones.
summary(df_oficial)
## Pregnancies Glucose BloodPressure SkinThickness
## Min. : 0.000 Min. : 44.0 Min. : 24.00 Min. : 7.00
## 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 64.00 1st Qu.:21.00
## Median : 3.000 Median :117.0 Median : 72.00 Median :29.00
## Mean : 3.845 Mean :121.7 Mean : 72.45 Mean :29.02
## 3rd Qu.: 6.000 3rd Qu.:141.0 3rd Qu.: 80.00 3rd Qu.:36.00
## Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00
## Insulin BMI DiabetesPedigreeFunction Age
## Min. : 14.0 Min. :18.20 Min. :0.0780 Min. :21.00
## 1st Qu.: 78.0 1st Qu.:27.50 1st Qu.:0.2450 1st Qu.:24.00
## Median :125.0 Median :32.30 Median :0.3755 Median :29.00
## Mean :153.8 Mean :32.47 Mean :0.4702 Mean :33.21
## 3rd Qu.:188.0 3rd Qu.:36.62 3rd Qu.:0.6160 3rd Qu.:41.00
## Max. :846.0 Max. :67.10 Max. :2.4200 Max. :81.00
## Outcome
## Min. :0.000
## 1st Qu.:0.000
## Median :0.000
## Mean :0.349
## 3rd Qu.:1.000
## Max. :1.000
descriptivos <- df_oficial %>%
summarise(across(everything(),
list(
media = mean,
mediana = median,
sd = sd,
min = min,
max = max
),
na.rm = TRUE))
## Warning: There was 1 warning in `summarise()`.
## ℹ In argument: `across(...)`.
## Caused by warning:
## ! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
## Supply arguments directly to `.fns` through an anonymous function instead.
##
## # Previously
## across(a:b, mean, na.rm = TRUE)
##
## # Now
## across(a:b, \(x) mean(x, na.rm = TRUE))
print(descriptivos)
## Pregnancies_media Pregnancies_mediana Pregnancies_sd Pregnancies_min
## 1 3.845052 3 3.369578 0
## Pregnancies_max Glucose_media Glucose_mediana Glucose_sd Glucose_min
## 1 17 121.6862 117 30.51162 44
## Glucose_max BloodPressure_media BloodPressure_mediana BloodPressure_sd
## 1 199 72.44661 72 12.37579
## BloodPressure_min BloodPressure_max SkinThickness_media SkinThickness_mediana
## 1 24 122 29.02344 29
## SkinThickness_sd SkinThickness_min SkinThickness_max Insulin_media
## 1 10.50446 7 99 153.8268
## Insulin_mediana Insulin_sd Insulin_min Insulin_max BMI_media BMI_mediana
## 1 125 113.3815 14 846 32.46719 32.3
## BMI_sd BMI_min BMI_max DiabetesPedigreeFunction_media
## 1 6.931819 18.2 67.1 0.4701536
## DiabetesPedigreeFunction_mediana DiabetesPedigreeFunction_sd
## 1 0.3755 0.3238199
## DiabetesPedigreeFunction_min DiabetesPedigreeFunction_max Age_media
## 1 0.078 2.42 33.21354
## Age_mediana Age_sd Age_min Age_max Outcome_media Outcome_mediana Outcome_sd
## 1 29 11.74562 21 81 0.3489583 0 0.4769514
## Outcome_min Outcome_max
## 1 0 1
Algunas características principales encontradas en este descriptivo son la siguientes: - En la variable de embarazos tiene una media de 3 embarazos, por lo cual, al hacer el conteo y transcripción de esta base, el 50% de pacientes tuvieron 3 o menos.
En Glucosa hqy valores por debajo de 70 mg/dL, lo que podría indicar hipoglucemia en estos pacientes.
En Diabetes hay un rango de 0.078 a 2.42, lo que podría indicar riesgo genético de diabetes en estos pacientes.
También, hay una gran variabilidad en insulina y glucosa, lo que indicaría una población muy diversa en sus metabilismos. El BMI promedio indica obesidad en la población estudiada. La diferencia entre media y mediana en las edades y embarazos sugiere una distribución asimétrica con algunos valores altos.Y un 34.9% de los casos en este conjunto de datos fueron diagnosticados con diabetes.
Luego de esto, veamos un histogramas por variable para ver sus distribuciones luego de la limpieza:
df_oficial %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Valor") %>%
ggplot(aes(x = Valor)) +
geom_histogram(bins = 30, fill = "steelblue", color = "white") +
facet_wrap(~Variable, scales = "free") +
theme_minimal() +
labs(title = "Distribuciones de variables", x = "Valor", y = "Frecuencia")
Viendo esto, algunas características encontradas fueron que variables
como los embarazos y la insulina muestran colas largas hacia la derecha
como nos mostró el descriptivo. Entre las bariables con distribuciones
más normales está la presión arterial y glusoca pues muestran
distribuciones simétricas.
También veremos un los boxplots con la limpieza final
df_oficial %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Valor") %>%
ggplot(aes(x = Variable, y = Valor)) +
geom_boxplot(fill = "lightgreen", outlier.colour = "red") +
coord_flip() +
theme_minimal() +
labs(title = "Boxplots después de limpieza", x = "Variable", y = "Valor")
En estos boxplots vemos que las variables con mayor dispersión, como la
insulina muestra la mayor variabilidad,con posibles outliers en el rango
alto, como en glucosa también presenta un rango amplio.
En la variable Outcome aparece un boxplot para la descripción de sus datos categóricos pues muestran 0 para quienes no presentan diabetes y un 1 para quienes si lo presentan.
Para variables con menor rango, como diabetes y presión arterial usan una escala diferente mientras que en embarazos tendría una escala adaptada
Por ello, los boxplots después de limpieza muestran el manejo de valores faltantes aunque con una pequeña diferencia de los originales
Y por último veremos la matriz de correlación para ver quienes estás más relacionadas para evaluarlas más a fondo en siguientes estudios:
ggpairs(df_oficial, progress = FALSE)
En la matriz de correlación se ven relaciones importantes entre
variables asociadas a la diabetes, como glucosa e insulina, y
antropométricas como IMC y grosor de piel, lo que sugiere una estrecha
relación fisiopatológica entre estos parámetros.
Las correlaciones moderadas como la edad y los embarazos o IMC con antecedentes genéticos de diabetes.
Y las mientras que las correlaciones débiles como los factores metabólicos y antropométricos están fuertemente interconectados, mientras que otros como la presión arterial tienen asociaciones más modestas, lo que podría reflejar diferentes vías de influencia en el desarrollo de la diabetes.