Actividad #3, ejercicio NA.

Primero que todo, importamos las librerías necesarias para el ejercicio:

suppressWarnings({
  library(Amelia)
  library(tidyverse)
  library(VIM)
  library(mice)
  library(mi)
  library(EnvStats)
  library(dplyr)
  library(hrbrthemes) 
  library(gridExtra)  
})

## Cargando paquete requerido: Rcpp

## ## 
## ## Amelia II: Multiple Imputation
## ## (Version 1.8.3, built: 2024-11-07)
## ## Copyright (C) 2005-2025 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Cargando paquete requerido: colorspace
## 
## Cargando paquete requerido: grid
## 
## VIM is ready to use.
## 
## 
## Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues
## 
## 
## Adjuntando el paquete: 'VIM'
## 
## 
## The following object is masked from 'package:datasets':
## 
##     sleep
## 
## 
## 
## Adjuntando el paquete: 'mice'
## 
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## 
## The following objects are masked from 'package:base':
## 
##     cbind, rbind
## 
## 
## Cargando paquete requerido: Matrix
## 
## 
## Adjuntando el paquete: 'Matrix'
## 
## 
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## 
## 
## Cargando paquete requerido: stats4
## 
## mi (Version 1.1, packaged: 2022-06-05 05:31:15 UTC; ben)
## 
## mi  Copyright (C) 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015 Trustees of Columbia University
## 
## This program comes with ABSOLUTELY NO WARRANTY.
## 
## This is free software, and you are welcome to redistribute it
## 
## under the General Public License version 2 or later.
## 
## Execute RShowDoc('COPYING') for details.
## 
## 
## Adjuntando el paquete: 'mi'
## 
## 
## The following objects are masked from 'package:mice':
## 
##     complete, pool
## 
## 
## The following object is masked from 'package:tidyr':
## 
##     complete
## 
## 
## 
## Adjuntando el paquete: 'EnvStats'
## 
## 
## The following object is masked from 'package:Matrix':
## 
##     print
## 
## 
## The following objects are masked from 'package:stats':
## 
##     predict, predict.lm
## 
## 
## The following object is masked from 'package:base':
## 
##     print.default
## 
## 
## 
## Adjuntando el paquete: 'gridExtra'
## 
## 
## The following object is masked from 'package:dplyr':
## 
##     combine

Ahora, importamos la base de datos necesaria para el análisis.

data<-read.csv("C:/DataViz/actividades_en_clase/bases_de_datos/diabetes.csv")

Análisis exploratorio de datos (EDA).

Demosle un primer vistazo a los datos con las funciones head() y summary().

head(data)

##   Pregnancies Glucose BloodPressure SkinThickness Insulin  BMI
## 1           6     148            72            35       0 33.6
## 2           1      85            66            29       0 26.6
## 3           8     183            64             0       0 23.3
## 4           1      89            66            23      94 28.1
## 5           0     137            40            35     168 43.1
## 6           5     116            74             0       0 25.6
##   DiabetesPedigreeFunction Age Outcome
## 1                    0.627  50       1
## 2                    0.351  31       0
## 3                    0.672  32       1
## 4                    0.167  21       0
## 5                    2.288  33       1
## 6                    0.201  30       0

Gracias a la función head(), vemos que las variables explicativas son númericas, y la variable objetivo es categórica.

summary(data)

##   Pregnancies        Glucose      BloodPressure    SkinThickness  
##  Min.   : 0.000   Min.   :  0.0   Min.   :  0.00   Min.   : 0.00  
##  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.00   1st Qu.: 0.00  
##  Median : 3.000   Median :117.0   Median : 72.00   Median :23.00  
##  Mean   : 3.845   Mean   :120.9   Mean   : 69.11   Mean   :20.54  
##  3rd Qu.: 6.000   3rd Qu.:140.2   3rd Qu.: 80.00   3rd Qu.:32.00  
##  Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
##     Insulin           BMI        DiabetesPedigreeFunction      Age       
##  Min.   :  0.0   Min.   : 0.00   Min.   :0.0780           Min.   :21.00  
##  1st Qu.:  0.0   1st Qu.:27.30   1st Qu.:0.2437           1st Qu.:24.00  
##  Median : 30.5   Median :32.00   Median :0.3725           Median :29.00  
##  Mean   : 79.8   Mean   :31.99   Mean   :0.4719           Mean   :33.24  
##  3rd Qu.:127.2   3rd Qu.:36.60   3rd Qu.:0.6262           3rd Qu.:41.00  
##  Max.   :846.0   Max.   :67.10   Max.   :2.4200           Max.   :81.00  
##     Outcome     
##  Min.   :0.000  
##  1st Qu.:0.000  
##  Median :0.000  
##  Mean   :0.349  
##  3rd Qu.:1.000  
##  Max.   :1.000

A primera vista llama la atención del valor máximo de la variable pregnancies, lo que nos puede dar un indicio de outliers en dicha variable, así como tambien el valor máximo de la insulina.

La metodología que se tomará en este ejercicio será la siguiente: primero se detectarán los valores faltantes de cada variable, luego se detectarán los valores atípicos de cada variable, y por último, se realizará una única imputación para ambos valores. Esto será acompañado de gráficos y pruebas analíticas para comprobar que se está realizando correctamente la imputación en cada caso.

Detección de datos faltantes

Para este ejercicio, los valores registrados como 0 serán considerados como datos faltantes para las variables Glucose, BloodPressure, SkinThickness, Insulin y BMI:

n=length(data$Outcome)
m=length(data)
for (j in 2: (m-3)){
  for (i in 1:n){
    if (data[i,j]==0){
      data[i,j]<-NA
    }
  }
}

Veamos ahora sí los NA´s por columna con summary()

summary(data)

##   Pregnancies        Glucose      BloodPressure    SkinThickness  
##  Min.   : 0.000   Min.   : 44.0   Min.   : 24.00   Min.   : 7.00  
##  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 64.00   1st Qu.:22.00  
##  Median : 3.000   Median :117.0   Median : 72.00   Median :29.00  
##  Mean   : 3.845   Mean   :121.7   Mean   : 72.41   Mean   :29.15  
##  3rd Qu.: 6.000   3rd Qu.:141.0   3rd Qu.: 80.00   3rd Qu.:36.00  
##  Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
##                   NA's   :5       NA's   :35       NA's   :227    
##     Insulin            BMI        DiabetesPedigreeFunction      Age       
##  Min.   : 14.00   Min.   :18.20   Min.   :0.0780           Min.   :21.00  
##  1st Qu.: 76.25   1st Qu.:27.50   1st Qu.:0.2437           1st Qu.:24.00  
##  Median :125.00   Median :32.30   Median :0.3725           Median :29.00  
##  Mean   :155.55   Mean   :32.46   Mean   :0.4719           Mean   :33.24  
##  3rd Qu.:190.00   3rd Qu.:36.60   3rd Qu.:0.6262           3rd Qu.:41.00  
##  Max.   :846.00   Max.   :67.10   Max.   :2.4200           Max.   :81.00  
##  NA's   :374      NA's   :11                                              
##     Outcome     
##  Min.   :0.000  
##  1st Qu.:0.000  
##  Median :0.000  
##  Mean   :0.349  
##  3rd Qu.:1.000  
##  Max.   :1.000  
##

Y gráficamente:

missmap(data)

En total el 9% de la base son datos faltantes, y podemos ver una gran cantidad de estos en las variables Insulin y SkinThickness, veamos con exactitud cuántos son por variable:

Creamos una función para ver la cantidad de NA´s por columna:

NAs<-function(dataf){
  nas<-c()
  for (j in 1: length(dataf)){
    cont=0
    for (i in 1:n){
      if (is.na(dataf[i,j])){
        cont<-cont+1
      }
    }
    nas[j]<-cont
  }
  return(nas)
}

Calculamos la proporción de NA´s:

nas<-NAs(data)
#Proporción de NA´s por columna
nasprp<-nas/n
nasprp

## [1] 0.000000000 0.006510417 0.045572917 0.295572917 0.486979167 0.014322917
## [7] 0.000000000 0.000000000 0.000000000

Creamos un dataframe para la proporción de NA´s:

colnames<-colnames(data)

faltantes<-data.frame(nasprp,colnames)
faltantes

##        nasprp                 colnames
## 1 0.000000000              Pregnancies
## 2 0.006510417                  Glucose
## 3 0.045572917            BloodPressure
## 4 0.295572917            SkinThickness
## 5 0.486979167                  Insulin
## 6 0.014322917                      BMI
## 7 0.000000000 DiabetesPedigreeFunction
## 8 0.000000000                      Age
## 9 0.000000000                  Outcome

Gracias a este análisis, y como se vió con missmap(), Insulin y SkinThickness son las variables con mayor cantidad de NA´s, que seguramente requieran de una imputación más delicada al contar con aproximadamente 29.6% y 48.6% de datos faltantes respectivamente, mientras que las variables Glucose, BloodPressure y BMI , al tener menos de 5% de faltantes, es posible que solo eliminando los registros que contengan NA´s sea suficiente para su tratamiento.

Como ya se mencionó antes, se van a tratar los datos faltantes luego de la detección de outliers.

Detección de outliers

Para cada variable, veamos los outliers por medio del rango intercuartílico y boxplots:

Pregnacies

ggplot(data)+
  aes(x="", Pregnancies)+
  geom_boxplot(fill="red")+
  theme_minimal()

out_preg<-boxplot.stats(data$Pregnancies)$out
out_preg

## [1] 15 17 14 14

Glucose

ggplot(data)+
  aes(x="", Glucose)+
  geom_boxplot(fill="red")+
  theme_minimal()

## Warning: Removed 5 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

out_glu<-boxplot.stats(data$Glucose)$out
out_glu

## integer(0)

En este caso, la variable Glucose no tiene valores atípicos.

BloodPressure

ggplot(data)+
  aes(x="", BloodPressure)+
  geom_boxplot(fill="red")+
  theme_minimal()

## Warning: Removed 35 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

out_blood<-boxplot.stats(data$BloodPressure)$out
out_blood

##  [1]  30 110 108 122  30 110 108 110  24  38 106 106 106 114

SkinThickness

ggplot(data)+
  aes(x="", SkinThickness)+
  geom_boxplot(fill="red")+
  theme_minimal()

## Warning: Removed 227 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

out_skin<-boxplot.stats(data$SkinThickness)$out
out_skin

## [1] 60 63 99

Insulin

ggplot(data)+
  aes(x="", Insulin)+
  geom_boxplot(fill="red")+
  theme_minimal()

## Warning: Removed 374 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

out_insulin<-boxplot.stats(data$Insulin)$out
out_insulin

##  [1] 543 846 495 485 495 478 744 370 680 402 375 545 465 415 579 474 480 600 440
## [20] 540 480 387 392 510

BMI

ggplot(data)+
  aes(x="", BMI)+
  geom_boxplot(fill="red")+
  theme_minimal()

## Warning: Removed 11 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

out_BMI<-boxplot.stats(data$BMI)$out
out_BMI

## [1] 53.2 55.0 67.1 52.3 52.3 52.9 59.4 57.3

DiabetesPedigreeFunction

ggplot(data)+
  aes(x="", DiabetesPedigreeFunction)+
  geom_boxplot(fill="red")+
  theme_minimal()

out_diab<-boxplot.stats(data$DiabetesPedigreeFunction)$out
out_diab

##  [1] 2.288 1.441 1.390 1.893 1.781 1.222 1.400 1.321 1.224 2.329 1.318 1.213
## [13] 1.353 1.224 1.391 1.476 2.137 1.731 1.268 1.600 2.420 1.251 1.699 1.258
## [25] 1.282 1.698 1.461 1.292 1.394

Age

ggplot(data)+
  aes(x="", Age)+
  geom_boxplot(fill="red")+
  theme_minimal()

out_age<-boxplot.stats(data$Age)$out
out_age

## [1] 69 67 72 81 67 67 70 68 69

Ahora que tenemos estos candidatos a valores atípicos, comprobaremos con el test de Rosner si de verdad lo son, o si necesitamos capturar aún más de estos:

test <- rosnerTest(data$Pregnancies, k = length(out_preg))
test$all.stats

##   i   Mean.i     SD.i Value Obs.Num    R.i+1 lambda.i+1 Outlier
## 1 0 3.845052 3.369578    17     160 3.904034   3.974092   FALSE
## 2 1 3.827901 3.338063    15      89 3.346880   3.973762   FALSE
## 3 2 3.813316 3.315699    14     299 3.072258   3.973432   FALSE
## 4 3 3.800000 3.297310    14     456 3.093431   3.973102   FALSE

De la variable Pregnancies, no se consiguió ningún valor atípico.

out_preg<-c()

test <- rosnerTest(data$Glucose, k = 1)

## Warning in rosnerTest(data$Glucose, k = 1): 5 observations with NA/NaN/Inf in
## 'x' removed.

test$all.stats

##   i   Mean.i     SD.i Value Obs.Num    R.i+1 lambda.i+1 Outlier
## 1 0 121.6868 30.53564    44      63 2.544134    3.97244   FALSE

Por el boxplot ya suponíamos que no había ningún outlier para la variable Glucose, cosa que confirmamos con esta prueba.

out_glu<-c()

test <- rosnerTest(data$BloodPressure, k = length(out_blood))

## Warning in rosnerTest(data$BloodPressure, k = length(out_blood)): 35
## observations with NA/NaN/Inf in 'x' removed.

## Warning in rosnerTest(data$BloodPressure, k = length(out_blood)): The true Type I error may be larger than assumed.
## Although the help file for 'rosnerTest' has a table with information
## on the estimated Type I error level,
## simulations were not run for k > 10 or k > floor(n/2).

test$all.stats

##     i   Mean.i     SD.i Value Obs.Num    R.i+1 lambda.i+1 Outlier
## 1   0 72.40518 12.38216   122     107 4.005345   3.962273    TRUE
## 2   1 72.33743 12.25391    24     598 3.944655   3.961926   FALSE
## 3   2 72.40356 12.13090    30      19 3.495498   3.961579   FALSE
## 4   3 72.46164 12.03706    30     126 3.527576   3.961231   FALSE
## 5   4 72.51989 11.94194   114     692 3.473483   3.960883   FALSE
## 6   5 72.46291 11.85057   110      44 3.167534   3.960534   FALSE
## 7   6 72.41128 11.77650   110     178 3.191841   3.960185   FALSE
## 8   7 72.35950 11.70153   110     550 3.216716   3.959835   FALSE
## 9   8 72.30759 11.62563   108      85 3.070149   3.959484   FALSE
## 10  9 72.25829 11.55758   108     363 3.092490   3.959133   FALSE
## 11 10 72.20885 11.48873    38     600 2.977600   3.958782   FALSE
## 12 11 72.25623 11.42579   106     659 2.953298   3.958430   FALSE
## 13 12 72.20943 11.36426   106     663 2.973407   3.958077   FALSE
## 14 13 72.16250 11.30202   106     673 2.993933   3.957724   FALSE

Guardamos la posición en la base de datos del valor atípico

out_blood<-test$all.stats[1,5]

test <- rosnerTest(data$SkinThickness, k = length(out_skin))

## Warning in rosnerTest(data$SkinThickness, k = length(out_skin)): 227
## observations with NA/NaN/Inf in 'x' removed.

test$all.stats

##   i   Mean.i      SD.i Value Obs.Num    R.i+1 lambda.i+1 Outlier
## 1 0 29.15342 10.476982    99     580 6.666670   3.883895    TRUE
## 2 1 29.02407 10.045046    63     446 3.382357   3.883409   FALSE
## 3 2 28.96104  9.946902    60      58 3.120465   3.882923   FALSE

out_skin<-test$all.stats[1,5]

test <- rosnerTest(data$Insulin, k = length(out_insulin))

## Warning in rosnerTest(data$Insulin, k = length(out_insulin)): 374 observations
## with NA/NaN/Inf in 'x' removed.

## Warning in rosnerTest(data$Insulin, k = length(out_insulin)): The true Type I error may be larger than assumed.
## Although the help file for 'rosnerTest' has a table with information
## on the estimated Type I error level,
## simulations were not run for k > 10 or k > floor(n/2).

test$all.stats

##     i   Mean.i      SD.i Value Obs.Num    R.i+1 lambda.i+1 Outlier
## 1   0 155.5482 118.77586   846      14 5.813065   3.799155    TRUE
## 2   1 153.7913 113.68582   744     229 5.191576   3.798462    TRUE
## 3   2 152.2857 109.83778   680     248 4.804488   3.797768    TRUE
## 4   3 150.9361 106.67422   600     585 4.209677   3.797071    TRUE
## 5   4 149.7846 104.34994   579     410 4.113231   3.796373    TRUE
## 6   5 148.6812 102.18094   545     287 3.878598   3.795672    TRUE
## 7   6 147.6598 100.30461   543       9 3.941396   3.794970    TRUE
## 8   7 146.6382  98.39272   540     656 3.997875   3.794265    TRUE
## 9   8 145.6192  96.45376   510     754 3.777777   3.793558    TRUE
## 10  9 144.6727  94.76757   495     112 3.696700   3.792850    TRUE
## 11 10 143.7604  93.18297   495     187 3.769354   3.792139    TRUE
## 12 11 142.8433  91.55324   485     154 3.737242   3.791426    TRUE
## 13 12 141.9476  89.97732   480     487 3.757084   3.790711    TRUE
## 14 13 141.0604  88.40644   480     696 3.833879   3.789994    TRUE
## 15 14 140.1684  86.78946   478     221 3.892542   3.789275    TRUE
## 16 15 139.2770  85.14463   474     416 3.931228   3.788554    TRUE
## 17 16 138.3915  83.49171   465     371 3.911867   3.787830    TRUE
## 18 17 137.5252  81.88374   440     646 3.693954   3.787104   FALSE
## 19 18 136.7207  80.48728   415     393 3.457432   3.786377   FALSE
## 20 19 135.9787  79.29637   402     249 3.354773   3.785646   FALSE
## 21 20 135.2674  78.19552   392     716 3.283214   3.784914   FALSE
## 22 21 134.5791  77.15776   387     711 3.271491   3.784180   FALSE
## 23 22 133.9005  76.13910   375     259 3.166566   3.783443   FALSE
## 24 23 133.2507  75.20174   370     232 3.148189   3.782704   FALSE

Recordemos que la variable Insulin tiene un 48.6% de NA´s, y ahora hay que sumarle bastantes valores atípicos, que si bien el último valor atípico detectado por el test es de 465, todavía hay valores por debajo de este que son exageradamente altos. Del resumen numérico obtuvimos que el tercer cuartil de esta variable es igual a 190, el cual es un valor altísmo para esta variable, por lo que podríamos concluir que una gran parte de sus registros o están mal o simplemente no fueron registrados. Es por esto que se decidió apartar esta variable del análisis ya que resulta demasiado problemática para el mismo.

data<-select(data,-Insulin)

test <- rosnerTest(data$BMI, k = length(out_BMI))

## Warning in rosnerTest(data$BMI, k = length(out_BMI)): 11 observations with
## NA/NaN/Inf in 'x' removed.

test$all.stats

##   i   Mean.i     SD.i Value Obs.Num    R.i+1 lambda.i+1 Outlier
## 1 0 32.45746 6.924988  67.1     178 5.002541   3.970442    TRUE
## 2 1 32.41164 6.813761  59.4     446 3.960861   3.970107   FALSE
## 3 2 32.37589 6.746971  57.3     674 3.694118   3.969772   FALSE
## 4 3 32.34284 6.689992  55.0     126 3.386725   3.969437   FALSE
## 5 4 32.31275 6.643189  53.2     121 3.144160   3.969101   FALSE
## 6 5 32.28497 6.603713  52.9     304 3.121733   3.968764   FALSE
## 7 6 32.25752 6.565042  52.3     194 3.052909   3.968427   FALSE
## 8 7 32.23080 6.528422  52.3     248 3.074127   3.968089   FALSE

out_BMI<-test$all.stats[1,5]

test <- rosnerTest(data$DiabetesPedigreeFunction, k = length(out_diab))

## Warning in rosnerTest(data$DiabetesPedigreeFunction, k = length(out_diab)): The true Type I error may be larger than assumed.
## Although the help file for 'rosnerTest' has a table with information
## on the estimated Type I error level,
## simulations were not run for k > 10 or k > floor(n/2).

test$all.stats

##     i    Mean.i      SD.i Value Obs.Num    R.i+1 lambda.i+1 Outlier
## 1   0 0.4718763 0.3313286 2.420     446 5.879733   3.974092    TRUE
## 2   1 0.4693364 0.3239768 2.329     229 5.740114   3.973762    TRUE
## 3   2 0.4669086 0.3171301 2.288       5 5.742410   3.973432    TRUE
## 4   3 0.4645281 0.3104137 2.137     371 5.387880   3.973102    TRUE
## 5   4 0.4623390 0.3046509 1.893      46 4.696067   3.972771    TRUE
## 6   5 0.4604640 0.3004070 1.781      59 4.395823   3.972440    TRUE
## 7   6 0.4587310 0.2967633 1.731     372 4.287150   3.972108    TRUE
## 8   7 0.4570591 0.2933457 1.699     594 4.233710   3.971776    TRUE
## 9   8 0.4554250 0.2900522 1.698     622 4.283971   3.971443    TRUE
## 10  9 0.4537879 0.2867083 1.600     396 3.997834   3.971110    TRUE
## 11 10 0.4522757 0.2838528 1.476     331 3.606533   3.970776   FALSE
## 12 11 0.4509234 0.2815864 1.461     623 3.587093   3.970442   FALSE
## 13 12 0.4495873 0.2793614 1.441      13 3.548854   3.970107   FALSE
## 14 13 0.4482742 0.2772021 1.400     148 3.433329   3.969772   FALSE
## 15 14 0.4470119 0.2752064 1.394     662 3.441011   3.969437   FALSE
## 16 15 0.4457543 0.2732126 1.391     309 3.459744   3.969101   FALSE
## 17 16 0.4444973 0.2712070 1.390      40 3.486277   3.968764   FALSE
## 18 17 0.4432383 0.2691797 1.353     260 3.379755   3.968427   FALSE
## 19 18 0.4420253 0.2672975 1.321     188 3.288376   3.968089   FALSE
## 20 19 0.4408518 0.2655357 1.318     244 3.303315   3.967751   FALSE
## 21 20 0.4396791 0.2637656 1.292     660 3.231358   3.967412   FALSE
## 22 21 0.4385382 0.2620886 1.282     619 3.218232   3.967073   FALSE
## 23 22 0.4374075 0.2604350 1.268     384 3.189250   3.966734   FALSE
## 24 23 0.4362926 0.2588225 1.258     607 3.174792   3.966394   FALSE
## 25 24 0.4351882 0.2572339 1.251     535 3.171479   3.966053   FALSE
## 26 25 0.4340902 0.2556565 1.224     219 3.089731   3.965712   FALSE
## 27 26 0.4330256 0.2541757 1.224     293 3.111920   3.965370   FALSE
## 28 27 0.4319582 0.2526776 1.222     101 3.126679   3.965028   FALSE
## 29 28 0.4308905 0.2511705 1.213     246 3.113858   3.964686   FALSE

out_diab<-c()
for (i in 1:10){
  out_diab[i]<-test$all.stats[i,5]
}

test <- rosnerTest(data$Age, k = length(out_age))
test$all.stats

##   i   Mean.i     SD.i Value Obs.Num    R.i+1 lambda.i+1 Outlier
## 1 0 33.24089 11.76023    81     460 4.061069   3.974092    TRUE
## 2 1 33.17862 11.64053    72     454 3.335018   3.973762   FALSE
## 3 2 33.12794 11.56315    70     667 3.188755   3.973432   FALSE
## 4 3 33.07974 11.49346    69     124 3.125278   3.973102   FALSE
## 5 4 33.03272 11.42714    69     685 3.147531   3.972771   FALSE
## 6 5 32.98558 11.36006    68     675 3.082239   3.972440   FALSE
## 7 6 32.93963 11.29634    67     364 3.015167   3.972108   FALSE
## 8 7 32.89488 11.23596    67     490 3.035354   3.971776   FALSE
## 9 8 32.85000 11.17491    67     538 3.055953   3.971443   FALSE

out_age<-test$all.stats[1,5]

Como se había anticipado, con los valores atípicos conseguidos se realizará exactamente la misma imputación que para los NA, por lo que los tomamos como datos faltantes:

new_dataf<-data
new_dataf$BloodPressure[out_blood]<-NA
new_dataf$SkinThickness[out_skin]<-NA
new_dataf$BMI[out_BMI]<-NA
for (i in 1:length(out_diab)){
  new_dataf$DiabetesPedigreeFunction[out_diab[i]]<-NA
}
new_dataf$Age[out_age]<-NA

Nuevamente, revisamos la proporción de NA´s:

#Busquemos de nuevo la proporción de NA´s por columna
nas<-NAs(new_dataf)

#Proporción de NA´s por columna
nasprp<-nas/n
nasprp

## [1] 0.000000000 0.006510417 0.046875000 0.296875000 0.015625000 0.013020833
## [7] 0.001302083 0.000000000

summary(new_dataf)

##   Pregnancies        Glucose      BloodPressure    SkinThickness  
##  Min.   : 0.000   Min.   : 44.0   Min.   : 24.00   Min.   : 7.00  
##  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 64.00   1st Qu.:22.00  
##  Median : 3.000   Median :117.0   Median : 72.00   Median :29.00  
##  Mean   : 3.845   Mean   :121.7   Mean   : 72.34   Mean   :29.02  
##  3rd Qu.: 6.000   3rd Qu.:141.0   3rd Qu.: 80.00   3rd Qu.:36.00  
##  Max.   :17.000   Max.   :199.0   Max.   :114.00   Max.   :63.00  
##                   NA's   :5       NA's   :36       NA's   :228    
##       BMI        DiabetesPedigreeFunction      Age           Outcome     
##  Min.   :18.20   Min.   :0.0780           Min.   :21.00   Min.   :0.000  
##  1st Qu.:27.50   1st Qu.:0.2402           1st Qu.:24.00   1st Qu.:0.000  
##  Median :32.30   Median :0.3670           Median :29.00   Median :0.000  
##  Mean   :32.41   Mean   :0.4523           Mean   :33.18   Mean   :0.349  
##  3rd Qu.:36.60   3rd Qu.:0.6092           3rd Qu.:41.00   3rd Qu.:1.000  
##  Max.   :59.40   Max.   :1.4760           Max.   :72.00   Max.   :1.000  
##  NA's   :12      NA's   :10               NA's   :1

colnames<-colnames(new_dataf)

faltantes<-data.frame(nasprp,colnames)
faltantes

##        nasprp                 colnames
## 1 0.000000000              Pregnancies
## 2 0.006510417                  Glucose
## 3 0.046875000            BloodPressure
## 4 0.296875000            SkinThickness
## 5 0.015625000                      BMI
## 6 0.013020833 DiabetesPedigreeFunction
## 7 0.001302083                      Age
## 8 0.000000000                  Outcome

Dado que todas las columnas a imputar contienen menos del 5% de NA´s, excepto por la variable SkinThickness, podemos eliminar las filas con NA de estas y luego imputar a SkinThickness.

new_dataf<-new_dataf %>% filter(!is.na(Glucose)& !is.na(BloodPressure)
                                & !is.na(BMI) & !is.na(DiabetesPedigreeFunction)
                                & !is.na(Age))

Evaluación de la imputación

Primero veremos gráficamente la distribución de cada versión de la variable, para ver una primera evaluación de la imputación.

Luego realizaremos pruebas de normalidad para saber que prueba deberíamos usar para comparar la igualdad de la distribución de las variables antes y despues de la imputación.

Glucose:

ggp1 <- ggplot(data, aes(x=Glucose)) +
  geom_histogram(fill="#FD0000", color="#E52521", alpha=0.9) +
  ggtitle("Variable original") +
  xlab("Glucose") + ylab("Frequency") +
  theme_ipsum() +
  theme(plot.title = element_text(size=15))

ggp2 <- ggplot(new_dataf, aes(x=Glucose)) +
  geom_histogram(fill="#43B047", color="#049DCB", alpha=0.9) +
  ggtitle("Variable imputada") +
  xlab("Glucose") + ylab("Frequency") +
  theme_ipsum() +
  theme(plot.title = element_text(size=15))
  
grid.arrange(ggp1, ggp2, ncol = 2)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 5 rows containing non-finite outside the scale range
## (`stat_bin()`).

## Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family
## not found in Windows font database
## Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family
## not found in Windows font database
## Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family
## not found in Windows font database

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

Podemos apreciar que la distribución de ambas versiones de la variable no varía demasiado, ahora usaremos una prueba analítica para comprobar esto.

ks.test(data$Glucose,"pnorm")

## Warning in ks.test.default(data$Glucose, "pnorm"): ties should not be present
## for the one-sample Kolmogorov-Smirnov test

## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  data$Glucose
## D = 1, p-value < 2.2e-16
## alternative hypothesis: two-sided

ks.test(new_dataf$Glucose,"pnorm")

## Warning in ks.test.default(new_dataf$Glucose, "pnorm"): ties should not be
## present for the one-sample Kolmogorov-Smirnov test

## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  new_dataf$Glucose
## D = 1, p-value < 2.2e-16
## alternative hypothesis: two-sided

Dado que ninguna de las dos variables tiene una distribución normal, usaremos una prueba no paramétrica para comprabar si existe alguna diferencia significativa entre ambas versiones de la variable, en este caso usaremos la prueba de Wilcoxon.

wilcox.test(data$Glucose, new_dataf$Glucose, alternative = "two.sided")

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  data$Glucose and new_dataf$Glucose
## W = 272264, p-value = 0.938
## alternative hypothesis: true location shift is not equal to 0

Dado que el p-valor es mayor a mi nivel de significancia(0.05), no existen diferencias significativas entre la variable imputada y la no imputada.

BloodPressure:

ggp1 <- ggplot(data, aes(x=BloodPressure)) +
  geom_histogram(fill="#FD0000", color="#E52521", alpha=0.9) +
  ggtitle("Variable original") +
  xlab("BloodPressure") + ylab("Frequency") +
  theme_ipsum() +
  theme(plot.title = element_text(size=15))

ggp2 <- ggplot(new_dataf, aes(x=BloodPressure)) +
  geom_histogram(fill="#43B047", color="#049DCB", alpha=0.9) +
  ggtitle("Variable imputada") +
  xlab("BloodPressure") + ylab("Frequency") +
  theme_ipsum() +
  theme(plot.title = element_text(size=15))
  
grid.arrange(ggp1, ggp2, ncol = 2)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 35 rows containing non-finite outside the scale range
## (`stat_bin()`).

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

Podemos apreciar que la distribución de ambas versiones de la variable no varía demasiado, ahora usaremos una prueba analítica para comprobar esto.

ks.test(data$BloodPressure,"pnorm")

## Warning in ks.test.default(data$BloodPressure, "pnorm"): ties should not be
## present for the one-sample Kolmogorov-Smirnov test

## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  data$BloodPressure
## D = 1, p-value < 2.2e-16
## alternative hypothesis: two-sided

ks.test(new_dataf$BloodPressure,"pnorm")

## Warning in ks.test.default(new_dataf$BloodPressure, "pnorm"): ties should not
## be present for the one-sample Kolmogorov-Smirnov test

## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  new_dataf$BloodPressure
## D = 1, p-value < 2.2e-16
## alternative hypothesis: two-sided

wilcox.test(data$BloodPressure, new_dataf$BloodPressure, alternative = "two.sided")

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  data$BloodPressure and new_dataf$BloodPressure
## W = 261315, p-value = 0.9631
## alternative hypothesis: true location shift is not equal to 0

Dado que el p-valor es mayor a mi nivel de significancia(0.05), no existen diferencias significativas entre la variable imputada y la no imputada.

BMI:

ggp1 <- ggplot(data, aes(x=BMI)) +
  geom_histogram(fill="#FD0000", color="#E52521", alpha=0.9) +
  ggtitle("Variable original") +
  xlab("BMI") + ylab("Frequency") +
  theme_ipsum() +
  theme(plot.title = element_text(size=15))

ggp2 <- ggplot(new_dataf, aes(x=BMI)) +
  geom_histogram(fill="#43B047", color="#049DCB", alpha=0.9) +
  ggtitle("Variable imputada") +
  xlab("BMI") + ylab("Frequency") +
  theme_ipsum() +
  theme(plot.title = element_text(size=15))
  
grid.arrange(ggp1, ggp2, ncol = 2)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 11 rows containing non-finite outside the scale range
## (`stat_bin()`).

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

Podemos apreciar que la distribución de ambas versiones de la variable no varía demasiado, ahora usaremos una prueba analítica para comprobar esto.

ks.test(data$BMI,"pnorm")

## Warning in ks.test.default(data$BMI, "pnorm"): ties should not be present for
## the one-sample Kolmogorov-Smirnov test

## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  data$BMI
## D = 1, p-value < 2.2e-16
## alternative hypothesis: two-sided

ks.test(new_dataf$BMI,"pnorm")

## Warning in ks.test.default(new_dataf$BMI, "pnorm"): ties should not be present
## for the one-sample Kolmogorov-Smirnov test

## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  new_dataf$BMI
## D = 1, p-value < 2.2e-16
## alternative hypothesis: two-sided

wilcox.test(data$BMI, new_dataf$BMI, alternative = "two.sided")

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  data$BMI and new_dataf$BMI
## W = 269884, p-value = 0.9616
## alternative hypothesis: true location shift is not equal to 0

Dado que el p-valor es mayor a mi nivel de significancia(0.05), no existen diferencias significativas entre la variable imputada y la no imputada.

DiabetesPedigreeFunction:

ggp1 <- ggplot(data, aes(x=DiabetesPedigreeFunction)) +
  geom_histogram(fill="#FD0000", color="#E52521", alpha=0.9) +
  ggtitle("Variable original") +
  xlab("DiabetesPedigreeFunction") + ylab("Frequency") +
  theme_ipsum() +
  theme(plot.title = element_text(size=15))

ggp2 <- ggplot(new_dataf, aes(x=DiabetesPedigreeFunction)) +
  geom_histogram(fill="#43B047", color="#049DCB", alpha=0.9) +
  ggtitle("Variable imputada") +
  xlab("DiabetesPedigreeFunction") + ylab("Frequency") +
  theme_ipsum() +
  theme(plot.title = element_text(size=15))
  
grid.arrange(ggp1, ggp2, ncol = 2)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

Podemos apreciar que la distribución de ambas versiones de la variable no varía demasiado, ahora usaremos una prueba analítica para comprobar esto.

ks.test(data$DiabetesPedigreeFunction,"pnorm")

## Warning in ks.test.default(data$DiabetesPedigreeFunction, "pnorm"): ties should
## not be present for the one-sample Kolmogorov-Smirnov test

## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  data$DiabetesPedigreeFunction
## D = 0.53217, p-value < 2.2e-16
## alternative hypothesis: two-sided

ks.test(new_dataf$DiabetesPedigreeFunction,"pnorm")

## Warning in ks.test.default(new_dataf$DiabetesPedigreeFunction, "pnorm"): ties
## should not be present for the one-sample Kolmogorov-Smirnov test

## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  new_dataf$DiabetesPedigreeFunction
## D = 0.53207, p-value < 2.2e-16
## alternative hypothesis: two-sided

wilcox.test(data$DiabetesPedigreeFunction, new_dataf$DiabetesPedigreeFunction, alternative = "two.sided")

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  data$DiabetesPedigreeFunction and new_dataf$DiabetesPedigreeFunction
## W = 274787, p-value = 0.8668
## alternative hypothesis: true location shift is not equal to 0

Dado que el p-valor es mayor a mi nivel de significancia(0.05), no existen diferencias significativas entre la variable imputada y la no imputada.

Age:

ggp1 <- ggplot(data, aes(x=Age)) +
  geom_histogram(fill="#FD0000", color="#E52521", alpha=0.9) +
  ggtitle("Variable original") +
  xlab("Age") + ylab("Frequency") +
  theme_ipsum() +
  theme(plot.title = element_text(size=15))

ggp2 <- ggplot(new_dataf, aes(x=Age)) +
  geom_histogram(fill="#43B047", color="#049DCB", alpha=0.9) +
  ggtitle("Variable imputada") +
  xlab("Age") + ylab("Frequency") +
  theme_ipsum() +
  theme(plot.title = element_text(size=15))
  
grid.arrange(ggp1, ggp2, ncol = 2)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

Podemos apreciar que la distribución de ambas versiones de la variable no varía demasiado, ahora usaremos una prueba analítica para comprobar esto.

ks.test(data$Age,"pnorm")

## Warning in ks.test.default(data$Age, "pnorm"): ties should not be present for
## the one-sample Kolmogorov-Smirnov test

## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  data$Age
## D = 1, p-value < 2.2e-16
## alternative hypothesis: two-sided

ks.test(new_dataf$Age,"pnorm")

## Warning in ks.test.default(new_dataf$Age, "pnorm"): ties should not be present
## for the one-sample Kolmogorov-Smirnov test

## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  new_dataf$Age
## D = 1, p-value < 2.2e-16
## alternative hypothesis: two-sided

wilcox.test(data$Age, new_dataf$Age, alternative = "two.sided")

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  data$Age and new_dataf$Age
## W = 271622, p-value = 0.8277
## alternative hypothesis: true location shift is not equal to 0

Dado que el p-valor es mayor a mi nivel de significancia(0.05), no existen diferencias significativas entre la variable imputada y la no imputada.

Para todas las variables imputadas, no hubo diferencias significativas entre las variables antes y despues de la imputación, por lo que podemos continuar con la siguiente imputación.

Imputación de la variable SkinThickness

Veamos la imputación de esta variable con 4 diferentes métodos: pmm, norm.predict, norm.nob, norm.

`pmm`:

imp <- mice(new_dataf, m=5, maxit=50, method='pmm', seed=500, printFlag = FALSE)
imp_df <- mice::complete(imp)   
summary(imp_df)

##   Pregnancies        Glucose       BloodPressure    SkinThickness  
##  Min.   : 0.000   Min.   : 44.00   Min.   : 24.00   Min.   : 7.00  
##  1st Qu.: 1.000   1st Qu.: 99.75   1st Qu.: 64.00   1st Qu.:21.00  
##  Median : 3.000   Median :117.00   Median : 72.00   Median :29.00  
##  Mean   : 3.899   Mean   :121.59   Mean   : 72.34   Mean   :28.52  
##  3rd Qu.: 6.000   3rd Qu.:141.00   3rd Qu.: 80.00   3rd Qu.:36.00  
##  Max.   :17.000   Max.   :199.00   Max.   :114.00   Max.   :60.00  
##       BMI        DiabetesPedigreeFunction      Age           Outcome      
##  Min.   :18.20   Min.   :0.0780           Min.   :21.00   Min.   :0.0000  
##  1st Qu.:27.50   1st Qu.:0.2447           1st Qu.:24.00   1st Qu.:0.0000  
##  Median :32.35   Median :0.3745           Median :29.00   Median :0.0000  
##  Mean   :32.37   Mean   :0.4563           Mean   :33.36   Mean   :0.3427  
##  3rd Qu.:36.50   3rd Qu.:0.6132           3rd Qu.:41.00   3rd Qu.:1.0000  
##  Max.   :57.30   Max.   :1.4760           Max.   :70.00   Max.   :1.0000

ggp1 <- ggplot(data, aes(x=SkinThickness)) +
  geom_histogram(fill="#FD0000", color="#E52521", alpha=0.9) +
  ggtitle("Variable original") +
  xlab("SkinThickness") + ylab("Frequency") +
  theme_ipsum() +
  theme(plot.title = element_text(size=15))+
  coord_cartesian(xlim = c(0, 60))

ggp2 <- ggplot(imp_df, aes(x=SkinThickness)) +
  geom_histogram(fill="#43B047", color="#049DCB", alpha=0.9) +
  ggtitle("Variable imputada") +
  xlab("SkinThickness") + ylab("Frequency") +
  theme_ipsum() +
  theme(plot.title = element_text(size=15))
  
grid.arrange(ggp1, ggp2, ncol = 2)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 227 rows containing non-finite outside the scale range
## (`stat_bin()`).

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

Podemos apreciar un ligero cambio en la distribución, comprobemos por medio de una prueba analítica si existen diferencias significativas entre sus distribuciones.

Nuevamente comprobamos normalidad para saber qué prueba usar:

ks.test(data$SkinThickness,"pnorm")

## Warning in ks.test.default(data$SkinThickness, "pnorm"): ties should not be
## present for the one-sample Kolmogorov-Smirnov test

## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  data$SkinThickness
## D = 1, p-value < 2.2e-16
## alternative hypothesis: two-sided

ks.test(imp_df$SkinThickness,"pnorm")

## Warning in ks.test.default(imp_df$SkinThickness, "pnorm"): ties should not be
## present for the one-sample Kolmogorov-Smirnov test

## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  imp_df$SkinThickness
## D = 1, p-value < 2.2e-16
## alternative hypothesis: two-sided

Dado que ninguno de los dos es normal, procederemos nuevamente a la prueba Wilcoxon:

wilcox.test(data$SkinThickness, imp_df$SkinThickness, alternative = "two.sided")

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  data$SkinThickness and imp_df$SkinThickness
## W = 198366, p-value = 0.3629
## alternative hypothesis: true location shift is not equal to 0

Dado que el p-valor es mayor a mi nivel de significancia(0.05), no existen diferencias significativas entre la variable imputada y la no imputada.

`norm.predict`:

imp <- mice(new_dataf, m=5, maxit=50, method='norm.predict', seed=500, printFlag = FALSE)
imp_df <- mice::complete(imp)   
summary(imp_df)

##   Pregnancies        Glucose       BloodPressure    SkinThickness  
##  Min.   : 0.000   Min.   : 44.00   Min.   : 24.00   Min.   : 7.00  
##  1st Qu.: 1.000   1st Qu.: 99.75   1st Qu.: 64.00   1st Qu.:22.00  
##  Median : 3.000   Median :117.00   Median : 72.00   Median :28.43  
##  Mean   : 3.899   Mean   :121.59   Mean   : 72.34   Mean   :28.66  
##  3rd Qu.: 6.000   3rd Qu.:141.00   3rd Qu.: 80.00   3rd Qu.:35.00  
##  Max.   :17.000   Max.   :199.00   Max.   :114.00   Max.   :60.00  
##       BMI        DiabetesPedigreeFunction      Age           Outcome      
##  Min.   :18.20   Min.   :0.0780           Min.   :21.00   Min.   :0.0000  
##  1st Qu.:27.50   1st Qu.:0.2447           1st Qu.:24.00   1st Qu.:0.0000  
##  Median :32.35   Median :0.3745           Median :29.00   Median :0.0000  
##  Mean   :32.37   Mean   :0.4563           Mean   :33.36   Mean   :0.3427  
##  3rd Qu.:36.50   3rd Qu.:0.6132           3rd Qu.:41.00   3rd Qu.:1.0000  
##  Max.   :57.30   Max.   :1.4760           Max.   :70.00   Max.   :1.0000

ggp1 <- ggplot(data, aes(x=SkinThickness)) +
  geom_histogram(fill="#FD0000", color="#E52521", alpha=0.9) +
  ggtitle("Variable original") +
  xlab("SkinThickness") + ylab("Frequency") +
  theme_ipsum() +
  theme(plot.title = element_text(size=15))+
  coord_cartesian(xlim = c(0, 60))

ggp2 <- ggplot(imp_df, aes(x=SkinThickness)) +
  geom_histogram(fill="#43B047", color="#049DCB", alpha=0.9) +
  ggtitle("Variable imputada") +
  xlab("SkinThickness") + ylab("Frequency") +
  theme_ipsum() +
  theme(plot.title = element_text(size=15))
  
grid.arrange(ggp1, ggp2, ncol = 2)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 227 rows containing non-finite outside the scale range
## (`stat_bin()`).

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

Podemos apreciar un ligero cambio en la distribución, comprobemos por medio de una prueba analítica si existen diferencias significativas entre sus distribuciones.

Nuevamente comprobamos normalidad para saber qué prueba usar:

ks.test(data$SkinThickness,"pnorm")

## Warning in ks.test.default(data$SkinThickness, "pnorm"): ties should not be
## present for the one-sample Kolmogorov-Smirnov test

## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  data$SkinThickness
## D = 1, p-value < 2.2e-16
## alternative hypothesis: two-sided

ks.test(imp_df$SkinThickness,"pnorm")

## Warning in ks.test.default(imp_df$SkinThickness, "pnorm"): ties should not be
## present for the one-sample Kolmogorov-Smirnov test

## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  imp_df$SkinThickness
## D = 1, p-value < 2.2e-16
## alternative hypothesis: two-sided

Dado que ninguno de los dos es normal, procederemos nuevamente a la prueba Wilcoxon:

wilcox.test(data$SkinThickness, imp_df$SkinThickness, alternative = "two.sided")

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  data$SkinThickness and imp_df$SkinThickness
## W = 196903, p-value = 0.4972
## alternative hypothesis: true location shift is not equal to 0

Dado que el p-valor es mayor a mi nivel de significancia(0.05), no existen diferencias significativas entre la variable imputada y la no imputada.

`norm.nob`:

imp <- mice(new_dataf, m=5, maxit=50, method='norm.nob', seed=500, printFlag = FALSE)
imp_df <- mice::complete(imp)   
summary(imp_df)

##   Pregnancies        Glucose       BloodPressure    SkinThickness  
##  Min.   : 0.000   Min.   : 44.00   Min.   : 24.00   Min.   : 2.17  
##  1st Qu.: 1.000   1st Qu.: 99.75   1st Qu.: 64.00   1st Qu.:21.00  
##  Median : 3.000   Median :117.00   Median : 72.00   Median :28.00  
##  Mean   : 3.899   Mean   :121.59   Mean   : 72.34   Mean   :28.36  
##  3rd Qu.: 6.000   3rd Qu.:141.00   3rd Qu.: 80.00   3rd Qu.:36.00  
##  Max.   :17.000   Max.   :199.00   Max.   :114.00   Max.   :60.00  
##       BMI        DiabetesPedigreeFunction      Age           Outcome      
##  Min.   :18.20   Min.   :0.0780           Min.   :21.00   Min.   :0.0000  
##  1st Qu.:27.50   1st Qu.:0.2447           1st Qu.:24.00   1st Qu.:0.0000  
##  Median :32.35   Median :0.3745           Median :29.00   Median :0.0000  
##  Mean   :32.37   Mean   :0.4563           Mean   :33.36   Mean   :0.3427  
##  3rd Qu.:36.50   3rd Qu.:0.6132           3rd Qu.:41.00   3rd Qu.:1.0000  
##  Max.   :57.30   Max.   :1.4760           Max.   :70.00   Max.   :1.0000

ggp1 <- ggplot(data, aes(x=SkinThickness)) +
  geom_histogram(fill="#FD0000", color="#E52521", alpha=0.9) +
  ggtitle("Variable original") +
  xlab("SkinThickness") + ylab("Frequency") +
  theme_ipsum() +
  theme(plot.title = element_text(size=15))+
  coord_cartesian(xlim = c(0, 60))

ggp2 <- ggplot(imp_df, aes(x=SkinThickness)) +
  geom_histogram(fill="#43B047", color="#049DCB", alpha=0.9) +
  ggtitle("Variable imputada") +
  xlab("SkinThickness") + ylab("Frequency") +
  theme_ipsum() +
  theme(plot.title = element_text(size=15))
  
grid.arrange(ggp1, ggp2, ncol = 2)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 227 rows containing non-finite outside the scale range
## (`stat_bin()`).

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

En este caso vemos cambios un poco más pronunciados que los métodos anteriores, comprobemos por medio de una prueba analítica si existen diferencias significativas entre sus distribuciones.

Nuevamente comprobamos normalidad para saber qué prueba usar:

ks.test(data$SkinThickness,"pnorm")

## Warning in ks.test.default(data$SkinThickness, "pnorm"): ties should not be
## present for the one-sample Kolmogorov-Smirnov test

## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  data$SkinThickness
## D = 1, p-value < 2.2e-16
## alternative hypothesis: two-sided

ks.test(imp_df$SkinThickness,"pnorm")

## Warning in ks.test.default(imp_df$SkinThickness, "pnorm"): ties should not be
## present for the one-sample Kolmogorov-Smirnov test

## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  imp_df$SkinThickness
## D = 0.99858, p-value < 2.2e-16
## alternative hypothesis: two-sided

Dado que ninguno de los dos es normal, procederemos nuevamente a la prueba Wilcoxon:

wilcox.test(data$SkinThickness, imp_df$SkinThickness, alternative = "two.sided")

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  data$SkinThickness and imp_df$SkinThickness
## W = 199681, p-value = 0.264
## alternative hypothesis: true location shift is not equal to 0

Dado que el p-valor es mayor a mi nivel de significancia(0.05), no existen diferencias significativas entre la variable imputada y la no imputada.

`norm`:

imp <- mice(new_dataf, m=5, maxit=50, method='norm', seed=500, printFlag = FALSE)
imp_df <- mice::complete(imp)   
summary(imp_df)

##   Pregnancies        Glucose       BloodPressure    SkinThickness   
##  Min.   : 0.000   Min.   : 44.00   Min.   : 24.00   Min.   : 3.702  
##  1st Qu.: 1.000   1st Qu.: 99.75   1st Qu.: 64.00   1st Qu.:21.089  
##  Median : 3.000   Median :117.00   Median : 72.00   Median :28.687  
##  Mean   : 3.899   Mean   :121.59   Mean   : 72.34   Mean   :28.587  
##  3rd Qu.: 6.000   3rd Qu.:141.00   3rd Qu.: 80.00   3rd Qu.:35.022  
##  Max.   :17.000   Max.   :199.00   Max.   :114.00   Max.   :60.281  
##       BMI        DiabetesPedigreeFunction      Age           Outcome      
##  Min.   :18.20   Min.   :0.0780           Min.   :21.00   Min.   :0.0000  
##  1st Qu.:27.50   1st Qu.:0.2447           1st Qu.:24.00   1st Qu.:0.0000  
##  Median :32.35   Median :0.3745           Median :29.00   Median :0.0000  
##  Mean   :32.37   Mean   :0.4563           Mean   :33.36   Mean   :0.3427  
##  3rd Qu.:36.50   3rd Qu.:0.6132           3rd Qu.:41.00   3rd Qu.:1.0000  
##  Max.   :57.30   Max.   :1.4760           Max.   :70.00   Max.   :1.0000

ggp1 <- ggplot(data, aes(x=SkinThickness)) +
  geom_histogram(fill="#FD0000", color="#E52521", alpha=0.9) +
  ggtitle("Variable original") +
  xlab("SkinThickness") + ylab("Frequency") +
  theme_ipsum() +
  theme(plot.title = element_text(size=15))+
  coord_cartesian(xlim=c(0,60))

ggp2 <- ggplot(imp_df, aes(x=SkinThickness)) +
  geom_histogram(fill="#43B047", color="#049DCB", alpha=0.9) +
  ggtitle("Variable imputada") +
  xlab("SkinThickness") + ylab("Frequency") +
  theme_ipsum() +
  theme(plot.title = element_text(size=15))
  
grid.arrange(ggp1, ggp2, ncol = 2)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 227 rows containing non-finite outside the scale range
## (`stat_bin()`).

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database

## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database

En este caso vemos cambios un poco más pronunciados que los métodos anteriores, comprobemos por medio de una prueba analítica si existen diferencias significativas entre sus distribuciones.

Nuevamente comprobamos normalidad para saber qué prueba usar:

ks.test(data$SkinThickness,"pnorm")

## Warning in ks.test.default(data$SkinThickness, "pnorm"): ties should not be
## present for the one-sample Kolmogorov-Smirnov test

## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  data$SkinThickness
## D = 1, p-value < 2.2e-16
## alternative hypothesis: two-sided

ks.test(imp_df$SkinThickness,"pnorm")

## Warning in ks.test.default(imp_df$SkinThickness, "pnorm"): ties should not be
## present for the one-sample Kolmogorov-Smirnov test

## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  imp_df$SkinThickness
## D = 0.99989, p-value < 2.2e-16
## alternative hypothesis: two-sided

Dado que ninguno de los dos es normal, procederemos nuevamente a la prueba Wilcoxon:

wilcox.test(data$SkinThickness, imp_df$SkinThickness, alternative = "two.sided")

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  data$SkinThickness and imp_df$SkinThickness
## W = 197405, p-value = 0.4484
## alternative hypothesis: true location shift is not equal to 0

Dado que el p-valor es mayor a mi nivel de significancia(0.05), no existen diferencias significativas entre la variable imputada y la no imputada.

Conclusiones

Por su puesto la manera como fueron tratados los datos no es la única, existen muchos más métodos de imputación como puede ser el uso de modelos de regresión o modelos de machine learning. Por otro lado, se pueden usar distintos métodos para la detección de outliers, como por ejemplo: el uso de percentiles o el rango intercuartílico, o pruebas analíticas como el test de Rosner usado en este caso. Lo importante es comprobar, ya sea gráfica o analíticamente, que la imputación no cambie la disitribución de los datos.

Actividad #3, ejercicio NA.

Hector Sanjuan Fabregas.

2025-08-12

Análisis exploratorio de datos (EDA).

Detección de datos faltantes

Detección de outliers

Pregnacies

Glucose

BloodPressure

SkinThickness

Insulin

BMI

DiabetesPedigreeFunction

Age

Evaluación de la imputación

Glucose:

BloodPressure:

BMI:

DiabetesPedigreeFunction:

Age:

Imputación de la variable SkinThickness

`pmm`:

`norm.predict`:

`norm.nob`:

`norm`:

Conclusiones

Actividad #3, ejercicio NA.

Hector Sanjuan Fabregas.

2025-08-12

Análisis exploratorio de datos (EDA).

Detección de datos faltantes

Detección de outliers

Pregnacies

Glucose

BloodPressure

SkinThickness

Insulin

BMI

DiabetesPedigreeFunction

Age

Evaluación de la imputación

Glucose:

BloodPressure:

BMI:

DiabetesPedigreeFunction:

Age:

Imputación de la variable SkinThickness

pmm:

norm.predict:

norm.nob:

norm:

Conclusiones

`pmm`:

`norm.predict`:

`norm.nob`:

`norm`: