Primero que todo, importamos las librerías necesarias para el ejercicio:
suppressWarnings({
library(Amelia)
library(tidyverse)
library(VIM)
library(mice)
library(mi)
library(EnvStats)
library(dplyr)
library(hrbrthemes)
library(gridExtra)
})
## Cargando paquete requerido: Rcpp
## ##
## ## Amelia II: Multiple Imputation
## ## (Version 1.8.3, built: 2024-11-07)
## ## Copyright (C) 2005-2025 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Cargando paquete requerido: colorspace
##
## Cargando paquete requerido: grid
##
## VIM is ready to use.
##
##
## Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues
##
##
## Adjuntando el paquete: 'VIM'
##
##
## The following object is masked from 'package:datasets':
##
## sleep
##
##
##
## Adjuntando el paquete: 'mice'
##
##
## The following object is masked from 'package:stats':
##
## filter
##
##
## The following objects are masked from 'package:base':
##
## cbind, rbind
##
##
## Cargando paquete requerido: Matrix
##
##
## Adjuntando el paquete: 'Matrix'
##
##
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
##
##
## Cargando paquete requerido: stats4
##
## mi (Version 1.1, packaged: 2022-06-05 05:31:15 UTC; ben)
##
## mi Copyright (C) 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015 Trustees of Columbia University
##
## This program comes with ABSOLUTELY NO WARRANTY.
##
## This is free software, and you are welcome to redistribute it
##
## under the General Public License version 2 or later.
##
## Execute RShowDoc('COPYING') for details.
##
##
## Adjuntando el paquete: 'mi'
##
##
## The following objects are masked from 'package:mice':
##
## complete, pool
##
##
## The following object is masked from 'package:tidyr':
##
## complete
##
##
##
## Adjuntando el paquete: 'EnvStats'
##
##
## The following object is masked from 'package:Matrix':
##
## print
##
##
## The following objects are masked from 'package:stats':
##
## predict, predict.lm
##
##
## The following object is masked from 'package:base':
##
## print.default
##
##
##
## Adjuntando el paquete: 'gridExtra'
##
##
## The following object is masked from 'package:dplyr':
##
## combine
Ahora, importamos la base de datos necesaria para el análisis.
data<-read.csv("C:/DataViz/actividades_en_clase/bases_de_datos/diabetes.csv")
Demosle un primer vistazo a los datos con las funciones
head()
y summary()
.
head(data)
## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## 1 6 148 72 35 0 33.6
## 2 1 85 66 29 0 26.6
## 3 8 183 64 0 0 23.3
## 4 1 89 66 23 94 28.1
## 5 0 137 40 35 168 43.1
## 6 5 116 74 0 0 25.6
## DiabetesPedigreeFunction Age Outcome
## 1 0.627 50 1
## 2 0.351 31 0
## 3 0.672 32 1
## 4 0.167 21 0
## 5 2.288 33 1
## 6 0.201 30 0
Gracias a la función head()
, vemos que las variables
explicativas son númericas, y la variable objetivo es categórica.
summary(data)
## Pregnancies Glucose BloodPressure SkinThickness
## Min. : 0.000 Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 62.00 1st Qu.: 0.00
## Median : 3.000 Median :117.0 Median : 72.00 Median :23.00
## Mean : 3.845 Mean :120.9 Mean : 69.11 Mean :20.54
## 3rd Qu.: 6.000 3rd Qu.:140.2 3rd Qu.: 80.00 3rd Qu.:32.00
## Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00
## Insulin BMI DiabetesPedigreeFunction Age
## Min. : 0.0 Min. : 0.00 Min. :0.0780 Min. :21.00
## 1st Qu.: 0.0 1st Qu.:27.30 1st Qu.:0.2437 1st Qu.:24.00
## Median : 30.5 Median :32.00 Median :0.3725 Median :29.00
## Mean : 79.8 Mean :31.99 Mean :0.4719 Mean :33.24
## 3rd Qu.:127.2 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00
## Max. :846.0 Max. :67.10 Max. :2.4200 Max. :81.00
## Outcome
## Min. :0.000
## 1st Qu.:0.000
## Median :0.000
## Mean :0.349
## 3rd Qu.:1.000
## Max. :1.000
A primera vista llama la atención del valor máximo de la variable pregnancies, lo que nos puede dar un indicio de outliers en dicha variable, así como tambien el valor máximo de la insulina.
La metodología que se tomará en este ejercicio será la siguiente: primero se detectarán los valores faltantes de cada variable, luego se detectarán los valores atípicos de cada variable, y por último, se realizará una única imputación para ambos valores. Esto será acompañado de gráficos y pruebas analíticas para comprobar que se está realizando correctamente la imputación en cada caso.
Para este ejercicio, los valores registrados como 0 serán
considerados como datos faltantes para las variables
Glucose
, BloodPressure
,
SkinThickness
, Insulin
y BMI
:
n=length(data$Outcome)
m=length(data)
for (j in 2: (m-3)){
for (i in 1:n){
if (data[i,j]==0){
data[i,j]<-NA
}
}
}
Veamos ahora sí los NA´s por columna con summary()
summary(data)
## Pregnancies Glucose BloodPressure SkinThickness
## Min. : 0.000 Min. : 44.0 Min. : 24.00 Min. : 7.00
## 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 64.00 1st Qu.:22.00
## Median : 3.000 Median :117.0 Median : 72.00 Median :29.00
## Mean : 3.845 Mean :121.7 Mean : 72.41 Mean :29.15
## 3rd Qu.: 6.000 3rd Qu.:141.0 3rd Qu.: 80.00 3rd Qu.:36.00
## Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00
## NA's :5 NA's :35 NA's :227
## Insulin BMI DiabetesPedigreeFunction Age
## Min. : 14.00 Min. :18.20 Min. :0.0780 Min. :21.00
## 1st Qu.: 76.25 1st Qu.:27.50 1st Qu.:0.2437 1st Qu.:24.00
## Median :125.00 Median :32.30 Median :0.3725 Median :29.00
## Mean :155.55 Mean :32.46 Mean :0.4719 Mean :33.24
## 3rd Qu.:190.00 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00
## Max. :846.00 Max. :67.10 Max. :2.4200 Max. :81.00
## NA's :374 NA's :11
## Outcome
## Min. :0.000
## 1st Qu.:0.000
## Median :0.000
## Mean :0.349
## 3rd Qu.:1.000
## Max. :1.000
##
Y gráficamente:
missmap(data)
En total el 9% de la base son datos faltantes, y podemos ver una gran
cantidad de estos en las variables Insulin
y
SkinThickness
, veamos con exactitud cuántos son por
variable:
Creamos una función para ver la cantidad de NA´s por columna:
NAs<-function(dataf){
nas<-c()
for (j in 1: length(dataf)){
cont=0
for (i in 1:n){
if (is.na(dataf[i,j])){
cont<-cont+1
}
}
nas[j]<-cont
}
return(nas)
}
Calculamos la proporción de NA´s:
nas<-NAs(data)
#Proporción de NA´s por columna
nasprp<-nas/n
nasprp
## [1] 0.000000000 0.006510417 0.045572917 0.295572917 0.486979167 0.014322917
## [7] 0.000000000 0.000000000 0.000000000
Creamos un dataframe para la proporción de NA´s:
colnames<-colnames(data)
faltantes<-data.frame(nasprp,colnames)
faltantes
## nasprp colnames
## 1 0.000000000 Pregnancies
## 2 0.006510417 Glucose
## 3 0.045572917 BloodPressure
## 4 0.295572917 SkinThickness
## 5 0.486979167 Insulin
## 6 0.014322917 BMI
## 7 0.000000000 DiabetesPedigreeFunction
## 8 0.000000000 Age
## 9 0.000000000 Outcome
Gracias a este análisis, y como se vió con
missmap(), Insulin
y SkinThickness
son las
variables con mayor cantidad de NA´s, que seguramente requieran de una
imputación más delicada al contar con aproximadamente 29.6% y 48.6% de
datos faltantes respectivamente, mientras que las variables
Glucose
, BloodPressure
y BMI
, al
tener menos de 5% de faltantes, es posible que solo eliminando los
registros que contengan NA´s sea suficiente para su tratamiento.
Como ya se mencionó antes, se van a tratar los datos faltantes luego de la detección de outliers.
Para cada variable, veamos los outliers por medio del rango intercuartílico y boxplots:
ggplot(data)+
aes(x="", Pregnancies)+
geom_boxplot(fill="red")+
theme_minimal()
out_preg<-boxplot.stats(data$Pregnancies)$out
out_preg
## [1] 15 17 14 14
ggplot(data)+
aes(x="", Glucose)+
geom_boxplot(fill="red")+
theme_minimal()
## Warning: Removed 5 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
out_glu<-boxplot.stats(data$Glucose)$out
out_glu
## integer(0)
En este caso, la variable Glucose
no tiene valores
atípicos.
ggplot(data)+
aes(x="", BloodPressure)+
geom_boxplot(fill="red")+
theme_minimal()
## Warning: Removed 35 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
out_blood<-boxplot.stats(data$BloodPressure)$out
out_blood
## [1] 30 110 108 122 30 110 108 110 24 38 106 106 106 114
ggplot(data)+
aes(x="", SkinThickness)+
geom_boxplot(fill="red")+
theme_minimal()
## Warning: Removed 227 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
out_skin<-boxplot.stats(data$SkinThickness)$out
out_skin
## [1] 60 63 99
ggplot(data)+
aes(x="", Insulin)+
geom_boxplot(fill="red")+
theme_minimal()
## Warning: Removed 374 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
out_insulin<-boxplot.stats(data$Insulin)$out
out_insulin
## [1] 543 846 495 485 495 478 744 370 680 402 375 545 465 415 579 474 480 600 440
## [20] 540 480 387 392 510
ggplot(data)+
aes(x="", BMI)+
geom_boxplot(fill="red")+
theme_minimal()
## Warning: Removed 11 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
out_BMI<-boxplot.stats(data$BMI)$out
out_BMI
## [1] 53.2 55.0 67.1 52.3 52.3 52.9 59.4 57.3
ggplot(data)+
aes(x="", DiabetesPedigreeFunction)+
geom_boxplot(fill="red")+
theme_minimal()
out_diab<-boxplot.stats(data$DiabetesPedigreeFunction)$out
out_diab
## [1] 2.288 1.441 1.390 1.893 1.781 1.222 1.400 1.321 1.224 2.329 1.318 1.213
## [13] 1.353 1.224 1.391 1.476 2.137 1.731 1.268 1.600 2.420 1.251 1.699 1.258
## [25] 1.282 1.698 1.461 1.292 1.394
ggplot(data)+
aes(x="", Age)+
geom_boxplot(fill="red")+
theme_minimal()
out_age<-boxplot.stats(data$Age)$out
out_age
## [1] 69 67 72 81 67 67 70 68 69
Ahora que tenemos estos candidatos a valores atípicos, comprobaremos con el test de Rosner si de verdad lo son, o si necesitamos capturar aún más de estos:
test <- rosnerTest(data$Pregnancies, k = length(out_preg))
test$all.stats
## i Mean.i SD.i Value Obs.Num R.i+1 lambda.i+1 Outlier
## 1 0 3.845052 3.369578 17 160 3.904034 3.974092 FALSE
## 2 1 3.827901 3.338063 15 89 3.346880 3.973762 FALSE
## 3 2 3.813316 3.315699 14 299 3.072258 3.973432 FALSE
## 4 3 3.800000 3.297310 14 456 3.093431 3.973102 FALSE
De la variable Pregnancies
, no se consiguió ningún valor
atípico.
out_preg<-c()
test <- rosnerTest(data$Glucose, k = 1)
## Warning in rosnerTest(data$Glucose, k = 1): 5 observations with NA/NaN/Inf in
## 'x' removed.
test$all.stats
## i Mean.i SD.i Value Obs.Num R.i+1 lambda.i+1 Outlier
## 1 0 121.6868 30.53564 44 63 2.544134 3.97244 FALSE
Por el boxplot ya suponíamos que no había ningún outlier para la
variable Glucose
, cosa que confirmamos con esta prueba.
out_glu<-c()
test <- rosnerTest(data$BloodPressure, k = length(out_blood))
## Warning in rosnerTest(data$BloodPressure, k = length(out_blood)): 35
## observations with NA/NaN/Inf in 'x' removed.
## Warning in rosnerTest(data$BloodPressure, k = length(out_blood)): The true Type I error may be larger than assumed.
## Although the help file for 'rosnerTest' has a table with information
## on the estimated Type I error level,
## simulations were not run for k > 10 or k > floor(n/2).
test$all.stats
## i Mean.i SD.i Value Obs.Num R.i+1 lambda.i+1 Outlier
## 1 0 72.40518 12.38216 122 107 4.005345 3.962273 TRUE
## 2 1 72.33743 12.25391 24 598 3.944655 3.961926 FALSE
## 3 2 72.40356 12.13090 30 19 3.495498 3.961579 FALSE
## 4 3 72.46164 12.03706 30 126 3.527576 3.961231 FALSE
## 5 4 72.51989 11.94194 114 692 3.473483 3.960883 FALSE
## 6 5 72.46291 11.85057 110 44 3.167534 3.960534 FALSE
## 7 6 72.41128 11.77650 110 178 3.191841 3.960185 FALSE
## 8 7 72.35950 11.70153 110 550 3.216716 3.959835 FALSE
## 9 8 72.30759 11.62563 108 85 3.070149 3.959484 FALSE
## 10 9 72.25829 11.55758 108 363 3.092490 3.959133 FALSE
## 11 10 72.20885 11.48873 38 600 2.977600 3.958782 FALSE
## 12 11 72.25623 11.42579 106 659 2.953298 3.958430 FALSE
## 13 12 72.20943 11.36426 106 663 2.973407 3.958077 FALSE
## 14 13 72.16250 11.30202 106 673 2.993933 3.957724 FALSE
Guardamos la posición en la base de datos del valor atípico
out_blood<-test$all.stats[1,5]
test <- rosnerTest(data$SkinThickness, k = length(out_skin))
## Warning in rosnerTest(data$SkinThickness, k = length(out_skin)): 227
## observations with NA/NaN/Inf in 'x' removed.
test$all.stats
## i Mean.i SD.i Value Obs.Num R.i+1 lambda.i+1 Outlier
## 1 0 29.15342 10.476982 99 580 6.666670 3.883895 TRUE
## 2 1 29.02407 10.045046 63 446 3.382357 3.883409 FALSE
## 3 2 28.96104 9.946902 60 58 3.120465 3.882923 FALSE
out_skin<-test$all.stats[1,5]
test <- rosnerTest(data$Insulin, k = length(out_insulin))
## Warning in rosnerTest(data$Insulin, k = length(out_insulin)): 374 observations
## with NA/NaN/Inf in 'x' removed.
## Warning in rosnerTest(data$Insulin, k = length(out_insulin)): The true Type I error may be larger than assumed.
## Although the help file for 'rosnerTest' has a table with information
## on the estimated Type I error level,
## simulations were not run for k > 10 or k > floor(n/2).
test$all.stats
## i Mean.i SD.i Value Obs.Num R.i+1 lambda.i+1 Outlier
## 1 0 155.5482 118.77586 846 14 5.813065 3.799155 TRUE
## 2 1 153.7913 113.68582 744 229 5.191576 3.798462 TRUE
## 3 2 152.2857 109.83778 680 248 4.804488 3.797768 TRUE
## 4 3 150.9361 106.67422 600 585 4.209677 3.797071 TRUE
## 5 4 149.7846 104.34994 579 410 4.113231 3.796373 TRUE
## 6 5 148.6812 102.18094 545 287 3.878598 3.795672 TRUE
## 7 6 147.6598 100.30461 543 9 3.941396 3.794970 TRUE
## 8 7 146.6382 98.39272 540 656 3.997875 3.794265 TRUE
## 9 8 145.6192 96.45376 510 754 3.777777 3.793558 TRUE
## 10 9 144.6727 94.76757 495 112 3.696700 3.792850 TRUE
## 11 10 143.7604 93.18297 495 187 3.769354 3.792139 TRUE
## 12 11 142.8433 91.55324 485 154 3.737242 3.791426 TRUE
## 13 12 141.9476 89.97732 480 487 3.757084 3.790711 TRUE
## 14 13 141.0604 88.40644 480 696 3.833879 3.789994 TRUE
## 15 14 140.1684 86.78946 478 221 3.892542 3.789275 TRUE
## 16 15 139.2770 85.14463 474 416 3.931228 3.788554 TRUE
## 17 16 138.3915 83.49171 465 371 3.911867 3.787830 TRUE
## 18 17 137.5252 81.88374 440 646 3.693954 3.787104 FALSE
## 19 18 136.7207 80.48728 415 393 3.457432 3.786377 FALSE
## 20 19 135.9787 79.29637 402 249 3.354773 3.785646 FALSE
## 21 20 135.2674 78.19552 392 716 3.283214 3.784914 FALSE
## 22 21 134.5791 77.15776 387 711 3.271491 3.784180 FALSE
## 23 22 133.9005 76.13910 375 259 3.166566 3.783443 FALSE
## 24 23 133.2507 75.20174 370 232 3.148189 3.782704 FALSE
Recordemos que la variable Insulin
tiene un 48.6% de
NA´s, y ahora hay que sumarle bastantes valores atípicos, que si bien el
último valor atípico detectado por el test es de 465, todavía hay
valores por debajo de este que son exageradamente altos. Del resumen
numérico obtuvimos que el tercer cuartil de esta variable es igual a
190, el cual es un valor altísmo para esta variable, por lo que
podríamos concluir que una gran parte de sus registros o están mal o
simplemente no fueron registrados. Es por esto que se decidió apartar
esta variable del análisis ya que resulta demasiado problemática para el
mismo.
data<-select(data,-Insulin)
test <- rosnerTest(data$BMI, k = length(out_BMI))
## Warning in rosnerTest(data$BMI, k = length(out_BMI)): 11 observations with
## NA/NaN/Inf in 'x' removed.
test$all.stats
## i Mean.i SD.i Value Obs.Num R.i+1 lambda.i+1 Outlier
## 1 0 32.45746 6.924988 67.1 178 5.002541 3.970442 TRUE
## 2 1 32.41164 6.813761 59.4 446 3.960861 3.970107 FALSE
## 3 2 32.37589 6.746971 57.3 674 3.694118 3.969772 FALSE
## 4 3 32.34284 6.689992 55.0 126 3.386725 3.969437 FALSE
## 5 4 32.31275 6.643189 53.2 121 3.144160 3.969101 FALSE
## 6 5 32.28497 6.603713 52.9 304 3.121733 3.968764 FALSE
## 7 6 32.25752 6.565042 52.3 194 3.052909 3.968427 FALSE
## 8 7 32.23080 6.528422 52.3 248 3.074127 3.968089 FALSE
out_BMI<-test$all.stats[1,5]
test <- rosnerTest(data$DiabetesPedigreeFunction, k = length(out_diab))
## Warning in rosnerTest(data$DiabetesPedigreeFunction, k = length(out_diab)): The true Type I error may be larger than assumed.
## Although the help file for 'rosnerTest' has a table with information
## on the estimated Type I error level,
## simulations were not run for k > 10 or k > floor(n/2).
test$all.stats
## i Mean.i SD.i Value Obs.Num R.i+1 lambda.i+1 Outlier
## 1 0 0.4718763 0.3313286 2.420 446 5.879733 3.974092 TRUE
## 2 1 0.4693364 0.3239768 2.329 229 5.740114 3.973762 TRUE
## 3 2 0.4669086 0.3171301 2.288 5 5.742410 3.973432 TRUE
## 4 3 0.4645281 0.3104137 2.137 371 5.387880 3.973102 TRUE
## 5 4 0.4623390 0.3046509 1.893 46 4.696067 3.972771 TRUE
## 6 5 0.4604640 0.3004070 1.781 59 4.395823 3.972440 TRUE
## 7 6 0.4587310 0.2967633 1.731 372 4.287150 3.972108 TRUE
## 8 7 0.4570591 0.2933457 1.699 594 4.233710 3.971776 TRUE
## 9 8 0.4554250 0.2900522 1.698 622 4.283971 3.971443 TRUE
## 10 9 0.4537879 0.2867083 1.600 396 3.997834 3.971110 TRUE
## 11 10 0.4522757 0.2838528 1.476 331 3.606533 3.970776 FALSE
## 12 11 0.4509234 0.2815864 1.461 623 3.587093 3.970442 FALSE
## 13 12 0.4495873 0.2793614 1.441 13 3.548854 3.970107 FALSE
## 14 13 0.4482742 0.2772021 1.400 148 3.433329 3.969772 FALSE
## 15 14 0.4470119 0.2752064 1.394 662 3.441011 3.969437 FALSE
## 16 15 0.4457543 0.2732126 1.391 309 3.459744 3.969101 FALSE
## 17 16 0.4444973 0.2712070 1.390 40 3.486277 3.968764 FALSE
## 18 17 0.4432383 0.2691797 1.353 260 3.379755 3.968427 FALSE
## 19 18 0.4420253 0.2672975 1.321 188 3.288376 3.968089 FALSE
## 20 19 0.4408518 0.2655357 1.318 244 3.303315 3.967751 FALSE
## 21 20 0.4396791 0.2637656 1.292 660 3.231358 3.967412 FALSE
## 22 21 0.4385382 0.2620886 1.282 619 3.218232 3.967073 FALSE
## 23 22 0.4374075 0.2604350 1.268 384 3.189250 3.966734 FALSE
## 24 23 0.4362926 0.2588225 1.258 607 3.174792 3.966394 FALSE
## 25 24 0.4351882 0.2572339 1.251 535 3.171479 3.966053 FALSE
## 26 25 0.4340902 0.2556565 1.224 219 3.089731 3.965712 FALSE
## 27 26 0.4330256 0.2541757 1.224 293 3.111920 3.965370 FALSE
## 28 27 0.4319582 0.2526776 1.222 101 3.126679 3.965028 FALSE
## 29 28 0.4308905 0.2511705 1.213 246 3.113858 3.964686 FALSE
out_diab<-c()
for (i in 1:10){
out_diab[i]<-test$all.stats[i,5]
}
test <- rosnerTest(data$Age, k = length(out_age))
test$all.stats
## i Mean.i SD.i Value Obs.Num R.i+1 lambda.i+1 Outlier
## 1 0 33.24089 11.76023 81 460 4.061069 3.974092 TRUE
## 2 1 33.17862 11.64053 72 454 3.335018 3.973762 FALSE
## 3 2 33.12794 11.56315 70 667 3.188755 3.973432 FALSE
## 4 3 33.07974 11.49346 69 124 3.125278 3.973102 FALSE
## 5 4 33.03272 11.42714 69 685 3.147531 3.972771 FALSE
## 6 5 32.98558 11.36006 68 675 3.082239 3.972440 FALSE
## 7 6 32.93963 11.29634 67 364 3.015167 3.972108 FALSE
## 8 7 32.89488 11.23596 67 490 3.035354 3.971776 FALSE
## 9 8 32.85000 11.17491 67 538 3.055953 3.971443 FALSE
out_age<-test$all.stats[1,5]
Como se había anticipado, con los valores atípicos conseguidos se realizará exactamente la misma imputación que para los NA, por lo que los tomamos como datos faltantes:
new_dataf<-data
new_dataf$BloodPressure[out_blood]<-NA
new_dataf$SkinThickness[out_skin]<-NA
new_dataf$BMI[out_BMI]<-NA
for (i in 1:length(out_diab)){
new_dataf$DiabetesPedigreeFunction[out_diab[i]]<-NA
}
new_dataf$Age[out_age]<-NA
Nuevamente, revisamos la proporción de NA´s:
#Busquemos de nuevo la proporción de NA´s por columna
nas<-NAs(new_dataf)
#Proporción de NA´s por columna
nasprp<-nas/n
nasprp
## [1] 0.000000000 0.006510417 0.046875000 0.296875000 0.015625000 0.013020833
## [7] 0.001302083 0.000000000
summary(new_dataf)
## Pregnancies Glucose BloodPressure SkinThickness
## Min. : 0.000 Min. : 44.0 Min. : 24.00 Min. : 7.00
## 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 64.00 1st Qu.:22.00
## Median : 3.000 Median :117.0 Median : 72.00 Median :29.00
## Mean : 3.845 Mean :121.7 Mean : 72.34 Mean :29.02
## 3rd Qu.: 6.000 3rd Qu.:141.0 3rd Qu.: 80.00 3rd Qu.:36.00
## Max. :17.000 Max. :199.0 Max. :114.00 Max. :63.00
## NA's :5 NA's :36 NA's :228
## BMI DiabetesPedigreeFunction Age Outcome
## Min. :18.20 Min. :0.0780 Min. :21.00 Min. :0.000
## 1st Qu.:27.50 1st Qu.:0.2402 1st Qu.:24.00 1st Qu.:0.000
## Median :32.30 Median :0.3670 Median :29.00 Median :0.000
## Mean :32.41 Mean :0.4523 Mean :33.18 Mean :0.349
## 3rd Qu.:36.60 3rd Qu.:0.6092 3rd Qu.:41.00 3rd Qu.:1.000
## Max. :59.40 Max. :1.4760 Max. :72.00 Max. :1.000
## NA's :12 NA's :10 NA's :1
colnames<-colnames(new_dataf)
faltantes<-data.frame(nasprp,colnames)
faltantes
## nasprp colnames
## 1 0.000000000 Pregnancies
## 2 0.006510417 Glucose
## 3 0.046875000 BloodPressure
## 4 0.296875000 SkinThickness
## 5 0.015625000 BMI
## 6 0.013020833 DiabetesPedigreeFunction
## 7 0.001302083 Age
## 8 0.000000000 Outcome
Dado que todas las columnas a imputar contienen menos del 5% de NA´s,
excepto por la variable SkinThickness
, podemos eliminar las
filas con NA de estas y luego imputar a SkinThickness
.
new_dataf<-new_dataf %>% filter(!is.na(Glucose)& !is.na(BloodPressure)
& !is.na(BMI) & !is.na(DiabetesPedigreeFunction)
& !is.na(Age))
Primero veremos gráficamente la distribución de cada versión de la variable, para ver una primera evaluación de la imputación.
Luego realizaremos pruebas de normalidad para saber que prueba deberíamos usar para comparar la igualdad de la distribución de las variables antes y despues de la imputación.
ggp1 <- ggplot(data, aes(x=Glucose)) +
geom_histogram(fill="#FD0000", color="#E52521", alpha=0.9) +
ggtitle("Variable original") +
xlab("Glucose") + ylab("Frequency") +
theme_ipsum() +
theme(plot.title = element_text(size=15))
ggp2 <- ggplot(new_dataf, aes(x=Glucose)) +
geom_histogram(fill="#43B047", color="#049DCB", alpha=0.9) +
ggtitle("Variable imputada") +
xlab("Glucose") + ylab("Frequency") +
theme_ipsum() +
theme(plot.title = element_text(size=15))
grid.arrange(ggp1, ggp2, ncol = 2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 5 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family
## not found in Windows font database
## Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family
## not found in Windows font database
## Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family
## not found in Windows font database
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
Podemos apreciar que la distribución de ambas versiones de la variable no varía demasiado, ahora usaremos una prueba analítica para comprobar esto.
ks.test(data$Glucose,"pnorm")
## Warning in ks.test.default(data$Glucose, "pnorm"): ties should not be present
## for the one-sample Kolmogorov-Smirnov test
##
## Asymptotic one-sample Kolmogorov-Smirnov test
##
## data: data$Glucose
## D = 1, p-value < 2.2e-16
## alternative hypothesis: two-sided
ks.test(new_dataf$Glucose,"pnorm")
## Warning in ks.test.default(new_dataf$Glucose, "pnorm"): ties should not be
## present for the one-sample Kolmogorov-Smirnov test
##
## Asymptotic one-sample Kolmogorov-Smirnov test
##
## data: new_dataf$Glucose
## D = 1, p-value < 2.2e-16
## alternative hypothesis: two-sided
Dado que ninguna de las dos variables tiene una distribución normal, usaremos una prueba no paramétrica para comprabar si existe alguna diferencia significativa entre ambas versiones de la variable, en este caso usaremos la prueba de Wilcoxon.
wilcox.test(data$Glucose, new_dataf$Glucose, alternative = "two.sided")
##
## Wilcoxon rank sum test with continuity correction
##
## data: data$Glucose and new_dataf$Glucose
## W = 272264, p-value = 0.938
## alternative hypothesis: true location shift is not equal to 0
Dado que el p-valor es mayor a mi nivel de significancia(0.05), no existen diferencias significativas entre la variable imputada y la no imputada.
ggp1 <- ggplot(data, aes(x=BloodPressure)) +
geom_histogram(fill="#FD0000", color="#E52521", alpha=0.9) +
ggtitle("Variable original") +
xlab("BloodPressure") + ylab("Frequency") +
theme_ipsum() +
theme(plot.title = element_text(size=15))
ggp2 <- ggplot(new_dataf, aes(x=BloodPressure)) +
geom_histogram(fill="#43B047", color="#049DCB", alpha=0.9) +
ggtitle("Variable imputada") +
xlab("BloodPressure") + ylab("Frequency") +
theme_ipsum() +
theme(plot.title = element_text(size=15))
grid.arrange(ggp1, ggp2, ncol = 2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 35 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
Podemos apreciar que la distribución de ambas versiones de la variable no varía demasiado, ahora usaremos una prueba analítica para comprobar esto.
ks.test(data$BloodPressure,"pnorm")
## Warning in ks.test.default(data$BloodPressure, "pnorm"): ties should not be
## present for the one-sample Kolmogorov-Smirnov test
##
## Asymptotic one-sample Kolmogorov-Smirnov test
##
## data: data$BloodPressure
## D = 1, p-value < 2.2e-16
## alternative hypothesis: two-sided
ks.test(new_dataf$BloodPressure,"pnorm")
## Warning in ks.test.default(new_dataf$BloodPressure, "pnorm"): ties should not
## be present for the one-sample Kolmogorov-Smirnov test
##
## Asymptotic one-sample Kolmogorov-Smirnov test
##
## data: new_dataf$BloodPressure
## D = 1, p-value < 2.2e-16
## alternative hypothesis: two-sided
Dado que ninguna de las dos variables tiene una distribución normal, usaremos una prueba no paramétrica para comprabar si existe alguna diferencia significativa entre ambas versiones de la variable, en este caso usaremos la prueba de Wilcoxon.
wilcox.test(data$BloodPressure, new_dataf$BloodPressure, alternative = "two.sided")
##
## Wilcoxon rank sum test with continuity correction
##
## data: data$BloodPressure and new_dataf$BloodPressure
## W = 261315, p-value = 0.9631
## alternative hypothesis: true location shift is not equal to 0
Dado que el p-valor es mayor a mi nivel de significancia(0.05), no existen diferencias significativas entre la variable imputada y la no imputada.
ggp1 <- ggplot(data, aes(x=BMI)) +
geom_histogram(fill="#FD0000", color="#E52521", alpha=0.9) +
ggtitle("Variable original") +
xlab("BMI") + ylab("Frequency") +
theme_ipsum() +
theme(plot.title = element_text(size=15))
ggp2 <- ggplot(new_dataf, aes(x=BMI)) +
geom_histogram(fill="#43B047", color="#049DCB", alpha=0.9) +
ggtitle("Variable imputada") +
xlab("BMI") + ylab("Frequency") +
theme_ipsum() +
theme(plot.title = element_text(size=15))
grid.arrange(ggp1, ggp2, ncol = 2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 11 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
Podemos apreciar que la distribución de ambas versiones de la variable no varía demasiado, ahora usaremos una prueba analítica para comprobar esto.
ks.test(data$BMI,"pnorm")
## Warning in ks.test.default(data$BMI, "pnorm"): ties should not be present for
## the one-sample Kolmogorov-Smirnov test
##
## Asymptotic one-sample Kolmogorov-Smirnov test
##
## data: data$BMI
## D = 1, p-value < 2.2e-16
## alternative hypothesis: two-sided
ks.test(new_dataf$BMI,"pnorm")
## Warning in ks.test.default(new_dataf$BMI, "pnorm"): ties should not be present
## for the one-sample Kolmogorov-Smirnov test
##
## Asymptotic one-sample Kolmogorov-Smirnov test
##
## data: new_dataf$BMI
## D = 1, p-value < 2.2e-16
## alternative hypothesis: two-sided
Dado que ninguna de las dos variables tiene una distribución normal, usaremos una prueba no paramétrica para comprabar si existe alguna diferencia significativa entre ambas versiones de la variable, en este caso usaremos la prueba de Wilcoxon.
wilcox.test(data$BMI, new_dataf$BMI, alternative = "two.sided")
##
## Wilcoxon rank sum test with continuity correction
##
## data: data$BMI and new_dataf$BMI
## W = 269884, p-value = 0.9616
## alternative hypothesis: true location shift is not equal to 0
Dado que el p-valor es mayor a mi nivel de significancia(0.05), no existen diferencias significativas entre la variable imputada y la no imputada.
ggp1 <- ggplot(data, aes(x=DiabetesPedigreeFunction)) +
geom_histogram(fill="#FD0000", color="#E52521", alpha=0.9) +
ggtitle("Variable original") +
xlab("DiabetesPedigreeFunction") + ylab("Frequency") +
theme_ipsum() +
theme(plot.title = element_text(size=15))
ggp2 <- ggplot(new_dataf, aes(x=DiabetesPedigreeFunction)) +
geom_histogram(fill="#43B047", color="#049DCB", alpha=0.9) +
ggtitle("Variable imputada") +
xlab("DiabetesPedigreeFunction") + ylab("Frequency") +
theme_ipsum() +
theme(plot.title = element_text(size=15))
grid.arrange(ggp1, ggp2, ncol = 2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
Podemos apreciar que la distribución de ambas versiones de la variable no varía demasiado, ahora usaremos una prueba analítica para comprobar esto.
ks.test(data$DiabetesPedigreeFunction,"pnorm")
## Warning in ks.test.default(data$DiabetesPedigreeFunction, "pnorm"): ties should
## not be present for the one-sample Kolmogorov-Smirnov test
##
## Asymptotic one-sample Kolmogorov-Smirnov test
##
## data: data$DiabetesPedigreeFunction
## D = 0.53217, p-value < 2.2e-16
## alternative hypothesis: two-sided
ks.test(new_dataf$DiabetesPedigreeFunction,"pnorm")
## Warning in ks.test.default(new_dataf$DiabetesPedigreeFunction, "pnorm"): ties
## should not be present for the one-sample Kolmogorov-Smirnov test
##
## Asymptotic one-sample Kolmogorov-Smirnov test
##
## data: new_dataf$DiabetesPedigreeFunction
## D = 0.53207, p-value < 2.2e-16
## alternative hypothesis: two-sided
Dado que ninguna de las dos variables tiene una distribución normal, usaremos una prueba no paramétrica para comprabar si existe alguna diferencia significativa entre ambas versiones de la variable, en este caso usaremos la prueba de Wilcoxon.
wilcox.test(data$DiabetesPedigreeFunction, new_dataf$DiabetesPedigreeFunction, alternative = "two.sided")
##
## Wilcoxon rank sum test with continuity correction
##
## data: data$DiabetesPedigreeFunction and new_dataf$DiabetesPedigreeFunction
## W = 274787, p-value = 0.8668
## alternative hypothesis: true location shift is not equal to 0
Dado que el p-valor es mayor a mi nivel de significancia(0.05), no existen diferencias significativas entre la variable imputada y la no imputada.
ggp1 <- ggplot(data, aes(x=Age)) +
geom_histogram(fill="#FD0000", color="#E52521", alpha=0.9) +
ggtitle("Variable original") +
xlab("Age") + ylab("Frequency") +
theme_ipsum() +
theme(plot.title = element_text(size=15))
ggp2 <- ggplot(new_dataf, aes(x=Age)) +
geom_histogram(fill="#43B047", color="#049DCB", alpha=0.9) +
ggtitle("Variable imputada") +
xlab("Age") + ylab("Frequency") +
theme_ipsum() +
theme(plot.title = element_text(size=15))
grid.arrange(ggp1, ggp2, ncol = 2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
Podemos apreciar que la distribución de ambas versiones de la variable no varía demasiado, ahora usaremos una prueba analítica para comprobar esto.
ks.test(data$Age,"pnorm")
## Warning in ks.test.default(data$Age, "pnorm"): ties should not be present for
## the one-sample Kolmogorov-Smirnov test
##
## Asymptotic one-sample Kolmogorov-Smirnov test
##
## data: data$Age
## D = 1, p-value < 2.2e-16
## alternative hypothesis: two-sided
ks.test(new_dataf$Age,"pnorm")
## Warning in ks.test.default(new_dataf$Age, "pnorm"): ties should not be present
## for the one-sample Kolmogorov-Smirnov test
##
## Asymptotic one-sample Kolmogorov-Smirnov test
##
## data: new_dataf$Age
## D = 1, p-value < 2.2e-16
## alternative hypothesis: two-sided
Dado que ninguna de las dos variables tiene una distribución normal, usaremos una prueba no paramétrica para comprabar si existe alguna diferencia significativa entre ambas versiones de la variable, en este caso usaremos la prueba de Wilcoxon.
wilcox.test(data$Age, new_dataf$Age, alternative = "two.sided")
##
## Wilcoxon rank sum test with continuity correction
##
## data: data$Age and new_dataf$Age
## W = 271622, p-value = 0.8277
## alternative hypothesis: true location shift is not equal to 0
Dado que el p-valor es mayor a mi nivel de significancia(0.05), no existen diferencias significativas entre la variable imputada y la no imputada.
Para todas las variables imputadas, no hubo diferencias significativas entre las variables antes y despues de la imputación, por lo que podemos continuar con la siguiente imputación.
Veamos la imputación de esta variable con 4 diferentes métodos:
pmm
, norm.predict
, norm.nob
,
norm
.
pmm
:imp <- mice(new_dataf, m=5, maxit=50, method='pmm', seed=500, printFlag = FALSE)
imp_df <- mice::complete(imp)
summary(imp_df)
## Pregnancies Glucose BloodPressure SkinThickness
## Min. : 0.000 Min. : 44.00 Min. : 24.00 Min. : 7.00
## 1st Qu.: 1.000 1st Qu.: 99.75 1st Qu.: 64.00 1st Qu.:21.00
## Median : 3.000 Median :117.00 Median : 72.00 Median :29.00
## Mean : 3.899 Mean :121.59 Mean : 72.34 Mean :28.52
## 3rd Qu.: 6.000 3rd Qu.:141.00 3rd Qu.: 80.00 3rd Qu.:36.00
## Max. :17.000 Max. :199.00 Max. :114.00 Max. :60.00
## BMI DiabetesPedigreeFunction Age Outcome
## Min. :18.20 Min. :0.0780 Min. :21.00 Min. :0.0000
## 1st Qu.:27.50 1st Qu.:0.2447 1st Qu.:24.00 1st Qu.:0.0000
## Median :32.35 Median :0.3745 Median :29.00 Median :0.0000
## Mean :32.37 Mean :0.4563 Mean :33.36 Mean :0.3427
## 3rd Qu.:36.50 3rd Qu.:0.6132 3rd Qu.:41.00 3rd Qu.:1.0000
## Max. :57.30 Max. :1.4760 Max. :70.00 Max. :1.0000
ggp1 <- ggplot(data, aes(x=SkinThickness)) +
geom_histogram(fill="#FD0000", color="#E52521", alpha=0.9) +
ggtitle("Variable original") +
xlab("SkinThickness") + ylab("Frequency") +
theme_ipsum() +
theme(plot.title = element_text(size=15))+
coord_cartesian(xlim = c(0, 60))
ggp2 <- ggplot(imp_df, aes(x=SkinThickness)) +
geom_histogram(fill="#43B047", color="#049DCB", alpha=0.9) +
ggtitle("Variable imputada") +
xlab("SkinThickness") + ylab("Frequency") +
theme_ipsum() +
theme(plot.title = element_text(size=15))
grid.arrange(ggp1, ggp2, ncol = 2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 227 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
Podemos apreciar un ligero cambio en la distribución, comprobemos por medio de una prueba analítica si existen diferencias significativas entre sus distribuciones.
Nuevamente comprobamos normalidad para saber qué prueba usar:
ks.test(data$SkinThickness,"pnorm")
## Warning in ks.test.default(data$SkinThickness, "pnorm"): ties should not be
## present for the one-sample Kolmogorov-Smirnov test
##
## Asymptotic one-sample Kolmogorov-Smirnov test
##
## data: data$SkinThickness
## D = 1, p-value < 2.2e-16
## alternative hypothesis: two-sided
ks.test(imp_df$SkinThickness,"pnorm")
## Warning in ks.test.default(imp_df$SkinThickness, "pnorm"): ties should not be
## present for the one-sample Kolmogorov-Smirnov test
##
## Asymptotic one-sample Kolmogorov-Smirnov test
##
## data: imp_df$SkinThickness
## D = 1, p-value < 2.2e-16
## alternative hypothesis: two-sided
Dado que ninguno de los dos es normal, procederemos nuevamente a la prueba Wilcoxon:
wilcox.test(data$SkinThickness, imp_df$SkinThickness, alternative = "two.sided")
##
## Wilcoxon rank sum test with continuity correction
##
## data: data$SkinThickness and imp_df$SkinThickness
## W = 198366, p-value = 0.3629
## alternative hypothesis: true location shift is not equal to 0
Dado que el p-valor es mayor a mi nivel de significancia(0.05), no existen diferencias significativas entre la variable imputada y la no imputada.
norm.predict
:imp <- mice(new_dataf, m=5, maxit=50, method='norm.predict', seed=500, printFlag = FALSE)
imp_df <- mice::complete(imp)
summary(imp_df)
## Pregnancies Glucose BloodPressure SkinThickness
## Min. : 0.000 Min. : 44.00 Min. : 24.00 Min. : 7.00
## 1st Qu.: 1.000 1st Qu.: 99.75 1st Qu.: 64.00 1st Qu.:22.00
## Median : 3.000 Median :117.00 Median : 72.00 Median :28.43
## Mean : 3.899 Mean :121.59 Mean : 72.34 Mean :28.66
## 3rd Qu.: 6.000 3rd Qu.:141.00 3rd Qu.: 80.00 3rd Qu.:35.00
## Max. :17.000 Max. :199.00 Max. :114.00 Max. :60.00
## BMI DiabetesPedigreeFunction Age Outcome
## Min. :18.20 Min. :0.0780 Min. :21.00 Min. :0.0000
## 1st Qu.:27.50 1st Qu.:0.2447 1st Qu.:24.00 1st Qu.:0.0000
## Median :32.35 Median :0.3745 Median :29.00 Median :0.0000
## Mean :32.37 Mean :0.4563 Mean :33.36 Mean :0.3427
## 3rd Qu.:36.50 3rd Qu.:0.6132 3rd Qu.:41.00 3rd Qu.:1.0000
## Max. :57.30 Max. :1.4760 Max. :70.00 Max. :1.0000
ggp1 <- ggplot(data, aes(x=SkinThickness)) +
geom_histogram(fill="#FD0000", color="#E52521", alpha=0.9) +
ggtitle("Variable original") +
xlab("SkinThickness") + ylab("Frequency") +
theme_ipsum() +
theme(plot.title = element_text(size=15))+
coord_cartesian(xlim = c(0, 60))
ggp2 <- ggplot(imp_df, aes(x=SkinThickness)) +
geom_histogram(fill="#43B047", color="#049DCB", alpha=0.9) +
ggtitle("Variable imputada") +
xlab("SkinThickness") + ylab("Frequency") +
theme_ipsum() +
theme(plot.title = element_text(size=15))
grid.arrange(ggp1, ggp2, ncol = 2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 227 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
Podemos apreciar un ligero cambio en la distribución, comprobemos por medio de una prueba analítica si existen diferencias significativas entre sus distribuciones.
Nuevamente comprobamos normalidad para saber qué prueba usar:
ks.test(data$SkinThickness,"pnorm")
## Warning in ks.test.default(data$SkinThickness, "pnorm"): ties should not be
## present for the one-sample Kolmogorov-Smirnov test
##
## Asymptotic one-sample Kolmogorov-Smirnov test
##
## data: data$SkinThickness
## D = 1, p-value < 2.2e-16
## alternative hypothesis: two-sided
ks.test(imp_df$SkinThickness,"pnorm")
## Warning in ks.test.default(imp_df$SkinThickness, "pnorm"): ties should not be
## present for the one-sample Kolmogorov-Smirnov test
##
## Asymptotic one-sample Kolmogorov-Smirnov test
##
## data: imp_df$SkinThickness
## D = 1, p-value < 2.2e-16
## alternative hypothesis: two-sided
Dado que ninguno de los dos es normal, procederemos nuevamente a la prueba Wilcoxon:
wilcox.test(data$SkinThickness, imp_df$SkinThickness, alternative = "two.sided")
##
## Wilcoxon rank sum test with continuity correction
##
## data: data$SkinThickness and imp_df$SkinThickness
## W = 196903, p-value = 0.4972
## alternative hypothesis: true location shift is not equal to 0
Dado que el p-valor es mayor a mi nivel de significancia(0.05), no existen diferencias significativas entre la variable imputada y la no imputada.
norm.nob
:imp <- mice(new_dataf, m=5, maxit=50, method='norm.nob', seed=500, printFlag = FALSE)
imp_df <- mice::complete(imp)
summary(imp_df)
## Pregnancies Glucose BloodPressure SkinThickness
## Min. : 0.000 Min. : 44.00 Min. : 24.00 Min. : 2.17
## 1st Qu.: 1.000 1st Qu.: 99.75 1st Qu.: 64.00 1st Qu.:21.00
## Median : 3.000 Median :117.00 Median : 72.00 Median :28.00
## Mean : 3.899 Mean :121.59 Mean : 72.34 Mean :28.36
## 3rd Qu.: 6.000 3rd Qu.:141.00 3rd Qu.: 80.00 3rd Qu.:36.00
## Max. :17.000 Max. :199.00 Max. :114.00 Max. :60.00
## BMI DiabetesPedigreeFunction Age Outcome
## Min. :18.20 Min. :0.0780 Min. :21.00 Min. :0.0000
## 1st Qu.:27.50 1st Qu.:0.2447 1st Qu.:24.00 1st Qu.:0.0000
## Median :32.35 Median :0.3745 Median :29.00 Median :0.0000
## Mean :32.37 Mean :0.4563 Mean :33.36 Mean :0.3427
## 3rd Qu.:36.50 3rd Qu.:0.6132 3rd Qu.:41.00 3rd Qu.:1.0000
## Max. :57.30 Max. :1.4760 Max. :70.00 Max. :1.0000
ggp1 <- ggplot(data, aes(x=SkinThickness)) +
geom_histogram(fill="#FD0000", color="#E52521", alpha=0.9) +
ggtitle("Variable original") +
xlab("SkinThickness") + ylab("Frequency") +
theme_ipsum() +
theme(plot.title = element_text(size=15))+
coord_cartesian(xlim = c(0, 60))
ggp2 <- ggplot(imp_df, aes(x=SkinThickness)) +
geom_histogram(fill="#43B047", color="#049DCB", alpha=0.9) +
ggtitle("Variable imputada") +
xlab("SkinThickness") + ylab("Frequency") +
theme_ipsum() +
theme(plot.title = element_text(size=15))
grid.arrange(ggp1, ggp2, ncol = 2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 227 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
En este caso vemos cambios un poco más pronunciados que los métodos anteriores, comprobemos por medio de una prueba analítica si existen diferencias significativas entre sus distribuciones.
Nuevamente comprobamos normalidad para saber qué prueba usar:
ks.test(data$SkinThickness,"pnorm")
## Warning in ks.test.default(data$SkinThickness, "pnorm"): ties should not be
## present for the one-sample Kolmogorov-Smirnov test
##
## Asymptotic one-sample Kolmogorov-Smirnov test
##
## data: data$SkinThickness
## D = 1, p-value < 2.2e-16
## alternative hypothesis: two-sided
ks.test(imp_df$SkinThickness,"pnorm")
## Warning in ks.test.default(imp_df$SkinThickness, "pnorm"): ties should not be
## present for the one-sample Kolmogorov-Smirnov test
##
## Asymptotic one-sample Kolmogorov-Smirnov test
##
## data: imp_df$SkinThickness
## D = 0.99858, p-value < 2.2e-16
## alternative hypothesis: two-sided
Dado que ninguno de los dos es normal, procederemos nuevamente a la prueba Wilcoxon:
wilcox.test(data$SkinThickness, imp_df$SkinThickness, alternative = "two.sided")
##
## Wilcoxon rank sum test with continuity correction
##
## data: data$SkinThickness and imp_df$SkinThickness
## W = 199681, p-value = 0.264
## alternative hypothesis: true location shift is not equal to 0
Dado que el p-valor es mayor a mi nivel de significancia(0.05), no existen diferencias significativas entre la variable imputada y la no imputada.
norm
:imp <- mice(new_dataf, m=5, maxit=50, method='norm', seed=500, printFlag = FALSE)
imp_df <- mice::complete(imp)
summary(imp_df)
## Pregnancies Glucose BloodPressure SkinThickness
## Min. : 0.000 Min. : 44.00 Min. : 24.00 Min. : 3.702
## 1st Qu.: 1.000 1st Qu.: 99.75 1st Qu.: 64.00 1st Qu.:21.089
## Median : 3.000 Median :117.00 Median : 72.00 Median :28.687
## Mean : 3.899 Mean :121.59 Mean : 72.34 Mean :28.587
## 3rd Qu.: 6.000 3rd Qu.:141.00 3rd Qu.: 80.00 3rd Qu.:35.022
## Max. :17.000 Max. :199.00 Max. :114.00 Max. :60.281
## BMI DiabetesPedigreeFunction Age Outcome
## Min. :18.20 Min. :0.0780 Min. :21.00 Min. :0.0000
## 1st Qu.:27.50 1st Qu.:0.2447 1st Qu.:24.00 1st Qu.:0.0000
## Median :32.35 Median :0.3745 Median :29.00 Median :0.0000
## Mean :32.37 Mean :0.4563 Mean :33.36 Mean :0.3427
## 3rd Qu.:36.50 3rd Qu.:0.6132 3rd Qu.:41.00 3rd Qu.:1.0000
## Max. :57.30 Max. :1.4760 Max. :70.00 Max. :1.0000
ggp1 <- ggplot(data, aes(x=SkinThickness)) +
geom_histogram(fill="#FD0000", color="#E52521", alpha=0.9) +
ggtitle("Variable original") +
xlab("SkinThickness") + ylab("Frequency") +
theme_ipsum() +
theme(plot.title = element_text(size=15))+
coord_cartesian(xlim=c(0,60))
ggp2 <- ggplot(imp_df, aes(x=SkinThickness)) +
geom_histogram(fill="#43B047", color="#049DCB", alpha=0.9) +
ggtitle("Variable imputada") +
xlab("SkinThickness") + ylab("Frequency") +
theme_ipsum() +
theme(plot.title = element_text(size=15))
grid.arrange(ggp1, ggp2, ncol = 2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 227 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
En este caso vemos cambios un poco más pronunciados que los métodos anteriores, comprobemos por medio de una prueba analítica si existen diferencias significativas entre sus distribuciones.
Nuevamente comprobamos normalidad para saber qué prueba usar:
ks.test(data$SkinThickness,"pnorm")
## Warning in ks.test.default(data$SkinThickness, "pnorm"): ties should not be
## present for the one-sample Kolmogorov-Smirnov test
##
## Asymptotic one-sample Kolmogorov-Smirnov test
##
## data: data$SkinThickness
## D = 1, p-value < 2.2e-16
## alternative hypothesis: two-sided
ks.test(imp_df$SkinThickness,"pnorm")
## Warning in ks.test.default(imp_df$SkinThickness, "pnorm"): ties should not be
## present for the one-sample Kolmogorov-Smirnov test
##
## Asymptotic one-sample Kolmogorov-Smirnov test
##
## data: imp_df$SkinThickness
## D = 0.99989, p-value < 2.2e-16
## alternative hypothesis: two-sided
Dado que ninguno de los dos es normal, procederemos nuevamente a la prueba Wilcoxon:
wilcox.test(data$SkinThickness, imp_df$SkinThickness, alternative = "two.sided")
##
## Wilcoxon rank sum test with continuity correction
##
## data: data$SkinThickness and imp_df$SkinThickness
## W = 197405, p-value = 0.4484
## alternative hypothesis: true location shift is not equal to 0
Dado que el p-valor es mayor a mi nivel de significancia(0.05), no existen diferencias significativas entre la variable imputada y la no imputada.
Por su puesto la manera como fueron tratados los datos no es la única, existen muchos más métodos de imputación como puede ser el uso de modelos de regresión o modelos de machine learning. Por otro lado, se pueden usar distintos métodos para la detección de outliers, como por ejemplo: el uso de percentiles o el rango intercuartílico, o pruebas analíticas como el test de Rosner usado en este caso. Lo importante es comprobar, ya sea gráfica o analíticamente, que la imputación no cambie la disitribución de los datos.