Analice la normalidad de la base de datos Tasas, la cual cuenta con 4 variables (TN: tasa de natalidad, TM: tasa de mortalidad, EV: esperanza de vida, EVM: esperanza de vida en mujeres y EVH: esperanza de vida en hombres), medidas en 194 países alrededor del mundo.
library(MVN)
## Warning: package 'MVN' was built under R version 4.3.3
Data <- read.csv("Tasas.csv")
head(Data)
## Paises TN TM EV EVM EVH
## 1 Afganistan 35.84 7.34 61.98 65.28 58.92
## 2 Albania 8.90 8.60 75.50 77.70 73.60
## 3 Alemania 8.80 12.70 80.70 83.30 78.40
## 4 Andorra 6.20 4.60 83.70 86.00 81.30
## 5 Angola 38.81 8.01 61.64 64.31 59.03
## 6 Antigua y Barbuda 12.12 6.37 78.50 80.94 75.78
### 1. Mardia
Mardia <- mvn(Data[,-1], mvnTest = "mardia")
Mardia$multivariateNormality
## Test Statistic p value Result
## 1 Mardia Skewness 201.323801536089 2.98182323473629e-25 NO
## 2 Mardia Kurtosis 4.20204709407368 2.64512086503021e-05 NO
## 3 MVN <NA> <NA> NO
### 2. Henze-Zirkler
HZ <- mvn(Data[,-1], mvnTest = "hz")
HZ$multivariateNormality
## Test HZ p value MVN
## 1 Henze-Zirkler 2.294973 0 NO
### 3. Royston
Royston <- mvn(Data[,-1], mvnTest = "royston")
Royston$multivariateNormality
## Test H p value MVN
## 1 Royston 25.55369 1.527186e-06 NO
### 4 Doornik-Hansen
DH <- mvn(Data[,-1], mvnTest = "dh")
DH$multivariateNormality
## Test E df p value MVN
## 1 Doornik-Hansen 239.7374 10 7.77862e-46 NO
### 5 Energy
Energy<- mvn(Data[,-1], mvnTest = "energy")
Energy$multivariateNormality
## Test Statistic p value MVN
## 1 E-statistic 3.151856 0 NO
La pruebas de normalidad multivariada como Mardia, indican que los datos no son normales.
library(GGally)
## Warning: package 'GGally' was built under R version 4.3.3
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.3.3
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
ggpairs(Data[,-1])
Este diagrama muestra que las variables no son normales.
### 1. Shapiro-Wilks
SW <- mvn(Data[,-1], univariateTest = "SW",desc=T)
SW
## $multivariateNormality
## Test HZ p value MVN
## 1 Henze-Zirkler 2.294973 0 NO
##
## $univariateNormality
## Test Variable Statistic p value Normality
## 1 Shapiro-Wilk TN 0.9233 <0.001 NO
## 2 Shapiro-Wilk TM 0.9589 <0.001 NO
## 3 Shapiro-Wilk EV 0.9776 0.0034 NO
## 4 Shapiro-Wilk EVM 0.9707 4e-04 NO
## 5 Shapiro-Wilk EVH 0.9804 0.008 NO
##
## $Descriptives
## n Mean Std.Dev Median Min Max 25th 75th Skew
## TN 194 19.06469 9.872522 16.885 5.00 45.29 10.4250 27.2400 0.65598906
## TM 194 8.57701 3.074472 8.090 1.31 18.40 6.6150 9.9975 0.72290852
## EV 194 71.31052 7.764204 71.860 52.53 85.60 65.6700 76.8050 -0.23215863
## EVM 194 74.01974 7.876189 75.165 53.07 87.90 68.4375 79.5975 -0.39876259
## EVH 194 68.73345 7.767232 68.700 50.37 84.10 63.0850 73.6750 -0.07285581
## Kurtosis
## TN -0.6175703
## TM 0.6721286
## EV -0.6532162
## EVM -0.5952345
## EVH -0.6863484
### 3. Lilliefors (correccion de Kolmogorov)
L <- mvn(Data[,-1], univariateTest = "Lillie",desc=T)
L$univariateNormality
## Test Variable Statistic p value Normality
## 1 Lilliefors (Kolmogorov-Smirnov) TN 0.1290 <0.001 NO
## 2 Lilliefors (Kolmogorov-Smirnov) TM 0.0859 0.0014 NO
## 3 Lilliefors (Kolmogorov-Smirnov) EV 0.0576 0.1199 YES
## 4 Lilliefors (Kolmogorov-Smirnov) EVM 0.0745 0.0107 NO
## 5 Lilliefors (Kolmogorov-Smirnov) EVH 0.0613 0.0728 YES
### 4. Shapiro Francia
SF <- mvn(Data[,-1], univariateTest = "SF",desc=T)
SF$univariateNormality
## Test Variable Statistic p value Normality
## 1 Shapiro-Francia TN 0.9267 <0.001 NO
## 2 Shapiro-Francia TM 0.9588 1e-04 NO
## 3 Shapiro-Francia EV 0.9809 0.0112 NO
## 4 Shapiro-Francia EVM 0.9736 0.0015 NO
## 5 Shapiro-Francia EVH 0.9835 0.0237 NO
### 5. Anderson Darling
AD <- mvn(Data[,-1], univariateTest = "AD",desc=T)
AD$univariateNormality
## Test Variable Statistic p value Normality
## 1 Anderson-Darling TN 4.9804 <0.001 NO
## 2 Anderson-Darling TM 2.6962 <0.001 NO
## 3 Anderson-Darling EV 0.9374 0.0172 NO
## 4 Anderson-Darling EVM 1.5290 6e-04 NO
## 5 Anderson-Darling EVH 0.8141 0.0347 NO
Las pruebas de normalidad univariada tambien muestran que los datos no son normales por lo que podemos concluir despues de hacer varias pruebas nos dimos cuenta que en todas las pruebas menos Liliefors indicaban que las variables no son normales. Como Lilliefors,asume que la media y desviacion son desconocidos concluimos que esta prueba no es concreta por lo que nos dejamos llevar por las demas. Con los datos de los 194 paises concluimos que la tasa de natalidad, tasa de mortalidad, esperenza de vida, esperanza de vida de mujeres y de hombres no son normales.
library(gapminder)
## Warning: package 'gapminder' was built under R version 4.3.3
library(devtools)
## Warning: package 'devtools' was built under R version 4.3.3
## Loading required package: usethis
## Warning: package 'usethis' was built under R version 4.3.3
devtools::install_github("jennybc/gapminder")
## Downloading GitHub repo jennybc/gapminder@HEAD
##
## ── R CMD build ─────────────────────────────────────────────────────────────────
## checking for file 'C:\Users\801229270\AppData\Local\Temp\RtmpMRfczz\remotes39c414bd5ac3\jennybc-gapminder-b895872/DESCRIPTION' ... ✔ checking for file 'C:\Users\801229270\AppData\Local\Temp\RtmpMRfczz\remotes39c414bd5ac3\jennybc-gapminder-b895872/DESCRIPTION' (455ms)
## ─ preparing 'gapminder': (481ms)
## checking DESCRIPTION meta-information ... ✔ checking DESCRIPTION meta-information
## ─ checking for LF line-endings in source and make files and shell scripts
## ─ checking for empty or unneeded directories
## ─ building 'gapminder_1.0.0.9000.tar.gz'
##
##
## Warning: package 'gapminder' is in use and will not be installed
gapminder <- gapminder::gapminder
Para los siguientes ejercicios tomare el year 1952 de los datos
Data <- data.frame(
pob = gapminder$pop[gapminder$year == 1952],
lf = gapminder$lifeExp[gapminder$year == 1952],
gdp = gapminder$gdpPercap[gapminder$year == 1952])
En este Histograma podemos observar que no hay normalidad en los datos
par(mfrow=c(1,3))
hist(Data[,1])
hist(Data[,2])
hist(Data[,3])
SW1 <- shapiro.test(Data$pob)
SW1
##
## Shapiro-Wilk normality test
##
## data: Data$pob
## W = 0.2562, p-value < 2.2e-16
SW2 <- shapiro.test(Data$lf)
SW2
##
## Shapiro-Wilk normality test
##
## data: Data$lf
## W = 0.93078, p-value = 2.014e-06
SW3 <- shapiro.test(Data$gdp)
SW3
##
## Shapiro-Wilk normality test
##
## data: Data$gdp
## W = 0.24778, p-value < 2.2e-16
El p-value de las tres, nos indica que estas variables estan lejos de ser normales por lo que procedemos a tranformar dichas variables.
#transformaciones Optamos por la prueba de transformacion bestNormalize
library(bestNormalize)
## Warning: package 'bestNormalize' was built under R version 4.3.3
best_trans1 <- bestNormalize(Data$pob)
best_trans1
## Best Normalizing transformation with 142 Observations
## Estimated Normality Statistics (Pearson P / df, lower => more normal):
## - arcsinh(x): 1.291
## - Box-Cox: 1.3992
## - Center+scale: 7.4903
## - Double Reversed Log_b(x+a): 6.6969
## - Log_b(x+a): 1.291
## - orderNorm (ORQ): 1.3928
## - sqrt(x + a): 3.4236
## - Yeo-Johnson: 1.3992
## Estimation method: Out-of-sample via CV with 10 folds and 5 repeats
##
## Based off these, bestNormalize chose:
## Standardized asinh(x) Transformation with 142 nonmissing obs.:
## Relevant statistics:
## - mean (before standardization) = 15.89392
## - sd (before standardization) = 1.627995
best_trans2 <- bestNormalize(Data$lf)
best_trans2
## Best Normalizing transformation with 142 Observations
## Estimated Normality Statistics (Pearson P / df, lower => more normal):
## - arcsinh(x): 1.6796
## - Box-Cox: 1.6705
## - Center+scale: 1.7055
## - Double Reversed Log_b(x+a): 3.2328
## - Exp(x): 18.429
## - Log_b(x+a): 1.6796
## - orderNorm (ORQ): 1.2175
## - sqrt(x + a): 1.6682
## - Yeo-Johnson: 1.6705
## Estimation method: Out-of-sample via CV with 10 folds and 5 repeats
##
## Based off these, bestNormalize chose:
## orderNorm Transformation with 142 nonmissing obs and no ties
## - Original quantiles:
## 0% 25% 50% 75% 100%
## 28.801 39.059 45.136 59.765 72.670
best_trans3 <- bestNormalize(Data$gdp)
best_trans3
## Best Normalizing transformation with 142 Observations
## Estimated Normality Statistics (Pearson P / df, lower => more normal):
## - arcsinh(x): 1.1867
## - Box-Cox: 1.2735
## - Center+scale: 4.6884
## - Double Reversed Log_b(x+a): 3.28
## - Log_b(x+a): 1.1867
## - orderNorm (ORQ): 1.5227
## - sqrt(x + a): 1.8354
## - Yeo-Johnson: 1.2735
## Estimation method: Out-of-sample via CV with 10 folds and 5 repeats
##
## Based off these, bestNormalize chose:
## Standardized asinh(x) Transformation with 142 nonmissing obs.:
## Relevant statistics:
## - mean (before standardization) = 8.270125
## - sd (before standardization) = 1.035393
Los datos de la prueba bestnormalize nos indica que la mejor prueba de tranformacion para cada una de las variables son las siguientes:
Procedemos a analizar si hay normalidad en los datos transformados.
library(MVN)
SW1_trans <- shapiro.test(best_trans1$x.t)
SW1_trans
##
## Shapiro-Wilk normality test
##
## data: best_trans1$x.t
## W = 0.99222, p-value = 0.6295
SW2_trans <- shapiro.test(best_trans2$x.t)
SW2_trans
##
## Shapiro-Wilk normality test
##
## data: best_trans2$x.t
## W = 0.99968, p-value = 1
SW3_trans <- shapiro.test(best_trans3$x.t)
SW3_trans
##
## Shapiro-Wilk normality test
##
## data: best_trans3$x.t
## W = 0.97604, p-value = 0.01348
Conclusion: Ahora podemos observar los datos tranformados con sus P-value casi perfectos lo que nos indica que estos datos transformados son normales.
Tenemos los siguientes datos:
library(bestNormalize)
data("autotrader")
names(autotrader)
## [1] "Car_Info" "Link" "Make" "Year" "Location" "Radius"
## [7] "price" "mileage" "status" "model"
Data3 <- data.frame(
kilom= autotrader$mileage,
ant = autotrader$Year,
precio = autotrader$price)
library(nortest)
AD_kilom <- ad.test(Data3$kilom)
AD_kilom
##
## Anderson-Darling normality test
##
## data: Data3$kilom
## A = 271.27, p-value < 2.2e-16
AD_ant <- ad.test(Data3$ant)
AD_ant
##
## Anderson-Darling normality test
##
## data: Data3$ant
## A = 399.77, p-value < 2.2e-16
AD_price <- ad.test(Data3$precio)
AD_price
##
## Anderson-Darling normality test
##
## data: Data3$precio
## A = 49.948, p-value < 2.2e-16
Los p-value son muy pequeños, estan lejos del entero, lo que indica que las variables de los datos no son normales.
Procedemos a tranformar los datos para llegar a la normalidad
library(bestNormalize)
best_transk <- bestNormalize(Data3$kilom)
## Warning: `progress_estimated()` was deprecated in dplyr 1.0.0.
## ℹ The deprecated feature was likely used in the bestNormalize package.
## Please report the issue to the authors.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
best_transk
## Best Normalizing transformation with 6283 Observations
## Estimated Normality Statistics (Pearson P / df, lower => more normal):
## - arcsinh(x): 3.3915
## - Box-Cox: 3.0323
## - Center+scale: 14.7862
## - Double Reversed Log_b(x+a): 23.0862
## - Log_b(x+a): 3.3933
## - orderNorm (ORQ): 1.0801
## - sqrt(x + a): 5.0822
## - Yeo-Johnson: 3.032
## Estimation method: Out-of-sample via CV with 10 folds and 5 repeats
##
## Based off these, bestNormalize chose:
## orderNorm Transformation with 6283 nonmissing obs and ties
## - 6077 unique values
## - Original quantiles:
## 0% 25% 50% 75% 100%
## 2 29099 44800 88950 325556
best_transa <- bestNormalize(Data3$ant)
best_transa
## Best Normalizing transformation with 6283 Observations
## Estimated Normality Statistics (Pearson P / df, lower => more normal):
## - arcsinh(x): 83.5038
## - Box-Cox: 83.5196
## - Center+scale: 83.5038
## - Double Reversed Log_b(x+a): 83.2412
## - Log_b(x+a): 83.5038
## - orderNorm (ORQ): 81.448
## - sqrt(x + a): 83.5038
## - Yeo-Johnson: 83.5525
## Estimation method: Out-of-sample via CV with 10 folds and 5 repeats
##
## Based off these, bestNormalize chose:
## orderNorm Transformation with 6283 nonmissing obs and ties
## - 17 unique values
## - Original quantiles:
## 0% 25% 50% 75% 100%
## 2000 2010 2013 2014 2016
best_transp <- bestNormalize(Data3$precio)
best_transp
## Best Normalizing transformation with 6283 Observations
## Estimated Normality Statistics (Pearson P / df, lower => more normal):
## - arcsinh(x): 3.9171
## - Box-Cox: 2.1633
## - Center+scale: 3.3967
## - Double Reversed Log_b(x+a): 6.4884
## - Log_b(x+a): 3.9171
## - orderNorm (ORQ): 1.0466
## - sqrt(x + a): 2.1873
## - Yeo-Johnson: 2.1633
## Estimation method: Out-of-sample via CV with 10 folds and 5 repeats
##
## Based off these, bestNormalize chose:
## orderNorm Transformation with 6283 nonmissing obs and ties
## - 2465 unique values
## - Original quantiles:
## 0% 25% 50% 75% 100%
## 722 11499 15998 21497 64998
Los datos de la prueba bestnormalize nos indica que la mejor prueba de tranformacion para todas las variables es la Tranformacion Ordernorm.
Por ultimo comprobamos que los datos transformados son normales:
trans_kilom <- ad.test(best_transk$x.t)
trans_kilom
##
## Anderson-Darling normality test
##
## data: best_transk$x.t
## A = 0.00032915, p-value = 1
trans_ant <- ad.test(best_transa$x.t)
trans_ant
##
## Anderson-Darling normality test
##
## data: best_transa$x.t
## A = 98.365, p-value < 2.2e-16
trans_precio <- ad.test(best_transp$x.t)
trans_precio
##
## Anderson-Darling normality test
##
## data: best_transp$x.t
## A = 0.01806, p-value = 1
Concluimos que el p-value nos da 1 en la variable kilom y price por lo que dichas variables fueron transformadas indicando su normalidad. Sin embargo, la variable milaege (ant) no se pudo normalizar debido a la anormalidad de sus datos.