Primera Presentacion

Analice la normalidad de la base de datos Tasas, la cual cuenta con 4 variables (TN: tasa de natalidad, TM: tasa de mortalidad, EV: esperanza de vida, EVM: esperanza de vida en mujeres y EVH: esperanza de vida en hombres), medidas en 194 países alrededor del mundo.

library(MVN)

## Warning: package 'MVN' was built under R version 4.3.3

Data <- read.csv("Tasas.csv")
head(Data)

##               Paises    TN    TM    EV   EVM   EVH
## 1        Afganistan  35.84  7.34 61.98 65.28 58.92
## 2           Albania   8.90  8.60 75.50 77.70 73.60
## 3          Alemania   8.80 12.70 80.70 83.30 78.40
## 4           Andorra   6.20  4.60 83.70 86.00 81.30
## 5            Angola  38.81  8.01 61.64 64.31 59.03
## 6 Antigua y Barbuda  12.12  6.37 78.50 80.94 75.78

### 1. Mardia 
Mardia <- mvn(Data[,-1], mvnTest = "mardia")
Mardia$multivariateNormality

##              Test        Statistic              p value Result
## 1 Mardia Skewness 201.323801536089 2.98182323473629e-25     NO
## 2 Mardia Kurtosis 4.20204709407368 2.64512086503021e-05     NO
## 3             MVN             <NA>                 <NA>     NO

### 2. Henze-Zirkler
HZ <- mvn(Data[,-1], mvnTest = "hz")
HZ$multivariateNormality

##            Test       HZ p value MVN
## 1 Henze-Zirkler 2.294973       0  NO

### 3. Royston
Royston <- mvn(Data[,-1], mvnTest = "royston")
Royston$multivariateNormality

##      Test        H      p value MVN
## 1 Royston 25.55369 1.527186e-06  NO

### 4 Doornik-Hansen
DH <- mvn(Data[,-1], mvnTest = "dh")
DH$multivariateNormality

##             Test        E df     p value MVN
## 1 Doornik-Hansen 239.7374 10 7.77862e-46  NO

### 5 Energy
Energy<- mvn(Data[,-1], mvnTest = "energy")
Energy$multivariateNormality

##          Test Statistic p value MVN
## 1 E-statistic  3.151856       0  NO

La pruebas de normalidad multivariada como Mardia, indican que los datos no son normales.

library(GGally)

## Warning: package 'GGally' was built under R version 4.3.3

## Loading required package: ggplot2

## Warning: package 'ggplot2' was built under R version 4.3.3

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

ggpairs(Data[,-1])

Este diagrama muestra que las variables no son normales.

### 1. Shapiro-Wilks
SW <- mvn(Data[,-1], univariateTest = "SW",desc=T)
SW

## $multivariateNormality
##            Test       HZ p value MVN
## 1 Henze-Zirkler 2.294973       0  NO
## 
## $univariateNormality
##           Test  Variable Statistic   p value Normality
## 1 Shapiro-Wilk    TN        0.9233  <0.001      NO    
## 2 Shapiro-Wilk    TM        0.9589  <0.001      NO    
## 3 Shapiro-Wilk    EV        0.9776  0.0034      NO    
## 4 Shapiro-Wilk    EVM       0.9707   4e-04      NO    
## 5 Shapiro-Wilk    EVH       0.9804   0.008      NO    
## 
## $Descriptives
##       n     Mean  Std.Dev Median   Min   Max    25th    75th        Skew
## TN  194 19.06469 9.872522 16.885  5.00 45.29 10.4250 27.2400  0.65598906
## TM  194  8.57701 3.074472  8.090  1.31 18.40  6.6150  9.9975  0.72290852
## EV  194 71.31052 7.764204 71.860 52.53 85.60 65.6700 76.8050 -0.23215863
## EVM 194 74.01974 7.876189 75.165 53.07 87.90 68.4375 79.5975 -0.39876259
## EVH 194 68.73345 7.767232 68.700 50.37 84.10 63.0850 73.6750 -0.07285581
##       Kurtosis
## TN  -0.6175703
## TM   0.6721286
## EV  -0.6532162
## EVM -0.5952345
## EVH -0.6863484

### 3. Lilliefors (correccion de Kolmogorov)
L <- mvn(Data[,-1], univariateTest = "Lillie",desc=T)
L$univariateNormality

##                              Test  Variable Statistic   p value Normality
## 1 Lilliefors (Kolmogorov-Smirnov)    TN        0.1290  <0.001      NO    
## 2 Lilliefors (Kolmogorov-Smirnov)    TM        0.0859  0.0014      NO    
## 3 Lilliefors (Kolmogorov-Smirnov)    EV        0.0576  0.1199      YES   
## 4 Lilliefors (Kolmogorov-Smirnov)    EVM       0.0745  0.0107      NO    
## 5 Lilliefors (Kolmogorov-Smirnov)    EVH       0.0613  0.0728      YES

### 4. Shapiro Francia
SF <- mvn(Data[,-1], univariateTest = "SF",desc=T)
SF$univariateNormality

##              Test  Variable Statistic   p value Normality
## 1 Shapiro-Francia    TN        0.9267  <0.001      NO    
## 2 Shapiro-Francia    TM        0.9588   1e-04      NO    
## 3 Shapiro-Francia    EV        0.9809  0.0112      NO    
## 4 Shapiro-Francia    EVM       0.9736  0.0015      NO    
## 5 Shapiro-Francia    EVH       0.9835  0.0237      NO

### 5. Anderson Darling
AD <- mvn(Data[,-1], univariateTest = "AD",desc=T)
AD$univariateNormality

##               Test  Variable Statistic   p value Normality
## 1 Anderson-Darling    TN        4.9804  <0.001      NO    
## 2 Anderson-Darling    TM        2.6962  <0.001      NO    
## 3 Anderson-Darling    EV        0.9374  0.0172      NO    
## 4 Anderson-Darling    EVM       1.5290   6e-04      NO    
## 5 Anderson-Darling    EVH       0.8141  0.0347      NO

Las pruebas de normalidad univariada tambien muestran que los datos no son normales por lo que podemos concluir despues de hacer varias pruebas nos dimos cuenta que en todas las pruebas menos Liliefors indicaban que las variables no son normales. Como Lilliefors,asume que la media y desviacion son desconocidos concluimos que esta prueba no es concreta por lo que nos dejamos llevar por las demas. Con los datos de los 194 paises concluimos que la tasa de natalidad, tasa de mortalidad, esperenza de vida, esperanza de vida de mujeres y de hombres no son normales.

Segunda Presentacion

library(gapminder)

## Warning: package 'gapminder' was built under R version 4.3.3

library(devtools)

## Warning: package 'devtools' was built under R version 4.3.3

## Loading required package: usethis

## Warning: package 'usethis' was built under R version 4.3.3

devtools::install_github("jennybc/gapminder")

## Downloading GitHub repo jennybc/gapminder@HEAD

## 
## ── R CMD build ─────────────────────────────────────────────────────────────────
##          checking for file 'C:\Users\801229270\AppData\Local\Temp\RtmpMRfczz\remotes39c414bd5ac3\jennybc-gapminder-b895872/DESCRIPTION' ...  ✔  checking for file 'C:\Users\801229270\AppData\Local\Temp\RtmpMRfczz\remotes39c414bd5ac3\jennybc-gapminder-b895872/DESCRIPTION' (455ms)
##       ─  preparing 'gapminder': (481ms)
##    checking DESCRIPTION meta-information ...  ✔  checking DESCRIPTION meta-information
##       ─  checking for LF line-endings in source and make files and shell scripts
##   ─  checking for empty or unneeded directories
##       ─  building 'gapminder_1.0.0.9000.tar.gz'
##      
##

## Warning: package 'gapminder' is in use and will not be installed

Cargamos los datos

gapminder <- gapminder::gapminder

Datos de población

Para los siguientes ejercicios tomare el year 1952 de los datos

Data <- data.frame(
pob = gapminder$pop[gapminder$year == 1952],
lf = gapminder$lifeExp[gapminder$year == 1952],
gdp = gapminder$gdpPercap[gapminder$year == 1952])

En este Histograma podemos observar que no hay normalidad en los datos

par(mfrow=c(1,3))
hist(Data[,1])
hist(Data[,2])
hist(Data[,3])

Pruebas analíticas

SW1 <- shapiro.test(Data$pob)
SW1

## 
##  Shapiro-Wilk normality test
## 
## data:  Data$pob
## W = 0.2562, p-value < 2.2e-16

SW2 <- shapiro.test(Data$lf)
SW2

## 
##  Shapiro-Wilk normality test
## 
## data:  Data$lf
## W = 0.93078, p-value = 2.014e-06

SW3 <- shapiro.test(Data$gdp)
SW3

## 
##  Shapiro-Wilk normality test
## 
## data:  Data$gdp
## W = 0.24778, p-value < 2.2e-16

El p-value de las tres, nos indica que estas variables estan lejos de ser normales por lo que procedemos a tranformar dichas variables.

#transformaciones Optamos por la prueba de transformacion bestNormalize

library(bestNormalize)

## Warning: package 'bestNormalize' was built under R version 4.3.3

best_trans1 <- bestNormalize(Data$pob)
best_trans1

## Best Normalizing transformation with 142 Observations
##  Estimated Normality Statistics (Pearson P / df, lower => more normal):
##  - arcsinh(x): 1.291
##  - Box-Cox: 1.3992
##  - Center+scale: 7.4903
##  - Double Reversed Log_b(x+a): 6.6969
##  - Log_b(x+a): 1.291
##  - orderNorm (ORQ): 1.3928
##  - sqrt(x + a): 3.4236
##  - Yeo-Johnson: 1.3992
## Estimation method: Out-of-sample via CV with 10 folds and 5 repeats
##  
## Based off these, bestNormalize chose:
## Standardized asinh(x) Transformation with 142 nonmissing obs.:
##  Relevant statistics:
##  - mean (before standardization) = 15.89392 
##  - sd (before standardization) = 1.627995

best_trans2 <- bestNormalize(Data$lf)
best_trans2

## Best Normalizing transformation with 142 Observations
##  Estimated Normality Statistics (Pearson P / df, lower => more normal):
##  - arcsinh(x): 1.6796
##  - Box-Cox: 1.6705
##  - Center+scale: 1.7055
##  - Double Reversed Log_b(x+a): 3.2328
##  - Exp(x): 18.429
##  - Log_b(x+a): 1.6796
##  - orderNorm (ORQ): 1.2175
##  - sqrt(x + a): 1.6682
##  - Yeo-Johnson: 1.6705
## Estimation method: Out-of-sample via CV with 10 folds and 5 repeats
##  
## Based off these, bestNormalize chose:
## orderNorm Transformation with 142 nonmissing obs and no ties 
##  - Original quantiles:
##     0%    25%    50%    75%   100% 
## 28.801 39.059 45.136 59.765 72.670

best_trans3 <- bestNormalize(Data$gdp)
best_trans3

## Best Normalizing transformation with 142 Observations
##  Estimated Normality Statistics (Pearson P / df, lower => more normal):
##  - arcsinh(x): 1.1867
##  - Box-Cox: 1.2735
##  - Center+scale: 4.6884
##  - Double Reversed Log_b(x+a): 3.28
##  - Log_b(x+a): 1.1867
##  - orderNorm (ORQ): 1.5227
##  - sqrt(x + a): 1.8354
##  - Yeo-Johnson: 1.2735
## Estimation method: Out-of-sample via CV with 10 folds and 5 repeats
##  
## Based off these, bestNormalize chose:
## Standardized asinh(x) Transformation with 142 nonmissing obs.:
##  Relevant statistics:
##  - mean (before standardization) = 8.270125 
##  - sd (before standardization) = 1.035393

Los datos de la prueba bestnormalize nos indica que la mejor prueba de tranformacion para cada una de las variables son las siguientes:

Variable 1 pob: Transformacion Box Cot
Variable 2 lf: Tranformacion Ordernorm
Variable 3 gdp: Transformacion Arcsinh

Procedemos a analizar si hay normalidad en los datos transformados.

library(MVN)
SW1_trans <- shapiro.test(best_trans1$x.t)
SW1_trans

## 
##  Shapiro-Wilk normality test
## 
## data:  best_trans1$x.t
## W = 0.99222, p-value = 0.6295

SW2_trans <- shapiro.test(best_trans2$x.t)
SW2_trans

## 
##  Shapiro-Wilk normality test
## 
## data:  best_trans2$x.t
## W = 0.99968, p-value = 1

SW3_trans <- shapiro.test(best_trans3$x.t)
SW3_trans

## 
##  Shapiro-Wilk normality test
## 
## data:  best_trans3$x.t
## W = 0.97604, p-value = 0.01348

Conclusion: Ahora podemos observar los datos tranformados con sus P-value casi perfectos lo que nos indica que estos datos transformados son normales.

EJERCICIO 2

Tenemos los siguientes datos:

library(bestNormalize)
data("autotrader")
names(autotrader)

##  [1] "Car_Info" "Link"     "Make"     "Year"     "Location" "Radius"  
##  [7] "price"    "mileage"  "status"   "model"

Data3 <- data.frame(
 kilom=  autotrader$mileage,
 ant = autotrader$Year,
 precio = autotrader$price)

Prueba analitica

library(nortest)
AD_kilom <- ad.test(Data3$kilom)
AD_kilom

## 
##  Anderson-Darling normality test
## 
## data:  Data3$kilom
## A = 271.27, p-value < 2.2e-16

AD_ant <- ad.test(Data3$ant)
AD_ant

## 
##  Anderson-Darling normality test
## 
## data:  Data3$ant
## A = 399.77, p-value < 2.2e-16

AD_price <- ad.test(Data3$precio)
AD_price

## 
##  Anderson-Darling normality test
## 
## data:  Data3$precio
## A = 49.948, p-value < 2.2e-16

Los p-value son muy pequeños, estan lejos del entero, lo que indica que las variables de los datos no son normales.

transformacion

Procedemos a tranformar los datos para llegar a la normalidad

library(bestNormalize)
best_transk <- bestNormalize(Data3$kilom)

## Warning: `progress_estimated()` was deprecated in dplyr 1.0.0.
## ℹ The deprecated feature was likely used in the bestNormalize package.
##   Please report the issue to the authors.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

best_transk

## Best Normalizing transformation with 6283 Observations
##  Estimated Normality Statistics (Pearson P / df, lower => more normal):
##  - arcsinh(x): 3.3915
##  - Box-Cox: 3.0323
##  - Center+scale: 14.7862
##  - Double Reversed Log_b(x+a): 23.0862
##  - Log_b(x+a): 3.3933
##  - orderNorm (ORQ): 1.0801
##  - sqrt(x + a): 5.0822
##  - Yeo-Johnson: 3.032
## Estimation method: Out-of-sample via CV with 10 folds and 5 repeats
##  
## Based off these, bestNormalize chose:
## orderNorm Transformation with 6283 nonmissing obs and ties
##  - 6077 unique values 
##  - Original quantiles:
##     0%    25%    50%    75%   100% 
##      2  29099  44800  88950 325556

best_transa <- bestNormalize(Data3$ant)

best_transa

## Best Normalizing transformation with 6283 Observations
##  Estimated Normality Statistics (Pearson P / df, lower => more normal):
##  - arcsinh(x): 83.5038
##  - Box-Cox: 83.5196
##  - Center+scale: 83.5038
##  - Double Reversed Log_b(x+a): 83.2412
##  - Log_b(x+a): 83.5038
##  - orderNorm (ORQ): 81.448
##  - sqrt(x + a): 83.5038
##  - Yeo-Johnson: 83.5525
## Estimation method: Out-of-sample via CV with 10 folds and 5 repeats
##  
## Based off these, bestNormalize chose:
## orderNorm Transformation with 6283 nonmissing obs and ties
##  - 17 unique values 
##  - Original quantiles:
##   0%  25%  50%  75% 100% 
## 2000 2010 2013 2014 2016

best_transp <- bestNormalize(Data3$precio)

best_transp

## Best Normalizing transformation with 6283 Observations
##  Estimated Normality Statistics (Pearson P / df, lower => more normal):
##  - arcsinh(x): 3.9171
##  - Box-Cox: 2.1633
##  - Center+scale: 3.3967
##  - Double Reversed Log_b(x+a): 6.4884
##  - Log_b(x+a): 3.9171
##  - orderNorm (ORQ): 1.0466
##  - sqrt(x + a): 2.1873
##  - Yeo-Johnson: 2.1633
## Estimation method: Out-of-sample via CV with 10 folds and 5 repeats
##  
## Based off these, bestNormalize chose:
## orderNorm Transformation with 6283 nonmissing obs and ties
##  - 2465 unique values 
##  - Original quantiles:
##    0%   25%   50%   75%  100% 
##   722 11499 15998 21497 64998

Los datos de la prueba bestnormalize nos indica que la mejor prueba de tranformacion para todas las variables es la Tranformacion Ordernorm.

Por ultimo comprobamos que los datos transformados son normales:

trans_kilom <- ad.test(best_transk$x.t)
trans_kilom

## 
##  Anderson-Darling normality test
## 
## data:  best_transk$x.t
## A = 0.00032915, p-value = 1

trans_ant <- ad.test(best_transa$x.t)
trans_ant

## 
##  Anderson-Darling normality test
## 
## data:  best_transa$x.t
## A = 98.365, p-value < 2.2e-16

trans_precio <- ad.test(best_transp$x.t)
trans_precio

## 
##  Anderson-Darling normality test
## 
## data:  best_transp$x.t
## A = 0.01806, p-value = 1

Concluimos que el p-value nos da 1 en la variable kilom y price por lo que dichas variables fueron transformadas indicando su normalidad. Sin embargo, la variable milaege (ant) no se pudo normalizar debido a la anormalidad de sus datos.

Tarea 5

ALEXANDRA ARROYO

2024-10-10