En esta tarea se exploró el dataset mtcars.

Primero cargamos las librerías.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidyquant)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo 
## ── Attaching core tidyquant packages ─────────────────────── tidyquant 1.0.10 ──
## ✔ PerformanceAnalytics 2.0.8      ✔ TTR                  0.24.4
## ✔ quantmod             0.4.26     ✔ xts                  0.14.1── Conflicts ────────────────────────────────────────── tidyquant_conflicts() ──
## ✖ zoo::as.Date()                 masks base::as.Date()
## ✖ zoo::as.Date.numeric()         masks base::as.Date.numeric()
## ✖ dplyr::filter()                masks stats::filter()
## ✖ xts::first()                   masks dplyr::first()
## ✖ dplyr::lag()                   masks stats::lag()
## ✖ xts::last()                    masks dplyr::last()
## ✖ PerformanceAnalytics::legend() masks graphics::legend()
## ✖ quantmod::summary()            masks base::summary()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggdist)
library(ggthemes)

Cargamos el dataset

coches<-as.data.frame(mtcars)

Tasks

  1. Calcular los histogramas de todas las variables
##Histograms
for(col in names(coches)){
  hist(coches[[col]], 
       main= sprintf('Histograma de %s', col), 
       xlab=col)
  theme_fivethirtyeight()
}

  1. Calcular el raincloud plot de todas las variables
#Raincloud plot: half- violin plot, boxplot and dots
for(col in names(coches)){
  print(
ggplot(coches,aes(x=get(col)))+
  #Densidad
  stat_halfeye(adjust=0.5,
               width=.6, 
               justification=-0.2,
               .width = 0.95, 
               point_colour= NA)+
  #Boxplot
  geom_boxplot(width=.2, 
               outlier.shape = NA,
               alpha= 0.5)+
  stat_dots(side="left",
            justification= 1.1, 
            binwidth=NA,
            dotsize=0.1
            ) +
  labs(
    title= sprintf("Raincloud %s", col ),
    x = "",
    y = sprintf("%s", col))+
  theme_fivethirtyeight()
)
}

  1. Realizar el heatmap plot del data set*

*Se corrige por el heatmap plot de las correlaciones entre las variables en la siguiente tarea.

x<-as.matrix(coches)
heatmap(x, Colv = NA, col = cm.colors(256), 
        #scale = "column", 
        margins = c(5,10),
       xlab = "specification variables", ylab =  "Car Models",
       main = "heatmap Mtcars data")

4.¿Qué variables son normales y por qué? ¿Qué tipo de prueba utilizar para ello?

Perform a Shapiro-Wilk Test.

It’s most appropriate for small sample sizes, but can also be used for larger samples. If the p-value of the test is greater than α = .05, then the data is assumed to be normally distributed.

P-Value

The p-value is a probability that measures the evidence against the null hypothesis. A smaller p-value provides stronger evidence against the null hypothesis.

P-value ≤ α: The data do not follow a normal distribution (Reject H0) If the p-value is less than or equal to the significance level, the decision is to reject the null hypothesis and conclude that your data do not follow a normal distribution.

P-value > α: You cannot conclude that the data do not follow a normal distribution (Fail to reject H0) If the p-value is larger than the significance level, the decision is to fail to reject the null hypothesis. You do not have enough evidence to conclude that your data do not follow a normal distribution.

i=1
for(col in names(coches)){
  
  test=shapiro.test(coches[[col]])
  p_value=test$p.value
  print(sprintf("Los datos de la columna %s, para la prueba Shapiro-Wilk tienen un p-value = %.5f", col, p_value))
  qqnorm(coches[[col]],
         main= sprintf('QQplot de %s', col), 
         xlab=col, col='blue')
  qqline(coches[[col]], col='red')
  
}
## [1] "Los datos de la columna mpg, para la prueba Shapiro-Wilk tienen un p-value = 0.12288"

## [1] "Los datos de la columna cyl, para la prueba Shapiro-Wilk tienen un p-value = 0.00001"

## [1] "Los datos de la columna disp, para la prueba Shapiro-Wilk tienen un p-value = 0.02081"

## [1] "Los datos de la columna hp, para la prueba Shapiro-Wilk tienen un p-value = 0.04881"

## [1] "Los datos de la columna drat, para la prueba Shapiro-Wilk tienen un p-value = 0.11006"

## [1] "Los datos de la columna wt, para la prueba Shapiro-Wilk tienen un p-value = 0.09265"

## [1] "Los datos de la columna qsec, para la prueba Shapiro-Wilk tienen un p-value = 0.59352"

## [1] "Los datos de la columna vs, para la prueba Shapiro-Wilk tienen un p-value = 0.00000"

## [1] "Los datos de la columna am, para la prueba Shapiro-Wilk tienen un p-value = 0.00000"

## [1] "Los datos de la columna gear, para la prueba Shapiro-Wilk tienen un p-value = 0.00001"

## [1] "Los datos de la columna carb, para la prueba Shapiro-Wilk tienen un p-value = 0.00044"

5.Calculamos las correlaciones entre la columna “hp” y las demás variables

i=1
cor_hp=1:ncol(coches)
for(col in names(coches)){
  cor_hp[i]=cor(coches$hp,coches[[col]])
  i=i+1
}
print(cor_hp)
##  [1] -0.7761684  0.8324475  0.7909486  1.0000000 -0.4487591  0.6587479
##  [7] -0.7082234 -0.7230967 -0.2432043 -0.1257043  0.7498125