La librería DataExplorer es la más conocida para el análisis exploratorio. Es muy simmple de usar y muy poderosa, pues ofrece como salida un informe con mucha información.
La función para crear el informe es create_report, y para ver cada gráfica de forma individual, las funciones son:
# install.packages("DataExplorer")
library(DataExplorer)
# install.packages("nycflights13")
library(nycflights13)
El paquete nycflights13 contiene información sobre todos los vuelos que partieron desde Nueva York (EWR, JFK y LGA) a destinos en los Estados Unidos en 2013. Fueron 336,776 vuelos en total.
Las tablas de este paquete y sus relaciones son las siguientes:
flights <- flights
weather <- weather
planes <- planes
airports <- airports
airlines <- airlines
df <- merge(flights, airlines, by = "carrier")
df <- merge(df, planes, by = "tailnum",
na.rm=TRUE)
create_report(df)
##
##
## processing file: report.rmd
##
|
| | 0%
|
|. | 2%
|
|.. | 5% [global_options]
|
|... | 7%
|
|.... | 10% [introduce]
|
|.... | 12%
|
|..... | 14% [plot_intro]
|
|...... | 17%
|
|....... | 19% [data_structure]
|
|........ | 21%
|
|......... | 24% [missing_profile]
|
|.......... | 26%
|
|........... | 29% [univariate_distribution_header]
|
|........... | 31%
|
|............ | 33% [plot_histogram]
|
|............. | 36%
|
|.............. | 38% [plot_density]
|
|............... | 40%
|
|................ | 43% [plot_frequency_bar]
|
|................. | 45%
|
|.................. | 48% [plot_response_bar]
|
|.................. | 50%
|
|................... | 52% [plot_with_bar]
|
|.................... | 55%
|
|..................... | 57% [plot_normal_qq]
|
|...................... | 60%
|
|....................... | 62% [plot_response_qq]
|
|........................ | 64%
|
|......................... | 67% [plot_by_qq]
|
|.......................... | 69%
|
|.......................... | 71% [correlation_analysis]
|
|........................... | 74%
|
|............................ | 76% [principal_component_analysis]
|
|............................. | 79%
|
|.............................. | 81% [bivariate_distribution_header]
|
|............................... | 83%
|
|................................ | 86% [plot_response_boxplot]
|
|................................. | 88%
|
|................................. | 90% [plot_by_boxplot]
|
|.................................. | 93%
|
|................................... | 95% [plot_response_scatterplot]
|
|.................................... | 98%
|
|.....................................| 100% [plot_by_scatterplot]
## output file: C:/Users/kathi/OneDrive/Escritorio/M2_IA con Impacto Empresarial/report.knit.md
## "C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/pandoc" +RTS -K512m -RTS "C:/Users/kathi/OneDrive/Escritorio/M2_IA con Impacto Empresarial/report.knit.md" --to html4 --from markdown+autolink_bare_uris+tex_math_single_backslash --output pandoca1477bb74f6.html --lua-filter "C:\Users\kathi\AppData\Local\R\win-library\4.3\rmarkdown\rmarkdown\lua\pagebreak.lua" --lua-filter "C:\Users\kathi\AppData\Local\R\win-library\4.3\rmarkdown\rmarkdown\lua\latex-div.lua" --embed-resources --standalone --variable bs3=TRUE --section-divs --table-of-contents --toc-depth 6 --template "C:\Users\kathi\AppData\Local\R\win-library\4.3\rmarkdown\rmd\h\default.html" --no-highlight --variable highlightjs=1 --variable theme=yeti --mathjax --variable "mathjax-url=https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" --include-in-header "C:\Users\kathi\AppData\Local\Temp\RtmpE5UQi8\rmarkdown-stra1445e81d7e.html"
##
## Output created: report.html
introduce(df)
## rows columns discrete_columns continuous_columns all_missing_columns
## 1 284170 28 10 18 0
## total_missing_values complete_rows total_observations memory_usage
## 1 311768 920 7956760 50225296
plot_intro(df)
plot_boxplot(df, by = "carrier")
## Warning: Removed 23255 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
## Warning: Removed 288513 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
plot_missing(df)
plot_histogram(df)
plot_bar(df)
## 4 columns ignored with more than 50 categories.
## tailnum: 3322 categories
## dest: 104 categories
## time_hour: 6934 categories
## model: 127 categories
plot_correlation(df)
## 5 features with more than 20 categories ignored!
## tailnum: 3322 categories
## dest: 104 categories
## time_hour: 6934 categories
## manufacturer: 35 categories
## model: 127 categories
## Warning in cor(x = structure(list(year.x = c(2013L, 2013L, 2013L, 2013L, : the
## standard deviation is zero