Teoría

La librería Data Explorer es la más conocida para el análisis exploratorio. Es muy simple de usar y muy poderosa, pues ofrece como salida un informe con mucha información.

La función para crear el informe es create_report, y para ver cada gráfica de forma individual, las funciones son:

  • ⁠introduce()
  • ⁠plot_intro()
  • ⁠plot_boxplot()
  • ⁠plot_missing()
  • ⁠plot_histogram()
  • ⁠plot_bar()
  • ⁠plot_correlation()

Instalar paquetes y llamar

librerías

#install.packages("DataExplorer")
library(DataExplorer)

#install.packages("nycflights13")
library(nycflights13)

Contexto

El paquete nycflights13 contiene información sobre todos los vuelos que partieron desde Nueva York (EWR, JKF y LGA) a destinos en los Estados Unidos en 2013. Fueron 336,776 vuelos en total.

Las tablas de este paquete y sus relaciones son las siguientes:

Crear base de datos

flights <- flights
weather <- weather
planes <- planes
airports <- airports
airlines <- airlines
df <- merge(flights, airlines, by = "carrier")
df <- merge(df, planes, by = "tailnum")
create_report(df)
## 
## 
## processing file: report.rmd
## 
  |                                           
  |                                     |   0%
  |                                           
  |.                                    |   2%                                 
  |                                           
  |..                                   |   5% [global_options]                
  |                                           
  |...                                  |   7%                                 
  |                                           
  |....                                 |  10% [introduce]                     
  |                                           
  |....                                 |  12%                                 
  |                                           
  |.....                                |  14% [plot_intro]                    
  |                                           
  |......                               |  17%                                 
  |                                           
  |.......                              |  19% [data_structure]                
  |                                           
  |........                             |  21%                                 
  |                                           
  |.........                            |  24% [missing_profile]               
  |                                           
  |..........                           |  26%                                 
  |                                           
  |...........                          |  29% [univariate_distribution_header]
  |                                           
  |...........                          |  31%                                 
  |                                           
  |............                         |  33% [plot_histogram]                
  |                                           
  |.............                        |  36%                                 
  |                                           
  |..............                       |  38% [plot_density]                  
  |                                           
  |...............                      |  40%                                 
  |                                           
  |................                     |  43% [plot_frequency_bar]            
  |                                           
  |.................                    |  45%                                 
  |                                           
  |..................                   |  48% [plot_response_bar]             
  |                                           
  |..................                   |  50%                                 
  |                                           
  |...................                  |  52% [plot_with_bar]                 
  |                                           
  |....................                 |  55%                                 
  |                                           
  |.....................                |  57% [plot_normal_qq]                
  |                                           
  |......................               |  60%                                 
  |                                           
  |.......................              |  62% [plot_response_qq]              
  |                                           
  |........................             |  64%                                 
  |                                           
  |.........................            |  67% [plot_by_qq]                    
  |                                           
  |..........................           |  69%                                 
  |                                           
  |..........................           |  71% [correlation_analysis]          
  |                                           
  |...........................          |  74%                                 
  |                                           
  |............................         |  76% [principal_component_analysis]  
  |                                           
  |.............................        |  79%                                 
  |                                           
  |..............................       |  81% [bivariate_distribution_header] 
  |                                           
  |...............................      |  83%                                 
  |                                           
  |................................     |  86% [plot_response_boxplot]         
  |                                           
  |.................................    |  88%                                 
  |                                           
  |.................................    |  90% [plot_by_boxplot]               
  |                                           
  |..................................   |  93%                                 
  |                                           
  |...................................  |  95% [plot_response_scatterplot]     
  |                                           
  |.................................... |  98%                                 
  |                                           
  |.....................................| 100% [plot_by_scatterplot]           
## output file: /Users/david3/Desktop/report.knit.md
## /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/pandoc +RTS -K512m -RTS /Users/david3/Desktop/report.knit.md --to html4 --from markdown+autolink_bare_uris+tex_math_single_backslash --output /Users/david3/Desktop/report.html --lua-filter /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library/rmarkdown/rmarkdown/lua/pagebreak.lua --lua-filter /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library/rmarkdown/rmarkdown/lua/latex-div.lua --embed-resources --standalone --variable bs3=TRUE --section-divs --table-of-contents --toc-depth 6 --template /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library/rmarkdown/rmd/h/default.html --no-highlight --variable highlightjs=1 --variable theme=yeti --mathjax --variable 'mathjax-url=https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML' --include-in-header /var/folders/w7/f2nbb5z10vbgtz9_q5vpc1kc0000gr/T//Rtmpovm6gW/rmarkdown-str5e0b78e2a6cc.html
## 
## Output created: report.html
introduce(df)
##     rows columns discrete_columns continuous_columns all_missing_columns
## 1 284170      28               10                 18                   0
##   total_missing_values complete_rows total_observations memory_usage
## 1               311768           920            7956760     50225296
plot_intro(df)

plot_boxplot(df, by="carrier")
## Warning: Removed 23255 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

## Warning: Removed 288513 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

plot_missing(df)

plot_histogram(df)

plot_bar(df)
## 4 columns ignored with more than 50 categories.
## tailnum: 3322 categories
## dest: 104 categories
## time_hour: 6934 categories
## model: 127 categories

plot_correlation(df)
## 5 features with more than 20 categories ignored!
## tailnum: 3322 categories
## dest: 104 categories
## time_hour: 6934 categories
## manufacturer: 35 categories
## model: 127 categories
## Warning in cor(x = structure(list(year.x = c(2013L, 2013L, 2013L, 2013L, : the
## standard deviation is zero

LS0tCnRpdGxlOiAiRGF0YSBFeHBsb3JlciIKYXV0aG9yOiAiRGF2aWQgSGVyZWRpYSBTw6FuY2hleiIKZGF0ZTogIjIwMjQtMDItMjciCm91dHB1dDogCiAgaHRtbF9kb2N1bWVudDoKICAgIHRvYzogVFJVRQogICAgdG9jX2Zsb2F0OiBUUlVFCiAgICBjb2RlX2Rvd25sb2FkOiBUUlVFCiAgICB0aGVtZTogZGFyawotLS0KCiFbXSgvVXNlcnMvZGF2aWQzL0Rlc2t0b3AvZjIwNGRmMzg4ZWYyOGIwZTgyMmUxNDBlZDE1ZTFkM2QuZ2lmKQotLS0KCiMgPHNwYW4gc3R5bGU9ImNvbG9yOiB5ZWxsb3c7IiA+IFRlb3LDrWEgPC9zcGFuPgpMYSBsaWJyZXLDrWEgKkRhdGEgRXhwbG9yZXIqIGVzIGxhIG3DoXMgY29ub2NpZGEgcGFyYSBlbAphbsOhbGlzaXMgZXhwbG9yYXRvcmlvLiBFcyBtdXkgc2ltcGxlIGRlIHVzYXIgeSBtdXkKcG9kZXJvc2EsIHB1ZXMgb2ZyZWNlIGNvbW8gc2FsaWRhIHVuIGluZm9ybWUgY29uIG11Y2hhCmluZm9ybWFjacOzbi4gIAoKTGEgZnVuY2nDs24gcGFyYSBjcmVhciBlbCBpbmZvcm1lIGVzICpjcmVhdGVfcmVwb3J0KiwgeSAKcGFyYSB2ZXIgY2FkYSBncsOhZmljYSBkZSBmb3JtYSBpbmRpdmlkdWFsLCBsYXMgZnVuY2lvbmVzIHNvbjogIAoKKiAq4oGgaW50cm9kdWNlKCkqCiogKuKBoHBsb3RfaW50cm8oKSoKKiAq4oGgcGxvdF9ib3hwbG90KCkqCiogKuKBoHBsb3RfbWlzc2luZygpKgoqICrigaBwbG90X2hpc3RvZ3JhbSgpKgoqICrigaBwbG90X2JhcigpKgoqICrigaBwbG90X2NvcnJlbGF0aW9uKCkqCgojIDxzcGFuIHN0eWxlPSJjb2xvcjogeWVsbG93OyIgPiBJbnN0YWxhciBwYXF1ZXRlcyB5IGxsYW1hcgpsaWJyZXLDrWFzPC9zcGFuPgpgYGB7cn0KI2luc3RhbGwucGFja2FnZXMoIkRhdGFFeHBsb3JlciIpCmxpYnJhcnkoRGF0YUV4cGxvcmVyKQoKI2luc3RhbGwucGFja2FnZXMoIm55Y2ZsaWdodHMxMyIpCmxpYnJhcnkobnljZmxpZ2h0czEzKQpgYGAKCiMgPHNwYW4gc3R5bGU9ImNvbG9yOiB5ZWxsb3c7IiA+IENvbnRleHRvIDwvc3Bhbj4KRWwgcGFxdWV0ZSAqbnljZmxpZ2h0czEzKiBjb250aWVuZSBpbmZvcm1hY2nDs24gc29icmUgdG9kb3MgbG9zCnZ1ZWxvcyBxdWUgcGFydGllcm9uIGRlc2RlIE51ZXZhIFlvcmsgKEVXUiwgSktGIHkgTEdBKSBhIGRlc3Rpbm9zCmVuIGxvcyBFc3RhZG9zIFVuaWRvcyBlbiAyMDEzLiBGdWVyb24gMzM2LDc3NiB2dWVsb3MgZW4gdG90YWwuIAoKTGFzIHRhYmxhcyBkZSBlc3RlIHBhcXVldGUgeSBzdXMgcmVsYWNpb25lcyBzb24gbGFzIHNpZ3VpZW50ZXM6CgohW10oL1VzZXJzL2RhdmlkMy9EZXNrdG9wL3JlbGF0aW9uYWwtbnljZmxpZ2h0cy5wbmcpCgojIDxzcGFuIHN0eWxlPSJjb2xvcjogeWVsbG93OyIgPiBDcmVhciBiYXNlIGRlIGRhdG9zIDwvc3Bhbj4KYGBge3J9CmZsaWdodHMgPC0gZmxpZ2h0cwp3ZWF0aGVyIDwtIHdlYXRoZXIKcGxhbmVzIDwtIHBsYW5lcwphaXJwb3J0cyA8LSBhaXJwb3J0cwphaXJsaW5lcyA8LSBhaXJsaW5lcwpkZiA8LSBtZXJnZShmbGlnaHRzLCBhaXJsaW5lcywgYnkgPSAiY2FycmllciIpCmRmIDwtIG1lcmdlKGRmLCBwbGFuZXMsIGJ5ID0gInRhaWxudW0iKQpgYGAKCmBgYHtyfQpjcmVhdGVfcmVwb3J0KGRmKQppbnRyb2R1Y2UoZGYpCnBsb3RfaW50cm8oZGYpCnBsb3RfYm94cGxvdChkZiwgYnk9ImNhcnJpZXIiKQpwbG90X21pc3NpbmcoZGYpCnBsb3RfaGlzdG9ncmFtKGRmKQpwbG90X2JhcihkZikKcGxvdF9jb3JyZWxhdGlvbihkZikKYGBgCgoKCgo=