TeorĂ­a

La líbreria DataExplorer es la mas conocida en el mundo análitico exploratorio. Es muy simple de usar y muy poderosa, pues ofrece como salida un informe con mucha información.

La función para crear el informe es create_report, y para ver cada gráfica de forma individual, las funciones son:

introduce() plot_intro() plot_boxplot() plot_missing() plot_histogram plot_bar() plot_correlation()

Instalar paquetes y llamar librerĂ­as

#install.packages("DataExplorer")
library(DataExplorer)
## Warning: package 'DataExplorer' was built under R version 4.3.1
#install.packages("nycflights13")
library(nycflights13)

Contexto

El paquete nycflights13 contiene informaciĂłn sobre todos los vuelos que partieron desde Nueva York (EWR, JFK y LGA) a destinos en los Estados Unidos en 2013. FUeron 336,776 vuelos en total.

Las tablas de este paquete y sus relaciones son las siguientes:

Crear base de datos

flights <- flights
weather <- weather
plnes <- planes
airports <- airports
airlines <- airlines
df <- merge(flights,airlines,by = "carrier")
df <- merge(df, planes, by = "tailnum")
introduce(df)
##     rows columns discrete_columns continuous_columns all_missing_columns
## 1 284170      28               10                 18                   0
##   total_missing_values complete_rows total_observations memory_usage
## 1               311768           920            7956760     50225296
plot_intro(df)

plot_boxplot(df, by = "carrier")
## Warning: Removed 23255 rows containing non-finite values (`stat_boxplot()`).

## Warning: Removed 288513 rows containing non-finite values (`stat_boxplot()`).

plot_missing(df)

plot_histogram
## function (data, binary_as_factor = TRUE, geom_histogram_args = list(bins = 30L), 
##     scale_x = "continuous", title = NULL, ggtheme = theme_gray(), 
##     theme_config = list(), nrow = 4L, ncol = 4L, parallel = FALSE) 
## {
##     variable <- value <- NULL
##     if (!is.data.table(data)) 
##         data <- data.table(data)
##     split_data <- split_columns(data, binary_as_factor = binary_as_factor)
##     if (split_data$num_continuous == 0) 
##         stop("No continuous features found!")
##     continuous <- split_data$continuous
##     feature_names <- names(continuous)
##     dt <- suppressWarnings(melt.data.table(continuous, measure.vars = feature_names, 
##         variable.factor = FALSE))
##     layout <- .getPageLayout(nrow, ncol, ncol(continuous))
##     plot_list <- .lapply(parallel = parallel, X = layout, FUN = function(x) {
##         ggplot(dt[variable %in% feature_names[x]], aes(x = value)) + 
##             do.call("geom_histogram", c(na.rm = TRUE, geom_histogram_args)) + 
##             do.call(paste0("scale_x_", scale_x), list()) + ylab("Frequency")
##     })
##     class(plot_list) <- c("multiple", class(plot_list))
##     plotDataExplorer(plot_obj = plot_list, page_layout = layout, 
##         title = title, ggtheme = ggtheme, theme_config = theme_config, 
##         facet_wrap_args = list(facet = ~variable, nrow = nrow, 
##             ncol = ncol, scales = "free"))
## }
## <bytecode: 0x119b45a78>
## <environment: namespace:DataExplorer>
plot_bar(df)
## 4 columns ignored with more than 50 categories.
## tailnum: 3322 categories
## dest: 104 categories
## time_hour: 6934 categories
## model: 127 categories

plot_correlation(df)
## 5 features with more than 20 categories ignored!
## tailnum: 3322 categories
## dest: 104 categories
## time_hour: 6934 categories
## manufacturer: 35 categories
## model: 127 categories
## Warning in cor(x = structure(list(year.x = c(2013L, 2013L, 2013L, 2013L, : the
## standard deviation is zero

LS0tCnRpdGxlOiAiRGF0YSBFeHBsb3JhciIKYXV0aG9yOiAiR2VuYXJvIFJvZHLDrWd1ZXogQWxjw6FudGFyYSAtIEEwMDgzMzE3MiIKZGF0ZTogIjIwMjQtMDItMjciCm91dHB1dDogCiAgaHRtbF9kb2N1bWVudDoKICAgIHRvYzogVFJVRQogICAgdG9jX2Zsb2F0OiBUUlVFCiAgICBjb2RlX2Rvd25sb2FkOiBUUlVFCi0tLQoKIVtdKC9Vc2Vycy9nZW5hcm9yb2RyaWd1ZXphbGNhbnRhcmEvRGVza3RvcC9UZWMvQUkgLSBDb25jZW50cmFjaW/MgW4vTW/MgWR1bG8gMiAtIE1hY2hpbmUgTGVhcm5pbmcvQkQvUE5xLmdpZikKCiMgPHNwYW4gc3R5bGU9ImNvbG9yOnllbGxvczsiPlRlb3LDrWE8L3NwYW4+IApMYSBsw61icmVyaWEgRGF0YUV4cGxvcmVyIGVzIGxhIG1hcyBjb25vY2lkYSBlbiBlbCBtdW5kbyBhbsOhbGl0aWNvIGV4cGxvcmF0b3Jpby4gRXMgbXV5IHNpbXBsZSBkZSB1c2FyIHkgbXV5IHBvZGVyb3NhLCBwdWVzIG9mcmVjZSBjb21vIHNhbGlkYSB1biBpbmZvcm1lIGNvbiBtdWNoYSBpbmZvcm1hY2nDs24uICAKCgpMYSBmdW5jacOzbiBwYXJhIGNyZWFyIGVsIGluZm9ybWUgZXMgKmNyZWF0ZV9yZXBvcnQqLCB5IHBhcmEgdmVyIGNhZGEgZ3LDoWZpY2EgZGUgZm9ybWEgaW5kaXZpZHVhbCwgbGFzIGZ1bmNpb25lcyBzb246CgoqaW50cm9kdWNlKCkqCipwbG90X2ludHJvKCkqCipwbG90X2JveHBsb3QoKSoKKnBsb3RfbWlzc2luZygpKgoqcGxvdF9oaXN0b2dyYW0qCipwbG90X2JhcigpKgoqcGxvdF9jb3JyZWxhdGlvbigpKgoKIyA8c3BhbiBzdHlsZT0iY29sb3I6eWVsbG9zOyI+SW5zdGFsYXIgcGFxdWV0ZXMgeSBsbGFtYXIgbGlicmVyw61hczwvc3Bhbj4gCmBgYHtyfQojaW5zdGFsbC5wYWNrYWdlcygiRGF0YUV4cGxvcmVyIikKbGlicmFyeShEYXRhRXhwbG9yZXIpCiNpbnN0YWxsLnBhY2thZ2VzKCJueWNmbGlnaHRzMTMiKQpsaWJyYXJ5KG55Y2ZsaWdodHMxMykKYGBgCgojIDxzcGFuIHN0eWxlPSJjb2xvcjp5ZWxsb3M7Ij5Db250ZXh0bzwvc3Bhbj4KRWwgcGFxdWV0ZSAqbnljZmxpZ2h0czEzKiBjb250aWVuZSBpbmZvcm1hY2nDs24gc29icmUgdG9kb3MgbG9zIHZ1ZWxvcyBxdWUgcGFydGllcm9uIGRlc2RlIE51ZXZhIFlvcmsgKEVXUiwgSkZLIHkgTEdBKSBhIGRlc3Rpbm9zIGVuIGxvcyBFc3RhZG9zIFVuaWRvcyBlbiAyMDEzLiBGVWVyb24gMzM2LDc3NiB2dWVsb3MgZW4gdG90YWwuCgpMYXMgdGFibGFzIGRlIGVzdGUgcGFxdWV0ZSB5IHN1cyByZWxhY2lvbmVzIHNvbiBsYXMgc2lndWllbnRlczoKCiFbXSgvVXNlcnMvZ2VuYXJvcm9kcmlndWV6YWxjYW50YXJhL0Rlc2t0b3AvVGVjL0FJIC0gQ29uY2VudHJhY2lvzIFuL01vzIFkdWxvIDIgLSBNYWNoaW5lIExlYXJuaW5nL0JEL3JlbGF0aW9uYWwtbnljZmxpZ2h0cy5wbmcpCgojIDxzcGFuIHN0eWxlPSJjb2xvcjp5ZWxsb3M7Ij5DcmVhciBiYXNlIGRlIGRhdG9zPC9zcGFuPiAKYGBge3J9CmZsaWdodHMgPC0gZmxpZ2h0cwp3ZWF0aGVyIDwtIHdlYXRoZXIKcGxuZXMgPC0gcGxhbmVzCmFpcnBvcnRzIDwtIGFpcnBvcnRzCmFpcmxpbmVzIDwtIGFpcmxpbmVzCmRmIDwtIG1lcmdlKGZsaWdodHMsYWlybGluZXMsYnkgPSAiY2FycmllciIpCmRmIDwtIG1lcmdlKGRmLCBwbGFuZXMsIGJ5ID0gInRhaWxudW0iKQpgYGAKCgpgYGB7cn0KaW50cm9kdWNlKGRmKQpwbG90X2ludHJvKGRmKQpwbG90X2JveHBsb3QoZGYsIGJ5ID0gImNhcnJpZXIiKQpwbG90X21pc3NpbmcoZGYpCnBsb3RfaGlzdG9ncmFtCnBsb3RfYmFyKGRmKQpwbG90X2NvcnJlbGF0aW9uKGRmKQpgYGAKCg==