La lĂbreria DataExplorer es la mas conocida en el mundo análitico exploratorio. Es muy simple de usar y muy poderosa, pues ofrece como salida un informe con mucha informaciĂłn.
La función para crear el informe es create_report, y para ver cada gráfica de forma individual, las funciones son:
introduce() plot_intro() plot_boxplot() plot_missing() plot_histogram plot_bar() plot_correlation()
#install.packages("DataExplorer")
library(DataExplorer)
## Warning: package 'DataExplorer' was built under R version 4.3.1
#install.packages("nycflights13")
library(nycflights13)
El paquete nycflights13 contiene informaciĂłn sobre todos los vuelos que partieron desde Nueva York (EWR, JFK y LGA) a destinos en los Estados Unidos en 2013. FUeron 336,776 vuelos en total.
Las tablas de este paquete y sus relaciones son las siguientes:
flights <- flights
weather <- weather
plnes <- planes
airports <- airports
airlines <- airlines
df <- merge(flights,airlines,by = "carrier")
df <- merge(df, planes, by = "tailnum")
introduce(df)
## rows columns discrete_columns continuous_columns all_missing_columns
## 1 284170 28 10 18 0
## total_missing_values complete_rows total_observations memory_usage
## 1 311768 920 7956760 50225296
plot_intro(df)
plot_boxplot(df, by = "carrier")
## Warning: Removed 23255 rows containing non-finite values (`stat_boxplot()`).
## Warning: Removed 288513 rows containing non-finite values (`stat_boxplot()`).
plot_missing(df)
plot_histogram
## function (data, binary_as_factor = TRUE, geom_histogram_args = list(bins = 30L),
## scale_x = "continuous", title = NULL, ggtheme = theme_gray(),
## theme_config = list(), nrow = 4L, ncol = 4L, parallel = FALSE)
## {
## variable <- value <- NULL
## if (!is.data.table(data))
## data <- data.table(data)
## split_data <- split_columns(data, binary_as_factor = binary_as_factor)
## if (split_data$num_continuous == 0)
## stop("No continuous features found!")
## continuous <- split_data$continuous
## feature_names <- names(continuous)
## dt <- suppressWarnings(melt.data.table(continuous, measure.vars = feature_names,
## variable.factor = FALSE))
## layout <- .getPageLayout(nrow, ncol, ncol(continuous))
## plot_list <- .lapply(parallel = parallel, X = layout, FUN = function(x) {
## ggplot(dt[variable %in% feature_names[x]], aes(x = value)) +
## do.call("geom_histogram", c(na.rm = TRUE, geom_histogram_args)) +
## do.call(paste0("scale_x_", scale_x), list()) + ylab("Frequency")
## })
## class(plot_list) <- c("multiple", class(plot_list))
## plotDataExplorer(plot_obj = plot_list, page_layout = layout,
## title = title, ggtheme = ggtheme, theme_config = theme_config,
## facet_wrap_args = list(facet = ~variable, nrow = nrow,
## ncol = ncol, scales = "free"))
## }
## <bytecode: 0x119b45a78>
## <environment: namespace:DataExplorer>
plot_bar(df)
## 4 columns ignored with more than 50 categories.
## tailnum: 3322 categories
## dest: 104 categories
## time_hour: 6934 categories
## model: 127 categories
plot_correlation(df)
## 5 features with more than 20 categories ignored!
## tailnum: 3322 categories
## dest: 104 categories
## time_hour: 6934 categories
## manufacturer: 35 categories
## model: 127 categories
## Warning in cor(x = structure(list(year.x = c(2013L, 2013L, 2013L, 2013L, : the
## standard deviation is zero