Background Information
Overview
The Data Explorer package was created to automate data exploration and treatment. These processes help users, usually analysts or modelers, to get a quick first glance at the data they are analyzing. This package helps them to focus on understanding the data and gather insights that would otherwise remain hidden.
It also aims to automate the exploratory data analysis (EDA) process and simplify generating reports so that users can focus on information displayed by the data. The commands are user-friendly, easy-to-use, and concise. The syntax is intuitive and simple so that the analyst doesn’t have to spend time with manual code. Manual coding has been identified as a problem in the data science pipeline that is time consuming for analysts, so this package is intended to help alleviate this issue through automation.
Version History
- Current version: 0.8.1
- This package was recently updated. The newest version came out in January 2020. Since its 0.2.3 version, the names of the functions have changed. There has also been an increase in the functionality of the package with the addition of various new functions. The 0.2.3 version had the following functions (to name a few): CollapseCategory, SplitColType, DensityContinous, GenerateReport, PlotMissing, CorrelationContinuous, and BarDiscrete. This document will discuss functions from the most current version of the package.
Dependency on Other Packages
This package relies on:
- ggplot2
- Xfun
- Farver
- Evaluate
- Tinytex
- Igraph
- Data.table
- Scales
- Rmarkdown
- networkD3
- gridExtra
- htmlwidgets
When you install the Data Explorer package, all of the above packages will automatically be installed.
Similarities to Other Packages
- ggplot2
- Both ggplot2 and DataExplorer are able to create graphs. However, that is the sole purpose of ggplot. With DataExplorer a user can also easily analyze the data to gather insights.
- funModeling
- Similarly to DataExplorer, funModeling package is a set of tools used for exploratory data analysis. However, funModeling is the only library that includes visualizations that can be used for predictive models and non-standard correlation analysis.
- SmartEDA
- SmartEDA and DataExplorer both conduct Exploratory Data Analysis (EDA). SmartEDA; however, has additional features, such as an extension of the data.table package and can plot both numerical and categorical data values. SmartEDA also gives descriptive statistics.
- Dlookr
- The dlookr package is used for creating high quality reports, similar to DataExplorer. What makes this package unique is that it is capable of selecting skewed variables, which can also help in seeing dependencies in the data.
- dataMaid
- dataMaid has two functions that DataExplorer does not: check function and makeDataReport. The check function looks at data consistency and validity, while makeDataReport creates a report in different formats. Both DataExplorer and dataMaid, though, are capable of providing graphs of the data.
- exploreR:
- The exploreR package is also useful for data exploration. What differs about this package is the fact that the analysis is based on linear regression.
- RtutoR:
- The RtutoR package is used for automated reporting which can be saved in a PPTX format. It is similar to DataExplorer in that it is used for reporting purposes and data exploration. However, this package is unique because it can identify the top k relevant variables for a specific criteria.
Examples of Usage
Set-Up
First, the Data Explorer package needs to be installed and downloaded. Also, download the datasets package and run the require(utils) command.
# install.packages("DataExplorer")
library(DataExplorer)
library(datasets)
require(utils)The data set “diamonds” will be used in the following examples. The data can be loaded as follows.
library(ggplot2)
data(diamonds)Functions
create_report()
Objective
- The main objective of this function is to create a data filing report based on the configured settings from the
configure_report()function.
Arguments
data: the input data to be used for the report generationoutput_format: specifies the type of report to be made. The default ishtml_document(toc = TRUE,toc_depth = 6,theme = "yeti").output_file: specifies the output file name. The default is “report.html”.output_dir: specifies the directory for the report to be saved in. The default is the user’s current directory.y: the name of the response variable, if any. This response variable will be passed to appropriate plotting functions automatically.config: the report configuration generated byconfigure_report.report_title: the report title. The default is “Data Profiling Report”.
Example of create_report()
- Create a report where:
- the data used is the “diamonds” dataset
- the output file will be called “practice_report.html”
- the output file will be saved in the current working directory
- the report is cofigured to the configure_report() function settings
# create_report(
# data = diamonds,
# output_file = "practice_report.html",
# output_dir = getwd(),
# config = configure_report()
# )Screenshots of a sample Data Profiling Report
configure_report()
Objective
- The main objective of this function is to configure a report template.
Arguments
There are many arguments that can be placed into the
configure_report()function. The majority of them can be separated into two categories:- adding a plot, or
- specifying the arguments for the added plot
For example, there is an argument
add_plot_introthat plots basic information about the input data. There is then another argumentplot_intro_argsthat takes-in the arguments to be passed to theplot_intro()function.However, there are still some arguments that are unique. These are described below.
| Argument | Explanation | Default Value |
|---|---|---|
| add_introduce | a boolean argument that specifies whether or not the introduce() function is included in the report | TRUE |
| global_ggtheme | the global setting for the theme of the report | theme_gray() |
- The complete list of arguments is shown and described below.
| Argument Name | Explanation | Default Value |
|---|---|---|
| add_introduce | a boolean argument that specifies whether or not the introduce() function is included in the report | TRUE |
| add_plot_intro | a boolean argument that specifies whether or not the plot_intro() function is included in the report | TRUE |
| add_plot_str | a boolean argument that specifies whether or not the plot_str() function is included in the report | TRUE |
| add_plot_missing | a boolean argument that specifies whether or not the plot_missing() function is included in the report | TRUE |
| add_plot_histogram | a boolean argument that specifies whether or not the plot_histogram() function is included in the report | TRUE |
| add_plot_density | a boolean argument that specifies whether or not the plot_density() function is included in the report | FALSE |
| add_plot_qq | a boolean argument that specifies whether or not the plot_qq() function is included in the report | TRUE |
| add_plot_bar | a boolean argument that specifies whether or not the plot_bar() function is included in the report | TRUE |
| add_plot_correlation | a boolean argument that specifies whether or not the plot_correlation() function is included in the report | TRUE |
| add_plot_prcomp | a boolean argument that specifies whether or not the plot_prcomp() function is included in the report | TRUE |
| add_plot_boxplot | a boolean argument that specifies whether or not the plot_boxplot() function is included in the report | TRUE |
| add_plot_scatterplot | a boolean argument that specifies whether or not the plot_scatterplot() function is included in the report | TRUE |
| introduce_args | a list of arguments to be passed to the plot_intro() function | list() |
| plot_intro_args | a list of arguments to be passed to the plot_intro() function | list() |
| plot_str_args | a list of arguments to be passed to the plot_str() function | list(type = “diagonal”, fontSize = 35, width = 1000, margin = list(left = 350, right = 250)) |
| plot_missing_args | a list of arguments to be passed to the plot_missing() function | list() |
| plot_histogram_args | a list of arguments to be passed to the plot_histogram() function | list() |
| plot_density_args | a list of arguments to be passed to the plot_density() function | list() |
| plot_qq_args | a list of arguments to be passed to the plot_qq() function | list(sampled_rows = 1000L) |
| plot_bar_args | a list of arguments to be passed to the plot_bar() function | list() |
| plot_correlation_args | a list of arguments to be passed to the plot_correlation() function | list(cor_args = list(use = “pairwise.complete.obs”)) |
| plot_prcomp_args | a list of arguments to be passed to the plot_prcomp() function | list() |
| plot_boxplot_args | a list of arguments to be passed to the plot_boxplot() function | list() |
| plot_scatterplot_args | a list of arguments to be passed to the plot_scatterplot() function | list(sampled_rows = 1000L) |
| global_ggtheme | the global setting for the theme of the report | theme_gray() |
Example of configure_report()
Configure a report where:
- A density plot is included
- QQ and box plots are not included
- The global theme is “theme_light”
configure_report(add_plot_density = TRUE, add_plot_bar = FALSE, add_plot_qq = FALSE, global_ggtheme = quote(theme_light()))drop_columns()
Objective
- The main purpose of this function is to quickly drop selected variables in the data frame.
Arguments
data: the input data to be usedind: a vector of either the names or column positions of the variables to be dropped
Example of drop_columns()
- This is the original dataset.
head(diamonds)## # A tibble: 6 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
- Columns 2 through 5 are dropped from the diamonds dataset using the
seq()function.
drop_columns(diamonds, seq(2,5)) ## # A tibble: 53,940 x 6
## carat table price x y z
## <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 55 326 3.95 3.98 2.43
## 2 0.21 61 326 3.89 3.84 2.31
## 3 0.23 65 327 4.05 4.07 2.31
## 4 0.290 58 334 4.2 4.23 2.63
## 5 0.31 58 335 4.34 4.35 2.75
## 6 0.24 57 336 3.94 3.96 2.48
## 7 0.24 57 336 3.95 3.98 2.47
## 8 0.26 55 337 4.07 4.11 2.53
## 9 0.22 61 337 3.87 3.78 2.49
## 10 0.23 61 338 4 4.05 2.39
## # ... with 53,930 more rows
- Columns 1, 2, and 4 are dropped from the diamonds dataset using the concatenate
c()function.
drop_columns(diamonds, c(1,2,4)) ## # A tibble: 53,940 x 7
## color depth table price x y z
## <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 E 61.5 55 326 3.95 3.98 2.43
## 2 E 59.8 61 326 3.89 3.84 2.31
## 3 E 56.9 65 327 4.05 4.07 2.31
## 4 I 62.4 58 334 4.2 4.23 2.63
## 5 J 63.3 58 335 4.34 4.35 2.75
## 6 J 62.8 57 336 3.94 3.96 2.48
## 7 I 62.3 57 336 3.95 3.98 2.47
## 8 H 61.9 55 337 4.07 4.11 2.53
## 9 E 65.1 61 337 3.87 3.78 2.49
## 10 H 59.4 61 338 4 4.05 2.39
## # ... with 53,930 more rows
- The “clarity” column is dropped from the diamonds dataset by directly calling the column name.
drop_columns(diamonds, "clarity") ## # A tibble: 53,940 x 9
## carat cut color depth table price x y z
## <dbl> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E 56.9 65 327 4.05 4.07 2.31
## 4 0.290 Premium I 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J 62.8 57 336 3.94 3.96 2.48
## 7 0.24 Very Good I 62.3 57 336 3.95 3.98 2.47
## 8 0.26 Very Good H 61.9 55 337 4.07 4.11 2.53
## 9 0.22 Fair E 65.1 61 337 3.87 3.78 2.49
## 10 0.23 Very Good H 59.4 61 338 4 4.05 2.39
## # ... with 53,930 more rows
- The “depth” and “table” columns are dropped from the diamonds dataset by calling the columns names using the concatenate
c()function.
drop_columns(diamonds, c("depth", "table")) ## # A tibble: 53,940 x 8
## carat cut color clarity price x y z
## <dbl> <ord> <ord> <ord> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 327 4.05 4.07 2.31
## 4 0.290 Premium I VS2 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 336 3.94 3.96 2.48
## 7 0.24 Very Good I VVS1 336 3.95 3.98 2.47
## 8 0.26 Very Good H SI1 337 4.07 4.11 2.53
## 9 0.22 Fair E VS2 337 3.87 3.78 2.49
## 10 0.23 Very Good H VS1 338 4 4.05 2.39
## # ... with 53,930 more rows
dummify()
Overview
- Data dummification is also know as “one hot encoding” or “feature binarization”. It turns each category to a distinct column with binary (numeric) values. These boolean values help make calculations related to probability easier.
Arguments
data: input datamaxcat: the max number of categories for a discrete feature, default is 50select: the names of selected features to be dummified, default is NULL
Note that select ignores continuous features. If the categories exceed maxcat then the select features will be ignored.
Datasets that have been dummified preserve the original feature, but the order of the columns may be different.
Examples of dummify()
- Here, no
selectargument is passed, so all columns that R recognizes as discrete will be dummified.
str(dummify(diamonds, maxcat = 5))## 2 features with more than 5 categories ignored!
## color: 7 categories
## clarity: 8 categories
## Classes 'tbl_df', 'tbl' and 'data.frame': 53940 obs. of 14 variables:
## $ carat : num 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ depth : num 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity : Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## $ cut_Fair : int 0 0 0 0 0 0 0 0 1 0 ...
## $ cut_Good : int 0 0 1 0 1 0 0 0 0 0 ...
## $ cut_Ideal : int 1 0 0 0 0 0 0 0 0 0 ...
## $ cut_Premium : int 0 1 0 1 0 0 0 0 0 0 ...
## $ cut_Very.Good: int 0 0 0 0 0 1 1 1 0 1 ...
## - attr(*, ".internal.selfref")=<externalptr>
- Here, only the manually-selected columns will be dummified.
str(dummify(diamonds, select = c("cut", "color")))## Classes 'tbl_df', 'tbl' and 'data.frame': 53940 obs. of 20 variables:
## $ carat : num 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ depth : num 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
## $ clarity : Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## $ cut_Fair : int 0 0 0 0 0 0 0 0 1 0 ...
## $ cut_Good : int 0 0 1 0 1 0 0 0 0 0 ...
## $ cut_Ideal : int 1 0 0 0 0 0 0 0 0 0 ...
## $ cut_Premium : int 0 1 0 1 0 0 0 0 0 0 ...
## $ cut_Very.Good: int 0 0 0 0 0 1 1 1 0 1 ...
## $ color_D : int 0 0 0 0 0 0 0 0 0 0 ...
## $ color_E : int 1 1 1 0 0 0 0 0 1 0 ...
## $ color_F : int 0 0 0 0 0 0 0 0 0 0 ...
## $ color_G : int 0 0 0 0 0 0 0 0 0 0 ...
## $ color_H : int 0 0 0 0 0 0 0 1 0 1 ...
## $ color_I : int 0 0 0 1 0 0 1 0 0 0 ...
## $ color_J : int 0 0 0 0 1 1 0 0 0 0 ...
## - attr(*, ".internal.selfref")=<externalptr>
introduce()
Overview
- Describes basic information for the input data, including:
- rows: number of rows
- columns: number of columns
- discrete_columns: number of discrete columns
- continuous_columns: number of continuous columns
- all_missing_columns: number of columns with everything missing
- total_missing_values: number of missing observations
- complete_rows: number of rows without missing values
- total_observations: total number of observations
- memory_usage: estimated memory allocation in bytes
Arguments
- The only argument is
data, which refers to the input data.
Example of introduce()
introduce(diamonds)## # A tibble: 1 x 9
## rows columns discrete_columns continuous_colu~ all_missing_col~
## <int> <int> <int> <int> <int>
## 1 53940 10 3 7 0
## # ... with 4 more variables: total_missing_values <int>, complete_rows <int>,
## # total_observations <int>, memory_usage <dbl>
plot_bar()
Overview
- This function plots bar charts for discrete categorical features based on their frequencies or other continuous features.
Arguments
data: input datawith: name of continuous feature to be summed. Default is NULL, i.e., frequency.maxcat: maximum categories allowed for each feature. Default is 50.order_bar: logical, indicating if bars should be ordered. Default is TRUE.binary_as_factor: treat binary as categorical? Default is TRUE.title: plot titleggtheme: complete ggplot2 themes. Default is theme_gray.theme_config: a list of configurations to be passed to themenrow: number of rows per page. Default is 3.ncol: number of columns per page. Default is 3.parallel: enable parallel? Default is FALSE.
Examples of plot_bar()
- Plotting the categorical data within the diamonds data set
plot_bar(diamonds)- Plotting the categorical data with 5 categories or fewer within the diamonds data set
plot_bar(diamonds, maxcat = 5, ggtheme = theme_light(), theme_config = list(
"plot.background" = element_rect(fill = "light gray"),
"aspect.ratio" = 1
), title = "Frequency of Each Cut of Diamonds")## 2 columns ignored with more than 5 categories.
## color: 7 categories
## clarity: 8 categories
split_columns()
Overview
- This function splits the input data into its discrete and continuous components.
Arguments
data: input databinary_as_factor: a boolean value that specifies whether or not to treat the variables as binary or categorical. The default is FALSE.
Example of split_columns
- Split the
diamondsdataset
output <- split_columns(diamonds)- Show only the discrete components of the dataset
output$discrete## # A tibble: 53,940 x 3
## cut color clarity
## <ord> <ord> <ord>
## 1 Ideal E SI2
## 2 Premium E SI1
## 3 Good E VS1
## 4 Premium I VS2
## 5 Good J SI2
## 6 Very Good J VVS2
## 7 Very Good I VVS1
## 8 Very Good H SI1
## 9 Fair E VS2
## 10 Very Good H VS1
## # ... with 53,930 more rows
- Show only the continuous components of the dataset
output$continuous## # A tibble: 53,940 x 7
## carat depth table price x y z
## <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 61.5 55 326 3.95 3.98 2.43
## 2 0.21 59.8 61 326 3.89 3.84 2.31
## 3 0.23 56.9 65 327 4.05 4.07 2.31
## 4 0.290 62.4 58 334 4.2 4.23 2.63
## 5 0.31 63.3 58 335 4.34 4.35 2.75
## 6 0.24 62.8 57 336 3.94 3.96 2.48
## 7 0.24 62.3 57 336 3.95 3.98 2.47
## 8 0.26 61.9 55 337 4.07 4.11 2.53
## 9 0.22 65.1 61 337 3.87 3.78 2.49
## 10 0.23 59.4 61 338 4 4.05 2.39
## # ... with 53,930 more rows
- Show the number of discrete components of the dataset
output$num_discrete## [1] 3
- Show the number of continuous components of the dataset
output$num_continuous## [1] 7
- Show the number of missing components of the dataset
output$num_all_missing## [1] 0
Reflection
In summary, we believe this package is one of the better packages to use for exploratory data analysis (EDA). The EDA process can be frustrating at times because analysts need to look at the data and generate hypotheses in order to determine the next steps. This occupies a lot of time for the analysts. The DataExplorer package allows an analyst to view their data with one function (create_report). DataExplorer automates the data handling and visualization processes so that users can just focus on interpreting the data and extracting insights, which ultimately saves time and resources. Below, we list out some of the pros and cons of the package.
- Pros:
- Quick and easy to use functions
- Provides functions for
- Summarizing data
- Finding missing values
- Plotting distributions
- Normalizing data
- Computing and visualizing principal component analysis
- Correlation
- Plotting relationships and transforming data
- Automates time-intensive processes of EDA
- Cons:
- This package almost entirely relies on visual techniques for understanding the data
- If too many graphs are displayed at once, it is difficult to focus on the details of a particular graph (information overload)
- Features that could be added or tweaked:
- Since the create_report function is one of the most important functions, we would like to suggest tweaks to that function
- We believe the graphs could include default colors to improve the visualizations.
- We would like to be able to:
- Modify the layout of the report
- Change the size of the graphs
- Fix a report page limit
- Customize the summary statistics in the
introduce()function
- Overall, being able to customize more features would help create more optimal reports.
Resources
- https://boxuancui.github.io/DataExplorer/
- https://blog.revolutionanalytics.com/2018/02/dataexplorer.html
- https://www.rdocumentation.org/packages/DataExplorer/versions/0.8.1
- https://cran.r-project.org/web/packages/DataExplorer/index.html
- https://cran.r-project.org/web/packages/DataExplorer/DataExplorer.pdf
- https://cran.r-project.org/web/packages/DataExplorer/vignettes/dataexplorer-intro.html
- https://arxiv.org/pdf/1904.02101.pdf
- https://www.rdocumentation.org/packages/DataExplorer/versions/0.2.3