Data Explorer Package

Background Information

Overview

The Data Explorer package was created to automate data exploration and treatment. These processes help users, usually analysts or modelers, to get a quick first glance at the data they are analyzing. This package helps them to focus on understanding the data and gather insights that would otherwise remain hidden.

It also aims to automate the exploratory data analysis (EDA) process and simplify generating reports so that users can focus on information displayed by the data. The commands are user-friendly, easy-to-use, and concise. The syntax is intuitive and simple so that the analyst doesn’t have to spend time with manual code. Manual coding has been identified as a problem in the data science pipeline that is time consuming for analysts, so this package is intended to help alleviate this issue through automation.

Version History

Current version: 0.8.1
This package was recently updated. The newest version came out in January 2020. Since its 0.2.3 version, the names of the functions have changed. There has also been an increase in the functionality of the package with the addition of various new functions. The 0.2.3 version had the following functions (to name a few): CollapseCategory, SplitColType, DensityContinous, GenerateReport, PlotMissing, CorrelationContinuous, and BarDiscrete. This document will discuss functions from the most current version of the package.

Dependency on Other Packages

This package relies on:

ggplot2
Xfun
Farver
Evaluate
Tinytex
Igraph
Data.table
Scales
Rmarkdown
networkD3
gridExtra
htmlwidgets

When you install the Data Explorer package, all of the above packages will automatically be installed.

Similarities to Other Packages

ggplot2
- Both ggplot2 and DataExplorer are able to create graphs. However, that is the sole purpose of ggplot. With DataExplorer a user can also easily analyze the data to gather insights.
funModeling
- Similarly to DataExplorer, funModeling package is a set of tools used for exploratory data analysis. However, funModeling is the only library that includes visualizations that can be used for predictive models and non-standard correlation analysis.
SmartEDA
- SmartEDA and DataExplorer both conduct Exploratory Data Analysis (EDA). SmartEDA; however, has additional features, such as an extension of the data.table package and can plot both numerical and categorical data values. SmartEDA also gives descriptive statistics.
Dlookr
- The dlookr package is used for creating high quality reports, similar to DataExplorer. What makes this package unique is that it is capable of selecting skewed variables, which can also help in seeing dependencies in the data.
dataMaid
- dataMaid has two functions that DataExplorer does not: check function and makeDataReport. The check function looks at data consistency and validity, while makeDataReport creates a report in different formats. Both DataExplorer and dataMaid, though, are capable of providing graphs of the data.
exploreR:
- The exploreR package is also useful for data exploration. What differs about this package is the fact that the analysis is based on linear regression.
RtutoR:
- The RtutoR package is used for automated reporting which can be saved in a PPTX format. It is similar to DataExplorer in that it is used for reporting purposes and data exploration. However, this package is unique because it can identify the top k relevant variables for a specific criteria.

Examples of Usage

Set-Up

First, the Data Explorer package needs to be installed and downloaded. Also, download the datasets package and run the require(utils) command.

# install.packages("DataExplorer")
library(DataExplorer)
library(datasets)
require(utils)

The data set “diamonds” will be used in the following examples. The data can be loaded as follows.

library(ggplot2)
data(diamonds)

Functions

`create_report()`

Objective

The main objective of this function is to create a data filing report based on the configured settings from the configure_report() function.

Arguments

data : the input data to be used for the report generation
output_format : specifies the type of report to be made. The default is html_document(toc = TRUE,toc_depth = 6,theme = "yeti").
output_file : specifies the output file name. The default is “report.html”.
output_dir : specifies the directory for the report to be saved in. The default is the user’s current directory.
y : the name of the response variable, if any. This response variable will be passed to appropriate plotting functions automatically.
config : the report configuration generated by configure_report.
report_title : the report title. The default is “Data Profiling Report”.

Example of create_report()

Create a report where:
- the data used is the “diamonds” dataset
- the output file will be called “practice_report.html”
- the output file will be saved in the current working directory
- the report is cofigured to the configure_report() function settings

# create_report(
#  data = diamonds,
#  output_file = "practice_report.html",
#  output_dir = getwd(),
#  config = configure_report()
# )

Screenshots of a sample Data Profiling Report

`configure_report()`

Objective

The main objective of this function is to configure a report template.

Arguments

There are many arguments that can be placed into the configure_report() function. The majority of them can be separated into two categories:
- adding a plot, or
- specifying the arguments for the added plot
For example, there is an argument add_plot_intro that plots basic information about the input data. There is then another argument plot_intro_args that takes-in the arguments to be passed to the plot_intro() function.
However, there are still some arguments that are unique. These are described below.

Argument	Explanation	Default Value
add_introduce	a boolean argument that specifies whether or not the introduce() function is included in the report	TRUE
global_ggtheme	the global setting for the theme of the report	theme_gray()

The complete list of arguments is shown and described below.

Argument Name	Explanation	Default Value
add_introduce	a boolean argument that specifies whether or not the introduce() function is included in the report	TRUE
add_plot_intro	a boolean argument that specifies whether or not the plot_intro() function is included in the report	TRUE
add_plot_str	a boolean argument that specifies whether or not the plot_str() function is included in the report	TRUE
add_plot_missing	a boolean argument that specifies whether or not the plot_missing() function is included in the report	TRUE
add_plot_histogram	a boolean argument that specifies whether or not the plot_histogram() function is included in the report	TRUE
add_plot_density	a boolean argument that specifies whether or not the plot_density() function is included in the report	FALSE
add_plot_qq	a boolean argument that specifies whether or not the plot_qq() function is included in the report	TRUE
add_plot_bar	a boolean argument that specifies whether or not the plot_bar() function is included in the report	TRUE
add_plot_correlation	a boolean argument that specifies whether or not the plot_correlation() function is included in the report	TRUE
add_plot_prcomp	a boolean argument that specifies whether or not the plot_prcomp() function is included in the report	TRUE
add_plot_boxplot	a boolean argument that specifies whether or not the plot_boxplot() function is included in the report	TRUE
add_plot_scatterplot	a boolean argument that specifies whether or not the plot_scatterplot() function is included in the report	TRUE
introduce_args	a list of arguments to be passed to the plot_intro() function	list()
plot_intro_args	a list of arguments to be passed to the plot_intro() function	list()
plot_str_args	a list of arguments to be passed to the plot_str() function	list(type = “diagonal”, fontSize = 35, width = 1000, margin = list(left = 350, right = 250))
plot_missing_args	a list of arguments to be passed to the plot_missing() function	list()
plot_histogram_args	a list of arguments to be passed to the plot_histogram() function	list()
plot_density_args	a list of arguments to be passed to the plot_density() function	list()
plot_qq_args	a list of arguments to be passed to the plot_qq() function	list(sampled_rows = 1000L)
plot_bar_args	a list of arguments to be passed to the plot_bar() function	list()
plot_correlation_args	a list of arguments to be passed to the plot_correlation() function	list(cor_args = list(use = “pairwise.complete.obs”))
plot_prcomp_args	a list of arguments to be passed to the plot_prcomp() function	list()
plot_boxplot_args	a list of arguments to be passed to the plot_boxplot() function	list()
plot_scatterplot_args	a list of arguments to be passed to the plot_scatterplot() function	list(sampled_rows = 1000L)
global_ggtheme	the global setting for the theme of the report	theme_gray()

Example of configure_report()

Configure a report where:
- A density plot is included
- QQ and box plots are not included
- The global theme is “theme_light”

configure_report(add_plot_density = TRUE, add_plot_bar = FALSE, add_plot_qq = FALSE, global_ggtheme = quote(theme_light()))

`drop_columns()`

Objective

The main purpose of this function is to quickly drop selected variables in the data frame.

Arguments

data : the input data to be used
ind : a vector of either the names or column positions of the variables to be dropped

Example of drop_columns()

This is the original dataset.

head(diamonds)

## # A tibble: 6 x 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23  Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
## 2 0.21  Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
## 3 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31
## 4 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
## 5 0.31  Good      J     SI2      63.3    58   335  4.34  4.35  2.75
## 6 0.24  Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48

Columns 2 through 5 are dropped from the diamonds dataset using the seq() function.

drop_columns(diamonds, seq(2,5))

## # A tibble: 53,940 x 6
##    carat table price     x     y     z
##    <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1 0.23     55   326  3.95  3.98  2.43
##  2 0.21     61   326  3.89  3.84  2.31
##  3 0.23     65   327  4.05  4.07  2.31
##  4 0.290    58   334  4.2   4.23  2.63
##  5 0.31     58   335  4.34  4.35  2.75
##  6 0.24     57   336  3.94  3.96  2.48
##  7 0.24     57   336  3.95  3.98  2.47
##  8 0.26     55   337  4.07  4.11  2.53
##  9 0.22     61   337  3.87  3.78  2.49
## 10 0.23     61   338  4     4.05  2.39
## # ... with 53,930 more rows

Columns 1, 2, and 4 are dropped from the diamonds dataset using the concatenate c() function.

drop_columns(diamonds, c(1,2,4))

## # A tibble: 53,940 x 7
##    color depth table price     x     y     z
##    <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1 E      61.5    55   326  3.95  3.98  2.43
##  2 E      59.8    61   326  3.89  3.84  2.31
##  3 E      56.9    65   327  4.05  4.07  2.31
##  4 I      62.4    58   334  4.2   4.23  2.63
##  5 J      63.3    58   335  4.34  4.35  2.75
##  6 J      62.8    57   336  3.94  3.96  2.48
##  7 I      62.3    57   336  3.95  3.98  2.47
##  8 H      61.9    55   337  4.07  4.11  2.53
##  9 E      65.1    61   337  3.87  3.78  2.49
## 10 H      59.4    61   338  4     4.05  2.39
## # ... with 53,930 more rows

The “clarity” column is dropped from the diamonds dataset by directly calling the column name.

drop_columns(diamonds, "clarity")

## # A tibble: 53,940 x 9
##    carat cut       color depth table price     x     y     z
##    <dbl> <ord>     <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1 0.23  Ideal     E      61.5    55   326  3.95  3.98  2.43
##  2 0.21  Premium   E      59.8    61   326  3.89  3.84  2.31
##  3 0.23  Good      E      56.9    65   327  4.05  4.07  2.31
##  4 0.290 Premium   I      62.4    58   334  4.2   4.23  2.63
##  5 0.31  Good      J      63.3    58   335  4.34  4.35  2.75
##  6 0.24  Very Good J      62.8    57   336  3.94  3.96  2.48
##  7 0.24  Very Good I      62.3    57   336  3.95  3.98  2.47
##  8 0.26  Very Good H      61.9    55   337  4.07  4.11  2.53
##  9 0.22  Fair      E      65.1    61   337  3.87  3.78  2.49
## 10 0.23  Very Good H      59.4    61   338  4     4.05  2.39
## # ... with 53,930 more rows

The “depth” and “table” columns are dropped from the diamonds dataset by calling the columns names using the concatenate c() function.

drop_columns(diamonds, c("depth", "table"))

## # A tibble: 53,940 x 8
##    carat cut       color clarity price     x     y     z
##    <dbl> <ord>     <ord> <ord>   <int> <dbl> <dbl> <dbl>
##  1 0.23  Ideal     E     SI2       326  3.95  3.98  2.43
##  2 0.21  Premium   E     SI1       326  3.89  3.84  2.31
##  3 0.23  Good      E     VS1       327  4.05  4.07  2.31
##  4 0.290 Premium   I     VS2       334  4.2   4.23  2.63
##  5 0.31  Good      J     SI2       335  4.34  4.35  2.75
##  6 0.24  Very Good J     VVS2      336  3.94  3.96  2.48
##  7 0.24  Very Good I     VVS1      336  3.95  3.98  2.47
##  8 0.26  Very Good H     SI1       337  4.07  4.11  2.53
##  9 0.22  Fair      E     VS2       337  3.87  3.78  2.49
## 10 0.23  Very Good H     VS1       338  4     4.05  2.39
## # ... with 53,930 more rows

`dummify()`

Overview

Data dummification is also know as “one hot encoding” or “feature binarization”. It turns each category to a distinct column with binary (numeric) values. These boolean values help make calculations related to probability easier.

Arguments

data : input data
maxcat : the max number of categories for a discrete feature, default is 50
select: the names of selected features to be dummified, default is NULL

Note that select ignores continuous features. If the categories exceed maxcat then the select features will be ignored.

Datasets that have been dummified preserve the original feature, but the order of the columns may be different.

Examples of dummify()

Here, no select argument is passed, so all columns that R recognizes as discrete will be dummified.

str(dummify(diamonds, maxcat = 5))

## 2 features with more than 5 categories ignored!
## color: 7 categories
## clarity: 8 categories

## Classes 'tbl_df', 'tbl' and 'data.frame':    53940 obs. of  14 variables:
##  $ carat        : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ depth        : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table        : num  55 61 65 58 58 57 57 55 61 61 ...
##  $ price        : int  326 326 327 334 335 336 336 337 337 338 ...
##  $ x            : num  3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y            : num  3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z            : num  2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
##  $ color        : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity      : Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ cut_Fair     : int  0 0 0 0 0 0 0 0 1 0 ...
##  $ cut_Good     : int  0 0 1 0 1 0 0 0 0 0 ...
##  $ cut_Ideal    : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ cut_Premium  : int  0 1 0 1 0 0 0 0 0 0 ...
##  $ cut_Very.Good: int  0 0 0 0 0 1 1 1 0 1 ...
##  - attr(*, ".internal.selfref")=<externalptr>

Here, only the manually-selected columns will be dummified.

str(dummify(diamonds, select = c("cut", "color")))

## Classes 'tbl_df', 'tbl' and 'data.frame':    53940 obs. of  20 variables:
##  $ carat        : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ depth        : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table        : num  55 61 65 58 58 57 57 55 61 61 ...
##  $ price        : int  326 326 327 334 335 336 336 337 337 338 ...
##  $ x            : num  3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y            : num  3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z            : num  2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
##  $ clarity      : Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ cut_Fair     : int  0 0 0 0 0 0 0 0 1 0 ...
##  $ cut_Good     : int  0 0 1 0 1 0 0 0 0 0 ...
##  $ cut_Ideal    : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ cut_Premium  : int  0 1 0 1 0 0 0 0 0 0 ...
##  $ cut_Very.Good: int  0 0 0 0 0 1 1 1 0 1 ...
##  $ color_D      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ color_E      : int  1 1 1 0 0 0 0 0 1 0 ...
##  $ color_F      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ color_G      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ color_H      : int  0 0 0 0 0 0 0 1 0 1 ...
##  $ color_I      : int  0 0 0 1 0 0 1 0 0 0 ...
##  $ color_J      : int  0 0 0 0 1 1 0 0 0 0 ...
##  - attr(*, ".internal.selfref")=<externalptr>

`introduce()`

Overview

Describes basic information for the input data, including:
- rows: number of rows
- columns: number of columns
- discrete_columns: number of discrete columns
- continuous_columns: number of continuous columns
- all_missing_columns: number of columns with everything missing
- total_missing_values: number of missing observations
- complete_rows: number of rows without missing values
- total_observations: total number of observations
- memory_usage: estimated memory allocation in bytes

Arguments

The only argument is data, which refers to the input data.

Example of introduce()

introduce(diamonds)

## # A tibble: 1 x 9
##    rows columns discrete_columns continuous_colu~ all_missing_col~
##   <int>   <int>            <int>            <int>            <int>
## 1 53940      10                3                7                0
## # ... with 4 more variables: total_missing_values <int>, complete_rows <int>,
## #   total_observations <int>, memory_usage <dbl>

`plot_bar()`

Overview

This function plots bar charts for discrete categorical features based on their frequencies or other continuous features.

Arguments

data : input data
with : name of continuous feature to be summed. Default is NULL, i.e., frequency.
maxcat : maximum categories allowed for each feature. Default is 50.
order_bar : logical, indicating if bars should be ordered. Default is TRUE.
binary_as_factor : treat binary as categorical? Default is TRUE.
title : plot title
ggtheme : complete ggplot2 themes. Default is theme_gray.
theme_config : a list of configurations to be passed to theme
nrow : number of rows per page. Default is 3.
ncol : number of columns per page. Default is 3.
parallel : enable parallel? Default is FALSE.

Examples of plot_bar()

Plotting the categorical data within the diamonds data set

plot_bar(diamonds)

Plotting the categorical data with 5 categories or fewer within the diamonds data set

plot_bar(diamonds, maxcat = 5, ggtheme = theme_light(), theme_config = list(
  "plot.background" = element_rect(fill = "light gray"),
  "aspect.ratio" = 1
), title = "Frequency of Each Cut of Diamonds")

## 2 columns ignored with more than 5 categories.
## color: 7 categories
## clarity: 8 categories

`split_columns()`

Overview

This function splits the input data into its discrete and continuous components.

Arguments

data : input data
binary_as_factor : a boolean value that specifies whether or not to treat the variables as binary or categorical. The default is FALSE.

Example of split_columns

Split the diamonds dataset

output <- split_columns(diamonds)

Show only the discrete components of the dataset

output$discrete

## # A tibble: 53,940 x 3
##    cut       color clarity
##    <ord>     <ord> <ord>  
##  1 Ideal     E     SI2    
##  2 Premium   E     SI1    
##  3 Good      E     VS1    
##  4 Premium   I     VS2    
##  5 Good      J     SI2    
##  6 Very Good J     VVS2   
##  7 Very Good I     VVS1   
##  8 Very Good H     SI1    
##  9 Fair      E     VS2    
## 10 Very Good H     VS1    
## # ... with 53,930 more rows

Show only the continuous components of the dataset

output$continuous

## # A tibble: 53,940 x 7
##    carat depth table price     x     y     z
##    <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1 0.23   61.5    55   326  3.95  3.98  2.43
##  2 0.21   59.8    61   326  3.89  3.84  2.31
##  3 0.23   56.9    65   327  4.05  4.07  2.31
##  4 0.290  62.4    58   334  4.2   4.23  2.63
##  5 0.31   63.3    58   335  4.34  4.35  2.75
##  6 0.24   62.8    57   336  3.94  3.96  2.48
##  7 0.24   62.3    57   336  3.95  3.98  2.47
##  8 0.26   61.9    55   337  4.07  4.11  2.53
##  9 0.22   65.1    61   337  3.87  3.78  2.49
## 10 0.23   59.4    61   338  4     4.05  2.39
## # ... with 53,930 more rows

Show the number of discrete components of the dataset

output$num_discrete

## [1] 3

Show the number of continuous components of the dataset

output$num_continuous

## [1] 7

Show the number of missing components of the dataset

output$num_all_missing

## [1] 0

Reflection

In summary, we believe this package is one of the better packages to use for exploratory data analysis (EDA). The EDA process can be frustrating at times because analysts need to look at the data and generate hypotheses in order to determine the next steps. This occupies a lot of time for the analysts. The DataExplorer package allows an analyst to view their data with one function (create_report). DataExplorer automates the data handling and visualization processes so that users can just focus on interpreting the data and extracting insights, which ultimately saves time and resources. Below, we list out some of the pros and cons of the package.
Pros:
- Quick and easy to use functions
- Provides functions for
  - Summarizing data
  - Finding missing values
  - Plotting distributions
  - Normalizing data
  - Computing and visualizing principal component analysis
  - Correlation
  - Plotting relationships and transforming data
  - Automates time-intensive processes of EDA
Cons:
- This package almost entirely relies on visual techniques for understanding the data
- If too many graphs are displayed at once, it is difficult to focus on the details of a particular graph (information overload)
Features that could be added or tweaked:
- Since the create_report function is one of the most important functions, we would like to suggest tweaks to that function
- We believe the graphs could include default colors to improve the visualizations.
- We would like to be able to:
  - Modify the layout of the report
  - Change the size of the graphs
  - Fix a report page limit
  - Customize the summary statistics in the introduce() function
- Overall, being able to customize more features would help create more optimal reports.

Data Explorer Package

Mark Schenkel, Kayla Spigelman, Yasmin Khanan

March 4, 2020

Background Information

Overview

Version History

Dependency on Other Packages

Similarities to Other Packages

Examples of Usage

Set-Up

Functions

`create_report()`

`configure_report()`

`drop_columns()`

`dummify()`

`introduce()`

`plot_bar()`

`split_columns()`

Reflection

Resources