Background Information

Overview

The Data Explorer package was created to automate data exploration and treatment. These processes help users, usually analysts or modelers, to get a quick first glance at the data they are analyzing. This package helps them to focus on understanding the data and gather insights that would otherwise remain hidden.

It also aims to automate the exploratory data analysis (EDA) process and simplify generating reports so that users can focus on information displayed by the data. The commands are user-friendly, easy-to-use, and concise. The syntax is intuitive and simple so that the analyst doesn’t have to spend time with manual code. Manual coding has been identified as a problem in the data science pipeline that is time consuming for analysts, so this package is intended to help alleviate this issue through automation.

Version History

  • Current version: 0.8.1
  • This package was recently updated. The newest version came out in January 2020. Since its 0.2.3 version, the names of the functions have changed. There has also been an increase in the functionality of the package with the addition of various new functions. The 0.2.3 version had the following functions (to name a few): CollapseCategory, SplitColType, DensityContinous, GenerateReport, PlotMissing, CorrelationContinuous, and BarDiscrete. This document will discuss functions from the most current version of the package.

Dependency on Other Packages

This package relies on:

  • ggplot2
  • Xfun
  • Farver
  • Evaluate
  • Tinytex
  • Igraph
  • Data.table
  • Scales
  • Rmarkdown
  • networkD3
  • gridExtra
  • htmlwidgets

When you install the Data Explorer package, all of the above packages will automatically be installed.

Similarities to Other Packages

  • ggplot2
    • Both ggplot2 and DataExplorer are able to create graphs. However, that is the sole purpose of ggplot. With DataExplorer a user can also easily analyze the data to gather insights.
  • funModeling
    • Similarly to DataExplorer, funModeling package is a set of tools used for exploratory data analysis. However, funModeling is the only library that includes visualizations that can be used for predictive models and non-standard correlation analysis.
  • SmartEDA
    • SmartEDA and DataExplorer both conduct Exploratory Data Analysis (EDA). SmartEDA; however, has additional features, such as an extension of the data.table package and can plot both numerical and categorical data values. SmartEDA also gives descriptive statistics.
  • Dlookr
    • The dlookr package is used for creating high quality reports, similar to DataExplorer. What makes this package unique is that it is capable of selecting skewed variables, which can also help in seeing dependencies in the data.
  • dataMaid
    • dataMaid has two functions that DataExplorer does not: check function and makeDataReport. The check function looks at data consistency and validity, while makeDataReport creates a report in different formats. Both DataExplorer and dataMaid, though, are capable of providing graphs of the data.
  • exploreR:
    • The exploreR package is also useful for data exploration. What differs about this package is the fact that the analysis is based on linear regression.
  • RtutoR:
    • The RtutoR package is used for automated reporting which can be saved in a PPTX format. It is similar to DataExplorer in that it is used for reporting purposes and data exploration. However, this package is unique because it can identify the top k relevant variables for a specific criteria.



Examples of Usage

Set-Up

First, the Data Explorer package needs to be installed and downloaded. Also, download the datasets package and run the require(utils) command.

# install.packages("DataExplorer")
library(DataExplorer)
library(datasets)
require(utils)


The data set “diamonds” will be used in the following examples. The data can be loaded as follows.

library(ggplot2)
data(diamonds)


Functions

create_report()

Objective

  • The main objective of this function is to create a data filing report based on the configured settings from the configure_report() function.

Arguments

  • data : the input data to be used for the report generation
  • output_format : specifies the type of report to be made. The default is html_document(toc = TRUE,toc_depth = 6,theme = "yeti").
  • output_file : specifies the output file name. The default is “report.html”.
  • output_dir : specifies the directory for the report to be saved in. The default is the user’s current directory.
  • y : the name of the response variable, if any. This response variable will be passed to appropriate plotting functions automatically.
  • config : the report configuration generated by configure_report.
  • report_title : the report title. The default is “Data Profiling Report”.

Example of create_report()

  • Create a report where:
    • the data used is the “diamonds” dataset
    • the output file will be called “practice_report.html”
    • the output file will be saved in the current working directory
    • the report is cofigured to the configure_report() function settings
# create_report(
#  data = diamonds,
#  output_file = "practice_report.html",
#  output_dir = getwd(),
#  config = configure_report()
# )
Screenshots of a sample Data Profiling Report

Screenshots of a sample Data Profiling Report



configure_report()

Objective

  • The main objective of this function is to configure a report template.

Arguments

  • There are many arguments that can be placed into the configure_report() function. The majority of them can be separated into two categories:

    • adding a plot, or
    • specifying the arguments for the added plot
  • For example, there is an argument add_plot_intro that plots basic information about the input data. There is then another argument plot_intro_args that takes-in the arguments to be passed to the plot_intro() function.

  • However, there are still some arguments that are unique. These are described below.

Argument Explanation Default Value
add_introduce a boolean argument that specifies whether or not the introduce() function is included in the report TRUE
global_ggtheme the global setting for the theme of the report theme_gray()


  • The complete list of arguments is shown and described below.
Argument Name Explanation Default Value
add_introduce a boolean argument that specifies whether or not the introduce() function is included in the report TRUE
add_plot_intro a boolean argument that specifies whether or not the plot_intro() function is included in the report TRUE
add_plot_str a boolean argument that specifies whether or not the plot_str() function is included in the report TRUE
add_plot_missing a boolean argument that specifies whether or not the plot_missing() function is included in the report TRUE
add_plot_histogram a boolean argument that specifies whether or not the plot_histogram() function is included in the report TRUE
add_plot_density a boolean argument that specifies whether or not the plot_density() function is included in the report FALSE
add_plot_qq a boolean argument that specifies whether or not the plot_qq() function is included in the report TRUE
add_plot_bar a boolean argument that specifies whether or not the plot_bar() function is included in the report TRUE
add_plot_correlation a boolean argument that specifies whether or not the plot_correlation() function is included in the report TRUE
add_plot_prcomp a boolean argument that specifies whether or not the plot_prcomp() function is included in the report TRUE
add_plot_boxplot a boolean argument that specifies whether or not the plot_boxplot() function is included in the report TRUE
add_plot_scatterplot a boolean argument that specifies whether or not the plot_scatterplot() function is included in the report TRUE
introduce_args a list of arguments to be passed to the plot_intro() function list()
plot_intro_args a list of arguments to be passed to the plot_intro() function list()
plot_str_args a list of arguments to be passed to the plot_str() function list(type = “diagonal”, fontSize = 35, width = 1000, margin = list(left = 350, right = 250))
plot_missing_args a list of arguments to be passed to the plot_missing() function list()
plot_histogram_args a list of arguments to be passed to the plot_histogram() function list()
plot_density_args a list of arguments to be passed to the plot_density() function list()
plot_qq_args a list of arguments to be passed to the plot_qq() function list(sampled_rows = 1000L)
plot_bar_args a list of arguments to be passed to the plot_bar() function list()
plot_correlation_args a list of arguments to be passed to the plot_correlation() function list(cor_args = list(use = “pairwise.complete.obs”))
plot_prcomp_args a list of arguments to be passed to the plot_prcomp() function list()
plot_boxplot_args a list of arguments to be passed to the plot_boxplot() function list()
plot_scatterplot_args a list of arguments to be passed to the plot_scatterplot() function list(sampled_rows = 1000L)
global_ggtheme the global setting for the theme of the report theme_gray()


Example of configure_report()

  • Configure a report where:

    • A density plot is included
    • QQ and box plots are not included
    • The global theme is “theme_light”
configure_report(add_plot_density = TRUE, add_plot_bar = FALSE, add_plot_qq = FALSE, global_ggtheme = quote(theme_light()))



drop_columns()

Objective

  • The main purpose of this function is to quickly drop selected variables in the data frame.

Arguments

  • data : the input data to be used

  • ind : a vector of either the names or column positions of the variables to be dropped

Example of drop_columns()

  • This is the original dataset.
head(diamonds)
## # A tibble: 6 x 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23  Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
## 2 0.21  Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
## 3 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31
## 4 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
## 5 0.31  Good      J     SI2      63.3    58   335  4.34  4.35  2.75
## 6 0.24  Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48


  • Columns 2 through 5 are dropped from the diamonds dataset using the seq() function.
drop_columns(diamonds, seq(2,5))                
## # A tibble: 53,940 x 6
##    carat table price     x     y     z
##    <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1 0.23     55   326  3.95  3.98  2.43
##  2 0.21     61   326  3.89  3.84  2.31
##  3 0.23     65   327  4.05  4.07  2.31
##  4 0.290    58   334  4.2   4.23  2.63
##  5 0.31     58   335  4.34  4.35  2.75
##  6 0.24     57   336  3.94  3.96  2.48
##  7 0.24     57   336  3.95  3.98  2.47
##  8 0.26     55   337  4.07  4.11  2.53
##  9 0.22     61   337  3.87  3.78  2.49
## 10 0.23     61   338  4     4.05  2.39
## # ... with 53,930 more rows


  • Columns 1, 2, and 4 are dropped from the diamonds dataset using the concatenate c() function.
drop_columns(diamonds, c(1,2,4))               
## # A tibble: 53,940 x 7
##    color depth table price     x     y     z
##    <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1 E      61.5    55   326  3.95  3.98  2.43
##  2 E      59.8    61   326  3.89  3.84  2.31
##  3 E      56.9    65   327  4.05  4.07  2.31
##  4 I      62.4    58   334  4.2   4.23  2.63
##  5 J      63.3    58   335  4.34  4.35  2.75
##  6 J      62.8    57   336  3.94  3.96  2.48
##  7 I      62.3    57   336  3.95  3.98  2.47
##  8 H      61.9    55   337  4.07  4.11  2.53
##  9 E      65.1    61   337  3.87  3.78  2.49
## 10 H      59.4    61   338  4     4.05  2.39
## # ... with 53,930 more rows


  • The “clarity” column is dropped from the diamonds dataset by directly calling the column name.
drop_columns(diamonds, "clarity")               
## # A tibble: 53,940 x 9
##    carat cut       color depth table price     x     y     z
##    <dbl> <ord>     <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1 0.23  Ideal     E      61.5    55   326  3.95  3.98  2.43
##  2 0.21  Premium   E      59.8    61   326  3.89  3.84  2.31
##  3 0.23  Good      E      56.9    65   327  4.05  4.07  2.31
##  4 0.290 Premium   I      62.4    58   334  4.2   4.23  2.63
##  5 0.31  Good      J      63.3    58   335  4.34  4.35  2.75
##  6 0.24  Very Good J      62.8    57   336  3.94  3.96  2.48
##  7 0.24  Very Good I      62.3    57   336  3.95  3.98  2.47
##  8 0.26  Very Good H      61.9    55   337  4.07  4.11  2.53
##  9 0.22  Fair      E      65.1    61   337  3.87  3.78  2.49
## 10 0.23  Very Good H      59.4    61   338  4     4.05  2.39
## # ... with 53,930 more rows


  • The “depth” and “table” columns are dropped from the diamonds dataset by calling the columns names using the concatenate c() function.
drop_columns(diamonds, c("depth", "table"))     
## # A tibble: 53,940 x 8
##    carat cut       color clarity price     x     y     z
##    <dbl> <ord>     <ord> <ord>   <int> <dbl> <dbl> <dbl>
##  1 0.23  Ideal     E     SI2       326  3.95  3.98  2.43
##  2 0.21  Premium   E     SI1       326  3.89  3.84  2.31
##  3 0.23  Good      E     VS1       327  4.05  4.07  2.31
##  4 0.290 Premium   I     VS2       334  4.2   4.23  2.63
##  5 0.31  Good      J     SI2       335  4.34  4.35  2.75
##  6 0.24  Very Good J     VVS2      336  3.94  3.96  2.48
##  7 0.24  Very Good I     VVS1      336  3.95  3.98  2.47
##  8 0.26  Very Good H     SI1       337  4.07  4.11  2.53
##  9 0.22  Fair      E     VS2       337  3.87  3.78  2.49
## 10 0.23  Very Good H     VS1       338  4     4.05  2.39
## # ... with 53,930 more rows



dummify()

Overview

  • Data dummification is also know as “one hot encoding” or “feature binarization”. It turns each category to a distinct column with binary (numeric) values. These boolean values help make calculations related to probability easier.

Arguments

  • data : input data
  • maxcat : the max number of categories for a discrete feature, default is 50
  • select: the names of selected features to be dummified, default is NULL

Note that select ignores continuous features. If the categories exceed maxcat then the select features will be ignored.

Datasets that have been dummified preserve the original feature, but the order of the columns may be different.

Examples of dummify()

  • Here, no select argument is passed, so all columns that R recognizes as discrete will be dummified.
str(dummify(diamonds, maxcat = 5))
## 2 features with more than 5 categories ignored!
## color: 7 categories
## clarity: 8 categories
## Classes 'tbl_df', 'tbl' and 'data.frame':    53940 obs. of  14 variables:
##  $ carat        : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ depth        : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table        : num  55 61 65 58 58 57 57 55 61 61 ...
##  $ price        : int  326 326 327 334 335 336 336 337 337 338 ...
##  $ x            : num  3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y            : num  3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z            : num  2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
##  $ color        : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity      : Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ cut_Fair     : int  0 0 0 0 0 0 0 0 1 0 ...
##  $ cut_Good     : int  0 0 1 0 1 0 0 0 0 0 ...
##  $ cut_Ideal    : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ cut_Premium  : int  0 1 0 1 0 0 0 0 0 0 ...
##  $ cut_Very.Good: int  0 0 0 0 0 1 1 1 0 1 ...
##  - attr(*, ".internal.selfref")=<externalptr>


  • Here, only the manually-selected columns will be dummified.
str(dummify(diamonds, select = c("cut", "color")))
## Classes 'tbl_df', 'tbl' and 'data.frame':    53940 obs. of  20 variables:
##  $ carat        : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ depth        : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table        : num  55 61 65 58 58 57 57 55 61 61 ...
##  $ price        : int  326 326 327 334 335 336 336 337 337 338 ...
##  $ x            : num  3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y            : num  3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z            : num  2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
##  $ clarity      : Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ cut_Fair     : int  0 0 0 0 0 0 0 0 1 0 ...
##  $ cut_Good     : int  0 0 1 0 1 0 0 0 0 0 ...
##  $ cut_Ideal    : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ cut_Premium  : int  0 1 0 1 0 0 0 0 0 0 ...
##  $ cut_Very.Good: int  0 0 0 0 0 1 1 1 0 1 ...
##  $ color_D      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ color_E      : int  1 1 1 0 0 0 0 0 1 0 ...
##  $ color_F      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ color_G      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ color_H      : int  0 0 0 0 0 0 0 1 0 1 ...
##  $ color_I      : int  0 0 0 1 0 0 1 0 0 0 ...
##  $ color_J      : int  0 0 0 0 1 1 0 0 0 0 ...
##  - attr(*, ".internal.selfref")=<externalptr>



introduce()

Overview

  • Describes basic information for the input data, including:
    • rows: number of rows
    • columns: number of columns
    • discrete_columns: number of discrete columns
    • continuous_columns: number of continuous columns
    • all_missing_columns: number of columns with everything missing
    • total_missing_values: number of missing observations
    • complete_rows: number of rows without missing values
    • total_observations: total number of observations
    • memory_usage: estimated memory allocation in bytes

Arguments

  • The only argument is data, which refers to the input data.

Example of introduce()

introduce(diamonds)
## # A tibble: 1 x 9
##    rows columns discrete_columns continuous_colu~ all_missing_col~
##   <int>   <int>            <int>            <int>            <int>
## 1 53940      10                3                7                0
## # ... with 4 more variables: total_missing_values <int>, complete_rows <int>,
## #   total_observations <int>, memory_usage <dbl>



plot_bar()

Overview

  • This function plots bar charts for discrete categorical features based on their frequencies or other continuous features.

Arguments

  • data : input data
  • with : name of continuous feature to be summed. Default is NULL, i.e., frequency.
  • maxcat : maximum categories allowed for each feature. Default is 50.
  • order_bar : logical, indicating if bars should be ordered. Default is TRUE.
  • binary_as_factor : treat binary as categorical? Default is TRUE.
  • title : plot title
  • ggtheme : complete ggplot2 themes. Default is theme_gray.
  • theme_config : a list of configurations to be passed to theme
  • nrow : number of rows per page. Default is 3.
  • ncol : number of columns per page. Default is 3.
  • parallel : enable parallel? Default is FALSE.

Examples of plot_bar()

  • Plotting the categorical data within the diamonds data set
plot_bar(diamonds)


  • Plotting the categorical data with 5 categories or fewer within the diamonds data set
plot_bar(diamonds, maxcat = 5, ggtheme = theme_light(), theme_config = list(
  "plot.background" = element_rect(fill = "light gray"),
  "aspect.ratio" = 1
), title = "Frequency of Each Cut of Diamonds")
## 2 columns ignored with more than 5 categories.
## color: 7 categories
## clarity: 8 categories



split_columns()

Overview

  • This function splits the input data into its discrete and continuous components.

Arguments

  • data : input data
  • binary_as_factor : a boolean value that specifies whether or not to treat the variables as binary or categorical. The default is FALSE.

Example of split_columns

  • Split the diamonds dataset
output <- split_columns(diamonds)


  • Show only the discrete components of the dataset
output$discrete
## # A tibble: 53,940 x 3
##    cut       color clarity
##    <ord>     <ord> <ord>  
##  1 Ideal     E     SI2    
##  2 Premium   E     SI1    
##  3 Good      E     VS1    
##  4 Premium   I     VS2    
##  5 Good      J     SI2    
##  6 Very Good J     VVS2   
##  7 Very Good I     VVS1   
##  8 Very Good H     SI1    
##  9 Fair      E     VS2    
## 10 Very Good H     VS1    
## # ... with 53,930 more rows


  • Show only the continuous components of the dataset
output$continuous
## # A tibble: 53,940 x 7
##    carat depth table price     x     y     z
##    <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1 0.23   61.5    55   326  3.95  3.98  2.43
##  2 0.21   59.8    61   326  3.89  3.84  2.31
##  3 0.23   56.9    65   327  4.05  4.07  2.31
##  4 0.290  62.4    58   334  4.2   4.23  2.63
##  5 0.31   63.3    58   335  4.34  4.35  2.75
##  6 0.24   62.8    57   336  3.94  3.96  2.48
##  7 0.24   62.3    57   336  3.95  3.98  2.47
##  8 0.26   61.9    55   337  4.07  4.11  2.53
##  9 0.22   65.1    61   337  3.87  3.78  2.49
## 10 0.23   59.4    61   338  4     4.05  2.39
## # ... with 53,930 more rows


  • Show the number of discrete components of the dataset
output$num_discrete
## [1] 3


  • Show the number of continuous components of the dataset
output$num_continuous
## [1] 7


  • Show the number of missing components of the dataset
output$num_all_missing
## [1] 0



Reflection



Resources