Introduction to DataExplorer

'DataExplorer' is a package that distills and condenses datasets in R for quick, simple overviews. This can be especially useful for researchers who need to quickly check one aspect of their data, clients needing an immediate report, or students learning to navigate R.

The package was created by Boxuan Cui in 2020. Cui details 3 main goals for 'DataExplorer':

Exploratory Data Analysis
Feature Engineering
Data Reporting

In this introduction, we'll be looking at creating the report, and then investigating some of the more useful functions.

Dig Deeper: You can learn more on Cui's GitHub page or their Introduction to DataExplorer directly on CRAN.

Exploring 'DataExplorer'

Package 'DataExplorer' is a collection of functions meant to easily provide summaries for exploratory data analysis.

Install and Load

Install 'DataExplorer' using the code below:

install.packages("DataExplorer")

library(DataExplorer)

Easy Reporting

Perhaps the greatest tool in DataExplorer is the 'report()' function. This function automatically runs all the functions of DataExplorer and combines them into a pre-designed document. The code below outputs an html document into your browser.

create_report(storms)

The report document can also be easily customized with with the 'configure_report()' function. For example, the arguement '(add_*)' used within 'configure_report()' will force the document to include or exclude the specified section. In the example below, the graphic introduction section of the report has been repressed.

configure_report(add_plot_intro = FALSE)

Detailed instructions on modifying the report can be found in the CRAN intro.

Some Useful Individual Functions

A quick example of how one function of 'DataExplorer' can make previously complex tasks into simple procedures with one line of code is the correlation analysis. This can be easily created shown with the 'plot_correlation()' function.

plot_correlation(na.omit(storms), maxcat = 2L)

Let's move on to digging into the most basic, but most useful applications of 'DataExplorer". Users of any ability level can use the package to examine the limits and depth of a dataset, either to gain a quick but full understanding of the information, or to answer small but important questions.

To start, let's take a look at general overviews of a dataset. In Base R, the function 'head()' is often used to give a quick view of what the dataset looks like.

head(storms[,1:6]) %>%
  pander()

name	year	month	day	hour	lat
Amy	1975	6	27	0	27.5
Amy	1975	6	27	6	28.5
Amy	1975	6	27	12	29.5
Amy	1975	6	27	18	30.5
Amy	1975	6	28	0	31.5
Amy	1975	6	28	6	32.4

This is helpful but is only a small window that leaves a lot of unknowns. With 'DataExplorer', you can immediately learn and display a complete picture of the size of the dataset. The function 'introduce()' gives a summary of columns and rows with useful information including how many total observations or missing values.

introduce(storms) %>%
  pander()

Table continues below
rows	columns	discrete_columns	continuous_columns	all_missing_columns
10010	13	3	10	0

total_missing_values	complete_rows	total_observations	memory_usage
13056	3482	130130	897000

That information can also be shown graphically with the function 'plot_intro()', allowing for a quick visual overview of where holes may lie in the dataset.

plot_intro(storms)

You can drill down further with 'plot_missing' to find exactly where the data is missing.

plot_missing(storms)

Summary

'DataExplorer' is a powerful tool to make summarizing your dataset simple and painless. Quick, seemingly simple questions such as "where is there missing data" now have quick, simple answers.

Additionally, data can now be reported immediately and efficiently for those who do not have the time or resources to customize intricate documents. The functions to easily create and customize the report documents allow data scientists to focus on interpreting the data rather than visualizing.

Further Resources

Check out these links to learn more about package 'DataExplorer'.

Code Through Assignment - DataExplorer

Tim Ness

08 October, 2020