The DataExplorer package helps to simplify the Exploratory Data Analysis (EDA) process for data analysis in R. The package is intended to allow the user to focus on data understanding and extracting insights by automating the data handling and visualisation.
introduce()plot_intro()plot_boxplot()plot_missing()plot_histogram()plot_bar()plot_correlation()create_report() which creates all the plots above and more.This vignette focuses on the visualisation capabilities of DataExplorer. However, this package contains other functions for feature engineering.
The main package used in this guide is DataExplorer. Additionally, for this example, we will work with the dataset penguins, from the palmerpenguins package. Additional packages - ggplot2 and dyplr will be used with DataExplorer. To make things simple the tidyverse package will be installed, which contains both ggplot2 and dyplr.
Use the commands below to download the required packages from CRAN:
# Install required packages
install.packages("DataExplorer")
install.packages("tidyverse")
install.packages("palmerpenguins")Once downloaded, use the command below to load the packages for use.
# Load in libraries
library(palmerpenguins)
library(DataExplorer)
library(tidyverse)First, run the glimpse() function from the dplyr package to look at the dataset and its columns. This function is somewhat similar to the str() function in the R utils package; however, it allows a better look at the data by showing as much data as possible.
# Glimpse at the penguins data
glimpse(penguins)## Rows: 344
## Columns: 8
## $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
## $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
## $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
## $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
## $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
## $ sex <fct> male, female, female, NA, female, male, female, male…
## $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
The introduce() function from DataExplorer provides some basic information about the dataset. It only takes one argument and that is the dataset.
# introduce() to get basic information on dataset
introduce(penguins)introduce() displays:
Additionally the plot_intro() function can be used to generate a plot that easily visualises the information from the introduce() function.
# plot introduction
plot_intro(penguins)plot_intro() can additionally take optional arguments as shown in the table below. These arguments can also be applied to the rest of the functions that will be explored.
| Argument | What it does |
|---|---|
| geom_label_args | A list of arguments to geom_label |
| title | Adds a title to the plot |
| ggtheme | Applies a full ggplot2 theme |
| theme_config | A list of ggplot2 configurations passed to the theme |
These optional arguments may appear complex if no prior ggplot2 knowledge is held. Below shows 2 examples of applying the additional arguments.
DataExplorer allows for customizing the plots with the help of the ggplot2 package. The ggtheme() argument adds a complete ggplot2 theme. For this example, we have used the theme theme_minimal().
Check out all the ggplot2 themes here.
Additionally the title argument is added to add a title to the plot.
# the plot_intro function
plot_intro(penguins,
title = "Missing Penguin Data Plot Title",
ggtheme = theme_minimal())DataExplorer allows extra ggplot2 customisation of visual elements of the plot using theme_config(). For this example, the plot.title is used to set the title colour to orange (as shown below).
Learn more about modification arguments here
# plot_intro() with a theme and title
plot_intro(
penguins,
ggtheme = theme_minimal(),
title = "A Plot Title",
theme_config = theme(plot.title = element_text(color = "orange"))
)
As you can see, the DataExplorer package has a problem with some of the plot’s labels being cut off. The quickest way to fix this is to use the geom_label_args to apply a ggplot2 geom_label() argument hjust with the alignment of “inward”, which aligns text towards the center.
For more information on geom_label’s, check here.
# plot_intro() with fixed labels
plot_intro(
penguins,
ggtheme = theme_minimal(),
title = "Plot with Text Labels Not Cut Off",
theme_config = theme(plot.title = element_text(color = "orange")),
geom_label_args = c(hjust = "inward")
)The plot_missing() function returns a plot showing the frequency of missing values for each column.
# plot missing values plot
plot_missing(penguins)plot_histogram() visualises an individual histogram plot for each continuous data columns in the dataset.
# Plot each continuous column
plot_histogram(penguins)Below displays the use of plot_boxplot(), it requires the arugment by to choose the feature the box lots will be broken down by. For this example, we have used the species column. Additionally we have used ncol to specify the number of columns for the displayed output.
# plot a boxplot by species of penguins
plot_boxplot(penguins, by = "species", ncol = 2L)plot_bar() visualises each discrete data columns of the dataset. For this example, we have also used the argument order_bar = TRUE, which orders the values in the bar chart.
# Plot each discrete column
plot_bar(penguins,
order_bar = TRUE)The plot_correlation() generates a correlation heatmap. The argument type = “all” can be used to select the column type to be included. The type argument can be specified as “all”, “discrete”, or “continuous”.
# plot correlation matrix for all features
plot_correlation(
penguins,
type = "all")Note: There are more arguments you can apply to these individual plots.
To check out more on what DataExplorer has to offer or for additional help, run the code below in your RStudio console.
??dataexplorerThe create_report() function creates a full data profiling report and includes all the individual plots that this tutorial has gone through. Overall it creates:
Run the code below to generate the report.
create_report(
penguins,
output_file = "report_example.html",
output_dir = getwd(),
config = configure_report(),
report_title = "Data Report"
)An overview of the functions
| Argument | What it does |
|---|---|
| data |
This is the dataset you want to explore. For this example, we are using the penguins dataset.
|
| output_file | The name of the file. |
| output_dir |
Where the file is rendered. The default is the user’s current directory. The getwd() does just that!
|
| config |
The configuration of the report. We can use an additional function configure_report() which allows the user to customize what plots you would like to create
|
| report_title | This is the title of the report. |
As create_report() generates a file onto your computer, screenshots of the report are displayed below
Click the tabs below to examples of the output
Cui, B. (2020a). DataExplorer. <URL: https://boxuancui.github.io/DataExplorer/> (visited on Aug. 10, 2022).
—–— (2020b). DataExplorer: Automate Data Exploration and Treatment. R package version 0.8.2. <URL: http://boxuancui.github.io/DataExplorer/>.
Francois, R. (2020). bibtex: Bibtex Parser. R package version 0.4.2.3. <URL: https://github.com/romainfrancois/bibtex>.
Horst, A., A. Hill, and K. Gorman (2020). palmerpenguins: Palmer Archipelago (Antarctica) Penguin Data. R package version 0.1.0. <URL: https://CRAN.R-project.org/package=palmerpenguins>.
R Core Team (2021). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. Vienna, Austria. <URL: https://www.R-project.org/>.
Wickham, H. (2016). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. ISBN: 978-3-319-24277-4. <URL: https://ggplot2.tidyverse.org>.
—–— (2022). tidyverse: Easily Install and Load the Tidyverse. R package version 1.3.2. <URL: https://CRAN.R-project.org/package=tidyverse>.
Wickham, H., M. Averick, J. Bryan, et al. (2019). “Welcome to the tidyverse”. In: Journal of Open Source Software 4.43, p. 1686. DOI: 10.21105/joss.01686.
Wickham, H., W. Chang, L. Henry, et al. (2021). ggplot2: Create Elegant Data Visualisations Using the Grammar of Graphics. R package version 3.3.5. <URL: https://CRAN.R-project.org/package=ggplot2>.
Wickham, H., R. François, L. Henry, et al. (2022). dplyr: A Grammar of Data Manipulation. R package version 1.0.8. <URL: https://CRAN.R-project.org/package=dplyr>.