The R package DataExplorer was built by Boxuan Cui to made Exploratory Data Analysis (EDA) easy. It can also be used for feature engineering and data reporting. This code-through will demonstrate how to use DataExplorer to find missing values in a dataset and to quickly visualize your data.
These examples will use the included “airquality” dataset, which contains daily air quality values from New York City in May through September 1973. It includes daily observations of 6 variables: Ozone, Solar.R, Wind, Temp, Month, and Day of month. You can see the first 6 rows of data here.
head( airquality )It is important for any data analyst to understand the structure of the dataset they’re working with. This can be a complicated process with many steps, but DataExplorer can simplify this to one single line of code, while also allowing the user to customize their data visualizations. It can also be used for “feature engineering” to replace missing values or combine categories.
To start with, you’ll need to install the DataExplorer package from CRAN and add to the library.
# install.packages("DataExplorer")
# library( DataExplorer )With no customization and only a single command, you can run a detailed report of your data
create_report( airquality )Using the “create_report” command produces an html document that opens in your browser. By default the report includes all of the following:
Basic Statistics: A table describing the data. Here you can see that the airquality dataset has 153 rows, 6 columns, 918 total observations, and 44 missing observations.
Percentages: A bar plot showing percent discrete and continues columns, comeplete rows, and missing observations. Airquality has all continuous columns, 72% of rows are complete, and 4.8% missing observations.
Data Structure: Showing variable classes. Five airquality variables are integers, one is numeric.
Missing Data Profile: Percent of each column that is missing data, color coded according to how problematic this may be. The Solar.R column is missing 4.58% of observations, while Ozone is missing 24.18%.
Univariate Distribution: Histograms for each variable. A quick glance at the Day histogram should raise a question–Why is there a spike at 14? Every month from May to September should have exactly one 14th of the month.
What else do you see in the data? Temp and Wind look fairly normally distributed with a couple of outliers, while Ozone seems skewed.
QQ Plot: This plots quantiles against each other, and is a tool to check for normal distribution if it forms a fairly straight line. As we saw in the histograms, Temp and Wind form quite straight lines here, Ozone and Solar.R introduce a bit of curve especially at the high and low ends, and Day and Month are not normally distributed just like we’d expect.
Correlation Analysis: Color-coded correlation table. Here we can see that Temp and Ozone have a strong positive correlation of 0.7, Wind and Ozone are negatively correlated at -0.71, and Month correlates with Temp at 0.4. Day of the month does not correlated strongly with any of the other variables, which makes sense because we would not expect systematic differences based on whether it is the 1st or the 12th or the 27th of the month.
Principal Component Analysis: PCA is a method for clustering variables, and shows how much variance is accounted for by the clusters. Here there are 3 clusters and you can see the relative importance of each variable.
In addition to the simple create_report() command, you can customize the report using configure_report() and adding different plot types and arguments.
configure_report(
add_introduce = TRUE,
add_plot_intro = TRUE,
add_plot_str = TRUE,
add_plot_missing = TRUE,
add_plot_histogram = TRUE,
add_plot_density = FALSE,
add_plot_qq = TRUE,
add_plot_bar = TRUE,
add_plot_correlation = TRUE,
add_plot_prcomp = TRUE,
add_plot_boxplot = TRUE,
add_plot_scatterplot = TRUE,
introduce_args = list(),
plot_intro_args = list(),
plot_str_args = list(type = “diagonal”, fontSize = 35, width = 1000, margin = list(left = 350, right = 250)),
plot_missing_args = list(),
plot_histogram_args = list(),
plot_density_args = list(),
plot_qq_args = list(sampled_rows = 1000L),
plot_bar_args = list(),
plot_correlation_args = list(cor_args = list(use = “pairwise.complete.obs”)),
plot_prcomp_args = list(),
plot_boxplot_args = list(),
plot_scatterplot_args = list(sampled_rows = 1000L),
global_ggtheme = quote(theme_gray()),
global_theme_config = list() )
You can also run any of the plot types individually on the whole dataset or slice and dice by including certain variables:
plot_bar()
plot_boxplot()
plot_correlation()
plot_density()
plot_histogram()
plot_intro()
plot_missing()
plot_prcomp()
plot_gg()
plot_scatterplot()
plot_str()
For example, here are boxplots by Temp…
plot_boxplot( airquality, by = "Temp" ) … and here are scatterplots by Ozone.
plot_scatterplot( airquality, by = "Ozone")
The DataExplorer package should be one of the first R tools a data analyst turns to when starting to work with a new dataset. It will help identify patterns of missing data, see the structure of the data, and visualize using multiple plot types. It’s not the only tool you’ll need, but will help you start your analysis with a good understanding of the data.
Learn more about the DataExplorer package with the following:
Resource I DataExplorer
Resource II DataExplorer Vignette
Resource III DataExplorer: Automate Data Exploration and Treatment