The document is demo for the EDA packages in R.

library(visdat)
library(inspectdf)
library(skimr)
library(tidyverse)

We will load a dataset, we will use Kaggle’s Melbourne Housing dataset https://www.kaggle.com/dansbecker/melbourne-housing-snapshot

data <- read_csv("melb_data.csv")
glimpse(data)
## Rows: 13,580
## Columns: 21
## $ Suburb        <chr> "Abbotsford", "Abbotsford", "Abbotsford", "Abbotsford...
## $ Address       <chr> "85 Turner St", "25 Bloomburg St", "5 Charles St", "4...
## $ Rooms         <dbl> 2, 2, 3, 3, 4, 2, 3, 2, 1, 2, 2, 3, 2, 2, 1, 2, 3, 3,...
## $ Type          <chr> "h", "h", "h", "h", "h", "h", "h", "h", "u", "h", "u"...
## $ Price         <dbl> 1480000, 1035000, 1465000, 850000, 1600000, 941000, 1...
## $ Method        <chr> "S", "S", "SP", "PI", "VB", "S", "S", "S", "S", "S", ...
## $ SellerG       <chr> "Biggin", "Biggin", "Biggin", "Biggin", "Nelson", "Je...
## $ Date          <chr> "3/12/2016", "4/02/2016", "4/03/2017", "4/03/2017", "...
## $ Distance      <dbl> 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5...
## $ Postcode      <dbl> 3067, 3067, 3067, 3067, 3067, 3067, 3067, 3067, 3067,...
## $ Bedroom2      <dbl> 2, 2, 3, 3, 3, 2, 4, 2, 1, 3, 2, 3, 2, 2, 1, 2, 3, 2,...
## $ Bathroom      <dbl> 1, 1, 2, 2, 1, 1, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 1,...
## $ Car           <dbl> 1, 0, 0, 1, 2, 0, 0, 2, 1, 2, 1, 2, 1, 1, 1, 2, 1, 1,...
## $ Landsize      <dbl> 202, 156, 134, 94, 120, 181, 245, 256, 0, 220, 0, 214...
## $ BuildingArea  <dbl> NA, 79, 150, NA, 142, NA, 210, 107, NA, 75, NA, 190, ...
## $ YearBuilt     <dbl> NA, 1900, 1900, NA, 2014, NA, 1910, 1890, NA, 1900, N...
## $ CouncilArea   <chr> "Yarra", "Yarra", "Yarra", "Yarra", "Yarra", "Yarra",...
## $ Lattitude     <dbl> -37.7996, -37.8079, -37.8093, -37.7969, -37.8072, -37...
## $ Longtitude    <dbl> 144.9984, 144.9934, 144.9944, 144.9969, 144.9941, 144...
## $ Regionname    <chr> "Northern Metropolitan", "Northern Metropolitan", "No...
## $ Propertycount <dbl> 4019, 4019, 4019, 4019, 4019, 4019, 4019, 4019, 4019,...

Skimr Package

Complete console output view

skim(data)
Data summary
Name data
Number of rows 13580
Number of columns 21
_______________________
Column type frequency:
character 8
numeric 13
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Suburb 0 1.0 3 18 0 314 0
Address 0 1.0 8 27 0 13378 0
Type 0 1.0 1 1 0 3 0
Method 0 1.0 1 2 0 5 0
SellerG 0 1.0 1 23 0 268 0
Date 0 1.0 9 10 0 58 0
CouncilArea 1369 0.9 4 17 0 33 0
Regionname 0 1.0 16 26 0 8 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Rooms 0 1.00 2.94 0.96 1.00 2.00 3.0 3.00 10.00 ▅▇▁▁▁
Price 0 1.00 1075684.08 639310.72 85000.00 650000.00 903000.0 1330000.00 9000000.00 ▇▁▁▁▁
Distance 0 1.00 10.14 5.87 0.00 6.10 9.2 13.00 48.10 ▇▆▁▁▁
Postcode 0 1.00 3105.30 90.68 3000.00 3044.00 3084.0 3148.00 3977.00 ▇▁▁▁▁
Bedroom2 0 1.00 2.91 0.97 0.00 2.00 3.0 3.00 20.00 ▇▁▁▁▁
Bathroom 0 1.00 1.53 0.69 0.00 1.00 1.0 2.00 8.00 ▇▆▁▁▁
Car 62 1.00 1.61 0.96 0.00 1.00 2.0 2.00 10.00 ▇▁▁▁▁
Landsize 0 1.00 558.42 3990.67 0.00 177.00 440.0 651.00 433014.00 ▇▁▁▁▁
BuildingArea 6450 0.53 151.97 541.01 0.00 93.00 126.0 174.00 44515.00 ▇▁▁▁▁
YearBuilt 5375 0.60 1964.68 37.27 1196.00 1940.00 1970.0 1999.00 2018.00 ▁▁▁▁▇
Lattitude 0 1.00 -37.81 0.08 -38.18 -37.86 -37.8 -37.76 -37.41 ▁▂▇▁▁
Longtitude 0 1.00 145.00 0.10 144.43 144.93 145.0 145.06 145.53 ▁▁▇▂▁
Propertycount 0 1.00 7454.42 4378.58 249.00 4380.00 6555.0 10331.00 21650.00 ▆▇▅▂▁

skim_df introduced skim_variable, n_missing etc. If drop skim_variable and skim_type it will result coercion to a tibble.

## pipeable
skim(data) %>% 
  filter(skim_variable == "Price")
Data summary
Name data
Number of rows 13580
Number of columns 21
_______________________
Column type frequency:
numeric 1
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Price 0 1 1075684 639310.7 85000 650000 903000 1330000 9e+06 ▇▁▁▁▁

partition() creates a list of smaller skim_df data frames. Each entry in the list is a data type from the original skim_df. The inverse of partition() is bind(), which takes the list and produces the original skim_df. While partition() keeps all of the subtables as list entries, yank() gives you a single subtable for a data type.

skim(data) %>% yank("numeric")

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Rooms 0 1.00 2.94 0.96 1.00 2.00 3.0 3.00 10.00 ▅▇▁▁▁
Price 0 1.00 1075684.08 639310.72 85000.00 650000.00 903000.0 1330000.00 9000000.00 ▇▁▁▁▁
Distance 0 1.00 10.14 5.87 0.00 6.10 9.2 13.00 48.10 ▇▆▁▁▁
Postcode 0 1.00 3105.30 90.68 3000.00 3044.00 3084.0 3148.00 3977.00 ▇▁▁▁▁
Bedroom2 0 1.00 2.91 0.97 0.00 2.00 3.0 3.00 20.00 ▇▁▁▁▁
Bathroom 0 1.00 1.53 0.69 0.00 1.00 1.0 2.00 8.00 ▇▆▁▁▁
Car 62 1.00 1.61 0.96 0.00 1.00 2.0 2.00 10.00 ▇▁▁▁▁
Landsize 0 1.00 558.42 3990.67 0.00 177.00 440.0 651.00 433014.00 ▇▁▁▁▁
BuildingArea 6450 0.53 151.97 541.01 0.00 93.00 126.0 174.00 44515.00 ▇▁▁▁▁
YearBuilt 5375 0.60 1964.68 37.27 1196.00 1940.00 1970.0 1999.00 2018.00 ▁▁▁▁▇
Lattitude 0 1.00 -37.81 0.08 -38.18 -37.86 -37.8 -37.76 -37.41 ▁▂▇▁▁
Longtitude 0 1.00 145.00 0.10 144.43 144.93 145.0 145.06 145.53 ▁▁▇▂▁
Propertycount 0 1.00 7454.42 4378.58 249.00 4380.00 6555.0 10331.00 21650.00 ▆▇▅▂▁

Vis_* Package

Useful for visualization, particular missing data.

## check character and numerical variables with missing
vis_dat(data)

Options to check more about missing data

## check missing percentage
vis_miss(data)

## can cluster and sort
args(vis_miss)
## function (x, cluster = FALSE, sort_miss = FALSE, show_perc = TRUE, 
##     show_perc_col = TRUE, large_data_size = 9e+05, warn_large_data = TRUE) 
## NULL

Options for checking correlations

data %>% select(where(is.numeric)) %>% 
  vis_cor()

Options for checking value condition

vis_expect(data, ~ .x > 2)

Other functions are vis_compare for compare dataframe, vis_guess for individual class of each value, vis_value for heatmap (missing?), vis_binary for occurrence of binary (missing?)

Inspectdf

All can be piped to show plot and plot can be two tibble

## data type
inspect_types(data) %>% show_plot()

## memory usage
inspect_mem(data) %>% show_plot()

## check na
data_price_dummy <- data %>% 
  mutate(price_dummy = if_else(Price > 1000000, "High", "Low"))
inspect_na(data_price_dummy %>% filter(price_dummy == "High"),
           data_price_dummy %>% filter(price_dummy == "Low")) %>%
  show_plot()

## check numerical variable distribution
inspect_num(data) %>% show_plot()

## check categorical variable distribution
inspect_imb(data) %>% show_plot()

## check two categorical
inspect_imb(data_price_dummy %>% filter(price_dummy == "High"),
           data_price_dummy %>% filter(price_dummy == "Low")) %>%
  show_plot() + theme(legend.position = "none")

## similiar to inspect_imb, but for all levels
inspect_cat(data) %>% show_plot()

inspect_cor(data) %>% show_plot()