EDA Package

The document is demo for the EDA packages in R.

library(visdat)
library(inspectdf)
library(skimr)
library(tidyverse)

We will load a dataset, we will use Kaggle’s Melbourne Housing dataset https://www.kaggle.com/dansbecker/melbourne-housing-snapshot

data <- read_csv("melb_data.csv")
glimpse(data)

## Rows: 13,580
## Columns: 21
## $ Suburb        <chr> "Abbotsford", "Abbotsford", "Abbotsford", "Abbotsford...
## $ Address       <chr> "85 Turner St", "25 Bloomburg St", "5 Charles St", "4...
## $ Rooms         <dbl> 2, 2, 3, 3, 4, 2, 3, 2, 1, 2, 2, 3, 2, 2, 1, 2, 3, 3,...
## $ Type          <chr> "h", "h", "h", "h", "h", "h", "h", "h", "u", "h", "u"...
## $ Price         <dbl> 1480000, 1035000, 1465000, 850000, 1600000, 941000, 1...
## $ Method        <chr> "S", "S", "SP", "PI", "VB", "S", "S", "S", "S", "S", ...
## $ SellerG       <chr> "Biggin", "Biggin", "Biggin", "Biggin", "Nelson", "Je...
## $ Date          <chr> "3/12/2016", "4/02/2016", "4/03/2017", "4/03/2017", "...
## $ Distance      <dbl> 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5...
## $ Postcode      <dbl> 3067, 3067, 3067, 3067, 3067, 3067, 3067, 3067, 3067,...
## $ Bedroom2      <dbl> 2, 2, 3, 3, 3, 2, 4, 2, 1, 3, 2, 3, 2, 2, 1, 2, 3, 2,...
## $ Bathroom      <dbl> 1, 1, 2, 2, 1, 1, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 1,...
## $ Car           <dbl> 1, 0, 0, 1, 2, 0, 0, 2, 1, 2, 1, 2, 1, 1, 1, 2, 1, 1,...
## $ Landsize      <dbl> 202, 156, 134, 94, 120, 181, 245, 256, 0, 220, 0, 214...
## $ BuildingArea  <dbl> NA, 79, 150, NA, 142, NA, 210, 107, NA, 75, NA, 190, ...
## $ YearBuilt     <dbl> NA, 1900, 1900, NA, 2014, NA, 1910, 1890, NA, 1900, N...
## $ CouncilArea   <chr> "Yarra", "Yarra", "Yarra", "Yarra", "Yarra", "Yarra",...
## $ Lattitude     <dbl> -37.7996, -37.8079, -37.8093, -37.7969, -37.8072, -37...
## $ Longtitude    <dbl> 144.9984, 144.9934, 144.9944, 144.9969, 144.9941, 144...
## $ Regionname    <chr> "Northern Metropolitan", "Northern Metropolitan", "No...
## $ Propertycount <dbl> 4019, 4019, 4019, 4019, 4019, 4019, 4019, 4019, 4019,...

Skimr Package

Complete console output view

skim(data)

Data summary
Name	data
Number of rows	13580
Number of columns	21
_______________________
Column type frequency:
character	8
numeric	13
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
Suburb	0	1.0	3	18	314
Address	0	1.0	8	27	13378
Type	0	1.0	1	1	3
Method	0	1.0	1	2	5
SellerG	0	1.0	1	23	268
Date	0	1.0	9	10	58
CouncilArea	1369	0.9	4	17	33
Regionname	0	1.0	16	26	8

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Rooms	0	1.00	2.94	0.96	1.00	2.00	3.0	3.00	10.00	▅▇▁▁▁
Price	0	1.00	1075684.08	639310.72	85000.00	650000.00	903000.0	1330000.00	9000000.00	▇▁▁▁▁
Distance	0	1.00	10.14	5.87	0.00	6.10	9.2	13.00	48.10	▇▆▁▁▁
Postcode	0	1.00	3105.30	90.68	3000.00	3044.00	3084.0	3148.00	3977.00	▇▁▁▁▁
Bedroom2	0	1.00	2.91	0.97	0.00	2.00	3.0	3.00	20.00	▇▁▁▁▁
Bathroom	0	1.00	1.53	0.69	0.00	1.00	1.0	2.00	8.00	▇▆▁▁▁
Car	62	1.00	1.61	0.96	0.00	1.00	2.0	2.00	10.00	▇▁▁▁▁
Landsize	0	1.00	558.42	3990.67	0.00	177.00	440.0	651.00	433014.00	▇▁▁▁▁
BuildingArea	6450	0.53	151.97	541.01	0.00	93.00	126.0	174.00	44515.00	▇▁▁▁▁
YearBuilt	5375	0.60	1964.68	37.27	1196.00	1940.00	1970.0	1999.00	2018.00	▁▁▁▁▇
Lattitude	0	1.00	-37.81	0.08	-38.18	-37.86	-37.8	-37.76	-37.41	▁▂▇▁▁
Longtitude	0	1.00	145.00	0.10	144.43	144.93	145.0	145.06	145.53	▁▁▇▂▁
Propertycount	0	1.00	7454.42	4378.58	249.00	4380.00	6555.0	10331.00	21650.00	▆▇▅▂▁

skim_df introduced skim_variable, n_missing etc. If drop skim_variable and skim_type it will result coercion to a tibble.

## pipeable
skim(data) %>% 
  filter(skim_variable == "Price")

Data summary
Name	data
Number of rows	13580
Number of columns	21
_______________________
Column type frequency:
numeric	1
________________________
Group variables	None

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Price	0	1	1075684	639310.7	85000	650000	903000	1330000	9e+06	▇▁▁▁▁

partition() creates a list of smaller skim_df data frames. Each entry in the list is a data type from the original skim_df. The inverse of partition() is bind(), which takes the list and produces the original skim_df. While partition() keeps all of the subtables as list entries, yank() gives you a single subtable for a data type.

skim(data) %>% yank("numeric")

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Rooms	0	1.00	2.94	0.96	1.00	2.00	3.0	3.00	10.00	▅▇▁▁▁
Price	0	1.00	1075684.08	639310.72	85000.00	650000.00	903000.0	1330000.00	9000000.00	▇▁▁▁▁
Distance	0	1.00	10.14	5.87	0.00	6.10	9.2	13.00	48.10	▇▆▁▁▁
Postcode	0	1.00	3105.30	90.68	3000.00	3044.00	3084.0	3148.00	3977.00	▇▁▁▁▁
Bedroom2	0	1.00	2.91	0.97	0.00	2.00	3.0	3.00	20.00	▇▁▁▁▁
Bathroom	0	1.00	1.53	0.69	0.00	1.00	1.0	2.00	8.00	▇▆▁▁▁
Car	62	1.00	1.61	0.96	0.00	1.00	2.0	2.00	10.00	▇▁▁▁▁
Landsize	0	1.00	558.42	3990.67	0.00	177.00	440.0	651.00	433014.00	▇▁▁▁▁
BuildingArea	6450	0.53	151.97	541.01	0.00	93.00	126.0	174.00	44515.00	▇▁▁▁▁
YearBuilt	5375	0.60	1964.68	37.27	1196.00	1940.00	1970.0	1999.00	2018.00	▁▁▁▁▇
Lattitude	0	1.00	-37.81	0.08	-38.18	-37.86	-37.8	-37.76	-37.41	▁▂▇▁▁
Longtitude	0	1.00	145.00	0.10	144.43	144.93	145.0	145.06	145.53	▁▁▇▂▁
Propertycount	0	1.00	7454.42	4378.58	249.00	4380.00	6555.0	10331.00	21650.00	▆▇▅▂▁

Vis_* Package

Useful for visualization, particular missing data.

## check character and numerical variables with missing
vis_dat(data)

Options to check more about missing data

## check missing percentage
vis_miss(data)

## can cluster and sort
args(vis_miss)

## function (x, cluster = FALSE, sort_miss = FALSE, show_perc = TRUE, 
##     show_perc_col = TRUE, large_data_size = 9e+05, warn_large_data = TRUE) 
## NULL

Options for checking correlations

data %>% select(where(is.numeric)) %>% 
  vis_cor()

Options for checking value condition

vis_expect(data, ~ .x > 2)

Other functions are vis_compare for compare dataframe, vis_guess for individual class of each value, vis_value for heatmap (missing?), vis_binary for occurrence of binary (missing?)

Inspectdf

All can be piped to show plot and plot can be two tibble

## data type
inspect_types(data) %>% show_plot()

## memory usage
inspect_mem(data) %>% show_plot()

## check na
data_price_dummy <- data %>% 
  mutate(price_dummy = if_else(Price > 1000000, "High", "Low"))
inspect_na(data_price_dummy %>% filter(price_dummy == "High"),
           data_price_dummy %>% filter(price_dummy == "Low")) %>%
  show_plot()

## check numerical variable distribution
inspect_num(data) %>% show_plot()

## check categorical variable distribution
inspect_imb(data) %>% show_plot()

## check two categorical
inspect_imb(data_price_dummy %>% filter(price_dummy == "High"),
           data_price_dummy %>% filter(price_dummy == "Low")) %>%
  show_plot() + theme(legend.position = "none")

## similiar to inspect_imb, but for all levels
inspect_cat(data) %>% show_plot()

inspect_cor(data) %>% show_plot()

EDA Package

Liu Zhang

02/04/2021

Skimr Package

Vis_* Package

Inspectdf