The document is demo for the EDA packages in R.
library(visdat)
library(inspectdf)
library(skimr)
library(tidyverse)
We will load a dataset, we will use Kaggle’s Melbourne Housing dataset https://www.kaggle.com/dansbecker/melbourne-housing-snapshot
data <- read_csv("melb_data.csv")
glimpse(data)
## Rows: 13,580
## Columns: 21
## $ Suburb <chr> "Abbotsford", "Abbotsford", "Abbotsford", "Abbotsford...
## $ Address <chr> "85 Turner St", "25 Bloomburg St", "5 Charles St", "4...
## $ Rooms <dbl> 2, 2, 3, 3, 4, 2, 3, 2, 1, 2, 2, 3, 2, 2, 1, 2, 3, 3,...
## $ Type <chr> "h", "h", "h", "h", "h", "h", "h", "h", "u", "h", "u"...
## $ Price <dbl> 1480000, 1035000, 1465000, 850000, 1600000, 941000, 1...
## $ Method <chr> "S", "S", "SP", "PI", "VB", "S", "S", "S", "S", "S", ...
## $ SellerG <chr> "Biggin", "Biggin", "Biggin", "Biggin", "Nelson", "Je...
## $ Date <chr> "3/12/2016", "4/02/2016", "4/03/2017", "4/03/2017", "...
## $ Distance <dbl> 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5, 2.5...
## $ Postcode <dbl> 3067, 3067, 3067, 3067, 3067, 3067, 3067, 3067, 3067,...
## $ Bedroom2 <dbl> 2, 2, 3, 3, 3, 2, 4, 2, 1, 3, 2, 3, 2, 2, 1, 2, 3, 2,...
## $ Bathroom <dbl> 1, 1, 2, 2, 1, 1, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1, 2, 1,...
## $ Car <dbl> 1, 0, 0, 1, 2, 0, 0, 2, 1, 2, 1, 2, 1, 1, 1, 2, 1, 1,...
## $ Landsize <dbl> 202, 156, 134, 94, 120, 181, 245, 256, 0, 220, 0, 214...
## $ BuildingArea <dbl> NA, 79, 150, NA, 142, NA, 210, 107, NA, 75, NA, 190, ...
## $ YearBuilt <dbl> NA, 1900, 1900, NA, 2014, NA, 1910, 1890, NA, 1900, N...
## $ CouncilArea <chr> "Yarra", "Yarra", "Yarra", "Yarra", "Yarra", "Yarra",...
## $ Lattitude <dbl> -37.7996, -37.8079, -37.8093, -37.7969, -37.8072, -37...
## $ Longtitude <dbl> 144.9984, 144.9934, 144.9944, 144.9969, 144.9941, 144...
## $ Regionname <chr> "Northern Metropolitan", "Northern Metropolitan", "No...
## $ Propertycount <dbl> 4019, 4019, 4019, 4019, 4019, 4019, 4019, 4019, 4019,...
Complete console output view
skim(data)
Name | data |
Number of rows | 13580 |
Number of columns | 21 |
_______________________ | |
Column type frequency: | |
character | 8 |
numeric | 13 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
Suburb | 0 | 1.0 | 3 | 18 | 0 | 314 | 0 |
Address | 0 | 1.0 | 8 | 27 | 0 | 13378 | 0 |
Type | 0 | 1.0 | 1 | 1 | 0 | 3 | 0 |
Method | 0 | 1.0 | 1 | 2 | 0 | 5 | 0 |
SellerG | 0 | 1.0 | 1 | 23 | 0 | 268 | 0 |
Date | 0 | 1.0 | 9 | 10 | 0 | 58 | 0 |
CouncilArea | 1369 | 0.9 | 4 | 17 | 0 | 33 | 0 |
Regionname | 0 | 1.0 | 16 | 26 | 0 | 8 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
Rooms | 0 | 1.00 | 2.94 | 0.96 | 1.00 | 2.00 | 3.0 | 3.00 | 10.00 | ▅▇▁▁▁ |
Price | 0 | 1.00 | 1075684.08 | 639310.72 | 85000.00 | 650000.00 | 903000.0 | 1330000.00 | 9000000.00 | ▇▁▁▁▁ |
Distance | 0 | 1.00 | 10.14 | 5.87 | 0.00 | 6.10 | 9.2 | 13.00 | 48.10 | ▇▆▁▁▁ |
Postcode | 0 | 1.00 | 3105.30 | 90.68 | 3000.00 | 3044.00 | 3084.0 | 3148.00 | 3977.00 | ▇▁▁▁▁ |
Bedroom2 | 0 | 1.00 | 2.91 | 0.97 | 0.00 | 2.00 | 3.0 | 3.00 | 20.00 | ▇▁▁▁▁ |
Bathroom | 0 | 1.00 | 1.53 | 0.69 | 0.00 | 1.00 | 1.0 | 2.00 | 8.00 | ▇▆▁▁▁ |
Car | 62 | 1.00 | 1.61 | 0.96 | 0.00 | 1.00 | 2.0 | 2.00 | 10.00 | ▇▁▁▁▁ |
Landsize | 0 | 1.00 | 558.42 | 3990.67 | 0.00 | 177.00 | 440.0 | 651.00 | 433014.00 | ▇▁▁▁▁ |
BuildingArea | 6450 | 0.53 | 151.97 | 541.01 | 0.00 | 93.00 | 126.0 | 174.00 | 44515.00 | ▇▁▁▁▁ |
YearBuilt | 5375 | 0.60 | 1964.68 | 37.27 | 1196.00 | 1940.00 | 1970.0 | 1999.00 | 2018.00 | ▁▁▁▁▇ |
Lattitude | 0 | 1.00 | -37.81 | 0.08 | -38.18 | -37.86 | -37.8 | -37.76 | -37.41 | ▁▂▇▁▁ |
Longtitude | 0 | 1.00 | 145.00 | 0.10 | 144.43 | 144.93 | 145.0 | 145.06 | 145.53 | ▁▁▇▂▁ |
Propertycount | 0 | 1.00 | 7454.42 | 4378.58 | 249.00 | 4380.00 | 6555.0 | 10331.00 | 21650.00 | ▆▇▅▂▁ |
skim_df introduced skim_variable, n_missing etc. If drop skim_variable and skim_type it will result coercion to a tibble.
## pipeable
skim(data) %>%
filter(skim_variable == "Price")
Name | data |
Number of rows | 13580 |
Number of columns | 21 |
_______________________ | |
Column type frequency: | |
numeric | 1 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
Price | 0 | 1 | 1075684 | 639310.7 | 85000 | 650000 | 903000 | 1330000 | 9e+06 | ▇▁▁▁▁ |
partition() creates a list of smaller skim_df data frames. Each entry in the list is a data type from the original skim_df. The inverse of partition() is bind(), which takes the list and produces the original skim_df. While partition() keeps all of the subtables as list entries, yank() gives you a single subtable for a data type.
skim(data) %>% yank("numeric")
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
Rooms | 0 | 1.00 | 2.94 | 0.96 | 1.00 | 2.00 | 3.0 | 3.00 | 10.00 | ▅▇▁▁▁ |
Price | 0 | 1.00 | 1075684.08 | 639310.72 | 85000.00 | 650000.00 | 903000.0 | 1330000.00 | 9000000.00 | ▇▁▁▁▁ |
Distance | 0 | 1.00 | 10.14 | 5.87 | 0.00 | 6.10 | 9.2 | 13.00 | 48.10 | ▇▆▁▁▁ |
Postcode | 0 | 1.00 | 3105.30 | 90.68 | 3000.00 | 3044.00 | 3084.0 | 3148.00 | 3977.00 | ▇▁▁▁▁ |
Bedroom2 | 0 | 1.00 | 2.91 | 0.97 | 0.00 | 2.00 | 3.0 | 3.00 | 20.00 | ▇▁▁▁▁ |
Bathroom | 0 | 1.00 | 1.53 | 0.69 | 0.00 | 1.00 | 1.0 | 2.00 | 8.00 | ▇▆▁▁▁ |
Car | 62 | 1.00 | 1.61 | 0.96 | 0.00 | 1.00 | 2.0 | 2.00 | 10.00 | ▇▁▁▁▁ |
Landsize | 0 | 1.00 | 558.42 | 3990.67 | 0.00 | 177.00 | 440.0 | 651.00 | 433014.00 | ▇▁▁▁▁ |
BuildingArea | 6450 | 0.53 | 151.97 | 541.01 | 0.00 | 93.00 | 126.0 | 174.00 | 44515.00 | ▇▁▁▁▁ |
YearBuilt | 5375 | 0.60 | 1964.68 | 37.27 | 1196.00 | 1940.00 | 1970.0 | 1999.00 | 2018.00 | ▁▁▁▁▇ |
Lattitude | 0 | 1.00 | -37.81 | 0.08 | -38.18 | -37.86 | -37.8 | -37.76 | -37.41 | ▁▂▇▁▁ |
Longtitude | 0 | 1.00 | 145.00 | 0.10 | 144.43 | 144.93 | 145.0 | 145.06 | 145.53 | ▁▁▇▂▁ |
Propertycount | 0 | 1.00 | 7454.42 | 4378.58 | 249.00 | 4380.00 | 6555.0 | 10331.00 | 21650.00 | ▆▇▅▂▁ |
Useful for visualization, particular missing data.
## check character and numerical variables with missing
vis_dat(data)
Options to check more about missing data
## check missing percentage
vis_miss(data)
## can cluster and sort
args(vis_miss)
## function (x, cluster = FALSE, sort_miss = FALSE, show_perc = TRUE,
## show_perc_col = TRUE, large_data_size = 9e+05, warn_large_data = TRUE)
## NULL
Options for checking correlations
data %>% select(where(is.numeric)) %>%
vis_cor()
Options for checking value condition
vis_expect(data, ~ .x > 2)
Other functions are vis_compare for compare dataframe, vis_guess for individual class of each value, vis_value for heatmap (missing?), vis_binary for occurrence of binary (missing?)
All can be piped to show plot and plot can be two tibble
## data type
inspect_types(data) %>% show_plot()
## memory usage
inspect_mem(data) %>% show_plot()
## check na
data_price_dummy <- data %>%
mutate(price_dummy = if_else(Price > 1000000, "High", "Low"))
inspect_na(data_price_dummy %>% filter(price_dummy == "High"),
data_price_dummy %>% filter(price_dummy == "Low")) %>%
show_plot()
## check numerical variable distribution
inspect_num(data) %>% show_plot()
## check categorical variable distribution
inspect_imb(data) %>% show_plot()
## check two categorical
inspect_imb(data_price_dummy %>% filter(price_dummy == "High"),
data_price_dummy %>% filter(price_dummy == "Low")) %>%
show_plot() + theme(legend.position = "none")
## similiar to inspect_imb, but for all levels
inspect_cat(data) %>% show_plot()
inspect_cor(data) %>% show_plot()