# 2_exploratory.r
source("../fannie_mae/00_setup.r")
## Warning: package 'fst' was built under R version 3.5.3
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
## Loading required package: future
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:lubridate':
## 
##     intersect, setdiff, union
## The following object is masked from 'package:xgboost':
## 
##     slice
## The following object is masked from 'package:glue':
## 
##     collapse
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## Loading required package: data.table
## 
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
## The following objects are masked from 'package:lubridate':
## 
##     hour, isoweek, mday, minute, month, quarter, second, wday,
##     week, yday, year
## The following object is masked from 'package:purrr':
## 
##     transpose
## We have 4 workers to use with disk.frame. To change that use setup_disk.frame(workers = n)
## 
## Attaching package: 'disk.frame'
## The following objects are masked from 'package:purrr':
## 
##     map, map2
## The following objects are masked from 'package:base':
## 
##     colnames, ncol, nrow

Assign a disk.frame

You use disk.frame(path) to assign a disk.frame

dir(file.path(outpath, "fm_with_harp")) %>% 
  head
## [1] "1.fst"   "10.fst"  "100.fst" "101.fst" "102.fst" "103.fst"

Use the function disk.frame to assign a folder

fm_with_harp = 
  disk.frame(file.path(outpath, "fm_with_harp"))

fm_with_harp
## path: "d:/data/fannie_mae_disk_frame//fm_with_harp"
## nchunks: 500
## nrow: 1820760716
## ncol: 38

Simple info about disk.frame

nrow(fm_with_harp)
## [1] 1820760716
ncol(fm_with_harp)
## [1] 38
head(fm_with_harp)
tail(fm_with_harp)

Simple group-by example

I want create the sum(default_12m) group by month. The sql statement looks like

select
  monthly.rpt.prd,
  sum(default_12m) as n_defaults,
  count(*) as cnt,
  n_default/cnt as odr # observed default rate
from
  table
group by
  monthly.rpt.prd;

This analysis only uses two columns, namely default_12m and monthly.rpt.prd. So I use srckeep to ensure that only those two columns are loaded.

The backend for disk.frame is fst which allows random access to columns AKA load only what you need! This is importance for performance!

system.time(a_wh1 <- fm_with_harp %>% 
  srckeep(c("default_12m","monthly.rpt.prd")) %>% 
  group_by(monthly.rpt.prd) %>% 
  summarise(
    N = n(), 
    n_defaults = sum(default_12m, na.rm = T)))
## Warning in group_by.disk.frame(., monthly.rpt.prd): The shardkeys 'loan_id'
## are NOT identical to shardby = 'monthly.rpt.prd'. The group_by operation
## is applied WITHIN each chunk, hence the results may not be as expected. To
## address this issue, you can rechunk(df, shardby = your_group_keys) which
## can be computationally expensive. Otherwise, you may use a second stage
## summary to obtain the desired result.
##    user  system elapsed 
##    0.01    0.00    0.02

The error recognises that the shardkey is loan_id which is not the group-by variable - monthly.rpt.prd.

The operations returned almost instantly, this is because the computation was done lazily. For now operations are recorded. No computation will take place until collect or compute is called.

Now play the recording by calling collect. This will take a minute. Back to the slides:

system.time(a_wh2 <- a_wh1 %>% collect) # 60~70 plugged in
##    user  system elapsed 
##    0.15    0.07   32.72
a_wh2
# temporary
# a_wh1 %>% 
#   srckeep(c("monthly.rpt.prd", "default_12m")) %>% 
#   map(function(chunk) {
#     chunk[1,]
#   }) %>% 
#   collect
# temporary
# a_wh1 %>% 
#   srckeep(c("monthly.rpt.prd", "default_12m")) %>% 
#   map(~{
#     .x[1,.SD]
#   }) %>% 
  # collect

once collect is called the resultant data is stored as a data.frame.

However this is not the correct result, as the group by was performed within each chunk. Hence we need a second stage group by. The second group by takes no time at all, as everything was done in memory

system.time(a_wh3 <- a_wh2 %>%
  group_by(monthly.rpt.prd) %>% 
  summarise(
    odr = sum(n_defaults)/sum(N)
  ) %>% 
  rename(
    Date = monthly.rpt.prd,
    `Observed Default Rate%` = odr
  ))
##    user  system elapsed 
##    0.02    0.00    0.01

Manipulate the data and plot

a_wh4 = a_wh3 %>% 
  gather(key = type, value=rate, -Date)

ggplot(a_wh4) + 
  geom_line(aes(x=Date, y = rate, colour = type)) +
  ggtitle("Fannie Mae Observed Default Rate over time & HARP Conversion Rate")