Performance Comparisons in data ETL using “data.table” & “dplyr”

This is a comparison in the difference in time required for various operations used to manipulate data in any given data science project.

The two packages that will be used for this comparison are: “data.table” and “dplyr”.

For this analysis, we will use a proprietary dataset that was provided by the AHA. The data contains information on participants in AHA findraising events for the years 2015-2017.


Loading the data and transforming its structure:

The first thing we need to do is fix the data so that we can perform operations on it by removing special characters and spacing. There are two ways to do this: the first is using base_R functionality and subsequent “dplyr” selection, and the second is using “data.table” operations. Lets see how they compare:

## Base R data transformation: 40.602 sec elapsed

The previous evaluation was using Base R functionality. Now, lets see how “data.table” performs:

## data.table transformation: 16.678 sec elapsed

From this comparison we can see that “data.table”" is significantly faster for loading the data and preparing it for manipulation. The next step will be to actually perform operations on the data such as aggregation and summarization, etc.


Data Exploration & Manipulation:

The AHA’s basic purpose of giving the data to us was to see if we could find a way to predict teams and individuals who would donate the largest sums of money to the fundraising events. With this knowledge, the AHA would theoretically be able to undertake targeted marketing efforts to encourage these groups and keep them active.

In order to help solve this issue, we will first have to perform some operations on the data to get a better idea of its structure. Then, we will have to aggregate it into a useful format. Finally, we will need to develop some useful summaries of its structure that we can use as inputs for the machine learning algorithms.

Our procedure in doing this will follow these steps:

  1. Subset the top 10% of teams and individuals in terms of total donations.
  2. Summarize this data by year.
  3. Create flags to mark individuals belonging to the top 10%.
  4. Summarize the amount raised per year by teams, individuals, events, and segments.
  5. Merge the summary information in a usefull way so that it can be used by the machine learning algorithms.

dplyr-

# Start the timer
tic("dplyr approach to subsetting")

# Add count of individuals per team
count <- df2 %>% group_by(team_id) %>% mutate(team_count = n())

# Aggregate total funds raised by team and individual
team_totals <- aggregate(total_gifts ~ team_id + event_year, data = count, FUN = sum)
personal_totals <- aggregate(total_gifts ~ participant_id + event_year, data = count, FUN = sum)

# Summarize totals by segment and event 
seg_summ <- count %>% group_by(tap_desc) %>% summarise(raised = sum(total_gifts), count = n())

event_summ <- count %>% group_by(name) %>% summarise(event_total = sum(total_gifts), count = n())

# Top 10% of teams and individuals and teams by year
n <- 10

top_teams_total <- subset(team_totals, total_gifts > quantile(total_gifts, prob = 1 - n/100))

top_walkers_total <- subset(personal_totals, total_gifts > quantile(total_gifts, prob = 1 - n/100))

top_team_y1 <- team_totals %>% filter(event_year == "FY2015") %>% subset(total_gifts > quantile(total_gifts, prob = 1 - n/100))
top_team_y2 <- team_totals %>% filter(event_year == "FY2016") %>% subset(total_gifts > quantile(total_gifts, prob = 1 - n/100))
top_team_y3 <- team_totals %>% filter(event_year == "FY2017") %>% subset(total_gifts > quantile(total_gifts, prob = 1 - n/100))

top_walker_y1 <- personal_totals %>% filter(event_year == "FY2015") %>% subset(total_gifts > quantile(total_gifts, prob = 1 - n/100))
top_walker_y2 <- personal_totals %>% filter(event_year == "FY2016") %>% subset(total_gifts > quantile(total_gifts, prob = 1 - n/100))
top_walker_y3 <- personal_totals %>% filter(event_year == "FY2017") %>% subset(total_gifts > quantile(total_gifts, prob = 1 - n/100))

# Top 10% of segments and events by year
top_segments_total <- subset(seg_summ, raised > quantile(raised, prob = 1 - n/100))
top_events_total <- subset(event_summ, event_total > quantile(event_total, prob = 1 - n/100))

# By year
y1_top_event <- count %>% filter(event_year == "FY2015") %>% group_by(name) %>% summarise(event_year_total = sum(total_gifts)) %>% subset(event_year_total > quantile(event_year_total, prob = 1 - n/100))
y2_top_event <- count %>% filter(event_year == "FY2016") %>% group_by(name) %>% summarise(event_year_total = sum(total_gifts)) %>% subset(event_year_total > quantile(event_year_total, prob = 1 - n/100)) 
y3_top_event <- count %>% filter(event_year == "FY2017") %>% group_by(name) %>% summarise(event_year_total = sum(total_gifts)) %>% subset(event_year_total > quantile(event_year_total, prob = 1 - n/100)) 

y1_top_segment <- count %>% filter(event_year == "FY2015") %>% group_by(tap_desc) %>% summarise(segment_total = sum(total_gifts)) %>% subset(segment_total > quantile(segment_total, prob = 1 - n/100))
y2_top_segment <- count %>% filter(event_year == "FY2016") %>% group_by(tap_desc) %>% summarise(segment_total = sum(total_gifts)) %>% subset(segment_total > quantile(segment_total, prob = 1 - n/100)) 
y3_top_segment <- count %>% filter(event_year == "FY2017") %>% group_by(tap_desc) %>% summarise(segment_total = sum(total_gifts)) %>% subset(segment_total > quantile(segment_total, prob = 1 - n/100)) 

# End timer
toc()
## dplyr approach to subsetting: 29.087 sec elapsed
# Start timer
tic("dplyr approach to filtering and flagging")

# Combine top teams and top walkers
count$top_team <- as.factor(ifelse(count$team_id %in% top_teams_total$team_id, 1, 0))
count$top_walker <- as.factor(ifelse(count$participant_id %in% top_walkers_total$participant_id, 1, 0))

count$y1_top_walker <- as.factor(ifelse(count$participant_id %in% top_walker_y1$participant_id, 1, 0))
count$y2_top_walker <- as.factor(ifelse(count$participant_id %in% top_walker_y2$participant_id, 1, 0))
count$y3_top_walker <- as.factor(ifelse(count$participant_id %in% top_walker_y3$participant_id, 1, 0))

count$y1_top_team <- as.factor(ifelse(count$team_id %in% top_team_y1$team_id, 1, 0))
count$y2_top_team <- as.factor(ifelse(count$team_id %in% top_team_y2$team_id, 1, 0))
count$y3_top_team <- as.factor(ifelse(count$team_id %in% top_team_y3$team_id, 1, 0))

# Flag for events
count$y1_top_event <- as.factor(ifelse(count$name %in% y1_top_event$name, 1, 0))
count$y2_top_event <- as.factor(ifelse(count$name %in% y2_top_event$name, 1, 0))
count$y3_top_event <- as.factor(ifelse(count$name %in% y3_top_event$name, 1, 0))

count$top_event <- as.factor(ifelse(count$name %in% top_events_total$name, 1, 0))

# Flag for segments
count$y1_top_segment <- as.factor(ifelse(count$tap_desc %in% y1_top_segment$tap_desc, 1, 0))
count$y2_top_segment <- as.factor(ifelse(count$tap_desc %in% y2_top_segment$tap_desc, 1, 0))
count$y3_top_segment <- as.factor(ifelse(count$tap_desc %in% y3_top_segment$tap_desc, 1, 0))

count$top_segment <- as.factor(ifelse(count$tap_desc %in% top_segments_total$tap_desc, 1, 0))

# End timer
toc()
## dplyr approach to filtering and flagging: 1.938 sec elapsed

So, it took “dplyr” roughly 30 seconds to subset the data and another 2.5 seconds to create the flags.

What about “data.table”?

data.table-

# Start the timer
tic("data.table approach to subsetting")

# count by category
dt2[, ':=' (team_count = .N), by = team_id]
dt2[, ':=' (event_count = .N), by = name]
dt2[, ':=' (segment_count = .N), by = tap_desc]

# sum $ by category
dt2[, ':=' (personal_sum = sum(total_gifts)), by = participant_id]

dt2[, ':=' (segment_sum = sum(total_gifts)), by = tap_desc]

dt2[, ':=' (event_sum = sum(total_gifts)), by = name]

dt2[, ':=' (year_sum = sum(total_gifts)), by = event_year] 

# top 10% 
n <- 10

dt2[, top_walker := as.factor(ifelse(participant_id %in% dt2[personal_sum > quantile(personal_sum, prob = 1 - n/100)]$participant_id, 1, 0))]

dt2[, top_team := as.factor(ifelse(team_id %in% dt2[team_total_gifts > quantile(team_total_gifts, prob = 1 - n/100)]$team_id, 1, 0))]

dt2[, top_segment := as.factor(ifelse(tap_desc %in% dt2[segment_sum > quantile(segment_sum, prob = 1 - n/100)]$tap_desc, 1, 0))]

dt2[, top_event := as.factor(ifelse(name %in% dt2[event_sum > quantile(event_sum, prob = 1 - n/100)]$name, 1, 0))]

# End timer
toc()
## data.table approach to subsetting: 0.717 sec elapsed
# Start timer
tic("data.table approach to filtering and flagging")

# By year
dt2[, y1_top_walker := as.factor(ifelse(event_year == "FY2015" & participant_id %in% dt2[personal_sum > quantile(personal_sum, prob = 1 - n/100)]$participant_id, 1, 0))]
dt2[, y2_top_walker := as.factor(ifelse(event_year == "FY2016" & participant_id %in% dt2[personal_sum > quantile(personal_sum, prob = 1 - n/100)]$participant_id, 1, 0))]
dt2[, y3_top_walker := as.factor(ifelse(event_year == "FY2017" & participant_id %in% dt2[personal_sum > quantile(personal_sum, prob = 1 - n/100)]$participant_id, 1, 0))]

dt2[, y1_top_team := as.factor(ifelse(event_year == "FY2015" & team_id %in% dt2[team_total_gifts > quantile(team_total_gifts, prob = 1 - n/100)]$team_id, 1, 0))]
dt2[, y2_top_team := as.factor(ifelse(event_year == "FY2016" & team_id %in% dt2[team_total_gifts > quantile(team_total_gifts, prob = 1 - n/100)]$team_id, 1, 0))]
dt2[, y3_top_team := as.factor(ifelse(event_year == "FY2017" & team_id %in% dt2[team_total_gifts > quantile(team_total_gifts, prob = 1 - n/100)]$team_id, 1, 0))]

dt2[, y1_top_segment := as.factor(ifelse(event_year == "FY2015" & tap_desc %in% dt2[segment_sum > quantile(segment_sum, prob = 1 - n/100)]$tap_desc, 1, 0))]
dt2[, y2_top_segment := as.factor(ifelse(event_year == "FY2016" & tap_desc %in% dt2[segment_sum > quantile(segment_sum, prob = 1 - n/100)]$tap_desc, 1, 0))]
dt2[, y3_top_segment := as.factor(ifelse(event_year == "FY2017" & tap_desc %in% dt2[segment_sum > quantile(segment_sum, prob = 1 - n/100)]$tap_desc, 1, 0))]

dt2[, y1_top_event := as.factor(ifelse(event_year == "FY2015" & name %in% dt2[event_sum > quantile(event_sum, prob = 1 - n/100)]$name, 1, 0))]
dt2[, y2_top_event := as.factor(ifelse(event_year == "FY2016" & name %in% dt2[event_sum > quantile(event_sum, prob = 1 - n/100)]$name, 1, 0))]
dt2[, y3_top_event := as.factor(ifelse(event_year == "FY2017" & name %in% dt2[event_sum > quantile(event_sum, prob = 1 - n/100)]$name, 1, 0))]

# End timer
toc()
## data.table approach to filtering and flagging: 1.879 sec elapsed

The “data.table” approach took only .85 seconds to subset the data and another 1.7 seconds to filter it and create the flags. From this we can see that there is an incredible difference in speed using these two different packages. It is also useful to note that with “data.table” we were able to subset the data and create the flags all in one step as opposed to using two like we did in the “dplyr” approach.

There is one final step that remains before we are ready to use our data as input for the random forest model: we must merge all of our tables into one for each of three models- top_walker, top_event, and top_segment.


Creating the final outputs:

dplyr-

# Start timer
tic("dplyr approach to merging and summarizing")

# top_walker df

# Yearly totals for teams and individuals
walker_df <- count %>% group_by(participant_id) %>% mutate(y1_personal = ifelse(event_year == "FY2015", sum(total_gifts), 0),
                                                           y2_personal = ifelse(event_year == "FY2016", sum(total_gifts), 0),
                                                           y3_personal = ifelse(event_year == "FY2017", sum(total_gifts), 0),
                                                           y1_team = ifelse(event_year == "FY2015", sum(team_total_gifts), 0),
                                                           y2_team = ifelse(event_year == "FY2016", sum(team_total_gifts), 0),
                                                           y3_team = ifelse(event_year == "FY2017", sum(team_total_gifts), 0))

# top_event df

# Count teams per event
teams <- with(count, aggregate(team_id ~ name, FUN = function(x){length(unique(x))}))

# Combine by event
event_df <- sqldf("select a.*, b.team_id as num_teams from walker_df a join teams b on a.name = b.name")

# Summarize demographics by event
event_demo <- count %>% group_by(name) %>% summarize(age = mean(medage_cy),
                                                     diversity = mean(divindx_cy),
                                                     house_inc = mean(medhinc_cy),
                                                     disposable_inc = mean(meddi_cy),
                                                     worth = mean(mednw_cy),
                                                     raised = sum(total_gifts))

# Create markers for top grossing
event_demo$y1_top_event <- as.factor(ifelse(event_demo$name %in% y1_top_event$name, 1, 0)) 
event_demo$y2_top_event <- as.factor(ifelse(event_demo$name %in% y2_top_event$name, 1, 0))
event_demo$y3_top_event <- as.factor(ifelse(event_demo$name %in% y3_top_event$name, 1, 0))
event_demo$top_event <- as.factor(ifelse(event_demo$name %in% top_events_total$name, 1, 0))

# Join counts and demographic info
event_df2 <- sqldf("select a.*, b.* from event_df a join event_demo b on a.name = b.name ")

# top_segment df

# Count participants
seg_df <- count %>% group_by(tap_desc) %>% mutate(seg_count = n())

seg_demo <- seg_df %>% group_by(tap_desc) %>% summarise(age = mean(medage_cy),
                                                         diversity = mean(divindx_cy),
                                                         house_inc = mean(medhinc_cy),
                                                         disposable_inc = mean(meddi_cy),
                                                         worth = mean(mednw_cy))

# Combine by segment 
seg_df <- sqldf("select a.*, b.* from seg_df a join seg_demo b on a.tap_desc = b.tap_desc")

# End timer
toc()
## dplyr approach to merging and summarizing: 93.64 sec elapsed

data.table-

# Start timer
tic("data.table approach to merging and summarizing")

# Create df for RF model of top_walker
walk_dt <- dt2[, ltf_gifts := sum(total_gifts), by = participant_id][]

y1_personal <- walk_dt[(event_year == "FY2015"), list(y1_personal = sum(total_gifts)), by = participant_id]
y2_personal <- walk_dt[(event_year == "FY2016"), list(y2_personal = sum(total_gifts)), by = participant_id]
y3_personal <- walk_dt[(event_year == "FY2017"), list(y3_personal = sum(total_gifts)), by = participant_id]

setkey(walk_dt, participant_id)
setkey(y1_personal, participant_id)
setkey(y2_personal, participant_id)
setkey(y3_personal, participant_id)

walk_dt2 <- merge(walk_dt, y1_personal, all.x = TRUE)
walk_dt2[is.na(y1_personal), y1_personal := 0]
setkey(walk_dt2, participant_id)

walk_dt3 <- merge(walk_dt2, y2_personal, all.x = TRUE)
walk_dt3[is.na(y2_personal), y2_personal := 0]

walk_dt3 <- merge(walk_dt3, y3_personal, all.x = TRUE)
walk_dt3[is.na(y3_personal), y3_personal := 0]

# Summarize demographics for event RF
event_demo_dt <- walk_dt3[, list(age = mean(medage_cy), diversity = mean(divindx_cy), house_inc = mean(medhinc_cy), disposable_inc = mean(meddi_cy), worth = mean(mednw_cy), raised = sum(total_gifts)), by = "name"]

# Final df for Event model
event_rf_dat <- sqldf("select a.*, b.age, b.diversity, b.house_inc, b.disposable_inc, b.worth, b.raised from walk_dt3 a join event_demo_dt b on a.name = b.name")

# Summarize info for tapestry RF
segments_dt <- walk_dt3[, list(age = mean(medage_cy), diversity = mean(divindx_cy), house_inc = mean(medhinc_cy), disposable_inc = mean(meddi_cy), worth = mean(mednw_cy), raised = sum(total_gifts)), by = "tap_desc"]
## Warning in gmean(mednw_cy): Group 1 summed to more than type 'integer'
## can hold so the result has been coerced to 'numeric' automatically, for
## convenience.
# Final df for segment model
segment_rf_dat <- sqldf("select a.*, b.age, b.diversity, b.house_inc, b.disposable_inc, b.worth, b.raised from walk_dt3 a join segments_dt b on a.tap_desc = b.tap_desc")

# End timer
toc()
## data.table approach to merging and summarizing: 10.604 sec elapsed

Conclusion:

In the end, what we can see from all this is that the “data.table” package for R performs signficantly faster than “dplyr” at every single step of the process involved in manioulating and transforming a large dataset to prepare for its ingestion by a machine learning algorithm. The syntax is more complex, but the trade-off in terms of time is well worth the additional effort required to learn and understand this package. The one advantage that “dplyr” offers is its ability to creat pipes with the “%>%” operator and chain multiple operations together in one call. There may be a “data.table” solution to this issue as well, however, I was unable to use it for this comparison.


Full code:

library(knitr)
library(dplyr)
library(data.table)
library(tictoc)
library(sqldf)

set.seed(100)

options(scipen = 3)

# Start timer
tic("dplyr approach")

# Load the data
data <- read.csv("Data/LuminateDataExport_UTDP2_011818.csv") # Note, using the "fread" function would be faster but will not auto-detect the data types for each column

# Clean the data
names(data) <- tolower(names(data))

# Remove the unwanted characters
data <- as.data.frame(lapply(data, gsub, pattern = "\\$", replacement = ""))

data <- as.data.frame(lapply(data, gsub, pattern = ",", replacement = ""))

data <- as.data.frame(lapply(data, gsub, pattern = " ", replacement = ""))

# Use dplyr to filter relevant columns
df <- select(data, -event_date, -city, -company_goal, -company_name, -fundraising_goal, -state, -street, -team_count, -team_member_goal, -team_name, -zip, -gifts_count, -registration_gift)

# Convert everything to character
df <- data.frame(lapply(df, as.character), stringsAsFactors = FALSE)

# Function to convert numeric columns
conversion <- function(x) {
  
  stopifnot(is.list(x))
  
  x[] <- rapply(x, utils::type.convert, classes = "character", how = "replace", as.is = TRUE)
  
  return(x)
}

# Conversion
df2 <- conversion(df)

# Factor columns
cols <- c("event_year", "name", "participant_id", "team_captain", "team_id", "match_code", "tap_level", "tap_desc")

df2[cols] <- lapply(df2[cols], factor)

# Fill empty segments info
df2$tap_desc <- sub("^$", 0, df2$tap_desc)

# Add count of individuals per team
count <- df2 %>% group_by(team_id) %>% mutate(team_count = n())

# Aggregate total funds raised by team and individual
team_totals <- aggregate(total_gifts ~ team_id + event_year, data = count, FUN = sum)
personal_totals <- aggregate(total_gifts ~ participant_id + event_year, data = count, FUN = sum)

# Summarize totals by segment and event 
seg_summ <- count %>% group_by(tap_desc) %>% summarise(raised = sum(total_gifts), count = n())

event_summ <- count %>% group_by(name) %>% summarise(event_total = sum(total_gifts), count = n())

# Top 10% of teams and individuals and teams by year
n <- 10

top_teams_total <- subset(team_totals, total_gifts > quantile(total_gifts, prob = 1 - n/100))

top_walkers_total <- subset(personal_totals, total_gifts > quantile(total_gifts, prob = 1 - n/100))

top_team_y1 <- team_totals %>% filter(event_year == "FY2015") %>% subset(total_gifts > quantile(total_gifts, prob = 1 - n/100))
top_team_y2 <- team_totals %>% filter(event_year == "FY2016") %>% subset(total_gifts > quantile(total_gifts, prob = 1 - n/100))
top_team_y3 <- team_totals %>% filter(event_year == "FY2017") %>% subset(total_gifts > quantile(total_gifts, prob = 1 - n/100))

top_walker_y1 <- personal_totals %>% filter(event_year == "FY2015") %>% subset(total_gifts > quantile(total_gifts, prob = 1 - n/100))
top_walker_y2 <- personal_totals %>% filter(event_year == "FY2016") %>% subset(total_gifts > quantile(total_gifts, prob = 1 - n/100))
top_walker_y3 <- personal_totals %>% filter(event_year == "FY2017") %>% subset(total_gifts > quantile(total_gifts, prob = 1 - n/100))

# Top 10% of segments and events by year
top_segments_total <- subset(seg_summ, raised > quantile(raised, prob = 1 - n/100))
top_events_total <- subset(event_summ, event_total > quantile(event_total, prob = 1 - n/100))

# By year
y1_top_event <- count %>% filter(event_year == "FY2015") %>% group_by(name) %>% summarise(event_year_total = sum(total_gifts)) %>% subset(event_year_total > quantile(event_year_total, prob = 1 - n/100))
y2_top_event <- count %>% filter(event_year == "FY2016") %>% group_by(name) %>% summarise(event_year_total = sum(total_gifts)) %>% subset(event_year_total > quantile(event_year_total, prob = 1 - n/100)) 
y3_top_event <- count %>% filter(event_year == "FY2017") %>% group_by(name) %>% summarise(event_year_total = sum(total_gifts)) %>% subset(event_year_total > quantile(event_year_total, prob = 1 - n/100)) 

y1_top_segment <- count %>% filter(event_year == "FY2015") %>% group_by(tap_desc) %>% summarise(segment_total = sum(total_gifts)) %>% subset(segment_total > quantile(segment_total, prob = 1 - n/100))
y2_top_segment <- count %>% filter(event_year == "FY2016") %>% group_by(tap_desc) %>% summarise(segment_total = sum(total_gifts)) %>% subset(segment_total > quantile(segment_total, prob = 1 - n/100)) 
y3_top_segment <- count %>% filter(event_year == "FY2017") %>% group_by(tap_desc) %>% summarise(segment_total = sum(total_gifts)) %>% subset(segment_total > quantile(segment_total, prob = 1 - n/100)) 

# top_walker df

# Yearly totals for teams and individuals
walker_df <- count %>% group_by(participant_id) %>% mutate(y1_personal = ifelse(event_year == "FY2015", sum(total_gifts), 0),
                                                           y2_personal = ifelse(event_year == "FY2016", sum(total_gifts), 0),
                                                           y3_personal = ifelse(event_year == "FY2017", sum(total_gifts), 0),
                                                           y1_team = ifelse(event_year == "FY2015", sum(team_total_gifts), 0),
                                                           y2_team = ifelse(event_year == "FY2016", sum(team_total_gifts), 0),
                                                           y3_team = ifelse(event_year == "FY2017", sum(team_total_gifts), 0))

# top_event df

# Count teams per event
teams <- with(count, aggregate(team_id ~ name, FUN = function(x){length(unique(x))}))

# Combine by event
event_df <- sqldf("select a.*, b.team_id as num_teams from walker_df a join teams b on a.name = b.name")

# Summarize demographics by event
event_demo <- count %>% group_by(name) %>% summarize(age = mean(medage_cy),
                                                     diversity = mean(divindx_cy),
                                                     house_inc = mean(medhinc_cy),
                                                     disposable_inc = mean(meddi_cy),
                                                     worth = mean(mednw_cy),
                                                     raised = sum(total_gifts))

# Create markers for top grossing
event_demo$y1_top_event <- as.factor(ifelse(event_demo$name %in% y1_top_event$name, 1, 0)) 
event_demo$y2_top_event <- as.factor(ifelse(event_demo$name %in% y2_top_event$name, 1, 0))
event_demo$y3_top_event <- as.factor(ifelse(event_demo$name %in% y3_top_event$name, 1, 0))
event_demo$top_event <- as.factor(ifelse(event_demo$name %in% top_events_total$name, 1, 0))

# Join counts and demographic info
event_df2 <- sqldf("select a.*, b.* from event_df a join event_demo b on a.name = b.name ")

# top_segment df

# Count participants
seg_df <- count %>% group_by(tap_desc) %>% mutate(seg_count = n())

seg_demo <- seg_df %>% group_by(tap_desc) %>% summarise(age = mean(medage_cy),
                                                         diversity = mean(divindx_cy),
                                                         house_inc = mean(medhinc_cy),
                                                         disposable_inc = mean(meddi_cy),
                                                         worth = mean(mednw_cy))

# Combine by segment 
seg_df <- sqldf("select a.*, b.* from seg_df a join seg_demo b on a.tap_desc = b.tap_desc")

# End timer
toc()

# Start timer
tic("data.table approach")

# Reload the data 
data <- setDT(read.csv("Data/LuminateDataExport_UTDP2_011818.csv"))

# Remove the unwanted characters
dt <- dt[, lapply(.SD, function(x) {gsub(c("\\$| |,"), "", x)})]

# Lower-casing
setnames(dt, names(dt), tolower(names(dt)))

# Exclude irrelevant columns
dt1 <- dt[,  c("event_date", "city", "company_goal", "company_name", "fundraising_goal", "state", "street", "team_count", "team_member_goal", "team_name", "zip", "gifts_count", "registration_gift") := NULL]

# Convert everything to character
dt1 <- dt1[, lapply(.SD, as.character, stringsAsFactors = FALSE)]

# Function to convert numeric columns
conversion <- function(x) {
  
  stopifnot(is.list(x))
  
  x[] <- rapply(x, utils::type.convert, classes = "character", how = "replace", as.is = TRUE)
  
  return(x)
}

# Conversion
dt2 <- conversion(dt1)

# Factor columns
cols <- c("event_year", "name", "participant_id", "team_captain", "team_id", "match_code", "tap_level", "tap_desc")

dt2 <- dt2[, (cols) := lapply(.SD, as.factor), .SDcols = cols]

# Fill empty segments info
dt2 <- dt2[, ("tap_desc") := sub("tap_desc", "^$", 0)]

# count by category
dt2[, ':=' (team_count = .N), by = team_id]
dt2[, ':=' (event_count = .N), by = name]
dt2[, ':=' (segment_count = .N), by = tap_desc]

# sum $ by category
dt2[, ':=' (personal_sum = sum(total_gifts)), by = participant_id]

dt2[, ':=' (segment_sum = sum(total_gifts)), by = tap_desc]

dt2[, ':=' (event_sum = sum(total_gifts)), by = name]

dt2[, ':=' (year_sum = sum(total_gifts)), by = event_year] 

# top 10% 
n <- 10

dt2[, top_walker := as.factor(ifelse(participant_id %in% dt2[personal_sum > quantile(personal_sum, prob = 1 - n/100)]$participant_id, 1, 0))]

dt2[, top_team := as.factor(ifelse(team_id %in% dt2[team_total_gifts > quantile(team_total_gifts, prob = 1 - n/100)]$team_id, 1, 0))]

dt2[, top_segment := as.factor(ifelse(tap_desc %in% dt2[segment_sum > quantile(segment_sum, prob = 1 - n/100)]$tap_desc, 1, 0))]

dt2[, top_event := as.factor(ifelse(name %in% dt2[event_sum > quantile(event_sum, prob = 1 - n/100)]$name, 1, 0))]

# By year
dt2[, y1_top_walker := as.factor(ifelse(event_year == "FY2015" & participant_id %in% dt2[personal_sum > quantile(personal_sum, prob = 1 - n/100)]$participant_id, 1, 0))]
dt2[, y2_top_walker := as.factor(ifelse(event_year == "FY2016" & participant_id %in% dt2[personal_sum > quantile(personal_sum, prob = 1 - n/100)]$participant_id, 1, 0))]
dt2[, y3_top_walker := as.factor(ifelse(event_year == "FY2017" & participant_id %in% dt2[personal_sum > quantile(personal_sum, prob = 1 - n/100)]$participant_id, 1, 0))]

dt2[, y1_top_team := as.factor(ifelse(event_year == "FY2015" & team_id %in% dt2[team_total_gifts > quantile(team_total_gifts, prob = 1 - n/100)]$team_id, 1, 0))]
dt2[, y2_top_team := as.factor(ifelse(event_year == "FY2016" & team_id %in% dt2[team_total_gifts > quantile(team_total_gifts, prob = 1 - n/100)]$team_id, 1, 0))]
dt2[, y3_top_team := as.factor(ifelse(event_year == "FY2017" & team_id %in% dt2[team_total_gifts > quantile(team_total_gifts, prob = 1 - n/100)]$team_id, 1, 0))]

dt2[, y1_top_segment := as.factor(ifelse(event_year == "FY2015" & tap_desc %in% dt2[segment_sum > quantile(segment_sum, prob = 1 - n/100)]$tap_desc, 1, 0))]
dt2[, y2_top_segment := as.factor(ifelse(event_year == "FY2016" & tap_desc %in% dt2[segment_sum > quantile(segment_sum, prob = 1 - n/100)]$tap_desc, 1, 0))]
dt2[, y3_top_segment := as.factor(ifelse(event_year == "FY2017" & tap_desc %in% dt2[segment_sum > quantile(segment_sum, prob = 1 - n/100)]$tap_desc, 1, 0))]

dt2[, y1_top_event := as.factor(ifelse(event_year == "FY2015" & name %in% dt2[event_sum > quantile(event_sum, prob = 1 - n/100)]$name, 1, 0))]
dt2[, y2_top_event := as.factor(ifelse(event_year == "FY2016" & name %in% dt2[event_sum > quantile(event_sum, prob = 1 - n/100)]$name, 1, 0))]
dt2[, y3_top_event := as.factor(ifelse(event_year == "FY2017" & name %in% dt2[event_sum > quantile(event_sum, prob = 1 - n/100)]$name, 1, 0))]

# Create df for RF model of top_walker
walk_dt <- dt2[, ltf_gifts := sum(total_gifts), by = participant_id][]

y1_personal <- walk_dt[(event_year == "FY2015"), list(y1_personal = sum(total_gifts)), by = participant_id]
y2_personal <- walk_dt[(event_year == "FY2016"), list(y2_personal = sum(total_gifts)), by = participant_id]
y3_personal <- walk_dt[(event_year == "FY2017"), list(y3_personal = sum(total_gifts)), by = participant_id]

setkey(walk_dt, participant_id)
setkey(y1_personal, participant_id)
setkey(y2_personal, participant_id)
setkey(y3_personal, participant_id)

walk_dt2 <- merge(walk_dt, y1_personal, all.x = TRUE)
walk_dt2[is.na(y1_personal), y1_personal := 0]
setkey(walk_dt2, participant_id)

walk_dt3 <- merge(walk_dt2, y2_personal, all.x = TRUE)
walk_dt3[is.na(y2_personal), y2_personal := 0]

walk_dt3 <- merge(walk_dt3, y3_personal, all.x = TRUE)
walk_dt3[is.na(y3_personal), y3_personal := 0]

# Summarize demographics for event RF
event_demo_dt <- walk_dt3[, list(age = mean(medage_cy), diversity = mean(divindx_cy), house_inc = mean(medhinc_cy), disposable_inc = mean(meddi_cy), worth = mean(mednw_cy), raised = sum(total_gifts)), by = "name"]

# Final df for Event model
event_rf_dat <- sqldf("select a.*, b.age, b.diversity, b.house_inc, b.disposable_inc, b.worth, b.raised from walk_dt3 a join event_demo_dt b on a.name = b.name")

# Summarize info for tapestry RF
segments_dt <- walk_dt3[, list(age = mean(medage_cy), diversity = mean(divindx_cy), house_inc = mean(medhinc_cy), disposable_inc = mean(meddi_cy), worth = mean(mednw_cy), raised = sum(total_gifts)), by = "tap_desc"]

# Final df for segment model
segment_rf_dat <- sqldf("select a.*, b.age, b.diversity, b.house_inc, b.disposable_inc, b.worth, b.raised from walk_dt3 a join segments_dt b on a.tap_desc = b.tap_desc")

# End timer
toc()