Task 0

What tf is my data?

getwd()
## [1] "/Users/isaiahmireles/Desktop/FingerHut"
setwd("/Users/isaiahmireles/Desktop/FingerHut")
df <- read.csv("dat_train1.csv")
df |> head()
##   customer_id account_id ed_id      event_name      event_timestamp
## 1    15849251  383997507     4 browse_products 2021-11-04T14:11:15Z
## 2    15849251  383997507     4 browse_products 2021-11-04T14:11:29Z
## 3    15849251  383997507     4 browse_products 2021-11-04T14:12:10Z
## 4    15849251  383997507     4 browse_products 2021-11-04T14:12:21Z
## 5    15849251  383997507     4 browse_products 2021-11-04T14:12:24Z
## 6    15849251  383997507     2  campaign_click 2021-11-29T06:00:00Z
##   journey_steps_until_end                 id sep
## 1                       1 15849251 383997507   -
## 2                       2 15849251 383997507   -
## 3                       3 15849251 383997507   -
## 4                       4 15849251 383997507   -
## 5                       5 15849251 383997507   -
## 6                       6 15849251 383997507   -
df |> str()
## 'data.frame':    54960961 obs. of  8 variables:
##  $ customer_id            : int  15849251 15849251 15849251 15849251 15849251 15849251 15849251 15849251 15849251 15849251 ...
##  $ account_id             : int  383997507 383997507 383997507 383997507 383997507 383997507 383997507 383997507 383997507 383997507 ...
##  $ ed_id                  : int  4 4 4 4 4 2 4 4 19 19 ...
##  $ event_name             : chr  "browse_products" "browse_products" "browse_products" "browse_products" ...
##  $ event_timestamp        : chr  "2021-11-04T14:11:15Z" "2021-11-04T14:11:29Z" "2021-11-04T14:12:10Z" "2021-11-04T14:12:21Z" ...
##  $ journey_steps_until_end: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ id                     : chr  "15849251 383997507" "15849251 383997507" "15849251 383997507" "15849251 383997507" ...
##  $ sep                    : chr  "-" "-" "-" "-" ...

Task 1

  1. How many rows does your data set have?
df |> nrow()
## [1] 54960961
  1. How many unqiue Ids are there?
length(unique(df$id))
## [1] 1430445
  1. What is the earliest and latest time stamp?
set.seed(123)
samp <- sample(df$event_timestamp, 3); samp
## [1] "2022-06-03T22:13:06Z" "2022-04-03T20:57:37Z" "2021-04-26T23:54:48Z"

Time Type : ISO 8601 format

%Y-%m-%dT%H:%M:%SZ

Component Meaning
%Y 4-digit year
%m month
%d day
T literal separator
%H hour (24h)
%M minute
%S second
Z UTC timezone

stands for the “Zero” timezone, which indicates Coordinated Universal Time (UTC)

What tf is UTC time? (YouTube)

Date Features :

Consider that our time is in UTC so, we need to be careful about assuming everyone is from the same place and therefore time_of_day sort of thinking isn’t necisarily relevant. So more broad, seasonal + time of wk + more trends may be more relevant

library(lubridate)
library(dplyr)

df <- df |>
  mutate(
    # --- Core date features (SAFE in UTC) ---
    event_date = as.Date(event_timestamp),
    weekday = wday(event_date, label = TRUE),
    month = month(event_date),
    day_of_month = day(event_date),

    # --- Weekend (still useful globally) ---
    is_weekend = weekday %in% c("Sat", "Sun"),

    # --- Pay cycle (VERY IMPORTANT, timezone invariant) ---
    pay_cycle = case_when(
      day_of_month <= 7  ~ "early_month", # benefits / rent timing
      day_of_month <= 15 ~ "mid_month", # paycheck #1
      day_of_month <= 23 ~ "late_month", # paycheck #2
      TRUE ~ "end_month" # cash-constrained period -- pay bills
     ),

    # --- Coarse seasonal signal (keep but low priority) ---
    season = case_when(
      month %in% c(12, 1, 2)  ~ "winter",
      month %in% c(3, 4, 5)   ~ "spring",
      month %in% c(6, 7, 8)   ~ "summer",
      month %in% c(9, 10, 11) ~ "fall"
    ),

    # --- Stronger macro timing features ---
    is_holiday_season = month %in% c(11, 12),   # retail spike
    is_tax_season = month %in% c(2, 3, 4)       # refund liquidity
  )
range(df$event_date)
## [1] "2020-11-03" "2023-01-23"

Data Fast Facts :