What tf is my data?
getwd()
## [1] "/Users/isaiahmireles/Desktop/FingerHut"
setwd("/Users/isaiahmireles/Desktop/FingerHut")
df <- read.csv("dat_train1.csv")
df |> head()
## customer_id account_id ed_id event_name event_timestamp
## 1 15849251 383997507 4 browse_products 2021-11-04T14:11:15Z
## 2 15849251 383997507 4 browse_products 2021-11-04T14:11:29Z
## 3 15849251 383997507 4 browse_products 2021-11-04T14:12:10Z
## 4 15849251 383997507 4 browse_products 2021-11-04T14:12:21Z
## 5 15849251 383997507 4 browse_products 2021-11-04T14:12:24Z
## 6 15849251 383997507 2 campaign_click 2021-11-29T06:00:00Z
## journey_steps_until_end id sep
## 1 1 15849251 383997507 -
## 2 2 15849251 383997507 -
## 3 3 15849251 383997507 -
## 4 4 15849251 383997507 -
## 5 5 15849251 383997507 -
## 6 6 15849251 383997507 -
df |> str()
## 'data.frame': 54960961 obs. of 8 variables:
## $ customer_id : int 15849251 15849251 15849251 15849251 15849251 15849251 15849251 15849251 15849251 15849251 ...
## $ account_id : int 383997507 383997507 383997507 383997507 383997507 383997507 383997507 383997507 383997507 383997507 ...
## $ ed_id : int 4 4 4 4 4 2 4 4 19 19 ...
## $ event_name : chr "browse_products" "browse_products" "browse_products" "browse_products" ...
## $ event_timestamp : chr "2021-11-04T14:11:15Z" "2021-11-04T14:11:29Z" "2021-11-04T14:12:10Z" "2021-11-04T14:12:21Z" ...
## $ journey_steps_until_end: int 1 2 3 4 5 6 7 8 9 10 ...
## $ id : chr "15849251 383997507" "15849251 383997507" "15849251 383997507" "15849251 383997507" ...
## $ sep : chr "-" "-" "-" "-" ...
df |> nrow()
## [1] 54960961
length(unique(df$id))
## [1] 1430445
set.seed(123)
samp <- sample(df$event_timestamp, 3); samp
## [1] "2022-06-03T22:13:06Z" "2022-04-03T20:57:37Z" "2021-04-26T23:54:48Z"
Time Type : ISO 8601 format
%Y-%m-%dT%H:%M:%SZ
| Component | Meaning |
|---|---|
%Y |
4-digit year |
%m |
month |
%d |
day |
T |
literal separator |
%H |
hour (24h) |
%M |
minute |
%S |
second |
Z |
UTC timezone |
stands for the “Zero” timezone, which indicates Coordinated Universal Time (UTC)
What tf is UTC time? (YouTube)
Consider that our time is in UTC so, we need to be careful about assuming everyone is from the same place and therefore time_of_day sort of thinking isn’t necisarily relevant. So more broad, seasonal + time of wk + more trends may be more relevant
library(lubridate)
library(dplyr)
df <- df |>
mutate(
# --- Core date features (SAFE in UTC) ---
event_date = as.Date(event_timestamp),
weekday = wday(event_date, label = TRUE),
month = month(event_date),
day_of_month = day(event_date),
# --- Weekend (still useful globally) ---
is_weekend = weekday %in% c("Sat", "Sun"),
# --- Pay cycle (VERY IMPORTANT, timezone invariant) ---
pay_cycle = case_when(
day_of_month <= 7 ~ "early_month", # benefits / rent timing
day_of_month <= 15 ~ "mid_month", # paycheck #1
day_of_month <= 23 ~ "late_month", # paycheck #2
TRUE ~ "end_month" # cash-constrained period -- pay bills
),
# --- Coarse seasonal signal (keep but low priority) ---
season = case_when(
month %in% c(12, 1, 2) ~ "winter",
month %in% c(3, 4, 5) ~ "spring",
month %in% c(6, 7, 8) ~ "summer",
month %in% c(9, 10, 11) ~ "fall"
),
# --- Stronger macro timing features ---
is_holiday_season = month %in% c(11, 12), # retail spike
is_tax_season = month %in% c(2, 3, 4) # refund liquidity
)
range(df$event_date)
## [1] "2020-11-03" "2023-01-23"