This project utilizes event-based regression modeling to quantify how specific calendar events, such as paydays and holiday closures, drive daily customer traffic patterns at a bank branch.
## tibble [254 × 13] (S3: tbl_df/tbl/data.frame)
## $ CUST : num [1:254] 1825 1257 969 1672 1098 ...
## $ DAYCAT : chr [1:254] "Tuesday" "Wednesday" "Thursday" "Friday" ...
## $ DATE : POSIXct[1:254], format: "2007-01-02" "2007-01-03" ...
## $ MONTH : num [1:254] 1 1 1 1 1 1 1 1 1 1 ...
## $ DAYMON : num [1:254] 2 3 4 5 8 9 10 11 12 15 ...
## $ DAYWEEK : num [1:254] 2 3 4 5 1 2 3 4 5 1 ...
## $ SPECIAL : chr [1:254] "SP,FAC,AH" "0" "0" "SP" ...
## $ Payday : num [1:254] 1 0 0 1 0 0 0 0 0 0 ...
## $ SP : num [1:254] 1 0 0 1 0 0 0 0 0 0 ...
## $ FAC : num [1:254] 1 0 0 0 0 0 0 0 0 0 ...
## $ Holidays: num [1:254] 1 0 0 0 0 0 0 0 0 0 ...
## $ BH : num [1:254] 0 0 0 0 0 0 0 0 0 0 ...
## $ AH : num [1:254] 1 0 0 0 0 0 0 0 0 0 ...
## [1] "Date"
## [1] "Date"
Q4$DAYCAT <- factor(Q4$DAYCAT,
levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday"),
ordered = TRUE)
str(Q4)## tibble [254 × 14] (S3: tbl_df/tbl/data.frame)
## $ CUST : num [1:254] 1825 1257 969 1672 1098 ...
## $ DAYCAT : Ord.factor w/ 5 levels "Monday"<"Tuesday"<..: 2 3 4 5 1 2 3 4 5 1 ...
## $ DATE : Date[1:254], format: "2007-01-02" "2007-01-03" ...
## $ MONTH : num [1:254] 1 1 1 1 1 1 1 1 1 1 ...
## $ DAYMON : num [1:254] 2 3 4 5 8 9 10 11 12 15 ...
## $ DAYWEEK : num [1:254] 2 3 4 5 1 2 3 4 5 1 ...
## $ SPECIAL : chr [1:254] "SP,FAC,AH" "0" "0" "SP" ...
## $ Payday : num [1:254] 1 0 0 1 0 0 0 0 0 0 ...
## $ SP : num [1:254] 1 0 0 1 0 0 0 0 0 0 ...
## $ FAC : num [1:254] 1 0 0 0 0 0 0 0 0 0 ...
## $ Holidays : num [1:254] 1 0 0 0 0 0 0 0 0 0 ...
## $ BH : num [1:254] 0 0 0 0 0 0 0 0 0 0 ...
## $ AH : num [1:254] 1 0 0 0 0 0 0 0 0 0 ...
## $ TimeIndex: int [1:254] 1 2 3 4 5 6 7 8 9 10 ...
## CUST DAYCAT DATE MONTH
## Min. : 404.0 Monday :50 Min. :2007-01-02 Min. : 1.000
## 1st Qu.: 785.8 Tuesday :51 1st Qu.:2007-03-30 1st Qu.: 3.250
## Median : 930.5 Wednesday:50 Median :2007-06-30 Median : 6.500
## Mean :1037.5 Thursday :51 Mean :2007-06-30 Mean : 6.476
## 3rd Qu.:1183.5 Friday :52 3rd Qu.:2007-09-30 3rd Qu.: 9.750
## Max. :2068.0 Max. :2007-12-31 Max. :12.000
## DAYMON DAYWEEK SPECIAL Payday
## Min. : 1.00 Min. :1.000 Length:254 Min. :0.000
## 1st Qu.: 8.00 1st Qu.:2.000 Class :character 1st Qu.:0.000
## Median :16.00 Median :3.000 Mode :character Median :0.000
## Mean :15.83 Mean :3.016 Mean :0.122
## 3rd Qu.:23.00 3rd Qu.:4.000 3rd Qu.:0.000
## Max. :31.00 Max. :5.000 Max. :1.000
## SP FAC Holidays BH
## Min. :0.00000 Min. :0.00000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.00000 Median :0.00000 Median :0.00000 Median :0.00000
## Mean :0.09843 Mean :0.03937 Mean :0.04724 Mean :0.01969
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.00000 Max. :1.00000 Max. :1.00000
## AH TimeIndex
## Min. :0.00000 Min. : 1.00
## 1st Qu.:0.00000 1st Qu.: 64.25
## Median :0.00000 Median :127.50
## Mean :0.02756 Mean :127.50
## 3rd Qu.:0.00000 3rd Qu.:190.75
## Max. :1.00000 Max. :254.00
##
## 0 1
## 223 31
plot(CUST ~ DATE, data = Q4, main="Customer Visits over Time", col="blue", type="l")
fit_index = lm(CUST ~ TimeIndex, data = Q4)
abline(fit_index, col="red", lwd=2) ggplot(Q4, aes(x=DATE, y=CUST)) +
geom_line(color="steelblue", alpha=0.6) +
# Adding vertical lines for Christmas (Closure) and Black Friday (Peak)
geom_vline(xintercept = as.Date("2007-12-25"), color="red", linetype="dashed") +
annotate("text", x=as.Date("2007-12-25"), y=max(Q4$CUST, na.rm=TRUE),
label="Christmas (Closed)", color="red", angle=90, vjust=-0.5) +
geom_vline(xintercept = as.Date("2007-11-23"), color="purple", linetype="dotted") +
annotate("text", x=as.Date("2007-11-23"), y=max(Q4$CUST, na.rm=TRUE),
label="Black Friday", color="purple", angle=90, vjust=-0.5) +
labs(title="Bank Visits: Contextualizing Gaps and Spikes",
subtitle="Dashed lines represent holiday closures",
x="Date", y="Number of Customers") +
theme_minimal()Q4$HolidayStatus <- "Normal"
Q4$HolidayStatus[Q4$BH == 1] <- "Before Holiday"
Q4$HolidayStatus[Q4$AH == 1] <- "After Holiday"
Q4$Holidays <- ifelse(Q4$BH == 1 | Q4$AH == 1, 0, Q4$Holidays)
ggplot(Q4, aes(x=HolidayStatus, y=CUST, fill=HolidayStatus)) +
geom_boxplot() +
scale_fill_brewer(palette="Set2") +
labs(title="The 'Holiday Sandwich' Effect: Before vs After",
subtitle="Analyzing shifts in customer behavior surrounding closures",
y="Number of Customers") +
theme_classic()ggplot(Q4, aes(x=DAYCAT, y=CUST, group=1)) +
geom_line(stat='summary', fun='mean', color="blue") +
geom_point(stat='summary', fun='mean') +
facet_wrap(~MONTH, ncol=4) +
labs(title="Average Weekly Traffic Pattern per Month",
subtitle="Identifying seasonal variations in the weekly rhythm",
x="Day of Week", y="Avg Customers") +
theme_bw() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))ggplot(Q4, aes(x=DAYCAT, y=CUST, fill=factor(Payday))) +
geom_boxplot() +
scale_fill_manual(values=c("white", "orange"), name="Is Payday?") +
labs(title="Joint Effect: Payday Impact by Day of the Week",
subtitle="Comparing the lift of paychecks across different workdays") +
theme_minimal()##
## Welch Two Sample t-test
##
## data: CUST by Payday
## t = -19.292, df = 46.166, p-value < 0.00000000000000022
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -804.7076 -652.6606
## sample estimates:
## mean in group 0 mean in group 1
## 948.574 1677.258
model_full <- lm(CUST ~ TimeIndex + DAYCAT + Payday + Holidays +
SP + FAC + AH + BH + BlackFriday + NewYear,
data = Q4)
summary(model_full)##
## Call:
## lm(formula = CUST ~ TimeIndex + DAYCAT + Payday + Holidays +
## SP + FAC + AH + BH + BlackFriday + NewYear, data = Q4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -457.39 -121.75 -12.67 84.84 787.59
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 952.6333 22.9678 41.477 < 0.0000000000000002 ***
## TimeIndex 0.1099 0.1533 0.716 0.474390
## DAYCAT.L 139.1587 28.6142 4.863 0.0000020849 ***
## DAYCAT.Q 382.6423 27.7622 13.783 < 0.0000000000000002 ***
## DAYCAT.C 49.6587 25.8921 1.918 0.056305 .
## DAYCAT^4 64.1263 25.0414 2.561 0.011053 *
## Payday 472.3189 125.4703 3.764 0.000210 ***
## Holidays NA NA NA NA
## SP -49.4312 120.0055 -0.412 0.680773
## FAC 51.7481 101.0730 0.512 0.609128
## AH 462.6465 82.5738 5.603 0.0000000574 ***
## BH 309.8121 82.1928 3.769 0.000206 ***
## BlackFriday -671.3497 197.9632 -3.391 0.000813 ***
## NewYear -405.6660 196.1959 -2.068 0.039740 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 175.5 on 241 degrees of freedom
## Multiple R-squared: 0.7478, Adjusted R-squared: 0.7353
## F-statistic: 59.56 on 12 and 241 DF, p-value: < 0.00000000000000022
model_interaction <- lm(CUST ~ TimeIndex + DAYCAT * Payday + Holidays +
SP + FAC + AH + BH + BlackFriday + NewYear,
data = Q4)
summary(model_interaction)##
## Call:
## lm(formula = CUST ~ TimeIndex + DAYCAT * Payday + Holidays +
## SP + FAC + AH + BH + BlackFriday + NewYear, data = Q4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -466.26 -120.14 -12.53 83.22 780.25
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 951.0044 23.1407 41.097 < 0.0000000000000002 ***
## TimeIndex 0.1314 0.1556 0.844 0.399372
## DAYCAT.L 145.7562 29.3871 4.960 0.00000134 ***
## DAYCAT.Q 387.0169 28.2953 13.678 < 0.0000000000000002 ***
## DAYCAT.C 49.1905 26.4381 1.861 0.064036 .
## DAYCAT^4 66.4801 25.2966 2.628 0.009147 **
## Payday 533.4398 137.4658 3.881 0.000135 ***
## Holidays NA NA NA NA
## SP -2.9503 163.5602 -0.018 0.985623
## FAC 9.4583 107.2439 0.088 0.929796
## AH 432.8733 86.4317 5.008 0.00000107 ***
## BH 316.3382 82.6420 3.828 0.000165 ***
## BlackFriday -651.5159 199.2858 -3.269 0.001238 **
## NewYear -378.3208 198.1448 -1.909 0.057424 .
## DAYCAT.L:Payday -104.2121 148.5209 -0.702 0.483573
## DAYCAT.Q:Payday -118.2256 168.3749 -0.702 0.483267
## DAYCAT.C:Payday 43.5906 150.2296 0.290 0.771946
## DAYCAT^4:Payday NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 176.1 on 238 degrees of freedom
## Multiple R-squared: 0.7494, Adjusted R-squared: 0.7336
## F-statistic: 47.44 on 15 and 238 DF, p-value: < 0.00000000000000022
## Adjusted R-Squared (Full Model): 0.7352767
## Adjusted R-Squared (Interaction Model): 0.7335897
## Estimate Std. Error t value
## (Intercept) 952.6333163 22.9678326 41.4768486
## TimeIndex 0.1098659 0.1533416 0.7164784
## DAYCAT.L 139.1587302 28.6142177 4.8632722
## DAYCAT.Q 382.6423341 27.7621838 13.7828615
## DAYCAT.C 49.6586705 25.8921457 1.9179048
## DAYCAT^4 64.1263009 25.0413752 2.5608139
## Payday 472.3188735 125.4703094 3.7643876
## SP -49.4312286 120.0055033 -0.4119080
## FAC 51.7481383 101.0729918 0.5119878
## AH 462.6464554 82.5737948 5.6028242
## BH 309.8120979 82.1927932 3.7693341
## BlackFriday -671.3497184 197.9632142 -3.3912852
## NewYear -405.6660483 196.1958932 -2.0676582
## Pr(>|t|)
## (Intercept) 0.000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001045745
## TimeIndex 0.474389715715175075416709660203196108341217041015625000000000000000000000000000000000000000000000000000000000000000000
## DAYCAT.L 0.000002084862032572900535879346223255836889620695728808641433715820312500000000000000000000000000000000000000000000000
## DAYCAT.Q 0.000000000000000000000000000000029419939302317286580971465783721597027489517557179127606766554995291607611519103956076
## DAYCAT.C 0.056304930212458741134451400967009249143302440643310546875000000000000000000000000000000000000000000000000000000000000
## DAYCAT^4 0.011053139624710026783782623738261463586241006851196289062500000000000000000000000000000000000000000000000000000000000
## Payday 0.000209796479739211377780680467530771693418500944972038269042968750000000000000000000000000000000000000000000000000000
## SP 0.680772965783324801591902541986200958490371704101562500000000000000000000000000000000000000000000000000000000000000000
## FAC 0.609128389496732092212027964706066995859146118164062500000000000000000000000000000000000000000000000000000000000000000
## AH 0.000000057423729770029243095267581977608761789610980486031621694564819335937500000000000000000000000000000000000000000
## BH 0.000205910402132394507439955289029853702231775969266891479492187500000000000000000000000000000000000000000000000000000
## BlackFriday 0.000812861082052640190366565864366066307411529123783111572265625000000000000000000000000000000000000000000000000000000
## NewYear 0.039739711348324505135742867878434481099247932434082031250000000000000000000000000000000000000000000000000000000000000
model_no_events <- lm(CUST ~ TimeIndex, data = Q4)
model_with_events <- lm(CUST ~ TimeIndex + DAYCAT + Payday + Holidays +
SP + FAC + AH + BH + BlackFriday + NewYear, data = Q4)
cat("Adj R-Squared without events:", summary(model_no_events)$adj.r.squared, "\n")## Adj R-Squared without events: -0.002867611
## Adj R-Squared with events: 0.7352767
holiday_impacts <- summary(model_with_events)$coefficients[c("BH", "AH"), ]
print("Comparison of Before vs After Holiday Impacts:")## [1] "Comparison of Before vs After Holiday Impacts:"
## Estimate Std. Error t value Pr(>|t|)
## BH 309.8121 82.19279 3.769334 0.00020591040213
## AH 462.6465 82.57379 5.602824 0.00000005742373
days_list <- levels(Q4$DAYCAT)
for (d in days_list) {
# Create a temporary subset for the specific day
day_data <- subset(Q4, DAYCAT == d)
# Run a simple trend model for just that day
day_model <- lm(CUST ~ TimeIndex, data = day_data)
# Print the result summary for that day
cat("\n--- Trend Analysis for:", d, "---")
print(summary(day_model)$coefficients)
}##
## --- Trend Analysis for: Monday --- Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1023.5862029 58.4553776 17.510557 0.0000000000000000000001757007
## TimeIndex 0.5353975 0.3902505 1.371933 0.1764626951631595164204924231
##
## --- Trend Analysis for: Tuesday --- Estimate Std. Error t value Pr(>|t|)
## (Intercept) 901.5152426 70.941548 12.7078598 0.00000000000000003983542
## TimeIndex -0.2250241 0.495008 -0.4545867 0.65141412229922857068942
##
## --- Trend Analysis for: Wednesday --- Estimate Std. Error t value Pr(>|t|)
## (Intercept) 784.4812464 51.4322993 15.2526964 0.0000000000000000000492083
## TimeIndex 0.3062118 0.3478923 0.8801914 0.3831413840660894409850812
##
## --- Trend Analysis for: Thursday --- Estimate Std. Error t value Pr(>|t|)
## (Intercept) 903.4937847 52.0234419 17.3670513 0.0000000000000000000001404155
## TimeIndex -0.1375904 0.3569098 -0.3855047 0.7015320604403076920618786971
##
## --- Trend Analysis for: Friday --- Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1506.07466374 87.2293942 17.26567835 0.0000000000000000000001032999
## TimeIndex -0.05542249 0.5868338 -0.09444325 0.9251345262963919235943421882
Q4$resids <- resid(model_with_events)
library(ggplot2)
ggplot(Q4, aes(x = DATE, y = resids)) +
geom_point(color = "steelblue", alpha = 0.7) +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
# Adding labels for any extreme outliers (errors > 500 customers)
geom_text(aes(label = ifelse(abs(resids) > 500, as.character(DATE), "")),
vjust = -1, size = 3) +
labs(title = "Residual Plot: Identifying Unexplained Traffic Outliers",
subtitle = "Labeled dates represent days where the model 'missed' significantly",
x = "Date", y = "Prediction Error (Residual)") +
theme_minimal()Outliers <- Q4[order(abs(Q4$resids), decreasing = TRUE), ]
head(Outliers[, c("DATE", "CUST", "resids")], 5)Prompt 1: I am working on an event-based time series project for a bank. How can I use ggplot2 to overlay vertical lines for Christmas and Black Friday even if the bank was closed on those dates and there are no rows in my dataset for them?
Prompt 2: I have a multiple regression model with CUST as the dependent variable. How do I interpret the interaction between DAYCAT and Payday to see if paydays have a significantly different boost on Fridays compared to other weekdays?
How my prompts evolved: Initially, my prompts were too general. I improved them by providing specific variable names from my dataset, such as CUST and DAYCAT.
What worked: Comparing the Adjusted R^2 between a simple trend model and our event-driven model proved that specific calendar events account for the majority of the traffic.
Key Findings: The Power of Events: The Adjusted R-Squared value of our model confirms that bank traffic is primarily driven by specific calendar events rather than just the passage of time.
The Payday “Super-Boost”: Our interaction analysis revealed that paydays occurring on Fridays create a unique synergy, resulting in the highest traffic volume of the month.
The Holiday Sandwich: The data shows that the day immediately following a closure (AH) is often as busy as the day preceding (BH) it, requiring maximum staffing levels.
Residual Insights: Through residual analysis, we identified specific outlier dates where our model under-predicted traffic. These dates represent “mystery” spikes that warrant further investigation into localized events or promotions.
Managerial Implication: Instead of uniform staffing, the bank should strategically increase availability during “Holiday Sandwich” periods and “Payday Fridays” to optimize service levels and minimize customer wait times.