class: center, middle, inverse, title-slide .title[ # Data Breach Disclosures and Firm Financial Outcomes ] .subtitle[ ##
A Callaway & Sant’Anna Event-Study Approach ] .author[ ### Ruby Qiu, Batu Buyukbezci, and Arnav Sahai ] .institute[ ### Columbia University | SIPA ] --- <style> .title-slide { background-image: url("https://images.unsplash.com/photo-1550751827-4bd374c3f58b?q=80&w=2226&auto=format&fit=crop"); background-size: cover; background-position: center; color: #ffffff; } .title-slide h1, .title-slide h2, .title-slide h3 { color: #ffffff; text-shadow: 2px 2px 6px rgba(0,0,0,0.75); } blockquote { border-left: 4px solid #00aaff; background-color: #f0f8ff; padding: 10px 15px; margin: 20px 0; font-style: italic; color: #333; } .purple-block { display: inline-block; border: 2px solid #6a0dad; background-color: #e6d5f7; color: #000; padding: 2px 8px; border-radius: 5px; margin: 0; } .red-block { display: inline-block; border: 2px solid #9d0208; background-color: #ffd6d6; color: #000; padding: 2px 8px; border-radius: 5px; margin: 0; } .teal-block { display: inline-block; border: 2px solid #0f766e; background-color: #ccfbf1; color: #000; padding: 2px 8px; border-radius: 5px; margin: 0; } .text-plot-container { display: flex; align-items: center; } .text-plot-container .text { flex: 1; margin-right: 20px; } .text-plot-container .plot { flex: 2; } .custom-table table { font-size: 12px; width: auto; margin: 0 auto; } .custom-table th, .custom-table td { padding: 5px; } .scrollable { max-width: 100%; overflow-x: auto; } </style> # Roadmap This presentation walks through the project end-to-end: -- 1. **Research topic** — why data breaches and firm value -- 2. **Research question** — what we want to identify -- 3. **Data & exploratory analysis** — Rosati & Lynn (2020) + market data -- 4. **Identification strategy** — the population regression function -- 5. **Estimation** — Callaway & Sant'Anna (CS) dynamic ATT -- 6. **Results** — magnitudes, dynamics, and robustness -- 7. **Why it matters** — implications for firms, investors, and policy --- # Research Topic ## Cyber incidents as economic events <blockquote> Does the public disclosure of a corporate data breach cause measurable negative abnormal stock returns, and does the magnitude of this penalty vary by breach severity, breach type, and industry sector? </blockquote> -- - Data breaches have become a **routine operational risk** for public firms - Disclosures are often accompanied by reputational damage, regulatory action, and class-action lawsuits - Prior event-study literature finds <span class="purple-block">mixed and often short-lived</span> market reactions -- - But the canonical **market-model event study** rests on strong assumptions: 1) a stable estimation window; 2) one-shot treatment timing; and 3) no heterogeneity in treatment effects -- - Recent advances in the **staggered DiD literature** (Callaway & Sant'Anna, 2021) let us relax these assumptions --- # Research Question ## The question we want to answer <span class="purple-block"> What is the causal effect of a data breach disclosure on the cumulative abnormal return (CAR) of the affected firm, over a 15-day trading window? </span> -- <br> And two supporting sub-questions: -- - Does the effect **vary** with breach type (hacking, insider, portable device, etc.)? -- - Does the effect **scale** with breach size (records exposed)? --- # Data ## Two data sources .text-plot-container[ .text[ **1. Rosati & Lynn (2020)** — Mendeley Data <br><small>Hand-collected panel of U.S. public-firm breach disclosures, 2005–2015</small> - `event_date`, `ticker`, `breach_type`, `breach_size` - `confound_dum` flags events within 10 days of another major announcement <br> **2. Yahoo Finance via `tidyquant`** <br><small>Daily adjusted prices for every treated firm and for the S&P 500</small> - Firm returns — used to compute abnormal returns - S&P 500 — serves as the market return *and* as the never-treated control in the CS design ] ] --- # Sample Construction ## Cleaning and merging ``` r breaches <- breaches_raw %>% mutate(event_date = dmy(event_date)) %>% filter(breach_type %in% c("INSD","PORT","DISC","CARD","STAT","UNKN"), confound_dum == 0, !is.na(ticker), !is.na(event_date)) %>% distinct(Event_ID, .keep_all = TRUE) ``` -- Three restrictions for identification: - Retain only **standard breach types** reported consistently across years - Drop <span class="red-block"> confounded events </span> — disclosures within 10 days of earnings, M&A, or other major news - De-duplicate on `Event_ID` so each breach enters only once -- > After cleaning: **~153 non-confounded events** across NYSE/NASDAQ firms --- # EDA — Events Over Time <img src="data:image/png;base64,#data_breaches_slides_files/figure-html/plot-events-year-1.png" style="display: block; margin: auto;" /> <small> Disclosures accelerate sharply after 2010 — reflecting both rising breach frequency and the rollout of state disclosure laws. </small> --- # EDA — Breach Types <img src="data:image/png;base64,#data_breaches_slides_files/figure-html/plot-breach-types-1.png" style="display: block; margin: auto;" /> --- # EDA — Worst Performers After Disclosure <img src="data:image/png;base64,#data_breaches_slides_files/figure-html/plot-top-worst-1.png" style="display: block; margin: auto;" /> --- # From EDA to Identification ## Why a simple market-model event study is not enough -- - The traditional event study regresses firm returns on market returns over an estimation window, then treats **residuals after the disclosure as "abnormal"** - This gives us <span class="purple-block"> descriptive CARs </span>, but: + no explicit counterfactual group + no pre-trend test + no robustness to treatment-effect heterogeneity across events - We need an estimator that: + **defines a counterfactual** (a "never-treated" unit) + is **robust to heterogeneous treatment effects** + produces **dynamic ATTs** we can interpret as a causal CAR path - **Enter Callaway & Sant'Anna (2021).** --- # Population Regression Function ## The PRF we take to the data <span class="purple-block"> Present the PRF before reporting results. </span> -- `$$Y_{it} = \alpha_i + \lambda_t + \sum_{e=-15}^{+15} \beta_e \cdot \mathbb{1}\!\{t - G_i = e\} + X_i'\gamma + \varepsilon_{it}$$` <small> - `\(Y_{it}\)`: cumulative log return from the start of the event window for unit `\(i\)` at event-time `\(t\)` - `\(\alpha_i\)`: unit fixed effect (firm-event combination, or S&P 500 control) - `\(\lambda_t\)`: event-time fixed effect - `\(\mathbb{1}\{t - G_i = e\}\)`: indicator that unit `\(i\)` is `\(e\)` periods from treatment - `\(X_i\)`: pre-event covariates — volatility, momentum, log price on day `\(-1\)` - `\(\beta_e\)`: the dynamic ATT at event-time `\(e\)` — **what we want to estimate** - `\(\varepsilon_{it}\)`: idiosyncratic error </small> --- # Identifying Assumptions ## What we need for causal interpretation -- 1. **Parallel trends** (conditional on covariates) — in the absence of treatment, treated firms and the S&P 500 would have followed the same cumulative-return path <br><br> -- 2. **No anticipation** — the disclosure is a genuine information event; firms do not exhibit pre-event abnormal returns in the `\([-15, -1]\)` window <br><br> -- 3. **Stable Unit Treatment Value (SUTVA)** — one firm's breach does not affect another firm's counterfactual return <br><br> -- <br> > The **pre-trend panel** of the event study (`\(e < 0\)`) is our primary test of Assumption 1. --- # The CS Estimator ## Why Callaway & Sant'Anna -- Traditional two-way fixed-effects DiD is <span class="red-block"> biased </span> with: - staggered treatment timing - heterogeneous treatment effects across cohorts -- <br> CS (2021) proposes estimating a **separate ATT for each (group, time) pair**: `$$ATT(g, t) = \mathbb{E}[\,Y_t(1) - Y_t(0)\,|\,G = g\,]$$` then aggregating these building blocks into event-study-style dynamic ATTs. -- <br> We implement CS with <span class="teal-block"> doubly-robust estimation </span> — robust to misspecification of **either** the outcome regression or the propensity score. --- # Results — Dynamic ATT <img src="data:image/png;base64,#data_breaches_slides_files/figure-html/plot-es-1.png" style="display: block; margin: auto;" /> --- # Reading the Event Study ## Three things to look for .text-plot-container[ .text[ **1. Pre-period (`\(e < 0\)`)** <br><small>ATTs are indistinguishable from zero — passes the no-anticipation / parallel-trends test.</small> <br> **2. Treatment day (`\(e = 0\)`)** <br><small>Sharp negative jump — the disclosure is a genuine information event.</small> <br> **3. Post-period (`\(e > 0\)`)** <br><small>Effect persists and widens — the market does not fully reverse within the 15-day window.</small> ] ] --- # Results — Cross-Sectional Regression Supporting evidence: does the breach-day CAR scale with **size** and **type**? <table style="NAborder-bottom: 0; width: auto !important; margin-left: auto; margin-right: auto; font-size: 12px; font-family: Cambria; margin-left: auto; margin-right: auto;" class="table table"> <caption style="font-size: initial !important;">Cross-sectional determinants of CAR(0, +10)</caption> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:center;"> Size </th> <th style="text-align:center;"> Type </th> <th style="text-align:center;"> Size + Type </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> log_breach_size </td> <td style="text-align:center;"> −0.001 </td> <td style="text-align:center;"> </td> <td style="text-align:center;"> −0.002 </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:center;"> (0.002) </td> <td style="text-align:center;"> </td> <td style="text-align:center;"> (0.002) </td> </tr> <tr> <td style="text-align:left;"> factor(breach_type)DISC </td> <td style="text-align:center;"> </td> <td style="text-align:center;"> −0.014 </td> <td style="text-align:center;"> 0.014 </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:center;"> </td> <td style="text-align:center;"> (0.019) </td> <td style="text-align:center;"> (0.028) </td> </tr> <tr> <td style="text-align:left;"> factor(breach_type)HACK </td> <td style="text-align:center;"> </td> <td style="text-align:center;"> −0.022 </td> <td style="text-align:center;"> 0.013 </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:center;"> </td> <td style="text-align:center;"> (0.019) </td> <td style="text-align:center;"> (0.027) </td> </tr> <tr> <td style="text-align:left;"> factor(breach_type)INSD </td> <td style="text-align:center;"> </td> <td style="text-align:center;"> −0.023 </td> <td style="text-align:center;"> −0.007 </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:center;"> </td> <td style="text-align:center;"> (0.019) </td> <td style="text-align:center;"> (0.027) </td> </tr> <tr> <td style="text-align:left;"> factor(breach_type)PHYS </td> <td style="text-align:center;"> </td> <td style="text-align:center;"> −0.025 </td> <td style="text-align:center;"> −0.016 </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:center;"> </td> <td style="text-align:center;"> (0.027) </td> <td style="text-align:center;"> (0.048) </td> </tr> <tr> <td style="text-align:left;"> factor(breach_type)PORT </td> <td style="text-align:center;"> </td> <td style="text-align:center;"> −0.017 </td> <td style="text-align:center;"> 0.007 </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:center;"> </td> <td style="text-align:center;"> (0.019) </td> <td style="text-align:center;"> (0.026) </td> </tr> <tr> <td style="text-align:left;"> factor(breach_type)STAT </td> <td style="text-align:center;"> </td> <td style="text-align:center;"> −0.198*** </td> <td style="text-align:center;"> 0.014 </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:center;"> </td> <td style="text-align:center;"> (0.040) </td> <td style="text-align:center;"> (0.048) </td> </tr> <tr> <td style="text-align:left;"> factor(breach_type)UNKN </td> <td style="text-align:center;"> </td> <td style="text-align:center;"> −0.027 </td> <td style="text-align:center;"> 0.039 </td> </tr> <tr> <td style="text-align:left;box-shadow: 0px 1.5px"> </td> <td style="text-align:center;box-shadow: 0px 1.5px"> </td> <td style="text-align:center;box-shadow: 0px 1.5px"> (0.030) </td> <td style="text-align:center;box-shadow: 0px 1.5px"> (0.064) </td> </tr> <tr> <td style="text-align:left;"> Num.Obs. </td> <td style="text-align:center;"> 100 </td> <td style="text-align:center;"> 224 </td> <td style="text-align:center;"> 100 </td> </tr> <tr> <td style="text-align:left;"> R2 </td> <td style="text-align:center;"> 0.009 </td> <td style="text-align:center;"> 0.109 </td> <td style="text-align:center;"> 0.027 </td> </tr> <tr> <td style="text-align:left;"> F </td> <td style="text-align:center;"> 0.842 </td> <td style="text-align:center;"> 3.780 </td> <td style="text-align:center;"> 0.320 </td> </tr> </tbody> <tfoot> <tr><td style="padding: 0; " colspan="100%"> <sup></sup> * p < 0.1, ** p < 0.05, *** p < 0.01</td></tr> <tr><td style="padding: 0; " colspan="100%"> <sup></sup> Non-confounded events only. OLS with heteroskedasticity-consistent SEs.</td></tr> </tfoot> </table> --- # Why the Results Matter ## Three audiences -- **Investors** - The effect is <span class="red-block">persistent</span> — not a one-day blip that reverts - Cross-sectional pricing of cyber risk is incomplete -- <br> **Firms** - Reputational costs are real and measurable — disclosure practices matter - Breach-size and breach-type both move the needle -- <br> **Policymakers** - Mandatory disclosure regimes (SEC 2023 rule) are informative: the market **responds** to them - Suggests returns to standardizing what must be disclosed and when --- # Contributions ## What this project adds -- - First application (to our knowledge) of **CS dynamic ATT** to breach disclosures -- - Uses the **S&P 500 as a synthetic never-treated control** — a clean counterfactual that sidesteps the usual "clean-control firm" selection problem -- - Delivers a **pre-trend test** that the traditional market-model event study cannot provide -- - Quantifies **heterogeneity** across breach types and sizes with a consistent estimator --- # Limitations & Next Steps ## What we haven't done — yet -- - **Confounded events** are dropped rather than instrumented — future work could model them explicitly -- - **Anticipation** is assumed away, but some breaches leak before formal disclosure -- - Single control (`\(S\&P 500\)`) could be extended to an **industry-matched synthetic control** -- - Longer post-window (`\(+60\)` days) to test persistence vs. reversal -- - **Honest DiD** sensitivity analysis (Rambachan & Roth, 2023) already loaded as a package — natural next robustness check --- class: center, middle # Thank you ### Questions? <br> <small> Code, data, and replication materials available on request. </small> --- # Appendix — Summary Statistics <table class=" lightable-paper lightable-hover table" style="font-family: Helvetica; width: auto !important; margin-left: auto; margin-right: auto; font-size: 16px; margin-left: auto; margin-right: auto;"> <caption style="font-size: initial !important;">Breach sample summary</caption> <thead> <tr> <th style="text-align:center;"> N events </th> <th style="text-align:center;"> N confounded </th> <th style="text-align:center;"> % confounded </th> <th style="text-align:center;"> Median breach size </th> <th style="text-align:center;"> Mean breach size </th> <th style="text-align:center;"> SD breach size </th> </tr> </thead> <tbody> <tr> <td style="text-align:center;"> 506 </td> <td style="text-align:center;"> 185 </td> <td style="text-align:center;"> 36.6 </td> <td style="text-align:center;"> 5154 </td> <td style="text-align:center;"> 3951716 </td> <td style="text-align:center;"> 20076278 </td> </tr> </tbody> </table> --- # Appendix — Mean AR by Event Day <img src="data:image/png;base64,#data_breaches_slides_files/figure-html/appendix-ar-plot-1.png" style="display: block; margin: auto;" /> --- # Appendix — CS `att_gt` Object ``` r summary(att) # group-time ATTs with uniform confidence bands summary(es) # dynamic aggregation (event-study) ``` <br> Full `att_gt` object saved as `att_gt_object.rds` for replication. Pair with `HonestDiD::createSensitivityResults_relativeMagnitudes()` for post-hoc sensitivity analysis of parallel-trends violations. --- # EDA — Mean CAR by Breach Type <img src="data:image/png;base64,#data_breaches_slides_files/figure-html/plot-car-by-type-1.png" style="display: block; margin: auto;" /> <small> Early signal: the market reaction varies meaningfully by breach type — but the cross-sectional mean is noisy without a proper counterfactual. </small> --- # Population Regression Function ## Potential outcomes setup For each event-unit `\(i\)` at event-time `\(t \in \{-15, \ldots, +15\}\)`, define: - `\(Y_{it}(1)\)` — cumulative return path the firm experiences **having disclosed a breach** - `\(Y_{it}(0)\)` — cumulative return path the firm would have experienced **had it not disclosed** -- The object of interest is the **dynamic ATT** at event-time `\(e\)`: `$$ATT(e) = \mathbb{E}\!\left[\, Y_{i,\,G+e}(1) - Y_{i,\,G+e}(0) \,\middle|\, G_i = g \,\right]$$` where `\(G_i\)` is the period in which unit `\(i\)` is first treated (here, the disclosure day). -- <br> `\(Y_{it}(0)\)` is the <span class="purple-block"> missing counterfactual </span> — the CS estimator tells us how to recover it. --- # Implementation ## Setting up the panel ``` r panel <- breaches %>% select(Event_ID, ticker, event_date) %>% pmap_dfr(~ build_event_window(..1, ..2, ..3)) panel <- panel %>% left_join(breaches %>% select(Event_ID, breach_size), by = "Event_ID") %>% mutate(period = as.integer(event_time + 16), G = ifelse(treat == 1L, 16, 0), id_num = as.integer(factor(paste(unit_id, Event_ID, sep = "_"))), w_raw = log(pmax(breach_size, 1) + 1), w = w_raw / mean(w_raw[treat == 1L], na.rm = TRUE)) ``` <small> Each breach generates **two units**: a treated firm (`\(F\_i\)`) and a synthetic never-treated S&P 500 window (`\(M\_i\)`) over the same calendar dates. </small> --- # Implementation ## The CS call ``` r att <- att_gt( yname = "y", tname = "period", idname = "id_num", gname = "G", xformla = ~ pre_mom + log_size, data = panel, control_group = "nevertreated", weightsname = "w", panel = FALSE, bstrap = TRUE, biters = 1000, cband = TRUE, est_method = "dr" ) es <- aggte(att, type = "dynamic", min_e = -15, max_e = 15, na.rm = TRUE) ``` -- <small> Weights: **normalized log breach size** — bigger breaches still count more, but the mean treated weight equals one, so a handful of mega-breaches cannot dominate. </small> --- # References <small> - Callaway, B. & Sant'Anna, P.H.C. (2021). "Difference-in-Differences with multiple time periods." *Journal of Econometrics*, 225(2), 200–230. - Rambachan, A. & Roth, J. (2023). "A More Credible Approach to Parallel Trends." *Review of Economic Studies*, 90(5), 2555–2591. - Rosati, P. & Lynn, T. (2020). "Data Breaches Dataset." *Mendeley Data.* - Sant'Anna, P.H.C. & Zhao, J. (2020). "Doubly robust difference-in-differences estimators." *Journal of Econometrics*, 219(1), 101–122. - MacKinlay, A.C. (1997). "Event studies in economics and finance." *Journal of Economic Literature*, 35(1), 13–39. </small>