1 Overview

This report documents every data selection step applied to the Tracking the Sun (TTS) 2025 public data file from Lawrence Berkeley National Laboratory. The pipeline follows the sample-selection methodology of Bukow (2022) and produces three nested subsets of the data, progressing from raw to estimation-ready.

overview <- data.table(
  Sample = c("Full TTS", "Price sample", "Estimation sample", "Final sample"),
  Rows   = comma(c(n_full, n_price, n_est, n_final)),
  `% of Full` = sprintf("%.1f%%", c(100, 100*n_price/n_full, 100*n_est/n_full, 100*n_final/n_full))
)

kable(overview, align = "lrr") %>%
  kable_styling(full_width = FALSE, font_size = 14) %>%
  row_spec(0, bold = TRUE) %>%
  row_spec(4, bold = TRUE, color = ft_claret)
Sample Rows % of Full
Full TTS 3,664,197 100.0%
Price sample 1,656,407 45.2%
Estimation sample 1,204,816 32.9%
Final sample 481,399 13.1%

Pipeline run: 24 June 2026, 17:17.


2 Sample Flow: Waterfall Chart

The chart below shows how each filter progressively reduces the dataset from the full TTS sample to the final estimation-ready sample.

wf <- fl[, .(step, filter_name, n_dropped)]
wf <- rbindlist(list(
  data.table(step = 0L, filter_name = "Full TTS", n_dropped = 0L),
  wf
))
wf[, running_total := n_full - cumsum(n_dropped)]
wf[, filter_name := factor(filter_name, levels = filter_name)]
wf[, bar_fill := fifelse(step == 0, ft_teal, ft_claret)]

ggplot(wf, aes(x = filter_name, y = running_total)) +
  geom_col(fill = wf$bar_fill, width = 0.7) +
  geom_text(aes(label = comma(running_total)),
            vjust = -0.4, size = 3.3, colour = ft_ink, fontface = "bold") +
  scale_y_continuous(labels = comma, expand = expansion(mult = c(0, 0.12))) +
  labs(
    title    = "Sample size after each filter",
    subtitle = "Starting from the full TTS dataset, each bar shows the remaining observations",
    x = NULL, y = NULL
  ) +
  theme_ft() +
  theme(
    axis.text.x  = element_text(angle = 40, hjust = 1, size = 10),
    panel.grid.major.x = element_blank()
  )


3 Filter-by-Filter Breakdown

3.1 Summary Table

display <- fl[, .(
  Step          = step,
  Stage         = stage,
  Filter        = filter_name,
  Before        = comma(n_before),
  Dropped       = comma(n_dropped),
  After         = comma(n_after),
  `% Dropped`   = sprintf("%.2f%%", pct_dropped_step),
  `% of Full`   = sprintf("%.1f%%", pct_of_full)
)]

price_rows <- which(fl$stage == "Price")
est_rows   <- which(fl$stage == "Estimation")
final_rows <- which(fl$stage == "Final")

kable(display, align = "clcrrrrr") %>%
  kable_styling(full_width = FALSE, font_size = 13) %>%
  row_spec(price_rows, background = "#f3f4f6") %>%
  row_spec(est_rows,   background = "#e6f4f1") %>%
  row_spec(final_rows, background = "#eef6f0") %>%
  row_spec(0, bold = TRUE)
Step Stage Filter Before Dropped After % Dropped % of Full
1 Price Missing price 3,664,197 949,240 2,714,957 25.91% 74.1%
2 Price Price outliers 2,714,957 92,397 2,622,560 3.40% 71.6%
3 Price Third-party owned 2,622,560 789,321 1,833,239 30.10% 50.0%
4 Price Self-installed 1,833,239 34,587 1,798,652 1.89% 49.1%
5 Price Battery storage 1,798,652 142,245 1,656,407 7.91% 45.2%
6 Estimation Non-residential 1,656,407 41,135 1,615,272 2.48% 44.1%
7 Estimation Size outliers 1,615,272 25,403 1,589,869 1.57% 43.4%
8 Estimation Pre-2010 installations 1,589,869 53,662 1,536,207 3.38% 41.9%
9 Estimation Negative efficiency 1,536,207 0 1,536,207 0.00% 41.9%
10 Estimation Missing module tech 1,536,207 331,391 1,204,816 21.57% 32.9%
11 Final Missing required vars 1,204,816 723,417 481,399 60.04% 13.1%
12 Final (ext) Missing vars (no poly) 1,204,816 17,971 1,186,845 1.49% 32.4%

Price filters · Estimation filters · Final sample

3.2 Detailed Descriptions

for (i in seq_len(nrow(fl))) {
  row <- fl[i]
  cat(sprintf(
    "### Step %d: %s\n\n**Stage:** %s &nbsp;&nbsp;|&nbsp;&nbsp; **Dropped:** %s observations (%.2f%%) &nbsp;&nbsp;|&nbsp;&nbsp; **Remaining:** %s (%.1f%% of full)\n\n%s\n\n---\n\n",
    row$step, row$filter_name, row$stage,
    comma(row$n_dropped), row$pct_dropped_step,
    comma(row$n_after), row$pct_of_full,
    row$description
  ))
}

3.2.1 Step 1: Missing price

Stage: Price   |   Dropped: 949,240 observations (25.91%)   |   Remaining: 2,714,957 (74.1% of full)

Price per watt could not be computed (installed price or system size missing/sentinel).


3.2.2 Step 2: Price outliers

Stage: Price   |   Dropped: 92,397 observations (3.40%)   |   Remaining: 2,622,560 (71.6% of full)

Price per watt outside the plausible $1-$10/W range (data errors or non-standard systems).


3.2.3 Step 3: Third-party owned

Stage: Price   |   Dropped: 789,321 observations (30.10%)   |   Remaining: 1,833,239 (50.0% of full)

TPO systems (leases, PPAs) excluded: reported price does not reflect true installed cost to the owner.


3.2.4 Step 4: Self-installed

Stage: Price   |   Dropped: 34,587 observations (1.89%)   |   Remaining: 1,798,652 (49.1% of full)

DIY/self-installed systems excluded: they lack standard installer labour costs.


3.2.5 Step 5: Battery storage

Stage: Price   |   Dropped: 142,245 observations (7.91%)   |   Remaining: 1,656,407 (45.2% of full)

PV+Storage and storage-only systems excluded so prices reflect PV-only cost.


3.2.6 Step 6: Non-residential

Stage: Estimation   |   Dropped: 41,135 observations (2.48%)   |   Remaining: 1,615,272 (44.1% of full)

Non-residential systems (commercial, agricultural, government) excluded.


3.2.7 Step 7: Size outliers

Stage: Estimation   |   Dropped: 25,403 observations (1.57%)   |   Remaining: 1,589,869 (43.4% of full)

Systems outside 1 W - 20 kW, implausible for residential installations.


3.2.8 Step 8: Pre-2010 installations

Stage: Estimation   |   Dropped: 53,662 observations (3.38%)   |   Remaining: 1,536,207 (41.9% of full)

Pre-2010 dropped: limited coverage and no weekly polysilicon price data.


3.2.9 Step 9: Negative efficiency

Stage: Estimation   |   Dropped: 0 observations (0.00%)   |   Remaining: 1,536,207 (41.9% of full)

Negative module efficiency, a reporting error (< 0.1% of sample).


3.2.10 Step 10: Missing module tech

Stage: Estimation   |   Dropped: 331,391 observations (21.57%)   |   Remaining: 1,204,816 (32.9% of full)

Module technology (Mono/Poly/Other) undeterminable from any of the three module fields.


3.2.11 Step 11: Missing required vars

Stage: Final   |   Dropped: 723,417 observations (60.04%)   |   Remaining: 481,399 (13.1% of full)

NA in any estimation variable after merges (mainly polysilicon / IRS / tax match failures).


3.2.12 Step 12: Missing vars (no poly)

Stage: Final (ext)   |   Dropped: 17,971 observations (1.49%)   |   Remaining: 1,186,845 (32.4% of full)

Extended sample: NA in any required variable except polysilicon. Full date range for tax-only IV.



4 Relative Impact of Each Filter

fl_plot <- fl[n_dropped > 0]

ggplot(fl_plot, aes(x = reorder(filter_name, n_dropped), y = n_dropped)) +
  geom_col(fill = ft_teal, width = 0.6) +
  geom_text(aes(label = comma(n_dropped)), hjust = -0.1, size = 3.3,
            colour = ft_ink, fontface = "bold") +
  coord_flip() +
  scale_y_continuous(labels = comma, expand = expansion(mult = c(0, 0.18))) +
  labs(
    title    = "Observations dropped by each filter",
    subtitle = "Ranked by absolute number of rows removed",
    x = NULL, y = NULL
  ) +
  theme_ft() +
  theme(panel.grid.major.y = element_blank())

ggplot(fl_plot, aes(x = reorder(filter_name, pct_dropped_step), y = pct_dropped_step)) +
  geom_col(fill = ft_claret, width = 0.6) +
  geom_text(aes(label = sprintf("%.1f%%", pct_dropped_step)), hjust = -0.1, size = 3.3,
            colour = ft_ink, fontface = "bold") +
  coord_flip() +
  scale_y_continuous(expand = expansion(mult = c(0, 0.18))) +
  labs(
    title    = "Share of rows dropped at each step",
    subtitle = "As percentage of rows entering that step (not of full sample)",
    x = NULL, y = NULL
  ) +
  theme_ft() +
  theme(panel.grid.major.y = element_blank())


5 External Data Merges

Three external datasets are joined to the estimation sample before the final complete-cases filter is applied.

mr_display <- mr[, .(
  Source       = source,
  `Join Key`   = join_key,
  `Match Rate` = sprintf("%.1f%%", match_pct),
  Description  = description
)]

kable(mr_display) %>%
  kable_styling(full_width = TRUE, font_size = 13) %>%
  row_spec(0, bold = TRUE)
Source Join Key Match Rate Description
Polysilicon (PVinsights) year + week 40.8% Weekly polysilicon spot prices joined on installation year and ISO week. First instrument for 2SLS.
IRS SOI (zip x year) zip5 + year 98.8% Household count and AGI per household, joined on 5-digit zip and year (capped at the latest IRS year).
Sales Tax (Tax Foundation) state + year 100.0% State sales tax rates converted to installation-level tax/W (Bukow B2: rate x 0.55 x price). Second instrument.

Unmatched rows in the merge stage are the primary driver of the final sample being smaller than the estimation sample. They introduce NAs in required variables which are then dropped by the complete-cases filter in Step 11.


6 Geographic Coverage

st_display <- state_cts[, .(
  State = state_raw,
  N     = comma(N),
  `% of Estimation` = sprintf("%.1f%%", 100 * N / n_est),
  Group = ifelse(state_raw %in% major_st, "Major (> 20k)", "Other")
)]

kable(head(st_display, 25), caption = "Top 25 states in estimation sample") %>%
  kable_styling(full_width = FALSE, font_size = 13) %>%
  row_spec(0, bold = TRUE) %>%
  row_spec(which(head(st_display, 25)$Group == "Major (> 20k)"),
           bold = TRUE, color = ft_teal)
Top 25 states in estimation sample
State N % of Estimation Group
CA 903,086 75.0% Major (> 20k)
NY 74,046 6.1% Major (> 20k)
AZ 64,395 5.3% Major (> 20k)
TX 48,840 4.1% Major (> 20k)
MA 46,597 3.9% Major (> 20k)
NM 15,098 1.3% Other
WI 13,801 1.1% Other
CT 12,655 1.1% Other
UT 7,954 0.7% Other
WA 6,964 0.6% Other
PA 4,630 0.4% Other
NH 2,282 0.2% Other
VT 1,980 0.2% Other
MN 1,038 0.1% Other
OR 953 0.1% Other
FL 371 0.0% Other
AR 64 0.0% Other
RI 61 0.0% Other
IL 1 0.0% Other

Major states (> 20,000 observations each): CA, NY, AZ, TX, MA. All other states are grouped as “Other” in the state fixed-effect specification.


7 Model Results

2SLS IV estimation of the residential solar price elasticity of demand (05_run_model.R). The dependent variable is ln(system size); the reported elasticity is the coefficient on ln(price/W). Instruments: polysilicon spot price and a synthetic sales-tax term. First-stage F is the weak-instrument statistic (rule of thumb: > 10).

if (!is.null(model_res)) {
  mres <- copy(model_res)
  mres[, `:=`(
    Elasticity      = sprintf("%+.3f", Elasticity),
    N               = comma(N),
    `First-stage F` = ifelse(is.na(FirstStageF), "—", comma(round(FirstStageF)))
  )]
  kable(mres[, .(Model, Elasticity, N, `First-stage F`)], align = "lrrr") %>%
    kable_styling(full_width = FALSE, font_size = 13) %>%
    row_spec(0, bold = TRUE) %>%
    row_spec(which(model_res$Model == "IV: Preferred"), bold = TRUE, color = ft_claret) %>%
    row_spec(which(model_res$Model == "IV: Poly Only"), bold = TRUE, color = ft_teal)
} else {
  cat("Run 05_run_model.R to generate model_results.rds")
}
Model Elasticity N First-stage F
OLS -0.436 481,399
IV: Preferred -0.442 481,399 1,622,303
IV: Poly Only -0.269 481,399 356
IV: No FE -0.492 481,399 688,215
IV: 2010-17 -0.471 270,817 715,783
IV: 2018+ -0.426 210,582 1,510,119
IV: CA -0.441 357,830 3,184,165
IV: CA 2010-17 -0.492 195,742 1,616,687
IV: Non-CA -0.463 123,569 103,493
IV: Tax Full -0.292 1,186,845 11,876,373
IV: Tax Pre-21 -0.423 596,514 3,994,974
IV: Tax Post-20 -0.215 590,331 11,023,836
IV: Tax to-2022 -0.333 920,811 7,912,747
OLS Extended -0.311 1,186,845
IV: Rate (preferred) +0.759 481,399 214
IV: Rate (tax only) +1.558 1,186,845 1,745

Bukow (2022) baseline: −0.443 (California-dominated, 2010–2017).

The forest plot makes the spread visible at a glance — every estimate as a point with its 95% CI, so re-running after a pipeline change shows immediately how the whole set shifts.

if (!is.null(model_res)) {
  mp <- copy(model_res)
  mp[, Model := factor(Model, levels = rev(Model))]   # keep script order, top-to-bottom
  ggplot(mp, aes(x = Elasticity, y = Model)) +
    geom_vline(xintercept = 0, colour = ft_rule, linewidth = 0.5) +
    geom_vline(xintercept = -0.443, linetype = "dashed", colour = ft_ink_lt, linewidth = 0.5) +
    geom_errorbarh(aes(xmin = CI_low, xmax = CI_high), height = 0.3, colour = ft_ink_lt) +
    geom_point(colour = ft_claret, size = 2.4) +
    coord_cartesian(xlim = c(-0.9, 1.9)) +
    labs(
      title    = "Price elasticity across specifications",
      subtitle = "Point estimate with 95% CI (clustered by state). Dashed line = Bukow −0.443; solid = 0.",
      x = "Elasticity (coefficient on ln price/W)", y = NULL
    ) +
    theme_ft()
}

Estimates left of the dashed line are more elastic than Bukow; the two rate-instrument models sit implausibly to the right of zero (positive), flagging that instrument’s failure.

if (!is.null(state_el)) {
  ggplot(state_el, aes(x = Elasticity, y = reorder(State, Elasticity))) +
    geom_vline(xintercept = -0.443, linetype = "dashed", colour = ft_ink_lt, linewidth = 0.5) +
    geom_errorbarh(aes(xmin = CI_low, xmax = CI_high), height = 0.22, colour = ft_ink_lt) +
    geom_point(colour = ft_teal, size = 3) +
    labs(
      title    = "Price elasticity by state (preferred IV)",
      subtitle = "Point + 95% CI (clustered by year). Dashed line = Bukow −0.443.",
      x = "Elasticity (coefficient on ln price/W)", y = NULL
    ) +
    theme_ft()
}

7.1 Key insights

  • The preferred IV reproduces Bukow almost exactly (−0.442 vs −0.443) — but read with care (next point).
  • The “preferred” estimate is close to OLS by construction. OLS is −0.436 and the preferred IV is −0.442 because the synthetic tax instrument is built from price (tax/W = rate × 0.55 × price), so its first-stage F is in the millions and 2SLS collapses back toward OLS. A first-stage F of 10⁶ is a red flag, not a strength.
  • Polysilicon is the only credible instrument. The polysilicon-only IV (highlighted teal) gives −0.269 with a sensible F ≈ 356 — a genuine global cost-shifter moving the estimate away from OLS. This implies demand is more inelastic than the preferred specification suggests.
  • The exogenous rate-based tax instrument fails too. Instrumenting with the tax wedge (rate only, no price) yields implausible positive elasticities (+0.76 to +1.56): state tax-rate variation is confounded with other state-level demand drivers and is too weak after fixed effects. So the tax instrument is unusable in both forms — lean on polysilicon.
  • External validity holds. California (−0.441) and non-California (−0.463) elasticities are close, and every subsample sits in a tight −0.42 to −0.49 band.
  • The post-2020 extension is weakly identified. With sales-tax rates frozen at 2020, the tax-only post-2020 model has no exogenous within-state variation; its estimate (−0.215) should be treated as indicative only.

8 Reproducibility

This report reads pipeline_meta.rds generated by 04_clean_tts.R. To regenerate:

source("scripts/04_clean_tts.R")
rmarkdown::render("06_pipeline_report.Rmd")

Session info:

sessionInfo()
## R version 4.5.2 (2025-10-31)
## Platform: aarch64-apple-darwin20
## Running under: macOS Tahoe 26.3
## 
## Matrix products: default
## BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1
## 
## locale:
## [1] en_GB/en_GB/en_GB/C/en_GB/en_GB
## 
## time zone: Europe/London
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] kableExtra_1.4.0    knitr_1.51          scales_1.4.0       
##  [4] lubridate_1.9.5     forcats_1.0.1       stringr_1.6.0      
##  [7] dplyr_1.2.0         purrr_1.2.1         readr_2.2.0        
## [10] tidyr_1.3.2         tibble_3.3.1        ggplot2_4.0.2      
## [13] tidyverse_2.0.0     data.table_1.18.2.1
## 
## loaded via a namespace (and not attached):
##  [1] rappdirs_0.3.4     sass_0.4.10        generics_0.1.4     xml2_1.5.2        
##  [5] stringi_1.8.7      hms_1.1.4          digest_0.6.39      magrittr_2.0.4    
##  [9] evaluate_1.0.5     grid_4.5.2         timechange_0.4.0   RColorBrewer_1.1-3
## [13] fastmap_1.2.0      rprojroot_2.1.1    jsonlite_2.0.0     viridisLite_0.4.3 
## [17] textshaping_1.0.4  jquerylib_0.1.4    cli_3.6.5          rlang_1.1.7       
## [21] withr_3.0.2        cachem_1.1.0       yaml_2.3.12        otel_0.2.0        
## [25] tools_4.5.2        tzdb_0.5.0         pandoc_0.2.0       here_1.0.2        
## [29] vctrs_0.7.1        R6_2.6.1           lifecycle_1.0.5    fs_1.6.6          
## [33] pkgconfig_2.0.3    pillar_1.11.1      bslib_0.10.0       gtable_0.3.6      
## [37] glue_1.8.0         systemfonts_1.3.1  xfun_0.56          tidyselect_1.2.1  
## [41] rstudioapi_0.18.0  farver_2.1.2       htmltools_0.5.9    labeling_0.4.3    
## [45] rmarkdown_2.30     svglite_2.2.2      compiler_4.5.2     S7_0.2.1