This report documents every data selection step applied to the Tracking the Sun (TTS) 2025 public data file from Lawrence Berkeley National Laboratory. The pipeline follows the sample-selection methodology of Bukow (2022) and produces three nested subsets of the data, progressing from raw to estimation-ready.
overview <- data.table(
Sample = c("Full TTS", "Price sample", "Estimation sample", "Final sample"),
Rows = comma(c(n_full, n_price, n_est, n_final)),
`% of Full` = sprintf("%.1f%%", c(100, 100*n_price/n_full, 100*n_est/n_full, 100*n_final/n_full))
)
kable(overview, align = "lrr") %>%
kable_styling(full_width = FALSE, font_size = 14) %>%
row_spec(0, bold = TRUE) %>%
row_spec(4, bold = TRUE, color = ft_claret)
| Sample | Rows | % of Full |
|---|---|---|
| Full TTS | 3,664,197 | 100.0% |
| Price sample | 1,656,407 | 45.2% |
| Estimation sample | 1,204,816 | 32.9% |
| Final sample | 481,399 | 13.1% |
Pipeline run: 24 June 2026, 17:17.
The chart below shows how each filter progressively reduces the dataset from the full TTS sample to the final estimation-ready sample.
wf <- fl[, .(step, filter_name, n_dropped)]
wf <- rbindlist(list(
data.table(step = 0L, filter_name = "Full TTS", n_dropped = 0L),
wf
))
wf[, running_total := n_full - cumsum(n_dropped)]
wf[, filter_name := factor(filter_name, levels = filter_name)]
wf[, bar_fill := fifelse(step == 0, ft_teal, ft_claret)]
ggplot(wf, aes(x = filter_name, y = running_total)) +
geom_col(fill = wf$bar_fill, width = 0.7) +
geom_text(aes(label = comma(running_total)),
vjust = -0.4, size = 3.3, colour = ft_ink, fontface = "bold") +
scale_y_continuous(labels = comma, expand = expansion(mult = c(0, 0.12))) +
labs(
title = "Sample size after each filter",
subtitle = "Starting from the full TTS dataset, each bar shows the remaining observations",
x = NULL, y = NULL
) +
theme_ft() +
theme(
axis.text.x = element_text(angle = 40, hjust = 1, size = 10),
panel.grid.major.x = element_blank()
)
display <- fl[, .(
Step = step,
Stage = stage,
Filter = filter_name,
Before = comma(n_before),
Dropped = comma(n_dropped),
After = comma(n_after),
`% Dropped` = sprintf("%.2f%%", pct_dropped_step),
`% of Full` = sprintf("%.1f%%", pct_of_full)
)]
price_rows <- which(fl$stage == "Price")
est_rows <- which(fl$stage == "Estimation")
final_rows <- which(fl$stage == "Final")
kable(display, align = "clcrrrrr") %>%
kable_styling(full_width = FALSE, font_size = 13) %>%
row_spec(price_rows, background = "#f3f4f6") %>%
row_spec(est_rows, background = "#e6f4f1") %>%
row_spec(final_rows, background = "#eef6f0") %>%
row_spec(0, bold = TRUE)
| Step | Stage | Filter | Before | Dropped | After | % Dropped | % of Full |
|---|---|---|---|---|---|---|---|
| 1 | Price | Missing price | 3,664,197 | 949,240 | 2,714,957 | 25.91% | 74.1% |
| 2 | Price | Price outliers | 2,714,957 | 92,397 | 2,622,560 | 3.40% | 71.6% |
| 3 | Price | Third-party owned | 2,622,560 | 789,321 | 1,833,239 | 30.10% | 50.0% |
| 4 | Price | Self-installed | 1,833,239 | 34,587 | 1,798,652 | 1.89% | 49.1% |
| 5 | Price | Battery storage | 1,798,652 | 142,245 | 1,656,407 | 7.91% | 45.2% |
| 6 | Estimation | Non-residential | 1,656,407 | 41,135 | 1,615,272 | 2.48% | 44.1% |
| 7 | Estimation | Size outliers | 1,615,272 | 25,403 | 1,589,869 | 1.57% | 43.4% |
| 8 | Estimation | Pre-2010 installations | 1,589,869 | 53,662 | 1,536,207 | 3.38% | 41.9% |
| 9 | Estimation | Negative efficiency | 1,536,207 | 0 | 1,536,207 | 0.00% | 41.9% |
| 10 | Estimation | Missing module tech | 1,536,207 | 331,391 | 1,204,816 | 21.57% | 32.9% |
| 11 | Final | Missing required vars | 1,204,816 | 723,417 | 481,399 | 60.04% | 13.1% |
| 12 | Final (ext) | Missing vars (no poly) | 1,204,816 | 17,971 | 1,186,845 | 1.49% | 32.4% |
Price filters · Estimation filters · Final sample
for (i in seq_len(nrow(fl))) {
row <- fl[i]
cat(sprintf(
"### Step %d: %s\n\n**Stage:** %s | **Dropped:** %s observations (%.2f%%) | **Remaining:** %s (%.1f%% of full)\n\n%s\n\n---\n\n",
row$step, row$filter_name, row$stage,
comma(row$n_dropped), row$pct_dropped_step,
comma(row$n_after), row$pct_of_full,
row$description
))
}
Stage: Price | Dropped: 949,240 observations (25.91%) | Remaining: 2,714,957 (74.1% of full)
Price per watt could not be computed (installed price or system size missing/sentinel).
Stage: Price | Dropped: 92,397 observations (3.40%) | Remaining: 2,622,560 (71.6% of full)
Price per watt outside the plausible $1-$10/W range (data errors or non-standard systems).
Stage: Price | Dropped: 789,321 observations (30.10%) | Remaining: 1,833,239 (50.0% of full)
TPO systems (leases, PPAs) excluded: reported price does not reflect true installed cost to the owner.
Stage: Price | Dropped: 34,587 observations (1.89%) | Remaining: 1,798,652 (49.1% of full)
DIY/self-installed systems excluded: they lack standard installer labour costs.
Stage: Price | Dropped: 142,245 observations (7.91%) | Remaining: 1,656,407 (45.2% of full)
PV+Storage and storage-only systems excluded so prices reflect PV-only cost.
Stage: Estimation | Dropped: 41,135 observations (2.48%) | Remaining: 1,615,272 (44.1% of full)
Non-residential systems (commercial, agricultural, government) excluded.
Stage: Estimation | Dropped: 25,403 observations (1.57%) | Remaining: 1,589,869 (43.4% of full)
Systems outside 1 W - 20 kW, implausible for residential installations.
Stage: Estimation | Dropped: 53,662 observations (3.38%) | Remaining: 1,536,207 (41.9% of full)
Pre-2010 dropped: limited coverage and no weekly polysilicon price data.
Stage: Estimation | Dropped: 0 observations (0.00%) | Remaining: 1,536,207 (41.9% of full)
Negative module efficiency, a reporting error (< 0.1% of sample).
Stage: Estimation | Dropped: 331,391 observations (21.57%) | Remaining: 1,204,816 (32.9% of full)
Module technology (Mono/Poly/Other) undeterminable from any of the three module fields.
Stage: Final | Dropped: 723,417 observations (60.04%) | Remaining: 481,399 (13.1% of full)
NA in any estimation variable after merges (mainly polysilicon / IRS / tax match failures).
Stage: Final (ext) | Dropped: 17,971 observations (1.49%) | Remaining: 1,186,845 (32.4% of full)
Extended sample: NA in any required variable except polysilicon. Full date range for tax-only IV.
fl_plot <- fl[n_dropped > 0]
ggplot(fl_plot, aes(x = reorder(filter_name, n_dropped), y = n_dropped)) +
geom_col(fill = ft_teal, width = 0.6) +
geom_text(aes(label = comma(n_dropped)), hjust = -0.1, size = 3.3,
colour = ft_ink, fontface = "bold") +
coord_flip() +
scale_y_continuous(labels = comma, expand = expansion(mult = c(0, 0.18))) +
labs(
title = "Observations dropped by each filter",
subtitle = "Ranked by absolute number of rows removed",
x = NULL, y = NULL
) +
theme_ft() +
theme(panel.grid.major.y = element_blank())
ggplot(fl_plot, aes(x = reorder(filter_name, pct_dropped_step), y = pct_dropped_step)) +
geom_col(fill = ft_claret, width = 0.6) +
geom_text(aes(label = sprintf("%.1f%%", pct_dropped_step)), hjust = -0.1, size = 3.3,
colour = ft_ink, fontface = "bold") +
coord_flip() +
scale_y_continuous(expand = expansion(mult = c(0, 0.18))) +
labs(
title = "Share of rows dropped at each step",
subtitle = "As percentage of rows entering that step (not of full sample)",
x = NULL, y = NULL
) +
theme_ft() +
theme(panel.grid.major.y = element_blank())
Three external datasets are joined to the estimation sample before the final complete-cases filter is applied.
mr_display <- mr[, .(
Source = source,
`Join Key` = join_key,
`Match Rate` = sprintf("%.1f%%", match_pct),
Description = description
)]
kable(mr_display) %>%
kable_styling(full_width = TRUE, font_size = 13) %>%
row_spec(0, bold = TRUE)
| Source | Join Key | Match Rate | Description |
|---|---|---|---|
| Polysilicon (PVinsights) | year + week | 40.8% | Weekly polysilicon spot prices joined on installation year and ISO week. First instrument for 2SLS. |
| IRS SOI (zip x year) | zip5 + year | 98.8% | Household count and AGI per household, joined on 5-digit zip and year (capped at the latest IRS year). |
| Sales Tax (Tax Foundation) | state + year | 100.0% | State sales tax rates converted to installation-level tax/W (Bukow B2: rate x 0.55 x price). Second instrument. |
Unmatched rows in the merge stage are the primary driver of the final sample being smaller than the estimation sample. They introduce NAs in required variables which are then dropped by the complete-cases filter in Step 11.
st_display <- state_cts[, .(
State = state_raw,
N = comma(N),
`% of Estimation` = sprintf("%.1f%%", 100 * N / n_est),
Group = ifelse(state_raw %in% major_st, "Major (> 20k)", "Other")
)]
kable(head(st_display, 25), caption = "Top 25 states in estimation sample") %>%
kable_styling(full_width = FALSE, font_size = 13) %>%
row_spec(0, bold = TRUE) %>%
row_spec(which(head(st_display, 25)$Group == "Major (> 20k)"),
bold = TRUE, color = ft_teal)
| State | N | % of Estimation | Group |
|---|---|---|---|
| CA | 903,086 | 75.0% | Major (> 20k) |
| NY | 74,046 | 6.1% | Major (> 20k) |
| AZ | 64,395 | 5.3% | Major (> 20k) |
| TX | 48,840 | 4.1% | Major (> 20k) |
| MA | 46,597 | 3.9% | Major (> 20k) |
| NM | 15,098 | 1.3% | Other |
| WI | 13,801 | 1.1% | Other |
| CT | 12,655 | 1.1% | Other |
| UT | 7,954 | 0.7% | Other |
| WA | 6,964 | 0.6% | Other |
| PA | 4,630 | 0.4% | Other |
| NH | 2,282 | 0.2% | Other |
| VT | 1,980 | 0.2% | Other |
| MN | 1,038 | 0.1% | Other |
| OR | 953 | 0.1% | Other |
| FL | 371 | 0.0% | Other |
| AR | 64 | 0.0% | Other |
| RI | 61 | 0.0% | Other |
| IL | 1 | 0.0% | Other |
Major states (> 20,000 observations each): CA, NY, AZ, TX, MA. All other states are grouped as “Other” in the state fixed-effect specification.
2SLS IV estimation of the residential solar price elasticity
of demand (05_run_model.R). The dependent variable
is ln(system size); the reported elasticity is the coefficient on
ln(price/W). Instruments: polysilicon spot price and a synthetic
sales-tax term. First-stage F is the weak-instrument statistic (rule of
thumb: > 10).
if (!is.null(model_res)) {
mres <- copy(model_res)
mres[, `:=`(
Elasticity = sprintf("%+.3f", Elasticity),
N = comma(N),
`First-stage F` = ifelse(is.na(FirstStageF), "—", comma(round(FirstStageF)))
)]
kable(mres[, .(Model, Elasticity, N, `First-stage F`)], align = "lrrr") %>%
kable_styling(full_width = FALSE, font_size = 13) %>%
row_spec(0, bold = TRUE) %>%
row_spec(which(model_res$Model == "IV: Preferred"), bold = TRUE, color = ft_claret) %>%
row_spec(which(model_res$Model == "IV: Poly Only"), bold = TRUE, color = ft_teal)
} else {
cat("Run 05_run_model.R to generate model_results.rds")
}
| Model | Elasticity | N | First-stage F |
|---|---|---|---|
| OLS | -0.436 | 481,399 | — |
| IV: Preferred | -0.442 | 481,399 | 1,622,303 |
| IV: Poly Only | -0.269 | 481,399 | 356 |
| IV: No FE | -0.492 | 481,399 | 688,215 |
| IV: 2010-17 | -0.471 | 270,817 | 715,783 |
| IV: 2018+ | -0.426 | 210,582 | 1,510,119 |
| IV: CA | -0.441 | 357,830 | 3,184,165 |
| IV: CA 2010-17 | -0.492 | 195,742 | 1,616,687 |
| IV: Non-CA | -0.463 | 123,569 | 103,493 |
| IV: Tax Full | -0.292 | 1,186,845 | 11,876,373 |
| IV: Tax Pre-21 | -0.423 | 596,514 | 3,994,974 |
| IV: Tax Post-20 | -0.215 | 590,331 | 11,023,836 |
| IV: Tax to-2022 | -0.333 | 920,811 | 7,912,747 |
| OLS Extended | -0.311 | 1,186,845 | — |
| IV: Rate (preferred) | +0.759 | 481,399 | 214 |
| IV: Rate (tax only) | +1.558 | 1,186,845 | 1,745 |
Bukow (2022) baseline: −0.443 (California-dominated, 2010–2017).
The forest plot makes the spread visible at a glance — every estimate as a point with its 95% CI, so re-running after a pipeline change shows immediately how the whole set shifts.
if (!is.null(model_res)) {
mp <- copy(model_res)
mp[, Model := factor(Model, levels = rev(Model))] # keep script order, top-to-bottom
ggplot(mp, aes(x = Elasticity, y = Model)) +
geom_vline(xintercept = 0, colour = ft_rule, linewidth = 0.5) +
geom_vline(xintercept = -0.443, linetype = "dashed", colour = ft_ink_lt, linewidth = 0.5) +
geom_errorbarh(aes(xmin = CI_low, xmax = CI_high), height = 0.3, colour = ft_ink_lt) +
geom_point(colour = ft_claret, size = 2.4) +
coord_cartesian(xlim = c(-0.9, 1.9)) +
labs(
title = "Price elasticity across specifications",
subtitle = "Point estimate with 95% CI (clustered by state). Dashed line = Bukow −0.443; solid = 0.",
x = "Elasticity (coefficient on ln price/W)", y = NULL
) +
theme_ft()
}
Estimates left of the dashed line are more elastic than Bukow; the two rate-instrument models sit implausibly to the right of zero (positive), flagging that instrument’s failure.
if (!is.null(state_el)) {
ggplot(state_el, aes(x = Elasticity, y = reorder(State, Elasticity))) +
geom_vline(xintercept = -0.443, linetype = "dashed", colour = ft_ink_lt, linewidth = 0.5) +
geom_errorbarh(aes(xmin = CI_low, xmax = CI_high), height = 0.22, colour = ft_ink_lt) +
geom_point(colour = ft_teal, size = 3) +
labs(
title = "Price elasticity by state (preferred IV)",
subtitle = "Point + 95% CI (clustered by year). Dashed line = Bukow −0.443.",
x = "Elasticity (coefficient on ln price/W)", y = NULL
) +
theme_ft()
}
This report reads pipeline_meta.rds generated by
04_clean_tts.R. To regenerate:
source("scripts/04_clean_tts.R")
rmarkdown::render("06_pipeline_report.Rmd")
Session info:
sessionInfo()
## R version 4.5.2 (2025-10-31)
## Platform: aarch64-apple-darwin20
## Running under: macOS Tahoe 26.3
##
## Matrix products: default
## BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
##
## locale:
## [1] en_GB/en_GB/en_GB/C/en_GB/en_GB
##
## time zone: Europe/London
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] kableExtra_1.4.0 knitr_1.51 scales_1.4.0
## [4] lubridate_1.9.5 forcats_1.0.1 stringr_1.6.0
## [7] dplyr_1.2.0 purrr_1.2.1 readr_2.2.0
## [10] tidyr_1.3.2 tibble_3.3.1 ggplot2_4.0.2
## [13] tidyverse_2.0.0 data.table_1.18.2.1
##
## loaded via a namespace (and not attached):
## [1] rappdirs_0.3.4 sass_0.4.10 generics_0.1.4 xml2_1.5.2
## [5] stringi_1.8.7 hms_1.1.4 digest_0.6.39 magrittr_2.0.4
## [9] evaluate_1.0.5 grid_4.5.2 timechange_0.4.0 RColorBrewer_1.1-3
## [13] fastmap_1.2.0 rprojroot_2.1.1 jsonlite_2.0.0 viridisLite_0.4.3
## [17] textshaping_1.0.4 jquerylib_0.1.4 cli_3.6.5 rlang_1.1.7
## [21] withr_3.0.2 cachem_1.1.0 yaml_2.3.12 otel_0.2.0
## [25] tools_4.5.2 tzdb_0.5.0 pandoc_0.2.0 here_1.0.2
## [29] vctrs_0.7.1 R6_2.6.1 lifecycle_1.0.5 fs_1.6.6
## [33] pkgconfig_2.0.3 pillar_1.11.1 bslib_0.10.0 gtable_0.3.6
## [37] glue_1.8.0 systemfonts_1.3.1 xfun_0.56 tidyselect_1.2.1
## [41] rstudioapi_0.18.0 farver_2.1.2 htmltools_0.5.9 labeling_0.4.3
## [45] rmarkdown_2.30 svglite_2.2.2 compiler_4.5.2 S7_0.2.1