1 Overview

This report documents every data selection step applied to the Tracking the Sun (TTS) 2025 public data file from Lawrence Berkeley National Laboratory. The pipeline follows the sample-selection methodology of Bukow (2022) and produces three nested subsets of the data, progressing from raw to estimation-ready.

overview <- data.table(
  Sample = c("Full TTS", "Price sample", "Estimation sample", "Final sample"),
  Rows   = comma(c(n_full, n_price, n_est, n_final)),
  `% of Full` = sprintf("%.1f%%", c(100, 100*n_price/n_full, 100*n_est/n_full, 100*n_final/n_full))
)

kable(overview, align = "lrr") %>%
  kable_styling(full_width = FALSE, font_size = 14) %>%
  row_spec(0, bold = TRUE) %>%
  row_spec(4, bold = TRUE, color = ft_claret)

Sample	Rows	% of Full
Full TTS	3,664,197	100.0%
Price sample	1,656,407	45.2%
Estimation sample	1,204,816	32.9%
Final sample	481,399	13.1%

Pipeline run: 24 June 2026, 17:17.

2 Sample Flow: Waterfall Chart

The chart below shows how each filter progressively reduces the dataset from the full TTS sample to the final estimation-ready sample.

wf <- fl[, .(step, filter_name, n_dropped)]
wf <- rbindlist(list(
  data.table(step = 0L, filter_name = "Full TTS", n_dropped = 0L),
  wf
))
wf[, running_total := n_full - cumsum(n_dropped)]
wf[, filter_name := factor(filter_name, levels = filter_name)]
wf[, bar_fill := fifelse(step == 0, ft_teal, ft_claret)]

ggplot(wf, aes(x = filter_name, y = running_total)) +
  geom_col(fill = wf$bar_fill, width = 0.7) +
  geom_text(aes(label = comma(running_total)),
            vjust = -0.4, size = 3.3, colour = ft_ink, fontface = "bold") +
  scale_y_continuous(labels = comma, expand = expansion(mult = c(0, 0.12))) +
  labs(
    title    = "Sample size after each filter",
    subtitle = "Starting from the full TTS dataset, each bar shows the remaining observations",
    x = NULL, y = NULL
  ) +
  theme_ft() +
  theme(
    axis.text.x  = element_text(angle = 40, hjust = 1, size = 10),
    panel.grid.major.x = element_blank()
  )

3 Filter-by-Filter Breakdown

3.1 Summary Table

display <- fl[, .(
  Step          = step,
  Stage         = stage,
  Filter        = filter_name,
  Before        = comma(n_before),
  Dropped       = comma(n_dropped),
  After         = comma(n_after),
  `% Dropped`   = sprintf("%.2f%%", pct_dropped_step),
  `% of Full`   = sprintf("%.1f%%", pct_of_full)
)]

price_rows <- which(fl$stage == "Price")
est_rows   <- which(fl$stage == "Estimation")
final_rows <- which(fl$stage == "Final")

kable(display, align = "clcrrrrr") %>%
  kable_styling(full_width = FALSE, font_size = 13) %>%
  row_spec(price_rows, background = "#f3f4f6") %>%
  row_spec(est_rows,   background = "#e6f4f1") %>%
  row_spec(final_rows, background = "#eef6f0") %>%
  row_spec(0, bold = TRUE)

Step	Stage	Filter	Before	Dropped	After	% Dropped	% of Full
1	Price	Missing price	3,664,197	949,240	2,714,957	25.91%	74.1%
2	Price	Price outliers	2,714,957	92,397	2,622,560	3.40%	71.6%
3	Price	Third-party owned	2,622,560	789,321	1,833,239	30.10%	50.0%
4	Price	Self-installed	1,833,239	34,587	1,798,652	1.89%	49.1%
5	Price	Battery storage	1,798,652	142,245	1,656,407	7.91%	45.2%
6	Estimation	Non-residential	1,656,407	41,135	1,615,272	2.48%	44.1%
7	Estimation	Size outliers	1,615,272	25,403	1,589,869	1.57%	43.4%
8	Estimation	Pre-2010 installations	1,589,869	53,662	1,536,207	3.38%	41.9%
9	Estimation	Negative efficiency	1,536,207	0	1,536,207	0.00%	41.9%
10	Estimation	Missing module tech	1,536,207	331,391	1,204,816	21.57%	32.9%
11	Final	Missing required vars	1,204,816	723,417	481,399	60.04%	13.1%
12	Final (ext)	Missing vars (no poly)	1,204,816	17,971	1,186,845	1.49%	32.4%

Price filters · Estimation filters · Final sample

3.2 Detailed Descriptions

for (i in seq_len(nrow(fl))) {
  row <- fl[i]
  cat(sprintf(
    "### Step %d: %s\n\n**Stage:** %s &nbsp;&nbsp;|&nbsp;&nbsp; **Dropped:** %s observations (%.2f%%) &nbsp;&nbsp;|&nbsp;&nbsp; **Remaining:** %s (%.1f%% of full)\n\n%s\n\n---\n\n",
    row$step, row$filter_name, row$stage,
    comma(row$n_dropped), row$pct_dropped_step,
    comma(row$n_after), row$pct_of_full,
    row$description
  ))
}

3.2.1 Step 1: Missing price

Stage: Price | Dropped: 949,240 observations (25.91%) | Remaining: 2,714,957 (74.1% of full)

Price per watt could not be computed (installed price or system size missing/sentinel).

3.2.2 Step 2: Price outliers

Stage: Price | Dropped: 92,397 observations (3.40%) | Remaining: 2,622,560 (71.6% of full)

Price per watt outside the plausible $1-$10/W range (data errors or non-standard systems).

3.2.3 Step 3: Third-party owned

Stage: Price | Dropped: 789,321 observations (30.10%) | Remaining: 1,833,239 (50.0% of full)

TPO systems (leases, PPAs) excluded: reported price does not reflect true installed cost to the owner.

3.2.4 Step 4: Self-installed

Stage: Price | Dropped: 34,587 observations (1.89%) | Remaining: 1,798,652 (49.1% of full)

DIY/self-installed systems excluded: they lack standard installer labour costs.

3.2.5 Step 5: Battery storage

Stage: Price | Dropped: 142,245 observations (7.91%) | Remaining: 1,656,407 (45.2% of full)

PV+Storage and storage-only systems excluded so prices reflect PV-only cost.

3.2.6 Step 6: Non-residential

Stage: Estimation | Dropped: 41,135 observations (2.48%) | Remaining: 1,615,272 (44.1% of full)

Non-residential systems (commercial, agricultural, government) excluded.

3.2.7 Step 7: Size outliers

Stage: Estimation | Dropped: 25,403 observations (1.57%) | Remaining: 1,589,869 (43.4% of full)

Systems outside 1 W - 20 kW, implausible for residential installations.

3.2.8 Step 8: Pre-2010 installations

Stage: Estimation | Dropped: 53,662 observations (3.38%) | Remaining: 1,536,207 (41.9% of full)

Pre-2010 dropped: limited coverage and no weekly polysilicon price data.

3.2.9 Step 9: Negative efficiency

Stage: Estimation | Dropped: 0 observations (0.00%) | Remaining: 1,536,207 (41.9% of full)

Negative module efficiency, a reporting error (< 0.1% of sample).

3.2.10 Step 10: Missing module tech

Stage: Estimation | Dropped: 331,391 observations (21.57%) | Remaining: 1,204,816 (32.9% of full)

Module technology (Mono/Poly/Other) undeterminable from any of the three module fields.

3.2.11 Step 11: Missing required vars

Stage: Final | Dropped: 723,417 observations (60.04%) | Remaining: 481,399 (13.1% of full)

NA in any estimation variable after merges (mainly polysilicon / IRS / tax match failures).

3.2.12 Step 12: Missing vars (no poly)

Stage: Final (ext) | Dropped: 17,971 observations (1.49%) | Remaining: 1,186,845 (32.4% of full)

Extended sample: NA in any required variable except polysilicon. Full date range for tax-only IV.

4 Relative Impact of Each Filter

fl_plot <- fl[n_dropped > 0]

ggplot(fl_plot, aes(x = reorder(filter_name, n_dropped), y = n_dropped)) +
  geom_col(fill = ft_teal, width = 0.6) +
  geom_text(aes(label = comma(n_dropped)), hjust = -0.1, size = 3.3,
            colour = ft_ink, fontface = "bold") +
  coord_flip() +
  scale_y_continuous(labels = comma, expand = expansion(mult = c(0, 0.18))) +
  labs(
    title    = "Observations dropped by each filter",
    subtitle = "Ranked by absolute number of rows removed",
    x = NULL, y = NULL
  ) +
  theme_ft() +
  theme(panel.grid.major.y = element_blank())

ggplot(fl_plot, aes(x = reorder(filter_name, pct_dropped_step), y = pct_dropped_step)) +
  geom_col(fill = ft_claret, width = 0.6) +
  geom_text(aes(label = sprintf("%.1f%%", pct_dropped_step)), hjust = -0.1, size = 3.3,
            colour = ft_ink, fontface = "bold") +
  coord_flip() +
  scale_y_continuous(expand = expansion(mult = c(0, 0.18))) +
  labs(
    title    = "Share of rows dropped at each step",
    subtitle = "As percentage of rows entering that step (not of full sample)",
    x = NULL, y = NULL
  ) +
  theme_ft() +
  theme(panel.grid.major.y = element_blank())

5 External Data Merges

Three external datasets are joined to the estimation sample before the final complete-cases filter is applied.

mr_display <- mr[, .(
  Source       = source,
  `Join Key`   = join_key,
  `Match Rate` = sprintf("%.1f%%", match_pct),
  Description  = description
)]

kable(mr_display) %>%
  kable_styling(full_width = TRUE, font_size = 13) %>%
  row_spec(0, bold = TRUE)

Source	Join Key	Match Rate	Description
Polysilicon (PVinsights)	year + week	40.8%	Weekly polysilicon spot prices joined on installation year and ISO week. First instrument for 2SLS.
IRS SOI (zip x year)	zip5 + year	98.8%	Household count and AGI per household, joined on 5-digit zip and year (capped at the latest IRS year).
Sales Tax (Tax Foundation)	state + year	100.0%	State sales tax rates converted to installation-level tax/W (Bukow B2: rate x 0.55 x price). Second instrument.

Unmatched rows in the merge stage are the primary driver of the final sample being smaller than the estimation sample. They introduce NAs in required variables which are then dropped by the complete-cases filter in Step 11.

6 Geographic Coverage

st_display <- state_cts[, .(
  State = state_raw,
  N     = comma(N),
  `% of Estimation` = sprintf("%.1f%%", 100 * N / n_est),
  Group = ifelse(state_raw %in% major_st, "Major (> 20k)", "Other")
)]

kable(head(st_display, 25), caption = "Top 25 states in estimation sample") %>%
  kable_styling(full_width = FALSE, font_size = 13) %>%
  row_spec(0, bold = TRUE) %>%
  row_spec(which(head(st_display, 25)$Group == "Major (> 20k)"),
           bold = TRUE, color = ft_teal)

Top 25 states in estimation sample
State	N	% of Estimation	Group
CA	903,086	75.0%	Major (> 20k)
NY	74,046	6.1%	Major (> 20k)
AZ	64,395	5.3%	Major (> 20k)
TX	48,840	4.1%	Major (> 20k)
MA	46,597	3.9%	Major (> 20k)
NM	15,098	1.3%	Other
WI	13,801	1.1%	Other
CT	12,655	1.1%	Other
UT	7,954	0.7%	Other
WA	6,964	0.6%	Other
PA	4,630	0.4%	Other
NH	2,282	0.2%	Other
VT	1,980	0.2%	Other
MN	1,038	0.1%	Other
OR	953	0.1%	Other
FL	371	0.0%	Other
AR	64	0.0%	Other
RI	61	0.0%	Other
IL	1	0.0%	Other

Major states (> 20,000 observations each): CA, NY, AZ, TX, MA. All other states are grouped as “Other” in the state fixed-effect specification.

7 Model Results

2SLS IV estimation of the residential solar price elasticity of demand (05_run_model.R). The dependent variable is ln(system size); the reported elasticity is the coefficient on ln(price/W). Instruments: polysilicon spot price and a synthetic sales-tax term. First-stage F is the weak-instrument statistic (rule of thumb: > 10).

if (!is.null(model_res)) {
  mres <- copy(model_res)
  mres[, `:=`(
    Elasticity      = sprintf("%+.3f", Elasticity),
    N               = comma(N),
    `First-stage F` = ifelse(is.na(FirstStageF), "—", comma(round(FirstStageF)))
  )]
  kable(mres[, .(Model, Elasticity, N, `First-stage F`)], align = "lrrr") %>%
    kable_styling(full_width = FALSE, font_size = 13) %>%
    row_spec(0, bold = TRUE) %>%
    row_spec(which(model_res$Model == "IV: Preferred"), bold = TRUE, color = ft_claret) %>%
    row_spec(which(model_res$Model == "IV: Poly Only"), bold = TRUE, color = ft_teal)
} else {
  cat("Run 05_run_model.R to generate model_results.rds")
}

Model	Elasticity	N	First-stage F
OLS	-0.436	481,399	—
IV: Preferred	-0.442	481,399	1,622,303
IV: Poly Only	-0.269	481,399	356
IV: No FE	-0.492	481,399	688,215
IV: 2010-17	-0.471	270,817	715,783
IV: 2018+	-0.426	210,582	1,510,119
IV: CA	-0.441	357,830	3,184,165
IV: CA 2010-17	-0.492	195,742	1,616,687
IV: Non-CA	-0.463	123,569	103,493
IV: Tax Full	-0.292	1,186,845	11,876,373
IV: Tax Pre-21	-0.423	596,514	3,994,974
IV: Tax Post-20	-0.215	590,331	11,023,836
IV: Tax to-2022	-0.333	920,811	7,912,747
OLS Extended	-0.311	1,186,845	—
IV: Rate (preferred)	+0.759	481,399	214
IV: Rate (tax only)	+1.558	1,186,845	1,745

Bukow (2022) baseline: −0.443 (California-dominated, 2010–2017).

The forest plot makes the spread visible at a glance — every estimate as a point with its 95% CI, so re-running after a pipeline change shows immediately how the whole set shifts.

if (!is.null(model_res)) {
  mp <- copy(model_res)
  mp[, Model := factor(Model, levels = rev(Model))]   # keep script order, top-to-bottom
  ggplot(mp, aes(x = Elasticity, y = Model)) +
    geom_vline(xintercept = 0, colour = ft_rule, linewidth = 0.5) +
    geom_vline(xintercept = -0.443, linetype = "dashed", colour = ft_ink_lt, linewidth = 0.5) +
    geom_errorbarh(aes(xmin = CI_low, xmax = CI_high), height = 0.3, colour = ft_ink_lt) +
    geom_point(colour = ft_claret, size = 2.4) +
    coord_cartesian(xlim = c(-0.9, 1.9)) +
    labs(
      title    = "Price elasticity across specifications",
      subtitle = "Point estimate with 95% CI (clustered by state). Dashed line = Bukow −0.443; solid = 0.",
      x = "Elasticity (coefficient on ln price/W)", y = NULL
    ) +
    theme_ft()
}

Estimates left of the dashed line are more elastic than Bukow; the two rate-instrument models sit implausibly to the right of zero (positive), flagging that instrument’s failure.

if (!is.null(state_el)) {
  ggplot(state_el, aes(x = Elasticity, y = reorder(State, Elasticity))) +
    geom_vline(xintercept = -0.443, linetype = "dashed", colour = ft_ink_lt, linewidth = 0.5) +
    geom_errorbarh(aes(xmin = CI_low, xmax = CI_high), height = 0.22, colour = ft_ink_lt) +
    geom_point(colour = ft_teal, size = 3) +
    labs(
      title    = "Price elasticity by state (preferred IV)",
      subtitle = "Point + 95% CI (clustered by year). Dashed line = Bukow −0.443.",
      x = "Elasticity (coefficient on ln price/W)", y = NULL
    ) +
    theme_ft()
}

7.1 Key insights

The preferred IV reproduces Bukow almost exactly (−0.442 vs −0.443) — but read with care (next point).
The “preferred” estimate is close to OLS by construction. OLS is −0.436 and the preferred IV is −0.442 because the synthetic tax instrument is built from price (tax/W = rate × 0.55 × price), so its first-stage F is in the millions and 2SLS collapses back toward OLS. A first-stage F of 10⁶ is a red flag, not a strength.
Polysilicon is the only credible instrument. The polysilicon-only IV (highlighted teal) gives −0.269 with a sensible F ≈ 356 — a genuine global cost-shifter moving the estimate away from OLS. This implies demand is more inelastic than the preferred specification suggests.
The exogenous rate-based tax instrument fails too. Instrumenting with the tax wedge (rate only, no price) yields implausible positive elasticities (+0.76 to +1.56): state tax-rate variation is confounded with other state-level demand drivers and is too weak after fixed effects. So the tax instrument is unusable in both forms — lean on polysilicon.
External validity holds. California (−0.441) and non-California (−0.463) elasticities are close, and every subsample sits in a tight −0.42 to −0.49 band.
The post-2020 extension is weakly identified. With sales-tax rates frozen at 2020, the tax-only post-2020 model has no exogenous within-state variation; its estimate (−0.215) should be treated as indicative only.

8 Reproducibility

This report reads pipeline_meta.rds generated by 04_clean_tts.R. To regenerate:

source("scripts/04_clean_tts.R")
rmarkdown::render("06_pipeline_report.Rmd")

Session info:

sessionInfo()

## R version 4.5.2 (2025-10-31)
## Platform: aarch64-apple-darwin20
## Running under: macOS Tahoe 26.3
## 
## Matrix products: default
## BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1
## 
## locale:
## [1] en_GB/en_GB/en_GB/C/en_GB/en_GB
## 
## time zone: Europe/London
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] kableExtra_1.4.0    knitr_1.51          scales_1.4.0       
##  [4] lubridate_1.9.5     forcats_1.0.1       stringr_1.6.0      
##  [7] dplyr_1.2.0         purrr_1.2.1         readr_2.2.0        
## [10] tidyr_1.3.2         tibble_3.3.1        ggplot2_4.0.2      
## [13] tidyverse_2.0.0     data.table_1.18.2.1
## 
## loaded via a namespace (and not attached):
##  [1] rappdirs_0.3.4     sass_0.4.10        generics_0.1.4     xml2_1.5.2        
##  [5] stringi_1.8.7      hms_1.1.4          digest_0.6.39      magrittr_2.0.4    
##  [9] evaluate_1.0.5     grid_4.5.2         timechange_0.4.0   RColorBrewer_1.1-3
## [13] fastmap_1.2.0      rprojroot_2.1.1    jsonlite_2.0.0     viridisLite_0.4.3 
## [17] textshaping_1.0.4  jquerylib_0.1.4    cli_3.6.5          rlang_1.1.7       
## [21] withr_3.0.2        cachem_1.1.0       yaml_2.3.12        otel_0.2.0        
## [25] tools_4.5.2        tzdb_0.5.0         pandoc_0.2.0       here_1.0.2        
## [29] vctrs_0.7.1        R6_2.6.1           lifecycle_1.0.5    fs_1.6.6          
## [33] pkgconfig_2.0.3    pillar_1.11.1      bslib_0.10.0       gtable_0.3.6      
## [37] glue_1.8.0         systemfonts_1.3.1  xfun_0.56          tidyselect_1.2.1  
## [41] rstudioapi_0.18.0  farver_2.1.2       htmltools_0.5.9    labeling_0.4.3    
## [45] rmarkdown_2.30     svglite_2.2.2      compiler_4.5.2     S7_0.2.1

TTS Data Pipeline - Filter & Sample Report

Tracking the Sun 2025 - Demand Estimation for Solar PV

Auto-generated by 04_clean_tts.R

24 June 2026