1 Overview

This report documents every data selection step applied to the Tracking the Sun (TTS) 2025 public data file from Lawrence Berkeley National Laboratory. The pipeline follows the sample-selection methodology of Bukow (2022) and produces three nested subsets:

Sample	Rows	% of Full
Full TTS	3,664,197	100%
Price sample	1,282,013	35.0%
Estimation sample	1,075,769	29.4%
Final sample	396,188	10.8%

Pipeline run: 28 March 2026, 14:36.

2 Sample Flow: Waterfall Chart

The chart below shows how each filter progressively reduces the dataset from the full TTS sample to the final estimation-ready sample.

# Build waterfall data
wf <- fl[, .(step, filter_name, n_dropped)]
wf <- rbindlist(list(
  data.table(step = 0L, filter_name = "Full TTS", n_dropped = 0L),
  wf
))
wf[, running_total := n_full - cumsum(n_dropped)]
wf[, filter_name := factor(filter_name, levels = filter_name)]

ggplot(wf, aes(x = filter_name, y = running_total)) +
  geom_col(fill = ifelse(wf$step == 0, "#2c7fb8", "#d95f02"), width = 0.7) +
  geom_text(aes(label = comma(running_total)),
            vjust = -0.4, size = 3.2, colour = "grey30") +
  scale_y_continuous(labels = comma, expand = expansion(mult = c(0, 0.12))) +
  labs(
    title = "Sample Size After Each Filter",
    x = NULL, y = "Observations remaining"
  ) +
  theme_minimal(base_size = 12) +
  theme(axis.text.x = element_text(angle = 40, hjust = 1))

3 Filter-by-Filter Breakdown

3.1 Summary Table

display <- fl[, .(
  Step          = step,
  Stage         = stage,
  Filter        = filter_name,
  `Before`      = comma(n_before),
  `Dropped`     = comma(n_dropped),
  `After`       = comma(n_after),
  `% Dropped`   = sprintf("%.2f%%", pct_dropped_step),
  `% of Full`   = sprintf("%.1f%%", pct_of_full)
)]

kable(display, align = "clcrrrrr") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE, font_size = 13) %>%
  row_spec(which(fl$stage == "Price"), background = "#fff3e0") %>%
  row_spec(which(fl$stage == "Estimation"), background = "#e3f2fd") %>%
  row_spec(which(fl$stage == "Final"), background = "#e8f5e9")

Step	Stage	Filter	Before	Dropped	After	% Dropped	% of Full
1	Price	Missing price	3,664,197	0	3,664,197	0.00%	100.0%
2	Price	Price outliers	3,664,197	1,041,637	2,622,560	28.43%	71.6%
3	Price	Third-party owned	2,622,560	1,079,344	1,543,216	41.16%	42.1%
4	Price	Self-installed	1,543,216	128,983	1,414,233	8.36%	38.6%
5	Price	Battery storage	1,414,233	132,220	1,282,013	9.35%	35.0%
6	Estimation	Non-residential	1,282,013	32,014	1,249,999	2.50%	34.1%
7	Estimation	Size outliers	1,249,999	16,663	1,233,336	1.33%	33.7%
8	Estimation	Pre-2010 installations	1,233,336	35,842	1,197,494	2.91%	32.7%
9	Estimation	Negative efficiency	1,197,494	121,725	1,075,769	10.16%	29.4%
10	Estimation	Missing module tech	1,075,769	0	1,075,769	0.00%	29.4%
11	Final	Missing required vars	1,075,769	679,581	396,188	63.17%	10.8%

Colour key: Price filters · Estimation filters · Final sample

3.2 Detailed Descriptions

for (i in seq_len(nrow(fl))) {
  row <- fl[i]
  cat(sprintf(
    "### Step %d: %s\n\n**Stage:** %s \n**Dropped:** %s observations (%.2f%% of rows entering this step) \n**Remaining:** %s (%.1f%% of full sample)\n\n%s\n\n---\n\n",
    row$step, row$filter_name, row$stage,
    comma(row$n_dropped), row$pct_dropped_step,
    comma(row$n_after), row$pct_of_full,
    row$description
  ))
}

3.2.1 Step 1: Missing price

Stage: Price Dropped: 0 observations (0.00% of rows entering this step) Remaining: 3,664,197 (100.0% of full sample)

Rows where price per watt could not be computed (missing size or cost).

3.2.2 Step 2: Price outliers

Stage: Price Dropped: 1,041,637 observations (28.43% of rows entering this step) Remaining: 2,622,560 (71.6% of full sample)

Price per watt outside the plausible $1-$10/W range (likely data errors or non-standard systems).

3.2.3 Step 3: Third-party owned

Stage: Price Dropped: 1,079,344 observations (41.16% of rows entering this step) Remaining: 1,543,216 (42.1% of full sample)

TPO systems (leases, PPAs) excluded because their reported prices do not reflect true installed cost to the homeowner.

3.2.4 Step 4: Self-installed

Stage: Price Dropped: 128,983 observations (8.36% of rows entering this step) Remaining: 1,414,233 (38.6% of full sample)

DIY/self-installed systems excluded because they lack standard installer labour costs.

3.2.5 Step 5: Battery storage

Stage: Price Dropped: 132,220 observations (9.35% of rows entering this step) Remaining: 1,282,013 (35.0% of full sample)

PV+Storage and storage-only systems excluded so prices reflect PV-only installed cost.

3.2.6 Step 6: Non-residential

Stage: Estimation Dropped: 32,014 observations (2.50% of rows entering this step) Remaining: 1,249,999 (34.1% of full sample)

Non-residential systems (commercial, agricultural, government, etc.) excluded to focus on the residential segment.

3.2.7 Step 7: Size outliers

Stage: Estimation Dropped: 16,663 observations (1.33% of rows entering this step) Remaining: 1,233,336 (33.7% of full sample)

Systems outside 1 W - 20,000 W (20 kW) range, which are implausible for residential installations.

3.2.8 Step 8: Pre-2010 installations

Stage: Estimation Dropped: 35,842 observations (2.91% of rows entering this step) Remaining: 1,197,494 (32.7% of full sample)

Installations before 2010 dropped due to limited data coverage and unavailability of weekly polysilicon price data.

3.2.9 Step 9: Negative efficiency

Stage: Estimation Dropped: 121,725 observations (10.16% of rows entering this step) Remaining: 1,075,769 (29.4% of full sample)

Observations with negative module efficiency, which are likely reporting errors (< 0.1% of sample).

3.2.10 Step 10: Missing module tech

Stage: Estimation Dropped: 0 observations (0.00% of rows entering this step) Remaining: 1,075,769 (29.4% of full sample)

Observations where module technology (Mono/Poly/Other) could not be determined from any of the three module fields.

3.2.11 Step 11: Missing required vars

Stage: Final Dropped: 679,581 observations (63.17% of rows entering this step) Remaining: 396,188 (10.8% of full sample)

Rows with NA in any variable required for estimation (after merges — mainly driven by polysilicon or IRS match failures).

4 Relative Impact of Each Filter

fl_plot <- fl[n_dropped > 0]

ggplot(fl_plot, aes(x = reorder(filter_name, n_dropped), y = n_dropped)) +
  geom_col(fill = "#7570b3", width = 0.6) +
  geom_text(aes(label = comma(n_dropped)), hjust = -0.1, size = 3.2) +
  coord_flip() +
  scale_y_continuous(labels = comma, expand = expansion(mult = c(0, 0.15))) +
  labs(
    title = "Observations Dropped by Each Filter",
    subtitle = "Ranked by absolute impact",
    x = NULL, y = "Rows dropped"
  ) +
  theme_minimal(base_size = 12)

ggplot(fl_plot, aes(x = reorder(filter_name, pct_dropped_step), y = pct_dropped_step)) +
  geom_col(fill = "#e7298a", width = 0.6) +
  geom_text(aes(label = sprintf("%.1f%%", pct_dropped_step)), hjust = -0.1, size = 3.2) +
  coord_flip() +
  scale_y_continuous(expand = expansion(mult = c(0, 0.15))) +
  labs(
    title = "Percentage of Rows Dropped at Each Step",
    subtitle = "As % of rows entering that step (not of full sample)",
    x = NULL, y = "% dropped"
  ) +
  theme_minimal(base_size = 12)

5 External Data Merges

Three external datasets are joined to the estimation sample before the final complete-cases filter is applied.

mr_display <- mr[, .(
  Source       = source,
  `Join Key`   = join_key,
  `Match Rate` = sprintf("%.1f%%", match_pct),
  Description  = description
)]

kable(mr_display) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = TRUE, font_size = 13)

Source	Join Key	Match Rate	Description
Polysilicon (PVinsights)	year + week	40.1%	Weekly polysilicon spot prices joined on installation year and ISO week.
IRS SOI (zip x year)	zip5 + year	95.7%	Household count and AGI per household from IRS Statistics of Income, joined on 5-digit zip code and year (capped at max available IRS year).
EIA Electricity (state x year)	state + year	94.6%	Average retail electricity price by state and year from EIA.

Unmatched rows in the merge stage are the primary driver of the final sample being smaller than the estimation sample — they introduce NAs in required variables which are then dropped by the complete-cases filter in Step 11.

6 Geographic Coverage

st_display <- state_cts[, .(
  State = state_raw,
  N     = comma(N),
  `% of Estimation` = sprintf("%.1f%%", 100 * N / n_est),
  Group = ifelse(state_raw %in% major_st, "Major (> 20k)", "Other")
)]

kable(head(st_display, 25), caption = "Top 25 states in estimation sample") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE, font_size = 13)

Top 25 states in estimation sample
State	N	% of Estimation	Group
CA	889,690	82.7%	Major (> 20k)
NY	71,910	6.7%	Major (> 20k)
AZ	64,365	6.0%	Major (> 20k)
MA	26,177	2.4%	Major (> 20k)
NM	15,037	1.4%	Other
TX	5,888	0.5%	Other
NH	2,272	0.2%	Other
FL	360	0.0%	Other
AR	64	0.0%	Other
OR	6	0.0%	Other

Major states (> 20,000 observations each): CA, NY, AZ, MA. All other states are grouped as “Other” in the state fixed-effect specification.

7 Reproducibility

This report reads pipeline_meta.rds generated by 01_clean_tts.R. To regenerate:

source("01_clean_tts.R")                       # produces data/processed/pipeline_meta.rds
rmarkdown::render("02_pipeline_report.Rmd")     # produces this report

Session info:

sessionInfo()

## R version 4.5.2 (2025-10-31)
## Platform: aarch64-apple-darwin20
## Running under: macOS Tahoe 26.3
## 
## Matrix products: default
## BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: Europe/London
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] kableExtra_1.4.0    knitr_1.51          scales_1.4.0       
##  [4] lubridate_1.9.5     forcats_1.0.1       stringr_1.6.0      
##  [7] dplyr_1.2.0         purrr_1.2.1         readr_2.2.0        
## [10] tidyr_1.3.2         tibble_3.3.1        ggplot2_4.0.2      
## [13] tidyverse_2.0.0     data.table_1.18.2.1
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.6       jsonlite_2.0.0     compiler_4.5.2     tidyselect_1.2.1  
##  [5] xml2_1.5.2         jquerylib_0.1.4    textshaping_1.0.4  systemfonts_1.3.1 
##  [9] yaml_2.3.12        fastmap_1.2.0      R6_2.6.1           labeling_0.4.3    
## [13] generics_0.1.4     svglite_2.2.2      bslib_0.10.0       pillar_1.11.1     
## [17] RColorBrewer_1.1-3 tzdb_0.5.0         rlang_1.1.7        stringi_1.8.7     
## [21] cachem_1.1.0       xfun_0.56          sass_0.4.10        S7_0.2.1          
## [25] otel_0.2.0         viridisLite_0.4.3  timechange_0.4.0   cli_3.6.5         
## [29] withr_3.0.2        magrittr_2.0.4     digest_0.6.39      grid_4.5.2        
## [33] rstudioapi_0.18.0  hms_1.1.4          lifecycle_1.0.5    vctrs_0.7.1       
## [37] evaluate_1.0.5     glue_1.8.0         farver_2.1.2       rmarkdown_2.30    
## [41] tools_4.5.2        pkgconfig_2.0.3    htmltools_0.5.9

TTS Data Pipeline — Filter & Sample Report

Tracking the Sun 2025 · Demand Estimation for Solar PV

Auto-generated by 01_clean_tts.R

28 March 2026