This report documents every data selection step applied to the Tracking the Sun (TTS) 2025 public data file from Lawrence Berkeley National Laboratory. The pipeline follows the sample-selection methodology of Bukow (2022) and produces three nested subsets:
| Sample | Rows | % of Full |
|---|---|---|
| Full TTS | 3,664,197 | 100% |
| Price sample | 1,282,013 | 35.0% |
| Estimation sample | 1,075,769 | 29.4% |
| Final sample | 396,188 | 10.8% |
Pipeline run: 28 March 2026, 14:36.
The chart below shows how each filter progressively reduces the dataset from the full TTS sample to the final estimation-ready sample.
# Build waterfall data
wf <- fl[, .(step, filter_name, n_dropped)]
wf <- rbindlist(list(
data.table(step = 0L, filter_name = "Full TTS", n_dropped = 0L),
wf
))
wf[, running_total := n_full - cumsum(n_dropped)]
wf[, filter_name := factor(filter_name, levels = filter_name)]
ggplot(wf, aes(x = filter_name, y = running_total)) +
geom_col(fill = ifelse(wf$step == 0, "#2c7fb8", "#d95f02"), width = 0.7) +
geom_text(aes(label = comma(running_total)),
vjust = -0.4, size = 3.2, colour = "grey30") +
scale_y_continuous(labels = comma, expand = expansion(mult = c(0, 0.12))) +
labs(
title = "Sample Size After Each Filter",
x = NULL, y = "Observations remaining"
) +
theme_minimal(base_size = 12) +
theme(axis.text.x = element_text(angle = 40, hjust = 1))
display <- fl[, .(
Step = step,
Stage = stage,
Filter = filter_name,
`Before` = comma(n_before),
`Dropped` = comma(n_dropped),
`After` = comma(n_after),
`% Dropped` = sprintf("%.2f%%", pct_dropped_step),
`% of Full` = sprintf("%.1f%%", pct_of_full)
)]
kable(display, align = "clcrrrrr") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE, font_size = 13) %>%
row_spec(which(fl$stage == "Price"), background = "#fff3e0") %>%
row_spec(which(fl$stage == "Estimation"), background = "#e3f2fd") %>%
row_spec(which(fl$stage == "Final"), background = "#e8f5e9")
| Step | Stage | Filter | Before | Dropped | After | % Dropped | % of Full |
|---|---|---|---|---|---|---|---|
| 1 | Price | Missing price | 3,664,197 | 0 | 3,664,197 | 0.00% | 100.0% |
| 2 | Price | Price outliers | 3,664,197 | 1,041,637 | 2,622,560 | 28.43% | 71.6% |
| 3 | Price | Third-party owned | 2,622,560 | 1,079,344 | 1,543,216 | 41.16% | 42.1% |
| 4 | Price | Self-installed | 1,543,216 | 128,983 | 1,414,233 | 8.36% | 38.6% |
| 5 | Price | Battery storage | 1,414,233 | 132,220 | 1,282,013 | 9.35% | 35.0% |
| 6 | Estimation | Non-residential | 1,282,013 | 32,014 | 1,249,999 | 2.50% | 34.1% |
| 7 | Estimation | Size outliers | 1,249,999 | 16,663 | 1,233,336 | 1.33% | 33.7% |
| 8 | Estimation | Pre-2010 installations | 1,233,336 | 35,842 | 1,197,494 | 2.91% | 32.7% |
| 9 | Estimation | Negative efficiency | 1,197,494 | 121,725 | 1,075,769 | 10.16% | 29.4% |
| 10 | Estimation | Missing module tech | 1,075,769 | 0 | 1,075,769 | 0.00% | 29.4% |
| 11 | Final | Missing required vars | 1,075,769 | 679,581 | 396,188 | 63.17% | 10.8% |
Colour key: Price filters · Estimation filters · Final sample
for (i in seq_len(nrow(fl))) {
row <- fl[i]
cat(sprintf(
"### Step %d: %s\n\n**Stage:** %s \n**Dropped:** %s observations (%.2f%% of rows entering this step) \n**Remaining:** %s (%.1f%% of full sample)\n\n%s\n\n---\n\n",
row$step, row$filter_name, row$stage,
comma(row$n_dropped), row$pct_dropped_step,
comma(row$n_after), row$pct_of_full,
row$description
))
}
Stage: Price Dropped: 0 observations (0.00% of rows entering this step) Remaining: 3,664,197 (100.0% of full sample)
Rows where price per watt could not be computed (missing size or cost).
Stage: Price Dropped: 1,041,637 observations (28.43% of rows entering this step) Remaining: 2,622,560 (71.6% of full sample)
Price per watt outside the plausible $1-$10/W range (likely data errors or non-standard systems).
Stage: Price Dropped: 1,079,344 observations (41.16% of rows entering this step) Remaining: 1,543,216 (42.1% of full sample)
TPO systems (leases, PPAs) excluded because their reported prices do not reflect true installed cost to the homeowner.
Stage: Price Dropped: 128,983 observations (8.36% of rows entering this step) Remaining: 1,414,233 (38.6% of full sample)
DIY/self-installed systems excluded because they lack standard installer labour costs.
Stage: Price Dropped: 132,220 observations (9.35% of rows entering this step) Remaining: 1,282,013 (35.0% of full sample)
PV+Storage and storage-only systems excluded so prices reflect PV-only installed cost.
Stage: Estimation Dropped: 32,014 observations (2.50% of rows entering this step) Remaining: 1,249,999 (34.1% of full sample)
Non-residential systems (commercial, agricultural, government, etc.) excluded to focus on the residential segment.
Stage: Estimation Dropped: 16,663 observations (1.33% of rows entering this step) Remaining: 1,233,336 (33.7% of full sample)
Systems outside 1 W - 20,000 W (20 kW) range, which are implausible for residential installations.
Stage: Estimation Dropped: 35,842 observations (2.91% of rows entering this step) Remaining: 1,197,494 (32.7% of full sample)
Installations before 2010 dropped due to limited data coverage and unavailability of weekly polysilicon price data.
Stage: Estimation Dropped: 121,725 observations (10.16% of rows entering this step) Remaining: 1,075,769 (29.4% of full sample)
Observations with negative module efficiency, which are likely reporting errors (< 0.1% of sample).
Stage: Estimation Dropped: 0 observations (0.00% of rows entering this step) Remaining: 1,075,769 (29.4% of full sample)
Observations where module technology (Mono/Poly/Other) could not be determined from any of the three module fields.
Stage: Final Dropped: 679,581 observations (63.17% of rows entering this step) Remaining: 396,188 (10.8% of full sample)
Rows with NA in any variable required for estimation (after merges — mainly driven by polysilicon or IRS match failures).
fl_plot <- fl[n_dropped > 0]
ggplot(fl_plot, aes(x = reorder(filter_name, n_dropped), y = n_dropped)) +
geom_col(fill = "#7570b3", width = 0.6) +
geom_text(aes(label = comma(n_dropped)), hjust = -0.1, size = 3.2) +
coord_flip() +
scale_y_continuous(labels = comma, expand = expansion(mult = c(0, 0.15))) +
labs(
title = "Observations Dropped by Each Filter",
subtitle = "Ranked by absolute impact",
x = NULL, y = "Rows dropped"
) +
theme_minimal(base_size = 12)
ggplot(fl_plot, aes(x = reorder(filter_name, pct_dropped_step), y = pct_dropped_step)) +
geom_col(fill = "#e7298a", width = 0.6) +
geom_text(aes(label = sprintf("%.1f%%", pct_dropped_step)), hjust = -0.1, size = 3.2) +
coord_flip() +
scale_y_continuous(expand = expansion(mult = c(0, 0.15))) +
labs(
title = "Percentage of Rows Dropped at Each Step",
subtitle = "As % of rows entering that step (not of full sample)",
x = NULL, y = "% dropped"
) +
theme_minimal(base_size = 12)
Three external datasets are joined to the estimation sample before the final complete-cases filter is applied.
mr_display <- mr[, .(
Source = source,
`Join Key` = join_key,
`Match Rate` = sprintf("%.1f%%", match_pct),
Description = description
)]
kable(mr_display) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = TRUE, font_size = 13)
| Source | Join Key | Match Rate | Description |
|---|---|---|---|
| Polysilicon (PVinsights) | year + week | 40.1% | Weekly polysilicon spot prices joined on installation year and ISO week. |
| IRS SOI (zip x year) | zip5 + year | 95.7% | Household count and AGI per household from IRS Statistics of Income, joined on 5-digit zip code and year (capped at max available IRS year). |
| EIA Electricity (state x year) | state + year | 94.6% | Average retail electricity price by state and year from EIA. |
Unmatched rows in the merge stage are the primary driver of the final sample being smaller than the estimation sample — they introduce NAs in required variables which are then dropped by the complete-cases filter in Step 11.
st_display <- state_cts[, .(
State = state_raw,
N = comma(N),
`% of Estimation` = sprintf("%.1f%%", 100 * N / n_est),
Group = ifelse(state_raw %in% major_st, "Major (> 20k)", "Other")
)]
kable(head(st_display, 25), caption = "Top 25 states in estimation sample") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE, font_size = 13)
| State | N | % of Estimation | Group |
|---|---|---|---|
| CA | 889,690 | 82.7% | Major (> 20k) |
| NY | 71,910 | 6.7% | Major (> 20k) |
| AZ | 64,365 | 6.0% | Major (> 20k) |
| MA | 26,177 | 2.4% | Major (> 20k) |
| NM | 15,037 | 1.4% | Other |
| TX | 5,888 | 0.5% | Other |
| NH | 2,272 | 0.2% | Other |
| FL | 360 | 0.0% | Other |
| AR | 64 | 0.0% | Other |
| OR | 6 | 0.0% | Other |
Major states (> 20,000 observations each): CA, NY, AZ, MA. All other states are grouped as “Other” in the state fixed-effect specification.
This report reads pipeline_meta.rds generated by
01_clean_tts.R. To regenerate:
source("01_clean_tts.R") # produces data/processed/pipeline_meta.rds
rmarkdown::render("02_pipeline_report.Rmd") # produces this report
Session info:
sessionInfo()
## R version 4.5.2 (2025-10-31)
## Platform: aarch64-apple-darwin20
## Running under: macOS Tahoe 26.3
##
## Matrix products: default
## BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: Europe/London
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] kableExtra_1.4.0 knitr_1.51 scales_1.4.0
## [4] lubridate_1.9.5 forcats_1.0.1 stringr_1.6.0
## [7] dplyr_1.2.0 purrr_1.2.1 readr_2.2.0
## [10] tidyr_1.3.2 tibble_3.3.1 ggplot2_4.0.2
## [13] tidyverse_2.0.0 data.table_1.18.2.1
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.6 jsonlite_2.0.0 compiler_4.5.2 tidyselect_1.2.1
## [5] xml2_1.5.2 jquerylib_0.1.4 textshaping_1.0.4 systemfonts_1.3.1
## [9] yaml_2.3.12 fastmap_1.2.0 R6_2.6.1 labeling_0.4.3
## [13] generics_0.1.4 svglite_2.2.2 bslib_0.10.0 pillar_1.11.1
## [17] RColorBrewer_1.1-3 tzdb_0.5.0 rlang_1.1.7 stringi_1.8.7
## [21] cachem_1.1.0 xfun_0.56 sass_0.4.10 S7_0.2.1
## [25] otel_0.2.0 viridisLite_0.4.3 timechange_0.4.0 cli_3.6.5
## [29] withr_3.0.2 magrittr_2.0.4 digest_0.6.39 grid_4.5.2
## [33] rstudioapi_0.18.0 hms_1.1.4 lifecycle_1.0.5 vctrs_0.7.1
## [37] evaluate_1.0.5 glue_1.8.0 farver_2.1.2 rmarkdown_2.30
## [41] tools_4.5.2 pkgconfig_2.0.3 htmltools_0.5.9