1 Overview

This report documents every data selection step applied to the Tracking the Sun (TTS) 2025 public data file from Lawrence Berkeley National Laboratory. The pipeline follows the sample-selection methodology of Bukow (2022) and produces three nested subsets:

Sample Rows % of Full
Full TTS 3,664,197 100%
Price sample 1,282,013 35.0%
Estimation sample 1,075,769 29.4%
Final sample 396,188 10.8%

Pipeline run: 28 March 2026, 14:36.


2 Sample Flow: Waterfall Chart

The chart below shows how each filter progressively reduces the dataset from the full TTS sample to the final estimation-ready sample.

# Build waterfall data
wf <- fl[, .(step, filter_name, n_dropped)]
wf <- rbindlist(list(
  data.table(step = 0L, filter_name = "Full TTS", n_dropped = 0L),
  wf
))
wf[, running_total := n_full - cumsum(n_dropped)]
wf[, filter_name := factor(filter_name, levels = filter_name)]

ggplot(wf, aes(x = filter_name, y = running_total)) +
  geom_col(fill = ifelse(wf$step == 0, "#2c7fb8", "#d95f02"), width = 0.7) +
  geom_text(aes(label = comma(running_total)),
            vjust = -0.4, size = 3.2, colour = "grey30") +
  scale_y_continuous(labels = comma, expand = expansion(mult = c(0, 0.12))) +
  labs(
    title = "Sample Size After Each Filter",
    x = NULL, y = "Observations remaining"
  ) +
  theme_minimal(base_size = 12) +
  theme(axis.text.x = element_text(angle = 40, hjust = 1))


3 Filter-by-Filter Breakdown

3.1 Summary Table

display <- fl[, .(
  Step          = step,
  Stage         = stage,
  Filter        = filter_name,
  `Before`      = comma(n_before),
  `Dropped`     = comma(n_dropped),
  `After`       = comma(n_after),
  `% Dropped`   = sprintf("%.2f%%", pct_dropped_step),
  `% of Full`   = sprintf("%.1f%%", pct_of_full)
)]

kable(display, align = "clcrrrrr") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE, font_size = 13) %>%
  row_spec(which(fl$stage == "Price"), background = "#fff3e0") %>%
  row_spec(which(fl$stage == "Estimation"), background = "#e3f2fd") %>%
  row_spec(which(fl$stage == "Final"), background = "#e8f5e9")
Step Stage Filter Before Dropped After % Dropped % of Full
1 Price Missing price 3,664,197 0 3,664,197 0.00% 100.0%
2 Price Price outliers 3,664,197 1,041,637 2,622,560 28.43% 71.6%
3 Price Third-party owned 2,622,560 1,079,344 1,543,216 41.16% 42.1%
4 Price Self-installed 1,543,216 128,983 1,414,233 8.36% 38.6%
5 Price Battery storage 1,414,233 132,220 1,282,013 9.35% 35.0%
6 Estimation Non-residential 1,282,013 32,014 1,249,999 2.50% 34.1%
7 Estimation Size outliers 1,249,999 16,663 1,233,336 1.33% 33.7%
8 Estimation Pre-2010 installations 1,233,336 35,842 1,197,494 2.91% 32.7%
9 Estimation Negative efficiency 1,197,494 121,725 1,075,769 10.16% 29.4%
10 Estimation Missing module tech 1,075,769 0 1,075,769 0.00% 29.4%
11 Final Missing required vars 1,075,769 679,581 396,188 63.17% 10.8%

Colour key: Price filters · Estimation filters · Final sample

3.2 Detailed Descriptions

for (i in seq_len(nrow(fl))) {
  row <- fl[i]
  cat(sprintf(
    "### Step %d: %s\n\n**Stage:** %s \n**Dropped:** %s observations (%.2f%% of rows entering this step) \n**Remaining:** %s (%.1f%% of full sample)\n\n%s\n\n---\n\n",
    row$step, row$filter_name, row$stage,
    comma(row$n_dropped), row$pct_dropped_step,
    comma(row$n_after), row$pct_of_full,
    row$description
  ))
}

3.2.1 Step 1: Missing price

Stage: Price Dropped: 0 observations (0.00% of rows entering this step) Remaining: 3,664,197 (100.0% of full sample)

Rows where price per watt could not be computed (missing size or cost).


3.2.2 Step 2: Price outliers

Stage: Price Dropped: 1,041,637 observations (28.43% of rows entering this step) Remaining: 2,622,560 (71.6% of full sample)

Price per watt outside the plausible $1-$10/W range (likely data errors or non-standard systems).


3.2.3 Step 3: Third-party owned

Stage: Price Dropped: 1,079,344 observations (41.16% of rows entering this step) Remaining: 1,543,216 (42.1% of full sample)

TPO systems (leases, PPAs) excluded because their reported prices do not reflect true installed cost to the homeowner.


3.2.4 Step 4: Self-installed

Stage: Price Dropped: 128,983 observations (8.36% of rows entering this step) Remaining: 1,414,233 (38.6% of full sample)

DIY/self-installed systems excluded because they lack standard installer labour costs.


3.2.5 Step 5: Battery storage

Stage: Price Dropped: 132,220 observations (9.35% of rows entering this step) Remaining: 1,282,013 (35.0% of full sample)

PV+Storage and storage-only systems excluded so prices reflect PV-only installed cost.


3.2.6 Step 6: Non-residential

Stage: Estimation Dropped: 32,014 observations (2.50% of rows entering this step) Remaining: 1,249,999 (34.1% of full sample)

Non-residential systems (commercial, agricultural, government, etc.) excluded to focus on the residential segment.


3.2.7 Step 7: Size outliers

Stage: Estimation Dropped: 16,663 observations (1.33% of rows entering this step) Remaining: 1,233,336 (33.7% of full sample)

Systems outside 1 W - 20,000 W (20 kW) range, which are implausible for residential installations.


3.2.8 Step 8: Pre-2010 installations

Stage: Estimation Dropped: 35,842 observations (2.91% of rows entering this step) Remaining: 1,197,494 (32.7% of full sample)

Installations before 2010 dropped due to limited data coverage and unavailability of weekly polysilicon price data.


3.2.9 Step 9: Negative efficiency

Stage: Estimation Dropped: 121,725 observations (10.16% of rows entering this step) Remaining: 1,075,769 (29.4% of full sample)

Observations with negative module efficiency, which are likely reporting errors (< 0.1% of sample).


3.2.10 Step 10: Missing module tech

Stage: Estimation Dropped: 0 observations (0.00% of rows entering this step) Remaining: 1,075,769 (29.4% of full sample)

Observations where module technology (Mono/Poly/Other) could not be determined from any of the three module fields.


3.2.11 Step 11: Missing required vars

Stage: Final Dropped: 679,581 observations (63.17% of rows entering this step) Remaining: 396,188 (10.8% of full sample)

Rows with NA in any variable required for estimation (after merges — mainly driven by polysilicon or IRS match failures).



4 Relative Impact of Each Filter

fl_plot <- fl[n_dropped > 0]

ggplot(fl_plot, aes(x = reorder(filter_name, n_dropped), y = n_dropped)) +
  geom_col(fill = "#7570b3", width = 0.6) +
  geom_text(aes(label = comma(n_dropped)), hjust = -0.1, size = 3.2) +
  coord_flip() +
  scale_y_continuous(labels = comma, expand = expansion(mult = c(0, 0.15))) +
  labs(
    title = "Observations Dropped by Each Filter",
    subtitle = "Ranked by absolute impact",
    x = NULL, y = "Rows dropped"
  ) +
  theme_minimal(base_size = 12)

ggplot(fl_plot, aes(x = reorder(filter_name, pct_dropped_step), y = pct_dropped_step)) +
  geom_col(fill = "#e7298a", width = 0.6) +
  geom_text(aes(label = sprintf("%.1f%%", pct_dropped_step)), hjust = -0.1, size = 3.2) +
  coord_flip() +
  scale_y_continuous(expand = expansion(mult = c(0, 0.15))) +
  labs(
    title = "Percentage of Rows Dropped at Each Step",
    subtitle = "As % of rows entering that step (not of full sample)",
    x = NULL, y = "% dropped"
  ) +
  theme_minimal(base_size = 12)


5 External Data Merges

Three external datasets are joined to the estimation sample before the final complete-cases filter is applied.

mr_display <- mr[, .(
  Source       = source,
  `Join Key`   = join_key,
  `Match Rate` = sprintf("%.1f%%", match_pct),
  Description  = description
)]

kable(mr_display) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = TRUE, font_size = 13)
Source Join Key Match Rate Description
Polysilicon (PVinsights) year + week 40.1% Weekly polysilicon spot prices joined on installation year and ISO week.
IRS SOI (zip x year) zip5 + year 95.7% Household count and AGI per household from IRS Statistics of Income, joined on 5-digit zip code and year (capped at max available IRS year).
EIA Electricity (state x year) state + year 94.6% Average retail electricity price by state and year from EIA.

Unmatched rows in the merge stage are the primary driver of the final sample being smaller than the estimation sample — they introduce NAs in required variables which are then dropped by the complete-cases filter in Step 11.


6 Geographic Coverage

st_display <- state_cts[, .(
  State = state_raw,
  N     = comma(N),
  `% of Estimation` = sprintf("%.1f%%", 100 * N / n_est),
  Group = ifelse(state_raw %in% major_st, "Major (> 20k)", "Other")
)]

kable(head(st_display, 25), caption = "Top 25 states in estimation sample") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE, font_size = 13)
Top 25 states in estimation sample
State N % of Estimation Group
CA 889,690 82.7% Major (> 20k)
NY 71,910 6.7% Major (> 20k)
AZ 64,365 6.0% Major (> 20k)
MA 26,177 2.4% Major (> 20k)
NM 15,037 1.4% Other
TX 5,888 0.5% Other
NH 2,272 0.2% Other
FL 360 0.0% Other
AR 64 0.0% Other
OR 6 0.0% Other

Major states (> 20,000 observations each): CA, NY, AZ, MA. All other states are grouped as “Other” in the state fixed-effect specification.


7 Reproducibility

This report reads pipeline_meta.rds generated by 01_clean_tts.R. To regenerate:

source("01_clean_tts.R")                       # produces data/processed/pipeline_meta.rds
rmarkdown::render("02_pipeline_report.Rmd")     # produces this report

Session info:

sessionInfo()
## R version 4.5.2 (2025-10-31)
## Platform: aarch64-apple-darwin20
## Running under: macOS Tahoe 26.3
## 
## Matrix products: default
## BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: Europe/London
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] kableExtra_1.4.0    knitr_1.51          scales_1.4.0       
##  [4] lubridate_1.9.5     forcats_1.0.1       stringr_1.6.0      
##  [7] dplyr_1.2.0         purrr_1.2.1         readr_2.2.0        
## [10] tidyr_1.3.2         tibble_3.3.1        ggplot2_4.0.2      
## [13] tidyverse_2.0.0     data.table_1.18.2.1
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.6       jsonlite_2.0.0     compiler_4.5.2     tidyselect_1.2.1  
##  [5] xml2_1.5.2         jquerylib_0.1.4    textshaping_1.0.4  systemfonts_1.3.1 
##  [9] yaml_2.3.12        fastmap_1.2.0      R6_2.6.1           labeling_0.4.3    
## [13] generics_0.1.4     svglite_2.2.2      bslib_0.10.0       pillar_1.11.1     
## [17] RColorBrewer_1.1-3 tzdb_0.5.0         rlang_1.1.7        stringi_1.8.7     
## [21] cachem_1.1.0       xfun_0.56          sass_0.4.10        S7_0.2.1          
## [25] otel_0.2.0         viridisLite_0.4.3  timechange_0.4.0   cli_3.6.5         
## [29] withr_3.0.2        magrittr_2.0.4     digest_0.6.39      grid_4.5.2        
## [33] rstudioapi_0.18.0  hms_1.1.4          lifecycle_1.0.5    vctrs_0.7.1       
## [37] evaluate_1.0.5     glue_1.8.0         farver_2.1.2       rmarkdown_2.30    
## [41] tools_4.5.2        pkgconfig_2.0.3    htmltools_0.5.9