In this data dive, I explore the time aspect of the building energy dataset by examining how greenhouse gas emissions change over time. The time variable I use is the reporting year, which I convert into a Date format in R. I then create a time series object and analyze trends in emissions over time.
This is meaningful because changes in emissions over time may reflect broader shifts in building operations, policy, energy efficiency efforts, or reporting practices.
library(tidyverse)
library(janitor)
library(tsibble)
library(feasts)
library(fabletools)
energy <- read_csv("Building_Energy_Benchmarking_Data__2015-Present.csv", show_col_types = FALSE) %>%
clean_names()
glimpse(energy)
## Rows: 34,699
## Columns: 46
## $ ose_building_id <dbl> 1, 2, 3, 5, 8, 9, 10, 11, 12, 13,…
## $ data_year <dbl> 2024, 2024, 2024, 2024, 2024, 202…
## $ building_name <chr> "MAYFLOWER PARK HOTEL", "PARAMOUN…
## $ building_type <chr> "NonResidential", "NonResidential…
## $ tax_parcel_identification_number <chr> "659000030", "659000220", "659000…
## $ address <chr> "405 OLIVE WAY", "724 PINE ST", "…
## $ city <chr> "SEATTLE", "SEATTLE", "SEATTLE", …
## $ state <chr> "WA", "WA", "WA", "WA", "WA", "WA…
## $ zip_code <dbl> 98101, 98101, 98101, 98101, 98121…
## $ latitude <dbl> 47.61220, 47.61307, 47.61367, 47.…
## $ longitude <dbl> -122.3380, -122.3336, -122.3382, …
## $ neighborhood <chr> "DOWNTOWN", "DOWNTOWN", "DOWNTOWN…
## $ council_district_code <dbl> 7, 7, 7, 7, 7, 7, 7, 7, 1, 1, 7, …
## $ year_built <dbl> 1927, 1996, 1969, 1926, 1980, 199…
## $ numberof_floors <dbl> 12, 11, 41, 10, 18, 2, 11, 8, 15,…
## $ numberof_buildings <dbl> 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ property_gfa_total <dbl> 88434, 103566, 956110, 61320, 175…
## $ property_gfa_buildings <dbl> 88434, 88502, 759392, 61320, 1135…
## $ property_gfa_parking <dbl> 0, 15064, 196718, 0, 62000, 37198…
## $ self_report_gfa_total <dbl> 115387, 103566, 947059, 61320, 20…
## $ self_report_gfa_buildings <dbl> 115387, 88502, 827566, 61320, 123…
## $ self_report_parking <dbl> 0, 15064, 119493, 0, 80497, 40971…
## $ energystar_score <dbl> 59, 85, 71, 50, 87, NA, 10, NA, 5…
## $ site_euiwn_k_btu_sf <dbl> 62.2, 71.9, 82.0, 87.2, 97.6, 168…
## $ site_eui_k_btu_sf <dbl> 61.7, 71.5, 81.7, 86.0, 97.1, 167…
## $ site_energy_use_k_btu <dbl> 7113958, 6330664, 67613264, 52739…
## $ site_energy_use_wn_k_btu <dbl> 7172158, 6362478, 67852608, 53463…
## $ source_euiwn_k_btu_sf <dbl> 122.9, 128.7, 171.8, 174.7, 167.6…
## $ source_eui_k_btu_sf <dbl> 121.4, 128.3, 171.5, 171.4, 167.2…
## $ epa_property_type <chr> "Hotel", "Hotel", "Hotel", "Hotel…
## $ largest_property_use_type <chr> "Hotel", "Hotel", "Hotel", "Hotel…
## $ largest_property_use_type_gfa <dbl> 115387, 88502, 827566, 61320, 123…
## $ second_largest_property_use_type <chr> NA, "Parking", "Parking", NA, "Pa…
## $ second_largest_property_use_type_gfa <dbl> NA, 15064, 117783, NA, 68009, 409…
## $ third_largest_property_use_type <chr> NA, NA, "Swimming Pool", NA, "Swi…
## $ third_largest_property_use_type_gfa <dbl> NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ electricity_k_wh <dbl> 1045040, 787838, 11279080, 796976…
## $ steam_use_k_btu <dbl> 1949686, NA, 23256386, 1389935, N…
## $ natural_gas_therms <dbl> 15986, 36426, 58726, 11648, 73811…
## $ compliance_status <chr> "Not Compliant", "Compliant", "Co…
## $ compliance_issue <chr> "Default Data", "No Issue", "No I…
## $ electricity_k_btu <dbl> 3565676, 2688104, 38484221, 27192…
## $ natural_gas_k_btu <dbl> 1598590, 3642560, 5872650, 116476…
## $ total_ghg_emissions <dbl> 263.3, 208.6, 2418.2, 190.1, 417.…
## $ ghg_emissions_intensity <dbl> 2.98, 2.36, 3.18, 3.10, 3.68, 2.8…
## $ demolished <lgl> FALSE, FALSE, FALSE, FALSE, FALSE…
names(energy)
## [1] "ose_building_id"
## [2] "data_year"
## [3] "building_name"
## [4] "building_type"
## [5] "tax_parcel_identification_number"
## [6] "address"
## [7] "city"
## [8] "state"
## [9] "zip_code"
## [10] "latitude"
## [11] "longitude"
## [12] "neighborhood"
## [13] "council_district_code"
## [14] "year_built"
## [15] "numberof_floors"
## [16] "numberof_buildings"
## [17] "property_gfa_total"
## [18] "property_gfa_buildings"
## [19] "property_gfa_parking"
## [20] "self_report_gfa_total"
## [21] "self_report_gfa_buildings"
## [22] "self_report_parking"
## [23] "energystar_score"
## [24] "site_euiwn_k_btu_sf"
## [25] "site_eui_k_btu_sf"
## [26] "site_energy_use_k_btu"
## [27] "site_energy_use_wn_k_btu"
## [28] "source_euiwn_k_btu_sf"
## [29] "source_eui_k_btu_sf"
## [30] "epa_property_type"
## [31] "largest_property_use_type"
## [32] "largest_property_use_type_gfa"
## [33] "second_largest_property_use_type"
## [34] "second_largest_property_use_type_gfa"
## [35] "third_largest_property_use_type"
## [36] "third_largest_property_use_type_gfa"
## [37] "electricity_k_wh"
## [38] "steam_use_k_btu"
## [39] "natural_gas_therms"
## [40] "compliance_status"
## [41] "compliance_issue"
## [42] "electricity_k_btu"
## [43] "natural_gas_k_btu"
## [44] "total_ghg_emissions"
## [45] "ghg_emissions_intensity"
## [46] "demolished"
energy_time <- energy %>%
filter(!is.na(data_year), !is.na(total_ghg_emissions)) %>%
mutate(date = as.Date(paste0(data_year, "-01-01")))
summary(energy_time$date)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## "2015-01-01" "2017-01-01" "2020-01-01" "2019-08-13" "2022-01-01" "2024-01-01"
head(energy_time)
## # A tibble: 6 × 47
## ose_building_id data_year building_name building_type tax_parcel_identific…¹
## <dbl> <dbl> <chr> <chr> <chr>
## 1 1 2024 MAYFLOWER PARK… NonResidenti… 659000030
## 2 2 2024 PARAMOUNT HOTEL NonResidenti… 659000220
## 3 3 2024 WESTIN HOTEL (… NonResidenti… 659000475
## 4 5 2024 HOTEL MAX NonResidenti… 659000640
## 5 8 2024 WARWICK SEATTL… NonResidenti… 659000970
## 6 9 2024 WEST PRECINCT … Nonresidenti… 660000560
## # ℹ abbreviated name: ¹​tax_parcel_identification_number
## # ℹ 42 more variables: address <chr>, city <chr>, state <chr>, zip_code <dbl>,
## # latitude <dbl>, longitude <dbl>, neighborhood <chr>,
## # council_district_code <dbl>, year_built <dbl>, numberof_floors <dbl>,
## # numberof_buildings <dbl>, property_gfa_total <dbl>,
## # property_gfa_buildings <dbl>, property_gfa_parking <dbl>,
## # self_report_gfa_total <dbl>, self_report_gfa_buildings <dbl>, …
For this analysis, I use the reporting year as the time column. Since year alone is not a full Date, I convert it into January 1 of each reporting year.
# Change 'reporting_year' to 'data_year'
energy %>%
filter(!is.na(data_year))
## # A tibble: 34,699 × 46
## ose_building_id data_year building_name building_type tax_parcel_identific…¹
## <dbl> <dbl> <chr> <chr> <chr>
## 1 1 2024 MAYFLOWER PAR… NonResidenti… 659000030
## 2 2 2024 PARAMOUNT HOT… NonResidenti… 659000220
## 3 3 2024 WESTIN HOTEL … NonResidenti… 659000475
## 4 5 2024 HOTEL MAX NonResidenti… 659000640
## 5 8 2024 WARWICK SEATT… NonResidenti… 659000970
## 6 9 2024 WEST PRECINCT… Nonresidenti… 660000560
## 7 10 2024 CAMLIN WORLDM… NonResidenti… 660000825
## 8 11 2024 PARAMOUNT THE… NonResidenti… 660000955
## 9 12 2024 COURTYARD BY … NonResidenti… 939000080
## 10 13 2024 LYON BUILDING Multifamily … 939000105
## # ℹ 34,689 more rows
## # ℹ abbreviated name: ¹​tax_parcel_identification_number
## # ℹ 41 more variables: address <chr>, city <chr>, state <chr>, zip_code <dbl>,
## # latitude <dbl>, longitude <dbl>, neighborhood <chr>,
## # council_district_code <dbl>, year_built <dbl>, numberof_floors <dbl>,
## # numberof_buildings <dbl>, property_gfa_total <dbl>,
## # property_gfa_buildings <dbl>, property_gfa_parking <dbl>, …
This conversion allows me to treat the reporting year as a valid time variable in R.
Because the dataset contains many buildings for each year, I aggregate greenhouse gas emissions by year. I use the average annual emissions so that the time series reflects typical emissions levels rather than raw totals.
yearly_emissions <- energy_time %>%
group_by(date) %>%
summarise(avg_ghg = mean(total_ghg_emissions, na.rm = TRUE)) %>%
ungroup()
yearly_emissions
## # A tibble: 10 × 2
## date avg_ghg
## <date> <dbl>
## 1 2015-01-01 107.
## 2 2016-01-01 117.
## 3 2017-01-01 135.
## 4 2018-01-01 129.
## 5 2019-01-01 279.
## 6 2020-01-01 263.
## 7 2021-01-01 145.
## 8 2022-01-01 148.
## 9 2023-01-01 189.
## 10 2024-01-01 285.
yearly_emissions <- energy_time %>%
group_by(date) %>%
summarise(avg_ghg = mean(total_ghg_emissions, na.rm = TRUE)) %>%
ungroup()
yearly_emissions
## # A tibble: 10 × 2
## date avg_ghg
## <date> <dbl>
## 1 2015-01-01 107.
## 2 2016-01-01 117.
## 3 2017-01-01 135.
## 4 2018-01-01 129.
## 5 2019-01-01 279.
## 6 2020-01-01 263.
## 7 2021-01-01 145.
## 8 2022-01-01 148.
## 9 2023-01-01 189.
## 10 2024-01-01 285.
This step is important because it converts the building-level data into a yearly series that can be analyzed over time.
emissions_ts <- yearly_emissions %>%
as_tsibble(index = date)
emissions_ts
## # A tsibble: 10 x 2 [1D]
## date avg_ghg
## <date> <dbl>
## 1 2015-01-01 107.
## 2 2016-01-01 117.
## 3 2017-01-01 135.
## 4 2018-01-01 129.
## 5 2019-01-01 279.
## 6 2020-01-01 263.
## 7 2021-01-01 145.
## 8 2022-01-01 148.
## 9 2023-01-01 189.
## 10 2024-01-01 285.
The tsibble format makes it easier to work with time-based data in R and supports later tools such as autocorrelation analysis.
ggplot(emissions_ts, aes(x = date, y = avg_ghg)) +
geom_line(color = "steelblue", linewidth = 1) +
geom_point(color = "darkred", size = 2) +
theme_minimal() +
labs(
title = "Average Greenhouse Gas Emissions Over Time",
x = "Year",
y = "Average Total GHG Emissions"
)
This plot provides an initial view of how average emissions change over time. The main feature to look for here is whether emissions appear to increase, decrease, or stay fairly stable across years.
From the time plot, I can assess whether emissions show an overall upward or downward direction or whether they fluctuate from year to year. If the line generally slopes downward, that would suggest average emissions are decreasing over time, which could reflect improvements in building efficiency or environmental policy.
This is significant because it shifts the focus from individual buildings to broader time-based patterns in the dataset.
To estimate an overall time trend, I fit a simple linear regression model using time as a numeric variable.
trend_df <- emissions_ts %>%
mutate(time_num = row_number())
trend_model <- lm(avg_ghg ~ time_num, data = trend_df)
summary(trend_model)
##
## Call:
## lm(formula = avg_ghg ~ time_num, data = trend_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -65.10 -35.78 -14.22 31.02 105.98
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 106.131 41.250 2.573 0.0330 *
## time_num 13.363 6.648 2.010 0.0793 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 60.38 on 8 degrees of freedom
## Multiple R-squared: 0.3356, Adjusted R-squared: 0.2525
## F-statistic: 4.04 on 1 and 8 DF, p-value: 0.07927
The coefficient for time_num measures whether average
greenhouse gas emissions are increasing or decreasing over time.
ggplot(trend_df, aes(x = date, y = avg_ghg)) +
geom_line(color = "steelblue", linewidth = 1) +
geom_point(color = "darkred", size = 2) +
geom_smooth(method = "lm", se = TRUE, color = "black") +
theme_minimal() +
labs(
title = "Trend in Average Greenhouse Gas Emissions Over Time",
x = "Year",
y = "Average Total GHG Emissions"
)
The fitted regression line makes the overall trend easier to see. A negative slope would indicate declining emissions over time, while a positive slope would indicate increasing emissions.
This is useful because it gives a simple quantitative summary of the direction of change.
ggplot(trend_df, aes(x = date, y = avg_ghg)) +
geom_line(color = "gray50") +
geom_point(size = 2, color = "darkred") +
geom_smooth(se = FALSE, color = "blue", linewidth = 1.2) +
theme_minimal() +
labs(
title = "Smoothed Emissions Trend Over Time",
x = "Year",
y = "Average Total GHG Emissions"
)
The smoothing curve helps show the broader pattern without focusing too much on short-term variation from year to year.
Because this dataset is annual rather than monthly or daily, I do not expect strong seasonal effects in the traditional sense. However, smoothing is still useful for identifying whether the series rises, falls, or changes direction over time.
emissions_ts_filled <- emissions_ts %>%
fill_gaps() %>%
fill(avg_ghg, .direction = "downup")
emissions_ts_filled %>%
ACF(avg_ghg) %>%
autoplot() +
ggtitle("ACF of Average GHG Emissions")
emissions_ts_filled %>%
PACF(avg_ghg) %>%
autoplot() +
ggtitle("PACF of Average GHG Emissions")
The ACF and PACF plots help assess whether values in one year are related to values in nearby years.
If strong autocorrelation appears at short lags, that suggests persistence over time. If repeated spikes appear at regular intervals, that could suggest cyclical behavior. Because the data are annual, strong seasonality may be limited.
If the ACF shows positive correlation at the first few lags, that suggests that years with relatively high emissions tend to be followed by nearby years with relatively high emissions as well. This would imply persistence in the series.
If the PACF drops off quickly after lag 1, that suggests the most important direct relationship is between adjacent years rather than longer lag structures.
These results are important because they help determine whether the data behave like a smooth trend, a persistent time series, or something with repeating cycles.
This analysis adds a time-based perspective to the building dataset by asking whether greenhouse gas emissions are changing across years rather than only varying across buildings.
If emissions are decreasing over time, that could suggest progress in energy efficiency, policy changes, or emissions reduction efforts. If they are not decreasing, that may indicate that additional intervention is needed.
Looking at the data over time is significant because it provides a dynamic view of building emissions and helps identify whether meaningful change is occurring.
This time-based analysis has several limitations.
First, the data are annual, so there may not be enough detail to reveal strong seasonal patterns in the usual sense. Second, averaging emissions across buildings may hide important differences across building types. Third, changes over time may partly reflect differences in which buildings were included in the dataset from one year to the next.
Because of these limitations, the results should be interpreted as a high-level view rather than a complete explanation of emissions behavior.
This analysis raises several follow-up questions:
In this data dive, I used reporting year as a time variable, converted it into a Date, and created a time series of average greenhouse gas emissions by year. I then used a time plot, linear regression, smoothing, and autocorrelation tools to explore how emissions behave over time.
Overall, this analysis shows that adding a time dimension can reveal broader patterns and trends that are not visible in cross-sectional analysis alone. Even though the data are yearly and may not show strong seasonality, the time-based approach still provides useful insight into whether building emissions are changing over time.