DATA 110 Project 2

My Research Question:

How do electricity usage, natural gas consumption, and building size influence greenhouse gas emissions in Montgomery County government buildings?

Introduction

This dataset I chose comes from the Montgomery County Energy Benchmarking dataset. It includes energy usage and greenhouse gas emissions for government buildings, showing how different buildings use electricity and natural gas and how that connects to emissions.

I chose this dataset because I wanted to understand how real buildings impact the environment. Energy use affects both cost and sustainability, and I found it interesting to analyze what factors lead to higher greenhouse gas emissions.

The dataset includes both categorical and quantitative variables. The categorical variables are City and Primary Property Type EPA Calculated. City represents where each building is located, and it helps compare emissions across different areas. Primary Property Type shows the type of building like the office or facility, it helps compare emissions across building categories.

The quantitative variables include Electricity (kWh), Natural Gas (therms), Reported Property Gross Floor Area, and Total Greenhouse Gas Emissions (CO2e). Electricity and natural gas measure energy usage, building size shows how large each building is, and emissions represent the total environmental impact.

Before my analysis, I cleaned the dataset by selecting only the variables I needed.I also converted values into numeric format, and removed missing values in order for the results to be more accurate.

Importing the Libraries and Dataset

# In this chunk I am loading the libraries that I need for my data analysis
# The library(readr) basically I used to import the csv file
# The library(dplyr) basically I will use for data cleaning as well as data manipulation
# The library(ggplot2) basically I will use it to help me create my visualizations
# The library(ploty) basically I will use it to help me create the interactive graph 
library(readr)
library(dplyr)

## Warning: package 'dplyr' was built under R version 4.5.2

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.5.2

library(plotly)

## Warning: package 'plotly' was built under R version 4.5.3

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

# In this chunk I am loading my data
 energy_data <- read_csv("2023_Energy_Benchmarking_All_Sites_20260418.csv")

## Rows: 1846 Columns: 27

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): Benchmarking Group, Building Name, Montgomery County Building ID, ...
## dbl  (8): ESPM Property ID, Reporting Year Start Date, Year Built, Number of...
## num  (8): Reported Property Gross Floor Area, Electricity (kWh), Electricity...
## lgl  (1): GHG Emissions Intensity (kgCO2e/ft²)
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# In this chunk, the thing I am exploring is the dataset structure
# The function head() displays the first 6 rows in order to understand the data
# The function dim() displays the number of rows and columns
# The function colnames() displays all the variable names
head(energy_data)

## # A tibble: 6 × 27
##   `Benchmarking Group` `ESPM Property ID` `Building Name` Montgomery County Bu…¹
##   <chr>                             <dbl> <chr>           <chr>                 
## 1 County                          5244852 council office… 152444                
## 2 County                          5245108 grey courthous… 152455                
## 3 County                          5244851 white oak crcc  278286                
## 4 County                          5245118 dennis ave hea… 954308                
## 5 County                         20658934 wheaton librar… 1093362               
## 6 County                          5245014 holiday park s… 1311981               
## # ℹ abbreviated name: ¹`Montgomery County Building ID`
## # ℹ 23 more variables: `Reporting Year Start Date` <dbl>,
## #   `Benchmarking Report Status` <chr>, Address <chr>, City <chr>, State <chr>,
## #   Zip <chr>, `Year Built` <dbl>, `Number of Buildings` <dbl>,
## #   `Primary Property Type Self Selected` <chr>,
## #   `Primary Property Type EPA Calculated` <chr>,
## #   `Reported Property Gross Floor Area` <dbl>, `ENERGY STAR Score` <dbl>, …

dim(energy_data)

## [1] 1846   27

colnames(energy_data)

##  [1] "Benchmarking Group"                         
##  [2] "ESPM Property ID"                           
##  [3] "Building Name"                              
##  [4] "Montgomery County Building ID"              
##  [5] "Reporting Year Start Date"                  
##  [6] "Benchmarking Report Status"                 
##  [7] "Address"                                    
##  [8] "City"                                       
##  [9] "State"                                      
## [10] "Zip"                                        
## [11] "Year Built"                                 
## [12] "Number of Buildings"                        
## [13] "Primary Property Type Self Selected"        
## [14] "Primary Property Type EPA Calculated"       
## [15] "Reported Property Gross Floor Area"         
## [16] "ENERGY STAR Score"                          
## [17] "Electricity (kWh)"                          
## [18] "Electricity Use Onsite Renewable (kWh)"     
## [19] "Natural Gas (therms)"                       
## [20] "Site EUI"                                   
## [21] "Weather-Normalized Site EUI"                
## [22] "Source EUI"                                 
## [23] "Weather-Normalized Source EUI"              
## [24] "Total GHG Emissions (Metric Tons CO2e)"     
## [25] "GHG Emissions Intensity (kgCO2e/ft²)"       
## [26] "Direct GHG Emissions (Metric Tons CO2e)"    
## [27] "Direct GHG Emissions Intensity (kgCO2e/ft²)"

Data Cleaning

# In this chunk, I am cleaning the dataset by selecting only the variables I need for analysis
# I am using the function select() in order to keep specific columns which are relevant to my research question
# I am using the function mutate in order to change the data type
# I am also using as.numeric which converts values from text to numbers in order for me to calculate
energy_clean <- energy_data %>%
  select(
    City,
    `Primary Property Type EPA Calculated`,
    `Electricity (kWh)`,
    `Natural Gas (therms)`,
    `Reported Property Gross Floor Area`,
    `Total GHG Emissions (Metric Tons CO2e)`
  ) %>%
  
# I cleaned the data by converting important columns into numeric values
# I also removed commas so the values are read correctly.
  mutate(
    `Electricity (kWh)` = as.numeric(gsub(",", "", `Electricity (kWh)`)),
    `Natural Gas (therms)` = as.numeric(gsub(",", "", `Natural Gas (therms)`)),
    `Reported Property Gross Floor Area` = as.numeric(gsub(",", "", `Reported Property Gross Floor Area`)),
    `Total GHG Emissions (Metric Tons CO2e)` = as.numeric(gsub(",", "", `Total GHG Emissions (Metric Tons CO2e)`)),
    

# Here I converted the building type variable into a factor since it is a categorical variable
`Primary Property Type EPA Calculated` = as.factor(`Primary Property Type EPA Calculated`)
  ) %>%

# Using the filter function I removed the rows that have missing values to ensure my analysis is accurate
filter(
    !is.na(City),
    !is.na(`Primary Property Type EPA Calculated`),
    !is.na(`Electricity (kWh)`),
    !is.na(`Natural Gas (therms)`),
    !is.na(`Reported Property Gross Floor Area`),
    !is.na(`Total GHG Emissions (Metric Tons CO2e)`)
  )

# Using colSums I used it to check if there is any missing values still remaining in the dataset
colSums(is.na(energy_clean))

##                                   City   Primary Property Type EPA Calculated 
##                                      0                                      0 
##                      Electricity (kWh)                   Natural Gas (therms) 
##                                      0                                      0 
##     Reported Property Gross Floor Area Total GHG Emissions (Metric Tons CO2e) 
##                                      0                                      0

Summary Statistics

# I am using the function summary which gives the mean, median, min and max 
# I am using the mean function to calculate the average emissions
# I am using the max function to figure out the highest emission cost
summary(energy_clean$`Total GHG Emissions (Metric Tons CO2e)`)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     7.72   187.20   455.70   796.92   963.80 24703.80

mean(energy_clean$`Total GHG Emissions (Metric Tons CO2e)`, na.rm = TRUE)

## [1] 796.9159

max(energy_clean$`Total GHG Emissions (Metric Tons CO2e)`, na.rm = TRUE)

## [1] 24703.8

Grouping Analysis by the City

# In this chunk, I am going to be grouping data by City
# I am using the group_by() function in order to separate the data into groups which are based on City
# I am using summarize() which calculates summary statistics for each group
energy_clean %>%
  group_by(`Electricity (kWh)`) %>%
  summarize(
    avg_emissions = mean(`Total GHG Emissions (Metric Tons CO2e)`, na.rm = TRUE),
    avg_electricity = mean(`Electricity (kWh)`, na.rm = TRUE),
    n = n()
  )

## # A tibble: 1,302 × 4
##    `Electricity (kWh)` avg_emissions avg_electricity     n
##                  <dbl>         <dbl>           <dbl> <int>
##  1                  0         103.                0      2
##  2              19238.        152.            19238.     1
##  3              23623         117.            23623      1
##  4              24026.         72.6           24026.     1
##  5              25774           7.72          25774      1
##  6              27286          66.3           27286      1
##  7              27826          18.3           27826      1
##  8              29203          47.8           29203      1
##  9              29601.         40.9           29601.     1
## 10              30227.          9.3           30227.     1
## # ℹ 1,292 more rows

Visualization 1: Electricity vs Emissions

# In this visualization, I am exploring the relationship between natural gas usage as well as the total greenhouse gas (GHG) emissions across different buildings.
# I am using ggplot() in order to create a scatter plot.
# The x-axis in this represents the Natural Gas usage in therms which is a quantitative variable.
# The y-axis represents Total GHG Emissions in metric tons of CO2 equivalent which is a quantitative variable. 
# The color I am using represents the City which is a categorical variable and I am going to be comparing patterns across many locations.
# I am using geom_point() to plot the buildings as a point on the graph 
# The alpha = 0.7 ends up making my points transparent based on this generally overlapping points are more easier to see
ggplot(energy_clean, aes(
  x = `Electricity (kWh)`,
  y = `Total GHG Emissions (Metric Tons CO2e)`,
  color = City
)) +
  geom_point(alpha = 0.7) +
  theme_minimal() +
  labs(
    title = "Electricity Use vs Greenhouse Gas Emissions",
    x = "Electricity (kWh)",
    y = "Total GHG Emissions",
    color = "City",
    caption = "Source: Montgomery County Energy Benchmarking Dataset"
  )+
    theme(axis.text.x = element_text(angle = 45, hjust = 1))

Visualization 2: Natural Gas vs Emissions

# In this visualization, I am analyzing how building size (gross floor area)
# relates to total greenhouse gas emissions.
# I am creating a scatterplot using ggplot 
# The x-axis represents the reported property gross floor area which is a quantative variables 
# The y-axis represents total GHG emissions (quantitative variable)
# The color represents City (categorical variable), helping me compare across regions
ggplot(energy_clean, aes(
 x = `Natural Gas (therms)`,
 y = `Total GHG Emissions (Metric Tons CO2e)`,
 color = City
)) +
  geom_point(alpha = 0.7) +
  theme_minimal() +
  labs(
   title = "Natural Gas Use vs Greenhouse Gas Emissions",
   x = "Natural Gas (therms)",
   y = "GHG Emissions",
   caption = "Source: Montgomery County Energy Benchmarking Dataset"
 )

Visualization 3: Building Size vs Emissions

# In this visualization, I am looking at the distribution of greenhouse gas emissions.
# I use a histogram because it helps me see how emissions are spread across buildings.
 # The geom_histogram() variable basically groups values into bins in order to show frequency distribution
# I used the variable theme_minimal() in order to keep the graph clean and easy to read

ggplot(energy_clean, aes(x = `Total GHG Emissions (Metric Tons CO2e)`)) +
  geom_histogram(bins = 30, fill = "steelblue", color = "white") +
  theme_minimal() +
  labs(
    title = "Distribution of Greenhouse Gas Emissions",
    x = "Total GHG Emissions",
    y = "Number of Buildings",
    caption = "Source: Montgomery County Energy Benchmarking Dataset"
  )

Interactive Plot

# In this chunk, I am creating an interactive scatterplot in order to try to explore the relationship there is between electricity usage and greenhouse gas emissions.
# The x-axis in this displays the electricity usage which is a quantative variable 
# The y-axis in this displays the greenhouse gas emissions which is a quantative variable
# In this I used City as a categorical variable in order to compare the different locations 
#I use geom_point() to plot each building as a point.
# I set alpha = 0.7 since it makes the points transparent in order for the overlapping data to be more easier to see.
# I apply a non-default color palette Set2 in order to make the graph more visually appealing
# The theme_minimal function makes the graph easy to read
# Lastly I converted the static ggplot basically into an interactive plot by using ggploty
# Due to this I was able to see the values which helps me clearly explore the data
interactive_plot <- ggplot(energy_clean, aes(
  x = `Electricity (kWh)`,
  y = `Total GHG Emissions (Metric Tons CO2e)`,
  color = City
)) +
  geom_point(alpha = 0.7) +
  scale_color_brewer(palette = "Set2") +
  theme_minimal()

ggplotly(interactive_plot)

## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors

Multiple Linear Regression

# In this section, I build a multiple linear regression model to see how electricity use, natural gas use, and building size together affect greenhouse gas emissions.
#The dependent variable is GHG emissions that is what I want to predict.
# The independent variables are electricity, natural gas, and building size.

model <- lm( `Total GHG Emissions (Metric Tons CO2e)` ~
    `Electricity (kWh)` +
    `Natural Gas (therms)` +
    `Reported Property Gross Floor Area`,
  data = energy_clean
)


summary(model)

## 
## Call:
## lm(formula = `Total GHG Emissions (Metric Tons CO2e)` ~ `Electricity (kWh)` + 
##     `Natural Gas (therms)` + `Reported Property Gross Floor Area`, 
##     data = energy_clean)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -983.3  -18.3   -7.2   -0.8 3284.8 
## 
## Coefficients:
##                                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                           3.168e+00  4.861e+00   0.652 0.514740    
## `Electricity (kWh)`                   3.207e-04  1.901e-06 168.766  < 2e-16 ***
## `Natural Gas (therms)`                5.145e-03  4.887e-05 105.289  < 2e-16 ***
## `Reported Property Gross Floor Area` -9.726e-05  2.706e-05  -3.594 0.000338 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 129.1 on 1301 degrees of freedom
## Multiple R-squared:  0.9912, Adjusted R-squared:  0.9911 
## F-statistic: 4.859e+04 on 3 and 1301 DF,  p-value: < 2.2e-16

Interpretation

Regression Equation:

GHG Emissions = 3.168 + 0.0003207(Electricity) + 0.005145(Natural Gas) − 0.00009726(Building Size)

In my model, I found that electricity use, natural gas consumption, and building size all help explain greenhouse gas emissions. All three variables are statistically significant since their p-values are below 0.05, which means they are meaningful predictors in my model. The adjusted R² is 0.9911, which shows how my model explains about 99% of the variation in greenhouse gas emissions. This tells me the model fits the data extremely well. I also noticed how as electricity and natural gas usage increase, the greenhouse gas emissions also increase which shows a strong positive relationship between energy use and the impact of the environment.

Essay

I was able to see a clear overall relationship between energy use and greenhouse gas emissions in Montgomery County buildings. In the scatterplots I created Electricity vs Emissions and Natural Gas vs Emissions, both variables show a strong positive relationship with total greenhouse gas emissions. This means that when the electricity use or natural gas use increases, emissions also tend to increase.

One pattern that I noticed which stood out is that electricity use appears to have a slightly upward trend compared to natural gas that suggests how electricity might be a driver of emissions in this dataset. The histogram of greenhouse gas emissions shows how most buildings have relatively low to moderate emissions. A smaller number of buildings tend to have very high emissions. As a result this creates a right-skewed distribution, which is common in energy datasets due to large buildings tending to use significantly more energy. Another observation I have is that the points in both scatterplots are widely spread, but still form an upward trend. This suggests that while energy use is a strong predictor of emissions, there are still other factors (like building efficiency or type) that also affect emissions but were not fully captured in my model.

There were a few limitations in my analysis that could be improved. In my grouping analysis by city I accidentally grouped by electricity instead of city, which did not fully answer the question I intended. If I had more time I would correct that and properly compare emissions across different cities to see if location has an impact or not. Another limitation in my project is that my model only uses three predictors electricity, natural gas, and building size.

While the R² value is very high this shows how overfitting in the dataset also has strong correlations between variables. I would have liked to include additional variables in order to make my model more accurate and realistic. I also wish I could have added a clearer comparison of emissions by building type which is a categorical variable, since that might display the types of buildings which are the least or most energy efficient.

References

Montgomery County Government. (2023).Building Energy Benchmarking Dataset (2023 Energy Benchmarking All Sites).Montgomery County Open Data Portal.

https://data.montgomerycountymd.gov/Environment/2023-Energy-Benchmarking-All-Sites/ensr-8pr2/about_data