Final Project

Source: Ellen Kennedy

Source: Ellen Kennedy

Introduction

The Heating Gas Consumption from 2010-2023 is from NYC Open Data. The variables I will be using for this Project are Development Name, Revenue Month, Service Start Day, Service End Date, Number of Days and Current Charges to explore how much gas does each organization need in order to maintain a their area.
To prepare the dataset for analysis, the following steps were taken:
- Missing Values: Checked for null or missing meter counts or boroughs. Rows with incomplete information were either removed or imputed as appropriate.
- Data Types: Verified and ensured all variables were in their appropriate formats: borough and rate class as categorical, meter counts as numerical.
I would like to explore the trend analysis from this dataset such: Are there noticeable peaks or dips in certain years? Is there a consistent relationship between gas consumption and costs over time? I chose this dataset specifically because in NYC, where winters can be severe and gas is the main source of heating for many households, heating expenses are a major concern. Understanding the effects on the economy and the environment is related to this issue. The analysis could contribute to energy-saving initiatives or better planning for heating infrastructure in NYC.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
setwd("/Users/itshannahuc/Desktop/Data 110")
HeatingGas_Consumption <-read_csv ("~/Desktop/Data 110/Heating_Gas_Consumption_And_Cost2010-2023.csv")
Rows: 230213 Columns: 25
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (20): Development Name, Borough, Account Name, Location, Meter AMR, Mete...
dbl  (5): EDP, UMIS BILL ID, # days, Current Charges, Consumption (Therms)

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
library(highcharter)
Registered S3 method overwritten by 'quantmod':
  method            from
  as.zoo.data.frame zoo 
colnames(HeatingGas_Consumption)
 [1] "Development Name"     "Borough"              "Account Name"        
 [4] "Location"             "Meter AMR"            "Meter Scope"         
 [7] "TDS #"                "EDP"                  "RC Code"             
[10] "Funding Source"       "AMP #"                "Vendor Name"         
[13] "UMIS BILL ID"         "Revenue Month"        "Service Start Date"  
[16] "Service End Date"     "# days"               "Meter Number"        
[19] "Estimated"            "Current Charges"      "Rate Class"          
[22] "Bill Analyzed"        "Consumption (Therms)" "ES Commodity"        
[25] "Underlying Utility"  

Check structure and summary

head(HeatingGas_Consumption)
# A tibble: 6 × 25
  `Development Name` Borough `Account Name` Location `Meter AMR` `Meter Scope`
  <chr>              <chr>   <chr>          <chr>    <chr>       <chr>        
1 ADAMS              BRONX   ADAMS          BLD 04   NONE        BLD 1-7      
2 ADAMS              BRONX   ADAMS          BLD 04   NONE        BLD 1-7      
3 ADAMS              BRONX   ADAMS          BLD 04   NONE        BLD 1-7      
4 ADAMS              BRONX   ADAMS          BLD 04   NONE        BLD 1-7      
5 ADAMS              BRONX   ADAMS          BLD 04   NONE        BLD 1-7      
6 ADAMS              BRONX   ADAMS          BLD 04   NONE        BLD 1-7      
# ℹ 19 more variables: `TDS #` <chr>, EDP <dbl>, `RC Code` <chr>,
#   `Funding Source` <chr>, `AMP #` <chr>, `Vendor Name` <chr>,
#   `UMIS BILL ID` <dbl>, `Revenue Month` <chr>, `Service Start Date` <chr>,
#   `Service End Date` <chr>, `# days` <dbl>, `Meter Number` <chr>,
#   Estimated <chr>, `Current Charges` <dbl>, `Rate Class` <chr>,
#   `Bill Analyzed` <chr>, `Consumption (Therms)` <dbl>, `ES Commodity` <chr>,
#   `Underlying Utility` <chr>
str(HeatingGas_Consumption)
spc_tbl_ [230,213 × 25] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ Development Name    : chr [1:230213] "ADAMS" "ADAMS" "ADAMS" "ADAMS" ...
 $ Borough             : chr [1:230213] "BRONX" "BRONX" "BRONX" "BRONX" ...
 $ Account Name        : chr [1:230213] "ADAMS" "ADAMS" "ADAMS" "ADAMS" ...
 $ Location            : chr [1:230213] "BLD 04" "BLD 04" "BLD 04" "BLD 04" ...
 $ Meter AMR           : chr [1:230213] "NONE" "NONE" "NONE" "NONE" ...
 $ Meter Scope         : chr [1:230213] "BLD 1-7" "BLD 1-7" "BLD 1-7" "BLD 1-7" ...
 $ TDS #               : chr [1:230213] "118" "118" "118" "118" ...
 $ EDP                 : num [1:230213] 248 248 248 248 248 248 248 248 248 248 ...
 $ RC Code             : chr [1:230213] "B011800" "B011800" "B011800" "B011800" ...
 $ Funding Source      : chr [1:230213] "FEDERAL" "FEDERAL" "FEDERAL" "FEDERAL" ...
 $ AMP #               : chr [1:230213] "NY005001180P" "NY005001180P" "NY005001180P" "NY005001180P" ...
 $ Vendor Name         : chr [1:230213] "CONSOLIDATED EDISON COMPANY OF NY" "CONSOLIDATED EDISON COMPANY OF NY" "CONSOLIDATED EDISON COMPANY OF NY" "CONSOLIDATED EDISON COMPANY OF NY" ...
 $ UMIS BILL ID        : num [1:230213] 8526158 8628611 8723233 9616046 9616039 ...
 $ Revenue Month       : chr [1:230213] "2020-01" "2020-02" "2020-03" "2020-04" ...
 $ Service Start Date  : chr [1:230213] "12/24/2019" "01/24/2020" "02/25/2020" "03/25/2020" ...
 $ Service End Date    : chr [1:230213] "01/24/2020" "02/25/2020" "03/25/2020" "04/23/2020" ...
 $ # days              : num [1:230213] 31 32 29 29 29 32 7 23 29 32 ...
 $ Meter Number        : chr [1:230213] "3299599" "3299599" "3299599" "3299599" ...
 $ Estimated           : chr [1:230213] "N" "N" "Y" "Y" ...
 $ Current Charges     : num [1:230213] 83213 123544 114655 70949 20747 ...
 $ Rate Class          : chr [1:230213] "TRMDHDF" "TRMDHDF" "TRMDHDF" "TRMDHDF" ...
 $ Bill Analyzed       : chr [1:230213] "Yes" "Yes" "Yes" "Yes" ...
 $ Consumption (Therms): num [1:230213] 113350 162913 136940 114511 48166 ...
 $ ES Commodity        : chr [1:230213] "UTILITY GAS" "UTILITY GAS" "UTILITY GAS" "UTILITY GAS" ...
 $ Underlying Utility  : chr [1:230213] "ConEd" "ConEd" "ConEd" "ConEd" ...
 - attr(*, "spec")=
  .. cols(
  ..   `Development Name` = col_character(),
  ..   Borough = col_character(),
  ..   `Account Name` = col_character(),
  ..   Location = col_character(),
  ..   `Meter AMR` = col_character(),
  ..   `Meter Scope` = col_character(),
  ..   `TDS #` = col_character(),
  ..   EDP = col_double(),
  ..   `RC Code` = col_character(),
  ..   `Funding Source` = col_character(),
  ..   `AMP #` = col_character(),
  ..   `Vendor Name` = col_character(),
  ..   `UMIS BILL ID` = col_double(),
  ..   `Revenue Month` = col_character(),
  ..   `Service Start Date` = col_character(),
  ..   `Service End Date` = col_character(),
  ..   `# days` = col_double(),
  ..   `Meter Number` = col_character(),
  ..   Estimated = col_character(),
  ..   `Current Charges` = col_double(),
  ..   `Rate Class` = col_character(),
  ..   `Bill Analyzed` = col_character(),
  ..   `Consumption (Therms)` = col_double(),
  ..   `ES Commodity` = col_character(),
  ..   `Underlying Utility` = col_character()
  .. )
 - attr(*, "problems")=<externalptr> 
summary(HeatingGas_Consumption)
 Development Name     Borough          Account Name         Location        
 Length:230213      Length:230213      Length:230213      Length:230213     
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
  Meter AMR         Meter Scope           TDS #                EDP       
 Length:230213      Length:230213      Length:230213      Min.   :200.0  
 Class :character   Class :character   Class :character   1st Qu.:322.0  
 Mode  :character   Mode  :character   Mode  :character   Median :507.0  
                                                          Mean   :500.7  
                                                          3rd Qu.:650.0  
                                                          Max.   :999.0  
                                                                         
   RC Code          Funding Source        AMP #           Vendor Name       
 Length:230213      Length:230213      Length:230213      Length:230213     
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
  UMIS BILL ID      Revenue Month      Service Start Date Service End Date  
 Min.   : 1755537   Length:230213      Length:230213      Length:230213     
 1st Qu.: 2414370   Class :character   Class :character   Class :character  
 Median : 5244017   Mode  :character   Mode  :character   Mode  :character  
 Mean   : 5721687                                                           
 3rd Qu.: 8526052                                                           
 Max.   :12236444                                                           
                                                                            
     # days        Meter Number        Estimated         Current Charges   
 Min.   :-335.00   Length:230213      Length:230213      Min.   :-54559.0  
 1st Qu.:  29.00   Class :character   Class :character   1st Qu.:   327.3  
 Median :  30.00   Mode  :character   Mode  :character   Median :  1176.5  
 Mean   :  30.28                                         Mean   : 10370.9  
 3rd Qu.:  31.00                                         3rd Qu.:  6121.9  
 Max.   : 761.00                                         Max.   :982291.3  
 NA's   :9                                                                 
  Rate Class        Bill Analyzed      Consumption (Therms) ES Commodity      
 Length:230213      Length:230213      Min.   :  -1744      Length:230213     
 Class :character   Class :character   1st Qu.:    583      Class :character  
 Mode  :character   Mode  :character   Median :   2157      Mode  :character  
                                       Mean   :  20164                        
                                       3rd Qu.:  12564                        
                                       Max.   :2351667                        
                                                                              
 Underlying Utility
 Length:230213     
 Class :character  
 Mode  :character  
                   
                   
                   
                   

Data Cleaning

# Rename columns for clarity
data <- HeatingGas_Consumption %>%
  rename_with(~ gsub(" ", "_", .), everything())  

# Handle missing values 
data <- HeatingGas_Consumption %>%
  mutate(`Current Charges` = ifelse(is.na(`Current Charges`), median(`Current Charges`, na.rm = TRUE), `Current Charges`))

These cleaning procedures are essential for preserving data integrity and preventing mistakes during modeling and visualization.

Quantitative Variables

Distribution of Consumption (Therms)

ggplot(data, aes(x = `Consumption (Therms)`)) +
  geom_histogram(fill = "steelblue", color = "black", bins = 30) +
  labs(title = "Distribution of Gas Consumption (Therms)", x = "Therms", y = "Frequency") +
  theme_minimal()

The histogram shows the distribution of gas consumption (in therms) among developments and also shows the frequency of specific consumption levels. By emphasizing basic patterns, dispersion, and potential skewness, it highlights outliers or irregularities and offers information on typical consumption ranges. Interesting patterns include gaps and clusters that display atypical usage patterns, skewness that shows most projects consume either low or large volumes of gas, and peaks that represent normal consumption levels. Surprises might happen if the distribution exhibits extreme outliers, unexpected homogeneity, or consumption levels that differ from economic or seasonal expectations.

Categorical Variables

Count of Developments by Borough

ggplot(data, aes(x = Borough, fill = Borough)) +
  geom_bar() +
  labs(title = "Number of Developments by Borough", x = "Borough", y = "Count") +
  theme_classic() +
  scale_fill_brewer(palette = "Set3") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

This bar graph uses the Borough variable to aggregate and color-code the number of developments in each NYC borough. The number of developments is displayed on the y-axis, while the x-axis depicts the boroughs. Colors from the “Set3” palette and slanted axis labels for readability are used in the visualization’s simple, classic look. This graph illustrates the variations in the quantity of developments in each borough, which may indicate intriguing trends like a surprisingly uniform distribution or one borough having noticeably more developments than others. Unexpected outcomes might include boroughs with fewer projects than predicted or distinct regional patterns associated with the density of development.

Summarize average gas consumption by Borough

summary_data <- HeatingGas_Consumption %>%
  group_by(Borough) %>%
  summarize(Average_Consumption = mean(`Consumption (Therms)`, na.rm = TRUE),
            Total_Charges = sum(`Current Charges`, na.rm = TRUE)) %>%
  arrange(desc(Average_Consumption))

# Filter for high-consumption developments
high_consumption <- HeatingGas_Consumption %>%
  filter(`Consumption (Therms)` > 1000)

I summarize the boroughs with the greatest average gas use along with a comparison of their total charges. Larger developments or higher energy demands may be represented by boroughs with high average consumption. Inefficiencies can be found by comparing average consumption to total charges. For example, a borough with high consumption but relatively low prices may be the result of price differentials or discounts.

I filtered to only include projects with gas usage over 1000 therms. Only focuses on significant or energy-intensive developments. Helps in identifying instances where energy use is noticeably higher than usual.

1st Visualization (For Grading)

ggplot(summary_data, aes(x = reorder(Borough, -Average_Consumption), y = Average_Consumption, fill = Borough)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = round(Average_Consumption, 2)), vjust = -0.5, size = 2.5) +
  labs(title = "Average Gas Consumption by Borough",
       x = "Borough", 
       y = "Average Consumption (Therms)",
       caption = "Sources: NYC Open Data") +
  scale_fill_brewer(palette = "Set3") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 35, hjust = 1))

Observation

This bar graph displays the typical gas usage in each of NYC’s boroughs. Borough with the highest gas consumption:With an average gas usage of 43,420.44 therms, Queens has the greatest gas consumption. Notable Levels of Consumption: The next two largest users are Manhattan (21,500.56 therms) and Staten Island (27,574.83 therms). Minimal Utilization: Significantly lower gas consumption is shown by FHA (193.43 therms) and Non-Development Facility (1,348.22 therms), which may indicate that these facilities are either minor or anomalies. Middle Range: At 20,487.24 therms and 17,553.77 therms, respectively, the Bronx and Brooklyn have moderate levels of typical gas consumption.

Compared to other boroughs, Queens has a significantly higher level of consumption. There are many single-family houses in Queens, and these residences often require more gas for heating than apartment complexes. This might be the reason why Queens uses the most gas. On the other hand, cities such as Manhattan often have a greater number of apartment buildings, which frequently share heating systems (centralized systems) and use gas better per unit. It is noticeable how much less gas is used in the boroughs compared to the FHA or non-development facilities.Because non-development facilities and FHA are probably smaller, less energy-intensive establishments, such administrative buildings, temporary dwelling units, or underutilized government buildings, they use a lot less gas.

Borough-specific building styles, infrastructure, and population density may all have an impact on changes in gas use.

Statistic Analysis

Predicting Charges based on Consumption

model <- lm(`Current Charges` ~ `Consumption (Therms)` + Borough, data = data)
summary(model)

Call:
lm(formula = `Current Charges` ~ `Consumption (Therms)` + Borough, 
    data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-464464   -1357    -430    1525  745612 

Coefficients:
                                  Estimate Std. Error  t value Pr(>|t|)    
(Intercept)                      1.344e+03  4.218e+01   31.861  < 2e-16 ***
`Consumption (Therms)`           5.140e-01  4.272e-04 1203.182  < 2e-16 ***
BoroughBROOKLYN                 -2.908e+03  5.289e+01  -54.993  < 2e-16 ***
BoroughFHA                      -1.336e+03  1.641e+02   -8.143 3.87e-16 ***
BoroughMANHATTAN                 4.847e+01  5.818e+01    0.833    0.405    
BoroughNON DEVELOPMENT FACILITY -1.328e+03  1.930e+02   -6.882 5.90e-12 ***
BoroughQUEENS                   -1.912e+03  1.070e+02  -17.868  < 2e-16 ***
BoroughSTATEN ISLAND            -3.972e+03  1.579e+02  -25.153  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9981 on 230205 degrees of freedom
Multiple R-squared:  0.8654,    Adjusted R-squared:  0.8654 
F-statistic: 2.114e+05 on 7 and 230205 DF,  p-value: < 2.2e-16

Current Charges are positively and significantly impacted by gas use. This makes obvious because increasing gas use naturally results in higher costs. Borough Differences: The majority of other boroughs, including Brooklyn, Queens, Staten Island, FHA, and Non-Development Facility, have far cheaper fees than the reference borough, which is probably Manhattan. The biggest fee drop (-3,972) is seen on Staten Island, with Brooklyn coming in second. Although the baseline costs in Manhattan seem to be the highest, this difference is not statistically significant.

The Adjusted R-squared is 0.8654 indicates that the predictors Borough and Consumption (Therms) account for 86.54% of the variation in Current Charges. This number is really high, suggesting that the model fits the data quite well.The high Adjusted R-squared indicates that even when using numerous borough categories, the model does not overfit.

Equation: Current_Charges = β0 + β1 * Consumption_Therms + β2 * Borough (Reference ChatGPT)

Diagnostic plots

# Diagnostic plots
par(mfrow = c(2, 2))
plot(model)

Summarize the count of Meter AMR by Rate Class and Borough

summarized_rate_data <- HeatingGas_Consumption %>%
  group_by(`Rate Class`, Borough) %>%
  summarize(
    Meter_Count = n_distinct(`Meter AMR`) 
  ) %>%
  ungroup()
`summarise()` has grouped output by 'Rate Class'. You can override using the
`.groups` argument.
print(summarized_rate_data)
# A tibble: 72 × 3
   `Rate Class`                  Borough   Meter_Count
   <chr>                         <chr>           <int>
 1 120 Del&Sup,Res,Gen Use       FHA                 1
 2 140 Del&Sup,Res,Home Heat     FHA                 1
 3 158 Del,Com,Multidwel,Heat    QUEENS              1
 4 177 Del,Com,Heat              QUEENS              1
 5 250 Gas Non Resid General Use QUEENS              1
 6 GS2                           BRONX               2
 7 GS2                           MANHATTAN           1
 8 GS2                           QUEENS              1
 9 GS2 General Non-Residential   BRONX               1
10 GS2 General Non-Residential   MANHATTAN           1
# ℹ 62 more rows

Data is grouped by Borough and Rate Class to illustrate the distinct meter counts. This enables the evaluation of the spatial and rate-type concentrations of gas consumption activities.The code is necessary to efficiently summarize and organize the data, producing a clear and useful dataset. This summary is a fundamental step in comprehending patterns of heating gas usage since it may guide decision-making, resource allocation, and additional analysis or visualizations.

2nd Visualization (For Grading)

heatmap_data <- summarized_rate_data %>%
  group_by(Borough, `Rate Class`) %>%
  summarize(Meter_Count = sum(Meter_Count, na.rm = TRUE)) %>%
  ungroup()
`summarise()` has grouped output by 'Borough'. You can override using the
`.groups` argument.
hc <- hchart(
  heatmap_data, 
  type = "heatmap", 
  hcaes(x = Borough, y = `Rate Class`, value = Meter_Count)
) %>%
  hc_title(text = "Heatmap: Meter Count Distribution by Borough and Rate Class") %>%
  hc_xAxis(title = list(text = "Borough")) %>%
  hc_yAxis(title = list(text = "Rate Class")) %>%
  hc_colorAxis(
    minColor = "#ffffcc",  # Light color for low values
    maxColor = "#ff0000",  # Dark color for high values
    stops = list(
      list(0, "#ffffcc"),
      list(0.5, "#f03b20"),
      list(1, "#bd0026")
    )
  ) %>%
  hc_tooltip(
    pointFormat = "<b>Borough:</b> {point.x}<br><b>Rate Class:</b> {point.y}<br><b>Meter Count:</b> {point.value}"
  ) %>%
  hc_add_annotation(
    labels = list(
      list(
        point = list(x = 0.5, y = -0.1, xAxis = 0, yAxis = 0),
        text = "Sources: NYC Open Data",
        style = list(color = "gray", fontSize = "12px")
      )
    )
  )

hc

High Concentration in Particular Rate Classes: In boroughs like the Bronx, Brooklyn, and Queens, some rate classes, such “TRMDHDF” and “T2-2 TRAN GENERAL,” display greater meter counts. Throughout several boroughs, the rate class “TRMDHDF” is very noticeable. Staten Island and Non-Development Facilities have extremely few rate classes with high meter counts. This indicates either a lower concentration of structures or fewer meters measured in certain regions. Queens and Manhattan show distinct patterns, with deeper shades clustered in specific rate classes, most likely due to differences in infrastructure or home heating demands. Manhattan has a wide yet low-intensity distribution, which might indicate a larger density of meters scattered across more buildings.Overall, the Bronx and Brooklyn appear to have more high-meter count clusters than the other boroughs, which might indicate denser growth and bigger structures. The heatmap clearly displays differences in meter count distributions between boroughs and rate classes. It identifies boroughs like the Bronx and Brooklyn as having larger numbers of meters, while other locations, such as Staten Island, have less. Patterns in various rate classes demonstrate how infrastructure or use rates vary around the city, offering useful information for resource planning and energy management.

Background Bibliography

Climate, building types, and population density are some of the elements that influence energy usage in New York State, which includes New York City. Natural gas is the main source of heating for New York’s residential structures, accounting for a sizeable amount of energy use, according to the U.S. Energy Information Administration (EIA). The paper points out that because older buildings are more common in urban regions like Manhattan and Brooklyn and require more energy to heat, household gas usage is greater there.When heating demands are at their highest during the winter, these places often see higher demand. On the other hand, average consumption is often lower in more suburban boroughs like Queens and Staten Island. Cold winters and other seasonal conditions make gas use even worse, particularly in places with high population densities where heating is necessary (U.S. EIA, n.d.). This site offers a thorough analysis of New York State’s energy consumption trends, with an emphasis on gas use in residential structures. It provides useful information on energy sources, consumption patterns, and the ways that urban infrastructure and climate affect regional differences in gas use.

Source: U.S. Energy Information Administration. (n.d.). State Energy Profile: New York. U.S. Department of Energy. Retrieved from https://www.eia.gov/state/print.php?sid=NY