Final Project: Exploring Electricity Consumption in the U.S.

Author

SMukrabine

source: https://www.bing.com/images/search?view=detailV2&insightstoken=ccid_vgAvEZOvcp_6B26AC0439783198E8E6AF256E56BBA2mid_884EB0294EFF67720A9B92FE19DBB5B95A8282BDthid_OIP.vgAvEZOvzwRPr29v2AM3QQHaEK&form=vissbi&vsimg=https%3A%2F%2Ftricloverindustries.co.ke%2Fwp-content%2Fuploads%2F2022%2F10%2Fsdfaads.jpg&iss=SBI&sbisrc=ImgDropper&idpbck=1&sbifsz=1500+x+844+%c2%b7+37.01+kB+%c2%b7+webp&sbifnm=file_show.webp&thw=1500&thh=844&ptime=107&dlen=50536&expw=1920&exph=1080&selectedindex=0&id=https%3A%2F%2Ftricloverindustries.co.ke%2Fwp-content%2Fuploads%2F2022%2F10%2Fsdfaads.jpg&ccid=vgAvEZOv&vt=3&sim=11&ck=6B26AC0439783198E8E6AF256E56BBA2&thid=OIP.vgAvEZOvzwRPr29v2AM3QQHaEK&mediaurl=https%3A%2F%2Ftricloverindustries.co.ke%2Fwp-content%2Fuploads%2F2022%2F10%2Fsdfaads.jpg&cdnurl=https%3A%2F%2Fth.bing.com%2Fth%2Fid%2FR.be002f1193afcf044faf6f6fd8033741%3Frik%3DvYKCWrm12xn%252bkg%26pid%3DImgRaw%26r%3D0&pivotparams=insightsToken%3Dbcid_TqSmntVr3koJDvZO.2h2iyxvWpag…..6sccid_pKae1Wve%26amp%3B%26amp%3Bcal%3D0%26amp%3Bcat%3D0%26amp%3Bcar%3D1%26amp%3Bcab%3D1%26amp%3Bann%3D%26amp%3Bhotspot%3D&first=1&q=imgurl%3Ahttps%3A%2F%2Ftricloverindustries.co.ke%2Fwp-content%2Fuploads%2F2022%2F10%2Fsdfaads.jpg&cal=0&cab=1&cat=0&car=1

Introduction

Specifically, here I analyze the electricity consumption and usage behavior of the electric utilities in the USA with reference to the CORGIS Electricity Dataset, which is from U.S. Energy Information Administration (EIA), https://corgis-edu.github.io/corgis/csv/electricity/. The dataset features self reported data about over 3,000 U.S. electric utilities. This data consists of aspects of utility attributes, electricity requirements, generation mode of electricity, power consumption, sales of electricity in the retail sector, and so on. I used this dataset because electricity is one of the most vital resources people have access to on a daily basis and studying how electricity is consumed in different states gives us insights about population size, economic activity, and demand for energy. The resulting dataset is well organized, it has real data from daily life, both numerical and categorical variables, and it can be used to clean, analyze and visualize the data. Specifically, it is the dataset which makes comparison possible between states electricity utilization, consistent with the primary task of this project.

The dataset had 38 variables and 3,174 observations as it is a large database. The data used in this investigation include: utility.state, retail.residential.sales, retail.commercial.sales, retail.industrial.sales, retail.total.sales, and uses.total. These variables are mainly quantitative numerical variables in terms of megawatt hours (MWh), while utility.state is a categorical variable that reflects U.S. state abbreviations. These variables were chosen because they directly measure electricity consumption and allow comparison of its levels across states. The data has been loaded into the R environment using the read_csv() function and has been subjected to manipulation, such as the R packages tidyverse, dplyr, ggplot2, and plotly that were used for data manipulation and visualization. This ensures uniformity, so to clean the data, all column names were also in lower case. The spaces in column names were replaced with underscores with gsub() function to get a quick reference for the variables in R. To gain insight regarding the types of variables and the distributions, a summary of the dataset was conducted on the variables. The extracted variables only for electricity consumption were selected, removing from the dataset the features that could not be considered. The number of sales of residential, commercial, and industrial electricity was summed together to obtain a new variable known as total_consumption. The data was further classified by utility.state and summarized to get total energy consumption per state. The method was used to concentrate the analysis on identifying the states that consume the most electricity without looking specifically at the utilities. NA was used to handle missing values with na.rm = TRUE to ensure correct results. This dataset does not contain a rich ReadMe file that summarizes the complete data collection process. However, EIA says this data was collected through the EIA-861 survey, by which electric utilities annually provide electricity sales and usage reports. This standardized methodology of reporting contributes to the integrity of the dataset and makes it suitable for studying electricity consumption in the U.S. states.

load Required Packages

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(highcharter)
Registered S3 method overwritten by 'quantmod':
  method            from
  as.zoo.data.frame zoo 
Highcharts (www.highcharts.com) is a Highsoft software product which is
not free for commercial and Governmental use
library(RColorBrewer)
library(ggfortify)
library(ggplot2)
library(plotly)

Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout

load Dataset

setwd("C:/Users/sajut/OneDrive/Desktop/DATA_110")
electricity <- read_csv("electricity.csv")
Rows: 3174 Columns: 38
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (3): Utility.Name, Utility.State, Utility.Type
dbl (35): Utility.Number, Demand.Summer Peak, Demand.Winter Peak, Sources.Ge...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Clean up The Data

# Clean column names
names(electricity) <- tolower(names(electricity))
names(electricity) <- gsub(" ","_",names(electricity))

head(electricity)
# A tibble: 6 × 38
  utility.number utility.name      utility.state utility.type demand.summer_peak
           <dbl> <chr>             <chr>         <chr>                     <dbl>
1             34 City of Abbevill… SC            Municipal                  13.7
2             55 City of Aberdeen… MS            Municipal                  32.4
3             59 City of Abbevill… LA            Municipal                  28.9
4             84 A & N Electric C… VA            Cooperative               154  
5             87 City of Ada - (M… MN            Municipal                   2.1
6             97 Adams Electric C… IL            Cooperative                37  
# ℹ 33 more variables: demand.winter_peak <dbl>, sources.generation <dbl>,
#   sources.purchased <dbl>, sources.other <dbl>, sources.total <dbl>,
#   uses.retail <dbl>, uses.resale <dbl>, uses.no_charge <dbl>,
#   uses.consumed <dbl>, uses.losses <dbl>, uses.total <dbl>,
#   revenues.retail <dbl>, revenue.delivery <dbl>, revenue.resale <dbl>,
#   revenue.adjustments <dbl>, revenue.transmission <dbl>, revenue.other <dbl>,
#   revenue.total <dbl>, retail.residential.revenue <dbl>, …

Summarizing

summary(electricity)
 utility.number  utility.name       utility.state      utility.type      
 Min.   :   34   Length:3174        Length:3174        Length:3174       
 1st Qu.: 6281   Class :character   Class :character   Class :character  
 Median :12796   Mode  :character   Mode  :character   Mode  :character  
 Mean   :16299                                                           
 3rd Qu.:18335                                                           
 Max.   :63546                                                           
 demand.summer_peak demand.winter_peak sources.generation  sources.purchased  
 Min.   :    0.00   Min.   :    0.00   Min.   :   -51740   Min.   :        0  
 1st Qu.:    4.00   1st Qu.:    3.50   1st Qu.:        0   1st Qu.:    25467  
 Median :   20.60   Median :   18.50   Median :        0   Median :   118510  
 Mean   :  268.27   Mean   :  235.64   Mean   :   699959   Mean   :  1123244  
 3rd Qu.:   80.15   3rd Qu.:   75.05   3rd Qu.:        0   3rd Qu.:   428025  
 Max.   :29569.00   Max.   :27796.00   Max.   :209485522   Max.   :338991695  
 sources.other      sources.total        uses.retail         uses.resale       
 Min.   :-1239123   Min.   :        0   Min.   :        0   Min.   :        0  
 1st Qu.:       0   1st Qu.:    27679   1st Qu.:    25546   1st Qu.:        0  
 Median :       0   Median :   123182   Median :   114762   Median :        0  
 Mean   :    7138   Mean   :  1830341   Mean   :  1188013   Mean   :   579621  
 3rd Qu.:       0   3rd Qu.:   454580   3rd Qu.:   424538   3rd Qu.:        0  
 Max.   :16746871   Max.   :392375881   Max.   :138138510   Max.   :392229863  
 uses.no_charge     uses.consumed        uses.losses          uses.total       
 Min.   :     0.0   Min.   :      0.0   Min.   :0.000e+00   Min.   :        0  
 1st Qu.:     0.0   1st Qu.:      0.0   1st Qu.:3.905e+02   1st Qu.:    27679  
 Median :     0.0   Median :      0.0   Median :4.232e+03   Median :   123182  
 Mean   :   927.2   Mean   :   3203.9   Mean   :5.858e+04   Mean   :  1830341  
 3rd Qu.:     0.0   3rd Qu.:    439.8   3rd Qu.:1.654e+04   3rd Qu.:   454580  
 Max.   :533482.0   Max.   :2179687.0   Max.   :1.730e+07   Max.   :392375881  
 revenues.retail    revenue.delivery  revenue.resale     revenue.adjustments
 Min.   :       0   Min.   :      0   Min.   :       0   Min.   :-330287.3  
 1st Qu.:    2766   1st Qu.:      0   1st Qu.:       0   1st Qu.:      0.0  
 Median :   12564   Median :      0   Median :       0   Median :      0.0  
 Mean   :  117155   Mean   :   8700   Mean   :   24057   Mean   :   -322.1  
 3rd Qu.:   43695   3rd Qu.:      0   3rd Qu.:       0   3rd Qu.:      0.0  
 Max.   :11088125   Max.   :4913677   Max.   :17785527   Max.   :  70119.7  
 revenue.transmission revenue.other       revenue.total     
 Min.   :     0       Min.   :-180145.1   Min.   :       0  
 1st Qu.:     0       1st Qu.:      0.0   1st Qu.:    2830  
 Median :     0       Median :      0.0   Median :   13111  
 Mean   :  3532       Mean   :   3586.6   Mean   :  156708  
 3rd Qu.:     0       3rd Qu.:    454.6   3rd Qu.:   46209  
 Max.   :910441       Max.   :1933051.0   Max.   :19725962  
 retail.residential.revenue retail.residential.sales
 Min.   :      0            Min.   :       0        
 1st Qu.:   1115            1st Qu.:    9420        
 Median :   4954            Median :   43092        
 Mean   :  58933            Mean   :  485023        
 3rd Qu.:  21115            3rd Qu.:  175659        
 Max.   :6655476            Max.   :60338973        
 retail.residential.customers retail.commercial.revenue retail.commercial.sales
 Min.   :      0.0            Min.   :      0.0         Min.   :        0      
 1st Qu.:    972.2            1st Qu.:    662.2         1st Qu.:     5812      
 Median :   3840.0            Median :   3095.9         Median :    27574      
 Mean   :  46664.3            Mean   :  47006.2         Mean   :   542879      
 3rd Qu.:  14838.0            3rd Qu.:  11383.5         3rd Qu.:   109026      
 Max.   :4845482.0            Max.   :6473222.8         Max.   :104394989      
 retail.commercial.customers retail.industrial.revenue retail.industrial.sales
 Min.   :     0.0            Min.   :   -216.0         Min.   :       0       
 1st Qu.:   160.5            1st Qu.:      0.0         1st Qu.:       0       
 Median :   566.0            Median :    928.5         Median :    9870       
 Mean   :  6491.2            Mean   :  19667.5         Mean   :  339542       
 3rd Qu.:  2076.0            3rd Qu.:   6449.7         3rd Qu.:   81182       
 Max.   :619752.0            Max.   :2717660.0         Max.   :33929014       
 retail.industrial.customers retail.transportation.revenue
 Min.   :     0.0            Min.   :     0.0             
 1st Qu.:     0.0            1st Qu.:     0.0             
 Median :     6.0            Median :     0.0             
 Mean   :   315.1            Mean   :   240.8             
 3rd Qu.:    37.0            3rd Qu.:     0.0             
 Max.   :141310.0            Max.   :205131.0             
 retail.transportation.sales retail.transportation.customers
 Min.   :      0             Min.   :0.00000                
 1st Qu.:      0             1st Qu.:0.00000                
 Median :      0             Median :0.00000                
 Mean   :   4356             Mean   :0.03466                
 3rd Qu.:      0             3rd Qu.:0.00000                
 Max.   :2851057             Max.   :7.00000                
 retail.total.revenue retail.total.sales  retail.total.customers
 Min.   :       0     Min.   :        0   Min.   :      0       
 1st Qu.:    2766     1st Qu.:    25555   1st Qu.:   1203       
 Median :   12608     Median :   114944   Median :   4796       
 Mean   :  125848     Mean   :  1371799   Mean   :  53476       
 3rd Qu.:   43712     3rd Qu.:   424735   3rd Qu.:  17505       
 Max.   :12799425     Max.   :138138510   Max.   :5524077       

Select only the columns which is needed for this project

electricity_clean <- electricity |>
  select(utility.state, retail.residential.sales, retail.commercial.sales, retail.industrial.sales,
         retail.total.sales, uses.total)
head(electricity_clean)
# A tibble: 6 × 6
  utility.state retail.residential.sales retail.commercial.sales
  <chr>                            <dbl>                   <dbl>
1 SC                               33000                   25000
2 MS                               34781                   48943
3 LA                               59232                   32466
4 VA                              360082                  167766
5 MN                                9873                    7540
6 IL                              118628                   57474
# ℹ 3 more variables: retail.industrial.sales <dbl>, retail.total.sales <dbl>,
#   uses.total <dbl>

Correlation Analysis

p1 <- ggplot(electricity_clean, aes(x = retail.total.sales, y = uses.total)) +
  labs(title = "Correlation Between Retail Electricity Sales and Total Usage by State",
       caption = "Source: U.S. Electricity Dataset",
       x = "Total Retail Electricity Sales (MWh)", 
       y = "Total Electricity Usage (MWh)") +
  theme_minimal(base_size = 12)

p1 + geom_point()

This scatterplot shows a moderate positive correlation between retail electricity sales and total usage.

Scatterplot appears to show a correlation

p2 <- p1 + 
  geom_point() + xlim(0, 600) + ylim(0, 600)

p2
Warning: Removed 3146 rows containing missing values or values outside the scale range
(`geom_point()`).

Adding a smoother in red with a confidence interval

p3 <- p2 + geom_smooth(method='lm',formula=y~x)
p3
Warning: Removed 3146 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 3146 rows containing missing values or values outside the scale range
(`geom_point()`).
Warning: Removed 8 rows containing missing values or values outside the scale range
(`geom_smooth()`).

Correlation Number

cor(electricity_clean$retail.total.sales, electricity_clean$uses.total)
[1] 0.5481364

Linear Regression Model

fit1 <- lm(uses.total ~ retail.total.sales, 
           data = electricity_clean)

summary(fit1)

Call:
lm(formula = uses.total ~ retail.total.sales, data = electricity_clean)

Residuals:
      Min        1Q    Median        3Q       Max 
-58429134   -432611   -429900   -421217 391794250 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)        4.329e+05  1.870e+05   2.315   0.0207 *  
retail.total.sales 1.019e+00  2.760e-02  36.910   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 10320000 on 3172 degrees of freedom
Multiple R-squared:  0.3005,    Adjusted R-squared:  0.3002 
F-statistic:  1362 on 1 and 3172 DF,  p-value: < 2.2e-16
autoplot(fit1, 1:4, nrow = 2, ncol = 2)

Statiscal Analysis

With uses.total as the dependent variable and retail.total.sales as the independent variable. The regression model is: Uses.Total = 432,900 + 1.019 × Retail.Total.Sales. The correlation of the two measures was 0.548, which is a medium positive linear relationship. retail.total.sales is statistically significant (p-value: < 2.2e-16), indicating retail electricity sales are a strong predictor of total electricity use. The intercept is also significant (p = 0.0207). The model adjusted R-squared is 0.3002, and it can be observed that retail electricity sales explain about 30% of the variance in electricity usage. Though this means a significant relationship, it also indicates that some other variables that have not been factored into the model affect electricity usage. Diagnostic plots constructed by autoplot() reveal no gross violations of linear regression assumptions, although residuals spread is showing some dispersion at higher values. These calculations yield an overall conclusion that higher retail electricity sales are associated with higher total electricity usage and so, give a reasonable but incomplete conclusion on the pattern of total electricity consumption.

Top 15 States

top_states <- electricity |>
  mutate(total_consumption = retail.residential.sales +
           retail.commercial.sales +
           retail.industrial.sales) |>
  group_by(utility.state) |>
  summarize(total_consumption = sum(retail.total.sales, na.rm = TRUE)) |>
  arrange(desc(total_consumption)) |>
  slice_head(n = 15) #help from data 101

head(top_states)
# A tibble: 6 × 2
  utility.state total_consumption
  <chr>                     <dbl>
1 TX                    502050278
2 CA                    331716319
3 OH                    273877518
4 PA                    218196746
5 NY                    216840169
6 FL                    205960537

Visualizations

Plot 1: Static Bar Plot

ggplot(top_states, aes(x = reorder(utility.state, total_consumption), y = total_consumption,
                       fill = total_consumption)) +
  geom_bar(stat = "identity", color = "#888888") +
  scale_fill_gradient(low = "#a3b7ca",high = "#79021c") +
  labs(title = "Top 15 States by Total Electricity Consumption",
       caption = "Source: U.S. Energy Information Administration (EIA)",
       x = "State",
       y = "Total Electricity Consumption (MWh)"
       ) +
  theme_minimal(base_size = 12) 

This bar-plot shows the top 15 states with the biggest electricity consumption. To make this visualisation, I had to cross state comparison of electricity and visualize which states consume the most power. X-axis shows states and Y-axis shows total electricity consumption in megawatt-hours. Different colors indicate different amounts of consumption of electricity, from different color levels that give a variety of contrasts in usage figures and make clearly visible the different states. After changing the default theme to make it easier for an audience to read, I incorporated a caption indicating where the data came from. This visualization also shows us that states such as Texas and California are consuming much more electricity than the rest of the states by much larger amounts suggesting that overall demand for power comes from its total value.

Plot 2: Highcharter Bar Plot

# set color 
cols <- c("dodgerblue")

highchart() |>
  hc_add_series(
    data = top_states$total_consumption,
    name = "Total Electricity Consumption (MWh)",
    type = "bar"
  ) |>
  hc_xAxis(
    categories = top_states$utility.state,
    title = list(text = "State")
  ) |>
  hc_yAxis(
    title = list(text = "Total Electricity Consumption (MWh)")) |>
  hc_chart(style = list(
    fontFamily = "AvantGarde", 
    fontWeight = "bold"
    )
  )

I then created an interactive bar chart using Highcharter where I displayed the total electricity consumption by state. The x-axis displays the states, while the y-axis indicates total electricity consumption in megawatt-hours (MWh). Only one color was used to make this chart straightforward and easy to read, and a bold font helped in understanding the information better. Because the chart is interactive, you can hover over the bars to see exact values. This chart shows that Texas and California consume much more electricity than the other states, supporting the results from the earlier bar plot.

3D Scatter Plot with Plotly

#source: https://plotly.com/r/3d-scatter-plots/
#https://plotly.com/r/reference/


fig <- plot_ly(                 #fig are using for 3d disply plot  
  data = top_states,
  x = ~utility.state,         #x-axis shows state abbreviations
  y = ~total_consumption,       #y-axis shows total electricity consumption 
  z = ~rank,                  #z-axis shows the rank of each state  
  marker = list(                #marker (dot) settings
    color = ~total_consumption,
    showscale = TRUE ))       #show color scale legend

#Add points (markers) to the 3D plot
fig <- fig %>% add_markers()

fig <- fig %>% layout(
  title = "Top 15 States by Electricity Consumption",
  caption = "Source: U.S. Energy Information Administration (EIA)",
  scene = list(                                             #3D scene settings
    xaxis = list(title = "State"),                    #label for x-axis, y-axis, z-axis
    yaxis = list(title = "Total Electricity Consumption (MWh)"),
    zaxis = list(title = "Rank")
  ),
  annotations = list(                     #add extra text annotation 
    x = 1.13,                            #X position and y position of annotation
    y = 1.05,
    text = "Total Electricity Consumption (MWh)",
    xref = "paper",                    #use overall plot coordinates
    yref = "paper",
    showarrow = FALSE
  )
)

fig
Warning: 'layout' objects don't have these attributes: 'caption'
Valid attributes include:
'_deprecated', 'activeshape', 'annotations', 'autosize', 'autotypenumbers', 'calendar', 'clickmode', 'coloraxis', 'colorscale', 'colorway', 'computed', 'datarevision', 'dragmode', 'editrevision', 'editType', 'font', 'geo', 'grid', 'height', 'hidesources', 'hoverdistance', 'hoverlabel', 'hovermode', 'images', 'legend', 'mapbox', 'margin', 'meta', 'metasrc', 'modebar', 'newshape', 'paper_bgcolor', 'plot_bgcolor', 'polar', 'scene', 'selectdirection', 'selectionrevision', 'separators', 'shapes', 'showlegend', 'sliders', 'smith', 'spikedistance', 'template', 'ternary', 'title', 'transition', 'uirevision', 'uniformtext', 'updatemenus', 'width', 'xaxis', 'yaxis', 'boxmode', 'barmode', 'bargap', 'mapType'

I tried out a new sort of approach that we didn’t cover in class, the interactive 3D scatter plot using Plotly. On my visualization, I showed the leading states by level of electricity consumption. The x-axis is the states, the y-axis is the aggregate electricity consumption in megawatt-hours (MWh), and the z-axis is the state rank. I used color to show different levels of electricity consumption so that higher-consuming states were more pronounced. As for the plot, this is interactive; you can rotate this plot and explore data from different perspectives. I modified this code from the Plotly 3D scatter plot documentation, and this visualization demonstrates that I can use techniques outside of the education I received in class.

Conclusion

In this project, I looked at electricity use across the United States. My linear regression showed that retail electricity sales can predict total usage, though other factors also matter. The charts, both static and interactive, showed that Texas and California use the most electricity. Making a 3D scatter plot helped me try to learn a new visualization technique. Overall, this project taught me how to clean data, run models, and create clear visualizations using a real world dataset.