Specifically, here I analyze the electricity consumption and usage behavior of the electric utilities in the USA with reference to the CORGIS Electricity Dataset, which is from U.S. Energy Information Administration (EIA), https://corgis-edu.github.io/corgis/csv/electricity/. The dataset features self reported data about over 3,000 U.S. electric utilities. This data consists of aspects of utility attributes, electricity requirements, generation mode of electricity, power consumption, sales of electricity in the retail sector, and so on. I used this dataset because electricity is one of the most vital resources people have access to on a daily basis and studying how electricity is consumed in different states gives us insights about population size, economic activity, and demand for energy. The resulting dataset is well organized, it has real data from daily life, both numerical and categorical variables, and it can be used to clean, analyze and visualize the data. Specifically, it is the dataset which makes comparison possible between states electricity utilization, consistent with the primary task of this project.
The dataset had 38 variables and 3,174 observations as it is a large database. The data used in this investigation include: utility.state, retail.residential.sales, retail.commercial.sales, retail.industrial.sales, retail.total.sales, and uses.total. These variables are mainly quantitative numerical variables in terms of megawatt hours (MWh), while utility.state is a categorical variable that reflects U.S. state abbreviations. These variables were chosen because they directly measure electricity consumption and allow comparison of its levels across states. The data has been loaded into the R environment using the read_csv() function and has been subjected to manipulation, such as the R packages tidyverse, dplyr, ggplot2, and plotly that were used for data manipulation and visualization. This ensures uniformity, so to clean the data, all column names were also in lower case. The spaces in column names were replaced with underscores with gsub() function to get a quick reference for the variables in R. To gain insight regarding the types of variables and the distributions, a summary of the dataset was conducted on the variables. The extracted variables only for electricity consumption were selected, removing from the dataset the features that could not be considered. The number of sales of residential, commercial, and industrial electricity was summed together to obtain a new variable known as total_consumption. The data was further classified by utility.state and summarized to get total energy consumption per state. The method was used to concentrate the analysis on identifying the states that consume the most electricity without looking specifically at the utilities. NA was used to handle missing values with na.rm = TRUE to ensure correct results. This dataset does not contain a rich ReadMe file that summarizes the complete data collection process. However, EIA says this data was collected through the EIA-861 survey, by which electric utilities annually provide electricity sales and usage reports. This standardized methodology of reporting contributes to the integrity of the dataset and makes it suitable for studying electricity consumption in the U.S. states.
load Required Packages
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.2 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.1.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)library(highcharter)
Registered S3 method overwritten by 'quantmod':
method from
as.zoo.data.frame zoo
Highcharts (www.highcharts.com) is a Highsoft software product which is
not free for commercial and Governmental use
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
Rows: 3174 Columns: 38
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): Utility.Name, Utility.State, Utility.Type
dbl (35): Utility.Number, Demand.Summer Peak, Demand.Winter Peak, Sources.Ge...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
fit1 <-lm(uses.total ~ retail.total.sales, data = electricity_clean)summary(fit1)
Call:
lm(formula = uses.total ~ retail.total.sales, data = electricity_clean)
Residuals:
Min 1Q Median 3Q Max
-58429134 -432611 -429900 -421217 391794250
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.329e+05 1.870e+05 2.315 0.0207 *
retail.total.sales 1.019e+00 2.760e-02 36.910 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 10320000 on 3172 degrees of freedom
Multiple R-squared: 0.3005, Adjusted R-squared: 0.3002
F-statistic: 1362 on 1 and 3172 DF, p-value: < 2.2e-16
autoplot(fit1, 1:4, nrow =2, ncol =2)
Statiscal Analysis
With uses.total as the dependent variable and retail.total.sales as the independent variable. The regression model is: Uses.Total = 432,900 + 1.019 × Retail.Total.Sales. The correlation of the two measures was 0.548, which is a medium positive linear relationship. retail.total.sales is statistically significant (p-value: < 2.2e-16), indicating retail electricity sales are a strong predictor of total electricity use. The intercept is also significant (p = 0.0207). The model adjusted R-squared is 0.3002, and it can be observed that retail electricity sales explain about 30% of the variance in electricity usage. Though this means a significant relationship, it also indicates that some other variables that have not been factored into the model affect electricity usage. Diagnostic plots constructed by autoplot() reveal no gross violations of linear regression assumptions, although residuals spread is showing some dispersion at higher values. These calculations yield an overall conclusion that higher retail electricity sales are associated with higher total electricity usage and so, give a reasonable but incomplete conclusion on the pattern of total electricity consumption.
# A tibble: 6 × 2
utility.state total_consumption
<chr> <dbl>
1 TX 502050278
2 CA 331716319
3 OH 273877518
4 PA 218196746
5 NY 216840169
6 FL 205960537
Visualizations
Plot 1: Static Bar Plot
ggplot(top_states, aes(x =reorder(utility.state, total_consumption), y = total_consumption,fill = total_consumption)) +geom_bar(stat ="identity", color ="#888888") +scale_fill_gradient(low ="#a3b7ca",high ="#79021c") +labs(title ="Top 15 States by Total Electricity Consumption",caption ="Source: U.S. Energy Information Administration (EIA)",x ="State",y ="Total Electricity Consumption (MWh)" ) +theme_minimal(base_size =12)
This bar-plot shows the top 15 states with the biggest electricity consumption. To make this visualisation, I had to cross state comparison of electricity and visualize which states consume the most power. X-axis shows states and Y-axis shows total electricity consumption in megawatt-hours. Different colors indicate different amounts of consumption of electricity, from different color levels that give a variety of contrasts in usage figures and make clearly visible the different states. After changing the default theme to make it easier for an audience to read, I incorporated a caption indicating where the data came from. This visualization also shows us that states such as Texas and California are consuming much more electricity than the rest of the states by much larger amounts suggesting that overall demand for power comes from its total value.
I then created an interactive bar chart using Highcharter where I displayed the total electricity consumption by state. The x-axis displays the states, while the y-axis indicates total electricity consumption in megawatt-hours (MWh). Only one color was used to make this chart straightforward and easy to read, and a bold font helped in understanding the information better. Because the chart is interactive, you can hover over the bars to see exact values. This chart shows that Texas and California consume much more electricity than the other states, supporting the results from the earlier bar plot.
3D Scatter Plot with Plotly
#source: https://plotly.com/r/3d-scatter-plots/#https://plotly.com/r/reference/fig <-plot_ly( #fig are using for 3d disply plot data = top_states,x =~utility.state, #x-axis shows state abbreviationsy =~total_consumption, #y-axis shows total electricity consumption z =~rank, #z-axis shows the rank of each state marker =list( #marker (dot) settingscolor =~total_consumption,showscale =TRUE )) #show color scale legend#Add points (markers) to the 3D plotfig <- fig %>%add_markers()fig <- fig %>%layout(title ="Top 15 States by Electricity Consumption",caption ="Source: U.S. Energy Information Administration (EIA)",scene =list( #3D scene settingsxaxis =list(title ="State"), #label for x-axis, y-axis, z-axisyaxis =list(title ="Total Electricity Consumption (MWh)"),zaxis =list(title ="Rank") ),annotations =list( #add extra text annotation x =1.13, #X position and y position of annotationy =1.05,text ="Total Electricity Consumption (MWh)",xref ="paper", #use overall plot coordinatesyref ="paper",showarrow =FALSE ))fig
I tried out a new sort of approach that we didn’t cover in class, the interactive 3D scatter plot using Plotly. On my visualization, I showed the leading states by level of electricity consumption. The x-axis is the states, the y-axis is the aggregate electricity consumption in megawatt-hours (MWh), and the z-axis is the state rank. I used color to show different levels of electricity consumption so that higher-consuming states were more pronounced. As for the plot, this is interactive; you can rotate this plot and explore data from different perspectives. I modified this code from the Plotly 3D scatter plot documentation, and this visualization demonstrates that I can use techniques outside of the education I received in class.
Conclusion
In this project, I looked at electricity use across the United States. My linear regression showed that retail electricity sales can predict total usage, though other factors also matter. The charts, both static and interactive, showed that Texas and California use the most electricity. Making a 3D scatter plot helped me try to learn a new visualization technique. Overall, this project taught me how to clean data, run models, and create clear visualizations using a real world dataset.