I. Introduction
Since the discovery of modern energy sources, humans’ lives have
changed dramatically. The world entered an industrial revolution that
impacted our lives positively and negatively. A couple of decades ago,
we started understanding the negative impacts more clearly, specifically
climate change and its effects. This led scientists to begin
investigating other types of energy sources. Currently, the U.S is using
five different sources of energy:
1- Coal
2- Natural Gas
3- Petroleum
4- Renewal Energy: Solar
5- Nuclear Energy
Overview
For this project, I want to investigate energy demand and use
throughout the United States. I will look at the past 20 years and try
to figure out the following questions:
1- Which State consumes the highest level of energy?
2- Which State has the highest energy expenditure?
3- Is there a correlation between the State’s level of consumption
and expenditure?
4- Analyzing the State with the highest consumption, what energy
sources are being used?
5- What about Maryland? How are we doing?
6- Finally, I want to compare the best energy source (renewable) to
the most commonly used (petroleum), and see if there is a particular
usage trend overall.
Dataset and Resources
My dataset is a combination of several excel worksheets that were
compiled into one comma-separated values file.
Load the needed libraries and the
dataset
# load the libraries
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.1.3
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.6 v purrr 0.3.4
## v tibble 3.1.7 v dplyr 1.0.9
## v tidyr 1.2.0 v stringr 1.4.0
## v readr 2.1.2 v forcats 0.5.1
## Warning: package 'ggplot2' was built under R version 4.1.3
## Warning: package 'tibble' was built under R version 4.1.3
## Warning: package 'tidyr' was built under R version 4.1.3
## Warning: package 'readr' was built under R version 4.1.3
## Warning: package 'purrr' was built under R version 4.1.3
## Warning: package 'dplyr' was built under R version 4.1.3
## Warning: package 'stringr' was built under R version 4.1.3
## Warning: package 'forcats' was built under R version 4.1.3
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(highcharter)
## Warning: package 'highcharter' was built under R version 4.1.3
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
## Highcharts (www.highcharts.com) is a Highsoft software product which is
## not free for commercial and Governmental use
library(dplyr)
library(RColorBrewer)
## Warning: package 'RColorBrewer' was built under R version 4.1.3
library(knitr)
## Warning: package 'knitr' was built under R version 4.1.3
library(ggpubr)
## Warning: package 'ggpubr' was built under R version 4.1.3
#Set Working station and load the datase.
setwd("C:/Users/mayss/Desktop/Data Science/Datasets")
energy <- read.csv('state_energy.csv')
Let us look at the dataset
structure
str(energy)
## 'data.frame': 1071 obs. of 12 variables:
## $ state : chr "AK" "AK" "AK" "AK" ...
## $ year : int 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 ...
## $ total_energy_consumption : int 742061 735745 732909 733471 775019 797723 746457 723227 650500 639557 ...
## $ total_energy_.expenditures : num 2842 2785 2525 2817 4148 ...
## $ energy_expenditures_percent_of_gdp : num 10.6 9.77 8.49 8.79 11.75 ...
## $ total_energy_expenditures_.pc : num 4525 4395 3931 4344 6292 ...
## $ gdp : int 39407 40959 42979 42355 44055 45657 49190 51721 51252 56215 ...
## $ coal_consumption : int 16455 15911 16429 12552 14075 13994 14981 13662 14705 14505 ...
## $ natural_gas_consumption : int 437972 413049 420808 415891 407944 434704 375719 372186 343938 343971 ...
## $ petroleum_consumption : int 275255 289334 277400 285417 334121 333091 342272 323089 278725 265166 ...
## $ nuclear_energy_consumed_for_electricity: int 0 0 0 0 0 0 0 0 0 0 ...
## $ renewable_energy_consumption : int 12375 17446 18269 19607 18876 15930 13481 14286 13126 15912 ...
What are the variables?
Variable <-c("State", "Year", "Total Consumption", "Total Energy Expenditure", "Expenditures as Percent of Current-Dollar GDP ", "Total Expenditure per Capita", "GDP", "Coal Consumption", "Natural Gas Consumption", "Petroleum Consumption", "Renewable Energy Consumption", "Nuclear Consumption" )
Description <-c("US State", "Years:2000-2021", "Total energy consumption (Billion Btu)", "Total energy expenditures (million dollars)", "Energy expenditures as percent of current-dollar Gross Domestic Product (GDP)", "Total energy expenditures per capita in dollars", "Real gross domestic product (GDP) in million Chained (2012) Dollars", "Coal total consumption in billion Btu", "Natural gas total consumption (excluding supplemental gaseous fuels) in billion Btu", "All petroleum products total consumption, excluding biofuels in billion Btu", "Renewable energy sources total consumption in billion Btu", "Nuclear energy consumed for electricity generation in billion Btu")
Variable_Table <-data.frame(Variable, Description)
knitr::kable(Variable_Table, "pipe", align = "l", "c")
State |
US State |
Year |
Years:2000-2021 |
Total Consumption |
Total energy consumption (Billion Btu) |
Total Energy Expenditure |
Total energy expenditures (million dollars) |
Expenditures as Percent of Current-Dollar GDP |
Energy expenditures as percent of current-dollar Gross
Domestic Product (GDP) |
Total Expenditure per Capita |
Total energy expenditures per capita in dollars |
GDP |
Real gross domestic product (GDP) in million Chained
(2012) Dollars |
Coal Consumption |
Coal total consumption in billion Btu |
Natural Gas Consumption |
Natural gas total consumption (excluding supplemental
gaseous fuels) in billion Btu |
Petroleum Consumption |
All petroleum products total consumption, excluding
biofuels in billion Btu |
Renewable Energy Consumption |
Renewable energy sources total consumption in billion
Btu |
Nuclear Consumption |
Nuclear energy consumed for electricity generation in
billion Btu |
II. Investigate Some
Answers
1- Which State had the highest energy
consumption and in which year?
As you can see, Texas takes the lead; California and Florida come
next, trailing by five billion BTUs or more. Most other states cluster
in a similar range consuming between 1.5 and 2.5 billion BTUs.
If we take a closer look, We can pinpoint the year of the
highest consumption.
# finding the maximum value of total energy consumption
energy[which.max(energy$total_energy_consumption),]
## state year total_energy_consumption total_energy_.expenditures
## 923 TX 2019 14234096 137807.1
## energy_expenditures_percent_of_gdp total_energy_expenditures_.pc gdp
## 923 7.39 4754.1 1785318
## coal_consumption natural_gas_consumption petroleum_consumption
## 923 992721 4795213 6846892
## nuclear_energy_consumed_for_electricity renewable_energy_consumption
## 923 431234 1071707
Texas’s highest energy consumption was in 2019, with a total
consumption of 11,829,075 billion BTUs!
2- Which state had the highest
expenditure in millions of dollars?
#Find the row with the maximum expenditure
energy[which.max(energy$total_energy_.expenditures),]
## state year total_energy_consumption total_energy_.expenditures
## 915 TX 2011 11829075 162237.9
## energy_expenditures_percent_of_gdp total_energy_expenditures_.pc gdp
## 915 12.1 6326.2 1353600
## coal_consumption natural_gas_consumption petroleum_consumption
## 915 1695239 3800578 5394068
## nuclear_energy_consumed_for_electricity renewable_energy_consumption
## 915 414881 524941
Once again, Texas takes the lead. However, it is surprising to see
that the highest total expenditure was in 2011, while Texas’s highest
consumption was in 2019! So here is a question, Is there a relationship
between the State’s consumption and expenditure?
I will construct a linear regression and calculate the correlation
to answer this question.
But first, let us narrow it down to the State of
Texas.
# creating a subset using the filter function
texas <- filter(energy, state == 'TX')
3- Is there a relationship between the
Texas’s energy consumption and its energy expenditures?
# Create a scattor plot with a non-default theme
p1 <- ggplot(texas, aes(x = total_energy_consumption , y = total_energy_.expenditures)) +
theme_light(base_size = 12) +
labs(title = "Consumption Vs. Expenditure For the State of Texas",
caption = "Source: U.S. Energy Information Administraion EIA\n") +
xlab("Total Energy Consumption in billion Btu") +
ylab("Total Energy Expenditures in million dollors")
# convert the numbers on the x-axis to a standard form
options(scipen = 999)
p2<- p1 + geom_point() + geom_smooth(method=lm,se=FALSE,fullrange=TRUE,color="purple")+
theme_classic()+
stat_regline_equation(label.y = 180000, aes(label = ..eq.label..)) +
stat_regline_equation(label.y = 175000, aes(label = ..rr.label..))
p2
## `geom_smooth()` using formula 'y ~ x'

As the visualization and the linear regression equation show, there
is almost no association between the State’s energy consumption and
expenditures. In fact, if we calculate the r value, we can see how weak
the correlation is
Calculating the correlation
cor(texas$total_energy_consumption, texas$total_energy_.expenditures)
## [1] 0.07184423
r is about 0.07. This is a very week correlation!
4- What type of energy sources does Texas
State use?
Since I am focusing on the energy sources, I will eliminate
the other variables.
texas_source <- texas[,!names(texas) %in% c("total_energy_.expenditures", "total_energy_expenditures_.pc", "gdp", "energy_expenditures_percent_of_gdp", "total_energy_consumption", "state")]
convert the wide dataset to a long one
texas_source_long <- texas_source %>%
pivot_longer( #converting columns -25, into one column called 'year'.
cols = 2:6,
names_to = c("energy_type"),
values_to = "billion_btu" #converting the amount consumed for each type of energy source to a single column, called 'billion_btu'.
)
texas_source_long
## # A tibble: 105 x 3
## year energy_type billion_btu
## <int> <chr> <int>
## 1 2000 coal_consumption 1548184
## 2 2000 natural_gas_consumption 4550106
## 3 2000 petroleum_consumption 5767121
## 4 2000 nuclear_energy_consumed_for_electricity 391670
## 5 2000 renewable_energy_consumption 101491
## 6 2001 coal_consumption 1492977
## 7 2001 natural_gas_consumption 4382935
## 8 2001 petroleum_consumption 5773604
## 9 2001 nuclear_energy_consumed_for_electricity 398535
## 10 2001 renewable_energy_consumption 102166
## # ... with 95 more rows
Let us plot
library(ggthemes)
## Warning: package 'ggthemes' was built under R version 4.1.3
library(ggrepel)
## Warning: package 'ggrepel' was built under R version 4.1.3
texas_source_long %>%
ggplot(aes(x = year, y = billion_btu)) +
geom_point(aes(color=energy_type), size = 2) +
scale_x_log10("Year") +
scale_y_log10("Total Consumption in Billion Btu") +
ggtitle("Texas Energy Sources 2000-2020\n") +
scale_color_discrete(name="Energy Source")+
theme_clean()+
scale_color_manual(labels = c("Coal", "Natural Gas", "Nuclear", "Petroleum", "Renewable"),
values = c("red", "blue", "orange", "purple", "green"))+
labs(col="Energy Sources")
## Scale for 'colour' is already present. Adding another scale for 'colour',
## which will replace the existing scale.

Even though petroleum usage is the highest, it is refreshing to see
that renewable energy is making its way up the chart.
5- What about Maryland?
I will create a subset of the original dataset to see how Maryland
is doing and only include the energy source columns. This will take
several steps.
# create a subset for MD
maryland <- filter(energy, state == 'MD')
# eliminate unneeded columns
maryland_source <- maryland[,!names(maryland) %in% c("total_energy_.expenditures", "total_energy_expenditures_.pc", "gdp", "energy_expenditures_percent_of_gdp", "total_energy_consumption", "state")]
# convert to long dataset
maryland_source_long <- maryland_source %>%
pivot_longer( #converting columns 2 to 5 into one column called 'energy_type'.
cols = 2:6,
names_to = c("energy_type"),
values_to = "billion_btu" #converting the amount consumed for each type of energy source to a single column, called 'billion_btu'.
)
# check it out!
maryland_source_long
## # A tibble: 105 x 3
## year energy_type billion_btu
## <int> <chr> <int>
## 1 2000 coal_consumption 312162
## 2 2000 natural_gas_consumption 219031
## 3 2000 petroleum_consumption 546998
## 4 2000 nuclear_energy_consumed_for_electricity 144204
## 5 2000 renewable_energy_consumption 54124
## 6 2001 coal_consumption 318851
## 7 2001 natural_gas_consumption 184783
## 8 2001 petroleum_consumption 566433
## 9 2001 nuclear_energy_consumed_for_electricity 142612
## 10 2001 renewable_energy_consumption 33277
## # ... with 95 more rows
I will use the time series in highchart again!
highchart() %>%
hc_add_series(data = maryland_source_long,
type = "line", hcaes(x = year,
y = billion_btu,
group = energy_type)) %>%
hc_xAxis(title = list(text="Year")) %>%
hc_yAxis(title = list(text="Total Energy Consumptuin in millions")) %>%
hc_title(text = "Maryland Energy Consumption 2000-2020") %>%
hc_plotOptions(series = list(marker = list(symbol = "circle"))) %>%
hc_legend(align = "right",
verticalAlign = "bottom")
While the overall consumption is lower than in Texas, looking at the
sources’ trends, it isn’t very reassuring that Maryland’s renewable and
nuclear energy usage is not increasing.
III. Reflection
With the rise of gas prices and the rapid climate change, we need to
seriously start using healthier and more economical energy sources. This
study showed that while more states are using renewable and nuclear
energy, we still need a heavier push in that direction. While my
comparison was based on the State’s GDP, it would have been more
accurate if I also had the population count for each State throughout
the 20 years.
This was an eye-opener study, as I thought Maryland was doing better
than what the chart showed. Hopefully, with more awareness, we become
one of the pioneer States in using more efficient energy sources.