Analyzing Coal Production and Commercial Coal Consumption in the U.S.
Author
Smukrabine
#Introduction In this project, I looked at the connection between the production of coal (in tons) and commercial consumption of coal (also in tons) over multiple U.S. states and years. The dataset is sourced from https://www.eia.gov/state/seds/seds-data-complete.php and U.S. Energy Information Administration - EIA - Independent Statistics and Analysis, official energy statistics from the US government.
I concentrated on the latter variables:
Year – the year of data (calendar).
State – the U.S. state for the production and consumption data set.
Production_Coal – total production of coal (tons).
Consumption_Commercial_Coal – sum of commercial coal consumed (tons).
I wanted to see how coal production and coal use relate over time and across states. I also going to construct an easy linear regression model that could be used to predict how much coal will be consumed based on production, year, and state.
Loading Necessary Libraries
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.2 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.1.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Rows: 3060 Columns: 84
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): State
dbl (83): Year, Production.Coal, Consumption.Commercial.Coal, Consumption.Co...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
ggplot(energy_clean, aes(x = Production_Coal, y = Consumption_Commercial_Coal)) +geom_point(color ="#d92774") +labs(x ="Coal Production", y ="Commercial Coal Consumption",title ="Consumption vs Production With Regression Line",caption ="Source: U.S. Energy Information Administration (EIA)") +theme_minimal()
Colored Scatter Plot by State
ggplot(energy_clean, aes(x = Production_Coal, y = Consumption_Commercial_Coal, color = State)) +geom_point(alpha =0.6, size =5) +labs(title ="Coal Production vs Commercial Coal Consumption By State",x ="Coal Production (tons)",y ="Commercial Coal Consumption (tons)",caption ="Source: U.S. Energy Information Administration (EIA)",color ="State" ) +theme_minimal() +scale_color_brewer(palette ="Set1")
Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Set1 is 9
Returning the palette you asked for with that many colors
Warning: Removed 2520 rows containing missing values or values outside the scale range
(`geom_point()`).
Refined Visualization
ggplot(energy_selected, aes(x = Production_Coal, y = Consumption_Commercial_Coal,color = State)) +geom_point(alpha =0.7, size =3) +labs(title ="Coal Production vs Commercial Coal Consumption By State",x ="Coal Production (tons)",y ="Commercial Coal Consumption (tons)",caption ="Source: U.S. Energy Information Administration (EIA)" ) +theme_minimal() +scale_color_brewer(palette ="Set2")
Essay
I first read in the data using read_csv() to get it ready for analysis. Then, I did some simple data cleaning for the column names to be more easily used in my script (for example, replacing Production. Coal with Production_Coal). I have also used the filter() function to eliminate any rows that contained missing data in the key variables: Production_Coal, Consumption_Commercial_Coal, Year, and State. This step was responsible for guaranteeing that my analysis and visualizations were coherent and error-free since no missing data was present.
The visualizations depict the correlation between coal production and commercial coal consumption in various U.S. states over different years. Each dot on the scatterplot corresponds to one observation (state and year). I saw from the graph that some states produce and consume more coal than other states, but that the amount of coal produced might be correlated with how much people in a state consume in general, if I see high production, I also see high consumption.
Something I would have loved to develop further but couldn’t get my head around in depth was more sophisticated visualizations. This would have provided a clearer picture of how production and consumption evolved through the years. I also wish I would have included more variables instead, if time permitted—to the tune of energy exports or imports—in order to enrich the analysis.