Analyzing Coal Production and Commercial Coal Consumption in the U.S.

Author

Smukrabine

#Introduction In this project, I looked at the connection between the production of coal (in tons) and commercial consumption of coal (also in tons) over multiple U.S. states and years. The dataset is sourced from https://www.eia.gov/state/seds/seds-data-complete.php and U.S. Energy Information Administration - EIA - Independent Statistics and Analysis, official energy statistics from the US government.

I concentrated on the latter variables:

Year – the year of data (calendar).

State – the U.S. state for the production and consumption data set.

Production_Coal – total production of coal (tons).

Consumption_Commercial_Coal – sum of commercial coal consumed (tons).

I wanted to see how coal production and coal use relate over time and across states. I also going to construct an easy linear regression model that could be used to predict how much coal will be consumed based on production, year, and state.

Loading Necessary Libraries

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(ggplot2)

Loading Dataset

setwd("C:/Users/sajut/OneDrive/Desktop/DATA_110")

energy_data <- read_csv("energy (3).csv")
Rows: 3060 Columns: 84
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): State
dbl (83): Year, Production.Coal, Consumption.Commercial.Coal, Consumption.Co...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Data Cleaning and Preparation

energy_clean <- energy_data |>
  rename(Production_Coal = Production.Coal,
    Consumption_Commercial_Coal = Consumption.Commercial.Coal
  )
energy_clean2 <- energy_clean |>
  filter(!is.na(Production_Coal), !is.na(Consumption_Commercial_Coal), !is.na(Year), !is.na(State))
energy_selected <- energy_clean |>
  filter(State %in% c("Alabama", "Wyoming", "Texas"))  # pick 3 states

Linear Regression Model

model <- lm(Consumption_Commercial_Coal ~ Production_Coal + Year + State, data = energy_clean)
summary(model)

Call:
lm(formula = Consumption_Commercial_Coal ~ Production_Coal + 
    Year + State, data = energy_clean)

Residuals:
   Min     1Q Median     3Q    Max 
-23717  -1361   -104   1151  59511 

Coefficients:
                            Estimate Std. Error t value Pr(>|t|)    
(Intercept)                1.772e+05  8.319e+03  21.298  < 2e-16 ***
Production_Coal            1.396e-03  1.673e-04   8.344  < 2e-16 ***
Year                      -8.883e+01  4.178e+00 -21.259  < 2e-16 ***
StateAlaska                3.924e+03  7.295e+02   5.380 8.03e-08 ***
StateArizona              -6.820e+02  7.271e+02  -0.938  0.34831    
StateArkansas             -4.430e+02  7.297e+02  -0.607  0.54384    
StateCalifornia           -2.851e+01  7.298e+02  -0.039  0.96884    
StateColorado              1.072e+03  7.253e+02   1.478  0.13948    
StateConnecticut          -6.456e+01  7.298e+02  -0.088  0.92952    
StateDelaware             -3.240e+02  7.298e+02  -0.444  0.65707    
StateDistrict of Columbia  2.916e+02  7.298e+02   0.400  0.68954    
StateFlorida              -2.735e+02  7.298e+02  -0.375  0.70784    
StateGeorgia               3.364e+02  7.298e+02   0.461  0.64488    
StateHawaii               -4.471e+02  7.298e+02  -0.613  0.54021    
StateIdaho                 1.126e+03  7.298e+02   1.543  0.12284    
StateIllinois              1.091e+04  7.327e+02  14.892  < 2e-16 ***
StateIndiana               6.520e+03  7.256e+02   8.986  < 2e-16 ***
StateIowa                  3.669e+03  7.297e+02   5.029 5.22e-07 ***
StateKansas               -3.279e+02  7.295e+02  -0.450  0.65310    
StateKentucky              2.878e+02  8.338e+02   0.345  0.73003    
StateLouisiana            -4.751e+02  7.293e+02  -0.651  0.51486    
StateMaine                 3.322e+01  7.298e+02   0.046  0.96369    
StateMaryland              7.501e+02  7.286e+02   1.030  0.30328    
StateMassachusetts         1.216e+03  7.298e+02   1.667  0.09567 .  
StateMichigan              6.108e+03  7.298e+02   8.369  < 2e-16 ***
StateMinnesota             2.278e+03  7.298e+02   3.122  0.00181 ** 
StateMississippi          -4.485e+02  7.296e+02  -0.615  0.53880    
StateMissouri              2.822e+03  7.289e+02   3.871  0.00011 ***
StateMontana              -8.636e+02  7.251e+02  -1.191  0.23377    
StateNebraska             -1.239e+02  7.298e+02  -0.170  0.86523    
StateNevada               -3.106e+02  7.298e+02  -0.426  0.67039    
StateNew Hampshire        -3.355e+02  7.298e+02  -0.460  0.64571    
StateNew Jersey            5.178e+02  7.298e+02   0.710  0.47803    
StateNew Mexico           -8.082e+02  7.257e+02  -1.114  0.26549    
StateNew York              5.505e+03  7.298e+02   7.543 6.04e-14 ***
StateNorth Carolina        3.977e+03  7.298e+02   5.449 5.48e-08 ***
StateNorth Dakota          1.023e+03  7.260e+02   1.408  0.15909    
StateOhio                  1.117e+04  7.265e+02  15.376  < 2e-16 ***
StateOklahoma             -3.602e+02  7.289e+02  -0.494  0.62124    
StateOregon               -1.765e+02  7.298e+02  -0.242  0.80894    
StatePennsylvania          2.438e+04  7.612e+02  32.028  < 2e-16 ***
StateRhode Island         -3.757e+02  7.298e+02  -0.515  0.60677    
StateSouth Carolina        1.105e+03  7.298e+02   1.514  0.13010    
StateSouth Dakota         -2.309e+02  7.298e+02  -0.316  0.75174    
StateTennessee             3.038e+03  7.277e+02   4.174 3.08e-05 ***
StateTexas                -8.424e+02  7.252e+02  -1.162  0.24553    
StateUtah                  1.483e+03  7.255e+02   2.043  0.04109 *  
StateVermont              -2.483e+02  7.298e+02  -0.340  0.73375    
StateVirginia              3.163e+03  7.275e+02   4.347 1.42e-05 ***
StateWashington            5.307e+02  7.290e+02   0.728  0.46667    
StateWest Virginia        -2.834e+03  8.723e+02  -3.248  0.00117 ** 
StateWisconsin             5.527e+03  7.298e+02   7.573 4.82e-14 ***
StateWyoming              -3.897e+03  8.749e+02  -4.454 8.74e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3972 on 3007 degrees of freedom
Multiple R-squared:  0.601, Adjusted R-squared:  0.5941 
F-statistic: 87.12 on 52 and 3007 DF,  p-value: < 2.2e-16

Data Visualization

Scatter Plot with Regression Line

ggplot(energy_clean, aes(x = Production_Coal, y = Consumption_Commercial_Coal)) +
  geom_point(color = "#d92774") +
  labs(x = "Coal Production", y = "Commercial Coal Consumption",
       title = "Consumption vs Production With Regression Line",
       caption = "Source: U.S. Energy Information Administration (EIA)") +
  theme_minimal()

Colored Scatter Plot by State

ggplot(energy_clean, aes(x = Production_Coal, 
                         y = Consumption_Commercial_Coal, 
                         color = State)) +
  geom_point(alpha = 0.6, size = 5) +
  labs(
    title = "Coal Production vs Commercial Coal Consumption By State",
    x = "Coal Production (tons)",
    y = "Commercial Coal Consumption (tons)",
    caption = "Source: U.S. Energy Information Administration (EIA)",
    color = "State"
  ) +
  theme_minimal() +
  scale_color_brewer(palette = "Set1")
Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Set1 is 9
Returning the palette you asked for with that many colors
Warning: Removed 2520 rows containing missing values or values outside the scale range
(`geom_point()`).

Refined Visualization

ggplot(energy_selected, aes(x = Production_Coal, 
                            y = Consumption_Commercial_Coal,
                            color = State)) +
  geom_point(alpha = 0.7, size = 3) +
  labs(
    title = "Coal Production vs Commercial Coal Consumption By State",
    x = "Coal Production (tons)",
    y = "Commercial Coal Consumption (tons)",
    caption = "Source: U.S. Energy Information Administration (EIA)"
  ) +
  theme_minimal() + 
  scale_color_brewer(palette = "Set2")

Essay

I first read in the data using read_csv() to get it ready for analysis. Then, I did some simple data cleaning for the column names to be more easily used in my script (for example, replacing Production. Coal with Production_Coal). I have also used the filter() function to eliminate any rows that contained missing data in the key variables: Production_Coal, Consumption_Commercial_Coal, Year, and State. This step was responsible for guaranteeing that my analysis and visualizations were coherent and error-free since no missing data was present.

The visualizations depict the correlation between coal production and commercial coal consumption in various U.S. states over different years. Each dot on the scatterplot corresponds to one observation (state and year). I saw from the graph that some states produce and consume more coal than other states, but that the amount of coal produced might be correlated with how much people in a state consume in general, if I see high production, I also see high consumption.

Something I would have loved to develop further but couldn’t get my head around in depth was more sophisticated visualizations. This would have provided a clearer picture of how production and consumption evolved through the years. I also wish I would have included more variables instead, if time permitted—to the tune of energy exports or imports—in order to enrich the analysis.