US coal exports

Preliminaries:

Load libraries

It’s a good idea to load our libraries at the top of the Rmd document so that everyone can see what we’re using. Similarly, it’s good practice to set cache=FALSE to ensure that the libraries are dynamically loaded each time you knit the document.

I’ve only added the libraries needed to download and read the data. We’ll need to load additional libraries to complete this assignment. Add them here once we discover that we need them.

## Install the pacman package if necessary
if (!require("pacman")) install.packages("pacman")
## Install other packages using pacman::p_load()
pacman::p_load(httr, readxl, here,tinytex,tidyverse,data.table,rlang,
               ggplot2, janitor, tidyfast, dtplyr, microbenchmark, skimr, lubridate, ggiraph,naniar,dtplyr, ggthemes, hrbrthemes, gganimate, gifski, countrycode, plotly)
theme_set(hrbrthemes::theme_ipsum()) ## ggplot2 theme I like

Read in the data

Use httr::GET() to fetch the EIA excel file for us from web.

# library(here) ## Already loaded
# library(httr) ## Already loaded
url = "https://www.eia.gov/coal/archive/coal_historical_exports.xlsx"
## Only download the file if we need to
if(!file.exists(here::here("data/coal.xlsx"))) {
  GET(url, write_disk(here::here("data/coal.xlsx")))
}

Next, we read in the file.

coal = read_excel(here::here("data/coal.xlsx"), skip = 3, na = ".")

We are now ready to go.

1) Clean the column names

The column (i.e. variable) names aren’t great: Spacing, uppercase letters, etc.

#now let's make our data fram into a data table
coal_dt = as.data.table(coal)

#clean the variable names in our data set
#names(coal) #check to make sure they're uniform
coal = 
  coal %>% 
  clean_names() %>%
  rename(yr = year, qtr = quarter, dest = coal_destination_country)
coal

## # A tibble: 11,118 × 14
##       yr   qtr type         customs_district coal_origin_coun… dest   steam_coal
##    <dbl> <dbl> <chr>        <chr>            <chr>             <chr>       <dbl>
##  1  2002     1 Coal Exports Anchorage, AK    United States     South…      66874
##  2  2002     1 Coal Exports Baltimore, MD    United States     Belgi…     125856
##  3  2002     1 Coal Exports Baltimore, MD    United States     Brazil         NA
##  4  2002     1 Coal Exports Baltimore, MD    United States     Canada     149335
##  5  2002     1 Coal Exports Baltimore, MD    United States     Germa…     153465
##  6  2002     1 Coal Exports Baltimore, MD    United States     Irela…     132827
##  7  2002     1 Coal Exports Baltimore, MD    United States     Israel     130693
##  8  2002     1 Coal Exports Baltimore, MD    United States     Jamai…       4049
##  9  2002     1 Coal Exports Baltimore, MD    United States     Nethe…     139460
## 10  2002     1 Coal Exports Baltimore, MD    United States     Norway      12116
## # … with 11,108 more rows, and 7 more variables: steam_revenue <dbl>,
## #   metallurgical <dbl>, metallurgical_revenue <dbl>, total <dbl>,
## #   total_revenue <dbl>, coke <dbl>, coke_revenue <dbl>

2) Total US coal exports over time (year only)

Plot the US’s total coal exports over time by year ONLY. What secular trends do you notice in the data?

# names(coal_dt) = gsub(x = tolower(names(coal_dt)), " ", "_")
coal_dt = clean_names(coal_dt)
setnames(coal_dt, 
         old = c("year", "quarter", "coal_destination_country"),
         new = c("yr", "qtr", "dest"))
coal_dt

##          yr qtr         type customs_district coal_origin_country
##     1: 2002   1 Coal Exports    Anchorage, AK       United States
##     2: 2002   1 Coal Exports    Baltimore, MD       United States
##     3: 2002   1 Coal Exports    Baltimore, MD       United States
##     4: 2002   1 Coal Exports    Baltimore, MD       United States
##     5: 2002   1 Coal Exports    Baltimore, MD       United States
##    ---                                                           
## 11114: 2021   1 Coal Exports      Seattle, WA       United States
## 11115: 2021   1 Coal Exports      Seattle, WA       United States
## 11116: 2021   1 Coal Exports      Seattle, WA       United States
## 11117: 2021   1 Coal Exports      Seattle, WA       United States
## 11118: 2021   1 Coal Exports        Tampa, FL       United States
##                                   dest steam_coal steam_revenue metallurgical
##     1: South Korea (Republic of Korea)      66874       1669556            NA
##     2:                         Belgium     125856       3703190            NA
##     3:                          Brazil         NA            NA         44924
##     4:                          Canada     149335       4377923            NA
##     5:    Germany, Federal Republic of     153465       4788694         83010
##    ---                                                                       
## 11114:                           Japan     246086       7813540            NA
## 11115:                       Singapore         NA            NA            NA
## 11116: South Korea (Republic of Korea)    1241507      39447683            NA
## 11117:                        Thailand         47         32179            NA
## 11118:                          Panama         NA            NA            NA
##        metallurgical_revenue   total total_revenue coke coke_revenue
##     1:                    NA   66874       1669556   NA           NA
##     2:                    NA  125856       3703190   NA           NA
##     3:               2237029   44924       2237029   NA           NA
##     4:                    NA  149335       4377923   NA           NA
##     5:               2918079  236475       7706773   NA           NA
##    ---                                                              
## 11114:                    NA  246086       7813540   37        36038
## 11115:                    NA      NA            NA    6         2794
## 11116:                    NA 1241507      39447683   NA           NA
## 11117:                    NA      47         32179    7        10224
## 11118:                    NA      NA            NA   19        25499

Hints: If you want nicely formatted y-axis label, add + scale_y_continuous(labels = scales::comma) to your ggplot2 code.

The trends in the data can be deciphered through the graph above. We see there is a steady increase of coal exports between the beginning years 2002 and 2006. In year 2006 and beyond, we start to see sections of a dramatic, positive increase followed by a dramatic decrease. It is oscillating in a linear manner and getting more dramatic in magnitude as time goes on. Thus, coal is being exported immediately after it is harvested, in a quick fashion–and then the exports slow down as resources run out, and need time to replenish. It is also seen to slow down because once nations buy coal, they need time to use it up before they need to repurchase more coal, for more use… and so on. These secular trends can be defined over years 2006-2009, 2009-2018, and from 2018-2020. The magnitude of change in coal exports in these periods seem to get multiply the increasing and decreasing amounts of coal exports as time goes on. It is seen to have been given a small start though, with not too many exports. The secular trends seem to start around year 2007 and have gone on to increase overall coal exports. The level of exports doesn’t seems to drop below the quantity of 50,000,000 a year though. Since 2006, there hadn’t been less than 50,000,000 coal exports a year.

# set plotting theme
#now we want to plot total exports by year using data.table and ggplot2
coal_dt[, .(total = sum(total, na.rm=TRUE)), by = yr] %>%
  ggplot(aes(x=yr, y=total)) +
  geom_area(alpha=0.5, fill = "dodgerblue") +
  geom_line(alpha = 0.5, color = "black") +
  scale_y_continuous(labels = scales::comma) + 
  labs(title= "Total Coal Exports in The United States From Years 2002 to 2020", x = "Year", y = "Total Coal     Exports") + 
  scale_x_continuous(labels = seq(2002, 2020, 1), breaks = seq(2002, 2020, 1)) + # display every year's value
  theme(axis.text.x = element_text(angle=90,size=7))

3) Total US coal exports over time (year AND quarter)

Now do the same as the above, expect aggregated quarter of year (2001Q1, 2002Q2, etc.). Do you notice any seasonality that was masked from the yearly averages?

#aggregate by quarter every year now
coal_dt[, yr_qtr := as.IDate(paste(yr , qtr*3-2, '01', sep = '-'))]
#ggplot2 is going to want to convert your quarterly data into actual date format before it plots nicely. (i.e. Don't leave it as a string.

coal_dt

##          yr qtr         type customs_district coal_origin_country
##     1: 2002   1 Coal Exports    Anchorage, AK       United States
##     2: 2002   1 Coal Exports    Baltimore, MD       United States
##     3: 2002   1 Coal Exports    Baltimore, MD       United States
##     4: 2002   1 Coal Exports    Baltimore, MD       United States
##     5: 2002   1 Coal Exports    Baltimore, MD       United States
##    ---                                                           
## 11114: 2021   1 Coal Exports      Seattle, WA       United States
## 11115: 2021   1 Coal Exports      Seattle, WA       United States
## 11116: 2021   1 Coal Exports      Seattle, WA       United States
## 11117: 2021   1 Coal Exports      Seattle, WA       United States
## 11118: 2021   1 Coal Exports        Tampa, FL       United States
##                                   dest steam_coal steam_revenue metallurgical
##     1: South Korea (Republic of Korea)      66874       1669556            NA
##     2:                         Belgium     125856       3703190            NA
##     3:                          Brazil         NA            NA         44924
##     4:                          Canada     149335       4377923            NA
##     5:    Germany, Federal Republic of     153465       4788694         83010
##    ---                                                                       
## 11114:                           Japan     246086       7813540            NA
## 11115:                       Singapore         NA            NA            NA
## 11116: South Korea (Republic of Korea)    1241507      39447683            NA
## 11117:                        Thailand         47         32179            NA
## 11118:                          Panama         NA            NA            NA
##        metallurgical_revenue   total total_revenue coke coke_revenue     yr_qtr
##     1:                    NA   66874       1669556   NA           NA 2002-01-01
##     2:                    NA  125856       3703190   NA           NA 2002-01-01
##     3:               2237029   44924       2237029   NA           NA 2002-01-01
##     4:                    NA  149335       4377923   NA           NA 2002-01-01
##     5:               2918079  236475       7706773   NA           NA 2002-01-01
##    ---                                                                         
## 11114:                    NA  246086       7813540   37        36038 2021-01-01
## 11115:                    NA      NA            NA    6         2794 2021-01-01
## 11116:                    NA 1241507      39447683   NA           NA 2021-01-01
## 11117:                    NA      47         32179    7        10224 2021-01-01
## 11118:                    NA      NA            NA   19        25499 2021-01-01

#plot it 
coal_dt[, .(total = sum(total, na.rm=TRUE)), by = yr_qtr] %>%
  ggplot(aes(x=yr_qtr, y=total)) +
  geom_area(alpha=0.5, fill = "dodgerblue") + 
  geom_line(alpha = 1, color = "black") +
  scale_y_continuous(labels = scales::comma)  +
  labs(title= "Total Coal Exports in The United States by Quarter From Years 2002 to 2020", x = "Quarter Year", y = "Total Coal     Exports")

There was definitely seasonality that was masked from the yearly averages. We can see this because there exists more variation and more cycles within the given time period. This is likely due to the fact that the yearly totals were summations of the quarters, which cuts out the ability to see more than one trend in a given year. From our more detailed data display, we can see more secular trends, and more variation among the annual exports. Looking at the shape pattern, we see the triangles that signify a second quarter of a year, are mostly the times when exports reach their highest each year: the second quarter receives the most coal exports in the given year. We also see the oscillation periods getting wider, signifying an export activity level that increases as time goes on.

4) Exports by destination country

4.1) Create a new data frame

Create a new data frame called coal_country that aggregates total exports by destination country (and quarter of year). Make sure you print the resulting data frame so that it appears in the knitted R markdown document.

#new data frame with total exports by country
# coal %>% distinct(type)
#data table way
coal_country_dt = coal_dt[, 
                          .(total = sum(total, na.rm=TRUE)), 
                          by = .(dest, yr_qtr)]

setorder(coal_country_dt, dest, yr_qtr)

coal_country_dt

##                dest     yr_qtr  total
##    1:       Albania 2016-10-01     74
##    2:       Algeria 2002-01-01 129305
##    3:       Algeria 2002-07-01  62931
##    4:       Algeria 2002-10-01 129563
##    5:       Algeria 2003-01-01 128525
##   ---                                
## 4682:       Vietnam 2020-01-01   1826
## 4683:       Vietnam 2020-04-01    137
## 4684:       Vietnam 2020-07-01     57
## 4685:       Vietnam 2021-01-01     24
## 4686: Western Samoa 2017-07-01    106

4.2) Inspect the data frame

It looks like some countries are missing data for a number of years and periods (e.g. Albania). Confirm that this is the case. What do you think is happening here?

#we can compare rows and column with the original data.table set (not the copy) to see which variable inputs are missing

#first let us look at 
#names(coal_country$coal_destination_country)

#ungroup and then count how many each country destination has
coal_country_dt[, .N, by = dest]

##                dest  N
##   1:        Albania  1
##   2:        Algeria 47
##   3:        Andorra  1
##   4:         Angola 43
##   5:       Anguilla  1
##  ---                  
## 147: United Kingdom 77
## 148:        Uruguay 46
## 149:      Venezuela 72
## 150:        Vietnam 37
## 151:  Western Samoa  1

Certain country’s are getting omitted from our data.table because we need complete data on BOTH yearly quarter AND destination country in order to group by both objects. We removed NA values in our data sets when calculating sum totals. Since each country destination has multiple coal exports in a given quarter, we will have repeat amounts for our variable “summed_total” when looking at rows of JUST one destination country and one summed_total. Printing this out we can see this is the case due to the fact that all of the countries do not uniformly share the same amount of data entries.

4.3) Complete the data frame

Fill in the implicit missing values, so that each country has a representative row for every year-quarter time period. In other words, you should modify the data frame so that every destination country has row entries for all possible year-quarter combinations (from 2002Q1 through the most recent quarter). Order your updated data frame by country, year and, quarter.

# Create a temporary data table with all possible combinations of countries and year-quarters
temp = CJ(dest = unique(coal_country_dt$dest), yr_qtr = unique(coal_country_dt$yr_qtr))

# Full join our temporary data table and our original caol_country_dt data table
coal_country_dt = merge(coal_country_dt, 
                        temp, 
                        all = TRUE, 
                        by = c('dest', 'yr_qtr'))

#rm(temp) # remove our temporary data table from our working environment

#print out our new merged data table containing all the quarterly years from 2002 to 2020 with the existing data filled in!
#coal_country_dt

4.4) Some more tidying up

#as dplyr's data frame,  recall our data frame object


#we need to check if wee have any missing data in any quarters 
all_missing =
  coal %>%
  ## First count how many obs there per year-quarter combo
  add_count(yr, qtr) %>%
  ## Now we look at only the NA values
  filter(is.na(total)) %>%
  group_by(yr, qtr) %>%
  mutate(missing_n = n()) %>%
  ## See if we have any matches
  filter(n == missing_n) %>%
  distinct(yr, qtr)

#print out the missing ones
#all_missing ## there are none

When using dplyr’s data.frame’s versus DataTable’s data.table, you have to switch the formats of your data sets in order to use their corresponding package in R-studio. When data.table is used in conjunction with multiple steps, and multiple variable manipulation, it is imperative we create copies of our data in order to keep the original data set as it was imported. Otherwise, deletions,mutations, and other changes will permanently alter the observation values. However, I did not run into this issue and I have zero missing rows a.k.a. missing quarters.

4.5) Culmulative top 10 US coal export destinations

Produce a vector

#make our vector
coal10_culm = coal_country_dt[, sum(total, na.rm=TRUE), by = dest][order(-V1), dest[1:10]]
#print out vector
coal10_culm

##  [1] "Canada"                          "Netherlands"                    
##  [3] "Brazil"                          "India"                          
##  [5] "South Korea (Republic of Korea)" "United Kingdom"                 
##  [7] "Japan"                           "Italy"                          
##  [9] "Germany, Federal Republic of"    "China"

4.6) Recent top 10 US coal export destinations

Now do the same, except for most recent period on record (i.e. final quarter in the dataset). Call this vector coal10_recent and make sure to print it so that I can see it too. Are there any interesting differences between the two vectors? Apart from any secular trends, what else might explain these differences?

#make our vector
coal10_recent =
  coal_country_dt[yr_qtr==max(yr_qtr), sum(total, na.rm=TRUE), by = dest
                  ][order(-V1), dest[1:10]]
#print out the vector
coal10_recent

##  [1] "India"                           "China"                          
##  [3] "Netherlands"                     "Brazil"                         
##  [5] "Japan"                           "South Korea (Republic of Korea)"
##  [7] "Ukraine"                         "Canada"                         
##  [9] "Germany, Federal Republic of"    "Egypt"

From our two vectors produced, we can see that they are the exact same vectors, with different arrangement of country destination names. Although they share the same top exports, we can gather that Canada is the Number 1 in most coal imports(since they’re receiving US’ coal exports). However, we also see India is #1 according to the second vector. If they are typical, and regular trading partners, then we would expect to see the same trends in top export destinations. Which we did overall.

4.7) US coal exports over time by country

Plot the quarterly coal exports over time, but now disaggregated by country. In particular, highlight the top 10 (cumulative) export destinations and then sum the remaining countries into a combined “Other” category. (In other words, your figure should contain the time series of eleven different countries/categories.)

#work with the datatable
`%nchin%` = Negate(`%chin%`)
coal_country_dt[dest %nchin% coal10_culm, dest := 'Other']
coal_country_dt = 
  coal_country_dt[, .(total = sum(total, na.rm=TRUE)), by = .(dest, yr_qtr)]

coal_country_dt

##                dest     yr_qtr   total
##   1:          Other 2002-01-01 2605613
##   2:          Other 2002-04-01 2281543
##   3:          Other 2002-07-01 2547928
##   4:          Other 2002-10-01 2782358
##   5:          Other 2003-01-01 3493542
##  ---                                  
## 843: United Kingdom 2020-01-01  354027
## 844: United Kingdom 2020-04-01  133708
## 845: United Kingdom 2020-07-01  317618
## 846: United Kingdom 2020-10-01  330598
## 847: United Kingdom 2021-01-01  323505

library(directlabels)

#now plot it 
p = coal_country_dt %>%
  ggplot(aes(x=yr_qtr, y=total, fill= dest, col = dest)) + 
  scale_y_continuous(labels = scales::comma) +
  geom_area() +
  labs(x = "Quarterly Year", y = "Total Coal Exports", title = "Total US Coal Exports in the Top 10 Countries from 2002 to 2020") + 
  theme(legend.title = element_text(color = "Black", face = "bold")) +
#  scale_x_date() + 
  theme(plot.title = element_text(hjust = 0.5))

p

#now make another version of the same graph
p = coal_country_dt %>%
  ggplot(aes(x=yr_qtr, y=total, fill= dest, col = dest)) + 
  scale_y_continuous(labels = scales::comma) + 
  labs(x = "Quarterly Year", y = "Total Coal Exports", title = "Total US Coal Exports in the Top 10 Countries from 2002 to 2020") 


#get rid of the legend
p + theme(legend.position = "none") +
  geom_line(size = 0.29) + 
    # Add, from the directlabels package, 
  # geom_dl, using method = 'last.bumpup' to put the 
  # labels at the end, and make sure that if they intersect, 
  # one is bumped up
  geom_dl(aes(label = dest), method = 'last.bumpup')  +
  theme(plot.title = element_text(hjust = 0.5))

4.8) Make it pretty

Take your previous plot and add some swag to it. That is, try to make it as visually appealing as possible without overloading it with chart junk.

Hint: You’ve got loads of options here. If you haven’t already done so, consider a more bespoke theme with the ggthemes, hrbrthemes, or cowplot packages. Try out scale_fill_brewer() and scale_colour_brewer() for a range of interesting colour palettes. Try some transparency effects with alpha. Give your axis labels more refined names with the labs() layer in ggplot2. While you’re at it, you might want to scale (i.e. normalise) your y-variable to get rid of all those zeros. You can shorten any country names to their ISO abbreviation; see ?countrycode::countrycode. More substantively — but more complicated — you might want to re-order your legend (and the plot itself) according to the relative importance of the destination countries. See ?forcats::fct_reorder or forcats::fct_relevel`.

# library(countrycode) ## Already loaded

coal_country_dt[, iso := fifelse(dest!='Other',
                              countrycode(dest, 'country.name', 'iso3c'), 
                              'Other')]

## Warning in countrycode_convert(sourcevar = sourcevar, origin = origin, destination = dest, : Some values were not matched unambiguously: Other

#spaghetti plot, faceted 
## Create a copy of our faceting variable (here: iso)
coal_country_dt$iso_copy = coal_country_dt$iso
coal_country = as.data.frame(coal_country_dt)
p1 = coal_country %>%
  ggplot(aes(x=yr_qtr, y=total)) +
  ## Extra geom_line() call. Note that i) we must remove the actual faceting 
  ## variable to ensure all the lines get printed in each facet, and ii) the
  ## copied facet variable is used in the group aesthetic.
  geom_line(
      data = . %>% select(-iso),
      aes(group=iso_copy), alpha=0.5, lwd=0.1
      ) +
  ## Main geom_line() call. We'll use a different colour/width for emphasis.
  geom_line(col = "#377EB8", lwd=0.8) +
  facet_wrap(~iso) +
  ## Since we're just emphasizing trends, I'll use a very plain plot theme
  theme_void()
p1

# the area map
# library(forcats)    ## Already loaded
# library(hrbrthemes) ## Already loaded

p2 =
  coal_country_dt %>%
  ggplot(
    aes(
      x=yr_qtr, y=total, 
      fill = forcats::fct_reorder2(iso, yr_qtr, total, .desc = FALSE), 
      col = forcats::fct_reorder2(iso, yr_qtr, total, .desc = FALSE)
      )) + 
  geom_area(alpha = 0.7) + ## Add some transparency
  geom_line(position="stack", lwd = 0.2) +
  scale_fill_brewer(palette="Spectral", name="Destination") + ## "Rainbow" palette (see note above)
  scale_colour_brewer(palette="Spectral", name="Destination") + ## See above.
  labs(y = "Short tons", title = "US coal exports") +
  theme(axis.title.x = element_blank()) ## Do we really to tell people that our x-axis is "Date"?
p2

4.9) Make it interactive

Create an interactive version of your previous figure.

#From our variable assignments, we can add our animations
ggplotly(p2)

Hint: Take a look at plotly::ggplotly(), or the gganimate package.

5) Show me something interesting

There’s a lot still to explore with this data set. Your final task is to show me something interesting. Drill down into the data and explain what’s driving the secular trends that we have observed above. Or highlight interesting seasonality within a particular country. Or go back to the original coal data frame and look at exports by customs district, or by coal type. Do we changes or trends there? Etcetera. Etcetera. My only requirement is that you show your work and tell me what you have found.

coal_country_dt[dest == "Canada"] %>%
  ggplot(aes(x=yr_qtr, y=total)) + 
  scale_y_continuous(labels = scales::comma) + 
  geom_area(alpha = 0.5) + ## Add some transparency
  geom_line(position="stack", lwd = 0.2) +
  geom_vline(xintercept = yq("2009q2"), col = "red") +
  labs(
    y = "Short tons (millions)", 
    title = "US coal exports to Canada",
    subtitle = "Note the summer seasonal peaks and post-GFC downturn."
    ) +
  theme(axis.title.x = element_blank())