It’s a good idea to load our libraries at the top of the Rmd document so that everyone can see what we’re using. Similarly, it’s good practice to set cache=FALSE to ensure that the libraries are dynamically loaded each time you knit the document.
I’ve only added the libraries needed to download and read the data. We’ll need to load additional libraries to complete this assignment. Add them here once we discover that we need them.
## Install the pacman package if necessary
if (!require("pacman")) install.packages("pacman")
## Install other packages using pacman::p_load()
pacman::p_load(httr, readxl, here,tinytex,tidyverse,data.table,rlang,
ggplot2, janitor, tidyfast, dtplyr, microbenchmark, skimr, lubridate, ggiraph,naniar,dtplyr, ggthemes, hrbrthemes, gganimate, gifski, countrycode, plotly)
theme_set(hrbrthemes::theme_ipsum()) ## ggplot2 theme I likeUse httr::GET() to fetch the EIA excel file for us from web.
# library(here) ## Already loaded
# library(httr) ## Already loaded
url = "https://www.eia.gov/coal/archive/coal_historical_exports.xlsx"
## Only download the file if we need to
if(!file.exists(here::here("data/coal.xlsx"))) {
GET(url, write_disk(here::here("data/coal.xlsx")))
}Next, we read in the file.
We are now ready to go.
The column (i.e. variable) names aren’t great: Spacing, uppercase letters, etc.
#now let's make our data fram into a data table
coal_dt = as.data.table(coal)
#clean the variable names in our data set
#names(coal) #check to make sure they're uniform
coal =
coal %>%
clean_names() %>%
rename(yr = year, qtr = quarter, dest = coal_destination_country)
coal## # A tibble: 11,118 × 14
## yr qtr type customs_district coal_origin_coun… dest steam_coal
## <dbl> <dbl> <chr> <chr> <chr> <chr> <dbl>
## 1 2002 1 Coal Exports Anchorage, AK United States South… 66874
## 2 2002 1 Coal Exports Baltimore, MD United States Belgi… 125856
## 3 2002 1 Coal Exports Baltimore, MD United States Brazil NA
## 4 2002 1 Coal Exports Baltimore, MD United States Canada 149335
## 5 2002 1 Coal Exports Baltimore, MD United States Germa… 153465
## 6 2002 1 Coal Exports Baltimore, MD United States Irela… 132827
## 7 2002 1 Coal Exports Baltimore, MD United States Israel 130693
## 8 2002 1 Coal Exports Baltimore, MD United States Jamai… 4049
## 9 2002 1 Coal Exports Baltimore, MD United States Nethe… 139460
## 10 2002 1 Coal Exports Baltimore, MD United States Norway 12116
## # … with 11,108 more rows, and 7 more variables: steam_revenue <dbl>,
## # metallurgical <dbl>, metallurgical_revenue <dbl>, total <dbl>,
## # total_revenue <dbl>, coke <dbl>, coke_revenue <dbl>
Plot the US’s total coal exports over time by year ONLY. What secular trends do you notice in the data?
# names(coal_dt) = gsub(x = tolower(names(coal_dt)), " ", "_")
coal_dt = clean_names(coal_dt)
setnames(coal_dt,
old = c("year", "quarter", "coal_destination_country"),
new = c("yr", "qtr", "dest"))
coal_dt## yr qtr type customs_district coal_origin_country
## 1: 2002 1 Coal Exports Anchorage, AK United States
## 2: 2002 1 Coal Exports Baltimore, MD United States
## 3: 2002 1 Coal Exports Baltimore, MD United States
## 4: 2002 1 Coal Exports Baltimore, MD United States
## 5: 2002 1 Coal Exports Baltimore, MD United States
## ---
## 11114: 2021 1 Coal Exports Seattle, WA United States
## 11115: 2021 1 Coal Exports Seattle, WA United States
## 11116: 2021 1 Coal Exports Seattle, WA United States
## 11117: 2021 1 Coal Exports Seattle, WA United States
## 11118: 2021 1 Coal Exports Tampa, FL United States
## dest steam_coal steam_revenue metallurgical
## 1: South Korea (Republic of Korea) 66874 1669556 NA
## 2: Belgium 125856 3703190 NA
## 3: Brazil NA NA 44924
## 4: Canada 149335 4377923 NA
## 5: Germany, Federal Republic of 153465 4788694 83010
## ---
## 11114: Japan 246086 7813540 NA
## 11115: Singapore NA NA NA
## 11116: South Korea (Republic of Korea) 1241507 39447683 NA
## 11117: Thailand 47 32179 NA
## 11118: Panama NA NA NA
## metallurgical_revenue total total_revenue coke coke_revenue
## 1: NA 66874 1669556 NA NA
## 2: NA 125856 3703190 NA NA
## 3: 2237029 44924 2237029 NA NA
## 4: NA 149335 4377923 NA NA
## 5: 2918079 236475 7706773 NA NA
## ---
## 11114: NA 246086 7813540 37 36038
## 11115: NA NA NA 6 2794
## 11116: NA 1241507 39447683 NA NA
## 11117: NA 47 32179 7 10224
## 11118: NA NA NA 19 25499
Hints: If you want nicely formatted y-axis label, add + scale_y_continuous(labels = scales::comma) to your ggplot2 code.
The trends in the data can be deciphered through the graph above. We see there is a steady increase of coal exports between the beginning years 2002 and 2006. In year 2006 and beyond, we start to see sections of a dramatic, positive increase followed by a dramatic decrease. It is oscillating in a linear manner and getting more dramatic in magnitude as time goes on. Thus, coal is being exported immediately after it is harvested, in a quick fashion–and then the exports slow down as resources run out, and need time to replenish. It is also seen to slow down because once nations buy coal, they need time to use it up before they need to repurchase more coal, for more use… and so on. These secular trends can be defined over years 2006-2009, 2009-2018, and from 2018-2020. The magnitude of change in coal exports in these periods seem to get multiply the increasing and decreasing amounts of coal exports as time goes on. It is seen to have been given a small start though, with not too many exports. The secular trends seem to start around year 2007 and have gone on to increase overall coal exports. The level of exports doesn’t seems to drop below the quantity of 50,000,000 a year though. Since 2006, there hadn’t been less than 50,000,000 coal exports a year.
# set plotting theme
#now we want to plot total exports by year using data.table and ggplot2
coal_dt[, .(total = sum(total, na.rm=TRUE)), by = yr] %>%
ggplot(aes(x=yr, y=total)) +
geom_area(alpha=0.5, fill = "dodgerblue") +
geom_line(alpha = 0.5, color = "black") +
scale_y_continuous(labels = scales::comma) +
labs(title= "Total Coal Exports in The United States From Years 2002 to 2020", x = "Year", y = "Total Coal Exports") +
scale_x_continuous(labels = seq(2002, 2020, 1), breaks = seq(2002, 2020, 1)) + # display every year's value
theme(axis.text.x = element_text(angle=90,size=7))Now do the same as the above, expect aggregated quarter of year (2001Q1, 2002Q2, etc.). Do you notice any seasonality that was masked from the yearly averages?
#aggregate by quarter every year now
coal_dt[, yr_qtr := as.IDate(paste(yr , qtr*3-2, '01', sep = '-'))]
#ggplot2 is going to want to convert your quarterly data into actual date format before it plots nicely. (i.e. Don't leave it as a string.
coal_dt## yr qtr type customs_district coal_origin_country
## 1: 2002 1 Coal Exports Anchorage, AK United States
## 2: 2002 1 Coal Exports Baltimore, MD United States
## 3: 2002 1 Coal Exports Baltimore, MD United States
## 4: 2002 1 Coal Exports Baltimore, MD United States
## 5: 2002 1 Coal Exports Baltimore, MD United States
## ---
## 11114: 2021 1 Coal Exports Seattle, WA United States
## 11115: 2021 1 Coal Exports Seattle, WA United States
## 11116: 2021 1 Coal Exports Seattle, WA United States
## 11117: 2021 1 Coal Exports Seattle, WA United States
## 11118: 2021 1 Coal Exports Tampa, FL United States
## dest steam_coal steam_revenue metallurgical
## 1: South Korea (Republic of Korea) 66874 1669556 NA
## 2: Belgium 125856 3703190 NA
## 3: Brazil NA NA 44924
## 4: Canada 149335 4377923 NA
## 5: Germany, Federal Republic of 153465 4788694 83010
## ---
## 11114: Japan 246086 7813540 NA
## 11115: Singapore NA NA NA
## 11116: South Korea (Republic of Korea) 1241507 39447683 NA
## 11117: Thailand 47 32179 NA
## 11118: Panama NA NA NA
## metallurgical_revenue total total_revenue coke coke_revenue yr_qtr
## 1: NA 66874 1669556 NA NA 2002-01-01
## 2: NA 125856 3703190 NA NA 2002-01-01
## 3: 2237029 44924 2237029 NA NA 2002-01-01
## 4: NA 149335 4377923 NA NA 2002-01-01
## 5: 2918079 236475 7706773 NA NA 2002-01-01
## ---
## 11114: NA 246086 7813540 37 36038 2021-01-01
## 11115: NA NA NA 6 2794 2021-01-01
## 11116: NA 1241507 39447683 NA NA 2021-01-01
## 11117: NA 47 32179 7 10224 2021-01-01
## 11118: NA NA NA 19 25499 2021-01-01
#plot it
coal_dt[, .(total = sum(total, na.rm=TRUE)), by = yr_qtr] %>%
ggplot(aes(x=yr_qtr, y=total)) +
geom_area(alpha=0.5, fill = "dodgerblue") +
geom_line(alpha = 1, color = "black") +
scale_y_continuous(labels = scales::comma) +
labs(title= "Total Coal Exports in The United States by Quarter From Years 2002 to 2020", x = "Quarter Year", y = "Total Coal Exports") There was definitely seasonality that was masked from the yearly averages. We can see this because there exists more variation and more cycles within the given time period. This is likely due to the fact that the yearly totals were summations of the quarters, which cuts out the ability to see more than one trend in a given year. From our more detailed data display, we can see more secular trends, and more variation among the annual exports. Looking at the shape pattern, we see the triangles that signify a second quarter of a year, are mostly the times when exports reach their highest each year: the second quarter receives the most coal exports in the given year. We also see the oscillation periods getting wider, signifying an export activity level that increases as time goes on.
Create a new data frame called coal_country that aggregates total exports by destination country (and quarter of year). Make sure you print the resulting data frame so that it appears in the knitted R markdown document.
#new data frame with total exports by country
# coal %>% distinct(type)
#data table way
coal_country_dt = coal_dt[,
.(total = sum(total, na.rm=TRUE)),
by = .(dest, yr_qtr)]
setorder(coal_country_dt, dest, yr_qtr)
coal_country_dt## dest yr_qtr total
## 1: Albania 2016-10-01 74
## 2: Algeria 2002-01-01 129305
## 3: Algeria 2002-07-01 62931
## 4: Algeria 2002-10-01 129563
## 5: Algeria 2003-01-01 128525
## ---
## 4682: Vietnam 2020-01-01 1826
## 4683: Vietnam 2020-04-01 137
## 4684: Vietnam 2020-07-01 57
## 4685: Vietnam 2021-01-01 24
## 4686: Western Samoa 2017-07-01 106
It looks like some countries are missing data for a number of years and periods (e.g. Albania). Confirm that this is the case. What do you think is happening here?
#we can compare rows and column with the original data.table set (not the copy) to see which variable inputs are missing
#first let us look at
#names(coal_country$coal_destination_country)
#ungroup and then count how many each country destination has
coal_country_dt[, .N, by = dest]## dest N
## 1: Albania 1
## 2: Algeria 47
## 3: Andorra 1
## 4: Angola 43
## 5: Anguilla 1
## ---
## 147: United Kingdom 77
## 148: Uruguay 46
## 149: Venezuela 72
## 150: Vietnam 37
## 151: Western Samoa 1
Certain country’s are getting omitted from our data.table because we need complete data on BOTH yearly quarter AND destination country in order to group by both objects. We removed NA values in our data sets when calculating sum totals. Since each country destination has multiple coal exports in a given quarter, we will have repeat amounts for our variable “summed_total” when looking at rows of JUST one destination country and one summed_total. Printing this out we can see this is the case due to the fact that all of the countries do not uniformly share the same amount of data entries.
Fill in the implicit missing values, so that each country has a representative row for every year-quarter time period. In other words, you should modify the data frame so that every destination country has row entries for all possible year-quarter combinations (from 2002Q1 through the most recent quarter). Order your updated data frame by country, year and, quarter.
# Create a temporary data table with all possible combinations of countries and year-quarters
temp = CJ(dest = unique(coal_country_dt$dest), yr_qtr = unique(coal_country_dt$yr_qtr))
# Full join our temporary data table and our original caol_country_dt data table
coal_country_dt = merge(coal_country_dt,
temp,
all = TRUE,
by = c('dest', 'yr_qtr'))
#rm(temp) # remove our temporary data table from our working environment
#print out our new merged data table containing all the quarterly years from 2002 to 2020 with the existing data filled in!
#coal_country_dt#as dplyr's data frame, recall our data frame object
#we need to check if wee have any missing data in any quarters
all_missing =
coal %>%
## First count how many obs there per year-quarter combo
add_count(yr, qtr) %>%
## Now we look at only the NA values
filter(is.na(total)) %>%
group_by(yr, qtr) %>%
mutate(missing_n = n()) %>%
## See if we have any matches
filter(n == missing_n) %>%
distinct(yr, qtr)
#print out the missing ones
#all_missing ## there are noneWhen using dplyr’s data.frame’s versus DataTable’s data.table, you have to switch the formats of your data sets in order to use their corresponding package in R-studio. When data.table is used in conjunction with multiple steps, and multiple variable manipulation, it is imperative we create copies of our data in order to keep the original data set as it was imported. Otherwise, deletions,mutations, and other changes will permanently alter the observation values. However, I did not run into this issue and I have zero missing rows a.k.a. missing quarters.
Produce a vector
#make our vector
coal10_culm = coal_country_dt[, sum(total, na.rm=TRUE), by = dest][order(-V1), dest[1:10]]
#print out vector
coal10_culm## [1] "Canada" "Netherlands"
## [3] "Brazil" "India"
## [5] "South Korea (Republic of Korea)" "United Kingdom"
## [7] "Japan" "Italy"
## [9] "Germany, Federal Republic of" "China"
Now do the same, except for most recent period on record (i.e. final quarter in the dataset). Call this vector coal10_recent and make sure to print it so that I can see it too. Are there any interesting differences between the two vectors? Apart from any secular trends, what else might explain these differences?
#make our vector
coal10_recent =
coal_country_dt[yr_qtr==max(yr_qtr), sum(total, na.rm=TRUE), by = dest
][order(-V1), dest[1:10]]
#print out the vector
coal10_recent## [1] "India" "China"
## [3] "Netherlands" "Brazil"
## [5] "Japan" "South Korea (Republic of Korea)"
## [7] "Ukraine" "Canada"
## [9] "Germany, Federal Republic of" "Egypt"
From our two vectors produced, we can see that they are the exact same vectors, with different arrangement of country destination names. Although they share the same top exports, we can gather that Canada is the Number 1 in most coal imports(since they’re receiving US’ coal exports). However, we also see India is #1 according to the second vector. If they are typical, and regular trading partners, then we would expect to see the same trends in top export destinations. Which we did overall.
Plot the quarterly coal exports over time, but now disaggregated by country. In particular, highlight the top 10 (cumulative) export destinations and then sum the remaining countries into a combined “Other” category. (In other words, your figure should contain the time series of eleven different countries/categories.)
#work with the datatable
`%nchin%` = Negate(`%chin%`)
coal_country_dt[dest %nchin% coal10_culm, dest := 'Other']
coal_country_dt =
coal_country_dt[, .(total = sum(total, na.rm=TRUE)), by = .(dest, yr_qtr)]
coal_country_dt## dest yr_qtr total
## 1: Other 2002-01-01 2605613
## 2: Other 2002-04-01 2281543
## 3: Other 2002-07-01 2547928
## 4: Other 2002-10-01 2782358
## 5: Other 2003-01-01 3493542
## ---
## 843: United Kingdom 2020-01-01 354027
## 844: United Kingdom 2020-04-01 133708
## 845: United Kingdom 2020-07-01 317618
## 846: United Kingdom 2020-10-01 330598
## 847: United Kingdom 2021-01-01 323505
library(directlabels)
#now plot it
p = coal_country_dt %>%
ggplot(aes(x=yr_qtr, y=total, fill= dest, col = dest)) +
scale_y_continuous(labels = scales::comma) +
geom_area() +
labs(x = "Quarterly Year", y = "Total Coal Exports", title = "Total US Coal Exports in the Top 10 Countries from 2002 to 2020") +
theme(legend.title = element_text(color = "Black", face = "bold")) +
# scale_x_date() +
theme(plot.title = element_text(hjust = 0.5))
p#now make another version of the same graph
p = coal_country_dt %>%
ggplot(aes(x=yr_qtr, y=total, fill= dest, col = dest)) +
scale_y_continuous(labels = scales::comma) +
labs(x = "Quarterly Year", y = "Total Coal Exports", title = "Total US Coal Exports in the Top 10 Countries from 2002 to 2020")
#get rid of the legend
p + theme(legend.position = "none") +
geom_line(size = 0.29) +
# Add, from the directlabels package,
# geom_dl, using method = 'last.bumpup' to put the
# labels at the end, and make sure that if they intersect,
# one is bumped up
geom_dl(aes(label = dest), method = 'last.bumpup') +
theme(plot.title = element_text(hjust = 0.5)) Take your previous plot and add some swag to it. That is, try to make it as visually appealing as possible without overloading it with chart junk.
Hint: You’ve got loads of options here. If you haven’t already done so, consider a more bespoke theme with the ggthemes, hrbrthemes, or cowplot packages. Try out scale_fill_brewer() and scale_colour_brewer() for a range of interesting colour palettes. Try some transparency effects with alpha. Give your axis labels more refined names with the labs() layer in ggplot2. While you’re at it, you might want to scale (i.e. normalise) your y-variable to get rid of all those zeros. You can shorten any country names to their ISO abbreviation; see ?countrycode::countrycode. More substantively — but more complicated — you might want to re-order your legend (and the plot itself) according to the relative importance of the destination countries. See ?forcats::fct_reorder or forcats::fct_relevel`.
# library(countrycode) ## Already loaded
coal_country_dt[, iso := fifelse(dest!='Other',
countrycode(dest, 'country.name', 'iso3c'),
'Other')]## Warning in countrycode_convert(sourcevar = sourcevar, origin = origin, destination = dest, : Some values were not matched unambiguously: Other
#spaghetti plot, faceted
## Create a copy of our faceting variable (here: iso)
coal_country_dt$iso_copy = coal_country_dt$iso
coal_country = as.data.frame(coal_country_dt)
p1 = coal_country %>%
ggplot(aes(x=yr_qtr, y=total)) +
## Extra geom_line() call. Note that i) we must remove the actual faceting
## variable to ensure all the lines get printed in each facet, and ii) the
## copied facet variable is used in the group aesthetic.
geom_line(
data = . %>% select(-iso),
aes(group=iso_copy), alpha=0.5, lwd=0.1
) +
## Main geom_line() call. We'll use a different colour/width for emphasis.
geom_line(col = "#377EB8", lwd=0.8) +
facet_wrap(~iso) +
## Since we're just emphasizing trends, I'll use a very plain plot theme
theme_void()
p1# the area map
# library(forcats) ## Already loaded
# library(hrbrthemes) ## Already loaded
p2 =
coal_country_dt %>%
ggplot(
aes(
x=yr_qtr, y=total,
fill = forcats::fct_reorder2(iso, yr_qtr, total, .desc = FALSE),
col = forcats::fct_reorder2(iso, yr_qtr, total, .desc = FALSE)
)) +
geom_area(alpha = 0.7) + ## Add some transparency
geom_line(position="stack", lwd = 0.2) +
scale_fill_brewer(palette="Spectral", name="Destination") + ## "Rainbow" palette (see note above)
scale_colour_brewer(palette="Spectral", name="Destination") + ## See above.
labs(y = "Short tons", title = "US coal exports") +
theme(axis.title.x = element_blank()) ## Do we really to tell people that our x-axis is "Date"?
p2There’s a lot still to explore with this data set. Your final task is to show me something interesting. Drill down into the data and explain what’s driving the secular trends that we have observed above. Or highlight interesting seasonality within a particular country. Or go back to the original coal data frame and look at exports by customs district, or by coal type. Do we changes or trends there? Etcetera. Etcetera. My only requirement is that you show your work and tell me what you have found.
coal_country_dt[dest == "Canada"] %>%
ggplot(aes(x=yr_qtr, y=total)) +
scale_y_continuous(labels = scales::comma) +
geom_area(alpha = 0.5) + ## Add some transparency
geom_line(position="stack", lwd = 0.2) +
geom_vline(xintercept = yq("2009q2"), col = "red") +
labs(
y = "Short tons (millions)",
title = "US coal exports to Canada",
subtitle = "Note the summer seasonal peaks and post-GFC downturn."
) +
theme(axis.title.x = element_blank())