# SET GLOBAL KNITR OPTIONS
knitr::opts_chunk$set(echo = TRUE,
message = FALSE,
warning = FALSE,
fig.width = 10,
fig.height = 8)
# LOAD PACKAGES
library(pander)
library(kableExtra)
# It's a lot of packages, but they all can be useful for ggplot applications.
# Even if I don't use some of them below, hopefully they'll still be of use in some way.
# All packages below were introduced to me (mostly) by my Data Viz class.
# See the works-cited section for the class website link, among others.
# GGPLOT RELATED PACKAGES (Some useful ones)
library(tidyverse) # For ggplot, dplyr, and others
library(tidyquant) # For accessing financial related data. Like FRED, S&P 500, Stock Prices from Yahoo, etc.
library(scales) # To make your axis labels better. Functions like 'dollar' or 'comma'
library(timetk) # Time series help (I haven't used it much tbh, but it has neat functions like forecast)
library(ggtext) # For fancy text stuff
library(ggrepel) # prevent text/labels from overlapping
library(sf) # For creating maps. (very easy)
library(png) # read png files into R
library(grid) # for making the png files into a plottable object in ggplot
library(WDI) # download data directly from world bank. Very useful
library(ggthemes) # one of the many theme packages available. See Further resources section
library(patchwork) # combine ggplots!!!! Incredibly useful!!!
library(viridis) # continous color scale options
# possible "toy data sets" to play with
# When you load the above libraries like tidyverse, a bunch of data sets in R will become available.
# To see all of them, do: data() in your console
# then if you'd like to call any one of them, do: data("name of data you want")
# here are some more
library(gapminder) # Useful data for demonstrations.
library(ISLR) # some data for teaching stats.
library(wooldridge) # "introductory econometrics a modern approach 6th ed" textbook data
# library(datasets) # available by default I think, if not already present in your R, install once.
This code-through explores the ggplot2 package, which is used to create data visualizations in R. This tutorial has three goals within it:
It would be impractical (if not impossible) to show everything possible with ggplot in this condensed format. Instead, I will try to showcase a few cool features that I’ve learned over the last two months that I think might interest everyone. I will also include as many references and links as possible for further exploration into ggplot and the many things that can be done (of which I’m aware). I tried to annotate the code as much as possible to explain what I am doing. I’m won’t (and can’t) show every possible argument for each function because the internet will do a better job than I ever would.
Specifically, we’ll explain and demonstrate:
Besides being super cool and powerful, learning the ggplot package will allow you to create amazing data visualizations to enhance your data storytelling, as Ben Wellington would put it. Visualizations are an extremely efficient method (if done correctly) to get your message across and summarize big chunks of data in a single (or multiple) picture. Also, ggplot looks good on a resume.
Specifically, you’ll learn how to
Let’s talk about the concept of “tidy data” to get it out the way and why it’s relevant to ggplot. As per the Rstudio Cloud Primers here
Tidy Data is defined as:
For example: This is an example of untidy data. Years should be it’s own variable. By the way, these tables are available when you load tidyverse
# untidy data example
pander(table4a)| country | 1999 | 2000 |
|---|---|---|
| Afghanistan | 745 | 2666 |
| Brazil | 37737 | 80488 |
| China | 212258 | 213766 |
Here is an example of tidy data; it follows all three conventions.
pander(table1)| country | year | cases | population |
|---|---|---|---|
| Afghanistan | 1999 | 745 | 19987071 |
| Afghanistan | 2000 | 2666 | 20595360 |
| Brazil | 1999 | 37737 | 1.72e+08 |
| Brazil | 2000 | 80488 | 174504898 |
| China | 1999 | 212258 | 1.273e+09 |
| China | 2000 | 213766 | 1.28e+09 |
ggplot follows something called the grammar of graphics, which is based a book written by Leland Wilkinson. Other software like tableau also use this format. Link in Works-cited for Kieran Healy textbook, where he explains this more thoroughly.
Basically the structure of a ggplot follows as such:
And that’s it. Everything you add are extensions of those three fields.
A lot of what I’ve learned about data visualization with ggplot comes from my experience taking PMAP 8921-Data Visualization, a course taught by Dr. Andrew Heiss. I highly recommend the class website, which is provided in my works cited section. There is a flurry of resources, videos, and examples provided on that platform, more than I can provide here. Even if you don’t plan on officially taking his course, I suggest visiting the website and enroll “informally” to learn more about ggplot if you’re interested. I’m sure he won’t mind helping you out or communicating with you.
Here are some very basic ggplot examples. Visualizing data allows to see patterns in the data. I only have two of the basic examples to familiarize you with the structure of ggplot. The advanced ones are more spicy :)
# let's make a simple scatter plot
ggplot(data = Default, # use data 'Default' from ISLR package
# Define the aesthetics
aes(x = balance, # Credit card balance on x-axis
y = income, # Annual Income on y-axis
color = default)) + # color points on default status.
# alpha controls the transparency of the dots
geom_point(alpha = 0.4,
aes(shape = default)) # You can use aes() inside a geom too!# Make two box-plots for income, balance, and color by default status
# combine them in the same image using patchwork library
# guides(fill = 'none') turned off the legend
b1 = ggplot(data = Default, aes(x = default, y = balance, fill = default))+
geom_boxplot() +
guides(fill = 'none') +
labs(title = "balance & default status")
b2 = ggplot(data = Default, aes(x = default, y = income, fill = default))+
geom_boxplot()+
guides(fill = 'none') +
labs(title = "Income & default status")
# using patchwork library, we can add two plots together!!!!!!
# See Further resources for more information on patchwork
b1 + b2 +
plot_annotation(title = "Combining our two plots using patchwork!")These will be way more useful because I can demonstrate a lot of things at once. Hopefully some of the code in the graphics I created might be of some use for you!
Here’s a cool time series plot of the GME stock over the last 19 years. I use tidyquant to pull the data. A couple other packages to polish my ggplot. These include: grid, png, ggtext. See the first code chunk for what each one contributes
NOTE: The data used for this plot might not be correct because I think GameStop share price peaked at around $350 per share not around 85-ish like shown below. Not sure why this is..
# We use tidyquant package here to pull data from yahoo finance
# I will provide a link to the documentation below
# load stock data from yahoo finance
GameStop <- tq_get(x = "GME" ,from="2002-01-02", to="2021-7-20")
# load a random png file I downloaded off Google and make it ggplot friendly
# Credit to Dr. Andrew Heiss for explaining how to plot pngs on ggplot! Blog in citations
wsb_logo_file <- readPNG("C:/Users/celaj/Downloads/wallstreetbetslogo.png")
wsb_logo_plot_piece <- rasterGrob(wsb_logo_file)
# make plot of adjusted stock price (took a bit but worth it)
ggplot(GameStop, aes(x = date, y = adjusted)) +
geom_line(color = "blue", size =1) +
# plot that red dot. This is where GME got popular on Reddit (approx)
geom_point(data= GameStop %>% filter(date == "2021-01-13"),
size=3,
colour="red") +
# use scales library to make y-axis USD amounts with labels = dollar
# adjusting the ticks for the x-axis
scale_y_continuous(labels = dollar) +
scale_x_date(date_breaks = "2 years", date_labels = "%Y") +
# adding labels like title, subtitle, caption, axis
labs(title = "GameStop Stock Price over the last 19 years",
subtitle = "How Reddit exploded GME stock prices in 2021",
x = NULL,
y = "share price",
caption = "No one really understands how this happened") +
# changing themes, font face
theme_bw() +
theme(plot.title = element_text(face = "bold")) +
theme(plot.subtitle = element_text(face = "italic")) +
theme(plot.caption = element_text(face = "italic")) +
# add an arrow that points toward... the point
geom_segment(aes(x = as.Date("2017-06-01"), y = 50, xend = as.Date("2020-06-15"), yend = 15),
arrow = arrow(length = unit(0.1, "in"), type = "closed")) +
# add a text label. Note: this is not normal text geom.
# this is rich_text provided by the ggtext package.
# allows you to do this fancy text manipulation :)
# like bold and color a specific word within a ggplot!!! Credit to Stack Overflow lol (and ggtext documentation)
annotate(geom = "rich_text",
x = as.Date("2015-06-01"),
y = 50,
label = "GME starts getting Popular on <br>
<b> <span style='color: red;'>r/WallStreetBets</span> </b>",
fontface = 3, fill = NA, label.color = NA) +
# insert picture of wsb logo using annotate_custom! Couldn't find the one I wanted tho
annotation_custom(wsb_logo_plot_piece,
xmin = as.Date("2014-06-01"), xmax = as.Date("2016-06-01"),
ymin = 75 , ymax = 55) Here we use the WDI package to pull some world bank data and map it a world map using the sf package. It was one of my favorite things I’ve learned with ggplot. The process is basically choose an World bank indicator, download the data, clean it, make sure the ISO3 codes or whatever match key to merge by are correct, and left-join it to a shape file.
# download some data from World bank using WDI package
# Credit to Dr. Andrew Heiss for teaching me how to use these two packages
# Also found a github post that is helpful. Links below in references/workcited
indicators = c("SP.POP.TOTL") # Population, Total
# if you're wondering where this code comes from:
# go to this website: https://data.worldbank.org/indicator
# pick an indicator and look at the URL.
wdi_raw = WDI(country = "all", indicators, extra = TRUE,
start = 2020, end = 2020)
# download world shapefile from Natural Earth
# https://www.naturalearthdata.com/downloads/110m-cultural-vectors/
# Ideally you should move into a data directory...
world_shapes = read_sf("C:/Users/celaj/Downloads/ne_110m_admin_0_countries/ne_110m_admin_0_countries/ne_110m_admin_0_countries.shp")
# selecting the two primary variables. Country Code & total population
wdi_clean <- wdi_raw %>%
select(TotalPop = SP.POP.TOTL, iso3c)
# left join the WDI data we downloaded to the shape file
# We match by the ISO3 code in each file
# Some parts of the map are grayed out
# That means data is missing or the ISO3 code is incorrect.
# I corrected two of them below.
merged_map_data = world_shapes %>%
# fix the two countries
mutate(ISO_A3 = case_when(
ADMIN == "Norway" ~ "NOR",
ADMIN == "France" ~ "FRA",
TRUE ~ ISO_A3)) %>%
left_join(wdi_clean, by = c("ISO_A3" = "iso3c")) %>%
filter(ISO_A3 != "ATA") # no one lives in Antarctica
# make the world map graph
ggplot() +
geom_sf(data = merged_map_data, aes(fill = TotalPop)) +
coord_sf(crs = st_crs("EPSG:4326")) + # WGS 84: DOD GPS coordinates
# the viridis library can be used for continuous color scales
# but you can also create your own scale with your preferred colors
# Like this: scale_fill_gradient(low = "#AF7AC5", high = "#E74C3C") # purple and red graidant scale
scale_fill_viridis(option = "plasma") +
# give it a title
labs(title = "Total Population by country in 2020",
subtitle = "Some of the data is missing or incorrect for certain countries.",
caption = "China and India are the most populous countries") +
# hjust is the horizontal adjustment. 0.5 = center
theme_void() +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.subtitle = element_text(hjust = 0.5, face = "italic")) +
theme(legend.position = "bottom")# pull the data like before. this time unemployment numbers
indicators = c("SL.UEM.TOTL.NE.ZS") # Unemployment, total (% of total labor force)
wdi_raw = WDI(country = "all", indicators, extra = TRUE,
start = 2019, end = 2019)
# selecting the the variables I'll use
wdi_clean <- wdi_raw %>%
select(UnemployPerc= SL.UEM.TOTL.NE.ZS, iso3c, country, year)
wdi_clean = na.omit(wdi_clean) # bunch of missing in there. Ignore those
# filter some weird things and aggregates out to keep only countries (what I want to compare)
mylist = c('WLD', 'EAP', 'ECA', 'SAS', 'CEB', 'OED', 'TLA','TEA', 'TSA', 'TEC','EMU',
'EAR', 'LTE','PST','LAC','LMC','LMY', 'MIC','UMC','NAC', 'EAS', 'ECS','IBT')
set.seed(1010102) # so we get consistent results
mydata = wdi_clean %>%
filter(!iso3c %in% mylist) %>%
slice_sample(n = 10) %>% # get random sample because can't decide
arrange(desc(UnemployPerc))
ggplot(data = mydata, aes(x = reorder(country, UnemployPerc),
y = UnemployPerc,
label = paste0(round(UnemployPerc,2), "%")))+
geom_point(color = "black", size = 12) +
geom_segment(aes(x=country, xend=country, y=0, yend=UnemployPerc)) +
coord_flip() +
guides(color = "none") +
geom_text(color = "white", size = 2.5, fontface = 2) +
labs(title = "% Unemployment of 10 random countries" ,
x = "countries",
caption = "wall sreet journal theme provided by ggthemes") +
theme_wsj() +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
theme(plot.title = element_text(size = 13, hjust = 0.5)) +
theme(plot.caption = element_text(size = 11, hjust = 0.5))
Learn more about ggplot and related things I’ve showed with the following:
Resource I Tidyquant tutorial
Resource II WDI package info
Resource III Data Viz class website, HIGHLY RECCOMENDED
Resource IV GG themes
Resource V Modify components of themes
Resource VI R primers, ggplot basics (recommended!)
Resource VII ggplot cheatsheet, useful
Resource VIII FREE textbook-Kieran Healy
Resource IX ggplot plot functions,comprehensive
Resource X combine graphs, patchwork doc
Resource XI fancy text stuff, ggtext (super cool!)
There’s so many other sources out there. Check Github, Reddit, etc.
This code through references and cites the following sources:
Andrew Heiss (2021). Source I. Creating maps with sf
Andrew Heiss (2021). Source II. Plot pngs in ggplot
Some dude on Stack Overflow (2020). Source III. ggtext coloring
Rstudio Cloud (2021). Source IV. Tidy-Data, Rstudio primers
Kieran Healy (2021). Source V. ggplot structure