Read comments please :)

# SET GLOBAL KNITR OPTIONS

knitr::opts_chunk$set(echo = TRUE, 
                      message = FALSE, 
                      warning = FALSE, 
                      fig.width = 10, 
                      fig.height = 8)


# LOAD PACKAGES

library(pander)
library(kableExtra)

# It's a lot of packages, but they all can be useful for ggplot applications.
# Even if I don't use some of them below, hopefully they'll still be of use in some way.
# All packages below were introduced to me (mostly) by my Data Viz class. 
# See the works-cited section for the class website link, among others.



# GGPLOT RELATED PACKAGES (Some useful ones)

library(tidyverse)  # For ggplot, dplyr, and others
library(tidyquant)  # For accessing financial related data. Like FRED, S&P 500, Stock Prices from Yahoo, etc.
library(scales)     # To make your axis labels better. Functions like 'dollar' or 'comma'
library(timetk)     # Time series help (I haven't used it much tbh, but it has neat functions like forecast)
library(ggtext)     # For fancy text stuff
library(ggrepel)    # prevent text/labels from overlapping
library(sf)         # For creating maps. (very easy)
library(png)        # read png files into R
library(grid)       # for making the png files into a plottable object in ggplot
library(WDI)         # download data directly from world bank. Very useful
library(ggthemes)   # one of the many theme packages available. See Further resources section
library(patchwork)  # combine ggplots!!!! Incredibly useful!!!
library(viridis)    # continous color scale options


# possible "toy data sets" to play with

# When you load the above libraries like tidyverse, a bunch of data sets in R will become available.
# To see all of them, do: data() in your console
# then if you'd like to call any one of them, do: data("name of data you want")

# here are some more 

library(gapminder)  # Useful data for demonstrations.
library(ISLR)       # some data for teaching stats. 
library(wooldridge) # "introductory econometrics a modern approach 6th ed" textbook data

# library(datasets) # available by default I think, if not already present in your R, install once.



Introduction

This code-through explores the ggplot2 package, which is used to create data visualizations in R. This tutorial has three goals within it:

  • Explain what ggplot2 is and how it works
  • Demonstrate some common ggplot2 functions and visuals
  • Provide a few advanced ggplot examples to spark some curiosity.

It would be impractical (if not impossible) to show everything possible with ggplot in this condensed format. Instead, I will try to showcase a few cool features that I’ve learned over the last two months that I think might interest everyone. I will also include as many references and links as possible for further exploration into ggplot and the many things that can be done (of which I’m aware). I tried to annotate the code as much as possible to explain what I am doing. I’m won’t (and can’t) show every possible argument for each function because the internet will do a better job than I ever would.


Content Overview

Specifically, we’ll explain and demonstrate:

  • tidy data; Why it is important and relevant to making our plots
  • logic of ggplot structure
  • Different sorts of geometry functions like point, line, box-plot, and maps.
  • Customizing your visuals using themes
  • Make annotations on your graph such as text and images
  • combining plots together
  • using WDI package to pull data from world bank
  • Pull stock prices from yahoo finance for time series chart


Why You Should Care

Besides being super cool and powerful, learning the ggplot package will allow you to create amazing data visualizations to enhance your data storytelling, as Ben Wellington would put it. Visualizations are an extremely efficient method (if done correctly) to get your message across and summarize big chunks of data in a single (or multiple) picture. Also, ggplot looks good on a resume.


Learning Objectives

Specifically, you’ll learn how to

  • Make annotations on a plot such text, arrows, label, insert an png image
  • Change aesthetics of the graph like color or fill, themes, etc.
  • Download data from the World bank through the WDI package and create maps with it
  • Some common geom_functions with ggplot (so many available)



Before we begin…


“Tidy” Data

Let’s talk about the concept of “tidy data” to get it out the way and why it’s relevant to ggplot. As per the Rstudio Cloud Primers here

Tidy Data is defined as:

  • Each variable is in its own column
  • Each observation is in its own row
  • Each value is in its own cell

For example: This is an example of untidy data. Years should be it’s own variable. By the way, these tables are available when you load tidyverse

# untidy data example
pander(table4a)
country 1999 2000
Afghanistan 745 2666
Brazil 37737 80488
China 212258 213766

Here is an example of tidy data; it follows all three conventions.

pander(table1)
country year cases population
Afghanistan 1999 745 19987071
Afghanistan 2000 2666 20595360
Brazil 1999 37737 1.72e+08
Brazil 2000 80488 174504898
China 1999 212258 1.273e+09
China 2000 213766 1.28e+09

Logic of ggplot

ggplot follows something called the grammar of graphics, which is based a book written by Leland Wilkinson. Other software like tableau also use this format. Link in Works-cited for Kieran Healy textbook, where he explains this more thoroughly.

Basically the structure of a ggplot follows as such:

  • Provide some data for the plot
  • Input what aesthetics you want on the plot
  • Tell it the geometry you want to use

And that’s it. Everything you add are extensions of those three fields.


Further Exposition

A lot of what I’ve learned about data visualization with ggplot comes from my experience taking PMAP 8921-Data Visualization, a course taught by Dr. Andrew Heiss. I highly recommend the class website, which is provided in my works cited section. There is a flurry of resources, videos, and examples provided on that platform, more than I can provide here. Even if you don’t plan on officially taking his course, I suggest visiting the website and enroll “informally” to learn more about ggplot if you’re interested. I’m sure he won’t mind helping you out or communicating with you.


Basic Example

Here are some very basic ggplot examples. Visualizing data allows to see patterns in the data. I only have two of the basic examples to familiarize you with the structure of ggplot. The advanced ones are more spicy :)


Scatter-plot

# let's make a simple scatter plot

ggplot(data = Default,                    # use data 'Default' from ISLR package
                                       
                                          # Define the aesthetics
       aes(x = balance,                   # Credit card balance on x-axis
           y = income,                    # Annual Income on y-axis
           color = default)) +            # color points on default status.
  
  
  # alpha controls the transparency of the dots
  geom_point(alpha = 0.4, 
             aes(shape = default)) # You can use aes() inside a geom too!

Box-plots (and combining them using patchwork)

# Make two box-plots for income, balance, and color by default status
# combine them in the same image using patchwork library
# guides(fill = 'none') turned off the legend


b1 = ggplot(data = Default, aes(x = default, y = balance, fill = default))+
  geom_boxplot() +
  guides(fill = 'none') +
  labs(title = "balance & default status")

b2 = ggplot(data = Default, aes(x = default, y = income, fill = default))+
  geom_boxplot()+
  guides(fill = 'none') +
  labs(title = "Income & default status")


# using patchwork library, we can add two plots together!!!!!! 
# See Further resources for more information on patchwork

b1 + b2 + 
  plot_annotation(title = "Combining our two plots using patchwork!")


Advanced Examples

These will be way more useful because I can demonstrate a lot of things at once. Hopefully some of the code in the graphics I created might be of some use for you!

Time series!

Here’s a cool time series plot of the GME stock over the last 19 years. I use tidyquant to pull the data. A couple other packages to polish my ggplot. These include: grid, png, ggtext. See the first code chunk for what each one contributes

NOTE: The data used for this plot might not be correct because I think GameStop share price peaked at around $350 per share not around 85-ish like shown below. Not sure why this is..

# We use tidyquant package here to pull data from yahoo finance
# I will provide a link to the documentation below

# load stock data from yahoo finance
GameStop <- tq_get(x = "GME" ,from="2002-01-02", to="2021-7-20")



# load a random png file I downloaded off Google and make it ggplot friendly
# Credit to Dr. Andrew Heiss for explaining how to plot pngs on ggplot! Blog in citations

wsb_logo_file <- readPNG("C:/Users/celaj/Downloads/wallstreetbetslogo.png")
wsb_logo_plot_piece <- rasterGrob(wsb_logo_file) 




# make plot of adjusted stock price (took a bit but worth it)
ggplot(GameStop, aes(x = date, y = adjusted)) +
  
  geom_line(color = "blue", size =1) + 
  
  
  # plot that red dot. This is where GME got popular on Reddit (approx)
  geom_point(data= GameStop %>% filter(date == "2021-01-13"),
             size=3,
             colour="red") +
  
  # use scales library to make y-axis USD amounts with labels = dollar
  # adjusting the ticks for the x-axis
  
  scale_y_continuous(labels = dollar) + 
  
  scale_x_date(date_breaks = "2 years", date_labels = "%Y") +
  
  
  # adding labels like title, subtitle, caption, axis
  
  labs(title = "GameStop Stock Price over the last 19 years",
       subtitle = "How Reddit exploded GME stock prices in 2021",
       x = NULL,
       y = "share price",
       caption = "No one really understands how this happened") +
  
  # changing themes, font face
  
  theme_bw() +
  theme(plot.title = element_text(face = "bold")) +
  theme(plot.subtitle = element_text(face = "italic")) +
  theme(plot.caption = element_text(face = "italic")) +
  
  
  # add an arrow that points toward... the point

geom_segment(aes(x = as.Date("2017-06-01"), y = 50, xend = as.Date("2020-06-15"), yend = 15),
                  arrow = arrow(length = unit(0.1, "in"),  type = "closed")) +

  
  # add a text label. Note: this is not normal text geom.
  # this is rich_text provided by the ggtext package. 
  # allows you to do this fancy text manipulation :)
  # like bold and color a specific word within a ggplot!!! Credit to Stack Overflow lol (and ggtext documentation)
  
annotate(geom = "rich_text", 
           x = as.Date("2015-06-01"), 
           y = 50, 
           label = "GME starts getting Popular on <br> 
         <b> <span style='color: red;'>r/WallStreetBets</span> </b>",
           fontface = 3, fill = NA, label.color = NA) +
  
  
  
# insert picture of wsb logo using annotate_custom! Couldn't find the one I wanted tho
  
annotation_custom(wsb_logo_plot_piece, 
                    xmin = as.Date("2014-06-01"), xmax = as.Date("2016-06-01"),
                    ymin = 75 , ymax = 55) 


Maps!!

Here we use the WDI package to pull some world bank data and map it a world map using the sf package. It was one of my favorite things I’ve learned with ggplot. The process is basically choose an World bank indicator, download the data, clean it, make sure the ISO3 codes or whatever match key to merge by are correct, and left-join it to a shape file.

# download some data from World bank using WDI package
# Credit to Dr. Andrew Heiss for teaching me how to use these two packages
# Also found a github post that is helpful. Links below in references/workcited


indicators = c("SP.POP.TOTL")  # Population, Total

# if you're wondering where this code comes from:
# go to this website: https://data.worldbank.org/indicator
# pick an indicator and look at the URL. 


wdi_raw = WDI(country = "all", indicators, extra = TRUE, 
               start = 2020, end = 2020)


# download world shapefile from Natural Earth
# https://www.naturalearthdata.com/downloads/110m-cultural-vectors/

# Ideally you should move into a data directory... 
world_shapes = read_sf("C:/Users/celaj/Downloads/ne_110m_admin_0_countries/ne_110m_admin_0_countries/ne_110m_admin_0_countries.shp")


# selecting the two primary variables. Country Code & total population
wdi_clean <- wdi_raw %>% 
  select(TotalPop = SP.POP.TOTL, iso3c)



# left join the WDI data we downloaded to the shape file
# We match by the ISO3 code in each file
# Some parts of the map are grayed out
# That means data is missing or the ISO3 code is incorrect. 
# I corrected two of them below.

merged_map_data = world_shapes %>%
  # fix the two countries
  mutate(ISO_A3 = case_when(
    ADMIN == "Norway" ~ "NOR",
    ADMIN == "France" ~ "FRA",
    TRUE ~ ISO_A3)) %>%
  
  left_join(wdi_clean, by = c("ISO_A3" = "iso3c")) %>%
  filter(ISO_A3 != "ATA") # no one lives in Antarctica



# make the world map graph
ggplot() + 
  geom_sf(data = merged_map_data, aes(fill = TotalPop)) +
  
  coord_sf(crs = st_crs("EPSG:4326")) +  # WGS 84: DOD GPS coordinates 
  
  # the viridis library can be used for continuous color scales
  # but you can also create your own scale with your preferred colors
  # Like this: scale_fill_gradient(low = "#AF7AC5", high = "#E74C3C") # purple and red graidant scale
  
  scale_fill_viridis(option = "plasma")  +
  
  # give it a title
  labs(title = "Total Population by country in 2020",
       subtitle = "Some of the data is missing or incorrect for certain countries.",
       caption = "China and India are the most populous countries") +
  
  
  # hjust is the horizontal adjustment. 0.5 = center
  theme_void() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  theme(plot.subtitle = element_text(hjust = 0.5, face = "italic")) +
  theme(legend.position = "bottom")


Combine point and line to make a lollipop chart…because why not?

# pull the data like before. this time unemployment numbers

indicators = c("SL.UEM.TOTL.NE.ZS")  # Unemployment, total (% of total labor force) 
wdi_raw = WDI(country = "all", indicators, extra = TRUE, 
               start = 2019, end = 2019)

# selecting the the variables I'll use
wdi_clean <- wdi_raw %>% 
  select(UnemployPerc= SL.UEM.TOTL.NE.ZS, iso3c, country, year)

wdi_clean = na.omit(wdi_clean) # bunch of missing in there. Ignore those

# filter some weird things and aggregates out to keep only countries (what I want to compare)

mylist = c('WLD', 'EAP', 'ECA', 'SAS', 'CEB', 'OED', 'TLA','TEA', 'TSA', 'TEC','EMU', 
           'EAR', 'LTE','PST','LAC','LMC','LMY', 'MIC','UMC','NAC', 'EAS', 'ECS','IBT')

  
set.seed(1010102) # so we get consistent results
mydata = wdi_clean %>%
  filter(!iso3c %in% mylist)  %>%
  slice_sample(n = 10) %>%         # get random sample because can't decide
  arrange(desc(UnemployPerc)) 
  

ggplot(data = mydata, aes(x = reorder(country, UnemployPerc), 
                          y = UnemployPerc, 
                          label = paste0(round(UnemployPerc,2), "%")))+
  
  geom_point(color = "black", size = 12) +
  geom_segment(aes(x=country, xend=country, y=0, yend=UnemployPerc)) +
  
  
  coord_flip() +
  guides(color = "none") +
  geom_text(color = "white", size = 2.5, fontface = 2) + 
  labs(title = "% Unemployment of 10 random countries" ,
       x = "countries",
       caption = "wall sreet journal theme provided by ggthemes") +
  
  theme_wsj() +
  
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) + 
  theme(plot.title = element_text(size = 13, hjust = 0.5)) +
  theme(plot.caption = element_text(size = 11, hjust = 0.5))



Further Resources

Learn more about ggplot and related things I’ve showed with the following:


There’s so many other sources out there. Check Github, Reddit, etc.



Works Cited

This code through references and cites the following sources: