FIRST: Many thanks to Alison Horst for her gorgerous aRt.

Data Visualization with ggplot2()

What is ggplot2()?

Artwork by the incredible Alison Horst

ggplot2() is the bread and butter of data visualization; it takes clean, organized, manipulated data, and (with your direction) builds beautiful, & more importantly, communicative, plots. ggplot2 is based on the grammar of graphics, which asserts that we can build every graph from the same few components: 1) a clean and tidy data set; 2) a set of geoms that map out data points; and 3) a coordinate system (in the broadest sense; ie. how will the data be mapped onto a graphical surface). To actually display data values, we map out our variables to aesthetic things like size, color, and x and y locations; we also tell ggplot2 the type of visualization we are interested in building (e.g., bar graph, box plot, line graph, density plots, etc.). ggplot2 opens up data science to broader audiences and helps all of us communicate our science. Download this cheat sheet and save it somewhere accessible–it’s an incredible tool to refer back to!

ggplot2 works with data.frames, the data type we built in our previous tidyverse workshop. The data we feed into ggplot2 consists of rows and columns that “live peacefully” together in a tidy data.frame. This means, often, we need to do some data cleaning up front to get our data into the tidy format we need for visualization.

Let’s get started on some REALish data!

Load the tidyverse and any additional required packages:

#install.packages("DALEX")
library(tidyverse)
library(DALEX)
library(RColorBrewer)

data(dragons)
dragons <- view(dragons)

In honor of Black Dragon Canyon Wash Let’s pretend we’re trying to understand how different species of chromatic dragons (a decently tempermental critter) with various life lengths compare in terms of number of scars and BMI.

Explore

We should first familiarize ourselves with the data.

dim(dragons) # view dimensions of the df
glimpse(dragons) # view the data structure and first few observations
str(dragons) # view data structure of df
colnames(dragons) # view the columns of df

Wrangle

Let’s quickly brush up on wrangling and clean up this data set for our purposes. We will:

  1. filter for black & blue dragons
  2. select relevant columns of data
  3. rename columns
  4. create new columns
  5. combine genus and species into a single column
  6. create categorical variable to group dragons by age!

Let’s create an efficient workflow by combining ALL of these data wrangling steps into one; ie. let’s review harnessing the great POWERS of tidyverse()!

Now let’s pull all of these steps together!

We split each wrangling step up into a separate data frame, but you could have linked all these functions together in one chunk using the pipe operator, like this:

# recall how we clean and manipulate data
dragons_simple <- dragons %>% 
  filter(colour %in% c("black","blue")) %>% # filter for blue & back dragons
  select(2:5,7, life_length) %>%  # select only relevant columns (height, weight, scars, colour, number_of_lost_teeth, and life_length)
  rename(teeth = number_of_lost_teeth,
         age = life_length) %>% # rename columns with terribly long names
  mutate(genus = ifelse(colour %in% c("blue"), "Sauroniops", "Jaggermeryx"), # create new column, genus, based on color; if blue = Sauroniops, if black = Jaggermeryx
         species = case_when(
              colour == "blue" & age < 1200 ~ "reike",
              colour == "blue" & age > 1200 ~ "naida",
              colour == "black" & age < 1000 ~ "ozzyi",
              colour == "black" & (age >= 1000 & age <= 1700) ~ "whido",
              colour == "black" & age > 1700 ~ "strummeri")) %>% # create species column based on colour and age
  mutate(BMI = (weight*2000)/0.45359237 / height^2) %>% # create new column, BMI (formula = (weight*2000)/0.45359237 / height^2)
  unite(genus_species, genus, species, sep = " ") %>% # combine genus and species
  mutate(age_group = as.factor(case_when(
              age < 1000 ~ "dragonling",
              age >= 1000 & age <= 2000 ~ "jouvenile",
              age > 2000 ~ "adult"))) # create categorical variables to group dragons by age

# recall how we save data
saveRDS(dragons_simple, "./out-data/dragons-tidy.RDS")

With this simplified and cleaned data set, we’re ready to explore! Let’s first isolate data we want to visualize by:

  1. grouping observations by age_group & genus_species
  2. finding the average scars, BMI, and number of teeth for each species-age combination
  3. pivot_longer() into tidy format for visualizing
fav_spp <- dragons_simple %>% 
  group_by(age_group,genus_species) %>% # group by age group and species
  summarise(ave_scars = ave(scars),
            ave_BMI = ave(BMI),
            ave_teeth = ave(teeth)) %>% # summarise average scars, BMI, and teeth lost
  distinct() %>% # keep only distinct observations
  pivot_longer(cols = c(`ave_scars`, `ave_BMI`, `ave_teeth`), names_to = "summary_var", values_to = "ave") # pivot_longer() to format for column graph
## `summarise()` regrouping output by 'age_group', 'genus_species' (override with `.groups` argument)

Column graph

Now that we have our data summarised and in tidy format, we’re ready to make a plot! We want to:

  1. create a column graph showing the summarised by species and by age_group
  2. create a different panel for each dragon species
  3. make it pretty

Note: Only the first 3 lines of the following code are necessary to make the plot. Everything else simply modifies the appearance and make it a bit more presentable. There are tons of ways to customize plots – we explore only a few options below.

ggplot(fav_spp, aes(x = age_group, y = ave, fill = summary_var)) + # fill = summary variables
  geom_col(position = "dodge") + # separate columns for each age group
  facet_wrap(~genus_species) + # create separate panels for each species
  ggtitle("Chromatic dragon traits") + # add a title
  labs(x = "Age Group", y = "", fill = "Summary Variable") + # add labels on the axes and legend
  scale_fill_manual(labels = c("Average BMI", "Average Scars", "Average Teeth Lost"), values = c("darkseagreen3", "cadetblue", "orange")) + # change colors and names
  theme_bw() + # change the theme
  theme(panel.border = element_rect(colour = "black", fill = NA, size = 0.7), # change the border color and size
        axis.text.x = element_text(angle = 45, hjust = 0.9)) + # change the angle of the x axis text, change the horizontal adjustment
  ggsave("./out-plots/dragon_plot.png") # save the plot
## Saving 15 x 10 in image

Line graph

Let’s visualize this data in a different way; Let’s:

  1. create a line graph that shows how height and weight relate by species, and let’s only filter for rare species (ie., NOT
    Sauroniops naida or Sauroniops reike)
  2. make it pretty
dragons_simple %>% filter(genus_species != "Sauroniops reike" & genus_species != "Sauroniops naida") %>% 
ggplot(.) + # fill = summary variables
  geom_point(aes(x = weight, y = height, color = genus_species, size = BMI), alpha = 0.6) + # separate columns for each age group
  scale_size_continuous(range = c(1, 11)) +
  ggtitle("Chromatic dragon height, weight, and BMI") + # add a title
  labs(x = "Weight (tons)", y = "Height (meters)", color = "Species") + # add labels on the axes and legend
  scale_color_manual(values = c("#44AA99", "#88CCEE", "#DDCC77", "#CC6677", "#117733")) + # change colors and names
  theme_bw() + # change the theme
  ggsave("./out-plots/dragon_BMI.png") # save the plot
## Saving 15 x 10 in image

Histogram

Let’s visualize this data in a different way; Let’s:

  1. create a histogram that shows the range of teeth lost by species
  2. make it pretty
ggplot(dragons_simple, aes(x = genus_species, y = teeth, fill = genus_species)) +
  geom_boxplot() + # make a boxplot
  labs(x = "", y = "Teeth Lost", fill = "Species") + # give the graph labels
  theme_minimal() + # change the theme
  scale_fill_brewer(palette = "Dark2") + # use a preset palette
  theme(axis.text.x= element_text(angle = 45, hjust = 1)) + # adjust the angle of the x axis text
  theme(legend.position = "bottom") + # move legend to the bottom
  guides(fill = guide_legend(nrow=2)) + # give legend two rows
  ggsave("./out-plots/dragon_BMI.png") # save the plot
## Saving 7 x 5 in image

Plots, plots, plots, plots, plots, plots, plots!

There are just about a bazillion different ways to visualize the same data:

Thank you stackoverflow, for this wild viz of all the ways we coulod visualize data about diamonds!

… In the end we just have to choose the method that makes the most sense for our audience, our research questions and our data!

Additionally, aRtists everywhere are using R to make ridiculously cool pieces of generative and patterned art. Check out some talented RLadies here and here! Plus, #TidyTuesday submissions can be absolutely, wildly, brilliantly, good.

Bonus

What’s especially neat about ggplot2 is that it interfaces seamlessly with sf, a package for spatial data wrangling, analysis, and viz.sf treats its spatial objects as data.frames meaning…. we can use all of the magic & power of the tidyverse to play with the data! Plus, these data look just like any other data.frame, except attached to them is a column geometry that houses the coordinate information for the observation!

Below, we’ll take the the EPA air quality data Ellie & Dakoeta gathered and make it spatial for visualization!

# install.packages("sf") # if you haven't done so already!
library(sf)
## Linking to GEOS 3.8.0, GDAL 3.0.4, PROJ 6.3.1
# read in the data
epa <- readRDS("./in-data/alldata_1980_2020.RDS")

# let's explore a bit
dim(epa) # view dimensions of the df
## [1] 251214     16
glimpse(epa) # view the data structure and first few observations
## Rows: 251,214
## Columns: 16
## $ county             <chr> "San Bernardino", "San Bernardino", "San Bernard...
## $ county_code        <chr> "071", "071", "071", "071", "071", "071", "071",...
## $ date_local         <chr> "1988-08-24", "1988-08-20", "1988-08-17", "1988-...
## $ parameter_code     <chr> "88502", "88502", "88502", "88502", "88502", "88...
## $ arithmetic_mean    <dbl> 4.03279, 16.52027, 13.55705, 8.23171, 15.81250, ...
## $ aqi                <int> 17, 60, 54, 34, 59, 20, 46, 67, 39, 65, 48, 56, ...
## $ state_code         <chr> "06", "06", "06", "06", "06", "06", "06", "06", ...
## $ site_number        <chr> "9010", "9010", "9010", "9010", "9010", "9010", ...
## $ latitude           <dbl> 34.19393, 34.19393, 34.19393, 34.19393, 34.19393...
## $ longitude          <dbl> -116.9141, -116.9141, -116.9141, -116.9141, -116...
## $ sample_duration    <chr> "24 HOUR", "24 HOUR", "24 HOUR", "24 HOUR", "24 ...
## $ datum              <chr> "WGS84", "WGS84", "WGS84", "WGS84", "WGS84", "WG...
## $ pollutant_standard <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ validity_indicator <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y"...
## $ method             <chr> "IMPROVE Module A with Cyclone Inlet-Teflon Filt...
## $ event_type         <chr> "None", "None", "None", "None", "None", "None", ...
str(epa) # view data structure of df
## tibble [251,214 x 16] (S3: tbl_df/tbl/data.frame)
##  $ county            : chr [1:251214] "San Bernardino" "San Bernardino" "San Bernardino" "San Bernardino" ...
##  $ county_code       : chr [1:251214] "071" "071" "071" "071" ...
##  $ date_local        : chr [1:251214] "1988-08-24" "1988-08-20" "1988-08-17" "1988-08-13" ...
##  $ parameter_code    : chr [1:251214] "88502" "88502" "88502" "88502" ...
##  $ arithmetic_mean   : num [1:251214] 4.03 16.52 13.56 8.23 15.81 ...
##  $ aqi               : int [1:251214] 17 60 54 34 59 20 46 67 39 65 ...
##  $ state_code        : chr [1:251214] "06" "06" "06" "06" ...
##  $ site_number       : chr [1:251214] "9010" "9010" "9010" "9010" ...
##  $ latitude          : num [1:251214] 34.2 34.2 34.2 34.2 34.2 ...
##  $ longitude         : num [1:251214] -117 -117 -117 -117 -117 ...
##  $ sample_duration   : chr [1:251214] "24 HOUR" "24 HOUR" "24 HOUR" "24 HOUR" ...
##  $ datum             : chr [1:251214] "WGS84" "WGS84" "WGS84" "WGS84" ...
##  $ pollutant_standard: chr [1:251214] NA NA NA NA ...
##  $ validity_indicator: chr [1:251214] "Y" "Y" "Y" "Y" ...
##  $ method            : chr [1:251214] "IMPROVE Module A with Cyclone Inlet-Teflon Filter, 2.2 sq. cm. - GRAVIMETRIC" "IMPROVE Module A with Cyclone Inlet-Teflon Filter, 2.2 sq. cm. - GRAVIMETRIC" "IMPROVE Module A with Cyclone Inlet-Teflon Filter, 2.2 sq. cm. - GRAVIMETRIC" "IMPROVE Module A with Cyclone Inlet-Teflon Filter, 2.2 sq. cm. - GRAVIMETRIC" ...
##  $ event_type        : chr [1:251214] "None" "None" "None" "None" ...
colnames(epa) # view the columns of df
##  [1] "county"             "county_code"        "date_local"        
##  [4] "parameter_code"     "arithmetic_mean"    "aqi"               
##  [7] "state_code"         "site_number"        "latitude"          
## [10] "longitude"          "sample_duration"    "datum"             
## [13] "pollutant_standard" "validity_indicator" "method"            
## [16] "event_type"
# give this data.frame real dates!
dates <- data.frame(str_split_fixed(epa$date_local, "-", n = 3)) %>%
  rename(year = X1,
         month = X2,
         day = X3)

# bind year, month, day columns to epa data.frame
epa <- bind_cols(dates, epa)

# build_spatial 
epa_sf <- st_as_sf(x = epa, 
                        coords = c("longitude", "latitude"),
                        crs = "+proj=aea +lat_0=0 +lon_0=-120 +lat_1=34 +lat_2=40.5 +x_0=0 +y_0=-4000000 +datum=NAD83 +units=m +no_defs")

glimpse(epa_sf) # notice the new column, geometry
## Rows: 251,214
## Columns: 18
## $ year               <fct> 1988, 1988, 1988, 1988, 1988, 1988, 1988, 1988, ...
## $ month              <fct> 08, 08, 08, 08, 08, 08, 08, 07, 07, 07, 07, 07, ...
## $ day                <fct> 24, 20, 17, 13, 10, 06, 03, 30, 27, 23, 20, 16, ...
## $ county             <chr> "San Bernardino", "San Bernardino", "San Bernard...
## $ county_code        <chr> "071", "071", "071", "071", "071", "071", "071",...
## $ date_local         <chr> "1988-08-24", "1988-08-20", "1988-08-17", "1988-...
## $ parameter_code     <chr> "88502", "88502", "88502", "88502", "88502", "88...
## $ arithmetic_mean    <dbl> 4.03279, 16.52027, 13.55705, 8.23171, 15.81250, ...
## $ aqi                <int> 17, 60, 54, 34, 59, 20, 46, 67, 39, 65, 48, 56, ...
## $ state_code         <chr> "06", "06", "06", "06", "06", "06", "06", "06", ...
## $ site_number        <chr> "9010", "9010", "9010", "9010", "9010", "9010", ...
## $ sample_duration    <chr> "24 HOUR", "24 HOUR", "24 HOUR", "24 HOUR", "24 ...
## $ datum              <chr> "WGS84", "WGS84", "WGS84", "WGS84", "WGS84", "WG...
## $ pollutant_standard <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ validity_indicator <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y"...
## $ method             <chr> "IMPROVE Module A with Cyclone Inlet-Teflon Filt...
## $ event_type         <chr> "None", "None", "None", "None", "None", "None", ...
## $ geometry           <POINT [m]> POINT (-116.9141 34.19393), POINT (-116.91...
# visualize
epa_sf %>% filter(year == 2019, event_type == "None") %>% 
ggplot(.) +
  geom_sf(aes(color = aqi)) +
  theme_bw()

There are many, many ways to make maps beautiful in ggplot using sf. Anyone who says R isn’t built for spatial viz just hasn’t tried hard enough–check out this if you don’t believe me (just about the most gorgeous, creative, and well-built map I’ve ever seen).

Resources