FIRST: Many thanks to Alison Horst for her gorgerous aRt.
Artwork by the incredible Alison Horst
ggplot2() is the bread and butter of data visualization; it takes clean, organized, manipulated data, and (with your direction) builds beautiful, & more importantly, communicative, plots. ggplot2 is based on the grammar of graphics, which asserts that we can build every graph from the same few components: 1) a clean and tidy data set; 2) a set of geoms that map out data points; and 3) a coordinate system (in the broadest sense; ie. how will the data be mapped onto a graphical surface). To actually display data values, we map out our variables to aesthetic things like size, color, and x and y locations; we also tell ggplot2 the type of visualization we are interested in building (e.g., bar graph, box plot, line graph, density plots, etc.). ggplot2 opens up data science to broader audiences and helps all of us communicate our science. Download this cheat sheet and save it somewhere accessible–it’s an incredible tool to refer back to!
ggplot2 works with data.frames, the data type we built in our previous tidyverse workshop. The data we feed into ggplot2 consists of rows and columns that “live peacefully” together in a tidy data.frame. This means, often, we need to do some data cleaning up front to get our data into the tidy format we need for visualization.
#install.packages("DALEX")
library(tidyverse)
library(DALEX)
library(RColorBrewer)
data(dragons)
dragons <- view(dragons)
In honor of Black Dragon Canyon Wash Let’s pretend we’re trying to understand how different species of chromatic dragons (a decently tempermental critter) with various life lengths compare in terms of number of scars and BMI.
We should first familiarize ourselves with the data.
dim(dragons) # view dimensions of the df
glimpse(dragons) # view the data structure and first few observations
str(dragons) # view data structure of df
colnames(dragons) # view the columns of df
Let’s quickly brush up on wrangling and clean up this data set for our purposes. We will:
Let’s create an efficient workflow by combining ALL of these data wrangling steps into one; ie. let’s review harnessing the great POWERS of tidyverse()!
We split each wrangling step up into a separate data frame, but you could have linked all these functions together in one chunk using the pipe operator, like this:
# recall how we clean and manipulate data
dragons_simple <- dragons %>%
filter(colour %in% c("black","blue")) %>% # filter for blue & back dragons
select(2:5,7, life_length) %>% # select only relevant columns (height, weight, scars, colour, number_of_lost_teeth, and life_length)
rename(teeth = number_of_lost_teeth,
age = life_length) %>% # rename columns with terribly long names
mutate(genus = ifelse(colour %in% c("blue"), "Sauroniops", "Jaggermeryx"), # create new column, genus, based on color; if blue = Sauroniops, if black = Jaggermeryx
species = case_when(
colour == "blue" & age < 1200 ~ "reike",
colour == "blue" & age > 1200 ~ "naida",
colour == "black" & age < 1000 ~ "ozzyi",
colour == "black" & (age >= 1000 & age <= 1700) ~ "whido",
colour == "black" & age > 1700 ~ "strummeri")) %>% # create species column based on colour and age
mutate(BMI = (weight*2000)/0.45359237 / height^2) %>% # create new column, BMI (formula = (weight*2000)/0.45359237 / height^2)
unite(genus_species, genus, species, sep = " ") %>% # combine genus and species
mutate(age_group = as.factor(case_when(
age < 1000 ~ "dragonling",
age >= 1000 & age <= 2000 ~ "jouvenile",
age > 2000 ~ "adult"))) # create categorical variables to group dragons by age
# recall how we save data
saveRDS(dragons_simple, "./out-data/dragons-tidy.RDS")
With this simplified and cleaned data set, we’re ready to explore! Let’s first isolate data we want to visualize by:
fav_spp <- dragons_simple %>%
group_by(age_group,genus_species) %>% # group by age group and species
summarise(ave_scars = ave(scars),
ave_BMI = ave(BMI),
ave_teeth = ave(teeth)) %>% # summarise average scars, BMI, and teeth lost
distinct() %>% # keep only distinct observations
pivot_longer(cols = c(`ave_scars`, `ave_BMI`, `ave_teeth`), names_to = "summary_var", values_to = "ave") # pivot_longer() to format for column graph
## `summarise()` regrouping output by 'age_group', 'genus_species' (override with `.groups` argument)
Now that we have our data summarised and in tidy format, we’re ready to make a plot! We want to:
Note: Only the first 3 lines of the following code are necessary to make the plot. Everything else simply modifies the appearance and make it a bit more presentable. There are tons of ways to customize plots – we explore only a few options below.
ggplot(fav_spp, aes(x = age_group, y = ave, fill = summary_var)) + # fill = summary variables
geom_col(position = "dodge") + # separate columns for each age group
facet_wrap(~genus_species) + # create separate panels for each species
ggtitle("Chromatic dragon traits") + # add a title
labs(x = "Age Group", y = "", fill = "Summary Variable") + # add labels on the axes and legend
scale_fill_manual(labels = c("Average BMI", "Average Scars", "Average Teeth Lost"), values = c("darkseagreen3", "cadetblue", "orange")) + # change colors and names
theme_bw() + # change the theme
theme(panel.border = element_rect(colour = "black", fill = NA, size = 0.7), # change the border color and size
axis.text.x = element_text(angle = 45, hjust = 0.9)) + # change the angle of the x axis text, change the horizontal adjustment
ggsave("./out-plots/dragon_plot.png") # save the plot
## Saving 15 x 10 in image
Let’s visualize this data in a different way; Let’s:
dragons_simple %>% filter(genus_species != "Sauroniops reike" & genus_species != "Sauroniops naida") %>%
ggplot(.) + # fill = summary variables
geom_point(aes(x = weight, y = height, color = genus_species, size = BMI), alpha = 0.6) + # separate columns for each age group
scale_size_continuous(range = c(1, 11)) +
ggtitle("Chromatic dragon height, weight, and BMI") + # add a title
labs(x = "Weight (tons)", y = "Height (meters)", color = "Species") + # add labels on the axes and legend
scale_color_manual(values = c("#44AA99", "#88CCEE", "#DDCC77", "#CC6677", "#117733")) + # change colors and names
theme_bw() + # change the theme
ggsave("./out-plots/dragon_BMI.png") # save the plot
## Saving 15 x 10 in image
Let’s visualize this data in a different way; Let’s:
ggplot(dragons_simple, aes(x = genus_species, y = teeth, fill = genus_species)) +
geom_boxplot() + # make a boxplot
labs(x = "", y = "Teeth Lost", fill = "Species") + # give the graph labels
theme_minimal() + # change the theme
scale_fill_brewer(palette = "Dark2") + # use a preset palette
theme(axis.text.x= element_text(angle = 45, hjust = 1)) + # adjust the angle of the x axis text
theme(legend.position = "bottom") + # move legend to the bottom
guides(fill = guide_legend(nrow=2)) + # give legend two rows
ggsave("./out-plots/dragon_BMI.png") # save the plot
## Saving 7 x 5 in image
There are just about a bazillion different ways to visualize the same data:
Thank you stackoverflow, for this wild viz of all the ways we coulod visualize data about diamonds!
… In the end we just have to choose the method that makes the most sense for our audience, our research questions and our data!
Additionally, aRtists everywhere are using R to make ridiculously cool pieces of generative and patterned art. Check out some talented RLadies here and here! Plus, #TidyTuesday submissions can be absolutely, wildly, brilliantly, good.
What’s especially neat about ggplot2 is that it interfaces seamlessly with sf, a package for spatial data wrangling, analysis, and viz.sf treats its spatial objects as data.frames meaning…. we can use all of the magic & power of the tidyverse to play with the data! Plus, these data look just like any other data.frame, except attached to them is a column geometry that houses the coordinate information for the observation!
Below, we’ll take the the EPA air quality data Ellie & Dakoeta gathered and make it spatial for visualization!
# install.packages("sf") # if you haven't done so already!
library(sf)
## Linking to GEOS 3.8.0, GDAL 3.0.4, PROJ 6.3.1
# read in the data
epa <- readRDS("./in-data/alldata_1980_2020.RDS")
# let's explore a bit
dim(epa) # view dimensions of the df
## [1] 251214 16
glimpse(epa) # view the data structure and first few observations
## Rows: 251,214
## Columns: 16
## $ county <chr> "San Bernardino", "San Bernardino", "San Bernard...
## $ county_code <chr> "071", "071", "071", "071", "071", "071", "071",...
## $ date_local <chr> "1988-08-24", "1988-08-20", "1988-08-17", "1988-...
## $ parameter_code <chr> "88502", "88502", "88502", "88502", "88502", "88...
## $ arithmetic_mean <dbl> 4.03279, 16.52027, 13.55705, 8.23171, 15.81250, ...
## $ aqi <int> 17, 60, 54, 34, 59, 20, 46, 67, 39, 65, 48, 56, ...
## $ state_code <chr> "06", "06", "06", "06", "06", "06", "06", "06", ...
## $ site_number <chr> "9010", "9010", "9010", "9010", "9010", "9010", ...
## $ latitude <dbl> 34.19393, 34.19393, 34.19393, 34.19393, 34.19393...
## $ longitude <dbl> -116.9141, -116.9141, -116.9141, -116.9141, -116...
## $ sample_duration <chr> "24 HOUR", "24 HOUR", "24 HOUR", "24 HOUR", "24 ...
## $ datum <chr> "WGS84", "WGS84", "WGS84", "WGS84", "WGS84", "WG...
## $ pollutant_standard <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ validity_indicator <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y"...
## $ method <chr> "IMPROVE Module A with Cyclone Inlet-Teflon Filt...
## $ event_type <chr> "None", "None", "None", "None", "None", "None", ...
str(epa) # view data structure of df
## tibble [251,214 x 16] (S3: tbl_df/tbl/data.frame)
## $ county : chr [1:251214] "San Bernardino" "San Bernardino" "San Bernardino" "San Bernardino" ...
## $ county_code : chr [1:251214] "071" "071" "071" "071" ...
## $ date_local : chr [1:251214] "1988-08-24" "1988-08-20" "1988-08-17" "1988-08-13" ...
## $ parameter_code : chr [1:251214] "88502" "88502" "88502" "88502" ...
## $ arithmetic_mean : num [1:251214] 4.03 16.52 13.56 8.23 15.81 ...
## $ aqi : int [1:251214] 17 60 54 34 59 20 46 67 39 65 ...
## $ state_code : chr [1:251214] "06" "06" "06" "06" ...
## $ site_number : chr [1:251214] "9010" "9010" "9010" "9010" ...
## $ latitude : num [1:251214] 34.2 34.2 34.2 34.2 34.2 ...
## $ longitude : num [1:251214] -117 -117 -117 -117 -117 ...
## $ sample_duration : chr [1:251214] "24 HOUR" "24 HOUR" "24 HOUR" "24 HOUR" ...
## $ datum : chr [1:251214] "WGS84" "WGS84" "WGS84" "WGS84" ...
## $ pollutant_standard: chr [1:251214] NA NA NA NA ...
## $ validity_indicator: chr [1:251214] "Y" "Y" "Y" "Y" ...
## $ method : chr [1:251214] "IMPROVE Module A with Cyclone Inlet-Teflon Filter, 2.2 sq. cm. - GRAVIMETRIC" "IMPROVE Module A with Cyclone Inlet-Teflon Filter, 2.2 sq. cm. - GRAVIMETRIC" "IMPROVE Module A with Cyclone Inlet-Teflon Filter, 2.2 sq. cm. - GRAVIMETRIC" "IMPROVE Module A with Cyclone Inlet-Teflon Filter, 2.2 sq. cm. - GRAVIMETRIC" ...
## $ event_type : chr [1:251214] "None" "None" "None" "None" ...
colnames(epa) # view the columns of df
## [1] "county" "county_code" "date_local"
## [4] "parameter_code" "arithmetic_mean" "aqi"
## [7] "state_code" "site_number" "latitude"
## [10] "longitude" "sample_duration" "datum"
## [13] "pollutant_standard" "validity_indicator" "method"
## [16] "event_type"
# give this data.frame real dates!
dates <- data.frame(str_split_fixed(epa$date_local, "-", n = 3)) %>%
rename(year = X1,
month = X2,
day = X3)
# bind year, month, day columns to epa data.frame
epa <- bind_cols(dates, epa)
# build_spatial
epa_sf <- st_as_sf(x = epa,
coords = c("longitude", "latitude"),
crs = "+proj=aea +lat_0=0 +lon_0=-120 +lat_1=34 +lat_2=40.5 +x_0=0 +y_0=-4000000 +datum=NAD83 +units=m +no_defs")
glimpse(epa_sf) # notice the new column, geometry
## Rows: 251,214
## Columns: 18
## $ year <fct> 1988, 1988, 1988, 1988, 1988, 1988, 1988, 1988, ...
## $ month <fct> 08, 08, 08, 08, 08, 08, 08, 07, 07, 07, 07, 07, ...
## $ day <fct> 24, 20, 17, 13, 10, 06, 03, 30, 27, 23, 20, 16, ...
## $ county <chr> "San Bernardino", "San Bernardino", "San Bernard...
## $ county_code <chr> "071", "071", "071", "071", "071", "071", "071",...
## $ date_local <chr> "1988-08-24", "1988-08-20", "1988-08-17", "1988-...
## $ parameter_code <chr> "88502", "88502", "88502", "88502", "88502", "88...
## $ arithmetic_mean <dbl> 4.03279, 16.52027, 13.55705, 8.23171, 15.81250, ...
## $ aqi <int> 17, 60, 54, 34, 59, 20, 46, 67, 39, 65, 48, 56, ...
## $ state_code <chr> "06", "06", "06", "06", "06", "06", "06", "06", ...
## $ site_number <chr> "9010", "9010", "9010", "9010", "9010", "9010", ...
## $ sample_duration <chr> "24 HOUR", "24 HOUR", "24 HOUR", "24 HOUR", "24 ...
## $ datum <chr> "WGS84", "WGS84", "WGS84", "WGS84", "WGS84", "WG...
## $ pollutant_standard <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ validity_indicator <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y"...
## $ method <chr> "IMPROVE Module A with Cyclone Inlet-Teflon Filt...
## $ event_type <chr> "None", "None", "None", "None", "None", "None", ...
## $ geometry <POINT [m]> POINT (-116.9141 34.19393), POINT (-116.91...
# visualize
epa_sf %>% filter(year == 2019, event_type == "None") %>%
ggplot(.) +
geom_sf(aes(color = aqi)) +
theme_bw()
There are many, many ways to make maps beautiful in ggplot using sf. Anyone who says R isn’t built for spatial viz just hasn’t tried hard enough–check out this if you don’t believe me (just about the most gorgeous, creative, and well-built map I’ve ever seen).