Welcome back to PHW251B! This lab module will be covering various
interactive plots and tables that can be created in R with different
packages. We will be extending various types of plots that we have been
working with thus far in ways that allow for increased engagement and
flexibility for our audiences. Interactive plots are incredibly powerful
in R, since the reproducibility can allow for on-the-fly dashboards and
interactive reports that need minimal upkeep after the initial coding.
Creating interactive plots also encompasses the ideals of accessibility
needed for static visualizations, and requires additional care when
considering the user interactivity and experience. We will try to keep
this in mind while exploring options in packages such as
plotly, datatables and others.
plotly and how to create interactive plots from
existing ggplot2 objectsplotly objectsggiraphDT and
reactableToday we’ll be using a variety of datasets.
US EPA Air Quality Index, Historical Data
AQI, or Air Quality Index, is the primary way to measure the current quality of the air.
AQI values are given these ranges:
| AQI Ranges | Category | Colors |
|---|---|---|
| 0-50 | Good | Green |
| 51-100 | Moderate | Yellow |
| 101-150 | Unhealthy for Sensitive Groups | Orange |
| 151-200 | Unhealthy | Red |
| 201-300 | Very Unhealthy | Purple |
| 301-500 | Hazardous | Maroon |
This dataset includes observations for individual air monitoring areas across the entire United States and their daily air pollution measurements dating from 1980 to present day. An area is made up of a series of air pollution monitors within it. Each observation also includes the latitude and longitude of the air monitoring area and residential population from where it was taken.
Here are the individual variables and their descriptions:
These data are provided by the US EPA through their EPA Data Mart. For more information, check here: https://aqs.epa.gov/aqsweb/airdata/download_files.html
The general plotly structure is fairly straightforward,
and is similar to ggplot2 in many ways:
plot_ly(data = dataframe, x = var1, y = var2, type = "bar", color = "", alpha = "", size = "", ... )
Take a look here for more native plotly code structure:
https://plotly.com/r/reference/index/
#pull up documentation for plotly
?plot_ly
#note how plotly uses the %>% instead of + that ggplot uses to stack functions on top of each other!
#basic horizontal bar chart
plot_ly() %>%
add_trace(data = aqi_categories_percentage,
#specify x axis
x = ~percentage,
#specify y axis
y = ~category,
#specify type of chart
type = "bar") %>%
layout(title = "Basic Plotly Horizontal Bar Chart")
#plotly object with more modifications to be in line with the ggplot object we made earlier
#initialize plotly (like making an empty ggplot item)
plot_ly() %>%
#add trace is akin to adding a geom_object
add_trace(data = aqi_categories_percentage,
#specify x axis
x = ~percentage,
#specify y axis
y = ~category,
#specify type of chart
type = "bar",
#indicate data labels and position of said labels
text = ~percentage, textposition = "outside",
#create manual discrete color scale for filled bars, emulating viridis palette, alpha = 0.5 (half transparency)
marker = list(color = c("rgba(50, 205, 50, 0.5)",
"rgba(255, 200, 0, 0.5)",
"rgba(255, 165, 0, 0.5)",
"rgba(255, 0, 0, 0.5)",
"rgba(138, 43, 226, 0.5)",
"rgba(178, 34, 34, 0.5)"),
#manual discrete color scale for border, viridis palette, alpha = 1.0
line = list(color = c("rgba(50, 205, 50, 1.0)",
"rgba(255, 200, 0, 1.0)",
"rgba(255, 165, 0, 1.0)",
"rgba(255, 0, 0, 1.0)",
"rgba(138, 43, 226, 1.0)",
"rgba(178, 34, 34, 1.0)"),
#width of border = 2
width = 2))) %>%
#order the categories (it should be descending, but plotly reads the factors in reverse)
layout(yaxis = list(title = "", categoryorder = "total ascending"),
#add xaxis title
xaxis = list(title = "Percentage (%)"),
title = "Plotly duplicate of ggplot chart")
Plotly isn’t without its own limitations and quirks – above is an example of my attempt to replicate the original ggplot object, but with some caveats (some of which are the limitations of my own documentation-searching and implementation skills):
geom_textWhich is the better implementation? That’s really up to your
discretion! In this bar charts example, there’s a few decisions to make
about legibility and customization, and my personal preference would be
the ggplotly implementation. In my experience,
plotly charts end up looking quite similar to each other
since the range of visual changes is limited compared to
ggplot. However, if the interactivity is your main
priority, it may not matter.
Where plotly stumbles especially is the consistency and
availability of documentation. Despite plotly being a major
cross-language library (it’s widely used in Python), the
R-verse doesn’t have great documentation and expansion
(yet!), and it can leave people who want more granular control over
aesthetics wanting. This is where ggplotly can fill the
gap.
Next, we’ll move onto some univariate and multivariate plots with
plotly and their ggplot counterparts.
Let’s look at the distribution of AQI values within this dataset, first with some summarizing functions and then with plots. Note that AQI is measured in integers (whole numbers), so it might not lend to the smoothest univariate plots.
Some takeaways:
aqi has a right skew distribution (we can see this with
the mean being higher than median) with several very large outliers#see a continuous summary of the variable -- we see that it is right skewed with a huge outlier
summary(ca_aqi$aqi)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 37.00 47.00 63.82 77.00 20646.00
#how many outliers do we see above 500 AQI (what is considered "extremely hazardous")?
nrow(ca_aqi %>% filter(aqi > 500))/nrow(ca_aqi)*100
## [1] 0.0439496
#let's subset our data to values below 500
aqi_histogram <- ggplot(data = ca_aqi %>% filter(aqi < 500), aes(x = aqi)) +
#add horizontal reference line at y = 0, in a dashed forest green
geom_hline(yintercept = 0, color = "forestgreen", linetype = "dashed") +
#add vertical reference line at x = 0, in dashed forest green
geom_vline(xintercept = 0, color = "forestgreen", linetype = "dashed") +
#add histogram, specifying density instead of count, half transparency and binwidth of 12.5 starting at x = 0
geom_histogram(aes(y = ..density..),
alpha = 0.5,
color = "grey25",
binwidth = 12.5,
boundary = 0) +
#add density line in orange with a gaussian kernel density smoothing factor of 2^6
geom_density(color = "orange", n = 2^6, size = 1.0, kernel = "gaussian")+
#add custom breaks on x axis
scale_x_continuous(breaks = seq(0, 500, 50)) +
theme_minimal() +
#remove grids in panel
theme(panel.grid = element_blank()) +
labs(y = "Density\n",
x = "\nAir Quality Index (AQI) values")
aqi_histogram
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
#let's look at an untouched ggplotly conversion
ggplotly(aqi_histogram)
In this case, the ggplotly conversion works wonderfully!
I have little critique of it, except for the hover tooltip, which can be
additionally customized, and the repeated orange density line at the
horizontal intercept. It’s a great example of why ggplotly
can really streamline a workflow! In this case, I would likely remove
the density curve, as it doesn’t seem to add much information that we
would be missing. For kicks, let’s try to make a native
plotly rendition:
#without modifications, base plotly plot
plot_ly(data = ca_aqi %>% filter(aqi < 500), x = ~aqi, type = "histogram")
Note the very fine density – the binwidth is really small, it might
even be a width of 1 (turns out – it is). If you hover your mouse over
individual bars it’ll return an individual AQI value. Neat!). Let’s add
a few more modifications to better match our original
ggplot object, some of which can be found here: https://plotly.com/r/histograms/
#create functions for reference lines
hline <- function(y = 0, color = "forestgreen") {
list(
type = "line",
#x extents of the reference line
x0 = 0,
x1 = 1,
xref = "paper",
#y extents will be determined by the input y = #
y0 = y,
y1 = y,
line = list(color = color, dash = "dot")
)
}
vline <- function(x = 0, color = "forestgreen") {
list(
type = "line",
#x extents determined by input x = #
x0 = x,
x1 = x,
xref = "paper",
y0 = 0,
#note that I set the y boundary here
y1 = 0.4,
line = list(color = color, dash = "dot")
)
}
plot_ly(data = ca_aqi %>% filter(aqi < 500), x = ~aqi, type = "histogram",
#change histogram to density function rather than counts, specify 40 bins in the entire range
histnorm = "probability", nbinsx = 40,
marker = list(color = "#adb5bd", line = list(color = "343a40", width = 2))) %>%
layout(shapes = list(hline(0), vline(0)), xaxis = list(title = "Air Quality Index (AQI) values"), yaxis = list(title = "Density\n"))
Since plotly begins plots at the origin (0,0) and
doesn’t “float” in space like ggplot does, the reference
lines don’t add much, in my opinion. Also, I couldn’t find a way to
programatically add a density plot to the plotly object (if
anyone figures it out, please give a shout!), so in this use-case
scenario, I recommend the ggplotly method.
Let’s move onto some bivariate plots! First, we’ll use some of the geographic information systems tools we used in Week 9 to add information to our dataset: namely, what county the observations reside in.
#use the tigris library again to pull county shapefiles
counties <- tigris::counties(state = "California", progress_bar = FALSE) %>%
rename_with(~ tolower(gsub(" ","_", .x, fixed=TRUE)))
## Retrieving data for the year 2021
#convert the ca_aqi dataset into an sf object, setting crs = 4269 (same as counties)
ca_aqi_geo <- st_as_sf(ca_aqi, coords = c("lng", "lat"), crs = 4269)
#here, we'll indicate that we want to keep the geometry of ca_aqi (points, indicated by the lat/long) but we do want to know which counties it is within (y = counties)
ca_aqi_geo <- sf::st_join(x = ca_aqi_geo, y = counties %>% rename("county" = "name") %>% select(county, geometry), left = TRUE)
#you'll note that in the geometry field it stays as a sfc_POINT, instead of MULTIPOLYGON, and we have one more column called "county"
head(ca_aqi_geo)
Great! Let’s make a lineplot of AQI values over the temporal span of the dataset for a few select counties. For this step, since it won’t be an explicitly geographic visualization, we can make a copy of the dataset that removes the geographic geometry to speed things along.
#create list of Bay Area Counties that we'll filter for
bayarea_counties <- c("San Francisco", "Alameda", "Contra Costa", "Santa Clara", "San Mateo", "Marin", "Solano", "Sonoma", "Napa", "Santa Cruz", "San Benito")
#use st_drop_geometry to remove geographic component -- this saves computing power if you don't need geography!
aqi_lineplot <- st_drop_geometry(ca_aqi_geo) %>%
#select columns of interest -- we don't need much since it'll be AQI over timespan by county
select(date, aqi, category, county, population) %>%
#filter observations that have a 'county' value in the list we made before
filter(county %in% bayarea_counties) %>%
#create year and month indicator variables from date variable
mutate(year = lubridate::year(date),
month = lubridate::month(date),
quarter = lubridate::quarter(date)) %>%
#group by month and year, and then county
group_by(year, county) %>%
#get minimum date in each month/year/county combination
summarise(date = min(date),
#calcuated an aqi average weighted by population
aqi = weighted.mean(aqi, population),
#let's keep a population field just in case
pop = mean(population, na.rm = T))
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
#take a peek at the dataset
head(aqi_lineplot)
#let's check that our filter worked correctly
unique(aqi_lineplot$county)
## [1] "Napa" "San Francisco" "Santa Clara" "Santa Cruz"
## [5] "Solano" "Sonoma"
#it looks like not every county showed up -- which is a complete possibility; more rural geographies like Marin and Sonoma might not have monitors and may rely on simulation from other monitors in Napa or San Francisco proper.
#just to check, we can check the unique counties in the overall ca_aqi_geo dataset: we do notice not every county makes it in!
unique(ca_aqi_geo$county)
## [1] "Kern" "Inyo" "Butte" "Imperial"
## [5] "Fresno" "Kings" "Merced" "Stanislaus"
## [9] "Ventura" "Tehama" "Shasta" "Riverside"
## [13] "Sacramento" "San Diego" "San Francisco" "Santa Clara"
## [17] "Santa Barbara" "Sonoma" "Tuolumne" "San Joaquin"
## [21] "Nevada" "Solano" "Tulare" "Sutter"
## [25] "Lake" "Del Norte" "Humboldt" "Los Angeles"
## [29] "Madera" "Napa" "Monterey" "San Luis Obispo"
## [33] "Santa Cruz" "Mendocino" "Lassen"
#let's move on and make this lineplot, first in ggplot!
ggplot(data = aqi_lineplot, aes(x = date, y = aqi, color = county)) +
#add a line connecting observations
geom_line() +
#can add geom_point if you'd like, may not be as legible
#geom_point(alpha = 0.5) +
#add horizontal reference line at the weighted mean of aqi values
geom_hline(yintercept = mean(aqi_lineplot$aqi),
color = "gray25", linetype = "dashed", linewidth = 0.75) +
#start x axis at 0
scale_x_date(expand = c(0,0)) +
#give gridlines color grey at 98% transparency/alpha
theme(panel.grid = element_line(color = "grey98"),
#give custom x and y axis lines at 70% grey, size 0.5
axis.line = element_line(color = "grey70", linewidth = 0.5),
#give axis ticks that push into the plot instead of out
axis.ticks.length = unit(-0.25, "cm"),
#indicate x axis ticks
axis.ticks.x = element_line(color = "grey40", linewidth = 0.5),
#make white background
panel.background = element_rect(fill = "grey100", color = "grey100", linewidth = 0.5),
#give custom position for legend
legend.position = c(0.70,0.95),
#bold plot title
plot.title = element_text(face="bold")) +
#customize legend for color, all contained in 1 row, and actual lines in legend are size 6
guides(color = guide_legend(nrow = 1,
byrow = TRUE,
override.aes = list(size = 6)))+
#give legends. \n indicates newlines, which spaces out the titles
labs(x = "\nYears",
y = "Air Quality Index (AQI) values \n",
color = "",
caption = "Horizontal reference line indicates average AQI value \n Data courtesy Environmental Protection Agency (EPA)",
title = "Time trends of Air Quality Index (AQI) values for Bay Area counties\n")
Next, I’ll convert it to a ggplotly object with a
customized hover text! Note that I have two major additions: the
text argument in the aes() function, and the
group = 1 argument (a requirement when modifying the hover
text). Then, I use the ggplotly function and inside it, I
indicate that I want this custom hover text.
The custom text aesthetic is built such that we have a string of text
“Year:” concatenated to a variable (lubridate::year(date),
which returns the year value of the date of the observation). The
concatenation is done by the paste() function.
#store modified ggplot code into object called lineplot
lineplot <- ggplot(data = aqi_lineplot, aes(x = date,
y = aqi,
color = county,
#make custom text aesthetic
text = paste("Year:", lubridate::year(date),
"\nAQI:", round(aqi,2),
"\nCounty:", county),
group = 1
)
) +
#add a line connecting observations
geom_line() +
#can add geom_point if you'd like, may not be as legible
#geom_point(alpha = 0.5) +
#add horizontal reference line at the weighted mean of aqi values
geom_hline(yintercept = mean(aqi_lineplot$aqi),
color = "gray25", linetype = "dashed", size = 0.75) +
#start x axis at 0
scale_x_date(expand = c(0,0)) +
#give gridlines color grey at 98% transparency/alpha
theme(panel.grid = element_line(color = "grey98"),
#give custom x and y axis lines at 70% grey, size 0.5
axis.line = element_line(color = "grey70", size = 0.5),
#give axis ticks that push into the plot instead of out
axis.ticks.length = unit(-0.25, "cm"),
#indicate x axis ticks
axis.ticks.x = element_line(color = "grey40", size = 0.5),
#make white background
panel.background = element_rect(fill = "grey100", color = "grey100", size = 0.5),
#give custom position for legend
legend.position = c(0.70,0.95),
#bold plot title
plot.title = element_text(face="bold")) +
#customize legend for color, all contained in 1 row, and actual lines in legend are size 6
guides(color = guide_legend(nrow = 1,
byrow = TRUE,
override.aes = list(size = 6)))+
#give legends. \n indicates newlines, which spaces out the titles
labs(x = "\nYears",
y = "Air Quality Index (AQI) values \n",
color = "",
caption = "Horizontal reference line indicates average AQI value \n Data courtesy Environmental Protection Agency (EPA)",
title = "Time trends of Air Quality Index (AQI) values for Bay Area counties\n")
## Warning: The `size` argument of `element_rect()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
#convert to ggplotly object!
ggplotly(lineplot, tooltip = "text")
I’m pretty happy with the ggplotly conversion! The most
noticeable differences are:
Line plots, especially time trend plots like the one above, are where I find interactive plots especially helpful. This example is a spaghetti plot, which would require faceting or other visual changes to help clarify the display if it were static. In this case, I think that the spaghetti-ness can be alleviated somewhat by making the plot interactive! It’s useful to be able to move about the plot, zoom in, and have the hover text clarify values that might be quite hard to parse if the plot were static. Additionally, one can click on an individual county to toggle it on/off – thereby giving the viewer flexibility to simplify their graph viewing experience to reduce “spaghetti-ness” and understand the data piecewise.
Next, let’s do a chart that we haven’t explicitly covered before: side-by-side bar charts, or what population graphs are usually comprised of! The R for Epidemiologist Handbook and other resources also refer to this as a Demographic Pyramid. I’m referencing these posts for the following code:
Cut for R (make categories of AQI values): https://r-coder.com/cut-r/ Population Pyramid in ggplot: https://stackoverflow.com/questions/14680075/simpler-population-pyramid-in-ggplot2 Skipping discrete labels: https://stackoverflow.com/questions/66275692/r-skip-labels-in-discrete-x-axis R Epidemiologist Handbook Chapter 33: https://epirhandbook.com/en/demographic-pyramids-and-likert-scales.html
#create dataset for side-by-side barchart of AQI values, done without geometry
aqi_cut_cat <- st_drop_geometry(ca_aqi_geo) %>%
#create categories with breaks every 20 units of AQI
mutate(aqi_cat = cut(ca_aqi_geo$aqi, breaks = seq(0, 600, 20))) %>%
#group by county and AQI category
group_by(county, aqi_cat) %>%
#get number of counts in each county-aqi category combination
summarise(n = n()) %>%
#filter out NAs in aqi_cat (when they're larger than 600)
filter(is.na(aqi_cat) == F)
## `summarise()` has grouped output by 'county'. You can override using the
## `.groups` argument.
#filter dataset to be SF and LA counties
aqi_sbs_plot <- ggplot(data = aqi_cut_cat %>% filter(county %in% c("San Francisco", "Los Angeles")),
#if county is SF, make the counts "negative"
mapping = aes(x = ifelse(test = county == "San Francisco", yes = -n, no = n),
#y axis will be the categories, encoding by county
y = aqi_cat, color = county, fill = county)) +
geom_col(alpha = 0.5, size = 0.5) +
#absolute value for labels on the x axis (so negative counts will be positive!); 10 breaks in total
scale_x_symmetric(labels = abs, n.breaks = 10) +
#Take every first label (TRUE), skip every second label (FALSE)
scale_y_discrete(breaks = function(x){x[c(TRUE, FALSE)]}) +
#give manual color choices
scale_fill_manual(values = c("#81b29a", "#4361ee")) +
scale_color_manual(values = c("#81b29a", "#4361ee")) +
#remove background fill
theme(panel.background = element_rect(fill = NA),
#remove y gridlines
panel.grid.major.y = element_blank(),
#emphasize major x gridlines
panel.grid.major.x = element_line(color = "grey90", linetype = "dashed"),
#remove minor x gridlines
panel.grid.minor.x = element_blank(),
#bold axis and legend titles
axis.title = element_text(face = "bold"),
legend.title = element_text(face = "bold"),
#modify legend position and background
legend.position = c(0.8, 0.8),
legend.background = element_rect(color = "grey90")) +
#remove legend for color (duplicates fill)
guides(color = "none") +
#custom lables
labs(x = "\nCount",
y = "Air Quality Index (AQI) bins\n",
fill = "County")
aqi_sbs_plot
#what does the native ggplotly conversion look like?
ggplotly(aqi_sbs_plot)
Again, the ggplotly conversion looks so good that I am
not very tempted to recreate a native plotly object, and
some exploration online to find out how to do so suggests the relative
difficulty of such a task. Therein lies the strength of
ggplot and plotly working harmoniously through
ggplotly – missing or to-be-developed features in
plotly do not necessarily need to be given up if you can
make do with some workarounds in ggplot.
Let’s modify the hover text of the chart above for
ggplotly, just to round out a HTML/webpage ready
visualization. These more complex visualizations with comparison usage
scenarios are really where interactivity can shine!
#filter dataset to be SF and LA counties
aqi_sbs_plot_2 <- ggplot(data = aqi_cut_cat %>% filter(county %in% c("San Francisco", "Los Angeles")),
#if county is SF, make the counts "negative"
mapping = aes(x = ifelse(test = county == "San Francisco", yes = -n, no = n),
#y axis will be the categories, encoding by county
y = aqi_cat, color = county, fill = county,
text = paste("County:", county,
"\nAQI Category:", aqi_cat,
"\nCount:", abs(n)))) +
geom_col(alpha = 0.5, size = 0.5) +
#absolute value for labels on the x axis (so negative counts will be positive!); 10 breaks in total
scale_x_symmetric(labels = abs, n.breaks = 10) +
#Take every first label (TRUE), skip every second label (FALSE)
scale_y_discrete(breaks = function(x){x[c(TRUE, FALSE)]}) +
#give manual color choices
scale_fill_manual(values = c("#81b29a", "#4361ee")) +
scale_color_manual(values = c("#81b29a", "#4361ee")) +
#remove background fill
theme(panel.background = element_rect(fill = NA),
#remove y gridlines
panel.grid.major.y = element_blank(),
#emphasize major x gridlines
panel.grid.major.x = element_line(color = "grey90", linetype = "dashed"),
#remove minor x gridlines
panel.grid.minor.x = element_blank(),
#bold axis and legend titles
axis.title = element_text(face = "bold"),
legend.title = element_text(face = "bold"),
plot.title = element_text(face = "bold"),
#modify legend position and background
legend.position = c(0.8, 0.8),
legend.background = element_rect(color = "grey90")) +
#remove legend for color (duplicates fill)
guides(color = "none") +
#custom lables
labs(x = "\nCount",
y = "Air Quality Index (AQI) bins\n",
fill = "County",
title = "Comparing AQI distributions across SF and LA counties")
ggplotly(aqi_sbs_plot_2, tooltip = "text")
Next, let’s take a look at bubble charts. The principles here can be extended to conventional scatterplots. Again, interactivity here is really useful – you can easily take a look at outliers in your EDA process, and these graphs can be more accessible and immediately actionable to a lay audience viewing your interactive report. I’d like to focus on making sure, as we have with the previous plots, that we pay attention to classic data visualization principles however – we still run the risk of overplotting, and we should ensure that hover texts are clear, legible, and present relevant information.
Below, we take a look at AQI and population (two continuous
variables) at the city-region level, with differential sizing of dots by
population density. You may notice that our first plot is very busy on
the left-hand side – many observations (in this case, city-regions) sit
along a smaller population size. There’s a few outliers on the
right-hand side that can be explored immediately using
plotly, as we’ll see in a bit.
What’s a strategy we can use to address this overplotting on the left-hand side while still preserving the general relationships we see in these data? This may be covered more in biostatistics courses, but you can transform variables to normalize and/or linearize the relationships between variables. There are some benefits (and drawbacks) in terms of model fitting and interpretation of results, but for now, we’ll operate on the knowledge that transforming variables is an existing practice and can be a viable tool in data visualization as well!
Also in the spirit of sharing the visualization thought process, here are some search strings I used while making these plots and the links that eventually helped:
#make our dataset
data_bubble <- ca_aqi %>%
#group by city-region
group_by(city_ascii) %>%
#create weighted AQI measure, mean of population, mean of density
summarise(aqi = weighted.mean(aqi, population),
population = mean(population),
density = mean(density))
ggplot(data = data_bubble, aes(x = population, y = aqi)) +
#bubbles are symbolized dynamically in color and size by density
geom_point(aes(size = density, color = density)) +
#change x axis labels from scientific to commas
scale_x_continuous(labels = scales::comma) +
#use viridis color palette
scale_color_viridis_c()+
#set the range of sizes that bubbles can take on
scale_size(range = c(1, 20),
#we also scale the size (driven by density) by cubing it, emphasizing larger values
trans = scales::trans_new(name = "cube",
transform = function(x) x**3,
inverse = function(x) x**3)) +
#add stata theme (from ggthemes package)
theme_stata() +
#take out size legend
guides(size = "none") +
#increase width of color legend
theme(legend.key.width = unit(1.85, "cm")) +
#labels
labs(color = expression(paste("Density (person per ", km^{2}, ")")),
x = "\nPopulation Size (natural log)",
y = "AQI\n (population weighted average) \n")
The results are interesting, and don’t look half-bad, especially with
the preset Stata theme. To address the overplotting on the left-hand
side, however, we’re going to perform a log-transformation on
population, the x-axis. Remembering back to our algebra
roots, this translates to very large values shrinking down, and very
small units will increase in relative size. In general,
log-transformations are great for skewed data, which we have here with
the population variable. Take a look at the difference:
#lets make a static ggplot, now with a log-transformation for population --
aqi_bubble <- ggplot(data = data_bubble, aes(x = log(population), y = aqi)) +
geom_point(aes(size = density, color = density)) +
scale_x_continuous(labels = scales::comma) +
scale_color_viridis_c()+
scale_size(range = c(1, 20),
trans = scales::trans_new(name = "cube",
transform = function(x) x**3,
inverse = function(x) x**3)) +
theme_stata() +
guides(size = "none") +
theme(legend.key.width = unit(1.80, "cm")) +
labs(color = expression(paste("Density (person per ", km^{2}, ")")),
x = "\nPopulation Size (natural log)",
y = "AQI\n (population weighted average) \n")
aqi_bubble
The general plot looks a lot less crowded on the left-hand side! This transformation also now suggests a fairly linear relationship between the natural log of population size and AQI measures, which may or may not be true. This should be taken into consideration when using transformations in any analysis or visualization. Further, we want to emphasize the interpretability of log (or any) transformations – it can be confusing for many audiences and should be reserved for those who may be familiar with the practice (showing EDA to your biostatistics advisor, for example).
Next, we’ll convert the ggplot object we just made using
ggplotly.
#convert into a plotly object using ggplotly
ggplotly(aqi_bubble)
You’ll notice that this conversion isn’t the most successful – most things carry over, but the font became serif from sans serif, the legend for city is still present, and the legend for density has an incorrect title. However, it is still convenient to explore the outliers (Riverside, Los Angeles, for example), using the plot.
Let’s try to make an alternative with native plotly.
Here’s the plotly page for bubble charts: https://plotly.com/r/bubble-charts/
#plotly object reference same dataset as before
plot_ly(data_bubble,
x = ~log(population),
y = ~aqi,
color = ~density,
#next two lines define scatterplot
type = 'scatter',
mode = 'markers',
#inclusion of size argument makes it a bubble chart, here we scale by x^3 for clarity
size = ~density**3,
#range of sizes that bubbles can take on
sizes = c(10, 2500),
#give bubbles transparency/alpha of 0.75
marker = list(opacity = 0.75),
hoverinfo = 'text',
#custom hover text, <b> text </b> means bolded text
text = ~paste("<b>City/Region:</b>", city_ascii,
#<br> means a newline, or break, here we round aqi to 2 digits
"<br><b>AQI (pop-weighted mean):</b>", round(aqi, 2),
"<br><b>Population:</b>", population,
"<br><b>Density (people per sq km):</b>", density)) %>%
#like ggplot, different parts of the layout get identical arguments
layout(xaxis = list(showgrid = FALSE, title = "Population size (natural log)",
#give border lines for x axis
linecolor = "black", linewidth = 1),
yaxis = list(title = "AQI \n(population-weighted average)",
linecolor = "black", linewidth = 1),
#remove legend for size
showlegend = FALSE,
#background color
paper_bgcolor = 'rgb(225, 240, 245)',
#custom font
font = list(family = "Arial", size = 12)) %>%
#title of color legend
colorbar(title = "Density \n(people per sq km)")
## Warning: `line.width` does not currently support multiple values.
With these modifications annotated in the code, I’m reasonable happy
with the parity between this plotly graph and
ggplot original. I prefer the scaling of the
plotly plot, which accentuates how much more dense San
Francisco is compared to peer city/regions (and the sans serif font).
You can achieve the same using scales::. Of course, as you
can throughout any module, feel free to modify code as you go along to
see how it affects your final visualization – it’s the best way to
learn!
ggigraph is a newer graphing library that, likely
ggplotly, also attempts to bridge ggplot2
objects with interactivity. It offers some features that should be
interesting to implement for interactive documents, like linked units
across graphs (which will be showcased shortly). From their
documentation, the necessary syntax adjustments we’d like to make aside
from our standard ggplot2 syntax include:
_interactive to the geom names (for example, change
geom_point to geom_point_interactive)tooltip, data_id, or both to the
aes functiontooltip: what is displayed when hovering over a graphic
elementdata_id: unique IDs for units that we’d like to assign
linked interactivity toLet’s extend the graphs we’ve already made by making a combined plot of AQI trends over time and population density bubble charts.
#let's copy our lineplot from before, with some additions
lineplot_igraph <- ggplot(data = aqi_lineplot, aes(x = date, y = aqi, color = county,
#note we add these two arguments to indicate the hover text
#and the interactive grouping
data_id = county, tooltip = county)) +
#add a line connecting observations NOTING that it is _interactive
geom_line_interactive() +
#can add geom_point if you'd like, may not be as legible
#geom_point(alpha = 0.5) +
#add horizontal reference line at the weighted mean of aqi values
geom_hline(yintercept = mean(aqi_lineplot$aqi),
color = "gray25", linetype = "dashed", linewidth = 0.75) +
#start x axis at 0
scale_x_date(expand = c(0,0)) +
#give gridlines color grey at 98% transparency/alpha
theme(panel.grid = element_line(color = "grey98"),
#give custom x and y axis lines at 70% grey, size 0.5
axis.line = element_line(color = "grey70", linewidth = 0.5),
#give axis ticks that push into the plot instead of out
axis.ticks.length = unit(-0.25, "cm"),
#indicate x axis ticks
axis.ticks.x = element_line(color = "grey40", linewidth = 0.5),
#make white background
panel.background = element_rect(fill = "grey100", color = "grey100", linewidth = 0.5),
#give custom position for legend
legend.position = c(0.70,0.95),
#bold plot title
plot.title = element_text(face="bold")) +
#customize legend for color, all contained in 1 row, and actual lines in legend are size 6
guides(color = guide_legend(nrow = 1,
byrow = TRUE,
override.aes = list(size = 6)))+
#give legends. \n indicates newlines, which spaces out the titles
labs(x = "\nYears",
y = "Air Quality Index (AQI) values \n",
color = "",
caption = "Horizontal reference line indicates average AQI value \n Data courtesy Environmental Protection Agency (EPA)",
title = "Time trends of Air Quality Index (AQI) values for Bay Area counties\n")
#we call the igraph object with the following function:
girafe(ggobj = lineplot_igraph,
#these options simply indicate we'd like the colors to extend to the hover
#by default, the hover has white text on a black background
options = list(opts_tooltip(use_fill = TRUE)))
#next, let's do the same with the population density against aqi scatterplot (we're not able to make bubble charts with ggigraph, yet)
aqi_scatter_igraph <- ggplot(data = aqi_lineplot %>%
group_by(county) %>%
summarise(aqi = weighted.mean(aqi, pop),
population = mean(pop)), aes(x = log(population), y = aqi,
#remember our additions
data_id = county, tooltip = county)) +
#remember to make our geom interactive
geom_point_interactive() +
scale_x_continuous(labels = scales::comma) +
scale_color_viridis_c()+
scale_size(range = c(1, 20),
trans = scales::trans_new(name = "cube",
transform = function(x) x**3,
inverse = function(x) x**3)) +
theme_stata() +
guides(size = "none") +
#theme(legend.key.width = unit(1.80, "cm")) +
labs(color = expression(paste("Density (person per ", km^{2}, ")")),
x = "\nPopulation Size (natural log)",
y = "AQI\n (population weighted average) \n")
#next, we call our ggigraph object
girafe(ggobj = aqi_scatter_igraph,
options = list(opts_tooltip(use_fill = TRUE)))
This was a brief introduction to ggigraph, but hopefully it illustrates the potential to combine graphs and cross-reference the same units. Since it is still a new package, however, visual customization is limited (note how we were not able to symbolize geom_point_interactive by a dynamic size).
While this course has not devoted much time to visualizing tables, based on the assumption that you have covered table construction in your stats courses, they are intrinsically tied into communicating data. In any public health practice that involves data, tables are just as important as visualizations (and in some cases, even more so!). This section aims to give a quick overview of interactive tables that R users can leverage in interactive reports.
A note on accessibility: A neat thing about both
dt and reactable is that since both generate a
HTML table, the resulting output(s) should be compatible with screen
readers, making your interactive documents accessible to a wider
audience!
We’ll be exploring these two packages:
DT / datatables: A venerated javascript library for interactive tables https://rstudio.github.io/DT/
reactable: A newer library in development, with simpler syntax: https://glin.github.io/reactable/index.html
Let’s start first with datatables! We’ll be using the same dataset that we made for the bubble chart, for ease.
#wrapping our dataset in the datatable function to start
datatable(data_bubble)
The default result is usable! As you can test for yourself, it’s an interactive paginated version of what our data viewer in R is, but it’ll stay interactive when knit to an HTML file! There’s a few options for customization and improvement we can explore.
Let’s first start with some visual changes:
datatable(data_bubble,
#add CSS class for theming
class = 'cell-border stripe',
#remove numbered index
rownames = F,
#add custom column names instead of renaming columns
colnames = c("City/Region", "AQI (population-weighted average)",
"Population", "Density (people per sq. km)")) %>%
#bold the `city_ascii` column
formatStyle("city_ascii", fontWeight = "bold") %>%
#round the `aqi` column to one decimal place
formatRound("aqi", 1) %>%
#format the `aqi` column to have red text over values of 75, black otherwise
formatStyle(c("aqi"),
color = styleInterval(c(75), c("black", "red")),
#bold values over 75, normal weight otherwise
fontWeight = styleInterval(c(75), c("normal", "bold")))
Next, let’s look at some functional customizations:
datatable(data_bubble,
#add CSS class for theming
class = 'cell-border stripe',
#remove numbered index
rownames = F,
#add custom column names instead of renaming columns
colnames = c("City/Region", "AQI (population-weighted average)",
"Population", "Density (people per sq. km)"),
#add buttons to export the data within the html (cool!)
#note that the `dom` argument can accept many, check out documentation for more
#by calling `Btp`, we keep buttons (B), the table itself (t), and pagination control (p)
extensions = 'Buttons', options = list(dom = 'Btpl',
buttons = c('copy',
'csv',
'excel',
'pdf',
'print'),
#how many entries per page
pageLength = 10,
#options for "Show how many entries"
lengthMenu = c(10, 20, 30))) %>%
#bold the `city_ascii` column
formatStyle("city_ascii", fontWeight = "bold") %>%
#round the `aqi` column to one decimal place
formatRound("aqi", 1) %>%
#format the `aqi` column to have red text over values of 75, black otherwise
formatStyle(c("aqi"),
color = styleInterval(c(75), c("black", "red")),
#bold values over 75, normal weight otherwise
fontWeight = styleInterval(c(75), c("normal", "bold")))
This is only a survey of a very few functions available in datatable, so do check out the aforementioned documentation to learn more!
Now, let’s move onto reactable:
Again, reactable is a newer library and may offer
simpler construction and more flexibility than dt. It still
is in development however, so documentation and help may not be as
available.
For this, we’ll use the dataset we used for the earlier lineplot to showcase convenient functionality to group table entries by a categorical variable (in this case, grouping observations each year by county).
#subset out an unneeded column, groupby county
reactable(aqi_lineplot %>% select(-date), groupBy = 'year',
#rename first column
columns = list(county = colDef(name = "County"),
#rename second column
year = colDef(name = "Year"),
#rename third column, rounding to 1 digit
aqi = colDef(name = "AQI", format = colFormat(digits = 1))))
We can already see a few stylistic changes, with
reactable leaning towards a more minimalist design
language, which might be preferable. I would argue that the use case I
presented with this dataset isn’t actually very useful in a table format
(it would make sense to be able to more easily compare across countries
and years), but hopefully it illustrates the possibilities with
reactable.
Another neat feature afforded by both dt and
reactable is the inclusion of plots right within our
tables! When distributions of a variable of interest vary by a grouping
variable or geography, it can be extremely useful and succinct to
include these. Here’s an example below using our main CA AQI dataset
grouped by year, drawing directly from documentation found here: https://glin.github.io/reactable/articles/examples.html#custom-rendering
reactable_data <- ca_aqi %>%
#create year variable from date variable
mutate(year = lubridate::year(date)) %>%
#group by year
group_by(year) %>%
#new aqi variable will be a list of all aqi values in that year
summarise(aqi = list(aqi)) %>%
#make empty variables for boxplot and sparkline
mutate(boxplot = NA, sparkline = NA)
reactable(reactable_data,
columns = list(
#barchart for aqi values in a year, max chart range of 500
aqi = colDef(cell = function(values) {
sparkline(values, type = "bar",
chartRangeMin = 0, chartRangeMax = 500)
}),
#boxplot for aqi values in a year
boxplot = colDef(cell = function(value, index) {
sparkline(reactable_data$aqi[[index]], type = "box",
chartRangeMin = 0, chartRangeMax = 200)
}),
#lineplot for aqi values in a year
sparkline = colDef(cell = function(value, index) {
sparkline(reactable_data$aqi[[index]])
})
))
The best way to get a sense of the usages of different interactivity
packages in R is to try it out yourself. In this exercise, we’d like you
to take the dataset we’ve prepared (the 5 cities with highest average
AQI in California) and create either a violin plot or a
box plot to depict the spread of AQI measurements
across the five cities. We’d like you to try constructing either of
these plots through ggplot with a ggplotly
conversion, and then by a native plotly object
construction.
Of importance is to take note as to how the tooltip is being generated. Refer to earlier code examples to customize the tooltip if you deem it necessary.
#here we generate the names of cities we are interested in
top5cities <- ca_aqi %>%
#group by city
group_by(city_ascii) %>%
#population-weighted average of AQI
summarise(mean_aqi = weighted.mean(aqi, population, na.rm = T)) %>%
#sort dataframe by mean aqi, in descending order
dplyr::arrange(desc(mean_aqi)) %>%
#take the top 5
head(5) %>%
#select just the city names
select(city_ascii) %>%
#convert the dataframe into a vector
as_vector()
#create dataframe with all unaggregated AQI measurements
top5_df <- ca_aqi %>%
#filter for cities within our list
filter(city_ascii %in% top5cities,
#filter out extreme aqi values
aqi < 1000)
# -------------------------------------------------------------
library(plotly)
# Violin plot using ggplot2
gg_violin <- ggplot(top5_df, aes(x = city_ascii, y = aqi)) +
geom_violin() +
labs(title = "Violin Plot of AQI in Top 5 Cities", x = "City", y = "AQI")
# Convert to interactive plotly object
interactive_violin <- ggplotly(gg_violin)
# Render the plot
interactive_violin
# Box plot using plotly
plotly_box <- plot_ly(data = top5_df, x = ~city_ascii, y = ~aqi, type = "box",
hoverinfo = 'y+x') # Customizing tooltip information
# Render the plot
plotly_box
And there you have it! A brief survey of just some of several
interactivity methods available to us in R. There are unlimited
possibilities in visualization, and while some are more involved than
others (and may require more patience with searching up documentation),
the deep investment in development within the R community points to
having a solution more often than not. We recommend you exploring the
linked resources at the beginning of this module (especially
plotly’s) for more examples. As always, feel free to ask
and questions or start a discussion in our class EdStem or by emailing
us!