Introduction

Welcome back to PHW251B! This lab module will be covering various interactive plots and tables that can be created in R with different packages. We will be extending various types of plots that we have been working with thus far in ways that allow for increased engagement and flexibility for our audiences. Interactive plots are incredibly powerful in R, since the reproducibility can allow for on-the-fly dashboards and interactive reports that need minimal upkeep after the initial coding. Creating interactive plots also encompasses the ideals of accessibility needed for static visualizations, and requires additional care when considering the user interactivity and experience. We will try to keep this in mind while exploring options in packages such as plotly, datatables and others.

Objectives

The Data

Today we’ll be using a variety of datasets.

US EPA Air Quality Index, Historical Data

AQI, or Air Quality Index, is the primary way to measure the current quality of the air.

AQI values are given these ranges:

AQI Ranges Category Colors
0-50 Good Green
51-100 Moderate Yellow
101-150 Unhealthy for Sensitive Groups Orange
151-200 Unhealthy Red
201-300 Very Unhealthy Purple
301-500 Hazardous Maroon

This dataset includes observations for individual air monitoring areas across the entire United States and their daily air pollution measurements dating from 1980 to present day. An area is made up of a series of air pollution monitors within it. Each observation also includes the latitude and longitude of the air monitoring area and residential population from where it was taken.

Here are the individual variables and their descriptions:

These data are provided by the US EPA through their EPA Data Mart. For more information, check here: https://aqs.epa.gov/aqsweb/airdata/download_files.html

Plotly Structure

The general plotly structure is fairly straightforward, and is similar to ggplot2 in many ways:

plot_ly(data = dataframe, x = var1, y = var2, type = "bar", color = "", alpha = "", size = "", ... )

Take a look here for more native plotly code structure: https://plotly.com/r/reference/index/

#pull up documentation for plotly
?plot_ly

#note how plotly uses the %>% instead of + that ggplot uses to stack functions on top of each other!
#basic horizontal bar chart
plot_ly() %>%
  add_trace(data = aqi_categories_percentage, 
            #specify x axis
            x = ~percentage,
            #specify y axis
            y = ~category, 
            #specify type of chart
            type = "bar") %>%
  layout(title = "Basic Plotly Horizontal Bar Chart")
#plotly object with more modifications to be in line with the ggplot object we made earlier

#initialize plotly (like making an empty ggplot item)
plot_ly() %>% 
  #add trace is akin to adding a geom_object
  add_trace(data = aqi_categories_percentage, 
            #specify x axis
            x = ~percentage,
            #specify y axis
            y = ~category, 
            #specify type of chart
            type = "bar",
            #indicate data labels and position of said labels
            text = ~percentage, textposition = "outside",
            #create manual discrete color scale for filled bars, emulating viridis palette, alpha = 0.5 (half transparency)
            marker = list(color = c("rgba(50, 205, 50, 0.5)",
                                    "rgba(255, 200, 0, 0.5)",
                                    "rgba(255, 165, 0, 0.5)",
                                    "rgba(255, 0, 0, 0.5)",
                                    "rgba(138, 43, 226, 0.5)",
                                    "rgba(178, 34, 34, 0.5)"),
                          #manual discrete color scale for border, viridis palette, alpha = 1.0
                          line = list(color = c("rgba(50, 205, 50, 1.0)",
                                    "rgba(255, 200, 0, 1.0)",
                                    "rgba(255, 165, 0, 1.0)",
                                    "rgba(255, 0, 0, 1.0)",
                                    "rgba(138, 43, 226, 1.0)",
                                    "rgba(178, 34, 34, 1.0)"),
                                    #width of border = 2
                                    width = 2))) %>%
  #order the categories (it should be descending, but plotly reads the factors in reverse)
  layout(yaxis = list(title = "", categoryorder = "total ascending"),
         #add xaxis title
         xaxis = list(title = "Percentage (%)"),
         title = "Plotly duplicate of ggplot chart")

Plotly isn’t without its own limitations and quirks – above is an example of my attempt to replicate the original ggplot object, but with some caveats (some of which are the limitations of my own documentation-searching and implementation skills):

Which is the better implementation? That’s really up to your discretion! In this bar charts example, there’s a few decisions to make about legibility and customization, and my personal preference would be the ggplotly implementation. In my experience, plotly charts end up looking quite similar to each other since the range of visual changes is limited compared to ggplot. However, if the interactivity is your main priority, it may not matter.

Where plotly stumbles especially is the consistency and availability of documentation. Despite plotly being a major cross-language library (it’s widely used in Python), the R-verse doesn’t have great documentation and expansion (yet!), and it can leave people who want more granular control over aesthetics wanting. This is where ggplotly can fill the gap.

Continuous Data

Next, we’ll move onto some univariate and multivariate plots with plotly and their ggplot counterparts.

Let’s look at the distribution of AQI values within this dataset, first with some summarizing functions and then with plots. Note that AQI is measured in integers (whole numbers), so it might not lend to the smoothest univariate plots.

Some takeaways:

Density & Univariate Distribution

#see a continuous summary of the variable -- we see that it is right skewed with a huge outlier
summary(ca_aqi$aqi)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     0.00    37.00    47.00    63.82    77.00 20646.00
#how many outliers do we see above 500 AQI (what is considered "extremely hazardous")? 
nrow(ca_aqi %>% filter(aqi > 500))/nrow(ca_aqi)*100
## [1] 0.0439496
#let's subset our data to values below 500
aqi_histogram <- ggplot(data = ca_aqi %>% filter(aqi < 500), aes(x = aqi)) + 
  #add horizontal reference line at y = 0, in a dashed forest green
  geom_hline(yintercept = 0, color = "forestgreen", linetype = "dashed") +
  #add vertical reference line at x = 0, in dashed forest green
  geom_vline(xintercept = 0, color = "forestgreen", linetype = "dashed") + 
  #add histogram, specifying density instead of count, half transparency and binwidth of 12.5 starting at x = 0
  geom_histogram(aes(y = ..density..), 
                 alpha = 0.5, 
                 color = "grey25", 
                 binwidth = 12.5, 
                 boundary = 0) + 
  #add density line in orange with a gaussian kernel density smoothing factor of 2^6
  geom_density(color = "orange", n = 2^6, size = 1.0, kernel = "gaussian")+
  #add custom breaks on x axis
  scale_x_continuous(breaks = seq(0, 500, 50)) + 
  theme_minimal() + 
  #remove grids in panel
  theme(panel.grid = element_blank()) + 
  labs(y = "Density\n",
       x = "\nAir Quality Index (AQI) values")

aqi_histogram
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

#let's look at an untouched ggplotly conversion
ggplotly(aqi_histogram)

In this case, the ggplotly conversion works wonderfully! I have little critique of it, except for the hover tooltip, which can be additionally customized, and the repeated orange density line at the horizontal intercept. It’s a great example of why ggplotly can really streamline a workflow! In this case, I would likely remove the density curve, as it doesn’t seem to add much information that we would be missing. For kicks, let’s try to make a native plotly rendition:

#without modifications, base plotly plot
plot_ly(data = ca_aqi %>% filter(aqi < 500), x = ~aqi, type = "histogram")

Note the very fine density – the binwidth is really small, it might even be a width of 1 (turns out – it is). If you hover your mouse over individual bars it’ll return an individual AQI value. Neat!). Let’s add a few more modifications to better match our original ggplot object, some of which can be found here: https://plotly.com/r/histograms/

#create functions for reference lines
hline <- function(y = 0, color = "forestgreen") {
  list(
    type = "line",
    #x extents of the reference line
    x0 = 0,
    x1 = 1,
    xref = "paper",
    #y extents will be determined by the input y = #
    y0 = y,
    y1 = y,
    line = list(color = color, dash = "dot")
  )
}

vline <- function(x = 0, color = "forestgreen") {
  list(
    type = "line",
    #x extents determined by input x = #
    x0 = x,
    x1 = x,
    xref = "paper",
    y0 = 0,
    #note that I set the y boundary here
    y1 = 0.4,
    line = list(color = color, dash = "dot")
  )
}

plot_ly(data = ca_aqi %>% filter(aqi < 500), x = ~aqi, type = "histogram",
        #change histogram to density function rather than counts, specify 40 bins in the entire range
        histnorm = "probability", nbinsx = 40,
        marker = list(color = "#adb5bd", line = list(color = "343a40", width = 2))) %>%
  layout(shapes = list(hline(0), vline(0)), xaxis = list(title = "Air Quality Index (AQI) values"), yaxis = list(title = "Density\n"))

Since plotly begins plots at the origin (0,0) and doesn’t “float” in space like ggplot does, the reference lines don’t add much, in my opinion. Also, I couldn’t find a way to programatically add a density plot to the plotly object (if anyone figures it out, please give a shout!), so in this use-case scenario, I recommend the ggplotly method.

Let’s move onto some bivariate plots! First, we’ll use some of the geographic information systems tools we used in Week 9 to add information to our dataset: namely, what county the observations reside in.

#use the tigris library again to pull county shapefiles 
counties <- tigris::counties(state = "California", progress_bar = FALSE) %>%
  rename_with(~ tolower(gsub(" ","_", .x, fixed=TRUE)))
## Retrieving data for the year 2021
#convert the ca_aqi dataset into an sf object, setting crs = 4269 (same as counties)
ca_aqi_geo <- st_as_sf(ca_aqi, coords = c("lng", "lat"), crs = 4269)

#here, we'll indicate that we want to keep the geometry of ca_aqi (points, indicated by the lat/long) but we do want to know which counties it is within (y = counties)
ca_aqi_geo <- sf::st_join(x = ca_aqi_geo, y = counties %>% rename("county" = "name") %>% select(county, geometry), left = TRUE)

#you'll note that in the geometry field it stays as a sfc_POINT, instead of MULTIPOLYGON, and we have one more column called "county"
head(ca_aqi_geo)

AQI and Time Series Plots

Great! Let’s make a lineplot of AQI values over the temporal span of the dataset for a few select counties. For this step, since it won’t be an explicitly geographic visualization, we can make a copy of the dataset that removes the geographic geometry to speed things along.

#create list of Bay Area Counties that we'll filter for
bayarea_counties <- c("San Francisco", "Alameda", "Contra Costa", "Santa Clara", "San Mateo", "Marin", "Solano", "Sonoma", "Napa", "Santa Cruz", "San Benito")

#use st_drop_geometry to remove geographic component -- this saves computing power if you don't need geography!
aqi_lineplot <- st_drop_geometry(ca_aqi_geo) %>%
  #select columns of interest -- we don't need much since it'll be AQI over timespan by county
  select(date, aqi, category, county, population) %>%
  #filter observations that have a 'county' value in the list we made before
  filter(county %in% bayarea_counties) %>%
  #create year and month indicator variables from date variable
  mutate(year = lubridate::year(date),
         month = lubridate::month(date),
         quarter = lubridate::quarter(date)) %>%
  #group by month and year, and then county
  group_by(year, county) %>%
  #get minimum date in each month/year/county combination
  summarise(date = min(date),
            #calcuated an aqi average weighted by population
            aqi = weighted.mean(aqi, population),
            #let's keep a population field just in case
            pop = mean(population, na.rm = T))
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
#take a peek at the dataset
head(aqi_lineplot)
#let's check that our filter worked correctly
unique(aqi_lineplot$county)
## [1] "Napa"          "San Francisco" "Santa Clara"   "Santa Cruz"   
## [5] "Solano"        "Sonoma"
#it looks like not every county showed up -- which is a complete possibility; more rural geographies like Marin and Sonoma might not have monitors and may rely on simulation from other monitors in Napa or San Francisco proper. 

#just to check, we can check the unique counties in the overall ca_aqi_geo dataset: we do notice not every county makes it in!
unique(ca_aqi_geo$county)
##  [1] "Kern"            "Inyo"            "Butte"           "Imperial"       
##  [5] "Fresno"          "Kings"           "Merced"          "Stanislaus"     
##  [9] "Ventura"         "Tehama"          "Shasta"          "Riverside"      
## [13] "Sacramento"      "San Diego"       "San Francisco"   "Santa Clara"    
## [17] "Santa Barbara"   "Sonoma"          "Tuolumne"        "San Joaquin"    
## [21] "Nevada"          "Solano"          "Tulare"          "Sutter"         
## [25] "Lake"            "Del Norte"       "Humboldt"        "Los Angeles"    
## [29] "Madera"          "Napa"            "Monterey"        "San Luis Obispo"
## [33] "Santa Cruz"      "Mendocino"       "Lassen"
#let's move on and make this lineplot, first in ggplot!
ggplot(data = aqi_lineplot, aes(x = date, y = aqi, color = county)) + 
  #add a line connecting observations
  geom_line() +
  #can add geom_point if you'd like, may not be as legible
  #geom_point(alpha = 0.5) + 
  #add horizontal reference line at the weighted mean of aqi values
  geom_hline(yintercept = mean(aqi_lineplot$aqi), 
             color = "gray25", linetype = "dashed", linewidth = 0.75) +
  #start x axis at 0
  scale_x_date(expand = c(0,0)) +
  #give gridlines color grey at 98% transparency/alpha
  theme(panel.grid = element_line(color = "grey98"),
        #give custom x and y axis lines at 70% grey, size 0.5
        axis.line = element_line(color = "grey70", linewidth = 0.5),
        #give axis ticks that push into the plot instead of out
        axis.ticks.length = unit(-0.25, "cm"),
        #indicate x axis ticks
        axis.ticks.x = element_line(color = "grey40", linewidth = 0.5),
        #make white background
        panel.background = element_rect(fill = "grey100", color = "grey100", linewidth = 0.5),
        #give custom position for legend
        legend.position = c(0.70,0.95),
        #bold plot title
        plot.title = element_text(face="bold")) +
  #customize legend for color, all contained in 1 row, and actual lines in legend are size 6
  guides(color = guide_legend(nrow = 1, 
                              byrow = TRUE, 
                              override.aes = list(size = 6)))+
  #give legends. \n indicates newlines, which spaces out the titles
  labs(x = "\nYears", 
       y = "Air Quality Index (AQI) values \n",
       color = "",
       caption = "Horizontal reference line indicates average AQI value \n Data courtesy Environmental Protection Agency (EPA)",
       title = "Time trends of Air Quality Index (AQI) values for Bay Area counties\n")

Next, I’ll convert it to a ggplotly object with a customized hover text! Note that I have two major additions: the text argument in the aes() function, and the group = 1 argument (a requirement when modifying the hover text). Then, I use the ggplotly function and inside it, I indicate that I want this custom hover text.

The custom text aesthetic is built such that we have a string of text “Year:” concatenated to a variable (lubridate::year(date), which returns the year value of the date of the observation). The concatenation is done by the paste() function.

#store modified ggplot code into object called lineplot
lineplot <- ggplot(data = aqi_lineplot, aes(x = date, 
                                            y = aqi, 
                                            color = county, 
                                            #make custom text aesthetic
                                            text = paste("Year:", lubridate::year(date),
                                                          "\nAQI:", round(aqi,2),
                                                          "\nCounty:", county),
                                            group = 1
                                            )
                   ) + 
  #add a line connecting observations
  geom_line() +
  #can add geom_point if you'd like, may not be as legible
  #geom_point(alpha = 0.5) + 
  #add horizontal reference line at the weighted mean of aqi values
  geom_hline(yintercept = mean(aqi_lineplot$aqi), 
             color = "gray25", linetype = "dashed", size = 0.75) +
  #start x axis at 0
  scale_x_date(expand = c(0,0)) +
  #give gridlines color grey at 98% transparency/alpha
  theme(panel.grid = element_line(color = "grey98"),
        #give custom x and y axis lines at 70% grey, size 0.5
        axis.line = element_line(color = "grey70", size = 0.5),
        #give axis ticks that push into the plot instead of out
        axis.ticks.length = unit(-0.25, "cm"),
        #indicate x axis ticks
        axis.ticks.x = element_line(color = "grey40", size = 0.5),
        #make white background
        panel.background = element_rect(fill = "grey100", color = "grey100", size = 0.5),
        #give custom position for legend
        legend.position = c(0.70,0.95),
        #bold plot title
        plot.title = element_text(face="bold")) +
  #customize legend for color, all contained in 1 row, and actual lines in legend are size 6
  guides(color = guide_legend(nrow = 1, 
                              byrow = TRUE, 
                              override.aes = list(size = 6)))+
  #give legends. \n indicates newlines, which spaces out the titles
  labs(x = "\nYears", 
       y = "Air Quality Index (AQI) values \n",
       color = "",
       caption = "Horizontal reference line indicates average AQI value \n Data courtesy Environmental Protection Agency (EPA)",
       title = "Time trends of Air Quality Index (AQI) values for Bay Area counties\n")
## Warning: The `size` argument of `element_rect()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
#convert to ggplotly object!
ggplotly(lineplot, tooltip = "text")

I’m pretty happy with the ggplotly conversion! The most noticeable differences are:

  • The positions on the legend and tick marks on the x and y axes
  • Missing the specified caption. Optional challenge: try to find and implement a solution for this, and let me know what you find!

Line plots, especially time trend plots like the one above, are where I find interactive plots especially helpful. This example is a spaghetti plot, which would require faceting or other visual changes to help clarify the display if it were static. In this case, I think that the spaghetti-ness can be alleviated somewhat by making the plot interactive! It’s useful to be able to move about the plot, zoom in, and have the hover text clarify values that might be quite hard to parse if the plot were static. Additionally, one can click on an individual county to toggle it on/off – thereby giving the viewer flexibility to simplify their graph viewing experience to reduce “spaghetti-ness” and understand the data piecewise.

Side-by-side bar charts / Demographic Pyramid

Next, let’s do a chart that we haven’t explicitly covered before: side-by-side bar charts, or what population graphs are usually comprised of! The R for Epidemiologist Handbook and other resources also refer to this as a Demographic Pyramid. I’m referencing these posts for the following code:

Cut for R (make categories of AQI values): https://r-coder.com/cut-r/ Population Pyramid in ggplot: https://stackoverflow.com/questions/14680075/simpler-population-pyramid-in-ggplot2 Skipping discrete labels: https://stackoverflow.com/questions/66275692/r-skip-labels-in-discrete-x-axis R Epidemiologist Handbook Chapter 33: https://epirhandbook.com/en/demographic-pyramids-and-likert-scales.html

#create dataset for side-by-side barchart of AQI values, done without geometry
aqi_cut_cat <- st_drop_geometry(ca_aqi_geo) %>%
  #create categories with breaks every 20 units of AQI 
  mutate(aqi_cat = cut(ca_aqi_geo$aqi, breaks = seq(0, 600, 20))) %>%
  #group by county and AQI category
  group_by(county, aqi_cat) %>%
  #get number of counts in each county-aqi category combination
  summarise(n = n()) %>%
  #filter out NAs in aqi_cat (when they're larger than 600)
  filter(is.na(aqi_cat) == F)
## `summarise()` has grouped output by 'county'. You can override using the
## `.groups` argument.
#filter dataset to be SF and LA counties
aqi_sbs_plot <- ggplot(data = aqi_cut_cat %>% filter(county %in% c("San Francisco", "Los Angeles")), 
       #if county is SF, make the counts "negative"
       mapping = aes(x = ifelse(test = county == "San Francisco", yes = -n, no = n), 
                     #y axis will be the categories, encoding by county
                     y = aqi_cat, color = county, fill = county)) +
  geom_col(alpha = 0.5, size = 0.5) +
  #absolute value for labels on the x axis (so negative counts will be positive!); 10 breaks in total
  scale_x_symmetric(labels = abs, n.breaks = 10) +
  #Take every first label (TRUE), skip every second label (FALSE)
  scale_y_discrete(breaks = function(x){x[c(TRUE, FALSE)]}) +
  #give manual color choices
  scale_fill_manual(values = c("#81b29a", "#4361ee")) +
  scale_color_manual(values = c("#81b29a", "#4361ee")) +
  #remove background fill
  theme(panel.background = element_rect(fill = NA),
        #remove y gridlines
        panel.grid.major.y = element_blank(),
        #emphasize major x gridlines
        panel.grid.major.x = element_line(color = "grey90", linetype = "dashed"),
        #remove minor x gridlines
        panel.grid.minor.x = element_blank(),
        #bold axis and legend titles
        axis.title = element_text(face = "bold"),
        legend.title = element_text(face = "bold"),
        #modify legend position and background
        legend.position = c(0.8, 0.8),
        legend.background = element_rect(color = "grey90")) +
  #remove legend for color (duplicates fill)
  guides(color = "none") +
  #custom lables
  labs(x = "\nCount",
       y = "Air Quality Index (AQI) bins\n",
       fill = "County")

aqi_sbs_plot

#what does the native ggplotly conversion look like?
ggplotly(aqi_sbs_plot)

Again, the ggplotly conversion looks so good that I am not very tempted to recreate a native plotly object, and some exploration online to find out how to do so suggests the relative difficulty of such a task. Therein lies the strength of ggplot and plotly working harmoniously through ggplotly – missing or to-be-developed features in plotly do not necessarily need to be given up if you can make do with some workarounds in ggplot.

Let’s modify the hover text of the chart above for ggplotly, just to round out a HTML/webpage ready visualization. These more complex visualizations with comparison usage scenarios are really where interactivity can shine!

#filter dataset to be SF and LA counties
aqi_sbs_plot_2 <- ggplot(data = aqi_cut_cat %>% filter(county %in% c("San Francisco", "Los Angeles")), 
       #if county is SF, make the counts "negative"
       mapping = aes(x = ifelse(test = county == "San Francisco", yes = -n, no = n), 
                     #y axis will be the categories, encoding by county
                     y = aqi_cat, color = county, fill = county, 
                     text = paste("County:", county,
                                  "\nAQI Category:", aqi_cat,
                                  "\nCount:", abs(n)))) +
  geom_col(alpha = 0.5, size = 0.5) +
  #absolute value for labels on the x axis (so negative counts will be positive!); 10 breaks in total
  scale_x_symmetric(labels = abs, n.breaks = 10) +
  #Take every first label (TRUE), skip every second label (FALSE)
  scale_y_discrete(breaks = function(x){x[c(TRUE, FALSE)]}) +
  #give manual color choices
  scale_fill_manual(values = c("#81b29a", "#4361ee")) +
  scale_color_manual(values = c("#81b29a", "#4361ee")) +
  #remove background fill
  theme(panel.background = element_rect(fill = NA),
        #remove y gridlines
        panel.grid.major.y = element_blank(),
        #emphasize major x gridlines
        panel.grid.major.x = element_line(color = "grey90", linetype = "dashed"),
        #remove minor x gridlines
        panel.grid.minor.x = element_blank(),
        #bold axis and legend titles
        axis.title = element_text(face = "bold"),
        legend.title = element_text(face = "bold"),
        plot.title = element_text(face = "bold"),
        #modify legend position and background
        legend.position = c(0.8, 0.8),
        legend.background = element_rect(color = "grey90")) +
  #remove legend for color (duplicates fill)
  guides(color = "none") +
  #custom lables
  labs(x = "\nCount",
       y = "Air Quality Index (AQI) bins\n",
       fill = "County", 
       title = "Comparing AQI distributions across SF and LA counties")

ggplotly(aqi_sbs_plot_2, tooltip = "text")

Bubble charts

Next, let’s take a look at bubble charts. The principles here can be extended to conventional scatterplots. Again, interactivity here is really useful – you can easily take a look at outliers in your EDA process, and these graphs can be more accessible and immediately actionable to a lay audience viewing your interactive report. I’d like to focus on making sure, as we have with the previous plots, that we pay attention to classic data visualization principles however – we still run the risk of overplotting, and we should ensure that hover texts are clear, legible, and present relevant information.

Below, we take a look at AQI and population (two continuous variables) at the city-region level, with differential sizing of dots by population density. You may notice that our first plot is very busy on the left-hand side – many observations (in this case, city-regions) sit along a smaller population size. There’s a few outliers on the right-hand side that can be explored immediately using plotly, as we’ll see in a bit.

What’s a strategy we can use to address this overplotting on the left-hand side while still preserving the general relationships we see in these data? This may be covered more in biostatistics courses, but you can transform variables to normalize and/or linearize the relationships between variables. There are some benefits (and drawbacks) in terms of model fitting and interpretation of results, but for now, we’ll operate on the knowledge that transforming variables is an existing practice and can be a viable tool in data visualization as well!

Also in the spirit of sharing the visualization thought process, here are some search strings I used while making these plots and the links that eventually helped:

#make our dataset
data_bubble <- ca_aqi %>%
  #group by city-region
  group_by(city_ascii) %>%
  #create weighted AQI measure, mean of population, mean of density
  summarise(aqi = weighted.mean(aqi, population),
            population = mean(population),
            density = mean(density))

ggplot(data = data_bubble, aes(x = population, y = aqi)) +
  #bubbles are symbolized dynamically in color and size by density
  geom_point(aes(size = density, color = density)) +
  #change x axis labels from scientific to commas
  scale_x_continuous(labels = scales::comma) +
  #use viridis color palette
  scale_color_viridis_c()+
  #set the range of sizes that bubbles can take on
  scale_size(range = c(1, 20), 
             #we also scale the size (driven by density) by cubing it, emphasizing larger values
             trans = scales::trans_new(name = "cube", 
                                       transform = function(x) x**3, 
                                       inverse = function(x) x**3)) +
  #add stata theme (from ggthemes package)
  theme_stata() + 
  #take out size legend
  guides(size = "none") +
  #increase width of color legend
  theme(legend.key.width = unit(1.85, "cm")) +
  #labels
  labs(color = expression(paste("Density (person per ", km^{2}, ")")),
       x = "\nPopulation Size (natural log)",
       y = "AQI\n (population weighted average) \n")

The results are interesting, and don’t look half-bad, especially with the preset Stata theme. To address the overplotting on the left-hand side, however, we’re going to perform a log-transformation on population, the x-axis. Remembering back to our algebra roots, this translates to very large values shrinking down, and very small units will increase in relative size. In general, log-transformations are great for skewed data, which we have here with the population variable. Take a look at the difference:

#lets make a static ggplot, now with a log-transformation for population -- 
aqi_bubble <- ggplot(data = data_bubble, aes(x = log(population), y = aqi)) +
  geom_point(aes(size = density, color = density)) +
  scale_x_continuous(labels = scales::comma) +
  scale_color_viridis_c()+
  scale_size(range = c(1, 20), 
             trans = scales::trans_new(name = "cube", 
                                       transform = function(x) x**3, 
                                       inverse = function(x) x**3)) +
  theme_stata() + 
  guides(size = "none") +
  theme(legend.key.width = unit(1.80, "cm")) +
  labs(color = expression(paste("Density (person per ", km^{2}, ")")),
       x = "\nPopulation Size (natural log)",
       y = "AQI\n (population weighted average) \n")

aqi_bubble

The general plot looks a lot less crowded on the left-hand side! This transformation also now suggests a fairly linear relationship between the natural log of population size and AQI measures, which may or may not be true. This should be taken into consideration when using transformations in any analysis or visualization. Further, we want to emphasize the interpretability of log (or any) transformations – it can be confusing for many audiences and should be reserved for those who may be familiar with the practice (showing EDA to your biostatistics advisor, for example).

Next, we’ll convert the ggplot object we just made using ggplotly.

#convert into a plotly object using ggplotly
ggplotly(aqi_bubble)

You’ll notice that this conversion isn’t the most successful – most things carry over, but the font became serif from sans serif, the legend for city is still present, and the legend for density has an incorrect title. However, it is still convenient to explore the outliers (Riverside, Los Angeles, for example), using the plot.

Let’s try to make an alternative with native plotly.

Here’s the plotly page for bubble charts: https://plotly.com/r/bubble-charts/

#plotly object reference same dataset as before
plot_ly(data_bubble,
        x = ~log(population),
        y = ~aqi,
        color = ~density,
        #next two lines define scatterplot
        type = 'scatter',
        mode = 'markers',
        #inclusion of size argument makes it a bubble chart, here we scale by x^3 for clarity
        size = ~density**3,
        #range of sizes that bubbles can take on
        sizes = c(10, 2500),
        #give bubbles transparency/alpha of 0.75
        marker = list(opacity = 0.75),
        hoverinfo = 'text',
        #custom hover text, <b> text </b> means bolded text
        text = ~paste("<b>City/Region:</b>", city_ascii,
                      #<br> means a newline, or break, here we round aqi to 2 digits
                      "<br><b>AQI (pop-weighted mean):</b>", round(aqi, 2), 
                      "<br><b>Population:</b>", population, 
                      "<br><b>Density (people per sq km):</b>", density)) %>%
  #like ggplot, different parts of the layout get identical arguments
  layout(xaxis = list(showgrid = FALSE, title = "Population size (natural log)", 
                      #give border lines for x axis
                      linecolor = "black", linewidth = 1),
         yaxis = list(title = "AQI \n(population-weighted average)", 
                      linecolor = "black", linewidth = 1),
         #remove legend for size
         showlegend = FALSE,
         #background color
         paper_bgcolor = 'rgb(225, 240, 245)',
         #custom font
         font = list(family = "Arial", size = 12)) %>%
  #title of color legend
  colorbar(title = "Density \n(people per sq km)") 
## Warning: `line.width` does not currently support multiple values.

With these modifications annotated in the code, I’m reasonable happy with the parity between this plotly graph and ggplot original. I prefer the scaling of the plotly plot, which accentuates how much more dense San Francisco is compared to peer city/regions (and the sans serif font). You can achieve the same using scales::. Of course, as you can throughout any module, feel free to modify code as you go along to see how it affects your final visualization – it’s the best way to learn!

Alternative graphing library: ggigraph

ggigraph is a newer graphing library that, likely ggplotly, also attempts to bridge ggplot2 objects with interactivity. It offers some features that should be interesting to implement for interactive documents, like linked units across graphs (which will be showcased shortly). From their documentation, the necessary syntax adjustments we’d like to make aside from our standard ggplot2 syntax include:

Let’s extend the graphs we’ve already made by making a combined plot of AQI trends over time and population density bubble charts.

#let's copy our lineplot from before, with some additions
lineplot_igraph <- ggplot(data = aqi_lineplot, aes(x = date, y = aqi, color = county, 
                                                   #note we add these two arguments to indicate the hover text
                                                   #and the interactive grouping 
                                                   data_id = county, tooltip = county)) + 
  #add a line connecting observations NOTING that it is _interactive
  geom_line_interactive() +
  #can add geom_point if you'd like, may not be as legible
  #geom_point(alpha = 0.5) + 
  #add horizontal reference line at the weighted mean of aqi values
  geom_hline(yintercept = mean(aqi_lineplot$aqi), 
             color = "gray25", linetype = "dashed", linewidth = 0.75) +
  #start x axis at 0
  scale_x_date(expand = c(0,0)) +
  #give gridlines color grey at 98% transparency/alpha
  theme(panel.grid = element_line(color = "grey98"),
        #give custom x and y axis lines at 70% grey, size 0.5
        axis.line = element_line(color = "grey70", linewidth = 0.5),
        #give axis ticks that push into the plot instead of out
        axis.ticks.length = unit(-0.25, "cm"),
        #indicate x axis ticks
        axis.ticks.x = element_line(color = "grey40", linewidth = 0.5),
        #make white background
        panel.background = element_rect(fill = "grey100", color = "grey100", linewidth = 0.5),
        #give custom position for legend
        legend.position = c(0.70,0.95),
        #bold plot title
        plot.title = element_text(face="bold")) +
  #customize legend for color, all contained in 1 row, and actual lines in legend are size 6
  guides(color = guide_legend(nrow = 1, 
                              byrow = TRUE, 
                              override.aes = list(size = 6)))+
  #give legends. \n indicates newlines, which spaces out the titles
  labs(x = "\nYears", 
       y = "Air Quality Index (AQI) values \n",
       color = "",
       caption = "Horizontal reference line indicates average AQI value \n Data courtesy Environmental Protection Agency (EPA)",
       title = "Time trends of Air Quality Index (AQI) values for Bay Area counties\n")

#we call the igraph object with the following function:

girafe(ggobj = lineplot_igraph, 
       #these options simply indicate we'd like the colors to extend to the hover
       #by default, the hover has white text on a black background
       options = list(opts_tooltip(use_fill = TRUE)))
#next, let's do the same with the population density against aqi scatterplot (we're not able to make bubble charts with ggigraph, yet)
aqi_scatter_igraph <- ggplot(data = aqi_lineplot %>% 
                              group_by(county) %>%
                              summarise(aqi = weighted.mean(aqi, pop),
                                        population = mean(pop)), aes(x = log(population), y = aqi, 
                                                    #remember our additions
                                                    data_id = county, tooltip = county)) +
  #remember to make our geom interactive
  geom_point_interactive() +
  scale_x_continuous(labels = scales::comma) +
  scale_color_viridis_c()+
  scale_size(range = c(1, 20), 
             trans = scales::trans_new(name = "cube", 
                                       transform = function(x) x**3, 
                                       inverse = function(x) x**3)) +
  theme_stata() + 
  guides(size = "none") +
  #theme(legend.key.width = unit(1.80, "cm")) +
  labs(color = expression(paste("Density (person per ", km^{2}, ")")),
       x = "\nPopulation Size (natural log)",
       y = "AQI\n (population weighted average) \n")


#next, we call our ggigraph object

girafe(ggobj = aqi_scatter_igraph, 
       options = list(opts_tooltip(use_fill = TRUE)))

This was a brief introduction to ggigraph, but hopefully it illustrates the potential to combine graphs and cross-reference the same units. Since it is still a new package, however, visual customization is limited (note how we were not able to symbolize geom_point_interactive by a dynamic size).

Interactive Tables in R

While this course has not devoted much time to visualizing tables, based on the assumption that you have covered table construction in your stats courses, they are intrinsically tied into communicating data. In any public health practice that involves data, tables are just as important as visualizations (and in some cases, even more so!). This section aims to give a quick overview of interactive tables that R users can leverage in interactive reports.

A note on accessibility: A neat thing about both dt and reactable is that since both generate a HTML table, the resulting output(s) should be compatible with screen readers, making your interactive documents accessible to a wider audience!

We’ll be exploring these two packages:

DT / datatables: A venerated javascript library for interactive tables https://rstudio.github.io/DT/

reactable: A newer library in development, with simpler syntax: https://glin.github.io/reactable/index.html

Let’s start first with datatables! We’ll be using the same dataset that we made for the bubble chart, for ease.

#wrapping our dataset in the datatable function to start
datatable(data_bubble)

The default result is usable! As you can test for yourself, it’s an interactive paginated version of what our data viewer in R is, but it’ll stay interactive when knit to an HTML file! There’s a few options for customization and improvement we can explore.

Let’s first start with some visual changes:

datatable(data_bubble,
          #add CSS class for theming
          class = 'cell-border stripe',
          #remove numbered index
          rownames = F,
          #add custom column names instead of renaming columns
          colnames = c("City/Region", "AQI (population-weighted average)", 
                       "Population", "Density (people per sq. km)")) %>%
  #bold the `city_ascii` column
  formatStyle("city_ascii", fontWeight = "bold") %>%
  #round the `aqi` column to one decimal place
  formatRound("aqi", 1) %>%
  #format the `aqi` column to have red text over values of 75, black otherwise
  formatStyle(c("aqi"), 
              color = styleInterval(c(75), c("black", "red")), 
              #bold values over 75, normal weight otherwise
              fontWeight = styleInterval(c(75), c("normal", "bold")))

Next, let’s look at some functional customizations:

datatable(data_bubble,
          #add CSS class for theming
          class = 'cell-border stripe',
          #remove numbered index
          rownames = F,
          #add custom column names instead of renaming columns
          colnames = c("City/Region", "AQI (population-weighted average)", 
                       "Population", "Density (people per sq. km)"),
          #add buttons to export the data within the html (cool!)
          #note that the `dom` argument can accept many, check out documentation for more
          #by calling `Btp`, we keep buttons (B), the table itself (t), and pagination control (p) 
          extensions = 'Buttons', options = list(dom = 'Btpl',
                                                 buttons = c('copy', 
                                                             'csv', 
                                                             'excel', 
                                                             'pdf', 
                                                             'print'),
                                                 #how many entries per page
                                                 pageLength = 10,
                                                 #options for "Show how many entries"
                                                 lengthMenu = c(10, 20, 30))) %>%
  #bold the `city_ascii` column
  formatStyle("city_ascii", fontWeight = "bold") %>%
  #round the `aqi` column to one decimal place
  formatRound("aqi", 1) %>%
  #format the `aqi` column to have red text over values of 75, black otherwise
  formatStyle(c("aqi"), 
              color = styleInterval(c(75), c("black", "red")), 
              #bold values over 75, normal weight otherwise
              fontWeight = styleInterval(c(75), c("normal", "bold")))

This is only a survey of a very few functions available in datatable, so do check out the aforementioned documentation to learn more!

Now, let’s move onto reactable:

Again, reactable is a newer library and may offer simpler construction and more flexibility than dt. It still is in development however, so documentation and help may not be as available.

For this, we’ll use the dataset we used for the earlier lineplot to showcase convenient functionality to group table entries by a categorical variable (in this case, grouping observations each year by county).

#subset out an unneeded column, groupby county
reactable(aqi_lineplot %>% select(-date), groupBy = 'year',
          #rename first column
          columns = list(county = colDef(name = "County"),
                         #rename second column
                         year = colDef(name = "Year"),
                         #rename third column, rounding to 1 digit
                         aqi = colDef(name = "AQI", format = colFormat(digits = 1))))

We can already see a few stylistic changes, with reactable leaning towards a more minimalist design language, which might be preferable. I would argue that the use case I presented with this dataset isn’t actually very useful in a table format (it would make sense to be able to more easily compare across countries and years), but hopefully it illustrates the possibilities with reactable.

Another neat feature afforded by both dt and reactable is the inclusion of plots right within our tables! When distributions of a variable of interest vary by a grouping variable or geography, it can be extremely useful and succinct to include these. Here’s an example below using our main CA AQI dataset grouped by year, drawing directly from documentation found here: https://glin.github.io/reactable/articles/examples.html#custom-rendering

reactable_data <- ca_aqi %>%
  #create year variable from date variable
  mutate(year = lubridate::year(date)) %>%
  #group by year
  group_by(year) %>%
  #new aqi variable will be a list of all aqi values in that year
  summarise(aqi = list(aqi)) %>%
  #make empty variables for boxplot and sparkline
  mutate(boxplot = NA, sparkline = NA)

reactable(reactable_data, 
          columns = list(
            #barchart for aqi values in a year, max chart range of 500
                aqi = colDef(cell = function(values) {
                  sparkline(values, type = "bar", 
                            chartRangeMin = 0, chartRangeMax = 500)
                }),
                #boxplot for aqi values in a year
                boxplot = colDef(cell = function(value, index) {
                  sparkline(reactable_data$aqi[[index]], type = "box", 
                            chartRangeMin = 0, chartRangeMax = 200)
                }),
                #lineplot for aqi values in a year
                sparkline = colDef(cell = function(value, index) {
                  sparkline(reactable_data$aqi[[index]])
                })
))

Exercise

The best way to get a sense of the usages of different interactivity packages in R is to try it out yourself. In this exercise, we’d like you to take the dataset we’ve prepared (the 5 cities with highest average AQI in California) and create either a violin plot or a box plot to depict the spread of AQI measurements across the five cities. We’d like you to try constructing either of these plots through ggplot with a ggplotly conversion, and then by a native plotly object construction.

Of importance is to take note as to how the tooltip is being generated. Refer to earlier code examples to customize the tooltip if you deem it necessary.

#here we generate the names of cities we are interested in
top5cities <- ca_aqi %>%
  #group by city
  group_by(city_ascii) %>%
  #population-weighted average of AQI
  summarise(mean_aqi = weighted.mean(aqi, population, na.rm = T)) %>%
  #sort dataframe by mean aqi, in descending order
  dplyr::arrange(desc(mean_aqi)) %>%
  #take the top 5
  head(5) %>%
  #select just the city names
  select(city_ascii) %>%
  #convert the dataframe into a vector
  as_vector()

#create dataframe with all unaggregated AQI measurements
top5_df <- ca_aqi %>%
  #filter for cities within our list
  filter(city_ascii %in% top5cities,
         #filter out extreme aqi values
         aqi < 1000)

# -------------------------------------------------------------
library(plotly)

# Violin plot using ggplot2
gg_violin <- ggplot(top5_df, aes(x = city_ascii, y = aqi)) +
             geom_violin() +
             labs(title = "Violin Plot of AQI in Top 5 Cities", x = "City", y = "AQI")

# Convert to interactive plotly object
interactive_violin <- ggplotly(gg_violin)

# Render the plot
interactive_violin
# Box plot using plotly
plotly_box <- plot_ly(data = top5_df, x = ~city_ascii, y = ~aqi, type = "box",
                      hoverinfo = 'y+x') # Customizing tooltip information

# Render the plot
plotly_box

And there you have it! A brief survey of just some of several interactivity methods available to us in R. There are unlimited possibilities in visualization, and while some are more involved than others (and may require more patience with searching up documentation), the deep investment in development within the R community points to having a solution more often than not. We recommend you exploring the linked resources at the beginning of this module (especially plotly’s) for more examples. As always, feel free to ask and questions or start a discussion in our class EdStem or by emailing us!