Since starting work at the Reich Lab in January (part-time, 4 hrs/wk), the main focus of my efforts has been in visualizing influenza forecasting results as related to the FluSight Network challenge. My hope is that these visualizations can be used to help us further reason about and understand our predictions and the performance of the models that contribute to them, as well as assist in the communication of these results to an audience outside the lab, whether collaborators, fellow infectious disease modeling researchers, or the interested layperson.
My general workflow has been to experiment with the codebase already in use for data processing and visualization to first understand what visuals have been made so far and how they were created. I then use that as influence for determining new methods of visualization to pursue, as well as a reference for how to interact with the code for certain common tasks (i.e. loading and tidying prediction or score data). This has led to a single large R script of general ideas, which are then refined into polished products in the more modular form of scripts, Markdown documents, or Shiny apps. Often, I experimented with plot ideas on my own, then refined them with guidance from Nick during our weekly meetings. Other times, Nick had a specific idea for a plot, which I then worked on implementing while consulting with him along the way for feedback.
This presentation will highlight different plots I have made so far in a chronological order, showing how some evolved over time and generally detailing my thought process for creating them and how they can be used.
The FluSight Network is a collaborative group of scientists and researchers participating in the CDC’s annual “Forecast the Influenza Season Collaborative Challenge” (a.k.a. FluSight). The FluSight challenge focuses on forecasts of the weighted percentage of doctor’s office visits for influenza-like-illness (wILI) in a particular region. These are divided into seven main targets of particular interest to the CDC, which can then be broadly defined as either “seasonal” or “k-week-ahead”. Seasonal targets are fixed scalar values for a particular season: onset week, peak week, and peak intensity (i.e. the maximum observed wILI percentage). The remaining k-week-ahead targets are the observed wILI percentages in each of the subsequent four weeks.
Each forecast is composed of a point estimate along with an associated confidence interval of potential values, representing our uncertainty around that observation. Because national wILI data is received from the CDC at a two-week lag, 1 and 2 week-ahead forecasts are considered nowcasts (i.e. at or before the current time), while 3 and 4 week-ahead forecasts are considered proper forecasts, or estimates about events in the future.
Past efforts have focused on using a multitude of different techniques and models (spanning compartmental, ARIMA-based, non-parametric, Bayesian, and more) to predict each of these components. However, the performance of these models can vary heavily among different seasons, regions, and locations, and additionally some models perform better for some targets than others. For that reason, more recent efforts have instead focused on developing a robust ensemble forecasting method that utilizes a weighted average across multiple models’ predictions. Various weighting methods were tried with different levels of granularity, ranging from constant weights for each model (22) to weights for each unique region, target, and model (1540). The best performing method ended up being target type weights (TTW) for a total of 44 weights, corresponding to one weight for each model and target-type combination. In practice, this improved accuracy by a noticeable amount.
Influenza forecasts have been evaluated by the CDC primarily using the log-score. While log scores are not on a particularly interpretable scale, exponentiating an average log score yields a forecast score equivalent to the geometric mean of the probabilities assigned to the eventually observed outcome. In this setting, this score has the intuitive interpretation of being the average probability assigned to the true outcome (where average is considered to be a geometric average).
To start off, I thought visualizing weekly forecasts in an interactive way could be useful since plots colored by more than a few different levels can quickly become difficult to discern, and Plotly allows you to easily select which factor levels of interest to show at a given time. This idea has proved extremely useful in future plotting I’ve done, and is in my opinion one of the best features in the model comparison Shiny app that I have most recently spent time working on.
We can also view the same scatterplot of peak week percentage probabilities as above with earlier data from the season (week 43) to see more uncertain distributions.
After discussing with Nick that the forecasted probability a given region would be past its peak was something of particular interest (especially at that point in time, when it was still pretty early into the season), I thought a diverging chart centered around .5 would be an interested way to visualize that. It could easily be interpreted as regions on the right being more likely than not to be past the peak, and the opposite for regions on the left.
The probability that we are past the peak for the current epiweek is calculated as the sum of the probabilities in each bin up to (and including) the current epiweek, which includes all past weeks as well as two in the future, since the data are received at a two-week lag. This means that, for example, the probability past peak plot for epiweek 5 comes from the forecast data from epiweek 3. I tried this a few different ways at first:
Ultimately, Nick enjoyed the Tufte-like elegance of the diverging dot plot most, and after working on it with him at one of our sessions we were able to refine it like so:
The diverging dot plot led to my first significant contribution of the semester - writing a script that would, without assumptions of the given week, create a plot that represents the forecasted probability each of the HHS regions were past their peak in that epiweek. We realized that, to most, the HHS Region number is a more-or-less completely opaque designation, so decided to supplement the plot with a choropleth map that colored each HHS region by its predicted probability of being past its peak.
This proved early in the season to be a useful way of gauging our models’ view of that particular target which easily lends itself to communication to a lay-audience. At a certain point, however, the plot stops telling us anything meaningful as each state is past its peak with a probability of 1. Observe this recent plot created with forecasts from epiweek 14:
Expanding on some ideas I had floating around since my initial analyses, I also thought creating heatmaps for a given region with epiweek on the x-axis and forecasted probability for the given target on the y-axis would be an interesting way to visualize how our predictions (and the uncertainty around each target) evolves as we get further into the season. My initial attempt looked something like this:
After consulting with Nick and implementing his suggested changes of binning probability into discrete subsets to even out the color gradient and focusing only on the seasonal targets, I arrived at the following:
One task Nick gave me to do that would help with the model comparison manuscript was a plot that compared average model performance against the “baseline” (in this case, the ReichLab KDE model, which uses historical incidence information to model the future probability of occurrences). To that end, he tasked me with creating a monster plot separated by model type and target type that included choropleth maps of performance for each subset, as well as a scatter plot that charted baseline performance (x) against average performance for all other models (y). The question we were trying to answer was essentially “Does the performance of the baseline model correlate with the performance of the rest of the models?”. The plot I eventually created can be seen below:
However, the results were surprising - the scatterplot in the bottom-left corner seemed to suggest that general modeling performance scales very linearly with baseline model performance, or essentially that the performance among each of the two types of models was roughly equal for the k-week-ahead targets. For this reason, Nick decided to include an altered version of the plot in the final model comparison manuscript - keeping some of the maps, but changing the plots on the bottom row. That plot can be seen here:
After finishing up this task, I was essentially let loose, and in reviewing previously created plots I realized that an avenue of visualization completely unutilized up to that point was model performance by epiweek. Inspired by this idea, I created a large variety of plots with this theme, utilizing different colors, facets, and subsets of the data to highlight different elements.
My general process for this was starting from the entire cleaned scores_adj dataset, then creating whatever subset was necessary for each plot using group_by and summarize from the dplyr package. As you might imagine, this led to a large number of subsets being declared and stored within memory as the number of plots increased, which is an issue we eventually addressed through the Shiny app. A few of these plots I liked best can be seen below:
I also came across stacked density plots (also known as “joy plots”) and tried my hand at creating some for our weekly forecasts using the ggridges library for visualization and the viridis library for coloring. I was pretty pleased with the results overall - I think they look really cool and tell an interesting story about how our uncertainty of certain predictors changes over time, maybe better than the heatmaps I had created before with a similar idea in mind.
When I showed the resulting huge document of plots to Nick, he immediately suggested consolidating all of my efforts into a Shiny app which would allow users to specify graphical parameters of the plot on the fly, making it possible to create all of the document’s plots (and more) at their convenience. After writing a proposal for such an app and getting Nick’s OK, I began.
My initial idea was to have only a single tab for all possible line plots, with an additional tab for creating density and joy plots. In the end, we scrapped the idea of including density plots and, at Nick’s suggestion, included separate tabs for each region, season, and model, which we felt increased the usability of the app substantially. With the app potentially at risk of being too focused on single targets, Nick also suggested we include a tab at the front of the app that displayed overall model performance through heatmap-style plots that are included in the paper. These also have specifiable parameters in the form of facet and x-axis, though model stays consistently as the y-axis.
Finally, an about tab was appended in order to give users more context about the FluSight challenge as well as the app’s role within it. Importantly, we make the distinction that the app is fundamentally different in purpose from the FluSight network site. Rather than focusing on the most accurate predictions which are provided by the ensemble models, the app instead provides a “meta-analysis” of each of the components of these models, detailing how their performance changes based on different facets. As you will see, their performance can vary wildly based on different seasons, targets, and regions, and so helping to inform interested parties about how these models perform at a more specific level was useful to us. Intended to be released as a supplement to the upcoming model comparison paper, we hope the app will be useful not only to FluSight Network participants, but also other researchers and groups interested in infectious disease modeling.
Though the app is essentially feature complete, its UI is still being tinkered with and there could be bugs present, so your thoughts and feedback are greatly appreciated! Link to the app.
The most critical thing I’ve learned during my time at the ReichLab is that defining a use case for your work early is probably the most important step in any analytic approach, whether modeling, visualization, or creating software. Before beginning something, you need a clear idea of what the expected benefit of your work will be, who will benefit from it, and how they will interact with it. In predictive analytics, this mostly boils down to how your predictions are getting used and what process or group they are intended to benefit. For something as open-ended as data visualization and exploratory data analysis, the lines can be less defined, but the process is no less important. During my work this semester I considered two main questions in creating graphics and visualizations that have helped guide my efforts - “who is the target audience for this visualization, and what will they gain from it?” and “how does this approach compare to what has been done already?”.
The first question makes sure you keep in mind what information is being conveyed within your plot, as well as how it should be interpreted by your target audience. This keeps you from creating graphics that look pretty but are ultimately meaningless, while always considering how your graph might be read by someone in your audience who should benefit from the knowledge gained by viewing it. For my work this semester, the audience has mostly been Nick (and other researchers within the FluSight Network) with the purpose being to more easily communicate our results regarding influenza forecasting, additionally having the potential to share these results with an outside audience for communication purposes. This forces me to also consider the interpretability of the graphs from the perspective of someone with no (or minimal) outside knowledge of the FluSight Network, which has been helpful as well.
The second question ensures that you are not treading ground which has already been visited before, and helps you to consider what new things your approach brings to the table that have not been done yet before. Similar to the process of writing an academic research paper, you have to be acutely aware of what has been done for your task previously before you can attempt to make an original contribution to it. This adage has been helpful in multiple ways to me this semester - in one instance, I had an idea for a Shiny dashboard that could be used to automate analysis of influenza forecasting results each week, before realizing that this would be far too similar to the FluSight Network website that exists already. On the other hand, when looking through visualizations made thus far for the model comparison paper, I noticed no plots focusing on performance by epiweek had been created. Eventually, this led to my most significant contribution to the lab yet in the form of the model comparison app, which featured among other things a strong focus on by-epiweek line plots to show model performance over time.
An applied introduction to GitHub and version control has been very helpful so far, as well as the general experience of working on a team and collaborating with others (especially in regards to adjusting to an existing codebase and contributing novel additions to it as required). Even though my work was fairly self-contained this semester, it was really good for me to get a chance to catch up with the other parts of the lab during weekly meetings and through discussion with Nick in order to get a sense of how my work contributes to the whole, which is something important to keep in mind during any sort of collaborative work environment.