Probability distribution plot

The goal of this plot is similar to the probability bar plot. We want to visualize the likelihood of a variable falling within a given range in the future. In this plot we will still split ranges by color but we won’t use a bar to represent likelihood, instead we will use an actual probability ditribution curve. This can make visualization a little easier.

Data

If you are plotting in a session different from the original analysis, ensure you have the data needed to complete the plots.

For a probability distribution plot we need the weighted metrics for each runs, the result after running metric_calc and merging with model likelihood weights. As a review, metric_calc uses a user defined metric definition to compute and extract metrics from the model results. For example, we can compute the median end of century global mean surface temperature anomaly relative to a past reference period.

After merging with the model weights and adding scenario name the data should look something like this:

##    run_number metric_result scenario      weights
## 1           1     0.7715469   ssp119 1.581314e-02
## 2           2     0.8429721   ssp119 1.785500e-02
## 3           3     1.2356367   ssp119 2.103415e-02
## 4           4     1.1842129   ssp119 1.923625e-02
## 5           5     0.7671621   ssp119 1.492278e-02
## 6           6     0.9800407   ssp119 5.276083e-03
## 7           7     0.9400672   ssp119 3.778613e-03
## 8           8     0.7499700   ssp119 6.111274e-03
## 9           9     0.9100293   ssp119 1.711479e-05
## 10         10     1.0911564   ssp119 6.053824e-07

Computing Kernal Density Values

A probability distribution is created by plotting continuous data on the x-axis (in our case warming values) and estimates the density at a specific point along the temperature range. The kernel density value is used to represent the estimated probability density of the data at each point along the warming (x-axis) range in our data. The kernel density is a useful tool for visualizing the shape of data distribution – this is what we are using it for!

We can use R functions to compute the kernel density values for our data, and will use this to plot the probability distribution.

First we will split our metric data frame into a list. The list will contain four data frames, one for each of our scenarios:

I do not include this here, but you may also want to check and make sure your scenario column is a set of factors.

# Split data into a list by scenario factor levels 
scenario_list <- split(plot_3_df, plot_3_df$scenario)

Now that our data is stored in a list we can take advantage of apply() functions to compute kernel density values in each data frame stored in scenario_list. The result of the density function is itself a list where the contents x and y are the coordinates we want to plot. Therefor the apply loop we write will include the extraction and storage of those variables:

# use lapply to loop across data frames in scenario_list to compute KDE values for each
kde_values <- lapply(names(scenario_list), function(df_name){
  
  # store data based on name of the current df 
  # this way we can add a scenario column to the final results
  df <- scenario_list[[df_name]]
  
  # use that df to compute density estimates weight by `weights`
  density_values <- density(df$metric_result, weights = df$weights)
  
  # build data frame with coordinates we want to plot
  density_estimate <- data.frame(scenario = df_name,
                                  value = density_values$x,
                                  density = density_values$y)
  
  return(density_estimate)
})

## Warning in density.default(df$metric_result, weights = df$weights): Selecting
## bandwidth *not* using 'weights'
## Warning in density.default(df$metric_result, weights = df$weights): Selecting
## bandwidth *not* using 'weights'
## Warning in density.default(df$metric_result, weights = df$weights): Selecting
## bandwidth *not* using 'weights'
## Warning in density.default(df$metric_result, weights = df$weights): Selecting
## bandwidth *not* using 'weights'

# bind the rows of kde_values to create a data frame
kde_df <- do.call(rbind, kde_values)

With our data prepped, we can move on to plotting.

Plotting

All plotting will be completed using ggplot2. If you are unfamiliar with ggplot2 there are several well-written tutorials online.

The final plot from this tutorial will be probability distribution where each curve band represents the temperature probabilities for each SSP scenario we used to run our model. There are a number of freedoms that can be taken with respect to design. Here, we use a palette that pairs with 9 temperature ranges from 0 - >4.5. In this figure, the probability density curve will be divided into temperature bands colored with this palette.

Color palette:

library(scales)
temp_cols <- c("navyblue","#2166AC","#4393C3","#D1E5f0","#FDDBC7","#F4A582","#D6604D","#B2182B","#67001F")
show_col(temp_cols)

As previously mentioned, the x-axis in these plots will represent the range of possible warming outcomes and the y-axis will be density values.

# Define breaks and colors to correspond
breaks <- c(0, 1, 1.5, 2, 2.5, 3.0, 3.5, 4.0, 4.5, Inf) # define breaks for values in kde_df (kde_df$value)

# what colors we will assign to each break
colors <- c(
  "navyblue",
  "#2166AC",
  "#4393C3",
  "#D1E5f0",
  "#FDDBC7",
  "#F4A582",
  "#D6604D",
  "#B2182B",
  "#67001F")

# names of the color breaks -- will be used to label the legend in the plot 
legend_labels <- c("0 to 1.0 C",
                   "1.0 to 1.5 C",
                   "1.5 to 2.0 C",
                   "2.0 to 2.5 C",
                   "2.5 to 3.0 C",
                   "3.0 to 3.5 C",
                   "3.5 to 4.0 C",
                   "4.0 to 4.5 C",
                   "4.5+ C")


# Create intervals and assign colors
kde_df$interval <- cut(kde_df$value, # which column to cut with the breaks
                       breaks = breaks, # the breaks object
                       labels = colors, # use the colors object here
                       include.lowest = TRUE) # should the lowest value in the data be included?

# Plot using ggplot
warming_probability <- # name the plot
  ggplot(kde_df, # specify the data frame containing the data 
         aes(x = value, # x-axis variable
             y = density, # y-axis variable
             fill = interval)) + # how to color fill the plot
  geom_area() + # specifying the plot type - will fill area with 'fill'
  scale_fill_manual(values = colors, # manually assign fill colors to match the palette we want
                    labels = legend_labels, # labels to use in the legend
                    name = "Warming") + # name of the legend
  scale_x_continuous(breaks = seq(0, 6, by = 0.5), # set the scale and labels of the x-axis
                     limits = c(0,6), # set x-axis limits
                     expand = c(0, 0.1)) + # limit white space
  labs(title = "Estimated probability of warming",
       x = "Temperature (C)",
       y = "Density") +
  theme_light() + # set theme - personal preference
  facet_wrap(~scenario) # facet by scenario name
warming_probability

## Warning: Removed 93 rows containing non-finite outside the scale range
## (`stat_align()`).

Matilda Visualization Part 3: Weighted Probability Distribution

Joe Brown

2024-04-17

Background

Probability distribution plot

Data

Computing Kernal Density Values

Plotting