Introduction

Once we have model output variables, we want to make sure that the data is comparable to other studies. One way that we can do this is by normalizing our data to a reference point that other climate change researchers use. One of these reference periods is 1850-1900 which is described as the pre-industrial reference period. In this guide, we will go through the steps of normalizing our output to this reference period. To do this we will write our own function and use it on our model result dataframe (df).

We will also use functions in Matilda to summarize our data to provide us with specific metrics that we can use to describe our results in a clear way for presentations, posters, or publications.

This guide picks up where the last Matilda guide finishes. It assumes you have a model result available to work from.

Here, I am loading an example result from Hector-Matilda, it is a copy I have saved on my computer:

mod_result <- read.csv("data/example_result.csv")

Normalizing Hector-Matilda Data

We will write our own function for this task.

First load your libraries. Asusming you are working in the same session as installing and running Matilda, you probably have Matilda installed already. But if not install matilda as well as tidyverse.

# load libraries
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(matilda)

## Loading required package: hector

When we write functions in R we are essentially writing the code that we want to perform on some part of the data (could be specific rows, columns, or the whole df) and wrap it into a function builder. This makes it a tool that can be more easily used in the future. In this type of work we write functions for things we might need to do a lot of different times or might be use ful future analyses.

In the following code chunk, I will write the full function with annotations and explain the construction after:

normalize_gmst <- function(data,              #1
                           ref_start = 1850, 
                           ref_end = 1900) {
  
split_by_run_number <- split(data, data$run_number) #2

normalized_list <- lapply(split_by_run_number, function(df) {
  
  gmst_subset <- subset(df, variable == "gmst") #3a 
  var_subset <- subset(df, variable != "gmst") #3b 
  
  ref_period <- subset(    #4
    gmst_subset,
    year >= ref_start &
      year <= ref_end
  )
  
  mean_ref_dat <- mean(ref_period$value, na.rm = TRUE) #5a
  
  gmst_subset$value <- gmst_subset$value - mean_ref_dat #5b
  
  df_normalized <- rbind(var_subset, gmst_subset) #6
  
  return(df_normalized)
})

mod_result_norm <- do.call(rbind, normalized_list) #7 

rownames(mod_result_norm) <- NULL

return(mod_result_norm)

}

This is a skill that requires practice. And even with practice it can take awhile to piece together what you want. I don’t expect this to all make sense just yet.

Here is what the above function does:

The use must supply a dataframe of the model results: data. They must also provide the year range to use as the baseline reference. The default in the function above is set to ref_start = 1850 and ref_end = 1900.
The function starts by splitting our results df by run_number. Behind the scenes, this creates a list of 50 dfs, one for each of model run and therefore one for each BETA value used to run the model.
a-b: We take each of these 50 dfs and, one at a time, we subset the data set further usning the subset() function; one data frame that includes only the GMST results, and one that contains the rest of the output variable. We do this because we only want the GMST data for this task but we will want to re-combine it with the rest of the data once it is normalized to out new reference years.
We isolate the reference years from the GMST data frames, again using the subset() function.
a-b: We calculate the mean GMST for the new reference period and use that to calculate new GMST values that = global mean surface temperature anomaly based on an 1850-1900 reference period. In other words, our new GMST values should now represent how much the temperature has change relative to the 1850-1900 average. That means the average GMST over 1850–1900 should be close to 0 for each run after we use this function.
Remember in #3 how we subsetted GMST so we could normalize the values? Well here, we are simply taking what we broke apart in #3a-b and putting it back together by binding the rows of the df back together.
At this point we still have 50 individual dataframes in a list, but we still want one big dataframe. In step 7 we bind the rows (using rbind()) of all the dfs in our list and return the combined object.

Currently, this is working for GMST. but the real utility of a function is when you can use the function to perform the task for any combination of variables we want. For now this GMST normalize function will work great.

Once this function is written we can use it to normalize the GMST values with a few lines of code. Lets give it a shot:

# normalize gmst values in mod_result
mod_result_norm <- normalize_gmst(mod_result, 
                                  ref_start = 1850, 
                                  ref_end = 1900)

You can see the advantages of functions in this example. Say we have 3 model results, one for each of the following climate scenarios: SSP1-2.6, SSP2-4.5, and SSP3-7.0. Before we wrote the normalize_gmst() function, it would have taken ~25 lines of code for each model result (~75 lines total) to get normalized GMST values for each climate scenario. But now, with our function, we would be able to complete this task with 3-9 lines of code.

Now that we have data that can be compared to our 1850-1900 reference period we can calculate summary metrics for GMST and our other output variables. It is even possible to edit this function to compute anomalies in all the output variables. But we are not yet interested in yet.

Computing Summary Metrics with Matilda

Once the model is run, Matilda has functions to help the use compute what we call metrics for each model run. For example, we are usually interested in determining the mean warming late in the century. This might be the average GMST from 2081-2100. This, of course, is an example and the following code could be applied to any of the model output components.

In Matilda we begin metric calculation by defining our metric of interest using the new_metric() function:

# define metric - e.g., late-century warming
lc_warming <- new_metric(var = "gmst", years = 2081:2100, op = mean)

in this code, var asks for which output variable you would like to use to define the metric, years indicates the year range you are interested in, providing a single year will simply look within that singe yer provided, and op which is short for operation and it ask for a mathematical function to define the metric. This could be mean, median, max, min, etc.

Once the metric is created, we can use it in the function metric_calc() to compute our metric. This will create a new data frame with the information summarized. In that data frame you will have access to the metrics you calculated for each run_number.

# computing metrics 
lc_warming_results <- metric_calc(x = mod_result_norm, 
                                  metric = lc_warming)

head(lc_warming_results)

##   run_number metric_result
## 1          1      2.334778
## 2          2      2.445911
## 3          3      2.460964
## 4          4      2.270894
## 5          5      2.513262
## 6          6      2.368425

This is valuable data because now we have a way of representing, in this example, how each BETA value affects late century warming. We can do a lot with this data:

We can compute what the median/mean late century warming is if we vary BETA. This can be useful fro comparisons of how BETA uncertainty affects warming projections across different climate scenarios (like comparing SSP2-4.5 to SSP3-7.0).
We can use it to compute the probability of late-century warming surpassing certain temperature thresholds such as 1.5, 2.0, or 3.0 C.
We can use this method to produce values of effect size. In statistics, effect size can tell us what the magnitude of change in the output is associated with changes in BETA.

These are the types of statistics we will use to report our experimental findings. In the next tutorial we will use these metrics and plots to develop results as evidence to answer the question:

How strongly does uncertainty in the CO2 fertilization factor propagate through Hector from terrestrial carbon-cycle responses to atmospheric CO2 and global temperature?

Normalizing Data and Calculating Summary Metrics

Joe Brown

2026-06-24

Introduction

Normalizing Hector-Matilda Data

Computing Summary Metrics with Matilda