Once we have model output variables, we want to make sure that the
data is comparable to other studies. One way that we can do this is by
normalizing our data to a reference point that other climate change
researchers use. One of these reference periods is
1850-1900 which is described as the pre-industrial
reference period. In this guide, we will go through the steps of
normalizing our output to this reference period. To do this we will
write our own function and use it on our model result dataframe
(df).
We will also use functions in Matilda to summarize our data to provide us with specific metrics that we can use to describe our results in a clear way for presentations, posters, or publications.
This guide picks up where the last Matilda guide finishes. It assumes you have a model result available to work from.
Here, I am loading an example result from Hector-Matilda, it is a copy I have saved on my computer:
mod_result <- read.csv("data/example_result.csv")
We will write our own function for this task.
First load your libraries. Asusming you are working in the same
session as installing and running Matilda, you probably have Matilda
installed already. But if not install matilda as well as
tidyverse.
# load libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.1 ✔ tibble 3.3.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(matilda)
## Loading required package: hector
When we write functions in R we are essentially writing the code that we want to perform on some part of the data (could be specific rows, columns, or the whole df) and wrap it into a function builder. This makes it a tool that can be more easily used in the future. In this type of work we write functions for things we might need to do a lot of different times or might be use ful future analyses.
In the following code chunk, I will write the full function with annotations and explain the construction after:
normalize_gmst <- function(data, #1
ref_start = 1850,
ref_end = 1900) {
split_by_run_number <- split(data, data$run_number) #2
normalized_list <- lapply(split_by_run_number, function(df) {
gmst_subset <- subset(df, variable == "gmst") #3a
var_subset <- subset(df, variable != "gmst") #3b
ref_period <- subset( #4
gmst_subset,
year >= ref_start &
year <= ref_end
)
mean_ref_dat <- mean(ref_period$value, na.rm = TRUE) #5a
gmst_subset$value <- gmst_subset$value - mean_ref_dat #5b
df_normalized <- rbind(var_subset, gmst_subset) #6
return(df_normalized)
})
mod_result_norm <- do.call(rbind, normalized_list) #7
rownames(mod_result_norm) <- NULL
return(mod_result_norm)
}
This is a skill that requires practice. And even with practice it can take awhile to piece together what you want. I don’t expect this to all make sense just yet.
Here is what the above function does:
The use must supply a dataframe of the model results:
data. They must also provide the year range to use as the
baseline reference. The default in the function above is set to
ref_start = 1850 and ref_end = 1900.
The function starts by splitting our results df by
run_number. Behind the scenes, this creates a list of 50
dfs, one for each of model run and therefore one for each
BETA value used to run the model.
a-b: We take each of these 50 dfs and, one at a time, we subset
the data set further usning the subset() function; one data
frame that includes only the GMST results, and one that
contains the rest of the output variable. We do this because we only
want the GMST data for this task but we will want to
re-combine it with the rest of the data once it is normalized to out new
reference years.
We isolate the reference years from the GMST data
frames, again using the subset() function.
a-b: We calculate the mean GMST for the new
reference period and use that to calculate new GMST values
that = global mean surface temperature anomaly based on an 1850-1900
reference period. In other words, our new GMST values
should now represent how much the temperature has change relative to the
1850-1900 average. That means the average GMST over
1850–1900 should be close to 0 for each run after we use this
function.
Remember in #3 how we subsetted GMST so we could
normalize the values? Well here, we are simply taking what we broke
apart in #3a-b and putting it back together by binding the rows of the
df back together.
At this point we still have 50 individual dataframes in a list,
but we still want one big dataframe. In step 7 we bind the rows (using
rbind()) of all the dfs in our list and return the combined
object.
Currently, this is working for GMST. but the real
utility of a function is when you can use the function to perform the
task for any combination of variables we want. For now this
GMST normalize function will work great.
Once this function is written we can use it to normalize the
GMST values with a few lines of code. Lets give it a
shot:
# normalize gmst values in mod_result
mod_result_norm <- normalize_gmst(mod_result,
ref_start = 1850,
ref_end = 1900)
You can see the advantages of functions in this example. Say we have
3 model results, one for each of the following climate scenarios:
SSP1-2.6, SSP2-4.5, and SSP3-7.0.
Before we wrote the normalize_gmst() function, it would
have taken ~25 lines of code for each model result (~75 lines total) to
get normalized GMST values for each climate scenario. But
now, with our function, we would be able to complete this task with 3-9
lines of code.
Now that we have data that can be compared to our 1850-1900 reference
period we can calculate summary metrics for GMST and our
other output variables. It is even possible to edit this function to
compute anomalies in all the output variables. But we are not yet
interested in yet.
Once the model is run, Matilda has functions to help the use compute
what we call metrics for each model run. For example, we
are usually interested in determining the mean warming late in the
century. This might be the average GMST from 2081-2100. This, of course,
is an example and the following code could be applied to any of the
model output components.
In Matilda we begin metric calculation by defining our metric of
interest using the new_metric() function:
# define metric - e.g., late-century warming
lc_warming <- new_metric(var = "gmst", years = 2081:2100, op = mean)
in this code, var asks for which output variable you
would like to use to define the metric, years indicates the
year range you are interested in, providing a single year will simply
look within that singe yer provided, and op which is short
for operation and it ask for a mathematical function to define
the metric. This could be mean, median, max, min, etc.
Once the metric is created, we can use it in the function
metric_calc() to compute our metric. This will create a new
data frame with the information summarized. In that data frame you will
have access to the metrics you calculated for each
run_number.
# computing metrics
lc_warming_results <- metric_calc(x = mod_result_norm,
metric = lc_warming)
head(lc_warming_results)
## run_number metric_result
## 1 1 2.334778
## 2 2 2.445911
## 3 3 2.460964
## 4 4 2.270894
## 5 5 2.513262
## 6 6 2.368425
This is valuable data because now we have a way of representing, in
this example, how each BETA value affects late century
warming. We can do a lot with this data:
We can compute what the median/mean late century warming is if we
vary BETA. This can be useful fro comparisons of how
BETA uncertainty affects warming projections across
different climate scenarios (like comparing SSP2-4.5 to
SSP3-7.0).
We can use it to compute the probability of late-century warming surpassing certain temperature thresholds such as 1.5, 2.0, or 3.0 C.
We can use this method to produce values of effect size. In
statistics, effect size can tell us what the magnitude of change in the
output is associated with changes in BETA.
These are the types of statistics we will use to report our experimental findings. In the next tutorial we will use these metrics and plots to develop results as evidence to answer the question:
How strongly does uncertainty in the CO2 fertilization factor propagate through Hector from terrestrial carbon-cycle responses to atmospheric CO2 and global temperature?