Reporting the performance of Zonation results based on the performance curves is typically done using the average (mean) over all features (possibly in a group). Sometimes a weighted average is used (weights are the same as the ones used in the actual prioritization). However, if the distribution of “proportion remaining” for all features is non-normal, typically skewed to some direction, mean may not be the proper statistic. Mean is also sensitive to very large or small values.
As an example, below are two versions of the same performance curves from analysis including 6364 species in Japan. The analysis include the following Zonation analysis options:
Version 2. Mean (solid black line), +- 1 standard deviation (grey shaded area, cut into range [0, 1]), minimum value (dashed black line), and maximum value (dotdashed black value)
Here, the average value shows the mean over all features (species) and the standard deviation is supposed to give an idea on the spread of the values around the mean. Typically using the mean + SD is appropriate if variable’s distribution is close to normal. Note that the minimum is 0 and the maximum is 1.0 for some species. Because of the condition layer used, not all value go to 1.0.
Version 1. Median (solid black line), upper and lower quartiles (grey shaded area), minimum value (dashed black line), and maximum value (dotdashed black value).
Here, the median over all features is more conservative than the average. The shaded area (the upper and lower quartile) already hint that the underlying distribution of the “proportion remaining” is skewed. Hence, for the most species the increases suggested by the mean are overly optimistic.
Looking closer at the values at certain points on the x-axis (2%, 9% and 17% of landscape protected) confirms that the distribution is non-normal. Here are values for an anlysis variant not accounting for the current protected area network (PAN) in Japan:
If the PAN could be selected from scratch, selecting 17% of the landscape (the Aichi target) would cover 27% of all species occurrence levels if using mean, 19% if using median.
Here are the same values for a variant accounting for the PAN:
If the PAN expansion is done from the current PAN, selecting 17% of the landscape (the Aichi target) would cover ~26% of all species occurrence levels if using mean, ~17% if using median.
The distribution is the most skewed for the smallest fraction protected: 2% (corresponding to the high rank PAs). In this case, the mean species occurrence levels covered is 11.85% (without PAN) and 3.86% (with PAN). However, in both cases the mean is close to the upper quartile (75% of all values, 11.25% and 3.3%, respectively). Is this fair use of the word “average” which is supposed to, in this context, to convey an idea of a “typical gain”?
Furthermore, the shape of the distribution of fractions depends on the fraction of the landscape selected:
## Warning in class(x) <- c("tbl_df", "tbl", "data.frame"): Setting class(x)
## to multiple strings ("tbl_df", "tbl", ...); result will no longer be an S4
## object
Which one to use?