Single ESM Calibration 3

Purpose

So my last set of calibration results I looked at the difference between adding the volscl. But we noticed that there were some calibration results that were a little iffy (we suspect that optim went to crazy town or that it found a local minimum.) Previously I had used the default Hector values as my initial parameter guess this time I used what I am calling the “best guess” for the parameters. I compared the comparison ESM data we are calibrating to with the large ensemble of Hector results we generated for the PC analysis. The parameter combination that corresponds to the results that best resemble the ESM comparison data are now being used as the initial parameter guess for optim (however for incm4 I did manual set the volscl parameter value to 0 since that model does not do volcanoes). I also increased the max number of iteration that optim can do.

Then I looked at the following

Did we see a change in the number of calibration runs that converged?
How did the MSR and parameter results change?
Are we happier with the Hector emulation?

1. Convergence

How many runs converged when we increased the max number of iterations and used a better initial parameter guess (before I was using the default hector parameters as the initial parameter guess and now I use the parameter combination that corresponds to the Hector results that most closely resemble the comparison data from the large Hector ensemble we generated for the PC analysis.)

Convergence from last time

Convergence of the best-guess calibration

With the best guesses and the higher max iteration we see an increase in the number of runs that converged. Now there are only 4 runs that do not converge where as before there were 14 runs that did not converge.

Can we determine why they are now passing?

The best_guess and old_rlts columns are the convergence fail info. If the value is equal to 0 then the run converged. If the value is 1 then the maxit (the max number of iterations was too low). The best_guess_fn_count column contains the function iteration counts from best guess calibration. As part of the best guess calibration I increased the max it to 800 where as before it was set at 500.

This table only contains info for the models that had a change in the convergence code between the two different calibration experiments.

model	best_guess	old_rslts	best_guess_fn_count
ACCESS1-3	0	1	615
CESM1-BGC	0	1	637
CESM1-FASTCHEM	0	1	691
CESM1-WACCM	0	1	603
CMCC-CESM	0	1	531
CNRM-CM5-2	1	0	801
EC-EARTH	0	1	781
FGOALS-g2	0	1	539
GISS-E2-R-CC	0	1	695
IPSL-CM5B-LR	0	1	663
MPI-ESM-MR	0	1	595
MRI-ESM1	0	1	433

Take always

MRI-ESM1 was the only model that had a best_guess_fn_count less than 500, to me this means that providing different initial parameter values made a big difference. For the other runs that are now converging but have a function iteration run count greater than 500 I know that increasing the maxit made a difference but would like to assume that changing the initial parameters did something. But I think that I will have to take a look at some other results before I draw any conclusions.
CNRM-CM5-2 used to converge but now does not. That is not really what I was expecting…

What about the runs that still do not converge?

model	convergence	function.	S	alpha	volscl	diff
CNRM-CM5-2	1	801	0.240	-0.031	3.568	0.000
GISS-E2-H-CC	1	801	21.175	-0.831	4.670	202.113
MPI-ESM-P	1	801	1446.024	0.206	4.821	2547.386
MRI-CGCM3	1	801	275237.716	1.638	2.005	49.588

Hmmmm it looks like all of these calibration attempts hit the maxit on their way to crazy town….

2. How did the MSR and paramter results change?

Did using the a more informed initial parameter combination affect the results?

Results

MSR

How did the MSR change for the calibration fits?

If the best guess method worked better then we would expect to see a decrease in MSR.

Summary of the change in MSR

##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## -2.784e-05 -1.145e-09 -2.000e-10 -8.592e-07  2.696e-09  1.691e-05

Green indicates that the best guess method returns a value that is better (a smaller MSR) where as red indicates that the best guess method returns a worse value.

Bar Plot by Model

For a lot of the models there is little to no change in the MSR. However for NorESM1-ME and inmcm4 we see a pretty large decrease in MSR. And for CSIRO-Mk3-6-0 the best guess method does remarkably worse :(

Scatter Plot

I’ve included a 1:1 line to highlight where there is no change.

Most of the calibration results are pretty close to the 1:1 line, so if there is a change in fit performance is pretty small. The change in MSR was pretty small it is hard to tell if that means that the parameters have changed. I can’t really get a sense if the change in method really changed anything.

Paramter Values

Did changing the initial parameter guess impact the parameters returned by optim?

Summary info about the absolute change in the parameters

param	mean	max	sd
alpha	0.00	0.02	0.01
diff	0.78	15.62	3.49
S	1.18	23.53	5.26
volscl	0.00	0.05	0.01

It looks like for at least some of the models there was no change in the parameter values, which is not surprising since there were some MSR that had little to no change in the MSR. However for at least some runs there was a change in the diff and S values. Which fingers crossed was in the right direction and not towards crazy town. Once again the points on the 1:1 line changed a little bit between the calibrations. I’ve tried to label the more interesting points.

Change in S

Well it looks like our friend inmcm4 went further into crazy town for S :( let’s also take a look at the the plot when we exclude inmcm4

More models are clustered towards the lower end of the S range, there is only one above 7 which is not surprising and reflects what we have assumed to be true about the S prior.

Change in diff

Once again out dear buddy inmcm4 is wonky but this time the new calibration improved it slightly I guess.

What happens when we exclude incm4?

What is the min value for the diff?

It looks like we still get some high diff values, CISRO is above 20! And NorESM1-ME has a dif value that is essentially 0 which would mean that the ocean is not absorbing any heat. I think it is a yellow flag if not a red flag.

Change in aero

Well it looks like inmcm4 finally has a reasonable parameter value but now GISS-E2-H has a negative aerosol scalar which counts as a red flag!

** Change in volscl**

CMCC-CMS and CCMCC-CM have negative volscl parameter values which is unlikely. Also inmcm4 has a high volscl even though we would expect that to have a value closer to 0.

Take Aways

It does not look like the best guess calibration method impacted the calibration fit results much, which is not surprising considering that there was little to no change in the MSR values either.
Several models have some suspicious parameter values…

3. Are we happier with the Hector emulation?

Here I compare the results from the new calibration where we use the best guess and the old calibration. So far I have only plotted the values for the models we highlighted as being wonky yesterday and included the quote from the slack channel about them.

Let’s look at the calibration results for the runs had the largest change in the MSR, which was NorESM

Hmmm they are not that different from one another and it looks like both calibrations have diff that are essentially 0.

Let’s look at inmcm4 because we know it is a troubled one.

There really has been no change in the MSR and the output despite having very difference parameter values, S

Models we talked about in the slack channel yesturday

CSIRO-Mk3L-1-2 : adding the volscl parameter looks like it helped things, but it still looks way off. Also the diffusivity is over 20, which seems a little questionable.

It looks like nothing really changed here.

CMCC-CM : another volscl < 0, diffusivity around 0.8, plus the fit just doesn’t look that good.

No change :(

CESM1-CAM5 : Parameters look ok, but the fit just doesn’t look that good. Also, this is an important one to get right.

No change again, this is actually the plot that made me wonder if we need to use weights for the ensemble members. I followed up and tired to calibrate again with weighting the average by the number of ensemble memebers for each experiment but it did not change the answer :|

CCSM4 : diffusivity is 0.1, and the fit doesn’t look very good.

I also think that something is wrong with the diagnostic plot because it looks like we are missing several new calibration runs.

ACCESS1-0 : Fit doesn’t look right. Surely we can get closer.

The calibration got pretty much the same answers.