Objective

WE FINALLY HAVE HEAT FLUX DATA (I am extremely excited by this because downloading data is no fun). But we need to make sure that the data is reasonable and confirm that we have enough heat flux data to continue with the hector calibration project(s).

Heat Flux Data

# Make a table of the cmip info wide by variable with indicator information about whether or not there is 
# data for that variable.
cmip_individual %>% 
    select(year, model, ensemble, variable, experiment) %>% 
    full_join(heat_flux %>%  select(year, model, ensemble, experiment, variable2=variable), 
              by = c("year", "model", "ensemble", "experiment")) %>% 
    gather(vari, name, variable, variable2) %>% 
    mutate(yes = TRUE) %>% 
    select(model, ensemble, experiment, name, yes) %>%  
    na.omit %>% 
    distinct %>% 
    spread(name, yes) -> 
    wide_variables

How many models do we have temp / heat flux data for?

length(unique(wide_variables$model))
[1] 35

So it looks like we have both tas and heat flux data for 35 models. However I am not sure if the historical data will constrain heat flux like we think is will. We are also concerned if the heat flux is too high during the historical period for Hector to be able to emulate.

What do the values look like?

The mean and sd of of the heat flux data for each scenario at 1850, 1900, 1950, and 2010, 2050, 2100.

heat_flux %>% 
    filter(year %in% c(1850, 1900, 1950, 2100, 2010, 2050, 2100)) %>% 
    group_by(experiment, year) %>% 
    summarise(mean = mean(value), 
              sd = sd(value))  %>% 
    na.omit %>% 
    arrange(experiment, year) %>%  
    kable(digits = 3)  %>% 
    kable_styling("striped", full_width = F) %>%
    collapse_rows(columns = 1)
experiment year mean sd
historical 1850 -1.875 23.387
1900 1.183 0.504
1950 1.304 0.608
rcp26 2010 1.945 0.761
2050 2.062 0.867
2100 1.417 0.714
rcp45 2010 1.975 1.307
2050 2.550 1.426
2100 2.334 1.383
rcp60 2010 2.159 0.581
2050 2.835 0.717
2100 3.204 0.734
rcp85 2010 1.631 1.397
2050 2.831 1.347
2100 3.997 1.678

Right off the bat I am surprised to see that the sd is so large in the historical period, I think that means that there is most likely some unrealistic values sneaking in there.

heat_flux %>% 
    ggplot(aes(year, value, color = model)) + 
    geom_point() + 
    facet_wrap('experiment', scales = 'free') + 
    theme(legend.position = 'none')

Right off the bat it looks like there are some questionable values for the historical period, rcp26, rcp45, and rcp85. I wonder if the points that are outlines in the historical and rcp26 are artifacts of numerical instability or if time series that underestimate the heat flux in rcp45 and rcp85 are because some variable is missing, but I thought that my prepossessing code would have taken care of that.

What models have outliers in the historical period?

Let’s define the outlines as heat flux values with a magnitude greater than 10.

heat_flux %>%
    filter(experiment == 'historical') %>%  
    filter(abs(value) > 10) %>%  
    pull(model) %>%
    unique() -> 
    hist_outliers

If we remove these models from do the historical runs look good?

heat_flux %>%  
    filter(!model %in% hist_outliers & experiment == 'historical') %>% 
    ggplot(aes(year, value, color = model, group = interaction(model, ensemble))) + 
    geom_line(alpha = 0.3) + 
    labs(y = 'w/m2')  -> 
    hist_plot
hist_plot

Yes I think so! I think that the models look pretty consistent in noise and trend.


What about the modes that have outlier values? Are they single years or for multiple years?

heat_flux %>% 
    filter(model %in% hist_outliers & experiment == 'historical') %>% 
    ggplot(aes(year, value, color = model, group = interaction(model, ensemble))) + 
    geom_point(alpha = 0.3) + 
    labs(y = 'w/m2') -> 
    hist_outlier_plot
hist_outlier_plot

How many years are outlines? for each ensemble / model?

hist_outlier_plot$data %>%
    filter(abs(value) >= 10) %>% 
    group_by(experiment, ensemble, model) %>% 
    summarise(n_outlier = n_distinct(year)) %>% 
    kable(digits = 3)  %>% 
    kable_styling("striped", full_width = F) %>%
    collapse_rows(columns = 1:2)
experiment ensemble model n_outlier
historical r2i1p1 CCSM4 3
MRI-CGCM3 3
r3i1p1 CCSM4 3
r5i1p1 CCSM4 6

If the outlines were removed what would the time series be more realistic?

hist_outlier_plot + 
    coord_cartesian(ylim = c(-10,10))

    
ggplot() + 
    geom_line(data = hist_plot$data, 
              aes(year, value, color = 'non outlier models', group = interaction(model, ensemble)), 
              alpha = 0.2) + 
    geom_point(data = hist_outlier_plot$data %>% 
                  filter(abs(value) < 10), 
              aes(year, value, color = model, group = interaction(model, ensemble))) + 
    labs(y = 'W/m2')

It looks like there are a few year for CCSM4 and MRI-CGCM3 that fall outside of the CMIP5 range but they don’t seem too drastic. Do we want to exclude the years with the heat Flux values greater than 10 and then keep the rest of the time series? Or do we exclude the heat flux data from our analysis? I thought that may be they had to do with the the models missing data at the end / start of runs or how they save data (some times modeling groups can write out multiple files for a single experiment every 5 years or so) but that was not the case few these models.


What models have outliers in the rcp26?

heat_flux %>%
    filter(experiment == 'rcp26' & abs(value) >= 10)  %>%  
    pull(model) %>%  
    unique() -> 
    outlier_rcp26
outlier_rcp26
[1] "CCSM4"     "MRI-CGCM3"

It looks like the same models have outlines in rcp26.


What models have outliers in rcp 85?

heat_flux %>%
    filter(experiment == 'rcp85' & value > 10) %>% 
    pull(model) %>% 
    unique() -> 
    outliers_rcp85; 
outliers_rcp85
[1] "CESM1-BGC"
heat_flux %>% 
    filter(experiment == 'rcp85' & model == outliers_rcp85) %>%  
    ggplot(aes(year, value, color  = model)) + 
    geom_line(data = heat_flux %>% 
                  filter(experiment == 'rcp85' & model != outliers_rcp85), 
              aes(year, value, color = 'other models', group = interaction(model, ensemble)), alpha = 0.3) + 
    geom_point() + 
    labs(y = 'w/m2') + 
    coord_cartesian(xlim = c(2006, 2100))

Well it loos like from 2990 to 2995 the CESM1-BGC has abnormally high heat flux values but reasonable values for the other years. Once again what do we do when there are only some of the years with suspicious data?

Below Expected Time Series

Both the rcp45 and rcp85 experiments have time series with values that are below.

heat_flux %>% 
    filter(experiment %in% c('rcp45', 'rcp85')) %>% 
    filter(value < -1) %>% 
    pull(model) %>%  
    unique()
[1] "inmcm4"

So it looks like the same model is the has unexpectedly low heat flux values. I wonder what could be causing it. I checked the errata (https://cmip.llnl.gov/cmip5/errata/cmip5errata.html) to see if inmcm4 had reported any problems with the data we used. While it did appear multiple times in the errata it wasn’t for any of the variables we used to calculate the heat flux.

Concluding Thoughts

Well its 5:00 on a Friday and I have brain mush so this look at our heat flux data has not been as thorough as I would like it to be. But it does raise some questions to consider.

  1. When there are cases where heat flux for a handful of year looks funky do we want to remove the outlier years or discard the entire time series?
  2. For inmcm4 what is going on? Is there a problem with one of the netcdf files we used because the trend and spread looks good to me, it is almost like the time series has been off set a bit.
---
title: "What's Up With the Heat Flux Data?"
output: html_notebook
---

## Objective 

WE FINALLY HAVE HEAT FLUX DATA (I am extremely excited by this because downloading data is no fun). But we need to make sure that the data is reasonable and confirm that we have enough heat flux data to continue with the hector calibration project(s).

```{r, echo = FALSE, message = FALSE, warning=FALSE}
library(hectorcal)
library(dplyr)
library(ggplot2)
library(tidyr)
library(knitr)
library(kableExtra)

BASE_DIR  <-  "/Users/dorh012/Documents/2019/hectorcal"
INPUT_DIR <- file.path(BASE_DIR, 'data-raw')

heat_flux <- read.csv(file.path(INPUT_DIR, 'CMIP5_heat_flux.csv'), stringsAsFactors = FALSE) %>%  
    filter(model %in% cmip_individual$model)
```

## Heat Flux Data 

```{r}
# Make a table of the cmip info wide by variable with indicator information about whether or not there is 
# data for that variable.
cmip_individual %>% 
    select(year, model, ensemble, variable, experiment) %>% 
    full_join(heat_flux %>%  select(year, model, ensemble, experiment, variable2=variable), 
              by = c("year", "model", "ensemble", "experiment")) %>% 
    gather(vari, name, variable, variable2) %>% 
    mutate(yes = TRUE) %>% 
    select(model, ensemble, experiment, name, yes) %>%  
    na.omit %>% 
    distinct %>% 
    spread(name, yes) -> 
    wide_variables
```

How many models do we have temp / heat flux data for? 

```{r}
length(unique(wide_variables$model))
```

So it looks like we have both tas and heat flux data for 35 models. However I am not sure if the historical data will constrain heat flux like we think is will. We are also concerned if the heat flux is too high during the historical period for Hector to be able to emulate. 



## What do the values look like? 

The mean and sd of of the heat flux data for each scenario at 1850, 1900, 1950, and 2010, 2050, 2100.

```{r}

heat_flux %>% 
    filter(year %in% c(1850, 1900, 1950, 2100, 2010, 2050, 2100)) %>% 
    group_by(experiment, year) %>% 
    summarise(mean = mean(value), 
              sd = sd(value))  %>% 
    na.omit %>% 
    arrange(experiment, year) %>%  
    kable(digits = 3)  %>% 
    kable_styling("striped", full_width = F) %>%
    collapse_rows(columns = 1)
```

Right off the bat I am surprised to see that the sd is so large in the historical period, I think that means that there is most likely some unrealistic values sneaking in there. 

```{r}
heat_flux %>% 
    ggplot(aes(year, value, color = model)) + 
    geom_point() + 
    facet_wrap('experiment', scales = 'free') + 
    theme(legend.position = 'none')
```

Right off the bat it looks like there are some questionable values for the historical period, rcp26, rcp45, and rcp85. I wonder if the points that are outlines in the historical and rcp26 are artifacts of numerical instability or if time series that underestimate the heat flux in rcp45 and rcp85 are because some variable is missing, but I thought that my prepossessing code would have taken care of that. 



### What models have outliers in the historical period? 

Let's define the outlines as heat flux values with a magnitude greater than 10. 

```{r}
heat_flux %>%
    filter(experiment == 'historical') %>%  
    filter(abs(value) > 10) %>%  
    pull(model) %>%
    unique() -> 
    hist_outliers
```

If we remove these models from do the historical runs look good? 

```{r}
heat_flux %>%  
    filter(!model %in% hist_outliers & experiment == 'historical') %>% 
    ggplot(aes(year, value, color = model, group = interaction(model, ensemble))) + 
    geom_line(alpha = 0.3) + 
    labs(y = 'w/m2')  -> 
    hist_plot
hist_plot
```

Yes I think so! I think that the models look pretty consistent in noise and trend. 

<br>

What about the modes that have outlier values? Are they single years or for multiple years? 

```{r}
heat_flux %>% 
    filter(model %in% hist_outliers & experiment == 'historical') %>% 
    ggplot(aes(year, value, color = model, group = interaction(model, ensemble))) + 
    geom_point(alpha = 0.3) + 
    labs(y = 'w/m2') -> 
    hist_outlier_plot
hist_outlier_plot
```

How many years are outlines? for each ensemble / model? 

```{r}
hist_outlier_plot$data %>%
    filter(abs(value) >= 10) %>% 
    group_by(experiment, ensemble, model) %>% 
    summarise(n_outlier = n_distinct(year)) %>% 
    kable(digits = 3)  %>% 
    kable_styling("striped", full_width = F) %>%
    collapse_rows(columns = 1:2)
```


If the outlines were removed what would the time series be more realistic? 


```{r}
hist_outlier_plot + 
    coord_cartesian(ylim = c(-10,10))
    
ggplot() + 
    geom_line(data = hist_plot$data, 
              aes(year, value, color = 'non outlier models', group = interaction(model, ensemble)), 
              alpha = 0.2) + 
    geom_point(data = hist_outlier_plot$data %>% 
                  filter(abs(value) < 10), 
              aes(year, value, color = model, group = interaction(model, ensemble))) + 
    labs(y = 'W/m2')
```

It looks like there are a few year for CCSM4 and MRI-CGCM3 that fall outside of the CMIP5 range but they don't seem too drastic. Do we want to exclude the years with the heat Flux values greater than 10 and then keep the rest of the time series? Or do we exclude the heat flux data from our analysis? I thought that may be they had to do with the the models missing data at the end / start of runs or how they save data (some times modeling groups can write out multiple files for a single experiment every 5 years or so) but that was not the case few these models. 

<br>

### What models have outliers in the rcp26? 


```{r}
heat_flux %>%
    filter(experiment == 'rcp26' & abs(value) >= 10)  %>%  
    pull(model) %>%  
    unique() -> 
    outlier_rcp26
outlier_rcp26
```

It looks like the same models have outlines in rcp26. 

<br>

### What models have outliers in rcp 85? 

```{r}
heat_flux %>%
    filter(experiment == 'rcp85' & value > 10) %>% 
    pull(model) %>% 
    unique() -> 
    outliers_rcp85; 
outliers_rcp85
```


```{r}

heat_flux %>% 
    filter(experiment == 'rcp85' & model == outliers_rcp85) %>%  
    ggplot(aes(year, value, color  = model)) + 
    geom_line(data = heat_flux %>% 
                  filter(experiment == 'rcp85' & model != outliers_rcp85), 
              aes(year, value, color = 'other models', group = interaction(model, ensemble)), alpha = 0.3) + 
    geom_point() + 
    labs(y = 'w/m2') + 
    coord_cartesian(xlim = c(2006, 2100))
```

Well it loos like from 2990 to 2995 the CESM1-BGC has abnormally high heat flux values but reasonable values for the other years. Once again what do we do when there are only some of the years with suspicious data?

### Below Expected Time Series 

Both the rcp45 and rcp85 experiments have time series with values that are below. 

```{r}
heat_flux %>% 
    filter(experiment %in% c('rcp45', 'rcp85')) %>% 
    filter(value < -1) %>% 
    pull(model) %>%  
    unique()
```

So it looks like the same model is the has unexpectedly low heat flux values. I wonder what could be causing it. I checked the errata (https://cmip.llnl.gov/cmip5/errata/cmip5errata.html) to see if inmcm4 had reported any problems with the data we used. While it did appear multiple times in the errata it wasn't for any of the variables we used to calculate the heat flux.

## Concluding Thoughts

Well its 5:00 on a Friday and I have brain mush so this look at our heat flux data has not been as thorough as I would like it to be. But it does raise some questions to consider. 

1. When there are cases where heat flux for a handful of year looks funky do we want to remove the outlier years or discard the entire time series? 
2. For inmcm4 what is going on? Is there a problem with one of the netcdf files we used because the trend and spread looks good to me, it is almost like the time series has been off set a bit. 


