Objectives
By no means is this an exhaustive how to use pic and cdo but may offer a start point for new users.
PIC
So PIC is PNNL’s Linux machine, it has 520 Intel Haswell-based nodes in quad form. Each node features a dual-socket Intel Haswell E5-2670v3 CPU (12-cores-per-socket, running at 2.3 GHz) with 64 GB of 2133 MHz ECC memory, an FDR Infiniband network card, and 480 GB local solid-state drive storage. For PIC help check out this confluence page. Tim Carlson at PIC support is a great resource when people at JGRCI are unable to answer your questions. When I have questions I usually talk to Robert, Caleb, or Pralit. For pic help google linux or unix commands. PIC is a useful resource but is not the silver bullet answer to computing problems, there are things that will be better to run on your local machine even if it is really slow.
I use PIC to run
- GCAM (single GCAM runs and parallel GCAM runs).
- Big R scripts (runs that would take several hours on my local machine).
- To processes netcdf files that are too large to download to my local machine or that I will want to process with CDO.
- Making maps.
What I do not use pic for
- To do interactive analyses.
- To make intermediate plots (the only time I use pic to make plots is if I am confident that the plots are not going to change and you will need to write them to a directory on pic and transfer them to your local machine in order to look at it).
When you first log on to pic you are going to end up at the /people/usrnameXXX directory. I use this location for scratch work/to save intermediate files because data storage is limited and temporary, I think things get deleted every so often. For the files/projects you want to keep around make a directory in /pic/projects/GCAM/. Note that if you want to make that directory public and you have a local windows machine do not use WINscp to make directories/move the files. It will cause problems that can only be resolved by contacting pic support.
If you are running something big on pic like a script that will process all a large set of netcdf files you are going to have to submit an sbatch file see here which will submit the batch to the pic que. Here is an example of an sbatch script called run.zsh
#!/bin/zsh
#SBATCH -t 15:0:0
#SBATCH -n 1
#SBATCH -c 1
#SBATCH -N 1
#SBATCH -J ESM_processing
#SBATCH -A IHESD
## This configuration gives us:
## a time limit of 15 hours
## 1 task
## 1 cpu per task
## on a single node
## names the job 'GCAM'
## on account
#First make sure the module commands are available.
source /etc/profile.d/modules.sh
#Set up your environment you wish to run in with module commands.
module load R/3.4.3
#Next unlimit system resources, and set any other environment variables you need.
unlimit
Rscript ./B2.get-annual-ocean-flux.R
To run sbatch scripts you are going to need hours on a pic account. Talk to pic support and who you are working on the project with to figure out hours and such. You can check your account allocation with.
[dorh012@constance03 ESM_processing_code]$ module load sbank
[dorh012@constance03 ESM_processing_code]$ sbank balance statement
You can also run a few code on the log in node. But do not abuse this because you run the risk of getting in trouble with pic if you are trying to do to much on the log in node.
Things that are okay to do on the log in node
- Move files / make directories
- Test out R scripts (when I do this I process dummy data or a single file)
- Look some netcdf files
This that are not okay to do on the log in node
- Process lots of files
- Run GCAM or Hector a bunch of times
- Open really large data files (I tried to unzip and open a large file from a collaborator on the pic log in node and it flooded the node and started to use 90% of pic’s computing power, needless to say the pic team was not happy with me).
How to load R on pic.
When developing R code to use on pic I usually work in R studio on my local machine, then copy and paste it into the interactive R module on pic.
To see what is aviable on pic use module avail
module avail R
Launch R version 3.4.3 on pic.
module load R/3.4.3
R
CDO
CDO is a collection of command line data operators for processing netcdfs (see website for information)[https://code.mpimet.mpg.de/projects/cdo/embedded/index.html]. CDO is great to process the cmip output data where there is inconsistent data structures (different relative start dates, different spatial projections etc) because it will perform operations based on information included in the meta data structure.
The CDO on pic lives at
/share/apps/netcdf/4.3.2/gcc/4.4.7/bin/cdo
How you would run cdo in command line on pic. (you can chain operators together)
cdo operator infile outfile
For example the global mean of a single file using cdo. The results will be stored in test.nc.
[dorh012@constance03 ~]$ /share/apps/netcdf/4.3.2/gcc/4.4.7/bin/cdo fldmean /pic/projects/GCAM/CMIP5-CHartin/CMIP5_RCP45/tas/tas_Amon_CESM1-BGC_rcp45_r1i1p1_200601-210012.nc ./test.nc
Warning (cdfScanVarAttributes) : NetCDF: Variable not found - areacella
cdo fldmean: Processed 63037440 values from 1 variable over 1140 timesteps ( 1.02s )
So with one line of code I was able to calculate the weighted global mean temperature for a netcdf. However if you want to process more than 10 files the command line option gets kind of clunky and you will not be able to generate new file names using cdo so paring cdo and R together is very powerful.
CDO + R
Examples of cdo + R processing code can be on GitHub at https://github.com/kdorheim/CDOexamples. But you are going to want to use the function system2 to execute the cdo code, it will look like the following format.
system2('path/to/cdo/exe', args = c('operator', 'path/to/inifile.nc', './test.nc'), stdout = TRUE, stderr = TRUE)
Here is what you would run in interactive R to repeat our earlier example in R.
CDO_EXE <- '/share/apps/netcdf/4.3.2/gcc/4.4.7/bin/cdo'
input_nc <- "/pic/projects/GCAM/CMIP5-CHartin/CMIP5_RCP45/tas/tas_Amon_CESM1-BGC_rcp45_r1i1p1_200601-210012.nc"
output_nc <- "./test.nc"
system2(CDO_EXE, args = c("fldmean", input_nc, output_nc), stdout = TRUE, stderr = TRUE)
Then to extract the results from the new netcdf you are going to use the package ncdf4. I recommend looking into that package documentation because it if a pretty valuable tool.
How to extract the results from a netcdf in R.
# For the input netcdf file.
nc_in <- nc_open(input_nc)
tas <- ncvar_get(nc_in, 'tas')
time <- ncvar_get(nc, 'time')
# For the output netcdf.
nc <- nc_open(output_nc)
tas <- ncvar_get(nc, 'tas')
time <- ncvar_get(nc, 'time')
What is the difference in these two netcdf files (dimension)?
How I like to structure my CDO + R code
I typically use two scripts (and I want to stress that this is just my preference on how to set it up there is going to be lots of different ways to doing it).
My first script locates and sorts the netcdfs I want to process, this is because of how CMIP5 data was stored on pic. It stores information about the cmip files to process in a csv file.
In my second script I import my Csv file containing the netcdf files to process, define my function that contains the system2 cdo call, and then use an apply family command to execute my processing function on the list of the netcdfs to process. I save the final results in a csv file that is easy to download onto my local machine.
CDO & R tips
- Write tests into your function and try to give informative error messages, otherwise you can spend a lot of time trying to debug your code.
- CDO generates lots of intermediate netcdf files, at some point you should clean these up but during development it can be useful to keep the intermediate netcdf files and check them.
Practice Exercise
What is the difference between the global average temperature for a single year calculated by cdo fldgen vs mean in R? Is this a surprise, why or why not?
Hint - if you used ncvar_get() to import the girded monthly data into R the data is going to be an array with three dimensions, [lon, lat, time].
---
title: "Intro to PIC & netcdf processing"
output: html_notebook
---

### Objectives

By no means is this an exhaustive how to use pic and cdo but may offer a start point for new users.  

### PIC

So PIC is PNNL's Linux machine, it has 520 Intel Haswell-based nodes in quad form. Each node features a dual-socket Intel Haswell E5-2670v3 CPU (12-cores-per-socket, running at 2.3 GHz) with 64 GB of 2133 MHz ECC memory, an FDR Infiniband network card, and 480 GB local solid-state drive storage. For PIC help check out this [confluence page](https://confluence.pnnl.gov/confluence/display/RC/Research+Computing+Knowledge). Tim Carlson at PIC support is a great resource when people at JGRCI are unable to answer your questions. When I have questions I usually talk to Robert, Caleb, or Pralit. For pic help google linux or unix commands. PIC is a useful resource but is not the silver bullet answer to computing problems, there are things that will be better to run on your local machine even if it is really slow.

I use PIC to run 

1. GCAM (single GCAM runs and parallel GCAM runs).
2. Big R scripts (runs that would take several hours on my local machine).
3. To processes netcdf files that are too large to download to my local machine or that I will want to process with CDO. 
4. Making maps.

<br>

What I do not use pic for 

1. To do interactive analyses. 
2. To make intermediate plots (the only time I use pic to make plots is if I am confident that the plots are not going to change and you will need to write them to a directory on pic and transfer them to your local machine in order to look at it).


When you first log on to pic you are going to end up at the `/people/usrnameXXX` directory. I use this location for scratch work/to save intermediate files because data storage is limited and temporary, I think things get deleted every so often. For the files/projects you want to keep around make a directory in `/pic/projects/GCAM/`. Note that if you want to make that directory public and you have a local windows machine do not use WINscp to make directories/move the files. It will cause problems that can only be resolved by contacting pic support. 

<br>

If you are running something big on pic like a script that will process all a large set of netcdf files you are going to have to submit an sbatch file [see here](https://confluence.pnnl.gov/confluence/display/RC/Creating+a+Job+Script) which will submit the batch to the pic que. Here is an example of an sbatch script called run.zsh 

```
#!/bin/zsh
#SBATCH -t 15:0:0
#SBATCH -n 1
#SBATCH -c 1
#SBATCH -N 1
#SBATCH -J ESM_processing
#SBATCH -A IHESD 

## This configuration gives us:
## a time limit of 15 hours
## 1 task
## 1 cpu per task
## on a single node
## names the job 'GCAM'
## on account 

#First make sure the module commands are available.
source /etc/profile.d/modules.sh 

#Set up your environment you wish to run in with module commands.
module load R/3.4.3

#Next unlimit system resources, and set any other environment variables you need.
unlimit

Rscript ./B2.get-annual-ocean-flux.R
```

To run sbatch scripts you are going to need hours on a pic account. Talk to pic support and who you are working on the project with to figure out hours and such. You can check your account allocation with. 

```
[dorh012@constance03 ESM_processing_code]$ module load sbank
[dorh012@constance03 ESM_processing_code]$ sbank balance statement
```

You can also run a few code on the log in node. But do not abuse this because you run the risk of getting in trouble with pic if you are trying to do to much on the log in node.

Things that are okay to do on the log in node 

* Move files / make directories 
* Test out R scripts (when I do this I process dummy data or a single file)
* Look some netcdf files 


This that are not okay to do on the log in node

* Process lots of files 
* Run GCAM or Hector a bunch of times 
* Open really large data files (I tried to unzip and open a large file from a collaborator on the pic log in node and it flooded the node and started to use 90% of pic's computing power, needless to say the pic team was not happy with me).


#### How to load R on pic. 

When developing R code to use on pic I usually work in R studio on my local machine, then copy and paste it into the interactive R module on pic.

To see what is aviable on pic use `module avail`

```
module avail R
```

Launch R version 3.4.3 on pic. 

```
module load R/3.4.3 
R
``` 


### CDO 

CDO is a collection of command line data operators for processing netcdfs (see website for information)[https://code.mpimet.mpg.de/projects/cdo/embedded/index.html]. CDO is great to process the cmip output data where there is inconsistent data structures (different relative start dates, different spatial projections etc) because it will perform operations based on information included in the meta data structure. 

The CDO on pic lives at 
```
/share/apps/netcdf/4.3.2/gcc/4.4.7/bin/cdo
```

How you would run cdo in command line on pic. (you can chain operators together)

```
cdo operator infile outfile
```

For example the global mean of a single file using cdo. The results will be stored in test.nc. 

```
[dorh012@constance03 ~]$ /share/apps/netcdf/4.3.2/gcc/4.4.7/bin/cdo fldmean /pic/projects/GCAM/CMIP5-CHartin/CMIP5_RCP45/tas/tas_Amon_CESM1-BGC_rcp45_r1i1p1_200601-210012.nc ./test.nc
Warning (cdfScanVarAttributes) : NetCDF: Variable not found - areacella
cdo fldmean: Processed 63037440 values from 1 variable over 1140 timesteps ( 1.02s )
```

So with one line of code I was able to calculate the weighted global mean temperature for a netcdf. However if you want to process more than 10 files the command line option gets kind of clunky and you will not be able to generate new file names using cdo so paring cdo and R together is very powerful. 


#### CDO + R 

Examples of cdo + R processing code can be on GitHub at https://github.com/kdorheim/CDOexamples. But you are going to want to use the function system2 to execute the cdo code, it will look like the following format. 

```
system2('path/to/cdo/exe', args = c('operator', 'path/to/inifile.nc', './test.nc'), stdout = TRUE, stderr = TRUE)
```

Here is what you would run in interactive R to repeat our earlier example in R.

```{r, eval=FALSE}
CDO_EXE   <-   '/share/apps/netcdf/4.3.2/gcc/4.4.7/bin/cdo'
input_nc  <- "/pic/projects/GCAM/CMIP5-CHartin/CMIP5_RCP45/tas/tas_Amon_CESM1-BGC_rcp45_r1i1p1_200601-210012.nc"
output_nc <- "./test.nc" 
system2(CDO_EXE, args = c("fldmean", input_nc, output_nc), stdout = TRUE, stderr = TRUE)
```


Then to extract the results from the new netcdf you are going to use the package `ncdf4`. I recommend looking into that package documentation because it if a pretty valuable tool.

How to extract the results from a netcdf in R.

```{r, eval=FALSE}
# For the input netcdf file.
nc_in <- nc_open(input_nc)
tas   <- ncvar_get(nc_in, 'tas')
time  <- ncvar_get(nc, 'time')
```


```{r, eval = FALSE}
# For the output netcdf. 
nc   <- nc_open(output_nc)
tas  <- ncvar_get(nc, 'tas')
time <- ncvar_get(nc, 'time')
```

What is the difference in these two netcdf files (dimension)? 

<br>

How I like to structure my CDO + R code

I typically use two scripts (and I want to stress that this is just my preference on how to set it up there is going to be lots of different ways to doing it). 

My first script locates and sorts the netcdfs I want to process, this is because of how CMIP5 data was stored on pic. It stores information about the cmip files to process in a csv file. 

In my second script I import my Csv file containing the netcdf files to process, define my function that contains the `system2` cdo call, and then use an `apply` family command to execute my processing function on the list of the netcdfs to process. I save the final results in a csv file that is easy to download onto my local machine. 

**CDO & R tips**

* Write tests into your function and try to give informative error messages, otherwise you can spend a lot of time trying to debug your code. 
* CDO generates lots of intermediate netcdf files, at some point you should clean these up but during development it can be useful to keep the intermediate netcdf files and check them. 


## Practice Exercise 

What is the difference between the global average temperature for a single year calculated by cdo `fldgen` vs `mean` in R? Is this a surprise, why or why not? 

Hint - if you used ncvar_get() to import the girded monthly data into R the data is going to be an array with three dimensions, `[lon, lat, time]`. 


