1introduction

NetCDF is widely used fromat for exchangeing or distributing climate data, has been also used in other fields. they contain metadata,containing latitude,longtitude,names and units of variables ,attributes ,missing values, offset… R could be used to analyze netCDF files,using ncdf4 packages provided by David Pierce, and through other packages like raster,Rnetcdf…

1.1Reading, restructuring, writing netCDF files in R

netcdf and “raster” have common “design pattern” 1.data input(using ,ncdf4,rhdf5,raster) 2.recasting/reshaping raster brick input data into a rectangular data frame 3.analysis and visualization 4.recasting/reshaping a result data frame back to a raster brick,and 5.data output,using the same packages as in step 1.

examples here include : reading netcdf file using ncdf4 package reshaping brick into data frame reshaping data frame into brick writing netcdf file using the ncdf4 package

2Reading a netCDF data set using the ncdf4

2.1open netcdf file

# load ncdf4 package

library(ncdf4)
## Warning: package 'ncdf4' was built under R version 4.0.3
# set path and filename

ncpath <- "E:/test/Netcdf/data subset/"
ncname <- "cru10min30_tmp"
ncfname <- paste(ncpath,ncname,".nc",sep="")
dname <- "tmp"

# open a netcdf file
ncin <- nc_open(ncfname)
print(ncin)
## File E:/test/Netcdf/data subset/cru10min30_tmp.nc (NC_FORMAT_CLASSIC):
## 
##      2 variables (excluding dimension variables):
##         float time_bounds[nv,time]   
##         float tmp[lon,lat,time]   
##             long_name: air_temperature
##             units: degC
##             _FillValue: -99
##             source: E:\Projects\cru\data\cru_cl_2.0\nc_files\cru10min_tmp.nc
## 
##      4 dimensions:
##         lon  Size:720
##             standard_name: longitude
##             long_name: longitude
##             units: degrees_east
##             axis: X
##         lat  Size:360
##             standard_name: latitude
##             long_name: latitude
##             units: degrees_north
##             axis: Y
##         time  Size:12
##             standard_name: time
##             long_name: time
##             units: days since 1900-01-01 00:00:00.0 -0:00
##             axis: T
##             calendar: standard
##             climatology: climatology_bounds
##         nv  Size:2
## [1] "vobjtovarid4: **** WARNING **** I was asked to get a varid for dimension named nv BUT this dimension HAS NO DIMVAR! Code will probably fail at this point"
## 
##     7 global attributes:
##         data: CRU CL 2.0 1961-1990 Monthly Averages
##         title: CRU CL 2.0 -- 10min grid sampled every 0.5 degree
##         institution: http://www.cru.uea.ac.uk/
##         source: http://www.cru.uea.ac.uk/~markn/cru05/cru05_intro.html
##         references: New et al. (2002) Climate Res 21:1-25
##         history: Wed Oct 29 11:27:35 2014: ncrename -v climatology_bounds,time_bounds cru10min30_tmp.nc
## P.J. Bartlein, 19 Jun 2005
##         Conventions: CF-1.0

2.2Get coordinate (including time) variables

# Get coordinate variables
lon <- ncvar_get(ncin,"lon")
nlon <- dim(lon)
head(nlon)
## [1] 720
lat <- ncvar_get(ncin,"lat")
nlat <- dim(lat)
head(nlat)
## [1] 360
print(c(nlon,nlat))
## [1] 720 360

Get the time variable and its attributes using the ncvar_get() and ncatt_get() functions, and list them, and also get the number of time steps using the dim() function.

time <- ncvar_get(ncin,"time")
time
##  [1] 27773.5 27803.5 27833.5 27864.0 27894.5 27925.0 27955.5 27986.5 28017.0
## [10] 28047.5 28078.0 28108.5
tunit <- ncatt_get(ncin,"time","units")
nt <- dim(time)
nt
## [1] 12

Print the time units string. Note the structure of the time units attribute. The object tunits has two components hasatt (a logical variable), and tunits$value, the actual “time since” string.

tunit
## $hasatt
## [1] TRUE
## 
## $value
## [1] "days since 1900-01-01 00:00:00.0 -0:00"

2.3Get a variable

Get the the variable (tmp) and its attributes, and verify the size of the array.

# get moisture
tmp_array <- ncvar_get(ncin,dname)
dlname <- ncatt_get(ncin,dname,"long_name")
dunits <- ncatt_get(ncin,dname,"units")
fillvalue <- ncatt_get(ncin,dname,"_FillValue")
dim(tmp_array)
## [1] 720 360  12
# get global attributes
title <- ncatt_get(ncin,0,"title")
institution <- ncatt_get(ncin,0,"institution")
datasource <- ncatt_get(ncin,0,"source")
references <- ncatt_get(ncin,0,"references")
history <- ncatt_get(ncin,0,"history")
Conventions <- ncatt_get(ncin,0,"Conventions")

Close the netCDF file using the nc_close() function.

Check what’s in the current workspace:

ls()
##  [1] "Conventions" "datasource"  "dlname"      "dname"       "dunits"     
##  [6] "fillvalue"   "history"     "institution" "lat"         "lon"        
## [11] "ncfname"     "ncin"        "ncname"      "ncpath"      "nlat"       
## [16] "nlon"        "nt"          "references"  "time"        "title"      
## [21] "tmp_array"   "tunit"

3 Reshaping from raster to rectangula

NetCDF files or data sets are naturally 2-D raster slabs (e.g. longitude by latitude “slices”), 3-D bricks (e.g. longitude by latitude by time), or 4-D arrays (e.g. longitude by latitude by height by time), while most data analysis routines in R expect 2-D variable-by-observation “tidy” data frames. (There is an exception to this expectation in some cases like principle components analysis (PCA) in which variables are locations and the observations are times.) Therefore, before analysis, a 3-D or 4-D array in the netCDF files must be** “flattened”** into a 2-D array. In addition, time is usually stored as the CF (Climate Forecast) “time since” format that is not usually human-readable. Here are some example conversions:

Load some necessary packages (install them if necessary)

library(chron)
## Warning: package 'chron' was built under R version 4.0.3
library(lattice)
library(RColorBrewer)

3.1 Convert the time variable

The time variable, in “time-since” units can be converted into “real” (or more easily readable) time values by splitting the time tunits$value string into its component parts, and then using the chron() function to determine the absolute value of each time value from the time origin.

# convert time -- split the time units string into fields
tustr <- strsplit(tunit$value, " ")   
tustr
## [[1]]
## [1] "days"       "since"      "1900-01-01" "00:00:00.0" "-0:00"
tdstr <- strsplit(unlist(tustr)[3], "-") # 
tdstr
## [[1]]
## [1] "1900" "01"   "01"
tdmstr <- strsplit(unlist(tustr)[4], ":") # 
tdmstr
## [[1]]
## [1] "00"   "00"   "00.0"
tmonth <- as.integer(unlist(tdstr)[2]) # 
tmonth
## [1] 1
tday <- as.integer(unlist(tdstr)[3])  #
tday
## [1] 1
tyear <- as.integer(unlist(tdstr)[1])
tyear
## [1] 1900
thour <-as.integer(unlist(tdmstr)[1])
thour
## [1] 0

convert to date-time of this netcdf

#date <- chron(time/60/24,origin=c(tmonth, tday, tyear))+thour/24
date <- chron(time,origin=c(tmonth, tday, tyear))+thour/24
date
##  [1] (01/16/76 12:00:00) (02/15/76 12:00:00) (03/16/76 12:00:00)
##  [4] (04/16/76 00:00:00) (05/16/76 12:00:00) (06/16/76 00:00:00)
##  [7] (07/16/76 12:00:00) (08/16/76 12:00:00) (09/16/76 00:00:00)
## [10] (10/16/76 12:00:00) (11/16/76 00:00:00) (12/16/76 12:00:00)

3.2Replace netCDF fillvalues with R NAs

In a netCDF file, values of a variable that are either missing or simply not available (i.e. ocean grid points in a terrestrial data set) are flagged using specific “fill values” **(_FillValue)** or missing values(missing_value), the particular values of which are set as attributes of a variable. In R, such unavailable data are indicated using the “NA” value. The following code fragment illustrates how to replace the netCDF variable’s fill values with R NA’s .

The head() function can be used before and after executing the “square bracket” selection and replacement to verify that the NA values have indeed replace the netCDF fill values(head(as.vector(temp_array[,,1])). The total number of non-missing (i.e. land, except for Antarctica, which is not present in the data set) grid cells can be gotten by determining the length of a vector of values representing one slice from the brick, omitting the NA values:

tmp_array[tmp_array==fillvalue$value] <- NA
length(na.omit(as.vector(tmp_array[,,1])))
## [1] 62961

3.3 Get a single time slice of the data, create an R data frame, and write a .csv file

NetCDF variables are read and written as one-dimensional vectors (e.g. longitudes), two-dimensional arrays or matrices (raster “slices”), or multi-dimensional arrays (raster “bricks”). In such data structures, the coordinate values for each grid point are implicit, inferred from the marginal values of, for example, longitude, latitude and time. In contrast, in R, the principal data structure for a variable is the data frame.This particular structure of this data set can be illustrated by selecting a single slice from the temperature “brick”, turning it into a data frame with three variables and 720 by 360 rows, and writing it out as a .csv file.

3.3.1 Get a single slice of data

# get a single layer,there is only on layer in this data
m<-1
tmp_array <- tmp_array[,,m]
#quickmap
image(lon,lat,tmp_array,col=rev(brewer.pal(10,"RdBu")))

A better map can be obtained using the levelplot() function from the lattice package. The expand.grid() function is used to create a set of 720 by 360 pairs of latitude and longitude values (with latitudes varying most rapidly), one for each element in the tmp_slice array. Specific values of the cutpoints of temperature categories are defined to cover the range of temperature values here.

# levelplot of the slice
grid <- expand.grid(lon=lon, lat=lat)
cutpts <- c(-50,-40,-30,-20,-10,0,10,20,30,40,50)
levelplot(tmp_array ~ lon * lat, data=grid, at=cutpts, cuts=11, pretty=T, 
col.regions=(rev(brewer.pal(10,"RdBu"))))

3.3.2 Create a data frame

To create a data frame, the expand.grid() and as.matrix() functions are used to create the 259200 pairs (i.e. rows) of values of longitude and latitude (the columns), and the as.vector() function is used to “unstack” the slice of data into a long vector. The size of the objects that are created can be verified using the dim() and length() functions.

# create dataframe -- reshape data
# matrix (nlon*nlat rows by 2 cols) of lons and lats
lonlat <- as.matrix(expand.grid(lon,lat))
head(lonlat,10)
##          Var1   Var2
##  [1,] -179.75 -89.75
##  [2,] -179.25 -89.75
##  [3,] -178.75 -89.75
##  [4,] -178.25 -89.75
##  [5,] -177.75 -89.75
##  [6,] -177.25 -89.75
##  [7,] -176.75 -89.75
##  [8,] -176.25 -89.75
##  [9,] -175.75 -89.75
## [10,] -175.25 -89.75
dim(lonlat)
## [1] 259200      2
# vector of `tmp` values
tmp_vec <- as.vector(tmp_array)
length(tmp_vec)
## [1] 259200
720*360
## [1] 259200

The data.frame() and cbind() functions are used to assemble the columns of the data frame, which are assigned appropriate names using the names() function (on the left-hand side of assignment operator). The head() function, applied on top of the na.omit() function lists the first rows of values without NAs:

# create dataframe and add names
tmp_df01 <- data.frame(cbind(lonlat,tmp_vec))
names(tmp_df01) <- c("lon","lat",paste(dname, sep="_"))
head(na.omit(tmp_df01), 10)
##          lon    lat tmp
## 49186 -67.25 -55.75 8.2
## 49901 -69.75 -55.25 7.9
## 49902 -69.25 -55.25 8.4
## 49903 -68.75 -55.25 7.8
## 49904 -68.25 -55.25 8.9
## 49905 -67.75 -55.25 9.1
## 49906 -67.25 -55.25 9.0
## 50617 -71.75 -54.75 8.8
## 50619 -70.75 -54.75 8.7
## 50620 -70.25 -54.75 7.9

3.3.3 Write out the data frame

Finally the data frame is written out to the working directory as a .csv file, using na.omit() again to drop the observations with missing data (i.e. ocean points and Antarctica).

# set path and filename
csvpath <- "E:/test/Netcdf/data subset/csvfiles"
csvname <- "cru_tmp_1.csv"
csvfile <- paste(csvpath, csvname, sep="")
write.table(na.omit(tmp_df01),csvfile, row.names=FALSE, sep=",")

3.4 Convert the whole array to a data frame, and calculate MTWA, MTCO and the annual mean

The idea here is to convert the nlon by nlat by nt 3-D array into a (nlon x nlat) by nt 2-D matrix. (This will work if the netCDF data set was written as a CF-compliant data set, with arrays dimensioned as in Fortran, i.e. as nlon x nlat x nt arrays).

3.4.1 Reshape the whole array

Convert the array to a vector. First, create a long vector tmp_vec_long using the as.vector() reshaping function, and verify its length, which should be 3110400. (This will work only if the netCDF file (and hence the data array) follow the “CF” conventions, i.e. that the variable tmp has been defined to have dimensions nlon by nlat by nt, in that order.)

# reshape the array into vector
tmp_vec_long <- as.vector(tmp_array)
head(na.omit(tmp_vec_long,10))
## [1] 8.2 7.9 8.4 7.8 8.9 9.1
length(tmp_vec_long)
## [1] 259200
259200*12
## [1] 3110400
# reshape the vector into a matrix
tmp_mat <- matrix(tmp_vec_long, nrow=nlon*nlat, ncol=nt) #nt=dim(time)
dim(tmp_mat)
## [1] 259200     12
head(na.omit(tmp_mat))
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
## [1,]  8.2  8.2  8.2  8.2  8.2  8.2  8.2  8.2  8.2   8.2   8.2   8.2
## [2,]  7.9  7.9  7.9  7.9  7.9  7.9  7.9  7.9  7.9   7.9   7.9   7.9
## [3,]  8.4  8.4  8.4  8.4  8.4  8.4  8.4  8.4  8.4   8.4   8.4   8.4
## [4,]  7.8  7.8  7.8  7.8  7.8  7.8  7.8  7.8  7.8   7.8   7.8   7.8
## [5,]  8.9  8.9  8.9  8.9  8.9  8.9  8.9  8.9  8.9   8.9   8.9   8.9
## [6,]  9.1  9.1  9.1  9.1  9.1  9.1  9.1  9.1  9.1   9.1   9.1   9.1

Create the second data frame from the tmp_mat matrix.

# create a dataframe
lonlat <- as.matrix(expand.grid(lon,lat))
tmp_df02 <- data.frame(cbind(lonlat,tmp_mat))
names(tmp_df02) <- c("lon","lat","tmpJan","tmpFeb","tmpMar","tmpApr","tmpMay","tmpJun",
  "tmpJul","tmpAug","tmpSep","tmpOct","tmpNov","tmpDec")
# options(width=96)
head(na.omit(tmp_df02, 20))
##          lon    lat tmpJan tmpFeb tmpMar tmpApr tmpMay tmpJun tmpJul tmpAug
## 49186 -67.25 -55.75    8.2    8.2    8.2    8.2    8.2    8.2    8.2    8.2
## 49901 -69.75 -55.25    7.9    7.9    7.9    7.9    7.9    7.9    7.9    7.9
## 49902 -69.25 -55.25    8.4    8.4    8.4    8.4    8.4    8.4    8.4    8.4
## 49903 -68.75 -55.25    7.8    7.8    7.8    7.8    7.8    7.8    7.8    7.8
## 49904 -68.25 -55.25    8.9    8.9    8.9    8.9    8.9    8.9    8.9    8.9
## 49905 -67.75 -55.25    9.1    9.1    9.1    9.1    9.1    9.1    9.1    9.1
##       tmpSep tmpOct tmpNov tmpDec
## 49186    8.2    8.2    8.2    8.2
## 49901    7.9    7.9    7.9    7.9
## 49902    8.4    8.4    8.4    8.4
## 49903    7.8    7.8    7.8    7.8
## 49904    8.9    8.9    8.9    8.9
## 49905    9.1    9.1    9.1    9.1

3.4.2 Get the annual mean

Get the annual mean, mtwa and mtco (mean temperatures of the warmest and coldest montth) values and add them to the second data frame.

# get the annual mean and MTWA and MTCO
tmp_df02$mtwa <- apply(tmp_df02[3:14],1,max) # mtwa
tmp_df02$mtco <- apply(tmp_df02[3:14],1,min) # mtco
tmp_df02$mat <- apply(tmp_df02[3:14],1,mean) # annual (i.e. row) means
head(na.omit(tmp_df02))
##          lon    lat tmpJan tmpFeb tmpMar tmpApr tmpMay tmpJun tmpJul tmpAug
## 49186 -67.25 -55.75    8.2    8.2    8.2    8.2    8.2    8.2    8.2    8.2
## 49901 -69.75 -55.25    7.9    7.9    7.9    7.9    7.9    7.9    7.9    7.9
## 49902 -69.25 -55.25    8.4    8.4    8.4    8.4    8.4    8.4    8.4    8.4
## 49903 -68.75 -55.25    7.8    7.8    7.8    7.8    7.8    7.8    7.8    7.8
## 49904 -68.25 -55.25    8.9    8.9    8.9    8.9    8.9    8.9    8.9    8.9
## 49905 -67.75 -55.25    9.1    9.1    9.1    9.1    9.1    9.1    9.1    9.1
##       tmpSep tmpOct tmpNov tmpDec mtwa mtco mat
## 49186    8.2    8.2    8.2    8.2  8.2  8.2 8.2
## 49901    7.9    7.9    7.9    7.9  7.9  7.9 7.9
## 49902    8.4    8.4    8.4    8.4  8.4  8.4 8.4
## 49903    7.8    7.8    7.8    7.8  7.8  7.8 7.8
## 49904    8.9    8.9    8.9    8.9  8.9  8.9 8.9
## 49905    9.1    9.1    9.1    9.1  9.1  9.1 9.1
dim(na.omit(tmp_df02))
## [1] 62961    17

3.4.3 Write out the second data frame

Write the second data frame out as a .csv file, dropping NAs.

# set path and filename
csvpath <- "E:/test/Netcdf/data subset/csvfiles"
csvname <- "cru_tmp_2.csv"
csvfile <- paste(csvpath, csvname, sep="")
write.table(na.omit(tmp_df02),csvfile, row.names=FALSE, sep=",")

Create a third data frame, with only non-missing values. This will be used later to demonstrate how to convert a “short” data frame into full matrix or array for writing out as a netCDF file.

# create a dataframe without missing values
tmp_df03 <- na.omit(tmp_df02)
head(tmp_df03)
##          lon    lat tmpJan tmpFeb tmpMar tmpApr tmpMay tmpJun tmpJul tmpAug
## 49186 -67.25 -55.75    8.2    8.2    8.2    8.2    8.2    8.2    8.2    8.2
## 49901 -69.75 -55.25    7.9    7.9    7.9    7.9    7.9    7.9    7.9    7.9
## 49902 -69.25 -55.25    8.4    8.4    8.4    8.4    8.4    8.4    8.4    8.4
## 49903 -68.75 -55.25    7.8    7.8    7.8    7.8    7.8    7.8    7.8    7.8
## 49904 -68.25 -55.25    8.9    8.9    8.9    8.9    8.9    8.9    8.9    8.9
## 49905 -67.75 -55.25    9.1    9.1    9.1    9.1    9.1    9.1    9.1    9.1
##       tmpSep tmpOct tmpNov tmpDec mtwa mtco mat
## 49186    8.2    8.2    8.2    8.2  8.2  8.2 8.2
## 49901    7.9    7.9    7.9    7.9  7.9  7.9 7.9
## 49902    8.4    8.4    8.4    8.4  8.4  8.4 8.4
## 49903    7.8    7.8    7.8    7.8  7.8  7.8 7.8
## 49904    8.9    8.9    8.9    8.9  8.9  8.9 8.9
## 49905    9.1    9.1    9.1    9.1  9.1  9.1 9.1