This is an R Markdown format used for publishing markdown documents to GitHub. When you click the Knit button all R code chunks are run and a markdown file (.md) suitable for publishing to GitHub is generated.

Including Plots

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

Version 0.1

Getting the raw data ready to go

29th June 2017

The 35th character is the flag. But is picked up by awk '{print $5}' filename. awk is the powerhouse to do this. See code:

and concatenates files accounting for missing monthly values and flagged values.

#! /bin/bash

wd=/lustre1/rwork2/routwzn/smukeshk/INTERPOLATION/MASTER/MSTR
HME=ALL
for reg in NthAmerica SthAmerica Africa Europe Russia SthAsia EstAsia Oceania
do
        for p in `seq 1 50`
        do
                echo Concatenating files in $HME/$reg/P"$p"/
                for i in $wd/$HME/$reg/P"$p"/*
                do
                        f=`basename $i`
                        awk '($4 != -9999.9) && ($5 == "")' $i >> $wd/$HME/$reg/CONCATENATED/$f
                        awk '($4 == -9999.9) && ($5 == "")' $i >> $wd/$HME/$reg/MISSING_MONTH/$f
                        awk '($4 != -9999.9) && ($5 != "")' $i >> $wd/$HME/$reg/FLAGGED/$f
                        awk '($4 == -9999.9) && ($5 != "")' $i >> $wd/$HME/$reg/BOTH/$f
                done
        done
done

17th July 2017

Still cannot login to MIRAKEL. JSAM keeps loading. As such I decided to download and compare CPC and GPCC global daily precip datasets to start the documentation and start the literature review for introduction.

18th July 2017

Managed to logon to MIRAKEL. Firefox JAVA applets stopped working. Ended up installing Pale Moon browser which natively supports JAVA applets. JSAM seems to run stabily on Pale Moon.

Create a skeleton for the documentation paper. The next tasks is to fill in as much of this skeleton as possible. Queued jobs summarising observations by region and concatenating regional daily data files into global daily data files.

19th July 2017

We now have summaries of number of stations (flagged, missing months, both, none) by regions.

Checked if all the FINALR4MAT jobs finished. They did! Two jobs FINALR4MAT49 and 50 had some readghcnd errors.

The error occurs in readGhcnd function while reading the QC'd GPCC data.
The error is due to incorrect write out of the raw QC'd GPCC data. The station id is incorrect (I think it is the same as the GHCND station id, 11 characters). This data could be easily fixed by finding the right station id and reading in the data again using rf_forKriging.f95. The FINALR4MAT scripts (rf_forKriging.f95 and gpccmod.f95) are fine.
For now we will not include these stations.

20th July 2017

Worked on a functional programming way of reading in a regional subset of netcdf file and then creating regional timeseries.

library(RNetCDF)
library(fields)
library(lubridate)

I have converted original ncdf4 files into "classic" using nccopy and then used cdo remapcon2 to flip about the equator.

nccopy -k "classic" precip.1979.nc precip.1979.classic.nc
# Get griddes and then edit it by 
# yfirst = 89.75 -> yfirst = -89.75 
# yinc = -0.5 -> yinc = 0.5 
cdo griddes precip.1979.classic > griddes_flipped
cdo remapcon2,griddes_flipped precip.1979.flipped.nc

nc.dir <- "/srv/ccrc/data45/z3289452/DailyPRCP_netCDFs/CPC_0.5X0.5_Daily_Unified_Global_Precipitation/"
nc.prefix <- "precip."
nc.suffix <- ".flipped.nc"

lon <- seq(0.25, by = 0.5, length.out = 720)
lat <- seq(-89.75, by = 0.5, length.out = 360)

#Now plot timeseries by region
# Region 1: NthAmerica
# Region 2: SthAmerica
# Region 3: Africa
# Region 4: Europe
# Region 5: Russia
# Region 6: SthAsia
# Region 7: EstAsia
# Region 8: Oceania


Regions <- list(NthAmerica = list(), SthAmerica = list(), Africa = list(),
                Europe = list(), Russia = list(), SthAsia = list(), EstAsia = list(),
                Oceania = list())

lon.adjusted = lon
lon.adjusted[lon.adjusted > 180] = lon[lon.adjusted > 180] - 360


Regions$NthAmerica$lon <- which(lon.adjusted <= -34)
Regions$NthAmerica$lat <- which(lat >= 13)
Regions$SthAmerica$lon <- which(lon.adjusted <= -34)
Regions$SthAmerica$lat <- which(lat < 13)
Regions$Africa$lon     <- which(lon.adjusted > -34 & lon.adjusted < 60)
Regions$Africa$lat     <- which(lat < 38)
Regions$Europe$lon     <- which(lon.adjusted > -34 & lon.adjusted <= 46)
Regions$Europe$lat     <- which(lat >= 38)
Regions$Russia$lon     <- which(lon.adjusted > 46)
Regions$Russia$lat     <- which(lat >= 38)
Regions$SthAsia$lon    <- which(lon.adjusted >= 60 & lon.adjusted < 98)
Regions$SthAsia$lat    <- which(lat >= 5 & lat < 38)
Regions$EstAsia$lon    <- which(lon.adjusted >= 98)
Regions$EstAsia$lat    <- which(lat >= 5 & lat < 38)
Regions$Oceania$lon    <- which(lon.adjusted >= 60)
Regions$Oceania$lat    <- which(lat < 5)

Functions to avoid loops. TSfromncdf takes an ncdf file and the indices of the grids to be read as well as dates and create a daily, monthly and annual timeseries. If you would like to read all grids, leave lon.index and lat.index out.

TSfromncdf <- function(nc.location, lon.index = NA, lat.index = NA, dates) {
  require(RNetCDF)
  nc <- open.nc(nc.location)
  precip.grid <- var.get.nc(nc, "precip", start = c(ifelse(is.na(sum(lon.index)), NA, lon.index[1]), 
                                                    ifelse(is.na(sum(lat.index)), NA, lat.index[1]), NA), 
                            count = c(ifelse(is.na(sum(lon.index)), NA, length(lon.index)), 
                                      ifelse(is.na(sum(lat.index)), NA, length(lat.index)), NA))
  close.nc(nc)
  daily <- apply(precip.grid, 3, mean, na.rm = T)
  rm(precip.grid); gc()
  monthly <- aggregate(daily, by = list(month(dates)), FUN = mean, na.rm = T)$x
  annual <- mean(daily, na.rm = T)
  return(list(daily = daily, monthly = monthly, annual = annual))
}

give.location <- function(year) {
  return(paste0(nc.dir, nc.prefix, year, nc.suffix))
}

combine.yearly.ts <- function(year, reg.num, timescale, Regions) {
  writeLines(paste0("year: ", year))
  dates <- seq(as.Date(paste0(year,"-01-01")), as.Date(paste0(year,"-12-31")), by = 1)
  ts.subset <- TSfromncdf(give.location(year), Regions[[reg.num]]$lon, 
                          Regions[[reg.num]]$lat, dates)[[timescale]]
  return(unlist(ts.subset))
}

create.reg.ts <- function(reg.num, timescale, Regions) {
  stopifnot(timescale == "daily" | timescale == "monthly")
  timescale = ifelse(timescale == "daily", 1, 2)
  writeLines(paste0("Creating TS for region: ", names(Regions)[reg.num]))
  return(c(unlist(sapply(1979:2013, FUN = combine.yearly.ts, reg.num = reg.num, 
                       timescale = timescale, Regions = Regions))))
}

Now we use an lapply to loop over all regional and yearly readins and create a daily or monthly time series. First daily regional timeseries:

daily.reg.ts <- lapply(1:8, FUN = create.reg.ts, timescale = "daily", Regions)
names(daily.reg.ts) <- names(Regions)

save(daily.reg.ts, file = "/srv/ccrc/data45/z3289452/PhD_repository/RData/CPC_reg_1979-2013_Daily_TS.RData")

Now daily global timeseries. Just change years to include the years you would like to calculate.

years <- seq(1979, 2013, 1)

glob.ts <- sapply(years, FUN = function(year) {
  writeLines(paste0("Year: ", year))
  dates <- seq(as.Date(paste0(year,"-01-01")), as.Date(paste0(year,"-12-31")), by = 1)
  ts.subset <- TSfromncdf(give.location(year), dates = dates)
  return(ts.subset)
})

daily.glob.ts <- unlist(sapply(1:length(years), function(y) {glob.ts[1,y][[1]]}))
monthly.glob.ts <- c(unlist(sapply(1:length(years), function(y) {glob.ts[2,y][[1]]})))
annual.glob.ts <- unlist(sapply(1:length(years), function(y) {glob.ts[3,y][[1]]}))

save(daily.glob.ts, monthly.glob.ts, annual.glob.ts, 
     file = "/srv/ccrc/data45/z3289452/PhD_repository/RData/CPC_glb_1979-2013_DMY_TS.RData")

21st July 2017

Started interpolation for 1950 to 1969 in two parts for each year (eg. 1950a and 1950b). All runs are in /lustre1/rwork2/routwzn/smukeshk/INTERPOLATION/KRIGING/RUNS.

wrote setupscripts.sh that modifies Datensatz-erzeugen.sh and kriging.job in each part directory getting it ready for job submission.

#!/bin/bash

wd=/lustre1/rwork2/routwzn/smukeshk/INTERPOLATION/KRIGING/RUNS
echo Dirs: A
for y in `seq 1950 1969`
do
        echo Year: $y
        cd $wd/"$y"a
        sed -i "s/Start=195001/Start="$y"01/g" Datensatz-erzeugen.sh
        sed -i "s/Ende=195012/Ende="$y"06/g" Datensatz-erzeugen.sh
        sed -i "s/Krig1950/Krig"$y"a/g" kriging.job
        sed -i "s#P1950#RUNS/"$y"a#g" kriging.job
        cd $wd
done
echo Dirs: B
for y in `seq 1950 1969`
do
        echo Year: $y
        cd $wd/"$y"b
        sed -i "s/Start=195001/Start="$y"07/g" Datensatz-erzeugen.sh
        sed -i "s/Ende=195012/Ende="$y"12/g" Datensatz-erzeugen.sh
        sed -i "s/Krig1950/Krig"$y"b/g" kriging.job
        sed -i "s#P1950#RUNS/"$y"b#g" kriging.job
        cd $wd
done

Had a chat with Lisa who suggested looking at the interpolation output to see if there are no artifacts (bull's eyes) before running off doing the interpolation.
Very important to email CMS team with data sample and publish an dataset abstract with doi advertising the dataset is coming. Create a Data Management Plan for this.
Make sure I have some figures ready by the end of next week (end of July I should have a draft abstract)for writing the abstract.

Monday, 24th July 2017

Kriging jobs for version 0 of the CCRC dataset finished without hiccups. As far as I can tell there are no memory allocation errors. I copied an output file (19600101.txt in 1960a dir) to /srv/ccrc/data45/z3289452/GPCC/INTERPOLATION/KRIGING/RUNS to make some plots to see if output was reasonable.

The code to check output files is in ownCloud/Working\ Directory/kriging_output_readandtest_24072017.R.

First we plot the output of the kriging output file:
Change out.file to change the output file to be read in We will change the for loops into functional code later

library(lubridate)
library(fields)
library(RNetCDF)

setwd("/srv/ccrc/data45/z3289452/GPCC/INTERPOLATION/KRIGING/RUNS")
out.dir <- "OUT/"

#####
run.dir <- "1960a/"
out.file <- "19600101.txt"
#####

if (substr(run.dir, 1, 4) != substr(out.file, 1, 4)) {
  warning("Run directory not the same year as output file")
}

out.df <- read.table(paste0(run.dir, out.dir, out.file), header = F)
names(out.df) <- c("lon","lat","PREC")
out.df <- out.df[order(out.df[,1],out.df[,2]),]

lon <- unique(out.df$lon)
lat <- unique(out.df$lat)

out.raster <- array(dim=c(length(lon),length(lat)))
for (i in 1:length(lon)) {
  for (j in 1:length(lat)) {
    out.raster[i,j] <- out.df$PREC[(i-1)*180+j]
    if (out.raster[i,j] == -99999.99) {
      # print("blah")
      out.raster[i,j] <- NA
    }
  }
}

image.plot(lon, lat, out.raster)
map("world", add = T, interior = F)
title(ymd(substr(out.file, 1, 8)))

# Let's plot only USA
lon.index <- which(lon <= -65 & lon >= -128)
lat.index <- which(lat <= 55 & lat >= 22)

image.plot(lon[lon.index], lat[lat.index], out.raster[lon.index, lat.index],
           xlab = "lon", ylab = "lat")
title(ymd(substr(out.file, 1, 8)))
map("world", "usa", add = T)

# Let's NA the 0 values to see the patterns more clearly
out.raster[which(out.raster < 1)] <- NA
image.plot(lon[lon.index], lat[lat.index], out.raster[lon.index, lat.index],
           xlab = "lon", ylab = "lat")
title(ymd(substr(out.file, 1, 8)))
map("world", "usa", add = T)

Now we compare the above plots with the same from CPC 0.25X0.25 degree CONUS gridded daily product. We would have compared with CPC global or GPCC_FDD but they do not go back far enough. When the remaining interpolation finishes we will do it.

# Let's read in CPC CONUS to see if the grids look similar
setwd("/srv/ccrc/data45/z3289452/DailyPRCP_netCDFs/CPC_.25x.25_Daily_US_Unified_Precipitation/")

timeunitssince <- function(day, timeunits, since){
  require(lubridate)
  numdays <- ymd(day) - ymd(since)
  if (timeunits == "days") {
    return(day(days(numdays)))
  } else if (timeunits == "hours") {
    return(hour(hours(numdays * 24)))
  } else {
    stop("timeunits must be \"days\" or \"hours\"")
  }
}

day.index <- timeunitssince(ymd(19600101), "days", ymd(19480101)) + 1

nc <- open.nc("precip.V1.0.1948-2006.nc")
CPC.grid <- var.get.nc(nc, "precip", start = c(NA, NA, day.index), count = c(NA, NA, 1))
lon.cpc <- var.get.nc(nc, "lon") - 360 # we subtract 360 to have lon go from (-180, 180) instead of (0, 360)
lat.cpc <- var.get.nc(nc, "lat")
time.cpc <- var.get.nc(nc, "time")

image.plot(lon.cpc, lat.cpc, CPC.grid, xlab = "lon", ylab = "lat")
title(ymd(substr(out.file, 1, 8)))
map("world", "usa", add = T)

CPC.grid[which(CPC.grid < 1)] <- NA
image.plot(lon.cpc, lat.cpc, CPC.grid, xlab = "lon", ylab = "lat")
title(ymd(substr(out.file, 1, 8)))
map("world", "usa", add = T)

The spatial patterns aren't that similar. Perhaps we need to compare with a 1 degree version by remapping. I am halfway through the process of regridding using CDO to a 1 degree version which I think we need to do anyway for the final interpolation.

Tuesday, 1st August 2017

The interpolation of ALL station data (as opposed to a subset based on completeness criteria, i.e. 20yr, 30yr, 40yr) finished although some of the runs ran out of walltime.

I wrote a script to save the output of the previous run and "re-setup" the directory for another run. This resetupjobs.sh file is available in the RUNS directory: /lustre1/rwork2/routwzn/smukeshk/INTERPOLATION/KRIGING/RUNS/.

#!/bin/bash

wd=/lustre1/rwork2/routwzn/smukeshk/INTERPOLATION/KRIGING/RUNS
folder=$1

cd $wd/$folder
jobnum=`tail -n 1 kriging.jobnum | awk '{print $2}' | sed 's/.xcepbs00//g'`
sed '231s/.*/if [ ! -d "$Ordner_Erg" ]; then mkdir $Ordner_Erg; fi/' Datensatz-erzeugen.sh > tmp
sed -i '232s#.*#if [ ! -d "$Ordner_Erg"/Error ]; then mkdir "$Ordner_Erg"/Error; fi#' tmp
sed -i '233s#.*#if [ ! -d "$Ordner_Erg"/Err_K ]; then mkdir "$Ordner_Erg"/Err_K; fi#' tmp
sed -i '234s#.*#if [ ! -d "$Ordner_Erg"/Stat-verteilung ]; then mkdir "$Ordner_Erg"/Stat-verteilung; fi#' tmp
sed -n '230,235p' tmp
echo Looks good? [Y/y]
read response
if [ response=="Y" -o response=="y" ]
then
        mv tmp Datensatz-erzeugen.sh
        chmod +x Datensatz-erzeugen.sh
fi
rm dates.txt
rm comp-kriging_f90.out
mv kriging.out kriging."$jobnum".out
cd $wd

Before I run resetupjobs.sh I change the Start and Ende parameters in the Datensatz-erzeugen.sh file by seeing in OUT directory and kriging.out file which days did not finish. E.g.:
sed -i '26s/Start=.*/Start=19861229/' Datensatz-erzeugen.sh

First we had to regrid CPC_global flipped 1988 ncdf file into the same grid as GPCC_FDD (i.e. 1X1 degree resolution starting at lon: -179.5). cdo remapcon2,griddes1X1deg_identicaltoGPCCFDD precip.1988.flipped.nc precip.1988.1X1deg.nc. griddes1X1deg_identicaltoGPCCFDD is printed below, available in CPC directory: /srv/ccrc/data45/z3289452/DailyPRCP_netCDFs/CPC_0.5X0.5_Daily_Unified_Global_Precipitation/

This procedure produced some negative rainfall values which were masked out in the R code to print plots.

Now the R code to print the comparitive plots. First modify and run the importKrigingOutput code chunk with

#####
run.dir <- "1960a/"
out.file <- "19600101.txt"
#####

ncdf.dir <- '/srv/ccrc/data45/z3289452/DailyPRCP_netCDFs/'
CPC.dir <- 'CPC_0.5X0.5_Daily_Unified_Global_Precipitation/'
GPCC.dir <- 'GPCC_FDD/'

# CPC
nc <- open.nc(paste0(ncdf.dir, CPC.dir,"precip.1988.1X1deg.nc"))
CPC.grid <- var.get.nc(nc, "precip")
lon.cpc <- var.get.nc(nc, "lon")
lat.cpc <- var.get.nc(nc, "lat")
time.cpc <- var.get.nc(nc, "time")
# Remove negative values in CPC (no idea how they got there)
CPC.grid[which(CPC.grid < 0)] <- NA

# Now GPCC-FDD
nc <- open.nc(paste0(ncdf.dir, GPCC.dir,"full_data_daily_1988.nc"))
GPCC.grid <- var.get.nc(nc, "p")
lon.gpcc <- var.get.nc(nc, "lon")
lat.gpcc <- var.get.nc(nc, "lat")
time.gpcc <- var.get.nc(nc, "time")

# Now let's change the breaks and colours
brks <- c(0, 1.7^(seq(1,10)))
require(graphics)
YlGnBu <- colorRampPalette(c("Yellow", "Green", "Blue"))

image.plot(lon, lat, out.raster, breaks = brks, col = YlGnBu(length(brks) - 1),
           xlab = "Longitude", ylab = "Latitude")
map("world", add = T, interior = F)
title(paste("CCRC:", ymd(substr(out.file, 1, 8))))

image.plot(lon.cpc, lat.cpc, CPC.grid[,,1], breaks = brks, col = YlGnBu(length(brks) - 1),
           xlab = "Longitude", ylab = "Latitude")
map("world", add = T, interior = F)
title(paste("CPC:", ymd(substr(out.file, 1, 8))))

image.plot(lon.gpcc, lat.gpcc, GPCC.grid[,,1], breaks = brks, col = YlGnBu(length(brks) - 1),
           xlab = "Longitude", ylab = "Latitude")
map("world", add = T, interior = F)
title(paste("GPCC:", ymd(substr(out.file, 1, 8))))

Wednesday, 2nd August 2017

Ran the remaining kriging jobs that did not finish. They have now finished and we have all grids for interpolation from 1950 to 2013.

Downloaded all the summaries of number of stations (by region and completeness criteria for analysis).
First I copied all the summaries in a structured directory (by compeleteness and regions) called cpdir in /lustre1/rwork2/routwzn/smukeshk/INTERPOLATION/MASTER/MSTR/cpdir using the script:

#!/bin/bash

wd=/lustre1/rwork2/routwzn/smukeshk/INTERPOLATION/MASTER/MSTR

for i in ALL TWENTY THIRTY FORTY
do
        for reg in NthAmerica SthAmerica Africa Europe Russia SthAsia EstAsia Oceania
        do
                cp $wd/$i/$reg/*_summary.txt $wd/cpdir/$i/$reg
        done
done

cpdir was then compressed (tar -czvf cpdir.tar.gz cpdir) and copied (and uncompressed) using scp to /srv/ccrc/data45/z3289452/GPCC/INTERPOLATION/MASTER/MSTR/SUMMARIES.

Thursday, 3rd August 2017

During summary data import in R, some weirdness was discovered in the summary files. The exact details are described in the following script which was used to fix the files. The script location is /srv/ccrc/data45/z3289452/GPCC/INTERPOLATION/MASTER/MSTR/SUMMARIES/fixsummaries.sh

#!/bin/bash

# The summaries have a giberish first line
# because for some reason there is a file called '*'
# in the folders
# ${f:0:8} returns nothing and 
# cat $i | wc -l returns the sum of all line numbers 
# of all files in directory

# Also we would like to remove the last line of the 
# summaries which is 20161231


wd=/srv/ccrc/data45/z3289452/GPCC/INTERPOLATION/MASTER/MSTR/SUMMARIES

for i in ALL TWENTY THIRTY FORTY
do
        for reg in NthAmerica SthAmerica Africa Europe Russia SthAsia EstAsia Oceania
        do
                for f in $wd/$i/$reg/*
                do
                        ln=`grep -n 19500101 "$f" | cut -d : -f 1`
                        if [ "$ln" -gt 1 ]
                        then 
                                sed -i '1d' "$f" 
                        fi 
                        sed -i '$d' "$f"
                done
        done
done

Now that the summaries are fixed we import the summary files in R and plot them.
First the data import: I would like to have done this in a functional way but it was difficult to have a function create a list and call it.

# summaries of stations
wd <- '/srv/ccrc/data45/z3289452/GPCC/INTERPOLATION/MASTER/MSTR/SUMMARIES'
setwd(wd)

len_criteria <- list('ALL', 'TWENTY', 'THIRTY', 'FORTY')
data_type <- list('CONCATENATED', 'FLAGGED', 'MISSING_MONTH', 'BOTH')
region <- list('NthAmerica', 'SthAmerica', 'Africa', 'Europe', 'Russia', 'SthAsia', 'EstAsia', 'Oceania')

summaries <- list()
for (l in 1:length(len_criteria)) {
  summaries.tmp2 <- list()
  for (d in 1:length(data_type)) {
    summaries.tmp1 <- list()
    for (r in 1:length(region)) {
      summaries.tmp1[[r]] <- read.table(paste0(len_criteria[l], '/', 
                                                    region[r], '/', 
                                                    data_type[d], '_summary.txt'), header = F)
      names(summaries.tmp1)[r] <- region[r]
      summaries.tmp1[[r]][,1] <- ymd(summaries.tmp1[[r]][,1]) #Convert into dates
    }
    summaries.tmp2[[d]] <- summaries.tmp1
    names(summaries.tmp2)[d] <- data_type[d]
  }
  summaries[[l]] <- summaries.tmp2
  names(summaries)[l] <- len_criteria[l]
}
rm(summaries.tmp1, summaries.tmp2)

plot.regs <- function(datType, data_type, summaries.lenCrit, lenCriteria) {
  data_num <- match(datType, data_type)
  ylim = range(sapply(summaries.lenCrit[[data_num]], function(x){range(x[,2])}))
  cols <- list("violet", "brown", "blue", "green", "gold", "orange", "red", "magenta")
  plot(summaries.lenCrit[[data_num]][[1]][,1], summaries.lenCrit[[data_num]][[1]][,2], 
       type = 'n', xlab = "Dates", ylab = "Number of stations", 
       xlim = c(ymd('19500101'), ymd('20131231')), ylim = ylim)
  plot.line <- function(list, col) {
    lines(list[,1], list[,2], col = col)
  }
  mapply(FUN = plot.line, summaries.lenCrit[[data_num]], cols)
  if (data_num != 1) {
    legend("topleft", legend = names(summaries.lenCrit[[data_num]]), col = unlist(cols), 
         lty = rep(1, length(region)), cex = 0.85)
  }
  title(paste0(lenCriteria, ": ", datType, ": Number of stations over time"))
}

plot.summaries <- function(lenCriteria, len_criteria, data_type, summaries) {
  lenCrit_num <- match(lenCriteria, len_criteria)
  lapply(data_type, FUN = plot.regs, data_type, summaries[[lenCrit_num]], lenCriteria)
}

layout(mat = matrix(1:4, ncol = 2, byrow = T))
lapply(len_criteria, FUN = plot.summaries, len_criteria, data_type, summaries)

Monday, 14th August 2015

A new private git repo in /srv/ccrc/data45/z3289452/GIT_REPOS has been created.

Need to download raw station data for CONCATENATED, TWENTY, THIRTY and FORTY to see the distribution of stations by year and regions. They are available in /srv/ccrc/data45/z3289452/GPCC/INTERPOLATION/MASTER/MSTR. I have download the raw file for 1st Jan and 1st July from 1950 to 2013.

Tried making maps with plotly but I could not adjust the size of the markers. Back to making a gif

Tuesday, 15th August 2015

###############################################
# Title: mapRawStations
# Author: Steefan Contractor
# Description: ShinyApp that maps raw stations
#              used in interpolation on a map
###############################################

# Custom slider function
sliderValues <- function (inputId,
                          label,
                          values,
                          from,
                          to = NULL,
                          grid = TRUE,
                          width = NULL,
                          postfix = NULL,
                          prefix = NULL,
                          dragRange = TRUE,
                          disable = FALSE,
                          animate = FALSE) {
  validate_fromto <-
    function(fromto = NULL,
             values = NULL,
             default = 0) {
      if (!is.null(fromto)) {
        if (is.character(values) & is.numeric(fromto)) {
          fromto <- fromto - 1
        } else {
          fromto <- which(values == fromto) - 1
        }
      } else {
        fromto <- default
      }
      return(fromto)
    }
  
  sliderProps <- shiny:::dropNulls(
    list(
      class = "js-range-slider",
      id = inputId,
      `data-type` = if (!is.null(to))
        "double"
      else
        "single",
      `data-from` = validate_fromto(fromto = from, values = values),
      `data-to` = validate_fromto(
        fromto = to,
        values = values,
        default = length(values)
      ),
      `data-grid` = grid,
      `data-prefix` = if (is.null(prefix)) {
        "null"
      } else {
        shQuote(prefix, "sh")
      },
      `data-postfix` = if (is.null(postfix)) {
        "null"
      } else {
        shQuote(postfix, "sh")
      },
      `data-drag-interval` = dragRange,
      `data-disable` = disable,
      `data-values` = if (is.numeric(values)) {
        paste(values, collapse = ", ")
      } else {
        paste(shQuote(values, type = "sh"), collapse = ", ")
      }
    )
  )
  sliderProps <- lapply(
    X = sliderProps,
    FUN = function(x) {
      if (identical(x, TRUE))
        "true"
      else if (identical(x, FALSE))
        "false"
      else
        x
    }
  )
  sliderTag <- tags$div(
    class = "form-group shiny-input-container",
    style = if (!is.null(width))
      paste0("width: ", htmltools::validateCssUnit(width), ";"),
    if (!is.null(label))
      shiny:::controlLabel(inputId, label),
    do.call(
      tags$input,
      list(
        type = if (is.numeric(values) &
                   is.null(to)) {
          "number"
        } else {
          "text"
        },
        #class = "js-range-slider",
        id = inputId,
        name = inputId,
        value = ""
      )
    ),
    tags$style(
      whisker::whisker.render(
        template =
          "input[id='{{id}}'] {
        -moz-appearance:textfield;
}
input[id='{{id}}']::-webkit-outer-spin-button,
input[id='{{id}}']::-webkit-inner-spin-button {
-webkit-appearance: none;
margin: 0;
}", data = list(id = inputId))
    ),
    tags$script(
      HTML(
        whisker::whisker.render(
          template = '$("#{{id}}").ionRangeSlider({
          type: "{{data-type}}",
          from: {{data-from}},
          to: {{data-to}},
          grid: {{data-grid}},
          keyboard: true,
          keyboard_step: 1,
          postfix: {{data-postfix}},
          prefix: {{data-prefix}},
          drag_interval: {{data-drag-interval}},
          values: [{{data-values}}],
          disable: {{data-disable}}
          });',
          data = sliderProps
      )
      ))
      )
  if (identical(animate, TRUE)) 
    animate <- animationOptions()
  if (!is.null(animate) && !identical(animate, FALSE)) {
    if (is.null(animate$playButton)) 
      animate$playButton <- icon("play", lib = "glyphicon")
    if (is.null(animate$pauseButton)) 
      animate$pauseButton <- icon("pause", lib = "glyphicon")
    sliderTag <- htmltools::tagAppendChild(
      sliderTag,
      tags$div(class = "slider-animate-container", 
               tags$a(href = "#", class = "slider-animate-button", 
                      `data-target-id` = inputId, `data-interval` = animate$interval, 
                      `data-loop` = animate$loop, span(class = "play", 
                                                       animate$playButton), 
                      span(class = "pause", 
                           animate$pauseButton)))
    )
  }
  dep <- htmltools::htmlDependency(
    "ionrangeslider",
    "2.1.12",
    c(href = "shared/ionrangeslider"),
    script = "js/ion.rangeSlider.min.js",
    stylesheet = c(
      "css/ion.rangeSlider.css",
      "css/ion.rangeSlider.skinShiny.css"
    )
  )
  htmltools::attachDependencies(sliderTag, dep)
}

library(lubridate)
library(dplyr)
library(shiny)
library(maps)
library(leaflet)

data.dir <- "data/"
dates <- format(seq.Date(from = ymd("19500101"), to = ymd("20131231"), by = "6 months"), "%b %Y")

Regions <- list(Global = 1, NthAmerica = 2, SthAmerica = 3, Africa = 4, Europe = 5,
                Russia = 6, SthAsia = 7, EstAsia = 8, Oceania = 9)
lonLim <- list(Global = c(-180, 180), NthAmerica = c(-180, -34), SthAmerica = c(-180, -34), Africa = c(-34, 60), Europe = c(-34, 46),
               Russia = c(46, 180), SthAsia = c(60, 98), EstAsia = c(98, 180), Oceania = c(60, 180))
latLim <- list(Global = c(-90, 90), NthAmerica = c(13, 90), SthAmerica = c(-90, 13), Africa = c(-90, 38), Europe = c(38, 90),
               Russia = c(28, 90), SthAsia = c(5, 38), EstAsia = c(5, 38), Oceania = c(-90, 5))

ui <- fluidPage(
  titlePanel(""),
  sidebarPanel(
  selectInput(inputId = "regions", label = "Region: ", 
              choices = Regions),
  selectInput(inputId = "compCrit", label = "Completeness Criteria (min # years): ",
              choices = c("ALL", "TWENTY", "THIRTY", "FORTY")),
  sliderValues(inputId = "Dates", label = "Date: ", values = dates, 
               from = dates[1], grid = F, 
               width = "100%", animate = animationOptions(interval = 500))
  ),
  mainPanel(
  plotOutput("stationMap")
  )
)

server <- function(input, output, session) {
  output$stationMap <- renderPlot({
    Date <- format(as.Date(paste0("01 ",input$Dates), format = "%d %b %Y"), "%Y%m%d")
    data <- read.table(paste0(data.dir, input$compCrit, "/", Date, ".txt"), header = F)
    names(data) <- c("Latitude", "Longitude")
    #browser()
    par(mar=c(1,1,2,1)+0.1)
    map("world", interior = T, xlim = lonLim[[as.numeric(input$regions)]], 
        ylim = latLim[[as.numeric(input$regions)]])
    points(data$Longitude, data$Latitude, cex = 0.01, col = "steelblue3", pch = 19)
    title("Map of stations interpolated for CCRC Global Gridded Daily Precipitation")
  })
}

shinyApp(ui = ui, server = server)

Elke has not replied since Friday, 4th August about the conversion of ascii output into netCDF. So now I am trying to figure out how to convert to netcdf on MIRAKEL myself. This is problematic because MIRAKEL does not have R, Python or NCL. So I cannot use any of these.

I am now in the process of copying Kriging output into a folder which can be "tar"ed and copied using scp.

Wednesday, 17th August 2015

Rsync was the fastest way to copy files. I used sshpass to pass a password to each rsync command. Save the password in a file echo ********** > rsync_pass; chmod 400 rsync_pass

#!/bin/bash

wd=/lustre1/rwork2/routwzn/smukeshk/INTERPOLATION/KRIGING/RUNS

for y in `seq 1982 2013`
do
        echo $y
        for p in a b
        do
                sshpass -p $(cat rsync_pass) rsync -avz $wd/$y$p/OUT/*.txt z3289452@ccrc163.ccrc.unsw.edu.au:/srv/ccrc/data45/z3289452/GPCC/INTERPOLATION/KRIGING/OUTDIR/ 
        done
done

Thursday, 18th August 2015

Modified Markus's ncl script to convert text files of raw data into netcdf.
.ncl file location: /srv/ccrc/data45/z3289452/GPCC/INTERPOLATION/KRIGING/read_KrigingOUT.ncl

Friday, 19th August 2015

All actual precipitation converted into netcdf. Now copying metadata (Error, Err_K and Station-distribution) to storm servers using cpfiles.sh and cpfiles.job on storm servers.

Friday, 25th August 2015

Meeting with Markus and Lisa
Redo the QC on all TX and TW files you can find on MIRAKEL servers. The issue with the lack of data density seems to be with the step where we merge the unique stations. As a result we will re-write the script that finds unique stations and the script that matches stations and merge them into one. The way we do this is to find all the GPCC stations and GHCND stations inside each grid. Then we match the stations inside each grid and merge the time series. Then we find the ten stations with the longest length record inside each grid. We save an extra four columns of information in the inventory files - the start year, end year and number of daily records since 1950, and number of flagged records to help with the step above. In the end we compare our map of stations with GPCC by finding out all grids where we have less than 10 stations and seeing if GPCC has more stations for that grid. While redoing the QC we also add extra source flags to let us know where the data comes from, i.e. GPCC, GHCN-Daily, Lisa's datasets etc.

Thursday, 7th September

I am currently reviewing old code in /srv/ccrc/data45/z3289452/NCEI_Automated_QC/GPCC/Post-GPCC/smukeshk/. This folder has three folders "AUSTRALIA", "BRAZIL" and "GHCN-Daily". Each folder has fortran scripts:

I am aiming to test run all scripts on one folder today (Australia). And then the aim is to write bash scripts that gives a list of paths to tx or tw files would

Monday, 11th September

Tuesday, 12th September

Wednesday, 13th September

Looked up which source flags are already in use by NCEI QC script. In the end we overwrite the source flags anyway but still in case the sflags are used in the QC scripts somewhere we wish to avoid the flags already in use. Below is the code that determines which flags are not in use. sflags were obtained from here

sflags = c("Z","R","0","6","C","X","W","K","7","F","B","M","r","E","z","u","b","s","a","G","Q","I","A","N","T","U","H","S")
nums  = as.character(seq(0, 9))
letters[which(!(letters %in% sflags))]

LETTERS[which(!(LETTERS %in% sflags))]

nums[which(!(nums %in% sflags))]

The way we set the sFlags (CF3) is we initialise the cF3 variable in the qcscripts with the appropriate qcflag and then it will be overwritten with the GHCN-D data. All GHCN-D data should have a flag except when value is missing, so the "f" flag is redundant anyway.

sFlag	Source
"c"	GPCC-TX files
"d"	GPCC-TW files
"e"	GPCC-MIRAKEL database
"f"	GHCN-Daily (If it does not already have one of the other GHCND sFlags)
"g"	Lisa's data - Russia
"h"	Lisa's data - Brazil
"i"	Lisa's data - Argentina
"j"	Lisa's data - South Africa
"k"	LACAD
"l"	SACAD
"m"	ECAD

~~Move the fortran script files to the GIT repo and develop code in there.~~ Done

Thursday, 14th September

write fortran script that adds start year, end year, number of records since 1950 and number of flagged records since 1950 for each station to the station inventory file. Maybe this should happen in the qc2out file where it reads all the information. Done.

~~Modify qc1_gpcc to allow the passing of a source flag when running the script. This flag would be saved alongside every nonmissing record~~ Done.

Friday, 15th September

After the matching is done we will have a list of unique stations. We could add even more extra columns to this list that indicate the number of records that come from GPCC, the ones that come from GHCND data and the ones that come from Lisa's sources. For each grid cell, we then create a list of stations inside this grid and rank them by lenghth of record, start/end attributes. Then we pick the top 10/15/20

Wednesday, 20th September

The countries with tx data in rhome are:
TURKEY CROATIA AUSTRALIA GREECE NORWAY BENIN_LD AFRIKA_FINK FRANCE HONG_KONG GEORGIA CHINA CANADA IRELAND SLOVENIA DENMARK_GRL ECUADOR SLOVAKIA SURINAME ESTONIA LATVIA LUXEMBURG INDONESIA SASSCAL

The countries with tx data in rhome are:
NEUSEELAND MEXIKO IRAN LITHUANIA PANAMA AUSTRIA OESTERLE FINLAND GREECE KASACHSTAN GHANA JORDAN ROMANIA SEYCHELLEN CHILE MALEDIVEN FRANCE BULGARIA GUINEA_BISSAU URUGUAY ECAD NETHERLANDS RUSSIA UK HONG_KONG SWITZERLAND KOREA_S GEORGIA SYNOP_TW EL_SALVADOR CHINA FRANZ_GUAYANA CANADA IRELAND SLOVENIA ECUADOR SLOVAKIA COSTA_RICA UNGARN NIGERIA ESTONIA MACAO TAIWAN LATVIA LUXEMBURG UKRAINE ETHIOPIA ZAMBIA SYRIA GAMBIA UZBEKISTAN SPAIN CYPRUS JAPAN INDONESIA COLUMBIA CZECH SACAD SASSCAL

Finished cloning the CCRCGlobalGriddedDailyPrecip GIT dir on lce27 (MIRAKEL) The location is /lustre1/rwork2/routwzn/smukeshk/NCEI_QC/Scripts/CCRCGlobalGriddedDailyPrecipitation
All QC scripts as well as useful bash scripts for setting up are available here.

Wrote a bash script mkefolders.sh that takes a tx file list with the entire tx paths and sets up the QC directories for each unique country in the tx file path list

#!/bin/bash

invFile=$1
TXW=$2

baseDir='/lustre1/rwork2/routwzn/smukeshk/NCEI_QC/'
scriptDir='/lustre1/rwork2/routwzn/smukeshk/NCEI_QC/Scripts/CCRCGlobalGriddedDailyPrecipitation/QCscripts/'

if [ -z $TXW ]; then
        echo "The second argument must be either TX or TW (Case sensitive)"
        exit 1
elif [ $TXW != "TX" ] && [ $TXW != "TW" ]; then     
        echo "The second argument must be either TX or TW (Case sensitive)"
        exit 1
fi 

if [ ! -d $baseDir$TXW ]; then
        echo Directory $baseDir$TXW does not exist
        exit 1
fi

if [ ! -f $baseDir$invFile ]; then 
        echo Inventory file $baseDir$invFile does not exist
        exit 1
fi

CTRS=`cat $baseDir$invFile | cut -d'/' -f6 | uniq`
for CTRY in $CTRS; do
        echo $CTRY
        mkdir $baseDir$TXW/$CTRY
        mkdir $baseDir$TXW/$CTRY/rawgpcc $baseDir$TXW/$CTRY/rawghcnd $baseDir$TXW/$CTRY/gpcc_por
        FLS=`grep $CTRY $baseDir$invFile`
        for FL in $FLS; do
                cp $FL $baseDir$TXW/$CTRY/rawgpcc/
                echo `basename $FL` >> $baseDir$TXW/$CTRY/tx_files.txt
        done
done

The setupQC.sh bash script then copies all the QC scripts from the git directory to the QC directory.

#!/bin/bash

invFile=$1
TXW=$2

baseDir='/lustre1/rwork2/routwzn/smukeshk/NCEI_QC/'
scriptDir='/lustre1/rwork2/routwzn/smukeshk/NCEI_QC/Scripts/CCRCGlobalGriddedDailyPrecipitation/QCscripts/'

if [ -z $TXW ]; then
        echo "The second argument must be either TX or TW (Case sensitive)"
        exit 1
elif [ $TXW != "TX" ] && [ $TXW != "TW" ]; then     
        echo "The second argument must be either TX or TW (Case sensitive)"
        exit 1
fi 

if [ ! -d $baseDir$TXW ]; then
        echo Directory $baseDir$TXW does not exist
        exit 1
fi

if [ ! -f $baseDir$invFile ]; then 
        echo Inventory file $baseDir$invFile does not exist
        exit 1
fi

CTRS=`cat $baseDir$invFile | cut -d'/' -f6 | uniq`
for CTRY in $CTRS; do
        echo $CTRY
        cp "$scriptDir"*.f95 $baseDir$TXW/$CTRY/
        cp "$scriptDir"*.sh $baseDir$TXW/$CTRY/
        cp -r "$scriptDir"input4sys $baseDir$TXW/$CTRY/
        cd $baseDir$TXW/$CTRY
        ./compile_QCprograms.sh
done

Running the QC-script.sh script for TX/LATVIA now as a test. If the QC'd data looks good I will run the remaining TX/CTRS tonight.
Output looks OK

Thursday, 21st September 2017

I have reorganised the /lustre1/rwork2/routwzn/smukeshk/NCEI_QC/ to see where the raw data comes from. All data is now in a directory QCdata. Inside this directory are directories indicating the source location of the data eg. "rhome_CTRS", "rhome_CTRS_Steefan", "rwork_GHCN". More sources will be added later. "rhome_CTRS_Steefan" refers to files that Anja added to the /e/rhome/routwzn/COUNTRIES/Steefan_tx_tw/ dir from x21045/COUNTRIES.

Friday 22nd September

All QC jobs finished except for GHCN-Daily Still have to check if the output was ok (spend a day going through all the folders and make sure the output is ok)

Running a job download and reformating all Mirakel data between 1950 and 2017

To do this weekend: Finish matchGhcndStns.f95 and use it to merge TX and TW files as a test

Monday 25th --- Friday 29th September

The TW and TX files are stored in various directories labelled by COUNTRY. Some directories however, consisted raw files from multiple countries, e.g. SYNOP_TW contained synoptic station data from tens of countries and ECAD contained data from many EU countries. So I had to spend a few days properly merging these raw files because otherwise the last COUNTRY dir to be assimilated would overwrite the previous ones. the matchGPCCStns.f95 script was used to do this. I also merged dirs with the same name between rhome_CTRS_TW and rhome_CTRS_TX directory. The RUSSIAN data from LISA was also merged with the data in TW format. Once finished there were around 72000

Saturday 21st October

for i in `seq 55 94`; do echo Part$i; mv Part$i/merged-stations.txt Part$i/merged-stations_wdupes.txt; tac Part$i/merged-stations_wdupes.txt | awk '!a[$1]++' | tac > Part$i/merged-stations.txt; cat Part$i/merged-stations.txt | wc -l; done

for i in `seq 75 94`; do cd /lustre1/rwork2/routwzn/smukeshk/RawData/Mirakel/MergeGHCNParts/Part"$i"A; echo Part"$i"A; ./submitMergeJob.sh; cd /lustre1/rwork2/routwzn/smukeshk/RawData/Mirakel/MergeGHCNParts/Part"$i"B; echo Part"$i"B; ./submitMergeJob.sh; done; cd /lustre1/rwork2/routwzn/smukeshk/RawData/Mirakel/MergeGHCNParts

Monday 23rd October

for i in `seq 102 108`; do sed -i 's/lc_lang/lc_big/g' Part"$i"A/merge.job; sed -i 's/lc_lang/lc_big/g' Part"$i"B/merge.job; done

for i in `seq 102 108`; do echo Part"$i"A; grep lc_big Part"$i"A/merge.job; echo Part"$i"B; grep lc_big Part"$i"B/merge.job; done

In RawData/Mirakel/MergeGHCNparts, Part109 onwards are merging the TW-TX-LISA_RU data that was unique wrt Mirakel data with GHCN data

ghcnd-stations.txt contained more stations than the files in ghcnd_all. This resulted in a "file missing" "Fortran runtime error" in matchGhcndStns.f95 for 65 Parts. Additionally I had forgotten to run matchGhcndStn.exe on 10 parts 5A/B to 9A/B. All these parts that need to be run are now queued and are in rerunParts.txt in RawData/Mirakel/MergeGHCNparts/

Tomorrow I will write a script to investigate the potentially nonunique stations in RawData/Mirakel/MergeParts2/nonuniqueStns.txt. These stations have the same lat/lon to 3 decimal places in RawData/Mirakel/gpcc-stations_extra.txt and RawData/TW-TX-LISA_RU_AR/gpcc-stations_extra.txt but elev are different and correlation of overlap is less than 0.90. The aim is to write a script that first gets the overlapping months from the data files using grep on the yearmonth cols. Subsequently it sed replaces the e with c source flags and then uses diff --side-by-side --suppress-common-lines to print 1) the number of month lines that have differences and 2) the different lines themselves one after the other (not side by side).

diff -W $(( $(tput cols) - 2 )) --side-by-side --suppress-common-lines file1.dly file2.dly | wc -l

Wednesday 25th October

awk '{print $1}' unique-stations.txt > uniqueStnIds.txt
grep -Ff uniqueStnIds.txt ../gpcc-stations_extra.txt > nonuniqueMirakel-stations.txt
grep -Ff uniqueStnIds.txt ../../TW-TX-LISA_RU_AR/gpcc-stations_extra.txt > nonuniqueTWXRU-stations.txt
for i in `cat nonuniqueMirakelStnIds.txt`; do sed -i "/$i/d" unique-stations.txt; done ## this takes forever!!
for i in `cat nonuniqueTWXRUStnIds.txt`; do sed -i "/$i/d" unique-stations.txt; done

Need to finish writing a shiftGHCN.f95. The script also creates a station list with extra columns and the new StnId. Also need to merge the duplicate GHCNStns in grep -Ff dupes.txt uniqueGhcnd-stations.txt.

Then need to merge Lisa_Argentina and cat all the station lists and copy all files from all the parts.

Also need to write a script that reads all 145000 stations and creates a script with extra columns indicating sources.

All this needs to be done by Monday 30th October. On Monday we run the interpolation!

Thursday 26th October

Rerun the QC after all the merging is done to make sure data is still homogenous.

Interpolate monthly values as well and use the grids from that to recalculate absolute values from anomalies.

Friday 27th October

If the stations do not have any overlap they cannot be correlated and hence come up as unique. This is apparant among GHCN stations with the same station ids (same lat/lon) but different elevation in RawData/GHCN-Daily/MergeDupes.

Going over the 111 stations in the unique GHCND stations list which have the same station Ids (same lat/lon to 3 decimal places):

Saturday 28th October

for i in `cat length.txt`; do for j in `cat regions.txt`; do for k in `seq 1 301`; do mkdir -p $i/$j/Part$k; done; done; done

Monday 6th Nov

for i in `cat length.txt`; do for j in `cat regions.txt`; do mkdir $i/$j/trash;  mv $i/$j/CONCATENATED $i/$j/trash/; mv  $i/$j/MISSING_MONTH $i/$j/trash/; mv $i/$j/FLAGGED $i/$j/trash/ mv $i/$j/BOTH $i/$j/trash/; mkdir $i/$j/CONCATENATED $i/$j/MISSING_MONTH $i/$j/FLAGGED $i/$j/BOTH; done; done

for i in TWENTY THIRTY FORTY; do cp catfiles_ALL.sh catfiles_$i.sh; sed -i "s/ALL/$i/g" catfiles_$i.sh; for j in 2 3 4; do cp catfiles_ALL$j.sh catfiles_$i$j.sh; sed -i "s/ALL/$i/g" catfiles_$i$j.sh; cp catfiles_ALL$j.job catfiles_$i$j.job; sed -i "s/ALL/$i/g" catfiles_$i$j.job; done; done

setwd("/srv/ccrc/data11/z3289452/CCRCGlobalGriddedDailyPrecip/Metadata")

rawdata <- read.table(file = "/srv/ccrc/data11/z3289452/CCRCGlobalGriddedDailyPrecip/Metadata/VERSION1.1_data/station-yearly-summary.txt")

totRecs <- array(dim = c(10,7,68))
totFlgs <- array(dim = c(10,7,68))
for (Reg in 1:10) {
  for (Src in 1:7) {
    totRecs[Reg,Src,] <- as.numeric(rawdata[(10*2)*(Src - 1) + Reg*2 - 1,])
    totFlgs[Reg,Src,] <- as.numeric(rawdata[(10*2)*(Src - 1) + Reg*2,])
  }
}

rm(rawdata)

# Now the plotting
years = seq(1950,2017)
par(mar = c(1,1,0,0) + 0.1, oma = c(1,1,1,1)+0.1, bg = NA)
layout(mat = matrix(1:2, ncol = 2, byrow = 2), widths = c(1, 0.5))
cols <- c('black', '#e41a1c',"",'#377eb8','#4daf4a','#984ea3','#ff7f00','#e6ab02')
  #c("black","purple", "hotpink", "blue", "darkgreen", "gold","red", "#a6761d")
plot(x = years, y =  totRecs[1,1,], type = "n", xlab = "Year", ylab = "Number of stations used in interpolation", cex.axis = 1.5,
     xlim = c(1950,2018),ylim = c(0,max(totRecs[1,1,])))
for (Reg in c(1:2,4:8)) {
  lines(totRecs[Reg,1,] ~ years,col = cols[Reg], lwd = 4)
}
par(mar = c(0,0,0,0)+0.1)
plot.new()
legend("topleft", legend = c("All", "Africa", "Asia", "Australia", "Europe", "North America", "South America"), 
       lwd = 4,col = cols[c(1,2,4:8)], cex = 2, bty = "n")

setwd("/srv/ccrc/data11/z3289452/CCRCGlobalGriddedDailyPrecip/Metadata")

rawdata <- read.table(file = "/srv/ccrc/data11/z3289452/CCRCGlobalGriddedDailyPrecip/Metadata/VERSION1.1_data/station-yearly-summary.txt")

totRecs <- array(dim = c(10,7,68))
totFlgs <- array(dim = c(10,7,68))
for (Reg in 1:10) {
  for (Src in 1:7) {
    totRecs[Reg,Src,] <- as.numeric(rawdata[(10*2)*(Src - 1) + Reg*2 - 1,])
    totFlgs[Reg,Src,] <- as.numeric(rawdata[(10*2)*(Src - 1) + Reg*2,])
  }
}

rm(rawdata)

# Plotting
# by sources
years = seq(1950,2017)
totRecsGPCC <- totRecs[1,2,] + totRecs[1,3,] + totRecs[1,4,]
totRecsOther <- totRecs[1,5,] + totRecs[1,6,]
totRecsGHCN <- totRecs[1,7,]

par(mar=  c(5,5,0.5,1) + 0.1, bg = NA)
plot(totRecsGPCC ~ years, type = "n", xlim = c(1950,2018), ylim = c(0,max(totRecsGPCC)),
     xlab = "Year", ylab = "Number of stations used in interpolation", cex.lab = 2, cex.axis = 1.5)
lines(totRecsGPCC ~ years, lwd = 4, col = "black")
lines(totRecsGHCN ~ years, lwd = 4, col = "blue")
lines(totRecsOther ~ years, lwd = 4, col = "darkgreen")
legend("topleft",legend = c("GPCC", "GHCN", "Other"), lwd = 4, col = c("black", "blue", "darkgreen"),
       bty = "n", cex = 2)

Created 5 yearly summaries for years 1950,55,60,65,70,...,2010

cut -c 15-32,197-209,295-307,393-405,491-503,589-601,687-699,785-797,883-895,981-993,1079-1091,1177-1189,1275-1287,1373-1385,1471-1483 combined-stations_extra.txt > combined-stations_extra_5yearlysummaries.txt

Wednesday, Nov 15th

Major New Problem
Turns out my station density is not as good as Version 0.1 in

Attempt to find the problem: Will check a station in China, Iran and Mali to make sure we don't have any station data in those periods in the first place

setwd("/srv/ccrc/data11/z3289452/CCRCGlobalGriddedDailyPrecip/Metadata")

# Now let's try and read the long station list
rawdata <- read.table(file = "combined-stations_extra_5yearlysummaries_CTRYCode.txt", na.strings = c("-9999","-99"))

rawdata.transform <- array(dim = c(135742,15))
for (i in 1:14) {
  rawdata.transform[,i] <- rawdata[,2 + i*2 - 1] - rawdata[,2 + i*2]
}
rawdata.transform <- cbind(rawdata[,c(1:2,31)],rawdata.transform)

library(maps)
stationMap.plot <- function(year) { 
  require(maps)
  years <- c(1950,1955,1960,1965,1970,1975,1980,1985,1990,1995,2000,2005,2010,2015)
  index = which(years == year)
  map("world", interior = F)
  dir <- "~/ownCloud/CCRCGlobalGriddedDailyPrecipitation/mapRawStations/data/data0.1/ALL/"
  data0.1 <- read.table(paste0(dir, year, "0101.txt"), header = F)
  points(data0.1[,2:1], cex = 0.1, pch = 19, col = "red")
  points(rawdata.transform[which(rawdata.transform[,index + 3] > 0),2:1], cex = 0.1, pch = 19,col = "darkblue")
  title(years[index], cex.main = 2)
}

par(mar = c(1,1,3,1) + 0.1)
layout(mat = matrix(c(1,2,3,4), ncol = 2, byrow = T))
sapply(seq(1950,1965,5), FUN = stationMap.plot)
sapply(seq(1970,1985,5), FUN = stationMap.plot)
sapply(seq(1990,2005,5), FUN = stationMap.plot)
sapply(seq(2010,2015,5), FUN = stationMap.plot)

We plot the archive that we expect to have more stations first and then the smaller archive on top. This way if we see any colours of the larger archive, we know that those stations are not in the smaller archive.

Here red stations are the reformatted stations from the older version for the 1st of Jan that year, and blue stations are the un-reformatted stations from the newer version but for all stations with data for that year.

The total stations unflagged stations with data for a year seem to be more than the total reformatted stations for the 1st of Jan that year.

Attempt to resolve: Plot station maps of concatenated stations and stations with missing month and see if that accounts for most stations. Also check later if the missing month is calculated okay.

library(maps)
stationMap.plot <- function(year) { 
  require(maps)
  years <- c(1950,1955,1960,1965,1970,1975,1980,1985,1990,1995,2000,2005,2010,2015)
  index = which(years == year)
  map("world", interior = F)
  dir <- "~/ownCloud/CCRCGlobalGriddedDailyPrecipitation/mapRawStations/data/ALL/"
  data1.0 <- read.table(paste0(dir, year, "0101.txt"), header = F)
  points(data1.0[,2:1], cex = 0.1, pch = 19, col = "red")
  points(rawdata.transform[which(rawdata.transform[,index + 3] > 0),2:1], cex = 0.1, pch = 19,col = "darkblue")
  title(years[index], cex.main = 2)
}

par(mar = c(1,1,3,1) + 0.1)
layout(mat = matrix(c(1,2,3,4), ncol = 2, byrow = T))
sapply(seq(1950,1965,5), FUN = stationMap.plot)
sapply(seq(1970,1985,5), FUN = stationMap.plot)
sapply(seq(1990,2005,5), FUN = stationMap.plot)
sapply(seq(2010,2015,5), FUN = stationMap.plot)

Here blue stations (plotted first) are all the un-reformatted stations from the newer version with data for that year and red stations are the reformatted stations for the newer version for the 1st of Jan that year.

Chinese station N31867E117233 starts in 1952 in RawData/COMBINED/combined. It looks like this data comes from Mirakel because the station does not exist in RawData/TW-TX-LISA_RU_AR/qc2out and the Mirakel data also starts in 1952 and seems identical. I individually checked rhome_CTRS_TW and rhome_CTRS_TX. This station (HEFEI) is in rhome_CTRS_TW and not in rhome_CTRS_TX ans doesn't go further than 1952.

The station ID from the old version is N31860E117230 and it has data going back to 1950. I checked if GHCN-Daily has this station too. Turns out it does and has station ID CHM00058321 but only has data going back to 1952.

AFRIKA - MALI
Data for station N11350W005683 (ML000061297 GHCN ID) only goes back to 1957 and is not found in GPCC at all but only in GHCN-Daily. I found GPCC station data in the old directory oflxs449/AFRIKA/. Upon checking NCEI_QC/QCdata/rhome_CTRS_TX/AFRIKA_FINK/ I did not find any data for this station

Thursday 16th Nov

New Zealand
The raw data given had the wrong format so a lot of the stations were not read in. The QC'd NZ data is available in oflxs449/NEUSEELAND

CZECH
So as expected reformatted data for station N49444E012923 going futher back was found in oflxs449/CZECH/N49444E12923.dly but upon looking at the raw gpcc tw format data, it does not look like the station goes back further! So I do not know where the data is coming from.

I also looked at the Chinese tw format data in oflxs449/TW/OESTERLE/rawgpcc/tw_chn_oes_lr_nf but the chinese data doesn't seem to go back further here either. Clearly there was some mistake in the old code.

IRAN
By looking at the TW format raw data in QCdata/rhome_CTRS_TW/IRAN/rawgpcc/ it is clear that there is no Iran data going back further than 1951. So it looks like although there is no TW format data from the old version to confirm this, there was a mistake in this case as well and data does not infact go back to 1950 for Iran.

Regarding Issue #2: a lot more stations in unreformatted unflagged data compared to reformatted concatenated data
As suspected this discrepency is due to missing monthly values. Furthermore the unreformatted data from the station list represented all stations with data in a particular year, plotting it together with reformatted data from 1st of Jan that year makes it look like the reformatting script missed a lot of data. Only when I aggregated data from 1, 8, 15, 22, 29 of each month for 2010 for the concatenated stations and missing_month stations and plotted them together with the unreformatted data from 2010 could I see that everything was okay with the reformatting script. So the red maps (stationDifference_beforeR4mat_data1.0_2010-2015 etc) are misrepresentative as they make it look like there are lots of stations unaccounted for.

# Now we will see if the total unreformatted stations in Version 1.0 are made up
# of total concatenated and total missing_month stations 

setwd("/srv/ccrc/data11/z3289452/CCRCGlobalGriddedDailyPrecip/Metadata/VERSION1.0_data/")

stationMap.plot <- function(year, unr4matted_data) { 
  require(maps)
  years <- c(1950,1955,1960,1965,1970,1975,1980,1985,1990,1995,2000,2005,2010,2015)
  index = which(years == year)
  cex = 0.2
  # First read concatenated data in ALL
  dataCONC <- read.table(paste0(year, ".txt"), header = F)
  # Now the missing month data
  dataMisMon <- read.table(paste0(year, "_MM.txt"), header = F)
  map("world", interior = T)
  points(unr4matted_data[which(rawdata.transform[,index + 3] > 0),2:1], cex = cex, pch = 19,col = "darkblue")
  points(dataCONC[,2:1], cex = cex, pch = 19, col = "red")
  points(dataMisMon[,2:1], cex = cex, pch = 19, col = "darkgreen")
  title(years[index], cex.main = 2)
}

# load rawdata.transform from the above code chunk first
stationMap.plot(2010, rawdata.transform)

Running the above code provides the following more a accurate plot. We see that the blue points are almost completely masked by the red and green points. The remaning blue points can be accounted if we aggregated all days in the year instead of just 5 days each month.

Tuesday, 21st November 2017

More on oflxs449 containing more data than the rhome TW/TX files
I decided to look into the supposed extra AFRIKA data that wasn't given to me in rhome this time around.
I first grepped the station names and WMO IDs in oflxs449/AFRIKA/gpcc-stations.txt against RawData/TW-TX-LISA_RU_AR/gpcc-stations_extra.txt and turned up 100 out of 142 "new" African stations.

Upon looking at the actual .dly data for one of the stations (Name: BOHICON, ID: AFRIKA/N7170E2050.dly), I confirmed my suspicions that the old qcscript somehow copied data 2 years back. I found the exact data string from 195301/02 in the new data, in 195001/02 in the old data!

However when looking at reformatted data f rom another station (Name: MAKOKOU, ID: N570E12870/N00570E012870) the data does not look similar at all! So perhaps this IS a station we do not have...

Most common African station data in TW-TX-LISA_RU_AR only goes back to 2000. However there is data going further back in GHCN-Daily. It does not look like there is more data coming from MIRAKEL but it is possible. When we look at the formatted data for the same station in oflxs449/AFRIKA it looks like there is completely different data. It does not look like the data is simple just shifted two years back which was the case with a lot of other station data in oflxs449. I think it would be a good idea to contact Anja and ask her if she can dig up data for these 142 African stations.

Thursday 23rd October

Anja replied back, there is data for some of the stations I sent her is in Steefan_tx_tw/SENEGAL. It ends in 1983 and is not in MIRAKEL. She does not know anything about the time stamp for this data either.

She also mentioned that they have now acquired Netherlands data going back to 1951 and further in some case. The old Netherlands data only went back until 2000.

Friday 25th October

cd /e/rhome/routwzn/smukeshk/INTERPOLATION/KRIGING/VERSION1.0
ls ALL/*/kriging.out > ALL_finishedJobs.txt
sed -i 's#/kriging.out##g' ALL_finishedJobs.txt
for i in `cat ALL_finishedJobs.txt`; do ls $i/OUT/*.txt | tail -n 1; done > ALL_lastdays.txt
grep -v '.\{17\}0630' ALL_lastdays.txt > tmp
grep -v '.\{17\}1231' tmp > ALL_restartJobs.txt

ls ALL/????[cd]

#!/bin/bash

wd=/lustre1/rwork2/routwzn/smukeshk/INTERPOLATION/KRIGING/VERSION1.0

for i in `cat $wd/ALL_restartJobs.txt`
do
        year=${i:14:4}
        dir=${i:4:5}
        monthday=${i:18:4}
        monthday=$((10#$monthday))
        newdate=`printf "%04d%04d" $year $((monthday+1))`
        echo $year $dir $monthday $newdate
        if [ ! -d $wd/ALL/"$year"c ]; then
                newdir="$year"c
                mkdir $wd/ALL/$newdir
        else
                newdir="$year"d
                mkdir $wd/ALL/$newdir
        fi
        cp $wd/ALL/$dir/*.sh $wd/ALL/$dir/*.f90 $wd/ALL/$dir/*.job $wd/ALL/$newdir/
        cp -r $wd/ALL/$dir/LAM* $wd/ALL/$dir/PHI* $wd/ALL/$dir/SEA* $wd/ALL/$dir/Subroutinen* $wd/ALL/$newdir/
        sed -i "26s/.*/\tStart=$newdate/" $wd/ALL/$newdir/Datensatz-erzeugen.sh
        sed -i "s/$dir/$newdir/g" $wd/ALL/$newdir/kriging.job
done

Tuesday 28th October

Monday 8th January 2018

for i in `seq 1984 2013`; do wget -r -np -nH --cut-dirs=3 -R index.html* https://www.ncei.noaa.gov/data/precipitation-persiann/access/$i/; done

Friday 2nd Feb 2018

Wednesday 25 Apr 2018

Once you have REGEN-stations_v1-1_extra.txt, run summariseMetaData.f95 to create station-yearly-summary.txt

To create REGEN-stations_v1-1_extra_5yearlySummaries.txt use the following code:

cut -c 15-32,197-209,295-307,393-405,491-503,589-601,687-699,785-797,883-895,981-993,1079-1091,1177-1189,1275-1287,1373-1385,1471-1483 REGEN-stations_v1-1_extra.txt > REGEN-stations_v1-1_extra_5yearlysummaries.txt

library(rworldmap)
# Some maps using the metadata

# The metadata is in 
setwd("/srv/ccrc/data11/z3289452/CCRCGlobalGriddedDailyPrecip/Metadata/VERSION1.1_data/")

widths <- c(13,8,9,6,2) #,2,30,7,5,4,4,6,6,6,6,6,6,6,6,6,6,6,6,6,5)

metaData <- read.fwf(file = "REGEN-stations_v1-1_coords.txt", widths = widths + 1, header = F, fileEncoding = "UTF-8")
colnames(metaData) <- c("StationID", "Lat", "Lon", "Elev","CTRY")

# The single argument to this function, points, is a data.frame in which:
#   - column 1 contains the longitude in degrees
#   - column 2 contains the latitude in degrees
# from https://stackoverflow.com/questions/14334970/convert-latitude-and-longitude-coordinates-to-country-name-in-r
coords2country = function(points)
{  
  countriesSP <- getMap(resolution='high')
  #countriesSP <- getMap(resolution='high') #you could use high res map from rworldxtra if you were concerned about detail
  
  # convert our list of points to a SpatialPoints object
  
  # pointsSP = SpatialPoints(points, proj4string=CRS(" +proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs +towgs84=0,0,0"))
  
  #setting CRS directly to that from rworldmap
  pointsSP = SpatialPoints(points, proj4string=CRS(proj4string(countriesSP)))  
  
  
  # use 'over' to get indices of the Polygons object containing each point 
  indices = over(pointsSP, countriesSP)
  
  # return the ADMIN names of each country
  list(countryCode = indices$ISO_A2, Region = indices$REGION)
  #indices$ISO3 # returns the ISO3 code 
  #indices$continent   # returns the continent (6 continent model)
  #indices$REGION   # returns the continent (7 continent model)
}

countryInfo <- coords2country(metaData[,c(3,2)])
 
# Now we write out the countryCodes and Regions
countryInfoDF <- cbind(sapply(countryInfo$countryCode, as.character), sapply(countryInfo$Region, as.character))
# There are 3518 stations which have no country code
sum(is.na(countryInfo$countryCode))


write.table(countryInfoDF,file = "REGEN-stations_v1-1_CountryAndRegion.txt", na = "  ", 
            row.names = F, col.names = F, quote = F)

Then use paste to concatenate the CTRY code and Region information to get REGEN-stations_v1-1_extra_wRegInfo.txt and REGEN-stations_v1-1_extra_5yearlysummaries_wCTRYCode.txt

Global Gridded Dataset of Daily Precipitation

GitHub Documents

Including Plots

Version 0.1

Getting the raw data ready to go

29th June 2017

17th July 2017

18th July 2017

19th July 2017

20th July 2017

21st July 2017

Monday, 24th July 2017

Tuesday, 1st August 2017

Wednesday, 2nd August 2017

Thursday, 3rd August 2017

Monday, 14th August 2015

Tuesday, 15th August 2015

Wednesday, 17th August 2015

Thursday, 18th August 2015

Friday, 19th August 2015

Friday, 19th August 2015

Friday, 25th August 2015

Thursday, 7th September

Monday, 11th September

Tuesday, 12th September

Wednesday, 13th September

Thursday, 14th September

Friday, 15th September

Wednesday, 20th September

Thursday, 21st September 2017

Friday 22nd September

Monday 25th --- Friday 29th September

Saturday 21st October

Monday 23rd October

Wednesday 25th October

Thursday 26th October

Friday 27th October

Saturday 28th October

Monday 6th Nov

Wednesday, Nov 15th

Thursday 16th Nov

Tuesday, 21st November 2017

Thursday 23rd October

Friday 25th October

Tuesday 28th October

Monday 8th January 2018

Friday 2nd Feb 2018

Wednesday 25 Apr 2018