This is an R Markdown format used for publishing markdown documents to GitHub. When you click the Knit button all R code chunks are run and a markdown file (.md) suitable for publishing to GitHub is generated.
You can also embed plots, for example:
Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.
Today's Task:
The 35th character is the flag. But is picked up by awk '{print $5}' filename. awk is the powerhouse to do this. See code:
awk '($4 != -9999.9) && ($5 == "")' filenameawk '($4 == -9999.9) && ($5 == "")' filenameawk '($4 != -9999.9) && ($5 != "")' filenameawk '($4 == -9999.9) && ($5 != "")' filenameNow we write the bash script that iterates over:
and concatenates files accounting for missing monthly values and flagged values.
#! /bin/bash
wd=/lustre1/rwork2/routwzn/smukeshk/INTERPOLATION/MASTER/MSTR
HME=ALL
for reg in NthAmerica SthAmerica Africa Europe Russia SthAsia EstAsia Oceania
do
for p in `seq 1 50`
do
echo Concatenating files in $HME/$reg/P"$p"/
for i in $wd/$HME/$reg/P"$p"/*
do
f=`basename $i`
awk '($4 != -9999.9) && ($5 == "")' $i >> $wd/$HME/$reg/CONCATENATED/$f
awk '($4 == -9999.9) && ($5 == "")' $i >> $wd/$HME/$reg/MISSING_MONTH/$f
awk '($4 != -9999.9) && ($5 != "")' $i >> $wd/$HME/$reg/FLAGGED/$f
awk '($4 == -9999.9) && ($5 != "")' $i >> $wd/$HME/$reg/BOTH/$f
done
done
doneStill cannot login to MIRAKEL. JSAM keeps loading. As such I decided to download and compare CPC and GPCC global daily precip datasets to start the documentation and start the literature review for introduction.
Details of the downloaded CPC dataset can be found on the datawiki: CCRC data wiki: precipitation
Managed to logon to MIRAKEL. Firefox JAVA applets stopped working. Ended up installing Pale Moon browser which natively supports JAVA applets. JSAM seems to run stabily on Pale Moon.
Create a skeleton for the documentation paper. The next tasks is to fill in as much of this skeleton as possible. Queued jobs summarising observations by region and concatenating regional daily data files into global daily data files.
We now have summaries of number of stations (flagged, missing months, both, none) by regions.
Checked if all the FINALR4MAT jobs finished. They did! Two jobs FINALR4MAT49 and 50 had some readghcnd errors.
==> FINALR4MAT49.out <==
readGhcnd error for station N32383E111667 in record CHM00057265201608TAVG 309H S 296H S 256H S 260H S 254H S 271H S 261H S 251H S 271H S 293H S 286H S 306H S 309H S 312H S 307H S 286H S 287H S 304H S 315H S 291H S 278H S 262H S 281H S 303H S 279H S-9999 -9999 -9999 -9999 -9999 -9999
==> FINALR4MAT50.out <==
readGhcnd error for station S22217E30000 in record ZI000067991199002PRCP 0 Q 0 Q 0 Q 0 Q 0 Q 0 Q 2 Q 40 Q 50 Q 0 Q 0 Q 0 Q 0 Q 0 Q 0 Q 0 Q 0 Q 0 Q 37 Q 0 Q 184 Q 425 Q 0 Q 0 Q 0 Q 0 Q 0 Q 0 Q-9999 -9999 -9999
The error occurs in readGhcnd function while reading the QC'd GPCC data.
The error is due to incorrect write out of the raw QC'd GPCC data. The station id is incorrect (I think it is the same as the GHCND station id, 11 characters). This data could be easily fixed by finding the right station id and reading in the data again using rf_forKriging.f95. The FINALR4MAT scripts (rf_forKriging.f95 and gpccmod.f95) are fine.
For now we will not include these stations.
Worked on a functional programming way of reading in a regional subset of netcdf file and then creating regional timeseries.
Packages: RNetcdf, fields (for image) and ludridate.
library(RNetCDF)
library(fields)
library(lubridate)I have converted original ncdf4 files into "classic" using nccopy and then used cdo remapcon2 to flip about the equator.
nccopy -k "classic" precip.1979.nc precip.1979.classic.nc
# Get griddes and then edit it by
# yfirst = 89.75 -> yfirst = -89.75
# yinc = -0.5 -> yinc = 0.5
cdo griddes precip.1979.classic > griddes_flipped
cdo remapcon2,griddes_flipped precip.1979.flipped.ncThe lon and lats are based on the netcdf description
nc.dir <- "/srv/ccrc/data45/z3289452/DailyPRCP_netCDFs/CPC_0.5X0.5_Daily_Unified_Global_Precipitation/"
nc.prefix <- "precip."
nc.suffix <- ".flipped.nc"
lon <- seq(0.25, by = 0.5, length.out = 720)
lat <- seq(-89.75, by = 0.5, length.out = 360)Regional stuff.
#Now plot timeseries by region
# Region 1: NthAmerica
# Region 2: SthAmerica
# Region 3: Africa
# Region 4: Europe
# Region 5: Russia
# Region 6: SthAsia
# Region 7: EstAsia
# Region 8: Oceania
Regions <- list(NthAmerica = list(), SthAmerica = list(), Africa = list(),
Europe = list(), Russia = list(), SthAsia = list(), EstAsia = list(),
Oceania = list())
lon.adjusted = lon
lon.adjusted[lon.adjusted > 180] = lon[lon.adjusted > 180] - 360
Regions$NthAmerica$lon <- which(lon.adjusted <= -34)
Regions$NthAmerica$lat <- which(lat >= 13)
Regions$SthAmerica$lon <- which(lon.adjusted <= -34)
Regions$SthAmerica$lat <- which(lat < 13)
Regions$Africa$lon <- which(lon.adjusted > -34 & lon.adjusted < 60)
Regions$Africa$lat <- which(lat < 38)
Regions$Europe$lon <- which(lon.adjusted > -34 & lon.adjusted <= 46)
Regions$Europe$lat <- which(lat >= 38)
Regions$Russia$lon <- which(lon.adjusted > 46)
Regions$Russia$lat <- which(lat >= 38)
Regions$SthAsia$lon <- which(lon.adjusted >= 60 & lon.adjusted < 98)
Regions$SthAsia$lat <- which(lat >= 5 & lat < 38)
Regions$EstAsia$lon <- which(lon.adjusted >= 98)
Regions$EstAsia$lat <- which(lat >= 5 & lat < 38)
Regions$Oceania$lon <- which(lon.adjusted >= 60)
Regions$Oceania$lat <- which(lat < 5)Functions to avoid loops. TSfromncdf takes an ncdf file and the indices of the grids to be read as well as dates and create a daily, monthly and annual timeseries. If you would like to read all grids, leave lon.index and lat.index out.
TSfromncdf <- function(nc.location, lon.index = NA, lat.index = NA, dates) {
require(RNetCDF)
nc <- open.nc(nc.location)
precip.grid <- var.get.nc(nc, "precip", start = c(ifelse(is.na(sum(lon.index)), NA, lon.index[1]),
ifelse(is.na(sum(lat.index)), NA, lat.index[1]), NA),
count = c(ifelse(is.na(sum(lon.index)), NA, length(lon.index)),
ifelse(is.na(sum(lat.index)), NA, length(lat.index)), NA))
close.nc(nc)
daily <- apply(precip.grid, 3, mean, na.rm = T)
rm(precip.grid); gc()
monthly <- aggregate(daily, by = list(month(dates)), FUN = mean, na.rm = T)$x
annual <- mean(daily, na.rm = T)
return(list(daily = daily, monthly = monthly, annual = annual))
}
give.location <- function(year) {
return(paste0(nc.dir, nc.prefix, year, nc.suffix))
}
combine.yearly.ts <- function(year, reg.num, timescale, Regions) {
writeLines(paste0("year: ", year))
dates <- seq(as.Date(paste0(year,"-01-01")), as.Date(paste0(year,"-12-31")), by = 1)
ts.subset <- TSfromncdf(give.location(year), Regions[[reg.num]]$lon,
Regions[[reg.num]]$lat, dates)[[timescale]]
return(unlist(ts.subset))
}
create.reg.ts <- function(reg.num, timescale, Regions) {
stopifnot(timescale == "daily" | timescale == "monthly")
timescale = ifelse(timescale == "daily", 1, 2)
writeLines(paste0("Creating TS for region: ", names(Regions)[reg.num]))
return(c(unlist(sapply(1979:2013, FUN = combine.yearly.ts, reg.num = reg.num,
timescale = timescale, Regions = Regions))))
}Now we use an lapply to loop over all regional and yearly readins and create a daily or monthly time series. First daily regional timeseries:
daily.reg.ts <- lapply(1:8, FUN = create.reg.ts, timescale = "daily", Regions)
names(daily.reg.ts) <- names(Regions)
save(daily.reg.ts, file = "/srv/ccrc/data45/z3289452/PhD_repository/RData/CPC_reg_1979-2013_Daily_TS.RData")Now daily global timeseries. Just change years to include the years you would like to calculate.
years <- seq(1979, 2013, 1)
glob.ts <- sapply(years, FUN = function(year) {
writeLines(paste0("Year: ", year))
dates <- seq(as.Date(paste0(year,"-01-01")), as.Date(paste0(year,"-12-31")), by = 1)
ts.subset <- TSfromncdf(give.location(year), dates = dates)
return(ts.subset)
})
daily.glob.ts <- unlist(sapply(1:length(years), function(y) {glob.ts[1,y][[1]]}))
monthly.glob.ts <- c(unlist(sapply(1:length(years), function(y) {glob.ts[2,y][[1]]})))
annual.glob.ts <- unlist(sapply(1:length(years), function(y) {glob.ts[3,y][[1]]}))
save(daily.glob.ts, monthly.glob.ts, annual.glob.ts,
file = "/srv/ccrc/data45/z3289452/PhD_repository/RData/CPC_glb_1979-2013_DMY_TS.RData")Started interpolation for 1950 to 1969 in two parts for each year (eg. 1950a and 1950b). All runs are in /lustre1/rwork2/routwzn/smukeshk/INTERPOLATION/KRIGING/RUNS.
wrote setupscripts.sh that modifies Datensatz-erzeugen.sh and kriging.job in each part directory getting it ready for job submission.
#!/bin/bash
wd=/lustre1/rwork2/routwzn/smukeshk/INTERPOLATION/KRIGING/RUNS
echo Dirs: A
for y in `seq 1950 1969`
do
echo Year: $y
cd $wd/"$y"a
sed -i "s/Start=195001/Start="$y"01/g" Datensatz-erzeugen.sh
sed -i "s/Ende=195012/Ende="$y"06/g" Datensatz-erzeugen.sh
sed -i "s/Krig1950/Krig"$y"a/g" kriging.job
sed -i "s#P1950#RUNS/"$y"a#g" kriging.job
cd $wd
done
echo Dirs: B
for y in `seq 1950 1969`
do
echo Year: $y
cd $wd/"$y"b
sed -i "s/Start=195001/Start="$y"07/g" Datensatz-erzeugen.sh
sed -i "s/Ende=195012/Ende="$y"12/g" Datensatz-erzeugen.sh
sed -i "s/Krig1950/Krig"$y"b/g" kriging.job
sed -i "s#P1950#RUNS/"$y"b#g" kriging.job
cd $wd
doneHad a chat with Lisa who suggested looking at the interpolation output to see if there are no artifacts (bull's eyes) before running off doing the interpolation.
Very important to email CMS team with data sample and publish an dataset abstract with doi advertising the dataset is coming. Create a Data Management Plan for this.
Make sure I have some figures ready by the end of next week (end of July I should have a draft abstract)for writing the abstract.
Kriging jobs for version 0 of the CCRC dataset finished without hiccups. As far as I can tell there are no memory allocation errors. I copied an output file (19600101.txt in 1960a dir) to /srv/ccrc/data45/z3289452/GPCC/INTERPOLATION/KRIGING/RUNS to make some plots to see if output was reasonable.
The code to check output files is in ownCloud/Working\ Directory/kriging_output_readandtest_24072017.R.
First we plot the output of the kriging output file:
Change out.file to change the output file to be read in We will change the for loops into functional code later
library(lubridate)
library(fields)
library(RNetCDF)
setwd("/srv/ccrc/data45/z3289452/GPCC/INTERPOLATION/KRIGING/RUNS")
out.dir <- "OUT/"
#####
run.dir <- "1960a/"
out.file <- "19600101.txt"
#####
if (substr(run.dir, 1, 4) != substr(out.file, 1, 4)) {
warning("Run directory not the same year as output file")
}
out.df <- read.table(paste0(run.dir, out.dir, out.file), header = F)
names(out.df) <- c("lon","lat","PREC")
out.df <- out.df[order(out.df[,1],out.df[,2]),]
lon <- unique(out.df$lon)
lat <- unique(out.df$lat)
out.raster <- array(dim=c(length(lon),length(lat)))
for (i in 1:length(lon)) {
for (j in 1:length(lat)) {
out.raster[i,j] <- out.df$PREC[(i-1)*180+j]
if (out.raster[i,j] == -99999.99) {
# print("blah")
out.raster[i,j] <- NA
}
}
}Now we do the plotting.
First the worldmap
image.plot(lon, lat, out.raster)
map("world", add = T, interior = F)
title(ymd(substr(out.file, 1, 8))) Now only USA
# Let's plot only USA
lon.index <- which(lon <= -65 & lon >= -128)
lat.index <- which(lat <= 55 & lat >= 22)
image.plot(lon[lon.index], lat[lat.index], out.raster[lon.index, lat.index],
xlab = "lon", ylab = "lat")
title(ymd(substr(out.file, 1, 8)))
map("world", "usa", add = T) Now only the wet grids.
# Let's NA the 0 values to see the patterns more clearly
out.raster[which(out.raster < 1)] <- NA
image.plot(lon[lon.index], lat[lat.index], out.raster[lon.index, lat.index],
xlab = "lon", ylab = "lat")
title(ymd(substr(out.file, 1, 8)))
map("world", "usa", add = T)Now we compare the above plots with the same from CPC 0.25X0.25 degree CONUS gridded daily product. We would have compared with CPC global or GPCC_FDD but they do not go back far enough. When the remaining interpolation finishes we will do it.
# Let's read in CPC CONUS to see if the grids look similar
setwd("/srv/ccrc/data45/z3289452/DailyPRCP_netCDFs/CPC_.25x.25_Daily_US_Unified_Precipitation/")
timeunitssince <- function(day, timeunits, since){
require(lubridate)
numdays <- ymd(day) - ymd(since)
if (timeunits == "days") {
return(day(days(numdays)))
} else if (timeunits == "hours") {
return(hour(hours(numdays * 24)))
} else {
stop("timeunits must be \"days\" or \"hours\"")
}
}
day.index <- timeunitssince(ymd(19600101), "days", ymd(19480101)) + 1
nc <- open.nc("precip.V1.0.1948-2006.nc")
CPC.grid <- var.get.nc(nc, "precip", start = c(NA, NA, day.index), count = c(NA, NA, 1))
lon.cpc <- var.get.nc(nc, "lon") - 360 # we subtract 360 to have lon go from (-180, 180) instead of (0, 360)
lat.cpc <- var.get.nc(nc, "lat")
time.cpc <- var.get.nc(nc, "time")Now the plots. No world map this time.
First: all grids
image.plot(lon.cpc, lat.cpc, CPC.grid, xlab = "lon", ylab = "lat")
title(ymd(substr(out.file, 1, 8)))
map("world", "usa", add = T) Now only the wet days:
CPC.grid[which(CPC.grid < 1)] <- NA
image.plot(lon.cpc, lat.cpc, CPC.grid, xlab = "lon", ylab = "lat")
title(ymd(substr(out.file, 1, 8)))
map("world", "usa", add = T)The spatial patterns aren't that similar. Perhaps we need to compare with a 1 degree version by remapping. I am halfway through the process of regridding using CDO to a 1 degree version which I think we need to do anyway for the final interpolation.
Nevertheless, the interpolation for the remaining years is runnning now.
The interpolation of ALL station data (as opposed to a subset based on completeness criteria, i.e. 20yr, 30yr, 40yr) finished although some of the runs ran out of walltime.
I wrote a script to save the output of the previous run and "re-setup" the directory for another run. This resetupjobs.sh file is available in the RUNS directory: /lustre1/rwork2/routwzn/smukeshk/INTERPOLATION/KRIGING/RUNS/.
#!/bin/bash
wd=/lustre1/rwork2/routwzn/smukeshk/INTERPOLATION/KRIGING/RUNS
folder=$1
cd $wd/$folder
jobnum=`tail -n 1 kriging.jobnum | awk '{print $2}' | sed 's/.xcepbs00//g'`
sed '231s/.*/if [ ! -d "$Ordner_Erg" ]; then mkdir $Ordner_Erg; fi/' Datensatz-erzeugen.sh > tmp
sed -i '232s#.*#if [ ! -d "$Ordner_Erg"/Error ]; then mkdir "$Ordner_Erg"/Error; fi#' tmp
sed -i '233s#.*#if [ ! -d "$Ordner_Erg"/Err_K ]; then mkdir "$Ordner_Erg"/Err_K; fi#' tmp
sed -i '234s#.*#if [ ! -d "$Ordner_Erg"/Stat-verteilung ]; then mkdir "$Ordner_Erg"/Stat-verteilung; fi#' tmp
sed -n '230,235p' tmp
echo Looks good? [Y/y]
read response
if [ response=="Y" -o response=="y" ]
then
mv tmp Datensatz-erzeugen.sh
chmod +x Datensatz-erzeugen.sh
fi
rm dates.txt
rm comp-kriging_f90.out
mv kriging.out kriging."$jobnum".out
cd $wdBefore I run resetupjobs.sh I change the Start and Ende parameters in the Datensatz-erzeugen.sh file by seeing in OUT directory and kriging.out file which days did not finish. E.g.:
sed -i '26s/Start=.*/Start=19861229/' Datensatz-erzeugen.sh
After this simply run submitJob.sh from within the folder.
Created comparitive maps of CCRC, CPC_global and GPCC_FDD for 1988-01-01.
First we had to regrid CPC_global flipped 1988 ncdf file into the same grid as GPCC_FDD (i.e. 1X1 degree resolution starting at lon: -179.5). cdo remapcon2,griddes1X1deg_identicaltoGPCCFDD precip.1988.flipped.nc precip.1988.1X1deg.nc. griddes1X1deg_identicaltoGPCCFDD is printed below, available in CPC directory: /srv/ccrc/data45/z3289452/DailyPRCP_netCDFs/CPC_0.5X0.5_Daily_Unified_Global_Precipitation/
#
# gridID 1
#
gridtype = lonlat
gridsize = 64800
xname = lon
xlongname = longitude
xunits = degrees_east
yname = lat
ylongname = latitude
yunits = degrees_north
xsize = 360
ysize = 180
xfirst = -179.5
xinc = 1
yfirst = -89.5
yinc = 1
This procedure produced some negative rainfall values which were masked out in the R code to print plots.
Now the R code to print the comparitive plots. First modify and run the importKrigingOutput code chunk with
#####
run.dir <- "1960a/"
out.file <- "19600101.txt"
#####Now import CPC and GPCC
ncdf.dir <- '/srv/ccrc/data45/z3289452/DailyPRCP_netCDFs/'
CPC.dir <- 'CPC_0.5X0.5_Daily_Unified_Global_Precipitation/'
GPCC.dir <- 'GPCC_FDD/'
# CPC
nc <- open.nc(paste0(ncdf.dir, CPC.dir,"precip.1988.1X1deg.nc"))
CPC.grid <- var.get.nc(nc, "precip")
lon.cpc <- var.get.nc(nc, "lon")
lat.cpc <- var.get.nc(nc, "lat")
time.cpc <- var.get.nc(nc, "time")
# Remove negative values in CPC (no idea how they got there)
CPC.grid[which(CPC.grid < 0)] <- NA
# Now GPCC-FDD
nc <- open.nc(paste0(ncdf.dir, GPCC.dir,"full_data_daily_1988.nc"))
GPCC.grid <- var.get.nc(nc, "p")
lon.gpcc <- var.get.nc(nc, "lon")
lat.gpcc <- var.get.nc(nc, "lat")
time.gpcc <- var.get.nc(nc, "time")Now the plots. We plot on an exponential scale for clarity.
# Now let's change the breaks and colours
brks <- c(0, 1.7^(seq(1,10)))
require(graphics)
YlGnBu <- colorRampPalette(c("Yellow", "Green", "Blue"))
image.plot(lon, lat, out.raster, breaks = brks, col = YlGnBu(length(brks) - 1),
xlab = "Longitude", ylab = "Latitude")
map("world", add = T, interior = F)
title(paste("CCRC:", ymd(substr(out.file, 1, 8))))
image.plot(lon.cpc, lat.cpc, CPC.grid[,,1], breaks = brks, col = YlGnBu(length(brks) - 1),
xlab = "Longitude", ylab = "Latitude")
map("world", add = T, interior = F)
title(paste("CPC:", ymd(substr(out.file, 1, 8))))
image.plot(lon.gpcc, lat.gpcc, GPCC.grid[,,1], breaks = brks, col = YlGnBu(length(brks) - 1),
xlab = "Longitude", ylab = "Latitude")
map("world", add = T, interior = F)
title(paste("GPCC:", ymd(substr(out.file, 1, 8))))
Ran the remaining kriging jobs that did not finish. They have now finished and we have all grids for interpolation from 1950 to 2013.
Downloaded all the summaries of number of stations (by region and completeness criteria for analysis).
First I copied all the summaries in a structured directory (by compeleteness and regions) called cpdir in /lustre1/rwork2/routwzn/smukeshk/INTERPOLATION/MASTER/MSTR/cpdir using the script:
#!/bin/bash
wd=/lustre1/rwork2/routwzn/smukeshk/INTERPOLATION/MASTER/MSTR
for i in ALL TWENTY THIRTY FORTY
do
for reg in NthAmerica SthAmerica Africa Europe Russia SthAsia EstAsia Oceania
do
cp $wd/$i/$reg/*_summary.txt $wd/cpdir/$i/$reg
done
donecpdir was then compressed (tar -czvf cpdir.tar.gz cpdir) and copied (and uncompressed) using scp to /srv/ccrc/data45/z3289452/GPCC/INTERPOLATION/MASTER/MSTR/SUMMARIES.
Now we make some line plots of the summaries.
During summary data import in R, some weirdness was discovered in the summary files. The exact details are described in the following script which was used to fix the files. The script location is /srv/ccrc/data45/z3289452/GPCC/INTERPOLATION/MASTER/MSTR/SUMMARIES/fixsummaries.sh
#!/bin/bash
# The summaries have a giberish first line
# because for some reason there is a file called '*'
# in the folders
# ${f:0:8} returns nothing and
# cat $i | wc -l returns the sum of all line numbers
# of all files in directory
# Also we would like to remove the last line of the
# summaries which is 20161231
wd=/srv/ccrc/data45/z3289452/GPCC/INTERPOLATION/MASTER/MSTR/SUMMARIES
for i in ALL TWENTY THIRTY FORTY
do
for reg in NthAmerica SthAmerica Africa Europe Russia SthAsia EstAsia Oceania
do
for f in $wd/$i/$reg/*
do
ln=`grep -n 19500101 "$f" | cut -d : -f 1`
if [ "$ln" -gt 1 ]
then
sed -i '1d' "$f"
fi
sed -i '$d' "$f"
done
done
doneNow that the summaries are fixed we import the summary files in R and plot them.
First the data import: I would like to have done this in a functional way but it was difficult to have a function create a list and call it.
# summaries of stations
wd <- '/srv/ccrc/data45/z3289452/GPCC/INTERPOLATION/MASTER/MSTR/SUMMARIES'
setwd(wd)
len_criteria <- list('ALL', 'TWENTY', 'THIRTY', 'FORTY')
data_type <- list('CONCATENATED', 'FLAGGED', 'MISSING_MONTH', 'BOTH')
region <- list('NthAmerica', 'SthAmerica', 'Africa', 'Europe', 'Russia', 'SthAsia', 'EstAsia', 'Oceania')
summaries <- list()
for (l in 1:length(len_criteria)) {
summaries.tmp2 <- list()
for (d in 1:length(data_type)) {
summaries.tmp1 <- list()
for (r in 1:length(region)) {
summaries.tmp1[[r]] <- read.table(paste0(len_criteria[l], '/',
region[r], '/',
data_type[d], '_summary.txt'), header = F)
names(summaries.tmp1)[r] <- region[r]
summaries.tmp1[[r]][,1] <- ymd(summaries.tmp1[[r]][,1]) #Convert into dates
}
summaries.tmp2[[d]] <- summaries.tmp1
names(summaries.tmp2)[d] <- data_type[d]
}
summaries[[l]] <- summaries.tmp2
names(summaries)[l] <- len_criteria[l]
}
rm(summaries.tmp1, summaries.tmp2)Now the plotting: This is functional! 😄
plot.regs <- function(datType, data_type, summaries.lenCrit, lenCriteria) {
data_num <- match(datType, data_type)
ylim = range(sapply(summaries.lenCrit[[data_num]], function(x){range(x[,2])}))
cols <- list("violet", "brown", "blue", "green", "gold", "orange", "red", "magenta")
plot(summaries.lenCrit[[data_num]][[1]][,1], summaries.lenCrit[[data_num]][[1]][,2],
type = 'n', xlab = "Dates", ylab = "Number of stations",
xlim = c(ymd('19500101'), ymd('20131231')), ylim = ylim)
plot.line <- function(list, col) {
lines(list[,1], list[,2], col = col)
}
mapply(FUN = plot.line, summaries.lenCrit[[data_num]], cols)
if (data_num != 1) {
legend("topleft", legend = names(summaries.lenCrit[[data_num]]), col = unlist(cols),
lty = rep(1, length(region)), cex = 0.85)
}
title(paste0(lenCriteria, ": ", datType, ": Number of stations over time"))
}
plot.summaries <- function(lenCriteria, len_criteria, data_type, summaries) {
lenCrit_num <- match(lenCriteria, len_criteria)
lapply(data_type, FUN = plot.regs, data_type, summaries[[lenCrit_num]], lenCriteria)
}
layout(mat = matrix(1:4, ncol = 2, byrow = T))
lapply(len_criteria, FUN = plot.summaries, len_criteria, data_type, summaries)
A new private git repo in /srv/ccrc/data45/z3289452/GIT_REPOS has been created.
Need to download raw station data for CONCATENATED, TWENTY, THIRTY and FORTY to see the distribution of stations by year and regions. They are available in /srv/ccrc/data45/z3289452/GPCC/INTERPOLATION/MASTER/MSTR. I have download the raw file for 1st Jan and 1st July from 1950 to 2013.
Tried making maps with plotly but I could not adjust the size of the markers. Back to making a gif
###############################################
# Title: mapRawStations
# Author: Steefan Contractor
# Description: ShinyApp that maps raw stations
# used in interpolation on a map
###############################################
# Custom slider function
sliderValues <- function (inputId,
label,
values,
from,
to = NULL,
grid = TRUE,
width = NULL,
postfix = NULL,
prefix = NULL,
dragRange = TRUE,
disable = FALSE,
animate = FALSE) {
validate_fromto <-
function(fromto = NULL,
values = NULL,
default = 0) {
if (!is.null(fromto)) {
if (is.character(values) & is.numeric(fromto)) {
fromto <- fromto - 1
} else {
fromto <- which(values == fromto) - 1
}
} else {
fromto <- default
}
return(fromto)
}
sliderProps <- shiny:::dropNulls(
list(
class = "js-range-slider",
id = inputId,
`data-type` = if (!is.null(to))
"double"
else
"single",
`data-from` = validate_fromto(fromto = from, values = values),
`data-to` = validate_fromto(
fromto = to,
values = values,
default = length(values)
),
`data-grid` = grid,
`data-prefix` = if (is.null(prefix)) {
"null"
} else {
shQuote(prefix, "sh")
},
`data-postfix` = if (is.null(postfix)) {
"null"
} else {
shQuote(postfix, "sh")
},
`data-drag-interval` = dragRange,
`data-disable` = disable,
`data-values` = if (is.numeric(values)) {
paste(values, collapse = ", ")
} else {
paste(shQuote(values, type = "sh"), collapse = ", ")
}
)
)
sliderProps <- lapply(
X = sliderProps,
FUN = function(x) {
if (identical(x, TRUE))
"true"
else if (identical(x, FALSE))
"false"
else
x
}
)
sliderTag <- tags$div(
class = "form-group shiny-input-container",
style = if (!is.null(width))
paste0("width: ", htmltools::validateCssUnit(width), ";"),
if (!is.null(label))
shiny:::controlLabel(inputId, label),
do.call(
tags$input,
list(
type = if (is.numeric(values) &
is.null(to)) {
"number"
} else {
"text"
},
#class = "js-range-slider",
id = inputId,
name = inputId,
value = ""
)
),
tags$style(
whisker::whisker.render(
template =
"input[id='{{id}}'] {
-moz-appearance:textfield;
}
input[id='{{id}}']::-webkit-outer-spin-button,
input[id='{{id}}']::-webkit-inner-spin-button {
-webkit-appearance: none;
margin: 0;
}", data = list(id = inputId))
),
tags$script(
HTML(
whisker::whisker.render(
template = '$("#{{id}}").ionRangeSlider({
type: "{{data-type}}",
from: {{data-from}},
to: {{data-to}},
grid: {{data-grid}},
keyboard: true,
keyboard_step: 1,
postfix: {{data-postfix}},
prefix: {{data-prefix}},
drag_interval: {{data-drag-interval}},
values: [{{data-values}}],
disable: {{data-disable}}
});',
data = sliderProps
)
))
)
if (identical(animate, TRUE))
animate <- animationOptions()
if (!is.null(animate) && !identical(animate, FALSE)) {
if (is.null(animate$playButton))
animate$playButton <- icon("play", lib = "glyphicon")
if (is.null(animate$pauseButton))
animate$pauseButton <- icon("pause", lib = "glyphicon")
sliderTag <- htmltools::tagAppendChild(
sliderTag,
tags$div(class = "slider-animate-container",
tags$a(href = "#", class = "slider-animate-button",
`data-target-id` = inputId, `data-interval` = animate$interval,
`data-loop` = animate$loop, span(class = "play",
animate$playButton),
span(class = "pause",
animate$pauseButton)))
)
}
dep <- htmltools::htmlDependency(
"ionrangeslider",
"2.1.12",
c(href = "shared/ionrangeslider"),
script = "js/ion.rangeSlider.min.js",
stylesheet = c(
"css/ion.rangeSlider.css",
"css/ion.rangeSlider.skinShiny.css"
)
)
htmltools::attachDependencies(sliderTag, dep)
}
library(lubridate)
library(dplyr)
library(shiny)
library(maps)
library(leaflet)
data.dir <- "data/"
dates <- format(seq.Date(from = ymd("19500101"), to = ymd("20131231"), by = "6 months"), "%b %Y")
Regions <- list(Global = 1, NthAmerica = 2, SthAmerica = 3, Africa = 4, Europe = 5,
Russia = 6, SthAsia = 7, EstAsia = 8, Oceania = 9)
lonLim <- list(Global = c(-180, 180), NthAmerica = c(-180, -34), SthAmerica = c(-180, -34), Africa = c(-34, 60), Europe = c(-34, 46),
Russia = c(46, 180), SthAsia = c(60, 98), EstAsia = c(98, 180), Oceania = c(60, 180))
latLim <- list(Global = c(-90, 90), NthAmerica = c(13, 90), SthAmerica = c(-90, 13), Africa = c(-90, 38), Europe = c(38, 90),
Russia = c(28, 90), SthAsia = c(5, 38), EstAsia = c(5, 38), Oceania = c(-90, 5))
ui <- fluidPage(
titlePanel(""),
sidebarPanel(
selectInput(inputId = "regions", label = "Region: ",
choices = Regions),
selectInput(inputId = "compCrit", label = "Completeness Criteria (min # years): ",
choices = c("ALL", "TWENTY", "THIRTY", "FORTY")),
sliderValues(inputId = "Dates", label = "Date: ", values = dates,
from = dates[1], grid = F,
width = "100%", animate = animationOptions(interval = 500))
),
mainPanel(
plotOutput("stationMap")
)
)
server <- function(input, output, session) {
output$stationMap <- renderPlot({
Date <- format(as.Date(paste0("01 ",input$Dates), format = "%d %b %Y"), "%Y%m%d")
data <- read.table(paste0(data.dir, input$compCrit, "/", Date, ".txt"), header = F)
names(data) <- c("Latitude", "Longitude")
#browser()
par(mar=c(1,1,2,1)+0.1)
map("world", interior = T, xlim = lonLim[[as.numeric(input$regions)]],
ylim = latLim[[as.numeric(input$regions)]])
points(data$Longitude, data$Latitude, cex = 0.01, col = "steelblue3", pch = 19)
title("Map of stations interpolated for CCRC Global Gridded Daily Precipitation")
})
}
shinyApp(ui = ui, server = server)Elke has not replied since Friday, 4th August about the conversion of ascii output into netCDF. So now I am trying to figure out how to convert to netcdf on MIRAKEL myself. This is problematic because MIRAKEL does not have R, Python or NCL. So I cannot use any of these.
I am now in the process of copying Kriging output into a folder which can be "tar"ed and copied using scp.
Rsync was the fastest way to copy files. I used sshpass to pass a password to each rsync command. Save the password in a file echo ********** > rsync_pass; chmod 400 rsync_pass
The copy script is as follows:
#!/bin/bash
wd=/lustre1/rwork2/routwzn/smukeshk/INTERPOLATION/KRIGING/RUNS
for y in `seq 1982 2013`
do
echo $y
for p in a b
do
sshpass -p $(cat rsync_pass) rsync -avz $wd/$y$p/OUT/*.txt z3289452@ccrc163.ccrc.unsw.edu.au:/srv/ccrc/data45/z3289452/GPCC/INTERPOLATION/KRIGING/OUTDIR/
done
doneModified Markus's ncl script to convert text files of raw data into netcdf.
.ncl file location: /srv/ccrc/data45/z3289452/GPCC/INTERPOLATION/KRIGING/read_KrigingOUT.ncl
All minimum TWENTY year completeness criteria calculations finished.
For some reason 19630630 and 19631224-31 had not run. Rerunning them now.
All actual precipitation converted into netcdf. Now copying metadata (Error, Err_K and Station-distribution) to storm servers using cpfiles.sh and cpfiles.job on storm servers.
Meeting with Markus and Lisa
Redo the QC on all TX and TW files you can find on MIRAKEL servers. The issue with the lack of data density seems to be with the step where we merge the unique stations. As a result we will re-write the script that finds unique stations and the script that matches stations and merge them into one. The way we do this is to find all the GPCC stations and GHCND stations inside each grid. Then we match the stations inside each grid and merge the time series. Then we find the ten stations with the longest length record inside each grid. We save an extra four columns of information in the inventory files - the start year, end year and number of daily records since 1950, and number of flagged records to help with the step above. In the end we compare our map of stations with GPCC by finding out all grids where we have less than 10 stations and seeing if GPCC has more stations for that grid. While redoing the QC we also add extra source flags to let us know where the data comes from, i.e. GPCC, GHCN-Daily, Lisa's datasets etc.
I am currently reviewing old code in /srv/ccrc/data45/z3289452/NCEI_Automated_QC/GPCC/Post-GPCC/smukeshk/. This folder has three folders "AUSTRALIA", "BRAZIL" and "GHCN-Daily". Each folder has fortran scripts:
I am aiming to test run all scripts on one folder today (Australia). And then the aim is to write bash scripts that gives a list of paths to tx or tw files would
Email from Udo regarding the locations of tx/tw files and their formats:
Hi Steefan,
Elke forwarded me your email.
The files with the ending „_nf“ are the loading files that are integrated into the data base; the other ones with “_umf” etc. are intermediate files or for information purposes.
Files starting with “tw” are without time shift, “tx” files are with a possible shift.
For Australia the files should be shifted to the previous day (so you should use the TX files), because originally the daily precipitation is assigned to the day when the observation is taken, but it should (according to our GPCC philosophy) be assigned to the previous day (since most of the observation interval is lying in the previous day).
I hope this will be helpful for you. For more detailed questions you unfortunately will have to wait for Anja’s return.
TO DO:
Looked up which source flags are already in use by NCEI QC script. In the end we overwrite the source flags anyway but still in case the sflags are used in the QC scripts somewhere we wish to avoid the flags already in use. Below is the code that determines which flags are not in use. sflags were obtained from here
sflags = c("Z","R","0","6","C","X","W","K","7","F","B","M","r","E","z","u","b","s","a","G","Q","I","A","N","T","U","H","S")
nums = as.character(seq(0, 9))
letters[which(!(letters %in% sflags))]## [1] "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "t" "v"
## [18] "w" "x" "y"
LETTERS[which(!(LETTERS %in% sflags))]## [1] "D" "J" "L" "O" "P" "V" "Y"
nums[which(!(nums %in% sflags))]## [1] "1" "2" "3" "4" "5" "8" "9"
The lowercase letters are mostly free so we will primarily be using them.
| sFlag | Source |
|---|---|
| "c" | GPCC-TX files |
| "d" | GPCC-TW files |
| "e" | GPCC-MIRAKEL database |
| "f" | GHCN-Daily (If it does not already have one of the other GHCND sFlags) |
| "g" | Lisa's data - Russia |
| "h" | Lisa's data - Brazil |
| "i" | Lisa's data - Argentina |
| "j" | Lisa's data - South Africa |
| "k" | LACAD |
| "l" | SACAD |
| "m" | ECAD |
The way we set the sFlags (CF3) is we initialise the cF3 variable in the qcscripts with the appropriate qcflag and then it will be overwritten with the GHCN-D data. All GHCN-D data should have a flag except when value is missing, so the "f" flag is redundant anyway.
Move the fortran script files to the GIT repo and develop code in there. Done
write fortran script that adds start year, end year, number of records since 1950 and number of flagged records since 1950 for each station to the station inventory file. Maybe this should happen in the qc2out file where it reads all the information. Done.
Modify qc1_gpcc to allow the passing of a source flag when running the script. This flag would be saved alongside every nonmissing record Done.
Meeting with Markus
After the matching is done we will have a list of unique stations. We could add even more extra columns to this list that indicate the number of records that come from GPCC, the ones that come from GHCND data and the ones that come from Lisa's sources. For each grid cell, we then create a list of stations inside this grid and rank them by lenghth of record, start/end attributes. Then we pick the top 10/15/20
Email from Anja regarding tx/tw file locations
Hi Steefan,
as Udo wrote you, the files begin with tw or tx and end with nf. The files ending with mfk contain many months completely without data and then were converted to nf without these months.8
All these data are reformatted but not all of them loaded into MIRAKEL.
The tw and tx files are - as you have seen - in the /e/rhome/routwzn/COUNTRIES directory. Data from GHCND are in the directory /e/rwork2/routwzn (it's too much for /e/rhome/routwzn/COUNTRIES). Some countries are shifted to /media/x21045/COUNTRIES - do you have access to this directory?
The countries with tx data in rhome are:
TURKEY CROATIA AUSTRALIA GREECE NORWAY BENIN_LD AFRIKA_FINK FRANCE HONG_KONG GEORGIA CHINA CANADA IRELAND SLOVENIA DENMARK_GRL ECUADOR SLOVAKIA SURINAME ESTONIA LATVIA LUXEMBURG INDONESIA SASSCAL
The countries with tx data in rhome are:
NEUSEELAND MEXIKO IRAN LITHUANIA PANAMA AUSTRIA OESTERLE FINLAND GREECE KASACHSTAN GHANA JORDAN ROMANIA SEYCHELLEN CHILE MALEDIVEN FRANCE BULGARIA GUINEA_BISSAU URUGUAY ECAD NETHERLANDS RUSSIA UK HONG_KONG SWITZERLAND KOREA_S GEORGIA SYNOP_TW EL_SALVADOR CHINA FRANZ_GUAYANA CANADA IRELAND SLOVENIA ECUADOR SLOVAKIA COSTA_RICA UNGARN NIGERIA ESTONIA MACAO TAIWAN LATVIA LUXEMBURG UKRAINE ETHIOPIA ZAMBIA SYRIA GAMBIA UZBEKISTAN SPAIN CYPRUS JAPAN INDONESIA COLUMBIA CZECH SACAD SASSCAL
Finished cloning the CCRCGlobalGriddedDailyPrecip GIT dir on lce27 (MIRAKEL) The location is /lustre1/rwork2/routwzn/smukeshk/NCEI_QC/Scripts/CCRCGlobalGriddedDailyPrecipitation
All QC scripts as well as useful bash scripts for setting up are available here.
Wrote a bash script mkefolders.sh that takes a tx file list with the entire tx paths and sets up the QC directories for each unique country in the tx file path list
#!/bin/bash
invFile=$1
TXW=$2
baseDir='/lustre1/rwork2/routwzn/smukeshk/NCEI_QC/'
scriptDir='/lustre1/rwork2/routwzn/smukeshk/NCEI_QC/Scripts/CCRCGlobalGriddedDailyPrecipitation/QCscripts/'
if [ -z $TXW ]; then
echo "The second argument must be either TX or TW (Case sensitive)"
exit 1
elif [ $TXW != "TX" ] && [ $TXW != "TW" ]; then
echo "The second argument must be either TX or TW (Case sensitive)"
exit 1
fi
if [ ! -d $baseDir$TXW ]; then
echo Directory $baseDir$TXW does not exist
exit 1
fi
if [ ! -f $baseDir$invFile ]; then
echo Inventory file $baseDir$invFile does not exist
exit 1
fi
CTRS=`cat $baseDir$invFile | cut -d'/' -f6 | uniq`
for CTRY in $CTRS; do
echo $CTRY
mkdir $baseDir$TXW/$CTRY
mkdir $baseDir$TXW/$CTRY/rawgpcc $baseDir$TXW/$CTRY/rawghcnd $baseDir$TXW/$CTRY/gpcc_por
FLS=`grep $CTRY $baseDir$invFile`
for FL in $FLS; do
cp $FL $baseDir$TXW/$CTRY/rawgpcc/
echo `basename $FL` >> $baseDir$TXW/$CTRY/tx_files.txt
done
doneThe setupQC.sh bash script then copies all the QC scripts from the git directory to the QC directory.
#!/bin/bash
invFile=$1
TXW=$2
baseDir='/lustre1/rwork2/routwzn/smukeshk/NCEI_QC/'
scriptDir='/lustre1/rwork2/routwzn/smukeshk/NCEI_QC/Scripts/CCRCGlobalGriddedDailyPrecipitation/QCscripts/'
if [ -z $TXW ]; then
echo "The second argument must be either TX or TW (Case sensitive)"
exit 1
elif [ $TXW != "TX" ] && [ $TXW != "TW" ]; then
echo "The second argument must be either TX or TW (Case sensitive)"
exit 1
fi
if [ ! -d $baseDir$TXW ]; then
echo Directory $baseDir$TXW does not exist
exit 1
fi
if [ ! -f $baseDir$invFile ]; then
echo Inventory file $baseDir$invFile does not exist
exit 1
fi
CTRS=`cat $baseDir$invFile | cut -d'/' -f6 | uniq`
for CTRY in $CTRS; do
echo $CTRY
cp "$scriptDir"*.f95 $baseDir$TXW/$CTRY/
cp "$scriptDir"*.sh $baseDir$TXW/$CTRY/
cp -r "$scriptDir"input4sys $baseDir$TXW/$CTRY/
cd $baseDir$TXW/$CTRY
./compile_QCprograms.sh
doneRunning the QC-script.sh script for TX/LATVIA now as a test. If the QC'd data looks good I will run the remaining TX/CTRS tonight.
Output looks OK
All rhome_CTRS/TX QC jobs have finished. Check output to see if they ran OK
I have reorganised the /lustre1/rwork2/routwzn/smukeshk/NCEI_QC/ to see where the raw data comes from. All data is now in a directory QCdata. Inside this directory are directories indicating the source location of the data eg. "rhome_CTRS", "rhome_CTRS_Steefan", "rwork_GHCN". More sources will be added later. "rhome_CTRS_Steefan" refers to files that Anja added to the /e/rhome/routwzn/COUNTRIES/Steefan_tx_tw/ dir from x21045/COUNTRIES.
Anja has emailed back with lots of information:
Hi Steefan, at first something about tw and tx files: These files are used to load the data into MIRAKEL. So in these files you have data that are partially corrected but most of QC is done while loading the data. So you don't get the data after the QC.
The difference between tw and tx files is a little bit difficult because it has a history. At first we had files with a format for precipitation up to 999mm (3 digits before the point). Because - very seldom - precipitation with more than 999 mm a day may occur we changed our format to 4 digits before the point. So we ut the suffix "nf" to files with the new format. Most of the files have this suffix, some old files were converted to the new format.
At first we had only tw files. Then we detected some data shifted for a day. So we shifted it to the "right" day. These files had "versch" in their name (for shifted, in German verschoben). Then we also wanted to see in MIRAKEL that data were shifted and we created the tx format with an information for shifted data. Data shifted 1 day back have the flag 249.
Now we want to use only tx files because not shifted data are only a special form of shifted data with a shift of 0 days (flag 250 = not shifted). For this case the so-called "Lademodul" has to be modified. So at the moment I produce tx files and for not shifted data also the tw files. So for some countries you can find both, they contain the same data.
After loading the data the tw and tx files are no longer used, and so old files are not converted to the newest format. So you'll find a lot of different forms. I hope this is no problem.
The data of some countries were shifted to x21045/COUNTRIES but not all of these countries have delivered daily data. So I copied all tw and tx files from x21045/COUNTRIES to /e/rhome/routwzn/COUNTRIES/Steefan_tw_tx.
For Australia you should use the data in /e/rhome/routwzn/COUNTRIES/AUSTRALIA/DAILY_201612, the data in /e/rhome/routwzn/COUNTRIES/AUSTRALIA are older and only until 2008. In /e/rhome/routwzn/COUNTRIES/AUSTRALIA/DAILY_201612 the files are split because they are too large for loading. You only must use tx_aus_201612_New_South_Wales_A_nf tx_aus_201612_New_South_Wales_B_nf tx_aus_201612_Queensland_A_nf and so on.
I hope this is not too difficult. If you have any further questions please ask me.
Best regards
Running all TX COUNTRIES in rhome now
All QC jobs finished except for GHCN-Daily Still have to check if the output was ok (spend a day going through all the folders and make sure the output is ok)
Running a job download and reformating all Mirakel data between 1950 and 2017
To do this weekend: Finish matchGhcndStns.f95 and use it to merge TX and TW files as a test
The TW and TX files are stored in various directories labelled by COUNTRY. Some directories however, consisted raw files from multiple countries, e.g. SYNOP_TW contained synoptic station data from tens of countries and ECAD contained data from many EU countries. So I had to spend a few days properly merging these raw files because otherwise the last COUNTRY dir to be assimilated would overwrite the previous ones. the matchGPCCStns.f95 script was used to do this. I also merged dirs with the same name between rhome_CTRS_TW and rhome_CTRS_TX directory. The RUSSIAN data from LISA was also merged with the data in TW format. Once finished there were around 72000
Once finished
for i in `seq 55 94`; do echo Part$i; mv Part$i/merged-stations.txt Part$i/merged-stations_wdupes.txt; tac Part$i/merged-stations_wdupes.txt | awk '!a[$1]++' | tac > Part$i/merged-stations.txt; cat Part$i/merged-stations.txt | wc -l; done
for i in `seq 75 94`; do cd /lustre1/rwork2/routwzn/smukeshk/RawData/Mirakel/MergeGHCNParts/Part"$i"A; echo Part"$i"A; ./submitMergeJob.sh; cd /lustre1/rwork2/routwzn/smukeshk/RawData/Mirakel/MergeGHCNParts/Part"$i"B; echo Part"$i"B; ./submitMergeJob.sh; done; cd /lustre1/rwork2/routwzn/smukeshk/RawData/Mirakel/MergeGHCNParts
for i in `seq 102 108`; do sed -i 's/lc_lang/lc_big/g' Part"$i"A/merge.job; sed -i 's/lc_lang/lc_big/g' Part"$i"B/merge.job; done
for i in `seq 102 108`; do echo Part"$i"A; grep lc_big Part"$i"A/merge.job; echo Part"$i"B; grep lc_big Part"$i"B/merge.job; done
In RawData/Mirakel/MergeGHCNparts, Part109 onwards are merging the TW-TX-LISA_RU data that was unique wrt Mirakel data with GHCN data
ghcnd-stations.txt contained more stations than the files in ghcnd_all. This resulted in a "file missing" "Fortran runtime error" in matchGhcndStns.f95 for 65 Parts. Additionally I had forgotten to run matchGhcndStn.exe on 10 parts 5A/B to 9A/B. All these parts that need to be run are now queued and are in rerunParts.txt in RawData/Mirakel/MergeGHCNparts/
Tomorrow I will write a script to investigate the potentially nonunique stations in RawData/Mirakel/MergeParts2/nonuniqueStns.txt. These stations have the same lat/lon to 3 decimal places in RawData/Mirakel/gpcc-stations_extra.txt and RawData/TW-TX-LISA_RU_AR/gpcc-stations_extra.txt but elev are different and correlation of overlap is less than 0.90. The aim is to write a script that first gets the overlapping months from the data files using grep on the yearmonth cols. Subsequently it sed replaces the e with c source flags and then uses diff --side-by-side --suppress-common-lines to print 1) the number of month lines that have differences and 2) the different lines themselves one after the other (not side by side).
diff -W $(( $(tput cols) - 2 )) --side-by-side --suppress-common-lines file1.dly file2.dly | wc -l
awk '{print $1}' unique-stations.txt > uniqueStnIds.txt
grep -Ff uniqueStnIds.txt ../gpcc-stations_extra.txt > nonuniqueMirakel-stations.txt
grep -Ff uniqueStnIds.txt ../../TW-TX-LISA_RU_AR/gpcc-stations_extra.txt > nonuniqueTWXRU-stations.txt
for i in `cat nonuniqueMirakelStnIds.txt`; do sed -i "/$i/d" unique-stations.txt; done ## this takes forever!!
for i in `cat nonuniqueTWXRUStnIds.txt`; do sed -i "/$i/d" unique-stations.txt; done To do:
Need to finish writing a shiftGHCN.f95. The script also creates a station list with extra columns and the new StnId. Also need to merge the duplicate GHCNStns in grep -Ff dupes.txt uniqueGhcnd-stations.txt.
Then need to merge Lisa_Argentina and cat all the station lists and copy all files from all the parts.
Then need to run reformat Stns script.
Also need to write a script that reads all 145000 stations and creates a script with extra columns indicating sources.
All this needs to be done by Monday 30th October. On Monday we run the interpolation!
Things I'd like to do in the next version:
Have a different station id because lat+lon is not unique enough
Rerun the QC after all the merging is done to make sure data is still homogenous.
Interpolate monthly values as well and use the grids from that to recalculate absolute values from anomalies.
If the stations do not have any overlap they cannot be correlated and hence come up as unique. This is apparant among GHCN stations with the same station ids (same lat/lon) but different elevation in RawData/GHCN-Daily/MergeDupes.
Going over the 111 stations in the unique GHCND stations list which have the same station Ids (same lat/lon to 3 decimal places):
Except in the case of Indian stations where I merge everything with the same exact stations coords and station names even if they have overlapping time periods
Do this tomorrow
for i in `cat length.txt`; do for j in `cat regions.txt`; do for k in `seq 1 301`; do mkdir -p $i/$j/Part$k; done; done; done
for i in `cat length.txt`; do for j in `cat regions.txt`; do mkdir $i/$j/trash; mv $i/$j/CONCATENATED $i/$j/trash/; mv $i/$j/MISSING_MONTH $i/$j/trash/; mv $i/$j/FLAGGED $i/$j/trash/ mv $i/$j/BOTH $i/$j/trash/; mkdir $i/$j/CONCATENATED $i/$j/MISSING_MONTH $i/$j/FLAGGED $i/$j/BOTH; done; done
for i in TWENTY THIRTY FORTY; do cp catfiles_ALL.sh catfiles_$i.sh; sed -i "s/ALL/$i/g" catfiles_$i.sh; for j in 2 3 4; do cp catfiles_ALL$j.sh catfiles_$i$j.sh; sed -i "s/ALL/$i/g" catfiles_$i$j.sh; cp catfiles_ALL$j.job catfiles_$i$j.job; sed -i "s/ALL/$i/g" catfiles_$i$j.job; done; done
setwd("/srv/ccrc/data11/z3289452/CCRCGlobalGriddedDailyPrecip/Metadata")
rawdata <- read.table(file = "/srv/ccrc/data11/z3289452/CCRCGlobalGriddedDailyPrecip/Metadata/VERSION1.1_data/station-yearly-summary.txt")
totRecs <- array(dim = c(10,7,68))
totFlgs <- array(dim = c(10,7,68))
for (Reg in 1:10) {
for (Src in 1:7) {
totRecs[Reg,Src,] <- as.numeric(rawdata[(10*2)*(Src - 1) + Reg*2 - 1,])
totFlgs[Reg,Src,] <- as.numeric(rawdata[(10*2)*(Src - 1) + Reg*2,])
}
}
rm(rawdata)
# Now the plotting
years = seq(1950,2017)
par(mar = c(1,1,0,0) + 0.1, oma = c(1,1,1,1)+0.1, bg = NA)
layout(mat = matrix(1:2, ncol = 2, byrow = 2), widths = c(1, 0.5))
cols <- c('black', '#e41a1c',"",'#377eb8','#4daf4a','#984ea3','#ff7f00','#e6ab02')
#c("black","purple", "hotpink", "blue", "darkgreen", "gold","red", "#a6761d")
plot(x = years, y = totRecs[1,1,], type = "n", xlab = "Year", ylab = "Number of stations used in interpolation", cex.axis = 1.5,
xlim = c(1950,2018),ylim = c(0,max(totRecs[1,1,])))
for (Reg in c(1:2,4:8)) {
lines(totRecs[Reg,1,] ~ years,col = cols[Reg], lwd = 4)
}
par(mar = c(0,0,0,0)+0.1)
plot.new()
legend("topleft", legend = c("All", "Africa", "Asia", "Australia", "Europe", "North America", "South America"),
lwd = 4,col = cols[c(1,2,4:8)], cex = 2, bty = "n")setwd("/srv/ccrc/data11/z3289452/CCRCGlobalGriddedDailyPrecip/Metadata")
rawdata <- read.table(file = "/srv/ccrc/data11/z3289452/CCRCGlobalGriddedDailyPrecip/Metadata/VERSION1.1_data/station-yearly-summary.txt")
totRecs <- array(dim = c(10,7,68))
totFlgs <- array(dim = c(10,7,68))
for (Reg in 1:10) {
for (Src in 1:7) {
totRecs[Reg,Src,] <- as.numeric(rawdata[(10*2)*(Src - 1) + Reg*2 - 1,])
totFlgs[Reg,Src,] <- as.numeric(rawdata[(10*2)*(Src - 1) + Reg*2,])
}
}
rm(rawdata)
# Plotting
# by sources
years = seq(1950,2017)
totRecsGPCC <- totRecs[1,2,] + totRecs[1,3,] + totRecs[1,4,]
totRecsOther <- totRecs[1,5,] + totRecs[1,6,]
totRecsGHCN <- totRecs[1,7,]
par(mar= c(5,5,0.5,1) + 0.1, bg = NA)
plot(totRecsGPCC ~ years, type = "n", xlim = c(1950,2018), ylim = c(0,max(totRecsGPCC)),
xlab = "Year", ylab = "Number of stations used in interpolation", cex.lab = 2, cex.axis = 1.5)
lines(totRecsGPCC ~ years, lwd = 4, col = "black")
lines(totRecsGHCN ~ years, lwd = 4, col = "blue")
lines(totRecsOther ~ years, lwd = 4, col = "darkgreen")
legend("topleft",legend = c("GPCC", "GHCN", "Other"), lwd = 4, col = c("black", "blue", "darkgreen"),
bty = "n", cex = 2)Created 5 yearly summaries for years 1950,55,60,65,70,...,2010 cut -c 15-32,197-209,295-307,393-405,491-503,589-601,687-699,785-797,883-895,981-993,1079-1091,1177-1189,1275-1287,1373-1385,1471-1483 combined-stations_extra.txt > combined-stations_extra_5yearlysummaries.txt
Major New Problem
Turns out my station density is not as good as Version 0.1 in
Attempt to find the problem: Will check a station in China, Iran and Mali to make sure we don't have any station data in those periods in the first place
See Figures:
setwd("/srv/ccrc/data11/z3289452/CCRCGlobalGriddedDailyPrecip/Metadata")
# Now let's try and read the long station list
rawdata <- read.table(file = "combined-stations_extra_5yearlysummaries_CTRYCode.txt", na.strings = c("-9999","-99"))
rawdata.transform <- array(dim = c(135742,15))
for (i in 1:14) {
rawdata.transform[,i] <- rawdata[,2 + i*2 - 1] - rawdata[,2 + i*2]
}
rawdata.transform <- cbind(rawdata[,c(1:2,31)],rawdata.transform)library(maps)
stationMap.plot <- function(year) {
require(maps)
years <- c(1950,1955,1960,1965,1970,1975,1980,1985,1990,1995,2000,2005,2010,2015)
index = which(years == year)
map("world", interior = F)
dir <- "~/ownCloud/CCRCGlobalGriddedDailyPrecipitation/mapRawStations/data/data0.1/ALL/"
data0.1 <- read.table(paste0(dir, year, "0101.txt"), header = F)
points(data0.1[,2:1], cex = 0.1, pch = 19, col = "red")
points(rawdata.transform[which(rawdata.transform[,index + 3] > 0),2:1], cex = 0.1, pch = 19,col = "darkblue")
title(years[index], cex.main = 2)
}
par(mar = c(1,1,3,1) + 0.1)
layout(mat = matrix(c(1,2,3,4), ncol = 2, byrow = T))
sapply(seq(1950,1965,5), FUN = stationMap.plot)
sapply(seq(1970,1985,5), FUN = stationMap.plot)
sapply(seq(1990,2005,5), FUN = stationMap.plot)
sapply(seq(2010,2015,5), FUN = stationMap.plot)We plot the archive that we expect to have more stations first and then the smaller archive on top. This way if we see any colours of the larger archive, we know that those stations are not in the smaller archive.
Here red stations are the reformatted stations from the older version for the 1st of Jan that year, and blue stations are the un-reformatted stations from the newer version but for all stations with data for that year.
Another related problem
The total stations unflagged stations with data for a year seem to be more than the total reformatted stations for the 1st of Jan that year.
There are two possible stations for this:
Attempt to resolve: Plot station maps of concatenated stations and stations with missing month and see if that accounts for most stations. Also check later if the missing month is calculated okay.
See Figures:
library(maps)
stationMap.plot <- function(year) {
require(maps)
years <- c(1950,1955,1960,1965,1970,1975,1980,1985,1990,1995,2000,2005,2010,2015)
index = which(years == year)
map("world", interior = F)
dir <- "~/ownCloud/CCRCGlobalGriddedDailyPrecipitation/mapRawStations/data/ALL/"
data1.0 <- read.table(paste0(dir, year, "0101.txt"), header = F)
points(data1.0[,2:1], cex = 0.1, pch = 19, col = "red")
points(rawdata.transform[which(rawdata.transform[,index + 3] > 0),2:1], cex = 0.1, pch = 19,col = "darkblue")
title(years[index], cex.main = 2)
}
par(mar = c(1,1,3,1) + 0.1)
layout(mat = matrix(c(1,2,3,4), ncol = 2, byrow = T))
sapply(seq(1950,1965,5), FUN = stationMap.plot)
sapply(seq(1970,1985,5), FUN = stationMap.plot)
sapply(seq(1990,2005,5), FUN = stationMap.plot)
sapply(seq(2010,2015,5), FUN = stationMap.plot)Here blue stations (plotted first) are all the un-reformatted stations from the newer version with data for that year and red stations are the reformatted stations for the newer version for the 1st of Jan that year.
Chinese station N31867E117233 starts in 1952 in RawData/COMBINED/combined. It looks like this data comes from Mirakel because the station does not exist in RawData/TW-TX-LISA_RU_AR/qc2out and the Mirakel data also starts in 1952 and seems identical. I individually checked rhome_CTRS_TW and rhome_CTRS_TX. This station (HEFEI) is in rhome_CTRS_TW and not in rhome_CTRS_TX ans doesn't go further than 1952.
The station ID from the old version is N31860E117230 and it has data going back to 1950. I checked if GHCN-Daily has this station too. Turns out it does and has station ID CHM00058321 but only has data going back to 1952.
Found data for this station in oflxs449/OESTERLE/
AFRIKA - MALI
Data for station N11350W005683 (ML000061297 GHCN ID) only goes back to 1957 and is not found in GPCC at all but only in GHCN-Daily. I found GPCC station data in the old directory oflxs449/AFRIKA/. Upon checking NCEI_QC/QCdata/rhome_CTRS_TX/AFRIKA_FINK/ I did not find any data for this station
New Zealand
The raw data given had the wrong format so a lot of the stations were not read in. The QC'd NZ data is available in oflxs449/NEUSEELAND
CZECH
So as expected reformatted data for station N49444E012923 going futher back was found in oflxs449/CZECH/N49444E12923.dly but upon looking at the raw gpcc tw format data, it does not look like the station goes back further! So I do not know where the data is coming from.
I also looked at the Chinese tw format data in oflxs449/TW/OESTERLE/rawgpcc/tw_chn_oes_lr_nf but the chinese data doesn't seem to go back further here either. Clearly there was some mistake in the old code.
IRAN
By looking at the TW format raw data in QCdata/rhome_CTRS_TW/IRAN/rawgpcc/ it is clear that there is no Iran data going back further than 1951. So it looks like although there is no TW format data from the old version to confirm this, there was a mistake in this case as well and data does not infact go back to 1950 for Iran.
RUNNING JOBS 1980 to 1986 now
Regarding Issue #2: a lot more stations in unreformatted unflagged data compared to reformatted concatenated data
As suspected this discrepency is due to missing monthly values. Furthermore the unreformatted data from the station list represented all stations with data in a particular year, plotting it together with reformatted data from 1st of Jan that year makes it look like the reformatting script missed a lot of data. Only when I aggregated data from 1, 8, 15, 22, 29 of each month for 2010 for the concatenated stations and missing_month stations and plotted them together with the unreformatted data from 2010 could I see that everything was okay with the reformatting script. So the red maps (stationDifference_beforeR4mat_data1.0_2010-2015 etc) are misrepresentative as they make it look like there are lots of stations unaccounted for.
# Now we will see if the total unreformatted stations in Version 1.0 are made up
# of total concatenated and total missing_month stations
setwd("/srv/ccrc/data11/z3289452/CCRCGlobalGriddedDailyPrecip/Metadata/VERSION1.0_data/")
stationMap.plot <- function(year, unr4matted_data) {
require(maps)
years <- c(1950,1955,1960,1965,1970,1975,1980,1985,1990,1995,2000,2005,2010,2015)
index = which(years == year)
cex = 0.2
# First read concatenated data in ALL
dataCONC <- read.table(paste0(year, ".txt"), header = F)
# Now the missing month data
dataMisMon <- read.table(paste0(year, "_MM.txt"), header = F)
map("world", interior = T)
points(unr4matted_data[which(rawdata.transform[,index + 3] > 0),2:1], cex = cex, pch = 19,col = "darkblue")
points(dataCONC[,2:1], cex = cex, pch = 19, col = "red")
points(dataMisMon[,2:1], cex = cex, pch = 19, col = "darkgreen")
title(years[index], cex.main = 2)
}# load rawdata.transform from the above code chunk first
stationMap.plot(2010, rawdata.transform)Running the above code provides the following more a accurate plot. We see that the blue points are almost completely masked by the red and green points. The remaning blue points can be accounted if we aggregated all days in the year instead of just 5 days each month.
Out of loop interpolation being run right now
More on oflxs449 containing more data than the rhome TW/TX files
I decided to look into the supposed extra AFRIKA data that wasn't given to me in rhome this time around.
I first grepped the station names and WMO IDs in oflxs449/AFRIKA/gpcc-stations.txt against RawData/TW-TX-LISA_RU_AR/gpcc-stations_extra.txt and turned up 100 out of 142 "new" African stations.
Upon looking at the actual .dly data for one of the stations (Name: BOHICON, ID: AFRIKA/N7170E2050.dly), I confirmed my suspicions that the old qcscript somehow copied data 2 years back. I found the exact data string from 195301/02 in the new data, in 195001/02 in the old data!
However when looking at reformatted data f rom another station (Name: MAKOKOU, ID: N570E12870/N00570E012870) the data does not look similar at all! So perhaps this IS a station we do not have...
Most common African station data in TW-TX-LISA_RU_AR only goes back to 2000. However there is data going further back in GHCN-Daily. It does not look like there is more data coming from MIRAKEL but it is possible. When we look at the formatted data for the same station in oflxs449/AFRIKA it looks like there is completely different data. It does not look like the data is simple just shifted two years back which was the case with a lot of other station data in oflxs449. I think it would be a good idea to contact Anja and ask her if she can dig up data for these 142 African stations.
Anja replied back, there is data for some of the stations I sent her is in Steefan_tx_tw/SENEGAL. It ends in 1983 and is not in MIRAKEL. She does not know anything about the time stamp for this data either.
She also mentioned that they have now acquired Netherlands data going back to 1951 and further in some case. The old Netherlands data only went back until 2000.
So things to do today:
cd /e/rhome/routwzn/smukeshk/INTERPOLATION/KRIGING/VERSION1.0
ls ALL/*/kriging.out > ALL_finishedJobs.txt
sed -i 's#/kriging.out##g' ALL_finishedJobs.txt
for i in `cat ALL_finishedJobs.txt`; do ls $i/OUT/*.txt | tail -n 1; done > ALL_lastdays.txt
grep -v '.\{17\}0630' ALL_lastdays.txt > tmp
grep -v '.\{17\}1231' tmp > ALL_restartJobs.txt
ls ALL/????[cd]#!/bin/bash
wd=/lustre1/rwork2/routwzn/smukeshk/INTERPOLATION/KRIGING/VERSION1.0
for i in `cat $wd/ALL_restartJobs.txt`
do
year=${i:14:4}
dir=${i:4:5}
monthday=${i:18:4}
monthday=$((10#$monthday))
newdate=`printf "%04d%04d" $year $((monthday+1))`
echo $year $dir $monthday $newdate
if [ ! -d $wd/ALL/"$year"c ]; then
newdir="$year"c
mkdir $wd/ALL/$newdir
else
newdir="$year"d
mkdir $wd/ALL/$newdir
fi
cp $wd/ALL/$dir/*.sh $wd/ALL/$dir/*.f90 $wd/ALL/$dir/*.job $wd/ALL/$newdir/
cp -r $wd/ALL/$dir/LAM* $wd/ALL/$dir/PHI* $wd/ALL/$dir/SEA* $wd/ALL/$dir/Subroutinen* $wd/ALL/$newdir/
sed -i "26s/.*/\tStart=$newdate/" $wd/ALL/$newdir/Datensatz-erzeugen.sh
sed -i "s/$dir/$newdir/g" $wd/ALL/$newdir/kriging.job
doneMeeting with Markus and Lisa
Hurricane Andrew
Hurricane Camille
Annual Average Precipitaion Timeseries
Annual average
RX1DAY average
Annual total trends
Annual maxima trends
Pacific Islands annual totals
Pacific Islands annual maxima
Now downloading PERSIANN-CDR data for comparison using
for i in `seq 1984 2013`; do wget -r -np -nH --cut-dirs=3 -R index.html* https://www.ncei.noaa.gov/data/precipitation-persiann/access/$i/; doneList of tasks:
Figure 1: Make map of in situ stations colour coded by sources including merged
Figure 2: replace figure (b) by a map of regions
Figure 3: Change label 'CCRC' to 'REGEN' and add min 40 yrs complete data
Figure 4:
Figure 5:
Figure 6:
Figure 9: Remove figure 9 and just have a map of some significant event in Africa and add caveat in text that although we mask most of central Africa, it is still useful to sometimes look at daily events
Once you have REGEN-stations_v1-1_extra.txt, run summariseMetaData.f95 to create station-yearly-summary.txt
To create REGEN-stations_v1-1_extra_5yearlySummaries.txt use the following code:
cut -c 15-32,197-209,295-307,393-405,491-503,589-601,687-699,785-797,883-895,981-993,1079-1091,1177-1189,1275-1287,1373-1385,1471-1483 REGEN-stations_v1-1_extra.txt > REGEN-stations_v1-1_extra_5yearlysummaries.txtTo get the country and region info use the following code:
library(rworldmap)
# Some maps using the metadata
# The metadata is in
setwd("/srv/ccrc/data11/z3289452/CCRCGlobalGriddedDailyPrecip/Metadata/VERSION1.1_data/")
widths <- c(13,8,9,6,2) #,2,30,7,5,4,4,6,6,6,6,6,6,6,6,6,6,6,6,6,5)
metaData <- read.fwf(file = "REGEN-stations_v1-1_coords.txt", widths = widths + 1, header = F, fileEncoding = "UTF-8")
colnames(metaData) <- c("StationID", "Lat", "Lon", "Elev","CTRY")
# The single argument to this function, points, is a data.frame in which:
# - column 1 contains the longitude in degrees
# - column 2 contains the latitude in degrees
# from https://stackoverflow.com/questions/14334970/convert-latitude-and-longitude-coordinates-to-country-name-in-r
coords2country = function(points)
{
countriesSP <- getMap(resolution='high')
#countriesSP <- getMap(resolution='high') #you could use high res map from rworldxtra if you were concerned about detail
# convert our list of points to a SpatialPoints object
# pointsSP = SpatialPoints(points, proj4string=CRS(" +proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs +towgs84=0,0,0"))
#setting CRS directly to that from rworldmap
pointsSP = SpatialPoints(points, proj4string=CRS(proj4string(countriesSP)))
# use 'over' to get indices of the Polygons object containing each point
indices = over(pointsSP, countriesSP)
# return the ADMIN names of each country
list(countryCode = indices$ISO_A2, Region = indices$REGION)
#indices$ISO3 # returns the ISO3 code
#indices$continent # returns the continent (6 continent model)
#indices$REGION # returns the continent (7 continent model)
}
countryInfo <- coords2country(metaData[,c(3,2)])
# Now we write out the countryCodes and Regions
countryInfoDF <- cbind(sapply(countryInfo$countryCode, as.character), sapply(countryInfo$Region, as.character))
# There are 3518 stations which have no country code
sum(is.na(countryInfo$countryCode))
write.table(countryInfoDF,file = "REGEN-stations_v1-1_CountryAndRegion.txt", na = " ",
row.names = F, col.names = F, quote = F)Then use paste to concatenate the CTRY code and Region information to get REGEN-stations_v1-1_extra_wRegInfo.txt and REGEN-stations_v1-1_extra_5yearlysummaries_wCTRYCode.txt
To create the station points every 5 years animation:
# Now let's try and read the long station list
rawdata <- read.table(file = "/srv/ccrc/data11/z3289452/CCRCGlobalGriddedDailyPrecip/Metadata/VERSION1.1_data/REGEN-stations_v1-1_extra_5yearlysummaries.txt",
na.strings = c("-9999","-99"))
rawdata.transform <- array(dim = c(nrow(rawdata),15))
for (i in 1:14) {
rawdata.transform[,i] <- rawdata[,2 + i*2 - 1] - rawdata[,2 + i*2]
}
rawdata.transform <- cbind(rawdata[,c(1:2)],rawdata.transform)
library(animation)
ani.options(interval = 1.0,ani.height = 700, ani.width = 1200, loop = 1)
year <- c(1950,1955,1960,1965,1970,1975,1980,1985,1990,1995,2000,2005,2010)
saveGIF({
for (i in 1:13) {
map("world", interior = F)
points(rawdata.transform[which(rawdata.transform[,i + 3] > 0),2:1], cex = 0.5, pch = 19, col = "darkblue")
title(year[i], cex.main = 3)
}
})