The following SQL query was run before moving to R.
create view worldclim_comp as
select st.altitud,anprec, maxyr,minyr,nyrs,st.id,st.geom, (bioclim(st.geom)).* from
cna.estclim st,
(select station, avg(sum) anprec, count(yr) nyrs, max(yr) maxyr,min(yr) minyr from
(select * from
(select station, extract(year from fecha) yr,sum(prec), count(fecha) nrecs from
cna.prec
group by
station,extract(year from fecha)) d
where nrecs >360) f
group by station) sq
where sq.station=st.id_estacio
The results are imported using RODBC.
library(RODBC)
con <- odbcConnect("gisdb")
d <- sqlQuery(con, "select * from worldclim_comp")
Now to look at at the percentage difference between Worldclim values and the yearly averages for precipitation in the CNA data we just need to divide one column by the other. Anprec is the CNA version and PrecAnnual comes from Worldclim.
boxplot(100 * d$anprec/d$precanual, col = "grey")
This does not look too bad for Worldclim. Most of the values are within 20% of Worldclim. Hijmans coverage seems to be reliable enough to suggest that the outliers are more likely to be the result of anomalies (which may be genuine and interestin or simply mistakes) in the CNA data.
We can look at a more robust subset by only using data from stations with many records.
d1 <- subset(d, nyrs > 30)
str(d1)
## 'data.frame': 60 obs. of 12 variables:
## $ altitud : num 1775 1775 1240 675 620 ...
## $ anprec : num 896 896 1684 1015 4230 ...
## $ maxyr : int 2005 2005 2005 2004 2005 2005 2003 1970 1999 1990 ...
## $ minyr : int 1945 1945 1943 1958 1956 1946 1962 1923 1962 1946 ...
## $ nyrs : int 32 32 57 42 49 51 36 31 35 36 ...
## $ id : int 4972 4971 4852 5099 5160 5175 5184 4891 5118 4786 ...
## $ geom : Factor w/ 230 levels "0101000020E610000000D312F0574557C0979B154884953140",..: 42 42 228 80 151 172 106 190 75 139 ...
## $ elev : int 1889 1889 1262 622 901 485 69 568 707 2111 ...
## $ tempannual : num 16.8 16.8 20.4 24.3 23.8 26.1 27.3 24.7 24.2 15.2 ...
## $ mesmasfrio : num 7.7 7.7 11.3 14.7 15.6 18.8 18.7 15.7 14.5 6.4 ...
## $ mesmascalido: num 25.7 25.7 29.6 33.8 31.7 33.6 35.3 33.3 33.8 23.3 ...
## $ precanual : int 995 995 1677 1264 4034 4513 1908 925 1635 1494 ...
boxplot(100 * d1$anprec/d1$precanual, col = "grey")
In this case the match is quite convincing (using 60 climate stations in Chiapas).
These wil usually be due to human error unless the points fall in an area with very abrupt topogrpahy where 500m laterally can reslut in a grest change in elevation.
elevdif <- d$elev - d$altitud
boxplot(elevdif, col = "grey")
This does need some further work to clarify problems with the data.
What is interesting is that in the raw data the two errors are only slightly correlated. Although a few mistakes lead to a statistically significant corelation, for most cases it appears that some other process has led to the discrpepancy.
precdif <- d$anprec/d$precanual
plot(precdif ~ elevdif, pch = 21, bg = 2)
cor.test(precdif, elevdif, method = "spearman")
##
## Spearman's rank correlation rho
##
## data: precdif and elevdif
## S = 2148848, p-value = 0.01474
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.1547