library(sp)
library(ggplot2)
library(gstat)
library(dplyr)
library(htmlwidgets)
library(leaflet)
library(maptools)
library(tidyr)
wind.points<-read.table("data/wind.txt",header = TRUE)
In geostatistical data there has to be a measurements in each location, but this data set doesn’t contain any other measurements. this data set seems to be point process data which concerns more with the pattern of locations instead.
Basically, both latitude and longitude are being presented and interpreted based on a reference. The latitude is positive because they are the north of the equator and longitude is negative because they are the west of the Greenwich.
leaflet(data=wind.points) %>%
addProviderTiles("Stamen.Toner")%>%
setView(-93.65, 42.0285, zoom = 4) %>%
addMarkers(~long_DD,~lat_DD,popup = ~as.character(FID))
North Dakota, Iowa, Nebraska, Kansas, Texas
Except for the points that they are located in Iowa (which they are in zone 15) the rest of the points are all in zone 14. There would be a little problem if we decide to measure the distance between the points. So I prefer to use zone 14 which most of points located in.
Actually, since the points fall into two different zones (14 and 15) there would be a little bit of problem. If I decide to measure the distance between the points located in different zones, I would introduce a level of distortion and error into my calculations.
Distances are in the Kilometer.
wind.sp<-wind.points
##making the spatial object
coordinates(wind.sp) <- c('long_DD','lat_DD')
## defining the coordinate refrence of the spatial object
proj4string(wind.sp) <- CRS('+proj=longlat +datum=NAD83')
FID’s 33011 and 48278
spDistsN1(wind.sp[wind.sp$FID==33011,],wind.sp[wind.sp$FID==48278,],longlat=TRUE)
## [1] 2331.725
FID’s 48278 and 47308
spDistsN1(wind.sp[wind.sp$FID==48278,],wind.sp[wind.sp$FID==47308,],longlat=TRUE)
## [1] 1716.525
FID’s 25090 and 25853
spDistsN1(wind.sp[wind.sp$FID==25090,],wind.sp[wind.sp$FID==25853,],longlat=TRUE)
## [1] 165.4021
Distances are in the Kilometer.
###converting coordinate to utm
wind.sp.utm <- spTransform(wind.sp, CRS('+proj=utm +zone=14'))
################################## Euclidean distances -
FID’s 33011 and 48278
spDistsN1(wind.sp.utm[wind.sp.utm$FID==33011,],wind.sp.utm[wind.sp.utm$FID==48278,])*0.001
## [1] 2333.258
FID’s 48278 (zone 14, TX) and 47308(zone 15, IA)
spDistsN1(wind.sp.utm[wind.sp.utm$FID==48278,],wind.sp.utm[wind.sp.utm$FID==47308,])*0.001
## [1] 1719.433
FID’s 25090 and 25853
spDistsN1(wind.sp.utm[wind.sp.utm$FID==25090,],wind.sp.utm[wind.sp.utm$FID==25853,])*0.001
## [1] 165.5145
Points with less horizontal distance and also located in a same UTM zone have less distortion and similar distance in great circle distance and euclidean distance.
Not necessarily. One possible reason can be the different datum system used between the GPS and data base. Alternatively, there is be also always some degree of error associated with any device measurements (including GPS).
stand.crop<-read.csv("data/sample.csv",header = TRUE)
df<-stand.crop%>%gather(method,value)
###plotting the distribution of standing crops for diffeent sampling method
df%>%ggplot()+
geom_histogram(aes(value),stat = "bin",binwidth =50)+
facet_grid(.~method)+
geom_vline(xintercept = 2470,size=1.5)
Note: because your evaluation will be based on 25 repetitions, ignore small deviations from unbiasedness, so if the sample is close to unbiased, call it unbiased.
Basically, what we need to prove is that the sample mean is the unbiased estimator of population mean. For a normal distribution it can be proved that the total value of sample points divided by the sample size is the unbiased estimator of the population mean.
Actually since we have just one set sample it’s not possible to calculated the variability of mean. So the proximity of sample mean to the true mean (population) can be seen as criteria for finding the unbiased sampling method.
Interestingly, the difference between the sample means and true mean in all sampling method is approximately less than 1 % which shows we can claim all methods can be an unbiased estimator.
df%>%group_by(method)%>%summarise(Average=mean(value), median=median(value), SD=sd(value))
## Source: local data frame [3 x 4]
##
## method Average median SD
## 1 basic 2495.28 2488 197.4290
## 2 super 2481.44 2522 140.1479
## 3 hard 2457.48 2492 106.0971
As I stated before there is just one set of sample and that’s why we don’t know the variation of the mean but the absolute difference shows that the super and hard sampling method seems to have the most precision respectively.
erod.points<-read.table("data/erode.txt",header = TRUE)
erod.sp<-erod.points
coordinates(erod.sp) <- c('x','y')
### grid manipulation
erod.grid<-read.table("data/erodegrid.txt",header = TRUE)
coordinates(erod.grid) <- c('x','y')
gridded(erod.grid) <- T # and flag as a grid
erod.idw <- idw(erode~1, erod.sp, erod.grid,idp=1.5)
## [inverse distance weighted interpolation]
spplot(erod.idw, 'var1.pred')
The unit is in megaton/year.
round(sum(erod.idw@data$var1.pred)*0.25,digits = 1)
## [1] 139.7
### linear model
erod.lm <- lm(erode ~ x + y, data=erod.sp)
### make prediction using linear trend surface
erod.trendl <- predict(erod.lm, newdata=erod.grid)
#SpatialPixelsDataFrame
erod.gridl <- SpatialPixelsDataFrame(erod.grid,
data.frame(linear=erod.trendl) )
###ploting
spplot(erod.gridl, 'linear')
I need to first see the measured values in their real positions.
##########
erod.points%>%ggplot(aes(x=x,y=y))+
geom_text(aes(label=erode),size=4)+
coord_equal()
It seems that the linear model is too simple and not suitable because it’s able to capture some diagnose no-linearity exist in data set.
co.points<-read.table("data/juraCo.txt",header = TRUE)
co.points%>%ggplot(aes(x=Xloc,y=Yloc))+
annotate("text", x = 2.5, y = 3.3,label="P1",color="red",size=5)+
annotate("text", x = 3.5, y = 2.5,label="P2",color="red",size=5)+
geom_text(aes(label=Co))+
coord_equal()
co.VC<-as.matrix(read.table("data/juraVC20.txt",header = TRUE))
X <- matrix(1,ncol=1, nrow=length(co.VC[,1]))# model with only an intercept - one column
vinv <- solve(co.VC)# the inverse variance-covariance matrix
glswt <- solve(t(X) %*% vinv %*% X) %*% t(X) %*% vinv# weights for GLS estimate of the overall mean
muhat <- glswt %*% co.points$Co# muhat is the GLS estimate of the overall mean
print(muhat)
## [,1]
## [1,] 9.192567
Ordinary kriging
S0.VC<-as.matrix(read.table("data/j20s0cov.txt",header = TRUE))
coef<-round((S0.VC[,1] %*% vinv),digits = 3)
cbind(co.points,x1w=t(coef))%>%ggplot(aes(x=Xloc,y=Yloc))+
geom_text(aes(label=x1w))+
annotate("point", x = 2.5, y = 3.3,color="red",size=4)+
coord_equal()
It’s reasonable because there is a higher weights associated with the points closer to the P1 than the points farther. It comes from the fact that the P1 had a higher covariance with the closer points rather than farther points.
### variance of prediction for the P1
coefp1<-coef
(12.525)^2-(coefp1%*%S0.VC[,1])
## [,1]
## [1,] 148.6815
#### variance of prediction for the p2
coefp2<-round((S0.VC[,2] %*% vinv),digits = 3)
(12.525)^2-(coefp2%*%S0.VC[,2])
## [,1]
## [1,] 156.8461
It makes sense because P1 is closer to the measured points (points with a value) than P2, thus there should be less variability on prediction of it.
### variance of prediction for the P1
coefp3<-round((S0.VC[,3] %*% vinv),digits = 3)
(12.525)^2-(coefp3%*%S0.VC[,3])
## [,1]
## [1,] 144.3506
I guess because since this variation is less than the other two and it comes form the fact that this point is so close to measured points ( or basically P3 is one of them indeed), so “honoring the data” refers to the point that we give more credit for the points closer to the measured points and less for points that their far form the “data”.