HW1

Question1: The data file wind.txt contains information about some of the many wind farms in the Midwest. FID is an ID number for the wind farm. lat_DD and long_DD are latitude, longitude in decimal degrees. The datum isn’t specified, but the data are recent so let’s assume it is NAD83. total_turb contains the number of installed turbines at each wind farm.

library(sp)
library(ggplot2)
library(gstat)
library(dplyr)
library(htmlwidgets)
library(leaflet)
library(maptools)
library(tidyr)
wind.points<-read.table("data/wind.txt",header = TRUE)

a) Are these data an example of geostatistical data? Briefly explain why or why not. If not, what sort of spatial data are they?

In geostatistical data there has to be a measurements in each location, but this data set doesn’t contain any other measurements. this data set seems to be point process data which concerns more with the pattern of locations instead.

b) Why are the latitude values positive and the longitude values negative?

Basically, both latitude and longitude are being presented and interpreted based on a reference. The latitude is positive because they are the north of the equator and longitude is negative because they are the west of the Greenwich.

c) Draw a map of the US, with state boundaries, and overlay the locations of these wind farms.

leaflet(data=wind.points) %>%
addProviderTiles("Stamen.Toner")%>%
  setView(-93.65, 42.0285, zoom = 4) %>%
  addMarkers(~long_DD,~lat_DD,popup = ~as.character(FID))

d) List the states that these data come from.

North Dakota, Iowa, Nebraska, Kansas, Texas

e) You would like to express locations using a rectangular coordinate system. What UTM zone will be the most appropriate to use? Briefly explain your choice.

Except for the points that they are located in Iowa (which they are in zone 15) the rest of the points are all in zone 14. There would be a little problem if we decide to measure the distance between the points. So I prefer to use zone 14 which most of points located in.

f) Do you have any concerns using UTM coordinates for these data? Briefly describe your concerns or explain why you have no concerns.

Actually, since the points fall into two different zones (14 and 15) there would be a little bit of problem. If I decide to measure the distance between the points located in different zones, I would introduce a level of distortion and error into my calculations.

g) Calculate and report the great circle distance between FID’s 33011 and 48278, between FID’s 48278 and 47308, and between FID’s 25090 and 25853. Make sure to include units.

Distances are in the Kilometer.

wind.sp<-wind.points
##making the spatial object
coordinates(wind.sp) <- c('long_DD','lat_DD')
## defining the coordinate refrence of the spatial object
proj4string(wind.sp) <- CRS('+proj=longlat +datum=NAD83')

FID’s 33011 and 48278

spDistsN1(wind.sp[wind.sp$FID==33011,],wind.sp[wind.sp$FID==48278,],longlat=TRUE)

## [1] 2331.725

FID’s 48278 and 47308

spDistsN1(wind.sp[wind.sp$FID==48278,],wind.sp[wind.sp$FID==47308,],longlat=TRUE)

## [1] 1716.525

FID’s 25090 and 25853

spDistsN1(wind.sp[wind.sp$FID==25090,],wind.sp[wind.sp$FID==25853,],longlat=TRUE)

## [1] 165.4021

h) Calculate and report the Euclidean distance based on UTM coordinates between the 3 pairs of locations in question 1g. Make sure to include units.

Distances are in the Kilometer.

###converting coordinate to utm
wind.sp.utm <- spTransform(wind.sp, CRS('+proj=utm +zone=14')) 
################################## Euclidean distances -

FID’s 33011 and 48278

spDistsN1(wind.sp.utm[wind.sp.utm$FID==33011,],wind.sp.utm[wind.sp.utm$FID==48278,])*0.001

## [1] 2333.258

FID’s 48278 (zone 14, TX) and 47308(zone 15, IA)

spDistsN1(wind.sp.utm[wind.sp.utm$FID==48278,],wind.sp.utm[wind.sp.utm$FID==47308,])*0.001

## [1] 1719.433

FID’s 25090 and 25853

spDistsN1(wind.sp.utm[wind.sp.utm$FID==25090,],wind.sp.utm[wind.sp.utm$FID==25853,])*0.001

## [1] 165.5145

i) Compare the great circle and Euclidean distances computed in questions g and h. Which pair of locations have the most similar distances? Which pairof locations have the least similar distances? Briefly explain why you would (or would not) expect what you found.

Points with less horizontal distance and also located in a same UTM zone have less distortion and similar distance in great circle distance and euclidean distance.

j) You go out with a GPS unit, which, like all GPS units reports longlat coordinates in the WGS84 datum, to locate a wind farm. You find the GPS unit location is not the same as that given in the data base. Is this an error in the data base? Briefly explain why or why not.

Not necessarily. One possible reason can be the different datum system used between the GPS and data base. Alternatively, there is be also always some degree of error associated with any device measurements (including GPS).

Question2:You are evaluating different ways to sample a pasture to estimate the total standing crop of grass in that pasture. The total can be estimated by sampling locations, measuring the standing crop / m2 at each location, estimating the mean across all locations, then multiplying by the pasture area to get the total standing crop. The details of estimating the mean depend on the specific sampling design; these details are not relevant to this question. You consider three sampling methods, called basic, super, and hard. The details of how locations are chosen for each of these three methods is not relevant (and no, these aren’t the names of standard sampling methods). You repeat each method 25 times, so you have 25 estimates of the total standing crop from each method. You then mow the entire pasture to measure the true total standing crop. The values for each sampling repetition are in sample.csv. The true total standing crop for the pasture is 2470 kg.

stand.crop<-read.csv("data/sample.csv",header = TRUE)
df<-stand.crop%>%gather(method,value)
###plotting the distribution of standing crops for diffeent sampling method
df%>%ggplot()+
  geom_histogram(aes(value),stat = "bin",binwidth =50)+
  facet_grid(.~method)+
 geom_vline(xintercept = 2470,size=1.5)

a) Which of the three sampling methods provides an unbiased estimate of the true total? Support your choices by appropriate numbers or plots.

Note: because your evaluation will be based on 25 repetitions, ignore small deviations from unbiasedness, so if the sample is close to unbiased, call it unbiased.

Basically, what we need to prove is that the sample mean is the unbiased estimator of population mean. For a normal distribution it can be proved that the total value of sample points divided by the sample size is the unbiased estimator of the population mean.

Actually since we have just one set sample it’s not possible to calculated the variability of mean. So the proximity of sample mean to the true mean (population) can be seen as criteria for finding the unbiased sampling method.

Interestingly, the difference between the sample means and true mean in all sampling method is approximately less than 1 % which shows we can claim all methods can be an unbiased estimator.

b) Which of the three sampling methods gives the most precise estimate of the true total? Support your choices by appropriate numbers or plots.

df%>%group_by(method)%>%summarise(Average=mean(value), median=median(value), SD=sd(value))

## Source: local data frame [3 x 4]
## 
##   method Average median       SD
## 1  basic 2495.28   2488 197.4290
## 2  super 2481.44   2522 140.1479
## 3   hard 2457.48   2492 106.0971

As I stated before there is just one set of sample and that’s why we don’t know the variation of the mean but the absolute difference shows that the super and hard sampling method seems to have the most precision respectively.

Question3: The data in erode.txt are a small data set with soil erosion values from 7 locations in Illinois. The x and y coordinates are UTM coordinates in zone 16. The Z value is the amount of soil erosion, as tons / ac / yr. The points in erodegrid.txt are grid locations at which predictions of soil erosion are desired.

erod.points<-read.table("data/erode.txt",header = TRUE)
erod.sp<-erod.points
coordinates(erod.sp) <- c('x','y')   
### grid manipulation
erod.grid<-read.table("data/erodegrid.txt",header = TRUE)
coordinates(erod.grid) <- c('x','y')   
gridded(erod.grid) <- T                # and flag as a grid

a) Use inverse distance weighting with a power of 1.5 to estimate erosion at each of the grid points. Give me a plot showing the predicted soil erosion at each grid point.

erod.idw <- idw(erode~1, erod.sp, erod.grid,idp=1.5)

## [inverse distance weighted interpolation]

spplot(erod.idw, 'var1.pred')

b) Each grid location is the center of an approximately 250,000 acre area. Estimate the total annual soil erosion over the area represented by the grid. For simplicity, report your estimate with units of megatons (1 megaton = 1,000,000 tons).

The unit is in megaton/year.

round(sum(erod.idw@data$var1.pred)*0.25,digits = 1)

## [1] 139.7

c) Fit a linear trend surface (i.e. linear in X and linear in Y) and predict the erosion at each of the grid points. Give me a plot showing the predicted soil erosion at each grid point.

### linear model
erod.lm <- lm(erode ~ x + y, data=erod.sp)
### make prediction using linear trend surface
erod.trendl <- predict(erod.lm, newdata=erod.grid)
#SpatialPixelsDataFrame
erod.gridl <- SpatialPixelsDataFrame(erod.grid, 
  data.frame(linear=erod.trendl) )
###ploting
spplot(erod.gridl, 'linear')

d) Visually compare the plots from parts a and c. Does it seem that a linear trend model is reasonable? Briefly explain why or why not.

I need to first see the measured values in their real positions.

##########
erod.points%>%ggplot(aes(x=x,y=y))+
  geom_text(aes(label=erode),size=4)+ 
  coord_equal()

It seems that the linear model is too simple and not suitable because it’s able to capture some diagnose no-linearity exist in data set.

4) The data in juraCo.txt are a very small part of data on heavy metal contamination in the Swiss part of the Jura Mountains. The data set includes 20 locations and the cobalt (Co) concentration in soil at that location. The Xloc and Yloc coordinates are in km; the Co concentration is measured in mg/kg.

a) Plot the cobalt concentration in a way that illustrates how soil Co concentration varies over the study area.

co.points<-read.table("data/juraCo.txt",header = TRUE)
co.points%>%ggplot(aes(x=Xloc,y=Yloc))+
  annotate("text", x = 2.5, y = 3.3,label="P1",color="red",size=5)+
  annotate("text", x = 3.5, y = 2.5,label="P2",color="red",size=5)+
  geom_text(aes(label=Co))+ 
  coord_equal()

b) The data file juraVC20.txt contains the 20x20 variance-covariance matrix for these 20 observations. Calculate the Generalized Least Squares (GLS) estimate of the mean Co concentration.

co.VC<-as.matrix(read.table("data/juraVC20.txt",header = TRUE))

X <- matrix(1,ncol=1, nrow=length(co.VC[,1]))#  model with only an intercept - one column

vinv <- solve(co.VC)#  the inverse variance-covariance matrix

glswt <- solve(t(X) %*% vinv %*% X) %*% t(X) %*% vinv# weights for GLS estimate of the overall mean

muhat <- glswt %*% co.points$Co# muhat is the GLS estimate of the overall mean

print(muhat)

##          [,1]
## [1,] 9.192567

c) The mean Co concentration is unknown, but it is assumed to be constant over the area of interest. What kriging method is the most appropriate one to predict the Co concentration at (Xloc = 2.5, Yloc = 3.3)? (All I need is a name, no explanation needed)

Ordinary kriging

d) The data set j20s0cov.txt contains the covariances between each observed location and three prediction locations. The column labelled P1 is for the prediction at Xloc=2.5, Yloc=3.3. The column labelled P2 is for a prediction at Xloc=3.5, Yloc=2.5. The column labelled P3 is for a prediction at an unspecified location.

Calculate the coefficients used the make the prediction at Xloc=2.5, Yloc=3.3 (P1). There should be 20 coefficients, one for each observation. Plot the locations and label the points with the coefficients (probably helpful to round to 3 digits or so). Is the pattern you see reasonable? Briefly explain why or why not.

S0.VC<-as.matrix(read.table("data/j20s0cov.txt",header = TRUE))
coef<-round((S0.VC[,1] %*% vinv),digits = 3)
cbind(co.points,x1w=t(coef))%>%ggplot(aes(x=Xloc,y=Yloc))+
  geom_text(aes(label=x1w))+
  annotate("point", x = 2.5, y = 3.3,color="red",size=4)+ 
  coord_equal()

It’s reasonable because there is a higher weights associated with the points closer to the P1 than the points farther. It comes from the fact that the P1 had a higher covariance with the closer points rather than farther points.

e) Calculate the prediction variance for predictions at P1 and P2. Use 12.525 as the estimated variance of the Co concentration ignoring location (i.e., use 12.525 for the value of sigma^2 at the start of the prediction variance equation). Report the two prediction variances. Explain why it is reasonable that one is larger than the other.

### variance of prediction for the P1
coefp1<-coef
(12.525)^2-(coefp1%*%S0.VC[,1])

##          [,1]
## [1,] 148.6815

#### variance  of prediction for the p2
coefp2<-round((S0.VC[,2] %*% vinv),digits = 3)
(12.525)^2-(coefp2%*%S0.VC[,2])

##          [,1]
## [1,] 156.8461

It makes sense because P1 is closer to the measured points (points with a value) than P2, thus there should be less variability on prediction of it.

f) Calculate the prediction variance at the P3 location. Again, use 12.525 as the estimated variance of the Co concentration ignoring location. The P3 location is one of 20 sample locations. Kriging is sometimes described as “honoring the data”. Use the prediction variance at P3 to explain why this term is appropriate.

### variance of prediction for the P1
coefp3<-round((S0.VC[,3] %*% vinv),digits = 3)
(12.525)^2-(coefp3%*%S0.VC[,3])

##          [,1]
## [1,] 144.3506

I guess because since this variation is less than the other two and it comes form the fact that this point is so close to measured points ( or basically P3 is one of them indeed), so “honoring the data” refers to the point that we give more credit for the points closer to the measured points and less for points that their far form the “data”.

HW1

Hamze Dokoohaki

February 4, 2016