Spatial Data Analysis in R

The following tutorial will walk us through the basics of spatial modeling in R. The primary objective is to see if we can create a model that finds the relationship between gridded population estimates and nighttime lights data that we downloaded from Google Earth Engine. As you will see most gridded population esitmates are heavily reliant on night lights to make their predictions.

Objectives:

Import and visualize raster data
Check rasters have same extent and resolution
Import and visualize shp files
Create a random spatial sample and fit a model
Make a prediction back to raster layers

Import and Visualize Raster Inputs

In the following section we are going to use the raster function to import our images. We just need to tell the computer where to look. ../ backs us out of the current working directory (which is the Tutorials folder) and then ../Example_data looks in the Example_Data folder.

First we will import our population and nighttime lights DNB data. We will read it in, print out a description and plot them.

pop = raster('../Example_Data/Belize_median_POP_2010.tif')

print(pop)

## class      : RasterLayer 
## dimensions : 637, 366, 233142  (nrow, ncol, ncell)
## resolution : 0.004491576, 0.004491576  (x, y)
## extent     : -89.26559, -87.62167, 15.74298, 18.60411  (xmin, xmax, ymin, ymax)
## crs        : +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0 
## source     : /home/mmann/Documents/Github/Belize_GEE_R_Tutorial/Example_Data/Belize_median_POP_2010.tif 
## names      : Belize_median_POP_2010 
## values     : 0.03336188, 36.27045  (min, max)

plot(pop)

dnb = raster('../Example_Data/Belize_median_DNB_2014_2018.tif')

print(dnb)

## class      : RasterLayer 
## dimensions : 637, 366, 233142  (nrow, ncol, ncell)
## resolution : 0.004491576, 0.004491576  (x, y)
## extent     : -89.26559, -87.62167, 15.74298, 18.60411  (xmin, xmax, ymin, ymax)
## crs        : +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0 
## source     : /home/mmann/Documents/Github/Belize_GEE_R_Tutorial/Example_Data/Belize_median_DNB_2014_2018.tif 
## names      : Belize_median_DNB_2014_2018

plot(dnb)

One important feature to note here is that both images have the same pixel resolution, same number of rows and columns and the same projection.

They both have:

637 rows and 366 columns of pixels,
each pixel is 0.0044 degrees wide, and
projected in WGS 1984 Lat Lon.

Since they have the same properties but hold different data it often helps to create a ‘stack’ out of them. We are essentially creating a two band raster with the first band being nighttime lights DNB and the second being the population data.

The following code stacks the two images:

dnb_pop_stack = stack(dnb, pop)
print(dnb_pop_stack)

## class      : RasterStack 
## dimensions : 637, 366, 233142, 2  (nrow, ncol, ncell, nlayers)
## resolution : 0.004491576, 0.004491576  (x, y)
## extent     : -89.26559, -87.62167, 15.74298, 18.60411  (xmin, xmax, ymin, ymax)
## crs        : +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0 
## names      : Belize_median_DNB_2014_2018, Belize_median_POP_2010 
## min values :                           ?,             0.03336188 
## max values :                           ?,               36.27045

blz = read_sf('../Example_Data/gadm36_BLZ_0.shp')
#plot(blz)  # simple plot
ggplot()+geom_sf(data=blz)

Sampling Spatial Data

We are going to want to take a random sample of out population and nighttime lights data. In order to do this we need to create a set of random spatial points throughout the country. After that we will extract the pop and dnb data at those location and use them to create a model of the pop = fn(dnb).

# create 000 random points
set.seed(1)
random_points = st_sample(x = blz,size=1000 ,type = 'random')

# visualize points
ggplot()+geom_sf(data=random_points)

Now let’s add the country boundary and overlay the points

# visualize boundary and points  # alpha controls transparency 
ggplot()+geom_sf(data=blz)+geom_sf(data=random_points,alpha=0.1)

In this step we extract the raster stack data to those random points. Unfortunately the function extract can’t use a sf based point file, so we will need to create a copy of sp type, and then use that to do the extraction.

set.seed(3) 
random_points_sp = as(random_points, "Spatial") # extract funciton needs sp spatial object not sf

# Extract raster data to points 
pop_dnb_df = extract(dnb_pop_stack, random_points_sp, df=TRUE) ## df=TRUE i.e. return as a data.frame

# drop any rows where no satellite values were available (missing values are stored as NA)
pop_dnb_df = na.omit(pop_dnb_df)

# print top
print(head(pop_dnb_df))

##   ID Belize_median_DNB_2014_2018 Belize_median_POP_2010
## 1  1                  0.13918549             0.05955504
## 2  2                  0.08886480             0.07712017
## 3  3                  0.08162701             0.05563286
## 4  4                  0.04958004             0.01328016
## 5  5                  0.07530552             0.01698016
## 6  6                  0.09782595             0.07531688

We can now see that each point (point number is stored in ID) now has two values Belize_median_DNB_2014_2018 and Belize_median_POP_2010. We can now use this to create a statistical model to predict population as a function of nighttime lights.

It is often helpful to look at a simple scatter plot of your Y and x values. In this case, it looks like we might have some kind of non-linear response. Let’s find out.

ggplot()+geom_point(data=pop_dnb_df,aes(x=Belize_median_DNB_2014_2018,y=Belize_median_POP_2010))

Building a spatial model

In this tutorial we will be comparing the performance of a few models including:

Linear regression
General Additive Models (GAMS)
Single layer neural network (nnet)

Before we do this we need to create two indepdent datasets. We will create a ‘training’ dataset which hold 80% of observations and another ‘testing’ dataset holding 20% in order to test out-of-sample model performance.

set.seed(2)   # this allows us to get the same random sample repeatedly (reproducable results)

# get row numbers of the training dataset
trainIndex = createDataPartition(y = pop_dnb_df$Belize_median_DNB_2014_2018,  # typically put your Y variable here
                                  p = .8, 
                                  list = FALSE, 
                                  times = 1)

pop_dnb_train = pop_dnb_df[ trainIndex,]  # keep training rows
pop_dnb_test  = pop_dnb_df[-trainIndex,]  # omit training rows (keep testing)

# show top example and shape of testing and training
print(head(pop_dnb_train))

##    ID Belize_median_DNB_2014_2018 Belize_median_POP_2010
## 3   3                  0.08162701             0.05563286
## 4   4                  0.04958004             0.01328016
## 5   5                  0.07530552             0.01698016
## 7   7                  1.05196548             0.75056845
## 9   9                  0.08552376             0.06077643
## 10 10                  0.10520373             0.06756844

print(dim(pop_dnb_train))

## [1] 791   3

print(dim(pop_dnb_test))

## [1] 196   3

Fitting Statistical Models

Ordinary Least Squares

Now that we have our testing and training data we can start fitting models. Let’s start with a simple ordinary least squares regression.

# store the formula defining Y and x 
lm_formula = Belize_median_POP_2010 ~ Belize_median_DNB_2014_2018

lm_fit = lm(lm_formula, data=pop_dnb_train)
summary(lm_fit)

## 
## Call:
## lm(formula = lm_formula, data = pop_dnb_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.596  -0.050  -0.029  -0.009  90.991 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 -0.07132    0.12208  -0.584    0.559    
## Belize_median_DNB_2014_2018  1.64032    0.10484  15.646   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.387 on 789 degrees of freedom
## Multiple R-squared:  0.2368, Adjusted R-squared:  0.2358 
## F-statistic: 244.8 on 1 and 789 DF,  p-value: < 2.2e-16

Important: Keep in mind that these are POPULATION ESIMATES so don’t read too much into it.

General Additive Models

Since we suspect non-linear relationships between Y and x, we will fit a General Additive Model. GAMs are simply a class of statistical models in which the usual linear relationship between the response (Y) and predictors (x) are replaced by several non linear smooth functions to model and capture the non linearities in the data. These are also a flexible and smooth technique which helps us to fit Linear models which can be either linearly or non linearly dependent on several predictors Xi to capture non linear relationships.

Effectively GAMS creates peicewise regression estimates with constraints that force line segments to connect. Here is an example of a fitting a smooth spline to a set of non-linear data.

synth = data.frame(y=seq(0,100,5),x = seq(0,100,5)^2)

ggplot(data= synth,aes(x=x,y=y)) + geom_point() +geom_smooth(method = 'loess')

Here we fit a GAMS model to help capture any non-linear relationships.

formula_gam = Belize_median_POP_2010 ~ s(Belize_median_DNB_2014_2018)  #s() designates which variables to smooth

# fit the gams model
gam_fit = gam(formula_gam, data= pop_dnb_train)

## Warning in model.matrix.default(mt, mf, contrasts): non-list contrasts
## argument ignored

# see results
summary(gam_fit)

## 
## Call: gam(formula = formula_gam, data = pop_dnb_train)
## Deviance Residuals:
##        Min         1Q     Median         3Q        Max 
## -41.781150  -0.016777  -0.002971   0.012277  32.133195 
## 
## (Dispersion Parameter for gaussian family taken to be 3.7235)
## 
##     Null Deviance: 11860 on 790 degrees of freedom
## Residual Deviance: 2926.647 on 785.9999 degrees of freedom
## AIC: 3291.638 
## 
## Number of Local Scoring Iterations: 2 
## 
## Anova for Parametric Effects
##                                 Df Sum Sq Mean Sq F value    Pr(>F)    
## s(Belize_median_DNB_2014_2018)   1 2808.4 2808.43  754.25 < 2.2e-16 ***
## Residuals                      786 2926.7    3.72                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Anova for Nonparametric Effects
##                                Npar Df Npar F     Pr(F)    
## (Intercept)                                                
## s(Belize_median_DNB_2014_2018)       3 548.31 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

With GAMS it is often informative to plot out the non-linear responses as estimated. Here we can see that the response increase up to some point and then declines.

# plot non-linear response
plot(gam_fit,se=TRUE)

Neural Net

Next we will fit a very simple neural network. This method basically searches for weights that describe the relationship between x and Y layers. In the hidden layer is where most of the calculations happens, every Perceptron unit takes an input from the input layer, multiplies and add it to initially random weights. It then uses some defined transfer and activation functions to make a final prediction in an output layer.

set.seed(825)

# set parameters for cross validation 
fitControl <- trainControl(## 5-fold CV
                           method = "repeatedcv",
                           number = 5,
                           ## repeated 5 times
                           repeats = 5)

formula_ml = Belize_median_POP_2010 ~ Belize_median_DNB_2014_2018

# fit the model 
nnet_fit <- train(formula_ml, 
                 data = pop_dnb_train, 
                 trControl = fitControl,
                 method = "nnet",
                 preProc = c("center", "scale"), # subtract mean and divide by std dev
                 verbose = FALSE)

nnet_fit

## Neural Network 
## 
## 791 samples
##   1 predictor
## 
## Pre-processing: centered (1), scaled (1) 
## Resampling: Cross-Validated (5 fold, repeated 5 times) 
## Summary of sample sizes: 632, 633, 632, 634, 633, 632, ... 
## Resampling results across tuning parameters:
## 
##   size  decay  RMSE      Rsquared   MAE      
##   1     0e+00  2.228563  0.3432170  0.2347637
##   1     1e-04  2.212628  0.3452640  0.2260535
##   1     1e-01  2.197500  0.4119595  0.2266696
##   3     0e+00  2.227304  0.3330964  0.2366021
##   3     1e-04  2.203022  0.4525949  0.2129799
##   3     1e-01  2.193545  0.4310922  0.2187953
##   5     0e+00  2.226056  0.4416630  0.2305579
##   5     1e-04  2.202381  0.4696927  0.2120839
##   5     1e-01  2.192608  0.4376314  0.2166679
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 5 and decay = 0.1.

Testing Model Results

Now that we have fitted three models we need to compare their perfomance on the independent testing dataset. Here we will focus on a few different metrics, in this case AIC and R^2 are not available for all models, so we will focus on root mean squared error RMSE instead.

First we need to make a prediction to the testing dataset for each model. We will then compare these predictions to the actual value and calculate the RMSE for each model.

# store the actual observations of Y
Y = pop_dnb_test$Belize_median_POP_2010

# make predictions to the test dataset
pred_lm = predict(lm_fit, newdata=pop_dnb_test)
pred_gam = predict(gam_fit, newdata=pop_dnb_test)
pred_nnet = predict(nnet_fit, newdata=pop_dnb_test)

# calculate RMSE 
rmse_lm = rmse(Y,pred_lm)
rmse_gam = rmse(Y,pred_gam)
rmse_nnet = rmse(Y,pred_nnet)

tribble(
  ~Model, ~RMSE,
  "ols", rmse_lm,  
  "gams", rmse_gam,
  "nnet",rmse_nnet
)

## # A tibble: 3 x 2
##   Model   RMSE
##   <chr>  <dbl>
## 1 ols    1.47 
## 2 gams  18.0  
## 3 nnet   0.758

In this case it looks like the simple neural net out performs all the other models. This is likely due to our improved handling of model fit through cross-validation.

Making Spatial Predictions

Luckily making predictions on spatial data (stored in a stack) is as easy as using the predict function.

# make a prediction using OLS back to the raster stack
pred_lm_raster = predict(dnb_pop_stack, model=lm_fit)

plot(pred_lm_raster)

Finally let’s write out our raster predictions to a file so that it can be opened in Arc or Qgis.

 writeRaster(x = pred_lm_raster,
            filename =  '../Example_Data/pred_lm_raster.tif',
            overwrite=TRUE)

Model improvements

One potential way to imrpove our model is to add additional covariates. In this case we might think that there might be some interaction between how nighttime lights interacts with population in different parts of the country.

To control for this we will want to add a shp file of different administrative boundaries to our raster stack.

In order to do this we must follow these general steps:

Rasterize polygons into the same resolution / projection as our other data
Add the administrative raster to the stack
Extract all the data
Refit our improved model
Assess accuracy

Let’s take a look at the shapefile

# read in data and plots
admin = read_sf('../Example_Data/gadm36_BLZ_1.shp')
ggplot()+geom_sf(data=admin)

Rasterize a shapefile

Now we need to rasterize the data by telling R which column of values to use, and give it an example of what the final raster should look like (resolution, extent etc)

# create an sp version for rasterizing
admin_sp = as(admin,'Spatial')  

## create a numeric value for each admin level 
# Copy admin names to new column 
admin_sp$NAME_1_numeric = admin_sp$NAME_1 

# look at the number of unique values
table(admin_sp$NAME_1_numeric)

## 
##      Belize        Cayo     Corozal Orange Walk Stann Creek      Toledo 
##           1           1           1           1           1           1

# recode values to numeric
admin_sp$NAME_1_numeric = admin_sp$NAME_1_numeric %>% 
                          recode(Belize = 1, Cayo = 2, Corozal = 3, `Orange Walk` = 4, 
                                 `Stann Creek` = 5, Toledo = 6)

head(admin_sp[,c("NAME_1", "NAME_1_numeric")])

##        NAME_1 NAME_1_numeric
## 1      Belize              1
## 2        Cayo              2
## 3     Corozal              3
## 4 Orange Walk              4
## 5 Stann Creek              5
## 6      Toledo              6

admin_raster = rasterize(
                          x=admin_sp,                   # shape to rasterize 
                          y= dnb,                       # example raster to copy properties from 
                          field = 'NAME_1_numeric'      # numeric field to rasterize 
                          )

# name the layer that was rasterized
names(admin_raster) = 'Admin'        

plot(admin_raster)

Add rasterized data to the stack

Now we need to add these values to the stack and extract data to points.

# create the new raster stack 
dnb_pop_adm_stack = stack(dnb,pop,admin_raster)

# extract values to points 
pop_dnb_adm_df = extract(dnb_pop_adm_stack, random_points_sp, df=TRUE) ## df=TRUE i.e. return as a data.frame

# remove nas
pop_dnb_adm_df = na.omit(pop_dnb_adm_df)

head(pop_dnb_adm_df)

##   ID Belize_median_DNB_2014_2018 Belize_median_POP_2010 Admin
## 1  1                  0.13918549             0.05955504     2
## 2  2                  0.08886480             0.07712017     2
## 3  3                  0.08162701             0.05563286     3
## 4  4                  0.04958004             0.01328016     2
## 5  5                  0.07530552             0.01698016     2
## 6  6                  0.09782595             0.07531688     2

Let’s create a OLS model with slope dummies (by interacting administrative boundaries with dnb values). We are going to declare Admin to be a “factor” this effectively treats the variable as a fixed effects component and creates the appropriate dummy variables. We can also handle interations easily by simply using * to indicate which variables should be interacted. R automatically includes the appropriate interactions in the model.

# define the specification and include interaction terms
formula_lm_admin = Belize_median_POP_2010 ~ Belize_median_DNB_2014_2018*factor(Admin)

# break observations into testing and training splits
pop_dnb_adm_train = pop_dnb_adm_df[ trainIndex,]  # keep training rows
pop_dnb_adm__test  = pop_dnb_adm_df[-trainIndex,]  # omit training rows (keep testing)


# fit the model
lm_fit_admin = lm(formula_lm_admin, data = pop_dnb_adm_train)

summary(lm_fit_admin)

## 
## Call:
## lm(formula = formula_lm_admin, data = pop_dnb_adm_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -28.9371  -0.0088   0.0176   0.0740  23.5374 
## 
## Coefficients:
##                                            Estimate Std. Error t value
## (Intercept)                                 0.02173    0.21757   0.100
## Belize_median_DNB_2014_2018                 0.19858    0.86053   0.231
## factor(Admin)2                             -1.21909    0.25548  -4.772
## factor(Admin)3                             -0.01593    0.31614  -0.050
## factor(Admin)4                             -0.08304    0.26078  -0.318
## factor(Admin)5                              0.03077    0.29527   0.104
## factor(Admin)6                             -0.05432    0.25923  -0.210
## Belize_median_DNB_2014_2018:factor(Admin)2  9.18908    0.87913  10.453
## Belize_median_DNB_2014_2018:factor(Admin)3  0.33018    0.86535   0.382
## Belize_median_DNB_2014_2018:factor(Admin)4  0.66337    0.86261   0.769
## Belize_median_DNB_2014_2018:factor(Admin)5 -0.06557    1.14135  -0.057
## Belize_median_DNB_2014_2018:factor(Admin)6  0.61188    0.92119   0.664
##                                            Pr(>|t|)    
## (Intercept)                                   0.920    
## Belize_median_DNB_2014_2018                   0.818    
## factor(Admin)2                             2.18e-06 ***
## factor(Admin)3                                0.960    
## factor(Admin)4                                0.750    
## factor(Admin)5                                0.917    
## factor(Admin)6                                0.834    
## Belize_median_DNB_2014_2018:factor(Admin)2  < 2e-16 ***
## Belize_median_DNB_2014_2018:factor(Admin)3    0.703    
## Belize_median_DNB_2014_2018:factor(Admin)4    0.442    
## Belize_median_DNB_2014_2018:factor(Admin)5    0.954    
## Belize_median_DNB_2014_2018:factor(Admin)6    0.507    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.778 on 777 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.7933, Adjusted R-squared:  0.7904 
## F-statistic: 271.2 on 11 and 777 DF,  p-value: < 2.2e-16

pred_lm_admin = predict(lm_fit_admin, newdata=pop_dnb_adm__test)

# calculate RMSE 
rmse_lm_admin = rmse(Y,pred_lm_admin)

tribble(
  ~Model, ~RMSE,
  "ols", rmse_lm,  
  "ols admin",rmse_lm_admin,
  "gams", rmse_gam,
  "nnet",rmse_nnet
)

## # A tibble: 4 x 2
##   Model       RMSE
##   <chr>      <dbl>
## 1 ols        1.47 
## 2 ols admin  1.24 
## 3 gams      18.0  
## 4 nnet       0.758

It looks like adding the administrative interatctions does improve the model fit, but not enough to surpass the results from a basic neural net.