The following tutorial will walk us through the basics of spatial modeling in R. The primary objective is to see if we can create a model that finds the relationship between gridded population estimates and nighttime lights data that we downloaded from Google Earth Engine. As you will see most gridded population esitmates are heavily reliant on night lights to make their predictions.
Objectives:
In the following section we are going to use the raster
function to import our images. We just need to tell the computer where to look. ../
backs us out of the current working directory (which is the Tutorials
folder) and then ../Example_data
looks in the Example_Data
folder.
First we will import our population and nighttime lights DNB data. We will read it in, print out a description and plot them.
pop = raster('../Example_Data/Belize_median_POP_2010.tif')
print(pop)
## class : RasterLayer
## dimensions : 637, 366, 233142 (nrow, ncol, ncell)
## resolution : 0.004491576, 0.004491576 (x, y)
## extent : -89.26559, -87.62167, 15.74298, 18.60411 (xmin, xmax, ymin, ymax)
## crs : +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0
## source : /home/mmann/Documents/Github/Belize_GEE_R_Tutorial/Example_Data/Belize_median_POP_2010.tif
## names : Belize_median_POP_2010
## values : 0.03336188, 36.27045 (min, max)
plot(pop)
dnb = raster('../Example_Data/Belize_median_DNB_2014_2018.tif')
print(dnb)
## class : RasterLayer
## dimensions : 637, 366, 233142 (nrow, ncol, ncell)
## resolution : 0.004491576, 0.004491576 (x, y)
## extent : -89.26559, -87.62167, 15.74298, 18.60411 (xmin, xmax, ymin, ymax)
## crs : +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0
## source : /home/mmann/Documents/Github/Belize_GEE_R_Tutorial/Example_Data/Belize_median_DNB_2014_2018.tif
## names : Belize_median_DNB_2014_2018
plot(dnb)
One important feature to note here is that both images have the same pixel resolution, same number of rows and columns and the same projection.
They both have:
Since they have the same properties but hold different data it often helps to create a ‘stack’ out of them. We are essentially creating a two band raster with the first band being nighttime lights DNB and the second being the population data.
The following code stacks the two images:
dnb_pop_stack = stack(dnb, pop)
print(dnb_pop_stack)
## class : RasterStack
## dimensions : 637, 366, 233142, 2 (nrow, ncol, ncell, nlayers)
## resolution : 0.004491576, 0.004491576 (x, y)
## extent : -89.26559, -87.62167, 15.74298, 18.60411 (xmin, xmax, ymin, ymax)
## crs : +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0
## names : Belize_median_DNB_2014_2018, Belize_median_POP_2010
## min values : ?, 0.03336188
## max values : ?, 36.27045
blz = read_sf('../Example_Data/gadm36_BLZ_0.shp')
#plot(blz) # simple plot
ggplot()+geom_sf(data=blz)
We are going to want to take a random sample of out population and nighttime lights data. In order to do this we need to create a set of random spatial points throughout the country. After that we will extract the pop and dnb data at those location and use them to create a model of the pop = fn(dnb).
# create 000 random points
set.seed(1)
random_points = st_sample(x = blz,size=1000 ,type = 'random')
# visualize points
ggplot()+geom_sf(data=random_points)
Now let’s add the country boundary and overlay the points
# visualize boundary and points # alpha controls transparency
ggplot()+geom_sf(data=blz)+geom_sf(data=random_points,alpha=0.1)
In this step we extract the raster stack data to those random points. Unfortunately the function extract
can’t use a sf based point file, so we will need to create a copy of sp type, and then use that to do the extraction.
set.seed(3)
random_points_sp = as(random_points, "Spatial") # extract funciton needs sp spatial object not sf
# Extract raster data to points
pop_dnb_df = extract(dnb_pop_stack, random_points_sp, df=TRUE) ## df=TRUE i.e. return as a data.frame
# drop any rows where no satellite values were available (missing values are stored as NA)
pop_dnb_df = na.omit(pop_dnb_df)
# print top
print(head(pop_dnb_df))
## ID Belize_median_DNB_2014_2018 Belize_median_POP_2010
## 1 1 0.13918549 0.05955504
## 2 2 0.08886480 0.07712017
## 3 3 0.08162701 0.05563286
## 4 4 0.04958004 0.01328016
## 5 5 0.07530552 0.01698016
## 6 6 0.09782595 0.07531688
We can now see that each point (point number is stored in ID) now has two values Belize_median_DNB_2014_2018 and Belize_median_POP_2010. We can now use this to create a statistical model to predict population as a function of nighttime lights.
It is often helpful to look at a simple scatter plot of your Y and x values. In this case, it looks like we might have some kind of non-linear response. Let’s find out.
ggplot()+geom_point(data=pop_dnb_df,aes(x=Belize_median_DNB_2014_2018,y=Belize_median_POP_2010))
In this tutorial we will be comparing the performance of a few models including:
Before we do this we need to create two indepdent datasets. We will create a ‘training’ dataset which hold 80% of observations and another ‘testing’ dataset holding 20% in order to test out-of-sample model performance.
set.seed(2) # this allows us to get the same random sample repeatedly (reproducable results)
# get row numbers of the training dataset
trainIndex = createDataPartition(y = pop_dnb_df$Belize_median_DNB_2014_2018, # typically put your Y variable here
p = .8,
list = FALSE,
times = 1)
pop_dnb_train = pop_dnb_df[ trainIndex,] # keep training rows
pop_dnb_test = pop_dnb_df[-trainIndex,] # omit training rows (keep testing)
# show top example and shape of testing and training
print(head(pop_dnb_train))
## ID Belize_median_DNB_2014_2018 Belize_median_POP_2010
## 3 3 0.08162701 0.05563286
## 4 4 0.04958004 0.01328016
## 5 5 0.07530552 0.01698016
## 7 7 1.05196548 0.75056845
## 9 9 0.08552376 0.06077643
## 10 10 0.10520373 0.06756844
print(dim(pop_dnb_train))
## [1] 791 3
print(dim(pop_dnb_test))
## [1] 196 3
Now that we have our testing and training data we can start fitting models. Let’s start with a simple ordinary least squares regression.
# store the formula defining Y and x
lm_formula = Belize_median_POP_2010 ~ Belize_median_DNB_2014_2018
lm_fit = lm(lm_formula, data=pop_dnb_train)
summary(lm_fit)
##
## Call:
## lm(formula = lm_formula, data = pop_dnb_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.596 -0.050 -0.029 -0.009 90.991
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.07132 0.12208 -0.584 0.559
## Belize_median_DNB_2014_2018 1.64032 0.10484 15.646 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.387 on 789 degrees of freedom
## Multiple R-squared: 0.2368, Adjusted R-squared: 0.2358
## F-statistic: 244.8 on 1 and 789 DF, p-value: < 2.2e-16
Important: Keep in mind that these are POPULATION ESIMATES so don’t read too much into it.
Since we suspect non-linear relationships between Y and x, we will fit a General Additive Model. GAMs are simply a class of statistical models in which the usual linear relationship between the response (Y) and predictors (x) are replaced by several non linear smooth functions to model and capture the non linearities in the data. These are also a flexible and smooth technique which helps us to fit Linear models which can be either linearly or non linearly dependent on several predictors Xi to capture non linear relationships.
Effectively GAMS creates peicewise regression estimates with constraints that force line segments to connect. Here is an example of a fitting a smooth spline to a set of non-linear data.
synth = data.frame(y=seq(0,100,5),x = seq(0,100,5)^2)
ggplot(data= synth,aes(x=x,y=y)) + geom_point() +geom_smooth(method = 'loess')
Here we fit a GAMS model to help capture any non-linear relationships.
formula_gam = Belize_median_POP_2010 ~ s(Belize_median_DNB_2014_2018) #s() designates which variables to smooth
# fit the gams model
gam_fit = gam(formula_gam, data= pop_dnb_train)
## Warning in model.matrix.default(mt, mf, contrasts): non-list contrasts
## argument ignored
# see results
summary(gam_fit)
##
## Call: gam(formula = formula_gam, data = pop_dnb_train)
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -41.781150 -0.016777 -0.002971 0.012277 32.133195
##
## (Dispersion Parameter for gaussian family taken to be 3.7235)
##
## Null Deviance: 11860 on 790 degrees of freedom
## Residual Deviance: 2926.647 on 785.9999 degrees of freedom
## AIC: 3291.638
##
## Number of Local Scoring Iterations: 2
##
## Anova for Parametric Effects
## Df Sum Sq Mean Sq F value Pr(>F)
## s(Belize_median_DNB_2014_2018) 1 2808.4 2808.43 754.25 < 2.2e-16 ***
## Residuals 786 2926.7 3.72
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Anova for Nonparametric Effects
## Npar Df Npar F Pr(F)
## (Intercept)
## s(Belize_median_DNB_2014_2018) 3 548.31 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
With GAMS it is often informative to plot out the non-linear responses as estimated. Here we can see that the response increase up to some point and then declines.
# plot non-linear response
plot(gam_fit,se=TRUE)
Next we will fit a very simple neural network. This method basically searches for weights that describe the relationship between x and Y layers. In the hidden layer is where most of the calculations happens, every Perceptron unit takes an input from the input layer, multiplies and add it to initially random weights. It then uses some defined transfer and activation functions to make a final prediction in an output layer.
set.seed(825)
# set parameters for cross validation
fitControl <- trainControl(## 5-fold CV
method = "repeatedcv",
number = 5,
## repeated 5 times
repeats = 5)
formula_ml = Belize_median_POP_2010 ~ Belize_median_DNB_2014_2018
# fit the model
nnet_fit <- train(formula_ml,
data = pop_dnb_train,
trControl = fitControl,
method = "nnet",
preProc = c("center", "scale"), # subtract mean and divide by std dev
verbose = FALSE)
nnet_fit
## Neural Network
##
## 791 samples
## 1 predictor
##
## Pre-processing: centered (1), scaled (1)
## Resampling: Cross-Validated (5 fold, repeated 5 times)
## Summary of sample sizes: 632, 633, 632, 634, 633, 632, ...
## Resampling results across tuning parameters:
##
## size decay RMSE Rsquared MAE
## 1 0e+00 2.228563 0.3432170 0.2347637
## 1 1e-04 2.212628 0.3452640 0.2260535
## 1 1e-01 2.197500 0.4119595 0.2266696
## 3 0e+00 2.227304 0.3330964 0.2366021
## 3 1e-04 2.203022 0.4525949 0.2129799
## 3 1e-01 2.193545 0.4310922 0.2187953
## 5 0e+00 2.226056 0.4416630 0.2305579
## 5 1e-04 2.202381 0.4696927 0.2120839
## 5 1e-01 2.192608 0.4376314 0.2166679
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 5 and decay = 0.1.
Now that we have fitted three models we need to compare their perfomance on the independent testing dataset. Here we will focus on a few different metrics, in this case AIC and R^2 are not available for all models, so we will focus on root mean squared error RMSE instead.
First we need to make a prediction to the testing dataset for each model. We will then compare these predictions to the actual value and calculate the RMSE for each model.
# store the actual observations of Y
Y = pop_dnb_test$Belize_median_POP_2010
# make predictions to the test dataset
pred_lm = predict(lm_fit, newdata=pop_dnb_test)
pred_gam = predict(gam_fit, newdata=pop_dnb_test)
pred_nnet = predict(nnet_fit, newdata=pop_dnb_test)
# calculate RMSE
rmse_lm = rmse(Y,pred_lm)
rmse_gam = rmse(Y,pred_gam)
rmse_nnet = rmse(Y,pred_nnet)
tribble(
~Model, ~RMSE,
"ols", rmse_lm,
"gams", rmse_gam,
"nnet",rmse_nnet
)
## # A tibble: 3 x 2
## Model RMSE
## <chr> <dbl>
## 1 ols 1.47
## 2 gams 18.0
## 3 nnet 0.758
In this case it looks like the simple neural net out performs all the other models. This is likely due to our improved handling of model fit through cross-validation.
Luckily making predictions on spatial data (stored in a stack) is as easy as using the predict function.
# make a prediction using OLS back to the raster stack
pred_lm_raster = predict(dnb_pop_stack, model=lm_fit)
plot(pred_lm_raster)
Finally let’s write out our raster predictions to a file so that it can be opened in Arc or Qgis.
writeRaster(x = pred_lm_raster,
filename = '../Example_Data/pred_lm_raster.tif',
overwrite=TRUE)
One potential way to imrpove our model is to add additional covariates. In this case we might think that there might be some interaction between how nighttime lights interacts with population in different parts of the country.
To control for this we will want to add a shp file of different administrative boundaries to our raster stack.
In order to do this we must follow these general steps:
Let’s take a look at the shapefile
# read in data and plots
admin = read_sf('../Example_Data/gadm36_BLZ_1.shp')
ggplot()+geom_sf(data=admin)
Now we need to rasterize the data by telling R which column of values to use, and give it an example of what the final raster should look like (resolution, extent etc)
# create an sp version for rasterizing
admin_sp = as(admin,'Spatial')
## create a numeric value for each admin level
# Copy admin names to new column
admin_sp$NAME_1_numeric = admin_sp$NAME_1
# look at the number of unique values
table(admin_sp$NAME_1_numeric)
##
## Belize Cayo Corozal Orange Walk Stann Creek Toledo
## 1 1 1 1 1 1
# recode values to numeric
admin_sp$NAME_1_numeric = admin_sp$NAME_1_numeric %>%
recode(Belize = 1, Cayo = 2, Corozal = 3, `Orange Walk` = 4,
`Stann Creek` = 5, Toledo = 6)
head(admin_sp[,c("NAME_1", "NAME_1_numeric")])
## NAME_1 NAME_1_numeric
## 1 Belize 1
## 2 Cayo 2
## 3 Corozal 3
## 4 Orange Walk 4
## 5 Stann Creek 5
## 6 Toledo 6
admin_raster = rasterize(
x=admin_sp, # shape to rasterize
y= dnb, # example raster to copy properties from
field = 'NAME_1_numeric' # numeric field to rasterize
)
# name the layer that was rasterized
names(admin_raster) = 'Admin'
plot(admin_raster)
Now we need to add these values to the stack and extract data to points.
# create the new raster stack
dnb_pop_adm_stack = stack(dnb,pop,admin_raster)
# extract values to points
pop_dnb_adm_df = extract(dnb_pop_adm_stack, random_points_sp, df=TRUE) ## df=TRUE i.e. return as a data.frame
# remove nas
pop_dnb_adm_df = na.omit(pop_dnb_adm_df)
head(pop_dnb_adm_df)
## ID Belize_median_DNB_2014_2018 Belize_median_POP_2010 Admin
## 1 1 0.13918549 0.05955504 2
## 2 2 0.08886480 0.07712017 2
## 3 3 0.08162701 0.05563286 3
## 4 4 0.04958004 0.01328016 2
## 5 5 0.07530552 0.01698016 2
## 6 6 0.09782595 0.07531688 2
Let’s create a OLS model with slope dummies (by interacting administrative boundaries with dnb values). We are going to declare Admin to be a “factor” this effectively treats the variable as a fixed effects component and creates the appropriate dummy variables. We can also handle interations easily by simply using *
to indicate which variables should be interacted. R automatically includes the appropriate interactions in the model.
# define the specification and include interaction terms
formula_lm_admin = Belize_median_POP_2010 ~ Belize_median_DNB_2014_2018*factor(Admin)
# break observations into testing and training splits
pop_dnb_adm_train = pop_dnb_adm_df[ trainIndex,] # keep training rows
pop_dnb_adm__test = pop_dnb_adm_df[-trainIndex,] # omit training rows (keep testing)
# fit the model
lm_fit_admin = lm(formula_lm_admin, data = pop_dnb_adm_train)
summary(lm_fit_admin)
##
## Call:
## lm(formula = formula_lm_admin, data = pop_dnb_adm_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.9371 -0.0088 0.0176 0.0740 23.5374
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 0.02173 0.21757 0.100
## Belize_median_DNB_2014_2018 0.19858 0.86053 0.231
## factor(Admin)2 -1.21909 0.25548 -4.772
## factor(Admin)3 -0.01593 0.31614 -0.050
## factor(Admin)4 -0.08304 0.26078 -0.318
## factor(Admin)5 0.03077 0.29527 0.104
## factor(Admin)6 -0.05432 0.25923 -0.210
## Belize_median_DNB_2014_2018:factor(Admin)2 9.18908 0.87913 10.453
## Belize_median_DNB_2014_2018:factor(Admin)3 0.33018 0.86535 0.382
## Belize_median_DNB_2014_2018:factor(Admin)4 0.66337 0.86261 0.769
## Belize_median_DNB_2014_2018:factor(Admin)5 -0.06557 1.14135 -0.057
## Belize_median_DNB_2014_2018:factor(Admin)6 0.61188 0.92119 0.664
## Pr(>|t|)
## (Intercept) 0.920
## Belize_median_DNB_2014_2018 0.818
## factor(Admin)2 2.18e-06 ***
## factor(Admin)3 0.960
## factor(Admin)4 0.750
## factor(Admin)5 0.917
## factor(Admin)6 0.834
## Belize_median_DNB_2014_2018:factor(Admin)2 < 2e-16 ***
## Belize_median_DNB_2014_2018:factor(Admin)3 0.703
## Belize_median_DNB_2014_2018:factor(Admin)4 0.442
## Belize_median_DNB_2014_2018:factor(Admin)5 0.954
## Belize_median_DNB_2014_2018:factor(Admin)6 0.507
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.778 on 777 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.7933, Adjusted R-squared: 0.7904
## F-statistic: 271.2 on 11 and 777 DF, p-value: < 2.2e-16
pred_lm_admin = predict(lm_fit_admin, newdata=pop_dnb_adm__test)
# calculate RMSE
rmse_lm_admin = rmse(Y,pred_lm_admin)
tribble(
~Model, ~RMSE,
"ols", rmse_lm,
"ols admin",rmse_lm_admin,
"gams", rmse_gam,
"nnet",rmse_nnet
)
## # A tibble: 4 x 2
## Model RMSE
## <chr> <dbl>
## 1 ols 1.47
## 2 ols admin 1.24
## 3 gams 18.0
## 4 nnet 0.758
It looks like adding the administrative interatctions does improve the model fit, but not enough to surpass the results from a basic neural net.