DEM 7093 - Using R for Point analysis Part 2

In this example I will use QGIS geoprocessing scripts through the RQGIS library in R. I will use data from the 2005 Peru Demographic and Health Survey, and data from the Peruvian government on the locations of secondary schools.

We use buffers from each DHS primary sampling unit location and point in polygon operations to measure whether a community had a secondary school within 5km.

Then, we use a hierarchical model to test whether a woman’s educational attainment is related to physical access to secondary schooling.

This is an example of a derived variable that cannot be obtained without the use of the GIS.

library(sf)

## Linking to GEOS 3.6.1, GDAL 2.2.3, proj.4 4.9.3

library(RQGIS) #you need Qgis 2.18 installed, not 3!

## Loading required package: reticulate

library(mapview)

Read in the DHS sampling unit locations

peru_dhs_points<-st_read("P:/LAPIVpaper/PEGE5BFL", "PEGE52FL")

## Reading layer `PEGE52FL' from data source `P:\LAPIVpaper\PEGE5BFL' using driver `ESRI Shapefile'
## Simple feature collection with 1414 features and 20 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: -81.30633 ymin: -18.22461 xmax: 0 ymax: 0
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs

peru_dhs_points<-st_transform(peru_dhs_points, crs=24892)
#project the data into a meter based system
mapview(peru_dhs_points["DHSCLUST"])

Set up QGIS environment

This lets R find where your QGIS binaries are located. set_env() should work without you specifying the path, but in my case I had to tell R where Qgis lives.

my_env<-set_env(root = "C:/Program Files/QGIS 2.18/")

Point in Polygon analysis

So, right now we only have points (locations of sampling units), so we need a buffer around each point in order to do our assessment of whether a school is within 5km of each community. To do this we will do a fixed distance buffer of 5km around each sampling location.

Buffer analysis

First we find the name of the algorithm for point buffering

find_algorithms(search_term = "buffer", qgis_env = my_env)

##  [1] "Fixed distance buffer-------------------------------->qgis:fixeddistancebuffer"                                                                                               
##  [2] "Variable distance buffer----------------------------->qgis:variabledistancebuffer"                                                                                            
##  [3] "r.buffer - Creates a raster map layer showing buffer zones surrounding cells that contain non-NULL category values.--->grass7:r.buffer"                                       
##  [4] "r.buffer.lowmem - Creates a raster map layer showing buffer zones surrounding cells that contain non-NULL category values (low-memory alternative).--->grass7:r.buffer.lowmem"
##  [5] "v.buffer.column - Creates a buffer around features of given type.--->grass7:v.buffer.column"                                                                                  
##  [6] "v.buffer.distance - Creates a buffer around features of given type.--->grass7:v.buffer.distance"                                                                              
##  [7] "Fixed distance buffer-------------------------------->saga:shapesbufferfixeddistance"                                                                                         
##  [8] "Raster buffer---------------------------------------->saga:gridbuffer"                                                                                                        
##  [9] "Raster proximity buffer------------------------------>saga:gridproximitybuffer"                                                                                               
## [10] "Threshold raster buffer------------------------------>saga:thresholdbuffer"                                                                                                   
## [11] "Variable distance buffer----------------------------->saga:shapesbufferattributedistance"                                                                                     
## [12] "Buffer vectors--------------------------------------->gdalogr:buffervectors"                                                                                                  
## [13] "Single sided buffers (and offset lines) for lines---->gdalogr:singlesidedbuffersandoffsetlinesforlines"                                                                       
## [14] "r.buffer - Creates a raster map layer showing buffer zones surrounding cells that contain non-NULL category values.--->grass:r.buffer"                                        
## [15] "v.buffer.column - Creates a buffer around features of given type.--->grass:v.buffer.column"                                                                                   
## [16] "v.buffer.distance - Creates a buffer around features of given type.--->grass:v.buffer.distance"

For this, we’ll use the qgis:fixeddistancebuffer function, but we need to see what the arguments to the function are:

get_usage(alg="qgis:fixeddistancebuffer", qgis_env = my_env, intern = F)

## ALGORITHM: Fixed distance buffer
##  INPUT <ParameterVector>
##  DISTANCE <ParameterNumber>
##  SEGMENTS <ParameterNumber>
##  DISSOLVE <ParameterBoolean>
##  OUTPUT <OutputVector>

so we have 5 arguments, when we use the function, we need to specify some of these:

params <- get_args_man(alg = "qgis:fixeddistancebuffer", qgis_env = my_env)
params

## $INPUT
## [1] "None"
## 
## $DISTANCE
## [1] "10.0"
## 
## $SEGMENTS
## [1] "5"
## 
## $DISSOLVE
## [1] "False"
## 
## $OUTPUT
## [1] "None"

Here I do a 5km buffer around each PSU location.

wd<-"C:/Users/ozd504/Google Drive/classes/dem7093/GIS_class_2018/data"
params$INPUT <-peru_dhs_points
params$DISTANCE<-5000 #5km around each point
params$OUTPUT<-file.path(wd, "peru_psu_buffer_5k.shp") # path to the output shapefile

now we have our parameters defined, we run the script:

peru_buff <- run_qgis(alg = "qgis:fixeddistancebuffer",
                    params = params,
                    load_output = TRUE,
                    qgis_env = my_env)

## $OUTPUT
## [1] "C:/Users/ozd504/Google Drive/classes/dem7093/GIS_class_2018/data/peru_psu_buffer_5k.shp"

mapview(peru_buff["DHSCLUST"], legend=T,map.types="OpenStreetMap")

5km buffers done! Now, we need to do our point in polygon operation. So I read in the Peruvian school data. These are a csv file, but have lat/long specified, so I can read in the text and make a simple feature layer from it. You can’t have missing information in the coordinate data to do this.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

schools<-read.csv("C:/Users/ozd504/Google Drive/classes/dem7093/GIS_class_2018/data/perusecondaryschools.csv", header=T)
schools<-schools%>%
  filter(complete.cases(Latitud, Longitud))

persch<-st_as_sf(schools, coords=c("Longitud", "Latitud"), crs=4326,agr="constant")
persch<-st_transform(persch,crs=24892 )

mapview(persch["Departamento"])

points in polygons

See what the right script is, and what arguments it needs:

find_algorithms(search_term = "in polygon")

## [1] "Count points in polygon------------------------------>qgis:countpointsinpolygon"        
## [2] "Count points in polygon(weighted)-------------------->qgis:countpointsinpolygonweighted"
## [3] "Count unique points in polygon----------------------->qgis:countuniquepointsinpolygon"  
## [4] "Fit n points in polygon------------------------------>saga:fitnpointstoshape"

get_usage(alg="qgis:countpointsinpolygon", qgis_env = my_env, intern = F)

## ALGORITHM: Count points in polygon
##  POLYGONS <ParameterVector>
##  POINTS <ParameterVector>
##  FIELD <ParameterString>
##  OUTPUT <OutputVector>

params <- get_args_man(alg = "qgis:countpointsinpolygon", qgis_env = my_env)
params

## $POLYGONS
## [1] "None"
## 
## $POINTS
## [1] "None"
## 
## $FIELD
## [1] "\"NUMPOINTS\""
## 
## $OUTPUT
## [1] "None"

make the parameter block

wd<-"C:/Users/ozd504/Google Drive/classes/dem7093/GIS_class_2018/data"
params$POLYGONS <-peru_buff
params$POINTS<-persch
params$OUTPUT<-file.path(wd, "peru_schoolsinbuffer.shp") # path to the output shapefile

now we have our parameters defined, we run the script:

peru_pip <- run_qgis(alg = "qgis:countpointsinpolygon",
                    params = params,
                    load_output = TRUE,
                    qgis_env = my_env)

## Warning in abbreviate(fld_names, minlength = 7): abbreviate used with non-
## ASCII chars

## Warning in abbreviate_shapefile_names(obj): Field names abbreviated for
## ESRI Shapefile driver

## Warning in CPL_write_ogr(obj, dsn, layer, driver,
## as.character(dataset_options), : GDAL Message 1: One or several characters
## couldn't be converted correctly from UTF-8 to ISO-8859-1. This warning will
## not be emitted anymore.

## $OUTPUT
## [1] "C:/Users/ozd504/Google Drive/classes/dem7093/GIS_class_2018/data/peru_schoolsinbuffer.shp"

mapview(peru_pip["NUMPOINTS"],map.types="OpenStreetMap", legend=T)

#recode the count of schools to be binary (0 or more than 0 schools)
peru_pip$closeschool<-ifelse(peru_pip$NUMPOINTS>0,1,0)
summary(peru_pip$closeschool)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  1.0000  1.0000  0.9279  1.0000  1.0000

So we see that 92.8% of communities have a secondary school within 5km, that’s great, but what about the communities that don’t?

Merge the new location data to our survey

Here we load the 2005 DHS data, and do some recodes for educational attainment, age, and get the survey design variables.

library(haven)
library(survey)

## Loading required package: grid

## Loading required package: Matrix

## Loading required package: survival

## 
## Attaching package: 'survey'

## The following object is masked from 'package:graphics':
## 
##     dotchart

library(dplyr)

peru_data<-read_dta("P:/LAPIVpaper/DHS_IPVdata/PEIR51FL.DTA")

peru_data<-peru_data%>%
  mutate(eduprim=ifelse(v106==1,1,0),
  edusecplus=ifelse(v106!=9&v106>=2,1,0),
  pwt=v005/1000000, 
  age=v012)%>%
  select(v000, v021, v022,v024,age, pwt, eduprim, edusecplus )

head(peru_data)

## # A tibble: 6 x 8
##   v000   v021  v022 v024        age   pwt eduprim edusecplus
##   <chr> <dbl> <dbl> <dbl+lbl> <dbl> <dbl>   <dbl>      <dbl>
## 1 PE5       5     1 1            38 0.437       0          1
## 2 PE5       5     1 1            20 0.437       0          1
## 3 PE5       5     1 1            34 0.437       0          1
## 4 PE5       5     1 1            30 0.437       0          1
## 5 PE5       5     1 1            35 0.437       0          1
## 6 PE5       5     1 1            26 0.437       0          1

Now we merge the survey to the spatial data:

peru_merge<-left_join(peru_data, peru_pip,by=c("v021"="DHSCLUST"))

## Warning: Column `v021`/`DHSCLUST` has different attributes on LHS and RHS
## of join

In order to test for the effect of school access on women’s educational attainment, we use a binomial generalized linear mixed model. We control for the survey design using a nested random effect specification (PSUs within strata).

library(lme4)
fit<-glmer(edusecplus~closeschool+scale(age)+(1|v022/v021),data=peru_merge, family = binomial, weights=pwt)

## Warning in eval(family$initialize, rho): non-integer #successes in a
## binomial glm!

summary(fit)

## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: edusecplus ~ closeschool + scale(age) + (1 | v022/v021)
##    Data: peru_merge
## Weights: pwt
## 
##      AIC      BIC   logLik deviance df.resid 
##  22762.6  22804.4 -11376.3  22752.6    31790 
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -20.4475  -0.3160   0.1769   0.4045   7.4010 
## 
## Random effects:
##  Groups    Name        Variance Std.Dev.
##  v021:v022 (Intercept) 1.100    1.049   
##  v022      (Intercept) 1.636    1.279   
## Number of obs: 31795, groups:  v021:v022, 1423; v022, 49
## 
## Fixed effects:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.41221    0.24596   1.676  0.09376 .  
## closeschool  0.54418    0.17122   3.178  0.00148 ** 
## scale(age)  -0.96633    0.01979 -48.817  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) clssch
## closeschool -0.639       
## scale(age)  -0.015 -0.008

So, we see in this case that if women lived within 5km of a secondary school, they are 61.95 % more likely to have a secondary education or more, compared to woman living in a community without a school within 5km.

DEM 7093 - Using R for Point analysis Part 2 - Derived Variables

Corey S. Sparks, Ph.D. - University of Texas at San Antonio

June 13, 2018