In this example I will use QGIS geoprocessing scripts through the RQGIS library in R. I will use data from the 2005 Peru Demographic and Health Survey, and data from the Peruvian government on the locations of secondary schools.
We use buffers from each DHS primary sampling unit location and point in polygon operations to measure whether a community had a secondary school within 5km.
Then, we use a hierarchical model to test whether a woman’s educational attainment is related to physical access to secondary schooling.
This is an example of a derived variable that cannot be obtained without the use of the GIS.
library(sf)
## Linking to GEOS 3.6.1, GDAL 2.2.3, proj.4 4.9.3
library(RQGIS) #you need Qgis 2.18 installed, not 3!
## Loading required package: reticulate
library(mapview)
peru_dhs_points<-st_read("P:/LAPIVpaper/PEGE5BFL", "PEGE52FL")
## Reading layer `PEGE52FL' from data source `P:\LAPIVpaper\PEGE5BFL' using driver `ESRI Shapefile'
## Simple feature collection with 1414 features and 20 fields
## geometry type: POINT
## dimension: XY
## bbox: xmin: -81.30633 ymin: -18.22461 xmax: 0 ymax: 0
## epsg (SRID): 4326
## proj4string: +proj=longlat +datum=WGS84 +no_defs
peru_dhs_points<-st_transform(peru_dhs_points, crs=24892)
#project the data into a meter based system
mapview(peru_dhs_points["DHSCLUST"])
This lets R find where your QGIS binaries are located. set_env() should work without you specifying the path, but in my case I had to tell R where Qgis lives.
my_env<-set_env(root = "C:/Program Files/QGIS 2.18/")
So, right now we only have points (locations of sampling units), so we need a buffer around each point in order to do our assessment of whether a school is within 5km of each community. To do this we will do a fixed distance buffer of 5km around each sampling location.
First we find the name of the algorithm for point buffering
find_algorithms(search_term = "buffer", qgis_env = my_env)
## [1] "Fixed distance buffer-------------------------------->qgis:fixeddistancebuffer"
## [2] "Variable distance buffer----------------------------->qgis:variabledistancebuffer"
## [3] "r.buffer - Creates a raster map layer showing buffer zones surrounding cells that contain non-NULL category values.--->grass7:r.buffer"
## [4] "r.buffer.lowmem - Creates a raster map layer showing buffer zones surrounding cells that contain non-NULL category values (low-memory alternative).--->grass7:r.buffer.lowmem"
## [5] "v.buffer.column - Creates a buffer around features of given type.--->grass7:v.buffer.column"
## [6] "v.buffer.distance - Creates a buffer around features of given type.--->grass7:v.buffer.distance"
## [7] "Fixed distance buffer-------------------------------->saga:shapesbufferfixeddistance"
## [8] "Raster buffer---------------------------------------->saga:gridbuffer"
## [9] "Raster proximity buffer------------------------------>saga:gridproximitybuffer"
## [10] "Threshold raster buffer------------------------------>saga:thresholdbuffer"
## [11] "Variable distance buffer----------------------------->saga:shapesbufferattributedistance"
## [12] "Buffer vectors--------------------------------------->gdalogr:buffervectors"
## [13] "Single sided buffers (and offset lines) for lines---->gdalogr:singlesidedbuffersandoffsetlinesforlines"
## [14] "r.buffer - Creates a raster map layer showing buffer zones surrounding cells that contain non-NULL category values.--->grass:r.buffer"
## [15] "v.buffer.column - Creates a buffer around features of given type.--->grass:v.buffer.column"
## [16] "v.buffer.distance - Creates a buffer around features of given type.--->grass:v.buffer.distance"
For this, we’ll use the qgis:fixeddistancebuffer function, but we need to see what the arguments to the function are:
get_usage(alg="qgis:fixeddistancebuffer", qgis_env = my_env, intern = F)
## ALGORITHM: Fixed distance buffer
## INPUT <ParameterVector>
## DISTANCE <ParameterNumber>
## SEGMENTS <ParameterNumber>
## DISSOLVE <ParameterBoolean>
## OUTPUT <OutputVector>
so we have 5 arguments, when we use the function, we need to specify some of these:
params <- get_args_man(alg = "qgis:fixeddistancebuffer", qgis_env = my_env)
params
## $INPUT
## [1] "None"
##
## $DISTANCE
## [1] "10.0"
##
## $SEGMENTS
## [1] "5"
##
## $DISSOLVE
## [1] "False"
##
## $OUTPUT
## [1] "None"
Here I do a 5km buffer around each PSU location.
wd<-"C:/Users/ozd504/Google Drive/classes/dem7093/GIS_class_2018/data"
params$INPUT <-peru_dhs_points
params$DISTANCE<-5000 #5km around each point
params$OUTPUT<-file.path(wd, "peru_psu_buffer_5k.shp") # path to the output shapefile
now we have our parameters defined, we run the script:
peru_buff <- run_qgis(alg = "qgis:fixeddistancebuffer",
params = params,
load_output = TRUE,
qgis_env = my_env)
## $OUTPUT
## [1] "C:/Users/ozd504/Google Drive/classes/dem7093/GIS_class_2018/data/peru_psu_buffer_5k.shp"
mapview(peru_buff["DHSCLUST"], legend=T,map.types="OpenStreetMap")
5km buffers done! Now, we need to do our point in polygon operation. So I read in the Peruvian school data. These are a csv file, but have lat/long specified, so I can read in the text and make a simple feature layer from it. You can’t have missing information in the coordinate data to do this.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
schools<-read.csv("C:/Users/ozd504/Google Drive/classes/dem7093/GIS_class_2018/data/perusecondaryschools.csv", header=T)
schools<-schools%>%
filter(complete.cases(Latitud, Longitud))
persch<-st_as_sf(schools, coords=c("Longitud", "Latitud"), crs=4326,agr="constant")
persch<-st_transform(persch,crs=24892 )
mapview(persch["Departamento"])
See what the right script is, and what arguments it needs:
find_algorithms(search_term = "in polygon")
## [1] "Count points in polygon------------------------------>qgis:countpointsinpolygon"
## [2] "Count points in polygon(weighted)-------------------->qgis:countpointsinpolygonweighted"
## [3] "Count unique points in polygon----------------------->qgis:countuniquepointsinpolygon"
## [4] "Fit n points in polygon------------------------------>saga:fitnpointstoshape"
get_usage(alg="qgis:countpointsinpolygon", qgis_env = my_env, intern = F)
## ALGORITHM: Count points in polygon
## POLYGONS <ParameterVector>
## POINTS <ParameterVector>
## FIELD <ParameterString>
## OUTPUT <OutputVector>
params <- get_args_man(alg = "qgis:countpointsinpolygon", qgis_env = my_env)
params
## $POLYGONS
## [1] "None"
##
## $POINTS
## [1] "None"
##
## $FIELD
## [1] "\"NUMPOINTS\""
##
## $OUTPUT
## [1] "None"
wd<-"C:/Users/ozd504/Google Drive/classes/dem7093/GIS_class_2018/data"
params$POLYGONS <-peru_buff
params$POINTS<-persch
params$OUTPUT<-file.path(wd, "peru_schoolsinbuffer.shp") # path to the output shapefile
now we have our parameters defined, we run the script:
peru_pip <- run_qgis(alg = "qgis:countpointsinpolygon",
params = params,
load_output = TRUE,
qgis_env = my_env)
## Warning in abbreviate(fld_names, minlength = 7): abbreviate used with non-
## ASCII chars
## Warning in abbreviate_shapefile_names(obj): Field names abbreviated for
## ESRI Shapefile driver
## Warning in CPL_write_ogr(obj, dsn, layer, driver,
## as.character(dataset_options), : GDAL Message 1: One or several characters
## couldn't be converted correctly from UTF-8 to ISO-8859-1. This warning will
## not be emitted anymore.
## $OUTPUT
## [1] "C:/Users/ozd504/Google Drive/classes/dem7093/GIS_class_2018/data/peru_schoolsinbuffer.shp"
mapview(peru_pip["NUMPOINTS"],map.types="OpenStreetMap", legend=T)
#recode the count of schools to be binary (0 or more than 0 schools)
peru_pip$closeschool<-ifelse(peru_pip$NUMPOINTS>0,1,0)
summary(peru_pip$closeschool)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 1.0000 1.0000 0.9279 1.0000 1.0000
So we see that 92.8% of communities have a secondary school within 5km, that’s great, but what about the communities that don’t?
Here we load the 2005 DHS data, and do some recodes for educational attainment, age, and get the survey design variables.
library(haven)
library(survey)
## Loading required package: grid
## Loading required package: Matrix
## Loading required package: survival
##
## Attaching package: 'survey'
## The following object is masked from 'package:graphics':
##
## dotchart
library(dplyr)
peru_data<-read_dta("P:/LAPIVpaper/DHS_IPVdata/PEIR51FL.DTA")
peru_data<-peru_data%>%
mutate(eduprim=ifelse(v106==1,1,0),
edusecplus=ifelse(v106!=9&v106>=2,1,0),
pwt=v005/1000000,
age=v012)%>%
select(v000, v021, v022,v024,age, pwt, eduprim, edusecplus )
head(peru_data)
## # A tibble: 6 x 8
## v000 v021 v022 v024 age pwt eduprim edusecplus
## <chr> <dbl> <dbl> <dbl+lbl> <dbl> <dbl> <dbl> <dbl>
## 1 PE5 5 1 1 38 0.437 0 1
## 2 PE5 5 1 1 20 0.437 0 1
## 3 PE5 5 1 1 34 0.437 0 1
## 4 PE5 5 1 1 30 0.437 0 1
## 5 PE5 5 1 1 35 0.437 0 1
## 6 PE5 5 1 1 26 0.437 0 1
Now we merge the survey to the spatial data:
peru_merge<-left_join(peru_data, peru_pip,by=c("v021"="DHSCLUST"))
## Warning: Column `v021`/`DHSCLUST` has different attributes on LHS and RHS
## of join
In order to test for the effect of school access on women’s educational attainment, we use a binomial generalized linear mixed model. We control for the survey design using a nested random effect specification (PSUs within strata).
library(lme4)
fit<-glmer(edusecplus~closeschool+scale(age)+(1|v022/v021),data=peru_merge, family = binomial, weights=pwt)
## Warning in eval(family$initialize, rho): non-integer #successes in a
## binomial glm!
summary(fit)
## Generalized linear mixed model fit by maximum likelihood (Laplace
## Approximation) [glmerMod]
## Family: binomial ( logit )
## Formula: edusecplus ~ closeschool + scale(age) + (1 | v022/v021)
## Data: peru_merge
## Weights: pwt
##
## AIC BIC logLik deviance df.resid
## 22762.6 22804.4 -11376.3 22752.6 31790
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -20.4475 -0.3160 0.1769 0.4045 7.4010
##
## Random effects:
## Groups Name Variance Std.Dev.
## v021:v022 (Intercept) 1.100 1.049
## v022 (Intercept) 1.636 1.279
## Number of obs: 31795, groups: v021:v022, 1423; v022, 49
##
## Fixed effects:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.41221 0.24596 1.676 0.09376 .
## closeschool 0.54418 0.17122 3.178 0.00148 **
## scale(age) -0.96633 0.01979 -48.817 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) clssch
## closeschool -0.639
## scale(age) -0.015 -0.008
So, we see in this case that if women lived within 5km of a secondary school, they are 61.95 % more likely to have a secondary education or more, compared to woman living in a community without a school within 5km.