2018-04-11, Institut für Geographie - Friedrich-Schiller-Universität Jena

Contents

  • Introductory comments
  • An example: the Propensity to Cycle Tool
  • dplyr vs tibbles
  • Discussion: can GDS save the world?

Introduction

What is saving the world?

Many ways of saying the same thing:

  • 'Policy-led research'
  • 'Impact'
  • 'Socially beneficial research'
  • Don't be evil (Google)

My definition: building an evidence-base for sustainable systems.

  • In the context of climate change that means:
  • Building an evidence-base to transition away from fossil fuels
  • But could also be interpretted in terms of other (quantifiable) social/economic/environmental indicators

Why climate change? I

Associated mega-caption

Why climate change? II

Why climate change? III

What is Geographic Data Science?

  • You tell me!
  • How does it differ from good old 'GIS'?
  • What does the science in the title mean?
  • Why the focus on data rather than information

Code example:

d = frame_data(
  ~Attribute, ~GIS, ~GDS,
  "Home disciplines", "Geography", "Geography, Computing, Statistics",
  "Software focus", "Graphic User Interface", "Code",
  "Reproduciblility", "Minimal", "Maximal"
)

Comparing GDS with GIS

knitr::kable(d)
Attribute GIS GDS
Home disciplines Geography Geography, Computing, Statistics
Software focus Graphic User Interface Code
Reproduciblility Minimal Maximal

Geographic data science CAN 'save the world'

But only if it's open and scientific

Reasoning:

  • Evidence inevitably gets skewed by political aims
  • If the people doing the research are influenced by dominant political forces, findings will be biases for political gain (solved by independent well-funded public research).
  • People doing policy relevant research watch out (regarding politicians):

"Their very spirit undergoes a pervasive transformation,” and they finally end up as “experts at exchanging smiles, handshakes, and favors." (Reclus 2013, original: 1898)

Importance of open data and methods

  • If the data underlying policy is hidden, it can be represented to push certain aims (solved by open data)
  • If the data is 'open' but the tools are closed, results open to political influence
  • Which brings us onto our next topic…

Example question: Where will cycling uptake happen?

How to transition to active cities? From this…

To this?

With available resources

Context 'evidence overload'?

  • Challenge: operationalise data
  • Challenge: make locally specific

Data for walking and cycling investment

  • Travel behaviour data
  • Route network data
  • Existing infrastructure (road widths, traffic, future possibilities)
  • Road safety data
  • Air pollution data
  • Crowdsourced data

The international dimension

~200 km cycle network in Seville, Spain. Source: WHO report at [ATFutures/who](https://github.com/ATFutures/who)

~200 km cycle network in Seville, Spain. Source: WHO report at ATFutures/who

  • Not a UK-specific issue, but benefits of country-specific tools

The Propensity to Cycle Tool (PCT)

What can the PCT do? - see www.pct.bike

The front page of the open source, open access Propensity to Cycle Tool (PCT).

The front page of the open source, open access Propensity to Cycle Tool (PCT).

Context: from concept to implementation

  • 3 years in the making
  • Origins go back further
  • "An algorithm to decide where to build next"!
  • Internationalisation of methods (World Health Organisation funded project)

The research landscape (see Lovelace et al. 2017)

The PCT in context (source: Lovelace et al. 2017)

Tool Scale Coverage Public access Format of output Levels of analysis Software licence
Propensity to Cycle Tool National England Yes Online map A, OD, R, RN Open source
Prioritization Index City Montreal No GIS-based P, A, R Proprietary
PAT Local Parts of Dublin No GIS-based A, OD, R Proprietary
Usage intensity index City Belo Horizonte No GIS-based A, OD, R, I Proprietary
Cycling Potential Tool City London No Static A, I Unknown
Santa Monica model City Santa Monica No Static P, OD, A Unknown

Policy feedback

"The PCT is a brilliant example of using Big Data to better plan infrastructure investment. It will allow us to have more confidence that new schemes are built in places and along travel corridors where there is high latent demand."

  • Shane Snow: Head of Seamless Travel Team, Sustainable and Acessible Travel Division

"The PCT shows the country’s great potential to get on their bikes, highlights the areas of highest possible growth and will be a useful innovation for local authorities to get the greatest bang for their buck from cycling investments  and realise cycling potential."

  • Andrew Jones, Parliamentary Under Secretary of State for Transport

The PCT in CWIS

Included in Cycling and Walking Infrastructure Strategy (CWIS)

Scenario shift in desire lines

Source: Lovelace et al. (2017)

  • Origin-destination data shows 'desire lines'
  • How will these shift with cycling uptake

Scenario shift in network load

The Cycling Infrastructure Prioritisation Toolkit (CyIPT)

Overview of the project

  • 12 month project funded by DfT's Innovation Challenge Fund (ICF)
  • Aim: tackle the challenge that cycling uptake is often limited by infrastructural barriers which could be remediated cost-effectively, yet investment is often spent on less cost-effective interventions, based on assessment of only a few options.

  • Project team:
    • Robin Lovelace (University of Leeds)
    • Malcolm Morgan (University of Leeds)
    • John Parkin (University of West of England)
    • Martin Lucas-Smith (Cyclestreets.net)
    • Adrian Lord (Phil Jones Associates)

Modelling cycling uptake

  • We can use 'backcasting' to estimate long-term potential under ideal questions (PCT)
  • But transport authorities need forecasts of future uptake
  • From specific interventions in order to do this
  • There is much existing work on this
  • But none that is 'operationalisable'
  • How to operationalise available data?

Data on infrastructure-uptake at a regional level

  • Clear link between infrastructure and uptake

New datasets:

  • DfT's Transport Direct data
  • 2001 OD data (manipulated and joined with 2011 data)

Operationalising the data

Wider context: Open source tools

  • Online interfaces reduce barriers
  • But there are benefits of running analysis locally
  • Various software options, including:
  • QGIS mapping software
  • sDNA QGIS plugin
  • R (see upcoming course 26th - 27th April)
  • Key feature of CyIPT and PCT:
  • Open source and provides open data downloads

Modelling cycling uptake

  • Hilliness and distance are (relatively) unchanging over time
  • Model based on polynomial logit model of both:

\[ logit(pcycle) = \alpha + \beta_1 d + \beta_2 d^{0.5} + \beta_3 d^2 + \gamma h + \delta_1 d h + \delta_2 d^{0.5} h \]

logit_pcycle = -3.9 + (-0.59 * distance) + (1.8 * sqrt(distance) ) + (0.008 * distance^2)

A live demo

"Actions speak louder than words"

tibbles and dplyr: A detour for programmers

Why data carpentry?

  • If you 'hack' or 'munge' data, it won't scale
  • So ultimately it's about being able to handle Big Data
  • We'll cover the basics of data frames and tibbles
  • And the basics of dplyr, an excellent package for data carpentry
    • dplyr is also compatible with the sf package

The data frame

The humble data frame is at the heart of most analysis projects:

d = data.frame(x = 1:3, y = c("A", "B", "C"))
d
##   x y
## 1 1 A
## 2 2 B
## 3 3 C

In reality this is a list, making function work on each column:

summary(d)
##        x       y    
##  Min.   :1.0   A:1  
##  1st Qu.:1.5   B:1  
##  Median :2.0   C:1  
##  Mean   :2.0        
##  3rd Qu.:2.5        
##  Max.   :3.0
plot(d)

Subsetting

In base R, there are many ways to subset:

d[1,] # the first line
##   x y
## 1 1 A
d[,1] # the first column
## [1] 1 2 3
d$x # the first column
## [1] 1 2 3
d [1] # the first column, as a data frame
##   x
## 1 1
## 2 2
## 3 3

The tibble

Recently the data frame has been extended:

library("tibble")
dt = tibble(x = 1:3, y = c("A", "B", "C"))
dt
## # A tibble: 3 x 2
##       x y    
##   <int> <chr>
## 1     1 A    
## 2     2 B    
## 3     3 C

Advantages of the tibble

It comes down to efficiency and usability

  • When printed, the tibble diff reports class
  • Character vectors are not coerced into factors
  • When printing a tibble diff to screen, only the first ten rows are displayed

dplyr

Like tibbles, has advantages over historic ways of doing things

  • Type stability (data frame in, data frame out)
  • Consistent functions - functions not [ do everything
  • Piping make complex operations easy
ghg_ems %>%
  filter(!grepl("World|Europe", Country)) %>% 
  group_by(Country) %>% 
  summarise(Mean = mean(Transportation),
            Growth = diff(range(Transportation))) %>%
  top_n(3, Growth) %>%
  arrange(desc(Growth))
# dplyr must be loaded with
library(dplyr)

Why pipes?

wb_ineq %>% 
  filter(grepl("g", Country)) %>%
  group_by(Year) %>%
  summarise(gini = mean(gini, na.rm  = TRUE)) %>%
  arrange(desc(gini)) %>%
  top_n(n = 5)

vs

top_n(
  arrange(
    summarise(
      group_by(
        filter(wb_ineq, grepl("g", Country)),
        Year),
      gini = mean(gini, na.rm  = TRUE)),
    desc(gini)),
  n = 5)

Subsetting with dplyr

Only 1 way to do it, making life simpler:

select(dt, x) # select columns
## # A tibble: 3 x 1
##       x
##   <int>
## 1     1
## 2     2
## 3     3
slice(dt, 2) # 'slice' rows
## # A tibble: 1 x 2
##       x y    
##   <int> <chr>
## 1     2 B

How we've used this in the PCT

Worked example: pct data in West Yorksire

  • We'll download and visualise some transport data
u_pct = "https://github.com/npct/pct-data/raw/master/west-yorkshire/l.Rds"
if(!file.exists("l.Rds"))
  download.file(u_pct, "l.Rds")
library(stplanr)
l = readRDS("l.Rds")
plot(l)

Analysing where people walk

sel_walk = l$foot > 9
l_walk = l[sel_walk,]
plot(l)
plot(l_walk, add = T, col = "red")

library(dplyr) # for next slide...

Doing it with sf (!)

l_walk1 = l %>% filter(All > 10) # fails
library(sf)
## Linking to GEOS 3.5.1, GDAL 2.2.2, proj.4 4.9.2
l_sf = st_as_sf(l)
plot(l_sf[6])

Subsetting with sf

much easier

l_walk2 = l_sf %>% 
  filter(foot > 9)
plot(l_sf[6])
plot(l_walk2, add = T)

Subsetting with sf

results

## Warning in plot.sf(l_walk2, add = T): ignoring all but the first attribute
## Warning in classInt::classIntervals(na.omit(values), min(nbreaks, n.unq), :
## n same as number of different finite values\neach different finite value is
## a separate class

A more advanced example

l_sf$distsf = as.numeric(st_length(l_sf))
l_drive_short2 = l_sf %>% 
  filter(distsf < 1000) %>% 
  filter(car_driver > foot)

Result: where people drive short distances rather than drive

library(tmap)
tmap_mode("view")
## tmap mode set to interactive viewing
qtm(l_drive_short2)

Discussion: ensuring research is used for the greater good

Points of discussion

It is clear that geographical research can have large policy impacts.

  • That researchers can act to maximise the social benefit of the research
  • That involves getting the evidence out to as many people as possible
  • And using open source, accessible tools - the 'science' in GDS?

But many questions remain:

  • Where to draw the line between impartial research and campaigning?
  • To what extent should researchers open-sourcing their work defend against commercial exploitation?

Final question

  • What can you do to maximise the social benefits arising from your work?
  • Thanks for listening - get in touch via r.lovelace@leeds.ac.uk or @robinlovelace

References

Lovelace, Robin, Anna Goodman, Rachel Aldred, Nikolai Berkoff, Ali Abbas, and James Woodcock. 2017. “The Propensity to Cycle Tool: An Open Source Online System for Sustainable Transport Planning.” Journal of Transport and Land Use 10 (1). doi:10.5198/jtlu.2016.862.

Reclus, Elisée. 2013. Anarchy, Geography, Modernity: Selected Writings of Elisée Reclus. Edited by John Clark and Camille Martin. Oakland, CA: PM Press.