Can geographic data save the world?

2018-04-11, Institut für Geographie - Friedrich-Schiller-Universität Jena

Introductory comments
An example: the Propensity to Cycle Tool
dplyr vs tibbles
Discussion: can GDS save the world?

Introduction

What is saving the world?

Many ways of saying the same thing:

'Policy-led research'
'Impact'
'Socially beneficial research'
Don't be evil (Google)

My definition: building an evidence-base for sustainable systems.

In the context of climate change that means:

Building an evidence-base to transition away from fossil fuels

But could also be interpretted in terms of other (quantifiable) social/economic/environmental indicators

Why climate change? I

Associated mega-caption

Why climate change? II

Why climate change? III

What is Geographic Data Science?

You tell me!

How does it differ from good old 'GIS'?
What does the science in the title mean?
Why the focus on data rather than information

Code example:

d = frame_data(
  ~Attribute, ~GIS, ~GDS,
  "Home disciplines", "Geography", "Geography, Computing, Statistics",
  "Software focus", "Graphic User Interface", "Code",
  "Reproduciblility", "Minimal", "Maximal"
)

Comparing GDS with GIS

knitr::kable(d)

Attribute	GIS	GDS
Home disciplines	Geography	Geography, Computing, Statistics
Software focus	Graphic User Interface	Code
Reproduciblility	Minimal	Maximal

Geographic data science CAN 'save the world'

But only if it's open and scientific

Reasoning:

Evidence inevitably gets skewed by political aims

If the people doing the research are influenced by dominant political forces, findings will be biases for political gain (solved by independent well-funded public research).

People doing policy relevant research watch out (regarding politicians):

"Their very spirit undergoes a pervasive transformation,” and they finally end up as “experts at exchanging smiles, handshakes, and favors." (Reclus 2013, original: 1898)

Importance of open data and methods

If the data underlying policy is hidden, it can be represented to push certain aims (solved by open data)

If the data is 'open' but the tools are closed, results open to political influence

Which brings us onto our next topic…

Example question: Where will cycling uptake happen?

How to transition to active cities? From this…

Source: Brent Toderian

To this?

Source: Brent Toderian

With available resources

Source: Brent Toderian

Context 'evidence overload'?

Challenge: operationalise data
Challenge: make locally specific

Data for walking and cycling investment

Travel behaviour data
Route network data
Existing infrastructure (road widths, traffic, future possibilities)
Road safety data
Air pollution data
Crowdsourced data

The international dimension

~200 km cycle network in Seville, Spain. Source: WHO report at ATFutures/who

Not a UK-specific issue, but benefits of country-specific tools

The Propensity to Cycle Tool (PCT)

What can the PCT do? - see w ww.pct.bike

The front page of the open source, open access Propensity to Cycle Tool (PCT).

Context: from concept to implementation

3 years in the making
Origins go back further
"An algorithm to decide where to build next"!
Internationalisation of methods (World Health Organisation funded project)

The research landscape (see Lovelace et al. 2017)

The PCT in context (source: Lovelace et al. 2017)

Tool	Scale	Coverage	Public access	Format of output	Levels of analysis	Software licence
Propensity to Cycle Tool	National	England	Yes	Online map	A, OD, R, RN	Open source
Prioritization Index	City	Montreal	No	GIS-based	P, A, R	Proprietary
PAT	Local	Parts of Dublin	No	GIS-based	A, OD, R	Proprietary
Usage intensity index	City	Belo Horizonte	No	GIS-based	A, OD, R, I	Proprietary
Cycling Potential Tool	City	London	No	Static	A, I	Unknown
Santa Monica model	City	Santa Monica	No	Static	P, OD, A	Unknown

Policy feedback

"The PCT is a brilliant example of using Big Data to better plan infrastructure investment. It will allow us to have more confidence that new schemes are built in places and along travel corridors where there is high latent demand."

Shane Snow: Head of Seamless Travel Team, Sustainable and Acessible Travel Division

"The PCT shows the country’s great potential to get on their bikes, highlights the areas of highest possible growth and will be a useful innovation for local authorities to get the greatest bang for their buck from cycling investments and realise cycling potential."

Andrew Jones, Parliamentary Under Secretary of State for Transport

The PCT in CWIS

Included in Cycling and Walking Infrastructure Strategy (CWIS)

Scenario shift in desire lines

Source: Lovelace et al. (2017)

Origin-destination data shows 'desire lines'
How will these shift with cycling uptake

Scenario shift in network load

The Cycling Infrastructure Prioritisation Toolkit (CyIPT)

Overview of the project

12 month project funded by DfT's Innovation Challenge Fund (ICF)
Aim: tackle the challenge that cycling uptake is often limited by infrastructural barriers which could be remediated cost-effectively, yet investment is often spent on less cost-effective interventions, based on assessment of only a few options.
Project team:
- Robin Lovelace (University of Leeds)
- Malcolm Morgan (University of Leeds)
- John Parkin (University of West of England)
- Martin Lucas-Smith (Cyclestreets.net)
- Adrian Lord (Phil Jones Associates)

Modelling cycling uptake

We can use 'backcasting' to estimate long-term potential under ideal questions (PCT)
But transport authorities need forecasts of future uptake
From specific interventions in order to do this
There is much existing work on this
But none that is 'operationalisable'

How to operationalise available data?

Data on infrastructure-uptake at a regional level

Clear link between infrastructure and uptake

New datasets:

DfT's Transport Direct data
2001 OD data (manipulated and joined with 2011 data)

Operationalising the data

See: https://www.cyipt.bike (password protected)

Wider context: Open source tools

Online interfaces reduce barriers
But there are benefits of running analysis locally
Various software options, including:

QGIS mapping software
sDNA QGIS plugin
R (see upcoming course 26th - 27th April)
Key feature of CyIPT and PCT:
Open source and provides open data downloads

Modelling cycling uptake

Hilliness and distance are (relatively) unchanging over time
Model based on polynomial logit model of both:

\[ logit(pcycle) = \alpha + \beta_1 d + \beta_2 d^{0.5} + \beta_3 d^2 + \gamma h + \delta_1 d h + \delta_2 d^{0.5} h \]

logit_pcycle = -3.9 + (-0.59 * distance) + (1.8 * sqrt(distance) ) + (0.008 * distance^2)

A live demo

"Actions speak louder than words"

Test version of p www.pct.bike

tibbles and dplyr: A detour for programmers

Why data carpentry?

Data analysts and 'scientists': don't wrangle, munge or 'hack' your valuable datasets. Use #datacarpentry: https://t.co/gXrlIJH91R pic.twitter.com/GSWS7O7zBz
— Robin Lovelace (robinlovelace) February 20, 2017

If you 'hack' or 'munge' data, it won't scale
So ultimately it's about being able to handle Big Data
We'll cover the basics of data frames and tibbles
And the basics of dplyr, an excellent package for data carpentry
- dplyr is also compatible with the sf package

The data frame

The humble data frame is at the heart of most analysis projects:

d = data.frame(x = 1:3, y = c("A", "B", "C"))
d

##   x y
## 1 1 A
## 2 2 B
## 3 3 C

In reality this is a list, making function work on each column:

summary(d)

##        x       y    
##  Min.   :1.0   A:1  
##  1st Qu.:1.5   B:1  
##  Median :2.0   C:1  
##  Mean   :2.0        
##  3rd Qu.:2.5        
##  Max.   :3.0

plot(d)

Subsetting

In base R, there are many ways to subset:

d[1,] # the first line

##   x y
## 1 1 A

d[,1] # the first column

## [1] 1 2 3

d$x # the first column

## [1] 1 2 3

d [1] # the first column, as a data frame

##   x
## 1 1
## 2 2
## 3 3

The tibble

Recently the data frame has been extended:

library("tibble")
dt = tibble(x = 1:3, y = c("A", "B", "C"))
dt

## # A tibble: 3 x 2
##       x y    
##   <int> <chr>
## 1     1 A    
## 2     2 B    
## 3     3 C

Advantages of the tibble

It comes down to efficiency and usability

When printed, the tibble diff reports class
Character vectors are not coerced into factors
When printing a tibble diff to screen, only the first ten rows are displayed

dplyr

Like tibbles, has advantages over historic ways of doing things

Type stability (data frame in, data frame out)
Consistent functions - functions not [ do everything
Piping make complex operations easy

ghg_ems %>%
  filter(!grepl("World|Europe", Country)) %>% 
  group_by(Country) %>% 
  summarise(Mean = mean(Transportation),
            Growth = diff(range(Transportation))) %>%
  top_n(3, Growth) %>%
  arrange(desc(Growth))

# dplyr must be loaded with
library(dplyr)

Why pipes?

wb_ineq %>% 
  filter(grepl("g", Country)) %>%
  group_by(Year) %>%
  summarise(gini = mean(gini, na.rm  = TRUE)) %>%
  arrange(desc(gini)) %>%
  top_n(n = 5)

top_n(
  arrange(
    summarise(
      group_by(
        filter(wb_ineq, grepl("g", Country)),
        Year),
      gini = mean(gini, na.rm  = TRUE)),
    desc(gini)),
  n = 5)

Subsetting with dplyr

Only 1 way to do it, making life simpler:

select(dt, x) # select columns

## # A tibble: 3 x 1
##       x
##   <int>
## 1     1
## 2     2
## 3     3

slice(dt, 2) # 'slice' rows

## # A tibble: 1 x 2
##       x y    
##   <int> <chr>
## 1     2 B

How we've used this in the PCT

Worked example: pct data in West Yorksire

We'll download and visualise some transport data

u_pct = "https://github.com/npct/pct-data/raw/master/west-yorkshire/l.Rds"
if(!file.exists("l.Rds"))
  download.file(u_pct, "l.Rds")
library(stplanr)
l = readRDS("l.Rds")
plot(l)

Analysing where people walk

sel_walk = l$foot > 9
l_walk = l[sel_walk,]
plot(l)
plot(l_walk, add = T, col = "red")

library(dplyr) # for next slide...

Doing it with sf (!)

l_walk1 = l %>% filter(All > 10) # fails

library(sf)

## Linking to GEOS 3.5.1, GDAL 2.2.2, proj.4 4.9.2

l_sf = st_as_sf(l)
plot(l_sf[6])

Subsetting with sf

much easier

l_walk2 = l_sf %>% 
  filter(foot > 9)
plot(l_sf[6])
plot(l_walk2, add = T)

Subsetting with sf

results

## Warning in plot.sf(l_walk2, add = T): ignoring all but the first attribute

## Warning in classInt::classIntervals(na.omit(values), min(nbreaks, n.unq), :
## n same as number of different finite values\neach different finite value is
## a separate class

A more advanced example

l_sf$distsf = as.numeric(st_length(l_sf))
l_drive_short2 = l_sf %>% 
  filter(distsf < 1000) %>% 
  filter(car_driver > foot)

Result: where people drive short distances rather than drive

library(tmap)
tmap_mode("view")

## tmap mode set to interactive viewing

qtm(l_drive_short2)

Discussion: ensuring research is used for the greater good

Points of discussion

It is clear that geographical research can have large policy impacts.

That researchers can act to maximise the social benefit of the research
That involves getting the evidence out to as many people as possible
And using open source, accessible tools - the 'science' in GDS?

But many questions remain:

Where to draw the line between impartial research and campaigning?
To what extent should researchers open-sourcing their work defend against commercial exploitation?

Final question

What can you do to maximise the social benefits arising from your work?

Thanks for listening - get in touch via r.lovelace@leeds.ac.uk or @robinlovelace

References

Lovelace, Robin, Anna Goodman, Rachel Aldred, Nikolai Berkoff, Ali Abbas, and James Woodcock. 2017. “The Propensity to Cycle Tool: An Open Source Online System for Sustainable Transport Planning.” Journal of Transport and Land Use 10 (1). doi:10.5198/jtlu.2016.862.

Reclus, Elisée. 2013. Anarchy, Geography, Modernity: Selected Writings of Elisée Reclus. Edited by John Clark and Camille Martin. Oakland, CA: PM Press.

Contents

Introduction

What is saving the world?

Why climate change? I

Associated mega-caption

Why climate change? II

Why climate change? III

What is Geographic Data Science?

Comparing GDS with GIS

Geographic data science CAN 'save the world'

But only if it's open and scientific

Importance of open data and methods

Example question: Where will cycling uptake happen?

How to transition to active cities? From this…

To this?

With available resources

Context 'evidence overload'?

Data for walking and cycling investment

The international dimension

The Propensity to Cycle Tool (PCT)

What can the PCT do? - see www.pct.bike

Context: from concept to implementation

The research landscape (see Lovelace et al. 2017)

The PCT in context (source: Lovelace et al. 2017)

Policy feedback

The PCT in CWIS

Scenario shift in desire lines

Source: Lovelace et al. (2017)

Scenario shift in network load

The Cycling Infrastructure Prioritisation Toolkit (CyIPT)

Overview of the project

Modelling cycling uptake

Data on infrastructure-uptake at a regional level

New datasets:

Operationalising the data

Wider context: Open source tools

Modelling cycling uptake

A live demo

"Actions speak louder than words"

tibbles and dplyr: A detour for programmers

Why data carpentry?

The data frame

Subsetting

The tibble

Advantages of the tibble

dplyr

Why pipes?

Subsetting with dplyr

How we've used this in the PCT

Worked example: pct data in West Yorksire

Analysing where people walk

Doing it with sf (!)

Subsetting with sf

much easier

Subsetting with sf

results

A more advanced example

Result: where people drive short distances rather than drive

Discussion: ensuring research is used for the greater good

Points of discussion

Final question

References

What can the PCT do? - see w ww.pct.bike