Data carpentry for sustainable transport planning: the Propensity to Cycle Tool

Why data carpentry?

Data analysts and 'scientists': don't wrangle, munge or 'hack' your valuable datasets. Use #datacarpentry: https://t.co/gXrlIJH91R pic.twitter.com/GSWS7O7zBz
— Robin Lovelace (@robinlovelace) February 20, 2017

If you 'hack' or 'munge' data, it won't scale
So ultimately it's about being able to handle Big Data
We'll cover the basics of data frames and tibbles
And the basics of dplyr, an excellent package for data carpentry
- dplyr is also compatible with the sf package

The data frame

The humble data frame is at the heart of most analysis projects:

d = data.frame(x = 1:3, y = c("A", "B", "C"))
d

##   x y
## 1 1 A
## 2 2 B
## 3 3 C

In reality this is a list, making function work on each column:

summary(d)

##        x       y    
##  Min.   :1.0   A:1  
##  1st Qu.:1.5   B:1  
##  Median :2.0   C:1  
##  Mean   :2.0        
##  3rd Qu.:2.5        
##  Max.   :3.0

plot(d)

Subsetting

In base R, there are many ways to subset:

d[1,] # the first line

##   x y
## 1 1 A

d[,1] # the first column

## [1] 1 2 3

d$x # the first column

## [1] 1 2 3

d [1] # the first column, as a data frame

##   x
## 1 1
## 2 2
## 3 3

The tibble

Recently the data frame has been extended:

library("tibble")
dt = tibble(x = 1:3, y = c("A", "B", "C"))
dt

## # A tibble: 3 × 2
##       x     y
##   <int> <chr>
## 1     1     A
## 2     2     B
## 3     3     C

Advantages of the tibble

It comes down to efficiency and usability

When printed, the tibble diff reports class
Character vectors are not coerced into factors
When printing a tibble diff to screen, only the first ten rows are displayed

dplyr

Like tibbles, has advantages over historic ways of doing things

Type stability (data frame in, data frame out)
Consistent functions - functions not [ do everything
Piping make complex operations easy

ghg_ems %>%
  filter(!grepl("World|Europe", Country)) %>% 
  group_by(Country) %>% 
  summarise(Mean = mean(Transportation),
            Growth = diff(range(Transportation))) %>%
  top_n(3, Growth) %>%
  arrange(desc(Growth))

# dplyr must be loaded with
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Why pipes?

wb_ineq %>% 
  filter(grepl("g", Country)) %>%
  group_by(Year) %>%
  summarise(gini = mean(gini, na.rm  = TRUE)) %>%
  arrange(desc(gini)) %>%
  top_n(n = 5)

top_n(
  arrange(
    summarise(
      group_by(
        filter(wb_ineq, grepl("g", Country)),
        Year),
      gini = mean(gini, na.rm  = TRUE)),
    desc(gini)),
  n = 5)

Subsetting with dplyr

Only 1 way to do it, making life simpler:

select(dt, x) # select columns

## # A tibble: 3 × 1
##       x
##   <int>
## 1     1
## 2     2
## 3     3

slice(dt, 2) # 'slice' rows

## # A tibble: 1 × 2
##       x     y
##   <int> <chr>
## 1     2     B

How we've used this in the PCT

Worked example: pct data in West Yorksire

We'll download and visualise some transport data

# u_pct = "https://github.com/npct/pct-data/raw/master/west-yorkshire/l.Rds"
# download.file(u_pct, "l.Rds")
library(stplanr)

## Loading required package: sp

l = readRDS("l.Rds")
plot(l)

Analysing where people walk

sel_walk = l$foot > 9
l_walk = l[sel_walk,]
plot(l)
plot(l_walk, add = T, col = "red")

library(dplyr) # for next slide...

Doing it with sf (!)

l_walk1 = l %>% filter(All > 10) # fails

library(sf)

## Linking to GEOS 3.5.1, GDAL 2.1.3, proj.4 4.9.2, lwgeom 2.3.2 r15302

l_sf = st_as_sf(l)
plot(l_sf[6])

Subsetting with sf

much easier

l_walk2 = l_sf %>% 
  filter(foot > 9)
plot(l_sf[6])
plot(l_walk2, add = T)

Subsetting with sf

results

A more advanced example

l_sf$distsf = as.numeric(st_length(l_sf))
l_drive_short2 = l_sf %>% 
  filter(distsf < 1000) %>% 
  filter(car_driver > foot)

Result: where people drive short distances rather than drive

library(tmap)
tmap_mode("view")

## tmap mode set to interactive viewing

qtm(l_drive_short2)

Why data carpentry?

The data frame

Subsetting

The tibble

Advantages of the tibble

dplyr

Why pipes?

Subsetting with dplyr

How we've used this in the PCT

Worked example: pct data in West Yorksire

Analysing where people walk

Doing it with sf (!)

Subsetting with sf

much easier

Subsetting with sf

results

A more advanced example

Result: where people drive short distances rather than drive

Overall results: the PCT