Slides: rpubs.com/RobinLovelace

Why data carpentry?

  • If you 'hack' or 'munge' data, it won't scale
  • So ultimately it's about being able to handle Big Data
  • We'll cover the basics of data frames and tibbles
  • And the basics of dplyr, an excellent package for data carpentry
    • dplyr is also compatible with the sf package

The data frame

The humble data frame is at the heart of most analysis projects:

d = data.frame(x = 1:3, y = c("A", "B", "C"))
d
##   x y
## 1 1 A
## 2 2 B
## 3 3 C

In reality this is a list, making function work on each column:

summary(d)
##        x       y    
##  Min.   :1.0   A:1  
##  1st Qu.:1.5   B:1  
##  Median :2.0   C:1  
##  Mean   :2.0        
##  3rd Qu.:2.5        
##  Max.   :3.0
plot(d)

Subsetting

In base R, there are many ways to subset:

d[1,] # the first line
##   x y
## 1 1 A
d[,1] # the first column
## [1] 1 2 3
d$x # the first column
## [1] 1 2 3
d [1] # the first column, as a data frame
##   x
## 1 1
## 2 2
## 3 3

The tibble

Recently the data frame has been extended:

library("tibble")
dt = tibble(x = 1:3, y = c("A", "B", "C"))
dt
## # A tibble: 3 × 2
##       x     y
##   <int> <chr>
## 1     1     A
## 2     2     B
## 3     3     C

Advantages of the tibble

It comes down to efficiency and usability

  • When printed, the tibble diff reports class
  • Character vectors are not coerced into factors
  • When printing a tibble diff to screen, only the first ten rows are displayed

dplyr

Like tibbles, has advantages over historic ways of doing things

  • Type stability (data frame in, data frame out)
  • Consistent functions - functions not [ do everything
  • Piping make complex operations easy
ghg_ems %>%
  filter(!grepl("World|Europe", Country)) %>% 
  group_by(Country) %>% 
  summarise(Mean = mean(Transportation),
            Growth = diff(range(Transportation))) %>%
  top_n(3, Growth) %>%
  arrange(desc(Growth))
# dplyr must be loaded with
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Why pipes?

wb_ineq %>% 
  filter(grepl("g", Country)) %>%
  group_by(Year) %>%
  summarise(gini = mean(gini, na.rm  = TRUE)) %>%
  arrange(desc(gini)) %>%
  top_n(n = 5)

vs

top_n(
  arrange(
    summarise(
      group_by(
        filter(wb_ineq, grepl("g", Country)),
        Year),
      gini = mean(gini, na.rm  = TRUE)),
    desc(gini)),
  n = 5)

Subsetting with dplyr

Only 1 way to do it, making life simpler:

select(dt, x) # select columns
## # A tibble: 3 × 1
##       x
##   <int>
## 1     1
## 2     2
## 3     3
slice(dt, 2) # 'slice' rows
## # A tibble: 1 × 2
##       x     y
##   <int> <chr>
## 1     2     B

How we've used this in the PCT

Worked example: pct data in West Yorksire

  • We'll download and visualise some transport data
# u_pct = "https://github.com/npct/pct-data/raw/master/west-yorkshire/l.Rds"
# download.file(u_pct, "l.Rds")
library(stplanr)
## Loading required package: sp
l = readRDS("l.Rds")
plot(l)

Analysing where people walk

sel_walk = l$foot > 9
l_walk = l[sel_walk,]
plot(l)
plot(l_walk, add = T, col = "red")

library(dplyr) # for next slide...

Doing it with sf (!)

l_walk1 = l %>% filter(All > 10) # fails
library(sf)
## Linking to GEOS 3.5.1, GDAL 2.1.3, proj.4 4.9.2, lwgeom 2.3.2 r15302
l_sf = st_as_sf(l)
plot(l_sf[6])

Subsetting with sf

much easier

l_walk2 = l_sf %>% 
  filter(foot > 9)
plot(l_sf[6])
plot(l_walk2, add = T)

Subsetting with sf

results

A more advanced example

l_sf$distsf = as.numeric(st_length(l_sf))
l_drive_short2 = l_sf %>% 
  filter(distsf < 1000) %>% 
  filter(car_driver > foot)

Result: where people drive short distances rather than drive

library(tmap)
tmap_mode("view")
## tmap mode set to interactive viewing
qtm(l_drive_short2)

Overall results: the PCT