2017-01-13

Reading and cleaning data

Nikée has written an R package, TravelAIR, for working with the TAI data.

This function, for example, reads in the custom data format:

readData <- function(inputPath, minObsDate="2014-01-01",
                   maxObsDate="2016-10-01",
                   saveOutput=TRUE, outputPath=inputPath,
                   saveSummary=TRUE, summaryName="Aug14-16_Summary",
                   summaryPath=inputPath)
# example usage, 1 minute to procss to process 180 MB
# system.time(readData(inputPath = "tai-private-data/")) 

I/O results

Original data

[
  {
    "distance": 5757.0, 
    "changed_method": null, 
    "d": "190111", 
    "deleted": null, 
    "ts": 1547224631587, 
    "method_desc": "Car", 
    "duration": 312.0, 
    "to_loc": [
      -1.790163, 
      55.30009
    ], 
    "from_loc": [
      -1.831256, 
      55.34619
    ], 
    "method": 15, 

The processed data

f = paste0("tai-private-data/",
           "Base_Agent-1.csv")
d = readr::read_csv(f) # 22999 rows (trips)
head(d)
## # A tibble: 6 × 29
##   distance changed_method      d deleted           ts method_desc duration
##      <dbl>          <chr>  <int>   <chr>        <dbl>       <chr>    <dbl>
## 1       56           <NA> 151001    <NA> 1.443658e+12  Stationary   25.711
## 2      110           <NA> 151001    <NA> 1.443658e+12         Car   13.943
## 3       20           <NA> 151001    <NA> 1.443658e+12  Stationary   31.100
## 4       50           <NA> 151001    <NA> 1.443658e+12  Stationary    8.753
## 5        6           <NA> 151001    <NA> 1.443658e+12  Stationary   37.201
## 6       40           <NA> 151001    <NA> 1.443658e+12  Stationary    9.743
## # ... with 22 more variables: to_loc <chr>, from_loc <chr>, method <int>,
## #   device_id <chr>, agentID <int>, from_locx <dbl>, from_locy <dbl>,
## #   to_locx <dbl>, to_locy <dbl>, date <date>, dateTime <dttm>,
## #   wday <int>, weekDay <chr>, decimalTime <dbl>, nextTimeStamp <dbl>,
## #   calcDuration <dbl>, nextFromLocX <dbl>, nextFromLocY <dbl>,
## #   locJump <int>, distGeo <dbl>, distHaversine <dbl>, speed <dbl>

Cleaning the data

Extensible function for cleaning the data:

cleanData<-function(inputPath, inputName="Base",
                    methodSpecific=c("Car", "Bicycle"),
                    thresholdSpeed=c(rep(240,11),1200,240),
                    unrealMerge=FALSE,
                    removeMethodUnknown=FALSE,
                    outputPath=inputPath, outputName="Cleaned"){ ... }

Example of data cleaning

a = read.csv("travelAIData/BasicData/Base_Agent-101.csv")
ac = read.csv("travelAIData/BasicData/Cleaned_Agent-101.csv")
nrow(a)
## [1] 5579
nrow(ac)
## [1] 5578
summary(a$speed)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   0.0000   0.2847   1.2410  13.9200  18.9100 266.4000
summary(ac$speed)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   0.0000   0.2844   1.2410  13.8800  18.8800 225.1000

(Pseudo) anonymisation

# Brute force approach
ac[c("to_locx", "to_locy")] <- ac[c("to_locx", "to_locy")] +
  runif(n = nrow(ac) * 2, min = -0.01, max = 0.01)
  • Also see stplanr function to 'top-and-tail' trips:
args(stplanr::toptail)
## function (l, toptail_dist, ...) 
## NULL

Issue: international forrays

plot(ac$to_locx, ac$to_locy) # viz issues

Addition cleaning steps (prototype)

a_bounds_x = quantile(ac$from_locx, probs = c(0.1, 0.9))
a_bounds_y = quantile(ac$from_locy, probs = c(0.1, 0.9))
sel_bb = ac$to_locx > a_bounds_x[1] & ac$to_locx < a_bounds_x[2]
ac = ac[sel_bb,]
plot(ac$to_locx, ac$to_locy)

Visualisation

library(leaflet.extras)
leaflet() %>% addTiles() %>% addWebGLHeatmap(ac$to_locx, ac$to_locy, size = 10000, units = "m", alphaRange = 0.00001) 

Analysis of the data

  • Distance to workplace estimated based on first instance when an agent is at a workplace
  • Estimated purpose of trip using time-of-day (midnight to 4 AM)
  • Created trip level table on commutes
  • Agent-level table summarising commute behaviour
  • Explored ways to define a 'typical commute' per agent. This includes:
    • Mode: single or dual mode estimated
    • Start time, end time
    • Distance (route and Euclidean)
    • Total number of commutes identified per agent (suggest below 5 - is not reliable)

Identification of aggregate patterns in commute behaviour

  • Based on input data, females commute on average 7.9 km to home and 5.1 to work (intermediate stops on the way home?)
  • For men the route distances values are 17.8 and 17.5 km, respectively
  • Estimated travel times but these do not seem reliable
  • Breakdown by age band

Some results I

Main mode of commute by age band

knitr::kable(readr::read_csv("vignettes/results-age-mode.csv"))
X1 Walk Bus Bicycle Car Metro Train Aeroplane
<=15 NA NA NA NA NA 1 NA
16-24 1 NA 1 4 NA NA NA
25-34 4 NA 11 2 1 5 NA
35-44 2 1 11 13 2 12 1
45-54 4 2 8 10 3 11 NA
55-64 3 2 2 3 NA 4 NA
>=65 2 NA 3 1 NA NA NA
unknown 7 3 32 11 1 15 1

Some results II

Main mode of commute by age band

knitr::kable(readr::read_csv("vignettes/results-gender-mode.csv"))
X1 Walk Bus Bicycle Car Metro Train Aeroplane
females 9 NA 6 9 1 5 NA
males 10 7 50 22 4 28 NA