AT Bus: Data Cleaning

Goal
Data Source
Data Processing
Appendix: Full Code Script

Goal

The object is to use the real-time location and arrival time data provided by Auckland Transport(AT), in addition to other variables like weather and traffic congestion, to predict the arrival time of the bus.

Data Source

Using a VPS, I ran a python script which request API from AT Developer Portal (The Developer Portal updates realtime feed every 30 seconds) hourly and store it to MongoDB. The original data is stored as documents and can be queried into R directly, therefore, solving the problem of dealing with JSON and storage previously encountered. The data collection has been running since 2023.04.22 and has since have over 400,000 observations and 39 columns (You can find the python code here). I have also obtained transit lane and roadwork data from Auckland OpenGIS and hourly weather data from Visual Crossing, including measurements like precipitation, temperature and visibility. In addition, I used static feed from AT Developer Portal to fill out the latitude and longitude of each stop recorded.

The summary of the data used and form is as following:

Table 1: Data Used
Data	Form	Description
AT Real Time Feed	.JSON, BSON	See next section
AT Static	.csv	Stop names, latitude and longitude, stop codes
Weather	.csv	Description, precipitation, humidity
Geographical	.geojson	Shapefiles including transit lane and roadworks

Data Processing

My data wrangling can be dividied into 4 parts: 1. elementary data cleaning, 2. using data.table for attaching weather and stop information 3. adding weekday using lubridate 4. wrangle spatial data with sf. Since adding weekday is relatively easy, the article will only talk about other three.

Elementary Data Cleaning

As mentioned before, each request response is stored into a single document in MongoDB. Therefore, we need to combine all the documents and transformed it into a single dataframe before further analysis. Before that, some data features need to be noted. The raw JSON data has three main keys: an id recording the id of the specific request response, a header includes timestamp and gtfs version, and an entity that stores all the information we need. Under “entity, there are three major keys:”Trip_time_update”, “Stop_time_update” and “vehicle” (sometimes missing). Under three major keys, there are keys shared the same name, for example, “schedule_relationship” is found both under “Stop_time_update” and “Trip_time_update”. In addition, sometimes “vehicle” suggests the bus service is cancelled and there are no other two major keys in this case. Regarding these data features, our sub-tasks are:

Only query the data from the database with services that are not cancelled (i.e., there is no “alert” key).
Combine them into a single dataframe in R.
For columns with the same name under different keys, compare the value; if both are not missing, write the value to a new column; if either of them are missing, and another are not, use the information from another. Therefore, a new column should include all the information from the two columns. The Final dataframe should have less missing values and less columns.

Let’s implement the first sub-task:

# set up database request
library(mongolite)
library(tidyverse)

# connection_string = "mongodb+srv://<username>:<password>@cluster0.7gfdq86.mongodb.net/test"

bus = mongo(collection="AT_Bus", 
            db="AT_Bus", url=connection_string)

unwind merges the information under “entity” from all documents that are not cancelled. We also want to count the number of total data and the number of non-cancelled services

count <- bus$aggregate('[{"$unwind": "$entity"}, {"$match":
{"entity.alert":{"$exists":false}}},
                       {"$group": {"_id": null, "count": {"$sum": 1}}} ]')
# total number
count2 <- bus$aggregate('[{"$unwind": "$entity"},{"$group": {"_id": null, "count": {"$sum": 1}}} ]')

# This is the final combined dataframe
system.time(docs <- bus$aggregate('[{"$unwind": "$entity"},{"$match":
{"entity.alert":{"$exists":false}}},
  {"$project": {"_id": 0,"header":0}}]'))

Total data collected sofar:

data.frame(not_cancelled=count$count,total_data=count2$count)

##   not_cancelled total_data
## 1        361115     421242

A look at the nested structure:

library(tidyverse)
glimpse(docs$entity)

## Rows: 361,115
## Columns: 4
## $ id          <chr> "51330302751-20230418152005_v104.60", "51330199094-2023041…
## $ trip_update <df[,5]> <data.frame[26 x 5]>
## $ is_deleted  <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ vehicle     <df[,5]> <data.frame[26 x 5]>

glimpse(docs$entity$trip_update)

## Rows: 361,115
## Columns: 5
## $ trip             <df[,6]> <data.frame[26 x 6]>
## $ timestamp        <int> 1682075324, 1682075323, 1682080225, 1682080225, 16…
## $ delay            <int> 0, 0, 0, 0, 0, 0, 128, 0, 9, 0, 145, 0, 0, -61, 100, …
## $ stop_time_update <df[,5]> <data.frame[26 x 5]>
## $ vehicle          <df[,3]> <data.frame[26 x 3]>

Our final task is to create new columns using the repeated columns. My strategy is to create dataframe for each keys, unlist them and give them distinguishing names, and finally, combine them into a single dataframe and use coalesce to merge the columns containing the same information.

# Data wrangling
docs<-select(docs,-"is_deleted") # delete schedules that didn't exist

# Separate main categories into seperate dataframes, clean and rename it, before coalescing them to have complete information
vehicle<-docs$vehicle 
trip_update<-docs$trip_update
vehicle<-vehicle%>%unnest(everything())
names(vehicle)<-paste0("vehicle","_",names(vehicle))

names(trip_update$trip)[2:6]<-paste0("trip","_",
                                     names(trip_update$trip)[2:6])
names(trip_update$vehicle)<-paste0("t",".","vehicle","_",
                                   names(trip_update$vehicle))
names(trip_update$stop_time_update$arrival)<-paste0(
  "arrival","_",names(trip_update$stop_time_update$arrival))
names(trip_update$stop_time_update$departure)<-paste0(
  "depature","_",names(trip_update$stop_time_update$departure))
trip_update$stop_time_update<-trip_update$stop_time_update%>%unnest(everything())
trip_update<-trip_update%>%unnest(everything())
bus.df<-cbind(docs$id,trip_update,vehicle)
names(bus.df)[1]<-"id"
#rm(vehicle,trip_update)

# elementary data wrangling (coalscing) ----
bus.df$vehicle_id <- coalesce(bus.df$t.vehicle_id,bus.df$vehicle_id)
# Many coalesce operations are omitted here 

# delete repeated columns, only keep columns that have the combined information
bus.df<-bus.df[,-c(3:7,19:21,27,30:36)]

# only obtained data that is not delayed. The main reason should refer to the NA analysis 
nbus.df<-bus.df[!is.na(bus.df$delay),]
nbus.df %>%
  summarise_all(~ sum(is.na(.)))

After ensuring NAs in delay is Missing At Random (MAR), I extract data with non-missing delays (refer to another article for details).

Matching Data Efficiently with `data.table`

The second part is to match the latitude and longitude, stop name data from the static transportation data with the stop id in the obtained data. I used data.table to do it efficiently. The same technique is applied to adding hourly weather data. The basic procedure of using data.table is as follows: 1. create data.table object 2. set the key(i.e., matching columns) and attach. I found using data.table is more efficient than tidyverse and loop for large datase.

For stop information, sometimes real feed only provides stop code with version number, sometimes only stop number. I removed all the version number and use both columns to match the data. For hourly weather data, a difficulty is to match the “start_hour” with the recorded hour in weather data (always like 12:00, 13:00). My strategy is to use the “start_minute”; if “start_minute”>30, “start_hour” is considered next hour.

Here is the code example:

# stop latitude and stop longitude ----
# main idea: the stops can be matched with either stop_id and stop code(needs to extract it from a long string)
library(data.table)

# Use fread() function to read the data efficiently
stop_data <- fread("E:/Bus_data/stops.txt", sep = ",")

# some stop_id are in different format, need to count them out:
nbus.df$y <- ifelse(grepl("v104.61", nbus.df$stop_id),nbus.df$stop_id, NA)
nbus.df$x <- ifelse(grepl("v104.61", nbus.df$stop_id), NA,nbus.df$stop_id)

nbus.df<-nbus.df%>%separate(y,c("stop_code",NA,NA),
                          sep = "-",)
setkey(stop_data, stop_code)

# Perform a left join to update matching rows in stop_align
stop_align[stop_data, on = "stop_id",`:=` (
  stop_lat = as.double(i.stop_lat),
  stop_lon = as.double(i.stop_lon),
  stop_name = as.character(i.stop_name)
), by = .EACHI]

Spatial Data Wrangling

The final part is adding transit data and roadworks as binary variables, i.e., if a stop is near (defined as less than 500 meters) a roadwork site and a transit lane, it is marked with 1. This is done using geojson data containing the shape pf transit lanes and roadworks, and using sf packages to calculate the distances.

The example codes are (roadworks are wrangled in a similar fashion):

# spatial data ----
library(sf)
rw.sf <- st_read("F:/Books/journals/Roadworks.geojson")
ts.sf<-st_read("F:/Books/journals/Transit_Lanes.geojson")

# Convert points data frame to an sf object
bus.cods<-nbus.df[complete.cases(nbus.df$stop_lat),]
stop.sf <- st_as_sf(bus.cods, coords = c("stop_lon", "stop_lat"), 
                    crs = st_crs(ts.sf))


#Transit
# Join the points to the shape file
stops_joined_tr <- st_join(stop.sf, ts.sf)

# Find the nearest features in the shape file for each point
nearest_features_tr <- st_nearest_feature(stop.sf, ts.sf)

# Check if points are neighboring (within 500 meters)
stops_neighbors_tr <- which(nearest_features <= 500)
bus.cods$order<-c(1:nrow(bus.cods))
bus.cods$is_transit<-ifelse(bus.cods$order%in%stops_neighbors_tr,1,0)

Final cleaned data for descriptive analysis:

## Rows: 177,041
## Columns: 34
## $ trip_id               <chr> "247-810104-89640-2-9568910-48e1fff6", "1281-025…
## $ timestamp             <int> 1682099570, 1682164974, 1682099450, 1682099450, …
## $ delay                 <int> 0, -3426, 0, 0, 0, -3426, 0, 0, 0, 0, 0, -38, -5…
## $ stop_sequence         <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 41, 13, 33, 30,…
## $ arrival_delay         <int> 0, NA, 0, 0, 0, NA, 0, 0, 0, 0, 0, NA, NA, -215,…
## $ arrival_time          <int> 1682168040, NA, 1682166720, 1682166720, 16821667…
## $ arrival_uncertainty   <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ depature_delay        <int> 0, -3426, 0, 0, 0, -3426, 0, 0, 0, 0, 0, -35, -5…
## $ depature_time         <int> 1682168040, 1682164974, 1682166720, 1682166720, …
## $ depature_uncertainty  <int> NA, 0, NA, NA, NA, 0, NA, NA, NA, NA, NA, 0, 0, …
## $ stop_id               <chr> "9004-c55cde9b", "7058-d68c50d8", "9001-3be05f0c…
## $ schedule_relationship <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ vehicle_id            <chr> "59238", "11467", "59849", "59849", "59849", "11…
## $ vehicle_label         <chr> "AMP        238", "NB1467", "AMP        849", "A…
## $ start_time            <chr> "24:54:00", "25:00:00", "24:32:00", "24:32:00", …
## $ start_hour            <dbl> 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 19, 21, 21, 21,…
## $ start_minute          <int> 54, 0, 32, 32, 32, 0, 54, 36, 36, 36, 54, 5, 28,…
## $ start_date            <date> 2023-04-22, 2023-04-22, 2023-04-22, 2023-04-22,…
## $ route_id              <chr> "WEST-201", "25L-202", "STH-201", "STH-201", "ST…
## $ direction_id          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, …
## $ stop_lat              <dbl> -36.84429, -36.85186, -36.84429, -36.84429, -36.…
## $ stop_lon              <dbl> 174.7685, 174.7641, 174.7685, 174.7685, 174.7685…
## $ stop_name             <chr> "Britomart Train Station 4", "Aotea Square", "Br…
## $ weekday               <chr> "Saturday", "Saturday", "Saturday", "Saturday", …
## $ delay_min             <dbl> 0.0000000, -57.1000000, 0.0000000, 0.0000000, 0.…
## $ temp                  <dbl> 19.1, 19.1, 19.1, 19.1, 19.1, 19.1, 19.1, 19.1, …
## $ feelslike             <dbl> 19.1, 19.1, 19.1, 19.1, 19.1, 19.1, 19.1, 19.1, …
## $ humidity              <dbl> 92.71, 92.71, 92.71, 92.71, 92.71, 92.71, 92.71,…
## $ precip                <dbl> 5.767, 5.767, 5.767, 5.767, 5.767, 5.767, 5.767,…
## $ windspeedmean         <dbl> 10.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0, 10.0, …
## $ severerisk            <chr> "Rain, Overcast", "Rain, Overcast", "Rain, Overc…
## $ is_transit            <dbl> 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, …
## $ roadwork              <dbl> 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, …
## $ c.start_time          <dbl> 0.9000000, 1.0000000, 0.5333333, 0.5333333, 0.53…

Appendix: Full Code Script

gc()

# First Part: data acquisition
library(mongolite)
library(tidyverse)

connection_string = "mongodb+srv://<username>:<passwords>@cluster0.7gfdq86.mongodb.net/test"
# for confidentiality reason, the user name and passwords are hidden

bus = mongo(collection="AT_Bus", 
            db="AT_Bus", url=connection_string)
count <- bus$aggregate('[{"$unwind": "$entity"}, {"$match":
{"entity.alert":{"$exists":false}}},
                       {"$group": {"_id": null, "count": {"$sum": 1}}} ]')
# total number
count2 <- bus$aggregate('[{"$unwind": "$entity"},{"$group": {"_id": null, "count": {"$sum": 1}}} ]')

system.time(docs <- bus$aggregate('[{"$unwind": "$entity"},{"$match":
{"entity.alert":{"$exists":false}}},
  {"$project": {"_id": 0,"header":0}}]'))
docs<-docs$entity
names(docs)

# Data wrangling
docs<-select(docs,-"is_deleted") # delete schedules that didn't exist

# Separate main categories into seperate dataframes, clean and rename it, before coalescing them to have complete information
vehicle<-docs$vehicle 
trip_update<-docs$trip_update
vehicle<-vehicle%>%unnest(everything())
names(vehicle)<-paste0("vehicle","_",names(vehicle))

names(trip_update$trip)[2:6]<-paste0("trip","_",
                                     names(trip_update$trip)[2:6])
names(trip_update$vehicle)<-paste0("t",".","vehicle","_",
                                   names(trip_update$vehicle))
names(trip_update$stop_time_update$arrival)<-paste0(
  "arrival","_",names(trip_update$stop_time_update$arrival))
names(trip_update$stop_time_update$departure)<-paste0(
  "depature","_",names(trip_update$stop_time_update$departure))
trip_update$stop_time_update<-trip_update$stop_time_update%>%unnest(everything())
trip_update<-trip_update%>%unnest(everything())
bus.df<-cbind(docs$id,trip_update,vehicle)
names(bus.df)[1]<-"id"
#rm(vehicle,trip_update)

# elementary data wrangling (coalscing) ----
bus.df$vehicle_id <- coalesce(bus.df$t.vehicle_id,bus.df$vehicle_id)
bus.df$trip_id <- coalesce(bus.df$trip_id,bus.df$vehicle_trip_id)
bus.df$start_time <- coalesce(bus.df$trip_start_time,bus.df$vehicle_start_time)
bus.df$start_date <- coalesce(bus.df$trip_start_date,bus.df$vehicle_start_date)
bus.df$schedule_relationship <- coalesce(bus.df$trip_schedule_relationship, 
                                         bus.df$schedule_relationship,
                                         bus.df$vehicle_schedule_relationship)
bus.df$route_id <- coalesce(bus.df$trip_route_id,bus.df$vehicle_route_id)
bus.df$direction_id <- coalesce(bus.df$trip_direction_id,
                                bus.df$vehicle_direction_id)
bus.df$timestamp <- coalesce(bus.df$timestamp,bus.df$vehicle_timestamp)
bus.df$vehicle_label <- coalesce(bus.df$t.vehicle_label,
                                 bus.df$vehicle_label)
# delete repeated columns, only keep columns that have the combined information
bus.df<-bus.df[,-c(3:7,19:21,27,30:36)]

# only obtained data that is not delayed. The main reason should refer to the NA analysis in the third section
nbus.df<-bus.df[!is.na(bus.df$delay),]
nbus.df %>%
  summarise_all(~ sum(is.na(.)))
# route id and stop id contains different format
rm(vehicle,trip_update,docs)
gc()

# stop latitude and stop longitude ----
# main idea: the stops can be matched with either stop_id and stop code(needs to extract it from a long string)
library(data.table)

# Use fread() function to read the data efficiently
stop_data <- fread("E:/Bus_data/stops.txt", sep = ",")

# some stop_id are in different format, need to count them out:
nbus.df$y <- ifelse(grepl("v104.61", nbus.df$stop_id),nbus.df$stop_id, NA)
nbus.df$x <- ifelse(grepl("v104.61", nbus.df$stop_id), NA,nbus.df$stop_id)

nbus.df<-nbus.df%>%separate(y,c("stop_code",NA,NA),
                          sep = "-",)
stop_align <- data.table(
  stop_id = nbus.df$x,
  stop_code = as.numeric(nbus.df$stop_code),
  stop_lat = numeric(nrow(nbus.df)),
  stop_lon = numeric(nrow(nbus.df)),
  stop_name = character(nrow(nbus.df))
)

# Set the key for join operation
setkey(stop_data, stop_id)
setkey(stop_data, stop_code)

# Perform a left join to update matching rows in stop_align
stop_align[stop_data, on = "stop_id",`:=` (
  stop_lat = as.double(i.stop_lat),
  stop_lon = as.double(i.stop_lon),
  stop_name = as.character(i.stop_name)
), by = .EACHI]


# Perform a left join to update matching rows in stop_align
stop_align[stop_data, on = "stop_code",`:=` (
  stop_lat = as.double(i.stop_lat),
  stop_lon = as.double(i.stop_lon),
  stop_name = as.character(i.stop_name),
  stop_id = as.character(i.stop_id)
), by = .EACHI]

nbus.dfstop_id<-stop_align$stop_id
nbus.df<-nbus.df%>%select(-c("x","stop_code"))
nbus.df<-cbind(nbus.df,stop_align[,3:5])
nbus.df[,26:27][nbus.df[,26:27] == 0]<-NA

rm(stop_data, stop_align)
gc()

# Time and date ----
library(lubridate)
nbus.df$x <- ifelse(grepl("v104.61", nbus.df$route_id),
                    nbus.df$route_id,NA)
nbus.df<-nbus.df%>%separate(x,c("x",NA,NA),
                          sep = "-",)
nbus.df$x <- paste0(substring(nbus.df$x, 1, 2), "-", 
                   substring(nbus.df$x, 3))
nbus.df$route_id<-ifelse(grepl("v104.61", nbus.df$route_id),nbus.df$x,
                         nbus.df$route_id)
nbus.df<-nbus.df%>%select(-x)
nbus.df<-nbus.df%>%
  mutate(start_date = as.Date(as.character(start_date), 
                              format = "%Y%m%d"))%>%
  mutate(weekday=weekdays(start_date))%>%
  mutate(delay_min)= delay/60)%>%
  separate(start_time,paste0("start","_",c("hour","minute",NA)),sep = ":",
           convert = T,remove = F)%>%
  select(-start_NA)%>%
  mutate(public_holiday=ifelse(nbus.df$start_date=="2023-04-25",1,0))
nbus.df$public_holiday=ifelse(nbus.df$weekday=="Sunday",1,
                              nbus.df$public_holiday)
nbus.df<-nbus.df[nbus.df$start_data!=any(c("2023-05-07","2023-05-08")),]

# weather ----
# main idea: compare if the date is the same; if so, compare hour, and paste corresponding weather conditions
weather <- read.csv("F:/Books/journals/Auckland,New Zealand 2023-04-20 to 2023-05-06.csv") 
# extract the date component
weather$date <- substr(weather$datetime, 1, 10)
# extract the time component and remove the "T" character
weather$time <- gsub("T", "", substr(weather$datetime, 12, 19))
weather<-weather[,-1]
weather<-weather%>%
  separate(time,c("hour",NA,NA),sep = ":",convert = T,remove = F)
weather.dt<-data.table(weather)
weather.dt$hour<-as.double(weather.dt$hour)
nbus.df$hour<-nbus.df$start_hour
nbus.df$date<-as.character(nbus.df$start_date)
nbus.df$hour<-ifelse(nbus.df$start_minute>30,nbus.df$hour+1,nbus.df$hour)
nbus.df$hour<-ifelse(nbus.df$hour>24, 0,nbus.df$hour)
nbus.df <- merge(nbus.df, weather.dt, 
                   by = c("date", "hour"), all.x = TRUE)
rm(weather.dt)
nbus.df<-nbus.df%>%select(-c(time,hour,date))
rm(weather)
nbus.df<-nbus.df[,-c(14:18,21,39,41)]
gc()

# spatial data ----
library(sf)
rw.sf <- st_read("F:/Books/journals/Roadworks.geojson")
ts.sf<-st_read("F:/Books/journals/Transit_Lanes.geojson")

# Convert points data frame to an sf object
bus.cods<-nbus.df[complete.cases(nbus.df$stop_lat),]
stop.sf <- st_as_sf(bus.cods, coords = c("stop_lon", "stop_lat"), 
                    crs = st_crs(ts.sf))


#Transit
# Join the points to the shape file
stops_joined_tr <- st_join(stop.sf, ts.sf)

# Find the nearest features in the shape file for each point
nearest_features_tr <- st_nearest_feature(stop.sf, ts.sf)

# Check if points are neighboring (within 500 meters)
stops_neighbors_tr <- which(nearest_features <= 500)

#Roadworks
# Join the points to the shape file
stops_joined_rw <- st_join(stop.sf, rw.sf)

# Find the nearest features in the shape file for each point
nearest_features_rw <- st_nearest_feature(stop.sf, rw.sf)

# Check if points are neighboring (within 500 meters)
stops_neighbors_rw <- which(nearest_features <= 500)

bus.cods$order<-c(1:nrow(bus.cods))
bus.cods$roadwork<-ifelse(bus.cods$order%in%stops_neighbors_rw,1,0)
bus.cods$is_transit<-ifelse(bus.cods$order%in%stops_neighbors_tr,1,0)


# final modelling data ----
df<-select(bus.cods,-c("trip_id","timestamp","arrival_delay","arrival_time","arrival_uncertainty","depature_delay","depature_time","depature_uncertainty","stop_id","vehicle_id","vehicle_label","start_time","start_date","route_id","stop_name","delay_min","feelslike"))
df$start_hour<-ifelse(df$start_hour>23,df$start_hour-24,df$start_hour)
df$start_time<-df$start_hour+df$start_minute/60
df<-select(df,-c(start_hour,start_minute))
df<-df[!is.na(df$temp),]

AT Bus: Data Cleaning

Anita Koh (Anita Wu)

2023-06-08

Goal

Data Source

Data Processing

Elementary Data Cleaning

Matching Data Efficiently with `data.table`

Spatial Data Wrangling

Appendix: Full Code Script

AT Bus: Data Cleaning

Anita Koh (Anita Wu)

2023-06-08

Goal

Data Source

Data Processing

Elementary Data Cleaning

Matching Data Efficiently with data.table

Spatial Data Wrangling

Appendix: Full Code Script

Matching Data Efficiently with `data.table`