Web-scraping London bikeshare bike availability

This code is an example of how to use the R programming language to access and process data from a bikeshare operator and obtain docking station bike availability data for that moment in time. This is an example for one city, but in principle the same could be done for many cities and then stored in a combined database.

First we need to load the necessary libraries

#——– Get docking station data for current time ———-#

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.1.3

## Warning: package 'tibble' was built under R version 4.1.3

## Warning: package 'tidyr' was built under R version 4.1.3

## Warning: package 'readr' was built under R version 4.1.3

## Warning: package 'stringr' was built under R version 4.1.3

## Warning: package 'forcats' was built under R version 4.1.3

library(xml2)
library(XML)
library(knitr)
library(RSQL) #Generate and Process 'SQL' Queries in R
library(RSQLite) #Can create an in-memory SQL database
library(odbc) #Contains drivers to connect to a database
library(DBI) #Contains functions for interacting with the database

Get the data

To find the url needed to obtain the data via an API, a bit of exploration is needed on the bikeshare operator site.

Below describes the process for obtaining data from the London scheme.

The first line uses the httr::GET() function to send a GET request to the URL “https://tfl.gov.uk/tfl/syndication/feeds/cycle-hire/livecyclehireupdates.xml”, which is an XML file that contains information about the status of the cycle hire stations in London. The result of the request is stored in the object status_api_call.

status_api_call <- httr::GET("https://tfl.gov.uk/tfl/syndication/feeds/cycle-hire/livecyclehireupdates.xml")

The second line uses the xmlInternalTreeParse() function to parse the XML content of status_api_call into an internal tree representation, which is easier to manipulate and query. The result is stored in the object api_xml2.

api_xml2 = xmlInternalTreeParse(status_api_call)

The printed output is in the wrong format, so we need convert into a df.

#cat(saveXML(api_xml2, nchars = 10))

The fourth line uses the Sys.time() function to get the current system time and store it in the object current_time. This is needed to ensure that the date and time is correct in the final df, as sometime bikeshare operators have errors with the datetime provided.

current_time = Sys.time()

The fifth line uses the xmlToDataFrame() function to convert the XML tree api_xml2 into a data frame, which is a tabular format that can be used for further analysis. The data frame is then piped (|>) to the mutate() function, which adds a new column called datetime with the value of current_time. This can help to keep track of when the data was obtained. The result is stored in the object xmldf.

xmldf = xmlToDataFrame(api_xml2) |>
  dplyr::mutate(datetime = current_time)

# print the data frame as a table
knitr::kable(xmldf[1:5, ], caption = "Data frame from XML file")

Data frame from XML file
id	name	terminalName	lat	long	installed	locked	installDate	temporary	nbBikes	nbStandardBikes	nbEBikes	nbEmptyDocks	nbDocks	datetime
1	River Street , Clerkenwell	001023	51.52916347	-0.109970527	true	false	1278947280000	false	14	14	0	5	19	2024-03-28 15:19:22
2	Phillimore Gardens, Kensington	001018	51.49960695	-0.197574246	true	false	1278585780000	false	11	11	0	26	37	2024-03-28 15:19:22
3	Christopher Street, Liverpool Street	001012	51.52128377	-0.084605692	true	false	1278240360000	false	9	9	0	21	32	2024-03-28 15:19:22
4	St. Chad’s Street, King’s Cross	001013	51.53005939	-0.120973687	true	false	1278241080000	false	19	19	0	4	23	2024-03-28 15:19:22
5	Sedding Street, Sloane Square	003420	51.49313	-0.156876	true	false	1278241440000	false	25	24	1	2	27	2024-03-28 15:19:22

We then need to save this data to a database. This code is an example of how to use R to interact with a SQLite database.

CREATE A DATABASE

The first line uses the dbConnect() function to establish a connection to a SQLite database file located at “C:/Users/user/Documents/name/db_name”. The result of the connection is stored in the object con

con <- dbConnect(RSQLite::SQLite(), dbname="C:/Users/patri/Documents/R projs/city_comparison/cities_database.db")

WRITE TO DATABASE

The second line uses the dbWriteTable() function to write the data frame xmldf to a table named ‘cities_database_table’ in the database. The append=TRUE argument indicates that the data will be added to the existing table, rather than replacing it

# RSQLite::dbWriteTable(con, name='cities_database_table', xmldf,append=TRUE)

Check what tables exist in your database

The third line uses the dbListTables() function to list the names of all the tables in the database. This can help to check if the table was created successfully.

# RSQLite::dbListTables(con)

create dataframe from DB

The fourth line uses the dbGetQuery() function to execute a SQL query that selects all the rows and columns from the table ‘cities_database_table’ and returns them as a data frame. The result is stored in the object cities_db_df.

# cities_db_df <- dbGetQuery(con, "SELECT * FROM cities_database_table")

Setup a script schedule

Separate to this script, we need to ensure that we collect this data every 10 minutes by using a timer feature to re-run the script.

# library(taskscheduleR)
# taskscheduler_create(taskname = "my_script_task", 
                     #rscript = "C:/Users/username/Documents/my_script.R", 
                     #schedule = "MINUTE", 
                     #starttime = format(Sys.time() + 50, "%H:%M"), 
                     #modifier = 10)

This will create a task named my_script_task that will run your script every ten minutes, starting from 50 seconds after the current time. You can modify the arguments of the function to suit your needs, such as changing the task name, the script path, the schedule frequency, or the start time.

Web-scraping London bikeshare bike availability

2023-10-03