Before starting, it is important to mention that the present document has been developed using R version 4.5.0 Patched (2025-05-01 r88189 ucrt)(R Core Team 2019), in the IDE “Mariposa Orchid” RStudio desktop version (2025.05.0+496)(Posit team 2025); running under Windows 10 x64 (build 19045).
In order to make possible the present project, we need to load the different libraries that will be used to process the information in this document. For that, I set up the next function; it is important to note that in this case I am doing it silently.
# I am setting the chunk option results='asis' and using the tags options, to render the output text of this chunk as regular text list inside the document
load_libs <- function() {
# Function to load the libraries
# Declare a char vector with the names of the libraries to upload
librerias <- c(
"data.table", "fuzzyjoin", "glue", "hms", "htmltools", "htmlwidgets",
"geosphere", "leaflet", "leaflet.extras2", "magrittr", "naniar", "rio",
"rlang", "scales", "simputation", "sp", "stringdist", "tidyverse", "visdat"
)
# Set up a variable to check the number of libraries loaded
ctrl <- length(librerias)
# Load "librerias" silently
librerias <- suppressMessages(suppressWarnings(lapply(librerias, library,
character.only = TRUE)))
# Check the number of loaded libs, organize the list and write a paragraph with the result
if (ctrl == length(librerias)) {
base_libs <- sort(sessionInfo()$base)
base_toP <- paste(base_libs[1:length(base_libs) - 1], collapse = ", ")
libs <- sort(unlist(librerias[[length(librerias)]]))
libs <- libs[!(libs %in% base_libs)]
libs_toP <- paste0(libs[1:length(libs) - 1], collapse = ", ")
listado <- tags$ul(tags$li(paste0(length(base_libs), " base R libraries: ", base_toP,
" and ", base_libs[length(base_libs)], ".")),
tags$li(paste0(length(librerias[[length(librerias)]]) -
length(base_libs), " additional installed packages: ",
libs_toP, " and ", libs[length(libs)], "."))) %>%
tagAppendAttributes(class = "disc")
salida <- paste0(
"The following ", length(librerias[[length(librerias)]]),
" libraries were installed:\n\n", listado
)
} else {
salida <- "Something happened, let's check!\n"
}
# The function returns a paragraph stating the result
writeLines(salida)
invisible(libs)
}
# libs <- load_libs()
load_libs()
The following 35 libraries were installed:
I mentioned the silent option before, because having so many packages
there may be cases in which conflicts arise between the functions in
different packages; if so, you can check with conflicts(),
reload the libraries with all the messages to check or use the notation:
package_name::function().
At the end of the chunk I commented the line
# libs <- load_libs(), since I just needed this code to
get the libs vector, returned invisibly by the function, to
add the packages references into the references.bib
document, following the R Markdown Cookbook’s recommendation (Xie, Dervieux, and Riederer 2020).
We have already loaded the needed libraries for this project, since we are not going to use this function again we can remove it to keep the environment neat:
rm(load_libs)
Right now I’m showing the chunk for deleting the function used in
this section. I’m going to keep on repeating this cleaning operation
several times during the document to avoid cluttering the environment;
but from now on, I won’t be including the respective chunks, since all
of them just use the same rm function.
First, I need to define the dates for which I want to download the zip files. I started this project on the first days of October 2022 and I needed to collect at least a year’s worth of data. To accomplish this, I established my first month as October 2021 (2021-10-01).
After quite some time, I noticed that learning this language was
going to take me a lot more time than I’d expected and the months kept
on passing. For this reason, I decided to leave the initial month as it
was, and keep on adding the new tables (dfs) as they were becoming
available online month after month. That meant that the code should have
being able to actualize the dates every time I ran it. The
Sys.Date() helped me to accomplish that. As you can see it
in the results, it keeps on retrieving the dates up to this last
month.
set_init_dates <- function() {
# Function to create a date vector with the initial date for each period
# Set the variables
# One with the last month that may be available
last_month <- floor_date(Sys.Date() - 30, "month")
# Another one with the sequence since the first date by month
# We´ll need this vector later on, so I'll save it in the Global environment
yearMonths <<- seq(as.Date("2021-10-01"), last_month, by = "months")
# State the result for these variables as a paragraph
salida <- paste0(
"The first month starts with the date: ", as.Date("2021-10-01"),
"; and, most probably, the last month's start date will be: ", last_month,
". The complete sequence of all the initial dates are in ",
'an atomic vector named "yearMonths"; vector with class "',
class(yearMonths), '" and of type "', typeof(yearMonths),
'".\n\nThe dates in the vector increment by 1 month, i.e., ',
paste0(yearMonths[1:6], collapse = ", "), ", etc.\n"
)
# The function returns a paragraph explaining the result
writeLines(salida)
}
set_init_dates()
The first month starts with the date: 2021-10-01; and, most probably, the last month’s start date will be: 2025-06-01. The complete sequence of all the initial dates are in an atomic vector named “yearMonths”; vector with class “Date” and of type “double”.
The dates in the vector increment by 1 month, i.e., 2021-10-01, 2021-11-01, 2021-12-01, 2022-01-01, 2022-02-01, 2022-03-01, etc.
I have to keep in mind that during the first days of every month the data for the previous month may not be available yet.
# Define a variable with the number of items
a <- length(yearMonths)
# Declare a vector to contain the names of the dfs
periods <- sapply(yearMonths, function(var) {
paste0(
lubridate::month(var, label = TRUE), "_",
format(var, "%Y"), "_rides"
)
})
# State the info:
writeLines(paste0("At this moment we may have ", a, " months of data available, there may be one less during the first days of every month (", a - 1, " in this case).\n\nThe following are some of the names that will be used for the tables, just to give you an idea of the scheme for those names:\n", paste0(periods[1:5], collapse = ", "), ", and so on. Up to, if already available, ", periods[a], ".\n"))
At this moment we may have 45 months of data available, there may be one less during the first days of every month (44 in this case).
The following are some of the names that will be used for the tables, just to give you an idea of the scheme for those names: oct_2021_rides, nov_2021_rides, dic_2021_rides, ene_2022_rides, feb_2022_rides, and so on. Up to, if already available, jun_2025_rides.
The data comes from the following index page: https://divvy-tripdata.s3.amazonaws.com/index.html (“Bucket Loading... — Divvy-Tripdata.s3.amazonaws.com”) This is a public data set by https://ride.divvybikes.com/ which operates a bike-sharing model in the city of Chicago (“Home | Divvy Bikes — Ride.divvybikes.com”).
For this exercise we can create the R data tables directly from their
information by using the import() function from the rio
package (Chan et al. 2023). The files are
comma-separated-values files (.csv) compressed in
.zip format. In the previous section, we already created a
vector, periods, with the names that we are going to give
to those tables.
Before proceeding with the download, I am going to reduce the value
of a so we only get 12 months of data. If we don’t change
it, this chunk will download the complete set of 45 csv files available
up to now, and that will take a long time. So, for the time being, let´s
just work with 12 months. Since we are changing the value of
a, we have to adjust the periods vector as
well:
# New value for 'a'
a <- 12
# Adjust the periods vector
periods <- periods[1:a]
Now we are ready to download the information, the following chunk will help us with this task:
# Define a char vector with each file name to download
files_to_dwnld <- paste0(
"https://divvy-tripdata.s3.amazonaws.com/", format(yearMonths[1:12], "%Y"),
format(yearMonths[1:12], "%m"), "-divvy-tripdata.zip"
)
download_data <- function(i) {
# Function to download the .zip files from the internet site, it takes an index as input
# Display a message to show the progress
cat("Downloading table ", periods[[i]], ": ", i, " out of ", a, sep = "")
df <- import(files_to_dwnld[[i]])
# Display a "Done" message
cat(" - Done!\n")
df
}
br_data <- lapply(seq_along(files_to_dwnld), \(i) download_data(i))
Downloading table oct_2021_rides: 1 out of 12 - Done!
Downloading table nov_2021_rides: 2 out of 12 - Done!
Downloading table dic_2021_rides: 3 out of 12 - Done!
Downloading table ene_2022_rides: 4 out of 12 - Done!
Downloading table feb_2022_rides: 5 out of 12 - Done!
Downloading table mar_2022_rides: 6 out of 12 - Done!
Downloading table abr_2022_rides: 7 out of 12 - Done!
Downloading table may_2022_rides: 8 out of 12 - Done!
Downloading table jun_2022_rides: 9 out of 12 - Done!
Downloading table jul_2022_rides: 10 out of 12 - Done!
Downloading table ago_2022_rides: 11 out of 12 - Done!
Downloading table sept_2022_rides: 12 out of 12 - Done!
# Assign the names from the periods vector
names(br_data) <- periods
If we were to download all the sets programmatically it would be important to do it in two parts:
files_to_dwnld[length(files_to_dwnld) - 1].
The code for downloading that last file could be something like:
tryCatch(
{
br_data[[length(files_to_dwnld)]] <- lapply(files_to_dwnld[length(files_to_dwnld)], import)
},
error = function(e) {
message(paste0("Data for ", periods[length(files_to_dwnld)], " is not yet available."))
print(e)
},
warning = function(w) {
message(paste0("The data for the ", periods[length(files_to_dwnld)], " is not available yet."))
print(w)
}
)
Let´s check the names and classes for the objects that we just got from the Internet:
sapply(br_data, class)
oct_2021_rides nov_2021_rides dic_2021_rides ene_2022_rides feb_2022_rides
"data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
mar_2022_rides abr_2022_rides may_2022_rides jun_2022_rides jul_2022_rides
"data.frame" "data.frame" "data.frame" "data.frame" "data.frame"
ago_2022_rides sept_2022_rides
"data.frame" "data.frame"
We have correctly downloaded the twelve data frames we needed, now we are going to add them also the data.table class in order to use that package (Barrett et al. 2025).
# Add the data.table class to the data frames
br_data <- lapply(br_data, as.data.table)
lapply(br_data, class)
$oct_2021_rides
[1] "data.table" "data.frame"
$nov_2021_rides
[1] "data.table" "data.frame"
$dic_2021_rides
[1] "data.table" "data.frame"
$ene_2022_rides
[1] "data.table" "data.frame"
$feb_2022_rides
[1] "data.table" "data.frame"
$mar_2022_rides
[1] "data.table" "data.frame"
$abr_2022_rides
[1] "data.table" "data.frame"
$may_2022_rides
[1] "data.table" "data.frame"
$jun_2022_rides
[1] "data.table" "data.frame"
$jul_2022_rides
[1] "data.table" "data.frame"
$ago_2022_rides
[1] "data.table" "data.frame"
$sept_2022_rides
[1] "data.table" "data.frame"
We have now both classes for the tables, we are ready to check the information in these 12 data tables!
Now let’s start checking the tables from the files we just downloaded. Let’s check and compare the column names for the different periods with the following function:
check_colnames <- function(a_list) {
# Function to check if the colnames are the same in all dfs,
# It takes a list with the dfs as input (br_data)
# Set up a vector with the column names of the first df
clmn_names <<- colnames(a_list[[1]])
# Define a vector to store the names of the other dfs with the same colnames (from the 2nd on):
clnms_rest <- lapply(a_list[-1], colnames)
samecols <- unlist(lapply(clnms_rest, \(cl_nms) identical(clmn_names, cl_nms)))
# List the column names of the first df:
sal1 <- paste0(
"In the table ", names(a_list)[1], " the names of the columns are: '",
paste0(clmn_names, collapse = "' - '"), "'.\n"
)
# Compare the previous info with the rest of the dfs:
sal2 <- paste0(if (sum(samecols) == 0) {
"\nNo other table has the same column names.\n\tLet's check!"
} else if (sum(samecols) == length(a_list)-1) {
paste0(
"Good news: All the other ", length(samecols),
" tables have the same column names!"
)
} else {
paste0("This tables have also the same colnames: ",
paste(names(samecols)[samecols], collapse = ", "),
".\nBut the following tables have DIFFERENT names: ",
paste(names(samecols)[!samecols], collapse = ", "),
".\nLet's fix that!",
sep = ""
)
})
# Combine the information
salida <- paste0(sal1, "\n", sal2, "\n")
# The function returns a paragraph stating the results
writeLines(salida)
}
check_colnames(br_data)
In the table oct_2021_rides the names of the columns are: ‘ride_id’ - ‘rideable_type’ - ‘started_at’ - ‘ended_at’ - ‘start_station_name’ - ‘start_station_id’ - ‘end_station_name’ - ‘end_station_id’ - ‘start_lat’ - ‘start_lng’ - ‘end_lat’ - ‘end_lng’ - ‘member_casual’.
Good news: All the other 11 tables have the same column names!
We are now certain that all the dfs have the same column names, that will make it easier for us to work with them.
Let’s check now the structure of the first table (first according to
the date), and if those columns in the various tables are all of the
same classes. The following info is the structure (str())
of the first data-table, oct_2021_rides:
str(br_data[[1]])
Classes 'data.table' and 'data.frame': 631226 obs. of 13 variables:
$ ride_id : chr "620BC6107255BF4C" "4471C70731AB2E45" "26CA69D43D15EE14" "362947F0437E1514" ...
$ rideable_type : chr "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
$ started_at : POSIXct, format: "2021-10-22 12:46:42" "2021-10-21 09:12:37" ...
$ ended_at : POSIXct, format: "2021-10-22 12:49:50" "2021-10-21 09:14:14" ...
$ start_station_name: chr "Kingsbury St & Kinzie St" "" "" "" ...
$ start_station_id : chr "KA1503000043" "" "" "" ...
$ end_station_name : chr "" "" "" "" ...
$ end_station_id : chr "" "" "" "" ...
$ start_lat : num 41.9 41.9 41.9 41.9 41.9 ...
$ start_lng : num -87.6 -87.7 -87.7 -87.7 -87.7 ...
$ end_lat : num 41.9 41.9 41.9 41.9 41.9 ...
$ end_lng : num -87.6 -87.7 -87.7 -87.7 -87.7 ...
$ member_casual : chr "member" "member" "member" "member" ...
- attr(*, ".internal.selfref")=<externalptr>
As we can observe, we have 7 columns with class chr
(character), 2 with POSIXct (date-time) and 4 with
num (numeric). By checking the few elements shown from each
column, we can infer that the classes are consistent with the
information contained in them.
Let’s run a function so we can check if the rest of the dfs have the same structure. I’ll just show the result:
All the other 11 tables have also the same structure!
After finding out that the columns in all the dfs coincide and have the correct classes, now we can start checking the tables information.
Before starting to check the information in the tables, I am going to
set up an empty list docWise to save information needed for
rendering this document correctly, and a variable that will help us
check the info of a different period by altering its value.
The list 'docWise' has been declared right now!
Now we can set the variable. The idea behind declaring this variable
is to be able to check the same information for a different period just
by changing its value. I am avoiding using a or
i since the value of those variables is related to
processes we had run, and very likely to be changed again by those or
similar processes. In this case, the name for the new variable is going
to be k.
# Defining a variable with a value to check a single period and adding a
# couple of 'if-statements' to avoid errors
k <- 9
check_k <- function(k) {
if (k < 1) {
k <- 1
}
if (k > a) {
k <- a
}
# Save the value in the document list
docWise$k <<- k
# Check the information:
sal1 <- paste0("Right now `k` = ", k, ", this means that the functions following ",
"this section are going to evaluate the data in table `", periods[k],
"`.\n")
writeLines(sal1)
}
check_k(k)
Right now k = 9, this means that the functions following
this section are going to evaluate the data in table
jun_2022_rides.
Note: The idea behind this variable k
is that we can run the functions in the following sections with the
information from other periods by changing its value. According to the
previous code chunk, the value for variable k needs to be
between 1 and a. In addition to this first caveat, and as
you can observe, in the function check_k the value of
k is added to the docWise list that we
declared before. The idea of that list is to keep there values that are
needed for rendering this document correctly; meaning that if
k is changed elsewhere, and the value of k in
the list is not, the text throughout the document is going to keep
referring to items related to the value of k before the
change and it may become confusing. To avoid this, please change the
value of k in the previous chunk, in the line right before
the declaration of the function check_k, and run the chunk
again.
Let’s check the data frame (jun_2022_rides) with the
head() and tail() to gather some insights
about the contents in the observations of the tables and see what else
we can notice:
ride_id rideable_type started_at ended_at
<char> <char> <POSc> <POSc>
1: 600CFD130D0FD2A4 electric_bike 2022-06-30 17:27:53 2022-06-30 17:35:15
2: F5E6B5C1682C6464 electric_bike 2022-06-30 18:39:52 2022-06-30 18:47:28
3: B6EB6D27BAD771D2 electric_bike 2022-06-30 11:49:25 2022-06-30 12:02:54
4: C9C320375DE1D5C6 electric_bike 2022-06-30 11:15:25 2022-06-30 11:19:43
5: 56C055851023BE98 electric_bike 2022-06-29 23:36:50 2022-06-29 23:45:17
6: B664188E8163D045 electric_bike 2022-06-30 16:42:10 2022-06-30 16:58:22
7: 338C05A3E90D619B electric_bike 2022-06-30 18:39:07 2022-06-30 19:05:02
8: C037F5F4107788DE electric_bike 2022-06-30 12:46:14 2022-06-30 14:12:48
9: C19B08D794D1C89E electric_bike 2022-06-30 11:09:38 2022-06-30 11:10:25
10: 6E9E3A041C14E960 electric_bike 2022-06-30 11:05:46 2022-06-30 11:09:11
start_station_name start_station_id end_station_name end_station_id
<char> <char> <char> <char>
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
start_lat start_lng end_lat end_lng member_casual
<num> <num> <num> <num> <char>
1: 41.89 -87.62 41.91 -87.62 casual
2: 41.91 -87.62 41.93 -87.63 casual
3: 41.91 -87.65 41.89 -87.61 casual
4: 41.80 -87.66 41.80 -87.65 casual
5: 41.91 -87.63 41.93 -87.64 casual
6: 42.03 -87.71 42.06 -87.73 casual
7: 41.91 -87.63 41.92 -87.72 casual
8: 41.89 -87.62 41.98 -87.67 casual
9: 41.89 -87.61 41.89 -87.61 casual
10: 41.89 -87.61 41.89 -87.61 casual
ride_id rideable_type started_at ended_at
<char> <char> <POSc> <POSc>
1: 82EF1A9AC3E768C4 classic_bike 2022-06-26 20:34:59 2022-06-27 21:34:52
2: 3EA43F311B85EB40 classic_bike 2022-06-20 17:51:55 2022-06-21 18:51:48
3: E44E6CD038DBB502 classic_bike 2022-06-17 14:35:01 2022-06-18 15:34:46
4: 1AEC816C5F684C5F classic_bike 2022-06-26 20:02:29 2022-06-27 21:02:24
5: 0E03A60ABBED6676 classic_bike 2022-06-27 14:22:02 2022-06-28 15:21:57
6: F4EAB98F4DD26581 docked_bike 2022-06-28 04:49:29 2022-07-01 04:54:03
7: A8120081E3A91F03 classic_bike 2022-06-02 17:36:30 2022-06-03 18:36:24
8: D5F0401AB215354D docked_bike 2022-06-17 14:36:05 2022-06-18 15:36:08
9: E3A85FBE4884FAC8 classic_bike 2022-06-22 17:10:17 2022-06-23 18:09:54
10: 020942C78999F631 classic_bike 2022-06-16 17:40:35 2022-06-17 18:40:30
start_station_name start_station_id end_station_name
<char> <char> <char>
1: Clarendon Ave & Junior Ter 13389
2: Clarendon Ave & Junior Ter 13389
3: Blue Island Ave & 18th St 13135
4: Cicero Ave & Quincy St 16913
5: Lincoln Ave & Waveland Ave 13253
6: Blue Island Ave & 18th St 13135
7: Clarendon Ave & Junior Ter 13389
8: Blue Island Ave & 18th St 13135
9: Clark St & Randolph St TA1305000030
10: Sheffield Ave & Kingsbury St 13154
end_station_id start_lat start_lng end_lat end_lng member_casual
<char> <num> <num> <num> <num> <char>
1: 41.96100 -87.64960 NA NA casual
2: 41.96100 -87.64960 NA NA casual
3: 41.85756 -87.66154 NA NA casual
4: 41.87761 -87.74541 NA NA casual
5: 41.94880 -87.67528 NA NA member
6: 41.85756 -87.66154 NA NA casual
7: 41.96100 -87.64960 NA NA member
8: 41.85756 -87.66154 NA NA casual
9: 41.88458 -87.63189 NA NA member
10: 41.91052 -87.65311 NA NA casual
Checking this info we can notice that there are some rows where we
don’t have the name of the starting or ending stations, nor their
respective id’s. In some cases we don’t even have the reported
coordinates in the end_lat and end_lng
columns. If we change the value of k above, we can observe
the same situation for other months.
The first column in the tables is called ride_id, I assume that this
is the key of the information contained in each table. Meaning that it’s
a unique identifier for each observation (row) in the data frame. Let´s
check the tables through sapply to see if there are any
duplicates:
sapply(br_data, \(DT) paste(DT[duplicated(ride_id), .(ride_id)], collapse = ", "))
oct_2021_rides nov_2021_rides dic_2021_rides ene_2022_rides feb_2022_rides
"character(0)" "character(0)" "character(0)" "character(0)" "character(0)"
mar_2022_rides abr_2022_rides may_2022_rides jun_2022_rides jul_2022_rides
"character(0)" "character(0)" "character(0)" "character(0)" "character(0)"
ago_2022_rides sept_2022_rides
"character(0)" "character(0)"
Getting the character(0) for every table is a good sign
that there are not repeated values in the ride_id column
for any of them.
Let’s check for the same period (jun_2022_rides) what the situation
is in terms of missing information. For checking that relevant info,
we’ll use first a function from the library naniar (Tierney and Cook 2023), that shows us a summary
with the number of observations that have NAs
(n_miss_in_var()), and how many variables miss that
information (n_vars()):
miss_table <- miss_var_table(br_data[[k]])
miss_table
# A tibble: 2 × 3
n_miss_in_var n_vars pct_vars
<int> <int> <dbl>
1 0 11 84.6
2 1055 2 15.4
miss_summa <- miss_var_summary(br_data[[k]])
miss_summa
# A tibble: 13 × 3
variable n_miss pct_miss
<chr> <int> <num>
1 end_lat 1055 0.137
2 end_lng 1055 0.137
3 ride_id 0 0
4 rideable_type 0 0
5 started_at 0 0
6 ended_at 0 0
7 start_station_name 0 0
8 start_station_id 0 0
9 end_station_name 0 0
10 end_station_id 0 0
11 start_lat 0 0
12 start_lng 0 0
13 member_casual 0 0
In this case, jun_2022_rides, the first function let us see that we have 2 columns (variables) that have 1055 rows (observations) missing information (NA); and, in the second table we can see that those columns are: end_lat, end_lng.
Previously, when we checked the info for the head and tail of this jun_2022_rides we observed that 4 columns of class character were also missing information. Therefore, in this df there are at least 2 different ways of representing the missing information: NA and empty strings (““). We can check that with the following function:
cuenta_miss_dt <- function(DT) {
# Function to count NAs and empty strings ("") per variable (column)
# This function takes a data table as input
# Get the variable class
classes <- t(DT[, sapply(.SD, class), .SDcols = clmn_names])
# Count NAs
n_NA <- data.matrix(DT[, sapply(.SD, \(col) sum(is.na(col))), .SDcols = clmn_names])
# Count empty strings
n_empty <- data.matrix(DT[, sapply(.SD, \(col) sum(col %in% "")), .SDcols = clmn_names])
# Format the output
salida <- as.data.table(cbind(
as_tibble(clmn_names),
classes, n_NA, n_empty
))[n_empty > 0 | n_NA > 0, -3]
setnames(salida, c(1:2), c("variable", "class"))
# The function returns a data.table
salida
}
missing_obs <- cuenta_miss_dt(br_data[[k]])
missing_obs
variable class n_NA n_empty
<char> <char> <int> <int>
1: start_station_name character 0 92944
2: start_station_id character 0 92944
3: end_station_name character 0 100152
4: end_station_id character 0 100152
5: end_lat numeric 1055 0
6: end_lng numeric 1055 0
Now we can clearly see that for the numeric variables (end_lat, end_lng), they used NA to represent the missing values, and for the character variables they used empty strings (““). To facilitate working with the data, we are going to transform all the empty strings into NAs and create some columns to store the information about the missing values.
Reminder: By changing the value of the
variable k in section 3.3 you can
check this same info for other periods, instead of running them all at
once, since that will take a long time to finish.
In the last section we found that there are two different values to
represent the missing information in these tables: NA for
the numeric variables and empty strings ("") for the
character variables. In this part we need to unify the representation
for those missing values, for that purpose and in order to use some
functions that deal with NAs, we are going to represent the missing
values only with NAs.
Before proceeding making any changes in the tables, it is a good a
idea to save a copy, just in case we need to go back to the originals in
the future, that way we will avoid having to download them again. To
accomplish this, we can take advantage of the native R format
.rds to save a list with a copy of each table:
save_tables <- function(a_list, name) {
# Function to save a list with a copy of each period data table
# It takes a list with the DTs and a name for the .rds file
# Save the list in the input folder
if (dir(".")[1] == "1_Load_libs") {
write_rds(a_list, paste0("../input/", name, ".rds"))
} else {
write_rds(a_list, paste0("../../input/", name, ".rds"))
}
}
save_tables(br_data, "bikeRidesOriginalTables")
From the last function in the previous section, we took the variable names that have empty strings (the names of the columns), and the number of observations that miss information for the data in jun_2022_rides:
variable n_empty
<char> <int>
1: start_station_name 92944
2: start_station_id 92944
3: end_station_name 100152
4: end_station_id 100152
To change those empty strings into NAs, we can use the
na_if function from the dplyr package (Wickham et al. 2023). Let’s check how it works
for the tail of the jun_2022_rides table that we checked before:
br_data[[k]] %>%
arrange(end_lat) %>%
tail(10) %>%
apply(2,na_if,y="") %>%
as.data.frame
ride_id rideable_type started_at ended_at
1 82EF1A9AC3E768C4 classic_bike 2022-06-26 20:34:59 2022-06-27 21:34:52
2 3EA43F311B85EB40 classic_bike 2022-06-20 17:51:55 2022-06-21 18:51:48
3 E44E6CD038DBB502 classic_bike 2022-06-17 14:35:01 2022-06-18 15:34:46
4 1AEC816C5F684C5F classic_bike 2022-06-26 20:02:29 2022-06-27 21:02:24
5 0E03A60ABBED6676 classic_bike 2022-06-27 14:22:02 2022-06-28 15:21:57
6 F4EAB98F4DD26581 docked_bike 2022-06-28 04:49:29 2022-07-01 04:54:03
7 A8120081E3A91F03 classic_bike 2022-06-02 17:36:30 2022-06-03 18:36:24
8 D5F0401AB215354D docked_bike 2022-06-17 14:36:05 2022-06-18 15:36:08
9 E3A85FBE4884FAC8 classic_bike 2022-06-22 17:10:17 2022-06-23 18:09:54
10 020942C78999F631 classic_bike 2022-06-16 17:40:35 2022-06-17 18:40:30
start_station_name start_station_id end_station_name
1 Clarendon Ave & Junior Ter 13389 <NA>
2 Clarendon Ave & Junior Ter 13389 <NA>
3 Blue Island Ave & 18th St 13135 <NA>
4 Cicero Ave & Quincy St 16913 <NA>
5 Lincoln Ave & Waveland Ave 13253 <NA>
6 Blue Island Ave & 18th St 13135 <NA>
7 Clarendon Ave & Junior Ter 13389 <NA>
8 Blue Island Ave & 18th St 13135 <NA>
9 Clark St & Randolph St TA1305000030 <NA>
10 Sheffield Ave & Kingsbury St 13154 <NA>
end_station_id start_lat start_lng end_lat end_lng member_casual
1 <NA> 41.96100 -87.64960 <NA> <NA> casual
2 <NA> 41.96100 -87.64960 <NA> <NA> casual
3 <NA> 41.85756 -87.66154 <NA> <NA> casual
4 <NA> 41.87761 -87.74541 <NA> <NA> casual
5 <NA> 41.94880 -87.67528 <NA> <NA> member
6 <NA> 41.85756 -87.66154 <NA> <NA> casual
7 <NA> 41.96100 -87.64960 <NA> <NA> member
8 <NA> 41.85756 -87.66154 <NA> <NA> casual
9 <NA> 41.88458 -87.63189 <NA> <NA> member
10 <NA> 41.91052 -87.65311 <NA> <NA> casual
Now all the missing information is represented by NAs across all the
columns in this example. However, it takes a long time for the function
to change the whole data frame and it changes all the columns to type:
char. Since we only need to change char
columns, we can subset those columns to start with, using the package
data.table (Barrett et al.
2025), and that will also help us speed up the process.
By using the vector periods with the names of the
tables, we can create the names for the new tables; to do so, I’ll just
add an ‘_sm’ to the end of the same names, e.g.:
oct_2021_rides will become oct_2021_rides_sm.
Then, we will define a function to perform the changes and display
messages to show the progress, and finally we’ll apply the function and
assign the new names to the new list (Bike Rides_DaTa_SM):
# Declare a vector with the new names
new_per <- paste0(periods, "_sm")
change_empty <- function(a_list, i) {
# Function to change the empty strings into NA, it takes as input a list and the index
# Display a message to show the progress
cat(paste0("Working on table ", periods[[i]], ", changing empty-strings into NAs.\n"))
# Change only the character columns and,
# join the non-char cols to the subset by the 'ride_id' col
DT <- a_list[[i]][, lapply(.SD, na_if, ""),
.SDcols = is.character
][a_list[[i]][,
.(ride_id, .SD),
.SDcols = !is.character
], on = .(ride_id)]
# Correct the col names that were changed by the join
setnames(DT, 8:13, clmn_names[c(3, 4, 9:12)])
# Display another message to show the progress
cat(paste0("The table ", new_per[i], " has been completed!\n"))
DT
}
# Apply the function
br_dt_sm <- lapply(seq_along(br_data), \(i) change_empty(br_data, i))
Working on table oct_2021_rides, changing empty-strings into NAs.
The table oct_2021_rides_sm has been completed!
Working on table nov_2021_rides, changing empty-strings into NAs.
The table nov_2021_rides_sm has been completed!
Working on table dic_2021_rides, changing empty-strings into NAs.
The table dic_2021_rides_sm has been completed!
Working on table ene_2022_rides, changing empty-strings into NAs.
The table ene_2022_rides_sm has been completed!
Working on table feb_2022_rides, changing empty-strings into NAs.
The table feb_2022_rides_sm has been completed!
Working on table mar_2022_rides, changing empty-strings into NAs.
The table mar_2022_rides_sm has been completed!
Working on table abr_2022_rides, changing empty-strings into NAs.
The table abr_2022_rides_sm has been completed!
Working on table may_2022_rides, changing empty-strings into NAs.
The table may_2022_rides_sm has been completed!
Working on table jun_2022_rides, changing empty-strings into NAs.
The table jun_2022_rides_sm has been completed!
Working on table jul_2022_rides, changing empty-strings into NAs.
The table jul_2022_rides_sm has been completed!
Working on table ago_2022_rides, changing empty-strings into NAs.
The table ago_2022_rides_sm has been completed!
Working on table sept_2022_rides, changing empty-strings into NAs.
The table sept_2022_rides_sm has been completed!
# Assign the new names for the DTs
names(br_dt_sm) <- new_per
It’s important to note that this operation put all the
char columns together at the beginning of the table (on the
left side of it), and the non-character ones at the end.
Let´s check for the new jun_2022_rides_sm table the missing variables summary again:
miss_var_summary(br_dt_sm[[k]]) %>% filter(n_miss > 0)
# A tibble: 6 × 3
variable n_miss pct_miss
<chr> <int> <num>
1 end_station_name 100152 13.0
2 end_station_id 100152 13.0
3 start_station_name 92944 12.1
4 start_station_id 92944 12.1
5 end_lat 1055 0.137
6 end_lng 1055 0.137
As we can see, now the summary function is able to get all the missing observations from all columns, and the values reported are the exact same numbers we had before making these changes.
Later on we are going to make some changes to see if we can fill some
of the missing information. To be able to keep track of the changes it
is a good idea to add some columns that let us know where the missing
values are. Those new columns, as a set, constitute what is called a
‘shadow matrix’ (that’s the reason for the _sm appendage on
the table names). To do this, we can use the functions
nabular() and add_label_shadow() from the
naniar package (Tierney and Cook
2023).
add_shadow <- function(i) {
# Function to add a shadow matrix, it takes an index as input
# Display a message to show the progress
cat("Working on table ", names(br_dt_sm)[[i]], ". ")
# Add the shadow matrix and convert the df as a data.table
DT <- as.data.table(nabular(br_dt_sm[[i]], only_miss = T) %>% add_label_shadow())
# Display the finished message for the DT to show the progress
cat("Shadow matrix attached!\n")
DT
}
br_dt_sm <- lapply(seq_along(br_dt_sm), \(i) add_shadow(i))
Working on table oct_2021_rides_sm . Shadow matrix attached!
Working on table nov_2021_rides_sm . Shadow matrix attached!
Working on table dic_2021_rides_sm . Shadow matrix attached!
Working on table ene_2022_rides_sm . Shadow matrix attached!
Working on table feb_2022_rides_sm . Shadow matrix attached!
Working on table mar_2022_rides_sm . Shadow matrix attached!
Working on table abr_2022_rides_sm . Shadow matrix attached!
Working on table may_2022_rides_sm . Shadow matrix attached!
Working on table jun_2022_rides_sm . Shadow matrix attached!
Working on table jul_2022_rides_sm . Shadow matrix attached!
Working on table ago_2022_rides_sm . Shadow matrix attached!
Working on table sept_2022_rides_sm . Shadow matrix attached!
Let’s check the dimensions of the jun_2022_rides_sm table:
[1] 769204 20
Now we have 20 variables instead of the original 13, checking the new
table one can see that the columns that missed information now have a
‘shadow’ column with the same name ending in “_NA”, so
end_lat has a shadow column named end_lat_NA
for example. Within those new columns we can see NA for the respective
observations in the original column that were missing information, and
!NA otherwise. Lastly, we have a column named any_missing
in which we have the values ‘Missing’ for the rows with at least one NA,
and ‘Not Missing’ if there are observations for all the variables in
that row (no NAs in that row).
From now on we are going to be working with these
new_per tables. Keeping in mind that we already saved a
copy with the original tables, we can remove all the
periods tables, and the vector itself, from the Global
environment.
Since we started working with these data tables, we have been producing new information about them while doing the different processes. So we can start collecting some general info about the different periods: like the total number of observations, and missing information for the different variables in each period. In the first part of this section we’ll do this. Then we will correct some information that is wrong in the data frames. After that, we will start the process to build the lookup tables, and check formulas; to, finally, complete the missing data that might be completed in each period.
It would be important to keep the information produced organized in a
single table. For that purpose, we need to create a summary table to
collect those observations about the data in the different periods.
Since we have the respective dates in the yearMonths dates
vector, we can use those dates in the first column to reference each
month. It would also be essential to know how many observations are
available for each one, and the total number of rows represents exactly
that, let’s get those values. I’ll call the column for the total number
of observations total_count:
period total_count
<char> <int>
1: oct. 2021 631226
2: nov. 2021 359978
3: dic. 2021 247540
4: ene. 2022 103770
5: feb. 2022 115609
6: mar. 2022 284042
7: abr. 2022 371249
8: may. 2022 634858
9: jun. 2022 769204
10: jul. 2022 823488
11: ago. 2022 785932
12: sept. 2022 701339
Just out of curiosity, let’s check how that looks like in a graph:
# First, let's create a theme we can use in the graphs
tema <- theme(
plot.title = element_text(size = 18), plot.subtitle = element_text(size = 14),
axis.title = element_text(size = 14), axis.text = element_text(size = 12),
legend.position = "bottom", legend.title = element_text(size = 14),
legend.text = element_text(size = 12), legend.key.size = unit(0.8, "cm")
)
# Now we can produce the graph
ggplot(summ_table) +
geom_point(
mapping = aes(x = month, y = total_count / 1000),
color = "olivedrab", size = 3
) +
geom_smooth(
method = loess, formula = "y ~ x",
mapping = aes(x = month, y = total_count / 1000), color = "navy", se = FALSE
) +
labs(x = "Period", y = "Number of rides (in thousands)") +
scale_x_date(date_breaks = "2 months", date_labels = "%b %Y") +
scale_y_continuous(breaks = seq(0, 800, by = 100)) +
tema +
ggtitle("Total number of rides per month")
We can clearly see a that there must be a year seasonality in this data. An important point to keep in mind for the analysis phase.
We can also include some information about the missing values. When we checked that information before, we noticed that the values for the missing columns were all in pairs, in the sense that if we missed the station name, start or end, we missed also the respective station id, and if we missed one coordinate we missed also the other. That will help us synthesize the names of the new columns for the summary table, first we’ll set a function to get the missing values for the columns of one table and then we can run it through a loop to get the values for all the periods.
set_miss_cols <- function(i) {
# Function to get the missing values in a period
miss_vals <- br_dt_sm[[i]][, lapply(.SD, n_miss),
.SDcols = c("start_station_id", "end_station_id", "start_lat", "end_lat")]
names(miss_vals) <- c("miss_start_stn", "miss_end_stn", "miss_start_coord", "miss_end_coord")
miss_vals
}
set_miss_cols(k)
miss_start_stn miss_end_stn miss_start_coord miss_end_coord
<int> <int> <int> <int>
1: 92944 100152 0 1055
The result is a table with just one row for that period, now we can run the function for all the periods and get the same observations for every month, and bind them in a single table. Finally, once we get that info, we can bind it to the summary table:
period total_count miss_start_stn miss_end_stn miss_start_coord
<char> <int> <int> <int> <int>
1: 2021-oct. 631226 108210 114834 0
2: 2021-nov. 359978 75290 79187 0
3: 2021-dic. 247540 51063 53498 0
4: 2022-ene. 103770 16260 17927 0
5: 2022-feb. 115609 18580 20355 0
6: 2022-mar. 284042 47246 51157 0
7: 2022-abr. 371249 70887 75288 0
8: 2022-may. 634858 86704 93171 0
9: 2022-jun. 769204 92944 100152 0
10: 2022-jul. 823488 112031 120951 0
11: 2022-ago. 785932 112037 120522 0
12: 2022-sept. 701339 103780 111185 0
miss_end_coord
<int>
1: 484
2: 191
3: 144
4: 86
5: 77
6: 266
7: 317
8: 722
9: 1055
10: 947
11: 843
12: 712
In the case of the starting stations I would think it is possible to fill all the missing info, by finding the closest station to the reported coordinates. In the other case, for the ending stations we need to check first if the observations that are missing the station info are also missing the coordinates or not, let’s see:
The following list shows the number of observations per period that miss end-station information and end coordinates at the same time:
oct_2021 - 484
nov_2021 - 191
dic_2021 - 144
ene_2022 - 86
feb_2022 - 77
mar_2022 - 266
abr_2022 - 317
may_2022 - 722
jun_2022 - 1055
jul_2022 - 947
ago_2022 - 843
sept_202 - 712
We can clearly see that those are the same number of observations we have in the last column (miss_end_coord) of the previous table. This means that we won’t be able to complete any values for those observations. Luckily those are really few compared with the size of each set:
period total_count miss_end_coord incompletable %
<char> <int> <int> <num>
1: 2021-oct. 631226 484 0.0767
2: 2021-nov. 359978 191 0.0531
3: 2021-dic. 247540 144 0.0582
4: 2022-ene. 103770 86 0.0829
5: 2022-feb. 115609 77 0.0666
6: 2022-mar. 284042 266 0.0936
7: 2022-abr. 371249 317 0.0854
8: 2022-may. 634858 722 0.1137
9: 2022-jun. 769204 1055 0.1372
10: 2022-jul. 823488 947 0.1150
11: 2022-ago. 785932 843 0.1073
12: 2022-sept. 701339 712 0.1015
As we can see, in the worst case (jun. 2022) we will still have the 99.86% of the observations, which is awesome! That, of course, if we are able to complete some of the missing observations based in the information reported.
For the next parts, we will continue producing new information about the data, but we have now the summary table we just created to keep storing that new info.
In section 3.4 we checked that the “Key” column ride_id
contained unique values. Apart from that column, now we need to make
sure that the station_name and station_id
columns are consistent across all tables and have an univocal
correspondence (one-on-one relationship). While doing the process to
complete the missing information for each period, I found out that there
are some inconsistencies related to the data contained in those two
columns in the tables.
To start, there are ids that end in “.0”, so we need to take those off; but, there is the exceptional case of station “Pulaski Rd & 21st St” that has id “331.0”. Since the station “Halsted St & Clybourn Ave” has id “331”; if I just remove the “.0”, both will end up with the same id. In this particular case, instead of removing the “.0”, I’m going to set the id to “21331”, because this is the id that appears in the stations information set that I’ll download in section 5.3.1, as we will see.
The next two particular cases have to do with codes “TA1306000029” and “13099”, since these correspond to two different stations in some sets. Let’s check the dic_2021_rides_sm set:
station_name station_id
<char> <char>
1: Halsted St & 18th St 13099
2: Halsted St & 18th St (Temp) 13099
3: DuSable Lake Shore Dr & Ohio St TA1306000029
4: McClurg Ct & Ohio St TA1306000029
By checking the names related to code “13099” one can see that it makes reference to the same station, in this case I think it is safe to leave the code as it is, because in both cases we can assign the same coordinates.
Checking the information about code “TA1306000029”, I found that in
the first two periods (October and November, 2021), this code
corresponds only to station “DuSable Lake Shore Dr & Ohio St”.
Reviewing the Info_Station and some 2022 periods sets, that
code corresponds only to station “McClurg Ct & Ohio St”. Given that
this last station_name is shared in all the sets except for
the first two, I think the best option is to remove the code, assigning
NA instead, to station “DuSable Lake Shore Dr & Ohio St” in the
periods in which it appears.
Another two particular cases are the stations with
station_id: “1033” and “1038”, that contain obvious coding
typos in some name’s sets; let´s check the sept_2022_rides_sm table,
that has both versions of each name (the correct and the incorrect):
start_station_name start_station_id
<char> <char>
1: Public Rack - Pulaski Rd & 65th St 1033
2: Public Rack - Pulaski Rd & 65th St 1033
3: Public Rack - Kedzie Ave & 62nd Pl 1038
4: Public Rack - Kedzie Ave & 62nd Pl 1038
Now we need to check for the opposite case to the one in the previous
section, stations that have the same name, but with different
station_id. To be able to do this, we need first to get all
the unique combinations of names and ids contained in each table, this
function will help us with that:
list_stations <- function(i) {
# The function will take the index of the df as input to get the data table (DT)
DT <- as.data.table(br_dt_sm[[i]])
# Get the names and ids from all the reported stations in that period
stations <- rbind(
na.omit(DT[,
.(
station_name = start_station_name,
station_id = start_station_id
)
], cols = "station_name"),
na.omit(DT[,
.(
station_name = end_station_name,
station_id = end_station_id
)
], cols = "station_name")
)
# Remove repeated rows
stations <- unique(stations, by = c("station_name", "station_id"))
# The function will return a df with the stations reported each period
stations
}
Now we have a function that creates a table with all the station names and ids reported each month. Based on the result of this function, we can get the names that correspond to more than one id.
We have 16 pairs of stations names with two different
station_id each. To decide which id should we keep, we can
check which one is the most repeated in all the periods sets:
get_rep_sttns <- function(a_list) {
# Takes a list as input
# Create a list of repeated stations with different ids per period
obtain_duplis <- function(i) {
stations <- list_stations(i)
reps <- stations[duplicated(station_name), station_name]
duplis <- stations[station_name %in% reps, ]
duplis[station_name %in% reps, "count" := rep(0)]
}
duplis <- lapply(seq_along(a_list), \(i) obtain_duplis(i))
# Combine all repeated data and count which ids appear in which periods
repeated <- as.data.table(bind_rows(duplis))
repeated <- unique(repeated)
for (i in 1:length(a_list)) {
repeated[, count := count + as.numeric(repeated$station_id %in% list_stations(i)$station_id)]
}
repeated[order(station_name), ]
}
namids_reps <- get_rep_sttns(br_dt_sm)
namids_reps
station_name station_id count
<char> <char> <num>
1: Bradley Park 20227 9
2: Bradley Park 633 1
3: California Ave & Cortez St 17660 12
4: California Ave & Cortez St 512 5
5: Calumet Ave & 51st St 813 5
6: Calumet Ave & 51st St 15470 12
7: Calumet Ave & 71st St 728 2
8: Calumet Ave & 71st St 15599 10
9: Central Park Ave & Ogden Ave 15685 12
10: Central Park Ave & Ogden Ave 532 5
11: Christiana Ave & Lawrence Ave 15615 12
12: Christiana Ave & Lawrence Ave 860 5
13: East End Ave & 87th St 20231 12
14: East End Ave & 87th St 883 3
15: Eggleston Ave & 92nd St 20118 12
16: Eggleston Ave & 92nd St 707 2
17: Halsted St & 111th St 20127 12
18: Halsted St & 111th St 875 5
19: Jeffery Blvd & 67th St KA1503000030 12
20: Jeffery Blvd & 67th St 753 5
21: Kostner Ave & Lake St 536 12
22: Kostner Ave & Lake St 851 4
23: Lake Park Ave & 47th St TA1308000035 12
24: Lake Park Ave & 47th St 812 5
25: Lakefront Trail & Bryn Mawr Ave KA1504000152 5
26: Lakefront Trail & Bryn Mawr Ave 15576 8
27: Lawndale Ave & 111th St 893 3
28: Lawndale Ave & 111th St 20203 12
29: Prairie Ave & Garfield Blvd TA1307000160 12
30: Prairie Ave & Garfield Blvd 906 4
31: Wabash Ave & 87th St 595 12
32: Wabash Ave & 87th St 644 12
station_name station_id count
The column count gives us the total number of periods in
which each id shows up. With this information is easy to decide which id
to keep: the one that repeats the most; except for the last case, since
both ids show up in all the periods. In this case, again, checking the
stations information set that I’ll download in section 5.3.1, the id “595” is the one that appears;
so, that’s the one we’ll keep. It’s worth mentioning that all the ids
that repeat the most in the periods are also the ones contained in that
same set that we will download, as we´ll see in that section.
Having decided what we need to do, let´s rearrange the information so we can use it in the next part:
station_name keep_id discard_id
<char> <char> <char>
1: Bradley Park 20227 633
2: California Ave & Cortez St 17660 512
3: Calumet Ave & 51st St 15470 813
4: Calumet Ave & 71st St 15599 728
5: Central Park Ave & Ogden Ave 15685 532
6: Christiana Ave & Lawrence Ave 15615 860
7: East End Ave & 87th St 20231 883
8: Eggleston Ave & 92nd St 20118 707
9: Halsted St & 111th St 20127 875
10: Jeffery Blvd & 67th St KA1503000030 753
11: Kostner Ave & Lake St 536 851
12: Lake Park Ave & 47th St TA1308000035 812
13: Lakefront Trail & Bryn Mawr Ave 15576 KA1504000152
14: Lawndale Ave & 111th St 20203 893
15: Prairie Ave & Garfield Blvd TA1307000160 906
16: Wabash Ave & 87th St 595 644
One last error that’s in the sept_2022_rides_sm names set is a laughable typo in the “Public Racks”:
station_name station_id
<char> <char>
1: Pubic Rack - Pulaski Rd & 41st 835
Apart from the errors mentioned, there are some very fancy descriptive names that are not very practical for the analysis, e.g., taken from the jun_2022_rides_sm set:
start_station_name start_station_id
<char> <char>
1: 63rd & Western Ave - north corner 739
2: Ashland Ave & 45th St - midblock south 845
3: 63rd & Western Ave - south corner 738
In these cases, I’m just going to replace the excessive wording for the initial of the corresponding cardinal point, i.e., “- north corner” will become “N”. Apart from these 3 cases there are some more stations’ names that include descriptors like “- midblock” or the complete name for the cardinal point; in the first case I decided to erase the string, in other cases that are consistent with the information that we will download from the site, I decided to leave them as they are.
Finally, there are some cases that will need a special treatment. I refer to the warehouses that show up in the list of stations, let´s check the information for the jun_2022_rides_sm:
# A tibble: 6 × 4
# Groups: start_station_name [6]
start_station_name start_station_id mean_lng mean_lat
<chr> <chr> <dbl> <dbl>
1 Base - 2132 W Hubbard Hubbard Bike-checking (LBS-… -87.7 41.9
2 Base - 2132 W Hubbard Warehouse Hubbard Bike-checking (LBS-… -87.7 41.9
3 MTV Hubbard St 021320 -87.7 41.9
4 NewHastings 2059 Hastings Warehouse Sta… -87.7 41.9
5 WEST CHI-WATSON DIVVY 001 - Warehouse test … -87.7 41.9
6 WestChi DIVVY 001 - Warehouse test … -87.8 41.9
We can see that almost in all these cases the name or the id make reference to the fact that these are warehouses or repair stations, the only one that may seem confusing is the “MTV Hubbard St”, which has a 6 digits number as the id, but if we check the numbers without the zeros, and add those to the name, we’ll get the same address that is there for the “Base” cases (2132 Hubbard St), and also, if we search for “mtv Hubbard St” on the internet we’ll find the Divvy page with that same address (“Divvy in Chicago, IL 60612 - — Chamberofcommerce.com”).
Let´s check on the map the case of the warehouse identified with name “WestChi” for the jun_2022_rides_sm table. We’ll need a function to generate the map, we can write a very simple one for now, using the leaflet package (Cheng, Schloerke, et al. 2024):
make_map2 <- function(label1,lng1,lat1,label2,lng2,lat2){
mapa <- leaflet() %>% addProviderTiles(providers$OpenStreetMap) %>%
addCircleMarkers(lng = lng1, lat = lat1,
color = "darkgreen", radius = 5,
label = as.character(label1)) %>%
addCircleMarkers(lng = lng2, lat = lat2,
color = "darkred", radius = 5,
label = as.character(label2))
mapa
}
Now we are able to check the points that show up using “WestChi” as
the start_station_name:
westchi_map <-
make_map2(br_dt_sm[[9]][str_detect(start_station_name,"WestChi"), start_station_name],
br_dt_sm[[9]][str_detect(start_station_name,"WestChi"), start_lng],
br_dt_sm[[9]][str_detect(start_station_name,"WestChi"), start_lat],
"Wrong point (mean coordinates of the 3 WestChi points)",
wh_dt_mean[start_station_name == "WestChi", mean_lng],
wh_dt_mean[start_station_name == "WestChi", mean_lat]
)
westchi_map
If you zoom-in to any point, you’ll discover a different station for every case. I suppose that maybe the bikes were taken from those stations to the warehouse for repairs (or any other situation similar to that). Being that the case, and since that “station” name doesn’t appear in the names set we will download, we won’t have a way to find the coordinates. Plus, getting the mean or any other measure based on that info (as in the table presented before), will provide us with an erroneous point (the red circle in the map).
Apart from the “Base” address, we also have the “2059 Hastings St” for the “NewHastings” warehouse. With the help of the internet maps we can easily find the coordinates for those two points; shown in the next map:
Assuming that for the analysis the rides related to those not-so-much “stations” would need to have a special treatment, I think it’s safe to use the same coordinates for most of them (except “NewHastings”), and even set a common id for all of them: “warehouse”. That way we know that those are going to constitute a particular group of rides.
Having resolved that case, we can set the information for those “stations” as follow:
station_name station_id longitude latitude
<char> <char> <num> <num>
1: Base - 2132 W Hubbard warehouse -87.68028 41.88986
2: Base - 2132 W Hubbard Warehouse warehouse -87.68028 41.88986
3: DIVVY CASSETTE REPAIR MOBILE STATION warehouse -87.68028 41.88986
4: MTV Hubbard St warehouse -87.68028 41.88986
5: NewHastings warehouse -87.67984 41.86344
6: WEST CHI-WATSON warehouse -87.68028 41.88986
7: WestChi warehouse -87.68028 41.88986
The next step would be to clean all the inconsistencies discovered and take care of the special cases in the tables for each period. To accomplish this we need to define a function that help us with that and then run it for all the periods. Let’s define the function:
fix_tables <- function(i) {
# The function will take the DT index as input
DT <- as.data.table(br_dt_sm[[i]])
# Assign the 'station_id' for the special case
if (nrow(DT[
start_station_id == "331.0" | end_station_id == "331.0",
.(start_station_id, end_station_id)
]) > 0) {
DT[start_station_id == "331.0", start_station_id := "21331"]
DT[end_station_id == "331.0", end_station_id := "21331"]
}
# Remove the ending decimal zero in the ids that have it
DT[, start_station_id := str_remove_all(DT$start_station_id, "\\.0$")][, end_station_id :=
str_remove_all(DT$end_station_id, "\\.0$")]
# Set the conflicting station id to 'NA'
if (nrow(DT[start_station_name == "DuSable Lake Shore Dr & Ohio St" |
end_station_name == "DuSable Lake Shore Dr & Ohio St",
.(start_station_name, end_station_name)
]) > 0) {
DT[start_station_name == "DuSable Lake Shore Dr & Ohio St", start_station_id := NA]
DT[end_station_name == "DuSable Lake Shore Dr & Ohio St", end_station_id := NA]
}
# Correct the typos in the names
if (nrow(DT[
start_station_name == "Public Rack - Kedzie Ave & 62nd Pl" |
end_station_name == "Public Rack - Kedzie Ave & 62nd Pl",
.(start_station_name, end_station_name)
]) > 0) {
DT[
start_station_name == "Public Rack - Kedzie Ave & 62nd Pl",
start_station_name := "Public Rack - Kedzie Ave & 62nd Pl"
]
DT[
end_station_name == "Public Rack - Kedzie Ave & 62nd Pl",
end_station_name := "Public Rack - Kedzie Ave & 62nd Pl"
]
}
if (sum(str_detect(na.omit(DT$start_station_name), "Pubic"),
str_detect(na.omit(DT$end_station_name), "Pubic")) > 0){
DT[, start_station_name := str_replace_all(DT$start_station_name, "Pubic", "Public")]
DT[, end_station_name := str_replace_all(DT$end_station_name, "Pubic", "Public")]
}
# Unify the station_id for the stations with more than one id
if (sum((DT$start_station_id %in% order_reps$discard_id),
(DT$end_station_id %in% order_reps$discard_id)) > 0) {
for (stn_id in order_reps$discard_id) {
DT[start_station_id == stn_id,
start_station_id := order_reps[discard_id == stn_id, keep_id]
][
end_station_id == stn_id,
end_station_id := order_reps[discard_id == stn_id, keep_id]
]
}
}
# Change the fancy names
if (sum(
str_detect(na.omit(DT$start_station_name), "-\\snorth\\scorner"),
str_detect(na.omit(DT$end_station_name), "-\\snorth\\scorner")
) > 0) {
DT[,
start_station_name := str_replace_all(DT$start_station_name, "-\\snorth\\scorner", "N")
][,
start_station_name := str_replace_all(DT$start_station_name, "-\\ssouth\\scorner", "S")
][,
end_station_name := str_replace_all(DT$end_station_name, "-\\ssouth\\scorner", "S")
][,
end_station_name := str_replace_all(DT$end_station_name, "-\\snorth\\scorner", "N")
]
}
if (sum(
str_detect(na.omit(DT$start_station_name), "-\\smidblock\\ssouth"),
str_detect(na.omit(DT$end_station_name), "-\\smidblock\\ssouth")
) > 0) {
DT[,
start_station_name := str_replace_all(DT$start_station_name, "-\\smidblock\\ssouth", "S")
][,
end_station_name := str_replace_all(DT$end_station_name, "-\\smidblock\\ssouth", "S")
]
}
if (sum(
str_detect(na.omit(DT$start_station_name), "^City\\sRack"),
str_detect(na.omit(DT$end_station_name), "^City\\sRack")
) > 0) {
DT[,
start_station_name := str_replace_all(DT$start_station_name, "^City\\sRack", "Public Rack")
][,
end_station_name := str_replace_all(DT$end_station_name, "^City\\sRack", "Public Rack")
]
}
if (sum(
str_detect(na.omit(DT$start_station_name), "\\s-\\smidblock$"),
str_detect(na.omit(DT$end_station_name), "\\s-\\smidblock$")
) > 0) {
DT[,
start_station_name := str_remove_all(DT$start_station_name, "\\s-\\smidblock$")
][,
end_station_name := str_remove_all(DT$end_station_name, "\\s-\\smidblock$")
]
}
# Assign the id and coordinates for the warehouses
if (sum((DT$start_station_name %in% warehouses$station_name),
(DT$end_station_name %in% warehouses$station_name)) > 0) {
for (wh_name in warehouses$station_name) {
id <- warehouses[station_name == wh_name, station_id]
lat <- warehouses[station_name == wh_name, latitude]
lng <- warehouses[station_name == wh_name, longitude]
DT[start_station_name == wh_name, `:=`(
start_station_id = id,
start_lat = lat,
start_lng = lng
)][end_station_name == wh_name, `:=`(
end_station_id = id,
end_lat = lat,
end_lng = lng
)]
}
}
# The function will correct the original table and return it
writeLines(paste("Table", names(br_dt_sm)[[i]], "has been corrected"))
DT
}
Now we can loop all the tables to apply the function and make the
corrections. This time I’ll use the map function from the
purrr package (Wickham and Henry 2025):
br_dt_sm <- map(br_dt_sm, fix_tables)
As happened with some of the warehouses we fixed in the previous section, not all the stations have the correct coordinates reported in the tables.
The values for the latitude and the longitude repeat all the time, and are truncated or rounded to 2 digits; you can clearly see it in the next table taken from the December 2021 set. Let’s check the case of the station with id “379”:
st_Rockwell_Archer <- rbind(
na.omit(br_dt_sm[[3]][start_station_id == "379", .(station_name = start_station_name,
station_id = start_station_id, lng = start_lng, lat = start_lat),
], cols="station_id"),
na.omit(br_dt_sm[[3]][end_station_id == "379", .(station_name = end_station_name,
station_id = end_station_id, lng = end_lng, lat = end_lat)
], cols="station_id")
)
st_Rockwell_Archer
station_name station_id lng lat
<char> <char> <num> <num>
1: Rockwell St & Archer Ave 379 -87.69 41.82
2: Rockwell St & Archer Ave 379 -87.69 41.82
3: Rockwell St & Archer Ave 379 -87.69 41.82
4: Rockwell St & Archer Ave 379 -87.69 41.82
5: Rockwell St & Archer Ave 379 -87.69 41.82
6: Rockwell St & Archer Ave 379 -87.69 41.82
7: Rockwell St & Archer Ave 379 -87.69 41.82
8: Rockwell St & Archer Ave 379 -87.69 41.82
9: Rockwell St & Archer Ave 379 -87.69 41.82
10: Rockwell St & Archer Ave 379 -87.69 41.82
11: Rockwell St & Archer Ave 379 -87.69 41.82
12: Rockwell St & Archer Ave 379 -87.69 41.82
13: Rockwell St & Archer Ave 379 -87.69 41.82
14: Rockwell St & Archer Ave 379 -87.69 41.82
15: Rockwell St & Archer Ave 379 -87.69 41.82
16: Rockwell St & Archer Ave 379 -87.69 41.82
17: Rockwell St & Archer Ave 379 -87.69 41.82
18: Rockwell St & Archer Ave 379 -87.69 41.82
Checking the area with the help of Google maps (“S Archer Ave & S Rockwell St · Chicago, IL 60632 — 41.8229955,-87.6922987,17z”) and leaflet maps (Cheng, Schloerke, et al. 2024), I was able to find the correct coordinates for the location of the station. If we put the information on the map, this is what we’ll find:
The green circle is the actual station site, the red circle is the location based on the reported coordinates. Is not very far but doesn’t work for what we need it, and this is just one case of several stations with the same problem with the reported coordinates. Therefore, we need another way of finding the correct coordinates for the stations. Furthermore, we will find that this same issue is repeated in the stations that are missing the name and id.
Luckily, the same website from which we got the tables https://data.cityofchicago.org (“City of Chicago | Data Portal — Data.cityofchicago.org”), offers the stations information in JSON format. Let´s get that stations information:
# Import the stations information in .json format:
# sttns_inf_json <- import("https://gbfs.lyft.com/gbfs/1.1/chi/en/station_information.json")
# The info in the website has changed, I am going to use an old copy I had downloaded before
sttns_inf_json <- import("../input/Info_Stations_list_1833.rds", trust = TRUE)
# Extract the columns we need from there:
Info_Station <- as.data.table(sttns_inf_json[[1]])[, .(
station_name = stations.name,
station_id = stations.short_name,
longitude = stations.lon,
latitude = stations.lat
)]
head(Info_Station, 10)
station_name station_id longitude latitude
<char> <char> <num> <num>
1: Damen Ave & 51st St 554 -87.67468 41.80091
2: Damen Ave & Thomas St (Augusta Blvd) TA1307000070 -87.67741 41.90131
3: Lowell Ave & Armitage Ave 1524189 -87.73535 41.91708
4: Kilbourn Ave & Milwaukee Ave 16994 -87.74001 41.94707
5: Christiana Ave & Lawrence Ave 15615 -87.71183 41.96835
6: Cicero Ave & Wrightwood Ave 23180 -87.74673 41.92791
7: Wood St & Augusta Blvd 657 -87.67220 41.89918
8: University Ave & 59th St 22003 -87.59891 41.78788
9: Hale Ave & 107th St 20108 -87.66892 41.69921
10: Malcolm X College 631 -87.67390 41.87762
I took the value for the id for the “Pulaski Rd & 21st St” station from this set, as we can see:
Info_Station[str_detect(Info_Station$station_name, "Pulaski") &
str_detect(Info_Station$station_name, "21st"), .(station_name, station_id)]
station_name station_id
<char> <char>
1: 21st St & Pulaski Rd 21331
We can also see the station_id for the stations that had
more than one id before we corrected it:
Índice: <discard_id>
station_name keep_id discard_id Info_Station_id
<char> <char> <char> <char>
1: Bradley Park 20227 633 20227
2: California Ave & Cortez St 17660 512 17660
3: Calumet Ave & 51st St 15470 813 15470
4: Calumet Ave & 71st St 15599 728 15599
5: Central Park Ave & Ogden Ave 15685 532 15685
6: Christiana Ave & Lawrence Ave 15615 860 15615
7: East End Ave & 87th St 20231 883 20231
8: Eggleston Ave & 92nd St 20118 707 20118
9: Halsted St & 111th St 20127 875 20127
10: Jeffery Blvd & 67th St KA1503000030 753 KA1503000030
11: Kostner Ave & Lake St 536 851 536
12: Lake Park Ave & 47th St TA1308000035 812 TA1308000035
13: Lakefront Trail & Bryn Mawr Ave 15576 KA1504000152 15576
14: Lawndale Ave & 111th St 20203 893 20203
15: Prairie Ave & Garfield Blvd TA1307000160 906 TA1307000160
16: Wabash Ave & 87th St 595 644 595
Coincidentally, all the ids that repeated the most across the periods
are the ones contained in the Info_Station ids set; and,
since id “644” doesn’t appeared in this set, we chose id “595” for the
last case (both of these last two mentioned ids showed up in the 12
periods sets).
Having this new set of information, what we can do is create a
function that gets the names and ids from the stations reported each
month using the list_stations function, and, if possible,
get the coordinates reported in this new Info_Station set
to build the lookup tables for the observations for each period. However
this is not going to be easy, precisely because some names and ids have
changed over time or have a different spelling in the sets and in the
webpage, e.g.: “Damen Ave & Coulter St” Vs. “Damen Ave/Coulter St”.
For this reason I think it is better to keep, as much as possible, the
names and ids from the original sets of information from each period;
although we have already changed one: “21331”.
Checking the info in the Info_Station set there are two
station_name repeated, with and without the
station_id (NA instead), and there is the case for the
station “Burling St & Diversey Pkwy”, that has two different
station_id:
station_name station_id longitude latitude
<char> <char> <num> <num>
1: Burling St & Diversey Pkwy 13208 -87.64776 41.93314
2: Burling St & Diversey Pkwy TA1309000036 -87.64776 41.93314
3: Indiana Ave & 133rd St 24409 -87.61719 41.65380
4: Indiana Ave & 133rd St <NA> -87.61705 41.65356
5: Western Ave & Lake St 24211 -87.68668 41.88481
6: Western Ave & Lake St <NA> -87.68585 41.88461
Since the coordinates don’t seem too far, I decided to keep only the stations with complete information. Then, for the station “Burling St & Diversey Pkwy”, I decided to keep id “TA1309000036”, since it shows across all the periods and id “13208” does not.
clean_Info_Statn <- function(DT){
# Function to clean the repeated stations names and non corresponding ids
# Check for repeated names
rep_name <- DT[duplicated(DT$station_name), station_name]
# Remove the duplicated names that don't have station ids, and the id "13208"
DT <- DT[!(station_name %in% rep_name & is.na(station_id)), ][!(station_id %in% "13208"), ]
DT
}
nrow(Info_Station)
[1] 1833
Just to have an idea of the cleaning process, from the previous code
chunk we can see the number of rows that the Info_Station
set has before cleaning it. Let’s clean it and check again:
Info_Station <- clean_Info_Statn(Info_Station)
nrow(Info_Station)
[1] 1830
After cleaning, we removed the two repeated stations and the inconsistent id, 3 rows in total were removed.
Another notorious difference between the periods and the
Info_Station sets is that there are a lot of station that
have been added racks close by since June 2022, and those are prefixed
with “Public Rack -” in the periods and the Info_Station
names sets. Since this prefix doesn’t exist in the periods’ names before
that month, I’ll remove it from the Info_Stations names
when comparing those initial sets. In order to do this, we need to
prevent duplicates appearing, as examples, this would happen with the
following two cases:
station_name station_id longitude latitude
<char> <char> <num> <num>
1: Western Ave & 62nd St <NA> -87.68342 41.78099
2: Public Rack - Western Ave & 62nd St <NA> -87.68376 41.78096
3: Hamlin Ave & Grand Ave <NA> -87.72156 41.90404
4: Public Rack - Hamlin Ave & Grand Ave <NA> -87.72029 41.90374
Therefore, to make sure that we will keep the station’s coordinates, we need to remove the racks data that may duplicate names before removing the prefix. Let’s define the function to do this and try it, to see the number of rows after this change:
clean_racks_prefx <- function(DT) {
# The function will work with a data table as input (Info_Station)
# Collect the names that have the prefix
racks <- DT[str_detect(DT$station_name, "Public Rack - "), station_name]
# Get all the pairs (with and without the prefix)
repeated <- DT[station_name %in% str_remove_all(racks, "Public Rack - "), station_name]
# Keep the repeated ones without the prefix
DT <- DT[!(station_name %in% paste0("Public Rack - ", repeated)), ]
# Erase the prefix for all the rest that still have it, but didn't repeat
DT[, station_name := str_remove_all(DT$station_name, "Public Rack - ")]
DT <- unique(DT, by = c("station_name", "station_id"))
DT
}
Info_Station2 <- clean_racks_prefx(Info_Station)
nrow(Info_Station2)
[1] 1785
There were 45 racks with the same name as the station, leaving, after
deleting the prefix, a list of names with 1785 stations in the
Info_Station set. So, for the first periods up to May 2022,
the Info_Station set that will be used to compare and get
the coordinates from would be this shorter version.
Now we can proceed to construct a function to create the lookup tables for the stations in every period. To make it easier, we are going to do it by parts.
To start, we’ll need to collect all the different stations reported
in each period, we already have a function to reunite the starting and
the ending stations in just one set: the list_stations
function. Another idea for this part would be to try to keep the names
and the ids from each set as possible. This is in case that the
information contained in the Info_Station set differs from
the one that appears in the periods sets.
Since the list_stations function help us to collect the
name and id of all the stations reported in each period, we can use it
to check the total number of stations per period. Let’s check the total
number of stations in the jun_2022_rides_sm table:
stations <- list_stations(k)
nrow(stations)
[1] 1265
We can create a new column for the summ_table showing
how many stations are available each month. The result is quite
intriguing: it looks like some stations do not work during the winter
time. Let’s check:
period total_count n_stations
<char> <int> <int>
1: 2021 oct. 631226 793
2: 2021 nov. 359978 820
3: 2021 dic. 247540 818
4: 2022 ene. 103770 777
5: 2022 feb. 115609 799
6: 2022 mar. 284042 844
7: 2022 abr. 371249 843
8: 2022 may. 634858 1103
9: 2022 jun. 769204 1265
10: 2022 jul. 823488 1226
11: 2022 ago. 785932 1271
12: 2022 sept. 701339 1288
We can put the numbers in a graph to have a better view:
It´s not hard to imagine that the COVID-19 pandemic has affected this business as it did with many others, and that maybe some of the long-term effects are still showing at the beginning of the graph; that added to the season of the year may help explain the almost flat slowly declining curve. But from May 2022, we can see a recovery in the number of stations functioning; most probably this is also related to the year seasonality, as the August-September decline may also suggest.
Having defined a way to compile all the stations for the periods, for
the next part, we will collect all the stations’ coordinates that share
the same name or id in both sets: the list of stations given by the
list_stations function and the Info_Station df
we got from the site.
Since some of the ids from the stations set differ from
the ones in the Info_Station set in just 2 digits (added at
the beginning of the former), I´ll use some regex matching to try to
find coincidences. For this purpose I’ll use the the
str_detect() function from the package stringr (Wickham 2023b) and agrep() from
base (R Core Team 2019).
To effectively use the str_detect() function, I’m going
to create a list of strings to condense the station names to its main
components. Meaning that we are trying to get rid of cardinal points
indicators (“E”, “N”, “S”, etc.) and street types (St, Rd, Dr, etc.),
leaving only the main parts of the names. For that purpose, and based on
the info collected from the same station_name vectors from
the periods, I built the following strings set (shrtn_addr)
and a function to shorten the station_name string in each
period set:
# Vector to shorten the stations names in each period's set
shrtn_addr <- c(
"Ave\\s", "Ave$", "Blvd\\s", "Blvd$", "Ct\\s", "Ct$", "Dr\\s", "Dr$", "^E\\s", "E\\s", "\\sE$",
"\\s?-\\sEast$", "Hwy\\s", "Hwy$", "^N\\s", "N\\s", "\\sN$", "\\s?-\\sNE$", "\\s?-\\sNW$",
"\\sNW$", "Pkwy\\s", "Pkwy$", "Pl\\s", "Pl$", "^Public\\sRack\\s-\\s?", "Rd\\s", "Rd$", "^S\\s",
"S\\s", "\\sS$", "\\s?-\\sSE$", "Sq\\s", "Sq$", "St\\s", "St$", "\\s?-\\sSW$", "Ter\\s", "Ter$",
"^W\\s", "W\\s", "\\sW$", "Way\\s", "Way$", "\\s?-\\sWest$", "\\s?\\(.+\\)$",
"\\s-\\s.+Vaccination\\sSite"
)
shorten_n_split <- function(name, vector = shrtn_addr) {
# Function to shorten the names (name) and split them if possible
# This function takes 2 character vectors:
# the 1st: the names to shorten; the 2nd: the strings to eliminate from the 1st.
# It uses the 'shrtn_addr' vector as the 2nd by default
name <- str_replace_all(name, "-\\sNorth$", "N")
name <- str_remove_all(name, "\\(Murray\\)\\sPark$")
for (i in 1:length(name)) {
if (sum(str_detect(name[i], "\\s&\\s")) > 0) {
# Loop through the strings to remove from each name
for (str in vector) {
name[i] <- str_remove_all(name[i], str)
}
}
}
name <- str_replace_all(name,'""','"')
# Split the names if possible
spl_names <- str_split(name, "\\s&\\s")
# It returns a list with as many items as the length of the original 'name' vector,
# with one string or two (if the 'name' can be splitted), in each position.
spl_names
}
Let´s check some examples. In the first column you’ll see the original name; in the second column, the complete name (if it’s just one whole name) or the first part of it (if it was splitted); and in the third column, NA (if it was just one name) or the second part of the name (if splitted):
Original_name Name_1 Name_2
<char> <char> <char>
1: 900 W Harrison St 900 W Harrison St <NA>
2: Halsted St & Willow St Halsted Willow
3: Sedgwick St & Webster Ave Sedgwick Webster
4: Broadway & Granville Ave Broadway Granville
5: Halsted St & 35th St Halsted 35th
6: Public Rack - Monticello Ave & Belmont Ave Monticello Belmont
7: Base - 2132 W Hubbard Base - 2132 W Hubbard <NA>
8: Troy St & North Ave Troy North
9: MLK Jr Dr & 29th St MLK Jr 29th
10: Lincoln Ave & Sunnyside Ave Lincoln Sunnyside
Let´s write a function to find the coordinates for the stations that share names or ids in the sets:
find_coord_by_name_id <- function(DT) {
# The function takes the DT with the stations names and ids ("stations")
# If the prefix "Public Rack" doesn't appear in the "stations" names
# clean it from the "Info_Station" set
if (sum(str_detect(DT$station_name, "Public Rack")) == 0) {
Info_Station <- clean_racks_prefx(Info_Station)
}
# Get the stations info for the ones that have the same names in both sets
statn_by_name <- Info_Station[station_name %in% DT$station_name]
# Remove the ids that come from Info_Station
statn_by_name$station_id <- NULL
# Join the sets by the station_name column
statn_by_name <- DT[statn_by_name, on = .(station_name)]
# In case there are any, delete the repeated ones
statn_by_name <- unique(statn_by_name, by = c("station_name", "station_id"))
# For the missing stations, get info for the ones that have the same ids in both sets
diff_name <- DT[!(station_name %in% statn_by_name$station_name)]
# Remove the NAs before getting the coords
diff_name <- diff_name[complete.cases(diff_name), ]
statn_by_id <- Info_Station[station_id %in% diff_name$station_id]
# Remove the names from 'Info_Station' and the code: "TA1307000138"
# This code correspond to different stations in each set:
# "Lincoln Ave & Roscoe St" in 'DT'; "Wood St & Webster Ave" in 'Info_Station'
statn_by_id <- statn_by_id[, station_name := NULL][station_id != "TA1307000138", ]
statn_by_id <- DT[statn_by_id, on = .(station_id)]
# The 'station_id' "13099" is repeated with different station names:
# "Halsted St & 18th St" and "Halsted St & 18th St (Temp)", both names with
# the same id are included in the periods names data sets, leaving them untouched.
# Bind the two pieces with complete info
lookUp_st <- rbind(statn_by_name, statn_by_id)
lookUp_st <- unique(lookUp_st, by = c("station_name", "station_id"))
# Get again the ones still missing info
missInfo <- DT[!(station_name %in% lookUp_st$station_name)]
# Some station ids that have short codes were added two digits before the code in the
# 'Info_Station' ids set (various of those ids had the ".0" ending), let's use regex
# matching to check
short_id <- missInfo[nchar(station_id) == 3, ]
chck_regex <- sapply(short_id$station_id, \(sid) sid <- regex(paste0("^\\d{2}\\Q", sid, "\\E$")))
index <- 1:length(chck_regex)
for (i in 1:length(chck_regex)) {
# Check if there are coincidences, and use the length to build the flow control
indx <- agrep(chck_regex[i], Info_Station$station_id, fixed = FALSE, max.distance = 0)
len <- length(indx)
if (len == 0) { # No match
index[i] <- NA
} else { # There is at least one match
# Use the 'shorten_n_split' function to get the main components of the name
check4names <- unlist(shorten_n_split(short_id[i, station_name]))
# Get the name from the 'Info_Station' set
name_strs <- Info_Station[indx, station_name]
# Exactly one match: check if the street names coincide or add 'NA' to the index vector
if (len == 1) {
if (length(check4names) == 1) {
if (sum(str_detect(name_strs, check4names[1])) != 0) {
index[i] <- indx
} else {
index[i] <- NA
}
} else if (length(check4names) > 1) {
if (sum(
str_detect(name_strs, check4names[1]),
str_detect(name_strs, check4names[2])
) != 0) {
index[i] <- indx
} else {
index[i] <- NA
}
}
# More than one match, look for the one where the names coincide
} else {
# Logical vector for the matches and the 1st part of the name
match_1 <- str_detect(name_strs, check4names[1])
if (sum(match_1) == 0) { # If there are no matches for that part add NA to the result
index[i] <- NA
} else if (sum(match_1) == 1) { # If there's one coincidence for that part
index[i] <- indx[match_1] # add the index to the result
} else if (sum(match_1) > 1) {
# If there are more than 1 coincidence check the second part of the name
match_2 <- str_detect(name_strs, check4names[2])
if (match_1 & match_2) {
index[i] <- indx[match_1 & match_2] # Add the one where the 2 parts coincide
} else {
index[i] <- NA # or add 'NA' to the index vector
}
}
}
}
}
short_id <- cbind(short_id, Info_Station[index, .(longitude, latitude)])
lookUp_st <- rbind(lookUp_st, short_id[complete.cases(short_id), ])
# The function returns a DT with the coordinates added to the stations that were found
lookUp_st
}
namids_found <- find_coord_by_name_id(stations)
head(namids_found, 10)
station_name station_id longitude latitude
<char> <char> <num> <num>
1: Damen Ave & 51st St 554 -87.67468 41.80091
2: Damen Ave & Thomas St (Augusta Blvd) TA1307000070 -87.67741 41.90131
3: Kilbourn Ave & Milwaukee Ave 16994 -87.74001 41.94707
4: Christiana Ave & Lawrence Ave 15615 -87.71183 41.96835
5: Cicero Ave & Wrightwood Ave 547 -87.74673 41.92791
6: Wood St & Augusta Blvd 657 -87.67220 41.89918
7: Hale Ave & 107th St 20108 -87.66892 41.69921
8: Dearborn St & Erie St 13045 -87.62932 41.89399
9: Calumet Ave & 71st St 15599 -87.61692 41.76551
10: Wood St & Chicago Ave 637 -87.67207 41.89563
By collecting the stations that still miss the coordinates, as we will see next, we have collected the coordinates for most of the stations from this jun_2022_rides_sm table, based in the names and ids reported in the data set:
station_name station_id
<char> <char>
1: Kimbark Ave & 67th St 751
2: Marquette Ave & 79th St 668
3: Base - 2132 W Hubbard warehouse
4: Prairie Ave & 47th St 814
5: Monticello Ave & Belmont Ave 500
---
264: Torrence Ave & 98th St 632
265: May St & 78th St 723
266: Park Manor Elementary School 748
267: Central Ave & Corcoran Pl 793
268: Halsted St & 64th St 744
We are just missing 268 stations out of the 1265 that appear in this period.
Now we can use str_detect() again and try to do some
“fuzzy” matching to find some other coincidences. As we saw, we
are missing 268 stations in the case of the jun_2022_rides_sm table,
let´s try to find some more correspondences by using again the
str_detect function; but this time, in conjunction with
stringdist() and amatch() from the
‘stringdist’ package (M. P. J. van der Loo
2014).
In this part we are going to use the str_detect()
function again to find where the two main parts of the periods names
have a coincidence in the Info_Station names set at the
same time. There may be more than one coincidence, in those cases, we’ll
use the stringdist function to get the closest match. In
case there is only one whole name from the period set that cannot be
splitted, we can also use the str_detect and complement it
with the amatch and stringdist functions to
try to find similar names.
For the amatch and stringdist functions I
decided to use the Damerau-Levenshtein method reducing the value of the
deletions and insertions penalties. The reason for reducing the
penalties for deletions is based on the fact that we are already
shortening the station names in each period to its main components,
subtracting street types and cardinal points. So, in case such strings
appear in the Info_Station names set, by lowering the
deletion’s penalty, we are just trying to compensate for that previous
removal. On the other case, the reason for reducing the value for the
insertions is based on observations, since most of the differences in
the names are due to symbols or special characters in the
Info_Station names, i.e., “-”, “*” or “/” just to list
some. Keeping this in mind, let’s define the function to try to find
more matches:
find_comb <- function(DT) {
# The function takes the df with the stations that need the coordinates (stillMiss)
vctr_names <- shorten_n_split(DT$station_name)
# Create a num vector for the results
found_comb <- 1:length(vctr_names)
# Loop through the list with the names
for (i in found_comb) {
# If the list item has more than one string
if (length(vctr_names[[i]]) > 1) {
# Detect where both strings coincide in the 'Info_Station' names vector and
# get the index, there may be more that one
numb <- which(str_detect(Info_Station$station_name, vctr_names[[i]][1]) &
str_detect(Info_Station$station_name, vctr_names[[i]][2]))
if (length(numb) > 1) {
# If there are more than one matches, check which string is the closest
reslts <- 1:length(numb)
for (j in reslts) {
reslts[j] <- stringdist(DT$station_name[i], Info_Station$station_name[numb[j]],
method = "dl", weight = c(d = 0.1, i = 0.5, s = 1, t = 1))
}
found_comb[i] <- numb[which.min(reslts)]
} else if (length(numb) == 1) {
# If there is just one match put it in the results
found_comb[i] <- numb
} else {
# If there is not a match add an NA to the results
found_comb[i] <- NA
}
# If there is only one item check if it is in the string
} else if (length(vctr_names[[i]]) == 1) {
if (sum(str_detect(Info_Station$station_name, vctr_names[[i]])) > 0){
indxs<- which(str_detect(Info_Station$station_name, vctr_names[[i]]))
if (length(indxs) > 1) {
d_indx <- 1:length(indxs)
for (j in d_indx) {
d_indx[j] <- stringdist(Info_Station$station_name[indxs[j]], DT$station_name[i],
method = "dl", weight = c(d = 0.1, i = 0.5, s = 1, t = 1))
}
found_comb[i] <- indxs[which.min(d_indx)]
} else {
found_comb[i] <- indxs
}
# If the item is not in the string use the 'amatch' function
} else {
# The penalty for deletions and insertions has been reduced based on what has
# been observed comparing the 'Info_Station' and the periods names sets
fuzzy_name <- amatch(vctr_names[i], Info_Station$station_name,
method = "dl", weight = c(d = 0.1, i = 0.5, s = 1, t = 1),
maxDist = 1.5
)
# If there is a match add the index to the results vector
if (length(fuzzy_name) > 0) {
found_comb[i] <- fuzzy_name
} else {
# If not add an NA to the results
found_comb[i] <- NA
}
}
}
}
# The output is a num vector with the indexes for the matches or 'NA' otherwise
found_comb
}
Now we can try to get more coincidences with the function we just
defined, going through the station names in the set that still miss the
coordinates: stillMiss. Let’s see:
station_name Info_Stn_name
<char> <char>
1: Kimbark Ave & 67th St Public Rack - Kimbark Ave & 67th St
2: Marquette Ave & 79th St Public Rack - Marquette Ave & 79th St
3: Base - 2132 W Hubbard <NA>
4: Prairie Ave & 47th St Public Rack - Prairie Ave & 47th St
5: Monticello Ave & Belmont Ave Public Rack - Monticello Ave & Belmont Ave
---
264: Torrence Ave & 98th St Public Rack - Torrence Ave & 98th St
265: May St & 78th St Public Rack - May St & 78th St
266: Park Manor Elementary School Public Rack - Park Manor Elementary School
267: Central Ave & Corcoran Pl Public Rack - Central Ave & Corcoran Pl
268: Halsted St & 64th St Public Rack - Halsted St & 64th St
So far, it works with the cases we checked, so we can write a function to find coordinates this way:
find_match <- function(DT) {
# The function takes a data.table as argument (stillMiss)
# Get the matches' indexes using the 'find_comb' function
indxs <- find_comb(DT)
# Add the respective longitude and latitude to the DT
found_mtchs <- cbind(DT, Info_Station[indxs, .(longitude, latitude)])
# The function returns a data table with the coordinates info for the matches (if any)
found_mtchs[complete.cases(found_mtchs), ]
}
We can add this new table to the one we had before with the coordinates we found with the names and ids:
naidma_found <- rbind(namids_found, find_match(stillMiss))
nrow(naidma_found)
[1] 1249
Let’s collect again the stations that still don’t have the coordinates, after matching the names, in a table:
station_name station_id
<char> <char>
1: Base - 2132 W Hubbard warehouse
2: Base - 2132 W Hubbard Warehouse warehouse
3: W Oakdale Ave & N Broadway 20252
4: Maryland Ave & 104th St 20134
5: MTV Hubbard St warehouse
6: WestChi warehouse
7: Washtenaw Ave & Madison St 416
8: NewHastings warehouse
9: Menard Ave & Division St 305
10: Cornell Dr & Hayes Dr 653
11: Kimbark Ave & 63rd St 752
12: Sacramento Blvd & Division St 511
13: Narragansett & Irving Park 323
14: Public Rack - Sacramento Blvd & Division St 511
15: WEST CHI-WATSON warehouse
16: DIVVY CASSETTE REPAIR MOBILE STATION warehouse
According to that result none of those stations exists in the
Info_Station set, or the names are so different from one
set to the other that it is as if those didn’t exist. In this
conditions, we have exhausted the information that we can extract from
the Info_Station set we downloaded, and now we’ll have to
resource to some other ways to complete the information.
One really easy way to find some more information is to use the registered coordinates that are in each period. Unfortunately, as we mentioned before, not all the stations have the correct coordinates reported in the tables. Therefore, we’ll need to be careful to use only coordinates that are not rounded or truncated, as we mentioned before, and we will need to define which measure of central tendency should we use for this.
I assume that since, for the cases in which we have valid coordinates, there are several different observations for each station and the coordinates reported in each observation are slightly different from one another, to complete the missing-stations information, I could use one of the measures of central tendency: mean, median or mode. So, I’ll have to compare those measures for observations in different periods, based on different stations. Therefore, in this section we are going to check what is the best option among the three values to complete this task, taking in mind some considerations:
Having these considerations in mind we can create functions to check the median and mode points. First I´ll copy the function to get the mode(s):
# I found this function in stack overflow
# NOTE: I wrote the comments while checking the function
mode_set <- function(x) {
ux <- unique(x) # Select all the unique values in the vector
tab <- tabulate(match(x, ux)) # Count how many of each there are
ux[tab == max(tab)] # Get the value(s) that repeat the most
}
# The result is a vector with the mode value(s) for that coordinate
Following is a function to get the median point of just one of the coordinates. We have to contemplate that it may happen that the median point of the latitude is different from the median point of the longitude:
median_by_col <- function(dt, var) {
# The function takes the data frame (dt) and the column name (var)
# Arrange the dt by the column we are checking
ordered <- arrange(dt, {{ var }})
# Get the column values in a numeric vector
vectr <- unlist(select(dt, {{ var }}))
# Get the median of that vector
medn <- median(vectr)
# Get the index of the point in the middle or the one on top
p_med <- as.integer(ceiling(nrow(dt) / 2))
# Get both coordinates if the value is reported in the data
if (medn %in% vectr) {
medin <- ordered[p_med, ]
# If it is not reported, get the closest reported values: one on top, one below
} else {
medins <- rbind(ordered[p_med, ], ordered[p_med - 1, ])
clnm <- colnames(medins)
# Create an empty tibble to gather the new information
medin <- tibble()
for (clmn in clnm) {
# Get the median for each coordinate in that pair of points:
new_t <- dplyr::summarise(df, "{clmn}" := median(.data[[clmn]]))
# Put the values on the tibble
medin <- cbind(medin, new_t)
}
# Get only the values of the coordinates
medin <- medin %>%
select(all_of(clnm))
}
# The result is a tibble with the median 'lat' and 'lng' for the selected 'var'
medin
}
Let’s now create a function to collect all the reported coordinates for just one station. That way we will be able to visualize the points in a map:
get_station_pts_1 <- function(df, st_id) {
# The arguments for this function are the data from any period and a station id
# Filter and select all the observations for the particular station id
selStId_start <- df %>%
filter(start_station_id == st_id) %>%
select(longitude = start_lng, latitude = start_lat)
# Add a column to identify the points
selStId_start %<>%
mutate(station = rep("start"), idx = 1:nrow(selStId_start))
selStId_end <- df %>%
filter(end_station_id == st_id) %>%
select(longitude = end_lng, latitude = end_lat)
selStId_end %<>%
mutate(station = rep("end"), idx = 1:nrow(selStId_end))
# It will return a "data.table" "data.frame"
selStId <- rbind(selStId_start, selStId_end)
}
Now we can generate maps to visualize the reported points for the
stations, in this case we will be referring just to the station with
station_id “13307” in the dic_2021_rides_sm set. Let’s see
how those coordinates look like in a map:
Most reported coordinates correspond to points that are really close to the station; but there are some ‘end’ points that are a little far, like a block (“end 75”) or two (“end 50”); and then we have the point labeled as “end 2”, that is completely extraneous to the sample (it is by the North-East viaduct at the crossing of the expressways!).
Clearly, that point is an outlier, so we will need to adjust the function to avoid those irregularities. Since the function will give us the central tendency value that we are going to use to compare the rest of the observations, it’s better to be a little stringent about them, so I’ll only include as part of the sample values between one IQR below the first quartile, and one above the third quartile, for both, the longitude and the latitude.
While adjusting the coordinates in other examples, I noticed that it’s not rare that the IQR is equal to zero. Based on this observations, I decided to write a conditional statement to enlarge the range a little bit from the first to the seventh octiles in those cases.
limit_coord_rng <- function(coord) {
# The function takes the numeric vector with the coordinate to adjust
coord_iqr <- IQR(coord)
# If IQR is not zero it limits the range one IQR above the 3rd quartile and one below the 1st
if (coord_iqr != 0) {
coord_range <- range(
quantile(coord, 1 / 4) - coord_iqr,
quantile(coord, 3 / 4) + coord_iqr
)
# If IQR is zero it limits the range from the 1st to the 7th octiles
} else {
coord_range <- range(quantile(coord, 1 / 8), quantile(coord, 7 / 8))
}
# The function returns the lower and upper limits of the range
coord_range
}
So the adjusted version of the function will be:
get_station_pts <- function(df, st_id) {
# The arguments for this function are the data frame from a period and a station id
selStId_start <- df %>%
filter(start_station_id == st_id) %>%
select(longitude = start_lng, latitude = start_lat)
selStId_start %<>%
mutate(station = rep("start"), idx = 1:nrow(selStId_start))
selStId_end <- df %>%
filter(end_station_id == st_id) %>%
select(longitude = end_lng, latitude = end_lat)
selStId_end %<>%
mutate(station = rep("end"), idx = 1:nrow(selStId_end))
selStId <- rbind(selStId_start, selStId_end)
selStId <- selStId[longitude %between% limit_coord_rng(selStId$longitude) &
latitude %between% limit_coord_rng(selStId$latitude), ]
# It will return a "data.table" "data.frame"
selStId
}
As we can see in the next map and table, the points don’t include anymore the obvious outlier (“end 2”), or the other points that were about a block or two from the station (“end 75”, “end 50”):
Now all the points are grouped together really close to the station. It’s important to note that in this process, we went from 200 to 182 points, the following are the 18 points not included in the new table:
anti_join(st_13307_1, st_13307, by = join_by(longitude, latitude, station, idx)) %>%
arrange(station)
longitude latitude station idx
<num> <num> <char> <int>
1: -87.66363 41.85479 end 1
2: -87.64725 41.87594 end 2
3: -87.66369 41.85480 end 19
4: -87.66227 41.85426 end 50
5: -87.66357 41.85503 end 65
6: -87.66368 41.85437 end 75
7: -87.66370 41.85477 end 81
8: -87.66364 41.85480 start 27
9: -87.66367 41.85479 start 34
10: -87.66375 41.85483 start 46
11: -87.66369 41.85478 start 50
12: -87.66368 41.85478 start 56
13: -87.66370 41.85477 start 61
14: -87.66368 41.85477 start 82
15: -87.66374 41.85480 start 94
16: -87.66369 41.85478 start 95
17: -87.66367 41.85476 start 98
18: -87.66372 41.85480 start 100
Having all the pieces, now we can define a function that creates a df
with the values that we want to check:
get_centr_measrs_pts <- function(df, st_id) {
# Function to get a df with the central tendency measures points
# As input it takes the df to check and a "station_id" (st_id)
# Use the function to get the reported coordinates for the station id
selStId <- get_station_pts(df, st_id)
selStId %<>%
select(longitude, latitude)
# Get all the unique extreme points (some may repeat)
xtrmPnts <- unique(selStId[c(
which.min(selStId$latitude),
which.max(selStId$latitude),
which.min(selStId$longitude),
which.max(selStId$longitude)
), 1:2]) %>%
mutate(group = rep("xtrmPnts")) # Add a column to categorize
# Create the mean point
mean_pnt <- tibble(
longitude = mean(selStId$longitude),
latitude = mean(selStId$latitude)
) %>%
mutate(group = rep("mean_pnt")) # Add the respective category
# Find or compute the median value
medn_lat <- median_by_col(selStId, latitude)
medn_lng <- median_by_col(selStId, longitude)
# If both are the same, just keep one:
medn_pnt <- unique(rbind(medn_lat, medn_lng)) %>%
mutate(group = rep("medn_pnt")) # Again, add the category
# If there are 2 points, compute the values
if (nrow(medn_pnt) > 1) {
medn_pnt <- tibble(
longitude = median(medn_pnt$longitude),
latitude = median(medn_pnt$latitude),
group = "medn_pnt_c"
)
}
# Use the mode_set function to get the most repeated value(s)
modes_lat <- mode_set(selStId$latitude)
modes_lng <- mode_set(selStId$longitude)
# Create a logic vector for that(those) value(s)
idx_md_lat <- selStId$latitude %in% modes_lat
idx_md_lng <- selStId$longitude %in% modes_lng
# If the logic vector is the same pick any (I used lat)
if (identical(idx_md_lat, idx_md_lng)) {
mode_pnt <- unique(selStId[idx_md_lat, 1:2]) %>%
mutate(group = rep("mode_pnt"))
# If the logic vectors are different, bind all the info and get the unique value(s)
} else {
mode_pnt <- unique(rbind(selStId[idx_md_lat, 1:2], selStId[idx_md_lng, 1:2])) %>%
mutate(group = rep("mode_pnt")) # add the category.
}
# Put all the info in just one table
new_df <- as.data.table(rbind(xtrmPnts, mean_pnt, medn_pnt, mode_pnt))
# This function will return a data.frame/data.table
new_df
}
Let’s try the function we just built to check the points for
station_id “13307” within the table dic_2021_rides_sm:
longitude latitude group
<num> <num> <char>
1: -87.66368 41.85480 xtrmPnts
2: -87.66358 41.85497 xtrmPnts
3: -87.66374 41.85481 xtrmPnts
4: -87.66356 41.85491 xtrmPnts
5: -87.66359 41.85489 mean_pnt
6: -87.66356 41.85491 medn_pnt
7: -87.66356 41.85491 mode_pnt
In this case, for the table dic_2021_rides_sm, we have the 4 extreme points (min and max for each coordinate), the mean, one median and just one mode, the same for both coordinates and it’s the same median point, for the station “Laflin St & Cullerton St”, identified with the id “13307”.
We can now write a more elaborated function to make the maps to visualize those points:
st_map <- function(df) {
# Set options for the points markers
xtrmOpts <- markerOptions(opacity = 0.75, color = "black")
meanOpts <- markerOptions(opacity = 0.6, color = "darkgreen")
mednOpts <- markerOptions(opacity = 0.6, color = "blue")
modeOpts <- markerOptions(opacity = 0.6, color = "darkred")
# Generate the map using the leaflet library
map <- leaflet() %>%
addProviderTiles(providers$OpenStreetMap) %>%
addCircleMarkers(
lat = df[group == "xtrmPnts", latitude],
lng = df[group == "xtrmPnts", longitude], radius = 2, label = "extreme pnt",
group = "Extremes", options = xtrmOpts
) %>%
addCircleMarkers(
lat = df[group == "mean_pnt", latitude],
lng = df[group == "mean_pnt", longitude], radius = 5, label = "mean pnt",
group = "Mean", options = meanOpts
) %>%
addCircleMarkers(
lat = df[str_starts(df$group, "medn_pnt"), latitude],
lng = df[str_starts(df$group, "medn_pnt"), longitude], radius = 5,
label = df$group[str_starts(df$group, "medn_pnt")], group = "Median",
options = mednOpts
) %>%
addCircleMarkers(
lat = df[group == "mode_pnt", latitude],
lng = df[group == "mode_pnt", longitude], radius = 5, label = "mode pnt",
group = "Mode", options = modeOpts
) %>%
addLayersControl(overlayGroups = c("Extremes", "Mean", "Median", "Mode"))
map
}
For the next step, we just need to select a df and a station_id to generate the maps. Since we already have the df of the previous example, let´s visualize those points on a map:
In the markerOptions the colors were established as follows:
To facilitate identifying the measures, you can check/uncheck the labels in the layers icon.
Observing the points we can see that in this case the median coincides with the mode and the extreme point with the min longitude, we can compare with the info in the table.
Let´s check another map, in this case for station_id
“605” from the table nov_2021_rides_sm:
In the case of this last map of the station with id “605”, we have the 4 extreme points and the mean, the median and the mode all coincide in the same point.
We are now going to use the addSidebyside function from
the leaflet.extras2 package (Sebastian
2025) to produce a new map to compare the points of the same
station (station_id == KA1504000117), but in two different
months (October and December, 2021). It is notorious how apart those
points look from one month to the other. Here you can compare the two
sets of points by moving the slider left or right:
Station: Halsted St & North Branch St, station id: KA1504000117
In this map, if you move the slider towards the title “October points”, those are the points that you will see in the map (you can zoom in to have a better view); likewise, if you move the slider towards the “December points” title, you’ll see the December points. In the case of the October points, those are really close together, and again the median and mode are the same, both really close to the mean. In contrast, the 4 extreme December points are really spread out, the median and the mode are the same and close to the extreme point in the South East, leaving the mean alone in between. Based on these observations I think we can take some decisions.
Being really notorious how the sets of points change from the observations of one month in comparison to another month, with both sets being observations for the same station. I think that it would be better to obtain the measure of central tendency that we are going to use from the same set of observations, and not to use just one for all the sets.
Most of the maps that we have checked show us that the measures of central tendency tend to be really close together, being that the case any would work as a reference point for the station. However, for the December points in this last case the mean would be the better choice for the purpose of finding the closest station from any given point, since the median or the mode might add or subtract distance depending on the location of the other point.
Having decided about the course of action, now we can set up a function to collect the valid reported coordinates for each period. The idea is to have the name and id of the station, and find the mean of the reported coordinates to be able to complete the reported values that don’t have the station’s data.
Let’s review again the 16 stations that still don’t have the coordinates:
station_name station_id
<char> <char>
1: Base - 2132 W Hubbard warehouse
2: Base - 2132 W Hubbard Warehouse warehouse
3: W Oakdale Ave & N Broadway 20252
4: Maryland Ave & 104th St 20134
5: MTV Hubbard St warehouse
6: WestChi warehouse
7: Washtenaw Ave & Madison St 416
8: NewHastings warehouse
9: Menard Ave & Division St 305
10: Cornell Dr & Hayes Dr 653
11: Kimbark Ave & 63rd St 752
12: Sacramento Blvd & Division St 511
13: Narragansett & Irving Park 323
14: Public Rack - Sacramento Blvd & Division St 511
15: WEST CHI-WATSON warehouse
16: DIVVY CASSETTE REPAIR MOBILE STATION warehouse
According to that result none of those stations exists in the
Info_Station set, and that is the reason why we need to use
the reported coordinates from the original sets. Unfortunately, as we
mentioned before, not all the stations have the correct coordinates
reported in the tables. Therefore, we’ll need to be careful to select
coordinates that are not rounded or truncated, as we mentioned
before.
get_reprtd_coords <- function(j, DT) {
# For this function we need to input an index (from "br_dt_sm" e.g.) and
# a data table DT ("missngStill")
# Collect the coordinates reported for every station missing them
for (i in 1:nrow(DT)) {
coords <- rbind(br_dt_sm[[j]][
start_station_name == DT[i, station_name],
.("lng" = start_lng, "lat" = start_lat)
], br_dt_sm[[j]][
end_station_name == DT[i, station_name],
.("lng" = end_lng, "lat" = end_lat)
])
# Check that the coords are not truncated or rounded
if (nrow(coords) > 0) {
if ((mean(coords[, lng]) * 10^5) %% 1000 != 0 &&
(mean(coords[, lat]) * 10^5) %% 1000 != 0) {
# Adjust the range for the values
coords <- coords[lng %between% limit_coord_rng(coords$lng) &
lat %between% limit_coord_rng(coords$lat), ]
# Add the mean of the coordinates to, or create, the table
if (exists("mean_coords")) {
mean_coords <- rbind(mean_coords, data.table(DT[i],
longitude = mean(coords[, lng]),
latitude = mean(coords[, lat])
))
} else {
mean_coords <- data.table(DT[i, ],
longitude = mean(coords[, lng]),
latitude = mean(coords[, lat])
)
}
}
}
}
# The output is a data table
if (exists("mean_coords")) {
return(mean_coords)
} else {
return(data.table(DT,
longitude = rep(NA, length.out = nrow(DT)),
latitude = rep(NA, length.out = nrow(DT))
))
}
}
Let’s check the function:
station_name station_id longitude latitude
<char> <char> <num> <num>
1: Base - 2132 W Hubbard warehouse -87.68028 41.88986
2: Base - 2132 W Hubbard Warehouse warehouse -87.68028 41.88986
3: MTV Hubbard St warehouse -87.68028 41.88986
4: WestChi warehouse -87.68028 41.88986
5: NewHastings warehouse -87.67984 41.86344
6: Cornell Dr & Hayes Dr 653 -87.58488 41.78059
7: WEST CHI-WATSON warehouse -87.68028 41.88986
8: DIVVY CASSETTE REPAIR MOBILE STATION warehouse -87.68028 41.88986
Let’s check those coordinates in the map:
If we zoom-in enough, we will find the names of the respective stations in the map itself for the actual stations; and, for the warehouses we’ll get the addresses we assign before.
We can now add these stations to the ones we already completed and keep the record of the ones still missing to see how many different incomplete stations we collect at the end of processing all the periods. This, in order to ponder if it’s worth to search for the information some other way.
coords_not_found <- missngStill[!(station_name %in% found_coords$station_name)]
coords_not_found[order(station_name), ]
station_name station_id
<char> <char>
1: Kimbark Ave & 63rd St 752
2: Maryland Ave & 104th St 20134
3: Menard Ave & Division St 305
4: Narragansett & Irving Park 323
5: Public Rack - Sacramento Blvd & Division St 511
6: Sacramento Blvd & Division St 511
7: W Oakdale Ave & N Broadway 20252
8: Washtenaw Ave & Madison St 416
Let’s build the lookup table to see how many stations have we completed so far:
lookup_dt <- rbind(naidma_found, found_coords)
nrow(lookup_dt)
[1] 1257
We are very close to get all the coordinates for all the stations reported in the jun_2022_rides_sm table, we are just missing the 8 stations that appeared on the previous table for that period.
Before getting to the final version of the constructor for the
complete function, it’s worth mentioning that I’d run all the periods in
a previous version of that function. While doing that, I was collecting
the names of the stations missing in each period in an empty data table
called “coords_not_found”, using the following lines of code in the
function (these lines are commented on this final version):
# Add the stations still missing info to the "coords_not_found" table stillLost <- data.table(stations[!(station_name %in% lookup_dt$station_name)], table = rep(name)) coords_not_found <<- rbind(coords_not_found, stillLost)
That action helped me identify the different strings included in the
shrtn_addr char vector, adapt the values in the
amatch and stringdist functions for the
defined find_comb function, and find the special cases
described above in the different data sets.
At the end of the final loop, the list of stations missing coordinates had a total of 69 observations, counting all the missing stations from all the periods; but those corresponded to only 12 unique names. Knowing that information, it was really easy, with the help of Google maps (“Google Maps — Maps.google.com”), Leaflet maps (Volodymyr 2023), and the coordinates obtained from other periods, to complete the missing data:
station_name longitude latitude
<char> <num> <num>
1: Bennett Ave & 96th St -87.57897 41.72214
2: Kimbark Ave & 63rd St -87.59603 41.78064
3: Maryland Ave & 104th St -87.60293 41.70566
4: Menard Ave & Division St -87.77052 41.90234
5: Narragansett & Irving Park -87.78540 41.95265
6: Public Rack - Albany Ave & 63rd St -87.70084 41.77891
7: Public Rack - Kimbark Ave & 63rd St -87.59603 41.78064
8: Public Rack - Sacramento Blvd & Division St -87.69750 41.90299
9: Sacramento Blvd & Division St -87.69750 41.90299
10: W Oakdale Ave & N Broadway -87.64405 41.93554
11: Washtenaw Ave & Madison St -87.69392 41.88140
12: Whipple St & Irving Park Rd -87.70437 41.95412
As you can notice in the table, in the case of the stations and racks that correspond to the same address, I left the same coordinates for both. Checking the map those stations have a darker tone of green compared to the rest, since both circles overlap (the station’s and the rack’s). Let’s check:
To find the coordinates I searched the coordinates of the street crossings in Google maps (“Google Maps — Maps.google.com”) and from there checking the “street view” of the same map and using the Leaflet maps (Volodymyr 2023) I was able to find most of the stations. However, in five cases I had to select the coordinates of the closest station that I could find:In the map the points for those cases are in red. There is a straight blue line connecting each red circle with the closest station to that point (the green circle). Among those cases, there are the two pairs of racks and stations that coincide; that’s why those two sets look to be in darker hues of the respective colors.
Hovering over the lines will display the distance in meters, rounded to tenths, from the street crossing to the nearest station. I used the Spherical Law of Cosines formula to get those distances, we will further discuss distance formulas in the next section of this document.
Now we can put all the pieces together in just one function to get the lookup tables for each period:
build_lookup_dt <- function(i) {
# The function needs the name of the DT to check (each period DT)
# Get the list of all the stations for that period
stations <- list_stations(i)
# Look for the coincidences in names or ids to get the coordinates
namids_found <- find_coord_by_name_id(stations)
# Get the stations still missing the coordinates
stillMiss <- stations[!(station_name %in% namids_found$station_name)]
# Use 'str_detect' and fuzzy matching to find more coordinates
naidma_found <- rbind(namids_found, find_match(stillMiss))
# Get the stations still missing the coordinates, again
missngStill <- stillMiss[!(station_name %in% naidma_found$station_name)]
# Get coordinates based on the reported data from the period
found_coords <- get_reprtd_coords(i, missngStill)
# Add the reported coordinates, if any, to the lookup table
lookup_df <- rbind(naidma_found, na.omit(found_coords, cols = c("longitude","latitude")))
# Add the searched ones
missing_coords <- missngStill[!(station_name %in% lookup_df$station_name)]
if (sum(missing_coords$station_name %in% searched$station_name) > 0) {
adding_info <- searched[station_name %in% missing_coords$station_name, ]
adding_info <- missing_coords[adding_info, on = .(station_name)]
adding_info <- adding_info[complete.cases(adding_info), ]
}
lookup_dt <- unique(rbind(lookup_df, adding_info))
# Add the stations still missing the info to the "coords_not_found" table
# stillLost <- data.table(stations[!(station_name %in% lookup_df$station_name)],
# table = rep(name))
# coords_not_found <<- rbind(coords_not_found, stillLost)
# The function returns a data.table
lookup_dt
}
All of the reported coordinates I was able to check work and are consistent with the info when displayed in a map, let’s see:
For the next step I will need to create a formula to check what is the closest station and fill the missing values with that information. Being careful to exclude the stations for which there is no info at all, i.e., no station name or id and no coordinates; since the formula may add the closest name and id to coordinates (0,0), skewing the data.
By using the station look-up tables created before, I believe we can
complete most the data in each month’s table. We just need to define a
function to do that easily. There is a miriad of options to select from;
just to illustrate, in the next chunk I’ve included some of those (Chris Veness 2022; Le 2021; Federal Regulations
(Annual Edition). Title 47: Telecommunication. 2017). In the
“Getting the measures” section I included also the distGeo
function from the geosphere package (Hijmans
2024):
# Global-average radius of earth in meters:
R <- 6371008.8
# Constant to convert degrees into radians
to_rad <- pi / 180
# Pythagoras’ theorem on an equi-rectangular projection
pyth <- function(lon1, lat1, lon2, lat2) {
# Mean latitud in radians
mean_lat <- mean(c(lat1, lat2)) * to_rad
cat1 <- (lon2 - lon1) * to_rad * cos(mean_lat)
cat2 <- (lat2 - lat1) * to_rad
d <- sqrt(cat1^2 + cat2^2)
d * R # meters
}
# Spherical Law of Cosines
law_cos <- function(lon1, lat1, lon2, lat2) {
d <- acos(sin(lat1 * to_rad) * sin(lat2 * to_rad) +
cos(lat1 * to_rad) * cos(lat2 * to_rad) * cos((lon2 - lon1) * to_rad)) * R
d # meters
}
# Haversine formula to measure distance in a sphere
haversine <- function(lon1, lat1, lon2, lat2) {
# generally used geo measurement function
dLat <- (lat2 - lat1) * to_rad
dLon <- (lon2 - lon1) * to_rad
a <- (sin(dLat / 2))^2 + cos(lat1 * to_rad) *
cos(lat2 * to_rad) * (sin(dLon / 2))^2
c <- 2 * atan2(sqrt(a), sqrt(1 - a))
d <- R * c
d # in meters
}
# In a Medium page I found this one
medium <- function(lon1, lat1, lon2, lat2) {
a <- 0.5 - cos((lat2 - lat1) * to_rad) / 2
+cos(lat1 * to_rad) * cos(lat2 * to_rad) *
(1 - cos((lon2 - lon1) * to_rad)) / 2
2 * R * asin(sqrt(a)) # meters
}
# FCC recommended formulae for distances not exceeding 475 kilometers:
fcc_dist <- function(lon1, lat1, lon2, lat2) {
# Mean latitude in radians
mean_lat <- mean(c(lat1, lat2)) * to_rad
K1 <- 111.13209 - 0.56605 * cos(2 * mean_lat) + 0.00120 * cos(4 * mean_lat)
K2 <- 111.41513 * cos(mean_lat) - 0.09455 * cos(3 * mean_lat) + 0.00012 * cos(5 * mean_lat)
D <- K1 * (lat2 - lat1) + K2 * (lon2 - lon1) # in Km
D * 1000 # in meters
}
# Defining the points (Taken from the table of the example)
lat1 <- 41.85476
lon1 <- -87.66367
lat2 <- 41.85492
lon2 <- -87.66366
# Getting the measures
pyt <- pyth(lon1, lat1, lon2, lat2)
l_c <- law_cos(lon1, lat1, lon2, lat2)
hav <- haversine(lon1, lat1, lon2, lat2)
med <- medium(lon1, lat1, lon2, lat2)
fcc <- fcc_dist(lon1, lat1, lon2, lat2)
diG <- distGeo(c(lon1,lat1),c(lon2,lat2))
# Organizing the info
comprsn <- tibble(
formula = c("Pythagorean", "Law of cosines", "Haversine", "Medium page", "FCC", "distGeo"),
`Distance in meters` = c(pyt, l_c, hav, med, fcc, diG))
comprsn %>% arrange(-`Distance in meters`)
# A tibble: 6 × 2
formula `Distance in meters`
<chr> <dbl>
1 FCC 18.6
2 Pythagorean 17.8
3 Haversine 17.8
4 Law of cosines 17.8
5 Medium page 17.8
6 distGeo 17.8
As we can see all the formulas give a very close distance between the
two selected points (81 cm is the maximum difference ~31.9”). So I would
think that since the same formula is going to be used in all the
calculations, any would work. Then the selection should be carried based
on the performance of the function, let’s run a
microbenchmark from the microbenchmark package (Mersmann 2019) to compare:
microbenchmark::microbenchmark(
times = 1000, unit = "us",
medium(lon1, lat1, lon2, lat2),
law_cos(lon1, lat1, lon2, lat2),
haversine(lon1, lat1, lon2, lat2),
pyth(lon1, lat1, lon2, lat2),
fcc_dist(lon1, lat1, lon2, lat2),
distGeo(c(lon1, lat1), c(lon2, lat2))
)
Unit: microseconds
expr min lq mean median uq max
medium(lon1, lat1, lon2, lat2) 1.5 1.7 12.1824 1.8 2.0 10078.3
law_cos(lon1, lat1, lon2, lat2) 1.5 1.7 9.0175 1.8 2.0 7029.6
haversine(lon1, lat1, lon2, lat2) 1.8 2.0 2.4166 2.2 2.4 38.3
pyth(lon1, lat1, lon2, lat2) 5.3 5.6 14.2884 5.9 6.3 7653.3
fcc_dist(lon1, lat1, lon2, lat2) 5.4 5.8 6.7931 6.0 6.5 59.9
distGeo(c(lon1, lat1), c(lon2, lat2)) 41.7 43.2 49.0610 43.8 45.6 194.8
neval
1000
1000
1000
1000
1000
1000
The formulas with the best results are the one from the “Medium” page, the “law of cos” and “haversine”. Being the most popular on the pages I reviewed and the one with the best mean (at least in most rounds), the winner is: the “haversine” formula! Now, we can proceed to build the function to complete the information.
Let’s use the formula with the jun_2022_rides_sm table to find the name and id of the stations missing that information. The following function will help us with that:
complete_stations <- function(j) {
DT <- as.data.table(br_dt_sm[[j]])
stations <- build_lookup_dt(j)
no_names <- unique(rbind(na.omit(DT[
!is.na(start_lng),
.("station_name" = start_station_name, "longitude" = start_lng, "latitude" = start_lat)
], cols = "station_name", invert = TRUE), na.omit(DT[
!(is.na(end_lng)),
.("station_name" = end_station_name, "longitude" = end_lng, "latitude" = end_lat)
], cols = "station_name", invert=TRUE)))
distance <- idx <- 1:nrow(no_names)
for (i in 1:nrow(no_names)) {
# calculate distance against all of stations
distances <- haversine(no_names[i, longitude], no_names[i, latitude],
stations[, longitude], stations[, latitude])
# find the shortest and store the index of stations
idx[i] <- which.min(distances)
distance[i] <- distances[idx[i]]
}
no_names[, `:=`("station_name" = stations$station_name[idx],
"station_id" = stations$station_id[idx],
`distance (m.)` = distance,
"st_lng" = stations$longitude[idx],
"st_lat" = stations$latitude[idx])]
}
completed <- complete_stations(k)
completed <- completed[order(`distance (m.)`, decreasing = TRUE),
.(station_name, station_id, `distance (m.)`, longitude, latitude,
st_lng, st_lat)]
head(completed, 10)
station_name station_id distance (m.) longitude latitude
<char> <char> <num> <num> <num>
1: Plainfield & Irving Park 327 9189.133 -87.92 41.90
2: W 103rd St & S Avers Ave 20202 8569.475 -87.82 41.70
3: Bloomingdale Ave & Harlem Ave 377 8439.880 -87.88 41.86
4: Lincolnwood Dr & Central St E014 8301.243 -87.81 42.09
5: W 103rd St & S Avers Ave 20202 7388.270 -87.80 41.73
6: Bloomingdale Ave & Harlem Ave 377 7097.729 -87.89 41.90
7: Narragansett & Irving Park 323 6388.508 -87.79 42.01
8: Western Ave & 118th St 900 6280.578 -87.72 41.63
9: Lincolnwood Dr & Central St E014 5418.120 -87.74 42.11
10: Plainfield & Irving Park 327 5331.919 -87.83 42.00
st_lng st_lat
<num> <num>
1: -87.83382 41.95213
2: -87.71707 41.70585
3: -87.80577 41.91203
4: -87.71530 42.06485
5: -87.71707 41.70585
6: -87.80577 41.91203
7: -87.78540 41.95265
8: -87.68092 41.67835
9: -87.71530 42.06485
10: -87.83382 41.95213
Checking the table we can see that there are some reported points (given by the longitude and latitude columns) that are really far from the nearest station (st_lng, st_lat); e.g.: the distance between those two points in the first observation is 9.2 Km (5.7 Miles).
Looking for information on the internet I found the following in Wikipedia (Wikipedia contributors 2025): “In Chicago, a typical city block is 330 by 660 feet (100 m × 200 m), meaning that 16 east-west blocks or 8 north-south blocks measure one mile, which has been adopted by other US cities.”
I cannot be sure what happened with those bikes or why they were reported in points so far away from the stations, and it’s worth mentioning that all those cases in which the bike-rides miss the station info, have hundredths place rounded or truncated coordinates. I believe it would be interesting for the analysis phase to check the relation between the distance from the station and the type of station it is (“start” - “end”). For now, we can get a little view on the map of the bikes that were left too far from the stations, just to pique our interest for a future development of this topic.
I mentioned “too far” in the previous paragraph; now we need to define what would be considered “too far”. Keeping in mind the Wikipedia information cited before, I would think that any distance greater than 800 m (8 “100 m” blocks or 1/2 Mile) may be considered “too far”, based on this resolution, we can build the following map (distances displayed on top of the lines are measured in “100 m” blocks):
too_far <- completed[`distance (m.)` > 800, ][, label :=
1:nrow(completed[`distance (m.)` > 800, ])]
stations_lu <- build_lookup_dt(k)
sttns_tf <- stations_lu[station_name %in% too_far$station_name, ]
too_far_map <- leaflet() %>% addProviderTiles(providers$OpenStreetMap) %>%
addCircleMarkers(lng = too_far$longitude, lat = too_far$latitude,
label = paste0("Too_far_",too_far$label), fillColor = "darkred",
fillOpacity = 0.4, color = "darkred", radius = 6) %>%
addCircleMarkers(lng = sttns_tf$longitude, lat = sttns_tf$latitude, fillColor = "navy",
fillOpacity = 0.4, label = sttns_tf$station_name, color = "green",
radius = 6) %>%
addCircleMarkers(lng = completed$st_lng, lat = completed$st_lat, color = "navy",
label = completed$station_name, radius = 2, opacity = 0.7)
for (i in 1:nrow(too_far)){
too_far_map <- too_far_map %>% addPolylines(lng = c(too_far[i,]$longitude,too_far[i,]$st_lng),
lat = c(too_far[i,]$latitude, too_far[i,]$st_lat),
label = paste(round(too_far[i,]$distance/100,1),"blocks"))
}
too_far_map
If reviewing the table with the really long distances was mind blowing, observing the map is even more intriguing: bikes in the middle of the lake, what is that about?
On a bright side, an idea that comes up into my mind is that if the company is considering expanding the coverage area for the bike program, maybe they should think about the areas with the most red circles!
After checking the data for the completed information, I think it is really relevant to include a column with the information about the distance to the nearest station, keeping in mind that we will need to fill with 0’s (zeros) that column for the stations that had all the information already.
In the completed table we have all the unique values for
which we need the stations’ data. In order to complete input the
stations’ data into the periods tables we need a function to add that
information. But first, we need to add some new columns to register the
new data: the distance to the nearest station, starting or ending, and
the coordinates used to find that distance, also for both cases.
br_dt_sm <- lapply(br_dt_sm, \(DT) DT[, `:=`(dist_start_stn = rep(0), dist_end_stn = rep(0),
mean_start_lng = rep(0), mean_start_lat = rep(0),
mean_end_lng = rep(0), mean_end_lat = rep(0))])
dim(br_dt_sm[[k]])
[1] 769204 26
We are ready now to put some information in those columns!
We have the tables ready to put the needed information about the missing stations, but we also defined some extra columns to input the new information related to how we got the missing stations information. Now we can define a function to input those values, to do that I’ll subset the complete table in two sets, the one that is already completed and the one that needs information. Since we are not going to do anything with the first subset, all the values for the new columns are going to remain being zero (0).
input_stn_data <- function(i) {
# The function takes the DT index as input
# Display a message to show the progress
cat("Working on table ",new_per[i],". ")
DT <- br_dt_sm[[i]]
# Get the complete info for the stations missing
to_input <- complete_stations(i)
# Subset the DT
notMiss <- na.omit(DT, cols = c("start_station_name", "end_station_name"))
missInf <- na.omit(DT, cols = c("start_station_name", "end_station_name"), invert=TRUE)
# Loop through the completed info filling in the missing one in the tables
for (i in 1:nrow(to_input)) {
missInf[
is.na(missInf$start_station_name) &
missInf$start_lng == to_input$longitude[i] &
missInf$start_lat == to_input$latitude[i],
c(
"start_station_name", "start_station_id",
"mean_start_lng", "mean_start_lat", "dist_start_stn"
) :=
.(
to_input$station_name[i], to_input$station_id[i],
to_input$st_lng[i], to_input$st_lat[i], to_input$`distance (m.)`[i]
)
]
missInf[
is.na(missInf$end_station_name) &
missInf$end_lng == to_input$longitude[i] &
missInf$end_lat == to_input$latitude[i],
c(
"end_station_name", "end_station_id",
"mean_end_lng", "mean_end_lat", "dist_end_stn"
) :=
.(
to_input$station_name[i], to_input$station_id[i],
to_input$st_lng[i], to_input$st_lat[i], to_input$`distance (m.)`[i]
)
]
}
# Bind again the two subsets of the DT
DT <- rbind(notMiss, missInf)
# Display another progress message
cat("Available information completed!\n")
# The function returns a data.table
DT
}
Let’s run the function to complete the tables:
br_dt_sm <- lapply(seq_along(br_dt_sm), \(i) input_stn_data(i))
Working on table oct_2021_rides_sm . Available information completed!
Working on table nov_2021_rides_sm . Available information completed!
Working on table dic_2021_rides_sm . Available information completed!
Working on table ene_2022_rides_sm . Available information completed!
Working on table feb_2022_rides_sm . Available information completed!
Working on table mar_2022_rides_sm . Available information completed!
Working on table abr_2022_rides_sm . Available information completed!
Working on table may_2022_rides_sm . Available information completed!
Working on table jun_2022_rides_sm . Available information completed!
Working on table jul_2022_rides_sm . Available information completed!
Working on table ago_2022_rides_sm . Available information completed!
Working on table sept_2022_rides_sm . Available information completed!
Let’s check the missing information again:
n_nas <- unlist(lapply(br_dt_sm, \(DT) sum(!complete.cases(DT))))
data.table(period= format(summ_table$month, "%b %Y"), miss_end_coords = summ_table$miss_end_coord, `has_NAs` = n_nas)
period miss_end_coords has_NAs
<char> <int> <int>
1: oct. 2021 484 4761
2: nov. 2021 191 1630
3: dic. 2021 144 675
4: ene. 2022 86 86
5: feb. 2022 77 77
6: mar. 2022 266 266
7: abr. 2022 317 317
8: may. 2022 722 722
9: jun. 2022 1055 1055
10: jul. 2022 947 947
11: ago. 2022 843 843
12: sept. 2022 712 712
Comparing the number of rows in the tables that include NAs and the
missing_end_coords column from the summary table, one can
see that the results from January to September 2022 coincide, which is a
good sign. The difference in the first three sets may be due to the
stations ids, let’s check:
stn_id_NA <- unlist(lapply(br_dt_sm[1:3], \(DT) nrow(DT[is.na(start_station_id) |
is.na(end_station_id), ])))
stn_name_NA <- unlist(lapply(br_dt_sm[1:3], \(DT) nrow(DT[is.na(start_station_name) |
is.na(end_station_name), ])))
data.table(period = format(summ_table$month[1:3], "%b %Y"),
miss_end_coords = summ_table$miss_end_coord[1:3],
stn_name_NA = stn_name_NA,
stn_id_NA = stn_id_NA, `have_NAs` = n_nas[1:3])
period miss_end_coords stn_name_NA stn_id_NA have_NAs
<char> <int> <int> <int> <int>
1: oct. 2021 484 484 4761 4761
2: nov. 2021 191 191 1630 1630
3: dic. 2021 144 144 675 675
As we can observe in this last table, that is exactly the case, there are some stations for which we have the station name but that don’t have station id. Let´s check which:
for (i in 1:3) {
miss_id_start <- unlist(unique(br_dt_sm[[i]][
!is.na(start_station_name) &
is.na(start_station_id),
.(start_station_name, start_station_id)
]))
miss_id_end <- unlist(unique(br_dt_sm[[i]][
!is.na(end_station_name) & is.na(end_station_id),
.(end_station_name, end_station_id)
]))
cat(new_per[i], ":\nMissing start station - id:\t", paste(miss_id_start, collapse = " - "),
"\nMissing end station - id:\t", paste(miss_id_end, collapse = " - "),"\n\n")
}
oct_2021_rides_sm :
Missing start station - id: DuSable Lake Shore Dr & Ohio St - NA
Missing end station - id: DuSable Lake Shore Dr & Ohio St - NA
nov_2021_rides_sm :
Missing start station - id: DuSable Lake Shore Dr & Ohio St - NA
Missing end station - id: DuSable Lake Shore Dr & Ohio St - NA
dic_2021_rides_sm :
Missing start station - id: DuSable Lake Shore Dr & Ohio St - NA
Missing end station - id: DuSable Lake Shore Dr & Ohio St - NA
As we can see, in all the cases, it is the station for which we
removed the id when correcting the information in the tables, and this
station doesn’t appear in the Info_Station set. In case
needed, we can assign a made-up id for this station, but for now we are
going to leave it as it is. Therefore, after a long and iterative work,
we are done filling the information in the tables!
br_dt_sm), one file with the originally downloaded data
sets (“bikeRidesOriginalTables.rds”), this document with all the
information about the process. One file with the information to produce
the Info_Stations set used in this document
(“Info_Stations_list_1833.rds”). All that is left to do, it to “pack”
the produced information in another list to facilitate reproducing this
work or parts of it:
Let’s create that list:
bike_rides_addtnl_info <- list("summary_table" = summ_table, "warehouses" = warehouses,
"searched_stns" = searched, "ggplot_theme" = tema)
save_tables <- function(a_list, name) {
# Function to save a list with a copy of each period data table
# It takes a list with the DTs and a name for the .rds file
# Save the list in the input folder
if (dir(".")[1] == "1_Load_libs") {
write_rds(a_list, paste0("../output/", name, ".rds"))
} else {
write_rds(a_list, paste0("../../output/", name, ".rds"))
}
}
save_tables(bike_rides_addtnl_info, "bikeRidesAddedInfo")
Now we have all the files needed to review and reproduce this document.
From previous chapters we collected some intriguing information about the discovered information, I put some recommendations for the next steps in the Appendix - Suggestions for the analysis phase.
We are finally done getting everything ready to be sent for the analysis phase!
too_far_map in section 5.5.1, in case the company is thinking about
expanding the coverage area for the bike program.