Bike Rides Data Wrangling

Loading the libraries
Setting the periods and downloading the data
1. Specifying the periods for the analysis
2. Downloading the CSV files
Initial observations of the files
Dealing with missing values
1. Unifying missing values
2. Adding a ‘shadow matrix’ to the tables
Collecting, correcting and completing
Packing and closing

References
Appendix - Suggestions for the analysis phase

1. Loading the libraries

Before starting, it is important to mention that the present document has been developed using R version 4.5.0 Patched (2025-05-01 r88189 ucrt)(R Core Team 2019), in the IDE “Mariposa Orchid” RStudio desktop version (2025.05.0+496)(Posit team 2025); running under Windows 10 x64 (build 19045).

In order to make possible the present project, we need to load the different libraries that will be used to process the information in this document. For that, I set up the next function; it is important to note that in this case I am doing it silently.

# I am setting the chunk option results='asis' and using the tags options, to render the output text of this chunk as regular text list inside the document

load_libs <- function() {
  # Function to load the libraries

  # Declare a char vector with the names of the libraries to upload
  librerias <- c(
    "data.table", "fuzzyjoin", "glue", "hms", "htmltools", "htmlwidgets",
    "geosphere", "leaflet", "leaflet.extras2", "magrittr", "naniar", "rio",
    "rlang", "scales", "simputation", "sp", "stringdist", "tidyverse", "visdat"
  )

  # Set up a variable to check the number of libraries loaded
  ctrl <- length(librerias)

  # Load "librerias" silently
  librerias <- suppressMessages(suppressWarnings(lapply(librerias, library, 
                                                        character.only = TRUE)))

  # Check the number of loaded libs, organize the list and write a paragraph with the result
  if (ctrl == length(librerias)) {
    base_libs <- sort(sessionInfo()$base)
    base_toP <- paste(base_libs[1:length(base_libs) - 1], collapse = ", ")
    libs <- sort(unlist(librerias[[length(librerias)]]))
    libs <- libs[!(libs %in% base_libs)]
    libs_toP <- paste0(libs[1:length(libs) - 1], collapse = ", ")
    listado <- tags$ul(tags$li(paste0(length(base_libs), " base R libraries: ", base_toP,
                                      " and ", base_libs[length(base_libs)], ".")),
                      tags$li(paste0(length(librerias[[length(librerias)]]) -
                                       length(base_libs), " additional installed packages: ",
                                     libs_toP, " and ", libs[length(libs)], "."))) %>%
      tagAppendAttributes(class = "disc")
    salida <- paste0(
      "The following ", length(librerias[[length(librerias)]]),
      " libraries were installed:\n\n", listado
    )
  } else {
    salida <- "Something happened, let's check!\n"
  }

  # The function returns a paragraph stating the result
  writeLines(salida)
  invisible(libs)
}
# libs <- load_libs()
load_libs()

The following 35 libraries were installed:

7 base R libraries: base, datasets, graphics, grDevices, methods, stats and utils.
28 additional installed packages: data.table, dplyr, forcats, fuzzyjoin, geosphere, ggplot2, glue, hms, htmltools, htmlwidgets, leaflet, leaflet.extras2, lubridate, magrittr, naniar, purrr, readr, rio, rlang, scales, simputation, sp, stringdist, stringr, tibble, tidyr, tidyverse and visdat.

I mentioned the silent option before, because having so many packages there may be cases in which conflicts arise between the functions in different packages; if so, you can check with conflicts(), reload the libraries with all the messages to check or use the notation: package_name::function().

At the end of the chunk I commented the line # libs <- load_libs(), since I just needed this code to get the libs vector, returned invisibly by the function, to add the packages references into the references.bib document, following the R Markdown Cookbook’s recommendation (Xie, Dervieux, and Riederer 2020).

We have already loaded the needed libraries for this project, since we are not going to use this function again we can remove it to keep the environment neat:

rm(load_libs)

Right now I’m showing the chunk for deleting the function used in this section. I’m going to keep on repeating this cleaning operation several times during the document to avoid cluttering the environment; but from now on, I won’t be including the respective chunks, since all of them just use the same rm function.

Back to the Table of Contents

2. Setting the periods and downloading the data

2.1 Specifying the periods for the analysis

First, I need to define the dates for which I want to download the zip files. I started this project on the first days of October 2022 and I needed to collect at least a year’s worth of data. To accomplish this, I established my first month as October 2021 (2021-10-01).

After quite some time, I noticed that learning this language was going to take me a lot more time than I’d expected and the months kept on passing. For this reason, I decided to leave the initial month as it was, and keep on adding the new tables (dfs) as they were becoming available online month after month. That meant that the code should have being able to actualize the dates every time I ran it. The Sys.Date() helped me to accomplish that. As you can see it in the results, it keeps on retrieving the dates up to this last month.

set_init_dates <- function() {
  # Function to create a date vector with the initial date for each period

  # Set the variables
  # One with the last month that may be available
  last_month <- floor_date(Sys.Date() - 30, "month")

  # Another one with the sequence since the first date by month
  # We´ll need this vector later on, so I'll save it in the Global environment
  yearMonths <<- seq(as.Date("2021-10-01"), last_month, by = "months")

  # State the result for these variables as a paragraph
  salida <- paste0(
    "The first month starts with the date: ", as.Date("2021-10-01"),
    "; and, most probably, the last month's start date will be: ", last_month,
    ". The complete sequence of all the initial dates are in ",
    'an atomic vector named "yearMonths"; vector with class "',
    class(yearMonths), '" and of type "', typeof(yearMonths),
    '".\n\nThe dates in the vector increment by 1 month, i.e., ',
    paste0(yearMonths[1:6], collapse = ", "), ", etc.\n"
  )

  # The function returns a paragraph explaining the result
  writeLines(salida)
}
set_init_dates()

The first month starts with the date: 2021-10-01; and, most probably, the last month’s start date will be: 2025-06-01. The complete sequence of all the initial dates are in an atomic vector named “yearMonths”; vector with class “Date” and of type “double”.

The dates in the vector increment by 1 month, i.e., 2021-10-01, 2021-11-01, 2021-12-01, 2022-01-01, 2022-02-01, 2022-03-01, etc.

I have to keep in mind that during the first days of every month the data for the previous month may not be available yet.

# Define a variable with the number of items
a <- length(yearMonths)

# Declare a vector to contain the names of the dfs
periods <- sapply(yearMonths, function(var) {
                                             paste0(
                                               lubridate::month(var, label = TRUE), "_",
                                               format(var, "%Y"), "_rides"
                                             )
                                            })

# State the info:
writeLines(paste0("At this moment we may have ", a, " months of data available, there may be one less during the first days of every month (", a - 1, " in this case).\n\nThe following are some of the names that will be used for the tables, just to give you an idea of the scheme for those names:\n", paste0(periods[1:5], collapse = ", "), ", and so on. Up to, if already available, ", periods[a], ".\n"))

At this moment we may have 45 months of data available, there may be one less during the first days of every month (44 in this case).

The following are some of the names that will be used for the tables, just to give you an idea of the scheme for those names: oct_2021_rides, nov_2021_rides, dic_2021_rides, ene_2022_rides, feb_2022_rides, and so on. Up to, if already available, jun_2025_rides.

Back to the Table of Contents

2.2 Downloading the CSV files

The data comes from the following index page: https://divvy-tripdata.s3.amazonaws.com/index.html (“Bucket Loading... — Divvy-Tripdata.s3.amazonaws.com”) This is a public data set by https://ride.divvybikes.com/ which operates a bike-sharing model in the city of Chicago (“Home | Divvy Bikes — Ride.divvybikes.com”).

For this exercise we can create the R data tables directly from their information by using the import() function from the rio package (Chan et al. 2023). The files are comma-separated-values files (.csv) compressed in .zip format. In the previous section, we already created a vector, periods, with the names that we are going to give to those tables.

Before proceeding with the download, I am going to reduce the value of a so we only get 12 months of data. If we don’t change it, this chunk will download the complete set of 45 csv files available up to now, and that will take a long time. So, for the time being, let´s just work with 12 months. Since we are changing the value of a, we have to adjust the periods vector as well:

# New value for 'a'
a <- 12
# Adjust the periods vector
periods <- periods[1:a]

Now we are ready to download the information, the following chunk will help us with this task:

# Define a char vector with each file name to download
files_to_dwnld <- paste0(
  "https://divvy-tripdata.s3.amazonaws.com/", format(yearMonths[1:12], "%Y"),
  format(yearMonths[1:12], "%m"), "-divvy-tripdata.zip"
)

download_data <- function(i) {
  # Function to download the .zip files from the internet site, it takes an index as input
  
  # Display a message to show the progress
  cat("Downloading table ", periods[[i]], ": ", i, " out of ", a, sep = "")
  
  df <- import(files_to_dwnld[[i]])

  # Display a "Done" message
  cat(" - Done!\n")
  
  df
}
br_data <- lapply(seq_along(files_to_dwnld), \(i) download_data(i))

  Downloading table oct_2021_rides: 1 out of 12 - Done!
  Downloading table nov_2021_rides: 2 out of 12 - Done!
  Downloading table dic_2021_rides: 3 out of 12 - Done!
  Downloading table ene_2022_rides: 4 out of 12 - Done!
  Downloading table feb_2022_rides: 5 out of 12 - Done!
  Downloading table mar_2022_rides: 6 out of 12 - Done!
  Downloading table abr_2022_rides: 7 out of 12 - Done!
  Downloading table may_2022_rides: 8 out of 12 - Done!
  Downloading table jun_2022_rides: 9 out of 12 - Done!
  Downloading table jul_2022_rides: 10 out of 12 - Done!
  Downloading table ago_2022_rides: 11 out of 12 - Done!
  Downloading table sept_2022_rides: 12 out of 12 - Done!

# Assign the names from the periods vector
names(br_data) <- periods

If we were to download all the sets programmatically it would be important to do it in two parts:

Download as we just did, up to files_to_dwnld[length(files_to_dwnld) - 1].
Then, add the tryCatch function for the last file. The reason for this is because we cannot know if that last file is available during the first days of each month.

The code for downloading that last file could be something like:

tryCatch(
  {
    br_data[[length(files_to_dwnld)]] <- lapply(files_to_dwnld[length(files_to_dwnld)], import)
  },
  error = function(e) {
    message(paste0("Data for ", periods[length(files_to_dwnld)], " is not yet available."))
    print(e)
  },
  warning = function(w) {
    message(paste0("The data for the ", periods[length(files_to_dwnld)], " is not available yet."))
    print(w)
  }
)

Let´s check the names and classes for the objects that we just got from the Internet:

sapply(br_data, class)

   oct_2021_rides  nov_2021_rides  dic_2021_rides  ene_2022_rides  feb_2022_rides 
     "data.frame"    "data.frame"    "data.frame"    "data.frame"    "data.frame" 
   mar_2022_rides  abr_2022_rides  may_2022_rides  jun_2022_rides  jul_2022_rides 
     "data.frame"    "data.frame"    "data.frame"    "data.frame"    "data.frame" 
   ago_2022_rides sept_2022_rides 
     "data.frame"    "data.frame"

We have correctly downloaded the twelve data frames we needed, now we are going to add them also the data.table class in order to use that package (Barrett et al. 2025).

# Add the data.table class to the data frames
br_data <- lapply(br_data, as.data.table)
lapply(br_data, class)

  $oct_2021_rides
  [1] "data.table" "data.frame"
  
  $nov_2021_rides
  [1] "data.table" "data.frame"
  
  $dic_2021_rides
  [1] "data.table" "data.frame"
  
  $ene_2022_rides
  [1] "data.table" "data.frame"
  
  $feb_2022_rides
  [1] "data.table" "data.frame"
  
  $mar_2022_rides
  [1] "data.table" "data.frame"
  
  $abr_2022_rides
  [1] "data.table" "data.frame"
  
  $may_2022_rides
  [1] "data.table" "data.frame"
  
  $jun_2022_rides
  [1] "data.table" "data.frame"
  
  $jul_2022_rides
  [1] "data.table" "data.frame"
  
  $ago_2022_rides
  [1] "data.table" "data.frame"
  
  $sept_2022_rides
  [1] "data.table" "data.frame"

We have now both classes for the tables, we are ready to check the information in these 12 data tables!

Back to the Table of Contents

3. Initial observations of the files

3.1 Checking the column names

Now let’s start checking the tables from the files we just downloaded. Let’s check and compare the column names for the different periods with the following function:

check_colnames <- function(a_list) {
  # Function to check if the colnames are the same in all dfs,
  # It takes a list with the dfs as input (br_data)

  # Set up a vector with the column names of the first df
  clmn_names <<- colnames(a_list[[1]])

  # Define a vector to store the names of the other dfs with the same colnames (from the 2nd on):
  clnms_rest <- lapply(a_list[-1], colnames)
  samecols <- unlist(lapply(clnms_rest, \(cl_nms) identical(clmn_names, cl_nms)))
  
  # List the column names of the first df:
  sal1 <- paste0(
    "In the table ", names(a_list)[1], " the names of the columns are: '",
    paste0(clmn_names, collapse = "' - '"), "'.\n"
  )

  # Compare the previous info with the rest of the dfs:
  sal2 <- paste0(if (sum(samecols) == 0) {
    "\nNo other table has the same column names.\n\tLet's check!"
  } else if (sum(samecols) == length(a_list)-1) {
    paste0(
      "Good news: All the other ", length(samecols),
      " tables have the same column names!"
    )
  } else {
    paste0("This tables have also the same colnames: ",
      paste(names(samecols)[samecols], collapse = ", "),
      ".\nBut the following tables have DIFFERENT names: ",
      paste(names(samecols)[!samecols], collapse = ", "),
      ".\nLet's fix that!",
      sep = ""
    )
  })

  # Combine the information
  salida <- paste0(sal1, "\n", sal2, "\n")

  # The function returns a paragraph stating the results
  writeLines(salida)
}
check_colnames(br_data)

In the table oct_2021_rides the names of the columns are: ‘ride_id’ - ‘rideable_type’ - ‘started_at’ - ‘ended_at’ - ‘start_station_name’ - ‘start_station_id’ - ‘end_station_name’ - ‘end_station_id’ - ‘start_lat’ - ‘start_lng’ - ‘end_lat’ - ‘end_lng’ - ‘member_casual’.

Good news: All the other 11 tables have the same column names!

We are now certain that all the dfs have the same column names, that will make it easier for us to work with them.

Back to the Table of Contents

3.2 Comparing the structure of the tables

Let’s check now the structure of the first table (first according to the date), and if those columns in the various tables are all of the same classes. The following info is the structure (str()) of the first data-table, oct_2021_rides:

str(br_data[[1]])

  Classes 'data.table' and 'data.frame':    631226 obs. of  13 variables:
   $ ride_id           : chr  "620BC6107255BF4C" "4471C70731AB2E45" "26CA69D43D15EE14" "362947F0437E1514" ...
   $ rideable_type     : chr  "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
   $ started_at        : POSIXct, format: "2021-10-22 12:46:42" "2021-10-21 09:12:37" ...
   $ ended_at          : POSIXct, format: "2021-10-22 12:49:50" "2021-10-21 09:14:14" ...
   $ start_station_name: chr  "Kingsbury St & Kinzie St" "" "" "" ...
   $ start_station_id  : chr  "KA1503000043" "" "" "" ...
   $ end_station_name  : chr  "" "" "" "" ...
   $ end_station_id    : chr  "" "" "" "" ...
   $ start_lat         : num  41.9 41.9 41.9 41.9 41.9 ...
   $ start_lng         : num  -87.6 -87.7 -87.7 -87.7 -87.7 ...
   $ end_lat           : num  41.9 41.9 41.9 41.9 41.9 ...
   $ end_lng           : num  -87.6 -87.7 -87.7 -87.7 -87.7 ...
   $ member_casual     : chr  "member" "member" "member" "member" ...
   - attr(*, ".internal.selfref")=<externalptr>

As we can observe, we have 7 columns with class chr (character), 2 with POSIXct (date-time) and 4 with num (numeric). By checking the few elements shown from each column, we can infer that the classes are consistent with the information contained in them.

Let’s run a function so we can check if the rest of the dfs have the same structure. I’ll just show the result:

  All the other 11 tables have also the same structure!

After finding out that the columns in all the dfs coincide and have the correct classes, now we can start checking the tables information.

Back to the Table of Contents

3.3 Analysing the first and last observations of a table

Before starting to check the information in the tables, I am going to set up an empty list docWise to save information needed for rendering this document correctly, and a variable that will help us check the info of a different period by altering its value.

  The list 'docWise' has been declared right now!

Now we can set the variable. The idea behind declaring this variable is to be able to check the same information for a different period just by changing its value. I am avoiding using a or i since the value of those variables is related to processes we had run, and very likely to be changed again by those or similar processes. In this case, the name for the new variable is going to be k.

# Defining a variable with a value to check a single period and adding a
# couple of 'if-statements' to avoid errors
k <- 9
check_k <- function(k) {
  if (k < 1) {
    k <- 1
  }
  if (k > a) {
    k <- a
  }
  
  # Save the value in the document list
  docWise$k <<- k
  
  # Check the information:
  sal1 <- paste0("Right now `k` = ", k, ", this means that the functions following ",
                 "this section are going to evaluate the data in table `", periods[k],
                 "`.\n")
  
  writeLines(sal1)
}
check_k(k)

Right now k = 9, this means that the functions following this section are going to evaluate the data in table jun_2022_rides.

Note: The idea behind this variable k is that we can run the functions in the following sections with the information from other periods by changing its value. According to the previous code chunk, the value for variable k needs to be between 1 and a. In addition to this first caveat, and as you can observe, in the function check_k the value of k is added to the docWise list that we declared before. The idea of that list is to keep there values that are needed for rendering this document correctly; meaning that if k is changed elsewhere, and the value of k in the list is not, the text throughout the document is going to keep referring to items related to the value of k before the change and it may become confusing. To avoid this, please change the value of k in the previous chunk, in the line right before the declaration of the function check_k, and run the chunk again.

Let’s check the data frame (jun_2022_rides) with the head() and tail() to gather some insights about the contents in the observations of the tables and see what else we can notice:

               ride_id rideable_type          started_at            ended_at
                <char>        <char>              <POSc>              <POSc>
   1: 600CFD130D0FD2A4 electric_bike 2022-06-30 17:27:53 2022-06-30 17:35:15
   2: F5E6B5C1682C6464 electric_bike 2022-06-30 18:39:52 2022-06-30 18:47:28
   3: B6EB6D27BAD771D2 electric_bike 2022-06-30 11:49:25 2022-06-30 12:02:54
   4: C9C320375DE1D5C6 electric_bike 2022-06-30 11:15:25 2022-06-30 11:19:43
   5: 56C055851023BE98 electric_bike 2022-06-29 23:36:50 2022-06-29 23:45:17
   6: B664188E8163D045 electric_bike 2022-06-30 16:42:10 2022-06-30 16:58:22
   7: 338C05A3E90D619B electric_bike 2022-06-30 18:39:07 2022-06-30 19:05:02
   8: C037F5F4107788DE electric_bike 2022-06-30 12:46:14 2022-06-30 14:12:48
   9: C19B08D794D1C89E electric_bike 2022-06-30 11:09:38 2022-06-30 11:10:25
  10: 6E9E3A041C14E960 electric_bike 2022-06-30 11:05:46 2022-06-30 11:09:11
      start_station_name start_station_id end_station_name end_station_id
                  <char>           <char>           <char>         <char>
   1:                                                                    
   2:                                                                    
   3:                                                                    
   4:                                                                    
   5:                                                                    
   6:                                                                    
   7:                                                                    
   8:                                                                    
   9:                                                                    
  10:                                                                    
      start_lat start_lng end_lat end_lng member_casual
          <num>     <num>   <num>   <num>        <char>
   1:     41.89    -87.62   41.91  -87.62        casual
   2:     41.91    -87.62   41.93  -87.63        casual
   3:     41.91    -87.65   41.89  -87.61        casual
   4:     41.80    -87.66   41.80  -87.65        casual
   5:     41.91    -87.63   41.93  -87.64        casual
   6:     42.03    -87.71   42.06  -87.73        casual
   7:     41.91    -87.63   41.92  -87.72        casual
   8:     41.89    -87.62   41.98  -87.67        casual
   9:     41.89    -87.61   41.89  -87.61        casual
  10:     41.89    -87.61   41.89  -87.61        casual

               ride_id rideable_type          started_at            ended_at
                <char>        <char>              <POSc>              <POSc>
   1: 82EF1A9AC3E768C4  classic_bike 2022-06-26 20:34:59 2022-06-27 21:34:52
   2: 3EA43F311B85EB40  classic_bike 2022-06-20 17:51:55 2022-06-21 18:51:48
   3: E44E6CD038DBB502  classic_bike 2022-06-17 14:35:01 2022-06-18 15:34:46
   4: 1AEC816C5F684C5F  classic_bike 2022-06-26 20:02:29 2022-06-27 21:02:24
   5: 0E03A60ABBED6676  classic_bike 2022-06-27 14:22:02 2022-06-28 15:21:57
   6: F4EAB98F4DD26581   docked_bike 2022-06-28 04:49:29 2022-07-01 04:54:03
   7: A8120081E3A91F03  classic_bike 2022-06-02 17:36:30 2022-06-03 18:36:24
   8: D5F0401AB215354D   docked_bike 2022-06-17 14:36:05 2022-06-18 15:36:08
   9: E3A85FBE4884FAC8  classic_bike 2022-06-22 17:10:17 2022-06-23 18:09:54
  10: 020942C78999F631  classic_bike 2022-06-16 17:40:35 2022-06-17 18:40:30
                start_station_name start_station_id end_station_name
                            <char>           <char>           <char>
   1:   Clarendon Ave & Junior Ter            13389                 
   2:   Clarendon Ave & Junior Ter            13389                 
   3:    Blue Island Ave & 18th St            13135                 
   4:       Cicero Ave & Quincy St            16913                 
   5:   Lincoln Ave & Waveland Ave            13253                 
   6:    Blue Island Ave & 18th St            13135                 
   7:   Clarendon Ave & Junior Ter            13389                 
   8:    Blue Island Ave & 18th St            13135                 
   9:       Clark St & Randolph St     TA1305000030                 
  10: Sheffield Ave & Kingsbury St            13154                 
      end_station_id start_lat start_lng end_lat end_lng member_casual
              <char>     <num>     <num>   <num>   <num>        <char>
   1:                 41.96100 -87.64960      NA      NA        casual
   2:                 41.96100 -87.64960      NA      NA        casual
   3:                 41.85756 -87.66154      NA      NA        casual
   4:                 41.87761 -87.74541      NA      NA        casual
   5:                 41.94880 -87.67528      NA      NA        member
   6:                 41.85756 -87.66154      NA      NA        casual
   7:                 41.96100 -87.64960      NA      NA        member
   8:                 41.85756 -87.66154      NA      NA        casual
   9:                 41.88458 -87.63189      NA      NA        member
  10:                 41.91052 -87.65311      NA      NA        casual

Checking this info we can notice that there are some rows where we don’t have the name of the starting or ending stations, nor their respective id’s. In some cases we don’t even have the reported coordinates in the end_lat and end_lng columns. If we change the value of k above, we can observe the same situation for other months.

Back to the Table of Contents

3.4 Checking the key column

The first column in the tables is called ride_id, I assume that this is the key of the information contained in each table. Meaning that it’s a unique identifier for each observation (row) in the data frame. Let´s check the tables through sapply to see if there are any duplicates:

sapply(br_data, \(DT) paste(DT[duplicated(ride_id), .(ride_id)], collapse = ", "))

   oct_2021_rides  nov_2021_rides  dic_2021_rides  ene_2022_rides  feb_2022_rides 
   "character(0)"  "character(0)"  "character(0)"  "character(0)"  "character(0)" 
   mar_2022_rides  abr_2022_rides  may_2022_rides  jun_2022_rides  jul_2022_rides 
   "character(0)"  "character(0)"  "character(0)"  "character(0)"  "character(0)" 
   ago_2022_rides sept_2022_rides 
   "character(0)"  "character(0)"

Getting the character(0) for every table is a good sign that there are not repeated values in the ride_id column for any of them.

Back to the Table of Contents

3.5 Looking for NAs and empty strings (““)

Let’s check for the same period (jun_2022_rides) what the situation is in terms of missing information. For checking that relevant info, we’ll use first a function from the library naniar (Tierney and Cook 2023), that shows us a summary with the number of observations that have NAs (n_miss_in_var()), and how many variables miss that information (n_vars()):

miss_table <- miss_var_table(br_data[[k]])
miss_table

  # A tibble: 2 × 3
    n_miss_in_var n_vars pct_vars
            <int>  <int>    <dbl>
  1             0     11     84.6
  2          1055      2     15.4

miss_summa <- miss_var_summary(br_data[[k]])
miss_summa

  # A tibble: 13 × 3
     variable           n_miss pct_miss
     <chr>               <int>    <num>
   1 end_lat              1055    0.137
   2 end_lng              1055    0.137
   3 ride_id                 0    0    
   4 rideable_type           0    0    
   5 started_at              0    0    
   6 ended_at                0    0    
   7 start_station_name      0    0    
   8 start_station_id        0    0    
   9 end_station_name        0    0    
  10 end_station_id          0    0    
  11 start_lat               0    0    
  12 start_lng               0    0    
  13 member_casual           0    0

In this case, jun_2022_rides, the first function let us see that we have 2 columns (variables) that have 1055 rows (observations) missing information (NA); and, in the second table we can see that those columns are: end_lat, end_lng.

Previously, when we checked the info for the head and tail of this jun_2022_rides we observed that 4 columns of class character were also missing information. Therefore, in this df there are at least 2 different ways of representing the missing information: NA and empty strings (““). We can check that with the following function:

cuenta_miss_dt <- function(DT) {
  # Function to count NAs and empty strings ("") per variable (column)
  # This function takes a data table as input

  # Get the variable class
  classes <- t(DT[, sapply(.SD, class), .SDcols = clmn_names])

  # Count NAs
  n_NA <- data.matrix(DT[, sapply(.SD, \(col) sum(is.na(col))), .SDcols = clmn_names])

  # Count empty strings
  n_empty <- data.matrix(DT[, sapply(.SD, \(col) sum(col %in% "")), .SDcols = clmn_names])

  # Format the output
  salida <- as.data.table(cbind(
    as_tibble(clmn_names),
    classes, n_NA, n_empty
  ))[n_empty > 0 | n_NA > 0, -3]
  setnames(salida, c(1:2), c("variable", "class"))

  # The function returns a data.table
  salida
}
missing_obs <- cuenta_miss_dt(br_data[[k]])
missing_obs

               variable     class  n_NA n_empty
                 <char>    <char> <int>   <int>
  1: start_station_name character     0   92944
  2:   start_station_id character     0   92944
  3:   end_station_name character     0  100152
  4:     end_station_id character     0  100152
  5:            end_lat   numeric  1055       0
  6:            end_lng   numeric  1055       0

Now we can clearly see that for the numeric variables (end_lat, end_lng), they used NA to represent the missing values, and for the character variables they used empty strings (““). To facilitate working with the data, we are going to transform all the empty strings into NAs and create some columns to store the information about the missing values.

Reminder: By changing the value of the variable k in section 3.3 you can check this same info for other periods, instead of running them all at once, since that will take a long time to finish.

Back to the Table of Contents

4. Dealing with missing values

4.1 Unifying missing values

In the last section we found that there are two different values to represent the missing information in these tables: NA for the numeric variables and empty strings ("") for the character variables. In this part we need to unify the representation for those missing values, for that purpose and in order to use some functions that deal with NAs, we are going to represent the missing values only with NAs.

Before proceeding making any changes in the tables, it is a good a idea to save a copy, just in case we need to go back to the originals in the future, that way we will avoid having to download them again. To accomplish this, we can take advantage of the native R format .rds to save a list with a copy of each table:

save_tables <- function(a_list, name) {
  # Function to save a list with a copy of each period data table
  # It takes a list with the DTs and a name for the .rds file

  # Save the list in the input folder
  if (dir(".")[1] == "1_Load_libs") {
    write_rds(a_list, paste0("../input/", name, ".rds"))
  } else {
    write_rds(a_list, paste0("../../input/", name, ".rds"))
  }
}
save_tables(br_data, "bikeRidesOriginalTables")

From the last function in the previous section, we took the variable names that have empty strings (the names of the columns), and the number of observations that miss information for the data in jun_2022_rides:

               variable n_empty
                 <char>   <int>
  1: start_station_name   92944
  2:   start_station_id   92944
  3:   end_station_name  100152
  4:     end_station_id  100152

To change those empty strings into NAs, we can use the na_if function from the dplyr package (Wickham et al. 2023). Let’s check how it works for the tail of the jun_2022_rides table that we checked before:

br_data[[k]] %>%
  arrange(end_lat) %>%
  tail(10) %>%
  apply(2,na_if,y="") %>%
  as.data.frame

              ride_id rideable_type          started_at            ended_at
  1  82EF1A9AC3E768C4  classic_bike 2022-06-26 20:34:59 2022-06-27 21:34:52
  2  3EA43F311B85EB40  classic_bike 2022-06-20 17:51:55 2022-06-21 18:51:48
  3  E44E6CD038DBB502  classic_bike 2022-06-17 14:35:01 2022-06-18 15:34:46
  4  1AEC816C5F684C5F  classic_bike 2022-06-26 20:02:29 2022-06-27 21:02:24
  5  0E03A60ABBED6676  classic_bike 2022-06-27 14:22:02 2022-06-28 15:21:57
  6  F4EAB98F4DD26581   docked_bike 2022-06-28 04:49:29 2022-07-01 04:54:03
  7  A8120081E3A91F03  classic_bike 2022-06-02 17:36:30 2022-06-03 18:36:24
  8  D5F0401AB215354D   docked_bike 2022-06-17 14:36:05 2022-06-18 15:36:08
  9  E3A85FBE4884FAC8  classic_bike 2022-06-22 17:10:17 2022-06-23 18:09:54
  10 020942C78999F631  classic_bike 2022-06-16 17:40:35 2022-06-17 18:40:30
               start_station_name start_station_id end_station_name
  1    Clarendon Ave & Junior Ter            13389             <NA>
  2    Clarendon Ave & Junior Ter            13389             <NA>
  3     Blue Island Ave & 18th St            13135             <NA>
  4        Cicero Ave & Quincy St            16913             <NA>
  5    Lincoln Ave & Waveland Ave            13253             <NA>
  6     Blue Island Ave & 18th St            13135             <NA>
  7    Clarendon Ave & Junior Ter            13389             <NA>
  8     Blue Island Ave & 18th St            13135             <NA>
  9        Clark St & Randolph St     TA1305000030             <NA>
  10 Sheffield Ave & Kingsbury St            13154             <NA>
     end_station_id start_lat start_lng end_lat end_lng member_casual
  1            <NA>  41.96100 -87.64960    <NA>    <NA>        casual
  2            <NA>  41.96100 -87.64960    <NA>    <NA>        casual
  3            <NA>  41.85756 -87.66154    <NA>    <NA>        casual
  4            <NA>  41.87761 -87.74541    <NA>    <NA>        casual
  5            <NA>  41.94880 -87.67528    <NA>    <NA>        member
  6            <NA>  41.85756 -87.66154    <NA>    <NA>        casual
  7            <NA>  41.96100 -87.64960    <NA>    <NA>        member
  8            <NA>  41.85756 -87.66154    <NA>    <NA>        casual
  9            <NA>  41.88458 -87.63189    <NA>    <NA>        member
  10           <NA>  41.91052 -87.65311    <NA>    <NA>        casual

Now all the missing information is represented by NAs across all the columns in this example. However, it takes a long time for the function to change the whole data frame and it changes all the columns to type: char. Since we only need to change char columns, we can subset those columns to start with, using the package data.table (Barrett et al. 2025), and that will also help us speed up the process.

By using the vector periods with the names of the tables, we can create the names for the new tables; to do so, I’ll just add an ‘_sm’ to the end of the same names, e.g.: oct_2021_rides will become oct_2021_rides_sm. Then, we will define a function to perform the changes and display messages to show the progress, and finally we’ll apply the function and assign the new names to the new list (Bike Rides_DaTa_SM):

# Declare a vector with the new names
new_per <- paste0(periods, "_sm")

change_empty <- function(a_list, i) {
  # Function to change the empty strings into NA, it takes as input a list and the index

  # Display a message to show the progress
  cat(paste0("Working on table ", periods[[i]], ", changing empty-strings into NAs.\n"))

  # Change only the character columns and,
  # join the non-char cols to the subset by the 'ride_id' col
  DT <- a_list[[i]][, lapply(.SD, na_if, ""),
    .SDcols = is.character
  ][a_list[[i]][,
    .(ride_id, .SD),
    .SDcols = !is.character
  ], on = .(ride_id)]

  # Correct the col names that were changed by the join
  setnames(DT, 8:13, clmn_names[c(3, 4, 9:12)])

  # Display another message to show the progress
  cat(paste0("The table ", new_per[i], " has been completed!\n"))

  DT
}
# Apply the function
br_dt_sm <- lapply(seq_along(br_data), \(i) change_empty(br_data, i))

  Working on table oct_2021_rides, changing empty-strings into NAs.
  The table oct_2021_rides_sm has been completed!
  Working on table nov_2021_rides, changing empty-strings into NAs.
  The table nov_2021_rides_sm has been completed!
  Working on table dic_2021_rides, changing empty-strings into NAs.
  The table dic_2021_rides_sm has been completed!
  Working on table ene_2022_rides, changing empty-strings into NAs.
  The table ene_2022_rides_sm has been completed!
  Working on table feb_2022_rides, changing empty-strings into NAs.
  The table feb_2022_rides_sm has been completed!
  Working on table mar_2022_rides, changing empty-strings into NAs.
  The table mar_2022_rides_sm has been completed!
  Working on table abr_2022_rides, changing empty-strings into NAs.
  The table abr_2022_rides_sm has been completed!
  Working on table may_2022_rides, changing empty-strings into NAs.
  The table may_2022_rides_sm has been completed!
  Working on table jun_2022_rides, changing empty-strings into NAs.
  The table jun_2022_rides_sm has been completed!
  Working on table jul_2022_rides, changing empty-strings into NAs.
  The table jul_2022_rides_sm has been completed!
  Working on table ago_2022_rides, changing empty-strings into NAs.
  The table ago_2022_rides_sm has been completed!
  Working on table sept_2022_rides, changing empty-strings into NAs.
  The table sept_2022_rides_sm has been completed!

# Assign the new names for the DTs
names(br_dt_sm) <- new_per

It’s important to note that this operation put all the char columns together at the beginning of the table (on the left side of it), and the non-character ones at the end.

Let´s check for the new jun_2022_rides_sm table the missing variables summary again:

miss_var_summary(br_dt_sm[[k]]) %>% filter(n_miss > 0)

  # A tibble: 6 × 3
    variable           n_miss pct_miss
    <chr>               <int>    <num>
  1 end_station_name   100152   13.0  
  2 end_station_id     100152   13.0  
  3 start_station_name  92944   12.1  
  4 start_station_id    92944   12.1  
  5 end_lat              1055    0.137
  6 end_lng              1055    0.137

As we can see, now the summary function is able to get all the missing observations from all columns, and the values reported are the exact same numbers we had before making these changes.

Back to the Table of Contents

4.2 Adding a ‘shadow matrix’ to the tables

Later on we are going to make some changes to see if we can fill some of the missing information. To be able to keep track of the changes it is a good idea to add some columns that let us know where the missing values are. Those new columns, as a set, constitute what is called a ‘shadow matrix’ (that’s the reason for the _sm appendage on the table names). To do this, we can use the functions nabular() and add_label_shadow() from the naniar package (Tierney and Cook 2023).

add_shadow <- function(i) {
  # Function to add a shadow matrix, it takes an index as input
  
  # Display a message to show the progress
  cat("Working on table ", names(br_dt_sm)[[i]], ". ")
  
  # Add the shadow matrix and convert the df as a data.table
  DT <- as.data.table(nabular(br_dt_sm[[i]], only_miss = T) %>% add_label_shadow())
  
  # Display the finished message for the DT to show the progress
  cat("Shadow matrix attached!\n")
  
  DT
}
br_dt_sm <- lapply(seq_along(br_dt_sm), \(i) add_shadow(i))

  Working on table  oct_2021_rides_sm . Shadow matrix attached!
  Working on table  nov_2021_rides_sm . Shadow matrix attached!
  Working on table  dic_2021_rides_sm . Shadow matrix attached!
  Working on table  ene_2022_rides_sm . Shadow matrix attached!
  Working on table  feb_2022_rides_sm . Shadow matrix attached!
  Working on table  mar_2022_rides_sm . Shadow matrix attached!
  Working on table  abr_2022_rides_sm . Shadow matrix attached!
  Working on table  may_2022_rides_sm . Shadow matrix attached!
  Working on table  jun_2022_rides_sm . Shadow matrix attached!
  Working on table  jul_2022_rides_sm . Shadow matrix attached!
  Working on table  ago_2022_rides_sm . Shadow matrix attached!
  Working on table  sept_2022_rides_sm . Shadow matrix attached!

Let’s check the dimensions of the jun_2022_rides_sm table:

  [1] 769204     20

Now we have 20 variables instead of the original 13, checking the new table one can see that the columns that missed information now have a ‘shadow’ column with the same name ending in “_NA”, so end_lat has a shadow column named end_lat_NA for example. Within those new columns we can see NA for the respective observations in the original column that were missing information, and !NA otherwise. Lastly, we have a column named any_missing in which we have the values ‘Missing’ for the rows with at least one NA, and ‘Not Missing’ if there are observations for all the variables in that row (no NAs in that row).

From now on we are going to be working with these new_per tables. Keeping in mind that we already saved a copy with the original tables, we can remove all the periods tables, and the vector itself, from the Global environment.

Back to the Table of Contents

5. Collecting, correcting and completing

Since we started working with these data tables, we have been producing new information about them while doing the different processes. So we can start collecting some general info about the different periods: like the total number of observations, and missing information for the different variables in each period. In the first part of this section we’ll do this. Then we will correct some information that is wrong in the data frames. After that, we will start the process to build the lookup tables, and check formulas; to, finally, complete the missing data that might be completed in each period.

5.1 Creating a summary table

It would be important to keep the information produced organized in a single table. For that purpose, we need to create a summary table to collect those observations about the data in the different periods. Since we have the respective dates in the yearMonths dates vector, we can use those dates in the first column to reference each month. It would also be essential to know how many observations are available for each one, and the total number of rows represents exactly that, let’s get those values. I’ll call the column for the total number of observations total_count:

          period total_count
          <char>       <int>
   1:  oct. 2021      631226
   2:  nov. 2021      359978
   3:  dic. 2021      247540
   4:  ene. 2022      103770
   5:  feb. 2022      115609
   6:  mar. 2022      284042
   7:  abr. 2022      371249
   8:  may. 2022      634858
   9:  jun. 2022      769204
  10:  jul. 2022      823488
  11:  ago. 2022      785932
  12: sept. 2022      701339

Just out of curiosity, let’s check how that looks like in a graph:

# First, let's create a theme we can use in the graphs
tema <- theme(
  plot.title = element_text(size = 18), plot.subtitle = element_text(size = 14),
  axis.title = element_text(size = 14), axis.text = element_text(size = 12),
  legend.position = "bottom", legend.title = element_text(size = 14),
  legend.text = element_text(size = 12), legend.key.size = unit(0.8, "cm")
)

# Now we can produce the graph
ggplot(summ_table) +
  geom_point(
    mapping = aes(x = month, y = total_count / 1000),
    color = "olivedrab", size = 3
  ) +
  geom_smooth(
    method = loess, formula = "y ~ x",
    mapping = aes(x = month, y = total_count / 1000), color = "navy", se = FALSE
  ) +
  labs(x = "Period", y = "Number of rides (in thousands)") +
  scale_x_date(date_breaks = "2 months", date_labels = "%b %Y") +
  scale_y_continuous(breaks = seq(0, 800, by = 100)) +
  tema +
  ggtitle("Total number of rides per month")

We can clearly see a that there must be a year seasonality in this data. An important point to keep in mind for the analysis phase.

We can also include some information about the missing values. When we checked that information before, we noticed that the values for the missing columns were all in pairs, in the sense that if we missed the station name, start or end, we missed also the respective station id, and if we missed one coordinate we missed also the other. That will help us synthesize the names of the new columns for the summary table, first we’ll set a function to get the missing values for the columns of one table and then we can run it through a loop to get the values for all the periods.

set_miss_cols <- function(i) {
  # Function to get the missing values in a period
  
  miss_vals <- br_dt_sm[[i]][, lapply(.SD, n_miss), 
                .SDcols = c("start_station_id", "end_station_id", "start_lat", "end_lat")]
  names(miss_vals) <- c("miss_start_stn", "miss_end_stn", "miss_start_coord", "miss_end_coord")
  miss_vals
}
set_miss_cols(k)

     miss_start_stn miss_end_stn miss_start_coord miss_end_coord
              <int>        <int>            <int>          <int>
  1:          92944       100152                0           1055

The result is a table with just one row for that period, now we can run the function for all the periods and get the same observations for every month, and bind them in a single table. Finally, once we get that info, we can bind it to the summary table:

          period total_count miss_start_stn miss_end_stn miss_start_coord
          <char>       <int>          <int>        <int>            <int>
   1:  2021-oct.      631226         108210       114834                0
   2:  2021-nov.      359978          75290        79187                0
   3:  2021-dic.      247540          51063        53498                0
   4:  2022-ene.      103770          16260        17927                0
   5:  2022-feb.      115609          18580        20355                0
   6:  2022-mar.      284042          47246        51157                0
   7:  2022-abr.      371249          70887        75288                0
   8:  2022-may.      634858          86704        93171                0
   9:  2022-jun.      769204          92944       100152                0
  10:  2022-jul.      823488         112031       120951                0
  11:  2022-ago.      785932         112037       120522                0
  12: 2022-sept.      701339         103780       111185                0
      miss_end_coord
               <int>
   1:            484
   2:            191
   3:            144
   4:             86
   5:             77
   6:            266
   7:            317
   8:            722
   9:           1055
  10:            947
  11:            843
  12:            712

In the case of the starting stations I would think it is possible to fill all the missing info, by finding the closest station to the reported coordinates. In the other case, for the ending stations we need to check first if the observations that are missing the station info are also missing the coordinates or not, let’s see:

The following list shows the number of observations per period that miss end-station information and end coordinates at the same time:

oct_2021 - 484
nov_2021 - 191
dic_2021 - 144
ene_2022 - 86
feb_2022 - 77
mar_2022 - 266
abr_2022 - 317
may_2022 - 722
jun_2022 - 1055
jul_2022 - 947
ago_2022 - 843
sept_202 - 712

We can clearly see that those are the same number of observations we have in the last column (miss_end_coord) of the previous table. This means that we won’t be able to complete any values for those observations. Luckily those are really few compared with the size of each set:

          period total_count miss_end_coord incompletable %
          <char>       <int>          <int>           <num>
   1:  2021-oct.      631226            484          0.0767
   2:  2021-nov.      359978            191          0.0531
   3:  2021-dic.      247540            144          0.0582
   4:  2022-ene.      103770             86          0.0829
   5:  2022-feb.      115609             77          0.0666
   6:  2022-mar.      284042            266          0.0936
   7:  2022-abr.      371249            317          0.0854
   8:  2022-may.      634858            722          0.1137
   9:  2022-jun.      769204           1055          0.1372
  10:  2022-jul.      823488            947          0.1150
  11:  2022-ago.      785932            843          0.1073
  12: 2022-sept.      701339            712          0.1015

As we can see, in the worst case (jun. 2022) we will still have the 99.86% of the observations, which is awesome! That, of course, if we are able to complete some of the missing observations based in the information reported.

For the next parts, we will continue producing new information about the data, but we have now the summary table we just created to keep storing that new info.

Back to the Table of Contents

5.2 Correcting tables inconsistencies

In section 3.4 we checked that the “Key” column ride_id contained unique values. Apart from that column, now we need to make sure that the station_name and station_id columns are consistent across all tables and have an univocal correspondence (one-on-one relationship). While doing the process to complete the missing information for each period, I found out that there are some inconsistencies related to the data contained in those two columns in the tables.

Back to the Table of Contents

5.2.1 Checking by the ids

To start, there are ids that end in “.0”, so we need to take those off; but, there is the exceptional case of station “Pulaski Rd & 21st St” that has id “331.0”. Since the station “Halsted St & Clybourn Ave” has id “331”; if I just remove the “.0”, both will end up with the same id. In this particular case, instead of removing the “.0”, I’m going to set the id to “21331”, because this is the id that appears in the stations information set that I’ll download in section 5.3.1, as we will see.

The next two particular cases have to do with codes “TA1306000029” and “13099”, since these correspond to two different stations in some sets. Let’s check the dic_2021_rides_sm set:

                        station_name   station_id
                              <char>       <char>
  1:            Halsted St & 18th St        13099
  2:     Halsted St & 18th St (Temp)        13099
  3: DuSable Lake Shore Dr & Ohio St TA1306000029
  4:            McClurg Ct & Ohio St TA1306000029

By checking the names related to code “13099” one can see that it makes reference to the same station, in this case I think it is safe to leave the code as it is, because in both cases we can assign the same coordinates.

Checking the information about code “TA1306000029”, I found that in the first two periods (October and November, 2021), this code corresponds only to station “DuSable Lake Shore Dr & Ohio St”. Reviewing the Info_Station and some 2022 periods sets, that code corresponds only to station “McClurg Ct & Ohio St”. Given that this last station_name is shared in all the sets except for the first two, I think the best option is to remove the code, assigning NA instead, to station “DuSable Lake Shore Dr & Ohio St” in the periods in which it appears.

Another two particular cases are the stations with station_id: “1033” and “1038”, that contain obvious coding typos in some name’s sets; let´s check the sept_2022_rides_sm table, that has both versions of each name (the correct and the incorrect):

                         start_station_name start_station_id
                                     <char>           <char>
  1:     Public Rack - Pulaski Rd & 65th St             1033
  2: Public Rack - Pulaski Rd &amp; 65th St             1033
  3:     Public Rack - Kedzie Ave & 62nd Pl             1038
  4: Public Rack - Kedzie Ave &amp; 62nd Pl             1038

Back to the Table of Contents

5.2.2 Checking by the names

Now we need to check for the opposite case to the one in the previous section, stations that have the same name, but with different station_id. To be able to do this, we need first to get all the unique combinations of names and ids contained in each table, this function will help us with that:

list_stations <- function(i) {
  # The function will take the index of the df as input to get the data table (DT)
  DT <- as.data.table(br_dt_sm[[i]])

  # Get the names and ids from all the reported stations in that period
  stations <- rbind(
    na.omit(DT[,
      .(
        station_name = start_station_name,
        station_id = start_station_id
      )
    ], cols = "station_name"),
    na.omit(DT[,
      .(
        station_name = end_station_name,
        station_id = end_station_id
      )
    ], cols = "station_name")
  )
  
  # Remove repeated rows
  stations <- unique(stations, by = c("station_name", "station_id"))

  # The function will return a df with the stations reported each period
  stations
}

Now we have a function that creates a table with all the station names and ids reported each month. Based on the result of this function, we can get the names that correspond to more than one id.

We have 16 pairs of stations names with two different station_id each. To decide which id should we keep, we can check which one is the most repeated in all the periods sets:

get_rep_sttns <- function(a_list) {
  # Takes a list as input
  
  # Create a list of repeated stations with different ids per period
  obtain_duplis <- function(i) {
    stations <- list_stations(i)
    reps <- stations[duplicated(station_name), station_name]
    duplis <- stations[station_name %in% reps, ]
    duplis[station_name %in% reps, "count" := rep(0)]
  }
  duplis <- lapply(seq_along(a_list), \(i) obtain_duplis(i))
  
  # Combine all repeated data and count which ids appear in which periods
  repeated <- as.data.table(bind_rows(duplis))
  repeated <- unique(repeated)
  for (i in 1:length(a_list)) {
    repeated[, count := count + as.numeric(repeated$station_id %in% list_stations(i)$station_id)]
  }
  repeated[order(station_name), ]
}
namids_reps <- get_rep_sttns(br_dt_sm)
namids_reps

                         station_name   station_id count
                               <char>       <char> <num>
   1:                    Bradley Park        20227     9
   2:                    Bradley Park          633     1
   3:      California Ave & Cortez St        17660    12
   4:      California Ave & Cortez St          512     5
   5:           Calumet Ave & 51st St          813     5
   6:           Calumet Ave & 51st St        15470    12
   7:           Calumet Ave & 71st St          728     2
   8:           Calumet Ave & 71st St        15599    10
   9:    Central Park Ave & Ogden Ave        15685    12
  10:    Central Park Ave & Ogden Ave          532     5
  11:   Christiana Ave & Lawrence Ave        15615    12
  12:   Christiana Ave & Lawrence Ave          860     5
  13:          East End Ave & 87th St        20231    12
  14:          East End Ave & 87th St          883     3
  15:         Eggleston Ave & 92nd St        20118    12
  16:         Eggleston Ave & 92nd St          707     2
  17:           Halsted St & 111th St        20127    12
  18:           Halsted St & 111th St          875     5
  19:          Jeffery Blvd & 67th St KA1503000030    12
  20:          Jeffery Blvd & 67th St          753     5
  21:           Kostner Ave & Lake St          536    12
  22:           Kostner Ave & Lake St          851     4
  23:         Lake Park Ave & 47th St TA1308000035    12
  24:         Lake Park Ave & 47th St          812     5
  25: Lakefront Trail & Bryn Mawr Ave KA1504000152     5
  26: Lakefront Trail & Bryn Mawr Ave        15576     8
  27:         Lawndale Ave & 111th St          893     3
  28:         Lawndale Ave & 111th St        20203    12
  29:     Prairie Ave & Garfield Blvd TA1307000160    12
  30:     Prairie Ave & Garfield Blvd          906     4
  31:            Wabash Ave & 87th St          595    12
  32:            Wabash Ave & 87th St          644    12
                         station_name   station_id count

The column count gives us the total number of periods in which each id shows up. With this information is easy to decide which id to keep: the one that repeats the most; except for the last case, since both ids show up in all the periods. In this case, again, checking the stations information set that I’ll download in section 5.3.1, the id “595” is the one that appears; so, that’s the one we’ll keep. It’s worth mentioning that all the ids that repeat the most in the periods are also the ones contained in that same set that we will download, as we´ll see in that section.

Having decided what we need to do, let´s rearrange the information so we can use it in the next part:

                         station_name      keep_id   discard_id
                               <char>       <char>       <char>
   1:                    Bradley Park        20227          633
   2:      California Ave & Cortez St        17660          512
   3:           Calumet Ave & 51st St        15470          813
   4:           Calumet Ave & 71st St        15599          728
   5:    Central Park Ave & Ogden Ave        15685          532
   6:   Christiana Ave & Lawrence Ave        15615          860
   7:          East End Ave & 87th St        20231          883
   8:         Eggleston Ave & 92nd St        20118          707
   9:           Halsted St & 111th St        20127          875
  10:          Jeffery Blvd & 67th St KA1503000030          753
  11:           Kostner Ave & Lake St          536          851
  12:         Lake Park Ave & 47th St TA1308000035          812
  13: Lakefront Trail & Bryn Mawr Ave        15576 KA1504000152
  14:         Lawndale Ave & 111th St        20203          893
  15:     Prairie Ave & Garfield Blvd TA1307000160          906
  16:            Wabash Ave & 87th St          595          644

One last error that’s in the sept_2022_rides_sm names set is a laughable typo in the “Public Racks”:

                       station_name station_id
                             <char>     <char>
  1: Pubic Rack - Pulaski Rd & 41st        835

Apart from the errors mentioned, there are some very fancy descriptive names that are not very practical for the analysis, e.g., taken from the jun_2022_rides_sm set:

                         start_station_name start_station_id
                                     <char>           <char>
  1:      63rd & Western Ave - north corner              739
  2: Ashland Ave & 45th St - midblock south              845
  3:      63rd & Western Ave - south corner              738

In these cases, I’m just going to replace the excessive wording for the initial of the corresponding cardinal point, i.e., “- north corner” will become “N”. Apart from these 3 cases there are some more stations’ names that include descriptors like “- midblock” or the complete name for the cardinal point; in the first case I decided to erase the string, in other cases that are consistent with the information that we will download from the site, I decided to leave them as they are.

Back to the Table of Contents

5.2.3 Warehouses

Finally, there are some cases that will need a special treatment. I refer to the warehouses that show up in the list of stations, let´s check the information for the jun_2022_rides_sm:

  # A tibble: 6 × 4
  # Groups:   start_station_name [6]
    start_station_name              start_station_id             mean_lng mean_lat
    <chr>                           <chr>                           <dbl>    <dbl>
  1 Base - 2132 W Hubbard           Hubbard Bike-checking (LBS-…    -87.7     41.9
  2 Base - 2132 W Hubbard Warehouse Hubbard Bike-checking (LBS-…    -87.7     41.9
  3 MTV Hubbard St                  021320                          -87.7     41.9
  4 NewHastings                     2059 Hastings Warehouse Sta…    -87.7     41.9
  5 WEST CHI-WATSON                 DIVVY 001 - Warehouse test …    -87.7     41.9
  6 WestChi                         DIVVY 001 - Warehouse test …    -87.8     41.9

We can see that almost in all these cases the name or the id make reference to the fact that these are warehouses or repair stations, the only one that may seem confusing is the “MTV Hubbard St”, which has a 6 digits number as the id, but if we check the numbers without the zeros, and add those to the name, we’ll get the same address that is there for the “Base” cases (2132 Hubbard St), and also, if we search for “mtv Hubbard St” on the internet we’ll find the Divvy page with that same address (“Divvy in Chicago, IL 60612 - — Chamberofcommerce.com”).

Let´s check on the map the case of the warehouse identified with name “WestChi” for the jun_2022_rides_sm table. We’ll need a function to generate the map, we can write a very simple one for now, using the leaflet package (Cheng, Schloerke, et al. 2024):

make_map2 <- function(label1,lng1,lat1,label2,lng2,lat2){
  mapa <- leaflet() %>% addProviderTiles(providers$OpenStreetMap) %>%
    addCircleMarkers(lng = lng1, lat = lat1, 
                     color = "darkgreen", radius = 5, 
                     label = as.character(label1)) %>%
    addCircleMarkers(lng = lng2, lat = lat2, 
                     color = "darkred", radius = 5, 
                     label = as.character(label2))
  mapa
}

Now we are able to check the points that show up using “WestChi” as the start_station_name:

westchi_map <-
make_map2(br_dt_sm[[9]][str_detect(start_station_name,"WestChi"), start_station_name],
         br_dt_sm[[9]][str_detect(start_station_name,"WestChi"), start_lng],
         br_dt_sm[[9]][str_detect(start_station_name,"WestChi"), start_lat],
         "Wrong point (mean coordinates of the 3 WestChi points)",
         wh_dt_mean[start_station_name == "WestChi", mean_lng],
         wh_dt_mean[start_station_name == "WestChi", mean_lat]
         )
westchi_map

If you zoom-in to any point, you’ll discover a different station for every case. I suppose that maybe the bikes were taken from those stations to the warehouse for repairs (or any other situation similar to that). Being that the case, and since that “station” name doesn’t appear in the names set we will download, we won’t have a way to find the coordinates. Plus, getting the mean or any other measure based on that info (as in the table presented before), will provide us with an erroneous point (the red circle in the map).

Apart from the “Base” address, we also have the “2059 Hastings St” for the “NewHastings” warehouse. With the help of the internet maps we can easily find the coordinates for those two points; shown in the next map:

Assuming that for the analysis the rides related to those not-so-much “stations” would need to have a special treatment, I think it’s safe to use the same coordinates for most of them (except “NewHastings”), and even set a common id for all of them: “warehouse”. That way we know that those are going to constitute a particular group of rides.

Having resolved that case, we can set the information for those “stations” as follow:

                             station_name station_id longitude latitude
                                   <char>     <char>     <num>    <num>
  1:                Base - 2132 W Hubbard  warehouse -87.68028 41.88986
  2:      Base - 2132 W Hubbard Warehouse  warehouse -87.68028 41.88986
  3: DIVVY CASSETTE REPAIR MOBILE STATION  warehouse -87.68028 41.88986
  4:                       MTV Hubbard St  warehouse -87.68028 41.88986
  5:                          NewHastings  warehouse -87.67984 41.86344
  6:                      WEST CHI-WATSON  warehouse -87.68028 41.88986
  7:                              WestChi  warehouse -87.68028 41.88986

Back to the Table of Contents

5.2.4 Cleaning the tables

The next step would be to clean all the inconsistencies discovered and take care of the special cases in the tables for each period. To accomplish this we need to define a function that help us with that and then run it for all the periods. Let’s define the function:

fix_tables <- function(i) {
  # The function will take the DT index as input
  
  DT <- as.data.table(br_dt_sm[[i]])

  # Assign the 'station_id' for the special case
  if (nrow(DT[
    start_station_id == "331.0" | end_station_id == "331.0",
    .(start_station_id, end_station_id)
  ]) > 0) {
    DT[start_station_id == "331.0", start_station_id := "21331"]
    DT[end_station_id == "331.0", end_station_id := "21331"]
  }

  # Remove the ending decimal zero in the ids that have it
  DT[, start_station_id := str_remove_all(DT$start_station_id, "\\.0$")][, end_station_id :=
    str_remove_all(DT$end_station_id, "\\.0$")]
  
  # Set the conflicting station id to 'NA'
  if (nrow(DT[start_station_name == "DuSable Lake Shore Dr & Ohio St" |
    end_station_name == "DuSable Lake Shore Dr & Ohio St",
    .(start_station_name, end_station_name)
  ]) > 0) {
    DT[start_station_name == "DuSable Lake Shore Dr & Ohio St", start_station_id := NA]
    DT[end_station_name == "DuSable Lake Shore Dr & Ohio St", end_station_id := NA]
  }

  # Correct the typos in the names
  if (nrow(DT[
    start_station_name == "Public Rack - Kedzie Ave &amp; 62nd Pl" |
      end_station_name == "Public Rack - Kedzie Ave &amp; 62nd Pl",
    .(start_station_name, end_station_name)
  ]) > 0) {
    DT[
      start_station_name == "Public Rack - Kedzie Ave &amp; 62nd Pl",
      start_station_name := "Public Rack - Kedzie Ave & 62nd Pl"
    ]
    DT[
      end_station_name == "Public Rack - Kedzie Ave &amp; 62nd Pl",
      end_station_name := "Public Rack - Kedzie Ave & 62nd Pl"
    ]
  }
  if (sum(str_detect(na.omit(DT$start_station_name), "Pubic"),
          str_detect(na.omit(DT$end_station_name), "Pubic")) > 0){
    DT[, start_station_name := str_replace_all(DT$start_station_name, "Pubic", "Public")]
    DT[, end_station_name := str_replace_all(DT$end_station_name, "Pubic", "Public")]
  }
  
  # Unify the station_id for the stations with more than one id
  if (sum((DT$start_station_id %in% order_reps$discard_id),
          (DT$end_station_id %in% order_reps$discard_id)) > 0) {
    for (stn_id in order_reps$discard_id) {
      DT[start_station_id == stn_id,
              start_station_id := order_reps[discard_id == stn_id, keep_id]
             ][
              end_station_id == stn_id,
              end_station_id := order_reps[discard_id == stn_id, keep_id]
             ]
    }
  }
  
  # Change the fancy names
  if (sum(
    str_detect(na.omit(DT$start_station_name), "-\\snorth\\scorner"),
    str_detect(na.omit(DT$end_station_name), "-\\snorth\\scorner")
  ) > 0) {
    DT[, 
      start_station_name := str_replace_all(DT$start_station_name, "-\\snorth\\scorner", "N")
    ][, 
      start_station_name := str_replace_all(DT$start_station_name, "-\\ssouth\\scorner", "S")
    ][, 
      end_station_name := str_replace_all(DT$end_station_name, "-\\ssouth\\scorner", "S")
    ][,
      end_station_name := str_replace_all(DT$end_station_name, "-\\snorth\\scorner", "N")
    ]
  }
  if (sum(
    str_detect(na.omit(DT$start_station_name), "-\\smidblock\\ssouth"),
    str_detect(na.omit(DT$end_station_name), "-\\smidblock\\ssouth")
  ) > 0) {
    DT[, 
      start_station_name := str_replace_all(DT$start_station_name, "-\\smidblock\\ssouth", "S")
    ][, 
      end_station_name := str_replace_all(DT$end_station_name, "-\\smidblock\\ssouth", "S")
    ]
  }
  if (sum(
    str_detect(na.omit(DT$start_station_name), "^City\\sRack"),
    str_detect(na.omit(DT$end_station_name), "^City\\sRack")
  ) > 0) {
    DT[, 
      start_station_name := str_replace_all(DT$start_station_name, "^City\\sRack", "Public Rack")
    ][, 
      end_station_name := str_replace_all(DT$end_station_name, "^City\\sRack", "Public Rack")
    ]
  }
  if (sum(
    str_detect(na.omit(DT$start_station_name), "\\s-\\smidblock$"),
    str_detect(na.omit(DT$end_station_name), "\\s-\\smidblock$")
  ) > 0) {
    DT[, 
      start_station_name := str_remove_all(DT$start_station_name, "\\s-\\smidblock$")
    ][, 
      end_station_name := str_remove_all(DT$end_station_name, "\\s-\\smidblock$")
    ]
  }
  
  # Assign the id and coordinates for the warehouses
  if (sum((DT$start_station_name %in% warehouses$station_name),
          (DT$end_station_name %in% warehouses$station_name)) > 0) {
    for (wh_name in warehouses$station_name) {
      id <- warehouses[station_name == wh_name, station_id]
      lat <- warehouses[station_name == wh_name, latitude]
      lng <- warehouses[station_name == wh_name, longitude]
      DT[start_station_name == wh_name, `:=`(
        start_station_id = id,
        start_lat = lat,
        start_lng = lng
      )][end_station_name == wh_name, `:=`(
        end_station_id = id,
        end_lat = lat,
        end_lng = lng
      )]
    }
  }

  # The function will correct the original table and return it
  writeLines(paste("Table", names(br_dt_sm)[[i]], "has been corrected"))
  DT
}

Now we can loop all the tables to apply the function and make the corrections. This time I’ll use the map function from the purrr package (Wickham and Henry 2025): br_dt_sm <- map(br_dt_sm, fix_tables)

Back to the Table of Contents

5.3 Process to complete the lookup tables

As happened with some of the warehouses we fixed in the previous section, not all the stations have the correct coordinates reported in the tables.

The values for the latitude and the longitude repeat all the time, and are truncated or rounded to 2 digits; you can clearly see it in the next table taken from the December 2021 set. Let’s check the case of the station with id “379”:

st_Rockwell_Archer <- rbind(
  na.omit(br_dt_sm[[3]][start_station_id == "379", .(station_name = start_station_name,
                              station_id = start_station_id, lng = start_lng, lat = start_lat),
                            ], cols="station_id"),
  na.omit(br_dt_sm[[3]][end_station_id == "379", .(station_name = end_station_name,
                              station_id = end_station_id, lng = end_lng, lat = end_lat)
                           ], cols="station_id")
)
st_Rockwell_Archer

                  station_name station_id    lng   lat
                        <char>     <char>  <num> <num>
   1: Rockwell St & Archer Ave        379 -87.69 41.82
   2: Rockwell St & Archer Ave        379 -87.69 41.82
   3: Rockwell St & Archer Ave        379 -87.69 41.82
   4: Rockwell St & Archer Ave        379 -87.69 41.82
   5: Rockwell St & Archer Ave        379 -87.69 41.82
   6: Rockwell St & Archer Ave        379 -87.69 41.82
   7: Rockwell St & Archer Ave        379 -87.69 41.82
   8: Rockwell St & Archer Ave        379 -87.69 41.82
   9: Rockwell St & Archer Ave        379 -87.69 41.82
  10: Rockwell St & Archer Ave        379 -87.69 41.82
  11: Rockwell St & Archer Ave        379 -87.69 41.82
  12: Rockwell St & Archer Ave        379 -87.69 41.82
  13: Rockwell St & Archer Ave        379 -87.69 41.82
  14: Rockwell St & Archer Ave        379 -87.69 41.82
  15: Rockwell St & Archer Ave        379 -87.69 41.82
  16: Rockwell St & Archer Ave        379 -87.69 41.82
  17: Rockwell St & Archer Ave        379 -87.69 41.82
  18: Rockwell St & Archer Ave        379 -87.69 41.82

Checking the area with the help of Google maps (“S Archer Ave & S Rockwell St · Chicago, IL 60632 — 41.8229955,-87.6922987,17z”) and leaflet maps (Cheng, Schloerke, et al. 2024), I was able to find the correct coordinates for the location of the station. If we put the information on the map, this is what we’ll find:

The green circle is the actual station site, the red circle is the location based on the reported coordinates. Is not very far but doesn’t work for what we need it, and this is just one case of several stations with the same problem with the reported coordinates. Therefore, we need another way of finding the correct coordinates for the stations. Furthermore, we will find that this same issue is repeated in the stations that are missing the name and id.

Back to the Table of Contents

5.3.1 Importing a file with the stations information

Luckily, the same website from which we got the tables https://data.cityofchicago.org (“City of Chicago | Data Portal — Data.cityofchicago.org”), offers the stations information in JSON format. Let´s get that stations information:

# Import the stations information in .json format:
# sttns_inf_json <- import("https://gbfs.lyft.com/gbfs/1.1/chi/en/station_information.json")
# The info in the website has changed, I am going to use an old copy I had downloaded before
sttns_inf_json <- import("../input/Info_Stations_list_1833.rds", trust = TRUE)
# Extract the columns we need from there:
Info_Station <- as.data.table(sttns_inf_json[[1]])[, .(
  station_name = stations.name,
  station_id = stations.short_name,
  longitude = stations.lon,
  latitude = stations.lat
)]
head(Info_Station, 10)

                              station_name   station_id longitude latitude
                                    <char>       <char>     <num>    <num>
   1:                  Damen Ave & 51st St          554 -87.67468 41.80091
   2: Damen Ave & Thomas St (Augusta Blvd) TA1307000070 -87.67741 41.90131
   3:            Lowell Ave & Armitage Ave      1524189 -87.73535 41.91708
   4:         Kilbourn Ave & Milwaukee Ave        16994 -87.74001 41.94707
   5:        Christiana Ave & Lawrence Ave        15615 -87.71183 41.96835
   6:          Cicero Ave & Wrightwood Ave        23180 -87.74673 41.92791
   7:               Wood St & Augusta Blvd          657 -87.67220 41.89918
   8:             University Ave & 59th St        22003 -87.59891 41.78788
   9:                  Hale Ave & 107th St        20108 -87.66892 41.69921
  10:                    Malcolm X College          631 -87.67390 41.87762

I took the value for the id for the “Pulaski Rd & 21st St” station from this set, as we can see:

Info_Station[str_detect(Info_Station$station_name, "Pulaski") &
  str_detect(Info_Station$station_name, "21st"), .(station_name, station_id)]

             station_name station_id
                   <char>     <char>
  1: 21st St & Pulaski Rd      21331

We can also see the station_id for the stations that had more than one id before we corrected it:

  Índice: <discard_id>
                         station_name      keep_id   discard_id Info_Station_id
                               <char>       <char>       <char>          <char>
   1:                    Bradley Park        20227          633           20227
   2:      California Ave & Cortez St        17660          512           17660
   3:           Calumet Ave & 51st St        15470          813           15470
   4:           Calumet Ave & 71st St        15599          728           15599
   5:    Central Park Ave & Ogden Ave        15685          532           15685
   6:   Christiana Ave & Lawrence Ave        15615          860           15615
   7:          East End Ave & 87th St        20231          883           20231
   8:         Eggleston Ave & 92nd St        20118          707           20118
   9:           Halsted St & 111th St        20127          875           20127
  10:          Jeffery Blvd & 67th St KA1503000030          753    KA1503000030
  11:           Kostner Ave & Lake St          536          851             536
  12:         Lake Park Ave & 47th St TA1308000035          812    TA1308000035
  13: Lakefront Trail & Bryn Mawr Ave        15576 KA1504000152           15576
  14:         Lawndale Ave & 111th St        20203          893           20203
  15:     Prairie Ave & Garfield Blvd TA1307000160          906    TA1307000160
  16:            Wabash Ave & 87th St          595          644             595

Coincidentally, all the ids that repeated the most across the periods are the ones contained in the Info_Station ids set; and, since id “644” doesn’t appeared in this set, we chose id “595” for the last case (both of these last two mentioned ids showed up in the 12 periods sets).

Having this new set of information, what we can do is create a function that gets the names and ids from the stations reported each month using the list_stations function, and, if possible, get the coordinates reported in this new Info_Station set to build the lookup tables for the observations for each period. However this is not going to be easy, precisely because some names and ids have changed over time or have a different spelling in the sets and in the webpage, e.g.: “Damen Ave & Coulter St” Vs. “Damen Ave/Coulter St”. For this reason I think it is better to keep, as much as possible, the names and ids from the original sets of information from each period; although we have already changed one: “21331”.

Checking the info in the Info_Station set there are two station_name repeated, with and without the station_id (NA instead), and there is the case for the station “Burling St & Diversey Pkwy”, that has two different station_id:

                   station_name   station_id longitude latitude
                         <char>       <char>     <num>    <num>
  1: Burling St & Diversey Pkwy        13208 -87.64776 41.93314
  2: Burling St & Diversey Pkwy TA1309000036 -87.64776 41.93314
  3:     Indiana Ave & 133rd St        24409 -87.61719 41.65380
  4:     Indiana Ave & 133rd St         <NA> -87.61705 41.65356
  5:      Western Ave & Lake St        24211 -87.68668 41.88481
  6:      Western Ave & Lake St         <NA> -87.68585 41.88461

Since the coordinates don’t seem too far, I decided to keep only the stations with complete information. Then, for the station “Burling St & Diversey Pkwy”, I decided to keep id “TA1309000036”, since it shows across all the periods and id “13208” does not.

clean_Info_Statn <- function(DT){
  # Function to clean the repeated stations names and non corresponding ids
  
  # Check for repeated names
  rep_name <- DT[duplicated(DT$station_name), station_name]
  # Remove the duplicated names that don't have station ids, and the id "13208"
  DT <- DT[!(station_name %in% rep_name & is.na(station_id)), ][!(station_id %in% "13208"), ]
  DT
}
nrow(Info_Station)

  [1] 1833

Just to have an idea of the cleaning process, from the previous code chunk we can see the number of rows that the Info_Station set has before cleaning it. Let’s clean it and check again:

Info_Station <- clean_Info_Statn(Info_Station)
nrow(Info_Station)

  [1] 1830

After cleaning, we removed the two repeated stations and the inconsistent id, 3 rows in total were removed.

Another notorious difference between the periods and the Info_Station sets is that there are a lot of station that have been added racks close by since June 2022, and those are prefixed with “Public Rack -” in the periods and the Info_Station names sets. Since this prefix doesn’t exist in the periods’ names before that month, I’ll remove it from the Info_Stations names when comparing those initial sets. In order to do this, we need to prevent duplicates appearing, as examples, this would happen with the following two cases:

                             station_name station_id longitude latitude
                                   <char>     <char>     <num>    <num>
  1:                Western Ave & 62nd St       <NA> -87.68342 41.78099
  2:  Public Rack - Western Ave & 62nd St       <NA> -87.68376 41.78096
  3:               Hamlin Ave & Grand Ave       <NA> -87.72156 41.90404
  4: Public Rack - Hamlin Ave & Grand Ave       <NA> -87.72029 41.90374

Therefore, to make sure that we will keep the station’s coordinates, we need to remove the racks data that may duplicate names before removing the prefix. Let’s define the function to do this and try it, to see the number of rows after this change:

clean_racks_prefx <- function(DT) {
  # The function will work with a data table as input (Info_Station)
  # Collect the names that have the prefix
  racks <- DT[str_detect(DT$station_name, "Public Rack - "), station_name]
  # Get all the pairs (with and without the prefix)
  repeated <- DT[station_name %in% str_remove_all(racks, "Public Rack - "), station_name]
  # Keep the repeated ones without the prefix
  DT <- DT[!(station_name %in% paste0("Public Rack - ", repeated)), ]
  # Erase the prefix for all the rest that still have it, but didn't repeat
  DT[, station_name := str_remove_all(DT$station_name, "Public Rack - ")]
  DT <- unique(DT, by = c("station_name", "station_id"))
  DT
}
Info_Station2 <- clean_racks_prefx(Info_Station)
nrow(Info_Station2)

  [1] 1785

There were 45 racks with the same name as the station, leaving, after deleting the prefix, a list of names with 1785 stations in the Info_Station set. So, for the first periods up to May 2022, the Info_Station set that will be used to compare and get the coordinates from would be this shorter version.

Back to the Table of Contents

5.3.2 Constructing a function to build the lookup tables

Now we can proceed to construct a function to create the lookup tables for the stations in every period. To make it easier, we are going to do it by parts.

5.3.2.1 Listing all the stations

To start, we’ll need to collect all the different stations reported in each period, we already have a function to reunite the starting and the ending stations in just one set: the list_stations function. Another idea for this part would be to try to keep the names and the ids from each set as possible. This is in case that the information contained in the Info_Station set differs from the one that appears in the periods sets.

Since the list_stations function help us to collect the name and id of all the stations reported in each period, we can use it to check the total number of stations per period. Let’s check the total number of stations in the jun_2022_rides_sm table:

stations <- list_stations(k)
nrow(stations)

  [1] 1265

We can create a new column for the summ_table showing how many stations are available each month. The result is quite intriguing: it looks like some stations do not work during the winter time. Let’s check:

          period total_count n_stations
          <char>       <int>      <int>
   1:  2021 oct.      631226        793
   2:  2021 nov.      359978        820
   3:  2021 dic.      247540        818
   4:  2022 ene.      103770        777
   5:  2022 feb.      115609        799
   6:  2022 mar.      284042        844
   7:  2022 abr.      371249        843
   8:  2022 may.      634858       1103
   9:  2022 jun.      769204       1265
  10:  2022 jul.      823488       1226
  11:  2022 ago.      785932       1271
  12: 2022 sept.      701339       1288

We can put the numbers in a graph to have a better view:

It´s not hard to imagine that the COVID-19 pandemic has affected this business as it did with many others, and that maybe some of the long-term effects are still showing at the beginning of the graph; that added to the season of the year may help explain the almost flat slowly declining curve. But from May 2022, we can see a recovery in the number of stations functioning; most probably this is also related to the year seasonality, as the August-September decline may also suggest.

Back to the Table of Contents

5.3.2.2 Collecting the coordinates for shared names or ids

Having defined a way to compile all the stations for the periods, for the next part, we will collect all the stations’ coordinates that share the same name or id in both sets: the list of stations given by the list_stations function and the Info_Station df we got from the site.

Since some of the ids from the stations set differ from the ones in the Info_Station set in just 2 digits (added at the beginning of the former), I´ll use some regex matching to try to find coincidences. For this purpose I’ll use the the str_detect() function from the package stringr (Wickham 2023b) and agrep() from base (R Core Team 2019).

To effectively use the str_detect() function, I’m going to create a list of strings to condense the station names to its main components. Meaning that we are trying to get rid of cardinal points indicators (“E”, “N”, “S”, etc.) and street types (St, Rd, Dr, etc.), leaving only the main parts of the names. For that purpose, and based on the info collected from the same station_name vectors from the periods, I built the following strings set (shrtn_addr) and a function to shorten the station_name string in each period set:

# Vector to shorten the stations names in each period's set
shrtn_addr <- c(
  "Ave\\s", "Ave$", "Blvd\\s", "Blvd$", "Ct\\s", "Ct$", "Dr\\s", "Dr$", "^E\\s", "E\\s", "\\sE$",
  "\\s?-\\sEast$", "Hwy\\s", "Hwy$", "^N\\s", "N\\s", "\\sN$", "\\s?-\\sNE$", "\\s?-\\sNW$",
  "\\sNW$", "Pkwy\\s", "Pkwy$", "Pl\\s", "Pl$", "^Public\\sRack\\s-\\s?", "Rd\\s", "Rd$", "^S\\s",
  "S\\s", "\\sS$", "\\s?-\\sSE$", "Sq\\s", "Sq$", "St\\s", "St$", "\\s?-\\sSW$", "Ter\\s", "Ter$",
  "^W\\s", "W\\s", "\\sW$", "Way\\s", "Way$", "\\s?-\\sWest$", "\\s?\\(.+\\)$",
  "\\s-\\s.+Vaccination\\sSite"
)

shorten_n_split <- function(name, vector = shrtn_addr) {
  # Function to shorten the names (name) and split them if possible
  # This function takes 2 character vectors:
  # the 1st: the names to shorten; the 2nd: the strings to eliminate from the 1st.
  # It uses the 'shrtn_addr' vector as the 2nd by default
  
  name <- str_replace_all(name, "-\\sNorth$", "N")
  name <- str_remove_all(name, "\\(Murray\\)\\sPark$")
  for (i in 1:length(name)) {
    if (sum(str_detect(name[i], "\\s&\\s")) > 0) {
      # Loop through the strings to remove from each name
      for (str in vector) {
        name[i] <- str_remove_all(name[i], str)
      }
    }
  }
  name <- str_replace_all(name,'""','"')
  # Split the names if possible
  spl_names <- str_split(name, "\\s&\\s")
  
  # It returns a list with as many items as the length of the original 'name' vector,
  # with one string or two (if the 'name' can be splitted), in each position.
  spl_names
}

Let´s check some examples. In the first column you’ll see the original name; in the second column, the complete name (if it’s just one whole name) or the first part of it (if it was splitted); and in the third column, NA (if it was just one name) or the second part of the name (if splitted):

                                   Original_name                Name_1     Name_2
                                          <char>                <char>     <char>
   1:                          900 W Harrison St     900 W Harrison St       <NA>
   2:                     Halsted St & Willow St               Halsted    Willow 
   3:                  Sedgwick St & Webster Ave              Sedgwick   Webster 
   4:                   Broadway & Granville Ave              Broadway Granville 
   5:                       Halsted St & 35th St               Halsted      35th 
   6: Public Rack - Monticello Ave & Belmont Ave            Monticello   Belmont 
   7:                      Base - 2132 W Hubbard Base - 2132 W Hubbard       <NA>
   8:                        Troy St & North Ave                  Troy     North 
   9:                        MLK Jr Dr & 29th St                MLK Jr      29th 
  10:                Lincoln Ave & Sunnyside Ave               Lincoln Sunnyside

Let´s write a function to find the coordinates for the stations that share names or ids in the sets:

find_coord_by_name_id <- function(DT) {
  # The function takes the DT with the stations names and ids ("stations")

  # If the prefix "Public Rack" doesn't appear in the "stations" names
  # clean it from the "Info_Station" set
  if (sum(str_detect(DT$station_name, "Public Rack")) == 0) {
    Info_Station <- clean_racks_prefx(Info_Station)
  }

  # Get the stations info for the ones that have the same names in both sets
  statn_by_name <- Info_Station[station_name %in% DT$station_name]
  # Remove the ids that come from Info_Station
  statn_by_name$station_id <- NULL
  # Join the sets by the station_name column
  statn_by_name <- DT[statn_by_name, on = .(station_name)]
  # In case there are any, delete the repeated ones
  statn_by_name <- unique(statn_by_name, by = c("station_name", "station_id"))

  # For the missing stations, get info for the ones that have the same ids in both sets
  diff_name <- DT[!(station_name %in% statn_by_name$station_name)]
  # Remove the NAs before getting the coords
  diff_name <- diff_name[complete.cases(diff_name), ]
  statn_by_id <- Info_Station[station_id %in% diff_name$station_id]
  # Remove the names from 'Info_Station' and the code: "TA1307000138"
  # This code correspond to different stations in each set:
  # "Lincoln Ave & Roscoe St" in 'DT'; "Wood St & Webster Ave" in 'Info_Station'
  statn_by_id <- statn_by_id[, station_name := NULL][station_id != "TA1307000138", ]
  statn_by_id <- DT[statn_by_id, on = .(station_id)]
  # The 'station_id' "13099" is repeated with different station names:
  # "Halsted St & 18th St" and "Halsted St & 18th St (Temp)", both names with
  # the same id are included in the periods names data sets, leaving them untouched.

  # Bind the two pieces with complete info
  lookUp_st <- rbind(statn_by_name, statn_by_id)
  lookUp_st <- unique(lookUp_st, by = c("station_name", "station_id"))

  # Get again the ones still missing info
  missInfo <- DT[!(station_name %in% lookUp_st$station_name)]
  # Some station ids that have short codes were added two digits before the code in the
  # 'Info_Station' ids set (various of those ids had the ".0" ending), let's use regex
  # matching to check
  short_id <- missInfo[nchar(station_id) == 3, ]
  chck_regex <- sapply(short_id$station_id, \(sid) sid <- regex(paste0("^\\d{2}\\Q", sid, "\\E$")))
  index <- 1:length(chck_regex)
  for (i in 1:length(chck_regex)) {
    # Check if there are coincidences, and use the length to build the flow control
    indx <- agrep(chck_regex[i], Info_Station$station_id, fixed = FALSE, max.distance = 0)
    len <- length(indx)
    if (len == 0) { # No match
      index[i] <- NA
    } else { # There is at least one match
      # Use the 'shorten_n_split' function to get the main components of the name
      check4names <- unlist(shorten_n_split(short_id[i, station_name]))
      # Get the name from the 'Info_Station' set
      name_strs <- Info_Station[indx, station_name]
      # Exactly one match: check if the street names coincide or add 'NA' to the index vector
      if (len == 1) {
        if (length(check4names) == 1) {
          if (sum(str_detect(name_strs, check4names[1])) != 0) {
            index[i] <- indx
          } else {
            index[i] <- NA
          }
        } else if (length(check4names) > 1) {
          if (sum(
            str_detect(name_strs, check4names[1]),
            str_detect(name_strs, check4names[2])
          ) != 0) {
            index[i] <- indx
          } else {
            index[i] <- NA
          }
        }
        # More than one match, look for the one where the names coincide
      } else {
        # Logical vector for the matches and the 1st part of the name
        match_1 <- str_detect(name_strs, check4names[1])
        if (sum(match_1) == 0) { # If there are no matches for that part add NA to the result
          index[i] <- NA
        } else if (sum(match_1) == 1) { # If there's one coincidence for that part
          index[i] <- indx[match_1] # add the index to the result
        } else if (sum(match_1) > 1) {
          # If there are more than 1 coincidence check the second part of the name
          match_2 <- str_detect(name_strs, check4names[2])
          if (match_1 & match_2) {
            index[i] <- indx[match_1 & match_2] # Add the one where the 2 parts coincide
          } else {
            index[i] <- NA # or add 'NA' to the index vector
          }
        }
      }
    }
  }
  short_id <- cbind(short_id, Info_Station[index, .(longitude, latitude)])
  lookUp_st <- rbind(lookUp_st, short_id[complete.cases(short_id), ])

  # The function returns a DT with the coordinates added to the stations that were found
  lookUp_st
}
namids_found <- find_coord_by_name_id(stations)
head(namids_found, 10)

                              station_name   station_id longitude latitude
                                    <char>       <char>     <num>    <num>
   1:                  Damen Ave & 51st St          554 -87.67468 41.80091
   2: Damen Ave & Thomas St (Augusta Blvd) TA1307000070 -87.67741 41.90131
   3:         Kilbourn Ave & Milwaukee Ave        16994 -87.74001 41.94707
   4:        Christiana Ave & Lawrence Ave        15615 -87.71183 41.96835
   5:          Cicero Ave & Wrightwood Ave          547 -87.74673 41.92791
   6:               Wood St & Augusta Blvd          657 -87.67220 41.89918
   7:                  Hale Ave & 107th St        20108 -87.66892 41.69921
   8:                Dearborn St & Erie St        13045 -87.62932 41.89399
   9:                Calumet Ave & 71st St        15599 -87.61692 41.76551
  10:                Wood St & Chicago Ave          637 -87.67207 41.89563

By collecting the stations that still miss the coordinates, as we will see next, we have collected the coordinates for most of the stations from this jun_2022_rides_sm table, based in the names and ids reported in the data set:

                       station_name station_id
                             <char>     <char>
    1:        Kimbark Ave & 67th St        751
    2:      Marquette Ave & 79th St        668
    3:        Base - 2132 W Hubbard  warehouse
    4:        Prairie Ave & 47th St        814
    5: Monticello Ave & Belmont Ave        500
   ---                                        
  264:       Torrence Ave & 98th St        632
  265:             May St & 78th St        723
  266: Park Manor Elementary School        748
  267:    Central Ave & Corcoran Pl        793
  268:         Halsted St & 64th St        744

We are just missing 268 stations out of the 1265 that appear in this period.

Back to the Table of Contents

5.3.2.3 Using string detection and fuzzy matching

Now we can use str_detect() again and try to do some “fuzzy” matching to find some other coincidences. As we saw, we are missing 268 stations in the case of the jun_2022_rides_sm table, let´s try to find some more correspondences by using again the str_detect function; but this time, in conjunction with stringdist() and amatch() from the ‘stringdist’ package (M. P. J. van der Loo 2014).

In this part we are going to use the str_detect() function again to find where the two main parts of the periods names have a coincidence in the Info_Station names set at the same time. There may be more than one coincidence, in those cases, we’ll use the stringdist function to get the closest match. In case there is only one whole name from the period set that cannot be splitted, we can also use the str_detect and complement it with the amatch and stringdist functions to try to find similar names.

For the amatch and stringdist functions I decided to use the Damerau-Levenshtein method reducing the value of the deletions and insertions penalties. The reason for reducing the penalties for deletions is based on the fact that we are already shortening the station names in each period to its main components, subtracting street types and cardinal points. So, in case such strings appear in the Info_Station names set, by lowering the deletion’s penalty, we are just trying to compensate for that previous removal. On the other case, the reason for reducing the value for the insertions is based on observations, since most of the differences in the names are due to symbols or special characters in the Info_Station names, i.e., “-”, “*” or “/” just to list some. Keeping this in mind, let’s define the function to try to find more matches:

find_comb <- function(DT) {
  # The function takes the df with the stations that need the coordinates (stillMiss)
  
  vctr_names <- shorten_n_split(DT$station_name)
  # Create a num vector for the results
  found_comb <- 1:length(vctr_names)
  # Loop through the list with the names
  for (i in found_comb) {
    # If the list item has more than one string
    if (length(vctr_names[[i]]) > 1) {
      # Detect where both strings coincide in the 'Info_Station' names vector and
      # get the index, there may be more that one
      numb <- which(str_detect(Info_Station$station_name, vctr_names[[i]][1]) &
                      str_detect(Info_Station$station_name, vctr_names[[i]][2]))
      if (length(numb) > 1) { 
        # If there are more than one matches, check which string is the closest
        reslts <- 1:length(numb)
        for (j in reslts) {
          reslts[j] <- stringdist(DT$station_name[i], Info_Station$station_name[numb[j]], 
                                  method = "dl", weight = c(d = 0.1, i = 0.5, s = 1, t = 1))
        }
        found_comb[i] <- numb[which.min(reslts)]
      } else if (length(numb) == 1) {
        # If there is just one match put it in the results
        found_comb[i] <- numb
      } else {
        # If there is not a match add an NA to the results
        found_comb[i] <- NA
      }
    # If there is only one item check if it is in the string
    } else if (length(vctr_names[[i]]) == 1) {
      if (sum(str_detect(Info_Station$station_name, vctr_names[[i]])) > 0){
        indxs<- which(str_detect(Info_Station$station_name, vctr_names[[i]]))
        if (length(indxs) > 1) {
          d_indx <- 1:length(indxs)
          for (j in d_indx) {
            d_indx[j] <- stringdist(Info_Station$station_name[indxs[j]], DT$station_name[i], 
                                    method = "dl", weight = c(d = 0.1, i = 0.5, s = 1, t = 1))
          }
          found_comb[i] <- indxs[which.min(d_indx)]
        } else {
          found_comb[i] <- indxs
        }
        # If the item is not in the string use the 'amatch' function
      } else {
        # The penalty for deletions and insertions has been reduced based on what has
        # been observed comparing the 'Info_Station' and the periods names sets
        fuzzy_name <- amatch(vctr_names[i], Info_Station$station_name,
                             method = "dl", weight = c(d = 0.1, i = 0.5, s = 1, t = 1),
                             maxDist = 1.5
        )
        # If there is a match add the index to the results vector
        if (length(fuzzy_name) > 0) {
          found_comb[i] <- fuzzy_name
        } else {
          # If not add an NA to the results
          found_comb[i] <- NA
        }
      }
    }
  }
  # The output is a num vector with the indexes for the matches or 'NA' otherwise
  found_comb
}

Now we can try to get more coincidences with the function we just defined, going through the station names in the set that still miss the coordinates: stillMiss. Let’s see:

                       station_name                              Info_Stn_name
                             <char>                                     <char>
    1:        Kimbark Ave & 67th St        Public Rack - Kimbark Ave & 67th St
    2:      Marquette Ave & 79th St      Public Rack - Marquette Ave & 79th St
    3:        Base - 2132 W Hubbard                                       <NA>
    4:        Prairie Ave & 47th St       Public Rack - Prairie Ave & 47th St 
    5: Monticello Ave & Belmont Ave Public Rack - Monticello Ave & Belmont Ave
   ---                                                                        
  264:       Torrence Ave & 98th St       Public Rack - Torrence Ave & 98th St
  265:             May St & 78th St             Public Rack - May St & 78th St
  266: Park Manor Elementary School Public Rack - Park Manor Elementary School
  267:    Central Ave & Corcoran Pl    Public Rack - Central Ave & Corcoran Pl
  268:         Halsted St & 64th St         Public Rack - Halsted St & 64th St

So far, it works with the cases we checked, so we can write a function to find coordinates this way:

find_match <- function(DT) {
  # The function takes a data.table as argument (stillMiss)

  # Get the matches' indexes using the 'find_comb' function
  indxs <- find_comb(DT)

  # Add the respective longitude and latitude to the DT
  found_mtchs <- cbind(DT, Info_Station[indxs, .(longitude, latitude)])

  # The function returns a data table with the coordinates info for the matches (if any)
  found_mtchs[complete.cases(found_mtchs), ]
}

We can add this new table to the one we had before with the coordinates we found with the names and ids:

naidma_found <- rbind(namids_found, find_match(stillMiss))
nrow(naidma_found)

  [1] 1249

Let’s collect again the stations that still don’t have the coordinates, after matching the names, in a table:

                                     station_name station_id
                                           <char>     <char>
   1:                       Base - 2132 W Hubbard  warehouse
   2:             Base - 2132 W Hubbard Warehouse  warehouse
   3:                  W Oakdale Ave & N Broadway      20252
   4:                     Maryland Ave & 104th St      20134
   5:                              MTV Hubbard St  warehouse
   6:                                     WestChi  warehouse
   7:                  Washtenaw Ave & Madison St        416
   8:                                 NewHastings  warehouse
   9:                    Menard Ave & Division St        305
  10:                       Cornell Dr & Hayes Dr        653
  11:                       Kimbark Ave & 63rd St        752
  12:               Sacramento Blvd & Division St        511
  13:                  Narragansett & Irving Park        323
  14: Public Rack - Sacramento Blvd & Division St        511
  15:                             WEST CHI-WATSON  warehouse
  16:        DIVVY CASSETTE REPAIR MOBILE STATION  warehouse

According to that result none of those stations exists in the Info_Station set, or the names are so different from one set to the other that it is as if those didn’t exist. In this conditions, we have exhausted the information that we can extract from the Info_Station set we downloaded, and now we’ll have to resource to some other ways to complete the information.

Back to the Table of Contents

5.3.2.4 Checking the mean, median and mode values:

One really easy way to find some more information is to use the registered coordinates that are in each period. Unfortunately, as we mentioned before, not all the stations have the correct coordinates reported in the tables. Therefore, we’ll need to be careful to use only coordinates that are not rounded or truncated, as we mentioned before, and we will need to define which measure of central tendency should we use for this.

I assume that since, for the cases in which we have valid coordinates, there are several different observations for each station and the coordinates reported in each observation are slightly different from one another, to complete the missing-stations information, I could use one of the measures of central tendency: mean, median or mode. So, I’ll have to compare those measures for observations in different periods, based on different stations. Therefore, in this section we are going to check what is the best option among the three values to complete this task, taking in mind some considerations:

mean: There is already a function in R that will allow us to get the mean point of the coordinates. In this case, most probably, we will be creating a new point (most probably it won’t be a reported pair of coordinates, but a new pair).
median: Just like mean, it is already a function in R. A difference with the mean is that in this case we have a 50/50 chance of getting a new point or a reported one (depending whether the number of coordinates from which it is calculated is even or odd); and we have to consider also that there is a minimal possibility that the median for each coordinate is different from the other.
mode: There is no predetermined function for the measure of central tendency mode in R. Checking the internet I found one in Stack Overflow (Williams). In this function the obtained points are reported points. Again, there is a small possibility that the mode(s) from the longitude is(are) different from the one(s) of the latitude, and there may be more than one mode in each coordinate; that is the reason for so many parenthesis in this last sentence.

Having these considerations in mind we can create functions to check the median and mode points. First I´ll copy the function to get the mode(s):

# I found this function in stack overflow
# NOTE: I wrote the comments while checking the function
mode_set <- function(x) {
  ux <- unique(x)               # Select all the unique values in the vector
  tab <- tabulate(match(x, ux)) # Count how many of each there are
  ux[tab == max(tab)]           # Get the value(s) that repeat the most
}
# The result is a vector with the mode value(s) for that coordinate

Following is a function to get the median point of just one of the coordinates. We have to contemplate that it may happen that the median point of the latitude is different from the median point of the longitude:

median_by_col <- function(dt, var) { 
  # The function takes the data frame (dt) and the column name (var)
  
  # Arrange the dt by the column we are checking
  ordered <- arrange(dt, {{ var }})
  
  # Get the column values in a numeric vector
  vectr <- unlist(select(dt, {{ var }}))
  # Get the median of that vector
  medn <- median(vectr)
  
  # Get the index of the point in the middle or the one on top
  p_med <- as.integer(ceiling(nrow(dt) / 2))
  # Get both coordinates if the value is reported in the data
  if (medn %in% vectr) {
    medin <- ordered[p_med, ]
  # If it is not reported, get the closest reported values: one on top, one below
  } else {
    medins <- rbind(ordered[p_med, ], ordered[p_med - 1, ])
    clnm <- colnames(medins)
    # Create an empty tibble to gather the new information
    medin <- tibble()
    for (clmn in clnm) {
      # Get the median for each coordinate in that pair of points:
      new_t <- dplyr::summarise(df, "{clmn}" := median(.data[[clmn]]))
      # Put the values on the tibble
      medin <- cbind(medin, new_t)
    }
    # Get only the values of the coordinates
    medin <- medin %>% 
      select(all_of(clnm))
  }
  # The result is a tibble with the median 'lat' and 'lng' for the selected 'var'
  medin
}

Let’s now create a function to collect all the reported coordinates for just one station. That way we will be able to visualize the points in a map:

get_station_pts_1 <- function(df, st_id) {
  # The arguments for this function are the data from any period and a station id

  # Filter and select all the observations for the particular station id
  selStId_start <- df %>%
    filter(start_station_id == st_id) %>%
    select(longitude = start_lng, latitude = start_lat)
  # Add a column to identify the points
  selStId_start %<>%
    mutate(station = rep("start"), idx = 1:nrow(selStId_start))

  selStId_end <- df %>%
    filter(end_station_id == st_id) %>%
    select(longitude = end_lng, latitude = end_lat)
  selStId_end %<>%
    mutate(station = rep("end"), idx = 1:nrow(selStId_end))

  # It will return a "data.table" "data.frame"
  selStId <- rbind(selStId_start, selStId_end)
}

Now we can generate maps to visualize the reported points for the stations, in this case we will be referring just to the station with station_id “13307” in the dic_2021_rides_sm set. Let’s see how those coordinates look like in a map:

Most reported coordinates correspond to points that are really close to the station; but there are some ‘end’ points that are a little far, like a block (“end 75”) or two (“end 50”); and then we have the point labeled as “end 2”, that is completely extraneous to the sample (it is by the North-East viaduct at the crossing of the expressways!).

Clearly, that point is an outlier, so we will need to adjust the function to avoid those irregularities. Since the function will give us the central tendency value that we are going to use to compare the rest of the observations, it’s better to be a little stringent about them, so I’ll only include as part of the sample values between one IQR below the first quartile, and one above the third quartile, for both, the longitude and the latitude.

While adjusting the coordinates in other examples, I noticed that it’s not rare that the IQR is equal to zero. Based on this observations, I decided to write a conditional statement to enlarge the range a little bit from the first to the seventh octiles in those cases.

limit_coord_rng <- function(coord) {
  # The function takes the numeric vector with the coordinate to adjust
  
  coord_iqr <- IQR(coord)
  # If IQR is not zero it limits the range one IQR above the 3rd quartile and one below the 1st
  if (coord_iqr != 0) {
    coord_range <- range(
      quantile(coord, 1 / 4) - coord_iqr,
      quantile(coord, 3 / 4) + coord_iqr
    )
  # If IQR is zero it limits the range from the 1st to the 7th octiles
  } else {
    coord_range <- range(quantile(coord, 1 / 8), quantile(coord, 7 / 8))
  }
  
  # The function returns the lower and upper limits of the range
  coord_range
}

So the adjusted version of the function will be:

get_station_pts <- function(df, st_id) {
  # The arguments for this function are the data frame from a period and a station id
  selStId_start <- df %>%
    filter(start_station_id == st_id) %>%
    select(longitude = start_lng, latitude = start_lat)
  selStId_start %<>%
    mutate(station = rep("start"), idx = 1:nrow(selStId_start))
  selStId_end <- df %>%
    filter(end_station_id == st_id) %>%
    select(longitude = end_lng, latitude = end_lat)
  selStId_end %<>%
    mutate(station = rep("end"), idx = 1:nrow(selStId_end))
  selStId <- rbind(selStId_start, selStId_end)
  selStId <- selStId[longitude %between% limit_coord_rng(selStId$longitude) &
                     latitude %between% limit_coord_rng(selStId$latitude), ]

  # It will return a "data.table" "data.frame"
  selStId
}

As we can see in the next map and table, the points don’t include anymore the obvious outlier (“end 2”), or the other points that were about a block or two from the station (“end 75”, “end 50”):

Now all the points are grouped together really close to the station. It’s important to note that in this process, we went from 200 to 182 points, the following are the 18 points not included in the new table:

anti_join(st_13307_1, st_13307, by = join_by(longitude, latitude, station, idx)) %>%
  arrange(station)

      longitude latitude station   idx
          <num>    <num>  <char> <int>
   1: -87.66363 41.85479     end     1
   2: -87.64725 41.87594     end     2
   3: -87.66369 41.85480     end    19
   4: -87.66227 41.85426     end    50
   5: -87.66357 41.85503     end    65
   6: -87.66368 41.85437     end    75
   7: -87.66370 41.85477     end    81
   8: -87.66364 41.85480   start    27
   9: -87.66367 41.85479   start    34
  10: -87.66375 41.85483   start    46
  11: -87.66369 41.85478   start    50
  12: -87.66368 41.85478   start    56
  13: -87.66370 41.85477   start    61
  14: -87.66368 41.85477   start    82
  15: -87.66374 41.85480   start    94
  16: -87.66369 41.85478   start    95
  17: -87.66367 41.85476   start    98
  18: -87.66372 41.85480   start   100

Having all the pieces, now we can define a function that creates a df with the values that we want to check:

The extreme values: max and min of both longitude and latitude
Mean
Median (or medians)
Mode (or modes)

get_centr_measrs_pts <- function(df, st_id) {
  # Function to get a df with the central tendency measures points
  # As input it takes the df to check and a "station_id" (st_id)

  # Use the function to get the reported coordinates for the station id
  selStId <- get_station_pts(df, st_id)
  selStId %<>%
    select(longitude, latitude)

  # Get all the unique extreme points (some may repeat)
  xtrmPnts <- unique(selStId[c(
    which.min(selStId$latitude),
    which.max(selStId$latitude),
    which.min(selStId$longitude),
    which.max(selStId$longitude)
  ), 1:2]) %>%
    mutate(group = rep("xtrmPnts")) # Add a column to categorize

  # Create the mean point
  mean_pnt <- tibble(
    longitude = mean(selStId$longitude),
    latitude = mean(selStId$latitude)
  ) %>%
    mutate(group = rep("mean_pnt")) # Add the respective category

  # Find or compute the median value
  medn_lat <- median_by_col(selStId, latitude)
  medn_lng <- median_by_col(selStId, longitude)
  # If both are the same, just keep one:
  medn_pnt <- unique(rbind(medn_lat, medn_lng)) %>%
    mutate(group = rep("medn_pnt")) # Again, add the category
  # If there are 2 points, compute the values
  if (nrow(medn_pnt) > 1) {
    medn_pnt <- tibble(
      longitude = median(medn_pnt$longitude),
      latitude = median(medn_pnt$latitude),
      group = "medn_pnt_c"
    )
  }

  # Use the mode_set function to get the most repeated value(s)
  modes_lat <- mode_set(selStId$latitude)
  modes_lng <- mode_set(selStId$longitude)
  # Create a logic vector for that(those) value(s)
  idx_md_lat <- selStId$latitude %in% modes_lat
  idx_md_lng <- selStId$longitude %in% modes_lng

  # If the logic vector is the same pick any (I used lat)
  if (identical(idx_md_lat, idx_md_lng)) {
    mode_pnt <- unique(selStId[idx_md_lat, 1:2]) %>%
      mutate(group = rep("mode_pnt"))
    # If the logic vectors are different, bind all the info and get the unique value(s)
  } else {
    mode_pnt <- unique(rbind(selStId[idx_md_lat, 1:2], selStId[idx_md_lng, 1:2])) %>%
      mutate(group = rep("mode_pnt")) # add the category.
  }

  # Put all the info in just one table
  new_df <- as.data.table(rbind(xtrmPnts, mean_pnt, medn_pnt, mode_pnt))

  # This function will return a data.frame/data.table
  new_df
}

Let’s try the function we just built to check the points for station_id “13307” within the table dic_2021_rides_sm:

     longitude latitude    group
         <num>    <num>   <char>
  1: -87.66368 41.85480 xtrmPnts
  2: -87.66358 41.85497 xtrmPnts
  3: -87.66374 41.85481 xtrmPnts
  4: -87.66356 41.85491 xtrmPnts
  5: -87.66359 41.85489 mean_pnt
  6: -87.66356 41.85491 medn_pnt
  7: -87.66356 41.85491 mode_pnt

In this case, for the table dic_2021_rides_sm, we have the 4 extreme points (min and max for each coordinate), the mean, one median and just one mode, the same for both coordinates and it’s the same median point, for the station “Laflin St & Cullerton St”, identified with the id “13307”.

We can now write a more elaborated function to make the maps to visualize those points:

st_map <- function(df) {
  # Set options for the points markers
  xtrmOpts <- markerOptions(opacity = 0.75, color = "black")
  meanOpts <- markerOptions(opacity = 0.6, color = "darkgreen")
  mednOpts <- markerOptions(opacity = 0.6, color = "blue")
  modeOpts <- markerOptions(opacity = 0.6, color = "darkred")

  # Generate the map using the leaflet library
  map <- leaflet() %>%
    addProviderTiles(providers$OpenStreetMap) %>%
    addCircleMarkers(
      lat = df[group == "xtrmPnts", latitude],
      lng = df[group == "xtrmPnts", longitude], radius = 2, label = "extreme pnt",
      group = "Extremes", options = xtrmOpts
    ) %>%
    addCircleMarkers(
      lat = df[group == "mean_pnt", latitude],
      lng = df[group == "mean_pnt", longitude], radius = 5, label = "mean pnt",
      group = "Mean", options = meanOpts
    ) %>%
    addCircleMarkers(
      lat = df[str_starts(df$group, "medn_pnt"), latitude],
      lng = df[str_starts(df$group, "medn_pnt"), longitude], radius = 5,
      label = df$group[str_starts(df$group, "medn_pnt")], group = "Median",
      options = mednOpts
    ) %>%
    addCircleMarkers(
      lat = df[group == "mode_pnt", latitude],
      lng = df[group == "mode_pnt", longitude], radius = 5, label = "mode pnt",
      group = "Mode", options = modeOpts
    ) %>%
    addLayersControl(overlayGroups = c("Extremes", "Mean", "Median", "Mode"))
  map
}

For the next step, we just need to select a df and a station_id to generate the maps. Since we already have the df of the previous example, let´s visualize those points on a map:

In the markerOptions the colors were established as follows:

Extreme points: “black”
Mean: “darkgreen”
Median: “blue”
Mode: “darkred”

To facilitate identifying the measures, you can check/uncheck the labels in the layers icon.

Observing the points we can see that in this case the median coincides with the mode and the extreme point with the min longitude, we can compare with the info in the table.

Let´s check another map, in this case for station_id “605” from the table nov_2021_rides_sm:

In the case of this last map of the station with id “605”, we have the 4 extreme points and the mean, the median and the mode all coincide in the same point.

We are now going to use the addSidebyside function from the leaflet.extras2 package (Sebastian 2025) to produce a new map to compare the points of the same station (station_id == KA1504000117), but in two different months (October and December, 2021). It is notorious how apart those points look from one month to the other. Here you can compare the two sets of points by moving the slider left or right:

Station: Halsted St & North Branch St, station id: KA1504000117

In this map, if you move the slider towards the title “October points”, those are the points that you will see in the map (you can zoom in to have a better view); likewise, if you move the slider towards the “December points” title, you’ll see the December points. In the case of the October points, those are really close together, and again the median and mode are the same, both really close to the mean. In contrast, the 4 extreme December points are really spread out, the median and the mode are the same and close to the extreme point in the South East, leaving the mean alone in between. Based on these observations I think we can take some decisions.

Being really notorious how the sets of points change from the observations of one month in comparison to another month, with both sets being observations for the same station. I think that it would be better to obtain the measure of central tendency that we are going to use from the same set of observations, and not to use just one for all the sets.

Most of the maps that we have checked show us that the measures of central tendency tend to be really close together, being that the case any would work as a reference point for the station. However, for the December points in this last case the mean would be the better choice for the purpose of finding the closest station from any given point, since the median or the mode might add or subtract distance depending on the location of the other point.

Back to the Table of Contents

5.3.2.5 Getting the reported coordinates

Having decided about the course of action, now we can set up a function to collect the valid reported coordinates for each period. The idea is to have the name and id of the station, and find the mean of the reported coordinates to be able to complete the reported values that don’t have the station’s data.

Let’s review again the 16 stations that still don’t have the coordinates:

                                     station_name station_id
                                           <char>     <char>
   1:                       Base - 2132 W Hubbard  warehouse
   2:             Base - 2132 W Hubbard Warehouse  warehouse
   3:                  W Oakdale Ave & N Broadway      20252
   4:                     Maryland Ave & 104th St      20134
   5:                              MTV Hubbard St  warehouse
   6:                                     WestChi  warehouse
   7:                  Washtenaw Ave & Madison St        416
   8:                                 NewHastings  warehouse
   9:                    Menard Ave & Division St        305
  10:                       Cornell Dr & Hayes Dr        653
  11:                       Kimbark Ave & 63rd St        752
  12:               Sacramento Blvd & Division St        511
  13:                  Narragansett & Irving Park        323
  14: Public Rack - Sacramento Blvd & Division St        511
  15:                             WEST CHI-WATSON  warehouse
  16:        DIVVY CASSETTE REPAIR MOBILE STATION  warehouse

According to that result none of those stations exists in the Info_Station set, and that is the reason why we need to use the reported coordinates from the original sets. Unfortunately, as we mentioned before, not all the stations have the correct coordinates reported in the tables. Therefore, we’ll need to be careful to select coordinates that are not rounded or truncated, as we mentioned before.

get_reprtd_coords <- function(j, DT) {
  # For this function we need to input an index (from "br_dt_sm" e.g.) and
  # a data table DT ("missngStill")

  # Collect the coordinates reported for every station missing them
  for (i in 1:nrow(DT)) {
    coords <- rbind(br_dt_sm[[j]][
      start_station_name == DT[i, station_name],
      .("lng" = start_lng, "lat" = start_lat)
    ], br_dt_sm[[j]][
      end_station_name == DT[i, station_name],
      .("lng" = end_lng, "lat" = end_lat)
    ])
    # Check that the coords are not truncated or rounded
    if (nrow(coords) > 0) {
      if ((mean(coords[, lng]) * 10^5) %% 1000 != 0 &&
        (mean(coords[, lat]) * 10^5) %% 1000 != 0) {
        # Adjust the range for the values
        coords <- coords[lng %between% limit_coord_rng(coords$lng) &
          lat %between% limit_coord_rng(coords$lat), ]
        # Add the mean of the coordinates to, or create, the table
        if (exists("mean_coords")) {
          mean_coords <- rbind(mean_coords, data.table(DT[i],
            longitude = mean(coords[, lng]),
            latitude = mean(coords[, lat])
          ))
        } else {
          mean_coords <- data.table(DT[i, ],
            longitude = mean(coords[, lng]),
            latitude = mean(coords[, lat])
          )
        }
      }
    }
  }
  # The output is a data table
  if (exists("mean_coords")) {
    return(mean_coords)
  } else {
    return(data.table(DT,
      longitude = rep(NA, length.out = nrow(DT)),
      latitude = rep(NA, length.out = nrow(DT))
    ))
  }
}

Let’s check the function:

                             station_name station_id longitude latitude
                                   <char>     <char>     <num>    <num>
  1:                Base - 2132 W Hubbard  warehouse -87.68028 41.88986
  2:      Base - 2132 W Hubbard Warehouse  warehouse -87.68028 41.88986
  3:                       MTV Hubbard St  warehouse -87.68028 41.88986
  4:                              WestChi  warehouse -87.68028 41.88986
  5:                          NewHastings  warehouse -87.67984 41.86344
  6:                Cornell Dr & Hayes Dr        653 -87.58488 41.78059
  7:                      WEST CHI-WATSON  warehouse -87.68028 41.88986
  8: DIVVY CASSETTE REPAIR MOBILE STATION  warehouse -87.68028 41.88986

Let’s check those coordinates in the map:

If we zoom-in enough, we will find the names of the respective stations in the map itself for the actual stations; and, for the warehouses we’ll get the addresses we assign before.

We can now add these stations to the ones we already completed and keep the record of the ones still missing to see how many different incomplete stations we collect at the end of processing all the periods. This, in order to ponder if it’s worth to search for the information some other way.

coords_not_found <- missngStill[!(station_name %in% found_coords$station_name)]
coords_not_found[order(station_name), ]

                                    station_name station_id
                                          <char>     <char>
  1:                       Kimbark Ave & 63rd St        752
  2:                     Maryland Ave & 104th St      20134
  3:                    Menard Ave & Division St        305
  4:                  Narragansett & Irving Park        323
  5: Public Rack - Sacramento Blvd & Division St        511
  6:               Sacramento Blvd & Division St        511
  7:                  W Oakdale Ave & N Broadway      20252
  8:                  Washtenaw Ave & Madison St        416

Let’s build the lookup table to see how many stations have we completed so far:

lookup_dt <- rbind(naidma_found, found_coords)
nrow(lookup_dt)

  [1] 1257

We are very close to get all the coordinates for all the stations reported in the jun_2022_rides_sm table, we are just missing the 8 stations that appeared on the previous table for that period.

Back to the Table of Contents

5.3.3 Building the lookup-table constructor

Before getting to the final version of the constructor for the complete function, it’s worth mentioning that I’d run all the periods in a previous version of that function. While doing that, I was collecting the names of the stations missing in each period in an empty data table called “coords_not_found”, using the following lines of code in the function (these lines are commented on this final version): # Add the stations still missing info to the "coords_not_found" table stillLost <- data.table(stations[!(station_name %in% lookup_dt$station_name)], table = rep(name)) coords_not_found <<- rbind(coords_not_found, stillLost)

That action helped me identify the different strings included in the shrtn_addr char vector, adapt the values in the amatch and stringdist functions for the defined find_comb function, and find the special cases described above in the different data sets.

At the end of the final loop, the list of stations missing coordinates had a total of 69 observations, counting all the missing stations from all the periods; but those corresponded to only 12 unique names. Knowing that information, it was really easy, with the help of Google maps (“Google Maps — Maps.google.com”), Leaflet maps (Volodymyr 2023), and the coordinates obtained from other periods, to complete the missing data:

                                     station_name longitude latitude
                                           <char>     <num>    <num>
   1:                       Bennett Ave & 96th St -87.57897 41.72214
   2:                       Kimbark Ave & 63rd St -87.59603 41.78064
   3:                     Maryland Ave & 104th St -87.60293 41.70566
   4:                    Menard Ave & Division St -87.77052 41.90234
   5:                  Narragansett & Irving Park -87.78540 41.95265
   6:         Public  Rack - Albany Ave & 63rd St -87.70084 41.77891
   7:         Public Rack - Kimbark Ave & 63rd St -87.59603 41.78064
   8: Public Rack - Sacramento Blvd & Division St -87.69750 41.90299
   9:               Sacramento Blvd & Division St -87.69750 41.90299
  10:                  W Oakdale Ave & N Broadway -87.64405 41.93554
  11:                  Washtenaw Ave & Madison St -87.69392 41.88140
  12:                 Whipple St & Irving Park Rd -87.70437 41.95412

As you can notice in the table, in the case of the stations and racks that correspond to the same address, I left the same coordinates for both. Checking the map those stations have a darker tone of green compared to the rest, since both circles overlap (the station’s and the rack’s). Let’s check:

To find the coordinates I searched the coordinates of the street crossings in Google maps (“Google Maps — Maps.google.com”) and from there checking the “street view” of the same map and using the Leaflet maps (Volodymyr 2023) I was able to find most of the stations. However, in five cases I had to select the coordinates of the closest station that I could find:

Bennett Ave & 96th St
Kimbark Ave & 63rd St
Public Rack - Kimbark Ave & 63rd St
Sacramento Blvd & Division St
Public Rack - Sacramento Blvd & Division St

In the map the points for those cases are in red. There is a straight blue line connecting each red circle with the closest station to that point (the green circle). Among those cases, there are the two pairs of racks and stations that coincide; that’s why those two sets look to be in darker hues of the respective colors.

Hovering over the lines will display the distance in meters, rounded to tenths, from the street crossing to the nearest station. I used the Spherical Law of Cosines formula to get those distances, we will further discuss distance formulas in the next section of this document.

Now we can put all the pieces together in just one function to get the lookup tables for each period:

build_lookup_dt <- function(i) {
  # The function needs the name of the DT to check (each period DT)

  # Get the list of all the stations for that period
  stations <- list_stations(i)

  # Look for the coincidences in names or ids to get the coordinates
  namids_found <- find_coord_by_name_id(stations)

  # Get the stations still missing the coordinates
  stillMiss <- stations[!(station_name %in% namids_found$station_name)]

  # Use 'str_detect' and fuzzy matching to find more coordinates
  naidma_found <- rbind(namids_found, find_match(stillMiss))

  # Get the stations still missing the coordinates, again
  missngStill <- stillMiss[!(station_name %in% naidma_found$station_name)]
  
  # Get coordinates based on the reported data from the period
  found_coords <- get_reprtd_coords(i, missngStill)
  
  # Add the reported coordinates, if any, to the lookup table
  lookup_df <- rbind(naidma_found, na.omit(found_coords, cols = c("longitude","latitude")))
  
  # Add the searched ones
  missing_coords <- missngStill[!(station_name %in% lookup_df$station_name)]
  if (sum(missing_coords$station_name %in% searched$station_name) > 0) {
    adding_info <- searched[station_name %in% missing_coords$station_name, ]
    adding_info <- missing_coords[adding_info, on = .(station_name)]
    adding_info <- adding_info[complete.cases(adding_info), ]
  }

  lookup_dt <- unique(rbind(lookup_df, adding_info))

  # Add the stations still missing the info to the "coords_not_found" table
  # stillLost <- data.table(stations[!(station_name %in% lookup_df$station_name)],
  #                         table = rep(name))
  # coords_not_found <<- rbind(coords_not_found, stillLost)
  
  # The function returns a data.table
  lookup_dt
}

All of the reported coordinates I was able to check work and are consistent with the info when displayed in a map, let’s see:

Back to the Table of Contents

5.4 Comparing distance formulas

For the next step I will need to create a formula to check what is the closest station and fill the missing values with that information. Being careful to exclude the stations for which there is no info at all, i.e., no station name or id and no coordinates; since the formula may add the closest name and id to coordinates (0,0), skewing the data.

By using the station look-up tables created before, I believe we can complete most the data in each month’s table. We just need to define a function to do that easily. There is a miriad of options to select from; just to illustrate, in the next chunk I’ve included some of those (Chris Veness 2022; Le 2021; Federal Regulations (Annual Edition). Title 47: Telecommunication. 2017). In the “Getting the measures” section I included also the distGeo function from the geosphere package (Hijmans 2024):

# Global-average radius of earth in meters:
R <- 6371008.8
# Constant to convert degrees into radians
to_rad <- pi / 180

#  Pythagoras’ theorem on an equi-rectangular projection
pyth <- function(lon1, lat1, lon2, lat2) {
  # Mean latitud in radians
  mean_lat <- mean(c(lat1, lat2)) * to_rad
  cat1 <- (lon2 - lon1) * to_rad * cos(mean_lat)
  cat2 <- (lat2 - lat1) * to_rad
  d <- sqrt(cat1^2 + cat2^2)
  d * R # meters
}

# Spherical Law of Cosines
law_cos <- function(lon1, lat1, lon2, lat2) {
  d <- acos(sin(lat1 * to_rad) * sin(lat2 * to_rad) + 
              cos(lat1 * to_rad) * cos(lat2 * to_rad) * cos((lon2 - lon1) * to_rad)) * R
  d # meters
}

# Haversine formula to measure distance in a sphere
haversine <- function(lon1, lat1, lon2, lat2) {
  # generally used geo measurement function
  dLat <- (lat2 - lat1) * to_rad
  dLon <- (lon2 - lon1) * to_rad
  a <- (sin(dLat / 2))^2 + cos(lat1 * to_rad) *
    cos(lat2 * to_rad) * (sin(dLon / 2))^2
  c <- 2 * atan2(sqrt(a), sqrt(1 - a))
  d <- R * c
  d # in meters
}

# In a Medium page I found this one
medium <- function(lon1, lat1, lon2, lat2) {
  a <- 0.5 - cos((lat2 - lat1) * to_rad) / 2
  +cos(lat1 * to_rad) * cos(lat2 * to_rad) *
    (1 - cos((lon2 - lon1) * to_rad)) / 2
  2 * R * asin(sqrt(a)) # meters
}

# FCC recommended formulae for distances not exceeding 475 kilometers:
fcc_dist <- function(lon1, lat1, lon2, lat2) {
  # Mean latitude in radians
  mean_lat <- mean(c(lat1, lat2)) * to_rad
  K1 <- 111.13209 - 0.56605 * cos(2 * mean_lat) + 0.00120 * cos(4 * mean_lat)
  K2 <- 111.41513 * cos(mean_lat) - 0.09455 * cos(3 * mean_lat) + 0.00012 * cos(5 * mean_lat)
  D <- K1 * (lat2 - lat1) + K2 * (lon2 - lon1) # in Km
  D * 1000 # in meters
}

# Defining the points (Taken from the table of the example)
lat1 <- 41.85476
lon1 <- -87.66367
lat2 <- 41.85492
lon2 <- -87.66366

# Getting the measures
pyt <- pyth(lon1, lat1, lon2, lat2)
l_c <- law_cos(lon1, lat1, lon2, lat2)
hav <- haversine(lon1, lat1, lon2, lat2)
med <- medium(lon1, lat1, lon2, lat2)
fcc <- fcc_dist(lon1, lat1, lon2, lat2)
diG <- distGeo(c(lon1,lat1),c(lon2,lat2))

# Organizing the info
comprsn <- tibble(
  formula = c("Pythagorean", "Law of cosines", "Haversine", "Medium page", "FCC", "distGeo"),
  `Distance in meters` = c(pyt, l_c, hav, med, fcc, diG))
comprsn %>% arrange(-`Distance in meters`)

  # A tibble: 6 × 2
    formula        `Distance in meters`
    <chr>                         <dbl>
  1 FCC                            18.6
  2 Pythagorean                    17.8
  3 Haversine                      17.8
  4 Law of cosines                 17.8
  5 Medium page                    17.8
  6 distGeo                        17.8

As we can see all the formulas give a very close distance between the two selected points (81 cm is the maximum difference ~31.9”). So I would think that since the same formula is going to be used in all the calculations, any would work. Then the selection should be carried based on the performance of the function, let’s run a microbenchmark from the microbenchmark package (Mersmann 2019) to compare:

microbenchmark::microbenchmark(
  times = 1000, unit = "us",
  medium(lon1, lat1, lon2, lat2),
  law_cos(lon1, lat1, lon2, lat2),
  haversine(lon1, lat1, lon2, lat2),
  pyth(lon1, lat1, lon2, lat2),
  fcc_dist(lon1, lat1, lon2, lat2),
  distGeo(c(lon1, lat1), c(lon2, lat2))
)

  Unit: microseconds
                                    expr  min   lq    mean median   uq     max
          medium(lon1, lat1, lon2, lat2)  1.5  1.7 12.1824    1.8  2.0 10078.3
         law_cos(lon1, lat1, lon2, lat2)  1.5  1.7  9.0175    1.8  2.0  7029.6
       haversine(lon1, lat1, lon2, lat2)  1.8  2.0  2.4166    2.2  2.4    38.3
            pyth(lon1, lat1, lon2, lat2)  5.3  5.6 14.2884    5.9  6.3  7653.3
        fcc_dist(lon1, lat1, lon2, lat2)  5.4  5.8  6.7931    6.0  6.5    59.9
   distGeo(c(lon1, lat1), c(lon2, lat2)) 41.7 43.2 49.0610   43.8 45.6   194.8
   neval
    1000
    1000
    1000
    1000
    1000
    1000

The formulas with the best results are the one from the “Medium” page, the “law of cos” and “haversine”. Being the most popular on the pages I reviewed and the one with the best mean (at least in most rounds), the winner is: the “haversine” formula! Now, we can proceed to build the function to complete the information.

Back to the Table of Contents

5.5 Completing the missing names and ids

5.5.1 Adding more columns to the tables

Let’s use the formula with the jun_2022_rides_sm table to find the name and id of the stations missing that information. The following function will help us with that:

complete_stations <- function(j) {
  
  DT <- as.data.table(br_dt_sm[[j]])
  stations <- build_lookup_dt(j)
  
  no_names <- unique(rbind(na.omit(DT[
    !is.na(start_lng),
    .("station_name" = start_station_name, "longitude" = start_lng, "latitude" = start_lat)
  ], cols = "station_name", invert = TRUE), na.omit(DT[
    !(is.na(end_lng)),
    .("station_name" = end_station_name, "longitude" = end_lng,  "latitude" = end_lat)
  ], cols = "station_name", invert=TRUE)))

  distance <- idx <- 1:nrow(no_names)
  for (i in 1:nrow(no_names)) {
    # calculate distance against all of stations
    distances <- haversine(no_names[i, longitude], no_names[i, latitude], 
                         stations[, longitude], stations[, latitude])
    # find the shortest and store the index of stations
    idx[i] <- which.min(distances)
    distance[i] <- distances[idx[i]]
  }
  
  no_names[, `:=`("station_name" = stations$station_name[idx], 
                   "station_id"   = stations$station_id[idx],
                   `distance (m.)`     = distance,
                   "st_lng"       = stations$longitude[idx],
                   "st_lat"       = stations$latitude[idx])]
}
completed <- complete_stations(k)
completed <- completed[order(`distance (m.)`, decreasing = TRUE),
                       .(station_name, station_id, `distance (m.)`, longitude, latitude,
                         st_lng, st_lat)]
head(completed, 10)

                       station_name station_id distance (m.) longitude latitude
                             <char>     <char>         <num>     <num>    <num>
   1:      Plainfield & Irving Park        327      9189.133    -87.92    41.90
   2:      W 103rd St & S Avers Ave      20202      8569.475    -87.82    41.70
   3: Bloomingdale Ave & Harlem Ave        377      8439.880    -87.88    41.86
   4:   Lincolnwood Dr & Central St       E014      8301.243    -87.81    42.09
   5:      W 103rd St & S Avers Ave      20202      7388.270    -87.80    41.73
   6: Bloomingdale Ave & Harlem Ave        377      7097.729    -87.89    41.90
   7:    Narragansett & Irving Park        323      6388.508    -87.79    42.01
   8:        Western Ave & 118th St        900      6280.578    -87.72    41.63
   9:   Lincolnwood Dr & Central St       E014      5418.120    -87.74    42.11
  10:      Plainfield & Irving Park        327      5331.919    -87.83    42.00
         st_lng   st_lat
          <num>    <num>
   1: -87.83382 41.95213
   2: -87.71707 41.70585
   3: -87.80577 41.91203
   4: -87.71530 42.06485
   5: -87.71707 41.70585
   6: -87.80577 41.91203
   7: -87.78540 41.95265
   8: -87.68092 41.67835
   9: -87.71530 42.06485
  10: -87.83382 41.95213

Checking the table we can see that there are some reported points (given by the longitude and latitude columns) that are really far from the nearest station (st_lng, st_lat); e.g.: the distance between those two points in the first observation is 9.2 Km (5.7 Miles).

Looking for information on the internet I found the following in Wikipedia (Wikipedia contributors 2025): “In Chicago, a typical city block is 330 by 660 feet (100 m × 200 m), meaning that 16 east-west blocks or 8 north-south blocks measure one mile, which has been adopted by other US cities.”

I cannot be sure what happened with those bikes or why they were reported in points so far away from the stations, and it’s worth mentioning that all those cases in which the bike-rides miss the station info, have hundredths place rounded or truncated coordinates. I believe it would be interesting for the analysis phase to check the relation between the distance from the station and the type of station it is (“start” - “end”). For now, we can get a little view on the map of the bikes that were left too far from the stations, just to pique our interest for a future development of this topic.

I mentioned “too far” in the previous paragraph; now we need to define what would be considered “too far”. Keeping in mind the Wikipedia information cited before, I would think that any distance greater than 800 m (8 “100 m” blocks or 1/2 Mile) may be considered “too far”, based on this resolution, we can build the following map (distances displayed on top of the lines are measured in “100 m” blocks):

too_far <- completed[`distance (m.)` > 800, ][, label := 
                                                1:nrow(completed[`distance (m.)` > 800, ])]
stations_lu <- build_lookup_dt(k)
sttns_tf <- stations_lu[station_name %in% too_far$station_name, ]
too_far_map <- leaflet() %>% addProviderTiles(providers$OpenStreetMap) %>%
  addCircleMarkers(lng = too_far$longitude, lat = too_far$latitude, 
                   label = paste0("Too_far_",too_far$label), fillColor = "darkred",
                   fillOpacity = 0.4, color = "darkred", radius = 6) %>%
  addCircleMarkers(lng = sttns_tf$longitude, lat = sttns_tf$latitude, fillColor = "navy",
                   fillOpacity = 0.4, label = sttns_tf$station_name, color = "green",
                   radius = 6) %>%
  addCircleMarkers(lng = completed$st_lng, lat = completed$st_lat, color = "navy",
                   label = completed$station_name, radius = 2, opacity = 0.7)
for (i in 1:nrow(too_far)){
  too_far_map <- too_far_map %>% addPolylines(lng = c(too_far[i,]$longitude,too_far[i,]$st_lng),
                                lat = c(too_far[i,]$latitude, too_far[i,]$st_lat),
                                label = paste(round(too_far[i,]$distance/100,1),"blocks"))
}
too_far_map

If reviewing the table with the really long distances was mind blowing, observing the map is even more intriguing: bikes in the middle of the lake, what is that about?

On a bright side, an idea that comes up into my mind is that if the company is considering expanding the coverage area for the bike program, maybe they should think about the areas with the most red circles!

After checking the data for the completed information, I think it is really relevant to include a column with the information about the distance to the nearest station, keeping in mind that we will need to fill with 0’s (zeros) that column for the stations that had all the information already.

In the completed table we have all the unique values for which we need the stations’ data. In order to complete input the stations’ data into the periods tables we need a function to add that information. But first, we need to add some new columns to register the new data: the distance to the nearest station, starting or ending, and the coordinates used to find that distance, also for both cases.

br_dt_sm <- lapply(br_dt_sm, \(DT) DT[, `:=`(dist_start_stn = rep(0), dist_end_stn = rep(0),
                                             mean_start_lng = rep(0), mean_start_lat = rep(0),
                                             mean_end_lng = rep(0), mean_end_lat = rep(0))])
dim(br_dt_sm[[k]])

  [1] 769204     26

We are ready now to put some information in those columns!

Back to the Table of Contents

5.5.2 Input the missing information in the tables

We have the tables ready to put the needed information about the missing stations, but we also defined some extra columns to input the new information related to how we got the missing stations information. Now we can define a function to input those values, to do that I’ll subset the complete table in two sets, the one that is already completed and the one that needs information. Since we are not going to do anything with the first subset, all the values for the new columns are going to remain being zero (0).

input_stn_data <- function(i) {
  # The function takes the DT index as input
  
  # Display a message to show the progress
  cat("Working on table ",new_per[i],". ")
  
  DT <- br_dt_sm[[i]]
  
  # Get the complete info for the stations missing
  to_input <- complete_stations(i)
  
  # Subset the DT
  notMiss <- na.omit(DT, cols = c("start_station_name", "end_station_name"))
  missInf <- na.omit(DT, cols = c("start_station_name", "end_station_name"), invert=TRUE)
  
  # Loop through the completed info filling in the missing one in the tables
  for (i in 1:nrow(to_input)) {
    missInf[
      is.na(missInf$start_station_name) &
      missInf$start_lng == to_input$longitude[i] & 
      missInf$start_lat == to_input$latitude[i],
      c(
        "start_station_name", "start_station_id",
        "mean_start_lng", "mean_start_lat", "dist_start_stn"
      ) :=
        .(
          to_input$station_name[i], to_input$station_id[i],
          to_input$st_lng[i], to_input$st_lat[i], to_input$`distance (m.)`[i]
        )
    ]
    missInf[
      is.na(missInf$end_station_name) &
      missInf$end_lng == to_input$longitude[i] &
      missInf$end_lat == to_input$latitude[i], 
      c(
        "end_station_name", "end_station_id",
        "mean_end_lng", "mean_end_lat", "dist_end_stn"
      ) :=
        .(
          to_input$station_name[i], to_input$station_id[i],
          to_input$st_lng[i], to_input$st_lat[i], to_input$`distance (m.)`[i]
        )
    ]
  }
  # Bind again the two subsets of the DT
  DT <- rbind(notMiss, missInf)
  
  # Display another progress message
  cat("Available information completed!\n")
  
  # The function returns a data.table
  DT
}

Let’s run the function to complete the tables:

br_dt_sm <- lapply(seq_along(br_dt_sm), \(i) input_stn_data(i))

  Working on table  oct_2021_rides_sm . Available information completed!
  Working on table  nov_2021_rides_sm . Available information completed!
  Working on table  dic_2021_rides_sm . Available information completed!
  Working on table  ene_2022_rides_sm . Available information completed!
  Working on table  feb_2022_rides_sm . Available information completed!
  Working on table  mar_2022_rides_sm . Available information completed!
  Working on table  abr_2022_rides_sm . Available information completed!
  Working on table  may_2022_rides_sm . Available information completed!
  Working on table  jun_2022_rides_sm . Available information completed!
  Working on table  jul_2022_rides_sm . Available information completed!
  Working on table  ago_2022_rides_sm . Available information completed!
  Working on table  sept_2022_rides_sm . Available information completed!

Let’s check the missing information again:

n_nas <- unlist(lapply(br_dt_sm, \(DT) sum(!complete.cases(DT))))
data.table(period= format(summ_table$month, "%b %Y"), miss_end_coords = summ_table$miss_end_coord, `has_NAs` = n_nas)

          period miss_end_coords has_NAs
          <char>           <int>   <int>
   1:  oct. 2021             484    4761
   2:  nov. 2021             191    1630
   3:  dic. 2021             144     675
   4:  ene. 2022              86      86
   5:  feb. 2022              77      77
   6:  mar. 2022             266     266
   7:  abr. 2022             317     317
   8:  may. 2022             722     722
   9:  jun. 2022            1055    1055
  10:  jul. 2022             947     947
  11:  ago. 2022             843     843
  12: sept. 2022             712     712

Comparing the number of rows in the tables that include NAs and the missing_end_coords column from the summary table, one can see that the results from January to September 2022 coincide, which is a good sign. The difference in the first three sets may be due to the stations ids, let’s check:

stn_id_NA <- unlist(lapply(br_dt_sm[1:3], \(DT) nrow(DT[is.na(start_station_id) |
                                                   is.na(end_station_id), ])))
stn_name_NA <- unlist(lapply(br_dt_sm[1:3], \(DT) nrow(DT[is.na(start_station_name) |
                                                   is.na(end_station_name), ])))
data.table(period = format(summ_table$month[1:3], "%b %Y"),
           miss_end_coords = summ_table$miss_end_coord[1:3],
           stn_name_NA = stn_name_NA,
           stn_id_NA = stn_id_NA, `have_NAs` = n_nas[1:3])

        period miss_end_coords stn_name_NA stn_id_NA have_NAs
        <char>           <int>       <int>     <int>    <int>
  1: oct. 2021             484         484      4761     4761
  2: nov. 2021             191         191      1630     1630
  3: dic. 2021             144         144       675      675

As we can observe in this last table, that is exactly the case, there are some stations for which we have the station name but that don’t have station id. Let´s check which:

for (i in 1:3) {
  miss_id_start <- unlist(unique(br_dt_sm[[i]][
    !is.na(start_station_name) &
      is.na(start_station_id),
    .(start_station_name, start_station_id)
  ]))
  miss_id_end <- unlist(unique(br_dt_sm[[i]][
    !is.na(end_station_name) & is.na(end_station_id),
    .(end_station_name, end_station_id)
  ]))

  cat(new_per[i], ":\nMissing start station - id:\t", paste(miss_id_start, collapse = " - "),
      "\nMissing end station - id:\t", paste(miss_id_end, collapse = " - "),"\n\n")
}

  oct_2021_rides_sm :
  Missing start station - id:    DuSable Lake Shore Dr & Ohio St - NA 
  Missing end station - id:  DuSable Lake Shore Dr & Ohio St - NA 
  
  nov_2021_rides_sm :
  Missing start station - id:    DuSable Lake Shore Dr & Ohio St - NA 
  Missing end station - id:  DuSable Lake Shore Dr & Ohio St - NA 
  
  dic_2021_rides_sm :
  Missing start station - id:    DuSable Lake Shore Dr & Ohio St - NA 
  Missing end station - id:  DuSable Lake Shore Dr & Ohio St - NA

As we can see, in all the cases, it is the station for which we removed the id when correcting the information in the tables, and this station doesn’t appear in the Info_Station set. In case needed, we can assign a made-up id for this station, but for now we are going to leave it as it is. Therefore, after a long and iterative work, we are done filling the information in the tables!

Back to the Table of Contents

6. Packing and closing

We have already one list with the corrected data sets (br_dt_sm), one file with the originally downloaded data sets (“bikeRidesOriginalTables.rds”), this document with all the information about the process. One file with the information to produce the Info_Stations set used in this document (“Info_Stations_list_1833.rds”). All that is left to do, it to “pack” the produced information in another list to facilitate reproducing this work or parts of it:

The summary table
The warehouses information
The searched stations information
The list with the ggplot2 (Wickham 2016) theme

Let’s create that list:

bike_rides_addtnl_info <- list("summary_table" = summ_table, "warehouses" = warehouses,
                               "searched_stns" = searched, "ggplot_theme" = tema)
save_tables <- function(a_list, name) {
  # Function to save a list with a copy of each period data table
  # It takes a list with the DTs and a name for the .rds file

  # Save the list in the input folder
  if (dir(".")[1] == "1_Load_libs") {
    write_rds(a_list, paste0("../output/", name, ".rds"))
  } else {
    write_rds(a_list, paste0("../../output/", name, ".rds"))
  }
}
save_tables(bike_rides_addtnl_info, "bikeRidesAddedInfo")

Now we have all the files needed to review and reproduce this document.

From previous chapters we collected some intriguing information about the discovered information, I put some recommendations for the next steps in the Appendix - Suggestions for the analysis phase.

We are finally done getting everything ready to be sent for the analysis phase!

Back to the Table of Contents

References

Allaire, JJ, Yihui Xie, Christophe Dervieux, Jonathan McPherson, Javier Luraschi, Kevin Ushey, Aron Atkins, et al. 2024. Rmarkdown: Dynamic Documents for r. https://github.com/rstudio/rmarkdown.

Bache, Stefan Milton, and Hadley Wickham. 2022. Magrittr: A Forward-Pipe Operator for r. https://doi.org/10.32614/CRAN.package.magrittr.

Barrett, Tyson, Matt Dowle, Arun Srinivasan, Jan Gorecki, Michael Chirico, Toby Hocking, Benjamin Schwendinger, and Ivan Krylov. 2025. Data.table: Extension of ‘Data.frame‘. https://doi.org/10.32614/CRAN.package.data.table.

Ben, and Gregor Thomas. 2014. “R - How Do i Make a List of Data Frames? - Stack Overflow.” Edited by moodymudskipperEditor. stackoverflow.com. https://stackoverflow.com/questions/17499013/how-do-i-make-a-list-of-data-frames.

Bivand, Roger S., Edzer Pebesma, and Virgilio Gomez-Rubio. 2013. Applied Spatial Data Analysis with R, Second Edition. Springer, NY. https://asdar-book.org/.

“Bucket Loading... — Divvy-Tripdata.s3.amazonaws.com.” https://divvy-tripdata.s3.amazonaws.com/index.html.

Chan, Chung-hong, Thomas J. Leeper, Jason Becker, and David Schoch. 2023. Rio: A Swiss-Army Knife for Data File i/o. https://cran.r-project.org/package=rio.

Cheng, Joe, Barret Schloerke, Bhaskar Karambelkar, and Yihui Xie. 2024. Leaflet: Create Interactive Web Maps with the JavaScript ’Leaflet’ Library. https://doi.org/10.32614/CRAN.package.leaflet.

Cheng, Joe, Carson Sievert, Barret Schloerke, Winston Chang, Yihui Xie, and Jeff Allen. 2024. Htmltools: Tools for HTML. https://doi.org/10.32614/CRAN.package.htmltools.

Chris Veness, www.movable-type.co.uk. 2022. “Calculate Distance and Bearing Between Two Latitude/Longitude Points Using Haversine Formula in JavaScript — Movable-Type.co.uk.” https://www.movable-type.co.uk/scripts/latlong.html.

“City of Chicago | Data Portal — Data.cityofchicago.org.” https://data.cityofchicago.org.

“Divvy in Chicago, IL 60612 - — Chamberofcommerce.com.” https://www.chamberofcommerce.com/business-directory/illinois/chicago/bike-sharing-station/2030939263-divvy.

“Earth Radius - Wikipedia — En.wikipedia.org.” 2021. https://en.wikipedia.org/wiki/Earth_radius.

Federal Regulations (Annual Edition). Title 47: Telecommunication., Code of. 2017. “Geographical Distance - Wikipedia — En.wikipedia.org.” https://en.wikipedia.org/wiki/Geographical_distance.

Gillespie, Colin, and Robin Lovelace. 2020. Efficient: Becoming an Efficient r Programmer.

———. 2021. “GitHub - Csgillespie/EfficientR: Efficient R Programming: A Book — Github.com.” https://github.com/csgillespie/efficientR/.

“Google Maps — Maps.google.com.” https://maps.google.com/.

Grolemund, Garrett, and Hadley Wickham. 2011. “Dates and Times Made Easy with lubridate.” Journal of Statistical Software 40 (3): 1–25. https://www.jstatsoft.org/v40/i03/.

Henry, Lionel, and Hadley Wickham. 2025. Rlang: Functions for Base Types and Core r and ’Tidyverse’ Features. https://doi.org/10.32614/CRAN.package.rlang.

Hester, Jim, and Jennifer Bryan. 2024. Glue: Interpreted String Literals. https://doi.org/10.32614/CRAN.package.glue.

Hijmans, Robert J. 2024. Geosphere: Spherical Trigonometry. https://doi.org/10.32614/CRAN.package.geosphere.

“Home | Divvy Bikes — Ride.divvybikes.com.” https://ride.divvybikes.com/.

Knell, Robert J. 2013. Introductory R: A Beginner’s Guide to Data Visualisation and Analysis Using R. (See web site). http://www.introductoryr.co.uk.

Kohl, Matthias. 2015. Introduction to Statistical Data Analysis with R. London: bookboon.com.

Le, Trancy. 2021. “Calculate the Distance Between Locations (with Longitude & Latitude) in Power BI — Trancyle.” https://medium.com/@trancyle/calculate-the-distance-between-locations-with-longitude-latitude-in-power-bi-d35a18056760.

Marulanda_Pabón, Andrés. 2023. “Bike-Share Data Wrangling.” https://www.kaggle.com/code/amarulo/bike-share-data-wrangling.

Mersmann, Olaf. 2019. Microbenchmark: Accurate Timing Functions. https://CRAN.R-project.org/package=microbenchmark.

Müller, Kirill. 2023. Hms: Pretty Time of Day. https://doi.org/10.32614/CRAN.package.hms.

Müller, Kirill, and Lorenz Walthert. 2024. Styler: Non-Invasive Pretty Printing of r Code. https://github.com/r-lib/styler.

Müller, Kirill, and Hadley Wickham. 2023. Tibble: Simple Data Frames. https://doi.org/10.32614/CRAN.package.tibble.

Pebesma, Edzer J., and Roger Bivand. 2005. “Classes and Methods for Spatial Data in R.” R News 5 (2): 9–13. https://CRAN.R-project.org/doc/Rnews/.

Posit team. 2025. RStudio: Integrated Development Environment for r. Boston, MA: Posit Software, PBC. http://www.posit.co/.

R Core Team. 2019. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org.

Robinson, David. 2020. Fuzzyjoin: Join Tables Together on Inexact Matching. https://doi.org/10.32614/CRAN.package.fuzzyjoin.

“S Archer Ave & S Rockwell St · Chicago, IL 60632 — 41.8229955,-87.6922987,17z.” https://www.google.com/maps/place/S+Archer+Ave+%26+S+Rockwell+St,+Chicago,+IL+60632,+USA/@41.8229955,-87.6922987,17z/data=!3m1!4b1!4m6!3m5!1s0x880e2dfb73e23ef5:0x91c207e8125a186d!8m2!3d41.8229915!4d-87.6897184!16s%2Fg%2F11gf1g_8xw?entry=ttu&g_ep=EgoyMDI1MDYxMS4wIKXMDSoASAFQAw%3D%3D.

Sebastian, Gatscha. 2025. Leaflet.extras2: Extra Functionality for ’Leaflet’ Package. https://doi.org/10.32614/CRAN.package.leaflet.extras2.

Spinu, Vitalie, Garrett Grolemund, and Hadley Wickham. 2020. Lubridate: Make Dealing with Dates a Little Easier.

Tierney, Nicholas. 2017. “Visdat: Visualising Whole Data Frames.” JOSS 2 (16): 355. https://doi.org/10.21105/joss.00355.

Tierney, Nicholas, and Dianne Cook. 2023. “Expanding Tidy Data Principles to Facilitate Missing Data Exploration, Visualization and Assessment of Imputations.” Journal of Statistical Software 105 (7): 1–31. https://doi.org/10.18637/jss.v105.i07.

Vaidyanathan, Ramnath, Yihui Xie, JJ Allaire, Joe Cheng, Carson Sievert, and Kenton Russell. 2023. Htmlwidgets: HTML Widgets for r. https://doi.org/10.32614/CRAN.package.htmlwidgets.

van der Loo, M. P. J. 2014. “The Stringdist Package for Approximate String Matching.” The R Journal 6: 111–22. https://CRAN.R-project.org/package=stringdist.

van der Loo, Mark. 2024. Simputation: Simple Imputation. https://doi.org/10.32614/CRAN.package.simputation.

Volodymyr, Agafonkin. 2023. “Leaflet, a JavaScript Library for Interactive Maps.” Computer software. 'https://leafletjs.com'.

Wickham, Hadley. 2016. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. https://ggplot2.tidyverse.org.

———. 2023a. Forcats: Tools for Working with Categorical Variables (Factors). https://doi.org/10.32614/CRAN.package.forcats.

———. 2023b. Stringr: Simple, Consistent Wrappers for Common String Operations. https://doi.org/10.32614/CRAN.package.stringr.

Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.

Wickham, Hadley, Winston Chang, Lionel Henry, Thomas Lin Pedersen, Kohske Takahashi, Claus Wilke, Kara Woo, Hiroaki Yutani, and Dewey Dunnington. 2020. Ggplot2: Create Elegant Data Visualisations Using the Grammar of Graphics. https://CRAN.R-project.org/package=ggplot2.

Wickham, Hadley, Romain François, Lionel Henry, Kirill Müller, and Davis Vaughan. 2023. Dplyr: A Grammar of Data Manipulation. https://doi.org/10.32614/CRAN.package.dplyr.

Wickham, Hadley, and Lionel Henry. 2025. Purrr: Functional Programming Tools. https://doi.org/10.32614/CRAN.package.purrr.

Wickham, Hadley, Jim Hester, and Jennifer Bryan. 2024. Readr: Read Rectangular Text Data. https://doi.org/10.32614/CRAN.package.readr.

Wickham, Hadley, Thomas Lin Pedersen, and Dana Seidel. 2025. Scales: Scale Functions for Visualization. https://doi.org/10.32614/CRAN.package.scales.

Wickham, Hadley, Davis Vaughan, and Maximilian Girlich. 2024. Tidyr: Tidy Messy Data. https://doi.org/10.32614/CRAN.package.tidyr.

Wikipedia contributors. 2025. “City Block — Wikipedia, the Free Encyclopedia.” https://en.wikipedia.org/w/index.php?title=City_block&oldid=1287932484.

Williams, Ken. “How to Find the Statistical Mode?” https://stackoverflow.com/questions/2547402/how-to-find-the-statistical-mode#8189441.

Xie, Yihui. 2013. Dynamic Documents with R and Knitr. Chapman & Hall/CRC. https://github.com/yihui/knitr-book/.

———. 2020. Knitr: A General-Purpose Package for Dynamic Report Generation in r. https://yihui.org/knitr/.

Xie, Yihui, J. J. Allaire, and Garrett Grolemund. 2018. R Markdown: The Definitive Guide. Boca Raton, Florida: Chapman; Hall/CRC. https://bookdown.org/yihui/rmarkdown.

Xie, Yihui, Christophe Dervieux, and Emily Riederer. 2020. R Markdown Cookbook. Boca Raton, Florida: Chapman; Hall/CRC. https://bookdown.org/yihui/rmarkdown-cookbook.

Back to the Table of Contents

Appendix - Suggestions for the analysis phase

Keep in mind that the total number of rides per month shows a clear sinusoidal yearly seasonality.
Check the relation between the distance from the station, for the bikes without “start” or “end” station info, and the type of station it is (“start” - “end”).
Recommend to check the “red” areas in the too_far_map in section 5.5.1, in case the company is thinking about expanding the coverage area for the bike program.

Bike Rides Data Wrangling

AMP

2025-05-12

Table of contents:

1. Loading the libraries

2. Setting the periods and downloading the data

2.1 Specifying the periods for the analysis

2.2 Downloading the CSV files

3. Initial observations of the files

3.1 Checking the column names

3.2 Comparing the structure of the tables

3.3 Analysing the first and last observations of a table

3.4 Checking the key column

3.5 Looking for NAs and empty strings (““)

4. Dealing with missing values

4.1 Unifying missing values

4.2 Adding a ‘shadow matrix’ to the tables

5. Collecting, correcting and completing

5.1 Creating a summary table

5.2 Correcting tables inconsistencies

5.2.1 Checking by the ids

5.2.2 Checking by the names

5.2.3 Warehouses

5.2.4 Cleaning the tables

5.3 Process to complete the lookup tables

5.3.1 Importing a file with the stations information

5.3.2 Constructing a function to build the lookup tables

5.3.2.1 Listing all the stations

5.3.2.2 Collecting the coordinates for shared names or ids

5.3.2.3 Using string detection and fuzzy matching

5.3.2.4 Checking the mean, median and mode values:

5.3.2.5 Getting the reported coordinates

5.3.3 Building the lookup-table constructor

5.4 Comparing distance formulas

5.5 Completing the missing names and ids

5.5.1 Adding more columns to the tables

5.5.2 Input the missing information in the tables

6. Packing and closing

References

Appendix - Suggestions for the analysis phase