MiBici public bike share system user analysis

This analysis aims to delve deep into bike mobility patterns from the perspective of its users.

Initial setup and libraries

Adding libraries readr, dplyr, ggplot2, lubridate and tidyr

library(readr)
library(dplyr)
## 
## Adjuntando el paquete: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(lubridate)
## 
## Adjuntando el paquete: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(tidyr)
library(DescTools)
## Warning: package 'DescTools' was built under R version 4.4.1

Collcting and summarizing data

Data aquisition

We start off by downloading the stations or service points list from que open data page of MiBici operator. This file is constantly being updated, so the code may only be suitable for a short period of time.

if(!file.exists("nomenclatura_2024_08.csv")) {
     download.file("https://www.mibici.net/site/assets/files/1118/nomenclatura_2024_08.csv",
                   "nomenclatura_2024_08.csv")
} else mbpoints <- read.csv("nomenclatura_2024_08.csv", encoding = "latin1")

This analysis stems from operational data from years 2014 to 2024 that has already been collected, cleaned and processed. Details about the previous process can be found here

User summaries

Our first point of interest is information we can extract from summarizing travel patterns by user. This summaries have already been processed, so now we proceed directly to read them from the hard drive.

dfUsers <- read_csv("dfUsers.csv")
## New names:
## Rows: 84170 Columns: 8
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (1): Sex dbl (7): ...1, Usuario_Id, BYear, Age_at_Tr, TTrips, Fav_origin,
## Fav_destin
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
dfUsers_Y <- read_csv("dfUsers_Y.csv")
## New names:
## Rows: 212881 Columns: 8
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (1): Sex dbl (7): ...1, Usuario_Id, Year_trip, BYear, TTrips, Fav_origin,
## Fav_destin
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
head(dfUsers, 10)
## # A tibble: 10 × 8
##     ...1 Usuario_Id BYear Age_at_Tr Sex   TTrips Fav_origin Fav_destin
##    <dbl>      <dbl> <dbl>     <dbl> <chr>  <dbl>      <dbl>      <dbl>
##  1     1          3  1990        24 F        178         75         75
##  2     2          6  1996        20 M         80         33         33
##  3     3        102  1982        32 M       4379         62         62
##  4     4        116  1977        38 M        517         12         12
##  5     5        127  1992        22 M       1381         62         62
##  6     6        140    NA        NA M          1         NA         NA
##  7     7        143    NA        NA M         10         75         75
##  8     8        157    NA        NA M          1         NA         NA
##  9     9        173  1977        38 M          3         NA         NA
## 10    10        176  1990        24 M        278         51        197
head(dfUsers_Y, 10)
## # A tibble: 10 × 8
##     ...1 Usuario_Id Year_trip BYear Sex   TTrips Fav_origin Fav_destin
##    <dbl>      <dbl>     <dbl> <dbl> <chr>  <dbl>      <dbl>      <dbl>
##  1     1          3      2014  1990 F         76         83         83
##  2     2          3      2015  1990 F        102         75         75
##  3     3          6      2016  1996 M         80         33         33
##  4     4        102      2014  1982 M         46         37         37
##  5     5        102      2015  1982 M        320         62         75
##  6     6        102      2016  1982 M        421         62         62
##  7     7        102      2017  1982 M        344         62         62
##  8     8        102      2018  1982 M       2789        210        210
##  9     9        102      2019  1982 M        295          3         49
## 10    10        102      2021  1982 M         77         72         72

Let’s try a few visualizations from our first data frame.

ggplot(dfUsers, aes(x=TTrips)) + 
        geom_histogram(color = "darkgrey",fill="darkred", alpha = 0.6, binwidth = 250)+
     labs(title = "MiBici total trips 2014-2024 by user Histogram", 
          x = "Overal trips for 2014 to 2024")

There is a hard skew to the right in the histogram. This use level among users is quite unequal. This could be in part attributed to a hypothetical growth on the user base over the years, but is definitely worth looking into it further on.

We will look into basic variables of user profile. Let’s start by year of birth, age at the moment of travel, and gender.

ggplot(dfUsers, aes(x=BYear)) + geom_histogram(color = "darkgrey", fill="yellow3", alpha = 0.6)+
     labs(title = "MiBici users by year of birth Histogram", x = "Year of Birth")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 10538 rows containing non-finite outside the scale range
## (`stat_bin()`).

ggplot(dfUsers, aes(x=Age_at_Tr)) + geom_histogram(color = "darkgrey", fill="orange3", alpha = 0.6)+
     labs(title = "MiBici users age at travel Histogram", x = "Age at the moment of travel")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 10538 rows containing non-finite outside the scale range
## (`stat_bin()`).

ggplot(dfUsers, aes(x=Sex, fill = Sex)) + geom_bar(alpha=.6) + labs(title="Users by Gender (Male-Female)", x = "Gender / Sex")

It is interesting that males account for over double the amount of females when we look at the user summarised data.

Now we will try to look at basic mobility pattern information. First, we would like to see how origins and destination of travel distributes throughout MiBici stations. For this we are just plotting a histogram of stations that come as the most frequently used from users.

Note that this is not an intensity or level of use analysis, but a visual of how individual users favor certain stations to start and end their travel.

ggplot(dfUsers, aes(x=Fav_origin)) + geom_histogram(color = "darkgrey", fill="gold3", alpha = 0.6)+
     labs(title = "Most popular stations as origins", x = "Stations code")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 7159 rows containing non-finite outside the scale range
## (`stat_bin()`).

ggplot(dfUsers, aes(x=Fav_destin)) + geom_histogram(color = "darkgrey", fill="steelblue3", alpha = 0.6)+
     labs(title = "Most popular stations as destinations", x = "Stations code")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 7841 rows containing non-finite outside the scale range
## (`stat_bin()`).

Not surprisingly, the pattern that uncovers users preference seems very similar at both origin and destination of the trip. One possible explanation for this is that users use the system mainly to perform daily commute trips, where origins at the beginning of the day are destinations at the end of the day, but this is something that would need additional analysis to confirm.

Now let’s take a look at the distribution of total and average daily trips by user user per year

ggplot(dfUsers_Y, aes(x=as.factor(Year_trip), y=TTrips, colour=as.factor(Year_trip))) + geom_boxplot( outlier.size = 1) + labs(title = "Total trips by user per year Box plot", x = "Year", y = "Total trips by user")

ggplot(filter(dfUsers_Y, Year_trip > 2014 & Year_trip < 2024), aes(x=as.factor(Year_trip), y=TTrips/365, colour=as.factor(Year_trip))) + geom_boxplot(outlier.size = 1) + 
        labs(title = "Average daily trips by user per year Box plot", 
             x = "Year", y = "Average daily trips by user")

res_YDT <- dfUsers_Y %>%
        group_by(Year_trip) %>%
        summarise(Avg_YT_user = mean(TTrips))
res_YDT <- mutate(res_YDT, Avg_WDT_user = Avg_YT_user/(365-104))
res_YDT
## # A tibble: 11 × 3
##    Year_trip Avg_YT_user Avg_WDT_user
##        <dbl>       <dbl>        <dbl>
##  1      2014        15.5       0.0594
##  2      2015        92.5       0.354 
##  3      2016        95.5       0.366 
##  4      2017       129.        0.496 
##  5      2018       133.        0.511 
##  6      2019       151.        0.579 
##  7      2020       118.        0.452 
##  8      2021       157.        0.600 
##  9      2022       169.        0.646 
## 10      2023       159.        0.611 
## 11      2024        96.1       0.368
res_TSx <- dfUsers_Y %>%
        group_by(Year_trip, Sex) %>%
        summarise(Sum_YT_user = sum(TTrips))
## `summarise()` has grouped output by 'Year_trip'. You can override using the
## `.groups` argument.
res_TSx <- res_TSx %>%
        pivot_wider(names_from = Sex, values_from = Sum_YT_user) 


colnames(res_TSx) <- c("Year_trip", "Fem", "Male", "NAs")

res_TSx <- res_TSx %>%
        mutate(M_times_F = Male/Fem)

Preliminary final remarks

With this initial analysis have identified a couple of trends that are interesting to discuss.

The system has more than 84 thousand registered users, but most of them seem to be occasional users. Average daily use accounting for only Monday to Friday use for year 2023 was only 0.61 trips. This indicates that MiBici probably does not constitute a backbone solution to daily transportation needs of its users, but a complementary mode.

There is a significant gender gap between male and female users. Register male users double the female amount. But the gap is wider considering actual trips made by males and females.

Finally, most users seem to favor one station or service point in particular to start or finish their trip. In line with the hypothesis that MiBici provides a complementary mobility solution, we can also assume that most used MiBici stations would be the ones close or in the area of mass transit mobility infrastructure.