This analysis aims to delve deep into bike mobility patterns from the perspective of its users.
Adding libraries readr, dplyr, ggplot2, lubridate and tidyr
library(readr)
library(dplyr)
##
## Adjuntando el paquete: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(lubridate)
##
## Adjuntando el paquete: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(tidyr)
library(DescTools)
## Warning: package 'DescTools' was built under R version 4.4.1
We start off by downloading the stations or service points list from que open data page of MiBici operator. This file is constantly being updated, so the code may only be suitable for a short period of time.
if(!file.exists("nomenclatura_2024_08.csv")) {
download.file("https://www.mibici.net/site/assets/files/1118/nomenclatura_2024_08.csv",
"nomenclatura_2024_08.csv")
} else mbpoints <- read.csv("nomenclatura_2024_08.csv", encoding = "latin1")
This analysis stems from operational data from years 2014 to 2024 that has already been collected, cleaned and processed. Details about the previous process can be found here
Our first point of interest is information we can extract from summarizing travel patterns by user. This summaries have already been processed, so now we proceed directly to read them from the hard drive.
dfUsers <- read_csv("dfUsers.csv")
## New names:
## Rows: 84170 Columns: 8
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (1): Sex dbl (7): ...1, Usuario_Id, BYear, Age_at_Tr, TTrips, Fav_origin,
## Fav_destin
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
dfUsers_Y <- read_csv("dfUsers_Y.csv")
## New names:
## Rows: 212881 Columns: 8
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (1): Sex dbl (7): ...1, Usuario_Id, Year_trip, BYear, TTrips, Fav_origin,
## Fav_destin
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
head(dfUsers, 10)
## # A tibble: 10 × 8
## ...1 Usuario_Id BYear Age_at_Tr Sex TTrips Fav_origin Fav_destin
## <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1 3 1990 24 F 178 75 75
## 2 2 6 1996 20 M 80 33 33
## 3 3 102 1982 32 M 4379 62 62
## 4 4 116 1977 38 M 517 12 12
## 5 5 127 1992 22 M 1381 62 62
## 6 6 140 NA NA M 1 NA NA
## 7 7 143 NA NA M 10 75 75
## 8 8 157 NA NA M 1 NA NA
## 9 9 173 1977 38 M 3 NA NA
## 10 10 176 1990 24 M 278 51 197
head(dfUsers_Y, 10)
## # A tibble: 10 × 8
## ...1 Usuario_Id Year_trip BYear Sex TTrips Fav_origin Fav_destin
## <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1 3 2014 1990 F 76 83 83
## 2 2 3 2015 1990 F 102 75 75
## 3 3 6 2016 1996 M 80 33 33
## 4 4 102 2014 1982 M 46 37 37
## 5 5 102 2015 1982 M 320 62 75
## 6 6 102 2016 1982 M 421 62 62
## 7 7 102 2017 1982 M 344 62 62
## 8 8 102 2018 1982 M 2789 210 210
## 9 9 102 2019 1982 M 295 3 49
## 10 10 102 2021 1982 M 77 72 72
Let’s try a few visualizations from our first data frame.
ggplot(dfUsers, aes(x=TTrips)) +
geom_histogram(color = "darkgrey",fill="darkred", alpha = 0.6, binwidth = 250)+
labs(title = "MiBici total trips 2014-2024 by user Histogram",
x = "Overal trips for 2014 to 2024")
There is a hard skew to the right in the histogram. This use level among users is quite unequal. This could be in part attributed to a hypothetical growth on the user base over the years, but is definitely worth looking into it further on.
We will look into basic variables of user profile. Let’s start by year of birth, age at the moment of travel, and gender.
ggplot(dfUsers, aes(x=BYear)) + geom_histogram(color = "darkgrey", fill="yellow3", alpha = 0.6)+
labs(title = "MiBici users by year of birth Histogram", x = "Year of Birth")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 10538 rows containing non-finite outside the scale range
## (`stat_bin()`).
ggplot(dfUsers, aes(x=Age_at_Tr)) + geom_histogram(color = "darkgrey", fill="orange3", alpha = 0.6)+
labs(title = "MiBici users age at travel Histogram", x = "Age at the moment of travel")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 10538 rows containing non-finite outside the scale range
## (`stat_bin()`).
ggplot(dfUsers, aes(x=Sex, fill = Sex)) + geom_bar(alpha=.6) + labs(title="Users by Gender (Male-Female)", x = "Gender / Sex")
It is interesting that males account for over double the amount of females when we look at the user summarised data.
Now we will try to look at basic mobility pattern information. First, we would like to see how origins and destination of travel distributes throughout MiBici stations. For this we are just plotting a histogram of stations that come as the most frequently used from users.
Note that this is not an intensity or level of use analysis, but a visual of how individual users favor certain stations to start and end their travel.
ggplot(dfUsers, aes(x=Fav_origin)) + geom_histogram(color = "darkgrey", fill="gold3", alpha = 0.6)+
labs(title = "Most popular stations as origins", x = "Stations code")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 7159 rows containing non-finite outside the scale range
## (`stat_bin()`).
ggplot(dfUsers, aes(x=Fav_destin)) + geom_histogram(color = "darkgrey", fill="steelblue3", alpha = 0.6)+
labs(title = "Most popular stations as destinations", x = "Stations code")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 7841 rows containing non-finite outside the scale range
## (`stat_bin()`).
Not surprisingly, the pattern that uncovers users preference seems very similar at both origin and destination of the trip. One possible explanation for this is that users use the system mainly to perform daily commute trips, where origins at the beginning of the day are destinations at the end of the day, but this is something that would need additional analysis to confirm.
Now let’s take a look at the distribution of total and average daily trips by user user per year
ggplot(dfUsers_Y, aes(x=as.factor(Year_trip), y=TTrips, colour=as.factor(Year_trip))) + geom_boxplot( outlier.size = 1) + labs(title = "Total trips by user per year Box plot", x = "Year", y = "Total trips by user")
ggplot(filter(dfUsers_Y, Year_trip > 2014 & Year_trip < 2024), aes(x=as.factor(Year_trip), y=TTrips/365, colour=as.factor(Year_trip))) + geom_boxplot(outlier.size = 1) +
labs(title = "Average daily trips by user per year Box plot",
x = "Year", y = "Average daily trips by user")
res_YDT <- dfUsers_Y %>%
group_by(Year_trip) %>%
summarise(Avg_YT_user = mean(TTrips))
res_YDT <- mutate(res_YDT, Avg_WDT_user = Avg_YT_user/(365-104))
res_YDT
## # A tibble: 11 × 3
## Year_trip Avg_YT_user Avg_WDT_user
## <dbl> <dbl> <dbl>
## 1 2014 15.5 0.0594
## 2 2015 92.5 0.354
## 3 2016 95.5 0.366
## 4 2017 129. 0.496
## 5 2018 133. 0.511
## 6 2019 151. 0.579
## 7 2020 118. 0.452
## 8 2021 157. 0.600
## 9 2022 169. 0.646
## 10 2023 159. 0.611
## 11 2024 96.1 0.368
res_TSx <- dfUsers_Y %>%
group_by(Year_trip, Sex) %>%
summarise(Sum_YT_user = sum(TTrips))
## `summarise()` has grouped output by 'Year_trip'. You can override using the
## `.groups` argument.
res_TSx <- res_TSx %>%
pivot_wider(names_from = Sex, values_from = Sum_YT_user)
colnames(res_TSx) <- c("Year_trip", "Fem", "Male", "NAs")
res_TSx <- res_TSx %>%
mutate(M_times_F = Male/Fem)
With this initial analysis have identified a couple of trends that are interesting to discuss.
The system has more than 84 thousand registered users, but most of them seem to be occasional users. Average daily use accounting for only Monday to Friday use for year 2023 was only 0.61 trips. This indicates that MiBici probably does not constitute a backbone solution to daily transportation needs of its users, but a complementary mode.
There is a significant gender gap between male and female users. Register male users double the female amount. But the gap is wider considering actual trips made by males and females.
Finally, most users seem to favor one station or service point in particular to start or finish their trip. In line with the hypothesis that MiBici provides a complementary mobility solution, we can also assume that most used MiBici stations would be the ones close or in the area of mass transit mobility infrastructure.