Since its emergence at the end of the 60’s, the public-sharing bicycles systems have been adapting to new technologies with the aim to ease the use, user identification and payment, and the demand for the use of these services has been increasing due to health, economic and environmental benefits, and encouraged by the increase of cycle paths.(1)
It is estimated that the bicycle-sharing market will grow at rate of 9.65% by year from 2024 to 2029(2).
The public bicycle-sharing enterprise “Cyclistic”, that operates in Chicago offers three types of rideables: classic bike, electric bike and docked bike; and has two types of users:those with one-trip or day-trip passes, named “Casual” and those with annual membership, and its Marketing team wants to know the differences between both types of users to make a Marketing plan to make casual users become member users.
With the aim to answer the question raised in the previous paragraph,the data of trips made each month in the last year 2023 were provided by the enterprise in 12 csv files for analysis. These archives had the following columns: ride_id, rideable_type, started_at, ended_at, start_station_name,start_station_id, end_station_name, end_station_id,start_lat, start_lng, end_lat,end_lng, member_casual.
Firstly, I made a general query to know how many rows the Sheets had. Then I searched for duplicate and null values using the SQL tool from Google Cloud®. Finally I changed the name of the column “member_casual” to “type_of_user”.
With the aim to determine the trend of rides during the year, the total number of trips by year, and the type of rideable by type of user, I grouped the total amount of rides by type of user and type of rideable by type of user respectively, using SQL tool by Google Cloud®. To calculate the hours of trips by type of user I selected the total of trips by each type of user in SQL tool by Google Cloud®, getting twelve spreadsheets for rides and rideable type and csv files for time of use. I consolidated in one spreadsheet with a total of trips and rideable types and obtained the following files: Rides_by_tipe_of_user, Bikes_by_user, time_of_trips_by_user_casual and time_of_trips_by_user_member. For the hours of trips I charged each csv archive for each month for each type of user in RStudio® desktop and transformed the columns with the dates of starting and ending trips from “Character” to “PositX” to make a calculated column with values in seconds of the first 10 trips for each month,I consolidated the data in a spreadsheet and transformed to hours dividing the seconds by 3600. Finally I obtained a file called time_of_trips_by_user that was used as sampling. Finally, I analyzed and graphed all the data using RStudio® desktop.
Both types of users have the same trend of use of the service
The first analysis I made was to know the amount of trips that each user made during the year 2023. As is shown in Fig.1, both types of user tend to use the service mostly between April and October, which match with the seasons of spring and summer which are the warmest.
library(readxl)
Rides_by_user <- read_excel("C:/Users/katve/Desktop/documentos/Proyecto final Ciclystic/Data for the Project/Rides_by_user.xlsx")
View(Rides_by_user)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(reshape2)
##
## Attaching package: 'reshape2'
##
## The following object is masked from 'package:tidyr':
##
## smiths
long_Rides_by_user<-melt(Rides_by_user, id="Month")
View(long_Rides_by_user)
library(ggplot2)
library(scales)
##
## Attaching package: 'scales'
##
## The following object is masked from 'package:purrr':
##
## discard
##
## The following object is masked from 'package:readr':
##
## col_factor
cols<-c("#104E8B","#008B00")
ggplot(long_Rides_by_user, aes(x=Month,y=value, color=variable))+geom_line(lwd=1)+scale_color_manual(values=cols)+scale_y_continuous(labels=label_comma())+guides(color=guide_legend(title="User"))+labs(title= "Trips made in 2023 by Month",subtitle= "Trips by Type of User", x="Month", y="Trips",tag="Fig.1")
Member users do more trips than casual users
Apparently, casual users have less trips than members, so I wanted to verify by the average of annual trips by user. As is shown in Fig.2, apparently Member users make more trips on average than Casual, with an average of trips of 305058.2 and 171598.2 respectively. To assess if these differences were statistically relevant I made a Shapiro and Levene’s tests to determine if the data were normally distributed and if variances were homogeneous, and the p-values were 0.1068 and 0.7834 respectively, being both values greater than 𝞪=0.05, so the data distributed normally and variances were homogeneous. Next I made a T-Student analysis to check if the average of trips made by Member users were greater than of Casual users, and as shown in Fig.3, the T-Student confirms this hypothesis with a p-value of 0.004797 with 95% of confidence.
library(ggplot2)
ggplot(data = long_Rides_by_user, mapping = aes(x=variable, y=value))+stat_summary(fun = mean,geom= "bar",fill= "blue",color="black")+stat_summary(fun.data = mean_se, geom = "errorbar",width= .2, position="dodge")+scale_y_continuous(labels=label_comma())+labs(title= "Average of Trips in 2023",subtitle= "By type of user",x= "User", y="Trips", tag= "Fig.2")
library(tidyverse)
member<-long_Rides_by_user%>%filter(variable=="Member")%>%pull(value)
casual<-long_Rides_by_user%>%filter(variable=="Casual")%>%pull(value)
shapiro.test(member)
##
## Shapiro-Wilk normality test
##
## data: member
## W = 0.89856, p-value = 0.152
shapiro.test(casual)
##
## Shapiro-Wilk normality test
##
## data: casual
## W = 0.88669, p-value = 0.1068
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
leveneTest(value~variable, data=long_Rides_by_user, center="median")
## Levene's Test for Homogeneity of Variance (center = "median")
## Df F value Pr(>F)
## group 1 0.0774 0.7834
## 22
t.test(member,casual, alternative= "greater")
##
## Welch Two Sample t-test
##
## data: member and casual
## t = 2.8374, df = 21.963, p-value = 0.004797
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## 52686.46 Inf
## sample estimates:
## mean of x mean of y
## 305058.2 171598.2
library(ggpubr)
p <- ggboxplot(long_Rides_by_user, x = 'variable', y = 'value',
color = 'variable', add="jitter")+stat_boxplot(geom="errorbar", width=0.25)+scale_y_continuous(labels=label_comma())+theme(legend.position="none")+labs(title= "T-test for Annual Trips by User",subtitle= "p-value=0.004797 ", x="User", y="Trips", tags="Fig.3")
p + stat_compare_means(aes(group = variable),method = 't.test', label = "p.signif", size = 3)
Both types of users have similar counts of hours
To determine if there were differences between both types of users, I transformed the type of date of columns started_at and ended_at to “PositX” format and created a column named “Time” by subtracting the column started_at to ended_at column, obtaining a Tibble with 10 first values from each month from each type of user that were used as sampling. Fig.4 shows that apparently casual users have more hours of trips by year than member users, however, when this data were analyzed by t-test, the p-value was 0.1013, being this value greater than 𝞪=0.05, so these differences are not statistically significant, as is shown in Fig.5. As for the trips by user, I assessed if the data were normally distributed and if variances were homogeneous using shapiro and leveneTest respectively, with p-values being less than 𝞪=0.05, so the data were not normally distributed and their variances were not homogeneous, however, since the t-student function of RStudio® for assumes that distribution is not normal and variance are not homogeneous, the test could be done anyway.
library(readr)
time_of_ride_member_jul <- read_csv("C:/Users/katve/Desktop/documentos/Proyecto final Ciclystic/Data for the Project/time_of_ride_member_jul.csv")
## Rows: 436292 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): ride_id, started_at, ended_at, user
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
View(time_of_ride_member_jul)
library (lubridate)
time_of_ride_member_jul$started_at<-ymd_hms(time_of_ride_member_jul$started_at)
time_of_ride_member_jul$ended_at<-ymd_hms(time_of_ride_member_jul$ended_at)
library(dplyr)
mutate(time_of_ride_member_jul,Time=ended_at-started_at)
## # A tibble: 436,292 × 5
## ride_id started_at ended_at user Time
## <chr> <dttm> <dttm> <chr> <drtn>
## 1 7E0BF7DD6A7B7A53 2023-07-29 22:49:24 2023-07-29 23:11:21 member 1317 secs
## 2 E15227B5397696C0 2023-07-30 19:53:33 2023-07-30 19:58:03 member 270 secs
## 3 D06B7779352515F5 2023-07-03 21:08:56 2023-07-03 21:29:01 member 1205 secs
## 4 6C0DCD4F0EE78A2D 2023-07-06 08:36:12 2023-07-06 08:43:46 member 454 secs
## 5 46739470B25313C9 2023-07-19 12:45:54 2023-07-19 12:51:51 member 357 secs
## 6 1A46D5700466FB0B 2023-07-12 10:13:34 2023-07-12 10:22:51 member 557 secs
## 7 1FAE7D659D531417 2023-07-24 17:12:56 2023-07-24 17:19:20 member 384 secs
## 8 AC10BD021529B810 2023-07-05 16:47:11 2023-07-05 16:52:59 member 348 secs
## 9 B5D8E42062B31CC9 2023-07-20 17:14:46 2023-07-20 17:21:37 member 411 secs
## 10 25AACB9DF7716859 2023-07-07 18:27:27 2023-07-07 18:31:33 member 246 secs
## # ℹ 436,282 more rows
library(readxl)
time_of_trips_by_user <- read_excel("C:/Users/katve/Desktop/documentos/Proyecto final Ciclystic/Data for the Project/time_of_trips_by_user.xlsx")
View(time_of_trips_by_user)
library(ggplot2)
ggplot(data = time_of_trips_by_user, mapping = aes(x=User, y=Time))+stat_summary(fun = mean,geom= "bar",fill= "blue",color="black")+stat_summary(fun.data = mean_se, geom = "errorbar",width= .2, position="dodge")+scale_y_continuous(labels=label_comma())+labs(title= "Average Time of Trips in 2023",subtitle= "By type of user",x= "User", y="Hours", tag= "Fig.4")
library(readxl)
time_of_trips_by_user <- read_excel("C:/Users/katve/Desktop/documentos/Proyecto final Ciclystic/Data for the Project/time_of_trips_by_user.xlsx")
View(time_of_trips_by_user)
library(tidyverse)
member<-time_of_trips_by_user%>%filter(User=="Member")%>%pull(Time)
casual<-time_of_trips_by_user%>%filter(User=="Casual")%>%pull(Time)
shapiro.test(member)
##
## Shapiro-Wilk normality test
##
## data: member
## W = 0.65141, p-value = 0.0002982
shapiro.test(casual)
##
## Shapiro-Wilk normality test
##
## data: casual
## W = 0.76828, p-value = 0.004176
library(car)
leveneTest(Time~User, data= time_of_trips_by_user, center="median")
## Levene's Test for Homogeneity of Variance (center = "median")
## Df F value Pr(>F)
## group 1 3.228 0.08613 .
## 22
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
t.test(member, casual, alternative = "less")
##
## Welch Two Sample t-test
##
## data: member and casual
## t = -1.3397, df = 13.476, p-value = 0.1013
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
## -Inf 49.95225
## sample estimates:
## mean of x mean of y
## 125.7153 282.6253
library(readxl)
time_of_trips_by_user <- read_excel("C:/Users/katve/Desktop/documentos/Proyecto final Ciclystic/Data for the Project/time_of_trips_by_user.xlsx")
View(time_of_trips_by_user)
library(ggplot2)
library(ggpubr)
p <- ggboxplot(time_of_trips_by_user, x = 'User', y = 'Time',
color = 'User', add="jitter")+stat_boxplot(geom="errorbar", width=0.25)+scale_y_continuous(labels=label_comma())+theme(legend.position="none")+labs(title= "T-test Average Time of Trips in 2023",subtitle= "p-value=0.1013 ", x="User", y="Hours", tags="Fig.5")
p + stat_compare_means(aes(group = User),method = 't.test', label = "p.signif", size = 3)
Only casual users use docked bikes
Next, I assessed the kind of rideable that each type of user uses for their trips to find if there were differences between both types of users. Fig.6 shows that only casual users use docked bikes, being this type of rideable less used than electric and classic bikes, that are mostly preferred by both types of users.
library(readxl)
Bikes_by_user <- read_excel("C:/Users/katve/Desktop/documentos/Proyecto final Ciclystic/Data for the Project/Bikes_by_user.xlsx")
View(Bikes_by_user)
library(reshape2)
long_Bikes_by_user<-melt(Bikes_by_user, id="User")
View(long_Bikes_by_user)
library(ggplot2)
ggplot(long_Bikes_by_user, mapping = aes(x=User,y= value,fill=variable))+geom_bar(stat = "identity")+theme(panel.background = element_rect(fill = "#8B8878"))+scale_fill_brewer()+scale_y_continuous(labels=label_comma())+guides(fill=guide_legend(title="Rideable type"))+labs(title= "Type of Bicycle used for Trips",subtitle= "Rideable type by user", x="User", y="Trips",tag="Fig.6")
Both types of users make more trips in spring and summer Member users make more trips than casual members Both types of users have similar average of hours of use of service Only casual users uses docked-bikes
Based on these findings I recommend to offer an upgrade for casual users, with monthly, quarterly or semestral subscription with the condition to only use classic and electric bikes.
A, S. (2023). An analysis of Bike-sharing systems, its influencing factors and impacts Note: Sub-titles are not captured in Xplore and should not be used. www.academia.edu. https://www.academia.edu/111400583/An_analysis_of_Bike_sharing_systems_its_influencing_factors_and_impacts_Note_Sub_titles_are_not_captured_in_Xplore_and_should_not_be_used
Bike sharing Market Insights. (s. f.). https://www.mordorintelligence.com/industry-reports/bike-sharing-market
Thanks to Talento Digital Para Chile for giving me the opportunity to do this scholarship program, and to Mai, for helping and guiding me with data grouping in SQL tool.