My First Data Analysis Project

Finding Differences Between Types of Users of a Public Bicycle-Sharing Enterprise

Final Project of Google Data Analytics Certificate

Background

Since its emergence at the end of the 60’s, the public-sharing bicycles systems have been adapting to new technologies with the aim to ease the use, user identification and payment, and the demand for the use of these services has been increasing due to health, economic and environmental benefits, and encouraged by the increase of cycle paths.(1)

It is estimated that the bicycle-sharing market will grow at rate of 9.65% by year from 2024 to 2029(2).

The public bicycle-sharing enterprise “Cyclistic”, that operates in Chicago offers three types of rideables: classic bike, electric bike and docked bike; and has two types of users:those with one-trip or day-trip passes, named “Casual” and those with annual membership, and its Marketing team wants to know the differences between both types of users to make a Marketing plan to make casual users become member users.

Data origin

With the aim to answer the question raised in the previous paragraph,the data of trips made each month in the last year 2023 were provided by the enterprise in 12 csv files for analysis. These archives had the following columns: ride_id, rideable_type, started_at, ended_at, start_station_name,start_station_id, end_station_name, end_station_id,start_lat, start_lng, end_lat,end_lng, member_casual.

Cleaning, filtering and analyzing data

Firstly, I made a general query to know how many rows the Sheets had. Then I searched for duplicate and null values using the SQL tool from Google Cloud®. Finally I changed the name of the column “member_casual” to “type_of_user”.

With the aim to determine the trend of rides during the year, the total number of trips by year, and the type of rideable by type of user, I grouped the total amount of rides by type of user and type of rideable by type of user respectively, using SQL tool by Google Cloud®. To calculate the hours of trips by type of user I selected the total of trips by each type of user in SQL tool by Google Cloud®, getting twelve spreadsheets for rides and rideable type and csv files for time of use. I consolidated in one spreadsheet with a total of trips and rideable types and obtained the following files: Rides_by_tipe_of_user, Bikes_by_user, time_of_trips_by_user_casual and time_of_trips_by_user_member. For the hours of trips I charged each csv archive for each month for each type of user in RStudio® desktop and transformed the columns with the dates of starting and ending trips from “Character” to “PositX” to make a calculated column with values in seconds of the first 10 trips for each month,I consolidated the data in a spreadsheet and transformed to hours dividing the seconds by 3600. Finally I obtained a file called time_of_trips_by_user that was used as sampling. Finally, I analyzed and graphed all the data using RStudio® desktop.

Results

Both types of users have the same trend of use of the service

The first analysis I made was to know the amount of trips that each user made during the year 2023. As is shown in Fig.1, both types of user tend to use the service mostly between April and October, which match with the seasons of spring and summer which are the warmest.

library(readxl)
Rides_by_user <- read_excel("C:/Users/katve/Desktop/documentos/Proyecto final Ciclystic/Data for the Project/Rides_by_user.xlsx")
View(Rides_by_user)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(reshape2)

## 
## Attaching package: 'reshape2'
## 
## The following object is masked from 'package:tidyr':
## 
##     smiths

long_Rides_by_user<-melt(Rides_by_user, id="Month")
View(long_Rides_by_user)
library(ggplot2)
library(scales)

## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor

cols<-c("#104E8B","#008B00") 
ggplot(long_Rides_by_user, aes(x=Month,y=value, color=variable))+geom_line(lwd=1)+scale_color_manual(values=cols)+scale_y_continuous(labels=label_comma())+guides(color=guide_legend(title="User"))+labs(title= "Trips made in 2023 by Month",subtitle= "Trips by Type of User", x="Month", y="Trips",tag="Fig.1")

Member users do more trips than casual users

Apparently, casual users have less trips than members, so I wanted to verify by the average of annual trips by user. As is shown in Fig.2, apparently Member users make more trips on average than Casual, with an average of trips of 305058.2 and 171598.2 respectively. To assess if these differences were statistically relevant I made a Shapiro and Levene’s tests to determine if the data were normally distributed and if variances were homogeneous, and the p-values were 0.1068 and 0.7834 respectively, being both values greater than 𝞪=0.05, so the data distributed normally and variances were homogeneous. Next I made a T-Student analysis to check if the average of trips made by Member users were greater than of Casual users, and as shown in Fig.3, the T-Student confirms this hypothesis with a p-value of 0.004797 with 95% of confidence.

library(ggplot2)
ggplot(data = long_Rides_by_user, mapping = aes(x=variable, y=value))+stat_summary(fun = mean,geom= "bar",fill= "blue",color="black")+stat_summary(fun.data = mean_se, geom = "errorbar",width= .2, position="dodge")+scale_y_continuous(labels=label_comma())+labs(title= "Average of Trips in 2023",subtitle= "By type of user",x= "User", y="Trips", tag= "Fig.2")

library(tidyverse)
member<-long_Rides_by_user%>%filter(variable=="Member")%>%pull(value)
casual<-long_Rides_by_user%>%filter(variable=="Casual")%>%pull(value)

shapiro.test(member)

## 
##  Shapiro-Wilk normality test
## 
## data:  member
## W = 0.89856, p-value = 0.152

shapiro.test(casual)

## 
##  Shapiro-Wilk normality test
## 
## data:  casual
## W = 0.88669, p-value = 0.1068

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

## The following object is masked from 'package:purrr':
## 
##     some

leveneTest(value~variable, data=long_Rides_by_user, center="median")

## Levene's Test for Homogeneity of Variance (center = "median")
##       Df F value Pr(>F)
## group  1  0.0774 0.7834
##       22

t.test(member,casual, alternative= "greater")

## 
##  Welch Two Sample t-test
## 
## data:  member and casual
## t = 2.8374, df = 21.963, p-value = 0.004797
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  52686.46      Inf
## sample estimates:
## mean of x mean of y 
##  305058.2  171598.2

library(ggpubr)
p <- ggboxplot(long_Rides_by_user, x = 'variable', y = 'value',
               color = 'variable', add="jitter")+stat_boxplot(geom="errorbar", width=0.25)+scale_y_continuous(labels=label_comma())+theme(legend.position="none")+labs(title= "T-test for Annual Trips by User",subtitle= "p-value=0.004797 ", x="User", y="Trips", tags="Fig.3")
p + stat_compare_means(aes(group = variable),method = 't.test', label = "p.signif", size = 3)

Both types of users have similar counts of hours

To determine if there were differences between both types of users, I transformed the type of date of columns started_at and ended_at to “PositX” format and created a column named “Time” by subtracting the column started_at to ended_at column, obtaining a Tibble with 10 first values from each month from each type of user that were used as sampling. Fig.4 shows that apparently casual users have more hours of trips by year than member users, however, when this data were analyzed by t-test, the p-value was 0.1013, being this value greater than 𝞪=0.05, so these differences are not statistically significant, as is shown in Fig.5. As for the trips by user, I assessed if the data were normally distributed and if variances were homogeneous using shapiro and leveneTest respectively, with p-values being less than 𝞪=0.05, so the data were not normally distributed and their variances were not homogeneous, however, since the t-student function of RStudio® for assumes that distribution is not normal and variance are not homogeneous, the test could be done anyway.

library(readr)
time_of_ride_member_jul <- read_csv("C:/Users/katve/Desktop/documentos/Proyecto final Ciclystic/Data for the Project/time_of_ride_member_jul.csv")

## Rows: 436292 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): ride_id, started_at, ended_at, user
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

View(time_of_ride_member_jul)
library (lubridate)
time_of_ride_member_jul$started_at<-ymd_hms(time_of_ride_member_jul$started_at)
time_of_ride_member_jul$ended_at<-ymd_hms(time_of_ride_member_jul$ended_at)
library(dplyr)
mutate(time_of_ride_member_jul,Time=ended_at-started_at)

## # A tibble: 436,292 × 5
##    ride_id          started_at          ended_at            user   Time     
##    <chr>            <dttm>              <dttm>              <chr>  <drtn>   
##  1 7E0BF7DD6A7B7A53 2023-07-29 22:49:24 2023-07-29 23:11:21 member 1317 secs
##  2 E15227B5397696C0 2023-07-30 19:53:33 2023-07-30 19:58:03 member  270 secs
##  3 D06B7779352515F5 2023-07-03 21:08:56 2023-07-03 21:29:01 member 1205 secs
##  4 6C0DCD4F0EE78A2D 2023-07-06 08:36:12 2023-07-06 08:43:46 member  454 secs
##  5 46739470B25313C9 2023-07-19 12:45:54 2023-07-19 12:51:51 member  357 secs
##  6 1A46D5700466FB0B 2023-07-12 10:13:34 2023-07-12 10:22:51 member  557 secs
##  7 1FAE7D659D531417 2023-07-24 17:12:56 2023-07-24 17:19:20 member  384 secs
##  8 AC10BD021529B810 2023-07-05 16:47:11 2023-07-05 16:52:59 member  348 secs
##  9 B5D8E42062B31CC9 2023-07-20 17:14:46 2023-07-20 17:21:37 member  411 secs
## 10 25AACB9DF7716859 2023-07-07 18:27:27 2023-07-07 18:31:33 member  246 secs
## # ℹ 436,282 more rows

library(readxl)
time_of_trips_by_user <- read_excel("C:/Users/katve/Desktop/documentos/Proyecto final Ciclystic/Data for the Project/time_of_trips_by_user.xlsx")
View(time_of_trips_by_user)
library(ggplot2)
ggplot(data = time_of_trips_by_user, mapping = aes(x=User, y=Time))+stat_summary(fun = mean,geom= "bar",fill= "blue",color="black")+stat_summary(fun.data = mean_se, geom = "errorbar",width= .2, position="dodge")+scale_y_continuous(labels=label_comma())+labs(title= "Average Time of Trips in 2023",subtitle= "By type of user",x= "User", y="Hours", tag= "Fig.4")

library(readxl)
time_of_trips_by_user <- read_excel("C:/Users/katve/Desktop/documentos/Proyecto final Ciclystic/Data for the Project/time_of_trips_by_user.xlsx")
View(time_of_trips_by_user)
library(tidyverse)
member<-time_of_trips_by_user%>%filter(User=="Member")%>%pull(Time)
casual<-time_of_trips_by_user%>%filter(User=="Casual")%>%pull(Time)

shapiro.test(member)

## 
##  Shapiro-Wilk normality test
## 
## data:  member
## W = 0.65141, p-value = 0.0002982

shapiro.test(casual)

## 
##  Shapiro-Wilk normality test
## 
## data:  casual
## W = 0.76828, p-value = 0.004176

library(car)
leveneTest(Time~User, data= time_of_trips_by_user, center="median")

## Levene's Test for Homogeneity of Variance (center = "median")
##       Df F value  Pr(>F)  
## group  1   3.228 0.08613 .
##       22                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

t.test(member, casual, alternative = "less")

## 
##  Welch Two Sample t-test
## 
## data:  member and casual
## t = -1.3397, df = 13.476, p-value = 0.1013
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##      -Inf 49.95225
## sample estimates:
## mean of x mean of y 
##  125.7153  282.6253

library(readxl)
time_of_trips_by_user <- read_excel("C:/Users/katve/Desktop/documentos/Proyecto final Ciclystic/Data for the Project/time_of_trips_by_user.xlsx")
View(time_of_trips_by_user)
library(ggplot2)
library(ggpubr)
p <- ggboxplot(time_of_trips_by_user, x = 'User', y = 'Time',
               color = 'User', add="jitter")+stat_boxplot(geom="errorbar", width=0.25)+scale_y_continuous(labels=label_comma())+theme(legend.position="none")+labs(title= "T-test Average Time of Trips in 2023",subtitle= "p-value=0.1013 ", x="User", y="Hours", tags="Fig.5")
p + stat_compare_means(aes(group = User),method = 't.test', label = "p.signif", size = 3)

Only casual users use docked bikes

Next, I assessed the kind of rideable that each type of user uses for their trips to find if there were differences between both types of users. Fig.6 shows that only casual users use docked bikes, being this type of rideable less used than electric and classic bikes, that are mostly preferred by both types of users.

library(readxl)
Bikes_by_user <- read_excel("C:/Users/katve/Desktop/documentos/Proyecto final Ciclystic/Data for the Project/Bikes_by_user.xlsx")
View(Bikes_by_user)
library(reshape2)
long_Bikes_by_user<-melt(Bikes_by_user, id="User")
View(long_Bikes_by_user)
library(ggplot2)
ggplot(long_Bikes_by_user, mapping = aes(x=User,y= value,fill=variable))+geom_bar(stat = "identity")+theme(panel.background = element_rect(fill = "#8B8878"))+scale_fill_brewer()+scale_y_continuous(labels=label_comma())+guides(fill=guide_legend(title="Rideable type"))+labs(title= "Type of Bicycle used for Trips",subtitle= "Rideable type by user", x="User", y="Trips",tag="Fig.6")

Conclusions

Both types of users make more trips in spring and summer Member users make more trips than casual members Both types of users have similar average of hours of use of service Only casual users uses docked-bikes

Recommendations

Based on these findings I recommend to offer an upgrade for casual users, with monthly, quarterly or semestral subscription with the condition to only use classic and electric bikes.

References

A, S. (2023). An analysis of Bike-sharing systems, its influencing factors and impacts Note: Sub-titles are not captured in Xplore and should not be used. www.academia.edu. https://www.academia.edu/111400583/An_analysis_of_Bike_sharing_systems_its_influencing_factors_and_impacts_Note_Sub_titles_are_not_captured_in_Xplore_and_should_not_be_used
Bike sharing Market Insights. (s. f.). https://www.mordorintelligence.com/industry-reports/bike-sharing-market

Acknowledgements

Thanks to Talento Digital Para Chile for giving me the opportunity to do this scholarship program, and to Mai, for helping and guiding me with data grouping in SQL tool.

My First Data Analysis Project

Katiana Vergara

2024-07-27