Scenario :


This is my version of the Google Data Analytics : Case Study 1 , A Cyclistic bike-share Analysis case study. The “Cyclistic” is a fictional company for this case study. They said this case study will be a ‘Tangible’ way to demonstrate my knowledge and skills so here we are 😊 : I’m a junior ‘Data Analyst’ working in the marketing analytics team at company named “Cyclistic”, a bike-share company in Chicago. I’ve joined this team six months ago and have been busy learning about Cyclistic’s mission and buisness goals as well as how I, as a junior data analyst, can help Cyclistic achieve them. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships.


Characters and teams :


  • Cyclistic: A bike-share fictional company.


  • Lily Moreno: The director of marketing and my manager.


  • Cyclistic Marketing Analytics team: My team of Data Analysts.


  • Cyclistic Executive team: The detail-oriented executive team.



Before we begin the Process for this Project, there are a few key points that are wrapped below as these are the steps I’ll be following to ensure its completion:

 I'll be following these vital steps for the Data Analysis Process:
         Ask, Prepare, Process, Analyze, Share, and Act.



Data Analysis Process



PHASE 1 : Ask


The director of marketing believes the company’s future success depends on maximizing the number of annual memberships.



In this phase:


Key taks
  • Identified the business task
  • Considered key stakeholders


Deliverable

  • To gain insights from Data to solve buisness problem



PHASE 2 : Prepare


I’ll be using Cyclistic’s historical trip data to analyze and identify trends, Downloaded last 12 months of data and stored it on my Google Drive


The data is located on divvy-tripdata and data has been made available by Motivate International Inc. under this license.


The company has their own licence over the dataset. Besides that, the 
dataset doesn't have any personal information about the riders. So, I'm 
sure that Data isn't Biased And have full credibility for the same reason. 
Finally, it's ROCCC because it's reliable, original, comprehensive, 
current and cited.
Verified the data's integrity by checking all the files, have consistent 
columns and the data types.


Key taks
  • Downloaded data and stored it appropriately.
  • Identified how it’s organized.
  • Sorted and filtered the data.
  • Determined the credibility of the data.


Deliverable

  • The description of all data sources used between May 2022 - April 2023



PHASE 3 : Process


I’ve also done analysis using excel but due to “Big Data” MS Excel won’t be able to load it.


R is primarily used for statistical analysis and data visualization. So, I chose ‘RStudio’ to merge all the months based Data into One Single file for further Analysis.


Setting up the Environment


Dependencies
library(tidyverse)
library(lubridate)
library(ggplot2)
library(readxl)


Working Directory
setwd("D:/Case_Study/Data/CSV")


Data Collection


t22_05 <- read.csv("202205-divvy-tripdata.csv")

t22_06 <- read.csv("202206-divvy-tripdata.csv")

t22_07 <- read.csv("202207-divvy-tripdata.csv")

t22_08 <- read.csv("202208-divvy-tripdata.csv")

t22_09 <- read.csv("202209-divvy-publictripdata.csv")

t22_10 <- read.csv("202210-divvy-tripdata.csv")

t22_11 <- read.csv("202211-divvy-tripdata.csv")

t22_12 <- read.csv("202212-divvy-tripdata.csv")

t23_01 <- read.csv("202301-divvy-tripdata.csv")

t23_02 <- read.csv("202302-divvy-tripdata.csv")

t23_03 <- read.csv("202303-divvy-tripdata.csv")

t23_04 <- read.csv("202304_divvy_tripdata.csv")


Data Wrangling

  • Ensured Data’s integrity one by one before compiling onto one single as Big Data.
  • Matching column(s) name consistent with “t23_03” as this is the latest data produced.


str(t23_04) # checked all the data one by one with " str() ".
## 'data.frame':    426590 obs. of  13 variables:
##  $ ride_id           : chr  "8FE8F7D9C10E88C7" "34E4ED3ADF1D821B" "5296BF07A2F77CB5" "40759916B76D5D52" ...
##  $ rideable_type     : chr  "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
##  $ started_at        : chr  "2023-04-02 08:37:28" "2023-04-19 11:29:02" "2023-04-19 08:41:22" "2023-04-19 13:31:30" ...
##  $ ended_at          : chr  "2023-04-02 08:41:37" "2023-04-19 11:52:12" "2023-04-19 08:43:22" "2023-04-19 13:35:09" ...
##  $ start_station_name: chr  "" "" "" "" ...
##  $ start_station_id  : chr  "" "" "" "" ...
##  $ end_station_name  : chr  "" "" "" "" ...
##  $ end_station_id    : chr  "" "" "" "" ...
##  $ start_lat         : num  41.8 41.9 41.9 41.9 41.9 ...
##  $ start_lng         : num  -87.6 -87.7 -87.7 -87.7 -87.7 ...
##  $ end_lat           : num  41.8 41.9 41.9 41.9 41.9 ...
##  $ end_lng           : num  -87.6 -87.7 -87.7 -87.7 -87.6 ...
##  $ member_casual     : chr  "member" "member" "member" "member" ...
  • Everything looks great


Merging all the Data into Single One for Analysis.

tripdata <- bind_rows(t22_05,t22_06,t22_07,t22_08,t22_09,t22_10,
                  t22_11,t22_12,t23_01,t23_02,t23_03,t23_04)


cross-checked data’s integrity :

str(tripdata)
## 'data.frame':    5859061 obs. of  13 variables:
##  $ ride_id           : chr  "EC2DE40644C6B0F4" "1C31AD03897EE385" "1542FBEC830415CF" "6FF59852924528F8" ...
##  $ rideable_type     : chr  "classic_bike" "classic_bike" "classic_bike" "classic_bike" ...
##  $ started_at        : chr  "2022-05-23 23:06:58" "2022-05-11 08:53:28" "2022-05-26 18:36:28" "2022-05-10 07:30:07" ...
##  $ ended_at          : chr  "2022-05-23 23:40:19" "2022-05-11 09:31:22" "2022-05-26 18:58:18" "2022-05-10 07:38:49" ...
##  $ start_station_name: chr  "Wabash Ave & Grand Ave" "DuSable Lake Shore Dr & Monroe St" "Clinton St & Madison St" "Clinton St & Madison St" ...
##  $ start_station_id  : chr  "TA1307000117" "13300" "TA1305000032" "TA1305000032" ...
##  $ end_station_name  : chr  "Halsted St & Roscoe St" "Field Blvd & South Water St" "Wood St & Milwaukee Ave" "Clark St & Randolph St" ...
##  $ end_station_id    : chr  "TA1309000025" "15534" "13221" "TA1305000030" ...
##  $ start_lat         : num  41.9 41.9 41.9 41.9 41.9 ...
##  $ start_lng         : num  -87.6 -87.6 -87.6 -87.6 -87.6 ...
##  $ end_lat           : num  41.9 41.9 41.9 41.9 41.9 ...
##  $ end_lng           : num  -87.6 -87.6 -87.7 -87.6 -87.7 ...
##  $ member_casual     : chr  "member" "member" "member" "member" ...


Saved merged Data :

write.csv(tripdata, file = "tripdata.csv")


Removed ‘lat’ & ‘lng’ columns.

will be talking about ‘lat’ & ‘lng’ in the end of the project.


  • Removed non-necessary column :
tripdata <- tripdata %>% 
  select(-c(start_lat, start_lng, end_lat, end_lng))


Checked Data Structure.

str(tripdata)
## 'data.frame':    5859061 obs. of  9 variables:
##  $ ride_id           : chr  "EC2DE40644C6B0F4" "1C31AD03897EE385" "1542FBEC830415CF" "6FF59852924528F8" ...
##  $ rideable_type     : chr  "classic_bike" "classic_bike" "classic_bike" "classic_bike" ...
##  $ started_at        : chr  "2022-05-23 23:06:58" "2022-05-11 08:53:28" "2022-05-26 18:36:28" "2022-05-10 07:30:07" ...
##  $ ended_at          : chr  "2022-05-23 23:40:19" "2022-05-11 09:31:22" "2022-05-26 18:58:18" "2022-05-10 07:38:49" ...
##  $ start_station_name: chr  "Wabash Ave & Grand Ave" "DuSable Lake Shore Dr & Monroe St" "Clinton St & Madison St" "Clinton St & Madison St" ...
##  $ start_station_id  : chr  "TA1307000117" "13300" "TA1305000032" "TA1305000032" ...
##  $ end_station_name  : chr  "Halsted St & Roscoe St" "Field Blvd & South Water St" "Wood St & Milwaukee Ave" "Clark St & Randolph St" ...
##  $ end_station_id    : chr  "TA1309000025" "15534" "13221" "TA1305000030" ...
##  $ member_casual     : chr  "member" "member" "member" "member" ...


Statistical Summary of Data.

summary(tripdata) 
##    ride_id          rideable_type       started_at          ended_at        
##  Length:5859061     Length:5859061     Length:5859061     Length:5859061    
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##  start_station_name start_station_id   end_station_name   end_station_id    
##  Length:5859061     Length:5859061     Length:5859061     Length:5859061    
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##  member_casual     
##  Length:5859061    
##  Class :character  
##  Mode  :character


Checking number of riders as casual and member :

table(tripdata$member_casual) 
## 
##  casual  member 
## 2358307 3500754


Data Cleaning :


Added columns as date, month, day, year and day_of_week for analysis
  and formatted as date :
tripdata$date <- as.Date(tripdata$started_at)
tripdata$day <- format(as.Date(tripdata$date), "%d")
tripdata$month <- format(as.Date(tripdata$date), "%B")
tripdata$year <- format(as.Date(tripdata$date), "%Y")
tripdata$day_of_week <- format(as.Date(tripdata$date), "%A")


Adding columns as pick_time & drop_time and formatted as datetime :
tripdata$pick <- as_datetime(tripdata$started_at)
tripdata$pick_time <- format(as_datetime(tripdata$pick), "%H:%M:")

tripdata$drop <- as_datetime(tripdata$ended_at)
tripdata$drop_time <- format(as_datetime(tripdata$drop), "%H:%M")


  • Took a look on Data
glimpse(tripdata)
## Rows: 5,859,061
## Columns: 18
## $ ride_id            <chr> "EC2DE40644C6B0F4", "1C31AD03897EE385", "1542FBEC83…
## $ rideable_type      <chr> "classic_bike", "classic_bike", "classic_bike", "cl…
## $ started_at         <chr> "2022-05-23 23:06:58", "2022-05-11 08:53:28", "2022…
## $ ended_at           <chr> "2022-05-23 23:40:19", "2022-05-11 09:31:22", "2022…
## $ start_station_name <chr> "Wabash Ave & Grand Ave", "DuSable Lake Shore Dr & …
## $ start_station_id   <chr> "TA1307000117", "13300", "TA1305000032", "TA1305000…
## $ end_station_name   <chr> "Halsted St & Roscoe St", "Field Blvd & South Water…
## $ end_station_id     <chr> "TA1309000025", "15534", "13221", "TA1305000030", "…
## $ member_casual      <chr> "member", "member", "member", "member", "member", "…
## $ date               <date> 2022-05-23, 2022-05-11, 2022-05-26, 2022-05-10, 20…
## $ day                <chr> "23", "11", "26", "10", "10", "04", "27", "29", "16…
## $ month              <chr> "May", "May", "May", "May", "May", "May", "May", "M…
## $ year               <chr> "2022", "2022", "2022", "2022", "2022", "2022", "20…
## $ day_of_week        <chr> "Monday", "Wednesday", "Thursday", "Tuesday", "Tues…
## $ pick               <dttm> 2022-05-23 23:06:58, 2022-05-11 08:53:28, 2022-05-…
## $ pick_time          <chr> "23:06:", "08:53:", "18:36:", "07:30:", "17:31:", "…
## $ drop               <dttm> 2022-05-23 23:40:19, 2022-05-11 09:31:22, 2022-05-…
## $ drop_time          <chr> "23:40", "09:31", "18:58", "07:38", "17:36", "14:56…


Removed ‘pick’ & ‘drop’ columns :
tripdata <- tripdata %>% 
  select(-c(pick,drop))


Checked for Data duplication :
tripdata_duplicate <- tripdata[!duplicated(tripdata$ride_id),]


  • Took a look on Data
glimpse(tripdata)
## Rows: 5,859,061
## Columns: 16
## $ ride_id            <chr> "EC2DE40644C6B0F4", "1C31AD03897EE385", "1542FBEC83…
## $ rideable_type      <chr> "classic_bike", "classic_bike", "classic_bike", "cl…
## $ started_at         <chr> "2022-05-23 23:06:58", "2022-05-11 08:53:28", "2022…
## $ ended_at           <chr> "2022-05-23 23:40:19", "2022-05-11 09:31:22", "2022…
## $ start_station_name <chr> "Wabash Ave & Grand Ave", "DuSable Lake Shore Dr & …
## $ start_station_id   <chr> "TA1307000117", "13300", "TA1305000032", "TA1305000…
## $ end_station_name   <chr> "Halsted St & Roscoe St", "Field Blvd & South Water…
## $ end_station_id     <chr> "TA1309000025", "15534", "13221", "TA1305000030", "…
## $ member_casual      <chr> "member", "member", "member", "member", "member", "…
## $ date               <date> 2022-05-23, 2022-05-11, 2022-05-26, 2022-05-10, 20…
## $ day                <chr> "23", "11", "26", "10", "10", "04", "27", "29", "16…
## $ month              <chr> "May", "May", "May", "May", "May", "May", "May", "M…
## $ year               <chr> "2022", "2022", "2022", "2022", "2022", "2022", "20…
## $ day_of_week        <chr> "Monday", "Wednesday", "Thursday", "Tuesday", "Tues…
## $ pick_time          <chr> "23:06:", "08:53:", "18:36:", "07:30:", "17:31:", "…
## $ drop_time          <chr> "23:40", "09:31", "18:58", "07:38", "17:36", "14:56…


Adding a column ‘ride_length’
tripdata$ride_length <- difftime(tripdata$ended_at,tripdata$started_at)
is.factor(tripdata$ride_length) # Confirming data not to be factor
## [1] FALSE
is.numeric(tripdata$ride_length) # Confirming data is to be numeric
## [1] FALSE
tripdata$ride_length <- as.numeric(as.duration(tripdata$ride_length))
is.numeric(tripdata$ride_length)
## [1] TRUE


  • Looks Great


  • Documentation of cleaning or manipulation of data clarified too



PHASE 4 : Analyze


CONDUCTING DESCRIPTIVE ANALYSIS


  • Making copy of original data :
tripdata_new <- tripdata


Statistical Summary of Data
mean(tripdata_new$ride_length)      # Average ride duration
## [1] 1136.131
median(tripdata_new$ride_length)    # Median ride duration
## [1] 599
max(tripdata_new$ride_length)       # Maximum ride duration
## [1] 2483235
min(tripdata_new$ride_length)       # Minimum ride duration
## [1] -621201
summary(tripdata_new$ride_length)   # Statistical summary of ride duration
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -621201     339     599    1136    1075 2483235


  • minimum ride duration by riders are negative.


Removing Bad Data :
tripdata_new <- tripdata_new[!tripdata_new$ride_length <0, ]
str(tripdata_new)  # Took a look on data
## 'data.frame':    5858958 obs. of  17 variables:
##  $ ride_id           : chr  "EC2DE40644C6B0F4" "1C31AD03897EE385" "1542FBEC830415CF" "6FF59852924528F8" ...
##  $ rideable_type     : chr  "classic_bike" "classic_bike" "classic_bike" "classic_bike" ...
##  $ started_at        : chr  "2022-05-23 23:06:58" "2022-05-11 08:53:28" "2022-05-26 18:36:28" "2022-05-10 07:30:07" ...
##  $ ended_at          : chr  "2022-05-23 23:40:19" "2022-05-11 09:31:22" "2022-05-26 18:58:18" "2022-05-10 07:38:49" ...
##  $ start_station_name: chr  "Wabash Ave & Grand Ave" "DuSable Lake Shore Dr & Monroe St" "Clinton St & Madison St" "Clinton St & Madison St" ...
##  $ start_station_id  : chr  "TA1307000117" "13300" "TA1305000032" "TA1305000032" ...
##  $ end_station_name  : chr  "Halsted St & Roscoe St" "Field Blvd & South Water St" "Wood St & Milwaukee Ave" "Clark St & Randolph St" ...
##  $ end_station_id    : chr  "TA1309000025" "15534" "13221" "TA1305000030" ...
##  $ member_casual     : chr  "member" "member" "member" "member" ...
##  $ date              : Date, format: "2022-05-23" "2022-05-11" ...
##  $ day               : chr  "23" "11" "26" "10" ...
##  $ month             : chr  "May" "May" "May" "May" ...
##  $ year              : chr  "2022" "2022" "2022" "2022" ...
##  $ day_of_week       : chr  "Monday" "Wednesday" "Thursday" "Tuesday" ...
##  $ pick_time         : chr  "23:06:" "08:53:" "18:36:" "07:30:" ...
##  $ drop_time         : chr  "23:40" "09:31" "18:58" "07:38" ...
##  $ ride_length       : num  2001 2274 1310 522 301 ...


  • New rows 5858958 , columns 17


Statistical Summary of ride duration :
summary(tripdata_new$ride_length)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0     339     599    1136    1075 2483235


Compared members and casual riders by ride duration :
aggregate(tripdata_new$ride_length~tripdata_new$member_casual, FUN= mean) #average
##   tripdata_new$member_casual tripdata_new$ride_length
## 1                     casual                1709.6796
## 2                     member                 750.0187
aggregate(tripdata_new$ride_length~tripdata_new$member_casual, FUN= median) #median
##   tripdata_new$member_casual tripdata_new$ride_length
## 1                     casual                      751
## 2                     member                      520
aggregate(tripdata_new$ride_length~tripdata_new$member_casual, FUN= max)  #maximun
##   tripdata_new$member_casual tripdata_new$ride_length
## 1                     casual                  2483235
## 2                     member                    93580
aggregate(tripdata_new$ride_length~tripdata_new$member_casual, FUN= min) #minimum
##   tripdata_new$member_casual tripdata_new$ride_length
## 1                     casual                        0
## 2                     member                        0


Average ride duration for each day by members vs casual users :
aggregate(tripdata_new$ride_length~tripdata_new$member_casual+
            tripdata_new$day_of_week, FUN= mean)
##    tripdata_new$member_casual tripdata_new$day_of_week tripdata_new$ride_length
## 1                      casual                   Friday                1651.1089
## 2                      member                   Friday                 741.8887
## 3                      casual                   Monday                1702.1047
## 4                      member                   Monday                 718.6881
## 5                      casual                 Saturday                1934.6068
## 6                      member                 Saturday                 835.9089
## 7                      casual                   Sunday                2006.2732
## 8                      member                   Sunday                 829.4256
## 9                      casual                 Thursday                1481.9947
## 10                     member                 Thursday                 726.0527
## 11                     casual                  Tuesday                1520.1513
## 12                     member                  Tuesday                 717.0647
## 13                     casual                Wednesday                1450.3821
## 14                     member                Wednesday                 713.4792
  • Day of Week are Not in order


“Days of the Week” are out of order. Fixed it :
tripdata_new$day_of_week <- ordered(tripdata_new$day_of_week, 
                                    levels = c("Sunday","Monday",
                                               "Tuesday","Wednesday",
                                               "Thursday","Friday","Saturday"))


Average ride duration for each day by members vs casual users :
aggregate(tripdata_new$ride_length~tripdata_new$member_casual+
            tripdata_new$day_of_week, FUN= mean)
##    tripdata_new$member_casual tripdata_new$day_of_week tripdata_new$ride_length
## 1                      casual                   Sunday                2006.2732
## 2                      member                   Sunday                 829.4256
## 3                      casual                   Monday                1702.1047
## 4                      member                   Monday                 718.6881
## 5                      casual                  Tuesday                1520.1513
## 6                      member                  Tuesday                 717.0647
## 7                      casual                Wednesday                1450.3821
## 8                      member                Wednesday                 713.4792
## 9                      casual                 Thursday                1481.9947
## 10                     member                 Thursday                 726.0527
## 11                     casual                   Friday                1651.1089
## 12                     member                   Friday                 741.8887
## 13                     casual                 Saturday                1934.6068
## 14                     member                 Saturday                 835.9089
glimpse(tripdata_new)  # Took a look on data           
## Rows: 5,858,958
## Columns: 17
## $ ride_id            <chr> "EC2DE40644C6B0F4", "1C31AD03897EE385", "1542FBEC83…
## $ rideable_type      <chr> "classic_bike", "classic_bike", "classic_bike", "cl…
## $ started_at         <chr> "2022-05-23 23:06:58", "2022-05-11 08:53:28", "2022…
## $ ended_at           <chr> "2022-05-23 23:40:19", "2022-05-11 09:31:22", "2022…
## $ start_station_name <chr> "Wabash Ave & Grand Ave", "DuSable Lake Shore Dr & …
## $ start_station_id   <chr> "TA1307000117", "13300", "TA1305000032", "TA1305000…
## $ end_station_name   <chr> "Halsted St & Roscoe St", "Field Blvd & South Water…
## $ end_station_id     <chr> "TA1309000025", "15534", "13221", "TA1305000030", "…
## $ member_casual      <chr> "member", "member", "member", "member", "member", "…
## $ date               <date> 2022-05-23, 2022-05-11, 2022-05-26, 2022-05-10, 20…
## $ day                <chr> "23", "11", "26", "10", "10", "04", "27", "29", "16…
## $ month              <chr> "May", "May", "May", "May", "May", "May", "May", "M…
## $ year               <chr> "2022", "2022", "2022", "2022", "2022", "2022", "20…
## $ day_of_week        <ord> Monday, Wednesday, Thursday, Tuesday, Tuesday, Wedn…
## $ pick_time          <chr> "23:06:", "08:53:", "18:36:", "07:30:", "17:31:", "…
## $ drop_time          <chr> "23:40", "09:31", "18:58", "07:38", "17:36", "14:56…
## $ ride_length        <dbl> 2001, 2274, 1310, 522, 301, 429, 533, 730, 1002, 88…



PHASE 5 : Data Visualization


Created a seperate dataframe for “Data Viz 1” :

rides_per_day <- tripdata_new %>% 
  group_by(member_casual, day_of_week, rideable_type) %>% 
  summarise(number_of_rides = n(),
            median_duration = median(ride_length),
            maximum_duration = max(ride_length),
            average_duration = mean(ride_length)) %>% 
  arrange(member_casual, day_of_week, rideable_type)


Converted to dataframe :
rides_per_day <- as.data.frame(rides_per_day)
glimpse(rides_per_day) # Took a look on it
## Rows: 35
## Columns: 7
## $ member_casual    <chr> "casual", "casual", "casual", "casual", "casual", "ca…
## $ day_of_week      <ord> Sunday, Sunday, Sunday, Monday, Monday, Monday, Tuesd…
## $ rideable_type    <chr> "classic_bike", "docked_bike", "electric_bike", "clas…
## $ number_of_rides  <int> 156683, 33968, 198160, 101921, 21193, 152634, 98284, …
## $ median_duration  <dbl> 973.0, 1763.5, 710.0, 842.0, 1730.0, 604.0, 762.0, 16…
## $ maximum_duration <dbl> 90055, 2175468, 28826, 89997, 1922127, 28800, 89997, …
## $ average_duration <dbl> 1880.3107, 8064.8280, 1067.3310, 1703.8064, 7270.6341…


Working on newly converted dataframe (rides_per_day) :


Data viz 1A : ‘Number of rides’ across the week for all users.

ggplot(data = rides_per_day, 
       mapping=  aes(x = day_of_week, y = number_of_rides, fill= member_casual)) +
  geom_bar(position = "dodge", stat = "identity") +
  scale_y_continuous(labels = scales::comma)+
  labs( x = "Day of Week", y = "Number of Rides", fill = "Member/Casual" , 
        title = "Average Number of Rides Per Day : Memeber vs Causal",
        subtitle = "High peek on Weekend by Casual riders",
        caption = "Data Analyst 'JP'")


Total number of rides taken by riders on weekend is almost the same, where on weekdays, member riders engagement are high compared to casual riders.



Data viz 1B : ‘Average duration’ across the week for all users.

ggplot(data= rides_per_day, 
       mapping = aes(x = day_of_week , y = average_duration, fill = member_casual)) +
  geom_bar(position = "dodge", stat = "identity") +
  labs(x = "Day of Week", y = "Average Duration", fill = "Memeber/Casual",
       title = " Average Ride Duration Per Day : Member vs Casual", 
       subtitle = "Overall Average Ride Duration by Casual Riders are High",
       caption = "Data Analyst 'JP'")


Casual riders’ average duration for using bikes are much high than Memebers. I’ll be suggesting to advertise about being ‘Members’ is way not too costly.



Data viz 1C : ‘Max duration’ across the week for all users :

ggplot(data= rides_per_day, 
       mapping = aes(x = day_of_week , y = maximum_duration, fill = member_casual)) +
  geom_bar(position = "dodge", stat = "identity") +
  labs(x = "Day of Week", y = "Average Duration", fill = "Memeber/Casual",
       title = " Maximum Ride Duration Per Day : Member vs Casual", 
       subtitle = "Overall Maximum Ride Duration by Casual Riders are High",
       caption = "Data Analyst 'JP'")


Maximum Rides duration taken by Casual riders is way high than Memebers. Casuals Riders riding bikes for maximum ‘ride duration’ than Memebers.



Data viz 1D : ‘Median duration’ across the week for all users :

ggplot(data= rides_per_day, 
       mapping = aes(x = day_of_week , y = median_duration, fill = member_casual)) +
  geom_bar(position = "dodge", stat = "identity") +
  labs(x = "Day of Week", y = "Average Duration", fill = "Memeber/Casual",
       title = " Median Ride Duration Per Day : Member vs Casual", 
       subtitle = "Overall Average Ride Duration by Casual Riders are High",
       caption = "Data Analyst 'JP'")


Overall Casual Riders riding bikes for longer period of time.



Data viz 1E : Riders peek rate on the daily basis by bike’s type:

ggplot(data = rides_per_day) +
  geom_point(mapping = aes (x = day_of_week , y= number_of_rides, color= rideable_type, shape = member_casual)) +
  scale_y_continuous(labels = scales::comma)


Docked bike is being used only by casual riders.



Data viz 1F : Average Ride Duration on the daily basis by bike’s type:

ggplot(data= rides_per_day, 
       mapping = aes(x = day_of_week , y = average_duration, 
                     fill = member_casual, shape = rideable_type)) +
  geom_point() +
  facet_wrap(~member_casual) +
  labs(x = "Day of Week", y = "Average Duration", shape = "Rideable Types ", fill = "Member/Casual",
       title = " Average Ride Duration by bike-types : Member vs Casual", 
       subtitle = "Average ride duration for Docked bike is high by casuals",
       caption = "Data Analyst 'JP'")


Casual Riders prefer to ride “Docked Bike” for long amount of time.



Data viz 1G : Peek rate on the daily basis by bike’s type:

ggplot(data= rides_per_day, 
       mapping = aes(x = rideable_type , y = day_of_week, fill = member_casual)) +
  geom_bar(position = "dodge", stat = "identity") +facet_wrap(~member_casual)+
  labs(x = "Rideable Type", y = "Day of Week", fill = "Memeber/Casual",
       title = " Ridable Types used per Day : Member vs Casual", 
       subtitle = "members never used 'docked bike' as of now",
       caption = "Data Analyst 'JP'")


Members never picked ‘docked bikes’ for last 12 months



Data viz 1H : Peek rate on the daily basis by Riders :

ggplot(data= rides_per_day, 
       mapping = aes(x = member_casual , y = number_of_rides, fill = day_of_week)) +
  geom_bar(position = "dodge", stat = "identity") +
  scale_y_continuous(labels = scales::comma) +
  labs(x = "Member vs Casual", y = "Number of Rides", fill = "Memeber/Casual",
       title = " Rides used by Member & Casual", 
       subtitle = "Members' ride are usually on peek during weekdays",
       caption = "Data Analyst 'JP'")


The number of rides taken by Casual riders is high than Member riders. advertising them to get membership since they are paying more money than members.



To get more detailed Viz, Manipulation in Original Data was needed :

tripdata$drop <- as_datetime(tripdata$ended_at)
tripdata$drop_time <- format(as_datetime(tripdata$drop), "%H:%M")
glimpse(tripdata)
## Rows: 5,859,061
## Columns: 18
## $ ride_id            <chr> "EC2DE40644C6B0F4", "1C31AD03897EE385", "1542FBEC83…
## $ rideable_type      <chr> "classic_bike", "classic_bike", "classic_bike", "cl…
## $ started_at         <chr> "2022-05-23 23:06:58", "2022-05-11 08:53:28", "2022…
## $ ended_at           <chr> "2022-05-23 23:40:19", "2022-05-11 09:31:22", "2022…
## $ start_station_name <chr> "Wabash Ave & Grand Ave", "DuSable Lake Shore Dr & …
## $ start_station_id   <chr> "TA1307000117", "13300", "TA1305000032", "TA1305000…
## $ end_station_name   <chr> "Halsted St & Roscoe St", "Field Blvd & South Water…
## $ end_station_id     <chr> "TA1309000025", "15534", "13221", "TA1305000030", "…
## $ member_casual      <chr> "member", "member", "member", "member", "member", "…
## $ date               <date> 2022-05-23, 2022-05-11, 2022-05-26, 2022-05-10, 20…
## $ day                <chr> "23", "11", "26", "10", "10", "04", "27", "29", "16…
## $ month              <chr> "May", "May", "May", "May", "May", "May", "May", "M…
## $ year               <chr> "2022", "2022", "2022", "2022", "2022", "2022", "20…
## $ day_of_week        <chr> "Monday", "Wednesday", "Thursday", "Tuesday", "Tues…
## $ pick_time          <chr> "23:06:", "08:53:", "18:36:", "07:30:", "17:31:", "…
## $ drop_time          <chr> "23:40", "09:31", "18:58", "07:38", "17:36", "14:56…
## $ ride_length        <dbl> 2001, 2274, 1310, 522, 301, 429, 533, 730, 1002, 88…
## $ drop               <dttm> 2022-05-23 23:40:19, 2022-05-11 09:31:22, 2022-05-…


  • Removed non-necessary column(s)
tripdata <- tripdata %>% 
  select(-c(drop))


  • Now original data is intact
glimpse(tripdata) # look on data
## Rows: 5,859,061
## Columns: 17
## $ ride_id            <chr> "EC2DE40644C6B0F4", "1C31AD03897EE385", "1542FBEC83…
## $ rideable_type      <chr> "classic_bike", "classic_bike", "classic_bike", "cl…
## $ started_at         <chr> "2022-05-23 23:06:58", "2022-05-11 08:53:28", "2022…
## $ ended_at           <chr> "2022-05-23 23:40:19", "2022-05-11 09:31:22", "2022…
## $ start_station_name <chr> "Wabash Ave & Grand Ave", "DuSable Lake Shore Dr & …
## $ start_station_id   <chr> "TA1307000117", "13300", "TA1305000032", "TA1305000…
## $ end_station_name   <chr> "Halsted St & Roscoe St", "Field Blvd & South Water…
## $ end_station_id     <chr> "TA1309000025", "15534", "13221", "TA1305000030", "…
## $ member_casual      <chr> "member", "member", "member", "member", "member", "…
## $ date               <date> 2022-05-23, 2022-05-11, 2022-05-26, 2022-05-10, 20…
## $ day                <chr> "23", "11", "26", "10", "10", "04", "27", "29", "16…
## $ month              <chr> "May", "May", "May", "May", "May", "May", "May", "M…
## $ year               <chr> "2022", "2022", "2022", "2022", "2022", "2022", "20…
## $ day_of_week        <chr> "Monday", "Wednesday", "Thursday", "Tuesday", "Tues…
## $ pick_time          <chr> "23:06:", "08:53:", "18:36:", "07:30:", "17:31:", "…
## $ drop_time          <chr> "23:40", "09:31", "18:58", "07:38", "17:36", "14:56…
## $ ride_length        <dbl> 2001, 2274, 1310, 522, 301, 429, 533, 730, 1002, 88…


ceated another dataframe for further data visualization:

tripdata_new2 <- tripdata_new


Manipulation in newly created dataframe for Data viz 2:

tripdata_new2$pick <- as_datetime(tripdata_new2$started_at)
tripdata_new2$pick_hour <- format(as_datetime(tripdata_new2$pick), "%H")

tripdata_new2$drop <- as_datetime(tripdata_new2$ended_at)
tripdata_new2$drop_hour <- format(as_datetime(tripdata_new2$drop), "%H")


  • Removed non-necessary column :
tripdata_new2 <- tripdata_new2 %>% 
  select(-c(pick, drop))
glimpse(tripdata_new2) # Took a look on it
## Rows: 5,858,958
## Columns: 19
## $ ride_id            <chr> "EC2DE40644C6B0F4", "1C31AD03897EE385", "1542FBEC83…
## $ rideable_type      <chr> "classic_bike", "classic_bike", "classic_bike", "cl…
## $ started_at         <chr> "2022-05-23 23:06:58", "2022-05-11 08:53:28", "2022…
## $ ended_at           <chr> "2022-05-23 23:40:19", "2022-05-11 09:31:22", "2022…
## $ start_station_name <chr> "Wabash Ave & Grand Ave", "DuSable Lake Shore Dr & …
## $ start_station_id   <chr> "TA1307000117", "13300", "TA1305000032", "TA1305000…
## $ end_station_name   <chr> "Halsted St & Roscoe St", "Field Blvd & South Water…
## $ end_station_id     <chr> "TA1309000025", "15534", "13221", "TA1305000030", "…
## $ member_casual      <chr> "member", "member", "member", "member", "member", "…
## $ date               <date> 2022-05-23, 2022-05-11, 2022-05-26, 2022-05-10, 20…
## $ day                <chr> "23", "11", "26", "10", "10", "04", "27", "29", "16…
## $ month              <chr> "May", "May", "May", "May", "May", "May", "May", "M…
## $ year               <chr> "2022", "2022", "2022", "2022", "2022", "2022", "20…
## $ day_of_week        <ord> Monday, Wednesday, Thursday, Tuesday, Tuesday, Wedn…
## $ pick_time          <chr> "23:06:", "08:53:", "18:36:", "07:30:", "17:31:", "…
## $ drop_time          <chr> "23:40", "09:31", "18:58", "07:38", "17:36", "14:56…
## $ ride_length        <dbl> 2001, 2274, 1310, 522, 301, 429, 533, 730, 1002, 88…
## $ pick_hour          <chr> "23", "08", "18", "07", "17", "14", "12", "19", "17…
## $ drop_hour          <chr> "23", "09", "18", "07", "17", "14", "12", "19", "18…



Data viz 2A : Riders Peek hour on time basis :

ggplot(data= tripdata_new2,
       mapping = aes(x = pick_hour, fill = member_casual)) +
  geom_bar(position = "dodge") +
  scale_y_continuous(labels = scales::comma) +
  labs(x = "Bikes' Pick Hour 00:00 To 23:00", y = "Number of Rides Taken", fill = "Memeber/Casual",
       title = " Rush Hours by Member & Casual", 
       subtitle = "High Peek Rate During Afternoon",
       caption = "Data Analyst 'JP'")


Members and Casual Riders using rides during work-hour are high.



Data viz 2B : Riders Drop hour on time basis :

ggplot(data= tripdata_new2,
       mapping = aes(x = drop_hour, fill = member_casual)) +
  geom_bar(position = "dodge") +
  scale_y_continuous(labels = scales::comma) +
  labs(x = "Bikes' Drop Hour 00:00 To 23:00", y = "Number of Rides Taken", fill = "Memeber/Casual",
       title = " Bikes' Drop Hours by Member & Casual", 
       subtitle = "High Drop Rate too During Afternoon",
       caption = "Data Analyst 'JP'")

Experimented the idea of Riders picking bikes and dropping it, on time basis.



Manipulation in newly created dataframe for Data viz 3:

tripdata_new2$month1 <- as.Date(tripdata_new2$started_at)
tripdata_new2$month_year <- format(as.Date(tripdata_new2$month1), "%m %y")


  • Removed non-necessary column :
tripdata_new2 <- tripdata_new2 %>% 
  select(-c(month1))

glimpse(tripdata_new2) # Checked Data
## Rows: 5,858,958
## Columns: 20
## $ ride_id            <chr> "EC2DE40644C6B0F4", "1C31AD03897EE385", "1542FBEC83…
## $ rideable_type      <chr> "classic_bike", "classic_bike", "classic_bike", "cl…
## $ started_at         <chr> "2022-05-23 23:06:58", "2022-05-11 08:53:28", "2022…
## $ ended_at           <chr> "2022-05-23 23:40:19", "2022-05-11 09:31:22", "2022…
## $ start_station_name <chr> "Wabash Ave & Grand Ave", "DuSable Lake Shore Dr & …
## $ start_station_id   <chr> "TA1307000117", "13300", "TA1305000032", "TA1305000…
## $ end_station_name   <chr> "Halsted St & Roscoe St", "Field Blvd & South Water…
## $ end_station_id     <chr> "TA1309000025", "15534", "13221", "TA1305000030", "…
## $ member_casual      <chr> "member", "member", "member", "member", "member", "…
## $ date               <date> 2022-05-23, 2022-05-11, 2022-05-26, 2022-05-10, 20…
## $ day                <chr> "23", "11", "26", "10", "10", "04", "27", "29", "16…
## $ month              <chr> "May", "May", "May", "May", "May", "May", "May", "M…
## $ year               <chr> "2022", "2022", "2022", "2022", "2022", "2022", "20…
## $ day_of_week        <ord> Monday, Wednesday, Thursday, Tuesday, Tuesday, Wedn…
## $ pick_time          <chr> "23:06:", "08:53:", "18:36:", "07:30:", "17:31:", "…
## $ drop_time          <chr> "23:40", "09:31", "18:58", "07:38", "17:36", "14:56…
## $ ride_length        <dbl> 2001, 2274, 1310, 522, 301, 429, 533, 730, 1002, 88…
## $ pick_hour          <chr> "23", "08", "18", "07", "17", "14", "12", "19", "17…
## $ drop_hour          <chr> "23", "09", "18", "07", "17", "14", "12", "19", "18…
## $ month_year         <chr> "05 22", "05 22", "05 22", "05 22", "05 22", "05 22…



Data viz 3A : Riders engagement on month-wise :

ggplot(data= tripdata_new2,
       mapping = aes(x = month_year, fill = member_casual)) +
  geom_bar(position = "dodge") +
  scale_y_continuous(labels = scales::comma) +
  labs(x = "Month-Year", y = "Number of Rides Taken", fill = "Memeber/Casual",
       title = " Month Wise Rides Taken by Members vs Casuals ", 
       subtitle = "Causal Riders Taking Less Rides in Winter Season than Members",
       caption = "Data Analyst 'JP'")


Casual Riders engagement in Winter Season is very low as compared to Members. Advertising before and during the Winter will have good impact on company too. While offering ‘becoming new member’ gets discount on “electric bike” & “classic bike” too 😊



PHASE 6 : Act


The act phase would be done by the Executive team of the company. So,Passing the documented report to Lily Moreno and the team.


Data-driven decision-making :

  • Cyclistic’s bike-share has good amount of engagements whether riders are Casuals or Members.

  • Advertising about being ‘Members’ is not too costly at all.

  • Casual Riders can easily save money when they subscribe annual Membership.

  • Casual Riders engagement in Winter Season is low as compared to Member Riders, Advertising about getting Membership too on Social Media like Instagram, Facebook etc. will help them get to know about membership plans.

  • Before and During Winter Season, Advertising about getting membership will be beneficial too.



Suggestion(s) :


  • If data has “lat, lng” to “address” could have help analysis much deeper.like area-wise bike-share distribution.

  • we can increase rides cost during time but can’t say its effect on it anyhow well riders engagement are high during work-hour.

  • If more about updated info like start station id , lat, lng, more info about rider, address would be useful


  • Saved


  • Exported data


  • write.csv(tripdata, file =“tripdata.csv”)
  • write.csv(tripdata_new, file =“tripdata_new.csv”)
  • write.csv(tripdata_new2, file =“tripdata_new2.csv”)