Scenario :

This is my version of the Google Data Analytics : Case Study 1 , A Cyclistic bike-share Analysis case study. The “Cyclistic” is a fictional company for this case study. They said this case study will be a ‘Tangible’ way to demonstrate my knowledge and skills so here we are 😊 : I’m a junior ‘Data Analyst’ working in the marketing analytics team at company named “Cyclistic”, a bike-share company in Chicago. I’ve joined this team six months ago and have been busy learning about Cyclistic’s mission and buisness goals as well as how I, as a junior data analyst, can help Cyclistic achieve them. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships.

Characters and teams :

Cyclistic: A bike-share fictional company.

Lily Moreno: The director of marketing and my manager.

Cyclistic Marketing Analytics team: My team of Data Analysts.

Cyclistic Executive team: The detail-oriented executive team.

Before we begin the Process for this Project, there are a few key points that are wrapped below as these are the steps I’ll be following to ensure its completion:

 I'll be following these vital steps for the Data Analysis Process:
         Ask, Prepare, Process, Analyze, Share, and Act.

Data Analysis Process

PHASE 1 : Ask

The director of marketing believes the company’s future success depends on maximizing the number of annual memberships.

About the company:

In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime. Until now, Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments.One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members. *Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders. Although the pricing flexibility helps Cyclistic attract more customers, Moreno believes that maximizing the number of annual members will be key to future growth. Rather than creating a marketing campaign that targets all-new customers, Moreno believes there is a very good chance to convert casual riders into members. She notes that casual riders are already aware of the Cyclistic program and have chosen Cyclistic for their mobility needs. Moreno has set a clear goal: Design marketing strategies aimed at converting casual riders into annual members. In order to do that, however, the marketing analyst team needs to better understand how annual members and casual riders differ, why casual riders would buy a membership, and how digital media could affect their marketing tactics. Moreno and her team are interested in analyzing the Cyclistic historical bike trip data to identify trends.

In this phase:

Moreno has assigned me the first question to answer: How do annual members and casual riders using Cyclistic bikes differently?

Key taks

Identified the business task
Considered key stakeholders

Deliverable

To gain insights from Data to solve buisness problem

PHASE 2 : Prepare

I’ll be using Cyclistic’s historical trip data to analyze and identify trends, Downloaded last 12 months of data and stored it on my Google Drive

The data is located on divvy-tripdata and data has been made available by Motivate International Inc. under this license.

The company has their own licence over the dataset. Besides that, the 
dataset doesn't have any personal information about the riders. So, I'm 
sure that Data isn't Biased And have full credibility for the same reason. 
Finally, it's ROCCC because it's reliable, original, comprehensive, 
current and cited.
Verified the data's integrity by checking all the files, have consistent 
columns and the data types.

Key taks

Downloaded data and stored it appropriately.
Identified how it’s organized.
Sorted and filtered the data.
Determined the credibility of the data.

Deliverable

The description of all data sources used between May 2022 - April 2023

PHASE 3 : Process

I’ve also done analysis using excel but due to “Big Data” MS Excel won’t be able to load it.

R is primarily used for statistical analysis and data visualization. So, I chose ‘RStudio’ to merge all the months based Data into One Single file for further Analysis.

Setting up the Environment

Dependencies

library(tidyverse)
library(lubridate)
library(ggplot2)
library(readxl)

Working Directory

setwd("D:/Case_Study/Data/CSV")

Data Collection

t22_05 <- read.csv("202205-divvy-tripdata.csv")

t22_06 <- read.csv("202206-divvy-tripdata.csv")

t22_07 <- read.csv("202207-divvy-tripdata.csv")

t22_08 <- read.csv("202208-divvy-tripdata.csv")

t22_09 <- read.csv("202209-divvy-publictripdata.csv")

t22_10 <- read.csv("202210-divvy-tripdata.csv")

t22_11 <- read.csv("202211-divvy-tripdata.csv")

t22_12 <- read.csv("202212-divvy-tripdata.csv")

t23_01 <- read.csv("202301-divvy-tripdata.csv")

t23_02 <- read.csv("202302-divvy-tripdata.csv")

t23_03 <- read.csv("202303-divvy-tripdata.csv")

t23_04 <- read.csv("202304_divvy_tripdata.csv")

Data Wrangling

Ensured Data’s integrity one by one before compiling onto one single as Big Data.
Matching column(s) name consistent with “t23_03” as this is the latest data produced.

str(t23_04) # checked all the data one by one with " str() ".

## 'data.frame':    426590 obs. of  13 variables:
##  $ ride_id           : chr  "8FE8F7D9C10E88C7" "34E4ED3ADF1D821B" "5296BF07A2F77CB5" "40759916B76D5D52" ...
##  $ rideable_type     : chr  "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
##  $ started_at        : chr  "2023-04-02 08:37:28" "2023-04-19 11:29:02" "2023-04-19 08:41:22" "2023-04-19 13:31:30" ...
##  $ ended_at          : chr  "2023-04-02 08:41:37" "2023-04-19 11:52:12" "2023-04-19 08:43:22" "2023-04-19 13:35:09" ...
##  $ start_station_name: chr  "" "" "" "" ...
##  $ start_station_id  : chr  "" "" "" "" ...
##  $ end_station_name  : chr  "" "" "" "" ...
##  $ end_station_id    : chr  "" "" "" "" ...
##  $ start_lat         : num  41.8 41.9 41.9 41.9 41.9 ...
##  $ start_lng         : num  -87.6 -87.7 -87.7 -87.7 -87.7 ...
##  $ end_lat           : num  41.8 41.9 41.9 41.9 41.9 ...
##  $ end_lng           : num  -87.6 -87.7 -87.7 -87.7 -87.6 ...
##  $ member_casual     : chr  "member" "member" "member" "member" ...

Everything looks great

Merging all the Data into Single One for Analysis.

tripdata <- bind_rows(t22_05,t22_06,t22_07,t22_08,t22_09,t22_10,
                  t22_11,t22_12,t23_01,t23_02,t23_03,t23_04)

cross-checked data’s integrity :

str(tripdata)

## 'data.frame':    5859061 obs. of  13 variables:
##  $ ride_id           : chr  "EC2DE40644C6B0F4" "1C31AD03897EE385" "1542FBEC830415CF" "6FF59852924528F8" ...
##  $ rideable_type     : chr  "classic_bike" "classic_bike" "classic_bike" "classic_bike" ...
##  $ started_at        : chr  "2022-05-23 23:06:58" "2022-05-11 08:53:28" "2022-05-26 18:36:28" "2022-05-10 07:30:07" ...
##  $ ended_at          : chr  "2022-05-23 23:40:19" "2022-05-11 09:31:22" "2022-05-26 18:58:18" "2022-05-10 07:38:49" ...
##  $ start_station_name: chr  "Wabash Ave & Grand Ave" "DuSable Lake Shore Dr & Monroe St" "Clinton St & Madison St" "Clinton St & Madison St" ...
##  $ start_station_id  : chr  "TA1307000117" "13300" "TA1305000032" "TA1305000032" ...
##  $ end_station_name  : chr  "Halsted St & Roscoe St" "Field Blvd & South Water St" "Wood St & Milwaukee Ave" "Clark St & Randolph St" ...
##  $ end_station_id    : chr  "TA1309000025" "15534" "13221" "TA1305000030" ...
##  $ start_lat         : num  41.9 41.9 41.9 41.9 41.9 ...
##  $ start_lng         : num  -87.6 -87.6 -87.6 -87.6 -87.6 ...
##  $ end_lat           : num  41.9 41.9 41.9 41.9 41.9 ...
##  $ end_lng           : num  -87.6 -87.6 -87.7 -87.6 -87.7 ...
##  $ member_casual     : chr  "member" "member" "member" "member" ...

Saved merged Data :

write.csv(tripdata, file = "tripdata.csv")

Removed ‘lat’ & ‘lng’ columns.

will be talking about ‘lat’ & ‘lng’ in the end of the project.

Removed non-necessary column :

tripdata <- tripdata %>% 
  select(-c(start_lat, start_lng, end_lat, end_lng))

Checked Data Structure.

str(tripdata)

## 'data.frame':    5859061 obs. of  9 variables:
##  $ ride_id           : chr  "EC2DE40644C6B0F4" "1C31AD03897EE385" "1542FBEC830415CF" "6FF59852924528F8" ...
##  $ rideable_type     : chr  "classic_bike" "classic_bike" "classic_bike" "classic_bike" ...
##  $ started_at        : chr  "2022-05-23 23:06:58" "2022-05-11 08:53:28" "2022-05-26 18:36:28" "2022-05-10 07:30:07" ...
##  $ ended_at          : chr  "2022-05-23 23:40:19" "2022-05-11 09:31:22" "2022-05-26 18:58:18" "2022-05-10 07:38:49" ...
##  $ start_station_name: chr  "Wabash Ave & Grand Ave" "DuSable Lake Shore Dr & Monroe St" "Clinton St & Madison St" "Clinton St & Madison St" ...
##  $ start_station_id  : chr  "TA1307000117" "13300" "TA1305000032" "TA1305000032" ...
##  $ end_station_name  : chr  "Halsted St & Roscoe St" "Field Blvd & South Water St" "Wood St & Milwaukee Ave" "Clark St & Randolph St" ...
##  $ end_station_id    : chr  "TA1309000025" "15534" "13221" "TA1305000030" ...
##  $ member_casual     : chr  "member" "member" "member" "member" ...

Statistical Summary of Data.

summary(tripdata)

##    ride_id          rideable_type       started_at          ended_at        
##  Length:5859061     Length:5859061     Length:5859061     Length:5859061    
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##  start_station_name start_station_id   end_station_name   end_station_id    
##  Length:5859061     Length:5859061     Length:5859061     Length:5859061    
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##  member_casual     
##  Length:5859061    
##  Class :character  
##  Mode  :character

Checking number of riders as casual and member :

table(tripdata$member_casual)

## 
##  casual  member 
## 2358307 3500754

Data Cleaning :

Added columns as date, month, day, year and day_of_week for analysis

  and formatted as date :

tripdata$date <- as.Date(tripdata$started_at)
tripdata$day <- format(as.Date(tripdata$date), "%d")
tripdata$month <- format(as.Date(tripdata$date), "%B")
tripdata$year <- format(as.Date(tripdata$date), "%Y")
tripdata$day_of_week <- format(as.Date(tripdata$date), "%A")

Adding columns as pick_time & drop_time and formatted as datetime :

tripdata$pick <- as_datetime(tripdata$started_at)
tripdata$pick_time <- format(as_datetime(tripdata$pick), "%H:%M:")

tripdata$drop <- as_datetime(tripdata$ended_at)
tripdata$drop_time <- format(as_datetime(tripdata$drop), "%H:%M")

Took a look on Data

glimpse(tripdata)

## Rows: 5,859,061
## Columns: 18
## $ ride_id            <chr> "EC2DE40644C6B0F4", "1C31AD03897EE385", "1542FBEC83…
## $ rideable_type      <chr> "classic_bike", "classic_bike", "classic_bike", "cl…
## $ started_at         <chr> "2022-05-23 23:06:58", "2022-05-11 08:53:28", "2022…
## $ ended_at           <chr> "2022-05-23 23:40:19", "2022-05-11 09:31:22", "2022…
## $ start_station_name <chr> "Wabash Ave & Grand Ave", "DuSable Lake Shore Dr & …
## $ start_station_id   <chr> "TA1307000117", "13300", "TA1305000032", "TA1305000…
## $ end_station_name   <chr> "Halsted St & Roscoe St", "Field Blvd & South Water…
## $ end_station_id     <chr> "TA1309000025", "15534", "13221", "TA1305000030", "…
## $ member_casual      <chr> "member", "member", "member", "member", "member", "…
## $ date               <date> 2022-05-23, 2022-05-11, 2022-05-26, 2022-05-10, 20…
## $ day                <chr> "23", "11", "26", "10", "10", "04", "27", "29", "16…
## $ month              <chr> "May", "May", "May", "May", "May", "May", "May", "M…
## $ year               <chr> "2022", "2022", "2022", "2022", "2022", "2022", "20…
## $ day_of_week        <chr> "Monday", "Wednesday", "Thursday", "Tuesday", "Tues…
## $ pick               <dttm> 2022-05-23 23:06:58, 2022-05-11 08:53:28, 2022-05-…
## $ pick_time          <chr> "23:06:", "08:53:", "18:36:", "07:30:", "17:31:", "…
## $ drop               <dttm> 2022-05-23 23:40:19, 2022-05-11 09:31:22, 2022-05-…
## $ drop_time          <chr> "23:40", "09:31", "18:58", "07:38", "17:36", "14:56…

Removed ‘pick’ & ‘drop’ columns :

tripdata <- tripdata %>% 
  select(-c(pick,drop))

Checked for Data duplication :

tripdata_duplicate <- tripdata[!duplicated(tripdata$ride_id),]

Took a look on Data

glimpse(tripdata)

## Rows: 5,859,061
## Columns: 16
## $ ride_id            <chr> "EC2DE40644C6B0F4", "1C31AD03897EE385", "1542FBEC83…
## $ rideable_type      <chr> "classic_bike", "classic_bike", "classic_bike", "cl…
## $ started_at         <chr> "2022-05-23 23:06:58", "2022-05-11 08:53:28", "2022…
## $ ended_at           <chr> "2022-05-23 23:40:19", "2022-05-11 09:31:22", "2022…
## $ start_station_name <chr> "Wabash Ave & Grand Ave", "DuSable Lake Shore Dr & …
## $ start_station_id   <chr> "TA1307000117", "13300", "TA1305000032", "TA1305000…
## $ end_station_name   <chr> "Halsted St & Roscoe St", "Field Blvd & South Water…
## $ end_station_id     <chr> "TA1309000025", "15534", "13221", "TA1305000030", "…
## $ member_casual      <chr> "member", "member", "member", "member", "member", "…
## $ date               <date> 2022-05-23, 2022-05-11, 2022-05-26, 2022-05-10, 20…
## $ day                <chr> "23", "11", "26", "10", "10", "04", "27", "29", "16…
## $ month              <chr> "May", "May", "May", "May", "May", "May", "May", "M…
## $ year               <chr> "2022", "2022", "2022", "2022", "2022", "2022", "20…
## $ day_of_week        <chr> "Monday", "Wednesday", "Thursday", "Tuesday", "Tues…
## $ pick_time          <chr> "23:06:", "08:53:", "18:36:", "07:30:", "17:31:", "…
## $ drop_time          <chr> "23:40", "09:31", "18:58", "07:38", "17:36", "14:56…

Adding a column ‘ride_length’

tripdata$ride_length <- difftime(tripdata$ended_at,tripdata$started_at)
is.factor(tripdata$ride_length) # Confirming data not to be factor

## [1] FALSE

is.numeric(tripdata$ride_length) # Confirming data is to be numeric

## [1] FALSE

tripdata$ride_length <- as.numeric(as.duration(tripdata$ride_length))
is.numeric(tripdata$ride_length)

## [1] TRUE

Looks Great

Documentation of cleaning or manipulation of data clarified too

PHASE 4 : Analyze

CONDUCTING DESCRIPTIVE ANALYSIS

Making copy of original data :

tripdata_new <- tripdata

Statistical Summary of Data

mean(tripdata_new$ride_length)      # Average ride duration

## [1] 1136.131

median(tripdata_new$ride_length)    # Median ride duration

## [1] 599

max(tripdata_new$ride_length)       # Maximum ride duration

## [1] 2483235

min(tripdata_new$ride_length)       # Minimum ride duration

## [1] -621201

summary(tripdata_new$ride_length)   # Statistical summary of ride duration

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -621201     339     599    1136    1075 2483235

minimum ride duration by riders are negative.

Removing Bad Data :

tripdata_new <- tripdata_new[!tripdata_new$ride_length <0, ]
str(tripdata_new)  # Took a look on data

## 'data.frame':    5858958 obs. of  17 variables:
##  $ ride_id           : chr  "EC2DE40644C6B0F4" "1C31AD03897EE385" "1542FBEC830415CF" "6FF59852924528F8" ...
##  $ rideable_type     : chr  "classic_bike" "classic_bike" "classic_bike" "classic_bike" ...
##  $ started_at        : chr  "2022-05-23 23:06:58" "2022-05-11 08:53:28" "2022-05-26 18:36:28" "2022-05-10 07:30:07" ...
##  $ ended_at          : chr  "2022-05-23 23:40:19" "2022-05-11 09:31:22" "2022-05-26 18:58:18" "2022-05-10 07:38:49" ...
##  $ start_station_name: chr  "Wabash Ave & Grand Ave" "DuSable Lake Shore Dr & Monroe St" "Clinton St & Madison St" "Clinton St & Madison St" ...
##  $ start_station_id  : chr  "TA1307000117" "13300" "TA1305000032" "TA1305000032" ...
##  $ end_station_name  : chr  "Halsted St & Roscoe St" "Field Blvd & South Water St" "Wood St & Milwaukee Ave" "Clark St & Randolph St" ...
##  $ end_station_id    : chr  "TA1309000025" "15534" "13221" "TA1305000030" ...
##  $ member_casual     : chr  "member" "member" "member" "member" ...
##  $ date              : Date, format: "2022-05-23" "2022-05-11" ...
##  $ day               : chr  "23" "11" "26" "10" ...
##  $ month             : chr  "May" "May" "May" "May" ...
##  $ year              : chr  "2022" "2022" "2022" "2022" ...
##  $ day_of_week       : chr  "Monday" "Wednesday" "Thursday" "Tuesday" ...
##  $ pick_time         : chr  "23:06:" "08:53:" "18:36:" "07:30:" ...
##  $ drop_time         : chr  "23:40" "09:31" "18:58" "07:38" ...
##  $ ride_length       : num  2001 2274 1310 522 301 ...

New rows 5858958 , columns 17

Statistical Summary of ride duration :

summary(tripdata_new$ride_length)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0     339     599    1136    1075 2483235

Compared members and casual riders by ride duration :

aggregate(tripdata_new$ride_length~tripdata_new$member_casual, FUN= mean) #average

##   tripdata_new$member_casual tripdata_new$ride_length
## 1                     casual                1709.6796
## 2                     member                 750.0187

aggregate(tripdata_new$ride_length~tripdata_new$member_casual, FUN= median) #median

##   tripdata_new$member_casual tripdata_new$ride_length
## 1                     casual                      751
## 2                     member                      520

aggregate(tripdata_new$ride_length~tripdata_new$member_casual, FUN= max)  #maximun

##   tripdata_new$member_casual tripdata_new$ride_length
## 1                     casual                  2483235
## 2                     member                    93580

aggregate(tripdata_new$ride_length~tripdata_new$member_casual, FUN= min) #minimum

##   tripdata_new$member_casual tripdata_new$ride_length
## 1                     casual                        0
## 2                     member                        0

Average ride duration for each day by members vs casual users :

aggregate(tripdata_new$ride_length~tripdata_new$member_casual+
            tripdata_new$day_of_week, FUN= mean)

##    tripdata_new$member_casual tripdata_new$day_of_week tripdata_new$ride_length
## 1                      casual                   Friday                1651.1089
## 2                      member                   Friday                 741.8887
## 3                      casual                   Monday                1702.1047
## 4                      member                   Monday                 718.6881
## 5                      casual                 Saturday                1934.6068
## 6                      member                 Saturday                 835.9089
## 7                      casual                   Sunday                2006.2732
## 8                      member                   Sunday                 829.4256
## 9                      casual                 Thursday                1481.9947
## 10                     member                 Thursday                 726.0527
## 11                     casual                  Tuesday                1520.1513
## 12                     member                  Tuesday                 717.0647
## 13                     casual                Wednesday                1450.3821
## 14                     member                Wednesday                 713.4792

Day of Week are Not in order

“Days of the Week” are out of order. Fixed it :

tripdata_new$day_of_week <- ordered(tripdata_new$day_of_week, 
                                    levels = c("Sunday","Monday",
                                               "Tuesday","Wednesday",
                                               "Thursday","Friday","Saturday"))

Average ride duration for each day by members vs casual users :

aggregate(tripdata_new$ride_length~tripdata_new$member_casual+
            tripdata_new$day_of_week, FUN= mean)

##    tripdata_new$member_casual tripdata_new$day_of_week tripdata_new$ride_length
## 1                      casual                   Sunday                2006.2732
## 2                      member                   Sunday                 829.4256
## 3                      casual                   Monday                1702.1047
## 4                      member                   Monday                 718.6881
## 5                      casual                  Tuesday                1520.1513
## 6                      member                  Tuesday                 717.0647
## 7                      casual                Wednesday                1450.3821
## 8                      member                Wednesday                 713.4792
## 9                      casual                 Thursday                1481.9947
## 10                     member                 Thursday                 726.0527
## 11                     casual                   Friday                1651.1089
## 12                     member                   Friday                 741.8887
## 13                     casual                 Saturday                1934.6068
## 14                     member                 Saturday                 835.9089

glimpse(tripdata_new)  # Took a look on data

## Rows: 5,858,958
## Columns: 17
## $ ride_id            <chr> "EC2DE40644C6B0F4", "1C31AD03897EE385", "1542FBEC83…
## $ rideable_type      <chr> "classic_bike", "classic_bike", "classic_bike", "cl…
## $ started_at         <chr> "2022-05-23 23:06:58", "2022-05-11 08:53:28", "2022…
## $ ended_at           <chr> "2022-05-23 23:40:19", "2022-05-11 09:31:22", "2022…
## $ start_station_name <chr> "Wabash Ave & Grand Ave", "DuSable Lake Shore Dr & …
## $ start_station_id   <chr> "TA1307000117", "13300", "TA1305000032", "TA1305000…
## $ end_station_name   <chr> "Halsted St & Roscoe St", "Field Blvd & South Water…
## $ end_station_id     <chr> "TA1309000025", "15534", "13221", "TA1305000030", "…
## $ member_casual      <chr> "member", "member", "member", "member", "member", "…
## $ date               <date> 2022-05-23, 2022-05-11, 2022-05-26, 2022-05-10, 20…
## $ day                <chr> "23", "11", "26", "10", "10", "04", "27", "29", "16…
## $ month              <chr> "May", "May", "May", "May", "May", "May", "May", "M…
## $ year               <chr> "2022", "2022", "2022", "2022", "2022", "2022", "20…
## $ day_of_week        <ord> Monday, Wednesday, Thursday, Tuesday, Tuesday, Wedn…
## $ pick_time          <chr> "23:06:", "08:53:", "18:36:", "07:30:", "17:31:", "…
## $ drop_time          <chr> "23:40", "09:31", "18:58", "07:38", "17:36", "14:56…
## $ ride_length        <dbl> 2001, 2274, 1310, 522, 301, 429, 533, 730, 1002, 88…

PHASE 5 : Data Visualization

Created a seperate dataframe for “Data Viz 1” :

rides_per_day <- tripdata_new %>% 
  group_by(member_casual, day_of_week, rideable_type) %>% 
  summarise(number_of_rides = n(),
            median_duration = median(ride_length),
            maximum_duration = max(ride_length),
            average_duration = mean(ride_length)) %>% 
  arrange(member_casual, day_of_week, rideable_type)

Converted to dataframe :

rides_per_day <- as.data.frame(rides_per_day)
glimpse(rides_per_day) # Took a look on it

## Rows: 35
## Columns: 7
## $ member_casual    <chr> "casual", "casual", "casual", "casual", "casual", "ca…
## $ day_of_week      <ord> Sunday, Sunday, Sunday, Monday, Monday, Monday, Tuesd…
## $ rideable_type    <chr> "classic_bike", "docked_bike", "electric_bike", "clas…
## $ number_of_rides  <int> 156683, 33968, 198160, 101921, 21193, 152634, 98284, …
## $ median_duration  <dbl> 973.0, 1763.5, 710.0, 842.0, 1730.0, 604.0, 762.0, 16…
## $ maximum_duration <dbl> 90055, 2175468, 28826, 89997, 1922127, 28800, 89997, …
## $ average_duration <dbl> 1880.3107, 8064.8280, 1067.3310, 1703.8064, 7270.6341…

Working on newly converted dataframe (rides_per_day) :

Data viz 1A : ‘Number of rides’ across the week for all users.

ggplot(data = rides_per_day, 
       mapping=  aes(x = day_of_week, y = number_of_rides, fill= member_casual)) +
  geom_bar(position = "dodge", stat = "identity") +
  scale_y_continuous(labels = scales::comma)+
  labs( x = "Day of Week", y = "Number of Rides", fill = "Member/Casual" , 
        title = "Average Number of Rides Per Day : Memeber vs Causal",
        subtitle = "High peek on Weekend by Casual riders",
        caption = "Data Analyst 'JP'")

Total number of rides taken by riders on weekend is almost the same, where on weekdays, member riders engagement are high compared to casual riders.

Data viz 1B : ‘Average duration’ across the week for all users.

ggplot(data= rides_per_day, 
       mapping = aes(x = day_of_week , y = average_duration, fill = member_casual)) +
  geom_bar(position = "dodge", stat = "identity") +
  labs(x = "Day of Week", y = "Average Duration", fill = "Memeber/Casual",
       title = " Average Ride Duration Per Day : Member vs Casual", 
       subtitle = "Overall Average Ride Duration by Casual Riders are High",
       caption = "Data Analyst 'JP'")

Casual riders’ average duration for using bikes are much high than Memebers. I’ll be suggesting to advertise about being ‘Members’ is way not too costly.

Data viz 1C : ‘Max duration’ across the week for all users :

ggplot(data= rides_per_day, 
       mapping = aes(x = day_of_week , y = maximum_duration, fill = member_casual)) +
  geom_bar(position = "dodge", stat = "identity") +
  labs(x = "Day of Week", y = "Average Duration", fill = "Memeber/Casual",
       title = " Maximum Ride Duration Per Day : Member vs Casual", 
       subtitle = "Overall Maximum Ride Duration by Casual Riders are High",
       caption = "Data Analyst 'JP'")

Maximum Rides duration taken by Casual riders is way high than Memebers. Casuals Riders riding bikes for maximum ‘ride duration’ than Memebers.

Data viz 1D : ‘Median duration’ across the week for all users :

ggplot(data= rides_per_day, 
       mapping = aes(x = day_of_week , y = median_duration, fill = member_casual)) +
  geom_bar(position = "dodge", stat = "identity") +
  labs(x = "Day of Week", y = "Average Duration", fill = "Memeber/Casual",
       title = " Median Ride Duration Per Day : Member vs Casual", 
       subtitle = "Overall Average Ride Duration by Casual Riders are High",
       caption = "Data Analyst 'JP'")

Overall Casual Riders riding bikes for longer period of time.

Data viz 1E : Riders peek rate on the daily basis by bike’s type:

ggplot(data = rides_per_day) +
  geom_point(mapping = aes (x = day_of_week , y= number_of_rides, color= rideable_type, shape = member_casual)) +
  scale_y_continuous(labels = scales::comma)

Docked bike is being used only by casual riders.

Data viz 1F : Average Ride Duration on the daily basis by bike’s type:

ggplot(data= rides_per_day, 
       mapping = aes(x = day_of_week , y = average_duration, 
                     fill = member_casual, shape = rideable_type)) +
  geom_point() +
  facet_wrap(~member_casual) +
  labs(x = "Day of Week", y = "Average Duration", shape = "Rideable Types ", fill = "Member/Casual",
       title = " Average Ride Duration by bike-types : Member vs Casual", 
       subtitle = "Average ride duration for Docked bike is high by casuals",
       caption = "Data Analyst 'JP'")

Casual Riders prefer to ride “Docked Bike” for long amount of time.

Data viz 1G : Peek rate on the daily basis by bike’s type:

ggplot(data= rides_per_day, 
       mapping = aes(x = rideable_type , y = day_of_week, fill = member_casual)) +
  geom_bar(position = "dodge", stat = "identity") +facet_wrap(~member_casual)+
  labs(x = "Rideable Type", y = "Day of Week", fill = "Memeber/Casual",
       title = " Ridable Types used per Day : Member vs Casual", 
       subtitle = "members never used 'docked bike' as of now",
       caption = "Data Analyst 'JP'")

Members never picked ‘docked bikes’ for last 12 months

Data viz 1H : Peek rate on the daily basis by Riders :

ggplot(data= rides_per_day, 
       mapping = aes(x = member_casual , y = number_of_rides, fill = day_of_week)) +
  geom_bar(position = "dodge", stat = "identity") +
  scale_y_continuous(labels = scales::comma) +
  labs(x = "Member vs Casual", y = "Number of Rides", fill = "Memeber/Casual",
       title = " Rides used by Member & Casual", 
       subtitle = "Members' ride are usually on peek during weekdays",
       caption = "Data Analyst 'JP'")

The number of rides taken by Casual riders is high than Member riders. advertising them to get membership since they are paying more money than members.

To get more detailed Viz, Manipulation in Original Data was needed :

tripdata$drop <- as_datetime(tripdata$ended_at)
tripdata$drop_time <- format(as_datetime(tripdata$drop), "%H:%M")
glimpse(tripdata)

## Rows: 5,859,061
## Columns: 18
## $ ride_id            <chr> "EC2DE40644C6B0F4", "1C31AD03897EE385", "1542FBEC83…
## $ rideable_type      <chr> "classic_bike", "classic_bike", "classic_bike", "cl…
## $ started_at         <chr> "2022-05-23 23:06:58", "2022-05-11 08:53:28", "2022…
## $ ended_at           <chr> "2022-05-23 23:40:19", "2022-05-11 09:31:22", "2022…
## $ start_station_name <chr> "Wabash Ave & Grand Ave", "DuSable Lake Shore Dr & …
## $ start_station_id   <chr> "TA1307000117", "13300", "TA1305000032", "TA1305000…
## $ end_station_name   <chr> "Halsted St & Roscoe St", "Field Blvd & South Water…
## $ end_station_id     <chr> "TA1309000025", "15534", "13221", "TA1305000030", "…
## $ member_casual      <chr> "member", "member", "member", "member", "member", "…
## $ date               <date> 2022-05-23, 2022-05-11, 2022-05-26, 2022-05-10, 20…
## $ day                <chr> "23", "11", "26", "10", "10", "04", "27", "29", "16…
## $ month              <chr> "May", "May", "May", "May", "May", "May", "May", "M…
## $ year               <chr> "2022", "2022", "2022", "2022", "2022", "2022", "20…
## $ day_of_week        <chr> "Monday", "Wednesday", "Thursday", "Tuesday", "Tues…
## $ pick_time          <chr> "23:06:", "08:53:", "18:36:", "07:30:", "17:31:", "…
## $ drop_time          <chr> "23:40", "09:31", "18:58", "07:38", "17:36", "14:56…
## $ ride_length        <dbl> 2001, 2274, 1310, 522, 301, 429, 533, 730, 1002, 88…
## $ drop               <dttm> 2022-05-23 23:40:19, 2022-05-11 09:31:22, 2022-05-…

Removed non-necessary column(s)

tripdata <- tripdata %>% 
  select(-c(drop))

Now original data is intact

glimpse(tripdata) # look on data

## Rows: 5,859,061
## Columns: 17
## $ ride_id            <chr> "EC2DE40644C6B0F4", "1C31AD03897EE385", "1542FBEC83…
## $ rideable_type      <chr> "classic_bike", "classic_bike", "classic_bike", "cl…
## $ started_at         <chr> "2022-05-23 23:06:58", "2022-05-11 08:53:28", "2022…
## $ ended_at           <chr> "2022-05-23 23:40:19", "2022-05-11 09:31:22", "2022…
## $ start_station_name <chr> "Wabash Ave & Grand Ave", "DuSable Lake Shore Dr & …
## $ start_station_id   <chr> "TA1307000117", "13300", "TA1305000032", "TA1305000…
## $ end_station_name   <chr> "Halsted St & Roscoe St", "Field Blvd & South Water…
## $ end_station_id     <chr> "TA1309000025", "15534", "13221", "TA1305000030", "…
## $ member_casual      <chr> "member", "member", "member", "member", "member", "…
## $ date               <date> 2022-05-23, 2022-05-11, 2022-05-26, 2022-05-10, 20…
## $ day                <chr> "23", "11", "26", "10", "10", "04", "27", "29", "16…
## $ month              <chr> "May", "May", "May", "May", "May", "May", "May", "M…
## $ year               <chr> "2022", "2022", "2022", "2022", "2022", "2022", "20…
## $ day_of_week        <chr> "Monday", "Wednesday", "Thursday", "Tuesday", "Tues…
## $ pick_time          <chr> "23:06:", "08:53:", "18:36:", "07:30:", "17:31:", "…
## $ drop_time          <chr> "23:40", "09:31", "18:58", "07:38", "17:36", "14:56…
## $ ride_length        <dbl> 2001, 2274, 1310, 522, 301, 429, 533, 730, 1002, 88…

ceated another dataframe for further data visualization:

tripdata_new2 <- tripdata_new

Manipulation in newly created dataframe for Data viz 2:

tripdata_new2$pick <- as_datetime(tripdata_new2$started_at)
tripdata_new2$pick_hour <- format(as_datetime(tripdata_new2$pick), "%H")

tripdata_new2$drop <- as_datetime(tripdata_new2$ended_at)
tripdata_new2$drop_hour <- format(as_datetime(tripdata_new2$drop), "%H")

Removed non-necessary column :

tripdata_new2 <- tripdata_new2 %>% 
  select(-c(pick, drop))
glimpse(tripdata_new2) # Took a look on it

## Rows: 5,858,958
## Columns: 19
## $ ride_id            <chr> "EC2DE40644C6B0F4", "1C31AD03897EE385", "1542FBEC83…
## $ rideable_type      <chr> "classic_bike", "classic_bike", "classic_bike", "cl…
## $ started_at         <chr> "2022-05-23 23:06:58", "2022-05-11 08:53:28", "2022…
## $ ended_at           <chr> "2022-05-23 23:40:19", "2022-05-11 09:31:22", "2022…
## $ start_station_name <chr> "Wabash Ave & Grand Ave", "DuSable Lake Shore Dr & …
## $ start_station_id   <chr> "TA1307000117", "13300", "TA1305000032", "TA1305000…
## $ end_station_name   <chr> "Halsted St & Roscoe St", "Field Blvd & South Water…
## $ end_station_id     <chr> "TA1309000025", "15534", "13221", "TA1305000030", "…
## $ member_casual      <chr> "member", "member", "member", "member", "member", "…
## $ date               <date> 2022-05-23, 2022-05-11, 2022-05-26, 2022-05-10, 20…
## $ day                <chr> "23", "11", "26", "10", "10", "04", "27", "29", "16…
## $ month              <chr> "May", "May", "May", "May", "May", "May", "May", "M…
## $ year               <chr> "2022", "2022", "2022", "2022", "2022", "2022", "20…
## $ day_of_week        <ord> Monday, Wednesday, Thursday, Tuesday, Tuesday, Wedn…
## $ pick_time          <chr> "23:06:", "08:53:", "18:36:", "07:30:", "17:31:", "…
## $ drop_time          <chr> "23:40", "09:31", "18:58", "07:38", "17:36", "14:56…
## $ ride_length        <dbl> 2001, 2274, 1310, 522, 301, 429, 533, 730, 1002, 88…
## $ pick_hour          <chr> "23", "08", "18", "07", "17", "14", "12", "19", "17…
## $ drop_hour          <chr> "23", "09", "18", "07", "17", "14", "12", "19", "18…

Data viz 2A : Riders Peek hour on time basis :

ggplot(data= tripdata_new2,
       mapping = aes(x = pick_hour, fill = member_casual)) +
  geom_bar(position = "dodge") +
  scale_y_continuous(labels = scales::comma) +
  labs(x = "Bikes' Pick Hour 00:00 To 23:00", y = "Number of Rides Taken", fill = "Memeber/Casual",
       title = " Rush Hours by Member & Casual", 
       subtitle = "High Peek Rate During Afternoon",
       caption = "Data Analyst 'JP'")

Members and Casual Riders using rides during work-hour are high.

Data viz 2B : Riders Drop hour on time basis :

ggplot(data= tripdata_new2,
       mapping = aes(x = drop_hour, fill = member_casual)) +
  geom_bar(position = "dodge") +
  scale_y_continuous(labels = scales::comma) +
  labs(x = "Bikes' Drop Hour 00:00 To 23:00", y = "Number of Rides Taken", fill = "Memeber/Casual",
       title = " Bikes' Drop Hours by Member & Casual", 
       subtitle = "High Drop Rate too During Afternoon",
       caption = "Data Analyst 'JP'")

Experimented the idea of Riders picking bikes and dropping it, on time basis.

Manipulation in newly created dataframe for Data viz 3:

tripdata_new2$month1 <- as.Date(tripdata_new2$started_at)
tripdata_new2$month_year <- format(as.Date(tripdata_new2$month1), "%m %y")

Removed non-necessary column :

tripdata_new2 <- tripdata_new2 %>% 
  select(-c(month1))

glimpse(tripdata_new2) # Checked Data

## Rows: 5,858,958
## Columns: 20
## $ ride_id            <chr> "EC2DE40644C6B0F4", "1C31AD03897EE385", "1542FBEC83…
## $ rideable_type      <chr> "classic_bike", "classic_bike", "classic_bike", "cl…
## $ started_at         <chr> "2022-05-23 23:06:58", "2022-05-11 08:53:28", "2022…
## $ ended_at           <chr> "2022-05-23 23:40:19", "2022-05-11 09:31:22", "2022…
## $ start_station_name <chr> "Wabash Ave & Grand Ave", "DuSable Lake Shore Dr & …
## $ start_station_id   <chr> "TA1307000117", "13300", "TA1305000032", "TA1305000…
## $ end_station_name   <chr> "Halsted St & Roscoe St", "Field Blvd & South Water…
## $ end_station_id     <chr> "TA1309000025", "15534", "13221", "TA1305000030", "…
## $ member_casual      <chr> "member", "member", "member", "member", "member", "…
## $ date               <date> 2022-05-23, 2022-05-11, 2022-05-26, 2022-05-10, 20…
## $ day                <chr> "23", "11", "26", "10", "10", "04", "27", "29", "16…
## $ month              <chr> "May", "May", "May", "May", "May", "May", "May", "M…
## $ year               <chr> "2022", "2022", "2022", "2022", "2022", "2022", "20…
## $ day_of_week        <ord> Monday, Wednesday, Thursday, Tuesday, Tuesday, Wedn…
## $ pick_time          <chr> "23:06:", "08:53:", "18:36:", "07:30:", "17:31:", "…
## $ drop_time          <chr> "23:40", "09:31", "18:58", "07:38", "17:36", "14:56…
## $ ride_length        <dbl> 2001, 2274, 1310, 522, 301, 429, 533, 730, 1002, 88…
## $ pick_hour          <chr> "23", "08", "18", "07", "17", "14", "12", "19", "17…
## $ drop_hour          <chr> "23", "09", "18", "07", "17", "14", "12", "19", "18…
## $ month_year         <chr> "05 22", "05 22", "05 22", "05 22", "05 22", "05 22…

Data viz 3A : Riders engagement on month-wise :

ggplot(data= tripdata_new2,
       mapping = aes(x = month_year, fill = member_casual)) +
  geom_bar(position = "dodge") +
  scale_y_continuous(labels = scales::comma) +
  labs(x = "Month-Year", y = "Number of Rides Taken", fill = "Memeber/Casual",
       title = " Month Wise Rides Taken by Members vs Casuals ", 
       subtitle = "Causal Riders Taking Less Rides in Winter Season than Members",
       caption = "Data Analyst 'JP'")

Casual Riders engagement in Winter Season is very low as compared to Members. Advertising before and during the Winter will have good impact on company too. While offering ‘becoming new member’ gets discount on “electric bike” & “classic bike” too 😊

PHASE 6 : Act

The act phase would be done by the Executive team of the company. So,Passing the documented report to Lily Moreno and the team.

Data-driven decision-making :

Cyclistic’s bike-share has good amount of engagements whether riders are Casuals or Members.
Advertising about being ‘Members’ is not too costly at all.
Casual Riders can easily save money when they subscribe annual Membership.
Casual Riders engagement in Winter Season is low as compared to Member Riders, Advertising about getting Membership too on Social Media like Instagram, Facebook etc. will help them get to know about membership plans.
Before and During Winter Season, Advertising about getting membership will be beneficial too.

Suggestion(s) :

If data has “lat, lng” to “address” could have help analysis much deeper.like area-wise bike-share distribution.
we can increase rides cost during time but can’t say its effect on it anyhow well riders engagement are high during work-hour.
If more about updated info like start station id , lat, lng, more info about rider, address would be useful

Saved

Exported data

write.csv(tripdata, file =“tripdata.csv”)
write.csv(tripdata_new, file =“tripdata_new.csv”)
write.csv(tripdata_new2, file =“tripdata_new2.csv”)

Google Data Analytics : Case Study 1

How Does a Bike-Share Navigate Speedy Success?

Jayprakash Kumar

2023-05-24

Scenario :

Characters and teams :

PHASE 1 : Ask

Key taks

Deliverable

PHASE 2 : Prepare

Key taks

Deliverable

PHASE 3 : Process

Setting up the Environment

Dependencies

Working Directory

Data Collection

Data Wrangling

Merging all the Data into Single One for Analysis.

cross-checked data’s integrity :

Saved merged Data :

Removed ‘lat’ & ‘lng’ columns.

will be talking about ‘lat’ & ‘lng’ in the end of the project.

Checked Data Structure.

Statistical Summary of Data.

Checking number of riders as casual and member :

Data Cleaning :

Added columns as date, month, day, year and day_of_week for analysis

Adding columns as pick_time & drop_time and formatted as datetime :

Removed ‘pick’ & ‘drop’ columns :

Checked for Data duplication :

Adding a column ‘ride_length’

PHASE 4 : Analyze

CONDUCTING DESCRIPTIVE ANALYSIS

Statistical Summary of Data

Removing Bad Data :

Statistical Summary of ride duration :

Compared members and casual riders by ride duration :

Average ride duration for each day by members vs casual users :

“Days of the Week” are out of order. Fixed it :

Average ride duration for each day by members vs casual users :

PHASE 5 : Data Visualization

Created a seperate dataframe for “Data Viz 1” :

Converted to dataframe :

Working on newly converted dataframe (rides_per_day) :

Data viz 1A : ‘Number of rides’ across the week for all users.

Data viz 1B : ‘Average duration’ across the week for all users.

Data viz 1C : ‘Max duration’ across the week for all users :

Data viz 1D : ‘Median duration’ across the week for all users :

Data viz 1E : Riders peek rate on the daily basis by bike’s type:

Data viz 1F : Average Ride Duration on the daily basis by bike’s type:

Data viz 1G : Peek rate on the daily basis by bike’s type:

Data viz 1H : Peek rate on the daily basis by Riders :

To get more detailed Viz, Manipulation in Original Data was needed :

ceated another dataframe for further data visualization:

Manipulation in newly created dataframe for Data viz 2:

Data viz 2A : Riders Peek hour on time basis :

Data viz 2B : Riders Drop hour on time basis :

Manipulation in newly created dataframe for Data viz 3:

Data viz 3A : Riders engagement on month-wise :

PHASE 6 : Act

The act phase would be done by the Executive team of the company. So,Passing the documented report to Lily Moreno and the team.

Data-driven decision-making :

Suggestion(s) :