Cyclistic bike-share program

Role

As a part of the Google Capstone Project, I played the role of a junior data analyst. I used Ask, Prepare, Process, Analyze, Share and Act data analysis process.

Scenario

Cyclistic is a bike-share company with more than 5800 bicycles and 600 docking stations. The company offers:

3 bike types: electric, classic, and docked bikes
3 pricing plans: single-ride passes, full-day passes, and annual memberships

Cyclistic bike riders include:

Casual Riders: who purchase signle ride passes or full ride passes
Cyclistic members: who opt for annual memberships

Since annual members are more than casual riders, Lily Moreno, the company’s marketing director, instead of creating a marketing campaign that targets all-new customers, she believes there is a solid opportunity to convert casual riders into annual members.

The team needs to better understand how annual members and casual riders differ, why casual riders would buy a membership, and how digital media could affect their marketing tactics.

Ask

Based on the understanding the primary stakeholders are:

Lily Moreno – Manager/Director of marketing
Cyclistic marketing analytics team
Cyclistic executive team - Cyclists/riders/clients

My business task is to analyze and compare the different riding patterns of casual riders and annual members in last 12 months.

Prepare

Collected different data sets from divvybikes provided by Google Data Analysis Certification program resources. I have downloaded data from August 2023 to August 2024.

It is secure because no one else is manipulating the data from the amazon bucket, and the accessibility for any type of person to feel free to downloaded, analyze it, copy, modify and distribute the product.

Process

I used R to investigate the data sets and Excel for some minimum error checks. ### Installing packages

Importing the tidyverse, readr, janitor, DescTools, lubridate libraries

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(readr)
library(janitor)

## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

library(DescTools)
library(lubridate)
library(ggplot2)
library(dplyr)

Loading the datasets

Load divvybikes datasets for past 12 months

divvy_bikes <- list.files(path = setwd("C:/Users/91944/Documents/Google-Data-Analytics-Capstone-Project/datasets"), 
                       pattern = "*.csv", full.names = TRUE) %>% 
  lapply(read_csv) %>% 
  bind_rows

Data profiling

Identifying anomalies
Inspecting data

glimpse(divvy_bikes)

## Rows: 5,715,693
## Columns: 13
## $ ride_id            <chr> "903C30C2D810A53B", "F2FB18A98E110A2B", "D0DEC7C94E…
## $ rideable_type      <chr> "electric_bike", "electric_bike", "electric_bike", …
## $ started_at         <dttm> 2023-08-19 15:41:53, 2023-08-18 15:30:18, 2023-08-…
## $ ended_at           <dttm> 2023-08-19 15:53:36, 2023-08-18 15:45:25, 2023-08-…
## $ start_station_name <chr> "LaSalle St & Illinois St", "Clark St & Randolph St…
## $ start_station_id   <chr> "13430", "TA1305000030", "TA1305000030", "KA1504000…
## $ end_station_name   <chr> "Clark St & Elm St", NA, NA, NA, NA, NA, NA, NA, NA…
## $ end_station_id     <chr> "TA1307000039", NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ start_lat          <dbl> 41.89072, 41.88451, 41.88498, 41.90310, 41.88555, 4…
## $ start_lng          <dbl> -87.63148, -87.63155, -87.63079, -87.63467, -87.632…
## $ end_lat            <dbl> 41.90297, 41.93000, 41.91000, 41.90000, 41.89000, 4…
## $ end_lng            <dbl> -87.63128, -87.64000, -87.63000, -87.62000, -87.680…
## $ member_casual      <chr> "member", "member", "member", "member", "member", "…

After inspecting the dataset, we found many columns has NA Values Finding the number of NA values in each column

na_values <- divvy_bikes %>%
  summarise_all(~ sum(is.na(.)))
glimpse(na_values)

## Rows: 1
## Columns: 13
## $ ride_id            <int> 0
## $ rideable_type      <int> 0
## $ started_at         <int> 0
## $ ended_at           <int> 0
## $ start_station_name <int> 947025
## $ start_station_id   <int> 947025
## $ end_station_name   <int> 989476
## $ end_station_id     <int> 989476
## $ start_lat          <int> 0
## $ start_lng          <int> 0
## $ end_lat            <int> 7756
## $ end_lng            <int> 7756
## $ member_casual      <int> 0

Cleaning the data

To check the column names of the dataset are correct or not we use funcion from “Janitor” library

clean_names(divvy_bikes)

## # A tibble: 5,715,693 × 13
##    ride_id          rideable_type started_at          ended_at           
##    <chr>            <chr>         <dttm>              <dttm>             
##  1 903C30C2D810A53B electric_bike 2023-08-19 15:41:53 2023-08-19 15:53:36
##  2 F2FB18A98E110A2B electric_bike 2023-08-18 15:30:18 2023-08-18 15:45:25
##  3 D0DEC7C94E4663DA electric_bike 2023-08-30 16:15:08 2023-08-30 16:27:37
##  4 E0DDDC5F84747ED9 electric_bike 2023-08-30 16:24:07 2023-08-30 16:33:34
##  5 7797A4874BA260CA electric_bike 2023-08-22 15:59:44 2023-08-22 16:20:38
##  6 DF4DE734EBC4DF66 electric_bike 2023-08-24 12:27:24 2023-08-24 12:54:59
##  7 EE60FB066E69AFAC electric_bike 2023-08-31 20:42:14 2023-08-31 20:54:38
##  8 A115DA6AA13DE5EF electric_bike 2023-08-17 15:15:51 2023-08-17 15:22:27
##  9 86DBB19374245893 electric_bike 2023-08-24 21:37:19 2023-08-24 21:47:22
## 10 2905CBC8B8EE392C electric_bike 2023-08-28 14:53:38 2023-08-28 14:59:35
## # ℹ 5,715,683 more rows
## # ℹ 9 more variables: start_station_name <chr>, start_station_id <chr>,
## #   end_station_name <chr>, end_station_id <chr>, start_lat <dbl>,
## #   start_lng <dbl>, end_lat <dbl>, end_lng <dbl>, member_casual <chr>

Checking the whether each column has the correct data type

sapply(divvy_bikes,typeof)

##            ride_id      rideable_type         started_at           ended_at 
##        "character"        "character"           "double"           "double" 
## start_station_name   start_station_id   end_station_name     end_station_id 
##        "character"        "character"        "character"        "character" 
##          start_lat          start_lng            end_lat            end_lng 
##           "double"           "double"           "double"           "double" 
##      member_casual 
##        "character"

We need to check if the dataset contains duplicate values or not by using ride_id which is primary key

duplicate_values <- divvy_bikes %>%
  get_dupes(ride_id)
duplicate_values

## # A tibble: 422 × 14
##    ride_id      dupe_count rideable_type started_at          ended_at           
##    <chr>             <int> <chr>         <dttm>              <dttm>             
##  1 011C8EF97AB…          2 classic_bike  2024-05-31 19:45:38 2024-06-01 20:45:33
##  2 011C8EF97AB…          2 classic_bike  2024-05-31 19:45:38 2024-06-01 20:45:33
##  3 01406457A85…          2 electric_bike 2024-05-31 23:54:59 2024-06-01 00:01:47
##  4 01406457A85…          2 electric_bike 2024-05-31 23:54:59 2024-06-01 00:01:47
##  5 02606FBC7F8…          2 classic_bike  2024-05-31 17:55:01 2024-06-01 18:54:53
##  6 02606FBC7F8…          2 classic_bike  2024-05-31 17:55:01 2024-06-01 18:54:53
##  7 0354FD07563…          2 electric_bike 2024-05-31 23:34:36 2024-06-01 00:14:29
##  8 0354FD07563…          2 electric_bike 2024-05-31 23:34:36 2024-06-01 00:14:29
##  9 048C715F1DE…          2 electric_bike 2024-05-31 23:53:44 2024-06-01 00:12:26
## 10 048C715F1DE…          2 electric_bike 2024-05-31 23:53:44 2024-06-01 00:12:26
## # ℹ 412 more rows
## # ℹ 9 more variables: start_station_name <chr>, start_station_id <chr>,
## #   end_station_name <chr>, end_station_id <chr>, start_lat <dbl>,
## #   start_lng <dbl>, end_lat <dbl>, end_lng <dbl>, member_casual <chr>

The result shows 422 values which is combination of original and duplicated values. So the dataset contain 211 duplicated values

Handling duplicated values

Now, we need to extract the distinct values from each column which means 211 duplicate rows will be removed from dataset.

To remove duplicates we use “distinct” function using ride_id field from dataset.

divvy_bikes_cleaned <-divvy_bikes %>%
  distinct(ride_id,.keep_all = TRUE)
glimpse(divvy_bikes_cleaned)

## Rows: 5,715,482
## Columns: 13
## $ ride_id            <chr> "903C30C2D810A53B", "F2FB18A98E110A2B", "D0DEC7C94E…
## $ rideable_type      <chr> "electric_bike", "electric_bike", "electric_bike", …
## $ started_at         <dttm> 2023-08-19 15:41:53, 2023-08-18 15:30:18, 2023-08-…
## $ ended_at           <dttm> 2023-08-19 15:53:36, 2023-08-18 15:45:25, 2023-08-…
## $ start_station_name <chr> "LaSalle St & Illinois St", "Clark St & Randolph St…
## $ start_station_id   <chr> "13430", "TA1305000030", "TA1305000030", "KA1504000…
## $ end_station_name   <chr> "Clark St & Elm St", NA, NA, NA, NA, NA, NA, NA, NA…
## $ end_station_id     <chr> "TA1307000039", NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ start_lat          <dbl> 41.89072, 41.88451, 41.88498, 41.90310, 41.88555, 4…
## $ start_lng          <dbl> -87.63148, -87.63155, -87.63079, -87.63467, -87.632…
## $ end_lat            <dbl> 41.90297, 41.93000, 41.91000, 41.90000, 41.89000, 4…
## $ end_lng            <dbl> -87.63128, -87.64000, -87.63000, -87.62000, -87.680…
## $ member_casual      <chr> "member", "member", "member", "member", "member", "…

Now all the duplicate rows are removed.

Total number of records in new data set is 5,715,482.

5715693 (original dataset) - 211 (duplicate records) = 5715482 (cleaned dataset)

Getting info about the dataset

For calculating the ride length of each ride, we need to add new column ride_length which is difference between the started_at and ended_at in minutes.

divvy_bikes_processed <- divvy_bikes_cleaned %>%
  mutate(ride_length = round(difftime(ended_at,started_at,units = "mins"),digits = 2))

we need to check if the ride_length has any negatives values or not which means ride_length <0

negative_ride_length <- divvy_bikes_processed %>%
  filter(ride_length <0)
negative_ride_length

## # A tibble: 404 × 14
##    ride_id          rideable_type started_at          ended_at           
##    <chr>            <chr>         <dttm>              <dttm>             
##  1 BEA2E3C59F6C9064 electric_bike 2023-08-16 14:42:56 2023-08-16 14:42:55
##  2 68203D8EB3D3F2AC electric_bike 2023-08-27 18:37:38 2023-08-27 18:37:35
##  3 62F4F440D6560E70 electric_bike 2023-08-21 15:34:39 2023-08-21 15:31:56
##  4 2BFB23CDC9A75AB0 electric_bike 2023-08-26 10:19:36 2023-08-26 10:16:52
##  5 8FD41344538C6DAD electric_bike 2023-08-04 21:28:08 2023-08-04 21:28:06
##  6 88C6795EB8654AF5 electric_bike 2023-08-19 03:01:47 2023-08-19 03:00:12
##  7 197D33FAF6396DDC electric_bike 2023-08-18 16:01:29 2023-08-18 16:01:28
##  8 7FC22D5B6730A327 electric_bike 2023-08-03 08:16:04 2023-08-03 08:13:34
##  9 82B071A054A869E0 electric_bike 2023-08-19 16:19:27 2023-08-19 16:19:25
## 10 41884873E3E524D3 electric_bike 2023-08-03 17:19:37 2023-08-03 17:19:14
## # ℹ 394 more rows
## # ℹ 10 more variables: start_station_name <chr>, start_station_id <chr>,
## #   end_station_name <chr>, end_station_id <chr>, start_lat <dbl>,
## #   start_lng <dbl>, end_lat <dbl>, end_lng <dbl>, member_casual <chr>,
## #   ride_length <drtn>

Clearly we can see that 404 records have negative values which means that end time is before than start time.

We need to correct it in 2 approaches:

Removing the records whose ride_length less than zero
By changing the values to correct columns by using condition statement.

I followed second approach becuase if we remove the records we may loose some data.

divvy_bikes_processed <- divvy_bikes_processed %>%
  mutate(new_started_at = ifelse(ride_length<0,as_datetime(ended_at),as_datetime(started_at)),
         new_ended_at = ifelse(ride_length<0,as_datetime(started_at),as_datetime(ended_at)),
         started_at=as_datetime(new_started_at), ended_at=as_datetime(new_ended_at)) %>%
  select(-new_started_at,-new_ended_at)

We changed the values of start time and end time by using ifelse condition and replaced start and end time columns with correct values and dropped newly created columns by using select function.

Now I calculated and checked if the ride_length column has any negative values or not.

divvy_bikes_processed <- divvy_bikes_processed %>%
  mutate(ride_length = round(difftime(ended_at,started_at,units = "mins"),digits = 2))

negative_ride_length <- divvy_bikes_processed %>%
  filter(ride_length<0)
negative_ride_length

## # A tibble: 0 × 14
## # ℹ 14 variables: ride_id <chr>, rideable_type <chr>, started_at <dttm>,
## #   ended_at <dttm>, start_station_name <chr>, start_station_id <chr>,
## #   end_station_name <chr>, end_station_id <chr>, start_lat <dbl>,
## #   start_lng <dbl>, end_lat <dbl>, end_lng <dbl>, member_casual <chr>,
## #   ride_length <drtn>

Now, the dataset does not contain any negative values.

Analyze the dataset and Visualization

We need to perform some calculations for analysis.

Calculations on datasets

Per day of week

Creating a new column day_of_week to calculate which day the ride is starting.

divvy_bikes_processed <- divvy_bikes_processed %>%
  mutate(day_of_week = weekdays(as.Date(started_at)))
head(divvy_bikes_processed)

## # A tibble: 6 × 15
##   ride_id          rideable_type started_at          ended_at           
##   <chr>            <chr>         <dttm>              <dttm>             
## 1 903C30C2D810A53B electric_bike 2023-08-19 15:41:53 2023-08-19 15:53:36
## 2 F2FB18A98E110A2B electric_bike 2023-08-18 15:30:18 2023-08-18 15:45:25
## 3 D0DEC7C94E4663DA electric_bike 2023-08-30 16:15:08 2023-08-30 16:27:37
## 4 E0DDDC5F84747ED9 electric_bike 2023-08-30 16:24:07 2023-08-30 16:33:34
## 5 7797A4874BA260CA electric_bike 2023-08-22 15:59:44 2023-08-22 16:20:38
## 6 DF4DE734EBC4DF66 electric_bike 2023-08-24 12:27:24 2023-08-24 12:54:59
## # ℹ 11 more variables: start_station_name <chr>, start_station_id <chr>,
## #   end_station_name <chr>, end_station_id <chr>, start_lat <dbl>,
## #   start_lng <dbl>, end_lat <dbl>, end_lng <dbl>, member_casual <chr>,
## #   ride_length <drtn>, day_of_week <chr>

To analyze the data we need to have mean, max, and mode of dataset by grouping member_casual column to find ride length and day of the week for two memberships.

member_casuals <- divvy_bikes_processed %>%
  group_by(member_casual) %>%
  summarise(avg_ride_length=mean(ride_length),max_ride_length = max(ride_length), mode_day_of_week = Mode(day_of_week), no_of_rides=n_distinct(ride_id))
member_casuals

## # A tibble: 2 × 5
##   member_casual avg_ride_length max_ride_length mode_day_of_week no_of_rides
##   <chr>         <drtn>          <drtn>          <chr>                  <int>
## 1 casual        27.15739 mins   98489.07 mins   Saturday             2038442
## 2 member        13.00704 mins   16635.22 mins   Wednesday            3677040

From the analysis we can know that casual members use cycles on Saturday and membership users on Wednesday.

Now we calculate average ride length and total no of rides grouping by type of member and day of the week.

divvy_bikes_week <- divvy_bikes_processed %>%
  group_by(member_casual,day_of_week) %>%
  summarise(avg_ride_length=mean(ride_length),no_of_rides=n_distinct(ride_id))

## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

divvy_bikes_week

## # A tibble: 14 × 4
## # Groups:   member_casual [2]
##    member_casual day_of_week avg_ride_length no_of_rides
##    <chr>         <chr>       <drtn>                <int>
##  1 casual        Friday      26.77450 mins        299983
##  2 casual        Monday      26.26931 mins        225902
##  3 casual        Saturday    30.36386 mins        415227
##  4 casual        Sunday      31.92722 mins        340723
##  5 casual        Thursday    24.10050 mins        255637
##  6 casual        Tuesday     23.44512 mins        234820
##  7 casual        Wednesday   23.44535 mins        266150
##  8 member        Friday      12.75488 mins        517553
##  9 member        Monday      12.37517 mins        503914
## 10 member        Saturday    14.32431 mins        471116
## 11 member        Sunday      14.51341 mins        408653
## 12 member        Thursday    12.46100 mins        576543
## 13 member        Tuesday     12.48452 mins        583225
## 14 member        Wednesday   12.73482 mins        616036

Finding maximum no of rides for both members and casual riders on week

max_no_of_rides <- divvy_bikes_week %>%
  group_by(member_casual)%>%
  slice_max(no_of_rides)
max_no_of_rides

## # A tibble: 2 × 4
## # Groups:   member_casual [2]
##   member_casual day_of_week avg_ride_length no_of_rides
##   <chr>         <chr>       <drtn>                <int>
## 1 casual        Saturday    30.36386 mins        415227
## 2 member        Wednesday   12.73482 mins        616036

Most of the casual riders ride on Saturday with 41522 rides.

Most of the Member riders ride on Wednesday with 616033.

Percentage of rides for each day of the week

rides_percentage <- divvy_bikes_processed %>%
  group_by(day_of_week)%>%
  summarise(no_of_rides=n_distinct(ride_id)) %>%
  mutate(percentage = (no_of_rides/sum(no_of_rides))*100)
rides_percentage

## # A tibble: 7 × 3
##   day_of_week no_of_rides percentage
##   <chr>             <int>      <dbl>
## 1 Friday           817536       14.3
## 2 Monday           729816       12.8
## 3 Saturday         886343       15.5
## 4 Sunday           749376       13.1
## 5 Thursday         832180       14.6
## 6 Tuesday          818045       14.3
## 7 Wednesday        882186       15.4

Saturday has higher percentage of rides during the year.So we need to analyze the behaviour of each type on Saturday.

saturday_rides <- divvy_bikes_week %>%
  filter(day_of_week == "Saturday") %>%
  group_by(member_casual,day_of_week) %>%
  summarise(no_of_rides=sum(no_of_rides))

## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

saturday_rides <- mutate(saturday_rides,percentage=(no_of_rides/sum(saturday_rides$no_of_rides))*100)
saturday_rides

## # A tibble: 2 × 4
## # Groups:   member_casual [2]
##   member_casual day_of_week no_of_rides percentage
##   <chr>         <chr>             <int>      <dbl>
## 1 casual        Saturday         415227       46.8
## 2 member        Saturday         471116       53.2

The member riders represent 53.2% of the rides on busiest day(Saturday) and 46.8% of rides by casual riders.

Per season of year

To calculate we need to create a new column Season_of_year to identify the seasons of the ride taken.

divvy_bikes_processed <- divvy_bikes_processed %>%
  mutate(season_of_year = ifelse(month(started_at)==12 | month(started_at)==1 | month(started_at)==2, "Winter",ifelse(month(started_at)==3 | month(started_at)==4 | month(started_at)==5, "Spring",ifelse(month(started_at)==6 | month(started_at)==7| month(started_at)==8, "Summer", "Autumn"))))

head(divvy_bikes_processed)

## # A tibble: 6 × 16
##   ride_id          rideable_type started_at          ended_at           
##   <chr>            <chr>         <dttm>              <dttm>             
## 1 903C30C2D810A53B electric_bike 2023-08-19 15:41:53 2023-08-19 15:53:36
## 2 F2FB18A98E110A2B electric_bike 2023-08-18 15:30:18 2023-08-18 15:45:25
## 3 D0DEC7C94E4663DA electric_bike 2023-08-30 16:15:08 2023-08-30 16:27:37
## 4 E0DDDC5F84747ED9 electric_bike 2023-08-30 16:24:07 2023-08-30 16:33:34
## 5 7797A4874BA260CA electric_bike 2023-08-22 15:59:44 2023-08-22 16:20:38
## 6 DF4DE734EBC4DF66 electric_bike 2023-08-24 12:27:24 2023-08-24 12:54:59
## # ℹ 12 more variables: start_station_name <chr>, start_station_id <chr>,
## #   end_station_name <chr>, end_station_id <chr>, start_lat <dbl>,
## #   start_lng <dbl>, end_lat <dbl>, end_lng <dbl>, member_casual <chr>,
## #   ride_length <drtn>, day_of_week <chr>, season_of_year <chr>

Now we calculate average ride length and total number of rides grouping by type of member and season of year and percentage of rides on each season

divvy_bikes_season <- divvy_bikes_processed %>%
  group_by(member_casual,season_of_year) %>%
  summarise(avg_ride_length=mean(ride_length),no_of_rides=n_distinct(ride_id)) %>%
  mutate(percentage=(no_of_rides/sum(no_of_rides))*100)

## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

divvy_bikes_season

## # A tibble: 8 × 5
## # Groups:   member_casual [2]
##   member_casual season_of_year avg_ride_length no_of_rides percentage
##   <chr>         <chr>          <drtn>                <int>      <dbl>
## 1 casual        Autumn         23.53489 mins        537098      26.3 
## 2 casual        Spring         26.71057 mins        445378      21.8 
## 3 casual        Summer         30.10985 mins        932671      45.8 
## 4 casual        Winter         22.21774 mins        123295       6.05
## 5 member        Autumn         12.43640 mins       1028904      28.0 
## 6 member        Spring         12.75438 mins        880827      24.0 
## 7 member        Summer         13.77731 mins       1298494      35.3 
## 8 member        Winter         12.60064 mins        468815      12.7

Finding maximum number of rides for both members and casual riders on season

max_no_of_rides_season<- divvy_bikes_season %>%
  group_by(member_casual) %>%
  slice_max(no_of_rides)
max_no_of_rides_season

## # A tibble: 2 × 5
## # Groups:   member_casual [2]
##   member_casual season_of_year avg_ride_length no_of_rides percentage
##   <chr>         <chr>          <drtn>                <int>      <dbl>
## 1 casual        Summer         30.10985 mins        932671       45.8
## 2 member        Summer         13.77731 mins       1298494       35.3

Season with most number of casual riders is Summer with 93267 rides.

Season with most number of member riders is Summer with 1298494 rides.

Visualizing percentage of rides on each season of the year

season_rides_percentage <- divvy_bikes_processed %>%
  group_by(season_of_year) %>%
  summarise(no_of_rides=n_distinct(ride_id)) %>%
  mutate(percentage=(no_of_rides/sum(no_of_rides))*100)
season_rides_percentage

## # A tibble: 4 × 3
##   season_of_year no_of_rides percentage
##   <chr>                <int>      <dbl>
## 1 Autumn             1566002       27.4
## 2 Spring             1326205       23.2
## 3 Summer             2231165       39.0
## 4 Winter              592110       10.4

Summer has higher percentage of rides during the year.So we need to analyze the behaviour of each type on Summer.

summer_rides <- divvy_bikes_season %>%
  filter(season_of_year=="Summer") %>%
  group_by(member_casual,season_of_year) %>%
  summarise(no_of_rides=sum(no_of_rides))

## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

summer_rides <- mutate(summer_rides,percentage=(no_of_rides/sum(summer_rides$no_of_rides))*100)
summer_rides

## # A tibble: 2 × 4
## # Groups:   member_casual [2]
##   member_casual season_of_year no_of_rides percentage
##   <chr>         <chr>                <int>      <dbl>
## 1 casual        Summer              932671       41.8
## 2 member        Summer             1298494       58.2

The member riders represent 58.2% of the rides on busiest season (Summer) and 41.8% of rides by casual riders.

Per month of year

First we create a new column month_of_year to identify the month of the year

divvy_bikes_processed <- divvy_bikes_processed %>%
  mutate(month_of_year = month(started_at))
head(divvy_bikes_processed)

## # A tibble: 6 × 17
##   ride_id          rideable_type started_at          ended_at           
##   <chr>            <chr>         <dttm>              <dttm>             
## 1 903C30C2D810A53B electric_bike 2023-08-19 15:41:53 2023-08-19 15:53:36
## 2 F2FB18A98E110A2B electric_bike 2023-08-18 15:30:18 2023-08-18 15:45:25
## 3 D0DEC7C94E4663DA electric_bike 2023-08-30 16:15:08 2023-08-30 16:27:37
## 4 E0DDDC5F84747ED9 electric_bike 2023-08-30 16:24:07 2023-08-30 16:33:34
## 5 7797A4874BA260CA electric_bike 2023-08-22 15:59:44 2023-08-22 16:20:38
## 6 DF4DE734EBC4DF66 electric_bike 2023-08-24 12:27:24 2023-08-24 12:54:59
## # ℹ 13 more variables: start_station_name <chr>, start_station_id <chr>,
## #   end_station_name <chr>, end_station_id <chr>, start_lat <dbl>,
## #   start_lng <dbl>, end_lat <dbl>, end_lng <dbl>, member_casual <chr>,
## #   ride_length <drtn>, day_of_week <chr>, season_of_year <chr>,
## #   month_of_year <dbl>

Calculate the average ride length, total number of rides and percentage of rides by grouping type of riders and month of year.

divvy_bikes_month <- divvy_bikes_processed %>%
  group_by(member_casual,month_of_year) %>%
  summarise(avg_ride_length=mean(ride_length),no_of_rides=n_distinct(ride_id)) %>%
  mutate(percentage=(no_of_rides/sum(no_of_rides))*100)

## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

divvy_bikes_month

## # A tibble: 24 × 5
## # Groups:   member_casual [2]
##    member_casual month_of_year avg_ride_length no_of_rides percentage
##    <chr>                 <dbl> <drtn>                <int>      <dbl>
##  1 casual                    1 21.30664 mins         24460       1.20
##  2 casual                    2 25.18987 mins         47163       2.31
##  3 casual                    3 24.95505 mins         82550       4.05
##  4 casual                    4 26.01160 mins        131810       6.47
##  5 casual                    5 27.73667 mins        231018      11.3 
##  6 casual                    6 27.70557 mins        301169      14.8 
##  7 casual                    7 27.38433 mins        320372      15.7 
##  8 casual                    8 35.24365 mins        311130      15.3 
##  9 casual                    9 25.18416 mins        261635      12.8 
## 10 casual                   10 23.11470 mins        177074       8.69
## # ℹ 14 more rows

Calculating maximum of rides for both members and casual riders for each month of the year

max_no_of_rides_month <- divvy_bikes_month %>%
  group_by(member_casual) %>%
  slice_max(no_of_rides)
max_no_of_rides_month

## # A tibble: 2 × 5
## # Groups:   member_casual [2]
##   member_casual month_of_year avg_ride_length no_of_rides percentage
##   <chr>                 <dbl> <drtn>                <int>      <dbl>
## 1 casual                    7 27.38433 mins        320372       15.7
## 2 member                    8 13.77204 mins        460563       12.5

Month with most number of casual riders is July with 320372 rides.

Month with most number of member riders is August with 460563 rides.

Calculating percentage of rides on each month

monthly_rides_percentage <- divvy_bikes_processed %>%
  group_by(month_of_year) %>%
  summarise(no_of_rides=n_distinct(ride_id)) %>%
  mutate(percentage = (no_of_rides/sum(no_of_rides))*100)
monthly_rides_percentage

## # A tibble: 12 × 3
##    month_of_year no_of_rides percentage
##            <dbl>       <int>      <dbl>
##  1             1      144873       2.53
##  2             2      223164       3.90
##  3             3      301687       5.28
##  4             4      415025       7.26
##  5             5      609493      10.7 
##  6             6      710747      12.4 
##  7             7      748725      13.1 
##  8             8      771693      13.5 
##  9             9      666371      11.7 
## 10            10      537119       9.40
## 11            11      362512       6.34
## 12            12      224073       3.92

August has higher percentage of rides during the year.So we need to analyze the behaviour of each type on members.

august_rides <- divvy_bikes_month %>%
  filter(month_of_year==8) %>%
  group_by(member_casual,month_of_year) %>%
  summarise(no_of_rides=sum(no_of_rides))

## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

august_rides <- mutate(august_rides,percentage =(no_of_rides/sum(august_rides$no_of_rides))*100)
august_rides

## # A tibble: 2 × 4
## # Groups:   member_casual [2]
##   member_casual month_of_year no_of_rides percentage
##   <chr>                 <dbl>       <int>      <dbl>
## 1 casual                    8      311130       40.3
## 2 member                    8      460563       59.7

The member riders represent 59.7% of the rides on busiest month(August) and 40.3% of rides by casual riders.

Per hour of the day

divvy_bikes_hour <- divvy_bikes_processed %>%
  mutate(hour_started_at = hour(started_at))
divvy_bikes_hour <- divvy_bikes_hour %>%
  group_by(member_casual,hour_started_at) %>%
  summarize(no_of_rides=n_distinct(ride_id))

## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

divvy_bikes_hour

## # A tibble: 48 × 3
## # Groups:   member_casual [2]
##    member_casual hour_started_at no_of_rides
##    <chr>                   <int>       <int>
##  1 casual                      0       34361
##  2 casual                      1       22890
##  3 casual                      2       14232
##  4 casual                      3        8007
##  5 casual                      4        6043
##  6 casual                      5       11385
##  7 casual                      6       27136
##  8 casual                      7       50196
##  9 casual                      8       69513
## 10 casual                      9       70244
## # ℹ 38 more rows

Ride type

Next, we will analyse the ride types that riders prefer to use.

ride_types <- divvy_bikes_processed %>%
  group_by(member_casual,rideable_type) %>%
  summarise(no_of_rides = n_distinct(ride_id))

## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

ride_types

## # A tibble: 5 × 3
## # Groups:   member_casual [2]
##   member_casual rideable_type no_of_rides
##   <chr>         <chr>               <int>
## 1 casual        classic_bike       981459
## 2 casual        docked_bike         15957
## 3 casual        electric_bike     1041026
## 4 member        classic_bike      1880675
## 5 member        electric_bike     1796365

From the analysis we can say that member riders prefer classic bikes than electric and docked bikes where as casual riders prefer electric bikes.

Consolidated analysis

Now we are calcuating the highest no of rides according to day, month and season

divvy_bikes_consolidated <- divvy_bikes_processed %>%
  group_by(member_casual,day_of_week,month_of_year,season_of_year) %>%
  summarise(avg_ride_length = mean(ride_length), no_of_rides=n_distinct(ride_id))

## `summarise()` has grouped output by 'member_casual', 'day_of_week',
## 'month_of_year'. You can override using the `.groups` argument.

divvy_bikes_consolidated

## # A tibble: 168 × 6
## # Groups:   member_casual, day_of_week, month_of_year [168]
##    member_casual day_of_week month_of_year season_of_year avg_ride_length
##    <chr>         <chr>               <dbl> <chr>          <drtn>         
##  1 casual        Friday                  1 Winter         18.39389 mins  
##  2 casual        Friday                  2 Winter         24.85875 mins  
##  3 casual        Friday                  3 Spring         22.73627 mins  
##  4 casual        Friday                  4 Spring         22.24533 mins  
##  5 casual        Friday                  5 Spring         26.87207 mins  
##  6 casual        Friday                  6 Summer         28.07744 mins  
##  7 casual        Friday                  7 Summer         27.84953 mins  
##  8 casual        Friday                  8 Summer         36.09690 mins  
##  9 casual        Friday                  9 Autumn         24.11483 mins  
## 10 casual        Friday                 10 Autumn         20.73668 mins  
## # ℹ 158 more rows
## # ℹ 1 more variable: no_of_rides <int>

Also, we can visualize each type of member that had the most number of rides according to the day, month and season.

max_no_of_rides_consolidated <- divvy_bikes_consolidated %>%
  group_by(member_casual) %>%
  slice_max(no_of_rides)
max_no_of_rides_consolidated

## # A tibble: 2 × 6
## # Groups:   member_casual [2]
##   member_casual day_of_week month_of_year season_of_year avg_ride_length
##   <chr>         <chr>               <dbl> <chr>          <drtn>         
## 1 casual        Saturday                9 Autumn         28.16613 mins  
## 2 member        Tuesday                 8 Summer         13.15050 mins  
## # ℹ 1 more variable: no_of_rides <int>

The most rides for the casual riders occur in Autumn, on September, on a Saturday with 68408 rides.

The most rides for the member riders occur in Summer, on August, on a Tuesday with 82787 rides.

Visualizations

We need to present the visualization of the analysis we made so far.

Total number of rides

Per day of week

divvy_bikes_week %>% 
  mutate(day_of_week = factor(day_of_week,levels=c('Sunday','Monday','Tuesday','Wednesday','Thursday','Friday','Saturday'))) %>%
  ggplot(aes(x=day_of_week,y=no_of_rides, fill=member_casual)) +
  geom_col(width=0.7,position = position_dodge(width=0.5)) +
  ggtitle("Total trips by type of riders vs day of the week (Saturday)") +
  theme(plot.title = element_text(hjust=0.5)) +
  scale_y_continuous(labels=function(x) format(x,scientific=FALSE)) +
  labs(fill= "Type of rider")

From the visualization we can say Wednesday is the day with more number of member riders and Saturday is the day with more number of casual riders.

So we understood than casual drivers make use of bikes during weekends, while member riders are busy during the week.

The busiest day of week for all riders is Saturday based on our analysis.

saturday_rides %>% 
  ggplot(aes(x=day_of_week,y=no_of_rides,fill=member_casual)) +
  geom_col(width=0.5,position="dodge") +
  ggtitle("Total trips by type of riders vs Busiest day of week") +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_y_continuous(labels=function(x) format(x,scientific=FALSE)) +
  labs(fill="Type of riders") +
  geom_text(aes(label=no_of_rides),position=position_dodge(width = 0.5),size=4,vjust=-0.3)

Per month and season

divvy_bikes_month %>%
  mutate(month_of_year = month.abb[month_of_year]) %>%
  mutate(month_of_year = factor(month_of_year,levels=c('Aug','Sep','Oct','Nov','Dec','Jan','Feb','Mar','Apr','May','Jun','Jul'))) %>%
  ggplot(aes(x=month_of_year,y=no_of_rides,fill=member_casual)) +
  geom_col(width=0.5,position=position_dodge(width=0.5)) +
  ggtitle("Total trips by type of riders vs Month of the year (Aug 23 to July 24)") +
  theme(plot.title = element_text(hjust=0.5)) +
  scale_y_continuous(labels = function(x) format(x,scientific=FALSE)) +
  labs(fill="Type of riders")

From visualization, August is the month with most member riders and July is the month with most casual member riders.

Per season of the year

divvy_bikes_season %>%
  ggplot(aes(x=season_of_year, y=no_of_rides,fill=member_casual)) +
  geom_col(width=0.5,position=position_dodge(width = 0.5)) +
  ggtitle("Total trips by type of riders vs Season of the year") +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_y_continuous(labels=function(x) format(x,scientific=FALSE)) +
  labs(fill= "Type of riders")

We can see that Summer is the busiest season for both type of riders and Winter is the lowest season.

As Summer is most busiest season, we need to analyze in more detail:

summer_rides %>%
  ggplot(aes(x=season_of_year,y=no_of_rides,fill=member_casual)) +
  geom_col(position="dodge",width=0.5) +
  ggtitle("Total trip by type of riders vs Busiest season(Summer)") +
  theme(plot.title = element_text(hjust=0.5)) +
  scale_y_continuous(labels = function(x) format(x,scientific=FALSE)) +
  labs(fill="Type of rider") +
  geom_text(aes(label=str_c(round(percentage,digits=2),"%",sep="")),position = position_dodge(width=0.5),size=4,vjust=-0.3)

The member riders represent 58.2% of the rides that are taken in the busiest season (Summer), while casual members represent the 41.8% of the rides.

Average trip duration

Per day of week

divvy_bikes_week %>%
  mutate(day_of_week= factor(day_of_week,levels=c('Sunday','Monday','Tuesday','Wednesday','Thursday','Friday','Saturday'))) %>%
  ggplot(aes(x=day_of_week,y=avg_ride_length,fill=member_casual)) +
  geom_col(width=0.5,position=position_dodge(width = 0.5)) +
  ggtitle("Avg trip duration by type of riders vs Day of week") +
  theme(plot.title = element_text(hjust=0.5)) +
  scale_y_continuous(labels=function(x) format(x,scientific=FALSE)) +
  labs(fill="Type of rider")

Highest average for trip duration for casual and member riders is on Sunday, followed by Saturday.

This means that the longest rides are happening on the weekends.

Per month of year

divvy_bikes_month %>%
  mutate(month_of_year = month.abb[month_of_year]) %>%
  mutate(month_of_year = factor(month_of_year,levels=c('Aug','Sep','Oct','Nov','Dec','Jan','Feb','Mar','Apr','May','Jun','Jul'))) %>%
  ggplot(aes(x=month_of_year,y=avg_ride_length,fill=member_casual)) +
  geom_col(width=0.5,position=position_dodge(width = 0.5)) +
  ggtitle("Avg trip duration by type of rider vs Month of year (Aug 23 to Jul 24)") +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_y_continuous(labels = function(x) format(x,scientific=FALSE)) +
  labs(fill="Type of riders")

The visualization above determine that the month of August is the highest average of trip duration for the casual riders.June is the highest average of trip duration for member riders.

Per season of year

divvy_bikes_season %>%
  ggplot(aes(x=season_of_year,y=avg_ride_length,fill=member_casual)) +
  geom_col(width=0.5,position=position_dodge(width=0.5)) +
  ggtitle("Avg trip duration by type of riders vs Season of the year") +
  theme(plot.title=element_text(hjust=0.5)) +
  scale_y_continuous(labels = function(x) format(x,scientific=FALSE)) +
  labs(fill="Type of riders")

Summer season is the one who has the highest average of trip duration for both casual and member riders.

Per hour of the day

divvy_bikes_hour %>%
  ggplot(aes(x=hour_started_at,y=no_of_rides,color=member_casual,group = member_casual))+
  geom_line(size=1)+
  geom_point(size=2) +
  theme(plot.title = element_text(hjust=0.5)) +
  labs(title="Use of bikes for 24 hours",x="Time of day") +
  labs(color="Type of riders") +
  scale_x_continuous(breaks = scales::breaks_width(1)) +
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE))+
  geom_hline(aes(yintercept = max(no_of_rides)))

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

The busiest hour of the day for the use of bikes for members and casual riders is both at hour 17.

Ride types

Next, we will visualize the ride types that riders prefer to use.

ride_types %>%
  ggplot(aes(x=rideable_type,y=no_of_rides,fill=member_casual)) +
  geom_col(width=0.5) +
  ggtitle("Ride type vs No of rides") +
  theme(plot.title=element_text(hjust=0.5)) +
  scale_y_continuous(labels = function(x) format(x,scientific=FALSE)) +
  labs(fill="Type of riders") +
  geom_text(aes(label = no_of_rides),size=4,vjust=-0.8)

We are using pie chart to visualize the percentage of the ride types used.

ride_types %>%
  group_by(rideable_type) %>%
  summarise(no_of_rides=sum(no_of_rides)) %>%
  ggplot(aes(x="",y=no_of_rides,fill=rideable_type)) +
  geom_col(color="black") +
  ggtitle("Amount of rides for each type") +
  theme(plot.title = element_text(hjust=0.5)) +
  labs(fill="Ride type") +
  geom_text(aes(label=str_c(round((no_of_rides/sum(no_of_rides)*100),digits=2),"%",sep="")),position=position_stack(vjust=0.5)) +
  coord_polar("y",start=0) +
  theme_void()

The visualizations above determine that the classic and electric bikes are the most favorite for both types of riders: casual and member. Almost the 50 and 50 percent are distributed among those two ride types. The docked bike is not common used with 0.28% of rides.

Average trip duration by rider type

member_casuals %>%
  ggplot(aes(x="",y=avg_ride_length,fill=member_casual)) +
  geom_col(color="black") +
  ggtitle("Avg trip duration for each rider type") +
  theme(plot.title = element_text(hjust=0.5)) +
  labs(fill="Type of rider") +
  geom_text(aes(label=str_c(round(avg_ride_length,digit=2),"mins",sep=" ")),position = position_stack(vjust=0.5)) +
  coord_polar("y",start=0) +
  theme_void()

## Don't know how to automatically pick scale for object of type <difftime>.
## Defaulting to continuous.

From the visualization above we can determine that the average of the trip duration from the casual riders is more than twice than the member riders.

Total amount of rides

member_casuals %>%
  ggplot(aes(x="",y=no_of_rides,fill=member_casual)) +
  geom_col(color="black") +
  ggtitle("Total amount of rides for each rider type") +
  theme(plot.title = element_text(hjust=0.5)) +
  labs(fill="Type of rider") +
  geom_text(aes(label = str_c(round((no_of_rides/sum(no_of_rides))*100,digits=2),"%",sep="")),
position=position_stack(vjust=0.5)) +
  coord_polar("y",start=0) +
  theme_void()

We can determine that the total % amount of rides from the member riders is almost twice (with 64.33%) than the casual riders (with 35.67%).

Summary of findings

From the data collected (August 2023 to July 2024) of Cyclist collection, following conclusions have been drawn:

Casual rides prefer longer trips than member riders, but member riders have more number of rides for days, months and season.
The average of trip duration of casual riders is twice more than member riders.
Total percentage amount of rides of member riders is almost twice than casual riders.
Almost 50 and 50 percent of classic and electric bikes are distributed between casual and member riders. And 0.28% of docked bike is used by both the riders.
The busiest hour of the day is 17:00 for both type of riders.
Casual riders use cyclist services during the weekend while annual members use on weekdays.
August is preferable month for member rides and July for casual riders.
Both annual members and casual riders prefer Summer season with usage of 35% of bikes during year.

Recommendations

Offers montly and quaterly memberships. This flexibility in prices helps in Cyclistic’s profit
Offer discounted membership for new members in first 3 months and discounted membership renewals for after every year.
Make campaigns or promotion in Summer season where most of the riders use Cyclist’s services.

Resources

Stackoverflow
RStudio Posit Community
RDocumentation

Google Data Analytics Captsone Project

Lahari Sanku

2024-08-20