As a part of the Google Capstone Project, I played the role of a junior data analyst. I used Ask, Prepare, Process, Analyze, Share and Act data analysis process.
Cyclistic is a bike-share company with more than 5800 bicycles and 600 docking stations. The company offers:
Cyclistic bike riders include:
Since annual members are more than casual riders, Lily Moreno, the company’s marketing director, instead of creating a marketing campaign that targets all-new customers, she believes there is a solid opportunity to convert casual riders into annual members.
The team needs to better understand how annual members and casual riders differ, why casual riders would buy a membership, and how digital media could affect their marketing tactics.
Based on the understanding the primary stakeholders are:
My business task is to analyze and compare the different riding patterns of casual riders and annual members in last 12 months.
Collected different data sets from divvybikes provided by Google Data Analysis Certification program resources. I have downloaded data from August 2023 to August 2024.
It is secure because no one else is manipulating the data from the amazon bucket, and the accessibility for any type of person to feel free to downloaded, analyze it, copy, modify and distribute the product.
I used R to investigate the data sets and Excel for some minimum error checks. ### Installing packages
Importing the tidyverse, readr,
janitor, DescTools, lubridate
libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readr)
library(janitor)
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(DescTools)
library(lubridate)
library(ggplot2)
library(dplyr)
Load divvybikes datasets for past 12 months
divvy_bikes <- list.files(path = setwd("C:/Users/91944/Documents/Google-Data-Analytics-Capstone-Project/datasets"),
pattern = "*.csv", full.names = TRUE) %>%
lapply(read_csv) %>%
bind_rows
glimpse(divvy_bikes)
## Rows: 5,715,693
## Columns: 13
## $ ride_id <chr> "903C30C2D810A53B", "F2FB18A98E110A2B", "D0DEC7C94E…
## $ rideable_type <chr> "electric_bike", "electric_bike", "electric_bike", …
## $ started_at <dttm> 2023-08-19 15:41:53, 2023-08-18 15:30:18, 2023-08-…
## $ ended_at <dttm> 2023-08-19 15:53:36, 2023-08-18 15:45:25, 2023-08-…
## $ start_station_name <chr> "LaSalle St & Illinois St", "Clark St & Randolph St…
## $ start_station_id <chr> "13430", "TA1305000030", "TA1305000030", "KA1504000…
## $ end_station_name <chr> "Clark St & Elm St", NA, NA, NA, NA, NA, NA, NA, NA…
## $ end_station_id <chr> "TA1307000039", NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ start_lat <dbl> 41.89072, 41.88451, 41.88498, 41.90310, 41.88555, 4…
## $ start_lng <dbl> -87.63148, -87.63155, -87.63079, -87.63467, -87.632…
## $ end_lat <dbl> 41.90297, 41.93000, 41.91000, 41.90000, 41.89000, 4…
## $ end_lng <dbl> -87.63128, -87.64000, -87.63000, -87.62000, -87.680…
## $ member_casual <chr> "member", "member", "member", "member", "member", "…
After inspecting the dataset, we found many columns has NA Values Finding the number of NA values in each column
na_values <- divvy_bikes %>%
summarise_all(~ sum(is.na(.)))
glimpse(na_values)
## Rows: 1
## Columns: 13
## $ ride_id <int> 0
## $ rideable_type <int> 0
## $ started_at <int> 0
## $ ended_at <int> 0
## $ start_station_name <int> 947025
## $ start_station_id <int> 947025
## $ end_station_name <int> 989476
## $ end_station_id <int> 989476
## $ start_lat <int> 0
## $ start_lng <int> 0
## $ end_lat <int> 7756
## $ end_lng <int> 7756
## $ member_casual <int> 0
To check the column names of the dataset are correct or not we use funcion from “Janitor” library
clean_names(divvy_bikes)
## # A tibble: 5,715,693 × 13
## ride_id rideable_type started_at ended_at
## <chr> <chr> <dttm> <dttm>
## 1 903C30C2D810A53B electric_bike 2023-08-19 15:41:53 2023-08-19 15:53:36
## 2 F2FB18A98E110A2B electric_bike 2023-08-18 15:30:18 2023-08-18 15:45:25
## 3 D0DEC7C94E4663DA electric_bike 2023-08-30 16:15:08 2023-08-30 16:27:37
## 4 E0DDDC5F84747ED9 electric_bike 2023-08-30 16:24:07 2023-08-30 16:33:34
## 5 7797A4874BA260CA electric_bike 2023-08-22 15:59:44 2023-08-22 16:20:38
## 6 DF4DE734EBC4DF66 electric_bike 2023-08-24 12:27:24 2023-08-24 12:54:59
## 7 EE60FB066E69AFAC electric_bike 2023-08-31 20:42:14 2023-08-31 20:54:38
## 8 A115DA6AA13DE5EF electric_bike 2023-08-17 15:15:51 2023-08-17 15:22:27
## 9 86DBB19374245893 electric_bike 2023-08-24 21:37:19 2023-08-24 21:47:22
## 10 2905CBC8B8EE392C electric_bike 2023-08-28 14:53:38 2023-08-28 14:59:35
## # ℹ 5,715,683 more rows
## # ℹ 9 more variables: start_station_name <chr>, start_station_id <chr>,
## # end_station_name <chr>, end_station_id <chr>, start_lat <dbl>,
## # start_lng <dbl>, end_lat <dbl>, end_lng <dbl>, member_casual <chr>
Checking the whether each column has the correct data type
sapply(divvy_bikes,typeof)
## ride_id rideable_type started_at ended_at
## "character" "character" "double" "double"
## start_station_name start_station_id end_station_name end_station_id
## "character" "character" "character" "character"
## start_lat start_lng end_lat end_lng
## "double" "double" "double" "double"
## member_casual
## "character"
We need to check if the dataset contains duplicate values or not by using ride_id which is primary key
duplicate_values <- divvy_bikes %>%
get_dupes(ride_id)
duplicate_values
## # A tibble: 422 × 14
## ride_id dupe_count rideable_type started_at ended_at
## <chr> <int> <chr> <dttm> <dttm>
## 1 011C8EF97AB… 2 classic_bike 2024-05-31 19:45:38 2024-06-01 20:45:33
## 2 011C8EF97AB… 2 classic_bike 2024-05-31 19:45:38 2024-06-01 20:45:33
## 3 01406457A85… 2 electric_bike 2024-05-31 23:54:59 2024-06-01 00:01:47
## 4 01406457A85… 2 electric_bike 2024-05-31 23:54:59 2024-06-01 00:01:47
## 5 02606FBC7F8… 2 classic_bike 2024-05-31 17:55:01 2024-06-01 18:54:53
## 6 02606FBC7F8… 2 classic_bike 2024-05-31 17:55:01 2024-06-01 18:54:53
## 7 0354FD07563… 2 electric_bike 2024-05-31 23:34:36 2024-06-01 00:14:29
## 8 0354FD07563… 2 electric_bike 2024-05-31 23:34:36 2024-06-01 00:14:29
## 9 048C715F1DE… 2 electric_bike 2024-05-31 23:53:44 2024-06-01 00:12:26
## 10 048C715F1DE… 2 electric_bike 2024-05-31 23:53:44 2024-06-01 00:12:26
## # ℹ 412 more rows
## # ℹ 9 more variables: start_station_name <chr>, start_station_id <chr>,
## # end_station_name <chr>, end_station_id <chr>, start_lat <dbl>,
## # start_lng <dbl>, end_lat <dbl>, end_lng <dbl>, member_casual <chr>
The result shows 422 values which is combination of original and duplicated values. So the dataset contain 211 duplicated values
Now, we need to extract the distinct values from each column which means 211 duplicate rows will be removed from dataset.
To remove duplicates we use “distinct” function using ride_id field from dataset.
divvy_bikes_cleaned <-divvy_bikes %>%
distinct(ride_id,.keep_all = TRUE)
glimpse(divvy_bikes_cleaned)
## Rows: 5,715,482
## Columns: 13
## $ ride_id <chr> "903C30C2D810A53B", "F2FB18A98E110A2B", "D0DEC7C94E…
## $ rideable_type <chr> "electric_bike", "electric_bike", "electric_bike", …
## $ started_at <dttm> 2023-08-19 15:41:53, 2023-08-18 15:30:18, 2023-08-…
## $ ended_at <dttm> 2023-08-19 15:53:36, 2023-08-18 15:45:25, 2023-08-…
## $ start_station_name <chr> "LaSalle St & Illinois St", "Clark St & Randolph St…
## $ start_station_id <chr> "13430", "TA1305000030", "TA1305000030", "KA1504000…
## $ end_station_name <chr> "Clark St & Elm St", NA, NA, NA, NA, NA, NA, NA, NA…
## $ end_station_id <chr> "TA1307000039", NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ start_lat <dbl> 41.89072, 41.88451, 41.88498, 41.90310, 41.88555, 4…
## $ start_lng <dbl> -87.63148, -87.63155, -87.63079, -87.63467, -87.632…
## $ end_lat <dbl> 41.90297, 41.93000, 41.91000, 41.90000, 41.89000, 4…
## $ end_lng <dbl> -87.63128, -87.64000, -87.63000, -87.62000, -87.680…
## $ member_casual <chr> "member", "member", "member", "member", "member", "…
Now all the duplicate rows are removed.
Total number of records in new data set is 5,715,482.
5715693 (original dataset) - 211 (duplicate records) = 5715482 (cleaned dataset)
For calculating the ride length of each ride, we need to add new column ride_length which is difference between the started_at and ended_at in minutes.
divvy_bikes_processed <- divvy_bikes_cleaned %>%
mutate(ride_length = round(difftime(ended_at,started_at,units = "mins"),digits = 2))
we need to check if the ride_length has any negatives values or not which means ride_length <0
negative_ride_length <- divvy_bikes_processed %>%
filter(ride_length <0)
negative_ride_length
## # A tibble: 404 × 14
## ride_id rideable_type started_at ended_at
## <chr> <chr> <dttm> <dttm>
## 1 BEA2E3C59F6C9064 electric_bike 2023-08-16 14:42:56 2023-08-16 14:42:55
## 2 68203D8EB3D3F2AC electric_bike 2023-08-27 18:37:38 2023-08-27 18:37:35
## 3 62F4F440D6560E70 electric_bike 2023-08-21 15:34:39 2023-08-21 15:31:56
## 4 2BFB23CDC9A75AB0 electric_bike 2023-08-26 10:19:36 2023-08-26 10:16:52
## 5 8FD41344538C6DAD electric_bike 2023-08-04 21:28:08 2023-08-04 21:28:06
## 6 88C6795EB8654AF5 electric_bike 2023-08-19 03:01:47 2023-08-19 03:00:12
## 7 197D33FAF6396DDC electric_bike 2023-08-18 16:01:29 2023-08-18 16:01:28
## 8 7FC22D5B6730A327 electric_bike 2023-08-03 08:16:04 2023-08-03 08:13:34
## 9 82B071A054A869E0 electric_bike 2023-08-19 16:19:27 2023-08-19 16:19:25
## 10 41884873E3E524D3 electric_bike 2023-08-03 17:19:37 2023-08-03 17:19:14
## # ℹ 394 more rows
## # ℹ 10 more variables: start_station_name <chr>, start_station_id <chr>,
## # end_station_name <chr>, end_station_id <chr>, start_lat <dbl>,
## # start_lng <dbl>, end_lat <dbl>, end_lng <dbl>, member_casual <chr>,
## # ride_length <drtn>
Clearly we can see that 404 records have negative values which means that end time is before than start time.
We need to correct it in 2 approaches:
I followed second approach becuase if we remove the records we may loose some data.
divvy_bikes_processed <- divvy_bikes_processed %>%
mutate(new_started_at = ifelse(ride_length<0,as_datetime(ended_at),as_datetime(started_at)),
new_ended_at = ifelse(ride_length<0,as_datetime(started_at),as_datetime(ended_at)),
started_at=as_datetime(new_started_at), ended_at=as_datetime(new_ended_at)) %>%
select(-new_started_at,-new_ended_at)
We changed the values of start time and end time by using ifelse condition and replaced start and end time columns with correct values and dropped newly created columns by using select function.
Now I calculated and checked if the ride_length column has any negative values or not.
divvy_bikes_processed <- divvy_bikes_processed %>%
mutate(ride_length = round(difftime(ended_at,started_at,units = "mins"),digits = 2))
negative_ride_length <- divvy_bikes_processed %>%
filter(ride_length<0)
negative_ride_length
## # A tibble: 0 × 14
## # ℹ 14 variables: ride_id <chr>, rideable_type <chr>, started_at <dttm>,
## # ended_at <dttm>, start_station_name <chr>, start_station_id <chr>,
## # end_station_name <chr>, end_station_id <chr>, start_lat <dbl>,
## # start_lng <dbl>, end_lat <dbl>, end_lng <dbl>, member_casual <chr>,
## # ride_length <drtn>
Now, the dataset does not contain any negative values.
We need to perform some calculations for analysis.
Creating a new column day_of_week to calculate which day the ride is starting.
divvy_bikes_processed <- divvy_bikes_processed %>%
mutate(day_of_week = weekdays(as.Date(started_at)))
head(divvy_bikes_processed)
## # A tibble: 6 × 15
## ride_id rideable_type started_at ended_at
## <chr> <chr> <dttm> <dttm>
## 1 903C30C2D810A53B electric_bike 2023-08-19 15:41:53 2023-08-19 15:53:36
## 2 F2FB18A98E110A2B electric_bike 2023-08-18 15:30:18 2023-08-18 15:45:25
## 3 D0DEC7C94E4663DA electric_bike 2023-08-30 16:15:08 2023-08-30 16:27:37
## 4 E0DDDC5F84747ED9 electric_bike 2023-08-30 16:24:07 2023-08-30 16:33:34
## 5 7797A4874BA260CA electric_bike 2023-08-22 15:59:44 2023-08-22 16:20:38
## 6 DF4DE734EBC4DF66 electric_bike 2023-08-24 12:27:24 2023-08-24 12:54:59
## # ℹ 11 more variables: start_station_name <chr>, start_station_id <chr>,
## # end_station_name <chr>, end_station_id <chr>, start_lat <dbl>,
## # start_lng <dbl>, end_lat <dbl>, end_lng <dbl>, member_casual <chr>,
## # ride_length <drtn>, day_of_week <chr>
To analyze the data we need to have mean, max, and mode of dataset by grouping member_casual column to find ride length and day of the week for two memberships.
member_casuals <- divvy_bikes_processed %>%
group_by(member_casual) %>%
summarise(avg_ride_length=mean(ride_length),max_ride_length = max(ride_length), mode_day_of_week = Mode(day_of_week), no_of_rides=n_distinct(ride_id))
member_casuals
## # A tibble: 2 × 5
## member_casual avg_ride_length max_ride_length mode_day_of_week no_of_rides
## <chr> <drtn> <drtn> <chr> <int>
## 1 casual 27.15739 mins 98489.07 mins Saturday 2038442
## 2 member 13.00704 mins 16635.22 mins Wednesday 3677040
From the analysis we can know that casual members use cycles on Saturday and membership users on Wednesday.
Now we calculate average ride length and total no of rides grouping by type of member and day of the week.
divvy_bikes_week <- divvy_bikes_processed %>%
group_by(member_casual,day_of_week) %>%
summarise(avg_ride_length=mean(ride_length),no_of_rides=n_distinct(ride_id))
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
divvy_bikes_week
## # A tibble: 14 × 4
## # Groups: member_casual [2]
## member_casual day_of_week avg_ride_length no_of_rides
## <chr> <chr> <drtn> <int>
## 1 casual Friday 26.77450 mins 299983
## 2 casual Monday 26.26931 mins 225902
## 3 casual Saturday 30.36386 mins 415227
## 4 casual Sunday 31.92722 mins 340723
## 5 casual Thursday 24.10050 mins 255637
## 6 casual Tuesday 23.44512 mins 234820
## 7 casual Wednesday 23.44535 mins 266150
## 8 member Friday 12.75488 mins 517553
## 9 member Monday 12.37517 mins 503914
## 10 member Saturday 14.32431 mins 471116
## 11 member Sunday 14.51341 mins 408653
## 12 member Thursday 12.46100 mins 576543
## 13 member Tuesday 12.48452 mins 583225
## 14 member Wednesday 12.73482 mins 616036
Finding maximum no of rides for both members and casual riders on week
max_no_of_rides <- divvy_bikes_week %>%
group_by(member_casual)%>%
slice_max(no_of_rides)
max_no_of_rides
## # A tibble: 2 × 4
## # Groups: member_casual [2]
## member_casual day_of_week avg_ride_length no_of_rides
## <chr> <chr> <drtn> <int>
## 1 casual Saturday 30.36386 mins 415227
## 2 member Wednesday 12.73482 mins 616036
Most of the casual riders ride on Saturday with 41522 rides.
Most of the Member riders ride on Wednesday with 616033.
Percentage of rides for each day of the week
rides_percentage <- divvy_bikes_processed %>%
group_by(day_of_week)%>%
summarise(no_of_rides=n_distinct(ride_id)) %>%
mutate(percentage = (no_of_rides/sum(no_of_rides))*100)
rides_percentage
## # A tibble: 7 × 3
## day_of_week no_of_rides percentage
## <chr> <int> <dbl>
## 1 Friday 817536 14.3
## 2 Monday 729816 12.8
## 3 Saturday 886343 15.5
## 4 Sunday 749376 13.1
## 5 Thursday 832180 14.6
## 6 Tuesday 818045 14.3
## 7 Wednesday 882186 15.4
Saturday has higher percentage of rides during the year.So we need to analyze the behaviour of each type on Saturday.
saturday_rides <- divvy_bikes_week %>%
filter(day_of_week == "Saturday") %>%
group_by(member_casual,day_of_week) %>%
summarise(no_of_rides=sum(no_of_rides))
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
saturday_rides <- mutate(saturday_rides,percentage=(no_of_rides/sum(saturday_rides$no_of_rides))*100)
saturday_rides
## # A tibble: 2 × 4
## # Groups: member_casual [2]
## member_casual day_of_week no_of_rides percentage
## <chr> <chr> <int> <dbl>
## 1 casual Saturday 415227 46.8
## 2 member Saturday 471116 53.2
The member riders represent 53.2% of the rides on busiest day(Saturday) and 46.8% of rides by casual riders.
To calculate we need to create a new column Season_of_year to identify the seasons of the ride taken.
divvy_bikes_processed <- divvy_bikes_processed %>%
mutate(season_of_year = ifelse(month(started_at)==12 | month(started_at)==1 | month(started_at)==2, "Winter",ifelse(month(started_at)==3 | month(started_at)==4 | month(started_at)==5, "Spring",ifelse(month(started_at)==6 | month(started_at)==7| month(started_at)==8, "Summer", "Autumn"))))
head(divvy_bikes_processed)
## # A tibble: 6 × 16
## ride_id rideable_type started_at ended_at
## <chr> <chr> <dttm> <dttm>
## 1 903C30C2D810A53B electric_bike 2023-08-19 15:41:53 2023-08-19 15:53:36
## 2 F2FB18A98E110A2B electric_bike 2023-08-18 15:30:18 2023-08-18 15:45:25
## 3 D0DEC7C94E4663DA electric_bike 2023-08-30 16:15:08 2023-08-30 16:27:37
## 4 E0DDDC5F84747ED9 electric_bike 2023-08-30 16:24:07 2023-08-30 16:33:34
## 5 7797A4874BA260CA electric_bike 2023-08-22 15:59:44 2023-08-22 16:20:38
## 6 DF4DE734EBC4DF66 electric_bike 2023-08-24 12:27:24 2023-08-24 12:54:59
## # ℹ 12 more variables: start_station_name <chr>, start_station_id <chr>,
## # end_station_name <chr>, end_station_id <chr>, start_lat <dbl>,
## # start_lng <dbl>, end_lat <dbl>, end_lng <dbl>, member_casual <chr>,
## # ride_length <drtn>, day_of_week <chr>, season_of_year <chr>
Now we calculate average ride length and total number of rides grouping by type of member and season of year and percentage of rides on each season
divvy_bikes_season <- divvy_bikes_processed %>%
group_by(member_casual,season_of_year) %>%
summarise(avg_ride_length=mean(ride_length),no_of_rides=n_distinct(ride_id)) %>%
mutate(percentage=(no_of_rides/sum(no_of_rides))*100)
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
divvy_bikes_season
## # A tibble: 8 × 5
## # Groups: member_casual [2]
## member_casual season_of_year avg_ride_length no_of_rides percentage
## <chr> <chr> <drtn> <int> <dbl>
## 1 casual Autumn 23.53489 mins 537098 26.3
## 2 casual Spring 26.71057 mins 445378 21.8
## 3 casual Summer 30.10985 mins 932671 45.8
## 4 casual Winter 22.21774 mins 123295 6.05
## 5 member Autumn 12.43640 mins 1028904 28.0
## 6 member Spring 12.75438 mins 880827 24.0
## 7 member Summer 13.77731 mins 1298494 35.3
## 8 member Winter 12.60064 mins 468815 12.7
Finding maximum number of rides for both members and casual riders on season
max_no_of_rides_season<- divvy_bikes_season %>%
group_by(member_casual) %>%
slice_max(no_of_rides)
max_no_of_rides_season
## # A tibble: 2 × 5
## # Groups: member_casual [2]
## member_casual season_of_year avg_ride_length no_of_rides percentage
## <chr> <chr> <drtn> <int> <dbl>
## 1 casual Summer 30.10985 mins 932671 45.8
## 2 member Summer 13.77731 mins 1298494 35.3
Season with most number of casual riders is Summer with 93267 rides.
Season with most number of member riders is Summer with 1298494 rides.
Visualizing percentage of rides on each season of the year
season_rides_percentage <- divvy_bikes_processed %>%
group_by(season_of_year) %>%
summarise(no_of_rides=n_distinct(ride_id)) %>%
mutate(percentage=(no_of_rides/sum(no_of_rides))*100)
season_rides_percentage
## # A tibble: 4 × 3
## season_of_year no_of_rides percentage
## <chr> <int> <dbl>
## 1 Autumn 1566002 27.4
## 2 Spring 1326205 23.2
## 3 Summer 2231165 39.0
## 4 Winter 592110 10.4
Summer has higher percentage of rides during the year.So we need to analyze the behaviour of each type on Summer.
summer_rides <- divvy_bikes_season %>%
filter(season_of_year=="Summer") %>%
group_by(member_casual,season_of_year) %>%
summarise(no_of_rides=sum(no_of_rides))
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
summer_rides <- mutate(summer_rides,percentage=(no_of_rides/sum(summer_rides$no_of_rides))*100)
summer_rides
## # A tibble: 2 × 4
## # Groups: member_casual [2]
## member_casual season_of_year no_of_rides percentage
## <chr> <chr> <int> <dbl>
## 1 casual Summer 932671 41.8
## 2 member Summer 1298494 58.2
The member riders represent 58.2% of the rides on busiest season (Summer) and 41.8% of rides by casual riders.
First we create a new column month_of_year to identify the month of the year
divvy_bikes_processed <- divvy_bikes_processed %>%
mutate(month_of_year = month(started_at))
head(divvy_bikes_processed)
## # A tibble: 6 × 17
## ride_id rideable_type started_at ended_at
## <chr> <chr> <dttm> <dttm>
## 1 903C30C2D810A53B electric_bike 2023-08-19 15:41:53 2023-08-19 15:53:36
## 2 F2FB18A98E110A2B electric_bike 2023-08-18 15:30:18 2023-08-18 15:45:25
## 3 D0DEC7C94E4663DA electric_bike 2023-08-30 16:15:08 2023-08-30 16:27:37
## 4 E0DDDC5F84747ED9 electric_bike 2023-08-30 16:24:07 2023-08-30 16:33:34
## 5 7797A4874BA260CA electric_bike 2023-08-22 15:59:44 2023-08-22 16:20:38
## 6 DF4DE734EBC4DF66 electric_bike 2023-08-24 12:27:24 2023-08-24 12:54:59
## # ℹ 13 more variables: start_station_name <chr>, start_station_id <chr>,
## # end_station_name <chr>, end_station_id <chr>, start_lat <dbl>,
## # start_lng <dbl>, end_lat <dbl>, end_lng <dbl>, member_casual <chr>,
## # ride_length <drtn>, day_of_week <chr>, season_of_year <chr>,
## # month_of_year <dbl>
Calculate the average ride length, total number of rides and percentage of rides by grouping type of riders and month of year.
divvy_bikes_month <- divvy_bikes_processed %>%
group_by(member_casual,month_of_year) %>%
summarise(avg_ride_length=mean(ride_length),no_of_rides=n_distinct(ride_id)) %>%
mutate(percentage=(no_of_rides/sum(no_of_rides))*100)
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
divvy_bikes_month
## # A tibble: 24 × 5
## # Groups: member_casual [2]
## member_casual month_of_year avg_ride_length no_of_rides percentage
## <chr> <dbl> <drtn> <int> <dbl>
## 1 casual 1 21.30664 mins 24460 1.20
## 2 casual 2 25.18987 mins 47163 2.31
## 3 casual 3 24.95505 mins 82550 4.05
## 4 casual 4 26.01160 mins 131810 6.47
## 5 casual 5 27.73667 mins 231018 11.3
## 6 casual 6 27.70557 mins 301169 14.8
## 7 casual 7 27.38433 mins 320372 15.7
## 8 casual 8 35.24365 mins 311130 15.3
## 9 casual 9 25.18416 mins 261635 12.8
## 10 casual 10 23.11470 mins 177074 8.69
## # ℹ 14 more rows
Calculating maximum of rides for both members and casual riders for each month of the year
max_no_of_rides_month <- divvy_bikes_month %>%
group_by(member_casual) %>%
slice_max(no_of_rides)
max_no_of_rides_month
## # A tibble: 2 × 5
## # Groups: member_casual [2]
## member_casual month_of_year avg_ride_length no_of_rides percentage
## <chr> <dbl> <drtn> <int> <dbl>
## 1 casual 7 27.38433 mins 320372 15.7
## 2 member 8 13.77204 mins 460563 12.5
Month with most number of casual riders is July with 320372 rides.
Month with most number of member riders is August with 460563 rides.
Calculating percentage of rides on each month
monthly_rides_percentage <- divvy_bikes_processed %>%
group_by(month_of_year) %>%
summarise(no_of_rides=n_distinct(ride_id)) %>%
mutate(percentage = (no_of_rides/sum(no_of_rides))*100)
monthly_rides_percentage
## # A tibble: 12 × 3
## month_of_year no_of_rides percentage
## <dbl> <int> <dbl>
## 1 1 144873 2.53
## 2 2 223164 3.90
## 3 3 301687 5.28
## 4 4 415025 7.26
## 5 5 609493 10.7
## 6 6 710747 12.4
## 7 7 748725 13.1
## 8 8 771693 13.5
## 9 9 666371 11.7
## 10 10 537119 9.40
## 11 11 362512 6.34
## 12 12 224073 3.92
August has higher percentage of rides during the year.So we need to analyze the behaviour of each type on members.
august_rides <- divvy_bikes_month %>%
filter(month_of_year==8) %>%
group_by(member_casual,month_of_year) %>%
summarise(no_of_rides=sum(no_of_rides))
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
august_rides <- mutate(august_rides,percentage =(no_of_rides/sum(august_rides$no_of_rides))*100)
august_rides
## # A tibble: 2 × 4
## # Groups: member_casual [2]
## member_casual month_of_year no_of_rides percentage
## <chr> <dbl> <int> <dbl>
## 1 casual 8 311130 40.3
## 2 member 8 460563 59.7
The member riders represent 59.7% of the rides on busiest month(August) and 40.3% of rides by casual riders.
divvy_bikes_hour <- divvy_bikes_processed %>%
mutate(hour_started_at = hour(started_at))
divvy_bikes_hour <- divvy_bikes_hour %>%
group_by(member_casual,hour_started_at) %>%
summarize(no_of_rides=n_distinct(ride_id))
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
divvy_bikes_hour
## # A tibble: 48 × 3
## # Groups: member_casual [2]
## member_casual hour_started_at no_of_rides
## <chr> <int> <int>
## 1 casual 0 34361
## 2 casual 1 22890
## 3 casual 2 14232
## 4 casual 3 8007
## 5 casual 4 6043
## 6 casual 5 11385
## 7 casual 6 27136
## 8 casual 7 50196
## 9 casual 8 69513
## 10 casual 9 70244
## # ℹ 38 more rows
Next, we will analyse the ride types that riders prefer to use.
ride_types <- divvy_bikes_processed %>%
group_by(member_casual,rideable_type) %>%
summarise(no_of_rides = n_distinct(ride_id))
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
ride_types
## # A tibble: 5 × 3
## # Groups: member_casual [2]
## member_casual rideable_type no_of_rides
## <chr> <chr> <int>
## 1 casual classic_bike 981459
## 2 casual docked_bike 15957
## 3 casual electric_bike 1041026
## 4 member classic_bike 1880675
## 5 member electric_bike 1796365
From the analysis we can say that member riders prefer classic bikes than electric and docked bikes where as casual riders prefer electric bikes.
Now we are calcuating the highest no of rides according to day, month and season
divvy_bikes_consolidated <- divvy_bikes_processed %>%
group_by(member_casual,day_of_week,month_of_year,season_of_year) %>%
summarise(avg_ride_length = mean(ride_length), no_of_rides=n_distinct(ride_id))
## `summarise()` has grouped output by 'member_casual', 'day_of_week',
## 'month_of_year'. You can override using the `.groups` argument.
divvy_bikes_consolidated
## # A tibble: 168 × 6
## # Groups: member_casual, day_of_week, month_of_year [168]
## member_casual day_of_week month_of_year season_of_year avg_ride_length
## <chr> <chr> <dbl> <chr> <drtn>
## 1 casual Friday 1 Winter 18.39389 mins
## 2 casual Friday 2 Winter 24.85875 mins
## 3 casual Friday 3 Spring 22.73627 mins
## 4 casual Friday 4 Spring 22.24533 mins
## 5 casual Friday 5 Spring 26.87207 mins
## 6 casual Friday 6 Summer 28.07744 mins
## 7 casual Friday 7 Summer 27.84953 mins
## 8 casual Friday 8 Summer 36.09690 mins
## 9 casual Friday 9 Autumn 24.11483 mins
## 10 casual Friday 10 Autumn 20.73668 mins
## # ℹ 158 more rows
## # ℹ 1 more variable: no_of_rides <int>
Also, we can visualize each type of member that had the most number of rides according to the day, month and season.
max_no_of_rides_consolidated <- divvy_bikes_consolidated %>%
group_by(member_casual) %>%
slice_max(no_of_rides)
max_no_of_rides_consolidated
## # A tibble: 2 × 6
## # Groups: member_casual [2]
## member_casual day_of_week month_of_year season_of_year avg_ride_length
## <chr> <chr> <dbl> <chr> <drtn>
## 1 casual Saturday 9 Autumn 28.16613 mins
## 2 member Tuesday 8 Summer 13.15050 mins
## # ℹ 1 more variable: no_of_rides <int>
The most rides for the casual riders occur in Autumn, on September, on a Saturday with 68408 rides.
The most rides for the member riders occur in Summer, on August, on a Tuesday with 82787 rides.
We need to present the visualization of the analysis we made so far.
divvy_bikes_week %>%
mutate(day_of_week = factor(day_of_week,levels=c('Sunday','Monday','Tuesday','Wednesday','Thursday','Friday','Saturday'))) %>%
ggplot(aes(x=day_of_week,y=no_of_rides, fill=member_casual)) +
geom_col(width=0.7,position = position_dodge(width=0.5)) +
ggtitle("Total trips by type of riders vs day of the week (Saturday)") +
theme(plot.title = element_text(hjust=0.5)) +
scale_y_continuous(labels=function(x) format(x,scientific=FALSE)) +
labs(fill= "Type of rider")
From the visualization we can say Wednesday is the day with more number of member riders and Saturday is the day with more number of casual riders.
So we understood than casual drivers make use of bikes during weekends, while member riders are busy during the week.
The busiest day of week for all riders is Saturday based on our analysis.
saturday_rides %>%
ggplot(aes(x=day_of_week,y=no_of_rides,fill=member_casual)) +
geom_col(width=0.5,position="dodge") +
ggtitle("Total trips by type of riders vs Busiest day of week") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_y_continuous(labels=function(x) format(x,scientific=FALSE)) +
labs(fill="Type of riders") +
geom_text(aes(label=no_of_rides),position=position_dodge(width = 0.5),size=4,vjust=-0.3)
divvy_bikes_month %>%
mutate(month_of_year = month.abb[month_of_year]) %>%
mutate(month_of_year = factor(month_of_year,levels=c('Aug','Sep','Oct','Nov','Dec','Jan','Feb','Mar','Apr','May','Jun','Jul'))) %>%
ggplot(aes(x=month_of_year,y=no_of_rides,fill=member_casual)) +
geom_col(width=0.5,position=position_dodge(width=0.5)) +
ggtitle("Total trips by type of riders vs Month of the year (Aug 23 to July 24)") +
theme(plot.title = element_text(hjust=0.5)) +
scale_y_continuous(labels = function(x) format(x,scientific=FALSE)) +
labs(fill="Type of riders")
From visualization, August is the month with most member riders and July is the month with most casual member riders.
divvy_bikes_season %>%
ggplot(aes(x=season_of_year, y=no_of_rides,fill=member_casual)) +
geom_col(width=0.5,position=position_dodge(width = 0.5)) +
ggtitle("Total trips by type of riders vs Season of the year") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_y_continuous(labels=function(x) format(x,scientific=FALSE)) +
labs(fill= "Type of riders")
We can see that Summer is the busiest season for both type of riders and Winter is the lowest season.
As Summer is most busiest season, we need to analyze in more detail:
summer_rides %>%
ggplot(aes(x=season_of_year,y=no_of_rides,fill=member_casual)) +
geom_col(position="dodge",width=0.5) +
ggtitle("Total trip by type of riders vs Busiest season(Summer)") +
theme(plot.title = element_text(hjust=0.5)) +
scale_y_continuous(labels = function(x) format(x,scientific=FALSE)) +
labs(fill="Type of rider") +
geom_text(aes(label=str_c(round(percentage,digits=2),"%",sep="")),position = position_dodge(width=0.5),size=4,vjust=-0.3)
The member riders represent 58.2% of the rides that are taken in the busiest season (Summer), while casual members represent the 41.8% of the rides.
divvy_bikes_week %>%
mutate(day_of_week= factor(day_of_week,levels=c('Sunday','Monday','Tuesday','Wednesday','Thursday','Friday','Saturday'))) %>%
ggplot(aes(x=day_of_week,y=avg_ride_length,fill=member_casual)) +
geom_col(width=0.5,position=position_dodge(width = 0.5)) +
ggtitle("Avg trip duration by type of riders vs Day of week") +
theme(plot.title = element_text(hjust=0.5)) +
scale_y_continuous(labels=function(x) format(x,scientific=FALSE)) +
labs(fill="Type of rider")
Highest average for trip duration for casual and member riders is on Sunday, followed by Saturday.
This means that the longest rides are happening on the weekends.
divvy_bikes_month %>%
mutate(month_of_year = month.abb[month_of_year]) %>%
mutate(month_of_year = factor(month_of_year,levels=c('Aug','Sep','Oct','Nov','Dec','Jan','Feb','Mar','Apr','May','Jun','Jul'))) %>%
ggplot(aes(x=month_of_year,y=avg_ride_length,fill=member_casual)) +
geom_col(width=0.5,position=position_dodge(width = 0.5)) +
ggtitle("Avg trip duration by type of rider vs Month of year (Aug 23 to Jul 24)") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_y_continuous(labels = function(x) format(x,scientific=FALSE)) +
labs(fill="Type of riders")
The visualization above determine that the month of August is the highest average of trip duration for the casual riders.June is the highest average of trip duration for member riders.
divvy_bikes_season %>%
ggplot(aes(x=season_of_year,y=avg_ride_length,fill=member_casual)) +
geom_col(width=0.5,position=position_dodge(width=0.5)) +
ggtitle("Avg trip duration by type of riders vs Season of the year") +
theme(plot.title=element_text(hjust=0.5)) +
scale_y_continuous(labels = function(x) format(x,scientific=FALSE)) +
labs(fill="Type of riders")
Summer season is the one who has the highest average of trip duration for both casual and member riders.
divvy_bikes_hour %>%
ggplot(aes(x=hour_started_at,y=no_of_rides,color=member_casual,group = member_casual))+
geom_line(size=1)+
geom_point(size=2) +
theme(plot.title = element_text(hjust=0.5)) +
labs(title="Use of bikes for 24 hours",x="Time of day") +
labs(color="Type of riders") +
scale_x_continuous(breaks = scales::breaks_width(1)) +
scale_y_continuous(labels = function(x) format(x, scientific = FALSE))+
geom_hline(aes(yintercept = max(no_of_rides)))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
The busiest hour of the day for the use of bikes for members and casual riders is both at hour 17.
Next, we will visualize the ride types that riders prefer to use.
ride_types %>%
ggplot(aes(x=rideable_type,y=no_of_rides,fill=member_casual)) +
geom_col(width=0.5) +
ggtitle("Ride type vs No of rides") +
theme(plot.title=element_text(hjust=0.5)) +
scale_y_continuous(labels = function(x) format(x,scientific=FALSE)) +
labs(fill="Type of riders") +
geom_text(aes(label = no_of_rides),size=4,vjust=-0.8)
We are using pie chart to visualize the percentage of the ride types used.
ride_types %>%
group_by(rideable_type) %>%
summarise(no_of_rides=sum(no_of_rides)) %>%
ggplot(aes(x="",y=no_of_rides,fill=rideable_type)) +
geom_col(color="black") +
ggtitle("Amount of rides for each type") +
theme(plot.title = element_text(hjust=0.5)) +
labs(fill="Ride type") +
geom_text(aes(label=str_c(round((no_of_rides/sum(no_of_rides)*100),digits=2),"%",sep="")),position=position_stack(vjust=0.5)) +
coord_polar("y",start=0) +
theme_void()
The visualizations above determine that the classic and electric bikes are the most favorite for both types of riders: casual and member. Almost the 50 and 50 percent are distributed among those two ride types. The docked bike is not common used with 0.28% of rides.
member_casuals %>%
ggplot(aes(x="",y=avg_ride_length,fill=member_casual)) +
geom_col(color="black") +
ggtitle("Avg trip duration for each rider type") +
theme(plot.title = element_text(hjust=0.5)) +
labs(fill="Type of rider") +
geom_text(aes(label=str_c(round(avg_ride_length,digit=2),"mins",sep=" ")),position = position_stack(vjust=0.5)) +
coord_polar("y",start=0) +
theme_void()
## Don't know how to automatically pick scale for object of type <difftime>.
## Defaulting to continuous.
From the visualization above we can determine that the average of the trip duration from the casual riders is more than twice than the member riders.
member_casuals %>%
ggplot(aes(x="",y=no_of_rides,fill=member_casual)) +
geom_col(color="black") +
ggtitle("Total amount of rides for each rider type") +
theme(plot.title = element_text(hjust=0.5)) +
labs(fill="Type of rider") +
geom_text(aes(label = str_c(round((no_of_rides/sum(no_of_rides))*100,digits=2),"%",sep="")),
position=position_stack(vjust=0.5)) +
coord_polar("y",start=0) +
theme_void()
We can determine that the total % amount of rides from the member riders is almost twice (with 64.33%) than the casual riders (with 35.67%).
From the data collected (August 2023 to July 2024) of Cyclist collection, following conclusions have been drawn: