In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime.
Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments. One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members.
Stakeholders; Director of Marketing: Lily Moreno Cyclists Cyclist executives
Business Task; Design marketing strategies aimed at converting casual riders into annual members. In order to do that, however, the team needs to better understand how annual members and casual riders differ, why casual riders would buy a membership, and how digital media could affect their marketing tactics.
I will be using Cyclistic’s public historical trip data to analyze and identify trends. I will be working with the latest 12 months datasets for 2024 from Divvy. The data has been made available by Motivate International Inc. under this licence
Install these packages, if not yet install
install.packages(“skimr”) install.packages(“janitor”) install.packages(“hms”) install.packages(“lubridate”)
Set working directory
setwd("G:/My Drive/Learning/Google Data Analytics/Course 8/Case study/Dataset/Excel/CSV")
Import data sets into R studio
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readxl)
Jan_24 <- read_csv("202401-divvy-tripdata.csv")
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat <- vroom(...)
## problems(dat)
## Rows: 144873 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): ride_id, rideable_type, started_at, ended_at, start_station_name, ...
## dbl (5): start_lat, start_lng, end_lat, end_lng, day_of_week
## time (1): ride_length
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Feb_24 <- read_csv("202402-divvy-tripdata.csv")
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat <- vroom(...)
## problems(dat)
## Rows: 223164 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): ride_id, rideable_type, started_at, ended_at, start_station_name, ...
## dbl (5): start_lat, start_lng, end_lat, end_lng, day_of_week
## time (1): ride_length
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Mar_24 <- read_csv("202403-divvy-tripdata.csv")
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat <- vroom(...)
## problems(dat)
## Rows: 301687 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): ride_id, rideable_type, started_at, ended_at, start_station_name, ...
## dbl (5): start_lat, start_lng, end_lat, end_lng, day_of_week
## time (1): ride_length
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Apr_24 <- read_csv("202404-divvy-tripdata.csv")
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat <- vroom(...)
## problems(dat)
## Rows: 415025 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): ride_id, rideable_type, started_at, ended_at, start_station_name, ...
## dbl (5): start_lat, start_lng, end_lat, end_lng, day_of_week
## time (1): ride_length
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
May_24 <- read_csv("202405-divvy-tripdata.csv")
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat <- vroom(...)
## problems(dat)
## Rows: 609493 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): ride_id, rideable_type, started_at, ended_at, start_station_name, ...
## dbl (5): start_lat, start_lng, end_lat, end_lng, day_of_week
## time (1): ride_length
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Jun_24 <- read_csv("202406-divvy-tripdata.csv")
## Rows: 710721 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## lgl (2): ride_length, day_of_week
## time (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Jul_24 <- read_csv("202407-divvy-tripdata.csv")
## Rows: 748962 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): ride_id, rideable_type, started_at, ended_at, start_station_name, ...
## dbl (5): start_lat, start_lng, end_lat, end_lng, day_of_week
## time (1): ride_length
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Aug_24 <- read_csv("202408-divvy-tripdata.csv")
## Rows: 755639 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): ride_id, rideable_type, started_at, ended_at, start_station_name, ...
## dbl (5): start_lat, start_lng, end_lat, end_lng, day_of_week
## time (1): ride_length
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Sep_24 <- read_csv("202409-divvy-tripdata.csv")
## Rows: 821276 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): ride_id, rideable_type, started_at, ended_at, start_station_name, ...
## dbl (5): start_lat, start_lng, end_lat, end_lng, day_of_week
## time (1): ride_length
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Oct_24 <- read_csv("202410-divvy-tripdata.csv")
## Rows: 616281 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): ride_id, rideable_type, started_at, ended_at, start_station_name, ...
## dbl (5): start_lat, start_lng, end_lat, end_lng, day_of_week
## time (1): ride_length
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Nov_24 <- read_csv("202411-divvy-tripdata.csv")
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat <- vroom(...)
## problems(dat)
## Rows: 335075 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): ride_id, rideable_type, started_at, ended_at, start_station_name, ...
## dbl (5): start_lat, start_lng, end_lat, end_lng, day_of_week
## time (1): ride_length
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Dec_24 <- read_csv("202412-divvy-tripdata.csv")
## Rows: 178372 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): ride_id, rideable_type, started_at, ended_at, start_station_name, ...
## dbl (5): start_lat, start_lng, end_lat, end_lng, day_of_week
## time (1): ride_length
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Review all column names
colnames(Jan_24)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual" "ride_length" "day_of_week"
colnames(Feb_24)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual" "ride_length" "day_of_week"
colnames(Mar_24)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual" "ride_length" "day_of_week"
colnames(Apr_24)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual" "ride_length" "day_of_week"
colnames(May_24)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual" "ride_length" "day_of_week"
colnames(Jun_24)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual" "ride_length" "day_of_week"
colnames(Jul_24)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual" "ride_length" "day_of_week"
colnames(Aug_24)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual" "ride_length" "day_of_week"
colnames(Sep_24)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual" "ride_length" "day_of_week"
colnames(Oct_24)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual" "ride_length" "day_of_week"
colnames(Nov_24)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual" "ride_length" "day_of_week"
colnames(Dec_24)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual" "ride_length" "day_of_week"
Dataset for June was left out because the Start_at variabe only has time recorded intead of date and time hence I couldn’t calculate the ride_length and day_of_week variables
library(dplyr)
Merged_2024 <- bind_rows(Jan_24, Feb_24, Mar_24, Apr_24, May_24, Jul_24, Aug_24,
Sep_24, Oct_24, Nov_24, Dec_24)
Rename variable member_casual to new name ‘user type’ Memeber_casual is abit confusing
library(tidyverse)
Merged_2024 <- Merged_2024 %>%
rename(user_type = member_casual)
Rename rideable_type to new name ‘ride_type’
library(tidyverse)
Merged_2024 <- Merged_2024 %>%
rename(ride_type = rideable_type)
Remove duplicate data using distinct function
library(tidyverse)
Merged_2024 %>% distinct()
## # A tibble: 5,149,847 × 15
## ride_id ride_type started_at ended_at start_station_name start_station_id
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 C1D650626C… electric… 12/01/202… 12/01/2… Wells St & Elm St KA1504000135
## 2 EECD38BDB2… electric… 08/01/202… 08/01/2… Wells St & Elm St KA1504000135
## 3 F4A9CE7806… electric… 27/01/202… 27/01/2… Wells St & Elm St KA1504000135
## 4 0A0D9E15EE… classic_… 29/01/202… 29/01/2… Wells St & Randol… TA1305000030
## 5 33FFC9805E… classic_… 31/01/202… 31/01/2… Lincoln Ave & Wav… 13253
## 6 C96080812C… classic_… 07/01/202… 07/01/2… Wells St & Elm St KA1504000135
## 7 0EA7CB313D… classic_… 05/01/202… 05/01/2… Wells St & Elm St KA1504000135
## 8 EE11F3A3B3… electric… 04/01/202… 04/01/2… Wells St & Elm St KA1504000135
## 9 63E83DE8E3… classic_… 01/01/202… 01/01/2… Wells St & Elm St KA1504000135
## 10 8005682869… electric… 03/01/202… 03/01/2… Clark St & Ida B … TA1305000009
## # ℹ 5,149,837 more rows
## # ℹ 9 more variables: end_station_name <chr>, end_station_id <chr>,
## # start_lat <dbl>, start_lng <dbl>, end_lat <dbl>, end_lng <dbl>,
## # user_type <chr>, ride_length <time>, day_of_week <dbl>
Remove missing data
Merged_2024 <- Merged_2024 %>%
drop_na(ride_id:ride_type)
colSums(is.na(Merged_2024))
## ride_id ride_type started_at ended_at
## 0 0 0 0
## start_station_name start_station_id end_station_name end_station_id
## 929926 929926 956626 956626
## start_lat start_lng end_lat end_lng
## 0 0 6108 6108
## user_type ride_length day_of_week
## 0 105 0
Realized there are 929926 null values for the variables start_station_name $ start_station_id. end_station_name $ end_station_id are missing 956626 values as well while end_lat $ end_lng are missing 6108 values. For now I’ll remove the the null values for end_lat $ end_lng and ignore start_station_name $ start_station_id. end_station_name $ end_station_id
Merged_2024 <- Merged_2024 %>%
drop_na(end_lat:end_lng)
glimpse(Merged_2024)
## Rows: 5,143,739
## Columns: 15
## $ ride_id <chr> "C1D650626C8C899A", "EECD38BDB25BFCB0", "F4A9CE7806…
## $ ride_type <chr> "electric_bike", "electric_bike", "electric_bike", …
## $ started_at <chr> "12/01/2024 15:30", "08/01/2024 15:45", "27/01/2024…
## $ ended_at <chr> "12/01/2024 15:37", "08/01/2024 15:52", "27/01/2024…
## $ start_station_name <chr> "Wells St & Elm St", "Wells St & Elm St", "Wells St…
## $ start_station_id <chr> "KA1504000135", "KA1504000135", "KA1504000135", "TA…
## $ end_station_name <chr> "Kingsbury St & Kinzie St", "Kingsbury St & Kinzie …
## $ end_station_id <chr> "KA1503000043", "KA1503000043", "KA1503000043", "13…
## $ start_lat <dbl> 41.90327, 41.90294, 41.90295, 41.88430, 41.94880, 4…
## $ start_lng <dbl> -87.63474, -87.63444, -87.63447, -87.63396, -87.675…
## $ end_lat <dbl> 41.88918, 41.88918, 41.88918, 41.92182, 41.88918, 4…
## $ end_lng <dbl> -87.63851, -87.63851, -87.63851, -87.64414, -87.638…
## $ user_type <chr> "member", "member", "member", "member", "member", "…
## $ ride_length <time> 00:07:00, 00:07:00, 00:08:00, 00:30:00, 00:26:00, …
## $ day_of_week <dbl> 6, 2, 7, 2, 4, 1, 6, 5, 2, 4, 4, 4, 4, 6, 1, 4, 7, …
head(Merged_2024)
## # A tibble: 6 × 15
## ride_id ride_type started_at ended_at start_station_name start_station_id
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 C1D650626C8… electric… 12/01/202… 12/01/2… Wells St & Elm St KA1504000135
## 2 EECD38BDB25… electric… 08/01/202… 08/01/2… Wells St & Elm St KA1504000135
## 3 F4A9CE78061… electric… 27/01/202… 27/01/2… Wells St & Elm St KA1504000135
## 4 0A0D9E15EE5… classic_… 29/01/202… 29/01/2… Wells St & Randol… TA1305000030
## 5 33FFC9805E3… classic_… 31/01/202… 31/01/2… Lincoln Ave & Wav… 13253
## 6 C96080812CD… classic_… 07/01/202… 07/01/2… Wells St & Elm St KA1504000135
## # ℹ 9 more variables: end_station_name <chr>, end_station_id <chr>,
## # start_lat <dbl>, start_lng <dbl>, end_lat <dbl>, end_lng <dbl>,
## # user_type <chr>, ride_length <time>, day_of_week <dbl>
Change time format for variable ride_length from seconds to minutes
Install these packages, if not yet installed install.packages(“hms”) install.packages(“lubridate”) library(lubridate) library(hms)
Merged_2024$day_of_week <- format(as.Date(Merged_2024$started_at), "%A")
Min, Max & mean
Removed missing values and ensured ride_length was greater than 0
library(dplyr)
min_max_avg <- Merged_2024 %>%
group_by(user_type) %>%
summarise(average_ride_length = mean(ride_length[ride_length>0],na.rm = T), max_ride_length = max(ride_length[ride_length>0],na.rm = T), min_ride_length = min(ride_length[ride_length>0],na.rm = T))
Calculate mode of the day of the week
library(DescTools)
Mode(Merged_2024$day_of_week)
## [1] "Sunday"
## attr(,"freq")
## [1] 771361
Calculating total rides by user type
ride_counts <- as.data.frame(table(Merged_2024$user_type))
colnames(ride_counts) <- c("user_type", "total_rides")
Average rides
library(tidyverse)
Average_rides <- Merged_2024 %>%
group_by(user_type, day_of_week) %>%
summarise(rides_per_day = n(), average_ride_length = mean(ride_length[ride_length>0],na.rm = T)) %>%
arrange(-rides_per_day, day_of_week)
## `summarise()` has grouped output by 'user_type'. You can override using the
## `.groups` argument.
Count of start stations
library(tidyverse)
Stations <- Merged_2024 %>%
drop_na(start_station_name) %>%
group_by(user_type, start_station_name) %>%
summarise(station_count = n()) %>%
arrange(-station_count, start_station_name)
## `summarise()` has grouped output by 'user_type'. You can override using the
## `.groups` argument.
Convert average time from seconds to minutes under new dataframe average_rides_new
Average_rides_new <- Average_rides %>%
mutate(average_time = average_ride_length/360)
library(ggplot2)
ggplot(ride_counts, aes(x = "", y = total_rides, fill = user_type)) +
geom_bar(stat = "identity", width = 1) +
coord_polar(theta = "y") +
theme_void() +
ggtitle("Total Rides by User Type") +
scale_fill_manual(values = c("casual" = "#F8766D", "member" = "#00BFC4")) +
geom_text(aes(label = paste0(round(total_rides / sum(total_rides) * 100, 1), "%")),
position = position_stack(vjust = 0.5), color = "white", size = 5)
## User type against type of bike
ggplot(data =Merged_2024) +
geom_bar(mapping = aes(x = ride_type, fill=ride_type)) +
facet_wrap(~user_type) +
theme(axis.text.x = element_text(angle = 45)) +
guides(fill="none") +
labs(title = "Biker types Vs User subscription types ", x="Type of bike",
y="Number of riders")
## Total number of rides per day
ggplot(Merged_2024) +
geom_bar(mapping=aes(x = day_of_week, fill=day_of_week)) +
facet_wrap(~user_type) +
theme(axis.text.x = element_text(angle = 45)) +
guides(fill="none") +
labs(title = "Total Rides per day ", x="Day of the week",y="Total rides")
## Average ride duration per day
ggplot(data = Average_rides_new) +
geom_col(mapping = aes(x = day_of_week, y = average_time, fill=day_of_week))+
facet_wrap(~user_type) +
theme(axis.text.x = element_text(angle = 45)) +
guides(fill="none") +
labs(title = "Average ride duration per day", x="Day of the week",y="Average duration in minutes")
## Don't know how to automatically pick scale for object of type <difftime>.
## Defaulting to continuous.
## Top 10 start stations by user type
library(ggplot2)
library(dplyr)
Top_ten <- Stations %>%
slice_max(station_count, n = 10)
ggplot(data = Top_ten) +
geom_col(mapping = aes(x = start_station_name, y = station_count, fill=start_station_name)) +
facet_wrap(~user_type) +
coord_flip() +
guides(fill="none") +
labs(title = "Top 10 start stations by user type", x = "Start station name", y = "station count")
## Top ten end stations
library(tidyverse)
End_stations <- Merged_2024 %>%
drop_na(end_station_name) %>%
group_by(user_type, end_station_name) %>%
summarise(station_count = n()) %>%
arrange(-station_count, end_station_name)
## `summarise()` has grouped output by 'user_type'. You can override using the
## `.groups` argument.
library(ggplot2)
library(dplyr)
End_top_ten <- End_stations %>%
slice_max(station_count, n = 10)
ggplot(data = End_top_ten) +
geom_col(mapping = aes(x = end_station_name, y = station_count, fill= end_station_name)) +
facet_wrap(~user_type) +
coord_flip() +
guides(fill="none") +
labs(title = "Top 10 end stations by user type", x = "End station name", y = "station count")
## Summary of Anlysis
Mode of day of the week: Sunday (771361)
Ride total counts by user type
Type of bike by user type
Total number of rides per day
Most frequent start and end stations
These are my top three recommendations based on the results of my analysis
One: From the results, we saw that both Casual and member users mostly preferred to start and end their rides at Streeter dr & grand ave and Dusable lake shore dr & monroe st. The company can use this station to advertise their memebership packages through billboards, fliers to attract more causal riders into membership.
Two:Casual users preffered to use electric bikes (so did the members). Use promotions like discounted costs for electronic bikes for casual users who enroll into membership category.
Three: For both casual and member users, we noticed the highest number of rides were recorded on Saturday and Sunday. The company could come up with attractive membership packages like weekend package to attract more casual riders into membership
References: https://www.kaggle.com/code/michellemariesanna/cyclistic-case-study. This helped to draw a pie-chart