In 2016, Cyclistic launched a successful bike-share offering. 5,824 bicycles, 692 stations. Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments. flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members. Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders. Although the pricing flexibility helps Cyclistic attract more customers, Moreno(director of marketing) believes that maximizing the number of annual members will be key to future growth and there is an opportunity to convert casual riders into members. She notes that casual riders are already aware of the Cyclistic program and have chosen Cyclistic for their mobility needs. Moreno has set a clear goal: Design marketing strategies aimed at converting casual riders into annual members. In order to do that, however, the team needs to better understand how annual members and casual riders differ, why casual riders would buy a membership, and how digital media could affect their marketing tactics. Moreno and her team are interested in analysing the Cyclistic historical bike trip data to identify trends.
Moreno has assigned you the first question to answer: How do annual members and casual riders use Cyclistic bikes differently?
options(repos = c(CRAN = "https://cran.rstudio.com/"))
install.packages("dplyr")
##
## The downloaded binary packages are in
## /var/folders/v0/hqlgjd_51rx6x61mrz2fllxw0000gn/T//Rtmpm7n64G/downloaded_packages
install.packages("readr")
##
## The downloaded binary packages are in
## /var/folders/v0/hqlgjd_51rx6x61mrz2fllxw0000gn/T//Rtmpm7n64G/downloaded_packages
install.packages("janitor")
##
## The downloaded binary packages are in
## /var/folders/v0/hqlgjd_51rx6x61mrz2fllxw0000gn/T//Rtmpm7n64G/downloaded_packages
install.packages("lubridate")
##
## The downloaded binary packages are in
## /var/folders/v0/hqlgjd_51rx6x61mrz2fllxw0000gn/T//Rtmpm7n64G/downloaded_packages
install.packages("ggplot2")
##
## The downloaded binary packages are in
## /var/folders/v0/hqlgjd_51rx6x61mrz2fllxw0000gn/T//Rtmpm7n64G/downloaded_packages
install.packages("tidyr")
##
## The downloaded binary packages are in
## /var/folders/v0/hqlgjd_51rx6x61mrz2fllxw0000gn/T//Rtmpm7n64G/downloaded_packages
install.packages("fmsb")
##
## The downloaded binary packages are in
## /var/folders/v0/hqlgjd_51rx6x61mrz2fllxw0000gn/T//Rtmpm7n64G/downloaded_packages
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(readr)
library(janitor)
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(ggplot2)
library(tidyr)
library(fmsb)
#merge csvs into data frame
cyclistic_data_2023_origional <- list.files(path = "/Users/billybennett/google_data_analytics_capstone/cyclistic_data_2023",
pattern = "*.csv", full.names = TRUE) %>%
lapply(read_csv) %>%
bind_rows
## Rows: 190301 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 190445 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 258678 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 426590 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 604827 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 719618 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 767650 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 771693 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 666371 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 537113 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 362518 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 224073 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
The above code merges the CSV files directly into one table.
cyclistic_data_2023_unclean <- cyclistic_data_2023_origional
I then created an alternate version of the data set to be cleaned.
print(cyclistic_data_2023_unclean)
## # A tibble: 5,719,877 × 13
## ride_id rideable_type started_at ended_at
## <chr> <chr> <dttm> <dttm>
## 1 F96D5A74A3E41399 electric_bike 2023-01-21 20:05:42 2023-01-21 20:16:33
## 2 13CB7EB698CEDB88 classic_bike 2023-01-10 15:37:36 2023-01-10 15:46:05
## 3 BD88A2E670661CE5 electric_bike 2023-01-02 07:51:57 2023-01-02 08:05:11
## 4 C90792D034FED968 classic_bike 2023-01-22 10:52:58 2023-01-22 11:01:44
## 5 3397017529188E8A classic_bike 2023-01-12 13:58:01 2023-01-12 14:13:20
## 6 58E68156DAE3E311 electric_bike 2023-01-31 07:18:03 2023-01-31 07:21:16
## 7 2F7194B6012A98D4 electric_bike 2023-01-15 21:18:36 2023-01-15 21:32:36
## 8 DB1CF84154D6A049 classic_bike 2023-01-25 10:49:01 2023-01-25 10:58:22
## 9 34EAB943F88C4C5D electric_bike 2023-01-25 20:49:47 2023-01-25 21:02:14
## 10 BC8AB1AA51DA9115 classic_bike 2023-01-06 16:37:19 2023-01-06 16:49:52
## # ℹ 5,719,867 more rows
## # ℹ 9 more variables: start_station_name <chr>, start_station_id <chr>,
## # end_station_name <chr>, end_station_id <chr>, start_lat <dbl>,
## # start_lng <dbl>, end_lat <dbl>, end_lng <dbl>, member_casual <chr>
viewing the data-set
cyclistic_data_2023_unclean <- cyclistic_data_2023_unclean %>% select(-start_lat)
cyclistic_data_2023_unclean <- cyclistic_data_2023_unclean %>% select(-start_lng)
cyclistic_data_2023_unclean <- cyclistic_data_2023_unclean %>% select(-end_lat)
cyclistic_data_2023_unclean <- cyclistic_data_2023_unclean %>% select(-end_lng)
cyclistic_data_2023_unclean <- cyclistic_data_2023_unclean %>% select(-start_station_id)
cyclistic_data_2023_unclean <- cyclistic_data_2023_unclean %>% select(-end_station_id)
removal of columns that aren’t needed for my analysis
get_dupes(cyclistic_data_2023_unclean)
## No variable names specified - using all columns.
## No duplicate combinations found of: ride_id, rideable_type, started_at, ended_at, start_station_name, end_station_name, member_casual
## # A tibble: 0 × 8
## # ℹ 8 variables: ride_id <chr>, rideable_type <chr>, started_at <dttm>,
## # ended_at <dttm>, start_station_name <chr>, end_station_name <chr>,
## # member_casual <chr>, dupe_count <int>
# remove duplicates
cyclistic_data_2023_unclean <- distinct(cyclistic_data_2023_unclean)
check for and then remove duplicate rows.
sum(rowSums(is.na(cyclistic_data_2023_unclean)) > 0)
## [1] 1387808
# Check number of missing values per column
sapply(cyclistic_data_2023_unclean, function(x) sum(is.na(x)))
## ride_id rideable_type started_at ended_at
## 0 0 0 0
## start_station_name end_station_name member_casual
## 875716 929202 0
check for missing values. I decided to not yet remove the rows containing missing values as they only appear in the station name columns, i will first finish cleaning the data and then split the clean data into two versions. The first being the completely cleaned data set minus the station names columns and the second being the cleaned data set minus the columns with missing rows.
cyclistic_data_2023_unclean %>%
distinct(rideable_type)
## # A tibble: 3 × 1
## rideable_type
## <chr>
## 1 electric_bike
## 2 classic_bike
## 3 docked_bike
check that only electric, docked and classic are the options
cyclistic_data_2023_unclean %>%
distinct(member_casual)
## # A tibble: 2 × 1
## member_casual
## <chr>
## 1 member
## 2 casual
check that only member or causal are options
cyclistic_data_2023_valid_dates <- cyclistic_data_2023_unclean %>%
mutate(valid_dates = if_else(started_at < ended_at, TRUE, FALSE))
cyclistic_data_2023_valid_dates <- cyclistic_data_2023_valid_dates %>%
filter(valid_dates == TRUE)
cyclistic_data_2023_clean <- cyclistic_data_2023_valid_dates %>% select(-valid_dates)
create valid dates column to check for incorrect data
cyclistic_data_2023_clean$started_at <- as.POSIXct(cyclistic_data_2023_clean$started_at, format="%Y-%m-%d %H:%M:%S")
cyclistic_data_2023_clean$ended_at <- as.POSIXct(cyclistic_data_2023_clean$ended_at, format="%Y-%m-%d %H:%M:%S")
converts the started at and ended at columns to date time
cyclistic_data_2023_clean <- cyclistic_data_2023_clean %>%
mutate(ride_length = as.numeric(difftime(ended_at, started_at, units = "mins")))
create ride length column
cyclistic_data_2023_clean <- cyclistic_data_2023_clean %>%
mutate(day_of_week = wday(started_at, label = TRUE, abbr = FALSE))
head(cyclistic_data_2023_clean)
## # A tibble: 6 × 9
## ride_id rideable_type started_at ended_at
## <chr> <chr> <dttm> <dttm>
## 1 F96D5A74A3E41399 electric_bike 2023-01-21 20:05:42 2023-01-21 20:16:33
## 2 13CB7EB698CEDB88 classic_bike 2023-01-10 15:37:36 2023-01-10 15:46:05
## 3 BD88A2E670661CE5 electric_bike 2023-01-02 07:51:57 2023-01-02 08:05:11
## 4 C90792D034FED968 classic_bike 2023-01-22 10:52:58 2023-01-22 11:01:44
## 5 3397017529188E8A classic_bike 2023-01-12 13:58:01 2023-01-12 14:13:20
## 6 58E68156DAE3E311 electric_bike 2023-01-31 07:18:03 2023-01-31 07:21:16
## # ℹ 5 more variables: start_station_name <chr>, end_station_name <chr>,
## # member_casual <chr>, ride_length <dbl>, day_of_week <ord>
create day of week column
cyclistic_data_2023_stations <- cyclistic_data_2023_clean %>% drop_na()
create a new table with only the rows with all station names
The first piece of analysis i have decided to do is looking at how members and casuals differ in their choice of bike type.
head(cyclistic_data_2023_clean)
## # A tibble: 6 × 9
## ride_id rideable_type started_at ended_at
## <chr> <chr> <dttm> <dttm>
## 1 F96D5A74A3E41399 electric_bike 2023-01-21 20:05:42 2023-01-21 20:16:33
## 2 13CB7EB698CEDB88 classic_bike 2023-01-10 15:37:36 2023-01-10 15:46:05
## 3 BD88A2E670661CE5 electric_bike 2023-01-02 07:51:57 2023-01-02 08:05:11
## 4 C90792D034FED968 classic_bike 2023-01-22 10:52:58 2023-01-22 11:01:44
## 5 3397017529188E8A classic_bike 2023-01-12 13:58:01 2023-01-12 14:13:20
## 6 58E68156DAE3E311 electric_bike 2023-01-31 07:18:03 2023-01-31 07:21:16
## # ℹ 5 more variables: start_station_name <chr>, end_station_name <chr>,
## # member_casual <chr>, ride_length <dbl>, day_of_week <ord>
cyclistic_data_2023_bike_type_table <- cyclistic_data_2023_clean %>%
group_by(rideable_type, member_casual) %>%
summarize(count = n())
## `summarise()` has grouped output by 'rideable_type'. You can override using the
## `.groups` argument.
head(cyclistic_data_2023_bike_type_table)
## # A tibble: 5 × 3
## # Groups: rideable_type [3]
## rideable_type member_casual count
## <chr> <chr> <int>
## 1 classic_bike casual 876805
## 2 classic_bike member 1819026
## 3 docked_bike casual 78287
## 4 electric_bike casual 1103529
## 5 electric_bike member 1840961
ggplot(data = cyclistic_data_2023_bike_type_table) + geom_col(mapping = aes(x = rideable_type, y = count, fill = member_casual), position = "dodge")+
labs(title = "Counts of Members and Casual Riders by Bike Type",
x = "Bike Type",
y = "Number of Journeys",
fill = "Member/Casual") +
theme_minimal()
as we can see here there are no instances of members using docked bikes. we can also see that casual riders are notably more likely to use an electric bike rather than a classic bike, where as whilst still leaning towards elcetric the distribution is more even among members. This may suggest that casuals have a slightly stronger preference for leisurely rides or perhaps cover longer distances.
I am now going to analyse how members and casuals use cyclistic differently depending on factors such as day of the week, month and time of day.
cyclistic_data_2023_day_of_week_table <- cyclistic_data_2023_clean %>% group_by(day_of_week, member_casual) %>% summarize(count = n())
## `summarise()` has grouped output by 'day_of_week'. You can override using the
## `.groups` argument.
ggplot(data = cyclistic_data_2023_day_of_week_table) + geom_col(mapping = aes(x = day_of_week, y = count, fill = member_casual), position = "dodge")+
labs(title = "Counts of Members and Casual Riders by Day of the Week",
x = "Day of the Week",
y = "Number of Journeys",
fill = "Member/Casual") +
theme_minimal()
as shown above we can see members are more likely to cycle on weekdays, where as casuals are more likely to cycle at weekends. This would suggest that casual riders are more likely to use the bikes for recreational purposes.
cyclistic_data_2023_hour_of_day_column <- cyclistic_data_2023_clean %>% mutate(hour_of_day = hour(cyclistic_data_2023_clean$started_at))
cyclistic_data_2023_hour_of_day_table <- cyclistic_data_2023_hour_of_day_column %>% group_by(hour_of_day, member_casual) %>% summarize(count = n())
## `summarise()` has grouped output by 'hour_of_day'. You can override using the
## `.groups` argument.
cyclistic_data_2023_day_of_week_table <- cyclistic_data_2023_clean %>%
group_by(day_of_week, member_casual) %>%
summarize(count = n())
## `summarise()` has grouped output by 'day_of_week'. You can override using the
## `.groups` argument.
ggplot(data = cyclistic_data_2023_hour_of_day_table) + geom_col(mapping = aes(x = hour_of_day, y = count, fill = member_casual), position = position_dodge()) +
labs(title = "Counts of Members and Casual Riders by Hour of Day",
x = "Hour of Day",
y = "Number of Journeys",
fill = "Member/Casual") +
scale_x_continuous(breaks = seq(0, 23, by = 1)) + theme_minimal() + theme (axis.text.x = element_text(angle = 0, hjust = 0, vjust = 0.5, size = 5))
as we can see here both member and casual usage peaks during late afternoon, however members are for more likely to cycle during the morning, when paired with the information from the previous graph we could infer that members are using cyclistic to cycle to and from work as the most likely journeys based on the information would be 7-8 am and 4-6pm Tuesday Wednesday Thursday. on the other hand casuals are most likely to be cycling 4-6pm at weekends and very unlikely to be cycling during week day mornings.
cyclistic_data_2023_dates_column <- cyclistic_data_2023_clean %>%
mutate(dates = as.Date(started_at))
cyclistic_data_2023_journey_date_table <- cyclistic_data_2023_dates_column %>%
group_by(dates, member_casual) %>%
summarize(count = n(), .groups = 'drop')
ggplot(data = cyclistic_data_2023_journey_date_table) +
geom_line(mapping = aes(x = dates, y = count, colour = member_casual, group = member_casual)) +
geom_smooth(mapping = aes(x = dates, y = count, colour = member_casual, group = member_casual),
method = "loess", se = FALSE) +
labs(title = "Counts of Members and Casual Riders by Day of the Year",
x = "Date",
y = "Number of Journeys",
colour = "Member/Casual") +
scale_x_date(date_breaks = "1 month", date_labels = "%b %Y") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1, size = 10))
## `geom_smooth()` using formula = 'y ~ x'
head(cyclistic_data_2023_journey_date_table)
## # A tibble: 6 × 3
## dates member_casual count
## <date> <chr> <int>
## 1 2023-01-01 casual 1991
## 2 2023-01-01 member 3144
## 3 2023-01-02 casual 1593
## 4 2023-01-02 member 3914
## 5 2023-01-03 casual 1145
## 6 2023-01-03 member 3982
as we can see in the above graph across the year members and casuals seem to use cyclistic similarly throughout the year with both groups peaking during hotter months. Interestingly e can see that the highest peaks being from casual riders, we could use this to suggest that casuals are more encouraged to ride by external factors, where as members are somewhat more consistent.
here i am going to asses if their are any key differences between the stations and routes that members and casual riders take.
cyclistic_data_2023_station_table <- cyclistic_data_2023_stations %>%
group_by(start_station_name, member_casual) %>%
summarize(count = n())
## `summarise()` has grouped output by 'start_station_name'. You can override
## using the `.groups` argument.
cyclistic_data_2023_station_table <- cyclistic_data_2023_station_table %>%
filter(count > 15000)
head(cyclistic_data_2023_station_table)
## # A tibble: 6 × 3
## # Groups: start_station_name [6]
## start_station_name member_casual count
## <chr> <chr> <int>
## 1 Broadway & Barry Ave member 16807
## 2 Canal St & Adams St member 16441
## 3 Clark St & Elm St member 23077
## 4 Clinton St & Jackson Blvd member 15975
## 5 Clinton St & Madison St member 19110
## 6 Clinton St & Washington Blvd member 24313
ggplot(data = cyclistic_data_2023_station_table) +
geom_point(mapping = aes(x = start_station_name, y = count, colour = member_casual)) +
labs(
title = "Counts of Members and Casual Riders by Station",
x = "Station Name",
y = "Number of Journeys",
colour = "Member/Casual"
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5, size = 10)
)
as we can see above there are far more stations where members are taking over 15000 journeys from, this would suggest that members are more likely to cycle from the same station frequently, this may suggest that members use cyclistic for journeys they often make, such as to a place of work. also by looking at the names that appear for casual rider popular stations we can see theater on the lake, shedd aquarium and millennium park, this would again support that casuals use cyclistic for recreational purposes.
cyclistic_data_2023_routes <- cyclistic_data_2023_stations %>%
mutate(route = paste(cyclistic_data_2023_stations$start_station_name, cyclistic_data_2023_stations$end_station_name, sep = " to "))
cyclistic_data_2023_routes_table <- cyclistic_data_2023_routes %>%
group_by(route, member_casual) %>%
summarize(count = n())
## `summarise()` has grouped output by 'route'. You can override using the
## `.groups` argument.
cyclistic_data_2023_routes_table <- cyclistic_data_2023_routes_table %>%
filter(count > 3000)
ggplot(data = cyclistic_data_2023_routes_table) +
geom_point(mapping = aes(x = route, y = count, colour = member_casual)) +
labs(
title = "Counts of Members and Casual Riders by Route",
x = "Route taken",
y = "Number of Journeys",
colour = "Member/Casual"
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5, size = 4))
when looking at this graph we can see among the routes frequently taken by casual riders we can see that many journeys begin and end at the same station, this would tell us that casual users are not using the bikes as a method of getting from one place to another, this would suggest to us that casuals are using the bikes for enjoyment rather than practicality. further more we can see that there is an outlier among the casual users, despite their being less casual rides taken overall the most used route by either group was from casual users. this would suggest that this casual riders are more likely to target specific routes where as members will use the bikes as and when needed.
round_to_nearest_10 <- function(x) {
round(x / 10) * 10
}
cyclistic_data_2023_ride_length_rounded <- cyclistic_data_2023_clean %>%
mutate(ride_length_rounded = round_to_nearest_10(ride_length))
cyclistic_data_2023_ride_length_table <- cyclistic_data_2023_ride_length_rounded %>%
group_by(ride_length_rounded, member_casual) %>%
summarize(count = n())
## `summarise()` has grouped output by 'ride_length_rounded'. You can override
## using the `.groups` argument.
head(cyclistic_data_2023_ride_length_table)
## # A tibble: 6 × 3
## # Groups: ride_length_rounded [3]
## ride_length_rounded member_casual count
## <dbl> <chr> <int>
## 1 0 casual 329506
## 2 0 member 934209
## 3 10 casual 912390
## 4 10 member 1848371
## 5 20 casual 379774
## 6 20 member 555337
cyclistic_data_2023_ride_length_table <- cyclistic_data_2023_ride_length_table %>%
filter(ride_length_rounded < 100)
ggplot(data = cyclistic_data_2023_ride_length_table) +
geom_col(mapping = aes(x = ride_length_rounded, y = count, fill = member_casual)) +
labs(
title = "Journeys Under 100 Minutes",
x = "Journey Time Minutes",
y = "Number of Journeys",
colour = "Member/Casual"
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5, size = 10)
)
as we can see here members take a far larger percentage of their journeys in under 50 minutes, however this graph doesn’t demonstrate the number of journeys taken over longer periods of time well as the scale makes it hard to read. because of this I have created a second graph below which only includes journeys over an hour.
round_to_nearest_10 <- function(x) {
round(x / 10) * 10
}
cyclistic_data_2023_ride_length_rounded <- cyclistic_data_2023_clean %>%
mutate(ride_length_rounded = round_to_nearest_10(ride_length))
cyclistic_data_2023_ride_length_table <- cyclistic_data_2023_ride_length_rounded %>%
group_by(ride_length_rounded, member_casual) %>%
summarize(count = n())
## `summarise()` has grouped output by 'ride_length_rounded'. You can override
## using the `.groups` argument.
head(cyclistic_data_2023_ride_length_table)
## # A tibble: 6 × 3
## # Groups: ride_length_rounded [3]
## ride_length_rounded member_casual count
## <dbl> <chr> <int>
## 1 0 casual 329506
## 2 0 member 934209
## 3 10 casual 912390
## 4 10 member 1848371
## 5 20 casual 379774
## 6 20 member 555337
cyclistic_data_2023_ride_length_table <- cyclistic_data_2023_ride_length_table %>%
filter(ride_length_rounded < 300)
cyclistic_data_2023_ride_length_table <- cyclistic_data_2023_ride_length_table %>%
filter(ride_length_rounded > 60)
ggplot(data = cyclistic_data_2023_ride_length_table) +
geom_col(mapping = aes(x = ride_length_rounded, y = count, fill = member_casual)) +
labs(
title = "Journeys Over 60 Minutes",
x = "Journey Time Minutes",
y = "Number of Journeys",
colour = "Member/Casual"
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5, size = 10)
)
when looking at these two graphs together w get a clear picture that casual riders are more likely to take a longer journey, this would again suggest that casual riders aren’t using the bikes primarily as a mode of transport. this may also explain why casual members were more likely to use electric bikes when compared with members as they feel more necessary for these longer journeys.
in conclusion i feel that the data is telling us that casual riders have a preference for using the bikes primarily for recreational purposes, whilst we also have data suggesting that members are more likely to use the bikes for transportation, with evidence to suggest that they are using the bikes to commute to work.
Here is a final visualisation which i feel displays this particularly well.
table_for_cyclystic_radar <- data.frame(
likelihood_to_ride_at_weekend = c(40, 20, 24.08541, 36.24878),
likelihood_to_ride_on_weekday = c(80, 60, 75.91459, 63.75122),
likelihood_to_ride_during_summer = c(50, 30, 35.92805, 45.82951),
likelihood_to_ride_6am_9am = c(30, 10, 19.3699, 10.87111),
journey_under_30_minutes = c(100, 80, 96.39075, 79.58624),
journey_over_1_hour = c(20, 0, 0.604647, 4.847663)
)
rownames(table_for_cyclystic_radar) <- c("max", "min", "member", "casual")
ggplot2_colors <- scales::hue_pal()(2)
colors_border <- rev(ggplot2_colors)
colors_in <- rev(sapply(ggplot2_colors, scales::alpha, 0.25))
radarchart(table_for_cyclystic_radar,
axistype = 1,
pcol = colors_border,
pfcol = colors_in,
plwd = 2,
plty = 1,
cglcol = "grey",
cglty = 1,
axislabcol = "grey",
caxislabels = rep("", 5),
vlcex = 0.8)
legend(x = "topright", legend = c("Casual", "Member"),
col = rev(colors_border),
lty = 1,
lwd = 2,
bty = "n",
pt.cex = 2,
cex = 0.8,
text.col = "black",
horiz = FALSE,
inset = c(0.1, 0.1))
When looking at the above visualisation i believe we get the clearest image of what the key differences are between members and casuals. with each rung only representing 5% we can see that these differences are significant but not everything and there are some clear overlaps between how members and casuals use cyclistic. However from what the data is telling me here are my recommendations. I believe there are two main roads to increase the number of cyclistic members. firstly i would target casual riders who have similar usage to current members and try to convince them to sign up to cyclistic as members. secondly i would create a new membership plan structured to the more recreational tendencies of casual riders, i believe this could convince many more people to join as members by tapping into a new market and demographic.