context

In 2016, Cyclistic launched a successful bike-share offering. 5,824 bicycles, 692 stations. Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments. flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members. Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders. Although the pricing flexibility helps Cyclistic attract more customers, Moreno(director of marketing) believes that maximizing the number of annual members will be key to future growth and there is an opportunity to convert casual riders into members. She notes that casual riders are already aware of the Cyclistic program and have chosen Cyclistic for their mobility needs. Moreno has set a clear goal: Design marketing strategies aimed at converting casual riders into annual members. In order to do that, however, the team needs to better understand how annual members and casual riders differ, why casual riders would buy a membership, and how digital media could affect their marketing tactics. Moreno and her team are interested in analysing the Cyclistic historical bike trip data to identify trends.

How do annual members and casual riders use Cyclistic bikes differently?
Why would casual riders buy Cyclistic annual memberships?
How can Cyclistic use digital media to influence casual riders to become members?

Moreno has assigned you the first question to answer: How do annual members and casual riders use Cyclistic bikes differently?

options(repos = c(CRAN = "https://cran.rstudio.com/"))

packages

install.packages("dplyr")

## 
## The downloaded binary packages are in
##  /var/folders/v0/hqlgjd_51rx6x61mrz2fllxw0000gn/T//Rtmpm7n64G/downloaded_packages

install.packages("readr")

## 
## The downloaded binary packages are in
##  /var/folders/v0/hqlgjd_51rx6x61mrz2fllxw0000gn/T//Rtmpm7n64G/downloaded_packages

install.packages("janitor")

## 
## The downloaded binary packages are in
##  /var/folders/v0/hqlgjd_51rx6x61mrz2fllxw0000gn/T//Rtmpm7n64G/downloaded_packages

install.packages("lubridate")

## 
## The downloaded binary packages are in
##  /var/folders/v0/hqlgjd_51rx6x61mrz2fllxw0000gn/T//Rtmpm7n64G/downloaded_packages

install.packages("ggplot2")

## 
## The downloaded binary packages are in
##  /var/folders/v0/hqlgjd_51rx6x61mrz2fllxw0000gn/T//Rtmpm7n64G/downloaded_packages

install.packages("tidyr")

## 
## The downloaded binary packages are in
##  /var/folders/v0/hqlgjd_51rx6x61mrz2fllxw0000gn/T//Rtmpm7n64G/downloaded_packages

install.packages("fmsb")

## 
## The downloaded binary packages are in
##  /var/folders/v0/hqlgjd_51rx6x61mrz2fllxw0000gn/T//Rtmpm7n64G/downloaded_packages

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(readr)
library(janitor)

## 
## Attaching package: 'janitor'

## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

library(lubridate)

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(ggplot2)
library(tidyr)
library(fmsb)

Data Cleaning

#merge csvs into data frame 
cyclistic_data_2023_origional <- list.files(path = "/Users/billybennett/google_data_analytics_capstone/cyclistic_data_2023",
                                          pattern = "*.csv", full.names = TRUE) %>% 
  lapply(read_csv) %>%
  bind_rows

## Rows: 190301 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 190445 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 258678 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 426590 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 604827 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 719618 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 767650 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 771693 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 666371 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 537113 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 362518 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 224073 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

The above code merges the CSV files directly into one table.

cyclistic_data_2023_unclean <- cyclistic_data_2023_origional

I then created an alternate version of the data set to be cleaned.

print(cyclistic_data_2023_unclean)

## # A tibble: 5,719,877 × 13
##    ride_id          rideable_type started_at          ended_at           
##    <chr>            <chr>         <dttm>              <dttm>             
##  1 F96D5A74A3E41399 electric_bike 2023-01-21 20:05:42 2023-01-21 20:16:33
##  2 13CB7EB698CEDB88 classic_bike  2023-01-10 15:37:36 2023-01-10 15:46:05
##  3 BD88A2E670661CE5 electric_bike 2023-01-02 07:51:57 2023-01-02 08:05:11
##  4 C90792D034FED968 classic_bike  2023-01-22 10:52:58 2023-01-22 11:01:44
##  5 3397017529188E8A classic_bike  2023-01-12 13:58:01 2023-01-12 14:13:20
##  6 58E68156DAE3E311 electric_bike 2023-01-31 07:18:03 2023-01-31 07:21:16
##  7 2F7194B6012A98D4 electric_bike 2023-01-15 21:18:36 2023-01-15 21:32:36
##  8 DB1CF84154D6A049 classic_bike  2023-01-25 10:49:01 2023-01-25 10:58:22
##  9 34EAB943F88C4C5D electric_bike 2023-01-25 20:49:47 2023-01-25 21:02:14
## 10 BC8AB1AA51DA9115 classic_bike  2023-01-06 16:37:19 2023-01-06 16:49:52
## # ℹ 5,719,867 more rows
## # ℹ 9 more variables: start_station_name <chr>, start_station_id <chr>,
## #   end_station_name <chr>, end_station_id <chr>, start_lat <dbl>,
## #   start_lng <dbl>, end_lat <dbl>, end_lng <dbl>, member_casual <chr>

viewing the data-set

cyclistic_data_2023_unclean <- cyclistic_data_2023_unclean %>% select(-start_lat)
cyclistic_data_2023_unclean <- cyclistic_data_2023_unclean %>% select(-start_lng)
cyclistic_data_2023_unclean <- cyclistic_data_2023_unclean %>% select(-end_lat)
cyclistic_data_2023_unclean <- cyclistic_data_2023_unclean %>% select(-end_lng)
cyclistic_data_2023_unclean <- cyclistic_data_2023_unclean %>% select(-start_station_id)
cyclistic_data_2023_unclean <- cyclistic_data_2023_unclean %>% select(-end_station_id)

removal of columns that aren’t needed for my analysis

get_dupes(cyclistic_data_2023_unclean)

## No variable names specified - using all columns.

## No duplicate combinations found of: ride_id, rideable_type, started_at, ended_at, start_station_name, end_station_name, member_casual

## # A tibble: 0 × 8
## # ℹ 8 variables: ride_id <chr>, rideable_type <chr>, started_at <dttm>,
## #   ended_at <dttm>, start_station_name <chr>, end_station_name <chr>,
## #   member_casual <chr>, dupe_count <int>

# remove duplicates
cyclistic_data_2023_unclean <- distinct(cyclistic_data_2023_unclean)

check for and then remove duplicate rows.

sum(rowSums(is.na(cyclistic_data_2023_unclean)) > 0)

## [1] 1387808

# Check number of missing values per column
sapply(cyclistic_data_2023_unclean, function(x) sum(is.na(x)))

##            ride_id      rideable_type         started_at           ended_at 
##                  0                  0                  0                  0 
## start_station_name   end_station_name      member_casual 
##             875716             929202                  0

check for missing values. I decided to not yet remove the rows containing missing values as they only appear in the station name columns, i will first finish cleaning the data and then split the clean data into two versions. The first being the completely cleaned data set minus the station names columns and the second being the cleaned data set minus the columns with missing rows.

cyclistic_data_2023_unclean %>%
  distinct(rideable_type)

## # A tibble: 3 × 1
##   rideable_type
##   <chr>        
## 1 electric_bike
## 2 classic_bike 
## 3 docked_bike

check that only electric, docked and classic are the options

cyclistic_data_2023_unclean %>%
  distinct(member_casual)

## # A tibble: 2 × 1
##   member_casual
##   <chr>        
## 1 member       
## 2 casual

check that only member or causal are options

Data Manipulation

cyclistic_data_2023_valid_dates <- cyclistic_data_2023_unclean %>%
  mutate(valid_dates = if_else(started_at < ended_at, TRUE, FALSE))
cyclistic_data_2023_valid_dates <- cyclistic_data_2023_valid_dates %>%
  filter(valid_dates == TRUE)
cyclistic_data_2023_clean <- cyclistic_data_2023_valid_dates %>% select(-valid_dates)

create valid dates column to check for incorrect data

cyclistic_data_2023_clean$started_at <- as.POSIXct(cyclistic_data_2023_clean$started_at, format="%Y-%m-%d %H:%M:%S")
cyclistic_data_2023_clean$ended_at <- as.POSIXct(cyclistic_data_2023_clean$ended_at, format="%Y-%m-%d %H:%M:%S")

converts the started at and ended at columns to date time

cyclistic_data_2023_clean <- cyclistic_data_2023_clean %>%
  mutate(ride_length = as.numeric(difftime(ended_at, started_at, units = "mins")))

create ride length column

cyclistic_data_2023_clean <- cyclistic_data_2023_clean %>%
  mutate(day_of_week = wday(started_at, label = TRUE, abbr = FALSE))
head(cyclistic_data_2023_clean)

## # A tibble: 6 × 9
##   ride_id          rideable_type started_at          ended_at           
##   <chr>            <chr>         <dttm>              <dttm>             
## 1 F96D5A74A3E41399 electric_bike 2023-01-21 20:05:42 2023-01-21 20:16:33
## 2 13CB7EB698CEDB88 classic_bike  2023-01-10 15:37:36 2023-01-10 15:46:05
## 3 BD88A2E670661CE5 electric_bike 2023-01-02 07:51:57 2023-01-02 08:05:11
## 4 C90792D034FED968 classic_bike  2023-01-22 10:52:58 2023-01-22 11:01:44
## 5 3397017529188E8A classic_bike  2023-01-12 13:58:01 2023-01-12 14:13:20
## 6 58E68156DAE3E311 electric_bike 2023-01-31 07:18:03 2023-01-31 07:21:16
## # ℹ 5 more variables: start_station_name <chr>, end_station_name <chr>,
## #   member_casual <chr>, ride_length <dbl>, day_of_week <ord>

create day of week column

cyclistic_data_2023_stations <- cyclistic_data_2023_clean %>% drop_na()

create a new table with only the rows with all station names

Data Analysis

member and casual usuage per bike type

The first piece of analysis i have decided to do is looking at how members and casuals differ in their choice of bike type.

head(cyclistic_data_2023_clean)

## # A tibble: 6 × 9
##   ride_id          rideable_type started_at          ended_at           
##   <chr>            <chr>         <dttm>              <dttm>             
## 1 F96D5A74A3E41399 electric_bike 2023-01-21 20:05:42 2023-01-21 20:16:33
## 2 13CB7EB698CEDB88 classic_bike  2023-01-10 15:37:36 2023-01-10 15:46:05
## 3 BD88A2E670661CE5 electric_bike 2023-01-02 07:51:57 2023-01-02 08:05:11
## 4 C90792D034FED968 classic_bike  2023-01-22 10:52:58 2023-01-22 11:01:44
## 5 3397017529188E8A classic_bike  2023-01-12 13:58:01 2023-01-12 14:13:20
## 6 58E68156DAE3E311 electric_bike 2023-01-31 07:18:03 2023-01-31 07:21:16
## # ℹ 5 more variables: start_station_name <chr>, end_station_name <chr>,
## #   member_casual <chr>, ride_length <dbl>, day_of_week <ord>

cyclistic_data_2023_bike_type_table <- cyclistic_data_2023_clean %>%
  group_by(rideable_type, member_casual) %>%
  summarize(count = n())

## `summarise()` has grouped output by 'rideable_type'. You can override using the
## `.groups` argument.

head(cyclistic_data_2023_bike_type_table)

## # A tibble: 5 × 3
## # Groups:   rideable_type [3]
##   rideable_type member_casual   count
##   <chr>         <chr>           <int>
## 1 classic_bike  casual         876805
## 2 classic_bike  member        1819026
## 3 docked_bike   casual          78287
## 4 electric_bike casual        1103529
## 5 electric_bike member        1840961

ggplot(data = cyclistic_data_2023_bike_type_table) + geom_col(mapping = aes(x = rideable_type, y = count, fill = member_casual), position = "dodge")+
  labs(title = "Counts of Members and Casual Riders by Bike Type",
       x = "Bike Type",
       y = "Number of Journeys",
       fill = "Member/Casual") +
  theme_minimal()

as we can see here there are no instances of members using docked bikes. we can also see that casual riders are notably more likely to use an electric bike rather than a classic bike, where as whilst still leaning towards elcetric the distribution is more even among members. This may suggest that casuals have a slightly stronger preference for leisurely rides or perhaps cover longer distances.

member and casual usage by date and time

I am now going to analyse how members and casuals use cyclistic differently depending on factors such as day of the week, month and time of day.

member and casual usage by day of week

cyclistic_data_2023_day_of_week_table <- cyclistic_data_2023_clean %>%  group_by(day_of_week, member_casual) %>% summarize(count = n())

## `summarise()` has grouped output by 'day_of_week'. You can override using the
## `.groups` argument.

ggplot(data = cyclistic_data_2023_day_of_week_table) + geom_col(mapping = aes(x = day_of_week, y = count, fill = member_casual), position = "dodge")+
  labs(title = "Counts of Members and Casual Riders by Day of the Week",
       x = "Day of the Week",
       y = "Number of Journeys",
       fill = "Member/Casual") +
  theme_minimal()

as shown above we can see members are more likely to cycle on weekdays, where as casuals are more likely to cycle at weekends. This would suggest that casual riders are more likely to use the bikes for recreational purposes.

member and casual usage by hour of day

cyclistic_data_2023_hour_of_day_column <- cyclistic_data_2023_clean %>% mutate(hour_of_day = hour(cyclistic_data_2023_clean$started_at))
cyclistic_data_2023_hour_of_day_table <- cyclistic_data_2023_hour_of_day_column %>% group_by(hour_of_day, member_casual) %>% summarize(count = n())

## `summarise()` has grouped output by 'hour_of_day'. You can override using the
## `.groups` argument.

cyclistic_data_2023_day_of_week_table <- cyclistic_data_2023_clean %>%
  group_by(day_of_week, member_casual) %>%
  summarize(count = n())

## `summarise()` has grouped output by 'day_of_week'. You can override using the
## `.groups` argument.

ggplot(data = cyclistic_data_2023_hour_of_day_table) + geom_col(mapping = aes(x = hour_of_day, y = count, fill = member_casual), position = position_dodge()) + 
  labs(title = "Counts of Members and Casual Riders by Hour of Day",
       x = "Hour of Day",
       y = "Number of Journeys",
       fill = "Member/Casual") + 
  scale_x_continuous(breaks = seq(0, 23, by = 1)) + theme_minimal() + theme (axis.text.x = element_text(angle = 0, hjust = 0, vjust = 0.5, size = 5))

as we can see here both member and casual usage peaks during late afternoon, however members are for more likely to cycle during the morning, when paired with the information from the previous graph we could infer that members are using cyclistic to cycle to and from work as the most likely journeys based on the information would be 7-8 am and 4-6pm Tuesday Wednesday Thursday. on the other hand casuals are most likely to be cycling 4-6pm at weekends and very unlikely to be cycling during week day mornings.

member and casual usage across 2023

cyclistic_data_2023_dates_column <- cyclistic_data_2023_clean %>%
  mutate(dates = as.Date(started_at))

cyclistic_data_2023_journey_date_table <- cyclistic_data_2023_dates_column %>%
  group_by(dates, member_casual) %>%
  summarize(count = n(), .groups = 'drop')

ggplot(data = cyclistic_data_2023_journey_date_table) + 
  geom_line(mapping = aes(x = dates, y = count, colour = member_casual, group = member_casual)) +
  geom_smooth(mapping = aes(x = dates, y = count, colour = member_casual, group = member_casual), 
              method = "loess", se = FALSE) +
  labs(title = "Counts of Members and Casual Riders by Day of the Year",
       x = "Date",
       y = "Number of Journeys",
       colour = "Member/Casual") +
  scale_x_date(date_breaks = "1 month", date_labels = "%b %Y") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1, size = 10))

## `geom_smooth()` using formula = 'y ~ x'

head(cyclistic_data_2023_journey_date_table)

## # A tibble: 6 × 3
##   dates      member_casual count
##   <date>     <chr>         <int>
## 1 2023-01-01 casual         1991
## 2 2023-01-01 member         3144
## 3 2023-01-02 casual         1593
## 4 2023-01-02 member         3914
## 5 2023-01-03 casual         1145
## 6 2023-01-03 member         3982

as we can see in the above graph across the year members and casuals seem to use cyclistic similarly throughout the year with both groups peaking during hotter months. Interestingly e can see that the highest peaks being from casual riders, we could use this to suggest that casuals are more encouraged to ride by external factors, where as members are somewhat more consistent.

member and casual usage by stations and routes

here i am going to asses if their are any key differences between the stations and routes that members and casual riders take.

member and casual journeys by station

cyclistic_data_2023_station_table <- cyclistic_data_2023_stations %>%
  group_by(start_station_name, member_casual) %>%
  summarize(count = n())

## `summarise()` has grouped output by 'start_station_name'. You can override
## using the `.groups` argument.

cyclistic_data_2023_station_table <- cyclistic_data_2023_station_table %>%
  filter(count > 15000)

head(cyclistic_data_2023_station_table)

## # A tibble: 6 × 3
## # Groups:   start_station_name [6]
##   start_station_name           member_casual count
##   <chr>                        <chr>         <int>
## 1 Broadway & Barry Ave         member        16807
## 2 Canal St & Adams St          member        16441
## 3 Clark St & Elm St            member        23077
## 4 Clinton St & Jackson Blvd    member        15975
## 5 Clinton St & Madison St      member        19110
## 6 Clinton St & Washington Blvd member        24313

ggplot(data = cyclistic_data_2023_station_table) + 
  geom_point(mapping = aes(x = start_station_name, y = count, colour = member_casual)) +
  labs(
    title = "Counts of Members and Casual Riders by Station",
    x = "Station Name",
    y = "Number of Journeys",
    colour = "Member/Casual"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5, size = 10)
  )

as we can see above there are far more stations where members are taking over 15000 journeys from, this would suggest that members are more likely to cycle from the same station frequently, this may suggest that members use cyclistic for journeys they often make, such as to a place of work. also by looking at the names that appear for casual rider popular stations we can see theater on the lake, shedd aquarium and millennium park, this would again support that casuals use cyclistic for recreational purposes.

member and casual journeys based on route

cyclistic_data_2023_routes <- cyclistic_data_2023_stations %>%
  mutate(route = paste(cyclistic_data_2023_stations$start_station_name, cyclistic_data_2023_stations$end_station_name, sep = " to "))

cyclistic_data_2023_routes_table <- cyclistic_data_2023_routes %>%
  group_by(route, member_casual) %>%
  summarize(count = n())

## `summarise()` has grouped output by 'route'. You can override using the
## `.groups` argument.

cyclistic_data_2023_routes_table <- cyclistic_data_2023_routes_table %>%
  filter(count > 3000)

ggplot(data = cyclistic_data_2023_routes_table) + 
  geom_point(mapping = aes(x = route, y = count, colour = member_casual)) +
  labs(
    title = "Counts of Members and Casual Riders by Route",
    x = "Route taken",
    y = "Number of Journeys",
    colour = "Member/Casual"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5, size = 4))

when looking at this graph we can see among the routes frequently taken by casual riders we can see that many journeys begin and end at the same station, this would tell us that casual users are not using the bikes as a method of getting from one place to another, this would suggest to us that casuals are using the bikes for enjoyment rather than practicality. further more we can see that there is an outlier among the casual users, despite their being less casual rides taken overall the most used route by either group was from casual users. this would suggest that this casual riders are more likely to target specific routes where as members will use the bikes as and when needed.

member and casual ride length

round_to_nearest_10 <- function(x) {
  round(x / 10) * 10
}
cyclistic_data_2023_ride_length_rounded <- cyclistic_data_2023_clean %>%
  mutate(ride_length_rounded = round_to_nearest_10(ride_length))

cyclistic_data_2023_ride_length_table <- cyclistic_data_2023_ride_length_rounded %>%
  group_by(ride_length_rounded, member_casual) %>%
  summarize(count = n())

## `summarise()` has grouped output by 'ride_length_rounded'. You can override
## using the `.groups` argument.

head(cyclistic_data_2023_ride_length_table)

## # A tibble: 6 × 3
## # Groups:   ride_length_rounded [3]
##   ride_length_rounded member_casual   count
##                 <dbl> <chr>           <int>
## 1                   0 casual         329506
## 2                   0 member         934209
## 3                  10 casual         912390
## 4                  10 member        1848371
## 5                  20 casual         379774
## 6                  20 member         555337

cyclistic_data_2023_ride_length_table <- cyclistic_data_2023_ride_length_table %>%
  filter(ride_length_rounded < 100)

ggplot(data = cyclistic_data_2023_ride_length_table) + 
  geom_col(mapping = aes(x = ride_length_rounded, y = count, fill = member_casual)) +
  labs(
    title = "Journeys Under 100 Minutes",
    x = "Journey Time Minutes",
    y = "Number of Journeys",
    colour = "Member/Casual"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5, size = 10)
  )

as we can see here members take a far larger percentage of their journeys in under 50 minutes, however this graph doesn’t demonstrate the number of journeys taken over longer periods of time well as the scale makes it hard to read. because of this I have created a second graph below which only includes journeys over an hour.

round_to_nearest_10 <- function(x) {
  round(x / 10) * 10
}
cyclistic_data_2023_ride_length_rounded <- cyclistic_data_2023_clean %>%
  mutate(ride_length_rounded = round_to_nearest_10(ride_length))

cyclistic_data_2023_ride_length_table <- cyclistic_data_2023_ride_length_rounded %>%
  group_by(ride_length_rounded, member_casual) %>%
  summarize(count = n())

## `summarise()` has grouped output by 'ride_length_rounded'. You can override
## using the `.groups` argument.

head(cyclistic_data_2023_ride_length_table)

## # A tibble: 6 × 3
## # Groups:   ride_length_rounded [3]
##   ride_length_rounded member_casual   count
##                 <dbl> <chr>           <int>
## 1                   0 casual         329506
## 2                   0 member         934209
## 3                  10 casual         912390
## 4                  10 member        1848371
## 5                  20 casual         379774
## 6                  20 member         555337

cyclistic_data_2023_ride_length_table <- cyclistic_data_2023_ride_length_table %>%
  filter(ride_length_rounded < 300)

cyclistic_data_2023_ride_length_table <- cyclistic_data_2023_ride_length_table %>%
  filter(ride_length_rounded > 60)

ggplot(data = cyclistic_data_2023_ride_length_table) + 
  geom_col(mapping = aes(x = ride_length_rounded, y = count, fill = member_casual)) +
  labs(
    title = "Journeys Over 60 Minutes",
    x = "Journey Time Minutes",
    y = "Number of Journeys",
    colour = "Member/Casual"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5, size = 10)
  )

when looking at these two graphs together w get a clear picture that casual riders are more likely to take a longer journey, this would again suggest that casual riders aren’t using the bikes primarily as a mode of transport. this may also explain why casual members were more likely to use electric bikes when compared with members as they feel more necessary for these longer journeys.

conclusion

in conclusion i feel that the data is telling us that casual riders have a preference for using the bikes primarily for recreational purposes, whilst we also have data suggesting that members are more likely to use the bikes for transportation, with evidence to suggest that they are using the bikes to commute to work.

Here is a final visualisation which i feel displays this particularly well.

table_for_cyclystic_radar <- data.frame(
  likelihood_to_ride_at_weekend = c(40, 20, 24.08541, 36.24878),
  likelihood_to_ride_on_weekday = c(80, 60, 75.91459, 63.75122),
  likelihood_to_ride_during_summer = c(50, 30, 35.92805, 45.82951),
  likelihood_to_ride_6am_9am = c(30, 10, 19.3699, 10.87111),
  journey_under_30_minutes = c(100, 80, 96.39075, 79.58624),
  journey_over_1_hour = c(20, 0, 0.604647, 4.847663)
)

rownames(table_for_cyclystic_radar) <- c("max", "min", "member", "casual")

ggplot2_colors <- scales::hue_pal()(2)
colors_border <- rev(ggplot2_colors)
colors_in <- rev(sapply(ggplot2_colors, scales::alpha, 0.25))

radarchart(table_for_cyclystic_radar,
           axistype = 1,
           pcol = colors_border, 
           pfcol = colors_in, 
           plwd = 2, 
           plty = 1, 
           cglcol = "grey", 
           cglty = 1, 
           axislabcol = "grey",
           caxislabels = rep("", 5),
           vlcex = 0.8)

legend(x = "topright", legend = c("Casual", "Member"), 
       col = rev(colors_border), 
       lty = 1, 
       lwd = 2, 
       bty = "n", 
       pt.cex = 2, 
       cex = 0.8, 
       text.col = "black", 
       horiz = FALSE, 
       inset = c(0.1, 0.1))

When looking at the above visualisation i believe we get the clearest image of what the key differences are between members and casuals. with each rung only representing 5% we can see that these differences are significant but not everything and there are some clear overlaps between how members and casuals use cyclistic. However from what the data is telling me here are my recommendations. I believe there are two main roads to increase the number of cyclistic members. firstly i would target casual riders who have similar usage to current members and try to convince them to sign up to cyclistic as members. secondly i would create a new membership plan structured to the more recreational tendencies of casual riders, i believe this could convince many more people to join as members by tapping into a new market and demographic.

google_data_analytics_case_study_cyclistic

William Bennett

2024-08-05