Capstone Project for the Google Data Analytics

Background

Cyclistic is a bike-share company based in Chicago that launched a successful bike-sharing program in 2016. Throughout the years, the program has expanded significantly to a fleet of 5,824 bicycles and a network of 692 geotracked stations sprawled across the city. With the large number of bicycles across numerous stations, customers can rent bikes from one station and return them to any other station within the network at their convenience. This encourages people to opt for cycling as a mode of transportation, therefore contributing to the success of Cyclistic’s bike-sharing program.

Cyclistic’s marketing strategy has so far focused on building general awareness and appealing to broad consumer segments. The company offers flexibile pricing plans that cater to diverse needs of users including single-ride passes, full-day passes, and annual memberships. Besides, it provides reclining bikes, hand tricycles, and cargo bikes, effectively welcoming individuals with disabilities and those who can’t ride on the standard two-wheeled bicycles. Based on the company database, Cyclistic users are more likely to ride for leisure, but about 30% use them to commute to work each day. While traditional bikes remain as the popular option, around 8% of users opt for the assistive alternatives.

The company’s marketing director believes that the company’s future success depends on maximizing the number of annual memberships. Therefore, as a junior data analyst, my team and I have to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, we will design a new marketing strategy to convert casual riders into annual members.


Ask

Understand the differences between casual and member usage by answering the following questions:

  • How do annual members and casual riders use Cyclistic bikes differently?
  • Why would casual riders buy Cyclistic annual memberships?
  • How can Cyclistic use digital media to influence casual riders to become members?

Prepare

Installing & Loading tidyverse and dplyr for data manipulation

Dataset to be used for this analysis can be found at divvy-tripdata. The data has been made available by Motivate International Inc. under this license. The Q1 2020 and Q1 2021 (Q1 2019 data was unavailable) data were used in this analysis.


Process

R was used to combine, clean, analyze and visualize. This includes the following files that were added:

  • Divvy_Trips_2020_Q1.csv
  • Divvy_Trips_2021_Q1.csv

Analyze

Install packages necessary for manipulation, analysis and visualization

install.packages("tidyverse")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
install.packages("dplyr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
install.packages("janitor")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(janitor)
## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

Set working directory, read in csv files for analysis and assign to dataframes df1-df2

setwd('/cloud/project/')
df1 <- read_csv('/cloud/project/Divvy_Trips_2020_Q1.csv')
## Rows: 426887 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, started_at, ended_at, start_station_name, ...
## dbl  (7): start_station_id, end_station_id, start_lat, start_lng, end_lat, e...
## time (1): ride_length
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
df2 <- read_csv('/cloud/project/Divvy_Trips_2021_Q1.csv')
## Rows: 96834 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (9): ride_id, rideable_type, started_at, ended_at, start_station_name, ...
## dbl  (5): start_lat, start_lng, end_lat, end_lng, day_of_week
## time (1): ride_length
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Change start and end station IDs to numerical data type from character in df2 for consistency and union of dataframes

df2$start_station_id =as.numeric(as.character(df2$start_station_id))
## Warning: NAs introduced by coercion
df2$end_station_id = as.numeric(as.character(df2$end_station_id))
## Warning: NAs introduced by coercion

Combine dataframes via bind_rows and assign to the bikeshare_data dataframe

bikeshare_data <- bind_rows(df1, df2)

Find and remove records with null values

bikeshare_data <- na.omit(bikeshare_data)

Change data type of ride_length field to numeric and check summary statistics

bikeshare_data$ride_length <- as.numeric(bikeshare_data$ride_length)
summary(bikeshare_data$ride_length)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   318.0   518.0   791.6   861.0 86399.0

Create histogram for breadth of values of ride_length field

hist(bikeshare_data$ride_length)

Find IQR and outliers of ride_length

Q1 <- quantile(bikeshare_data$ride_length, .25)
Q3 <- quantile(bikeshare_data$ride_length, .75)
IQR <- IQR(bikeshare_data$ride_length)
outliers <- subset(bikeshare_data, bikeshare_data$ride_length<(Q1 - 1.5*IQR) | bikeshare_data$ride_length>(Q3 + 1.5*IQR))
dim(outliers)
## [1] 14589    15

It is reasonable to assume that bikes will be returned daily, so we’re limiting ride length to 8 hours max and and 5 minutes minimum

bikeshare_data <- bikeshare_data[!(bikeshare_data$ride_length > 480),]
bikeshare_data <- bikeshare_data[!(bikeshare_data$ride_length < 5),]

Looking at boxplot after removal of bad records to see distribution of values

ggplot(bikeshare_data, aes(x=member_casual, y=ride_length, fill=member_casual)) +
  geom_boxplot()

Ensure there are no records after clean up that contain NA

rows_with_na <- bikeshare_data[apply(bikeshare_data, 1, function(x) any(is.na(x))
  ), ]

Install and load lubridate so we can create date, hour, weekday, month, year and time_start fields and convert to appropriate data type for aggregate analysis

install.packages("lubridate")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
library(lubridate)
bikeshare_data$date <- ymd_hms(bikeshare_data$started_at)
bikeshare_data$weekday <- weekdays(bikeshare_data$date)
bikeshare_data$month_number <- format(bikeshare_data$date, "%m")
bikeshare_data$month_number <- as.numeric(bikeshare_data$month_number)
bikeshare_data$month <- month.name[bikeshare_data$month_number]
bikeshare_data$year <- year(bikeshare_data$date)
bikeshare_data$year <- as.character(bikeshare_data$year)
bikeshare_data$start_hour <- hour(bikeshare_data$date)

Aggregate rideable_type by month for Q1

bikeshare_data %>% 
  group_by(rideable_type, month) %>% 
  summarize(count = n()) %>% 
  arrange(desc(count))
## `summarise()` has grouped output by 'rideable_type'. You can override using the
## `.groups` argument.
## # A tibble: 4 × 3
## # Groups:   rideable_type [3]
##   rideable_type month    count
##   <chr>         <chr>    <int>
## 1 docked_bike   January  67330
## 2 docked_bike   February 28497
## 3 classic_bike  January   4620
## 4 electric_bike January   1824

Aggregate rideable_type by weekdays for Q1

bikeshare_data %>%
  group_by(rideable_type, weekday) %>%
  summarize(count = n()) %>%
  arrange(desc(count))
## `summarise()` has grouped output by 'rideable_type'. You can override using the
## `.groups` argument.
## # A tibble: 21 × 3
## # Groups:   rideable_type [3]
##    rideable_type weekday   count
##    <chr>         <chr>     <int>
##  1 docked_bike   Tuesday   17321
##  2 docked_bike   Thursday  17187
##  3 docked_bike   Wednesday 16697
##  4 docked_bike   Friday    15826
##  5 docked_bike   Monday    15640
##  6 docked_bike   Sunday     6579
##  7 docked_bike   Saturday   6577
##  8 classic_bike  Friday      736
##  9 classic_bike  Thursday    708
## 10 classic_bike  Saturday    705
## # ℹ 11 more rows

Aggregate rideable_type by year for Q1

bikeshare_data %>%
  group_by(rideable_type, year) %>%
  summarize(count = n()) %>%
  arrange(desc(count))
## `summarise()` has grouped output by 'rideable_type'. You can override using the
## `.groups` argument.
## # A tibble: 4 × 3
## # Groups:   rideable_type [3]
##   rideable_type year  count
##   <chr>         <chr> <int>
## 1 docked_bike   2020  95776
## 2 classic_bike  2021   4620
## 3 electric_bike 2021   1824
## 4 docked_bike   2021     51

Aggregate member_casual riders by weekdays for Q1

bikeshare_data %>%
  group_by(member_casual, weekday) %>%
  summarize(count = n()) %>%
  arrange(desc(count))
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
## # A tibble: 14 × 3
## # Groups:   member_casual [2]
##    member_casual weekday   count
##    <chr>         <chr>     <int>
##  1 member        Tuesday   17765
##  2 member        Thursday  17704
##  3 member        Wednesday 17168
##  4 member        Friday    16402
##  5 member        Monday    16203
##  6 member        Saturday   7139
##  7 member        Sunday     6927
##  8 casual        Friday      469
##  9 casual        Wednesday   459
## 10 casual        Thursday    456
## 11 casual        Tuesday     425
## 12 casual        Saturday    407
## 13 casual        Sunday      376
## 14 casual        Monday      371

Summarize member_casual riders by weekday for Q1

bikeshare_data %>%
  group_by(member_casual, weekday) %>%
  summarise(number_of_rides = n(),
            average_ride_duration = mean(ride_length),
            median = median(ride_length))%>%
            arrange(desc(number_of_rides))
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
## # A tibble: 14 × 5
## # Groups:   member_casual [2]
##    member_casual weekday   number_of_rides average_ride_duration median
##    <chr>         <chr>               <int>                 <dbl>  <dbl>
##  1 member        Tuesday             17765                  303.    308
##  2 member        Thursday            17704                  302.    306
##  3 member        Wednesday           17168                  300.    303
##  4 member        Friday              16402                  302.    306
##  5 member        Monday              16203                  301.    305
##  6 member        Saturday             7139                  294.    299
##  7 member        Sunday               6927                  294.    296
##  8 casual        Friday                469                  313.    327
##  9 casual        Wednesday             459                  297.    328
## 10 casual        Thursday              456                  321.    337
## 11 casual        Tuesday               425                  331.    349
## 12 casual        Saturday              407                  325.    352
## 13 casual        Sunday                376                  321.    349
## 14 casual        Monday                371                  324.    334

Summary stats for member_casual riders for Q1

bikeshare_data %>%
  group_by(member_casual) %>%
     summarise(total_rides = length(member_casual), 
     percentage = total_rides/nrow(bikeshare_data) * 100,
     avg_ride_length = mean(ride_length), 
     median = median(ride_length))
## # A tibble: 2 × 5
##   member_casual total_rides percentage avg_ride_length median
##   <chr>               <int>      <dbl>           <dbl>  <dbl>
## 1 casual               2963       2.90            318.    339
## 2 member              99308      97.1             301.    304

Summary stats for rideable_type for Q1

bikeshare_data %>%
  group_by(rideable_type) %>%
     summarise(total_rides = length(member_casual), 
     percentage = total_rides/nrow(bikeshare_data) * 100,
     avg_ride_length = mean(ride_length), 
     median = median(ride_length))
## # A tibble: 3 × 5
##   rideable_type total_rides percentage avg_ride_length median
##   <chr>               <int>      <dbl>           <dbl>  <dbl>
## 1 classic_bike         4620       4.52            287.    294
## 2 docked_bike         95827      93.7             302.    306
## 3 electric_bike        1824       1.78            292.    300

Summarizing rideable_type by weekday for Q1

bikeshare_data %>%
  group_by(rideable_type, weekday) %>%
  summarise(number_of_rides = n(),
            average_ride_duration = mean(ride_length),
            median = median(ride_length),
            max = max(ride_length),
            min = min(ride_length)) %>%
  arrange(desc(number_of_rides))
## `summarise()` has grouped output by 'rideable_type'. You can override using the
## `.groups` argument.
## # A tibble: 21 × 7
## # Groups:   rideable_type [3]
##    rideable_type weekday   number_of_rides average_ride_duration median   max
##    <chr>         <chr>               <int>                 <dbl>  <dbl> <dbl>
##  1 docked_bike   Tuesday             17321                  304.   308    480
##  2 docked_bike   Thursday            17187                  303.   307    480
##  3 docked_bike   Wednesday           16697                  301.   304    480
##  4 docked_bike   Friday              15826                  304.   308    480
##  5 docked_bike   Monday              15640                  302.   306    480
##  6 docked_bike   Sunday               6579                  296.   299    480
##  7 docked_bike   Saturday             6577                  298.   303    480
##  8 classic_bike  Friday                736                  284.   285    480
##  9 classic_bike  Thursday              708                  290.   294.   480
## 10 classic_bike  Saturday              705                  277.   284    480
## # ℹ 11 more rows
## # ℹ 1 more variable: min <dbl>

Aggregate rideable_type by member_casual by day of week for Q1

bikeshare_data %>% 
  group_by(rideable_type,member_casual,weekday) %>% 
  summarise(total_count=n()) %>%
  arrange(desc(total_count))
## `summarise()` has grouped output by 'rideable_type', 'member_casual'. You can
## override using the `.groups` argument.
## # A tibble: 42 × 4
## # Groups:   rideable_type, member_casual [6]
##    rideable_type member_casual weekday   total_count
##    <chr>         <chr>         <chr>           <int>
##  1 docked_bike   member        Tuesday         16975
##  2 docked_bike   member        Thursday        16841
##  3 docked_bike   member        Wednesday       16331
##  4 docked_bike   member        Friday          15458
##  5 docked_bike   member        Monday          15366
##  6 docked_bike   member        Saturday         6303
##  7 docked_bike   member        Sunday           6293
##  8 classic_bike  member        Friday            684
##  9 classic_bike  member        Thursday          644
## 10 classic_bike  member        Saturday          617
## # ℹ 32 more rows

Aggregate member_casual by month for Q1

bikeshare_data %>%
  group_by(member_casual, month) %>%
  summarize(count = n()) %>%
  arrange(desc(count))
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
## # A tibble: 4 × 3
## # Groups:   member_casual [2]
##   member_casual month    count
##   <chr>         <chr>    <int>
## 1 member        January  71596
## 2 member        February 27712
## 3 casual        January   2178
## 4 casual        February   785

Aggregate member_casual by rideable_type by weekday for Q1

bikeshare_data %>%
  group_by(member_casual, rideable_type, weekday) %>%
  summarize(count = n()) %>%
  arrange(desc(count))
## `summarise()` has grouped output by 'member_casual', 'rideable_type'. You can
## override using the `.groups` argument.
## # A tibble: 42 × 4
## # Groups:   member_casual, rideable_type [6]
##    member_casual rideable_type weekday   count
##    <chr>         <chr>         <chr>     <int>
##  1 member        docked_bike   Tuesday   16975
##  2 member        docked_bike   Thursday  16841
##  3 member        docked_bike   Wednesday 16331
##  4 member        docked_bike   Friday    15458
##  5 member        docked_bike   Monday    15366
##  6 member        docked_bike   Saturday   6303
##  7 member        docked_bike   Sunday     6293
##  8 member        classic_bike  Friday      684
##  9 member        classic_bike  Thursday    644
## 10 member        classic_bike  Saturday    617
## # ℹ 32 more rows

Looking at top stations summary for Q1

bikeshare_data %>%
  group_by(start_station_name) %>%
  summarise(total_count = length((start_station_name))) %>%
  arrange(desc(total_count)) %>% top_n(10)
## Selecting by total_count
## # A tibble: 10 × 2
##    start_station_name           total_count
##    <chr>                              <int>
##  1 Clinton St & Madison St             1852
##  2 Kingsbury St & Kinzie St            1658
##  3 Canal St & Adams St                 1657
##  4 Clinton St & Washington Blvd        1392
##  5 Clinton St & Lake St                1117
##  6 Columbus Dr & Randolph St           1044
##  7 Canal St & Madison St                984
##  8 Dearborn St & Erie St                951
##  9 LaSalle St & Jackson Blvd            918
## 10 Desplaines St & Kinzie St            914

Looking at top stations use by member type for Q1

bikeshare_data %>%
  group_by(start_station_name, member_casual) %>%
  summarise(total_count = length((start_station_name))) %>%
  arrange(desc(total_count)) %>% top_n(10)
## `summarise()` has grouped output by 'start_station_name'. You can override
## using the `.groups` argument.
## Selecting by total_count
## # A tibble: 1,014 × 3
## # Groups:   start_station_name [593]
##    start_station_name           member_casual total_count
##    <chr>                        <chr>               <int>
##  1 Clinton St & Madison St      member               1837
##  2 Kingsbury St & Kinzie St     member               1637
##  3 Canal St & Adams St          member               1628
##  4 Clinton St & Washington Blvd member               1385
##  5 Clinton St & Lake St         member               1085
##  6 Columbus Dr & Randolph St    member               1030
##  7 Canal St & Madison St        member                963
##  8 Dearborn St & Erie St        member                929
##  9 LaSalle St & Jackson Blvd    member                913
## 10 Desplaines St & Kinzie St    member                885
## # ℹ 1,004 more rows

Looking at top stations by rideable type for Q1

bikeshare_data %>%
  group_by(start_station_name, rideable_type) %>%
  summarise(total_count = length((start_station_name))) %>%
  arrange(desc(total_count)) %>% top_n(10)
## `summarise()` has grouped output by 'start_station_name'. You can override
## using the `.groups` argument.
## Selecting by total_count
## # A tibble: 1,002 × 3
## # Groups:   start_station_name [593]
##    start_station_name           rideable_type total_count
##    <chr>                        <chr>               <int>
##  1 Clinton St & Madison St      docked_bike          1852
##  2 Kingsbury St & Kinzie St     docked_bike          1658
##  3 Canal St & Adams St          docked_bike          1579
##  4 Clinton St & Washington Blvd docked_bike          1392
##  5 Clinton St & Lake St         docked_bike          1035
##  6 Canal St & Madison St        docked_bike           959
##  7 Columbus Dr & Randolph St    docked_bike           929
##  8 LaSalle St & Jackson Blvd    docked_bike           918
##  9 Desplaines St & Kinzie St    docked_bike           914
## 10 University Ave & 57th St     docked_bike           884
## # ℹ 992 more rows

Aggregating top stations by member_type by rideable_type for Q1

bikeshare_data %>%
  group_by(start_station_name, member_casual, rideable_type) %>%
  summarise(total_count = length((start_station_name))) %>%
  arrange(desc(total_count)) %>% top_n(10)
## `summarise()` has grouped output by 'start_station_name', 'member_casual'. You
## can override using the `.groups` argument.
## Selecting by total_count
## # A tibble: 1,624 × 4
## # Groups:   start_station_name, member_casual [1,014]
##    start_station_name           member_casual rideable_type total_count
##    <chr>                        <chr>         <chr>               <int>
##  1 Clinton St & Madison St      member        docked_bike          1837
##  2 Kingsbury St & Kinzie St     member        docked_bike          1637
##  3 Canal St & Adams St          member        docked_bike          1559
##  4 Clinton St & Washington Blvd member        docked_bike          1385
##  5 Clinton St & Lake St         member        docked_bike          1015
##  6 Canal St & Madison St        member        docked_bike           940
##  7 Columbus Dr & Randolph St    member        docked_bike           921
##  8 LaSalle St & Jackson Blvd    member        docked_bike           913
##  9 Desplaines St & Kinzie St    member        docked_bike           885
## 10 University Ave & 57th St     member        docked_bike           849
## # ℹ 1,614 more rows

Creating top stations dataframe

top_station <- bikeshare_data %>%
  group_by(start_station_name) %>%
  summarise(total_count = length(na.omit(start_station_name))) %>%
  arrange(desc(total_count)) %>% 
  top_n(10)
## Selecting by total_count

Share

Looking at various visualizations of the analysis to help better understand behavior

Total rides by rideable_type for Q1

ggplot(bikeshare_data, aes(x=rideable_type)) + geom_bar() + geom_text(aes(label = ..count..), stat = "count", colour = "black", hjust = 0.25, vjust = 0)
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Total rides by member_casual for Q1

ggplot(bikeshare_data, aes(x=member_casual)) + geom_bar() + geom_text(aes(label = ..count..), stat = "count", colour = "black", hjust = 0.25, vjust = 0)

Top start_stations by total rides for Q1

 ggplot(data  = top_station) +
  geom_col(mapping = aes(x = start_station_name, y = total_count, 
                         fill = start_station_name), position = "dodge") +
  labs(title = "Top Ten Start Station") + 
  theme(axis.text.x = element_text(angle = 90))

Ride length by day of week for Q1

 bikeshare_data %>%
  group_by(member_casual, day_of_week) %>% 
  summarise(average_ride_length = mean(ride_length)) %>% 
  ggplot(aes(x = day_of_week, y = average_ride_length, fill = member_casual)) +
  geom_col(position = "dodge") +
  geom_text(aes(label = format(round(average_ride_length, 2)), vjust = 0.5)) +
  labs(title = "Avg Ride Length by Member Type by Day of Week")
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

Start time by weekdays for Q1

bikeshare_data %>%
  group_by(start_hour, weekday, member_casual) %>% 
  summarise(number_of_rides = n()) %>%
  ggplot(aes(x = start_hour, y = number_of_rides, fill = weekday, color = member_casual)) + 
  geom_point(size = 3, alpha = 0.4) +
  labs(title = "Start Hour for Weekdays") +
  geom_line(size = 0.5)
## `summarise()` has grouped output by 'start_hour', 'weekday'. You can override
## using the `.groups` argument.
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.


Summary

  • Both members and casual riders prefer docked bikes over classic and electric bikes.
  • The average ride time for members is 300 minutes and for casual riders is 318 minutes.
  • The most active days for members are Tuesdays, Thursdays and Wednesdays and for casual riders Fridays, Wednesdays and Thursdays, respectively.
  • January is the more active month for Q1 for both member and casual riders.
  • Members make up 97.1% of the rides and casual riders compose the remaining 2.9%.
  • For members, we see a precipitous increase in usage starting at 8AM in the morning and again at 5PM in the evening. Casual riders’ start hours appear to be random for the most part, with peaks also around 8AM and 5PM.
  • As for station behavior, most of the volume occurs at Clinton St & Madison St. For casual riders, Ellis Ave & 60th St. gets the most traffic.

Act

  • Create Google Business Profiles for all stations to help with visibility in Google Search, Maps and local pack
  • Since electric bikes are used only 1.78% of the time, promoting them through social media, email and creating and optimizing content specifically for them may help bolster usage.
  • Casual riders only account for 2.89% of total rides, so targeting them through social media, email and creating and optimizing content specifically for them may help increase usage rates.
  • Incentivize casual riders to buy memberships through discounted rates or promotions.
  • Offer a rewards program to help with churn rate.