Cyclistic Portfolio Project

Olivia S. Engl

About The Company

In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members.

Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders. Cyclistic wants to design marketing strategies aimed at converting casual riders into annual members. In order to do that, however, the marketing analyst team needs to better understand how annual members and casual riders differ, why casual riders would buy a membership, and how digital media could affect their marketing tactics.

Business Task

  • Define differences between the bike usage of annual members and casual riders.

  • Why would casual riders buy Cyclistic annual memberships?

  • How can Cyclistic use digital media to influence casual riders to become members?

Data Source And Modifications

Source

The data has been made available by Motivate international Inc. under this this agreement LINK.

Limitations

  • For the purpose of this portfolio project and due to Rstudio memory constraints, a random, unbiased sample has been extracted from the original data frames. The sample size is 1/10th of the original data.

    Following code was used to create the sample for each quartale:

    set.seed(2022286) Cyclistic_Q1_2019<- Divvy_Trips_2019_Q1 %>% sample_n(36506)

  • Demographic information (age and gender) is only available for subscribers and not for casual riders, also referred to as customers.

  • Every ride has a unique code (primary key) and we do not have a customer identification code; therefore we can only analyze the age and gender distribution according to the number of rides, rather than for the number of individuals.

Modifications

Following modifications of the data have been performed during the data cleaning process:

  • Subscriber’s age had several outliers, with some people aged well over 100 years old. After looking at the age distribution a cutoff age of 85 was defined. All ages over 85 were replaced with NA.

  • Trip duration varied from 1 minute to 3 months. The Interquartal Range method was used to remove outliers.

  • Where trip duration was under 3 minutes and start and end locations were the same, it was assumed that no real trip took place and the values were removed from the dataset. This could for instance be due to technical issues with the bike.

Data Cleaning

Install Packages.

library("tidyverse")
library("janitor")
library("skimr")
library("lubridate")
library("naniar")
library("waffle")
library("formattable")

Cyclistic_Q1_2019 <- read.csv("Cyclistic_Q1_2019.csv")
Cyclistic_Q2_2019 <- read.csv("Cyclistic_Q2_2019.csv")
Cyclistic_Q3_2019 <- read.csv("Cyclistic_Q3_2019.csv")
Cyclistic_Q4_2019 <- read.csv("Cyclistic_Q4_2019.csv")

Data Overview

Check that all variables are in the right order and have the same format.

glimpse(Cyclistic_Q1_2019)
## Rows: 36,506
## Columns: 12
## $ trip_id           <int> 21814521, 22132588, 22077402, 22128279, 22155279, 21…
## $ start_time        <chr> "1/14/2019 17:32", "3/26/2019 6:46", "3/18/2019 15:2…
## $ end_time          <chr> "1/14/2019 17:41", "3/26/2019 7:00", "3/18/2019 15:3…
## $ bikeid            <int> 3780, 786, 5417, 795, 223, 1184, 4157, 5753, 4527, 3…
## $ tripduration      <chr> "575", "844", "253", "369", "327", "84", "501", "581…
## $ from_station_id   <int> 174, 637, 174, 192, 40, 91, 172, 199, 5, 466, 161, 3…
## $ from_station_name <chr> "Canal St & Madison St", "Wood St & Chicago Ave (*)"…
## $ to_station_id     <int> 620, 107, 283, 108, 164, 66, 181, 174, 624, 452, 211…
## $ to_station_name   <chr> "Orleans St & Chestnut St (NEXT Apts)", "Desplaines …
## $ usertype          <chr> "Subscriber", "Subscriber", "Subscriber", "Subscribe…
## $ gender            <chr> "Male", "Male", "Male", "Male", "Male", "Male", "Mal…
## $ birthyear         <int> 1981, 1990, 1990, 1983, 1984, 1986, 1979, 1966, 1985…
glimpse(Cyclistic_Q2_2019)
## Rows: 110,816
## Columns: 12
## $ X01...Rental.Details.Rental.ID                    <int> 22561350, 22556568, …
## $ X01...Rental.Details.Local.Start.Time             <chr> "5/6/2019 16:59", "5…
## $ X01...Rental.Details.Local.End.Time               <chr> "5/6/2019 17:18", "5…
## $ X01...Rental.Details.Bike.ID                      <int> 1553, 5917, 3449, 63…
## $ X01...Rental.Details.Duration.In.Seconds.Uncapped <chr> "1,130.00", "486", "…
## $ X03...Rental.Start.Station.ID                     <int> 49, 277, 47, 50, 602…
## $ X03...Rental.Start.Station.Name                   <chr> "Dearborn St & Monro…
## $ X02...Rental.End.Station.ID                       <int> 118, 84, 26, 39, 663…
## $ X02...Rental.End.Station.Name                     <chr> "Sedgwick St & North…
## $ User.Type                                         <chr> "Subscriber", "Subsc…
## $ Member.Gender                                     <chr> "Male", "Male", "Fem…
## $ X05...Member.Details.Member.Birthday.Year         <int> 1967, 1974, 1971, NA…
glimpse(Cyclistic_Q3_2019)
## Rows: 164,071
## Columns: 12
## $ trip_id           <int> 24672286, 23846067, 23842155, 24002175, 24967257, 24…
## $ start_time        <chr> "8/29/2019 22:44", "7/19/2019 16:29", "7/19/2019 12:…
## $ end_time          <chr> "8/29/2019 23:10", "7/19/2019 16:36", "7/19/2019 14:…
## $ bikeid            <int> 3033, 3280, 6367, 4329, 5886, 3701, 577, 5414, 688, …
## $ tripduration      <int> 1519, 405, 7120, 603, 21561, 338, 575, 354, 2457, 70…
## $ from_station_id   <int> 52, 38, 45, 331, 76, 301, 418, 107, 93, 13, 24, 2, 3…
## $ from_station_name <chr> "Michigan Ave & Lake St", "Clark St & Lake St", "Mic…
## $ to_station_id     <int> 25, 624, 35, 268, 35, 140, 345, 56, 504, 289, 51, 35…
## $ to_station_name   <chr> "Michigan Ave & Pearson St", "Dearborn St & Van Bure…
## $ usertype          <chr> "Customer", "Subscriber", "Customer", "Customer", "C…
## $ gender            <chr> NA, "Male", NA, "Female", "Male", "Male", NA, "Male"…
## $ birthyear         <int> NA, 1991, NA, 1995, 1961, 1988, 1992, 1969, NA, 1993…
glimpse(Cyclistic_Q4_2019)
## Rows: 70,405
## Columns: 12
## $ trip_id           <int> 25288982, 25562503, 25558591, 25719662, 25583344, 25…
## $ start_time        <chr> "10/5/2019 12:03", "10/25/2019 16:32", "10/25/2019 1…
## $ end_time          <chr> "10/5/2019 12:10", "10/25/2019 16:41", "10/25/2019 1…
## $ bikeid            <int> 999, 2983, 1655, 130, 5722, 313, 5610, 892, 420, 179…
## $ tripduration      <int> 413, 536, 499, 217, 740, 716, 369, 329, 3384, 253, 5…
## $ from_station_id   <int> 94, 37, 201, 337, 349, 160, 117, 58, 636, 154, 133, …
## $ from_station_name <chr> "Clark St & Armitage Ave", "Dearborn St & Adams St",…
## $ to_station_id     <int> 220, 47, 148, 211, 304, 20, 256, 69, 304, 319, 77, 7…
## $ to_station_name   <chr> "Clark St & Drummond Pl", "State St & Kinzie St", "S…
## $ usertype          <chr> "Subscriber", "Subscriber", "Subscriber", "Subscribe…
## $ gender            <chr> "Male", "Male", "Female", "Male", "Female", "Male", …
## $ birthyear         <int> 1993, 1979, 1993, 1984, 1996, 1986, 1990, 1984, NA, …

Prepare Dataframes for Binding

The column titles in Q2 are inconsistent with the other files. Rename them in order to be able to merge the data frames.

Cyclistic_Q2_2019 <- Cyclistic_Q2_2019 %>%
  clean_names()%>% 
  rename(trip_id = x01_rental_details_rental_id,
         start_time = x01_rental_details_local_start_time,
         end_time = x01_rental_details_local_end_time,
         bikeid = x01_rental_details_bike_id,
         tripduration = x01_rental_details_duration_in_seconds_uncapped,
         from_station_id = x03_rental_start_station_id,
         from_station_name = x03_rental_start_station_name,
         to_station_id = x02_rental_end_station_id,
         to_station_name = x02_rental_end_station_name,
         usertype = user_type,
         gender = member_gender,
         birthyear = x05_member_details_member_birthday_year)

Remove commas from tripduration values and convert to integer

Cyclistic_Q1_2019 <- Cyclistic_Q1_2019 %>%
  mutate(tripduration=gsub("\\,", "", tripduration))
Cyclistic_Q2_2019 <- Cyclistic_Q2_2019 %>%
  mutate(tripduration=gsub("\\,", "", tripduration))
Cyclistic_Q3_2019 <- Cyclistic_Q3_2019 %>%
  mutate(tripduration=gsub("\\,", "", tripduration))
Cyclistic_Q4_2019 <- Cyclistic_Q4_2019 %>%
  mutate(tripduration=gsub("\\,", "", tripduration))

Cyclistic_Q1_2019$tripduration <- as.integer(Cyclistic_Q1_2019$tripduration)
Cyclistic_Q2_2019$tripduration <- as.integer(Cyclistic_Q2_2019$tripduration)
Cyclistic_Q3_2019$tripduration <- as.integer(Cyclistic_Q3_2019$tripduration)
Cyclistic_Q4_2019$tripduration <- as.integer(Cyclistic_Q4_2019$tripduration)

Create a single dataset using rbind().

Cyclistic_2019 <- rbind(Cyclistic_Q1_2019,
                        Cyclistic_Q2_2019,
                        Cyclistic_Q3_2019,
                        Cyclistic_Q4_2019)

Clean New Dataset

Rename the column names and make them consistent.

Cyclistic_2019 <- Cyclistic_2019 %>%
  clean_names()%>%
  rename( bike_id = bikeid,
          trip_duration = tripduration,
          user_type = usertype,
          birth_year = birthyear)

Check for duplicates in trip_id (should be unique values).

Cyclistic_2019 %>%
  distinct(trip_id) # There are no duplicates in trip_id

Convert start and end time from character to datetime.

Cyclistic_2019$start_time <- as.POSIXlt(Cyclistic_2019$start_time,format="%m/%d/%Y %H:%M")
Cyclistic_2019$end_time <- as.POSIXlt(Cyclistic_2019$end_time,format="%m/%d/%Y %H:%M")

Recalculate trip duration to check the time unit used in the data set.

difftime(Cyclistic_2019$end_time, Cyclistic_2019$start_time, 
         units = "secs" )

Trip duration is currently in seconds. Convert to minutes.

Cyclistic_2019 <- Cyclistic_2019 %>%
  mutate(trip_duration_minutes= round(trip_duration/60 , 1))

Identify NA Values.

sapply(Cyclistic_2019, function(x) sum(is.na(x)))
##               trip_id            start_time              end_time 
##                     0                     0                     0 
##               bike_id         trip_duration       from_station_id 
##                     0                     0                     0 
##     from_station_name         to_station_id       to_station_name 
##                     0                     0                     0 
##             user_type                gender            birth_year 
##                     0                 35518                 54222 
## trip_duration_minutes 
##                     0

The only NAs are in “gender” and “birth year”.

We only know the gender for Subscribers. Therefore, replace NA with “Unknown” for Customers.

Cyclistic_2019 <- Cyclistic_2019 %>%
  mutate(gender= as.character(gender))%>%
  mutate(gender= if_else(user_type== "Customer", "Unknown", gender))

Assign the right class to each variable.

Cyclistic_2019 <- Cyclistic_2019 %>%
  mutate(trip_id = as.integer(trip_id), 
         bike_id = as.integer(bike_id),
         from_station_id = as.integer(from_station_id),
         to_station_id = as.integer(to_station_id),
         user_type = as.factor(user_type),
         gender = as.factor(gender),
         birth_year = as.integer(birth_year))

Add variables weekday and month.

Cyclistic_2019 <- Cyclistic_2019 %>%
  mutate(weekday = wday(start_time, label = TRUE, abbr = FALSE),
         month = month(start_time, label = TRUE, abbr = FALSE)) 

Create an age variable.

Cyclistic_2019 <- Cyclistic_2019 %>%
  mutate(age = 2019- birth_year)

Check trip duration.

Cyclistic_2019 <- Cyclistic_2019 %>%
  arrange(trip_duration_minutes)

The duration of the trip varies significantly, from 1 minute to three months. Identify outliers.

options(scipen = 999) #turns off scientific mode
hist(Cyclistic_2019$trip_duration_minutes)

Cyclistic_2019%>%
  ggplot(aes(month , trip_duration_minutes))+
  geom_boxplot(varwidth=T, fill="plum", outlier.color = "red")+
  ylim(0,300)

Both the histogram and the boxplot confirm that there are several outliers that do not represent the typical use of customers and members.

Use the IQR method to find and remove outliers.

Q <-  quantile(Cyclistic_2019$trip_duration_minutes, 
               probs = c(0.25, 0.75), na.rm = FALSE)
IQR <- IQR(Cyclistic_2019$trip_duration_minutes)

upper_inner_fence <- (21.4 + 1.5 * 14.6) # Q2 + 1.5*IQR
lower_inner_fence <- (6.8 - 1.5 * 14.6) # Q2 + 1.5*IQR

Cyclistic_2019_Outliers_Excluded <- Cyclistic_2019 %>%
  filter(trip_duration_minutes > 0 & trip_duration_minutes < 43.3)

Also, an assumption is made that trips under 3 minutes, where start station equals the end station, are likely due to technical issues with the bike and should not be included in the analysis.

Cyclistic_2019_Outliers_Excluded <- 
  Cyclistic_2019_Outliers_Excluded %>%
  mutate(tech_issue = 
           if_else(trip_duration_minutes <3 & 
                     from_station_name== to_station_name, TRUE, FALSE))%>%
  filter(tech_issue != TRUE)

Summarize data to check for additional outliers.

Cyclistic_2019_Outliers_Excluded %>%
  select(user_type, gender, trip_duration_minutes, weekday, month, age) %>%
  summary()
##       user_type          gender       trip_duration_minutes      weekday     
##  Customer  : 64728          :   761   Min.   : 1.00         Sunday   :37215  
##  Subscriber:289186   Female : 71207   1st Qu.: 6.60         Monday   :52609  
##                      Male   :215770   Median :11.00         Tuesday  :55812  
##                      Unknown: 64728   Mean   :13.64         Wednesday:55417  
##                      NA's   :  1448   3rd Qu.:18.60         Thursday :56221  
##                                       Max.   :43.20         Friday   :54023  
##                                                             Saturday :42617  
##        month            age        
##  August   :53146   Min.   : 16.00  
##  July     :49967   1st Qu.: 27.00  
##  September:45601   Median : 32.00  
##  June     :42736   Mean   : 34.95  
##  October  :35477   3rd Qu.: 40.00  
##  May      :34110   Max.   :131.00  
##  (Other)  :92877   NA's   :38637

There is an issue with the age. Max age is 131.

Cyclistic_2019_Outliers_Excluded <- 
  Cyclistic_2019_Outliers_Excluded %>% 
  arrange(desc(age))

Cyclistic_2019_Outliers_Excluded %>%
  ggplot(aes(user_type, age)) +
  geom_boxplot()

Upon closer inspection there are several ages over 100, which is highly unlikely for a bike rental. We assume that the age input is wrong. Replace all ages over 85 with NA.

Cyclistic_2019_Outliers_Excluded <- 
  replace_with_na_at(data= Cyclistic_2019_Outliers_Excluded,
                     .vars = "age", 
                     condition = ~.x > 85)

Cyclistic_2019_Outliers_Excluded <- 
  replace_with_na_at(data= Cyclistic_2019_Outliers_Excluded,
                     .vars = "birth_year", 
                     condition = ~.x < 1934)

Create age ranges.

Cyclistic_2019_Outliers_Excluded <- Cyclistic_2019_Outliers_Excluded %>%
  mutate(age_range= case_when(
    age >= 16 & age < 20 ~ "between 16 and 19",
    age >= 20 & age < 30 ~ "between 20 and 29",
    age >= 30 & age < 40 ~ "between 30 and 39",
    age >= 40 & age < 50 ~ "between 40 and 49",
    age >= 50 ~ "50+"))

Analysis

Bike Rental Frequency Over The Year

Prepare Data for Calendar Heatmap.

Heatmap <- Cyclistic_2019_Outliers_Excluded %>%
  select(start_time, month, weekday)%>%
  mutate(date= as.Date(start_time)) %>%
  mutate(week = isoweek(start_time)) %>%
  group_by(date,weekday, week, month)%>%
  summarise(number_of_trips= n())

# Change levels for weekdays

Heatmap$weekday <- factor(Heatmap$weekday, levels = c("Monday", "Tuesday",
                                                      "Wednesday", "Thursday",
                                                      "Friday", "Saturday", 
                                                      "Sunday"))

# Create Vector for week of the month.
week_of_month <- (4 + day(Heatmap$date) + 
                    wday(floor_date(Heatmap$date, "month")))%/% 7
Heatmap$week_of_month <- week_of_month

# Create Vector for day.

day <-format(as.Date(Heatmap$date,format="%Y-%m-%d"), format = "%d")

# Create factor for week_of_month

Heatmap <- Heatmap %>% 
  mutate(week_of_month_factor = case_when(week_of_month==0 ~ "null",
                                          week_of_month==1 ~ "first", 
                                          week_of_month==2 ~ "second",
                                          week_of_month==3 ~ "third",
                                          week_of_month==4 ~ "fourth",
                                          week_of_month==5 ~ "fifth")) %>%
  mutate(weekday= substring(weekday, 1, 2),
         week_of_month_factor = as.factor(week_of_month_factor))%>%
  mutate(week_of_month_factor= recode(week_of_month_factor,"fifth"=1,
                                      "fourth"=2,"third"=3, "second"=4,
                                      "first"= 5, "null"=6))

Create Calendar Heatmap

Heatmap %>%
  ggplot(aes(reorder(weekday, week_of_month_factor, decreasing=FALSE), 
             week_of_month_factor, fill = number_of_trips)) + 
  geom_tile(colour = "white") + 
  geom_text(aes(label= day))+
  facet_wrap(~month)+
  scale_fill_gradient(low="red", high="green")+
  labs(x="Day", 
       y="",
       fill= "Daily Trips",
       title = "Bike Rental Frequency Over The Year", 
       subtitle="Trips per Day in 2019",
       caption = "source: Motivation International Inc")+
  theme_minimal()+
  theme(strip.text.x = element_text(size=10, face="bold"),
        strip.background = element_rect(colour="black", fill="#CCCCFF"))+
  scale_x_discrete(expand = c(0,0))+
  scale_y_discrete(expand = c(0,0)) +
  coord_fixed(ratio=1)

high_season<- Heatmap %>%
  drop_na %>%
  mutate(months_grouped= case_when(
         month %in% c("June", "July", "August", "September") ~ "high season",
         month %in% c("April", "May", "October") ~ "mid season",
         month %in% c("November", "December", "January", "February", "March") ~ "low season",
         TRUE ~ NA_character_)) %>%
  group_by(months_grouped)%>%
  summarise(sum_trips=round(sum(number_of_trips)))%>%
mutate(perc_season= formattable::percent(sum_trips/sum(sum_trips)) )%>%
  filter(months_grouped == "high season")

high_season1 <- high_season$perc_season
mid_season<- Heatmap %>%
  drop_na %>%
  mutate(months_grouped= case_when(
         month %in% c("June", "July", "August", "September") ~ "high season",
         month %in% c("April", "May", "October") ~ "mid season",
         month %in% c("November", "December", "January", "February", "March") ~ "low season",
         TRUE ~ NA_character_)) %>%
  group_by(months_grouped)%>%
  summarise(sum_trips=round(sum(number_of_trips)))%>%
mutate(perc_season= formattable::percent(sum_trips/sum(sum_trips)) )%>%
  filter(months_grouped == "mid season")

mid_season1 <- mid_season$perc_season
low_season<- Heatmap %>%
  drop_na %>%
  mutate(months_grouped= case_when(
         month %in% c("June", "July", "August", "September") ~ "high season",
         month %in% c("April", "May", "October") ~ "mid season",
         month %in% c("November", "December", "January", "February", "March") ~ "low season",
         TRUE ~ NA_character_)) %>%
  group_by(months_grouped)%>%
  summarise(sum_trips=round(sum(number_of_trips)))%>%
mutate(perc_season= formattable::percent(sum_trips/sum(sum_trips)) )%>%
  filter(months_grouped == "low season")
  
low_season1 <- low_season$perc_season

Key Takeaway

The busiest months are June to September, accounting for 54.10% of yearly rentals. In April, May and October bikes are still fairly well booked, with 26.70% of rides, while low season goes from November to March, with 19.20%.

Prepare data for stacked area chart.

Stacked_Area_Chart <- 
  Cyclistic_2019_Outliers_Excluded %>%
  select (start_time, trip_duration_minutes, user_type, month) %>%
  mutate(week = isoweek(start_time)) %>%
  group_by(user_type, week) %>%
  summarise(mean_duration = mean(trip_duration_minutes))%>%
  mutate(mean_duration= round(mean_duration, digits= 1))

Create stacked area chart.

Stacked_Area_Chart %>%
  drop_na(mean_duration)%>%
  ggplot(aes(x= week, y= mean_duration, fill= user_type )) +
  geom_area() +
  theme_bw()+
  labs(title = "Customer vs Subscriber Bike Usage over the Year",
       subtitle = "Usage as distance in minutes for the year 2019",
       caption = "source: Motivation International Inc", 
       x= "weeks", y= "bike usage", fill= "User Type")+
  scale_fill_manual(values = c("Customer" = "#006d77", "Subscriber"= "#83c5be"))

Key Takeaway

Subscribers use the bike share service fairly steadily throughout the year, while customers’ usage fluctuates week by week and is the highest around summer time.

Bike Rental Frequency By Time Of The Day

Prepare data for stacked bar chart

# create breaks
time_breaks <- hour(hm("00:00", "6:00", "9:00", "12:00", "3:00", "18:00", "23:59"))
# labels for the breaks
labels <- c("Night", "Early Morning", "Late Morning", 
            "Early Afternoon", "Late Afternoon", "Evening")

Cyclistic_2019_Outliers_Excluded$Time_of_day <- 
  cut(x=hour(Cyclistic_2019_Outliers_Excluded$start_time), 
      breaks = time_breaks, labels = labels, include.lowest=TRUE)


Stacked_Bar<- Cyclistic_2019_Outliers_Excluded %>%
  select(Time_of_day, user_type)%>%
  group_by(user_type, Time_of_day)%>%
  summarise(time_of_day_n =  n())%>%
  mutate(time_of_day_perc= round(case_when(user_type== "Customer" ~ time_of_day_n/ sum(time_of_day_n)*100,
                                           user_type== "Subscriber" ~ time_of_day_n/ sum(time_of_day_n)*100),1))

Create stacked bar chart.

Stacked_Bar %>%  
  ggplot(aes(user_type, time_of_day_perc, fill= Time_of_day)) +
  geom_bar(position = "fill", stat = "identity")+
  theme_bw()+
  scale_fill_manual(values = c("Night" = "#006d77", 
                               "Early Morning"= "#83c5be", 
                               "Late Morning"="#edf6f9",
                               "Early Afternoon"= "#ffddd2" ,
                               "Late Afternoon"= "#e29578",
                               "Evening"= "#723d46")) +
  labs(x="User Type", 
       y="Frequency",
       fill= "Time of Day",
       title = "Bike Rental Frequency per Time of Day", 
       subtitle="2019 customer vs subscribers bike usage",
       caption = "source: Motivation International Inc")

Key Takeaway

Customers utilize the bike rental service mainly in the late afternoon, accounting for half of the daily rides. Other popular times are early afternoon and evening, with about 15% each. Subscribers also predominantly rent in the late afternoon, but unlike customers there is a good share of late morning rentals (~25%).

Bike Rental Frequency By Day Of The Week

Create Bar Chart

Cyclistic_2019_Outliers_Excluded %>%
  ggplot(aes(weekday, fill= user_type)) +
  geom_bar(position= "dodge")+
  theme_bw()+
  theme(axis.text.x = element_text(angle = 90))+
  scale_fill_manual(values = c("Customer" = "#006d77", 
                               "Subscriber"= "#83c5be")) +
  labs(x="", 
       y="Number of Trips",
       fill= "User Type",
       title = "Customer vs Subscriber Rentals", 
       subtitle="2019 Bike Rental Frequency per Weekday",
       caption = "source: Motivation International Inc")

weekend_vs_weekday <- Cyclistic_2019_Outliers_Excluded %>%
  mutate(weekend_weekday= if_else(weekday == "Saturday" | weekday == "Sunday", "Weekend", "Weekday"))%>%
  group_by(user_type, weekend_weekday) %>%
  summarise(n= n())%>%
  mutate(n_per_day = ifelse(weekend_weekday== "Weekend", (n/52)/2, (n/52)/5))
perc_weekend_c <- weekend_vs_weekday %>%
  filter(user_type== "Customer") %>%
  mutate(perc_weekend_c= formattable::percent(n_per_day/sum(n_per_day))) %>%
filter(weekend_weekday == "Weekend") 

perc_weekend_c1 <- perc_weekend_c$perc_weekend_c
perc_weekday_s <- weekend_vs_weekday %>%
  filter(user_type== "Subscriber") %>%
  mutate(perc_weekday_s= formattable::percent (n_per_day/sum(n_per_day))) %>%
filter(weekend_weekday == "Weekday")   

perc_weekday_s1 <- perc_weekday_s$perc_weekday_s

Key Takeaway

Customers predominantly rent over the weekend, with 63.85% of their rides performed on a Saturday or Sunday. In contrast, subscribers mostly rent Monday to Friday. 64.04% of subscribers’ rentals are on a weekday.

Bike Rental Frequency Vs Trip Duration

Prepare data for Scatterplot.

Scatterplot <- Cyclistic_2019_Outliers_Excluded %>%
  select (trip_id, start_time, trip_duration_minutes, user_type, month) %>%
  group_by(month, user_type) %>%
  summarise(mean_duration = mean(trip_duration_minutes, na.rm= TRUE),
            total_rides = n())

Create Scatterplot.

Scatterplot %>%
  ggplot(aes(month, total_rides))+
  geom_point(aes(col= user_type, size= mean_duration))+
  scale_color_manual(values = c("Customer" = "#006d77", 
                               "Subscriber"= "#83c5be")) +
  theme_bw()+
  theme(axis.text.x = element_text(angle = 90))+
  labs(title = "Customers vs Subscribers", 
       subtitle = "2019 Number of trips and duration",
       x="month", y= "Number of rides", 
       fill= "user type", size= "duration (minutes)",
       caption = "source: Motivation International Inc")

freq_s<-Scatterplot %>%
  group_by(user_type)%>%
 summarise(freq_s= round(mean(total_rides)))%>%
filter(user_type=="Subscriber")

freq_s1 <- freq_s$freq_s
freq_c<-Scatterplot %>%
  group_by(user_type)%>%
 summarise(freq_c= round(mean(total_rides)))%>%
filter(user_type=="Customer")

freq_c1 <- freq_c$freq_c
dur_s<-Scatterplot %>%
  group_by(user_type)%>%
 summarise(dur_s= round(mean(mean_duration),1))%>%
filter(user_type=="Subscriber")

dur_s1 <- dur_s$dur_s
dur_c<-Scatterplot %>%
  group_by(user_type)%>%
 summarise(dur_c= round(mean(mean_duration),1))%>%
filter(user_type=="Customer")

dur_c1 <- dur_c$dur_c

Key Takeaway

Subscribers use the bike service over four times more often than customers, with an average monthly frequency of 24099 versus 5394. However, on average customers rent the bike for twice the duration of subscribers. The mean monthly durations are 20.1 and 11.6 minutes respectively.

Age And Gender Distribution

Create Boxplot.

Cyclistic_2019_Outliers_Excluded %>%
  filter(gender == "Male" | gender == "Female")%>%
  drop_na(gender, age)%>%
  ggplot(aes(gender, age))+
  geom_boxplot(varwidth = T, fill = "#006d77")+
  labs(title = "Gender and Age Disctribution", 
       subtitle = "Data only available for subscribers",
       size= "average duration (minutes)",
       caption = "source: Motivation International Inc")+
  theme_bw()

age_f<-Cyclistic_2019_Outliers_Excluded %>%
  drop_na()%>%
  group_by(gender)%>%
 summarise(avg_age= round(mean(age),1))%>%
  filter(gender=="Female")

age_f1 <- age_f$avg_age
age_m<-Cyclistic_2019_Outliers_Excluded %>%
  drop_na()%>%
  group_by(gender)%>%
 summarise(avg_age= round(mean(age),1))%>%
  filter(gender=="Male")

age_m1 <- age_m$avg_age

Key Takeaway

Most rides are made by male subscribers in their late 20s to early 40s. Female subscribers tend to be slightly younger than the male counterparts, with respective ages of 34.1 and 35.8 .

Create Pie Chart.

Cyclistic_2019_Outliers_Excluded %>%
  drop_na(gender)%>%
  ggplot(aes(x = "", fill = factor(gender))) + 
  geom_bar(width = 1) +
  theme(axis.line = element_blank(), 
        plot.title = element_text(hjust=0.5)) + 
  labs(fill="", 
       x=NULL, 
       y=NULL, 
       title="Gender of Subscribers", 
       caption="source: Motivation International Inc")+
  coord_polar(theta = "y", start=0)+
  theme_void()+
  scale_fill_manual(values = c("Male" = "#006d77", 
                               "Female"= "#83c5be", 
                               "Unknown"="#edf6f9"))

perc_m <- Cyclistic_2019_Outliers_Excluded %>%
  drop_na()%>%
  filter(gender %in% c("Male", "Female", "Unknown")) %>%
  group_by(gender)%>%
  summarise(gender_n= n())%>%
  mutate(gender_perc= formattable::percent(gender_n/sum(gender_n)))%>%
  filter(gender== "Male")

perc_m1 <- perc_m$gender_perc
perc_f <- Cyclistic_2019_Outliers_Excluded %>%
  drop_na()%>%
  filter(gender %in% c("Male", "Female", "Unknown")) %>%
  group_by(gender)%>%
  summarise(gender_n= n())%>%
  mutate(gender_perc= formattable::percent(gender_n/sum(gender_n)))%>%
  filter(gender== "Female")

perc_f1 <- perc_f$gender_perc
perc_u <- Cyclistic_2019_Outliers_Excluded %>%
  drop_na()%>%
  filter(gender %in% c("Male", "Female", "Unknown")) %>%
  group_by(gender)%>%
  summarise(gender_n= n())%>%
  mutate(gender_perc= formattable::percent(gender_n/sum(gender_n)))%>%
  filter(gender== "Unknown")

perc_u1 <- perc_u$gender_perc

Key Takeaway

Gender is only available for subscribers: however, because the number of rides performed by suscribers significantly exceeds the amount of casual rides, we can confidently conclude that at least 70% of rides are made by males. The bike rental distribution for 2019 is as follows: 68.78% males, 22.71% females, 8.51% unknown (customers).

Prepare Data for Waffle Chart.

Waffle <- Cyclistic_2019_Outliers_Excluded %>%
  drop_na(age_range) %>%
  group_by(age_range)%>%
  summarise(age_frequency= n())

# Create Vector

age<-c('between 16 and 19'= 2666, 
       'between 20 and 29'= 119735, 'between 30 and 39'= 111172,
       'between 40 and 49'= 41161, '50+'= 40380)

Create Waffle Chart.

waffle(age/10000, rows=5, size=0.6, 
       colors=c("#006d77", "#83c5be", "#edf6f9", 
                "#ffddd2", "#e29578"), 
       title="Age Distribution of Subscribers", 
       xlab="1 square = 10000 persons")

avg_age<-Cyclistic_2019_Outliers_Excluded %>%
  drop_na()%>%
 summarise(avg_age= round(mean(age),1))

avg_age1 <- avg_age$avg_age

Key Takeaway

The target age for Cyclistic subscribers is between 20 and 40, with an average age of 34.9.

Conclusion

Differences between the bike usage of annual members and casual riders:

  • Subscribers rent fairly consistently throughout the year, while the frequency of bike rentals for customers fluctuates strongly.

  • Casual riders rent less frequently, but for longer duration.

  • Casual riders prefer weekends, while subscribers mainly rent during the workweek.

  • Customers like to rent later in the day than subscribers.

Why would casual riders buy Cyclistic annual memberships?

  • The predominance of weekday bike rentals, the earlier rides shorter rides and lesser seasonality suggests that subscribers use the bike service for their commute, while customers rent the bikes for leasure.

  • In order for the membership to be attractive for casual riders, they would need to have an incentive to use the bikes more frequently in their day-to-day lives.

How can Cyclistic use digital media to influence casual riders to become members?

  • Social media could be used to promote a healthy lifestyle, showcasing influencers commuting to work in any season and weather.

  • Cyclistic could cooperate with sports gear companies and create a campaign to promote all-year bike commuting.

  • Cyclistic could run challenges, competitions and promotions on social media, rewarding bike commuters.

Other recomendations

  • Cyclistic could partner with local companies and offer cheaper yearly memberships to their employees.

  • Cyclistic could also partner with local health and fitness clubs and offer attractive memberships for their members. This is particularly true for female clubs, since the majority of subscribers is male and this could be an effective way to make bike rentals attractive for women as well.

  • Cyclistic could also work with high schools and colleges to try to attract younger subscribers, since currently most rides are performed by subscribers aged between 20 and 40.