Olivia S. Engl
In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members.
Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders. Cyclistic wants to design marketing strategies aimed at converting casual riders into annual members. In order to do that, however, the marketing analyst team needs to better understand how annual members and casual riders differ, why casual riders would buy a membership, and how digital media could affect their marketing tactics.
Define differences between the bike usage of annual members and casual riders.
Why would casual riders buy Cyclistic annual memberships?
How can Cyclistic use digital media to influence casual riders to become members?
The data has been made available by Motivate international Inc. under this this agreement LINK.
For the purpose of this portfolio project and due to Rstudio memory constraints, a random, unbiased sample has been extracted from the original data frames. The sample size is 1/10th of the original data.
Following code was used to create the sample for each quartale:
set.seed(2022286) Cyclistic_Q1_2019<- Divvy_Trips_2019_Q1 %>% sample_n(36506)
Demographic information (age and gender) is only available for subscribers and not for casual riders, also referred to as customers.
Every ride has a unique code (primary key) and we do not have a customer identification code; therefore we can only analyze the age and gender distribution according to the number of rides, rather than for the number of individuals.
Following modifications of the data have been performed during the data cleaning process:
Subscriber’s age had several outliers, with some people aged well over 100 years old. After looking at the age distribution a cutoff age of 85 was defined. All ages over 85 were replaced with NA.
Trip duration varied from 1 minute to 3 months. The Interquartal Range method was used to remove outliers.
Where trip duration was under 3 minutes and start and end locations were the same, it was assumed that no real trip took place and the values were removed from the dataset. This could for instance be due to technical issues with the bike.
library("tidyverse")
library("janitor")
library("skimr")
library("lubridate")
library("naniar")
library("waffle")
library("formattable")
Cyclistic_Q1_2019 <- read.csv("Cyclistic_Q1_2019.csv")
Cyclistic_Q2_2019 <- read.csv("Cyclistic_Q2_2019.csv")
Cyclistic_Q3_2019 <- read.csv("Cyclistic_Q3_2019.csv")
Cyclistic_Q4_2019 <- read.csv("Cyclistic_Q4_2019.csv")
Check that all variables are in the right order and have the same format.
glimpse(Cyclistic_Q1_2019)
## Rows: 36,506
## Columns: 12
## $ trip_id <int> 21814521, 22132588, 22077402, 22128279, 22155279, 21…
## $ start_time <chr> "1/14/2019 17:32", "3/26/2019 6:46", "3/18/2019 15:2…
## $ end_time <chr> "1/14/2019 17:41", "3/26/2019 7:00", "3/18/2019 15:3…
## $ bikeid <int> 3780, 786, 5417, 795, 223, 1184, 4157, 5753, 4527, 3…
## $ tripduration <chr> "575", "844", "253", "369", "327", "84", "501", "581…
## $ from_station_id <int> 174, 637, 174, 192, 40, 91, 172, 199, 5, 466, 161, 3…
## $ from_station_name <chr> "Canal St & Madison St", "Wood St & Chicago Ave (*)"…
## $ to_station_id <int> 620, 107, 283, 108, 164, 66, 181, 174, 624, 452, 211…
## $ to_station_name <chr> "Orleans St & Chestnut St (NEXT Apts)", "Desplaines …
## $ usertype <chr> "Subscriber", "Subscriber", "Subscriber", "Subscribe…
## $ gender <chr> "Male", "Male", "Male", "Male", "Male", "Male", "Mal…
## $ birthyear <int> 1981, 1990, 1990, 1983, 1984, 1986, 1979, 1966, 1985…
glimpse(Cyclistic_Q2_2019)
## Rows: 110,816
## Columns: 12
## $ X01...Rental.Details.Rental.ID <int> 22561350, 22556568, …
## $ X01...Rental.Details.Local.Start.Time <chr> "5/6/2019 16:59", "5…
## $ X01...Rental.Details.Local.End.Time <chr> "5/6/2019 17:18", "5…
## $ X01...Rental.Details.Bike.ID <int> 1553, 5917, 3449, 63…
## $ X01...Rental.Details.Duration.In.Seconds.Uncapped <chr> "1,130.00", "486", "…
## $ X03...Rental.Start.Station.ID <int> 49, 277, 47, 50, 602…
## $ X03...Rental.Start.Station.Name <chr> "Dearborn St & Monro…
## $ X02...Rental.End.Station.ID <int> 118, 84, 26, 39, 663…
## $ X02...Rental.End.Station.Name <chr> "Sedgwick St & North…
## $ User.Type <chr> "Subscriber", "Subsc…
## $ Member.Gender <chr> "Male", "Male", "Fem…
## $ X05...Member.Details.Member.Birthday.Year <int> 1967, 1974, 1971, NA…
glimpse(Cyclistic_Q3_2019)
## Rows: 164,071
## Columns: 12
## $ trip_id <int> 24672286, 23846067, 23842155, 24002175, 24967257, 24…
## $ start_time <chr> "8/29/2019 22:44", "7/19/2019 16:29", "7/19/2019 12:…
## $ end_time <chr> "8/29/2019 23:10", "7/19/2019 16:36", "7/19/2019 14:…
## $ bikeid <int> 3033, 3280, 6367, 4329, 5886, 3701, 577, 5414, 688, …
## $ tripduration <int> 1519, 405, 7120, 603, 21561, 338, 575, 354, 2457, 70…
## $ from_station_id <int> 52, 38, 45, 331, 76, 301, 418, 107, 93, 13, 24, 2, 3…
## $ from_station_name <chr> "Michigan Ave & Lake St", "Clark St & Lake St", "Mic…
## $ to_station_id <int> 25, 624, 35, 268, 35, 140, 345, 56, 504, 289, 51, 35…
## $ to_station_name <chr> "Michigan Ave & Pearson St", "Dearborn St & Van Bure…
## $ usertype <chr> "Customer", "Subscriber", "Customer", "Customer", "C…
## $ gender <chr> NA, "Male", NA, "Female", "Male", "Male", NA, "Male"…
## $ birthyear <int> NA, 1991, NA, 1995, 1961, 1988, 1992, 1969, NA, 1993…
glimpse(Cyclistic_Q4_2019)
## Rows: 70,405
## Columns: 12
## $ trip_id <int> 25288982, 25562503, 25558591, 25719662, 25583344, 25…
## $ start_time <chr> "10/5/2019 12:03", "10/25/2019 16:32", "10/25/2019 1…
## $ end_time <chr> "10/5/2019 12:10", "10/25/2019 16:41", "10/25/2019 1…
## $ bikeid <int> 999, 2983, 1655, 130, 5722, 313, 5610, 892, 420, 179…
## $ tripduration <int> 413, 536, 499, 217, 740, 716, 369, 329, 3384, 253, 5…
## $ from_station_id <int> 94, 37, 201, 337, 349, 160, 117, 58, 636, 154, 133, …
## $ from_station_name <chr> "Clark St & Armitage Ave", "Dearborn St & Adams St",…
## $ to_station_id <int> 220, 47, 148, 211, 304, 20, 256, 69, 304, 319, 77, 7…
## $ to_station_name <chr> "Clark St & Drummond Pl", "State St & Kinzie St", "S…
## $ usertype <chr> "Subscriber", "Subscriber", "Subscriber", "Subscribe…
## $ gender <chr> "Male", "Male", "Female", "Male", "Female", "Male", …
## $ birthyear <int> 1993, 1979, 1993, 1984, 1996, 1986, 1990, 1984, NA, …
The column titles in Q2 are inconsistent with the other files. Rename them in order to be able to merge the data frames.
Cyclistic_Q2_2019 <- Cyclistic_Q2_2019 %>%
clean_names()%>%
rename(trip_id = x01_rental_details_rental_id,
start_time = x01_rental_details_local_start_time,
end_time = x01_rental_details_local_end_time,
bikeid = x01_rental_details_bike_id,
tripduration = x01_rental_details_duration_in_seconds_uncapped,
from_station_id = x03_rental_start_station_id,
from_station_name = x03_rental_start_station_name,
to_station_id = x02_rental_end_station_id,
to_station_name = x02_rental_end_station_name,
usertype = user_type,
gender = member_gender,
birthyear = x05_member_details_member_birthday_year)
Remove commas from tripduration values and convert to integer
Cyclistic_Q1_2019 <- Cyclistic_Q1_2019 %>%
mutate(tripduration=gsub("\\,", "", tripduration))
Cyclistic_Q2_2019 <- Cyclistic_Q2_2019 %>%
mutate(tripduration=gsub("\\,", "", tripduration))
Cyclistic_Q3_2019 <- Cyclistic_Q3_2019 %>%
mutate(tripduration=gsub("\\,", "", tripduration))
Cyclistic_Q4_2019 <- Cyclistic_Q4_2019 %>%
mutate(tripduration=gsub("\\,", "", tripduration))
Cyclistic_Q1_2019$tripduration <- as.integer(Cyclistic_Q1_2019$tripduration)
Cyclistic_Q2_2019$tripduration <- as.integer(Cyclistic_Q2_2019$tripduration)
Cyclistic_Q3_2019$tripduration <- as.integer(Cyclistic_Q3_2019$tripduration)
Cyclistic_Q4_2019$tripduration <- as.integer(Cyclistic_Q4_2019$tripduration)
Create a single dataset using rbind().
Cyclistic_2019 <- rbind(Cyclistic_Q1_2019,
Cyclistic_Q2_2019,
Cyclistic_Q3_2019,
Cyclistic_Q4_2019)
Rename the column names and make them consistent.
Cyclistic_2019 <- Cyclistic_2019 %>%
clean_names()%>%
rename( bike_id = bikeid,
trip_duration = tripduration,
user_type = usertype,
birth_year = birthyear)
Check for duplicates in trip_id (should be unique values).
Cyclistic_2019 %>%
distinct(trip_id) # There are no duplicates in trip_id
Convert start and end time from character to datetime.
Cyclistic_2019$start_time <- as.POSIXlt(Cyclistic_2019$start_time,format="%m/%d/%Y %H:%M")
Cyclistic_2019$end_time <- as.POSIXlt(Cyclistic_2019$end_time,format="%m/%d/%Y %H:%M")
Recalculate trip duration to check the time unit used in the data set.
difftime(Cyclistic_2019$end_time, Cyclistic_2019$start_time,
units = "secs" )
Trip duration is currently in seconds. Convert to minutes.
Cyclistic_2019 <- Cyclistic_2019 %>%
mutate(trip_duration_minutes= round(trip_duration/60 , 1))
Identify NA Values.
sapply(Cyclistic_2019, function(x) sum(is.na(x)))
## trip_id start_time end_time
## 0 0 0
## bike_id trip_duration from_station_id
## 0 0 0
## from_station_name to_station_id to_station_name
## 0 0 0
## user_type gender birth_year
## 0 35518 54222
## trip_duration_minutes
## 0
The only NAs are in “gender” and “birth year”.
We only know the gender for Subscribers. Therefore, replace NA with “Unknown” for Customers.
Cyclistic_2019 <- Cyclistic_2019 %>%
mutate(gender= as.character(gender))%>%
mutate(gender= if_else(user_type== "Customer", "Unknown", gender))
Assign the right class to each variable.
Cyclistic_2019 <- Cyclistic_2019 %>%
mutate(trip_id = as.integer(trip_id),
bike_id = as.integer(bike_id),
from_station_id = as.integer(from_station_id),
to_station_id = as.integer(to_station_id),
user_type = as.factor(user_type),
gender = as.factor(gender),
birth_year = as.integer(birth_year))
Add variables weekday and month.
Cyclistic_2019 <- Cyclistic_2019 %>%
mutate(weekday = wday(start_time, label = TRUE, abbr = FALSE),
month = month(start_time, label = TRUE, abbr = FALSE))
Create an age variable.
Cyclistic_2019 <- Cyclistic_2019 %>%
mutate(age = 2019- birth_year)
Check trip duration.
Cyclistic_2019 <- Cyclistic_2019 %>%
arrange(trip_duration_minutes)
The duration of the trip varies significantly, from 1 minute to three months. Identify outliers.
options(scipen = 999) #turns off scientific mode
hist(Cyclistic_2019$trip_duration_minutes)
Cyclistic_2019%>%
ggplot(aes(month , trip_duration_minutes))+
geom_boxplot(varwidth=T, fill="plum", outlier.color = "red")+
ylim(0,300)
Both the histogram and the boxplot confirm that there are several outliers that do not represent the typical use of customers and members.
Use the IQR method to find and remove outliers.
Q <- quantile(Cyclistic_2019$trip_duration_minutes,
probs = c(0.25, 0.75), na.rm = FALSE)
IQR <- IQR(Cyclistic_2019$trip_duration_minutes)
upper_inner_fence <- (21.4 + 1.5 * 14.6) # Q2 + 1.5*IQR
lower_inner_fence <- (6.8 - 1.5 * 14.6) # Q2 + 1.5*IQR
Cyclistic_2019_Outliers_Excluded <- Cyclistic_2019 %>%
filter(trip_duration_minutes > 0 & trip_duration_minutes < 43.3)
Also, an assumption is made that trips under 3 minutes, where start station equals the end station, are likely due to technical issues with the bike and should not be included in the analysis.
Cyclistic_2019_Outliers_Excluded <-
Cyclistic_2019_Outliers_Excluded %>%
mutate(tech_issue =
if_else(trip_duration_minutes <3 &
from_station_name== to_station_name, TRUE, FALSE))%>%
filter(tech_issue != TRUE)
Summarize data to check for additional outliers.
Cyclistic_2019_Outliers_Excluded %>%
select(user_type, gender, trip_duration_minutes, weekday, month, age) %>%
summary()
## user_type gender trip_duration_minutes weekday
## Customer : 64728 : 761 Min. : 1.00 Sunday :37215
## Subscriber:289186 Female : 71207 1st Qu.: 6.60 Monday :52609
## Male :215770 Median :11.00 Tuesday :55812
## Unknown: 64728 Mean :13.64 Wednesday:55417
## NA's : 1448 3rd Qu.:18.60 Thursday :56221
## Max. :43.20 Friday :54023
## Saturday :42617
## month age
## August :53146 Min. : 16.00
## July :49967 1st Qu.: 27.00
## September:45601 Median : 32.00
## June :42736 Mean : 34.95
## October :35477 3rd Qu.: 40.00
## May :34110 Max. :131.00
## (Other) :92877 NA's :38637
There is an issue with the age. Max age is 131.
Cyclistic_2019_Outliers_Excluded <-
Cyclistic_2019_Outliers_Excluded %>%
arrange(desc(age))
Cyclistic_2019_Outliers_Excluded %>%
ggplot(aes(user_type, age)) +
geom_boxplot()
Upon closer inspection there are several ages over 100, which is highly unlikely for a bike rental. We assume that the age input is wrong. Replace all ages over 85 with NA.
Cyclistic_2019_Outliers_Excluded <-
replace_with_na_at(data= Cyclistic_2019_Outliers_Excluded,
.vars = "age",
condition = ~.x > 85)
Cyclistic_2019_Outliers_Excluded <-
replace_with_na_at(data= Cyclistic_2019_Outliers_Excluded,
.vars = "birth_year",
condition = ~.x < 1934)
Create age ranges.
Cyclistic_2019_Outliers_Excluded <- Cyclistic_2019_Outliers_Excluded %>%
mutate(age_range= case_when(
age >= 16 & age < 20 ~ "between 16 and 19",
age >= 20 & age < 30 ~ "between 20 and 29",
age >= 30 & age < 40 ~ "between 30 and 39",
age >= 40 & age < 50 ~ "between 40 and 49",
age >= 50 ~ "50+"))
Prepare Data for Calendar Heatmap.
Heatmap <- Cyclistic_2019_Outliers_Excluded %>%
select(start_time, month, weekday)%>%
mutate(date= as.Date(start_time)) %>%
mutate(week = isoweek(start_time)) %>%
group_by(date,weekday, week, month)%>%
summarise(number_of_trips= n())
# Change levels for weekdays
Heatmap$weekday <- factor(Heatmap$weekday, levels = c("Monday", "Tuesday",
"Wednesday", "Thursday",
"Friday", "Saturday",
"Sunday"))
# Create Vector for week of the month.
week_of_month <- (4 + day(Heatmap$date) +
wday(floor_date(Heatmap$date, "month")))%/% 7
Heatmap$week_of_month <- week_of_month
# Create Vector for day.
day <-format(as.Date(Heatmap$date,format="%Y-%m-%d"), format = "%d")
# Create factor for week_of_month
Heatmap <- Heatmap %>%
mutate(week_of_month_factor = case_when(week_of_month==0 ~ "null",
week_of_month==1 ~ "first",
week_of_month==2 ~ "second",
week_of_month==3 ~ "third",
week_of_month==4 ~ "fourth",
week_of_month==5 ~ "fifth")) %>%
mutate(weekday= substring(weekday, 1, 2),
week_of_month_factor = as.factor(week_of_month_factor))%>%
mutate(week_of_month_factor= recode(week_of_month_factor,"fifth"=1,
"fourth"=2,"third"=3, "second"=4,
"first"= 5, "null"=6))
Create Calendar Heatmap
Heatmap %>%
ggplot(aes(reorder(weekday, week_of_month_factor, decreasing=FALSE),
week_of_month_factor, fill = number_of_trips)) +
geom_tile(colour = "white") +
geom_text(aes(label= day))+
facet_wrap(~month)+
scale_fill_gradient(low="red", high="green")+
labs(x="Day",
y="",
fill= "Daily Trips",
title = "Bike Rental Frequency Over The Year",
subtitle="Trips per Day in 2019",
caption = "source: Motivation International Inc")+
theme_minimal()+
theme(strip.text.x = element_text(size=10, face="bold"),
strip.background = element_rect(colour="black", fill="#CCCCFF"))+
scale_x_discrete(expand = c(0,0))+
scale_y_discrete(expand = c(0,0)) +
coord_fixed(ratio=1)
high_season<- Heatmap %>%
drop_na %>%
mutate(months_grouped= case_when(
month %in% c("June", "July", "August", "September") ~ "high season",
month %in% c("April", "May", "October") ~ "mid season",
month %in% c("November", "December", "January", "February", "March") ~ "low season",
TRUE ~ NA_character_)) %>%
group_by(months_grouped)%>%
summarise(sum_trips=round(sum(number_of_trips)))%>%
mutate(perc_season= formattable::percent(sum_trips/sum(sum_trips)) )%>%
filter(months_grouped == "high season")
high_season1 <- high_season$perc_season
mid_season<- Heatmap %>%
drop_na %>%
mutate(months_grouped= case_when(
month %in% c("June", "July", "August", "September") ~ "high season",
month %in% c("April", "May", "October") ~ "mid season",
month %in% c("November", "December", "January", "February", "March") ~ "low season",
TRUE ~ NA_character_)) %>%
group_by(months_grouped)%>%
summarise(sum_trips=round(sum(number_of_trips)))%>%
mutate(perc_season= formattable::percent(sum_trips/sum(sum_trips)) )%>%
filter(months_grouped == "mid season")
mid_season1 <- mid_season$perc_season
low_season<- Heatmap %>%
drop_na %>%
mutate(months_grouped= case_when(
month %in% c("June", "July", "August", "September") ~ "high season",
month %in% c("April", "May", "October") ~ "mid season",
month %in% c("November", "December", "January", "February", "March") ~ "low season",
TRUE ~ NA_character_)) %>%
group_by(months_grouped)%>%
summarise(sum_trips=round(sum(number_of_trips)))%>%
mutate(perc_season= formattable::percent(sum_trips/sum(sum_trips)) )%>%
filter(months_grouped == "low season")
low_season1 <- low_season$perc_season
Key Takeaway
The busiest months are June to September, accounting for 54.10% of yearly rentals. In April, May and October bikes are still fairly well booked, with 26.70% of rides, while low season goes from November to March, with 19.20%.
Prepare data for stacked area chart.
Stacked_Area_Chart <-
Cyclistic_2019_Outliers_Excluded %>%
select (start_time, trip_duration_minutes, user_type, month) %>%
mutate(week = isoweek(start_time)) %>%
group_by(user_type, week) %>%
summarise(mean_duration = mean(trip_duration_minutes))%>%
mutate(mean_duration= round(mean_duration, digits= 1))
Create stacked area chart.
Stacked_Area_Chart %>%
drop_na(mean_duration)%>%
ggplot(aes(x= week, y= mean_duration, fill= user_type )) +
geom_area() +
theme_bw()+
labs(title = "Customer vs Subscriber Bike Usage over the Year",
subtitle = "Usage as distance in minutes for the year 2019",
caption = "source: Motivation International Inc",
x= "weeks", y= "bike usage", fill= "User Type")+
scale_fill_manual(values = c("Customer" = "#006d77", "Subscriber"= "#83c5be"))
Key Takeaway
Subscribers use the bike share service fairly steadily throughout the year, while customers’ usage fluctuates week by week and is the highest around summer time.
Prepare data for stacked bar chart
# create breaks
time_breaks <- hour(hm("00:00", "6:00", "9:00", "12:00", "3:00", "18:00", "23:59"))
# labels for the breaks
labels <- c("Night", "Early Morning", "Late Morning",
"Early Afternoon", "Late Afternoon", "Evening")
Cyclistic_2019_Outliers_Excluded$Time_of_day <-
cut(x=hour(Cyclistic_2019_Outliers_Excluded$start_time),
breaks = time_breaks, labels = labels, include.lowest=TRUE)
Stacked_Bar<- Cyclistic_2019_Outliers_Excluded %>%
select(Time_of_day, user_type)%>%
group_by(user_type, Time_of_day)%>%
summarise(time_of_day_n = n())%>%
mutate(time_of_day_perc= round(case_when(user_type== "Customer" ~ time_of_day_n/ sum(time_of_day_n)*100,
user_type== "Subscriber" ~ time_of_day_n/ sum(time_of_day_n)*100),1))
Create stacked bar chart.
Stacked_Bar %>%
ggplot(aes(user_type, time_of_day_perc, fill= Time_of_day)) +
geom_bar(position = "fill", stat = "identity")+
theme_bw()+
scale_fill_manual(values = c("Night" = "#006d77",
"Early Morning"= "#83c5be",
"Late Morning"="#edf6f9",
"Early Afternoon"= "#ffddd2" ,
"Late Afternoon"= "#e29578",
"Evening"= "#723d46")) +
labs(x="User Type",
y="Frequency",
fill= "Time of Day",
title = "Bike Rental Frequency per Time of Day",
subtitle="2019 customer vs subscribers bike usage",
caption = "source: Motivation International Inc")
Key Takeaway
Customers utilize the bike rental service mainly in the late afternoon, accounting for half of the daily rides. Other popular times are early afternoon and evening, with about 15% each. Subscribers also predominantly rent in the late afternoon, but unlike customers there is a good share of late morning rentals (~25%).
Create Bar Chart
Cyclistic_2019_Outliers_Excluded %>%
ggplot(aes(weekday, fill= user_type)) +
geom_bar(position= "dodge")+
theme_bw()+
theme(axis.text.x = element_text(angle = 90))+
scale_fill_manual(values = c("Customer" = "#006d77",
"Subscriber"= "#83c5be")) +
labs(x="",
y="Number of Trips",
fill= "User Type",
title = "Customer vs Subscriber Rentals",
subtitle="2019 Bike Rental Frequency per Weekday",
caption = "source: Motivation International Inc")
weekend_vs_weekday <- Cyclistic_2019_Outliers_Excluded %>%
mutate(weekend_weekday= if_else(weekday == "Saturday" | weekday == "Sunday", "Weekend", "Weekday"))%>%
group_by(user_type, weekend_weekday) %>%
summarise(n= n())%>%
mutate(n_per_day = ifelse(weekend_weekday== "Weekend", (n/52)/2, (n/52)/5))
perc_weekend_c <- weekend_vs_weekday %>%
filter(user_type== "Customer") %>%
mutate(perc_weekend_c= formattable::percent(n_per_day/sum(n_per_day))) %>%
filter(weekend_weekday == "Weekend")
perc_weekend_c1 <- perc_weekend_c$perc_weekend_c
perc_weekday_s <- weekend_vs_weekday %>%
filter(user_type== "Subscriber") %>%
mutate(perc_weekday_s= formattable::percent (n_per_day/sum(n_per_day))) %>%
filter(weekend_weekday == "Weekday")
perc_weekday_s1 <- perc_weekday_s$perc_weekday_s
Key Takeaway
Customers predominantly rent over the weekend, with 63.85% of their rides performed on a Saturday or Sunday. In contrast, subscribers mostly rent Monday to Friday. 64.04% of subscribers’ rentals are on a weekday.
Prepare data for Scatterplot.
Scatterplot <- Cyclistic_2019_Outliers_Excluded %>%
select (trip_id, start_time, trip_duration_minutes, user_type, month) %>%
group_by(month, user_type) %>%
summarise(mean_duration = mean(trip_duration_minutes, na.rm= TRUE),
total_rides = n())
Create Scatterplot.
Scatterplot %>%
ggplot(aes(month, total_rides))+
geom_point(aes(col= user_type, size= mean_duration))+
scale_color_manual(values = c("Customer" = "#006d77",
"Subscriber"= "#83c5be")) +
theme_bw()+
theme(axis.text.x = element_text(angle = 90))+
labs(title = "Customers vs Subscribers",
subtitle = "2019 Number of trips and duration",
x="month", y= "Number of rides",
fill= "user type", size= "duration (minutes)",
caption = "source: Motivation International Inc")
freq_s<-Scatterplot %>%
group_by(user_type)%>%
summarise(freq_s= round(mean(total_rides)))%>%
filter(user_type=="Subscriber")
freq_s1 <- freq_s$freq_s
freq_c<-Scatterplot %>%
group_by(user_type)%>%
summarise(freq_c= round(mean(total_rides)))%>%
filter(user_type=="Customer")
freq_c1 <- freq_c$freq_c
dur_s<-Scatterplot %>%
group_by(user_type)%>%
summarise(dur_s= round(mean(mean_duration),1))%>%
filter(user_type=="Subscriber")
dur_s1 <- dur_s$dur_s
dur_c<-Scatterplot %>%
group_by(user_type)%>%
summarise(dur_c= round(mean(mean_duration),1))%>%
filter(user_type=="Customer")
dur_c1 <- dur_c$dur_c
Key Takeaway
Subscribers use the bike service over four times more often than customers, with an average monthly frequency of 24099 versus 5394. However, on average customers rent the bike for twice the duration of subscribers. The mean monthly durations are 20.1 and 11.6 minutes respectively.
Create Boxplot.
Cyclistic_2019_Outliers_Excluded %>%
filter(gender == "Male" | gender == "Female")%>%
drop_na(gender, age)%>%
ggplot(aes(gender, age))+
geom_boxplot(varwidth = T, fill = "#006d77")+
labs(title = "Gender and Age Disctribution",
subtitle = "Data only available for subscribers",
size= "average duration (minutes)",
caption = "source: Motivation International Inc")+
theme_bw()
age_f<-Cyclistic_2019_Outliers_Excluded %>%
drop_na()%>%
group_by(gender)%>%
summarise(avg_age= round(mean(age),1))%>%
filter(gender=="Female")
age_f1 <- age_f$avg_age
age_m<-Cyclistic_2019_Outliers_Excluded %>%
drop_na()%>%
group_by(gender)%>%
summarise(avg_age= round(mean(age),1))%>%
filter(gender=="Male")
age_m1 <- age_m$avg_age
Key Takeaway
Most rides are made by male subscribers in their late 20s to early 40s. Female subscribers tend to be slightly younger than the male counterparts, with respective ages of 34.1 and 35.8 .
Create Pie Chart.
Cyclistic_2019_Outliers_Excluded %>%
drop_na(gender)%>%
ggplot(aes(x = "", fill = factor(gender))) +
geom_bar(width = 1) +
theme(axis.line = element_blank(),
plot.title = element_text(hjust=0.5)) +
labs(fill="",
x=NULL,
y=NULL,
title="Gender of Subscribers",
caption="source: Motivation International Inc")+
coord_polar(theta = "y", start=0)+
theme_void()+
scale_fill_manual(values = c("Male" = "#006d77",
"Female"= "#83c5be",
"Unknown"="#edf6f9"))
perc_m <- Cyclistic_2019_Outliers_Excluded %>%
drop_na()%>%
filter(gender %in% c("Male", "Female", "Unknown")) %>%
group_by(gender)%>%
summarise(gender_n= n())%>%
mutate(gender_perc= formattable::percent(gender_n/sum(gender_n)))%>%
filter(gender== "Male")
perc_m1 <- perc_m$gender_perc
perc_f <- Cyclistic_2019_Outliers_Excluded %>%
drop_na()%>%
filter(gender %in% c("Male", "Female", "Unknown")) %>%
group_by(gender)%>%
summarise(gender_n= n())%>%
mutate(gender_perc= formattable::percent(gender_n/sum(gender_n)))%>%
filter(gender== "Female")
perc_f1 <- perc_f$gender_perc
perc_u <- Cyclistic_2019_Outliers_Excluded %>%
drop_na()%>%
filter(gender %in% c("Male", "Female", "Unknown")) %>%
group_by(gender)%>%
summarise(gender_n= n())%>%
mutate(gender_perc= formattable::percent(gender_n/sum(gender_n)))%>%
filter(gender== "Unknown")
perc_u1 <- perc_u$gender_perc
Key Takeaway
Gender is only available for subscribers: however, because the number of rides performed by suscribers significantly exceeds the amount of casual rides, we can confidently conclude that at least 70% of rides are made by males. The bike rental distribution for 2019 is as follows: 68.78% males, 22.71% females, 8.51% unknown (customers).
Prepare Data for Waffle Chart.
Waffle <- Cyclistic_2019_Outliers_Excluded %>%
drop_na(age_range) %>%
group_by(age_range)%>%
summarise(age_frequency= n())
# Create Vector
age<-c('between 16 and 19'= 2666,
'between 20 and 29'= 119735, 'between 30 and 39'= 111172,
'between 40 and 49'= 41161, '50+'= 40380)
Create Waffle Chart.
waffle(age/10000, rows=5, size=0.6,
colors=c("#006d77", "#83c5be", "#edf6f9",
"#ffddd2", "#e29578"),
title="Age Distribution of Subscribers",
xlab="1 square = 10000 persons")
avg_age<-Cyclistic_2019_Outliers_Excluded %>%
drop_na()%>%
summarise(avg_age= round(mean(age),1))
avg_age1 <- avg_age$avg_age
Key Takeaway
The target age for Cyclistic subscribers is between 20 and 40, with an average age of 34.9.
Subscribers rent fairly consistently throughout the year, while the frequency of bike rentals for customers fluctuates strongly.
Casual riders rent less frequently, but for longer duration.
Casual riders prefer weekends, while subscribers mainly rent during the workweek.
Customers like to rent later in the day than subscribers.
The predominance of weekday bike rentals, the earlier rides shorter rides and lesser seasonality suggests that subscribers use the bike service for their commute, while customers rent the bikes for leasure.
In order for the membership to be attractive for casual riders, they would need to have an incentive to use the bikes more frequently in their day-to-day lives.
Social media could be used to promote a healthy lifestyle, showcasing influencers commuting to work in any season and weather.
Cyclistic could cooperate with sports gear companies and create a campaign to promote all-year bike commuting.
Cyclistic could run challenges, competitions and promotions on social media, rewarding bike commuters.
Cyclistic could partner with local companies and offer cheaper yearly memberships to their employees.
Cyclistic could also partner with local health and fitness clubs and offer attractive memberships for their members. This is particularly true for female clubs, since the majority of subscribers is male and this could be an effective way to make bike rentals attractive for women as well.
Cyclistic could also work with high schools and colleges to try to attract younger subscribers, since currently most rides are performed by subscribers aged between 20 and 40.