How does a bike-share navigate speedy success

Upon completion of the lessons in Coursera for the course: Google Data Analytics, I have worked on a case study to put all the new knowledge to practice.

This report is written on RStudio R Markdown document and rendered to a HTML page. As learnt in the course, we will be using the following as the outline to our report. This ensures that we will have a comprehensive analysis of the case study.

Ask
Prepare
Process
Analyse
Share
Act

Ask

Problem Statement: How do annual members and casual riders use Cyclistic bikes differently?

Stakeholders

The stakeholders for this analysis are mainly the following individual/teams:

Lily Moreno: Director of marketing
Cyclistic marketing analytics team
Cyclistic executive team

Insights

Identify the relationship between annual members and casual members
Identify business offers to specific user type
Identify user-specific offerings or experience

Prepare

Datasets

The data has been made available by Motivate International Inc. under this license.

Data download link here

Complete list below for the datasets that is being used in this analysis.

Divvy_Trips_2019_Q1
Divvy_Trips_2019_Q2
Divvy_Trips_2019_Q3
Divvy_Trips_2019_Q4

Details on the total dataset
Format: CSV
Size: 525.6 MB
No. of rows: 3,818,004

Issues encountered

Large data set: Initial plans to clean data throught Microsoft Excel. But due to the size of the datasets, I was unable to proceed with that. There are alternatives to using Excel. In this case, I will be using R, to proceed with data cleaning and analysis.

Process

We will be performing the data cleaning and analysis using R. The scripts run for this will be shown as we go. There are also visualisation on Tableau together with R visualisations in this case study. Link to the Tableau dashboard will be provided below.

Reasons to pick R:

Convenient for data cleaning and visualisation in one place
Ease of writing report and generating a HTML page/PDF/slides at completion

Why practice Tableau

Convenient visualisation tool
Commonly used tool for analysts

Loading packages for data manipulation and visualisation

install.packages("tidyverse", repos = "https://cloud.r-project.org")
install.packages("scales", repos = "https://cloud.r-project.org")

library(tidyverse)
library(scales)

In this case study, we will be using tidyverse and scales packages.

Download the data

divvytripdata2019Q1 <- read_csv("Data/csv/Divvy_Trips_2019_Q1.csv")

divvytripdata2019Q2 <- read_csv("Data/csv/Divvy_Trips_2019_Q2.csv")

divvytripdata2019Q3 <- read_csv("Data/csv/Divvy_Trips_2019_Q3.csv")

divvytripdata2019Q4 <- read_csv("Data/csv/Divvy_Trips_2019_Q4.csv")

Read the downloaded data using read_csv from readr package which is part of the core tidyverse.

We are able to successfully load the data onto the environment as no errors have been encountered. We will be able to examine and clean the data before proceeding with the analysis.

View data

We will firstly view the column names to start off the data cleaning.

Verify the data

colnames(divvytripdata2019Q1)

 [1] "trip_id"           "start_time"        "end_time"         
 [4] "bikeid"            "tripduration"      "from_station_id"  
 [7] "from_station_name" "to_station_id"     "to_station_name"  
[10] "usertype"          "gender"            "birthyear"

colnames(divvytripdata2019Q2)

 [1] "01 - Rental Details Rental ID"                   
 [2] "01 - Rental Details Local Start Time"            
 [3] "01 - Rental Details Local End Time"              
 [4] "01 - Rental Details Bike ID"                     
 [5] "01 - Rental Details Duration In Seconds Uncapped"
 [6] "03 - Rental Start Station ID"                    
 [7] "03 - Rental Start Station Name"                  
 [8] "02 - Rental End Station ID"                      
 [9] "02 - Rental End Station Name"                    
[10] "User Type"                                       
[11] "Member Gender"                                   
[12] "05 - Member Details Member Birthday Year"

colnames(divvytripdata2019Q3)

 [1] "trip_id"           "start_time"        "end_time"         
 [4] "bikeid"            "tripduration"      "from_station_id"  
 [7] "from_station_name" "to_station_id"     "to_station_name"  
[10] "usertype"          "gender"            "birthyear"

colnames(divvytripdata2019Q4)

 [1] "trip_id"           "start_time"        "end_time"         
 [4] "bikeid"            "tripduration"      "from_station_id"  
 [7] "from_station_name" "to_station_id"     "to_station_name"  
[10] "usertype"          "gender"            "birthyear"

With the exception of the divvytripdata2019Q2 dataset (referred as Q2), we can see that the column names are the same across the datasets. As such, we will rename the Q2 column names to be the same as the other three data sets for ease of data manipulation.

Rename the columns

divvytripdata2019Q2 <- 
  rename(divvytripdata2019Q2, 
         "trip_id" = "01 - Rental Details Rental ID",
         "start_time" = "01 - Rental Details Local Start Time",
         "end_time" = "01 - Rental Details Local End Time",
         "bikeid" = "01 - Rental Details Bike ID" ,
         "tripduration" = "01 - Rental Details Duration In Seconds Uncapped",
         "from_station_id" = "03 - Rental Start Station ID",
         "from_station_name" = "03 - Rental Start Station Name" ,
         "to_station_id"  = "02 - Rental End Station ID",
         "to_station_name" = "02 - Rental End Station Name",
         "usertype" = "User Type",
         "gender" = "Member Gender",     
         "birthyear" = "05 - Member Details Member Birthday Year"
         )

We have renamed the column names using rename() from the dplyr package.

Verify the data again

As we have renamed all column for divvytripdata2019Q2, we will verify that the change was successful.

colnames(divvytripdata2019Q1)

 [1] "trip_id"           "start_time"        "end_time"         
 [4] "bikeid"            "tripduration"      "from_station_id"  
 [7] "from_station_name" "to_station_id"     "to_station_name"  
[10] "usertype"          "gender"            "birthyear"

colnames(divvytripdata2019Q2)

 [1] "trip_id"           "start_time"        "end_time"         
 [4] "bikeid"            "tripduration"      "from_station_id"  
 [7] "from_station_name" "to_station_id"     "to_station_name"  
[10] "usertype"          "gender"            "birthyear"

colnames(divvytripdata2019Q3)

 [1] "trip_id"           "start_time"        "end_time"         
 [4] "bikeid"            "tripduration"      "from_station_id"  
 [7] "from_station_name" "to_station_id"     "to_station_name"  
[10] "usertype"          "gender"            "birthyear"

colnames(divvytripdata2019Q4)

 [1] "trip_id"           "start_time"        "end_time"         
 [4] "bikeid"            "tripduration"      "from_station_id"  
 [7] "from_station_name" "to_station_id"     "to_station_name"  
[10] "usertype"          "gender"            "birthyear"

Combining data to process

We have confirmed that the column names are the same across the 4 dataframes. As such, we will combine the rows into a single dataframe - divvydata using rbind(). When using rbind(), we will need to ensure that all dataframes have the same variables. We have verified the variables are the same in the code chunk above where all column names are the same.

divvydata <- rbind(divvytripdata2019Q1, divvytripdata2019Q2, divvytripdata2019Q3, divvytripdata2019Q4)

colnames(divvydata)

 [1] "trip_id"           "start_time"        "end_time"         
 [4] "bikeid"            "tripduration"      "from_station_id"  
 [7] "from_station_name" "to_station_id"     "to_station_name"  
[10] "usertype"          "gender"            "birthyear"

Once all the rows are combined into one dataframe, we will proceed with the data cleaning.

Data cleaning

divvydata_cleaned <- divvydata %>% 
  select("trip_id", "start_time", "end_time", "bikeid", 
         "tripduration", "from_station_id", "from_station_name",
         "to_station_id", "to_station_name", "usertype",
         "gender", "birthyear") %>% 
  na.omit() %>% 
  mutate(trip_length = as.numeric(difftime(end_time, start_time, units = "mins")),
         weekday = weekdays(as.Date(start_time)), 
         month = months(as.Date(start_time))) %>% 
  filter(birthyear > 1900, birthyear < 2024, 
         trip_length > 0, trip_length <= (24*60),
         gender %in% c("Male", "Female"),
         to_station_id > 0, from_station_id > 0, bikeid > 0, trip_id > 0,
         usertype %in% c("Subscriber", "Customer"))  %>% 
  distinct()

This will be the total code for data cleaning assigned to the dataframe divvydata_cleaned. Let’s go through the items line by line.

Breakdown of data cleaning

We will go through the lines. Starting with the first line!

divvydata_cleaned <- divvydata %>%

We will be creating a new dataframe named divvydata_cleaned. We will be passing divvydata as a parameter to the subsequent functions that we will calling to process the data.

  select("trip_id", "start_time", "end_time", "bikeid", 
           "tripduration", "from_station_id", "from_station_name",
           "to_station_id", "to_station_name", "usertype",
           "gender", "birthyear") %>%

We will use select() to select the column names that we would like to keep. In this case, I have listed all the column names from divvydata as we will be working with it.

  na.omit() %>%

All rows with empty values will be removed.

  mutate(trip_length = as.numeric(difftime(end_time, start_time, units = "mins")),
           weekday = weekdays(as.Date(start_time)), 
           month = months(as.Date(start_time))) %>%

Next, we will be using mutate() to add new columns using the existing values. We will be creating a total of 3 columns: trip_length, weekday, month.

trip_length is the number of minutes derived from the time difference between trip start time and end time, where the variables are start_time and end_time respectively. We will use difftime() to get the time difference in minutes.

weekday is created to identify which day of the week the trip has been made. To get the day of the week, we will be passing the start_time variable to the base function as.Date() to get the date. That value is passed to the weekdays() base function to get the day of the week.

month is similar to weekday. Instead of using weekdays() to get the value that we want, we will be using months() to get the value of the month of the date that we have passed to it.

The above three variables are created to ease our analysis later.

  filter(birthyear > 1900, birthyear < 2024, 
         trip_length > 0, trip_length <= (24*60),
         gender %in% c("Male", "Female"),
         to_station_id > 0, from_station_id > 0, bikeid > 0, trip_id > 0,
         usertype %in% c("Subscriber", "Customer"))  %>%

We will be using filter(), to ensure that only valid values are assigned to our variables.

birthyear: Year of birth should be at least 1900, and not beyond the current year 2024.

trip_length: Should be more than zero and less than one day which has 24 hours. Value for this variable is in minutes.

gender: Values should be either Male or Female.

to_station_id / from_station_id / bikeid / trip_id: Values are numeric of at least 1.

usertype: Values should be either Subscriber or Customer.

  distinct()

We will only take unique rows and remove any duplicates in our data.

Verify data after cleaning

This is to ensure only expected values are in our data.

min(divvydata_cleaned$birthyear)
max(divvydata_cleaned$birthyear)

[1] 1901
[1] 2014

Year of birth are between 1900 and 2024.

min(divvydata_cleaned$trip_length)
max(divvydata_cleaned$trip_length)

[1] 1.016667
[1] 1439.75

Trip length is at least 1 minute.

min(divvydata_cleaned$to_station_id)
min(divvydata_cleaned$from_station_id)
min(divvydata_cleaned$bikeid)
min(divvydata_cleaned$trip_id)

[1] 2
[1] 2
[1] 1
[1] 21742443

IDs are minimally 1.

unique(divvydata_cleaned$usertype)

## [1] "Subscriber" "Customer"

unique(divvydata_cleaned$gender)

## [1] "Male"   "Female"

unique(divvydata_cleaned$weekday)

## [1] "Tuesday"   "Wednesday" "Thursday"  "Friday"    "Saturday"  "Sunday"   
## [7] "Monday"

unique(divvydata_cleaned$month)

##  [1] "January"   "February"  "March"     "April"     "May"       "June"     
##  [7] "July"      "August"    "September" "October"   "November"  "December"

With this we have confirmed that all the variables are what it should be. We shall proceed with the analysis.

Analyse & Share

Before proceeding with analysis, we will save the cleaned data into a csv for any other analysis not done on R. This data will be used in Tableau later for any additional graphs.

Write to CSV

write_csv(divvydata_cleaned, "Data/csv/divvydata_cleanedv1.csv")

Plot and Analyse Graphs

We will plot and analyse its trends and relationships of mainly 4 graphs.

Total rides across the year 2019 (months)
Total rides across the week in 2019
Number of users in each age group
Rides across the day [Tableau]

Total rides across the year 2019 (months)

months_in_a_year <- c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December")

# New facet label names for usertype variable
usertype.labs <- c("Casual Rider", "Annual Member")
names(usertype.labs) <- c("Customer", "Subscriber")

ggplot(data = divvydata_cleaned) +
  geom_bar(position=position_dodge(), mapping = aes(x = factor(month, months_in_a_year), fill = usertype, group=usertype)) + 
  scale_y_continuous(breaks = seq(0,500000, by=50000), labels=label_number(scale = 1e-3, suffix = "K")) + 
  scale_fill_hue(labels = c("Casual Rider", "Annual Member")) + 
  geom_text(stat='count', aes(x=month, label=comma(after_stat(count)), group=usertype), vjust=-0.5, size=2.5, position = position_dodge(width=0.9)) +
  labs(title = "Total rides across the year 2019", fill = "User Type") +
  xlab("Months") + 
  ylab("Number of rides") +
  theme(legend.position = "bottom")

Trends & Relationships

Ride trends for both user types are similar
High number of rides across June to September 2019
Peak in August 2019 for both casual rider and annual members
- Annual Member: 399,853 rides
- Casual Rider: 83,123 rides
Least number of rides in February for both casual rider and annual members
- Annual Member: 92,725 Rides
- Casual Rider: 720 Rides
This trend can be due the changes in weather. During the warmer months, there are more active rides as compare to the months where the weather is cooler.

Total rides across the week in 2019

days_in_a_week <- c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday")

# New facet label names for usertype variable
usertype.labs <- c("Casual Rider", "Annual Member")
names(usertype.labs) <- c("Customer", "Subscriber")

ggplot(data = divvydata_cleaned) +
  geom_bar(position=position_dodge(), mapping = aes(x = factor(weekday, days_in_a_week), fill=usertype)) +
  scale_y_continuous(breaks = seq(0,500000, by=100000), labels=label_number(scale = 1e-3, suffix = "K")) +
  labs(title = "Total rides across the week in 2019", fill = "User Type") +
  scale_fill_hue(labels = c("Casual Rider", "Annual Member")) +
  xlab("Day of the week") + 
  ylab("Number of rides") + 
  geom_text(stat='count',size=3, vjust=-0.5, aes(x=weekday, label=comma(after_stat(count)), group=usertype),  position = position_dodge(width=0.9)) +
  theme(legend.position = "bottom")

Trends & Relationships

Annual members more active on weekdays
- Members are consistent users as there is not much differences across the week
- Members are likely using it as their mode of transport to their daily destination such as work or school
Casual members more active on weekends
- This is likely because the users are using it for leisure instead of as a mode of transport to their destination such as work and school daily.

Number of users in each age group

agegroup <- c(1900, 1910, 1920, 1930, 1940, 1950, 1960, 1970, 1980, 1990, 2000, 2010)

# Group user age groups
divvydata_cleaned_age <- divvydata_cleaned %>% 
  select("trip_id", "usertype", "birthyear") %>% 
  mutate(age_group = cut(birthyear, breaks=c(1900, 1909, 1919, 1929, 1939, 1949, 1959, 1969, 1979, 1989, 1999, 2009, 2024), labels=agegroup))

ggplot(data = divvydata_cleaned_age) +
  geom_bar(position=position_dodge(), mapping = aes(x = age_group, fill=usertype)) +
  scale_y_continuous(breaks = seq(0,1000000, by=100000), labels=label_number(scale = 1e-3, suffix = "K")) +
  labs(title = "Users per age group", fill = "User Type") +
  scale_fill_hue(labels = c("Casual Rider", "Annual Member")) +
  xlab("Age Group") + 
  ylab("Number of users") + 
  geom_text(stat='count',size=2.5, vjust=-0.5, aes(x=age_group, label=comma(after_stat(count)), group=usertype),  position = position_dodge(width=0.9)) +
  theme(legend.position = "bottom")

Trends & Relationships

Majority the users are from the age group that has the most number of rides are from 1980 and 1990 for annual members and the 1990 for casual members
Annual member: Majority users are in the age range of 20 to 39 year old (1980s to 1990s) - 2,110,809
Casual rider: Majority are in age range 20-29 year old (1990s) - 182,733

Rides across the day [Tableau]

Trends & Relationships

On weekdays, we see that there are two peaks across the day: 0800H and 1700H. Whereas on weekends, there is a gradual increase as the day where it peaks in the late afternoon before it decreases as the day ends. This is likely due to the annual members cycling to and back from work/school. As weekend users are more to leisure, the trend is that there are more active users during the daytime as compared to night time.

Tableau Dashboard

Here is the link to the Tableau Public Dashboard

Act

From this analysis, we have gained insights and understand users activities on Cyclistic. Through the insights gained, we can address the points below in order to provide a more customised offerings to the users and encourage more casual users to join the annual membership.

Peak hours vs Non-peak hours
Weekday usage vs Weekends usage only
Increased rides in warmer months

Taking the points above into consideration, we can create the following packages for user’s subscription in order to retain and obtain more annual membership users

Time specific packages

From this analysis, we have identified the peak and off-peak timings of the rides on weekdays and weekends. As casual users are using Cyclistic simply for leisure, we can provide packages that is for off-peak only and peak only packages that costs slightly lower than paying full package where you can use it at any time of day.

This packages can encourage more users to join in as some users could be price sensitive. Offering a lower cost package that could meet their needs can encourage more users to purchase the annual membership.

Weekend packages

As casual users are using Cyclistic simply for leisure on the weekends, we can provide packages that is only for weekends that costs lower in total than paying for single rides each weekend as a casual rider.

This packages can encourage users to join in as limited annual user where they purchase a package that is tailored to their needs.

Season specific packages

In months where the weather is cold, there is a drop in overall ridership for Cyclistic. To promote users to use Cyclistic in the colder seasons, there could be special discounts for one-time rides and also annual membership.

Conclusion

This concludes the write-up on the case study: How does a bike-share navigate speedy success. Thank you for reading and hope you gained some insights from it.

Case Study 1

Google Data Analytics Capstone

Aisyah Jefri

2024-09-25