Introduction

This case study is my capstone project for the Google Data Analytics Professional Certificate (via Coursera).

For this project I assume the role of junior data analyst. I have been asked to look at historical data for a fictional company, Cyclistic. My deliverable: to make recommendations for an upcoming marketing campaign. For this project I will use real-world data collected between January 2021 – December 2022 from a bike share program in Chicago.

Cyclistic operates a fleet of more than 5,800 bicycles which can be accessed from over 600 docking stations across the city. Bikes can be borrowed from one docking station, ridden, then returned to any docking station in the system Over the years marketing campaigns have been broad and targeted a cross-section of potential users. Data analysis has shown that riders with an annual membership are more profitable than casual riders. The marketing team are interested in creating a campaign to encourage casual riders to convert to members.

The marketing analyst team would like to know how annual members and casual riders differ, why casual riders would buy a membership, and how Cyclistic can use digital media to influence casual riders to become members. The team is interested in analyzing the Cyclistic historical bike trip data to identify trends in the usage of bikes by casual and member riders.

1. ASK

How can the stakeholders increase profitability by convincing casual riders to become annual members through the use of a targeted marketing campaign?

How do annual members and casual riders use Cyclistic bikes differently?

The stakeholders in this project are:

Lily Moreno, Director of Marketing at Cyclistic, responsible for the marketing campaigns at Cyclistic.

The Cyclistic marketing analytics team. This team responsible for collecting, analyzing and reporting data to be used in marketing campaigns. I assume the role junior analyst on this team

The Cyclistic executive team. This team makes the final decision on the recommended marketing plan.

2. PREPARE

The raw data can be found here: Cyclistic Bike Share The data has been made available by Motivate International Inc.

Once downloaded, the data was imported into RStudio for preparation and cleaning. Some summary statistics were also calculated within RStudio. Cleaned data was then exported in CSV format to be used in Power BI for visualization and analysis. Final images were then exported from Power BI and brought back into RStudio to be placed in the final report using RMarkdown.

Load necessary libraries

  • tidyverse
  • dplyr
  • purrr
  • lubridate
  • knitr
  • janitor
  • data.table
  • gridExtra

Import and merge data

12 individual CSV files were imported, one for each month from January 2021 – December 2022. These 12 files were then merged into one large dataset.

3. PROCESS AND CLEAN

Remove duplicate rows based on all columns:

data %>% distinct()

Records for trips less than 60 seconds (false starts) or longer than 24 hours were removed. Bikes out longer than 24 hours are considered stolen and the rider is charged for a replacement. Create duration column using start and end times. (91288 records removed)

data$duration <- difftime(data$ended_at,data$started_at)
is.factor(data$duration)
data$duration <- as.numeric(as.character(data$duration))
is.numeric(data$duration)
data <- data[!(data$duration<60 | data$duration>86400),]
data$start_time <- as.ITime(data$started_at)
format(round(data$start_time, units="hours"), format="%H")
data$end_time <- as.ITime(data$ended_at)
format(round(data$end_time, units="hours"), format="%H")

Records with missing fields start_station, end_station, start/end lat/long fields were removed. (1,098,532 records removed)

data <- na.omit(data)

Extract day, month, year, day_of_week from original raw date column for statistical analysis

data$date <- as.Date(data$started_at)
data$month <- format(as.Date(data$date), "%m")
data$day <- format(as.Date(data$date), "%d")
data$year <- format(as.Date(data$date), "%Y")
data$day_of_week <- format(as.Date(data$date), "%A")

Records for trips that started or ended at DIVVY CASSETTE REPAIR MOBILE STATION or HUBBARD ST BIKE CHECKING (LBS-WH-TEST) were removed as these are administrative stations. (8 records removed)

data<- data[!(data$start_station_name == "DIVVY CASSETTE REPAIR MOBILE STATION" | data$start_station_name == "HUBBARD ST BIKE CHECKING (LBS-WH-TEST)" | data$start_station_name == "WATSON TESTING DIVVY" | data$end_station_name == "DIVVY CASSETTE REPAIR MOBILE STATION" | data$end_station_name == "HUBBARD ST BIKE CHECKING (LBS-WH-TEST)" | data$end_station_name == "WATSON TESTING DIVVY"),]

Initially the data set contained 5,667,986 records. Once data was cleaned, 4,569,446 records remained. 19.4% of the records were removed.

4. ANALYZE

Once the data was cleaned, analysis of the tidy data was undertaken in RStudio to determine the following:

Mean, median, maximum and minimum ride duration (by rider type) Average ride length by day and by rider type Count of trips by rider type Count of trips by bicycle type Count of the number of start stations

## `summarise()` has grouped output by 'data$start_station_name'. You can override
## using the `.groups` argument.
## `summarise()` has grouped output by 'data$end_station_name'. You can override
## using the `.groups` argument.

The cleaned data set was used to create a csv file that was exported from RStudio and imported into Power BI for further analysis and creation of visualizations.

Power BI was used to further analyze the data and determine:

Popularity of bike types Number of trips by rider type Most popular days of the week for rides Duration of rides by rider type Most popular time of day to ride Popularity of start and end location for rides Top 10 start stations by user type Top 10 end stations by user type Months of the year of the rides

Summary of analysis

Analyzing the cleaned dataset in Power BI we can see that there are several key differences between casual and member riders:

Bike Preference

Classic bikes are the dominant category for rider choice. Whether this is due to availability or rider preference is uncertain at this point. Given the rise in popularity in electric bikes over the past decade - it is possible that more riders would choose to become members if more electric bikes were available.

Casual riders represent more than 44% of total riders using the bike sharing service. Being able to reach these riders with a targeted marketing campaign can have a significant impact on convincing them to become members. Knowing where to find them and when is critical to a successful campaign.

Weekends are the most popular time for casual riders to use the service. Friday is the most popular weekday. Members usage is relatively consistent throughout the week - slightly higher on week days, suggesting members are using their bikes for commuting to and from work.

Casual riders overall ride their bikes longer than members - one possible explanation is that members are using the bikes consistently for commuting to and from work while casual riders are taking longer rides for sightseeing/leisure activities.

Member and casual riders also differ in the stations that are popular for starting and ending their rides. Casual riders appear to cluster around locations closer to the water/parks - again suggesting that casual riders are using their bikes for leisure activities while members are using their bikes more consistently for commuting to and from work.

The number of trips made by casual riders increases over the day, peaking at 5-6 pm. Member trips also peak at 5-6 pm but there are two smaller peaks earlier in the day at 8 am and noon, which corresponds with the work day.

Identifying the most popular start and end locations for casual riders is essential to ensure a successful targeted marketing campaign. The top ten start and end locations are shown for casual riders.

June to September is the most popular time for casual riders.

5. SHARE

This report can be found at RPubs in R Markdown format.

6. ACT

Based on an analysis of the data, the following recommendations are made to the Cyclistic stakeholders:

Knowing when and where to reach the most casual riders is critical for a successful marketing campaign. The marketing campaign should be targeted at the popular start and end stations for casual riders.

To reach the most riders, marketing should be targeted for the busiest casual rider days (Friday, Saturday and Sunday), busiest hours (afternoon) and the most popular months (June, July, August, and September).

To reach the most riders, marketing should target stations where the maximum number of casual riders frequent.

Beyond the data collected by Cyclistic I recommend future studies collect additional data (via random stoppage of riders to collect survey data - offering incentives to compensate them for their time such as coupons for free rides) * Ask casual riders if they are riding for leisure or commuting to work * Ask casual riders what is preventing them from becoming members. Price? Type of bikes available? Lack of bikes available in their neighbourhood?