Introduction

In this case study, I’ll be acting as a junior data analyst working for a fictional company called Cyclistic. To answer the key business questions, I will follow the following data analysis process: ask, prepare, process, analyse, share and act. These steps give me a clear path to succeed in this case study and to answer the business related questions. Below you’ll find the scenario as well as the questions I’ll try to answer in this project.

Scenario

Cyclistic is a bike-share company based in Chicago and I will be acting as a junior data analyst. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. That being said, I need to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, I will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve my recommendations, so they must be backed up with compelling data insights and professional data visualizations.

Background

Cyclistic launched a successful bike-share offering in 2016 and since then, the program has grown to a fleet of 5824 bicycles. These bicycles are geotracked and locked into a network of 692 stations across Chicago. These bike can be unlocked from one station and returned to any other station in the system at any point. Until now, Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments. One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who have purchased the single-ride or full-day passes are referred to as casual riders; Customers who purchase annual memberships are Cyclisitc members.

From previous analysis, Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders. Although the pricing flexibility helps Cyclistic attract more customers, the director of marketing Lily Moreno believes that maximizing the number of annual members will be key to future growth. Rather than creating a marketing campaign that targets all-new customers, Moreno believes there is a very good chance to convert casual riders into members. She notes that casual riders are already aware of the Cyclistic program and have chosen Cyclistic for their mobility needs.

Now that we have the background information we can move onto the ask phase.

Ask

Our goal is to design marketing strategies to help convert casual riders into annual members. By investigating the data provided, we can find good insights that will help provide solutions to the questions below. It is important that the data will be used to help us answer the business task and for us to not lose focus of the business task. The following questions will help guide us to solve this business task:

Moreno has decided that is best we focus on the first question: How do annual members and casual riders use Cyclistic bikes differently? By the end of this study I will have produced a variety of deliverable that would be presented to Lily Moreno and then the Cyclistic Executive team.

Prepare

We then move on to the prepare stage, in which we prepare the data we will use for the analysis. We will use Cyclistic’s historical trip data that has been provided and can be downloaded here.

This is a public data set that will be used to explore how different customer types are using Cyclistic bikes. We won’t be able to connect pass purchase to credit card numbers to determine if casual riders live in the Cyclistic service area or if they have purchased multiple single passes. This is because of data-privacy issues which prohibits access from riders’ personally identifiable information (PII).

We will focus on the data from the year 2022. This will allow us to have access to a full year of data and at the time of making this, it is the most recent full year. We need to ensure the data is stored appropriately and so I have created a folder on my desktop to store all the data that will be used. The data is separated per month and this is clearly shown by demonstrating appropriate file-naming conventions (e.g “2022_02_trip_data”). The data we will be using has been found directly from Cyclistic’s own collection of metrics from stations they own. The data is also very up to date and so it is very relevant to the problem. Each table contained 13 columns: ride IDs, bike type, start and end times of rides, start and end station identifiers such as name and location, and member type of the rider.

Note: The data has been made available by Motivate International Inc. under this licence

Process

I began the process step by trying to import all of the data into Google Sheets as this is what I used during the Certification. Unfortunately, there were issues with this as there were too much data. So, I decided to use R as this is able to work with larger amounts of data.

Then, I focused on finding any missing data and used the following functions to do this. From this, I found that there’s a lot of missing data for the columns that related to the locations of the stations e.g start_station_name. I decided to not remove these rows straight away as I may choose to use the other data from these rows that aren’t missing. However, if I am to use this data then I will remove the null values. I then merged all of the months together to have one data frame with all of the data. Then, I quickly checked to see if there were any duplicate ID’s as this should be unique and in this case there were no duplicates, so no further action was required. It is worth mentioning again that we are not provided any PII so having the ride_id to be unique for every row makes sense.

I then created a new column called ride_length which calculated the total ride length for each trip. Also, I created another column, day_of_week which would indicate which day of the week the bike was used. The last columns I added were date columns that shown the date, the month and the day of the month in separate columns. These new columns will be useful to analyse as it could help identify trends among the data. Finally, we removed any bad data and stored this in a new data frame as we are removing data. The data we removed included trips where the ride length were negative (around 100 rows) and where the start station was ‘HQ QR’. This is because these data points were taken out of their docks and checked for quality by Divvy. This now puts us at a point where we can begin to analyse the data.

Analyse

As we have now cleaned the data we can now analyse the data and see what results we can get to help answer our question. To analyse my data I used a variety of tools in R which included the functions max, min, mean, aggregate, ggplot and many more (See Appendix). To begin with I found a summary of the ride length which told us the median, mean, min and max. The max was 41387 which is close to a whole month. This definitely seems to be an outlier but as it doesn’t greatly impact the results there is no need for it to be removed. I then investigated into the differences between casual customers and members. The general trend managed to show that casual customers have a greater average ride length than members. This was the case regardless of the day of the week. This is seen in the diagram below.

#visualize average duration 
all_trips_v2 %>% 
  mutate(weekday = wday(started_at, label = TRUE)) %>% 
  group_by(member_casual, weekday) %>% 
  summarise(number_of_rides = n()
            ,average_duration = mean(ride_length)) %>% 
  arrange(member_casual, weekday)  %>% 
  ggplot(aes(x = weekday, y = average_duration, fill = member_casual)) +
  geom_col(position = "dodge") +
  labs(y = "Average duration", title = "Comparing average duration of casual vs member")
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

Next, we looked at how the number of rides vary over the week and again compared members and casuals. This is shown in the graph below. The weekdays clearly shows a great difference in the number of rides with the members having a lot more rides than the casuals. Whilst in the weekend, the amount of rides are more similar. This could be because the members use the bike to get to work, but to confirm this lets look at how often the bikes are used throughout the day.

#visualize the number of rides by rider type
all_trips_v2 %>% 
  mutate(weekday = wday(started_at, label = TRUE)) %>% 
  group_by(member_casual, weekday) %>% 
  summarise(number_of_rides = n()
            ,average_duration = mean(ride_length)) %>% 
  arrange(member_casual, weekday)  %>% 
  ggplot(aes(x = weekday, y = number_of_rides, fill = member_casual)) +
  geom_col(position = "dodge") +
  labs(y = "no. of rides", title = "Number of rides each weekday")
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

The following graph shows how the number of riders vary throughout the day and again, compares the members vs the casual riders. The members doesn’t follow a complete pattern, however we see that it does spike at around 08:00 and 17:00 which is when people would be travelling to and from work. This could suggest that members generally use the bikes for this purpose. Casual members seems to increase throughout the day and peaks at around 16:00.

#visualize how bikes are used at each hour of day
all_trips_v2 %>% 
  mutate(hour_start = hour(started_at)) %>% 
  group_by(member_casual, hour_start) %>% 
  summarise(number_of_rides = n()
            ,average_duration = mean(ride_length)) %>% 
  arrange(member_casual, hour_start)  %>% 
  ggplot(aes(x = hour_start, y = number_of_rides, fill = member_casual)) +
  geom_col(position = "dodge") +
  labs(x = "hour of day",
       y= "no. of rides",
       title = "Number of rides each hour of day")
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

Then, we seen how the number of rides is impacted over a year. Below is a plot that shows how the number of rides changes over a year. As expected, during the winter season, the amount of rides that occur is much smaller than that of the summer time. Also, during the winter season there were more membership riders than casual whilst during June to August the amount of rides are much more similar.

#visualize no of bikes over year
all_trips_v2 %>% 
  mutate(month = month) %>% 
  group_by(member_casual, month) %>% 
  summarise(number_of_rides = n()
            ,average_duration = mean(ride_length)) %>% 
  arrange(member_casual, month)  %>% 
  ggplot(aes(x = month, y = number_of_rides, fill = member_casual)) +
  geom_col(position = "dodge") +
  labs(x = "month",
       y= "no. of rides",
       title = "Number of rides for every month")
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

Finally, I wanted to focus on where the most popular locations are. The graph below shows the 10 most popular start stations and how many rides take place at each respective station. Unfortunately, we wanted to segment this analysis by the member type but this didn’t turn out this way. However, it still tells us the most popular station has the most casual riders which could be useful in the recommendations part

#visualize most popular start stations 
top_start_station = all_trips_v2 %>%  #slice and group data frame as follows
  group_by(
    member_casual, start_station_name, start_lat, start_lng
  ) %>%
  summarise(
    nr_rides_start = n()
  ) %>%
  arrange(-nr_rides_start)
## `summarise()` has grouped output by 'member_casual', 'start_station_name',
## 'start_lat'. You can override using the `.groups` argument.
top_start_station[1:10,] %>% 
  ggplot(aes(x = reorder(start_station_name, nr_rides_start), 
             y = nr_rides_start,
             fill = member_casual)) +
  geom_col(position = "dodge") +
  coord_flip() +
  labs(x = "station name",
       y= "no. of rides",
       title = "Most Popular Start Stations") + 
  theme(axis.text.y = element_text(angle=45, size = 8))

Share

I used Tableau to create a dashboard that would showcase graphs and key figures. The link to this can be found here. I felt that through this visualization and the analysis in this report I was able to find some key findings as to how annual and casual riders use Cyclistic bikes. Here are some of the key differences I found during this process:

Act

My recommendations for the marketing strategy using my analysis and visualizations are as follows:

Appendix

setwd("C:/Users/cwill/Documents/Google Course/Google Final Project/Data/All CSV Datasets")
#load libraries
library(ggplot2)
library(tidyverse)
library(skimr)
library(janitor)

#import .csv files
tripdata01 <- read.csv("202201-divvy-tripdata.csv")
tripdata02 <- read.csv("202202-divvy-tripdata.csv")
tripdata03 <- read.csv("202203-divvy-tripdata.csv")
tripdata04 <- read.csv("202204-divvy-tripdata.csv")
tripdata05 <- read.csv("202205-divvy-tripdata.csv")
tripdata06 <- read.csv("202206-divvy-tripdata.csv")
tripdata07 <- read.csv("202207-divvy-tripdata.csv")
tripdata08 <- read.csv("202208-divvy-tripdata.csv")
tripdata09 <- read.csv("202209-divvy-publictripdata.csv")
tripdata10 <- read.csv("202210-divvy-tripdata.csv")
tripdata11 <- read.csv("202211-divvy-tripdata.csv")
tripdata12 <- read.csv("202212-divvy-tripdata.csv")

#merge all data together
alltripdata <- rbind(tripdata01,tripdata02,tripdata03,tripdata04,tripdata05,tripdata06,tripdata07,tripdata08,
    tripdata09,tripdata10,tripdata11,tripdata12)

#to see if all data for ride_id is unique - it is
alltripdata <- distinct(alltripdata, ride_id,.keep_all = TRUE)

#create new col for ride length in mins
alltripdata <- alltripdata %>%
  mutate(ride_length = difftime(ymd_hms(alltripdata$ended_at),                            ymd_hms(alltripdata$started_at), units="mins"))

#create new columns for day of week, month and day
alltripdata <- alltripdata %>%
  mutate(day_of_week = weekdays(as.Date(alltripdata$started_at)))                        

alltripdata$date <- as.Date(alltripdata$started_at)
alltripdata$month <- format(as.Date(alltripdata$date), "%m")
alltripdata$day <- format(as.Date(alltripdata$date), "%d")

#Change data type to numeric and check this is true
is.factor(alltripdata$ride_length)
alltripdata$ride_length <- as.numeric(as.character(
  alltripdata$ride_length))
is.numeric(alltripdata$ride_length)

#create new df as we are removing 'bad' data
all_trips_v2 <- alltripdata[!(alltripdata$start_station_name == 
                              "HQ QR" | alltripdata$ride_length<0),]

#descriptive analysis on ride length
summary(all_trips_v2$ride_length)

#compare members vs causal users
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual
          , FUN = mean)
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual
          , FUN = median)
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual
          , FUN = max)


#average ride length by each day for members vs casual
all_trips_v2$day_of_week <- ordered(all_trips_v2$day_of_week
                                    , levels=c("Sunday", "Monday",
                                               "Tuesday", "Wednesday",
                                               "Thursday", "Friday", "Saturday"))

aggregate(all_trips_v2$ride_length ~ 
            all_trips_v2$member_casual + 
            all_trips_v2$day_of_week, FUN = mean)

#analyse ridership data by weekday and member/casual
all_trips_v2 %>% 
  mutate(weekday = wday(started_at, label = TRUE)) %>%  #creates weekday field using wday()
  group_by(member_casual, weekday) %>%  #groups by usertype and weekday
  summarise(number_of_rides = n()                           #calculates the number of rides and average duration 
            ,average_duration = mean(ride_length)) %>%      # calculates the average duration
  arrange(member_casual, weekday)
#visualize average duration 
all_trips_v2 %>% 
  mutate(weekday = wday(started_at, label = TRUE)) %>% 
  group_by(member_casual, weekday) %>% 
  summarise(number_of_rides = n()
            ,average_duration = mean(ride_length)) %>% 
  arrange(member_casual, weekday)  %>% 
  ggplot(aes(x = weekday, y = average_duration, fill = member_casual)) +
  geom_col(position = "dodge") +
  labs(y = "Average duration", title = "Comparing average duration of casual vs member")
#visualize the number of rides by rider type
all_trips_v2 %>% 
  mutate(weekday = wday(started_at, label = TRUE)) %>% 
  group_by(member_casual, weekday) %>% 
  summarise(number_of_rides = n()
            ,average_duration = mean(ride_length)) %>% 
  arrange(member_casual, weekday)  %>% 
  ggplot(aes(x = weekday, y = number_of_rides, fill = member_casual)) +
  geom_col(position = "dodge") +
  labs(y = "no. of rides", title = "Number of rides each weekday")
#visualize how bikes are used at each hour of day
all_trips_v2 %>% 
  mutate(hour_start = hour(started_at)) %>% 
  group_by(member_casual, hour_start) %>% 
  summarise(number_of_rides = n()
            ,average_duration = mean(ride_length)) %>% 
  arrange(member_casual, hour_start)  %>% 
  ggplot(aes(x = hour_start, y = number_of_rides, fill = member_casual)) +
  geom_col(position = "dodge") +
  labs(x = "hour of day",
       y= "no. of rides",
       title = "Number of rides each hour of day")
#visualize no of bikes over year
all_trips_v2 %>% 
  mutate(month = month) %>% 
  group_by(member_casual, month) %>% 
  summarise(number_of_rides = n()
            ,average_duration = mean(ride_length)) %>% 
  arrange(member_casual, month)  %>% 
  ggplot(aes(x = month, y = number_of_rides, fill = member_casual)) +
  geom_col(position = "dodge") +
  labs(x = "month",
       y= "no. of rides",
       title = "Number of rides for every month")
#visualize most popular start stations 
top_start_station = all_trips_v2 %>%  #slice and group data frame as follows
  group_by(
    member_casual, start_station_name, start_lat, start_lng
  ) %>%
  summarise(
    nr_rides_start = n()
  ) %>%
  arrange(-nr_rides_start)

top_start_station[1:10,] %>% 
  ggplot(aes(x = reorder(start_station_name, nr_rides_start), 
             y = nr_rides_start,
             fill = member_casual)) +
  geom_col(position = "dodge") +
  coord_flip() +
  labs(x = "station name",
       y= "no. of rides",
       title = "Most Popular Start Stations") + 
  theme(axis.text.y = element_text(angle=45, size = 8))