Introduction

The Cyclistic Bike Share Capstone Case Study is a project that forms part of the Google Data Analytics Professional Certificate. Here’s a brief overview:

In this case study, I placed myself in the role of a Junior Data Analyst working as a member of the Marketing Analystic Team at Cyclistic Bike Share Company tasked to carry Out a comprehensive analysis of the behaviour of bike users categorised into two: Members and Casual Users. In the project, I am applying data analytics techniques learned in the Cousera Google Data Analytics Professional Certificate Course to a real-world business scenario to providing valuable insights into customer behaviour and strategic decision-making.

Background:

• The case study is based on a fictional company named Cyclistic, a bike-share company with over 5,800 bicycles and 600 docking stations. • Cyclistic offers a variety of bikes, including reclining bikes, hand tricycles, and cargo bikes, catering to a diverse range of users. • The company has a flexible strategy allowing riders to unlock a bike from one station and return it to any other station.

Data Analysis Phases: The case study follows a structured analysis process divided into six phases:

Ask: Define the business task and consider the demands of key stakeholders.
Prepare: Use Cyclistic’s historical trip data for analysis.
Process: Organize and clean the data for analysis.
Analyze: Examine the data to find trends and insights.
Share: Present the findings in an accessible format.
Act: Make recommendations based on the analysis.

Business Task:

Although the pricing flexibility helps Cyclistic attract more customers than usual, Lily Moreno, the marketing director, believes that expanding the number of annual members will be critical to future growth. Cyclistic’s finance analysts have also concluded that annual members are much more profitable than casual riders. Rather than creating a marketing campaign targeting all-new customers, Moreno believes there is a good chance to convert casual riders into members. She notes that casual riders already know the Cyclistic program and have chosen Cyclistic for their mobility needs.

Based on this three hypothesis:

• The primary aim is to understand how casual riders and annual members use Cyclistic bikes differently. • Insights from this analysis are intended to help the marketing analytics team develop strategies to convert casual riders into annual members.

Tools Used: • The analysis utilizes tools such as RStudio,Google Sheets, Tableau, and RPubs.

Load Packages

{r} library(tidyverse) {r} library(lubridate)

{r} library(dplyr)

{r} library(readr)

{r} library(ggplot2)

STEP 1: COLLECT DATA

```{r} file_location <- file.path(“C:\User\Okoye Benjamin E\Documents\divvy_Trips_2020_Q1.csv”)


```{r}
file_location <- file.path("C:\\User\\Okoye Benjamin E\\Documents\\divvy_Trips_2019_Q1.csv")

# Upload Divvy datasets (csv files) here

{r} q1_2019 <- read_csv("Divvy_Trips_2019_Q1.csv") ```{r} q1_2020 <- read_csv(“Divvy_Trips_2020_Q1.csv”)


# STEP 2: WRANGLE DATA AND COMBINE INTO A SINGLE FILE

# Compare column names each of the files

# While the names don't have to be in the same order, they DO need to match perfectly before we can use a command to join them into one file

```{r}
colnames(q1_2019)

{r} colnames(q1_2020)

Rename columns to make them consistent with q1_2020 (as this will be the supposed going-forward table design for Divvy)

{r} (q1_2019 <- rename(q1_2019 ,ride_id = trip_id ,rideable_type = bikeid ,started_at = start_time ,ended_at = end_time ,start_station_name = from_station_name ,start_station_id = from_station_id ,end_station_name = to_station_name ,end_station_id = to_station_id ,member_casual = usertype ))

Inspect the dataframes and look for incongruencies

{r} str(q1_2019)

{r} str(q1_2020)

Convert ride_id and rideable_type to character so that they can stack correctly

```{r} q1_2019 <- mutate(q1_2019, ride_id = as.character(ride_id) ,rideable_type = as.character(rideable_type))


# Stack individual quarter's data frames into one big data frame

```{r}
all_trips <- bind_rows(q1_2019, q1_2020)#, q3_2019)#, q4_2019, q1_2020)

Remove lat, long, birthyear, and gender fields as this data was dropped beginning in 2020

```{r} all_trips <- all_trips %>%
select(-c(start_lat, start_lng, end_lat, end_lng, birthyear, gender, “tripduration”))


# STEP 3: CLEAN UP AND ADD DATA TO PREPARE FOR ANALYSIS

# Inspect the new table that has been created

#List of column names

```{r}
colnames(all_trips)

#How many rows are in data frame?

{r} nrow(all_trips)

#Dimensions of the data frame?

{r} dim(all_trips) #See the first 6 rows of data frame.

{r} head(all_trips)

Also six rows of the data frame

{r} tail(all_trips)

#See list of columns and data types (numeric, character, etc)

{r} str(all_trips)

#Statistical summary of data. Mainly for numerics

{r} summary(all_trips)

There are a few problems we will need to fix:

(1) In the “member_casual” column, there are two names for members (“member” and “Subscriber”) and two names for casual riders (“Customer” and “casual”). We will need to consolidate that from four to two labels.

(2) The data can only be aggregated at the ride-level, which is too granular. We will want to add some additional columns of data – such as day, month, year – that provide additional opportunities to aggregate the data.

(3) We will want to add a calculated field for length of ride since the 2020Q1 data did not have the “tripduration” column. We will add “ride_length” to the entire dataframe for consistency.

(4) There are some rides where tripduration shows up as negative, including several hundred rides where Divvy took bikes out of circulation for Quality Control reasons. We will want to delete these rides.

In the “member_casual” column, replace “Subscriber” with “member” and “Customer” with “casual”

Before 2020, Divvy used different labels for these two types of riders … we will want to make our dataframe consistent with their current nomenclature

N.B.: “Level” is a special property of a column that is retained even if a subset does not contain any values from a specific level

Begin by seeing how many observations fall under each usertype

{r} table(all_trips$member_casual)

Reassign to the desired values (we will go with the current 2020 labels)

{r} all_trips <- all_trips %>% mutate(member_casual = recode(member_casual ,"Subscriber" = "member" ,"Customer" = "casual"))

Check to make sure the proper number of observations were reassigned

```{r} table(all_trips$member_casual)

# Add columns that list the date, month, day, and year of each ride

# This will allow us to aggregate ride data for each month, day, or year ... before completing these operations we could only aggregate at the ride level

# https://www.statmethods.net/input/dates.html more on date formats in R found at that link

```{r}
all_trips$date <- as.Date(all_trips$started_at) #The default format is yyyy-mm-dd

{r} all_trips$month <- format(as.Date(all_trips$date), "%m")

{r} all_trips$day <- format(as.Date(all_trips$date), "%d")

{r} all_trips$year <- format(as.Date(all_trips$date), "%Y")

{r} all_trips$day_of_week <- format(as.Date(all_trips$date), "%A")

Add a “ride_length” calculation to all_trips (in seconds)

https://stat.ethz.ch/R-manual/R-devel/library/base/html/difftime.html

{r} all_trips$ride_length <- difftime(all_trips$ended_at,all_trips$started_at)

Inspect the structure of the columns

{r} str(all_trips)

Convert “ride_length” from Factor to numeric so we can run calculations on the data

{r} is.factor(all_trips$ride_length)

{r} all_trips$ride_length <- as.numeric(as.character(all_trips$ride_length))

{r} is.numeric(all_trips$ride_length)

Remove “bad” data

The dataframe includes a few hundred entries when bikes were taken out of docks and checked for quality by Divvy or ride_length was negative

We will create a new version of the dataframe (v2) since data is being removed

https://www.datasciencemadesimple.com/delete-or-drop-rows-in-r-with-conditions-2/

{r} all_trips_v2 <- all_trips[!(all_trips$start_station_name == "HQ QR" |all_trips$ride_length<0),]

STEP 4: CONDUCT DESCRIPTIVE ANALYSIS

Descriptive analysis on ride_length (all figures in seconds)

{r} mean(all_trips_v2$ride_length) #straight average (total ride length / rides)

{r} median(all_trips_v2$ride_length) #midpoint number in the ascending array of ride lengths

{r} max(all_trips_v2$ride_length) #longest ride

{r} min(all_trips_v2$ride_length) #shortest ride

You can condense the four lines above to one line using summary() on the specific attribute

{r} summary(all_trips_v2$ride_length)

Compare members and casual users

{r} aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = mean)

{r} aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = median)

{r} aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = max)

```{r} aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = min)


# See the average ride time by each day for members vs casual users

```{r}
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual + all_trips_v2$day_of_week, FUN = mean)

{r} aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = median)

{r} aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = max)

```{r} aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = min)


# See the average ride time by each day for members vs casual users

```{r}
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual + all_trips_v2$day_of_week, FUN = mean)

Notice that the days of the week are out of order. Let’s fix that.

{r} all_trips_v2$day_of_week <- ordered(all_trips_v2$day_of_week, levels=c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))

Now, let’s run the average ride time by each day for members vs casual users

{r} aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual + all_trips_v2$day_of_week, FUN = mean)

analyze ridership data by type and weekday


# Let's visualize the number of rides by rider type

```{r}
all_trips_v2 %>% 
  mutate(weekday = wday(started_at, label = TRUE)) %>% 
  group_by(member_casual, weekday) %>% 
  summarise(number_of_rides = n()
            ,average_duration = mean(ride_length)) %>% 
  arrange(member_casual, weekday)  %>% 
  ggplot(aes(x = weekday, y = number_of_rides, fill = member_casual)) +
  geom_col(position = "dodge")

Let’s create a visualization for average duration

```{r} all_trips_v2 %>% mutate(weekday = wday(started_at, label = TRUE)) %>% group_by(member_casual, weekday) %>% summarise(number_of_rides = n() ,average_duration = mean(ride_length)) %>% arrange(member_casual, weekday) %>% ggplot(aes(x = weekday, y = average_duration, fill = member_casual)) + geom_col(position = “dodge”)


## My Presentation: 

In the Cyclistic Bike-Share Analysis Case Study, I analyzed historical data from a fictional bike-sharing company called Cyclistic. Here are the key insights and recommendations:

## Business Task:

The aim of the case study was to understand how casual riders and annual members use Cyclistic bikes differently.

## Data Analysis Phases:

The analysis followed several phases:

1. Ask: Identifying the business task and considering stakeholder demands.

2. Prepare: Using Cyclistic’s historical trip data (12 months) for analysis.

3. Process: Cleaning and organizing the data.

4. Analyze: Identifying trends and patterns.

5. Share: Communicating findings.

6. Actions: Implementing recommendations.

## Tools Used:

The main tools included RStudio, Google Sheets,Tableau and RPubs.

## Details of my Observation:

1. It was revealed that in all_trips_V2$ride_length:
   Casual-Customer riders       696019.11 secs
   Members                      59395.69 secs
2. Maximum ride time/Average ride time:            
   Casual-Customer riders       946684800 secs
   Members                      820454400 secs
3. Ridership data by type and weekday revealed that:
   Casaul-Customer riders use 7 days (Sunday-Saturday)
   Members use only 3 days (Sunday-Tuesday)
4. The visualizations of ride by type revealed that though the members ride only for 3 days,    (Sunday-Tuesday),they have greater number of rides.
5. Visualization for Average ride duration indicates that Casual Customers have greater        Average ride duration
   
## Insights:

1. The analysis revealed distinct patterns between casual riders and annual members.

2. Casual Riders: These customers purchase single-ride or full-day passes.

3. Cyclistic Members: These customers have annual memberships.

These insights would be used by the Cyclistic marketing analytics team to design a new marketing strategy aimed at converting casual riders into annual members.

Supporting visualizations and key findings

## ACT

This phase will be carried out by the executive team, Director of Marketing (Lily Moreno) and the Marketing Analytics team based on my analysis.

Conclusion:

The Visualization revealed that:

1.  Members have more bikes compared to casual riders.
2.  We have more members riding in all months compared to casual riders.
3.  Casual riders travel for a longer period.
4.  Members ride more throughout the entire weekday while the casual riders also have a high     ride record during the weekends (Saturday and Sunday) compared to the other days of the     week.
5.  Casual riders go farther in terms of distance.

## Deliverable

These are the compelling data insights and my design marketing strategies aimed at converting casual riders into annual members.

1. Have a slash sale or promo for casual riders so they can acquire more bikes and indulge     them in the benefits of being a member.
2.  Host fun biking competitions with prizes at intervals for casual riders on the weekends.     Since there are lot of members on weekends, this will also attract them to get a            membership.
3.  Encourage casual riders to ride more in the entire year through advertisement, hand         flyers, by giving them various coupons so as to convince them into being a member.

## Acknowledegment:

My data was downloaded from Global web icon
Divvy Bikes: https://divvybikes.com/system-dataGlobal , a web Historical trip data available to the public. Here you'll find Divvy's trip data for public use. 

Note: This data is provided according to the Divvy Data License Agreement and released on a monthly schedule.


I also used the Divvy Execise R Script.

Thank you for making out time to read my Presentation. I look forward to your valuable feedback.  

```{r}
write.csv(all_trips_v2, "C:\\Users\\Okoye Benjamin E\\Documents\\divvy_Trips_2019_Q1_csv", row.names=FALSE)

{r} file_location

STEP 5: EXPORT SUMMARY FILE FOR FURTHER ANALYSIS

Create a csv file that we will visualize in Excel, Tableau, or my presentation software

N.B.: This file location is for a Mac. If you are working on a PC, change the file location accordingly (most likely “C:_USERNAME...”) to export the data. You can read more here: https://datatofish.com/export-dataframe-to-csv-in-r/

```{r} counts <- aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual + all_trips_v2$day_of_week, FUN = mean) write.csv(counts, file = ‘avg_ride_length.csv’)


## Create a DataFrame

## Use write.csv to Export the DataFrame

## Next, you’ll need to add the syntax to export the DataFrame to a CSV file in R.

## To create a DataFrame in R,using this code,

```{r}
rm(df)
df <- data.frame(a = -(1:5), b = 1:5)
df$c[df$a > 0] <- 7
df
#   a b  c
# 1 -1 1 NA
# 2 -2 2 NA
# 3 -3 3 NA
# 4 -4 4 NA
# 5 -5 5 NA

{r} print(df)

Case Study 1 : Cyclistic Bike Share User Analysis

Okoye Benjamin

2024-05-30

Introduction

Background:

Business Task:

Based on this three hypothesis:

Load Packages

STEP 1: COLLECT DATA

# Upload Divvy datasets (csv files) here

Rename columns to make them consistent with q1_2020 (as this will be the supposed going-forward table design for Divvy)

Inspect the dataframes and look for incongruencies

Convert ride_id and rideable_type to character so that they can stack correctly

Remove lat, long, birthyear, and gender fields as this data was dropped beginning in 2020

Also six rows of the data frame

There are a few problems we will need to fix:

(1) In the “member_casual” column, there are two names for members (“member” and “Subscriber”) and two names for casual riders (“Customer” and “casual”). We will need to consolidate that from four to two labels.

(2) The data can only be aggregated at the ride-level, which is too granular. We will want to add some additional columns of data – such as day, month, year – that provide additional opportunities to aggregate the data.

(3) We will want to add a calculated field for length of ride since the 2020Q1 data did not have the “tripduration” column. We will add “ride_length” to the entire dataframe for consistency.

(4) There are some rides where tripduration shows up as negative, including several hundred rides where Divvy took bikes out of circulation for Quality Control reasons. We will want to delete these rides.

In the “member_casual” column, replace “Subscriber” with “member” and “Customer” with “casual”

Before 2020, Divvy used different labels for these two types of riders … we will want to make our dataframe consistent with their current nomenclature

N.B.: “Level” is a special property of a column that is retained even if a subset does not contain any values from a specific level

Begin by seeing how many observations fall under each usertype

Reassign to the desired values (we will go with the current 2020 labels)

Check to make sure the proper number of observations were reassigned

Add a “ride_length” calculation to all_trips (in seconds)

https://stat.ethz.ch/R-manual/R-devel/library/base/html/difftime.html

Inspect the structure of the columns

Convert “ride_length” from Factor to numeric so we can run calculations on the data

Remove “bad” data

The dataframe includes a few hundred entries when bikes were taken out of docks and checked for quality by Divvy or ride_length was negative

We will create a new version of the dataframe (v2) since data is being removed

https://www.datasciencemadesimple.com/delete-or-drop-rows-in-r-with-conditions-2/

STEP 4: CONDUCT DESCRIPTIVE ANALYSIS

Descriptive analysis on ride_length (all figures in seconds)

You can condense the four lines above to one line using summary() on the specific attribute

Compare members and casual users

Notice that the days of the week are out of order. Let’s fix that.

Now, let’s run the average ride time by each day for members vs casual users

analyze ridership data by type and weekday

Let’s create a visualization for average duration

STEP 5: EXPORT SUMMARY FILE FOR FURTHER ANALYSIS

Create a csv file that we will visualize in Excel, Tableau, or my presentation software

N.B.: This file location is for a Mac. If you are working on a PC, change the file location accordingly (most likely “C:_USERNAME...”) to export the data. You can read more here: https://datatofish.com/export-dataframe-to-csv-in-r/