BACKGROUND

Introduction

This is my capstone project to finish the Google Data Analytics Certification. I worked with the data of the DIVVY Bicycle Sharing Service.

Scenario

I am a junior data analyst working in the marketing analyst team at DIVVY Cyclists, a company in Chicago. According to the General Manager of this company, Ms. Lily Moreno, the company must launch marketing strategies to convert casual riders into annual members. She thinks that with this approach the profits for the company will increase. Nevertheless, she believes that understanding better how casual riders and members differ is a key step towards achieving this goal.

My responsibility as part of the marketing analyst team is to understand What casual riders and annual members use Cyclistic bikes differently and to provide insights to my manager regarding this goal.

About the DIVVY Bicycle Sharing Service

In 2016, the DIVVY company, through a program named Cyclistic, launched a bike-share offering, which currently represents a fleet of 5,824 bicycles that are tracked and locked into a network of 692 stations across the city of Chicago. The riders of these bicycles can unlock them in aty station and return them to other stations of the system at any time.

Thus far, there has been only one marketing strategy that relies on building general awareness and appealing through single-ride passes, full-day passes,a nd annual memberships. Also, derived from these plans the company has two basic segments, i.e., customers who purchase single-ride or full-day passes or Casual riders, and customers who purchase an annual membership. These last ones are called Cyclist members.

Ms. Lily Moreno, the President of the company has suggested a more innovative marketing plan. She thinks that maximizing the number of annual members will be key to future growth by convincing casual riders to buy a membership. She also considers that already the casual riders are aware of the Cyclistic Program and its benefits, so the next step will be to convince this segment to upgrade to the next level. But for this Ms. Moreno would like to gather information regarding the differences between both segments.

OBJECTIVES

Mission Report

To determine how annual members and casual riders use DIVVY Cyclist bikes differently.

Operational Objectives

To complete this objective, the two following operational objectives had to be accomplished.

  1. Create and describe beyond the basic division between annual members and casual riders, the behavior of a series of subsets of bike riders depending on the type of bicycle they usually ride.

  2. Describe the riding habits of the different subsets regarding two variables: the average distance in kilometers ridden on every occasion they used this bike service and the average hours per trip on the same occasion.

  3. Illustrate the differences among the subsets of bike riders regarding the most relevant stations of the city of Chicago from which they depart on each trip.

All of the analysis regarding the former should be presented to the main stakeholders of the DIVVY Bicycle Sharing Service, Ms. Lily Moreno (President) and other members of the board of directors.

CODING

Data Source

The data was imported from this web page: https://divvy-tripdata.s3.amazonaws.com/index.html

A total of twelve data sets, one for each month, from August 2021 to July were downloaded from this web page. Each data set contains information on every ride logged by the customers of this company. Only data referred to each trip and any personal information was scrubbed.

All files were large to be managed by Google Sheets or Excel. I used RStudio Desktop to process and analysis and Tableau Online Desktop to visualize results.

Preparing and Cleaning

Libraries used

library(tidyverse)
library(lubridate)
library(ggplot2)
library(here)
library(dplyr)
library(class)
library(readr)
library(tibble)
library(tidyr)
library(janitor)
library(skimr)

This is how we imported all the csv files from Aug. the 21st to Jul. the 22nd

library(readr)
Aug_2021_tripdata <- read_csv("202108-divvy-tripdata.zip")
View(Aug_2021_tripdata)
Sep_2021_tripdata <- read_csv("202109-divvy-tripdata.zip")
View(Sep_2021_tripdata)
Oct_2021_tripdata <- read_csv("202110-divvy-tripdata.zip")
View(Oct_2021_tripdata)
Nov_2021_tripdata <- read_csv("202111-divvy-tripdata.zip")
View(Nov_2021_tripdata)
Dec_2021_tripdata <- read_csv("202112-divvy-tripdata.zip")
View(Dec_2021_tripdata)                     
Jan_2022_tripdata <- read_csv("202201-divvy-tripdata.zip")
View(Jan_2022_tripdata)
Feb_2022_tripdata <- read_csv("202202-divvy-tripdata.zip")
View(Feb_2022_tripdata)
Mar_2022_tripdata <- read_csv("202203-divvy-tripdata.zip")
View(Mar_2022_tripdata)
Apr_2022_tripdata <- read_csv("202204-divvy-tripdata.zip")
View(Apr_2022_tripdata)
May_2022_tripdata <- read_csv("202205-divvy-tripdata.zip")
View(May_2022_tripdata)
Jun_2022_tripdata <- read_csv("202206-divvy-tripdata.zip")
View(Jun_2022_tripdata)
Jul_2022_tripdata <- read_csv("202207-divvy-tripdata.zip")
View(Jul_2022_tripdata)

We wanted to be sure that the uploaded files had the same columns and names.

compare_df_cols(Aug_2021_tripdata,Sep_2021_tripdata,Oct_2021_tripdata,
                Nov_2021_tripdata,Dec_2021_tripdata,Jan_2022_tripdata,
                Feb_2022_tripdata,Mar_2022_tripdata,Apr_2022_tripdata,
                May_2022_tripdata,Jun_2022_tripdata,Jul_2022_tripdata)

Then, created one only file with rbind.

Mydata <- rbind(Aug_2021_tripdata,Sep_2021_tripdata,Oct_2021_tripdata,
                Nov_2021_tripdata,Dec_2021_tripdata,Jan_2022_tripdata,
                Feb_2022_tripdata,Mar_2022_tripdata,Apr_2022_tripdata,
                May_2022_tripdata,Jun_2022_tripdata,Jul_2022_tripdata)

Calculated a new column (ride_length) which represents the number of seconds that each trip takes.

mydata_v1$ride_length <- difftime(mydata_v1$ended_at,mydata_v1$started_at)

Then we converted ride_length to numeric and removed cases where ride_length < 0.

class(mydata_v1$ride_length) = "Numeric"
mydata_v1 <- subset(mydata_v1, ride_length >0)

Created new variables (as suggested by Kevin Hartman).

mydata_v1$date <- as.Date(mydata_v1$started_at) #The default format is yyyy-mm-dd
mydata_v1$month <- format(as.Date(mydata_v1$date), "%m")
mydata_v1$day <- format(as.Date(mydata_v1$date), "%d")
mydata_v1$year <- format(as.Date(mydata_v1$date), "%Y")
mydata_v1$day_of_week <- format(as.Date(mydata_v1$date), "%A")

Created new groups.

mydata_v1 <-mydata_v1 %>% mutate(rider_class =
case_when(rideable_type =="docked_bike" mydata_v1$member_casual == "member" ~ "memb_docked_bike", 
rideable_type =="electric_bike" & mydata_v1$member_casual == "member" ~ "memb_electric_bike",
rideable_type =="classic_bike" & mydata_v1$member_casual == "member" ~ "memb_classic_bike",
rideable_type =="docked_bike" & mydata_v1$member_casual == "casual" ~
"cas_docked_bike",
rideable_type =="electric_bike" & mydata_v1$member_casual == "casual" ~ "cas_electric_bike",
rideable_type =="classic_bike" & mydata_v1$member_casual == "casual" ~ "cas_classic_bike"),      
)

Renamed my_data_v1 to all_trips_def and eliminated rows with NA and create all_trips.

all_trips_def <- mydata_v1
all_trips <- na.omit(all_trips)

Also, after arranging all_trips_cleaned by dist_km we eliminated an outlier whose value was 1,189.52165 using

all_trips <- subset(all_trips, dist_km < 1189)

We also created a new variable called “hours_per_trip” using

all_trips <- all_trips %>% 
  mutate(hours_per_trip = (all_trips$ride_length)/3600)

According to Isabella Peel in https://isabellapeel.github.io/Cyclistic_Case_Study-Divvy_Bikes/Final_report.html:

“The original data sets of the company excluded trips that were below 60 seconds because” (potentially false starts or users trying to re-dock a bike to ensure it was secure)“. So, we also excluded those cases.

all_trips <- subset(all_trips, ride_length > 60)

We tried to calculate another new variable which was the distance traveled on each trip using the following formula, the geosphere library, and the Haversine formula.

all_trips <- all_trips %>%
   all_trips$dist <- distm(x = all_trips[, c("start_lng", "start_lat")], 
                          y = all_trips[, c("end_lng","end_lat")],
                          fun = distHaversine
  )

But we got this error in return: Cannot allocate vector of size 114449.9 Gb. So, we created a subset of all_trips data frame named all_strips_distance:

all_trips_distance <- all_trips %>% 
     select(c("ride_length", "start_station_name","start_lat",
              "start_lng","end_lat","end_lng","member_casual","rider_class")) 

and wrote a csv file named:

write_csv(all_trips_distance, file = "all_trips_distance.csv")

We imported this csv file as a query using Power BI and calculated new Columns, in order to calculate the distance per trip using the following formula:

dist_km = ACOS( SIN(lat1PI()/180)SIN(lat2PI()/180) + COS(lat1PI()/180)* COS(lat2PI()/180)COS(lon2PI()/180-lon1PI()/180))*6371

Which required a previous transformation of the variables start_lat,start_lng, end_lat, and end_lng.

      lat1 = [start_lat]/180*Number.PI
      lat2 = [end_lat]/180*Number.PI
      lon1 = [start_lng]/180*Number.PI
      lon2 = [end_lng]/180*Number.PI

Considering that it was a large file to convert to .csv we used Dax Studio to do this. So, back to csv and importing into my RStudio project as all_trips_distance, Then we arranged the all_trips data frame and the all_trips_distance data frame by the “ride_id” columns, create a new and transitory data frame called distance (whose only variable was “dist_km” and use the function cbind to merge all_trips with the distance data frame.

Finally, we had our final data frame: “all_trips_cleaned”.

    all_trips_cleaned <- all_trips

We double-checked the final data frame to see if they were cleaned.

colnames(all_trips_cleaned)  #List of column names
nrow(all_trips_cleaned)  #How many rows are in data frame?
dim(all_trips_cleaned)  #Dimensions of the data frame?
head(all_trips_cleaned)  #See the first six rows of the data frame. Also tail(all_trips)
str(all_trips_cleaned)  #See list of columns and data types (numeric, character, etc)
summary(all_trips_cleaned)  #Statistical summary of data for numerics

DATA PROCESS

We first created the first two data frames summarizing the main statistical values of the variables dist_km and hours_per_trip in the main data frame all_trips_cleaned


    all_trips_d1<- all_trips_cleaned %>% 
      group_by(rider_class) %>% 
      summarize(min_dist_km = min(dist_km),
                max_dist_km = max(dist_km),
                median_dist_km = median(dist_km),
                mean_dist_km = mean(dist_km),
                ng= n())

    all_trips_h1<- all_trips_cleaned %>% 
      group_by(rider_class) %>% 
      summarize(min_hours_per_trip = min(hours_per_trip),
                max_hours_per_trip = max(hours_per_trip),
                median_hours_per_trip = median(hours_per_trip),
                mean_hours_per_trip = mean(hours_per_trip),
                ng= n())

Here is the first tibble

Here is the second tibble

Then we created all_trips_d4 and all_trips_export to be transformed into two csv files and processed with Tableau. The data frame all_trips_export is a subset of all_trips_cleaned

Here is a tibble of all_trips_d4

This data frame was a table of 1825 observations, each subset or type of riders contained 365 cases and was one to be used to make our charts. Use the following link to see them.

https://public.tableau.com/app/profile/eduardo.gonzalez.pinedo/viz/DIVVYCyclistsProject/KmandHoursCasual?publish=yes

FINAL REPORT

Analysis

As stated in our objectives we divided the data set of casual riders and members in a series of subsets depending on the type of bikes used. The total amount of cases included 4,372,979 bikers as shown in the following graph:

This number of cases represents 41.02% of the three groups that are considered casual riders and 58.98% represent the bicycle riders that have a membership.

A first analysis of these five groups via the two Box plots graphs, reveals that the Casual Docked Bikers is the most disperse sub set, considering the hours per trip. It seems also that this the group dedicates more time per trip than the rest independently of the distance traveled.

In the following chart, it can be seen that the dispersion of the data is a characteristic of the Casual Docked Bikes group. The correlation between the distance and the hour per trip is negative and near zero, which means there is no correlation between these two measures within the Casual Docked Group. An interpretation of this fact is that this group uses the DIVVY service to satisfy more Leisure needs than those activities related to work or similar requirements.

On the contrary, among the clients of Casual Classic Bikes the correlation is higher (r = .526) and even higher in the Casual Electric Bikes group, (r = .799). Thus, this is an important fact because any probable marketing or communication program addressed to the Casual riders must not be general; they must consider this type of difference.

As expected the two subsets of the Members group exhibit higher correlation between the distance traveled and the number of hours per trip, which means they are more aware of the type of service they are using and have specific needs other than using the service just for the fun of a ride.

Another factor to be considered is related to differences among the five subsets when the distance traveled is seen separately from the hours per trip. In this case, we can see why they do not correlate especially within the Casual Docked Bikes group. The following graph shows a sudden rise in this new group from January to February in the average distance traveled. Later during March, such averages tend to fall. Considering that during the such period it is still winter in a city like Chicago, such a sudden rise is hard to explain, unless we ponder the Covid-19 pandemic factor which affected everybody worldwide.

In the other cases, the trend is that the average distance decreases from October to January (Fall and part of the Winter) and then it increases from March to July (Spring and part of Summer). The same trend is seen when the average hours per trip are considered. Furthermore, in the case of the Casual Docked Bikes group a sudden increment appears between December and January, which again is hard to explain due to the particularly chilly winter in cities like Chicago.

Now, when the averages are evaluated by the day of the week, the only relevant difference appears in the average of hours per trip. Again, the Casual Docked Group shows a very odd increment from Wednesday to Thursday. The other groups show a set of very stable averages per day.

Regarding the most popular stations and the differences among the subsets, it can be seen from the next two charts that at least for the three subsets of casual riders, the tendency is to consider the same list of names among the thirty most selected stations. In particular, the names of Streeter Dr & Grand Ave., Dusable Lake Shore & Monroe St., Michigan Ave & Oak St., and Millennium Park stand out in those lists.

In the former two charts, the sum of times the riders used those stations as the starting point, and the correspondent percentages are the measures considered. As we said a concise list of the stations named can set differences among the five subsets, but not after that.

More differences show up when the heat maps are related to the averages of the distances traveled and the averages of the hours per trip as shown in the next two charts.

Finally, the differences in the average distance traveled from the thirty stations most frequently used as a starting point and the average hours per trip, but considering the total number of cases, independently of the type of rider can be seen in the following graph.

Conclusions

Reviewing our data, trends, and values of this analysis, we conclude the following:

  1. Working with a new variable such as the distance traveled from two geographical points (starting station and end station) and trying to correlate it with the number of hours that each ride took, and also operating with a new division of the portfolio of clients (beyond Casual vs. Member) has driven a new set of insights.

  2. The correlation analysis demonstrates that the Casual Docked Bikers simply use this service on a basis that has nothing to do with honoring an appointment, going to a workplace, or any other similar situation. We can suppose then that it is due to satisfying leisure needs.

  3. A better position regarding the correlation analysis is achieved by the Casual Classic Bikers and the Casual Electric Bikers. Especially in this last case, these are people that are using the bicycle service for a specific service (attending an appointment, going to work, etc.) In this last group, it can be better predicted the hours that a ride takes from the distance covered and vice versa.

  4. Likewise, when the two subsets of the Members Group, are being considered, the correlation improves which means that either the Member Classic Bikers or the Member Electric Bikers are more aware and have a clearer and a more specific purpose of using this kind of service than the Casual Bikers.

  5. On the other hand, when we independently analyzed monthly the trajectory of both measures, the average distance traveled and average hours per trip, the data showed us that the main trend in any subset was to diminish the average the distance per month, and the average hours per trip. This happened starting from August to December, and from there the tendency was to increase from February to July.

  6. This is a logical fact considering that the first period refers to the Fall and the beginning of the Winter (August all through December) and the second period represents the beginning of Spring until and part of Summer (March to July). Nevertheless, again the group of the Casual Docked Bikers was consistent with its unpredictable behavior.

  7. For this subset the average distance traveled during the months of December 2021; January 2022; and February 2022 was 2.258 Km, 2.215 Km, and 2.771 Km, respectively. With the average hours per trip (h/trip) there was a comparable situation: 1.405 h/trip, during December 2021; 2.297 in January and it starts to decrease in February 2022 with 1.555 h/trip. Thinking that these are winter months and that the trips are in the city of Chicago, this kind of facts could be explained due to the impossibility of returning the bikes on time, some difficulty in correctly registering the time when this happened, and of course, the Covid-19 pandemic whose major peak was during 2020 and part of 2021.

  8. Beyond being the highest value in h/trip and considering the day of the week, the Casual Docked Bikers show the only inconsistency in comparison to the rest of the subsets. In this case, the average increases from 1.023 during Wednesday to 1.764 during Thursday and falls back to .841 during Fridays. This kind of behavior is more related to the personal traits and characteristics of its members than to any other external factor.

  9. Finally, we listed a set of the most relevant starting stations from where the riders depart for their rides. In general, the different names appear with the same relative values whenever the total sum of cases and/or their proportions are involved. Again, and as a general trend, the subset of the Casual Docked Bikers are the ones that have the highest influence in such discrepancy.

Recommendations

Bearing in mind that the company’s final goal is to design a fresh marketing and communication campaign we recommend that these procedures should be addressed to the Casual Classic Bikers and the Casual Electric Bikers which represent the largest proportion of the target. Nevertheless, this general strategy should consider one or several tactics adapted especially to the Casual Docked Bikers.

  1. A general digital campaign communicating the benefits of having a membership for this kind of service. This one should consider the fact that Chicago is a complicated city and a way to simplify the life of its locals could be the usage of a bicycle to transport themselves on time and safer to various places. You could earn time and money. Being a member of this special club will help you with this purpose.

  2. In the case of the Casual Docked Bikers specific messages within this digital campaign, convincing them that even though leisure needs are what motivates them, why not become a member? Having fun riding a bicycle does not mean that they should pay more. Consider the hardest moment of the year like winter. If I am a member, no worries if I could not return my bike on time. Being a member should give me certain advantages. Besides, thinking that the Covid-19 pandemic is hopefully ending, any factor that helps me return to normality should be considered.

  3. Ponder the fact that the Covid-19 pandemic is close to an end. Use this as a special drive within the general digital campaign.

  4. Use flyers, posters, and similar devices to be delivered to the localities close to the most popular starting stations of the city of Chicago communicating the benefits of being a member biker. They should be addressed to any kind of biker because it will reinforce the current member’s behavior and simply generate the casual biker’s attention.

  5. Send text messages, phone calls, and/or email to previously identified casual riders communicating the benefits of becoming members, especially to the ones that are Docked Bikers.