Scenario
Cyclistic is a fictional bike-sharing company in Chicago. The service was launched in 2016 and has grown to a fleet of 5,824 bikes that are geotracked and locked into a network of 692 stations around the city.
The team wants to understand how casual riders and annual members, the two different rider types, use their service differently and believe that the company’s future success depends on maximizing the number of annual memberships as they are the most profitable. From these insights, the team will design a new marketing campaign to convert casual riders into annual members.
The company wants to analyze data of the last twelve months, from February 2021 to January 2022, and provides the data under the following link.
* https://divvy-tripdata.s3.amazonaws.com/index.html
For my analysis, I’ll use R for data exploration, manipulation and cleaning and then visualize the data in Tableau.
Data prearation
Setting up my R environment
library(tidyverse)## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.6 v dplyr 1.0.7
## v tidyr 1.1.4 v stringr 1.4.0
## v readr 2.1.1 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(geosphere)
knitr::opts_knit$set(root.dir = "~/R/Practice/Case Study_Cyclists/Data")Data upload & exploration
First, I’ll upload the monthly data sets into R and explored them using the View() and head() function.
data_2021_02 <- read.csv("202102-divvy-tripdata.csv")
data_2021_03 <- read.csv("202103-divvy-tripdata.csv")
data_2021_04 <- read.csv("202104-divvy-tripdata.csv")
data_2021_05 <- read.csv("202105-divvy-tripdata.csv")
data_2021_06 <- read.csv("202106-divvy-tripdata.csv")
data_2021_07 <- read.csv("202107-divvy-tripdata.csv")
data_2021_08 <- read.csv("202108-divvy-tripdata.csv")
data_2021_09 <- read.csv("202109-divvy-tripdata.csv")
data_2021_10 <- read.csv("202110-divvy-tripdata.csv")
data_2021_11 <- read.csv("202111-divvy-tripdata.csv")
data_2021_12 <- read.csv("202112-divvy-tripdata.csv")
data_2022_01 <- read.csv("202201-divvy-tripdata.csv")colnames(data_2021_02)## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
Data consolidation
After I made sure all files have the same data structure, I’ll consolidate them into one data frame for data manipulation and cleaning.
full_data <- rbind(data_2021_02, data_2021_03, data_2021_04, data_2021_05, data_2021_06,
data_2021_07, data_2021_08, data_2021_09, data_2021_10, data_2021_11,
data_2021_12,data_2022_01)Data manipulation
Convert start and end time into date format
To be able to make calculations with the start and end time of the trips, we need to convert these two columns into date format.
full_data$started_at <- strptime(full_data$started_at, format = "%Y-%m-%d %H:%M:%S")
full_data$ended_at <- strptime(full_data$ended_at, format = "%Y-%m-%d %H:%M:%S")Calculate trip length in minutes and convert into number
In the next step, we’ll calculate the trip length of each ride.
full_data$ride_length_min <- round(difftime(full_data$ended_at, full_data$started_at, units = 'mins'), 2) %>%
as.numeric()Calculate trip distance in meters
I’ll also calculate the trip distance based on the start and end coordinates using the ‘geosphere’ package. To do so, I’ll use the ‘haversine’ method, which gives the shortest distance between two coordinates over the earth’s surface.
full_data <- full_data %>%
mutate(ride_distance_m = round(distHaversine(cbind(full_data$start_lng, full_data$start_lat),
cbind(full_data$end_lng, full_data$end_lat)), 2))To make sure the results are accurate, I’ve spot checked some coordinates manually on this website.
Let’s take a look at the data set.
knitr::kable(head(full_data, n = 5))| ride_id | rideable_type | started_at | ended_at | start_station_name | start_station_id | end_station_name | end_station_id | start_lat | start_lng | end_lat | end_lng | member_casual | ride_length_min | ride_distance_m |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 89E7AA6C29227EFF | classic_bike | 2021-02-12 16:14:56 | 2021-02-12 16:21:43 | Glenwood Ave & Touhy Ave | 525 | Sheridan Rd & Columbia Ave | 660 | 42.01270 | -87.66606 | 42.00458 | -87.66141 | member | 6.78 | 982.20 |
| 0FEFDE2603568365 | classic_bike | 2021-02-14 17:52:38 | 2021-02-14 18:12:09 | Glenwood Ave & Touhy Ave | 525 | Bosworth Ave & Howard St | 16806 | 42.01270 | -87.66606 | 42.01954 | -87.66956 | casual | 19.52 | 814.32 |
| E6159D746B2DBB91 | electric_bike | 2021-02-09 19:10:18 | 2021-02-09 19:19:10 | Clark St & Lake St | KA1503000012 | State St & Randolph St | TA1305000029 | 41.88579 | -87.63110 | 41.88487 | -87.62750 | member | 8.87 | 315.99 |
| B32D3199F1C2E75B | classic_bike | 2021-02-02 17:49:41 | 2021-02-02 17:54:06 | Wood St & Chicago Ave | 637 | Honore St & Division St | TA1305000034 | 41.89563 | -87.67207 | 41.90312 | -87.67394 | member | 4.42 | 847.45 |
| 83E463F23575F4BF | electric_bike | 2021-02-23 15:07:23 | 2021-02-23 15:22:37 | State St & 33rd St | 13216 | Emerald Ave & 31st St | TA1309000055 | 41.83473 | -87.62583 | 41.83816 | -87.64512 | member | 15.23 | 1645.30 |
Data cleaning
I’ll remove lines where the start and end station is identical and the trip length is under 2 minutes assuming the customers unlocked a bike and then had to put it right back. Also, I’ll remove any lines with no start or end station.
full_data_clean <- full_data[full_data$ride_length_min > 2 &
full_data$start_station_id != full_data$end_station_id, ] %>%
filter(start_station_id != "") %>%
filter(end_station_id != "")This removes 1,391,019 lines from the data set.
nrow(full_data)## [1] 5601999
nrow(full_data_clean)## [1] 4210980
Data analysis
First, let’s take a look at the number of rides per month and compare casual riders (in blue) to annual members (in orange).
We can see that the number of rides peaks in the summer months, from June until September. In most of the months, the number of rides from annual members is higher than the ones from casual riders (except of July). In the winter months, the number of rides drops sharply with some annual members but almost none casual riders.
This could mean that annual members do rely on the service as their main mean of transport, meaning they also have to use it during cold and moist weather conditions. On the other side, casual riders use it almost excludively during summer which could mean they tend to use it more for leisure and enjoyment.
This hypothesis is being supported by looking at the number of rides per weekday and hour.
The graph above shows the number of rides per weekday. You can see, that casual riders are using the service mostly on the weekend, whereas the number of rides from annual member is pretty steady throughout the week, with the lowest number of rides on Sundays.
This graph shows the number of rides per hour. Both types of members, casual and annual, have the most number rides in the late afternoon/early evening hours from 4 to 6 PM. However, for annual members, it also shows an increased number of rides in the morning hours between 6 to 9 AM so times where most of the employees begin and end their workday.
Another noticeable difference can be found in the monthly average trip duration of the two member types. The following graph shows that trips of casual riders tend to be much longer than the ones from annual members. The trip duration of annual members is pretty steady throughout the year with an average of 10 to 15 minutes per ride. In comparison, the trip duration of casual riders tend to be much higher on average and fluctuates much more throughout the year with a maximum average of almost 45 minutes in February, and a minimum average of 20 minutes in November.
Conclusion
It seems like annual members use the bike-sharing for purposes such as transportation and commuting. Even though the number of rides peaks in summer, there is also some usage in the winter months by that member group. The number of rides is pretty steady throughout the week with it’s low on Sunday. By looking at the number of rides per hour we can see that most number of rides are being started in the hours between 4 - 6 PM, but also a noticeable peak in the morning hours between 6 - 8 AM.
In contrast, casual riders tend to use the service mostly on weekends and in the summer months for leisure activities and enjoyment.
Furthermore, the average trip duration of casual riders tend to be much higher and fluctuates a lot with averages of almost 45 minutes in February and to approximately 20 minutes in November. The average trip duration of annual members is pretty steady throughout the year with an average between 10-15 minutes.
Recommendations
Placement of advertisements at most popular stations
Place advertisements in the areas of most popular stations. The shiny areas in below heat map are the most popular pick-up spots. Note, that this heat map only considers data from the month with the highest number of rides, July (since Tableau can’t handle the entire data set)! In the second visualization, you can see the top-15 pick-up stations by name and how many rides were started respectively.
These advertisements can be digital on social media apps and restricted to certain areas, but also physical board advertisements in pedestrian areas.
Since it seems like casual riders are using the service as a way of staying active during their leisure time, the ads could highlight what benefits come with riding the bike such as physical health, mental health and even living a more sustainable life.
Pick-Up Spots in July
Most Popular Start Stations
Flexibility of memberships
To decrease the border of entry, more flexible memberships could be offered by giving the option to subscribe on a monthly base. Alternatively, quarterly or half-year memberships could be offered with gradual discounts off the monthly membership to attract more customers. Same here, advertisements should be placed at and around most popular stations (see above).