Background
Scenario
You are a junior data analyst working on the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.
About the company
In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime. Until now, Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments. One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members.
Ask
Business Task In order to maximize the number of annual membership, I, data analyst, will find trend and patterns among casual riders and membership riders, and identify potential riders who can get benefit from annual membership. I do not need to raise awareness of annual membership among casual riders as they are already aware of the program.
Stakeholders
The director of marketing The marketing analysis team *Cyclistic’s Executive team
Stakeholder’s expectation Design marketing strategies aimed at converting casual riders into annual members. In order to do that, however, the marketing analyst team needs to better understand how annual members and casual riders differ, why casual riders would buy a membership, and how digital media could affect their marketing tactics. The marketing team is interested in analyzing the Cyclistic historical bike trip data to identify trends.
Prepare
DATA TYPE check
1.ROOOC Check 2.Sampling bias 3.Observer bias 4.Interpretation bias 5.Confirmation bias
About the data set:
Since Cyclistic is a fictional company, I will use Divvy’s, a bike-share program based in Chicago, data from May 2020 – April 2021 to complete this case study. To download the data, please use this link. This data was made public by Motivate International Inc, under this license. Due to data privacy issues, personal information has been removed or encrypted.
Install relevant packages
install.packages("tidyverse")
install.packages("dplyr")
install.packages("lubridate")
install.packages("geosphere")
Library
library(tidyverse)
── Attaching core tidyverse packages ─────────────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ───────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(ggplot2)
library(dplyr)
library(tidyr)
library(lubridate)
Trip Data
tripdata_202401 <- read.csv ("202401-divvy-tripdata.csv")
tripdata_202402 <- read.csv ("202402-divvy-tripdata.csv")
tripdata_202403 <- read.csv ("202403-divvy-tripdata.csv")
tripdata_202404 <- read.csv ("202404-divvy-tripdata.csv")
tripdata_202405 <- read.csv ("202405-divvy-tripdata.csv")
tripdata_202406 <- read.csv ("202406-divvy-tripdata.csv")
Data Check
colnames(tripdata_202401)
[1] "ride_id" "rideable_type" "started_at"
[4] "ended_at" "start_station_name" "start_station_id"
[7] "end_station_name" "end_station_id" "start_lat"
[10] "start_lng" "end_lat" "end_lng"
[13] "member_casual"
colnames(tripdata_202402)
[1] "ride_id" "rideable_type" "started_at"
[4] "ended_at" "start_station_name" "start_station_id"
[7] "end_station_name" "end_station_id" "start_lat"
[10] "start_lng" "end_lat" "end_lng"
[13] "member_casual"
colnames(tripdata_202403)
[1] "ride_id" "rideable_type" "started_at"
[4] "ended_at" "start_station_name" "start_station_id"
[7] "end_station_name" "end_station_id" "start_lat"
[10] "start_lng" "end_lat" "end_lng"
[13] "member_casual"
colnames(tripdata_202404)
[1] "ride_id" "rideable_type" "started_at"
[4] "ended_at" "start_station_name" "start_station_id"
[7] "end_station_name" "end_station_id" "start_lat"
[10] "start_lng" "end_lat" "end_lng"
[13] "member_casual"
colnames(tripdata_202405)
[1] "ride_id" "rideable_type" "started_at"
[4] "ended_at" "start_station_name" "start_station_id"
[7] "end_station_name" "end_station_id" "start_lat"
[10] "start_lng" "end_lat" "end_lng"
[13] "member_casual"
colnames(tripdata_202406)
[1] "ride_id" "rideable_type" "started_at"
[4] "ended_at" "start_station_name" "start_station_id"
[7] "end_station_name" "end_station_id" "start_lat"
[10] "start_lng" "end_lat" "end_lng"
[13] "member_casual"
combine data
All_trip <- bind_rows(tripdata_202401,tripdata_202402,tripdata_202403,tripdata_202404,tripdata_202405,tripdata_202406)
colnames(All_trip)
[1] "ride_id" "rideable_type" "started_at"
[4] "ended_at" "start_station_name" "start_station_id"
[7] "end_station_name" "end_station_id" "start_lat"
[10] "start_lng" "end_lat" "end_lng"
[13] "member_casual"
str(All_trip)
'data.frame': 2404963 obs. of 13 variables:
$ ride_id : chr "C1D650626C8C899A" "EECD38BDB25BFCB0" "F4A9CE78061F17F7" "0A0D9E15EE50B171" ...
$ rideable_type : chr "electric_bike" "electric_bike" "electric_bike" "classic_bike" ...
$ started_at : chr "2024-01-12 15:30:27" "2024-01-08 15:45:46" "2024-01-27 12:27:19" "2024-01-29 16:26:17" ...
$ ended_at : chr "2024-01-12 15:37:59" "2024-01-08 15:52:59" "2024-01-27 12:35:19" "2024-01-29 16:56:06" ...
$ start_station_name: chr "Wells St & Elm St" "Wells St & Elm St" "Wells St & Elm St" "Wells St & Randolph St" ...
$ start_station_id : chr "KA1504000135" "KA1504000135" "KA1504000135" "TA1305000030" ...
$ end_station_name : chr "Kingsbury St & Kinzie St" "Kingsbury St & Kinzie St" "Kingsbury St & Kinzie St" "Larrabee St & Webster Ave" ...
$ end_station_id : chr "KA1503000043" "KA1503000043" "KA1503000043" "13193" ...
$ start_lat : num 41.9 41.9 41.9 41.9 41.9 ...
$ start_lng : num -87.6 -87.6 -87.6 -87.6 -87.7 ...
$ end_lat : num 41.9 41.9 41.9 41.9 41.9 ...
$ end_lng : num -87.6 -87.6 -87.6 -87.6 -87.6 ...
$ member_casual : chr "member" "member" "member" "member" ...
Process
New column
All_trip$start_date <- as.Date(All_trip$started_at)
All_trip$start_datetime <- ymd_hms(All_trip$started_at, quiet = TRUE)
All_trip$month <- format(as.Date(All_trip$start_date), "%m")
All_trip$day <- format(as.Date(All_trip$start_date), "%d")
All_trip$year <- format(as.Date(All_trip$start_date), "%Y")
All_trip$day_of_week <- format(as.Date(All_trip$start_date), "%A")
All_trip$end_datetime <- ymd_hms(All_trip$ended_at, quiet = TRUE)
All_trip$start_Time <- format(All_trip$start_datetime, "%H:%M:%S")
All_trip$end_Time <- format(All_trip$end_datetime, "%H:%M:%S")
All_trip$duration <- (All_trip$end_datetime - All_trip$start_datetime)
All_trip$ride_distance <- distGeo(matrix(c(All_trip$start_lng, All_trip$start_lat), ncol = 2), matrix(c(All_trip$end_lng, All_trip$end_lat), ncol = 2))
colnames(All_trip)
[1] "ride_id" "rideable_type" "started_at"
[4] "ended_at" "start_station_name" "start_station_id"
[7] "end_station_name" "end_station_id" "start_lat"
[10] "start_lng" "end_lat" "end_lng"
[13] "member_casual" "start_date" "start_datetime"
[16] "month" "day" "year"
[19] "day_of_week" "end_datetime" "start_Time"
[22] "end_Time" "duration" "ride_distance"
str(All_trip)
'data.frame': 2404963 obs. of 24 variables:
$ ride_id : chr "C1D650626C8C899A" "EECD38BDB25BFCB0" "F4A9CE78061F17F7" "0A0D9E15EE50B171" ...
$ rideable_type : chr "electric_bike" "electric_bike" "electric_bike" "classic_bike" ...
$ started_at : chr "2024-01-12 15:30:27" "2024-01-08 15:45:46" "2024-01-27 12:27:19" "2024-01-29 16:26:17" ...
$ ended_at : chr "2024-01-12 15:37:59" "2024-01-08 15:52:59" "2024-01-27 12:35:19" "2024-01-29 16:56:06" ...
$ start_station_name: chr "Wells St & Elm St" "Wells St & Elm St" "Wells St & Elm St" "Wells St & Randolph St" ...
$ start_station_id : chr "KA1504000135" "KA1504000135" "KA1504000135" "TA1305000030" ...
$ end_station_name : chr "Kingsbury St & Kinzie St" "Kingsbury St & Kinzie St" "Kingsbury St & Kinzie St" "Larrabee St & Webster Ave" ...
$ end_station_id : chr "KA1503000043" "KA1503000043" "KA1503000043" "13193" ...
$ start_lat : num 41.9 41.9 41.9 41.9 41.9 ...
$ start_lng : num -87.6 -87.6 -87.6 -87.6 -87.7 ...
$ end_lat : num 41.9 41.9 41.9 41.9 41.9 ...
$ end_lng : num -87.6 -87.6 -87.6 -87.6 -87.6 ...
$ member_casual : chr "member" "member" "member" "member" ...
$ start_date : Date, format: "2024-01-12" "2024-01-08" ...
$ start_datetime : POSIXct, format: "2024-01-12 15:30:27" "2024-01-08 15:45:46" ...
$ month : chr "01" "01" "01" "01" ...
$ day : chr "12" "08" "27" "29" ...
$ year : chr "2024" "2024" "2024" "2024" ...
$ day_of_week : chr "Friday" "Monday" "Saturday" "Monday" ...
$ end_datetime : POSIXct, format: "2024-01-12 15:37:59" "2024-01-08 15:52:59" ...
$ start_Time : chr "15:30:27" "15:45:46" "12:27:19" "16:26:17" ...
$ end_Time : chr "15:37:59" "15:52:59" "12:35:19" "16:56:06" ...
$ duration : 'difftime' num 452 433 480 1789 ...
..- attr(*, "units")= chr "secs"
$ ride_distance : num 1596 1565 1566 4253 7291 ...
Calculation
All_trip$duration <- as.numeric(as.character(All_trip$duration))
is.numeric(All_trip$duration)
[1] TRUE
Remove bad data
all_trips <- All_trip[All_trip$duration != 0, ]
Check new data
dim(all_trips)
[1] 2404467 24
View(all_trips)
summary(all_trips)
ride_id rideable_type started_at ended_at
Length:2404467 Length:2404467 Length:2404467 Length:2404467
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
start_station_name start_station_id end_station_name end_station_id
Length:2404467 Length:2404467 Length:2404467 Length:2404467
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
start_lat start_lng end_lat end_lng member_casual
Min. :41.65 Min. :-87.89 Min. :41.60 Min. :-88.12 Length:2404467
1st Qu.:41.88 1st Qu.:-87.66 1st Qu.:41.88 1st Qu.:-87.66 Class :character
Median :41.90 Median :-87.64 Median :41.90 Median :-87.64 Mode :character
Mean :41.90 Mean :-87.65 Mean :41.90 Mean :-87.65
3rd Qu.:41.93 3rd Qu.:-87.63 3rd Qu.:41.93 3rd Qu.:-87.63
Max. :42.07 Max. :-87.53 Max. :42.19 Max. :-87.46
NA's :3389 NA's :3389
start_date start_datetime month
Min. :2024-01-01 Min. :2024-01-01 00:00:39.00 Length:2404467
1st Qu.:2024-03-24 1st Qu.:2024-03-24 20:17:05.00 Class :character
Median :2024-05-07 Median :2024-05-07 15:08:08.00 Mode :character
Mean :2024-04-26 Mean :2024-04-27 10:59:45.57
3rd Qu.:2024-06-06 3rd Qu.:2024-06-06 13:09:25.78
Max. :2024-06-30 Max. :2024-06-30 23:55:17.06
day year day_of_week
Length:2404467 Length:2404467 Length:2404467
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
end_datetime start_Time end_Time
Min. :2024-01-01 00:04:20.00 Length:2404467 Length:2404467
1st Qu.:2024-03-24 20:31:39.00 Class :character Class :character
Median :2024-05-07 15:22:15.00 Mode :character Mode :character
Mean :2024-04-27 11:17:32.56
3rd Qu.:2024-06-06 13:31:27.61
Max. :2024-06-30 23:59:57.93
duration ride_distance
Min. :-164899 Min. : 0.0
1st Qu.: 330 1st Qu.: 868.1
Median : 582 Median : 1559.1
Mean : 1067 Mean : 2132.1
3rd Qu.: 1042 3rd Qu.: 2775.5
Max. : 93596 Max. :33514.6
NA's :3389
Remove duplicate ID
all_trips_v2 <- all_trips[!duplicated(all_trips$ride_id),]
dim(all_trips_v2)
[1] 2404256 24
summary(all_trips_v2)
ride_id rideable_type started_at ended_at
Length:2404256 Length:2404256 Length:2404256 Length:2404256
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
start_station_name start_station_id end_station_name end_station_id
Length:2404256 Length:2404256 Length:2404256 Length:2404256
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
start_lat start_lng end_lat end_lng member_casual
Min. :41.65 Min. :-87.89 Min. :41.60 Min. :-88.12 Length:2404256
1st Qu.:41.88 1st Qu.:-87.66 1st Qu.:41.88 1st Qu.:-87.66 Class :character
Median :41.90 Median :-87.64 Median :41.90 Median :-87.64 Mode :character
Mean :41.90 Mean :-87.65 Mean :41.90 Mean :-87.65
3rd Qu.:41.93 3rd Qu.:-87.63 3rd Qu.:41.93 3rd Qu.:-87.63
Max. :42.07 Max. :-87.53 Max. :42.19 Max. :-87.46
NA's :3349 NA's :3349
start_date start_datetime month
Min. :2024-01-01 Min. :2024-01-01 00:00:39.00 Length:2404256
1st Qu.:2024-03-24 1st Qu.:2024-03-24 20:02:52.00 Class :character
Median :2024-05-07 Median :2024-05-07 14:46:39.00 Mode :character
Mean :2024-04-26 Mean :2024-04-27 10:55:24.35
3rd Qu.:2024-06-06 3rd Qu.:2024-06-06 13:12:13.01
Max. :2024-06-30 Max. :2024-06-30 23:55:17.06
day year day_of_week
Length:2404256 Length:2404256 Length:2404256
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
end_datetime start_Time end_Time
Min. :2024-01-01 00:04:20.00 Length:2404256 Length:2404256
1st Qu.:2024-03-24 20:20:09.00 Class :character Class :character
Median :2024-05-07 15:04:06.00 Mode :character Mode :character
Mean :2024-04-27 11:13:09.35
3rd Qu.:2024-06-06 13:34:22.65
Max. :2024-06-30 23:59:57.93
duration ride_distance
Min. :-164899 Min. : 0.0
1st Qu.: 330 1st Qu.: 868.1
Median : 582 Median : 1559.0
Mean : 1065 Mean : 2132.0
3rd Qu.: 1042 3rd Qu.: 2775.3
Max. : 93596 Max. :33514.6
NA's :3349
Analysis
all_trips_v2$day_of_week <-
ordered(all_trips_v2$day_of_week, levels = c('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'))
all_trips_v2 %>%
group_by(member_casual,day_of_week) %>%
summarise(number_of_rides=n(), .groups = "drop") %>%
arrange(day_of_week)
summarized_data <- all_trips_v2 %>%
group_by(member_casual, day_of_week) %>%
summarise(number_of_rides = n(), .groups = "drop") %>%
arrange(day_of_week)
ggplot(data = summarized_data, aes(x = day_of_week, y = number_of_rides, fill = member_casual)) +
geom_bar(position = "dodge", stat = "identity")
all_trips_v2 %>%
group_by(member_casual,day_of_week) %>%
summarise(sum(duration),.groups ="drop") %>%
arrange(day_of_week)
summarized_data <- all_trips_v2 %>%
group_by(member_casual,day_of_week) %>%
summarise (total_duration=sum(duration),.groups ="drop") %>%
arrange(day_of_week)
ggplot(data = summarized_data, aes(x = day_of_week, y = total_duration, fill = member_casual)) +
geom_bar(position = "dodge", stat = "identity")
all_trips_v2 %>%
group_by(month,member_casual) %>%
summarise(number_of_rides=n(),.groups = "drop")%>%
arrange(month)
all_trips_v2 %>%
group_by(month,member_casual) %>%
summarise(number_of_rides=n(),.groups = "drop")%>%
arrange(month)%>%
ggplot(aes(x=month,y=number_of_rides,fill=member_casual))+
geom_bar(position = "dodge", stat = "identity")
all_trips_v2 %>%
group_by(member_casual) %>%
filter(ride_distance < 10000)%>%
ggplot(aes(x = ride_distance, fill = member_casual)) +
geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
It looks like casual users take longer bike rides compared to regular members. This suggests that casual users might use the bikes more for fun or leisure, while regular members might use them more for practical reasons, like commuting.
Share
Conclusion:
*Casual users tend to use the bikes for leisure and tourism, especially on weekends, which suggests they enjoy longer rides and recreational activities.
*Annual users primarily use the bikes for commuting or practical purposes during weekdays, indicating a more routine and functional use of the service.
*Casual users may prefer biking for social outings or exploring new places, as reflected in their longer ride distances and weekend activity.
*Annual users are more likely to incorporate biking into their daily routines, possibly for work or other regular activities, resulting in shorter and more frequent rides during the workweek.
*The difference in ride patterns highlights the diverse ways people utilize bike-sharing services, with casual users focusing on enjoyment and annual users on practicality.