bike station
This project is the final project in my Google Data Analytics Professional Certificate Course on coursera. In this case study, I will be analyzing a public dataset for a fictional company called Cyclistic, provided by the course. Here, I will be using R programming language for this analysis because of its potential benefits to reproducibility, transparency, easy statistical analysis tools and data visualizations
This project will be based on the data analysis process
This project will follow the following road map steps
scenario
You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships,in essence, converting casual users to member. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members.But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data
Characters and Teams
Cyclistic: A bike-share program that features more than 5,800 bicycles and 600 docking stations. Cyclistic sets itself apart by also offering reclining bikes, hand tricycles, and cargo bikes, making bike-share more inclusive to people with disabilities and riders who can’t use a standard two-wheeled bike. Most riders opt for traditional bikes; about 8% of riders use the assistive options. Cyclist users are more likely to ride for leisure, but about 30% use them to commute to work each day
Lily Moreno: The director of marketing and your manager. Moreno is responsible for the development of campaigns and initiatives to promote the bike-share program. These may include email, social media, and other channels
Cyclistic marketing analytics team: A team of data analysts who are responsible for collecting, analyzing, and reporting data that helps guide Cyclistic marketing strategy. You joined this team six months ago and have been busy learning about Cyclistic mission and business goals as well as how you, as a junior data analyst, can help Cyclistic achieve them
Cyclistic executive team: The notoriously detail-oriented executive team will decide whether to approve the recommended marketing program
Three questions will guide the future marketing program:
Lily Moreno (director of marketing and my manager) has assigned me the first question to answer: How do annual members and casual riders use Cyclistic bikes differently?
Key tasks
Deliverable
I will use Cyclistic’s historical trip data to analyze and identify trends.The data has been made available by Motivate International Inc. under this license Datasets are available here for download
Key tasks
Deliverable
A description of all data sources used
* The main source of all the data used was provided by the Cyclistic
Company
Install and load the needed R studio library
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(lubridate)
##
## Attaching package: 'lubridate'
##
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(ggplot2)
import data to R studio
q2_2019 <- read.csv("Divvy_Trips_2019_Q2.csv")
q3_2019 <- read.csv("Divvy_Trips_2019_Q3.csv")
q4_2019 <- read.csv("Divvy_Trips_2019_q4.csv")
q1_2020 <- read.csv("Divvy_Trips_2020_Q1.csv")
viewing and comparing each columns of the dataset
colnames(q2_2019)
## [1] "X01...Rental.Details.Rental.ID"
## [2] "X01...Rental.Details.Local.Start.Time"
## [3] "X01...Rental.Details.Local.End.Time"
## [4] "X01...Rental.Details.Bike.ID"
## [5] "X01...Rental.Details.Duration.In.Seconds.Uncapped"
## [6] "X03...Rental.Start.Station.ID"
## [7] "X03...Rental.Start.Station.Name"
## [8] "X02...Rental.End.Station.ID"
## [9] "X02...Rental.End.Station.Name"
## [10] "User.Type"
## [11] "Member.Gender"
## [12] "X05...Member.Details.Member.Birthday.Year"
colnames(q3_2019)
## [1] "trip_id" "start_time" "end_time"
## [4] "bikeid" "tripduration" "from_station_id"
## [7] "from_station_name" "to_station_id" "to_station_name"
## [10] "usertype" "gender" "birthyear"
colnames(q4_2019)
## [1] "trip_id" "start_time" "end_time"
## [4] "bikeid" "tripduration" "from_station_id"
## [7] "from_station_name" "to_station_id" "to_station_name"
## [10] "usertype" "gender" "birthyear"
colnames(q1_2020)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
Rename columns to make them consistent with q1_2020,this is due to the new naming convention of columns by the public dataset
Rename the q2_2019 dataset columns to q1_2020 dataset column
(q2_2019 <- rename(q2_2019
,ride_id = "X01...Rental.Details.Rental.ID"
,rideable_type = "X01...Rental.Details.Bike.ID"
,started_at = "X01...Rental.Details.Local.Start.Time"
,ended_at = "X01...Rental.Details.Local.End.Time"
,start_station_name = "X03...Rental.Start.Station.Name"
,start_station_id = "X03...Rental.Start.Station.ID"
,end_station_name = "X02...Rental.End.Station.Name"
,end_station_id = "X02...Rental.End.Station.ID"
,member_casual = "User.Type"))
(q3_2019 <- rename(q3_2019
,ride_id = trip_id
,rideable_type = bikeid
,started_at = start_time
,ended_at = end_time
,start_station_name = from_station_name
,start_station_id = from_station_id
,end_station_name = to_station_name
,end_station_id = to_station_id
,member_casual = usertype))
(q4_2019 <- rename(q4_2019
,ride_id = trip_id
,rideable_type = bikeid
,started_at = start_time
,ended_at = end_time
,start_station_name = from_station_name
,start_station_id = from_station_id
,end_station_name = to_station_name
,end_station_id = to_station_id
,member_casual = usertype))
Confirmation of the columns name
colnames(q2_2019)
## [1] "ride_id"
## [2] "started_at"
## [3] "ended_at"
## [4] "rideable_type"
## [5] "X01...Rental.Details.Duration.In.Seconds.Uncapped"
## [6] "start_station_id"
## [7] "start_station_name"
## [8] "end_station_id"
## [9] "end_station_name"
## [10] "member_casual"
## [11] "Member.Gender"
## [12] "X05...Member.Details.Member.Birthday.Year"
colnames(q3_2019)
## [1] "ride_id" "started_at" "ended_at"
## [4] "rideable_type" "tripduration" "start_station_id"
## [7] "start_station_name" "end_station_id" "end_station_name"
## [10] "member_casual" "gender" "birthyear"
colnames(q4_2019)
## [1] "ride_id" "started_at" "ended_at"
## [4] "rideable_type" "tripduration" "start_station_id"
## [7] "start_station_name" "end_station_id" "end_station_name"
## [10] "member_casual" "gender" "birthyear"
colnames(q1_2020)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
Confirmation of data type of all the columns
str(q2_2019)
str(q3_2019)
str(q4_2019)
str(q1_2020)
Rideable_type and ride_id columns from q2_1019,q3_2019 and q4_2019 needs to be converted to character data type as its in q1-2020 dataset
q2_2019 <- dplyr::mutate(q2_2019, ride_id = as.character(q2_2019$ride_id),
rideable_type = as.character(q2_2019$rideable_type))
q3_2019 <- dplyr::mutate(q3_2019,ride_id = as.character(q3_2019$ride_id),
rideable_type = as.character(q3_2019$rideable_type))
q4_2019 <- dplyr::mutate(q4_2019,ride_id = as.character(q4_2019$ride_id),
rideable_type = as.character(q4_2019$rideable_type))
is.character(q2_2019$ride_id)
## [1] TRUE
is.character(q3_2019$ride_id)
## [1] TRUE
is.character(q4_2019$ride_id)
## [1] TRUE
is.character(q2_2019$rideable_type)
## [1] TRUE
is.character(q3_2019$rideable_type)
## [1] TRUE
is.character(q4_2019$rideable_type)
## [1] TRUE
Aggregate the four datasets into a single dataset
all_trip <- bind_rows(q2_2019,q3_2019,q4_2019,q1_2020)
Drop some columns that are not unique
all_trip <- all_trip %>%
select(-c( "X01...Rental.Details.Duration.In.Seconds.Uncapped","X05...Member.Details.Member.Birthday.Year",Member.Gender,"X05...Member.Details.Member.Birthday.Year",tripduration,birthyear,start_lat,start_lng,end_lat,end_lng,gender))
colnames(all_trip)
## [1] "ride_id" "started_at" "ended_at"
## [4] "rideable_type" "start_station_id" "start_station_name"
## [7] "end_station_id" "end_station_name" "member_casual"
Data processing activities on the dataset
all_trip <- all_trip %>%
mutate(member_casual = recode(member_casual
,"Subscriber" = "member"
,"Customer" = "casual"))
Confirm by running this code chunk
n_distinct(all_trip$member_casual)
## [1] 2
all_trip$ride_length <- difftime(all_trip$ended_at,all_trip$started_at)
typeof(all_trip$ride_length)
## [1] "double"
Convert to numeric data type
all_trip$ride_length <- as.numeric(all_trip$ride_length)
typeof(all_trip$ride_length)
## [1] "double"
all_trip$date <- as.Date(all_trip$started_at) #The default format is yyyy-mm-dd
all_trip$year <- format(as.Date(all_trip$date), "%Y")
all_trip$month <- format(as.Date(all_trip$date), "%m")
all_trip$day <- format(as.Date(all_trip$date), "%d")
all_trip$day_of_the_week <- format(as.Date(all_trip$date),"%A")
To confirm the new columns
colnames(all_trip)
## [1] "ride_id" "started_at" "ended_at"
## [4] "rideable_type" "start_station_id" "start_station_name"
## [7] "end_station_id" "end_station_name" "member_casual"
## [10] "ride_length" "date" "year"
## [13] "month" "day" "day_of_the_week"
Filter out all the data where the ride_length is less than one seconds and also where the start station name is “HQ QR”
all_trip_v2 <- all_trip[!(all_trip$ride_length < 1 | all_trip$start_station_name == "HQ QR"),]
.
All the data have been stored appropriately and has been prepared for analysis
mean(all_trip_v2$ride_length) #straight average (total ride length / rides)
median(all_trip_v2$ride_length) #midpoint number in the ascending array of ride lengths
max(all_trip_v2$ride_length) #longest ride
min(all_trip_v2$ride_length) #shortest ride
aggregating members_casual and ride_length(in seconds)
aggregate(all_trip_v2$ride_length ~ all_trip_v2$member_casual, FUN = mean)
## all_trip_v2$member_casual all_trip_v2$ride_length
## 1 casual 3552.7502
## 2 member 850.0662
aggregate(all_trip_v2$ride_length ~ all_trip_v2$member_casual, FUN = median)
## all_trip_v2$member_casual all_trip_v2$ride_length
## 1 casual 1546
## 2 member 589
aggregate(all_trip_v2$ride_length ~ all_trip_v2$member_casual, FUN = max)
## all_trip_v2$member_casual all_trip_v2$ride_length
## 1 casual 9387024
## 2 member 9056634
aggregate(all_trip_v2$ride_length ~ all_trip_v2$member_casual, FUN = min)
## all_trip_v2$member_casual all_trip_v2$ride_length
## 1 casual 2
## 2 member 1
aggregating ride_length and the days_of_the_week
all_trip_v2 %>%
mutate(weekday = wday(started_at, label = TRUE)) %>%
aggregate(all_trip_v2$ride_length ~ all_trip_v2$member_casual+weekday, FUN = mean)
## all_trip_v2$member_casual weekday all_trip_v2$ride_length
## 1 casual Sun 3581.4054
## 2 member Sun 919.9746
## 3 casual Mon 3372.2869
## 4 member Mon 842.5726
## 5 casual Tue 3596.3599
## 6 member Tue 826.1427
## 7 casual Wed 3718.6619
## 8 member Wed 823.9996
## 9 casual Thu 3682.9847
## 10 member Thu 823.9278
## 11 casual Fri 3773.8351
## 12 member Fri 824.5305
## 13 casual Sat 3331.9138
## 14 member Sat 968.9337
all_trip_v2 %>%
mutate(weekday = wday(started_at, label = TRUE)) %>%
aggregate(all_trip_v2$ride_length ~ all_trip_v2$member_casual+weekday, FUN = max)
## all_trip_v2$member_casual weekday all_trip_v2$ride_length
## 1 casual Sun 8636205
## 2 member Sun 2910776
## 3 casual Mon 7247750
## 4 member Mon 8203638
## 5 casual Tue 7522062
## 6 member Tue 9056634
## 7 casual Wed 7606872
## 8 member Wed 5628779
## 9 casual Thu 9387024
## 10 member Thu 6028602
## 11 casual Fri 7939448
## 12 member Fri 4840301
## 13 casual Sat 8116785
## 14 member Sat 4805491
all_trip_v2 %>%
mutate(weekday = wday(started_at, label = TRUE)) %>%
aggregate(all_trip_v2$ride_length ~ all_trip_v2$member_casual+weekday, FUN = min)
## all_trip_v2$member_casual weekday all_trip_v2$ride_length
## 1 casual Sun 3
## 2 member Sun 2
## 3 casual Mon 8
## 4 member Mon 2
## 5 casual Tue 11
## 6 member Tue 2
## 7 casual Wed 6
## 8 member Wed 2
## 9 casual Thu 11
## 10 member Thu 1
## 11 casual Fri 2
## 12 member Fri 1
## 13 casual Sat 3
## 14 member Sat 1
all_trip_v2 %>%
mutate(weekday = wday(started_at, label = TRUE)) %>%
aggregate(all_trip_v2$ride_length ~ all_trip_v2$member_casual+weekday, FUN = median)
## all_trip_v2$member_casual weekday all_trip_v2$ride_length
## 1 casual Sun 1626
## 2 member Sun 641
## 3 casual Mon 1547
## 4 member Mon 579
## 5 casual Tue 1428
## 6 member Tue 580
## 7 casual Wed 1419
## 8 member Wed 583
## 9 casual Thu 1430
## 10 member Thu 581
## 11 casual Fri 1499
## 12 member Fri 573
## 13 casual Sat 1658
## 14 member Sat 640
Analyze number of rides by riders type during the weekday
all_trip_v2 %>%
mutate(weekdays = wday(started_at, label = TRUE)) %>%
group_by(weekdays,member_casual) %>%
summarise(number_of_ride = n()) %>%
arrange(weekdays)
## `summarise()` has grouped output by 'weekdays'. You can override using the
## `.groups` argument.
## # A tibble: 14 × 3
## # Groups: weekdays [7]
## weekdays member_casual number_of_ride
## <ord> <chr> <int>
## 1 Sun casual 181293
## 2 Sun member 267965
## 3 Mon casual 103296
## 4 Mon member 472196
## 5 Tue casual 90510
## 6 Tue member 508445
## 7 Wed casual 92457
## 8 Wed member 500329
## 9 Thu casual 102679
## 10 Thu member 484177
## 11 Fri casual 122404
## 12 Fri member 452790
## 13 Sat casual 209543
## 14 Sat member 287958
Analyze total number of ride by member_casual(rider type) and months of ride
all_trip_v2 %>%
group_by(member_casual,month) %>%
summarise(number_of_rides = n()) %>%
arrange(month)
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
## # A tibble: 24 × 3
## # Groups: member_casual [2]
## member_casual month number_of_rides
## <chr> <chr> <int>
## 1 casual 01 7785
## 2 member 01 136099
## 3 casual 02 12314
## 4 member 02 126714
## 5 casual 03 24615
## 6 member 03 115593
## 7 casual 04 47744
## 8 member 04 217566
## 9 casual 05 81624
## 10 member 05 285834
## # … with 14 more rows
Analyze the ride_length(total distance) and month of the ride
all_trip_v2 %>%
group_by(month) %>%
summarise(ride_length = sum(ride_length)) %>%
arrange(month)
## # A tibble: 12 × 2
## month ride_length
## <chr> <dbl>
## 1 01 166548847
## 2 02 195843763
## 3 03 204036659
## 4 04 322327235
## 5 05 488463093
## 6 06 660063474
## 7 07 1006336596
## 8 08 1143030984
## 9 09 709088948
## 10 10 486573591
## 11 11 193487505
## 12 12 157404301
Analyze the number of ride by types of riders(member_casual)
all_trip_v2 %>%
group_by(member_casual) %>%
summarise(count_of_rides = length(ride_length))
## # A tibble: 2 × 2
## member_casual count_of_rides
## <chr> <int>
## 1 casual 902182
## 2 member 2973860
Visualize average_duration of riders type and weekday
all_trip_v2 %>%
mutate(weekday = wday(started_at , label = TRUE)) %>%
group_by(member_casual, weekday) %>%
summarise(average_duration = mean(ride_length)) %>%
arrange(member_casual, weekday) %>%
ggplot(aes( x = weekday, y = average_duration, fill = member_casual)) +
geom_col(position = "dodge") +
labs(title = "Average duration of ride by day of the week", subtitle = "ride_length and weekday")
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
from the graph above, casual riders tends to rides for more distance than membership riders during the week days, Membership riders rides a little bit more on saturdays while for casual riders, they rides more on Friday. Generally, both riders type makes more use of the bikes on weekends.
Visualize total number of ride by member_casual(rider type) and months of ride
all_trip_v2 %>%
group_by(member_casual,month) %>%
summarise(number_of_rides = n()) %>%
arrange(member_casual,month) %>%
ggplot(aes(x = month, y = number_of_rides, fill = member_casual )) +
theme(axis.text.x = element_text(angle = 45)) +
geom_col(width=0.5, position = position_dodge(width=0.5)) +
scale_y_continuous(labels = function(x) format(x, scientific = FALSE))
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
From the graph above, it can be deduced that members have the most numbers of rides than casual riders, it can also be confirmed that August has the highest number of rides by both members and casual riders, month july, August and september recorded the most number of riders for both riders which stood above 350000 rides for casual and partly over 100000 rides for casual riders,january to march recorded the lowest number of rides for both parties
Visualize the ride_length(total distance) and month of the ride
all_trip_v2 %>%
group_by(member_casual,month) %>%
summarise(average_ride_length = mean(ride_length)) %>%
arrange(month) %>%
ggplot(aes(x = month, y = average_ride_length, fill = member_casual)) +
labs(title = "Average ride distance of members and casual riders by month")+
theme(axis.text.x = element_text(angle = 45)) +
geom_col(width=0.5, position = position_dodge(width=0.5)) +
scale_y_continuous(labels = function(x) format(x, scientific = FALSE))
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
From the graph above, casual riders ride for a longer distance than the member users, likewise, the month of January and February has the longest distance of rides from casual riders whereas the membership riders maintained a uniform riding distance ranging from over 2500 to 5000
Visualize members and casuals by the number of ride taken (ride count)
all_trip_v2 %>%
group_by(member_casual) %>%
summarise(ride_count = length(ride_id)) %>%
arrange(member_casual) %>%
ggplot(aes(x = member_casual,y = ride_count,fill = member_casual)) +
geom_col(position = "dodge")+
labs(title = "Total rides taken (ride_count) of Members and Casual riders")+
scale_y_continuous(labels = function(x) format(x, scientific = FALSE))
From the above graph, we can observe that there are more member rides(2973860) compared to casual rides(902182) based on the ride count
compare Members and Casual riders depending on ride distance
all_trip_v2 %>%
group_by(member_casual) %>%
summarise(average_ride_length = mean(ride_length)) %>%
ggplot() +
geom_col(mapping= aes(x= member_casual,y= average_ride_length,fill=member_casual), show.legend = FALSE)+
labs(title = "Mean distance traveled by Members and Casual riders")
From the graph, casual riders tends to travel for more distance than the
memberships riders
Conclusion
Members have more rides compared to casual riders.
We have more members riding in all months compared to casual riders.
Casual riders rides for a longer time period.
casual riders ride more throughout the entire week days than the member riders
Casual riders go farther in terms of distance both by week days and months than the memberhip riders
top three recommendations based on this analysis
Have a slash sale or promo for casual riders so they can acquire more bikes and indulge them in the benefits of being a member.
Host fun biking competitions with prizes at intervals for casual riders during August due to their high number of rides during that time, this will also attract them to get a membership.
Encourage casual riders to ride more in the entire year through advertisement, hand flyers, by giving them various coupons so as to convince them into being a member.