This project is a case study from the Google Data Analytics Professional Certificate course. For the analysis i have chosen to use R programming language and Rstudio IDE for its easy statisical analysis and data visualisations with large datasets.
You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualisations.
Three questions will guide the future marketing program:
1. How do annual members and casual riders use Cyclistic bikes
differently?
2. Why would casual riders buy Cyclistic annual memberships?
3. How can Cyclistic use digital media to influence casual riders to
become members?
The director of marketing and your manager Lily Moreno has assigned you the first question to answer: How do annual members and casual riders use Cyclistic bikes differently?
The main objective is build successful marketing strategies that will aim to turn casual bike riders into members by providing data insights into the differences in how the two groups use Cyclistic bike share differently.
Director of marketing (Lily Moreno), Cyclistic executive team and marketing analytics team.
Find and explore the differences in between casual riders and member riders.
Cyclistic’s historical trip data will be used to analyse and identify trends This data has been made avialble as seen in these licenses Datasets are avialable here
For this analysis i have chosen the most recent data from 2022-03 to 2023-02, as its more current for the business task. All trips data is a comma-delimited (.CSV) format. Total of 13 columns including “ride_id”, “rideable_type”, “started_at”, “ended_at”, “start_station_name”, “start_station_id”, “end_station_id”, “start_lat”, “start_lng”, “end_lat”, “end_lng” and “member_casual” For the purpose of this case study the datasets are appropriate and will enable me to answer the business questions.
# install and load the necessary packages
library(tidyverse)
library(ggplot2)
library(lubridate)
library(readr)
library(dplyr)
library(geosphere)
library(tidyr)
library(rmarkdown)
#import data sets
trip_2022_03 <- read.csv("~/Rstudio/Cyclistic_Case_Study/202203-tripdata.csv", stringsAsFactors = FALSE)
trip_2022_04 <- read.csv("~/Rstudio/Cyclistic_Case_Study/202204-tripdata.csv", stringsAsFactors = FALSE)
trip_2022_05 <- read.csv("~/Rstudio/Cyclistic_Case_Study/202205-tripdata.csv", stringsAsFactors = FALSE)
trip_2022_06 <- read.csv("~/Rstudio/Cyclistic_Case_Study/202206-tripdata.csv", stringsAsFactors = FALSE)
trip_2022_07 <- read.csv("~/Rstudio/Cyclistic_Case_Study/202207-tripdata.csv", stringsAsFactors = FALSE)
trip_2022_08 <- read.csv("~/Rstudio/Cyclistic_Case_Study/202208-tripdata.csv", stringsAsFactors = FALSE)
trip_2022_09 <- read.csv("~/Rstudio/Cyclistic_Case_Study/202209-tripdata.csv", stringsAsFactors = FALSE)
trip_2022_10 <- read.csv("~/Rstudio/Cyclistic_Case_Study/202210-tripdata.csv", stringsAsFactors = FALSE)
trip_2022_11 <- read.csv("~/Rstudio/Cyclistic_Case_Study/202211-tripdata.csv", stringsAsFactors = FALSE)
trip_2022_12 <- read.csv("~/Rstudio/Cyclistic_Case_Study/202212-tripdata.csv", stringsAsFactors = FALSE)
trip_2023_01 <- read.csv("~/Rstudio/Cyclistic_Case_Study/202301-tripdata.csv", stringsAsFactors = FALSE)
trip_2023_02 <- read.csv("~/Rstudio/Cyclistic_Case_Study/202302-tripdata.csv", stringsAsFactors = FALSE)
First we need to examine the datasets for consistency including column names and datetypes. It is important for all column names to be in the correct order before the merge the data frames successfully.
#check for consistency between data sets - matching column names
colnames(trip_2022_03)
colnames(trip_2022_04)
colnames(trip_2022_05)
colnames(trip_2022_06)
colnames(trip_2022_07)
colnames(trip_2022_08)
colnames(trip_2022_09)
colnames(trip_2022_10)
colnames(trip_2022_11)
colnames(trip_2022_12)
colnames(trip_2023_01)
colnames(trip_2023_02)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
#check data types - chr, date etc
str(trip_2022_03)
str(trip_2022_04)
str(trip_2022_05)
str(trip_2022_06)
str(trip_2022_07)
str(trip_2022_08)
str(trip_2022_09)
str(trip_2022_10)
str(trip_2022_11)
str(trip_2022_12)
str(trip_2023_01)
str(trip_2023_02)
## 'data.frame': 284042 obs. of 13 variables:
## $ ride_id : chr "47EC0A7F82E65D52" "8494861979B0F477" "EFE527AF80B66109" "9F446FD9DEE3F389" ...
## $ rideable_type : chr "classic_bike" "electric_bike" "classic_bike" "classic_bike" ...
## $ started_at : chr "2022-03-21 13:45:01" "2022-03-16 09:37:16" "2022-03-23 19:52:02" "2022-03-01 19:12:26" ...
## $ ended_at : chr "2022-03-21 13:51:18" "2022-03-16 09:43:34" "2022-03-23 19:54:48" "2022-03-01 19:22:14" ...
## $ start_station_name: chr "Wabash Ave & Wacker Pl" "Michigan Ave & Oak St" "Broadway & Berwyn Ave" "Wabash Ave & Wacker Pl" ...
## $ start_station_id : chr "TA1307000131" "13042" "13109" "TA1307000131" ...
## $ end_station_name : chr "Kingsbury St & Kinzie St" "Orleans St & Chestnut St (NEXT Apts)" "Broadway & Ridge Ave" "Franklin St & Jackson Blvd" ...
## $ end_station_id : chr "KA1503000043" "620" "15578" "TA1305000025" ...
## $ start_lat : num 41.9 41.9 42 41.9 41.9 ...
## $ start_lng : num -87.6 -87.6 -87.7 -87.6 -87.6 ...
## $ end_lat : num 41.9 41.9 42 41.9 41.9 ...
## $ end_lng : num -87.6 -87.6 -87.7 -87.6 -87.7 ...
## $ member_casual : chr "member" "member" "member" "member" ...
#merge the individual monthly data frames into one large data frame
yearly_trips <- bind_rows(trip_2022_03, trip_2022_04, trip_2022_05, trip_2022_06, trip_2022_07, trip_2022_08,
trip_2022_09, trip_2022_10, trip_2022_11, trip_2022_12, trip_2023_01, trip_2023_02)
Cleaning and Preparation of data for analysis
# checking merged data frame
colnames(yearly_trips) #List of column names
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
head(yearly_trips) #See the first six row of data
## ride_id rideable_type started_at ended_at
## 1 47EC0A7F82E65D52 classic_bike 2022-03-21 13:45:01 2022-03-21 13:51:18
## 2 8494861979B0F477 electric_bike 2022-03-16 09:37:16 2022-03-16 09:43:34
## 3 EFE527AF80B66109 classic_bike 2022-03-23 19:52:02 2022-03-23 19:54:48
## 4 9F446FD9DEE3F389 classic_bike 2022-03-01 19:12:26 2022-03-01 19:22:14
## 5 431128AD9AFFEDC0 classic_bike 2022-03-21 18:37:01 2022-03-21 19:19:11
## 6 9AA8A13AF7A85325 classic_bike 2022-03-07 17:10:22 2022-03-07 17:15:04
## start_station_name start_station_id
## 1 Wabash Ave & Wacker Pl TA1307000131
## 2 Michigan Ave & Oak St 13042
## 3 Broadway & Berwyn Ave 13109
## 4 Wabash Ave & Wacker Pl TA1307000131
## 5 DuSable Lake Shore Dr & North Blvd LF-005
## 6 Bissell St & Armitage Ave 13059
## end_station_name end_station_id start_lat start_lng
## 1 Kingsbury St & Kinzie St KA1503000043 41.88688 -87.62603
## 2 Orleans St & Chestnut St (NEXT Apts) 620 41.90100 -87.62375
## 3 Broadway & Ridge Ave 15578 41.97835 -87.65975
## 4 Franklin St & Jackson Blvd TA1305000025 41.88688 -87.62603
## 5 Loomis St & Jackson Blvd 13206 41.91172 -87.62680
## 6 Southport Ave & Clybourn Ave TA1309000030 41.91802 -87.65218
## end_lat end_lng member_casual
## 1 41.88918 -87.63851 member
## 2 41.89820 -87.63754 member
## 3 41.98404 -87.66027 member
## 4 41.87771 -87.63532 member
## 5 41.87794 -87.66201 member
## 6 41.92077 -87.66371 member
str(yearly_trips) #List of column names and data types
## 'data.frame': 5829084 obs. of 13 variables:
## $ ride_id : chr "47EC0A7F82E65D52" "8494861979B0F477" "EFE527AF80B66109" "9F446FD9DEE3F389" ...
## $ rideable_type : chr "classic_bike" "electric_bike" "classic_bike" "classic_bike" ...
## $ started_at : chr "2022-03-21 13:45:01" "2022-03-16 09:37:16" "2022-03-23 19:52:02" "2022-03-01 19:12:26" ...
## $ ended_at : chr "2022-03-21 13:51:18" "2022-03-16 09:43:34" "2022-03-23 19:54:48" "2022-03-01 19:22:14" ...
## $ start_station_name: chr "Wabash Ave & Wacker Pl" "Michigan Ave & Oak St" "Broadway & Berwyn Ave" "Wabash Ave & Wacker Pl" ...
## $ start_station_id : chr "TA1307000131" "13042" "13109" "TA1307000131" ...
## $ end_station_name : chr "Kingsbury St & Kinzie St" "Orleans St & Chestnut St (NEXT Apts)" "Broadway & Ridge Ave" "Franklin St & Jackson Blvd" ...
## $ end_station_id : chr "KA1503000043" "620" "15578" "TA1305000025" ...
## $ start_lat : num 41.9 41.9 42 41.9 41.9 ...
## $ start_lng : num -87.6 -87.6 -87.7 -87.6 -87.6 ...
## $ end_lat : num 41.9 41.9 42 41.9 41.9 ...
## $ end_lng : num -87.6 -87.6 -87.7 -87.6 -87.7 ...
## $ member_casual : chr "member" "member" "member" "member" ...
#create separate columns for date, year, month and day
yearly_trips$date <- as.Date(yearly_trips$started_at)
yearly_trips$year <- format(as.Date(yearly_trips$date), "%y") #extract year
yearly_trips$month <- format(as.Date(yearly_trips$date), "%m") # extract month
yearly_trips$day <- format(as.Date(yearly_trips$date), "%d") #extract day of month
yearly_trips$day_of_week <- wday(yearly_trips$date, label = TRUE) #extract day of the week column
yearly_trips$start_time <- format(strptime(yearly_trips$started_at,"%Y-%m-%d %H:%M:%S"), "%H") #extract hour of the day
# calculate the length of ride by finding the difference in end and start time (needed to convert start and ended at columns)
yearly_trips$started_at2 <- strptime(yearly_trips$started_at, "%Y-%m-%d %H:%M:%S") #convert chr date time
yearly_trips$ended_at2 <- strptime (yearly_trips$ended_at, "%Y-%m-%d %H:%M:%S") #convert chr date time
yearly_trips$ride_length <- difftime(yearly_trips$ended_at2, yearly_trips$started_at2, units ="mins") #calculate difference in times
yearly_trips$ride_length2 <- as.numeric(yearly_trips$ride_length) # convert to number for calculations
# calculate the ride_distance
yearly_trips$ride_distance <- distGeo(matrix(c(yearly_trips$start_lng, yearly_trips$start_lat), ncol = 2),
matrix(c(yearly_trips$end_lng, yearly_trips$end_lat), ncol = 2))
yearly_trips$ride_distance <- yearly_trips$ride_distance/1000 #distance in km
#clean data set
yearly_trips2 <- yearly_trips [!(yearly_trips$ride_length2 <=0),] # remove values less than or equal to 0
yearly_trips2 <- na.omit(yearly_trips2) # remove rows where with NA values
Now all the required data is in one place. We are ready to perform descriptive analysis of the data to find patterns and explore the differences between casual customers and members.
#first lets looks at our cleaned data frame
str(yearly_trips2)
## 'data.frame': 5822489 obs. of 24 variables:
## $ ride_id : chr "47EC0A7F82E65D52" "8494861979B0F477" "EFE527AF80B66109" "9F446FD9DEE3F389" ...
## $ rideable_type : chr "classic_bike" "electric_bike" "classic_bike" "classic_bike" ...
## $ started_at : chr "2022-03-21 13:45:01" "2022-03-16 09:37:16" "2022-03-23 19:52:02" "2022-03-01 19:12:26" ...
## $ ended_at : chr "2022-03-21 13:51:18" "2022-03-16 09:43:34" "2022-03-23 19:54:48" "2022-03-01 19:22:14" ...
## $ start_station_name: chr "Wabash Ave & Wacker Pl" "Michigan Ave & Oak St" "Broadway & Berwyn Ave" "Wabash Ave & Wacker Pl" ...
## $ start_station_id : chr "TA1307000131" "13042" "13109" "TA1307000131" ...
## $ end_station_name : chr "Kingsbury St & Kinzie St" "Orleans St & Chestnut St (NEXT Apts)" "Broadway & Ridge Ave" "Franklin St & Jackson Blvd" ...
## $ end_station_id : chr "KA1503000043" "620" "15578" "TA1305000025" ...
## $ start_lat : num 41.9 41.9 42 41.9 41.9 ...
## $ start_lng : num -87.6 -87.6 -87.7 -87.6 -87.6 ...
## $ end_lat : num 41.9 41.9 42 41.9 41.9 ...
## $ end_lng : num -87.6 -87.6 -87.7 -87.6 -87.7 ...
## $ member_casual : chr "member" "member" "member" "member" ...
## $ date : Date, format: "2022-03-21" "2022-03-16" ...
## $ year : chr "22" "22" "22" "22" ...
## $ month : chr "03" "03" "03" "03" ...
## $ day : chr "21" "16" "23" "01" ...
## $ day_of_week : Ord.factor w/ 7 levels "Sun"<"Mon"<"Tue"<..: 2 4 4 3 2 2 5 7 5 6 ...
## $ start_time : chr "13" "09" "19" "19" ...
## $ started_at2 : POSIXlt, format: "2022-03-21 13:45:01" "2022-03-16 09:37:16" ...
## $ ended_at2 : POSIXlt, format: "2022-03-21 13:51:18" "2022-03-16 09:43:34" ...
## $ ride_length : 'difftime' num 6.28333333333333 6.3 2.76666666666667 9.8 ...
## ..- attr(*, "units")= chr "mins"
## $ ride_length2 : num 6.28 6.3 2.77 9.8 42.17 ...
## $ ride_distance : num 1.067 1.185 0.634 1.277 4.755 ...
## - attr(*, "na.action")= 'omit' Named int [1:6057] 309 461 2686 2771 3160 5694 7114 8195 13771 13951 ...
## ..- attr(*, "names")= chr [1:6057] "NA" "NA.1" "NA.2" "NA.3" ...
# good practice to look at basic descriptive statisitcs of our data
yearly_trips2 %>%
summarise(average_ride_length = mean(ride_length2), median_ride_length = median(ride_length2),
max_ride_length = max(ride_length2), min_ride_length = min(ride_length2))
## average_ride_length median_ride_length max_ride_length min_ride_length
## 1 16.11311 10.13333 34294.07 0.01666667
yearly_trips2 %>%
group_by(member_casual) %>%
summarise(ride_count = length(ride_id), ride_percentage = (length(ride_id) /
nrow(yearly_trips2)) *100 )
## # A tibble: 2 × 3
## member_casual ride_count ride_percentage
## <chr> <int> <dbl>
## 1 casual 2359547 40.5
## 2 member 3462942 59.5
ggplot(yearly_trips2, aes(x = member_casual, fill = member_casual)) + geom_bar() +
labs (title = "casual riders vs member riders", x="Casuals VS Members", y = "number of riders", x ="Casual vs Members distribution") +
scale_y_continuous(labels = function(x) format(x, scientific = FALSE))
#total and average ride time for each day of the week
yearly_trips2 %>%
group_by(member_casual, day_of_week) %>%
summarise(number_of_rides = n(), average_ride_length = mean(ride_length2)) %>%
arrange(member_casual, day_of_week)
## # A tibble: 14 × 4
## # Groups: member_casual [2]
## member_casual day_of_week number_of_rides average_ride_length
## <chr> <ord> <int> <dbl>
## 1 casual Sun 397503 24.9
## 2 casual Mon 282669 22.3
## 3 casual Tue 271722 19.3
## 4 casual Wed 279352 18.7
## 5 casual Thu 312994 19.4
## 6 casual Fri 338015 20.4
## 7 casual Sat 477292 24.5
## 8 member Sun 402676 13.6
## 9 member Mon 488499 11.9
## 10 member Tue 546468 11.6
## 11 member Wed 542308 11.7
## 12 member Thu 547254 11.9
## 13 member Fri 481057 12.1
## 14 member Sat 454680 13.7
#total casual vs member - day of the week
yearly_trips2 %>%
group_by(member_casual, day_of_week) %>%
summarise(number_of_rides = n()) %>%
arrange(member_casual, day_of_week) %>%
ggplot(aes(x = day_of_week, y=number_of_rides, fill = member_casual)) +
labs(title = "Total rides by Memeber vs Casuals for each day of the week") +
geom_col(width=0.5, position = position_dodge(width=0.5))+
scale_y_continuous(labels = function(x) format(x, scientific = FALSE))
#average ride length - casual vs member - day of the week
yearly_trips2 %>%
group_by(member_casual, day_of_week) %>%
summarise(average_ride_length = mean(ride_length2)) %>%
arrange(member_casual, day_of_week) %>%
ggplot(aes(x= day_of_week, y = average_ride_length, fill = member_casual)) +
geom_col(width=0.5, position = position_dodge(width=0.5)) +
labs(title ="Average ride time by Members and Casual riders Vs. Day of the week")
#first lets convert into a month abbreviation for ease of viewing
yearly_trips2$month <- as.numeric(yearly_trips2$month) # convert character to number
yearly_trips2$month <- month.abb[yearly_trips2$month] # convert numeric month to month abbreviation
yearly_trips2 %>%
group_by(member_casual, month) %>%
summarise(numbner_of_rides = n()) %>%
arrange(member_casual, month) %>%
print(n = Inf) %>% # print all rows
ggplot(aes(x=month, y=numbner_of_rides, fill = member_casual)) +
labs (title = "Total monthly comparison - members vs casuals", x = "month", y = "Number of rides") +
geom_col(width=0.5, position = position_dodge(width=0.5)) +
scale_y_continuous(labels = function(x) format(x, scientific = FALSE)) +
scale_x_discrete(limits = month.abb) # order the months
## # A tibble: 24 × 3
## # Groups: member_casual [2]
## member_casual month numbner_of_rides
## <chr> <chr> <int>
## 1 casual Apr 126102
## 2 casual Aug 358130
## 3 casual Dec 44788
## 4 casual Feb 42920
## 5 casual Jan 39902
## 6 casual Jul 405146
## 7 casual Jun 368060
## 8 casual Mar 89580
## 9 casual May 279737
## 10 casual Nov 100554
## 11 casual Oct 208584
## 12 casual Sep 296044
## 13 member Apr 244799
## 14 member Aug 426882
## 15 member Dec 136876
## 16 member Feb 147400
## 17 member Jan 150264
## 18 member Jul 417323
## 19 member Jun 400023
## 20 member Mar 194059
## 21 member May 354351
## 22 member Nov 236893
## 23 member Oct 349561
## 24 member Sep 404511
#average rides vs month
yearly_trips2 %>%
group_by(member_casual, month) %>%
summarise(average_ride_length = mean(ride_length2)) %>%
arrange(member_casual, month) %>%
print(n = Inf) %>% # print all rows
ggplot(aes(x=month, y=average_ride_length, fill = member_casual)) +
labs (title = "Average monthly comparison - members vs casuals", x = "month", y = "Average ride length (mins)") +
geom_col(width=0.5, position = position_dodge(width=0.5)) +
scale_y_continuous(labels = function(x) format(x, scientific = FALSE)) +
scale_x_discrete(limits = month.abb) # order the months
## # A tibble: 24 × 3
## # Groups: member_casual [2]
## member_casual month average_ride_length
## <chr> <chr> <dbl>
## 1 casual Apr 23.4
## 2 casual Aug 21.4
## 3 casual Dec 13.4
## 4 casual Feb 15.9
## 5 casual Jan 13.6
## 6 casual Jul 23.3
## 7 casual Jun 23.4
## 8 casual Mar 25.7
## 9 casual May 25.5
## 10 casual Nov 15.5
## 11 casual Oct 18.5
## 12 casual Sep 20.0
## 13 member Apr 11.4
## 14 member Aug 13.1
## 15 member Dec 10.4
## 16 member Feb 10.5
## 17 member Jan 10.1
## 18 member Jul 13.4
## 19 member Jun 13.7
## 20 member Mar 11.7
## 21 member May 13.1
## 22 member Nov 10.9
## 23 member Oct 11.6
## 24 member Sep 12.6
#comparison on ride distance
yearly_trips2 %>%
group_by(member_casual) %>%
summarise(average_ride_distance = mean(ride_distance)) %>%
ggplot(aes(x=member_casual, y=average_ride_distance, fill = member_casual)) +
geom_col() +
labs(title = "Mean travel distance members vs casuals", y="Average distance in KM" )
#hourly comparison
yearly_trips2 %>%
ggplot(aes(x = start_time, fill = member_casual)) +
geom_bar()+
labs(title = "Hourly demand comparison", x = "hour of the day")+
scale_y_continuous(labels = function(x) format(x, scientific = FALSE))+
theme(axis.text.x = element_text(angle = 30))+
facet_wrap(~member_casual)
#hourly demand by day of the week
yearly_trips2 %>%
ggplot(aes(x = start_time, fill = member_casual)) +
geom_bar() +
labs(title = "Hourly demand comparison", x = "hour of the day")+
scale_y_continuous(labels = function(x) format(x, scientific = FALSE))+
theme(axis.text.x = element_text(angle = 90))+
facet_wrap(~day_of_week)
#rideable type
yearly_trips2 %>%
group_by(rideable_type) %>%
summarise(count = length(ride_id))
## # A tibble: 3 × 2
## rideable_type count
## <chr> <int>
## 1 classic_bike 2663497
## 2 docked_bike 176393
## 3 electric_bike 2982599
yearly_trips2 %>%
ggplot(aes(x=rideable_type, fill = member_casual)) +
labs(title = "Rideable type comparison - members vs casuals", x= "Rideable type") +
geom_bar(width=0.5, position = position_dodge(width=0.5))+
scale_y_continuous(labels = function(x) format(x, scientific = FALSE))
For the last stage of the data analysis process, we will make three recommendations from our key analysis aimed at increasing our annual members.