knitr::include_graphics(here::here("Bike_station.jpg"))
Bike Station
INTRODUCTION
This capstone project is the final project in my Google Data Analytics Professional Certificate Course. In this case study, I will be analyzing a public dataset for a fictional company called Cyclistic, provided by the course. Here, I will be using R programming language for this analysis because of its potential benefits to reproducibility, transparency, easy statistical analysis tools and data visualizations.
The following sets of data analysis process will be followed:
Ask,
Prepare,
Process,
Analyze,
Share,
Act.
The case study road map as listed below will be followed on each step
Codes, when needed.
Key tasks.
Deliverables.
Scenerio
You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members.But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data.
ASK
How do annual members and casual riders use Cyclistic bikes differently?
Why would casual riders buy Cyclistic annual memberships?
How can Cyclistic use digital media to influence casual riders to become members?
Lily Moreno (director of marketing and my manager) has assigned me the first question to answer: How do annual members and casual riders use Cyclistic bikes differently?
Key tasks
Deliverable
PREPARE
Key tasks
Deliverable
Install and load required packages
install.packages("tidyverse")
library(tidyverse)
install.packages("lubridate")
library(lubridate)
install.packages("ggplot2")
library(ggplot2)
Import data to R Studio
read_csv("Divvy_Trips_2019_Q1.csv")
q1_2019 <- read_csv("Divvy_Trips_2019_Q1.csv")
read_csv("Divvy_Trips_2019_Q2.csv")
q2_2019 <- read_csv("Divvy_Trips_2019_Q2.csv")
read_csv("Divvy_Trips_2019_Q3.csv")
q3_2019 <- read_csv("Divvy_Trips_2019_Q3.csv")
read_csv("Divvy_Trips_2019_Q4.csv")
q4_2019 <- read_csv("Divvy_Trips_2019_Q4.csv")
read_csv("Divvy_Trips_2020_Q1.csv")
q1_2020 <- read_csv("Divvy_Trips_2020_Q1.csv")
read_csv("202004-divvy-tripdata.csv")
q2_04 <- read_csv("202004-divvy-tripdata.csv")
read_csv("202005-divvy-tripdata.csv")
q2_05 <- read_csv("202005-divvy-tripdata.csv")
read_csv("202006-divvy-tripdata.csv")
q2_06 <- read_csv("202006-divvy-tripdata.csv")
bind_rows(q2_04, q2_05, q2_06)
q2_2020 <- bind_rows(q2_04, q2_05, q2_06)
read_csv("202007-divvy-tripdata.csv")
q3_07 <- read_csv("202004-divvy-tripdata.csv")
read_csv("202008-divvy-tripdata.csv")
q3_08 <- read_csv("202008-divvy-tripdata.csv")
read_csv("202009-divvy-tripdata.csv")
q3_09 <- read_csv("202009-divvy-tripdata.csv")
bind_rows(q3_07, q3_08, q3_09)
q3_2020 <- bind_rows(q3_07, q3_08, q3_09)
read_csv("202010-divvy-tripdata.csv")
q4_10 <- read_csv("202010-divvy-tripdata.csv")
read_csv("202011-divvy-tripdata.csv")
q4_11 <- read_csv("202011-divvy-tripdata.csv")
read_csv("202012-divvy-tripdata.csv")
q4_12 <- read_csv("202012-divvy-tripdata.csv")
rbind(q4_10, q4_11, q4_12)
q4_2020 <- rbind(q4_10, q4_11, q4_12)
Wrangle and merge all data into a single file
bind_rows(q1_2019, q2_2019, q3_2019, q4_2019
, q1_2020, q2_2020, q3_2020, q4_2020)
all_trips <- bind_rows(q1_2019, q2_2019, q3_2019, q4_2019
, q1_2020, q2_2020, q3_2020, q4_2020)
PROCESS
Cleaning up data and adding data to prepare for analysis
Key tasks
Check the data for errors.
Choose your tools.
Transform the data so you can work with it effectively.
Document the cleaning process.
Deliverables
colnames(all_trips) #List of column names
nrow(all_trips) #How many rows are in data frame
dim(all_trips) #Dimensions of the data frame
head(all_trips) #See the first 6 rows of data frame
tail(all_trips) #see the last 6 rows of data frame
str(all_trips) #See list of columns and data types (numeric, character, etc)
summary(all_trips) #Statistical summary of data. Mainly for numeric
Adding columns that list the date, month, day, and year of each ride.
all_trips$date <- as.Date(all_trips$started_at) #The default format is yyyy-mm-dd
all_trips$month <- format(as.Date(all_trips$date), "%m")
all_trips$day <- format(as.Date(all_trips$date), "%d")
all_trips$year <- format(as.Date(all_trips$date), "%Y")
all_trips$day_of_week <- format(as.Date(all_trips$date), "%A")
Adding a “ride_length” calculation to all_trips (in seconds)
all_trips$ride_length <- difftime(all_trips$ended_at,all_trips$started_at)
str(all_trips) #to inspect the structure of the columns
Convert “ride_length” from factor to numeric so we can run calculations on the data
all_trips$ride_length <- as.numeric(as.character(all_trips$ride_length))
is.numeric(all_trips$ride_length)
Remove “bad” data
all_trips[!(all_trips$start_station_name == "HQ QR" | all_trips$ride_length<0),]
all_trips_v2 <- all_trips[!(all_trips$start_station_name == "HQ QR" | all_trips$ride_length<0),]
ANALYZE
Key tasks
Deliverables
Conduct descriptive analysis - Descriptive analysis on ride_length (all figures in seconds)
mean(all_trips_v2$ride_length, na.rm = TRUE)#straight average (total ride length / rides)
median(all_trips_v2$ride_length, na.rm = TRUE)#midpoint number in the ascending array of ride lengths
max(all_trips_v2$ride_length, na.rm = TRUE)#longest ride
min(all_trips_v2$ride_length, na.rm = TRUE)#shortest_ride
knitr::include_graphics(here::here("mean median max min of all_trips.png"))
mean median max min of all_trips
Let’s visualize members and casuals by the total ride taken (ride count)
all_trips_v2 %>%
group_by(member_casual) %>%
summarise(ride_count = length(ride_id)) %>%
ggplot(aes(x = member_casual, y = ride_count, fill = member_casual)) +
geom_col(position = "dodge") +
labs(title ="Total rides taken (ride_count) of Members and Casual riders") +
geom_col(width=0.5, position = position_dodge(width=0.5)) +
scale_y_continuous(labels = function(x) format(x, scientific = FALSE))
knitr::include_graphics(here::here("member_casual ride_count.png"))
member_casual ride_count
knitr::include_graphics(here::here("Total rides taken (ride_count) of Members and Casual riders.png"))
Let’s see the average time ride by each day for members vs casual users
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual
+ all_trips_v2$day_of_week
, FUN = mean)
# notice that the above code result didn't give us a well ordered days of the week
# now let's put it in a well arranged order
all_trips_v2$day_of_week <- ordered(all_trips_v2$day_of_week, levels=c("Sunday"
, "Monday"
, "Tuesday"
, "Wednesday"
, "Thursday"
, "Friday"
, "Saturday"))
#now we'd rerun the average time ride by each day for members vs casual users
# so as to see if the ordered code we entered would work
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual
+ all_trips_v2$day_of_week
, FUN = mean)
knitr::include_graphics(here::here("average ride time by each day for member vs casual users.png"))
average ride time by each day for member vs casual users
all_trips_v2 %>%
mutate(weekday = wday(started_at, label = TRUE)) %>%
group_by(member_casual, weekday) %>%
summarise(number_of_rides = n(), average_duration = mean(ride_length)) %>%
arrange(member_casual, weekday)
knitr::include_graphics(here::here("total rides and average ride time by each day for members vs casual riders.png"))
total rides and average ride time(duration) by each day for members vs casual riders
Let’s visualize the above table by days of the week and number of rides taken by member and casual riders.
knitr::include_graphics(here::here("Total rides of Members and Casual riders Vs. Day of the week.png"))
Let’s visualize the average duration of Members and Casual riders Vs. Day of the week
all_trips_v2 %>%
mutate(weekday = wday(started_at, label = TRUE)) %>%
group_by(member_casual, weekday) %>%
summarise(number_of_rides = n()
,average_duration = mean(ride_length)) %>%
arrange(member_casual, weekday) %>%
ggplot(aes(x = weekday, y = average_duration, fill = member_casual)) +
geom_col(position = "dodge") +
labs(title ="Average duration of Members and Casual riders Vs. Day of the week") +
geom_col(width=0.5, position = position_dodge(width=0.5)) +
scale_y_continuous(labels = function(x) format(x, scientific = FALSE))
knitr::include_graphics(here::here("Average duration of Members and Casual riders Vs. Day of the week.png"))
Let’s create a visualization for Total rides by members and casual riders by month
all_trips_v2 %>%
group_by(member_casual, month) %>%
summarise(number_of_rides = n(),.groups="drop") %>%
arrange(member_casual, month) %>%
ggplot(aes(x = month, y = number_of_rides, fill = member_casual)) +
labs(title ="Total rides by Members and Casual riders by Month") +
theme(axis.text.x = element_text(angle = 45)) +
geom_col(width=0.5, position = position_dodge(width=0.5)) +
scale_y_continuous(labels = function(x) format(x, scientific = FALSE))
knitr::include_graphics(here::here("Total rides by Members and Casual riders by Month.png"))
Let’s compare Members and Casual riders depending on ride distance.
all_trips_v2 %>%
group_by(member_casual) %>% drop_na() %>%
summarise(average_ride_length = mean(ride_length)) %>%
ggplot() +
geom_col(mapping= aes(x= member_casual,y= average_ride_length,fill=member_casual), show.legend = FALSE)+
labs(title = "Mean distance traveled by Members and Casual riders")
knitr::include_graphics(here::here("Mean distance traveled by Members and Casual riders.png"))
SHARE
This phase involves using visualization to share my findings and can be done by presentation.
Key tasks
Determine the best way to share your findings.
Create effective data visualizations.
Present your findings.
Ensure your work is accessible.
Deliverables
ACT
This phase will be carried out by the executive team, Director of Marketing (Lily Moreno) and the Marketing Analytics team based on my analysis.
Conclusion
Members have more bikes compared to casual riders.
We have more members riding in all months compared to casual riders.
Casual riders travel for a longer time period.
Members ride more throughout the entire weekday while the casual riders also have a high ride record during the weekends(Saturday and Sunday) compared to the other days of the week.
Casual riders go farther in terms of distance.
Deliverable
Have a slash sale or promo for casual riders so they can acquire more bikes and indulge them in the benefits of being a member.
Host fun biking competitions with prizes at intervals for casual riders on the weekends. Since there are lot of members on weekends,this will also attract them to get a membership.
Encourage casual riders to ride more in the entire year through advertisement, hand flyers, by giving them various coupons so as to convince them into being a member.
THANK YOU FOR READING, PLEASE PROVIDE YOUR VALUABLE FEEDBACK.