knitr::include_graphics(here::here("Bike_station.jpg"))
Bike Station

Bike Station

INTRODUCTION

This capstone project is the final project in my Google Data Analytics Professional Certificate Course. In this case study, I will be analyzing a public dataset for a fictional company called Cyclistic, provided by the course. Here, I will be using R programming language for this analysis because of its potential benefits to reproducibility, transparency, easy statistical analysis tools and data visualizations.

The following sets of data analysis process will be followed:

The case study road map as listed below will be followed on each step

Scenerio

You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members.But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data.

ASK

  1. How do annual members and casual riders use Cyclistic bikes differently?

  2. Why would casual riders buy Cyclistic annual memberships?

  3. How can Cyclistic use digital media to influence casual riders to become members?

Lily Moreno (director of marketing and my manager) has assigned me the first question to answer: How do annual members and casual riders use Cyclistic bikes differently?

Key tasks

  1. Identify the business task
  1. Consider key stakeholders

Deliverable

  1. A clear statement of the business task

PREPARE

Key tasks

  1. Download data and store it appropriately.
  1. Identify how it’s organized.
  1. Sort and filter the data.
  1. Determine the credibility of the data.

Deliverable

  1. A description of all data sources used

Install and load required packages

install.packages("tidyverse")
library(tidyverse)

install.packages("lubridate")
library(lubridate)

install.packages("ggplot2")
library(ggplot2)

Import data to R Studio

read_csv("Divvy_Trips_2019_Q1.csv")
q1_2019 <- read_csv("Divvy_Trips_2019_Q1.csv")
read_csv("Divvy_Trips_2019_Q2.csv")
q2_2019 <- read_csv("Divvy_Trips_2019_Q2.csv")
read_csv("Divvy_Trips_2019_Q3.csv")
q3_2019 <- read_csv("Divvy_Trips_2019_Q3.csv")
read_csv("Divvy_Trips_2019_Q4.csv")
q4_2019 <- read_csv("Divvy_Trips_2019_Q4.csv")
read_csv("Divvy_Trips_2020_Q1.csv")
q1_2020 <- read_csv("Divvy_Trips_2020_Q1.csv")

read_csv("202004-divvy-tripdata.csv")
q2_04 <- read_csv("202004-divvy-tripdata.csv")
read_csv("202005-divvy-tripdata.csv")
q2_05 <- read_csv("202005-divvy-tripdata.csv")
read_csv("202006-divvy-tripdata.csv")
q2_06 <- read_csv("202006-divvy-tripdata.csv")

bind_rows(q2_04, q2_05, q2_06)
q2_2020 <- bind_rows(q2_04, q2_05, q2_06)

read_csv("202007-divvy-tripdata.csv")
q3_07 <- read_csv("202004-divvy-tripdata.csv")
read_csv("202008-divvy-tripdata.csv")
q3_08 <- read_csv("202008-divvy-tripdata.csv")
read_csv("202009-divvy-tripdata.csv")
q3_09 <- read_csv("202009-divvy-tripdata.csv")
bind_rows(q3_07, q3_08, q3_09)
q3_2020 <- bind_rows(q3_07, q3_08, q3_09)

read_csv("202010-divvy-tripdata.csv")
q4_10 <- read_csv("202010-divvy-tripdata.csv")
read_csv("202011-divvy-tripdata.csv")
q4_11 <- read_csv("202011-divvy-tripdata.csv")
read_csv("202012-divvy-tripdata.csv")
q4_12 <- read_csv("202012-divvy-tripdata.csv")
rbind(q4_10, q4_11, q4_12)
q4_2020 <- rbind(q4_10, q4_11, q4_12)

Wrangle and merge all data into a single file

bind_rows(q1_2019, q2_2019, q3_2019, q4_2019
          , q1_2020, q2_2020, q3_2020, q4_2020)

all_trips <- bind_rows(q1_2019, q2_2019, q3_2019, q4_2019
                       , q1_2020, q2_2020, q3_2020, q4_2020)

PROCESS

Cleaning up data and adding data to prepare for analysis

Key tasks

  1. Check the data for errors.

  2. Choose your tools.

  3. Transform the data so you can work with it effectively.

  4. Document the cleaning process.

Deliverables

  1. Documentation of any cleaning or manipulation of data.
colnames(all_trips)  #List of column names
nrow(all_trips)  #How many rows are in data frame
dim(all_trips)  #Dimensions of the data frame
head(all_trips)  #See the first 6 rows of data frame
tail(all_trips)  #see the last 6 rows of data frame
str(all_trips)  #See list of columns and data types (numeric, character, etc)
summary(all_trips)  #Statistical summary of data. Mainly for numeric

Adding columns that list the date, month, day, and year of each ride.

all_trips$date <- as.Date(all_trips$started_at) #The default format is yyyy-mm-dd
all_trips$month <- format(as.Date(all_trips$date), "%m")
all_trips$day <- format(as.Date(all_trips$date), "%d")
all_trips$year <- format(as.Date(all_trips$date), "%Y")
all_trips$day_of_week <- format(as.Date(all_trips$date), "%A")

Adding a “ride_length” calculation to all_trips (in seconds)

all_trips$ride_length <- difftime(all_trips$ended_at,all_trips$started_at)

str(all_trips)  #to inspect the structure of the columns

Convert “ride_length” from factor to numeric so we can run calculations on the data

all_trips$ride_length <- as.numeric(as.character(all_trips$ride_length))

is.numeric(all_trips$ride_length)

Remove “bad” data

all_trips[!(all_trips$start_station_name == "HQ QR" | all_trips$ride_length<0),]

all_trips_v2 <- all_trips[!(all_trips$start_station_name == "HQ QR" | all_trips$ride_length<0),]

ANALYZE

Key tasks

  1. Aggregate your data so it’s useful and accessible.
  2. Organize and format your data.
  3. Perform calculations.
  4. Identify trends and relationships.

Deliverables

  1. A summary of your analysis

Conduct descriptive analysis - Descriptive analysis on ride_length (all figures in seconds)

mean(all_trips_v2$ride_length, na.rm = TRUE)#straight average (total ride length / rides)

median(all_trips_v2$ride_length, na.rm = TRUE)#midpoint number in the ascending array of ride lengths

max(all_trips_v2$ride_length, na.rm = TRUE)#longest ride

min(all_trips_v2$ride_length, na.rm = TRUE)#shortest_ride
knitr::include_graphics(here::here("mean median max min of all_trips.png"))
mean median max min of all_trips

mean median max min of all_trips

Let’s visualize members and casuals by the total ride taken (ride count)

all_trips_v2 %>% 
  group_by(member_casual) %>% 
  summarise(ride_count = length(ride_id)) %>%
  ggplot(aes(x = member_casual, y = ride_count, fill = member_casual)) +
  geom_col(position = "dodge") +
  labs(title ="Total rides taken (ride_count) of Members and Casual riders") +
  geom_col(width=0.5, position = position_dodge(width=0.5)) +
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE))
knitr::include_graphics(here::here("member_casual ride_count.png"))
member_casual ride_count

member_casual ride_count

knitr::include_graphics(here::here("Total rides taken (ride_count) of Members and Casual riders.png"))

Let’s see the average time ride by each day for members vs casual users

aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual 
          + all_trips_v2$day_of_week
          , FUN = mean)

# notice that the above code result didn't give us a well ordered days of the week
# now let's put it in a well arranged order

all_trips_v2$day_of_week <- ordered(all_trips_v2$day_of_week, levels=c("Sunday"
                                                                       , "Monday"
                                                                       , "Tuesday"
                                                                       , "Wednesday"
                                                                       , "Thursday"
                                                                       , "Friday"
                                                                       , "Saturday"))

#now we'd rerun the average time ride by each day for members vs casual users
# so as to see if the ordered code we entered would work
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual 
          + all_trips_v2$day_of_week
          , FUN = mean)
knitr::include_graphics(here::here("average ride time by each day for member vs casual users.png"))
average ride time by each day for member vs casual users

average ride time by each day for member vs casual users

all_trips_v2 %>% 
  mutate(weekday = wday(started_at, label = TRUE)) %>%
  group_by(member_casual, weekday) %>%  
  summarise(number_of_rides = n(), average_duration = mean(ride_length)) %>%
  arrange(member_casual, weekday)   
knitr::include_graphics(here::here("total rides and average ride time by each day for members vs casual riders.png"))
total rides and average ride time(duration) by each day for members vs casual riders

total rides and average ride time(duration) by each day for members vs casual riders

Let’s visualize the above table by days of the week and number of rides taken by member and casual riders.

knitr::include_graphics(here::here("Total rides of Members and Casual riders Vs. Day of the week.png"))

Let’s visualize the average duration of Members and Casual riders Vs. Day of the week

all_trips_v2 %>% 
  mutate(weekday = wday(started_at, label = TRUE)) %>% 
  group_by(member_casual, weekday) %>% 
  summarise(number_of_rides = n()
            ,average_duration = mean(ride_length)) %>% 
  arrange(member_casual, weekday)  %>% 
  ggplot(aes(x = weekday, y = average_duration, fill = member_casual)) +
  geom_col(position = "dodge") +
  labs(title ="Average duration of Members and Casual riders Vs. Day of the week") +
  geom_col(width=0.5, position = position_dodge(width=0.5)) +
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE))
knitr::include_graphics(here::here("Average duration of Members and Casual riders Vs. Day of the week.png"))

Let’s create a visualization for Total rides by members and casual riders by month

all_trips_v2 %>%  
  group_by(member_casual, month) %>% 
  summarise(number_of_rides = n(),.groups="drop") %>% 
  arrange(member_casual, month)  %>% 
  ggplot(aes(x = month, y = number_of_rides, fill = member_casual)) +
  labs(title ="Total rides by Members and Casual riders by Month") +
  theme(axis.text.x = element_text(angle = 45)) +
  geom_col(width=0.5, position = position_dodge(width=0.5)) +
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE))
knitr::include_graphics(here::here("Total rides by Members and Casual riders by Month.png"))

Let’s compare Members and Casual riders depending on ride distance.

all_trips_v2 %>% 
  group_by(member_casual) %>% drop_na() %>%
  summarise(average_ride_length = mean(ride_length)) %>%
  ggplot() + 
  geom_col(mapping= aes(x= member_casual,y= average_ride_length,fill=member_casual), show.legend = FALSE)+
  labs(title = "Mean distance traveled by Members and Casual riders")
knitr::include_graphics(here::here("Mean distance traveled by Members and Casual riders.png"))

SHARE

This phase involves using visualization to share my findings and can be done by presentation.

Key tasks

  1. Determine the best way to share your findings.

  2. Create effective data visualizations.

  3. Present your findings.

  4. Ensure your work is accessible.

Deliverables

  1. Supporting visualizations and key findings

ACT

This phase will be carried out by the executive team, Director of Marketing (Lily Moreno) and the Marketing Analytics team based on my analysis.

Conclusion

  1. Members have more bikes compared to casual riders.

  2. We have more members riding in all months compared to casual riders.

  3. Casual riders travel for a longer time period.

  4. Members ride more throughout the entire weekday while the casual riders also have a high ride record during the weekends(Saturday and Sunday) compared to the other days of the week.

  5. Casual riders go farther in terms of distance.

Deliverable

  1. Have a slash sale or promo for casual riders so they can acquire more bikes and indulge them in the benefits of being a member.

  2. Host fun biking competitions with prizes at intervals for casual riders on the weekends. Since there are lot of members on weekends,this will also attract them to get a membership.

  3. Encourage casual riders to ride more in the entire year through advertisement, hand flyers, by giving them various coupons so as to convince them into being a member.

THANK YOU FOR READING, PLEASE PROVIDE YOUR VALUABLE FEEDBACK.