Full Year Bike Rider Analysis

This analysis is based on the Divvy case study “‘Sophisticated, Clear, and Polished’: Divvy and Data Visualization” written by Kevin Hartman (found here: https://artscience.blog/home/divvy-dataviz-case-study). The purpose of this analysis is to consolidate downloaded Divvy data (found here : https://divvy-tripdata.s3.amazonaws.com/index.html), wrangle the data, and then conduct simple analysis to help answer the key question: “In what ways do members and casual riders use Divvy bikes differently?”

STEP 1: DATA COLLECTION
STEP 2: DATA CLEANING AND VALIDATION
  • Converted the files to excel worksheet.
  • Checked any blank and dublicate files of rider_id.
  • Created column for ride_length using the formula =IF(D2>C2, D2-C2, “ERROR”). To return ERROR if ended_at < started_at.
  • Cleaned data by removing all columns with ERROR using filter function.
STEP 3: DATA PREPARATION AND DESCRIPTIVE ANALYSIS
  • Used formulas to determine: Columns-“ride_length”using the formula=IF(D2>C2, D2-C2, “ERROR”), and “day_of_week” using =WEEKDAY(C2,1).
  • Calculated the following data summary using the excel formulas given;
    • Min ride_length=MIN(N$2:N$84718)
    • Max ride_length=MAX(N2:N84718)
    • Mean of ride_length=AVERAGE(N2:N84718)
    • Mean of ride_length=AVERAGE(N2:N84718)
    • Median of day_of_week=MEDIAN(O2:O84718)
    • Mean of ride_length (Member)=AVERAGEIF(M$2:M$84719, “member”,N2:N$84719)
    • Mean of ride_length(Casual)=AVERAGEIF(M2:M84720, “casual”,N2:N84720)
    • Number of cyclists (Member)=COUNTIF(M$2:M$84718,“member”)
    • Number of cyclists (Casual)=COUNTIF(M2:M84720,“casual”).
  • Used pivot charts to calculate the number of rides and average duration for each day of the week for each month and converted the duration from hhmmss to seconds.
knitr::include_graphics("C:/Users/Sam/Desktop/Capstone/Pivot_202004.PNG")

  • Combined the above summary for each three consecutive months to form four quaters (2020Q2,2020Q3,2020Q4, 2021Q1).
STEP 4: ANNUAL DATA ANALYSIS AND VISUALIZATION

Called the above summary into R for further analysis and visualization.

Setting up the working environment

  • loaded tidyverse and readxl libraries

    library(“tidyverse”)
    library(readxl)

Set the working directory

setwd("C:/Users/Sam/Desktop/Capstone")
getwd()
## [1] "C:/Users/Sam/Desktop/Capstone"

Imported data from the summary_data excel file into R

Q2_2020 <- read_excel("Summary_Data.xlsx", sheet = "Annual", range = "A2:E9")
Q3_2020 <- read_excel("Summary_Data.xlsx", sheet = "Annual", range = "G2:K9")
Q4_2020 <- read_excel("Summary_Data.xlsx", sheet = "Annual", range = "A14:E21")
Q1_2021 <- read_excel("Summary_Data.xlsx", sheet = "Annual", range = "G14:K21")

Combined Q2_2020, Q3_2020, Q4_2020,and Q1_2021 to form consolidated annual data

annual_data <- rbind(Q2_2020, Q3_2020, Q4_2020,Q1_2021)

Converted weekday to an ordered factor

annual_data$day_of_week <- factor(annual_data$day_of_week, levels = c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))

Grouped by weekday and summarized columns number_of_rides_casual, average_duration_casual, number_of_rides_member, average_duration_member by sum.

all_trips <- annual_data %>%
  group_by(day_of_week) %>%
  summarize(Annual_number_of_rides_casual = sum(number_of_rides_casual),
            Annual_number_of_rides_member = sum(number_of_rides_member),
            Annual_average_duration_casual = sum(average_duration_casual),
            Annual_average_duration_member = sum(average_duration_member))

Created the clustered column char using the ggplot2 library to visualize day_of_week vs number of rides.

Reshaped the data using the gather() function from the tidyr library to convert the wide format to a long format.

all_trips_v2 <- gather(all_trips, key = "Column", value = "Number_of_rides", Annual_number_of_rides_casual, Annual_number_of_rides_member)

Created clustered column chart using the ggplot() function and day_of_week as x-axis, Number_of_rides as y-axis, fill color, and other chart labels.

ggplot(all_trips_v2, aes(x = day_of_week, y = Number_of_rides, fill = Column)) +
  geom_col(position = "dodge") +
  labs(x = "day_of_week", y = "Number_of_rides", fill = "") +
  scale_fill_manual(values = c("#C40000", "#29C6D7")) +
  theme_classic()+
  scale_y_continuous(labels = scales::comma)

Created the clustered column char using the ggplot2 library to visualize day_of_week vs annual average duration of rides.

Reshaped the data using the gather() function from the tidyr library to convert the wide format to a long format.

all_trips_v3 <- gather(all_trips, key = "Column", value = "Average_duration", Annual_average_duration_casual, Annual_average_duration_member)

Created clustered column chart using the ggplot() function and day_of_week as x-axis, Average_duration(in sec) as y-axis, fill color, and other chart labels.

ggplot(all_trips_v3, aes(x = day_of_week, y = Average_duration, fill = Column)) +
  geom_col(position = "dodge") +
  labs(x = "day_of_week", y = "Average_duration(sec)", fill = "") +
  scale_fill_manual(values = c("#ED7D31", "#799AD5")) +
  theme_classic()+
  scale_y_continuous(labels = scales::comma)

FINDINGS
  1. The number of casual riders are comparatively higher as compared to member riders in a given day of the week.
    However, on weekends, the number of casual riders is high with Saturday having more casual riders than member riders.
  2. The average duration for casual riders is higher compared to that of member riders through out the week.
RECOMENDATIONS
  1. The marketing team should campaign more over the weekends with more emphasis on Saturday for change of membership from casual to member since the number of casual riders is high.
  2. The team should also offer discounted rates for to the casual riders for conversion to member riders with attractive deal such as half the monthly charge for the first month of subscription.
  3. The members should be encouraged to increase their ride duration on daily basis to improve their average duration per day.