Business Task

Analyze ridesharing bicycle data to create ideas to improve membership in Cyclistic’s program. The task involves compiling collected data and identifying how riders that don’t have a membership compare to riders who hold a membership. Finding those differences can help Cyclistic make data-driven decisions on how they can increase membership in their program.

Data and Tools Used

Description of Data Used: Data from Chicago’s Divvy bikesharing company was downloaded and used as an example for this project. While there is data from multiple years, the only data used was from the 12 most recent months as of the start of the study. Additionally, the data was uploaded to RStudio which was used for manipulation and data visualization.

Processing Data

Setting up my environment

loading necessary packages

library("flexdashboard")
library("tinytex")
library("rmarkdown")
library("tidyverse")
library("skimr")
library("janitor")
library("here")
library("formatR")
library("data.table")
library("ggplot2")

Upload the dataframes

may_data <- read_csv(here("202305-divvy-tripdata.csv"))
april_data <- read_csv(here("202304-divvy-tripdata.csv"))
march_data <- read_csv(here("202303-divvy-tripdata.csv"))
february_data <- read_csv(here("202302-divvy-tripdata.csv"))
january_data <- read_csv(here("202301-divvy-tripdata.csv"))
december_data <- read_csv(here("202212-divvy-tripdata.csv"))
november_data <- read_csv(here("202211-divvy-tripdata.csv"))
october_data <- read_csv(here("202210-divvy-tripdata.csv"))
september_data <- read_csv(here("202209-divvy-publictripdata.csv"))
august_data <- read_csv(here("202208-divvy-tripdata.csv"))
july_data <- read_csv(here("202207-divvy-tripdata.csv"))
june_data <- read_csv(here("202206-divvy-tripdata.csv"))

Synthesize the dataframes

total_rides <-
  rbindlist(
    list(
      may_data,
      april_data,
      march_data,
      february_data,
      january_data,
      december_data,
      november_data,
      october_data,
      september_data,
      august_data,
      july_data,
      june_data
    )
  )

Then, check for invalid data, missing data, and irregular data

clean_names(total_rides)
skim_without_charts(total_rides)

Make a column that shows the duration of each ride. Note: the result is measured in seconds, but descriptors will be in minute or hour intervals.

total_rides$ride_length <- total_rides$ended_at - total_rides$started_at

Prepare the rides that have been categorized as “casual” for analysis

casual_riders <- total_rides[total_rides$member_casual == "casual", ]
casual_riders <- casual_riders[casual_riders$ride_length > 0, ]

Find both the average and the median ride lengths for non-member riders

avg_ride_length_casual <- mean(casual_riders$ride_length, na.rm = TRUE)
avg_ride_length_casual_numeric <- as.numeric(avg_ride_length_casual)

median_ride_length_casual <- median(casual_riders$ride_length, na.rm = TRUE)
median_ride_length_casual_numerical <- as.numeric(median_ride_length_casual)

Then, make sure that ride_length is in a proper format for analyzing

casual_riders$ride_length <- as.numeric(as.character(casual_riders$ride_length))
non_numeric <- is.na(casual_riders$ride_length | is.infinite(casual_riders$ride_length))

casual_riders$ride_length[non_numeric] <- NA
mean_value_casual <- mean(casual_riders$ride_length, na.rm = TRUE)
casual_riders$ride_length[is.na(casual_riders$ride_length)] <- mean_value_casual

Part of the context given in this study is that the bikes are used mostly for leisure and for commuting to work. Keeping that in mind, it is sensible that a time boundary of 24 hours should be used for the analysis. There are 86,400 seconds in one day, and since the ride duration column is quantified in seconds, use 86,400 instead of 24.

filtered_casual_riders <- casual_riders %>% 
  filter(ride_length >= 0 & ride_length <= 86400)

Prepare the rides that have been categorized as “members” for analysis

member_riders <- total_rides[total_rides$member_casual == "member", ]
member_riders <- member_riders[member_riders$ride_length > 0, ]

Find both the average and median ride durations for riders who hold a membership.

avg_ride_length_member <- mean(member_riders$ride_length, na.rm = TRUE)
avg_ride_length_member_numeric <- as.numeric(avg_ride_length_member)

median_ride_length_member <- median(member_riders$ride_length, na.rm = TRUE)
median_ride_length_member_numerical <- as.numeric(median_ride_length_member)

Then, make sure that the ride duration column is in the proper format for analyzing.

member_riders$ride_length <- as.numeric(as.character(member_riders$ride_length))
non_numeric <- is.na(member_riders$ride_length | is.infinite(member_riders$ride_length))
member_riders$ride_length[non_numeric] <- NA
mean_value_member <- mean(member_riders$ride_length, na.rm = TRUE)
member_riders$ride_length[is.na(member_riders$ride_length)] <- mean_value_member

Once again, if the bikes are used primarily for leisure and commuting, it is sensible to limit the information we look at to have a length of 24 hours or less.

filtered_member_riders <- member_riders %>% 
  filter(ride_length >= 0 & ride_length <= 86400)

Analysis of Data Collected

In the initial calculations, the average and median ride durations were displayed. Therefore, create a table that shows that information for both non-members and members.

##      [,1]          [,2]                     [,3]                    
## [1,] "Member Type" "Mean"                   "Median"                
## [2,] "Casual"      "28 minutes, 12 seconds" "12 minutes, 15 seconds"
## [3,] "Member"      "12 minutes, 28 seconds" "8 minutes, 36 seconds"

I created a histogram to get a better look at how non-members and members used the bikes in terms of ride duration. The first graph is the ride duration for non-members. Each column represents 10-minute intervals, i.e. the first column is the count of rides that were 10 minutes or less, the second column is the count of rides between 10 minutes(exclusive) and 20 minutes(inclusive), etc.

The second histogram represents ride duration for riders who hold a membership. Again, each column represents 10-minute intervals.

It was clear that 24 hours was too wide a time period for both non-members and members alike. For non-members, most rides were concentrated to under 1 hour. I made a histogram to adjust for that. While the histogram’s overall time period is shortened, the length represented for each column is still 10 minutes.

I also needed to reflect the time period adjustment for riders who hold a membership. It is worth noting that the histogram shows rides that are less than 4-and-a-half hours.

Noticing the distribution for the ride durations will be important when discussing my recommendations for increasing membership at Cyclistic.

A final part of my analysis is to find the stations most frequented by users who do not hold a membership. It is important to note that in the original dataset, some starting and ending station names were not given. This query has been adjusted to remove the blanks and only include the data which contains values that correspond to bike stations. The numbers that are returned in the table represent how many times those stations were frequented in the past year.

## 
##            Streeter Dr & Grand Ave  DuSable Lake Shore Dr & Monroe St 
##                              55765                              31611 
##              Michigan Ave & Oak St                    Millennium Park 
##                              24726                              24550 
## DuSable Lake Shore Dr & North Blvd 
##                              23062
## 
##            Streeter Dr & Grand Ave  DuSable Lake Shore Dr & Monroe St 
##                              58031                              28964 
##                    Millennium Park              Michigan Ave & Oak St 
##                              26139                              25855 
## DuSable Lake Shore Dr & North Blvd 
##                              25359

There is a key difference between users who hold a membership and the users who do not hold a membership. The ride duration for non-members is significantly shorter than the ride durations of the people who hold memberships at Cyclistic. Since non-member users are charged on the type of pass they use, either single-ride or day-long pass, they are more likely to feel a time constraint than those who have the year-long membership.

Recommendations for Implementation

Based on the insights gained from the data, it would be beneficial if Cyclistic were to run an advertisement along the lines of “first 5 rides under 20 minutes with a membership is half off”, I believe membership would increase. Expanding on that, I believe that those new members will feel more free to use the bikes longer and more often.

One recommendation for implementation would be to place ads around the stations that are frequented most by users who are not members. The stations that are most popular both for starting and finishing rides are as follows: Streeter Dr & Grand Ave, DuSable Lake Shore Dr & Monroe St, Michigan Ave & Oak St, Millenium Park, and DuSable Lake Shore Dr & North Blvd. If Cyclistic would like to post advertisements at more locations, it would be easy for the analyst to find more than just the 5 stations listed. It is also worth noting that while the top 5 most frequented starting stations are the same as the top 5 most frequented ending stations, that may change if the number of stations is increased from 5.

Another recommendation would be to offer the same, or similar offer directly in Cyclistic’s mobile app. That way, current users who do not hold a membership and do not frequent the 5 most used stations will still be notified about the special offer no matter what station they use. By using this blanket approach, Cyclistic has an even higher chance of getting non-member users to sign up for a membership.

Citations of sources used

Certification Company; Coursera: Google Data Analytics Certification.” (n.d.). Coursera. Retrieved July 9, 2023, from https://www.coursera.org/professional-certificates/google-data-analytics

Data Retrieved; Divvy Trip Data. (n.d.). Retrieved July 9, 2023, from https://divvy-tripdata.s3.amazonaws.com/index.html

Analysis tool used; RStudio Team (2020). RStudio: Integrated Development for R. RStudio, PBC, Boston, MA URL http://www.rstudio.com/