Bike-Share Analysis

Introduction

A bike-share company in Chicago features more than 5,800 bicycles and 600 docking stations. I will analyze all the data gathered in 2019 to develop a new marketing strategy to convert customers into subscribers.

Prepare

I will analyze 4 csv files from 2019 that total over 3 million rows. Given this data set is so large, I will use my knowledge of R to combine these csv files together and conduct my analysis.

Dataset downloaded from: https://divvy-tripdata.s3.amazonaws.com/index.html

Loading R packages

library('tidyverse')
library('ggplot2')
library('lubridate')

Read csv files

q1_2019 <- read_csv("Divvy_Trips_2019_Q1.csv")
q2_2019 <- read_csv("Divvy_Trips_2019_Q2.csv")
q3_2019 <- read_csv("Divvy_Trips_2019_Q3.csv")
q4_2019 <- read_csv("Divvy_Trips_2019_Q4.csv")

Combine Data into 1 Data Frame

To keep the data consistent I am Renaming Q2_2019 Columns to match the other data sets.

(q2_2019 <- rename(q2_2019
                   ,ride_id = "01 - Rental Details Rental ID"
                   ,rideable_type = "01 - Rental Details Bike ID" 
                   ,started_at = "01 - Rental Details Local Start Time"  
                   ,ended_at = "01 - Rental Details Local End Time"  
                   ,start_station_name = "03 - Rental Start Station Name" 
                   ,start_station_id = "03 - Rental Start Station ID"
                   ,end_station_name = "02 - Rental End Station Name" 
                   ,end_station_id = "02 - Rental End Station ID"
                   ,member_casual = "User Type"))

## # A tibble: 1,108,163 × 12
##     ride_id started_at          ended_at            rideable_type
##       <dbl> <dttm>              <dttm>                      <dbl>
##  1 22178529 2019-04-01 00:02:22 2019-04-01 00:09:48          6251
##  2 22178530 2019-04-01 00:03:02 2019-04-01 00:20:30          6226
##  3 22178531 2019-04-01 00:11:07 2019-04-01 00:15:19          5649
##  4 22178532 2019-04-01 00:13:01 2019-04-01 00:18:58          4151
##  5 22178533 2019-04-01 00:19:26 2019-04-01 00:36:13          3270
##  6 22178534 2019-04-01 00:19:39 2019-04-01 00:23:56          3123
##  7 22178535 2019-04-01 00:26:33 2019-04-01 00:35:41          6418
##  8 22178536 2019-04-01 00:29:48 2019-04-01 00:36:11          4513
##  9 22178537 2019-04-01 00:32:07 2019-04-01 01:07:44          3280
## 10 22178538 2019-04-01 00:32:19 2019-04-01 01:07:39          5534
## # ℹ 1,108,153 more rows
## # ℹ 8 more variables: `01 - Rental Details Duration In Seconds Uncapped` <dbl>,
## #   start_station_id <dbl>, start_station_name <chr>, end_station_id <dbl>,
## #   end_station_name <chr>, member_casual <chr>, `Member Gender` <chr>,
## #   `05 - Member Details Member Birthday Year` <dbl>

Combine all the csv’s into 1 dataset

all_trips <- bind_rows(q1_2019,q2_2019, q3_2019, q4_2019)

Clean Up Data

After merging my data I found columns where the Data was not collected in every quarter. I used R to remove these fields

all_trips <- all_trips %>%  
  select(-c(birthyear, gender, "01 - Rental Details Duration In Seconds Uncapped", "05 - Member Details Member Birthday Year", "Member Gender", "tripduration"))
View(all_trips)

Calculate Ride length

Subtract the Start time from the end time and put the time difference in a new column called ride length

all_trips$ride_length <- difftime(all_trips$ended_at,all_trips$started_at)

I found 13 records that had a negative ride length. Removed rows with Negative ride length and copied results into a new data frame

all_trips_v2 <- all_trips[!(all_trips$ride_length<0),]

Calculate using R

Calculating total rows and average by type of user.

table(all_trips$member_casual)

## 
##   Customer Subscriber 
##     259586     848577

avg_by_member <- aggregate(ride_length ~ member_casual, data = all_trips_v2, mean)
avg_by_member

##   member_casual   ride_length
## 1      Customer 48.51607 mins
## 2    Subscriber 14.04721 mins

Display Average Duration by Day

all_trips_v2 %>% 
  mutate(weekday = wday(started_at, label = TRUE)) %>% 
  group_by(member_casual, weekday) %>% 
  summarise(number_of_rides = n()
            ,average_duration = mean(ride_length)) %>% 
  arrange(member_casual, weekday)  %>% 
  ggplot(aes(x = weekday, y = average_duration, fill = member_casual)) +
  geom_col(position = "dodge")

Tableau

I combined the 4 csv files together and merged it into one data set in R. I found that Tableau can only create visualizations for up to 1 million rows. I exported a sample of the data set using R into one csv file containing 10% of my original data set and uploaded the data into Tableau.

Conclusions

Memberships make up a majority of the customer base for this company, however non-subscribers ride for a longer time than members. Casual riders make up the majority spent riding.
The peak time for this company is during the Summer, and the service is most used from 3 to 5 PM.
Riders that use this service use it for a consistent length of time throughout the week. There isn’t much fluctuation in the amount of time spent riding between Monday and Friday.

Recommendation

Offer promotional deals during off hours and months to help boost the amount of customers during those periods. Lowering prices during the Winter season and after 8 PM could result in an increase in customers.
Run campaigns to encourage paid subscribers to use the service more often. Fun loyalty programs centered around fitness and exercise can boost the utilization of this service.
Offer membership trials to new customers for a week to give them a chance to explore how often they would use the service.