Introduction

A bike-share company in Chicago features more than 5,800 bicycles and 600 docking stations. I will analyze all the data gathered in 2019 to develop a new marketing strategy to convert customers into subscribers.

Prepare

I will analyze 4 csv files from 2019 that total over 3 million rows. Given this data set is so large, I will use my knowledge of R to combine these csv files together and conduct my analysis.

Dataset downloaded from: https://divvy-tripdata.s3.amazonaws.com/index.html

Loading R packages

library('tidyverse')
library('ggplot2')
library('lubridate')

Read csv files

q1_2019 <- read_csv("Divvy_Trips_2019_Q1.csv")
q2_2019 <- read_csv("Divvy_Trips_2019_Q2.csv")
q3_2019 <- read_csv("Divvy_Trips_2019_Q3.csv")
q4_2019 <- read_csv("Divvy_Trips_2019_Q4.csv")

Combine Data into 1 Data Frame

To keep the data consistent I am Renaming Q2_2019 Columns to match the other data sets.

(q2_2019 <- rename(q2_2019
                   ,ride_id = "01 - Rental Details Rental ID"
                   ,rideable_type = "01 - Rental Details Bike ID" 
                   ,started_at = "01 - Rental Details Local Start Time"  
                   ,ended_at = "01 - Rental Details Local End Time"  
                   ,start_station_name = "03 - Rental Start Station Name" 
                   ,start_station_id = "03 - Rental Start Station ID"
                   ,end_station_name = "02 - Rental End Station Name" 
                   ,end_station_id = "02 - Rental End Station ID"
                   ,member_casual = "User Type"))
## # A tibble: 1,108,163 Ă— 12
##     ride_id started_at          ended_at            rideable_type
##       <dbl> <dttm>              <dttm>                      <dbl>
##  1 22178529 2019-04-01 00:02:22 2019-04-01 00:09:48          6251
##  2 22178530 2019-04-01 00:03:02 2019-04-01 00:20:30          6226
##  3 22178531 2019-04-01 00:11:07 2019-04-01 00:15:19          5649
##  4 22178532 2019-04-01 00:13:01 2019-04-01 00:18:58          4151
##  5 22178533 2019-04-01 00:19:26 2019-04-01 00:36:13          3270
##  6 22178534 2019-04-01 00:19:39 2019-04-01 00:23:56          3123
##  7 22178535 2019-04-01 00:26:33 2019-04-01 00:35:41          6418
##  8 22178536 2019-04-01 00:29:48 2019-04-01 00:36:11          4513
##  9 22178537 2019-04-01 00:32:07 2019-04-01 01:07:44          3280
## 10 22178538 2019-04-01 00:32:19 2019-04-01 01:07:39          5534
## # ℹ 1,108,153 more rows
## # ℹ 8 more variables: `01 - Rental Details Duration In Seconds Uncapped` <dbl>,
## #   start_station_id <dbl>, start_station_name <chr>, end_station_id <dbl>,
## #   end_station_name <chr>, member_casual <chr>, `Member Gender` <chr>,
## #   `05 - Member Details Member Birthday Year` <dbl>

Combine all the csv’s into 1 dataset

all_trips <- bind_rows(q1_2019,q2_2019, q3_2019, q4_2019)

Clean Up Data

After merging my data I found columns where the Data was not collected in every quarter. I used R to remove these fields

all_trips <- all_trips %>%  
  select(-c(birthyear, gender, "01 - Rental Details Duration In Seconds Uncapped", "05 - Member Details Member Birthday Year", "Member Gender", "tripduration"))
View(all_trips)

Calculate Ride length

Subtract the Start time from the end time and put the time difference in a new column called ride length

all_trips$ride_length <- difftime(all_trips$ended_at,all_trips$started_at)

I found 13 records that had a negative ride length. Removed rows with Negative ride length and copied results into a new data frame

all_trips_v2 <- all_trips[!(all_trips$ride_length<0),]

Calculate using R

Calculating total rows and average by type of user.

table(all_trips$member_casual)
## 
##   Customer Subscriber 
##     259586     848577
avg_by_member <- aggregate(ride_length ~ member_casual, data = all_trips_v2, mean)
avg_by_member
##   member_casual   ride_length
## 1      Customer 48.51607 mins
## 2    Subscriber 14.04721 mins

Display Average Duration by Day

all_trips_v2 %>% 
  mutate(weekday = wday(started_at, label = TRUE)) %>% 
  group_by(member_casual, weekday) %>% 
  summarise(number_of_rides = n()
            ,average_duration = mean(ride_length)) %>% 
  arrange(member_casual, weekday)  %>% 
  ggplot(aes(x = weekday, y = average_duration, fill = member_casual)) +
  geom_col(position = "dodge")

Tableau

I combined the 4 csv files together and merged it into one data set in R. I found that Tableau can only create visualizations for up to 1 million rows. I exported a sample of the data set using R into one csv file containing 10% of my original data set and uploaded the data into Tableau.

Conclusions

Recommendation