Cyclist Bike-Share Analysis

Inroduction

The Cyclistic Bike Share Case Study is my capstone project for the Google Data Analytics Certificate. For the analysis I will be using R programing language and RStudio. In this project, I will follow the data analysis process which I learned from the course: ask, prepare, process, analyze, share and act to analyze the data.

Scenario

You are a junior data analyst working on the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.

Ask

Lily Moreno,The director of marketing and your manager, the has assigned you the question to answer: How do annual members and casual riders use Cyclistic bikes differently?

Business Objective

The business objective of the case study is to identify opportunities for marketing campaigns to convert casual riders into annual members. This will be done through analysis of bike trip data and the understanding user behavior and preferences. The ultimate goal is to increase profit.

Business Task

Analyze Cyclistic’s historical bike trip data Identify trends and Patterns of casual and member riders Understand behavior and preferences of these two user groups Identify ways to convert casual riders into annual members Use insights from analysis to inform development of new marketing strategy for the company Goal increase profitability and drive future growth.

Stakeholders

Lily Moreno: Director of Marketing at Cyclistic, who is responsible for implementing the marketing campaigns at Cyclistic.

Cyclistic’s marketing team: They will be responsible for conducting the analysis and developing the marketing strategy based on the insights gained.

Cyclistic’s investors and shareholders: They have a financial interest in the company’s success and may be interested in the results of the analysis and any changes to the marketing strategy.

Prepare

Data Source

The data for this analysis can be accessed through the provided link. It includes 12 months of historical trip data in 2019 and 2020 from Cyclistic, a fictional bike share company based in Chicago.

Credibility of Data

Motivate, Inc. collected the data for this analysis directly through its management of the Cyclistic Bike Share program for the City of Chicago. The data is comprehensive and consistent, because it includes information on all rides taken by users; not just a sample. It is also current, as it is released on a monthly basis by the City of Chicago. The data is made available to the public by the City of Chicago.

Process

Here, we perform the data cleaning, ensure integrity, and that it is complete, correct, and relevant.

Install Required Packages

tidyverse
conflicted
ggplot2

library(tidyverse)
library(conflicted)
library(ggplot2)

Download and store data

read_csv("Divvy_Trips_2019_Q1.csv")

## Rows: 365069 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): start_time, end_time, from_station_name, to_station_name, usertype,...
## dbl (5): trip_id, bikeid, from_station_id, to_station_id, birthyear
## num (1): tripduration
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

## # A tibble: 365,069 × 12
##     trip_id start_time         end_time      bikeid tripduration from_station_id
##       <dbl> <chr>              <chr>          <dbl>        <dbl>           <dbl>
##  1 21742443 2019-01-01 0:04:37 2019-01-01 0…   2167          390             199
##  2 21742444 2019-01-01 0:08:13 2019-01-01 0…   4386          441              44
##  3 21742445 2019-01-01 0:13:23 2019-01-01 0…   1524          829              15
##  4 21742446 2019-01-01 0:13:45 2019-01-01 0…    252         1783             123
##  5 21742447 2019-01-01 0:14:52 2019-01-01 0…   1170          364             173
##  6 21742448 2019-01-01 0:15:33 2019-01-01 0…   2437          216              98
##  7 21742449 2019-01-01 0:16:06 2019-01-01 0…   2708          177              98
##  8 21742450 2019-01-01 0:18:41 2019-01-01 0…   2796          100             211
##  9 21742451 2019-01-01 0:18:43 2019-01-01 0…   6205         1727             150
## 10 21742452 2019-01-01 0:19:18 2019-01-01 0…   3939          336             268
## # ℹ 365,059 more rows
## # ℹ 6 more variables: from_station_name <chr>, to_station_id <dbl>,
## #   to_station_name <chr>, usertype <chr>, gender <chr>, birthyear <dbl>

read_csv("Divvy_Trips_2020_Q1.csv")

## Rows: 426887 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, started_at, ended_at, start_station_name, e...
## dbl (6): start_station_id, end_station_id, start_lat, start_lng, end_lat, en...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

## # A tibble: 426,887 × 13
##    ride_id rideable_type started_at ended_at start_station_name start_station_id
##    <chr>   <chr>         <chr>      <chr>    <chr>                         <dbl>
##  1 EACB19… docked_bike   2020-01-2… 2020-01… Western Ave & Lel…              239
##  2 8FED87… docked_bike   2020-01-3… 2020-01… Clark St & Montro…              234
##  3 789F3C… docked_bike   2020-01-0… 2020-01… Broadway & Belmon…              296
##  4 C9A388… docked_bike   2020-01-0… 2020-01… Clark St & Randol…               51
##  5 943BC3… docked_bike   2020-01-3… 2020-01… Clinton St & Lake…               66
##  6 6D9C8A… docked_bike   2020-01-1… 2020-01… Wells St & Hubbar…              212
##  7 31EB9B… docked_bike   2020-01-1… 2020-01… Desplaines St & R…               96
##  8 A2B24E… docked_bike   2020-01-1… 2020-01… Desplaines St & R…               96
##  9 5E3F01… docked_bike   2020-01-3… 2020-01… Wells St & Hubbar…              212
## 10 19DC57… docked_bike   2020-01-3… 2020-01… Clark St & Lake St               38
## # ℹ 426,877 more rows
## # ℹ 7 more variables: end_station_name <chr>, end_station_id <dbl>,
## #   start_lat <dbl>, start_lng <dbl>, end_lat <dbl>, end_lng <dbl>,
## #   member_casual <chr>

Data Wrangling and combining

Rename the files to make them consistant with the 2020 data

 (Divvy_Trips_2019_Q1 <- rename(Divvy_Trips_2019_Q1
                    ,ride_id = trip_id
                    ,rideable_type = bikeid
                    ,started_at = start_time
                    ,ended_at = end_time
                    ,start_station_name = from_station_name
                    ,start_station_id = from_station_id
                    ,end_station_name = to_station_name
                    ,end_station_id = to_station_id
                    ,member_casual = usertype
 ))

## # A tibble: 365,069 × 12
##     ride_id started_at      ended_at rideable_type tripduration start_station_id
##       <dbl> <chr>           <chr>            <dbl>        <dbl>            <dbl>
##  1 21742443 2019-01-01 0:0… 2019-01…          2167          390              199
##  2 21742444 2019-01-01 0:0… 2019-01…          4386          441               44
##  3 21742445 2019-01-01 0:1… 2019-01…          1524          829               15
##  4 21742446 2019-01-01 0:1… 2019-01…           252         1783              123
##  5 21742447 2019-01-01 0:1… 2019-01…          1170          364              173
##  6 21742448 2019-01-01 0:1… 2019-01…          2437          216               98
##  7 21742449 2019-01-01 0:1… 2019-01…          2708          177               98
##  8 21742450 2019-01-01 0:1… 2019-01…          2796          100              211
##  9 21742451 2019-01-01 0:1… 2019-01…          6205         1727              150
## 10 21742452 2019-01-01 0:1… 2019-01…          3939          336              268
## # ℹ 365,059 more rows
## # ℹ 6 more variables: start_station_name <chr>, end_station_id <dbl>,
## #   end_station_name <chr>, member_casual <chr>, gender <chr>, birthyear <dbl>

Coveret the ride_id and rideable_type to stack correctly

Divvy_Trips_2019_Q1 <-  mutate(Divvy_Trips_2019_Q1, ride_id = as.character(ride_id)
                    ,rideable_type = as.character(rideable_type))

Make one data frame with all the data and remove data that isnt consistant through the years

all_trips <- bind_rows(Divvy_Trips_2019_Q1, Divvy_Trips_2020_Q1)#, q3_2019)#, q4_2019, q1_2020)

all_trips <- all_trips %>%  
     select(-c(start_lat, start_lng, end_lat, end_lng, birthyear, gender,  "tripduration"))

Clean and add data to prepare for analysis

What needs to be cleaned?

There are two names for member and two for casual, so to get rid of the names subscriber and customer I reassigned the value names.

all_trips <-  all_trips %>% 
  mutate(member_casual = recode(member_casual
                                ,"Subscriber" = "member"
                                ,"Customer" = "casual"))

Remove the data with negative ride_length because that means it was being checked for quality by Divvy, and not being used by riders.

all_trips$ride_length <- as.numeric(difftime(all_trips$ended_at, all_trips$started_at, units = "mins"))
all_trips <- all_trips %>% dplyr::filter(start_station_name != "HQ QR", ride_length >= 0)

Add columns that list the date, month, day, and year of each ride

all_trips$date <- as.Date(all_trips$started_at) 
all_trips$month <- format(as.Date(all_trips$date), "%m")
all_trips$day <- format(as.Date(all_trips$date), "%d")
all_trips$year <- format(as.Date(all_trips$date), "%Y")
all_trips$day_of_week <- format(as.Date(all_trips$date), "%A")

Add a ride_lengh and convert from factor to numeric

 all_trips$ride_length <- difftime(all_trips$ended_at,all_trips$started_at)

all_trips$ride_length <- as.numeric(as.character(all_trips$ride_length))

Analysis

I used summery to find the min, max, median, and mean if the ride_lenths



``` r
summary(all_trips$ride_length)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##        1      331      539     1189      912 10628422

Arrange by the days of the week a compare the casual riders and members

all_trips$day_of_week <- ordered(all_trips$day_of_week, levels=c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))
aggregate(all_trips$ride_length ~ all_trips$member_casual, FUN = mean)

##   all_trips$member_casual all_trips$ride_length
## 1                  casual             5370.7685
## 2                  member              795.1324

aggregate(all_trips$ride_length ~ all_trips$member_casual, FUN = median)

##   all_trips$member_casual all_trips$ride_length
## 1                  casual                  1393
## 2                  member                   508

aggregate(all_trips$ride_length ~ all_trips$member_casual, FUN = max)

##   all_trips$member_casual all_trips$ride_length
## 1                  casual              10628422
## 2                  member               6096428

aggregate(all_trips$ride_length ~ all_trips$member_casual, FUN = min)

##   all_trips$member_casual all_trips$ride_length
## 1                  casual                     2
## 2                  member                     1

The average ride time by each day for members vs casual users

aggregate(all_trips$ride_length ~ all_trips$member_casual + all_trips$day_of_week, FUN = mean)

##    all_trips$member_casual all_trips$day_of_week all_trips$ride_length
## 1                   casual                Sunday             5059.1813
## 2                   member                Sunday              972.1609
## 3                   casual                Monday             4750.7627
## 4                   member                Monday              822.2134
## 5                   casual               Tuesday             4560.8190
## 6                   member               Tuesday              769.4416
## 7                   casual             Wednesday             4480.3724
## 8                   member             Wednesday              711.9838
## 9                   casual              Thursday             8450.1557
## 10                  member              Thursday              707.2093
## 11                  casual                Friday             6088.9402
## 12                  member                Friday              796.6713
## 13                  casual              Saturday             4946.4956
## 14                  member              Saturday              973.7094

The relationship between rides and days of the week

all_trips %>% 
  mutate(weekday = wday(started_at, label = TRUE)) %>%  
  group_by(member_casual, weekday) %>%  
  summarise(number_of_rides = n()                            
            ,average_duration = mean(ride_length)) %>%      
  arrange(member_casual, weekday)

## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

## # A tibble: 14 × 4
## # Groups:   member_casual [2]
##    member_casual weekday number_of_rides average_duration
##    <chr>         <ord>             <int>            <dbl>
##  1 casual        Sun               18652            5059.
##  2 casual        Mon                5591            4751.
##  3 casual        Tue                7311            4561.
##  4 casual        Wed                7690            4480.
##  5 casual        Thu                7147            8450.
##  6 casual        Fri                8013            6089.
##  7 casual        Sat               13473            4946.
##  8 member        Sun               60197             972.
##  9 member        Mon              110430             822.
## 10 member        Tue              127974             769.
## 11 member        Wed              121902             712.
## 12 member        Thu              125228             707.
## 13 member        Fri              115168             797.
## 14 member        Sat               59413             974.

Visualizations

## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

Act

Key Takeaways

Annual members primarily use the bike-sharing service for commuting purposes, while casual riders tend to use it for leisure, especially on weekends. -Annual members exhibit a more consistent use of service throughout the week compared to casual riders. -Casual riders tend to have longer ride duration’s, averaging around 400% longer than annual members. -Saturday has highest count of rented bikes.

Reccomendations

Offer discounts or incentives for casual riders during weekdays to encourage more usage during the week.
Introduce a rewards system for casual riders making it easier to convert to a membership
Provide a echo friendly option for riders by offering discounts to businesses. This could motivate more casual riders to bike to work and save money on gas and parking.

Conclusion

This analysis provides valuable insights into the preferences and behaviors of members and casual riders. By tailoring strategies to the identified differences and preferences, Cyclistic can effectively convert casual riders into portential members.