Introduction

The success of Chicago-based bike-sharing start-up Cyclistic, according to its marketing director Lily Moreno, rests on increasing the number of annual subscriptions. The team is interested in seeing how annual members and casual riders utilize Cyclistic bikes. The company plan a new marketing drive to turn casual riders into annual members based on these insights. This report will give the findings in the following format to help readers understand user trends: * Trips completed by user type, * Total distance travelled (kilometres) by user type * Hours cycled by type of user * User preference for bikes * completed rides by month and user type * Number of rides taken each day, broken down per user type * Top 5 user types for start station and end station

Packages installation

We need to install some packages to be able to use their functions

Load the Libraries

This project will use the following libraries

library(readr)
library(geosphere)
library(ggplot2)
library(scales)
## 
## Attaching package: 'scales'
## The following object is masked from 'package:readr':
## 
##     col_factor
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(tidyverse)
## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ purrr   0.3.4      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ lubridate::as.difftime() masks base::as.difftime()
## ✖ scales::col_factor()     masks readr::col_factor()
## ✖ lubridate::date()        masks base::date()
## ✖ purrr::discard()         masks scales::discard()
## ✖ dplyr::filter()          masks stats::filter()
## ✖ lubridate::intersect()   masks base::intersect()
## ✖ dplyr::lag()             masks stats::lag()
## ✖ lubridate::setdiff()     masks base::setdiff()
## ✖ lubridate::union()       masks base::union()
library(dplyr)
library(DT)
library(shiny)
## 
## Attaching package: 'shiny'
## 
## The following objects are masked from 'package:DT':
## 
##     dataTableOutput, renderDataTable
## 
## The following object is masked from 'package:geosphere':
## 
##     span
library(dygraphs)
library(ggiraph)
library(tmaptools)

Set Language

Sys.setenv(LANG = "en")

Dataset

To evaluate and spot trends, the team examined Cyclistic’s 12-month historical trip data (from October 2021 to September 2022). The dataset can be downloaded from here, the datasets were taken.

Motivate International Inc. has made this public data accessible under the terms of this license.

Data Import

The 12-month historical trip data (in CSV format) are downloaded, extracted and stored in one folder titled trip_data. The next task is to import the files in to R environment and merge all the files into a dataframe that will be use. The dataframe contain all the 12 months data and we can compare them.

merged <- 
  list.files(path="trip_data", full.names = TRUE) %>% 
  lapply(read_csv) %>% 
  bind_rows
## Rows: 631226 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 359978 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 247540 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 103770 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 115609 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 284042 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 371249 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 634858 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 769204 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 823488 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 785932 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 701339 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

To view the columns of the imported data:

head(merged)
## # A tibble: 6 × 13
##   ride_id        ridea…¹ started_at          ended_at            start…² start…³
##   <chr>          <chr>   <dttm>              <dttm>              <chr>   <chr>  
## 1 620BC6107255B… electr… 2021-10-22 12:46:42 2021-10-22 12:49:50 Kingsb… KA1503…
## 2 4471C70731AB2… electr… 2021-10-21 09:12:37 2021-10-21 09:14:14 <NA>    <NA>   
## 3 26CA69D43D15E… electr… 2021-10-16 16:28:39 2021-10-16 16:36:26 <NA>    <NA>   
## 4 362947F0437E1… electr… 2021-10-16 16:17:48 2021-10-16 16:19:03 <NA>    <NA>   
## 5 BB731DE2F2EC5… electr… 2021-10-20 23:17:54 2021-10-20 23:26:10 <NA>    <NA>   
## 6 7176307BBC097… electr… 2021-10-21 16:57:37 2021-10-21 17:11:58 <NA>    <NA>   
## # … with 7 more variables: end_station_name <chr>, end_station_id <chr>,
## #   start_lat <dbl>, start_lng <dbl>, end_lat <dbl>, end_lng <dbl>,
## #   member_casual <chr>, and abbreviated variable names ¹​rideable_type,
## #   ²​start_station_name, ³​start_station_id

To inspect the data

str(merged)
## spec_tbl_df [5,828,235 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:5828235] "620BC6107255BF4C" "4471C70731AB2E45" "26CA69D43D15EE14" "362947F0437E1514" ...
##  $ rideable_type     : chr [1:5828235] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
##  $ started_at        : POSIXct[1:5828235], format: "2021-10-22 12:46:42" "2021-10-21 09:12:37" ...
##  $ ended_at          : POSIXct[1:5828235], format: "2021-10-22 12:49:50" "2021-10-21 09:14:14" ...
##  $ start_station_name: chr [1:5828235] "Kingsbury St & Kinzie St" NA NA NA ...
##  $ start_station_id  : chr [1:5828235] "KA1503000043" NA NA NA ...
##  $ end_station_name  : chr [1:5828235] NA NA NA NA ...
##  $ end_station_id    : chr [1:5828235] NA NA NA NA ...
##  $ start_lat         : num [1:5828235] 41.9 41.9 41.9 41.9 41.9 ...
##  $ start_lng         : num [1:5828235] -87.6 -87.7 -87.7 -87.7 -87.7 ...
##  $ end_lat           : num [1:5828235] 41.9 41.9 41.9 41.9 41.9 ...
##  $ end_lng           : num [1:5828235] -87.6 -87.7 -87.7 -87.7 -87.7 ...
##  $ member_casual     : chr [1:5828235] "member" "member" "member" "member" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_character(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
glimpse(merged)
## Rows: 5,828,235
## Columns: 13
## $ ride_id            <chr> "620BC6107255BF4C", "4471C70731AB2E45", "26CA69D43D…
## $ rideable_type      <chr> "electric_bike", "electric_bike", "electric_bike", …
## $ started_at         <dttm> 2021-10-22 12:46:42, 2021-10-21 09:12:37, 2021-10-…
## $ ended_at           <dttm> 2021-10-22 12:49:50, 2021-10-21 09:14:14, 2021-10-…
## $ start_station_name <chr> "Kingsbury St & Kinzie St", NA, NA, NA, NA, NA, NA,…
## $ start_station_id   <chr> "KA1503000043", NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ end_station_name   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ end_station_id     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ start_lat          <dbl> 41.88919, 41.93000, 41.92000, 41.92000, 41.89000, 4…
## $ start_lng          <dbl> -87.63850, -87.70000, -87.70000, -87.69000, -87.710…
## $ end_lat            <dbl> 41.89000, 41.93000, 41.94000, 41.92000, 41.89000, 4…
## $ end_lng            <dbl> -87.63000, -87.71000, -87.72000, -87.69000, -87.690…
## $ member_casual      <chr> "member", "member", "member", "member", "member", "…

Data Manipulation

Four new columns were added to compute the following: Ride length (in minutes) of each trip (the new column is labelled as ride_length) Distance travelled in kilometers, The day the trip was taken, this is the day of the week of the trip. The month of the trip The new dataframe is labelled as merged_new. This new dataframe is used for subsequent data analysis.

merged_new <- mutate(merged, ride_length = difftime(ended_at, started_at, units = "mins")) %>%
  mutate(merged, distance_km = distHaversine(cbind(start_lng, start_lat), cbind(end_lng, end_lat))*0.001)

To add day of the week to the table

merged_new <- mutate(merged_new, day_of_trip = wday(started_at, label = TRUE))

Extract Month and add it to the columns

merged_new <- mutate(merged_new, month_of_trip = month.abb[month(started_at)])

Rename the month of the trip

This will let the month truly represent the month and year of the trip

Number of trips completed by user type

The annual members (referred to as members in the chart) made 615,960 more trips than the casual riders.

ggplot(merged_new, aes(x=member_casual)) +
  geom_bar(fill = "#008000") +
  labs(
    title = "Number of rides completed by user type",
    subtitle = "For the period between October 2021 and September ending 2022",
    x = "User type",
    y = "Number of rides (in millions)") +
  scale_y_continuous(labels = label_number(suffix = " M", scale = 1e-6)) +
  geom_text(stat='count', aes(label=..count..), vjust=+2, color="white")

## Checking number of trips against against days of the trip

ggplot(merged_new, aes(x=day_of_trip)) +
  geom_bar(fill = "#008000") +
  labs(
    title = "Number of rides completed by week day",
    subtitle = "For the period between October 2021 and September ending 2022",
    x = "day_of trip",
    y = "Number of rides") +
  scale_y_continuous(labels = label_number(suffix = " M", scale = 1e-6)) +
  geom_text(stat='count', aes(label=..count..), vjust=+2, color="white")

## Checking number of trips against against Month of the trip

ggplot(merged_new, aes(x=month_of_trip)) +
  geom_bar(fill = "#008000") +
  labs(
    title = "Number of rides completed by week day",
    subtitle = "For the period between October 2021 and September ending 2022",
    x = "month_of trip",
    y = "Number of rides") +
  scale_y_continuous(labels = label_number(suffix = " M", scale = 1e-6)) +
  geom_text(stat='count', aes(label=..count..), vjust=+2, color="white")

Total distance (in kilometers) traveled by user type

data_bar2 <- merged_new %>% 
  group_by(member_casual) %>%
  summarise(distance_km=sum(distance_km, na.rm=TRUE)) 

ggplot(data_bar2, aes(x=member_casual, y=distance_km)) +
  geom_bar(stat = "identity", fill= "blue") +
  labs(
    title = "Distance travelled by user type",
    subtitle = "For the period between October 2021 and September ending 2022",
    x = "User type",
    y = "Distance travelled (kilometers)") +
  scale_y_continuous(labels = label_number(suffix = " M", scale = 2e-6)) +
  geom_text(aes(label=round(stat(y),2)), vjust=+2, color="white")

## Total distance (in kilometers) traveled by day of the week Saturday and Sunday are the most busy days as they riders and more distance travelled than the rest of the day in the week.

data_bar2 <- merged_new %>% 
  group_by(day_of_trip) %>%
  summarise(distance_km=sum(distance_km, na.rm=TRUE)) 

ggplot(data_bar2, aes(x=day_of_trip, y=distance_km)) +
  geom_bar(stat = "identity", fill= "blue") +
  labs(
    title = "Distance travelled by day of the week",
    subtitle = "For the period between October 2021 and September ending 2022",
    x = "Day of the week",
    y = "Distance travelled (kilometers)") +
  scale_y_continuous(labels = label_number(suffix = " M", scale = 1e-6)) +
  geom_text(aes(label=round(stat(y),2)), vjust=+2, color="white")

Hours cycled by user type

This show how long the riders use the services.

data_bar3 <- merged_new %>%
  group_by(member_casual) %>%
  summarise(time=sum(ride_length, na.rm=TRUE)/60)

ggplot(data_bar3, aes(x=member_casual, y=time)) +
  geom_bar(stat = "identity", fill = "yellow") +
  labs(
    title = "Hours cycled by user type",
    subtitle = "For the period between February 2021 and February 2022",
    x = "User type",
    y = "Hours cycled") +
  scale_y_continuous(labels = label_number(suffix = " M", scale = 1e-6)) +
  geom_text(aes(label=round(time,2)), vjust=+2, color="black")
## Don't know how to automatically pick scale for object of type difftime. Defaulting to continuous.

Hours cycled by weekday

This show how long the riders use the services.

data_bar3 <- merged_new %>%
  group_by(day_of_trip) %>%
  summarise(time=sum(ride_length, na.rm=TRUE)/60)

ggplot(data_bar3, aes(x=day_of_trip, y=time)) +
  geom_bar(stat = "identity", fill = "008000") +
  labs(
    title = "Hours cycled by day of the week",
    subtitle = "For the period between February 2021 and February 2022",
    x = "day of trip",
    y = "Hours cycled") +
  scale_y_continuous(labels = label_number(suffix = " M", scale = 1e-6)) +
  geom_text(aes(label=round(time,2)), vjust=+2, color="black")
## Don't know how to automatically pick scale for object of type difftime. Defaulting to continuous.

On average, annual casual members rides more in the period examined.

data_bar3.1 <- merged_new %>%
  group_by(member_casual) %>%
  summarise(average_time=mean(ride_length, na.rm=TRUE))
print(data_bar3.1)
## # A tibble: 2 × 2
##   member_casual average_time 
##   <chr>         <drtn>       
## 1 casual        29.35914 mins
## 2 member        12.76507 mins

Bike preference disaggregated by member

Classic bike is the most preferred bike type among annual members. Members don’t use docked bike at all.

data_bar4 <- merged_new %>%
  group_by(member_casual, rideable_type) %>%
  summarise(count_of = n())
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
ggplot(data_bar4, aes(x=member_casual, y= count_of, fill=rideable_type)) +
  geom_bar(stat="identity") +
  labs(
    title = "Bike preference by user type",
    subtitle = "For the period between October 2021 and September ending 2022",
    fill = "Bike type",
    x = "User type",
    y = "Number of users") +
  geom_text(aes(label=count_of), position = position_stack(vjust = .5), color="white") +
  scale_y_continuous(labels = label_number(suffix = " M", scale = 1e-6))

Disaggregation of rides completed by month and member type

The summer period (around June to September) saw an increase in rides completed, while the winter period (around Nov to February) saw a marked reduction in rides completed. This trend is similar in both user type, likely because people are less likely to go out in the winter.

data_bar5 <- mutate(merged_new, start_month_year = floor_date(as_date(started_at), "month")) %>%
  group_by(start_month_year, member_casual) %>%
  summarise(count_of = n())
## `summarise()` has grouped output by 'start_month_year'. You can override using
## the `.groups` argument.
ggplot(data_bar5, aes(x=start_month_year, y=count_of, fill="orange"))+
  geom_bar(stat="identity") +
  facet_wrap(~member_casual)+
  labs(
    title = "Number of rides completed by month by user type",
    subtitle = "For the period between October 2021 to September ending 2022",
    x = "Month",
    y = "Number of rides completed") +
  geom_text(aes(label=count_of), position = position_stack(vjust = .5), color="black", angle = 90) +
  scale_y_continuous(labels = label_number(suffix = " M", scale = 1e-6)) +
  scale_x_date(date_labels = "%b %y", date_breaks = "2 month") +
  theme(legend.position = "none")

merged_new %>%
  group_by(member_casual, day_of_trip) %>%  
  summarise(number_of_rides = n(),average_duration_mins = mean(ride_length)) %>% 
  arrange(member_casual, desc(number_of_rides))
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
## # A tibble: 14 × 4
## # Groups:   member_casual [2]
##    member_casual day_of_trip number_of_rides average_duration_mins
##    <chr>         <ord>                 <int> <drtn>               
##  1 casual        Sat                  499806 32.70851 mins        
##  2 casual        Sun                  405040 34.35873 mins        
##  3 casual        Fri                  352510 28.01084 mins        
##  4 casual        Thu                  306694 25.67940 mins        
##  5 casual        Wed                  281660 25.03412 mins        
##  6 casual        Mon                  279789 29.72791 mins        
##  7 casual        Tue                  275787 25.80486 mins        
##  8 member        Tue                  541527 12.16232 mins        
##  9 member        Wed                  538490 12.10353 mins        
## 10 member        Thu                  530552 12.29393 mins        
## 11 member        Fri                  491467 12.52837 mins        
## 12 member        Mon                  473060 12.32758 mins        
## 13 member        Sat                  458229 14.26343 mins        
## 14 member        Sun                  393624 14.21138 mins

Key Findings

The data analysis revealed the following:

  • The total of 5,828,235 trips were made using the services between October 2021 and September ending 2022.
  • Causal member made a total of 2,401,286 trips while registered members made a total of 3,426,949 trips in the period in review.
  • Registered members made 1,025,663 more trips than casual users in the 12 months.
  • Saturday is the most busy day.
  • The demand for renting bikes rises throughout the summer and falls during the winter. Both annual members and casual cyclists share this observation. People are more likely to go out in the summer and less likely to ride bicycles in the cold.
  • A total of 12,463,691Km of rides was taken during this period. While Saturday is the most busiest day of the week in terms of ride lenght, kilometer travelled and number of rides.