The success of Chicago-based bike-sharing start-up Cyclistic, according to its marketing director Lily Moreno, rests on increasing the number of annual subscriptions. The team is interested in seeing how annual members and casual riders utilize Cyclistic bikes. The company plan a new marketing drive to turn casual riders into annual members based on these insights. This report will give the findings in the following format to help readers understand user trends: * Trips completed by user type, * Total distance travelled (kilometres) by user type * Hours cycled by type of user * User preference for bikes * completed rides by month and user type * Number of rides taken each day, broken down per user type * Top 5 user types for start station and end station
We need to install some packages to be able to use their functions
This project will use the following libraries
library(readr)
library(geosphere)
library(ggplot2)
library(scales)
##
## Attaching package: 'scales'
## The following object is masked from 'package:readr':
##
## col_factor
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(tidyverse)
## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ purrr 0.3.4 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ lubridate::as.difftime() masks base::as.difftime()
## ✖ scales::col_factor() masks readr::col_factor()
## ✖ lubridate::date() masks base::date()
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ lubridate::intersect() masks base::intersect()
## ✖ dplyr::lag() masks stats::lag()
## ✖ lubridate::setdiff() masks base::setdiff()
## ✖ lubridate::union() masks base::union()
library(dplyr)
library(DT)
library(shiny)
##
## Attaching package: 'shiny'
##
## The following objects are masked from 'package:DT':
##
## dataTableOutput, renderDataTable
##
## The following object is masked from 'package:geosphere':
##
## span
library(dygraphs)
library(ggiraph)
library(tmaptools)
Sys.setenv(LANG = "en")
To evaluate and spot trends, the team examined Cyclistic’s 12-month historical trip data (from October 2021 to September 2022). The dataset can be downloaded from here, the datasets were taken.
Motivate International Inc. has made this public data accessible under the terms of this license.
The 12-month historical trip data (in CSV format) are downloaded, extracted and stored in one folder titled trip_data. The next task is to import the files in to R environment and merge all the files into a dataframe that will be use. The dataframe contain all the 12 months data and we can compare them.
merged <-
list.files(path="trip_data", full.names = TRUE) %>%
lapply(read_csv) %>%
bind_rows
## Rows: 631226 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 359978 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 247540 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 103770 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 115609 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 284042 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 371249 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 634858 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 769204 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 823488 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 785932 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 701339 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
To view the columns of the imported data:
head(merged)
## # A tibble: 6 × 13
## ride_id ridea…¹ started_at ended_at start…² start…³
## <chr> <chr> <dttm> <dttm> <chr> <chr>
## 1 620BC6107255B… electr… 2021-10-22 12:46:42 2021-10-22 12:49:50 Kingsb… KA1503…
## 2 4471C70731AB2… electr… 2021-10-21 09:12:37 2021-10-21 09:14:14 <NA> <NA>
## 3 26CA69D43D15E… electr… 2021-10-16 16:28:39 2021-10-16 16:36:26 <NA> <NA>
## 4 362947F0437E1… electr… 2021-10-16 16:17:48 2021-10-16 16:19:03 <NA> <NA>
## 5 BB731DE2F2EC5… electr… 2021-10-20 23:17:54 2021-10-20 23:26:10 <NA> <NA>
## 6 7176307BBC097… electr… 2021-10-21 16:57:37 2021-10-21 17:11:58 <NA> <NA>
## # … with 7 more variables: end_station_name <chr>, end_station_id <chr>,
## # start_lat <dbl>, start_lng <dbl>, end_lat <dbl>, end_lng <dbl>,
## # member_casual <chr>, and abbreviated variable names ¹rideable_type,
## # ²start_station_name, ³start_station_id
To inspect the data
str(merged)
## spec_tbl_df [5,828,235 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:5828235] "620BC6107255BF4C" "4471C70731AB2E45" "26CA69D43D15EE14" "362947F0437E1514" ...
## $ rideable_type : chr [1:5828235] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : POSIXct[1:5828235], format: "2021-10-22 12:46:42" "2021-10-21 09:12:37" ...
## $ ended_at : POSIXct[1:5828235], format: "2021-10-22 12:49:50" "2021-10-21 09:14:14" ...
## $ start_station_name: chr [1:5828235] "Kingsbury St & Kinzie St" NA NA NA ...
## $ start_station_id : chr [1:5828235] "KA1503000043" NA NA NA ...
## $ end_station_name : chr [1:5828235] NA NA NA NA ...
## $ end_station_id : chr [1:5828235] NA NA NA NA ...
## $ start_lat : num [1:5828235] 41.9 41.9 41.9 41.9 41.9 ...
## $ start_lng : num [1:5828235] -87.6 -87.7 -87.7 -87.7 -87.7 ...
## $ end_lat : num [1:5828235] 41.9 41.9 41.9 41.9 41.9 ...
## $ end_lng : num [1:5828235] -87.6 -87.7 -87.7 -87.7 -87.7 ...
## $ member_casual : chr [1:5828235] "member" "member" "member" "member" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
glimpse(merged)
## Rows: 5,828,235
## Columns: 13
## $ ride_id <chr> "620BC6107255BF4C", "4471C70731AB2E45", "26CA69D43D…
## $ rideable_type <chr> "electric_bike", "electric_bike", "electric_bike", …
## $ started_at <dttm> 2021-10-22 12:46:42, 2021-10-21 09:12:37, 2021-10-…
## $ ended_at <dttm> 2021-10-22 12:49:50, 2021-10-21 09:14:14, 2021-10-…
## $ start_station_name <chr> "Kingsbury St & Kinzie St", NA, NA, NA, NA, NA, NA,…
## $ start_station_id <chr> "KA1503000043", NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ end_station_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ end_station_id <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ start_lat <dbl> 41.88919, 41.93000, 41.92000, 41.92000, 41.89000, 4…
## $ start_lng <dbl> -87.63850, -87.70000, -87.70000, -87.69000, -87.710…
## $ end_lat <dbl> 41.89000, 41.93000, 41.94000, 41.92000, 41.89000, 4…
## $ end_lng <dbl> -87.63000, -87.71000, -87.72000, -87.69000, -87.690…
## $ member_casual <chr> "member", "member", "member", "member", "member", "…
Four new columns were added to compute the following: Ride length (in minutes) of each trip (the new column is labelled as ride_length) Distance travelled in kilometers, The day the trip was taken, this is the day of the week of the trip. The month of the trip The new dataframe is labelled as merged_new. This new dataframe is used for subsequent data analysis.
merged_new <- mutate(merged, ride_length = difftime(ended_at, started_at, units = "mins")) %>%
mutate(merged, distance_km = distHaversine(cbind(start_lng, start_lat), cbind(end_lng, end_lat))*0.001)
To add day of the week to the table
merged_new <- mutate(merged_new, day_of_trip = wday(started_at, label = TRUE))
Extract Month and add it to the columns
merged_new <- mutate(merged_new, month_of_trip = month.abb[month(started_at)])
This will let the month truly represent the month and year of the trip
The annual members (referred to as members in the chart) made 615,960 more trips than the casual riders.
ggplot(merged_new, aes(x=member_casual)) +
geom_bar(fill = "#008000") +
labs(
title = "Number of rides completed by user type",
subtitle = "For the period between October 2021 and September ending 2022",
x = "User type",
y = "Number of rides (in millions)") +
scale_y_continuous(labels = label_number(suffix = " M", scale = 1e-6)) +
geom_text(stat='count', aes(label=..count..), vjust=+2, color="white")
## Checking number of trips against against days of the trip
ggplot(merged_new, aes(x=day_of_trip)) +
geom_bar(fill = "#008000") +
labs(
title = "Number of rides completed by week day",
subtitle = "For the period between October 2021 and September ending 2022",
x = "day_of trip",
y = "Number of rides") +
scale_y_continuous(labels = label_number(suffix = " M", scale = 1e-6)) +
geom_text(stat='count', aes(label=..count..), vjust=+2, color="white")
## Checking number of trips against against Month of the trip
ggplot(merged_new, aes(x=month_of_trip)) +
geom_bar(fill = "#008000") +
labs(
title = "Number of rides completed by week day",
subtitle = "For the period between October 2021 and September ending 2022",
x = "month_of trip",
y = "Number of rides") +
scale_y_continuous(labels = label_number(suffix = " M", scale = 1e-6)) +
geom_text(stat='count', aes(label=..count..), vjust=+2, color="white")
data_bar2 <- merged_new %>%
group_by(member_casual) %>%
summarise(distance_km=sum(distance_km, na.rm=TRUE))
ggplot(data_bar2, aes(x=member_casual, y=distance_km)) +
geom_bar(stat = "identity", fill= "blue") +
labs(
title = "Distance travelled by user type",
subtitle = "For the period between October 2021 and September ending 2022",
x = "User type",
y = "Distance travelled (kilometers)") +
scale_y_continuous(labels = label_number(suffix = " M", scale = 2e-6)) +
geom_text(aes(label=round(stat(y),2)), vjust=+2, color="white")
## Total distance (in kilometers) traveled by day of the week Saturday and Sunday are the most busy days as they riders and more distance travelled than the rest of the day in the week.
data_bar2 <- merged_new %>%
group_by(day_of_trip) %>%
summarise(distance_km=sum(distance_km, na.rm=TRUE))
ggplot(data_bar2, aes(x=day_of_trip, y=distance_km)) +
geom_bar(stat = "identity", fill= "blue") +
labs(
title = "Distance travelled by day of the week",
subtitle = "For the period between October 2021 and September ending 2022",
x = "Day of the week",
y = "Distance travelled (kilometers)") +
scale_y_continuous(labels = label_number(suffix = " M", scale = 1e-6)) +
geom_text(aes(label=round(stat(y),2)), vjust=+2, color="white")
This show how long the riders use the services.
data_bar3 <- merged_new %>%
group_by(member_casual) %>%
summarise(time=sum(ride_length, na.rm=TRUE)/60)
ggplot(data_bar3, aes(x=member_casual, y=time)) +
geom_bar(stat = "identity", fill = "yellow") +
labs(
title = "Hours cycled by user type",
subtitle = "For the period between February 2021 and February 2022",
x = "User type",
y = "Hours cycled") +
scale_y_continuous(labels = label_number(suffix = " M", scale = 1e-6)) +
geom_text(aes(label=round(time,2)), vjust=+2, color="black")
## Don't know how to automatically pick scale for object of type difftime. Defaulting to continuous.
This show how long the riders use the services.
data_bar3 <- merged_new %>%
group_by(day_of_trip) %>%
summarise(time=sum(ride_length, na.rm=TRUE)/60)
ggplot(data_bar3, aes(x=day_of_trip, y=time)) +
geom_bar(stat = "identity", fill = "008000") +
labs(
title = "Hours cycled by day of the week",
subtitle = "For the period between February 2021 and February 2022",
x = "day of trip",
y = "Hours cycled") +
scale_y_continuous(labels = label_number(suffix = " M", scale = 1e-6)) +
geom_text(aes(label=round(time,2)), vjust=+2, color="black")
## Don't know how to automatically pick scale for object of type difftime. Defaulting to continuous.
On average, annual casual members rides more in the period examined.
data_bar3.1 <- merged_new %>%
group_by(member_casual) %>%
summarise(average_time=mean(ride_length, na.rm=TRUE))
print(data_bar3.1)
## # A tibble: 2 × 2
## member_casual average_time
## <chr> <drtn>
## 1 casual 29.35914 mins
## 2 member 12.76507 mins
Classic bike is the most preferred bike type among annual members. Members don’t use docked bike at all.
data_bar4 <- merged_new %>%
group_by(member_casual, rideable_type) %>%
summarise(count_of = n())
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
ggplot(data_bar4, aes(x=member_casual, y= count_of, fill=rideable_type)) +
geom_bar(stat="identity") +
labs(
title = "Bike preference by user type",
subtitle = "For the period between October 2021 and September ending 2022",
fill = "Bike type",
x = "User type",
y = "Number of users") +
geom_text(aes(label=count_of), position = position_stack(vjust = .5), color="white") +
scale_y_continuous(labels = label_number(suffix = " M", scale = 1e-6))
The summer period (around June to September) saw an increase in rides completed, while the winter period (around Nov to February) saw a marked reduction in rides completed. This trend is similar in both user type, likely because people are less likely to go out in the winter.
data_bar5 <- mutate(merged_new, start_month_year = floor_date(as_date(started_at), "month")) %>%
group_by(start_month_year, member_casual) %>%
summarise(count_of = n())
## `summarise()` has grouped output by 'start_month_year'. You can override using
## the `.groups` argument.
ggplot(data_bar5, aes(x=start_month_year, y=count_of, fill="orange"))+
geom_bar(stat="identity") +
facet_wrap(~member_casual)+
labs(
title = "Number of rides completed by month by user type",
subtitle = "For the period between October 2021 to September ending 2022",
x = "Month",
y = "Number of rides completed") +
geom_text(aes(label=count_of), position = position_stack(vjust = .5), color="black", angle = 90) +
scale_y_continuous(labels = label_number(suffix = " M", scale = 1e-6)) +
scale_x_date(date_labels = "%b %y", date_breaks = "2 month") +
theme(legend.position = "none")
merged_new %>%
group_by(member_casual, day_of_trip) %>%
summarise(number_of_rides = n(),average_duration_mins = mean(ride_length)) %>%
arrange(member_casual, desc(number_of_rides))
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
## # A tibble: 14 × 4
## # Groups: member_casual [2]
## member_casual day_of_trip number_of_rides average_duration_mins
## <chr> <ord> <int> <drtn>
## 1 casual Sat 499806 32.70851 mins
## 2 casual Sun 405040 34.35873 mins
## 3 casual Fri 352510 28.01084 mins
## 4 casual Thu 306694 25.67940 mins
## 5 casual Wed 281660 25.03412 mins
## 6 casual Mon 279789 29.72791 mins
## 7 casual Tue 275787 25.80486 mins
## 8 member Tue 541527 12.16232 mins
## 9 member Wed 538490 12.10353 mins
## 10 member Thu 530552 12.29393 mins
## 11 member Fri 491467 12.52837 mins
## 12 member Mon 473060 12.32758 mins
## 13 member Sat 458229 14.26343 mins
## 14 member Sun 393624 14.21138 mins
The data analysis revealed the following: