Introduction

A bicycle-sharing system, bike share program, is a shared transport service in which bicycles are made available for shared use to individuals on a short-term basis for a price or free. Many bike share systems allow people to borrow a bike from a "dock" and return it at another dock belonging to the same system.

The Company

Cyclistic, a bike-share company in Chicago, with meet different characters and team members, highly interested in answering the key business questions, and keen to steps of the data analysis process: ask, prepare, process, analyze, share, and act.

In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime.

Until now, Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments. One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members

Stakeholders and teams

Ask Phase

Business Task

  • How do annual members and casual riders use Cyclistic bikes differently?
  • Design marketing strategies aimed at converting casual riders into annual members.

Preparing the data

The dataset was downloaded from divvy trip data.

The data used was collected by Cyclistic, for this analysis, six months data was used; from January 2020 to June 2020.The data was accessible in a zip folder, and downloaded to a personal computer. The dataset desired months used for this analysis was merged into one dataset in R.

Process

Loading the appropriate packages
library(readr)
library(skimr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v stringr 1.4.0
## v tidyr   1.2.0     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x lubridate::as.difftime() masks base::as.difftime()
## x lubridate::date()        masks base::date()
## x dplyr::filter()          masks stats::filter()
## x lubridate::intersect()   masks base::intersect()
## x dplyr::lag()             masks stats::lag()
## x lubridate::setdiff()     masks base::setdiff()
## x lubridate::union()       masks base::union()
library(ggplot2)
library(corrplot)
## corrplot 0.92 loaded
Loading the dataset for the year, 2020
Divvy_Trips_Q1 <- read_csv("Divvy_Trips_2020_Q1.csv")
## Rows: 426887 Columns: 13
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr  (5): ride_id, rideable_type, start_station_name, end_station_name, memb...
## dbl  (6): start_station_id, end_station_id, start_lat, start_lng, end_lat, e...
## dttm (2): started_at, ended_at
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
Divvy_Trips2 <- read_csv("202004-divvy-tripdata.csv")
## Rows: 84776 Columns: 13
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr  (5): ride_id, rideable_type, start_station_name, end_station_name, memb...
## dbl  (6): start_station_id, end_station_id, start_lat, start_lng, end_lat, e...
## dttm (2): started_at, ended_at
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
Divvy_Trips3 <- read_csv("202005-divvy-tripdata.csv")
## Rows: 200274 Columns: 13
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr  (5): ride_id, rideable_type, start_station_name, end_station_name, memb...
## dbl  (6): start_station_id, end_station_id, start_lat, start_lng, end_lat, e...
## dttm (2): started_at, ended_at
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
Divvy_Trips4 <- read_csv("202006-divvy-tripdata.csv")
## Rows: 343005 Columns: 13
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr  (5): ride_id, rideable_type, start_station_name, end_station_name, memb...
## dbl  (6): start_station_id, end_station_id, start_lat, start_lng, end_lat, e...
## dttm (2): started_at, ended_at
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

Cleaning the dataset

#Checking for null values in the dataset
is.null(Divvy_Trips_Q1)
## [1] FALSE
is.null(Divvy_Trips2)
## [1] FALSE
is.null(Divvy_Trips3)
## [1] FALSE
is.null(Divvy_Trips4)
## [1] FALSE
#checking for duplicate data
sum(duplicated(Divvy_Trips_Q1))
## [1] 0
sum(duplicated(Divvy_Trips2))
## [1] 0
sum(duplicated(Divvy_Trips3))
## [1] 0
sum(duplicated(Divvy_Trips4))
## [1] 0
#the Start_station_id and end_station_id are in decimal instead of integer
Merging all the dataset into one
Divvy_Trips <- rbind(Divvy_Trips_Q1, Divvy_Trips2, Divvy_Trips3, Divvy_Trips4)

View(Divvy_Trips)
str(Divvy_Trips)
## spec_tbl_df [1,054,942 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:1054942] "EACB19130B0CDA4A" "8FED874C809DC021" "789F3C21E472CA96" "C9A388DAC6ABF313" ...
##  $ rideable_type     : chr [1:1054942] "docked_bike" "docked_bike" "docked_bike" "docked_bike" ...
##  $ started_at        : POSIXct[1:1054942], format: "2020-01-21 20:06:59" "2020-01-30 14:22:39" ...
##  $ ended_at          : POSIXct[1:1054942], format: "2020-01-21 20:14:30" "2020-01-30 14:26:22" ...
##  $ start_station_name: chr [1:1054942] "Western Ave & Leland Ave" "Clark St & Montrose Ave" "Broadway & Belmont Ave" "Clark St & Randolph St" ...
##  $ start_station_id  : num [1:1054942] 239 234 296 51 66 212 96 96 212 38 ...
##  $ end_station_name  : chr [1:1054942] "Clark St & Leland Ave" "Southport Ave & Irving Park Rd" "Wilton Ave & Belmont Ave" "Fairbanks Ct & Grand Ave" ...
##  $ end_station_id    : num [1:1054942] 326 318 117 24 212 96 212 212 96 100 ...
##  $ start_lat         : num [1:1054942] 42 42 41.9 41.9 41.9 ...
##  $ start_lng         : num [1:1054942] -87.7 -87.7 -87.6 -87.6 -87.6 ...
##  $ end_lat           : num [1:1054942] 42 42 41.9 41.9 41.9 ...
##  $ end_lng           : num [1:1054942] -87.7 -87.7 -87.7 -87.6 -87.6 ...
##  $ member_casual     : chr [1:1054942] "member" "member" "member" "member" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_double(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_double(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
#Checking for null values in the dataset
colSums(is.na(Divvy_Trips))
##            ride_id      rideable_type         started_at           ended_at 
##                  0                  0                  0                  0 
## start_station_name   start_station_id   end_station_name     end_station_id 
##                  0                  0                889                889 
##          start_lat          start_lng            end_lat            end_lng 
##                  0                  0                889                889 
##      member_casual 
##                  0
Replacing null values to N/A
Divvy_Trips$start_station_name[Divvy_Trips$start_station_name ==""]<- "None"

Divvy_Trips$end_station_name[Divvy_Trips$end_station_name ==""]<- "None"
Droping the columns: latitude , longitude , start station Id , end Station Id.
Divvy_Trips = subset(Divvy_Trips, select = -c(start_lat, start_lng, end_lat, end_lng, start_station_id , end_station_id))
Checking for how many distict values in the column member_casual
n_distinct(Divvy_Trips$member_casual) #..this means that there are two distinct variables, member and casual
## [1] 2
n_distinct(Divvy_Trips$ride_id) 
## [1] 1054942
n_distinct(Divvy_Trips$rideable_type) 
## [1] 1
Checking for how many causual riders and annual members on the dataset
table(Divvy_Trips['member_casual'])
## 
## casual member 
## 313735 741207
Divvy_Trips%>%
  count(member_casual)
## # A tibble: 2 x 2
##   member_casual      n
##   <chr>          <int>
## 1 casual        313735
## 2 member        741207
Checking for how many rideable types on the dataset and how many members use rideable types on the dataset
#Checking for how many rideable types on the dataset
table(Divvy_Trips['rideable_type'])
## 
## docked_bike 
##     1054942
Divvy_Trips%>%
  count(rideable_type)
## # A tibble: 1 x 2
##   rideable_type       n
##   <chr>           <int>
## 1 docked_bike   1054942
#Checking for how many members use rideable types on the dataset
table(Divvy_Trips['rideable_type', 'member_casual'])
## < table of extent 0 >
Divvy_Trips%>%
  count(rideable_type, member_casual)
## # A tibble: 2 x 3
##   rideable_type member_casual      n
##   <chr>         <chr>          <int>
## 1 docked_bike   casual        313735
## 2 docked_bike   member        741207
skim(Divvy_Trips)
Data summary
Name Divvy_Trips
Number of rows 1054942
Number of columns 7
_______________________
Column type frequency:
character 5
POSIXct 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
ride_id 0 1 16 16 0 1054942 0
rideable_type 0 1 11 11 0 1 0
start_station_name 0 1 5 43 0 621 0
end_station_name 889 1 5 43 0 622 0
member_casual 0 1 6 6 0 2 0

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
started_at 0 1 2020-01-01 00:04:44 2020-06-30 23:59:54 2020-05-02 19:56:55 961848
ended_at 0 1 2020-01-01 00:10:54 2020-07-03 20:26:15 2020-05-02 20:33:56 960982
#this shows that more member riders use classic bikes, docked bikes and electric bikes than casual riders.
table(Divvy_Trips$member_casual, useNA = "ifany")   #this shows that there are no null values.
## 
## casual member 
## 313735 741207
Transforming the dataset
Divvy_Trips$rideDate<-as.Date(Divvy_Trips$started_at)

Divvy_Trips$started_at<-as_datetime(Divvy_Trips$started_at)
Divvy_Trips$ended_at<-as_datetime(Divvy_Trips$ended_at)

# Adding a new column to the dataset from 'ride date'.
Divvy_Trips$month<-format(as.Date(Divvy_Trips$rideDate),"%B")
Divvy_Trips$day <-format(as.Date(Divvy_Trips$rideDate),"%d")
Divvy_Trips$year<-format(as.Date(Divvy_Trips$rideDate),"%Y")
Divvy_Trips$day_of_week<-format(as.Date(Divvy_Trips$rideDate),"%A")
# View the column names
colnames(Divvy_Trips)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "end_station_name"  
##  [7] "member_casual"      "rideDate"           "month"             
## [10] "day"                "year"               "day_of_week"
Length of the ride
Divvy_Trips <- Divvy_Trips%>%
  mutate(length_of_ride=ended_at - started_at)
head(Divvy_Trips)
## # A tibble: 6 x 13
##   ride_id rideable_type started_at          ended_at            start_station_n~
##   <chr>   <chr>         <dttm>              <dttm>              <chr>           
## 1 EACB19~ docked_bike   2020-01-21 20:06:59 2020-01-21 20:14:30 Western Ave & L~
## 2 8FED87~ docked_bike   2020-01-30 14:22:39 2020-01-30 14:26:22 Clark St & Mont~
## 3 789F3C~ docked_bike   2020-01-09 19:29:26 2020-01-09 19:32:17 Broadway & Belm~
## 4 C9A388~ docked_bike   2020-01-06 16:17:07 2020-01-06 16:25:56 Clark St & Rand~
## 5 943BC3~ docked_bike   2020-01-30 08:37:16 2020-01-30 08:42:48 Clinton St & La~
## 6 6D9C8A~ docked_bike   2020-01-10 12:33:05 2020-01-10 12:37:54 Wells St & Hubb~
## # ... with 8 more variables: end_station_name <chr>, member_casual <chr>,
## #   rideDate <date>, month <chr>, day <chr>, year <chr>, day_of_week <chr>,
## #   length_of_ride <drtn>
Divvy_Trips$length_of_ride <- as.numeric(Divvy_Trips$length_of_ride)
str(Divvy_Trips$length_of_ride)
##  num [1:1054942] 451 223 171 529 332 289 289 297 295 203 ...
#checking for hours and minutes used to complete the ride
Divvy_Trips$hour_minutes_of_ride <- hms::as_hms(Divvy_Trips$length_of_ride)
View(Divvy_Trips)
Filtering the length of ride less than 0 seconds
biketrip <- filter(Divvy_Trips,length_of_ride>0)

Average, minimum and maximum length of ride

##average, minimum and maximum length of ride

biketrip%>%
  summarise(min_length=min(length_of_ride),max_length=max(length_of_ride),average_length=mean(length_of_ride))
## # A tibble: 1 x 3
##   min_length max_length average_length
##        <dbl>      <dbl>          <dbl>
## 1          1    9387024          1746.
View(biketrip)
Length of Ride by member_type
aggregate(length_of_ride~member_casual, data= biketrip,mean)
##   member_casual length_of_ride
## 1        casual      3601.5713
## 2        member       961.5074
aggregate(length_of_ride~member_casual, data= biketrip,median)
##   member_casual length_of_ride
## 1        casual           1486
## 2        member            652
aggregate(length_of_ride~member_casual, data= biketrip,max)
##   member_casual length_of_ride
## 1        casual        9387024
## 2        member        5627611
#this means that casual riders have more ride lengths than member riders.
Sorting the data
#sorting the data by weekday
biketrip$day_of_week<-ordered(biketrip$day_of_week,levels=c('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'))
biketrip%>%count(day_of_week,member_casual)
## # A tibble: 14 x 3
##    day_of_week member_casual      n
##    <ord>       <chr>          <int>
##  1 Monday      casual         32362
##  2 Monday      member        105619
##  3 Tuesday     casual         33618
##  4 Tuesday     member        117270
##  5 Wednesday   casual         32769
##  6 Wednesday   member        110409
##  7 Thursday    casual         35580
##  8 Thursday    member        111767
##  9 Friday      casual         39153
## 10 Friday      member        107493
## 11 Saturday    casual         72389
## 12 Saturday    member         97352
## 13 Sunday      casual         67378
## 14 Sunday      member         90829
#sorting the data by month
biketrip$month<-ordered(biketrip$month,levels=c('January', 'February', 'March', 'April', 'May', 'June'))
biketrip%>%count(month,member_casual)
## # A tibble: 12 x 3
##    month    member_casual      n
##    <ord>    <chr>          <int>
##  1 January  casual          7785
##  2 January  member        136099
##  3 February casual         12860
##  4 February member        126715
##  5 March    casual         27625
##  6 March    member        115593
##  7 April    casual         23605
##  8 April    member         61112
##  9 May      casual         86838
## 10 May      member        113252
## 11 June     casual        154536
## 12 June     member        187968
biketrip%>%count(member_casual, rideable_type)
## # A tibble: 2 x 3
##   member_casual rideable_type      n
##   <chr>         <chr>          <int>
## 1 casual        docked_bike   313249
## 2 member        docked_bike   740739
##average, max and min length of ride for 6 months
mean_r_length <-as.numeric(mean(biketrip$length_of_ride))/60
cat("The average ride length over 6 months is;", mean_r_length, "minutes")
## The average ride length over 6 months is; 29.1024 minutes
min_r_length <-as.numeric(min(biketrip$length_of_ride))/60
cat("The minimum ride length over 6 months is;", min_r_length, "minutes")
## The minimum ride length over 6 months is; 0.01666667 minutes
max_r_length <- as.numeric(max(biketrip$length_of_ride))/3600
cat("The maximum ride length over 6 months is;", max_r_length, "hours")
## The maximum ride length over 6 months is; 2607.507 hours

visualization

For the visualization, two packages have been installed and loaded. For this analysis, ggplot2 and corrplot was used.

biketrip%>%
  group_by(member_casual,day_of_week)%>%
  summarise(total_ride_duration=mean(length_of_ride))%>%
  ggplot(mapping=aes(x=member_casual,y=total_ride_duration,fill=day_of_week)) +
  geom_bar(position="Dodge",stat = "identity") +
  facet_wrap(~day_of_week) +
  labs(title="Average ride length by day of week")
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

biketrip%>%
  group_by(member_casual,year)%>%
  summarise(Ridenumbers=n())%>%
  ggplot(mapping=aes(x=year,y=Ridenumbers,fill=member_casual)) +
  geom_bar(position="Dodge",stat = "identity") +
  labs(title="Average ride length by year")
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

str(biketrip$length_of_ride)
##  num [1:1053988] 451 223 171 529 332 289 289 297 295 203 ...
par(mfrow=c(1,1))


boxplot(length_of_ride ~ member_casual,
        data = biketrip,
        main = "distribution of length by week",
        xlab = "casual_member",
        ylab = "length of ride",
        col = c("orange", "yellow"))

boxplot(biketrip$month ~ biketrip$member_casual,
        data = biketrip,
        main = "Month Vs Riders",
        xlab = "Member Riders and Casual Riders",
        ylab = "Month",
        col = c("pink", "pink1")) 

Recommendations

In correlation with the analysis, the result leads to the following recommendations being made:

  • Cyclist can promote shorter rides for casual riders with incentives on casual riders who complete more shorter rides and subscribe to membership.

  • Cyclist could promote more friendly advertisements to engage member riders on longer rides.

  • Cyclist can send more member rider a motivation quote or text, to motivate them on the need to keep riding; as this might help members stay more on longer rides.