Introduction

A bicycle-sharing system, bike share program, is a shared transport service in which bicycles are made available for shared use to individuals on a short-term basis for a price or free. Many bike share systems allow people to borrow a bike from a "dock" and return it at another dock belonging to the same system.

The Company

Cyclistic, a bike-share company in Chicago, with meet different characters and team members, highly interested in answering the key business questions, and keen to steps of the data analysis process: ask, prepare, process, analyze, share, and act.

In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime.

Until now, Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments. One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members

Stakeholders and teams

Cyclistic: A bike-share program that features more than 5,800 bicycles and 600 docking stations. Cyclistic sets itself apart by also offering reclining bikes, hand tricycles, and cargo bikes, making bike-share more inclusive to people with disabilities and riders who can’t use a standard two-wheeled bike. The majority of riders opt for traditional bikes; about 8% of riders use the assistive options. Cyclistic users are more likely to ride for leisure, but about 30% use them to commute to work each day.
Lily Moreno: The director of marketing and manager. Moreno is responsible for the development of campaigns and initiatives to promote the bike-share program. These may include email, social media, and other channels.
Cyclistic marketing analytics team: A team of data analysts who are responsible for collecting, analyzing, and reporting data that helps guide Cyclistic marketing strategy.

Ask Phase

Business Task

How do annual members and casual riders use Cyclistic bikes differently?
Design marketing strategies aimed at converting casual riders into annual members.

Preparing the data

The dataset was downloaded from divvy trip data.

The data used was collected by Cyclistic, for this analysis, six months data was used; from January 2020 to June 2020.The data was accessible in a zip folder, and downloaded to a personal computer. The dataset desired months used for this analysis was merged into one dataset in R.

Process

Loading the appropriate packages

library(readr)
library(skimr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(lubridate)

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v stringr 1.4.0
## v tidyr   1.2.0     v forcats 0.5.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x lubridate::as.difftime() masks base::as.difftime()
## x lubridate::date()        masks base::date()
## x dplyr::filter()          masks stats::filter()
## x lubridate::intersect()   masks base::intersect()
## x dplyr::lag()             masks stats::lag()
## x lubridate::setdiff()     masks base::setdiff()
## x lubridate::union()       masks base::union()

library(ggplot2)
library(corrplot)

## corrplot 0.92 loaded

Loading the dataset for the year, 2020

Divvy_Trips_Q1 <- read_csv("Divvy_Trips_2020_Q1.csv")

## Rows: 426887 Columns: 13
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr  (5): ride_id, rideable_type, start_station_name, end_station_name, memb...
## dbl  (6): start_station_id, end_station_id, start_lat, start_lng, end_lat, e...
## dttm (2): started_at, ended_at
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

Divvy_Trips2 <- read_csv("202004-divvy-tripdata.csv")

## Rows: 84776 Columns: 13
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr  (5): ride_id, rideable_type, start_station_name, end_station_name, memb...
## dbl  (6): start_station_id, end_station_id, start_lat, start_lng, end_lat, e...
## dttm (2): started_at, ended_at
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

Divvy_Trips3 <- read_csv("202005-divvy-tripdata.csv")

## Rows: 200274 Columns: 13
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr  (5): ride_id, rideable_type, start_station_name, end_station_name, memb...
## dbl  (6): start_station_id, end_station_id, start_lat, start_lng, end_lat, e...
## dttm (2): started_at, ended_at
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

Divvy_Trips4 <- read_csv("202006-divvy-tripdata.csv")

## Rows: 343005 Columns: 13
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr  (5): ride_id, rideable_type, start_station_name, end_station_name, memb...
## dbl  (6): start_station_id, end_station_id, start_lat, start_lng, end_lat, e...
## dttm (2): started_at, ended_at
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

Cleaning the dataset

#Checking for null values in the dataset
is.null(Divvy_Trips_Q1)

## [1] FALSE

is.null(Divvy_Trips2)

## [1] FALSE

is.null(Divvy_Trips3)

## [1] FALSE

is.null(Divvy_Trips4)

## [1] FALSE

#checking for duplicate data
sum(duplicated(Divvy_Trips_Q1))

## [1] 0

sum(duplicated(Divvy_Trips2))

## [1] 0

sum(duplicated(Divvy_Trips3))

## [1] 0

sum(duplicated(Divvy_Trips4))

## [1] 0

#the Start_station_id and end_station_id are in decimal instead of integer

Merging all the dataset into one

Divvy_Trips <- rbind(Divvy_Trips_Q1, Divvy_Trips2, Divvy_Trips3, Divvy_Trips4)

View(Divvy_Trips)
str(Divvy_Trips)

## spec_tbl_df [1,054,942 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:1054942] "EACB19130B0CDA4A" "8FED874C809DC021" "789F3C21E472CA96" "C9A388DAC6ABF313" ...
##  $ rideable_type     : chr [1:1054942] "docked_bike" "docked_bike" "docked_bike" "docked_bike" ...
##  $ started_at        : POSIXct[1:1054942], format: "2020-01-21 20:06:59" "2020-01-30 14:22:39" ...
##  $ ended_at          : POSIXct[1:1054942], format: "2020-01-21 20:14:30" "2020-01-30 14:26:22" ...
##  $ start_station_name: chr [1:1054942] "Western Ave & Leland Ave" "Clark St & Montrose Ave" "Broadway & Belmont Ave" "Clark St & Randolph St" ...
##  $ start_station_id  : num [1:1054942] 239 234 296 51 66 212 96 96 212 38 ...
##  $ end_station_name  : chr [1:1054942] "Clark St & Leland Ave" "Southport Ave & Irving Park Rd" "Wilton Ave & Belmont Ave" "Fairbanks Ct & Grand Ave" ...
##  $ end_station_id    : num [1:1054942] 326 318 117 24 212 96 212 212 96 100 ...
##  $ start_lat         : num [1:1054942] 42 42 41.9 41.9 41.9 ...
##  $ start_lng         : num [1:1054942] -87.7 -87.7 -87.6 -87.6 -87.6 ...
##  $ end_lat           : num [1:1054942] 42 42 41.9 41.9 41.9 ...
##  $ end_lng           : num [1:1054942] -87.7 -87.7 -87.7 -87.6 -87.6 ...
##  $ member_casual     : chr [1:1054942] "member" "member" "member" "member" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_double(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_double(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

#Checking for null values in the dataset
colSums(is.na(Divvy_Trips))

##            ride_id      rideable_type         started_at           ended_at 
##                  0                  0                  0                  0 
## start_station_name   start_station_id   end_station_name     end_station_id 
##                  0                  0                889                889 
##          start_lat          start_lng            end_lat            end_lng 
##                  0                  0                889                889 
##      member_casual 
##                  0

Replacing null values to N/A

Divvy_Trips$start_station_name[Divvy_Trips$start_station_name ==""]<- "None"

Divvy_Trips$end_station_name[Divvy_Trips$end_station_name ==""]<- "None"

Droping the columns: latitude , longitude , start station Id , end Station Id.

Divvy_Trips = subset(Divvy_Trips, select = -c(start_lat, start_lng, end_lat, end_lng, start_station_id , end_station_id))

Checking for how many distict values in the column member_casual

n_distinct(Divvy_Trips$member_casual) #..this means that there are two distinct variables, member and casual

## [1] 2

n_distinct(Divvy_Trips$ride_id)

## [1] 1054942

n_distinct(Divvy_Trips$rideable_type)

## [1] 1

Checking for how many causual riders and annual members on the dataset

table(Divvy_Trips['member_casual'])

## 
## casual member 
## 313735 741207

Divvy_Trips%>%
  count(member_casual)

## # A tibble: 2 x 2
##   member_casual      n
##   <chr>          <int>
## 1 casual        313735
## 2 member        741207

Checking for how many rideable types on the dataset and how many members use rideable types on the dataset

#Checking for how many rideable types on the dataset
table(Divvy_Trips['rideable_type'])

## 
## docked_bike 
##     1054942

Divvy_Trips%>%
  count(rideable_type)

## # A tibble: 1 x 2
##   rideable_type       n
##   <chr>           <int>
## 1 docked_bike   1054942

#Checking for how many members use rideable types on the dataset
table(Divvy_Trips['rideable_type', 'member_casual'])

## < table of extent 0 >

Divvy_Trips%>%
  count(rideable_type, member_casual)

## # A tibble: 2 x 3
##   rideable_type member_casual      n
##   <chr>         <chr>          <int>
## 1 docked_bike   casual        313735
## 2 docked_bike   member        741207

skim(Divvy_Trips)

Data summary
Name	Divvy_Trips
Number of rows	1054942
Number of columns	7
_______________________
Column type frequency:
character	5
POSIXct	2
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
ride_id	0	1	16	16	1054942
rideable_type	0	1	11	11	1
start_station_name	0	1	5	43	621
end_station_name	889	1	5	43	622
member_casual	0	1	6	6	2

Variable type: POSIXct

skim_variable	n_missing	complete_rate	min	max	median	n_unique
started_at	0	1	2020-01-01 00:04:44	2020-06-30 23:59:54	2020-05-02 19:56:55	961848
ended_at	0	1	2020-01-01 00:10:54	2020-07-03 20:26:15	2020-05-02 20:33:56	960982

#this shows that more member riders use classic bikes, docked bikes and electric bikes than casual riders.

table(Divvy_Trips$member_casual, useNA = "ifany")   #this shows that there are no null values.

## 
## casual member 
## 313735 741207

Transforming the dataset

Divvy_Trips$rideDate<-as.Date(Divvy_Trips$started_at)

Divvy_Trips$started_at<-as_datetime(Divvy_Trips$started_at)
Divvy_Trips$ended_at<-as_datetime(Divvy_Trips$ended_at)

# Adding a new column to the dataset from 'ride date'.
Divvy_Trips$month<-format(as.Date(Divvy_Trips$rideDate),"%B")
Divvy_Trips$day <-format(as.Date(Divvy_Trips$rideDate),"%d")
Divvy_Trips$year<-format(as.Date(Divvy_Trips$rideDate),"%Y")
Divvy_Trips$day_of_week<-format(as.Date(Divvy_Trips$rideDate),"%A")

# View the column names
colnames(Divvy_Trips)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "end_station_name"  
##  [7] "member_casual"      "rideDate"           "month"             
## [10] "day"                "year"               "day_of_week"

Length of the ride

Divvy_Trips <- Divvy_Trips%>%
  mutate(length_of_ride=ended_at - started_at)
head(Divvy_Trips)

## # A tibble: 6 x 13
##   ride_id rideable_type started_at          ended_at            start_station_n~
##   <chr>   <chr>         <dttm>              <dttm>              <chr>           
## 1 EACB19~ docked_bike   2020-01-21 20:06:59 2020-01-21 20:14:30 Western Ave & L~
## 2 8FED87~ docked_bike   2020-01-30 14:22:39 2020-01-30 14:26:22 Clark St & Mont~
## 3 789F3C~ docked_bike   2020-01-09 19:29:26 2020-01-09 19:32:17 Broadway & Belm~
## 4 C9A388~ docked_bike   2020-01-06 16:17:07 2020-01-06 16:25:56 Clark St & Rand~
## 5 943BC3~ docked_bike   2020-01-30 08:37:16 2020-01-30 08:42:48 Clinton St & La~
## 6 6D9C8A~ docked_bike   2020-01-10 12:33:05 2020-01-10 12:37:54 Wells St & Hubb~
## # ... with 8 more variables: end_station_name <chr>, member_casual <chr>,
## #   rideDate <date>, month <chr>, day <chr>, year <chr>, day_of_week <chr>,
## #   length_of_ride <drtn>

Divvy_Trips$length_of_ride <- as.numeric(Divvy_Trips$length_of_ride)
str(Divvy_Trips$length_of_ride)

##  num [1:1054942] 451 223 171 529 332 289 289 297 295 203 ...

#checking for hours and minutes used to complete the ride
Divvy_Trips$hour_minutes_of_ride <- hms::as_hms(Divvy_Trips$length_of_ride)
View(Divvy_Trips)

Filtering the length of ride less than 0 seconds

biketrip <- filter(Divvy_Trips,length_of_ride>0)

Average, minimum and maximum length of ride

##average, minimum and maximum length of ride

biketrip%>%
  summarise(min_length=min(length_of_ride),max_length=max(length_of_ride),average_length=mean(length_of_ride))

## # A tibble: 1 x 3
##   min_length max_length average_length
##        <dbl>      <dbl>          <dbl>
## 1          1    9387024          1746.

View(biketrip)

Length of Ride by member_type

aggregate(length_of_ride~member_casual, data= biketrip,mean)

##   member_casual length_of_ride
## 1        casual      3601.5713
## 2        member       961.5074

aggregate(length_of_ride~member_casual, data= biketrip,median)

##   member_casual length_of_ride
## 1        casual           1486
## 2        member            652

aggregate(length_of_ride~member_casual, data= biketrip,max)

##   member_casual length_of_ride
## 1        casual        9387024
## 2        member        5627611

#this means that casual riders have more ride lengths than member riders.

Sorting the data

#sorting the data by weekday
biketrip$day_of_week<-ordered(biketrip$day_of_week,levels=c('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'))
biketrip%>%count(day_of_week,member_casual)

## # A tibble: 14 x 3
##    day_of_week member_casual      n
##    <ord>       <chr>          <int>
##  1 Monday      casual         32362
##  2 Monday      member        105619
##  3 Tuesday     casual         33618
##  4 Tuesday     member        117270
##  5 Wednesday   casual         32769
##  6 Wednesday   member        110409
##  7 Thursday    casual         35580
##  8 Thursday    member        111767
##  9 Friday      casual         39153
## 10 Friday      member        107493
## 11 Saturday    casual         72389
## 12 Saturday    member         97352
## 13 Sunday      casual         67378
## 14 Sunday      member         90829

#sorting the data by month
biketrip$month<-ordered(biketrip$month,levels=c('January', 'February', 'March', 'April', 'May', 'June'))
biketrip%>%count(month,member_casual)

## # A tibble: 12 x 3
##    month    member_casual      n
##    <ord>    <chr>          <int>
##  1 January  casual          7785
##  2 January  member        136099
##  3 February casual         12860
##  4 February member        126715
##  5 March    casual         27625
##  6 March    member        115593
##  7 April    casual         23605
##  8 April    member         61112
##  9 May      casual         86838
## 10 May      member        113252
## 11 June     casual        154536
## 12 June     member        187968

biketrip%>%count(member_casual, rideable_type)

## # A tibble: 2 x 3
##   member_casual rideable_type      n
##   <chr>         <chr>          <int>
## 1 casual        docked_bike   313249
## 2 member        docked_bike   740739

##average, max and min length of ride for 6 months
mean_r_length <-as.numeric(mean(biketrip$length_of_ride))/60
cat("The average ride length over 6 months is;", mean_r_length, "minutes")

## The average ride length over 6 months is; 29.1024 minutes

min_r_length <-as.numeric(min(biketrip$length_of_ride))/60
cat("The minimum ride length over 6 months is;", min_r_length, "minutes")

## The minimum ride length over 6 months is; 0.01666667 minutes

max_r_length <- as.numeric(max(biketrip$length_of_ride))/3600
cat("The maximum ride length over 6 months is;", max_r_length, "hours")

## The maximum ride length over 6 months is; 2607.507 hours

visualization

For the visualization, two packages have been installed and loaded. For this analysis, ggplot2 and corrplot was used.

biketrip%>%
  group_by(member_casual,day_of_week)%>%
  summarise(total_ride_duration=mean(length_of_ride))%>%
  ggplot(mapping=aes(x=member_casual,y=total_ride_duration,fill=day_of_week)) +
  geom_bar(position="Dodge",stat = "identity") +
  facet_wrap(~day_of_week) +
  labs(title="Average ride length by day of week")

## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

biketrip%>%
  group_by(member_casual,year)%>%
  summarise(Ridenumbers=n())%>%
  ggplot(mapping=aes(x=year,y=Ridenumbers,fill=member_casual)) +
  geom_bar(position="Dodge",stat = "identity") +
  labs(title="Average ride length by year")

## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

str(biketrip$length_of_ride)

##  num [1:1053988] 451 223 171 529 332 289 289 297 295 203 ...

par(mfrow=c(1,1))


boxplot(length_of_ride ~ member_casual,
        data = biketrip,
        main = "distribution of length by week",
        xlab = "casual_member",
        ylab = "length of ride",
        col = c("orange", "yellow"))

boxplot(biketrip$month ~ biketrip$member_casual,
        data = biketrip,
        main = "Month Vs Riders",
        xlab = "Member Riders and Casual Riders",
        ylab = "Month",
        col = c("pink", "pink1"))

Observations and trends

From the Divvy Trips six months data, the analysis discovered some trends which includes the following:

There are more member riders than casual riders but casual riders engage in more longer rides than member riders.
There is a longer ride length for members on saturdays and Sundays when compared to other days.
There is a longer ride length for casual riders on thursday and sunday than any other day of the week.

Recommendations

In correlation with the analysis, the result leads to the following recommendations being made:

Cyclist can promote shorter rides for casual riders with incentives on casual riders who complete more shorter rides and subscribe to membership.
Cyclist could promote more friendly advertisements to engage member riders on longer rides.
Cyclist can send more member rider a motivation quote or text, to motivate them on the need to keep riding; as this might help members stay more on longer rides.

How Does a Bike-Share Navigate Speedy Success?

Praise Chikunie

March 11, 2022