A bicycle-sharing system, bike share program, is a shared transport service in which bicycles are made available for shared use to individuals on a short-term basis for a price or free. Many bike share systems allow people to borrow a bike from a "dock" and return it at another dock belonging to the same system.
Cyclistic, a bike-share company in Chicago, with meet different characters and team members, highly interested in answering the key business questions, and keen to steps of the data analysis process: ask, prepare, process, analyze, share, and act.
In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime.
Until now, Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments. One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members
Cyclistic: A bike-share program that features more than 5,800 bicycles and 600 docking stations. Cyclistic sets itself apart by also offering reclining bikes, hand tricycles, and cargo bikes, making bike-share more inclusive to people with disabilities and riders who can’t use a standard two-wheeled bike. The majority of riders opt for traditional bikes; about 8% of riders use the assistive options. Cyclistic users are more likely to ride for leisure, but about 30% use them to commute to work each day.
Lily Moreno: The director of marketing and manager. Moreno is responsible for the development of campaigns and initiatives to promote the bike-share program. These may include email, social media, and other channels.
Cyclistic marketing analytics team: A team of data analysts who are responsible for collecting, analyzing, and reporting data that helps guide Cyclistic marketing strategy.
The dataset was downloaded from divvy trip data.
The data used was collected by Cyclistic, for this analysis, six months data was used; from January 2020 to June 2020.The data was accessible in a zip folder, and downloaded to a personal computer. The dataset desired months used for this analysis was merged into one dataset in R.
library(readr)
library(skimr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.6 v stringr 1.4.0
## v tidyr 1.2.0 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x lubridate::as.difftime() masks base::as.difftime()
## x lubridate::date() masks base::date()
## x dplyr::filter() masks stats::filter()
## x lubridate::intersect() masks base::intersect()
## x dplyr::lag() masks stats::lag()
## x lubridate::setdiff() masks base::setdiff()
## x lubridate::union() masks base::union()
library(ggplot2)
library(corrplot)
## corrplot 0.92 loaded
Divvy_Trips_Q1 <- read_csv("Divvy_Trips_2020_Q1.csv")
## Rows: 426887 Columns: 13
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (5): ride_id, rideable_type, start_station_name, end_station_name, memb...
## dbl (6): start_station_id, end_station_id, start_lat, start_lng, end_lat, e...
## dttm (2): started_at, ended_at
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
Divvy_Trips2 <- read_csv("202004-divvy-tripdata.csv")
## Rows: 84776 Columns: 13
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (5): ride_id, rideable_type, start_station_name, end_station_name, memb...
## dbl (6): start_station_id, end_station_id, start_lat, start_lng, end_lat, e...
## dttm (2): started_at, ended_at
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
Divvy_Trips3 <- read_csv("202005-divvy-tripdata.csv")
## Rows: 200274 Columns: 13
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (5): ride_id, rideable_type, start_station_name, end_station_name, memb...
## dbl (6): start_station_id, end_station_id, start_lat, start_lng, end_lat, e...
## dttm (2): started_at, ended_at
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
Divvy_Trips4 <- read_csv("202006-divvy-tripdata.csv")
## Rows: 343005 Columns: 13
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (5): ride_id, rideable_type, start_station_name, end_station_name, memb...
## dbl (6): start_station_id, end_station_id, start_lat, start_lng, end_lat, e...
## dttm (2): started_at, ended_at
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
#Checking for null values in the dataset
is.null(Divvy_Trips_Q1)
## [1] FALSE
is.null(Divvy_Trips2)
## [1] FALSE
is.null(Divvy_Trips3)
## [1] FALSE
is.null(Divvy_Trips4)
## [1] FALSE
#checking for duplicate data
sum(duplicated(Divvy_Trips_Q1))
## [1] 0
sum(duplicated(Divvy_Trips2))
## [1] 0
sum(duplicated(Divvy_Trips3))
## [1] 0
sum(duplicated(Divvy_Trips4))
## [1] 0
#the Start_station_id and end_station_id are in decimal instead of integer
Divvy_Trips <- rbind(Divvy_Trips_Q1, Divvy_Trips2, Divvy_Trips3, Divvy_Trips4)
View(Divvy_Trips)
str(Divvy_Trips)
## spec_tbl_df [1,054,942 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:1054942] "EACB19130B0CDA4A" "8FED874C809DC021" "789F3C21E472CA96" "C9A388DAC6ABF313" ...
## $ rideable_type : chr [1:1054942] "docked_bike" "docked_bike" "docked_bike" "docked_bike" ...
## $ started_at : POSIXct[1:1054942], format: "2020-01-21 20:06:59" "2020-01-30 14:22:39" ...
## $ ended_at : POSIXct[1:1054942], format: "2020-01-21 20:14:30" "2020-01-30 14:26:22" ...
## $ start_station_name: chr [1:1054942] "Western Ave & Leland Ave" "Clark St & Montrose Ave" "Broadway & Belmont Ave" "Clark St & Randolph St" ...
## $ start_station_id : num [1:1054942] 239 234 296 51 66 212 96 96 212 38 ...
## $ end_station_name : chr [1:1054942] "Clark St & Leland Ave" "Southport Ave & Irving Park Rd" "Wilton Ave & Belmont Ave" "Fairbanks Ct & Grand Ave" ...
## $ end_station_id : num [1:1054942] 326 318 117 24 212 96 212 212 96 100 ...
## $ start_lat : num [1:1054942] 42 42 41.9 41.9 41.9 ...
## $ start_lng : num [1:1054942] -87.7 -87.7 -87.6 -87.6 -87.6 ...
## $ end_lat : num [1:1054942] 42 42 41.9 41.9 41.9 ...
## $ end_lng : num [1:1054942] -87.7 -87.7 -87.7 -87.6 -87.6 ...
## $ member_casual : chr [1:1054942] "member" "member" "member" "member" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_double(),
## .. end_station_name = col_character(),
## .. end_station_id = col_double(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
#Checking for null values in the dataset
colSums(is.na(Divvy_Trips))
## ride_id rideable_type started_at ended_at
## 0 0 0 0
## start_station_name start_station_id end_station_name end_station_id
## 0 0 889 889
## start_lat start_lng end_lat end_lng
## 0 0 889 889
## member_casual
## 0
Divvy_Trips$start_station_name[Divvy_Trips$start_station_name ==""]<- "None"
Divvy_Trips$end_station_name[Divvy_Trips$end_station_name ==""]<- "None"
Divvy_Trips = subset(Divvy_Trips, select = -c(start_lat, start_lng, end_lat, end_lng, start_station_id , end_station_id))
n_distinct(Divvy_Trips$member_casual) #..this means that there are two distinct variables, member and casual
## [1] 2
n_distinct(Divvy_Trips$ride_id)
## [1] 1054942
n_distinct(Divvy_Trips$rideable_type)
## [1] 1
table(Divvy_Trips['member_casual'])
##
## casual member
## 313735 741207
Divvy_Trips%>%
count(member_casual)
## # A tibble: 2 x 2
## member_casual n
## <chr> <int>
## 1 casual 313735
## 2 member 741207
#Checking for how many rideable types on the dataset
table(Divvy_Trips['rideable_type'])
##
## docked_bike
## 1054942
Divvy_Trips%>%
count(rideable_type)
## # A tibble: 1 x 2
## rideable_type n
## <chr> <int>
## 1 docked_bike 1054942
#Checking for how many members use rideable types on the dataset
table(Divvy_Trips['rideable_type', 'member_casual'])
## < table of extent 0 >
Divvy_Trips%>%
count(rideable_type, member_casual)
## # A tibble: 2 x 3
## rideable_type member_casual n
## <chr> <chr> <int>
## 1 docked_bike casual 313735
## 2 docked_bike member 741207
skim(Divvy_Trips)
| Name | Divvy_Trips |
| Number of rows | 1054942 |
| Number of columns | 7 |
| _______________________ | |
| Column type frequency: | |
| character | 5 |
| POSIXct | 2 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| ride_id | 0 | 1 | 16 | 16 | 0 | 1054942 | 0 |
| rideable_type | 0 | 1 | 11 | 11 | 0 | 1 | 0 |
| start_station_name | 0 | 1 | 5 | 43 | 0 | 621 | 0 |
| end_station_name | 889 | 1 | 5 | 43 | 0 | 622 | 0 |
| member_casual | 0 | 1 | 6 | 6 | 0 | 2 | 0 |
Variable type: POSIXct
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| started_at | 0 | 1 | 2020-01-01 00:04:44 | 2020-06-30 23:59:54 | 2020-05-02 19:56:55 | 961848 |
| ended_at | 0 | 1 | 2020-01-01 00:10:54 | 2020-07-03 20:26:15 | 2020-05-02 20:33:56 | 960982 |
#this shows that more member riders use classic bikes, docked bikes and electric bikes than casual riders.
table(Divvy_Trips$member_casual, useNA = "ifany") #this shows that there are no null values.
##
## casual member
## 313735 741207
Divvy_Trips$rideDate<-as.Date(Divvy_Trips$started_at)
Divvy_Trips$started_at<-as_datetime(Divvy_Trips$started_at)
Divvy_Trips$ended_at<-as_datetime(Divvy_Trips$ended_at)
# Adding a new column to the dataset from 'ride date'.
Divvy_Trips$month<-format(as.Date(Divvy_Trips$rideDate),"%B")
Divvy_Trips$day <-format(as.Date(Divvy_Trips$rideDate),"%d")
Divvy_Trips$year<-format(as.Date(Divvy_Trips$rideDate),"%Y")
Divvy_Trips$day_of_week<-format(as.Date(Divvy_Trips$rideDate),"%A")
# View the column names
colnames(Divvy_Trips)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "end_station_name"
## [7] "member_casual" "rideDate" "month"
## [10] "day" "year" "day_of_week"
Divvy_Trips <- Divvy_Trips%>%
mutate(length_of_ride=ended_at - started_at)
head(Divvy_Trips)
## # A tibble: 6 x 13
## ride_id rideable_type started_at ended_at start_station_n~
## <chr> <chr> <dttm> <dttm> <chr>
## 1 EACB19~ docked_bike 2020-01-21 20:06:59 2020-01-21 20:14:30 Western Ave & L~
## 2 8FED87~ docked_bike 2020-01-30 14:22:39 2020-01-30 14:26:22 Clark St & Mont~
## 3 789F3C~ docked_bike 2020-01-09 19:29:26 2020-01-09 19:32:17 Broadway & Belm~
## 4 C9A388~ docked_bike 2020-01-06 16:17:07 2020-01-06 16:25:56 Clark St & Rand~
## 5 943BC3~ docked_bike 2020-01-30 08:37:16 2020-01-30 08:42:48 Clinton St & La~
## 6 6D9C8A~ docked_bike 2020-01-10 12:33:05 2020-01-10 12:37:54 Wells St & Hubb~
## # ... with 8 more variables: end_station_name <chr>, member_casual <chr>,
## # rideDate <date>, month <chr>, day <chr>, year <chr>, day_of_week <chr>,
## # length_of_ride <drtn>
Divvy_Trips$length_of_ride <- as.numeric(Divvy_Trips$length_of_ride)
str(Divvy_Trips$length_of_ride)
## num [1:1054942] 451 223 171 529 332 289 289 297 295 203 ...
#checking for hours and minutes used to complete the ride
Divvy_Trips$hour_minutes_of_ride <- hms::as_hms(Divvy_Trips$length_of_ride)
View(Divvy_Trips)
biketrip <- filter(Divvy_Trips,length_of_ride>0)
##average, minimum and maximum length of ride
biketrip%>%
summarise(min_length=min(length_of_ride),max_length=max(length_of_ride),average_length=mean(length_of_ride))
## # A tibble: 1 x 3
## min_length max_length average_length
## <dbl> <dbl> <dbl>
## 1 1 9387024 1746.
View(biketrip)
aggregate(length_of_ride~member_casual, data= biketrip,mean)
## member_casual length_of_ride
## 1 casual 3601.5713
## 2 member 961.5074
aggregate(length_of_ride~member_casual, data= biketrip,median)
## member_casual length_of_ride
## 1 casual 1486
## 2 member 652
aggregate(length_of_ride~member_casual, data= biketrip,max)
## member_casual length_of_ride
## 1 casual 9387024
## 2 member 5627611
#this means that casual riders have more ride lengths than member riders.
#sorting the data by weekday
biketrip$day_of_week<-ordered(biketrip$day_of_week,levels=c('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'))
biketrip%>%count(day_of_week,member_casual)
## # A tibble: 14 x 3
## day_of_week member_casual n
## <ord> <chr> <int>
## 1 Monday casual 32362
## 2 Monday member 105619
## 3 Tuesday casual 33618
## 4 Tuesday member 117270
## 5 Wednesday casual 32769
## 6 Wednesday member 110409
## 7 Thursday casual 35580
## 8 Thursday member 111767
## 9 Friday casual 39153
## 10 Friday member 107493
## 11 Saturday casual 72389
## 12 Saturday member 97352
## 13 Sunday casual 67378
## 14 Sunday member 90829
#sorting the data by month
biketrip$month<-ordered(biketrip$month,levels=c('January', 'February', 'March', 'April', 'May', 'June'))
biketrip%>%count(month,member_casual)
## # A tibble: 12 x 3
## month member_casual n
## <ord> <chr> <int>
## 1 January casual 7785
## 2 January member 136099
## 3 February casual 12860
## 4 February member 126715
## 5 March casual 27625
## 6 March member 115593
## 7 April casual 23605
## 8 April member 61112
## 9 May casual 86838
## 10 May member 113252
## 11 June casual 154536
## 12 June member 187968
biketrip%>%count(member_casual, rideable_type)
## # A tibble: 2 x 3
## member_casual rideable_type n
## <chr> <chr> <int>
## 1 casual docked_bike 313249
## 2 member docked_bike 740739
##average, max and min length of ride for 6 months
mean_r_length <-as.numeric(mean(biketrip$length_of_ride))/60
cat("The average ride length over 6 months is;", mean_r_length, "minutes")
## The average ride length over 6 months is; 29.1024 minutes
min_r_length <-as.numeric(min(biketrip$length_of_ride))/60
cat("The minimum ride length over 6 months is;", min_r_length, "minutes")
## The minimum ride length over 6 months is; 0.01666667 minutes
max_r_length <- as.numeric(max(biketrip$length_of_ride))/3600
cat("The maximum ride length over 6 months is;", max_r_length, "hours")
## The maximum ride length over 6 months is; 2607.507 hours
For the visualization, two packages have been installed and loaded. For this analysis, ggplot2 and corrplot was used.
biketrip%>%
group_by(member_casual,day_of_week)%>%
summarise(total_ride_duration=mean(length_of_ride))%>%
ggplot(mapping=aes(x=member_casual,y=total_ride_duration,fill=day_of_week)) +
geom_bar(position="Dodge",stat = "identity") +
facet_wrap(~day_of_week) +
labs(title="Average ride length by day of week")
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
biketrip%>%
group_by(member_casual,year)%>%
summarise(Ridenumbers=n())%>%
ggplot(mapping=aes(x=year,y=Ridenumbers,fill=member_casual)) +
geom_bar(position="Dodge",stat = "identity") +
labs(title="Average ride length by year")
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
str(biketrip$length_of_ride)
## num [1:1053988] 451 223 171 529 332 289 289 297 295 203 ...
par(mfrow=c(1,1))
boxplot(length_of_ride ~ member_casual,
data = biketrip,
main = "distribution of length by week",
xlab = "casual_member",
ylab = "length of ride",
col = c("orange", "yellow"))
boxplot(biketrip$month ~ biketrip$member_casual,
data = biketrip,
main = "Month Vs Riders",
xlab = "Member Riders and Casual Riders",
ylab = "Month",
col = c("pink", "pink1"))
From the Divvy Trips six months data, the analysis discovered some trends which includes the following:
There are more member riders than casual riders but casual riders engage in more longer rides than member riders.
There is a longer ride length for members on saturdays and Sundays when compared to other days.
There is a longer ride length for casual riders on thursday and sunday than any other day of the week.
In correlation with the analysis, the result leads to the following recommendations being made:
Cyclist can promote shorter rides for casual riders with incentives on casual riders who complete more shorter rides and subscribe to membership.
Cyclist could promote more friendly advertisements to engage member riders on longer rides.
Cyclist can send more member rider a motivation quote or text, to motivate them on the need to keep riding; as this might help members stay more on longer rides.