Welcome to the Cyclistic bike-share analysis case study! In this case study, I performed many real-world tasks of a junior data analyst for a fictional company, Cyclistic. In order to answer the key business questions, I followed the steps of the data analysis process: Ask, Prepare, Process, Analyze, Share and Act.
Cyclistic, is a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, our team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, our team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve the recommendations, so they must be backed up with compelling data insights and professional data visualizations.
● Cyclistic: A bike-share program that features more than 5,800 bicycles and 600 docking stations. Cyclistic sets itself apart by also offering reclining bikes, hand tricycles, and cargo bikes, making bike-share more inclusive to people with disabilities and riders who can’t use a standard two-wheeled bike. The majority of riders opt for traditional bikes; about 8% of riders use the assistive options. Cyclistic users are more likely to ride for leisure, but about 30% use them to commute to work each day.
● Lily Moreno: The director of marketing and your manager. Moreno is responsible for the development of campaigns and initiatives to promote the bike-share program. These may include email, social media, and other channels.
● Cyclistic marketing analytics team: A team of data analysts who are responsible for collecting, analyzing, and reporting data that helps guide Cyclistic marketing strategy. You joined this team six months ago and have been busy learning about Cyclistic’s mission and business goals — as well as how you, as a junior data analyst, can help Cyclistic achieve them.
● Cyclistic executive team: The notoriously detail-oriented executive team will decide whether to approve the recommended marketing program.
Key tasks
The primary question to answer which will guide the future marketing program is how do annual members and casual riders use Cyclistic bikes differently? To figure it out we need to study the bike usage behavior of both casual and annual members. After this we have to brainstorm the recommendations from the insights gained which will led to conversion of casual users to annual members.
This phase addresses where the data is located? How is it organized? Bias and credibility issues in the data along with licensing, privacy, security and accessibility. Does your data ROCCC?
For the purpose of this case study Cyclistic’s current trip data of last 12 months i.e. from Feb 2021 to Jan 2022 is used to analyze and identify trends. The data has been made available by Motivate International Inc. under this license. This is a public dataset which can be downloaded from here. The dataset provided were zipped csv formatted files contained monthly trip data. ROCCC stands for Reliability, Originality, Comprehensiveness, Current and Cited. As cited above, this data is from a fictional company so we couldn’t stand for it’s reliability or if it’s cited. As for originality this data was acquired by company itself. Sensitive information was not provided, such as names and addresses of users. The data is licensed, private, secure and accessible. Dataset consist of 13 columns and upto 100000’s of rows per data file.
Loading required R Packages for data preparation:
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.6 v dplyr 1.0.8
## v tidyr 1.2.0 v stringr 1.4.0
## v readr 2.1.2 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(dplyr)
library(ggplot2)
library(ggmap)
## Google's Terms of Service: https://cloud.google.com/maps-platform/terms/.
## Please cite ggmap if you use it! See citation("ggmap") for details.
library(geosphere)
Reading all the data files.
Feb <- read.csv("feb_2021.csv")
Mar <- read.csv('mar_2021.csv')
Apr <- read.csv('apr_2021.csv')
May <- read.csv('may_2021.csv')
Jun <- read.csv('jun_2021.csv')
Jul <- read.csv('jul_2021.csv')
Aug <- read.csv('aug_2021.csv')
Sep <- read.csv('sep_2021.csv')
Oct <- read.csv('oct_2021.csv')
Nov <- read.csv('nov_2021.csv')
Dec <- read.csv('dec_2021.csv')
Jan <- read.csv('jan_2022.csv')
Inspecting the structure of all the data files to ensure equal columns and appropriate datatypes.
colnames(Jan)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
str(Jan)
## 'data.frame': 103770 obs. of 13 variables:
## $ ride_id : chr "C2F7DD78E82EC875" "A6CF8980A652D272" "BD0F91DFF741C66D" "CBB80ED419105406" ...
## $ rideable_type : chr "electric_bike" "electric_bike" "classic_bike" "classic_bike" ...
## $ started_at : chr "2022-01-13 11:59:47" "2022-01-10 08:41:56" "2022-01-25 04:53:40" "2022-01-04 00:18:04" ...
## $ ended_at : chr "2022-01-13 12:02:44" "2022-01-10 08:46:17" "2022-01-25 04:58:01" "2022-01-04 00:33:00" ...
## $ start_station_name: chr "Glenwood Ave & Touhy Ave" "Glenwood Ave & Touhy Ave" "Sheffield Ave & Fullerton Ave" "Clark St & Bryn Mawr Ave" ...
## $ start_station_id : chr "525" "525" "TA1306000016" "KA1504000151" ...
## $ end_station_name : chr "Clark St & Touhy Ave" "Clark St & Touhy Ave" "Greenview Ave & Fullerton Ave" "Paulina St & Montrose Ave" ...
## $ end_station_id : chr "RP-007" "RP-007" "TA1307000001" "TA1309000021" ...
## $ start_lat : num 42 42 41.9 42 41.9 ...
## $ start_lng : num -87.7 -87.7 -87.7 -87.7 -87.6 ...
## $ end_lat : num 42 42 41.9 42 41.9 ...
## $ end_lng : num -87.7 -87.7 -87.7 -87.7 -87.6 ...
## $ member_casual : chr "casual" "casual" "member" "casual" ...
colnames(Feb)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
str(Feb)
## 'data.frame': 49622 obs. of 13 variables:
## $ ride_id : chr "89E7AA6C29227EFF" "0FEFDE2603568365" "E6159D746B2DBB91" "B32D3199F1C2E75B" ...
## $ rideable_type : chr "classic_bike" "classic_bike" "electric_bike" "classic_bike" ...
## $ started_at : chr "2021-02-12 16:14:56" "2021-02-14 17:52:38" "2021-02-09 19:10:18" "2021-02-02 17:49:41" ...
## $ ended_at : chr "2021-02-12 16:21:43" "2021-02-14 18:12:09" "2021-02-09 19:19:10" "2021-02-02 17:54:06" ...
## $ start_station_name: chr "Glenwood Ave & Touhy Ave" "Glenwood Ave & Touhy Ave" "Clark St & Lake St" "Wood St & Chicago Ave" ...
## $ start_station_id : chr "525" "525" "KA1503000012" "637" ...
## $ end_station_name : chr "Sheridan Rd & Columbia Ave" "Bosworth Ave & Howard St" "State St & Randolph St" "Honore St & Division St" ...
## $ end_station_id : chr "660" "16806" "TA1305000029" "TA1305000034" ...
## $ start_lat : num 42 42 41.9 41.9 41.8 ...
## $ start_lng : num -87.7 -87.7 -87.6 -87.7 -87.6 ...
## $ end_lat : num 42 42 41.9 41.9 41.8 ...
## $ end_lng : num -87.7 -87.7 -87.6 -87.7 -87.6 ...
## $ member_casual : chr "member" "casual" "member" "member" ...
colnames(Mar)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
str(Mar)
## 'data.frame': 228496 obs. of 13 variables:
## $ ride_id : chr "CFA86D4455AA1030" "30D9DC61227D1AF3" "846D87A15682A284" "994D05AA75A168F2" ...
## $ rideable_type : chr "classic_bike" "classic_bike" "classic_bike" "classic_bike" ...
## $ started_at : chr "2021-03-16 08:32:30" "2021-03-28 01:26:28" "2021-03-11 21:17:29" "2021-03-11 13:26:42" ...
## $ ended_at : chr "2021-03-16 08:36:34" "2021-03-28 01:36:55" "2021-03-11 21:33:53" "2021-03-11 13:55:41" ...
## $ start_station_name: chr "Humboldt Blvd & Armitage Ave" "Humboldt Blvd & Armitage Ave" "Shields Ave & 28th Pl" "Winthrop Ave & Lawrence Ave" ...
## $ start_station_id : chr "15651" "15651" "15443" "TA1308000021" ...
## $ end_station_name : chr "Stave St & Armitage Ave" "Central Park Ave & Bloomingdale Ave" "Halsted St & 35th St" "Broadway & Sheridan Rd" ...
## $ end_station_id : chr "13266" "18017" "TA1308000043" "13323" ...
## $ start_lat : num 41.9 41.9 41.8 42 42 ...
## $ start_lng : num -87.7 -87.7 -87.6 -87.7 -87.7 ...
## $ end_lat : num 41.9 41.9 41.8 42 42.1 ...
## $ end_lng : num -87.7 -87.7 -87.6 -87.6 -87.7 ...
## $ member_casual : chr "casual" "casual" "casual" "casual" ...
colnames(Apr)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
str(Apr)
## 'data.frame': 337230 obs. of 13 variables:
## $ ride_id : chr "6C992BD37A98A63F" "1E0145613A209000" "E498E15508A80BAD" "1887262AD101C604" ...
## $ rideable_type : chr "classic_bike" "docked_bike" "docked_bike" "classic_bike" ...
## $ started_at : chr "2021-04-12 18:25:36" "2021-04-27 17:27:11" "2021-04-03 12:42:45" "2021-04-17 09:17:42" ...
## $ ended_at : chr "2021-04-12 18:56:55" "2021-04-27 18:31:29" "2021-04-07 11:40:24" "2021-04-17 09:42:48" ...
## $ start_station_name: chr "State St & Pearson St" "Dorchester Ave & 49th St" "Loomis Blvd & 84th St" "Honore St & Division St" ...
## $ start_station_id : chr "TA1307000061" "KA1503000069" "20121" "TA1305000034" ...
## $ end_station_name : chr "Southport Ave & Waveland Ave" "Dorchester Ave & 49th St" "Loomis Blvd & 84th St" "Southport Ave & Waveland Ave" ...
## $ end_station_id : chr "13235" "KA1503000069" "20121" "13235" ...
## $ start_lat : num 41.9 41.8 41.7 41.9 41.7 ...
## $ start_lng : num -87.6 -87.6 -87.7 -87.7 -87.7 ...
## $ end_lat : num 41.9 41.8 41.7 41.9 41.7 ...
## $ end_lng : num -87.7 -87.6 -87.7 -87.7 -87.7 ...
## $ member_casual : chr "member" "casual" "casual" "member" ...
colnames(May)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
str(May)
## 'data.frame': 531633 obs. of 13 variables:
## $ ride_id : chr "C809ED75D6160B2A" "DD59FDCE0ACACAF3" "0AB83CB88C43EFC2" "7881AC6D39110C60" ...
## $ rideable_type : chr "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : chr "2021-05-30 11:58:15" "2021-05-30 11:29:14" "2021-05-30 14:24:01" "2021-05-30 14:25:51" ...
## $ ended_at : chr "2021-05-30 12:10:39" "2021-05-30 12:14:09" "2021-05-30 14:25:13" "2021-05-30 14:41:04" ...
## $ start_station_name: chr "" "" "" "" ...
## $ start_station_id : chr "" "" "" "" ...
## $ end_station_name : chr "" "" "" "" ...
## $ end_station_id : chr "" "" "" "" ...
## $ start_lat : num 41.9 41.9 41.9 41.9 41.9 ...
## $ start_lng : num -87.6 -87.6 -87.7 -87.7 -87.7 ...
## $ end_lat : num 41.9 41.8 41.9 41.9 41.9 ...
## $ end_lng : num -87.6 -87.6 -87.7 -87.7 -87.7 ...
## $ member_casual : chr "casual" "casual" "casual" "casual" ...
colnames(Jun)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
str(Jun)
## 'data.frame': 729595 obs. of 13 variables:
## $ ride_id : chr "99FEC93BA843FB20" "06048DCFC8520CAF" "9598066F68045DF2" "B03C0FE48C412214" ...
## $ rideable_type : chr "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : chr "2021-06-13 14:31:28" "2021-06-04 11:18:02" "2021-06-04 09:49:35" "2021-06-03 19:56:05" ...
## $ ended_at : chr "2021-06-13 14:34:11" "2021-06-04 11:24:19" "2021-06-04 09:55:34" "2021-06-03 20:21:55" ...
## $ start_station_name: chr "" "" "" "" ...
## $ start_station_id : chr "" "" "" "" ...
## $ end_station_name : chr "" "" "" "" ...
## $ end_station_id : chr "" "" "" "" ...
## $ start_lat : num 41.8 41.8 41.8 41.8 41.8 ...
## $ start_lng : num -87.6 -87.6 -87.6 -87.6 -87.6 ...
## $ end_lat : num 41.8 41.8 41.8 41.8 41.8 ...
## $ end_lng : num -87.6 -87.6 -87.6 -87.6 -87.6 ...
## $ member_casual : chr "member" "member" "member" "member" ...
colnames(Jul)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
str(Jul)
## 'data.frame': 822410 obs. of 13 variables:
## $ ride_id : chr "0A1B623926EF4E16" "B2D5583A5A5E76EE" "6F264597DDBF427A" "379B58EAB20E8AA5" ...
## $ rideable_type : chr "docked_bike" "classic_bike" "classic_bike" "classic_bike" ...
## $ started_at : chr "2021-07-02 14:44:36" "2021-07-07 16:57:42" "2021-07-25 11:30:55" "2021-07-08 22:08:30" ...
## $ ended_at : chr "2021-07-02 15:19:58" "2021-07-07 17:16:09" "2021-07-25 11:48:45" "2021-07-08 22:23:32" ...
## $ start_station_name: chr "Michigan Ave & Washington St" "California Ave & Cortez St" "Wabash Ave & 16th St" "California Ave & Cortez St" ...
## $ start_station_id : chr "13001" "17660" "SL-012" "17660" ...
## $ end_station_name : chr "Halsted St & North Branch St" "Wood St & Hubbard St" "Rush St & Hubbard St" "Carpenter St & Huron St" ...
## $ end_station_id : chr "KA1504000117" "13432" "KA1503000044" "13196" ...
## $ start_lat : num 41.9 41.9 41.9 41.9 41.9 ...
## $ start_lng : num -87.6 -87.7 -87.6 -87.7 -87.7 ...
## $ end_lat : num 41.9 41.9 41.9 41.9 41.9 ...
## $ end_lng : num -87.6 -87.7 -87.6 -87.7 -87.7 ...
## $ member_casual : chr "casual" "casual" "member" "member" ...
colnames(Aug)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
str(Aug)
## 'data.frame': 804352 obs. of 13 variables:
## $ ride_id : chr "99103BB87CC6C1BB" "EAFCCCFB0A3FC5A1" "9EF4F46C57AD234D" "5834D3208BFAF1DA" ...
## $ rideable_type : chr "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : chr "2021-08-10 17:15:49" "2021-08-10 17:23:14" "2021-08-21 02:34:23" "2021-08-21 06:52:55" ...
## $ ended_at : chr "2021-08-10 17:22:44" "2021-08-10 17:39:24" "2021-08-21 02:50:36" "2021-08-21 07:08:13" ...
## $ start_station_name: chr "" "" "" "" ...
## $ start_station_id : chr "" "" "" "" ...
## $ end_station_name : chr "" "" "" "" ...
## $ end_station_id : chr "" "" "" "" ...
## $ start_lat : num 41.8 41.8 42 42 41.8 ...
## $ start_lng : num -87.7 -87.7 -87.7 -87.7 -87.6 ...
## $ end_lat : num 41.8 41.8 42 42 41.8 ...
## $ end_lng : num -87.7 -87.6 -87.7 -87.7 -87.6 ...
## $ member_casual : chr "member" "member" "member" "member" ...
colnames(Sep)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
str(Sep)
## 'data.frame': 756147 obs. of 13 variables:
## $ ride_id : chr "9DC7B962304CBFD8" "F930E2C6872D6B32" "6EF72137900BB910" "78D1DE133B3DBF55" ...
## $ rideable_type : chr "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : chr "2021-09-28 16:07:10" "2021-09-28 14:24:51" "2021-09-28 00:20:16" "2021-09-28 14:51:17" ...
## $ ended_at : chr "2021-09-28 16:09:54" "2021-09-28 14:40:05" "2021-09-28 00:23:57" "2021-09-28 15:00:06" ...
## $ start_station_name: chr "" "" "" "" ...
## $ start_station_id : chr "" "" "" "" ...
## $ end_station_name : chr "" "" "" "" ...
## $ end_station_id : chr "" "" "" "" ...
## $ start_lat : num 41.9 41.9 41.8 41.8 41.9 ...
## $ start_lng : num -87.7 -87.6 -87.7 -87.7 -87.7 ...
## $ end_lat : num 41.9 42 41.8 41.8 41.9 ...
## $ end_lng : num -87.7 -87.7 -87.7 -87.7 -87.7 ...
## $ member_casual : chr "casual" "casual" "casual" "casual" ...
colnames(Oct)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
str(Oct)
## 'data.frame': 631226 obs. of 13 variables:
## $ ride_id : chr "620BC6107255BF4C" "4471C70731AB2E45" "26CA69D43D15EE14" "362947F0437E1514" ...
## $ rideable_type : chr "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : chr "2021-10-22 12:46:42" "2021-10-21 09:12:37" "2021-10-16 16:28:39" "2021-10-16 16:17:48" ...
## $ ended_at : chr "2021-10-22 12:49:50" "2021-10-21 09:14:14" "2021-10-16 16:36:26" "2021-10-16 16:19:03" ...
## $ start_station_name: chr "Kingsbury St & Kinzie St" "" "" "" ...
## $ start_station_id : chr "KA1503000043" "" "" "" ...
## $ end_station_name : chr "" "" "" "" ...
## $ end_station_id : chr "" "" "" "" ...
## $ start_lat : num 41.9 41.9 41.9 41.9 41.9 ...
## $ start_lng : num -87.6 -87.7 -87.7 -87.7 -87.7 ...
## $ end_lat : num 41.9 41.9 41.9 41.9 41.9 ...
## $ end_lng : num -87.6 -87.7 -87.7 -87.7 -87.7 ...
## $ member_casual : chr "member" "member" "member" "member" ...
colnames(Nov)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
str(Nov)
## 'data.frame': 359978 obs. of 13 variables:
## $ ride_id : chr "7C00A93E10556E47" "90854840DFD508BA" "0A7D10CDD144061C" "2F3BE33085BCFF02" ...
## $ rideable_type : chr "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : chr "2021-11-27 13:27:38" "2021-11-27 13:38:25" "2021-11-26 22:03:34" "2021-11-27 09:56:49" ...
## $ ended_at : chr "2021-11-27 13:46:38" "2021-11-27 13:56:10" "2021-11-26 22:05:56" "2021-11-27 10:01:50" ...
## $ start_station_name: chr "" "" "" "" ...
## $ start_station_id : chr "" "" "" "" ...
## $ end_station_name : chr "" "" "" "" ...
## $ end_station_id : chr "" "" "" "" ...
## $ start_lat : num 41.9 42 42 41.9 41.9 ...
## $ start_lng : num -87.7 -87.7 -87.7 -87.8 -87.6 ...
## $ end_lat : num 42 41.9 42 41.9 41.9 ...
## $ end_lng : num -87.7 -87.7 -87.7 -87.8 -87.6 ...
## $ member_casual : chr "casual" "casual" "casual" "casual" ...
colnames(Dec)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
str(Dec)
## 'data.frame': 247540 obs. of 13 variables:
## $ ride_id : chr "46F8167220E4431F" "73A77762838B32FD" "4CF42452054F59C5" "3278BA87BF698339" ...
## $ rideable_type : chr "electric_bike" "electric_bike" "electric_bike" "classic_bike" ...
## $ started_at : chr "2021-12-07 15:06:07" "2021-12-11 03:43:29" "2021-12-15 23:10:28" "2021-12-26 16:16:10" ...
## $ ended_at : chr "2021-12-07 15:13:42" "2021-12-11 04:10:23" "2021-12-15 23:23:14" "2021-12-26 16:30:53" ...
## $ start_station_name: chr "Laflin St & Cullerton St" "LaSalle Dr & Huron St" "Halsted St & North Branch St" "Halsted St & North Branch St" ...
## $ start_station_id : chr "13307" "KP1705001026" "KA1504000117" "KA1504000117" ...
## $ end_station_name : chr "Morgan St & Polk St" "Clarendon Ave & Leland Ave" "Broadway & Barry Ave" "LaSalle Dr & Huron St" ...
## $ end_station_id : chr "TA1307000130" "TA1307000119" "13137" "KP1705001026" ...
## $ start_lat : num 41.9 41.9 41.9 41.9 41.9 ...
## $ start_lng : num -87.7 -87.6 -87.6 -87.6 -87.7 ...
## $ end_lat : num 41.9 42 41.9 41.9 41.9 ...
## $ end_lng : num -87.7 -87.7 -87.6 -87.6 -87.6 ...
## $ member_casual : chr "member" "casual" "member" "member" ...
Merging all the individual data frames of monthly data into a single data frame.
trip_data <- bind_rows(Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec, Jan)
str(trip_data)
## 'data.frame': 5601999 obs. of 13 variables:
## $ ride_id : chr "89E7AA6C29227EFF" "0FEFDE2603568365" "E6159D746B2DBB91" "B32D3199F1C2E75B" ...
## $ rideable_type : chr "classic_bike" "classic_bike" "electric_bike" "classic_bike" ...
## $ started_at : chr "2021-02-12 16:14:56" "2021-02-14 17:52:38" "2021-02-09 19:10:18" "2021-02-02 17:49:41" ...
## $ ended_at : chr "2021-02-12 16:21:43" "2021-02-14 18:12:09" "2021-02-09 19:19:10" "2021-02-02 17:54:06" ...
## $ start_station_name: chr "Glenwood Ave & Touhy Ave" "Glenwood Ave & Touhy Ave" "Clark St & Lake St" "Wood St & Chicago Ave" ...
## $ start_station_id : chr "525" "525" "KA1503000012" "637" ...
## $ end_station_name : chr "Sheridan Rd & Columbia Ave" "Bosworth Ave & Howard St" "State St & Randolph St" "Honore St & Division St" ...
## $ end_station_id : chr "660" "16806" "TA1305000029" "TA1305000034" ...
## $ start_lat : num 42 42 41.9 41.9 41.8 ...
## $ start_lng : num -87.7 -87.7 -87.6 -87.7 -87.6 ...
## $ end_lat : num 42 42 41.9 41.9 41.8 ...
## $ end_lng : num -87.7 -87.7 -87.6 -87.7 -87.6 ...
## $ member_casual : chr "member" "casual" "member" "member" ...
Process the data for analysis which includes checking for data errors, documenting the cleaning process and transforming the data to work with it effectively. As the data is extremely large, R is used as data processing and analysis tool.
head(trip_data) #first 6 rows of data frame
## ride_id rideable_type started_at ended_at
## 1 89E7AA6C29227EFF classic_bike 2021-02-12 16:14:56 2021-02-12 16:21:43
## 2 0FEFDE2603568365 classic_bike 2021-02-14 17:52:38 2021-02-14 18:12:09
## 3 E6159D746B2DBB91 electric_bike 2021-02-09 19:10:18 2021-02-09 19:19:10
## 4 B32D3199F1C2E75B classic_bike 2021-02-02 17:49:41 2021-02-02 17:54:06
## 5 83E463F23575F4BF electric_bike 2021-02-23 15:07:23 2021-02-23 15:22:37
## 6 BDAA7E3494E8D545 electric_bike 2021-02-24 15:43:33 2021-02-24 15:49:05
## start_station_name start_station_id end_station_name
## 1 Glenwood Ave & Touhy Ave 525 Sheridan Rd & Columbia Ave
## 2 Glenwood Ave & Touhy Ave 525 Bosworth Ave & Howard St
## 3 Clark St & Lake St KA1503000012 State St & Randolph St
## 4 Wood St & Chicago Ave 637 Honore St & Division St
## 5 State St & 33rd St 13216 Emerald Ave & 31st St
## 6 Fairbanks St & Superior St 18003 LaSalle Dr & Huron St
## end_station_id start_lat start_lng end_lat end_lng member_casual
## 1 660 42.01270 -87.66606 42.00458 -87.66141 member
## 2 16806 42.01270 -87.66606 42.01954 -87.66956 casual
## 3 TA1305000029 41.88579 -87.63110 41.88487 -87.62750 member
## 4 TA1305000034 41.89563 -87.67207 41.90312 -87.67394 member
## 5 TA1309000055 41.83473 -87.62583 41.83816 -87.64512 member
## 6 KP1705001026 41.89581 -87.62025 41.89489 -87.63198 casual
colnames(trip_data)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
nrow(trip_data) #no. of rows in data frame
## [1] 5601999
dim(trip_data) #dimensions of data frame
## [1] 5601999 13
summary(trip_data) #statistical summary of data mainly for numerics
## ride_id rideable_type started_at ended_at
## Length:5601999 Length:5601999 Length:5601999 Length:5601999
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## start_station_name start_station_id end_station_name end_station_id
## Length:5601999 Length:5601999 Length:5601999 Length:5601999
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## start_lat start_lng end_lat end_lng
## Min. :41.64 Min. :-87.84 Min. :41.39 Min. :-88.97
## 1st Qu.:41.88 1st Qu.:-87.66 1st Qu.:41.88 1st Qu.:-87.66
## Median :41.90 Median :-87.64 Median :41.90 Median :-87.64
## Mean :41.90 Mean :-87.65 Mean :41.90 Mean :-87.65
## 3rd Qu.:41.93 3rd Qu.:-87.63 3rd Qu.:41.93 3rd Qu.:-87.63
## Max. :45.64 Max. :-73.80 Max. :42.17 Max. :-87.49
## NA's :4754 NA's :4754
## member_casual
## Length:5601999
## Class :character
## Mode :character
##
##
##
##
str(trip_data) #list of columns and datatypes
## 'data.frame': 5601999 obs. of 13 variables:
## $ ride_id : chr "89E7AA6C29227EFF" "0FEFDE2603568365" "E6159D746B2DBB91" "B32D3199F1C2E75B" ...
## $ rideable_type : chr "classic_bike" "classic_bike" "electric_bike" "classic_bike" ...
## $ started_at : chr "2021-02-12 16:14:56" "2021-02-14 17:52:38" "2021-02-09 19:10:18" "2021-02-02 17:49:41" ...
## $ ended_at : chr "2021-02-12 16:21:43" "2021-02-14 18:12:09" "2021-02-09 19:19:10" "2021-02-02 17:54:06" ...
## $ start_station_name: chr "Glenwood Ave & Touhy Ave" "Glenwood Ave & Touhy Ave" "Clark St & Lake St" "Wood St & Chicago Ave" ...
## $ start_station_id : chr "525" "525" "KA1503000012" "637" ...
## $ end_station_name : chr "Sheridan Rd & Columbia Ave" "Bosworth Ave & Howard St" "State St & Randolph St" "Honore St & Division St" ...
## $ end_station_id : chr "660" "16806" "TA1305000029" "TA1305000034" ...
## $ start_lat : num 42 42 41.9 41.9 41.8 ...
## $ start_lng : num -87.7 -87.7 -87.6 -87.7 -87.6 ...
## $ end_lat : num 42 42 41.9 41.9 41.8 ...
## $ end_lng : num -87.7 -87.7 -87.6 -87.7 -87.6 ...
## $ member_casual : chr "member" "casual" "member" "member" ...
Adding the individual columns for date, day, month, year, day of the week to ease the in-depth analysis.
# Adding columns for date, month, year, day of the week into the data frame
trip_data$date <- as.Date(trip_data$started_at)
trip_data$month <- format(as.Date(trip_data$date),"%m")
trip_data$day <- format(as.Date(trip_data$date),"%d")
trip_data$year <- format(as.Date(trip_data$date),"%Y")
trip_data$day_of_week <- format(as.Date(trip_data$date),"%A")
colnames(trip_data)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual" "date" "month"
## [16] "day" "year" "day_of_week"
head(trip_data)
## ride_id rideable_type started_at ended_at
## 1 89E7AA6C29227EFF classic_bike 2021-02-12 16:14:56 2021-02-12 16:21:43
## 2 0FEFDE2603568365 classic_bike 2021-02-14 17:52:38 2021-02-14 18:12:09
## 3 E6159D746B2DBB91 electric_bike 2021-02-09 19:10:18 2021-02-09 19:19:10
## 4 B32D3199F1C2E75B classic_bike 2021-02-02 17:49:41 2021-02-02 17:54:06
## 5 83E463F23575F4BF electric_bike 2021-02-23 15:07:23 2021-02-23 15:22:37
## 6 BDAA7E3494E8D545 electric_bike 2021-02-24 15:43:33 2021-02-24 15:49:05
## start_station_name start_station_id end_station_name
## 1 Glenwood Ave & Touhy Ave 525 Sheridan Rd & Columbia Ave
## 2 Glenwood Ave & Touhy Ave 525 Bosworth Ave & Howard St
## 3 Clark St & Lake St KA1503000012 State St & Randolph St
## 4 Wood St & Chicago Ave 637 Honore St & Division St
## 5 State St & 33rd St 13216 Emerald Ave & 31st St
## 6 Fairbanks St & Superior St 18003 LaSalle Dr & Huron St
## end_station_id start_lat start_lng end_lat end_lng member_casual
## 1 660 42.01270 -87.66606 42.00458 -87.66141 member
## 2 16806 42.01270 -87.66606 42.01954 -87.66956 casual
## 3 TA1305000029 41.88579 -87.63110 41.88487 -87.62750 member
## 4 TA1305000034 41.89563 -87.67207 41.90312 -87.67394 member
## 5 TA1309000055 41.83473 -87.62583 41.83816 -87.64512 member
## 6 KP1705001026 41.89581 -87.62025 41.89489 -87.63198 casual
## date month day year day_of_week
## 1 2021-02-12 02 12 2021 Friday
## 2 2021-02-14 02 14 2021 Sunday
## 3 2021-02-09 02 09 2021 Tuesday
## 4 2021-02-02 02 02 2021 Tuesday
## 5 2021-02-23 02 23 2021 Tuesday
## 6 2021-02-24 02 24 2021 Wednesday
Adding a column to calculate the ride duration per ride.
# Adding ride_length column into the data frame
trip_data$ride_length <- difftime(trip_data$ended_at, trip_data$started_at)
str(trip_data)
## 'data.frame': 5601999 obs. of 19 variables:
## $ ride_id : chr "89E7AA6C29227EFF" "0FEFDE2603568365" "E6159D746B2DBB91" "B32D3199F1C2E75B" ...
## $ rideable_type : chr "classic_bike" "classic_bike" "electric_bike" "classic_bike" ...
## $ started_at : chr "2021-02-12 16:14:56" "2021-02-14 17:52:38" "2021-02-09 19:10:18" "2021-02-02 17:49:41" ...
## $ ended_at : chr "2021-02-12 16:21:43" "2021-02-14 18:12:09" "2021-02-09 19:19:10" "2021-02-02 17:54:06" ...
## $ start_station_name: chr "Glenwood Ave & Touhy Ave" "Glenwood Ave & Touhy Ave" "Clark St & Lake St" "Wood St & Chicago Ave" ...
## $ start_station_id : chr "525" "525" "KA1503000012" "637" ...
## $ end_station_name : chr "Sheridan Rd & Columbia Ave" "Bosworth Ave & Howard St" "State St & Randolph St" "Honore St & Division St" ...
## $ end_station_id : chr "660" "16806" "TA1305000029" "TA1305000034" ...
## $ start_lat : num 42 42 41.9 41.9 41.8 ...
## $ start_lng : num -87.7 -87.7 -87.6 -87.7 -87.6 ...
## $ end_lat : num 42 42 41.9 41.9 41.8 ...
## $ end_lng : num -87.7 -87.7 -87.6 -87.7 -87.6 ...
## $ member_casual : chr "member" "casual" "member" "member" ...
## $ date : Date, format: "2021-02-12" "2021-02-14" ...
## $ month : chr "02" "02" "02" "02" ...
## $ day : chr "12" "14" "09" "02" ...
## $ year : chr "2021" "2021" "2021" "2021" ...
## $ day_of_week : chr "Friday" "Sunday" "Tuesday" "Tuesday" ...
## $ ride_length : 'difftime' num 407 1171 532 265 ...
## ..- attr(*, "units")= chr "secs"
glimpse(trip_data)
## Rows: 5,601,999
## Columns: 19
## $ ride_id <chr> "89E7AA6C29227EFF", "0FEFDE2603568365", "E6159D746B~
## $ rideable_type <chr> "classic_bike", "classic_bike", "electric_bike", "c~
## $ started_at <chr> "2021-02-12 16:14:56", "2021-02-14 17:52:38", "2021~
## $ ended_at <chr> "2021-02-12 16:21:43", "2021-02-14 18:12:09", "2021~
## $ start_station_name <chr> "Glenwood Ave & Touhy Ave", "Glenwood Ave & Touhy A~
## $ start_station_id <chr> "525", "525", "KA1503000012", "637", "13216", "1800~
## $ end_station_name <chr> "Sheridan Rd & Columbia Ave", "Bosworth Ave & Howar~
## $ end_station_id <chr> "660", "16806", "TA1305000029", "TA1305000034", "TA~
## $ start_lat <dbl> 42.01270, 42.01270, 41.88579, 41.89563, 41.83473, 4~
## $ start_lng <dbl> -87.66606, -87.66606, -87.63110, -87.67207, -87.625~
## $ end_lat <dbl> 42.00458, 42.01954, 41.88487, 41.90312, 41.83816, 4~
## $ end_lng <dbl> -87.66141, -87.66956, -87.62750, -87.67394, -87.645~
## $ member_casual <chr> "member", "casual", "member", "member", "member", "~
## $ date <date> 2021-02-12, 2021-02-14, 2021-02-09, 2021-02-02, 20~
## $ month <chr> "02", "02", "02", "02", "02", "02", "02", "02", "02~
## $ day <chr> "12", "14", "09", "02", "23", "24", "01", "11", "27~
## $ year <chr> "2021", "2021", "2021", "2021", "2021", "2021", "20~
## $ day_of_week <chr> "Friday", "Sunday", "Tuesday", "Tuesday", "Tuesday"~
## $ ride_length <drtn> 407 secs, 1171 secs, 532 secs, 265 secs, 914 secs,~
Changing the data type of the ride length column to numeric.
# converting ride_length to numeric
trip_data$ride_length <- as.numeric(as.character(trip_data$ride_length))
is.numeric(trip_data$ride_length)
## [1] TRUE
glimpse(trip_data)
## Rows: 5,601,999
## Columns: 19
## $ ride_id <chr> "89E7AA6C29227EFF", "0FEFDE2603568365", "E6159D746B~
## $ rideable_type <chr> "classic_bike", "classic_bike", "electric_bike", "c~
## $ started_at <chr> "2021-02-12 16:14:56", "2021-02-14 17:52:38", "2021~
## $ ended_at <chr> "2021-02-12 16:21:43", "2021-02-14 18:12:09", "2021~
## $ start_station_name <chr> "Glenwood Ave & Touhy Ave", "Glenwood Ave & Touhy A~
## $ start_station_id <chr> "525", "525", "KA1503000012", "637", "13216", "1800~
## $ end_station_name <chr> "Sheridan Rd & Columbia Ave", "Bosworth Ave & Howar~
## $ end_station_id <chr> "660", "16806", "TA1305000029", "TA1305000034", "TA~
## $ start_lat <dbl> 42.01270, 42.01270, 41.88579, 41.89563, 41.83473, 4~
## $ start_lng <dbl> -87.66606, -87.66606, -87.63110, -87.67207, -87.625~
## $ end_lat <dbl> 42.00458, 42.01954, 41.88487, 41.90312, 41.83816, 4~
## $ end_lng <dbl> -87.66141, -87.66956, -87.62750, -87.67394, -87.645~
## $ member_casual <chr> "member", "casual", "member", "member", "member", "~
## $ date <date> 2021-02-12, 2021-02-14, 2021-02-09, 2021-02-02, 20~
## $ month <chr> "02", "02", "02", "02", "02", "02", "02", "02", "02~
## $ day <chr> "12", "14", "09", "02", "23", "24", "01", "11", "27~
## $ year <chr> "2021", "2021", "2021", "2021", "2021", "2021", "20~
## $ day_of_week <chr> "Friday", "Sunday", "Tuesday", "Tuesday", "Tuesday"~
## $ ride_length <dbl> 407, 1171, 532, 265, 914, 332, 51, 76, 1377, 1042, ~
Inspecting the bad ride length i.e. rides having ride length <=0.
# checking bad ride length
sum(trip_data$ride_length <= 0)
## [1] 652
nrow(trip_data)
## [1] 5601999
Removing the bad data.
# Removing bad ride length data
trip_data <- trip_data[!(trip_data$ride_length <= 0),]
sum(trip_data$ride_length <= 0)
## [1] 0
nrow(trip_data)
## [1] 5601347
Adding another column for different periods in a day i.e. morning, evening, afternoon and night.
# Creating breaks
breaks <- hour(hm("00:00", "6:00", "12:00", "18:00", "23:59"))
# labels for the breaks
labels <- c("Night", "Morning", "Afternoon", "Evening")
#Defining time of the day(morning, afternoon, evening, night)
trip_data$time_of_the_trip <- cut(x=hour(trip_data$started_at), breaks = breaks, labels = labels, include.lowest=TRUE)
colnames(trip_data)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual" "date" "month"
## [16] "day" "year" "day_of_week"
## [19] "ride_length" "time_of_the_trip"
head(trip_data)
## ride_id rideable_type started_at ended_at
## 1 89E7AA6C29227EFF classic_bike 2021-02-12 16:14:56 2021-02-12 16:21:43
## 2 0FEFDE2603568365 classic_bike 2021-02-14 17:52:38 2021-02-14 18:12:09
## 3 E6159D746B2DBB91 electric_bike 2021-02-09 19:10:18 2021-02-09 19:19:10
## 4 B32D3199F1C2E75B classic_bike 2021-02-02 17:49:41 2021-02-02 17:54:06
## 5 83E463F23575F4BF electric_bike 2021-02-23 15:07:23 2021-02-23 15:22:37
## 6 BDAA7E3494E8D545 electric_bike 2021-02-24 15:43:33 2021-02-24 15:49:05
## start_station_name start_station_id end_station_name
## 1 Glenwood Ave & Touhy Ave 525 Sheridan Rd & Columbia Ave
## 2 Glenwood Ave & Touhy Ave 525 Bosworth Ave & Howard St
## 3 Clark St & Lake St KA1503000012 State St & Randolph St
## 4 Wood St & Chicago Ave 637 Honore St & Division St
## 5 State St & 33rd St 13216 Emerald Ave & 31st St
## 6 Fairbanks St & Superior St 18003 LaSalle Dr & Huron St
## end_station_id start_lat start_lng end_lat end_lng member_casual
## 1 660 42.01270 -87.66606 42.00458 -87.66141 member
## 2 16806 42.01270 -87.66606 42.01954 -87.66956 casual
## 3 TA1305000029 41.88579 -87.63110 41.88487 -87.62750 member
## 4 TA1305000034 41.89563 -87.67207 41.90312 -87.67394 member
## 5 TA1309000055 41.83473 -87.62583 41.83816 -87.64512 member
## 6 KP1705001026 41.89581 -87.62025 41.89489 -87.63198 casual
## date month day year day_of_week ride_length time_of_the_trip
## 1 2021-02-12 02 12 2021 Friday 407 Afternoon
## 2 2021-02-14 02 14 2021 Sunday 1171 Afternoon
## 3 2021-02-09 02 09 2021 Tuesday 532 Evening
## 4 2021-02-02 02 02 2021 Tuesday 265 Afternoon
## 5 2021-02-23 02 23 2021 Tuesday 914 Afternoon
## 6 2021-02-24 02 24 2021 Wednesday 332 Afternoon
The data has been prepared and processed now ready for descriptive analysis. Analysis includes performing calculations on the cleaned, consistent data and identification of trends, patterns and relationships.
Performing statistical analysis by calculating mean, median, maximum and minimum on ride length column for both casual riders and members.
# finding mean(total ride length/total rides), median(midpoint), max(longest), min(shortest) for ride_length
trip_data %>%
group_by(member_casual) %>% summarise(average_ride_length = mean(ride_length), median_length = median(ride_length),
max_ride_length = max(ride_length), min_ride_length = min(ride_length))
## # A tibble: 2 x 5
## member_casual average_ride_leng~ median_length max_ride_length min_ride_length
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 casual 1922. 957 3356649 1
## 2 member 816. 574 93596 1
Calculating total no. of rides.
# total ride taken(ride count) by members and casual riders
trip_data %>%
group_by(member_casual) %>%
summarise(ride_count = length(ride_id))
## # A tibble: 2 x 2
## member_casual ride_count
## <chr> <int>
## 1 casual 2529064
## 2 member 3072283
Calculating average ride length and no. of rides as per day of the week.
# calculating total rides and average ride time by each day for members and casual riders
trip_data %>%
group_by(member_casual, day_of_week) %>%
summarise(number_of_rides = n(),
average_ride_length = mean(ride_length),.groups = "drop")
## # A tibble: 14 x 4
## member_casual day_of_week number_of_rides average_ride_length
## <chr> <chr> <int> <dbl>
## 1 casual Friday 363656 1822.
## 2 casual Monday 286681 1916.
## 3 casual Saturday 557722 2085.
## 4 casual Sunday 480699 2254.
## 5 casual Thursday 286233 1669.
## 6 casual Tuesday 274868 1676.
## 7 casual Wednesday 279205 1665.
## 8 member Friday 445093 799.
## 9 member Monday 418420 792.
## 10 member Saturday 431674 914.
## 11 member Sunday 376207 939.
## 12 member Thursday 453535 765.
## 13 member Tuesday 468659 767.
## 14 member Wednesday 478695 766.
Comparing ride lengths between different times of the day
# Comparing time period(night, morning, evening, day) of ride with ride length for both riders
trip_data %>%
group_by(member_casual, time_of_the_trip) %>%
summarise(number_of_rides = n(),
average_ride_length = mean(ride_length),.groups = "drop")
## # A tibble: 8 x 4
## member_casual time_of_the_trip number_of_rides average_ride_length
## <chr> <fct> <int> <dbl>
## 1 casual Night 181708 2121.
## 2 casual Morning 588852 1852.
## 3 casual Afternoon 1195470 1933.
## 4 casual Evening 563034 1909.
## 5 member Night 195833 785.
## 6 member Morning 920021 776.
## 7 member Afternoon 1404941 843.
## 8 member Evening 551488 827.
In this phase, the gained insights and findings are shared through effective data visualizations. Bar charts are used to share the above analysis.
# Visualizing total rides taken by members and casual riders
trip_data %>%
group_by(member_casual) %>%
summarise(ride_count = length(ride_id)) %>%
ggplot() + geom_col(mapping = aes(x = member_casual, y = ride_count, fill = member_casual), show.legend = FALSE) +
labs(title = "Total no. of rides ")
# Visualizing the days of the week with no. of rides taken by riders
trip_data %>%
group_by(member_casual, day_of_week) %>%
summarise(number_of_rides = n(), .groups = "drop") %>%
arrange(member_casual, day_of_week) %>%
ggplot(aes(x = day_of_week, y = number_of_rides, fill = member_casual)) +
labs(title = "Total rides vs.day of the week") +
geom_col(width = 0.5, position = position_dodge(width = 0.5)) +
scale_y_continuous(labels = function(x) format(x,scientific = FALSE))
# Visualizing average ride by day of the week
trip_data %>%
group_by(member_casual, day_of_week) %>%
summarise(average_ride_length = mean(ride_length), .groups = "drop") %>%
ggplot(aes(x = day_of_week, y = average_ride_length, fill = member_casual)) +
geom_col(width = 0.5, position = position_dodge(width = 0.5)) +
labs(title = "Average ride length vs. day of the week")
# visualizing total rides taken by members and casuals by month
trip_data %>%
group_by(member_casual, month) %>%
summarise(number_of_rides = n(), .groups = "drop") %>%
arrange(member_casual, month) %>%
ggplot(aes(x = month, y = number_of_rides, fill = member_casual)) +
labs(title = "Total rides vs.month") +
geom_col(width = 0.5, position = position_dodge(width = 0.5)) +
scale_y_continuous(labels = function(x) format(x,scientific = FALSE))
# visualizing average rides by month
trip_data %>%
group_by(member_casual, month) %>%
summarise(average_ride_length = mean(ride_length), .groups = "drop") %>%
ggplot(aes(x = month, y = average_ride_length, fill = member_casual)) +
geom_col(width = 0.5, position = position_dodge(width = 0.5)) +
labs(title = "Average ride length vs. month")
# visualizing and comparing casual and member rides by distance
trip_data %>%
group_by(member_casual) %>%
summarise(average_ride_distance = mean(ride_length)) %>%
ggplot() + geom_col(mapping = aes(x = member_casual, y = average_ride_distance, fill = member_casual), show.legend = FALSE) +
labs(title = "Mean distance travelled")
# Visualizing time period(night, morning, evening, day) of rides with total no. of rides
trip_data %>%
group_by(member_casual, time_of_the_trip) %>%
summarise(number_of_rides = n(), .groups = "drop") %>%
ggplot() + geom_col(mapping = aes(x = time_of_the_trip, y = number_of_rides, fill = member_casual), show.legend = TRUE) +
labs(title = "Total no. of rides vs. the time of the trip")
# Visualizing comparison of total rides with the type of ride
trip_data %>%
group_by(member_casual, rideable_type) %>%
summarise(number_of_rides = n(), .groups = "drop") %>%
ggplot() + geom_col(mapping = aes(x = rideable_type, y = number_of_rides, fill = member_casual), show.legend = TRUE) +
labs(title = "Total no. of rides vs. ride type")
Visualizing the start and end positions of rides using latitudes and longitudes co-ordinates.
# Visualizing and analyzing on map via latitudes and longitudes
# Adding a new dataframe only for most popular routes > 200 rides
coordinates_df <- trip_data %>%
filter(start_lat != end_lng & start_lng != end_lat) %>%
group_by(start_lng, start_lat, end_lng, end_lat, member_casual, rideable_type) %>%
summarise(total_rides = n(), .groups = "drop") %>%
filter(total_rides > 200)
casual_riders <- coordinates_df %>%
filter(member_casual == "casual")
member_riders <- coordinates_df %>%
filter(member_casual =="member")
# Storing map of Chicago
chicago <- c(left = -87.700424, bottom = 41.790769, right = -87.554855, top = 41.990119)
chicago_map <- get_stamenmap(bbox = chicago, zoom = 12, maptype = "terrain" )
## Source : http://tile.stamen.com/terrain/12/1050/1520.png
## Source : http://tile.stamen.com/terrain/12/1051/1520.png
## Source : http://tile.stamen.com/terrain/12/1050/1521.png
## Source : http://tile.stamen.com/terrain/12/1051/1521.png
## Source : http://tile.stamen.com/terrain/12/1050/1522.png
## Source : http://tile.stamen.com/terrain/12/1051/1522.png
## Source : http://tile.stamen.com/terrain/12/1050/1523.png
## Source : http://tile.stamen.com/terrain/12/1051/1523.png
# maps for casual and member riders
ggmap(chicago_map, darken = c(0.1, "white")) + geom_point(casual_riders, mapping = aes(x = start_lng, y = start_lat, color = rideable_type), size = 2) + coord_fixed(0.8) + labs(title = "Hotspots of casual riders", x=NULL, y=NULL) + theme(legend.position = "Right")
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Warning: Removed 49 rows containing missing values (geom_point).
ggmap(chicago_map, darken = c(0.1,"white")) + geom_point(member_riders, mapping = aes(x = start_lng, y = start_lat, color = rideable_type), size = 2) + coord_fixed(0.8) + labs(title = "Hotspots of member riders",x=NULL, y=NULL) + theme(legend.position = "Right")
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Warning: Removed 109 rows containing missing values (geom_point).
Now that we have finished creating visualizations, its time to act on our findings and proposing the top 3 recommendations based on our analysis.
Weekend membership : As we found that most of the casual riders prefer riding on weekends more, thus a weekend membership can attract new casual riders as well as the existing ones and also the weekend membership benefits can be used to influence them for extended memberships.
Marketing and promotional campaigns : The busiest time of the year for Cyclistic is in the 3rd quarter of the year when rides are on its peak for both type of riders which is the best time for promotional activities and campaigns. Those can be conducted nearby riding hotspots. Classic bikes are used the most thus offerings can be created for those.
Discounts and riding competitions : Cyclistic can organize bike riding competitions with exciting prizes and can offer discounted yearly memberships to the participants.
Additional data like pricing details etc. could be used to expand our findings and scope of analysis but the provided data is sufficient to conclude our findings and accomplish the business task.
Resources-
RStudio, Medium, Linkedin and Kaggle community.
For ggmap: http://journal.r-project.org/archive/2013-1/kahle-wickham.pdf https://cran.r-project.org/web/packages/ggmap/citation.html