Data-driven decision-making

Data-driven decision-making (DDDM) is defined as using facts, metrics, and data to guide strategic business decisions that align with your goals, objectives, and initiatives. Various of Data-driven decision-making process has been defined by Industry experts. There is no universal DDDM process but most of the DDDM follows the almost same process. The method used in this project is Ask - Prepare - Process - Analyze - Share - Act ( APPASA )

ASK

Introduction

In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime. Until now, Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments.One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes,and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members.

Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders. Although the pricing flexibility helps Cyclistic attract more customers, Moreno believes that maximizing the number of annual members will be key to future growth. Rather than creating a marketing campaign that targets all-new customers, Moreno believes there is a very good chance to convert casual riders into members. She notes that casual riders are already aware of the Cyclistic program and have chosen Cyclistic for their mobility needs. This data analysis has been done to obtain following :

  1. To know how annual members and casual riders use Cyclistic bikes differently?

  2. Why would casual riders buy Cyclistic annual memberships?

  3. How can Cyclistic use digital media to influence casual riders to become members?

Stakeholder

Primary Stakeholder : Lily Moreno, Cyclistic executive team

Secondary Stakeholder : Cyclistic marketing analytics team

PREPARE

Data Set & Metadata

“divvy-tripdata” is historical trip data of company Divvy and data has been made available to public. Data set comes under Divvy Data License Agreement and it released monthly in schedule. The Data of Each trip is anonymized and includes: Trip start day and time Trip end day and time Trip start station Trip end station Rider type (Member, Single Ride, and Day Pass) Bike type (Classic bike, Docked bike, Electric bike)

ROCCC analysis

  1. Reliable

  2. Original

  3. Comprehensive

  4. Current

  5. Cited

Data set checks yes in all criteria of ROCCC analysis.

All the data used in this analysis can be downlaoded from here.

Installing Required Packages

The packages used in this data analysis are : tidyverse, dplyr, skimr, janitor, lubridate, readr

and installed using : install.packages(“package_name”)

Loading Required Packages

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.8
## v tidyr   1.2.0     v stringr 1.4.0
## v readr   2.1.2     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(here)
## here() starts at C:/Users/mayank/Documents
library(dplyr)
library(skimr)
library(janitor)
## 
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(readr)

PROCESS

Uploading 12 Months “divvy-tripdata”

library(readr)
Feb_2021_trip <- read_csv("Bluetooth Folder/excel_data/Feb_2021_trip.csv")
## Rows: 49622 Columns: 13
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (9): ride_id, rideable_type, started_at, ended_at, start_station_name, s...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
library(readr)
March_2021_trip <- read_csv("Bluetooth Folder/excel_data/March_2021_trip.csv")
## Rows: 228496 Columns: 13
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
library(readr)
April_2021_trip <- read_csv("Bluetooth Folder/excel_data/April_2021_trip.csv")
## Rows: 337230 Columns: 13
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
library(readr)
May_2021_trip <- read_csv("Bluetooth Folder/excel_data/May_2021_trip.csv")
## Rows: 531633 Columns: 13
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
library(readr)
June_2021_trip <- read_csv("Bluetooth Folder/excel_data/June_2021_trip.csv")
## Rows: 729595 Columns: 13
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
library(readr)
July_2021_trip <- read_csv("Bluetooth Folder/excel_data/July_2021_trip.csv")
## Rows: 822410 Columns: 13
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
library(readr)
August_2021_trip <- read_csv("Bluetooth Folder/excel_data/August_2021_trip.csv")
## Rows: 804352 Columns: 13
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
library(readr)
Sept_2021_trip <- read_csv("Bluetooth Folder/excel_data/Sept_2021_trip.csv")
## Rows: 756147 Columns: 13
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
library(readr)
Oct_2021_trip <- read_csv("Bluetooth Folder/excel_data/Oct_2021_trip.csv")
## Rows: 631226 Columns: 13
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
library(readr)
Nov_2021_trip <- read_csv("Bluetooth Folder/excel_data/Nov_2021_trip.csv")
## Rows: 359978 Columns: 13
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
library(readr)
Dec_2021_trip <- read_csv("Bluetooth Folder/excel_data/Dec_2021_trip.csv")
## Rows: 247540 Columns: 13
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
library(readr)
Jan_2022_trip <- read_csv("Bluetooth Folder/excel_data/Jan_2022_trip.csv")
## Rows: 103770 Columns: 13
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

Checking Column name and looking for Inconsistency

str(Feb_2021_trip)
## spec_tbl_df [49,622 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:49622] "89E7AA6C29227EFF" "0FEFDE2603568365" "E6159D746B2DBB91" "B32D3199F1C2E75B" ...
##  $ rideable_type     : chr [1:49622] "classic_bike" "classic_bike" "electric_bike" "classic_bike" ...
##  $ started_at        : chr [1:49622] "2/12/2021 16:14" "2/14/2021 17:52" "2/9/2021 19:10" "2/2/2021 17:49" ...
##  $ ended_at          : chr [1:49622] "2/12/2021 16:21" "2/14/2021 18:12" "2/9/2021 19:19" "2/2/2021 17:54" ...
##  $ start_station_name: chr [1:49622] "Glenwood Ave & Touhy Ave" "Glenwood Ave & Touhy Ave" "Clark St & Lake St" "Wood St & Chicago Ave" ...
##  $ start_station_id  : chr [1:49622] "525" "525" "KA1503000012" "637" ...
##  $ end_station_name  : chr [1:49622] "Sheridan Rd & Columbia Ave" "Bosworth Ave & Howard St" "State St & Randolph St" "Honore St & Division St" ...
##  $ end_station_id    : chr [1:49622] "660" "16806" "TA1305000029" "TA1305000034" ...
##  $ start_lat         : num [1:49622] 42 42 41.9 41.9 41.8 ...
##  $ start_lng         : num [1:49622] -87.7 -87.7 -87.6 -87.7 -87.6 ...
##  $ end_lat           : num [1:49622] 42 42 41.9 41.9 41.8 ...
##  $ end_lng           : num [1:49622] -87.7 -87.7 -87.6 -87.7 -87.6 ...
##  $ member_casual     : chr [1:49622] "member" "casual" "member" "member" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_character(),
##   ..   ended_at = col_character(),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_character(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
str(March_2021_trip)
## spec_tbl_df [228,496 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:228496] "CFA86D4455AA1030" "30D9DC61227D1AF3" "846D87A15682A284" "994D05AA75A168F2" ...
##  $ rideable_type     : chr [1:228496] "classic_bike" "classic_bike" "classic_bike" "classic_bike" ...
##  $ started_at        : POSIXct[1:228496], format: "2021-03-16 08:32:30" "2021-03-28 01:26:28" ...
##  $ ended_at          : POSIXct[1:228496], format: "2021-03-16 08:36:34" "2021-03-28 01:36:55" ...
##  $ start_station_name: chr [1:228496] "Humboldt Blvd & Armitage Ave" "Humboldt Blvd & Armitage Ave" "Shields Ave & 28th Pl" "Winthrop Ave & Lawrence Ave" ...
##  $ start_station_id  : chr [1:228496] "15651" "15651" "15443" "TA1308000021" ...
##  $ end_station_name  : chr [1:228496] "Stave St & Armitage Ave" "Central Park Ave & Bloomingdale Ave" "Halsted St & 35th St" "Broadway & Sheridan Rd" ...
##  $ end_station_id    : chr [1:228496] "13266" "18017" "TA1308000043" "13323" ...
##  $ start_lat         : num [1:228496] 41.9 41.9 41.8 42 42 ...
##  $ start_lng         : num [1:228496] -87.7 -87.7 -87.6 -87.7 -87.7 ...
##  $ end_lat           : num [1:228496] 41.9 41.9 41.8 42 42.1 ...
##  $ end_lng           : num [1:228496] -87.7 -87.7 -87.6 -87.6 -87.7 ...
##  $ member_casual     : chr [1:228496] "casual" "casual" "casual" "casual" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_character(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
str(April_2021_trip)
## spec_tbl_df [337,230 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:337230] "6C992BD37A98A63F" "1E0145613A209000" "E498E15508A80BAD" "1887262AD101C604" ...
##  $ rideable_type     : chr [1:337230] "classic_bike" "docked_bike" "docked_bike" "classic_bike" ...
##  $ started_at        : POSIXct[1:337230], format: "2021-04-12 18:25:36" "2021-04-27 17:27:11" ...
##  $ ended_at          : POSIXct[1:337230], format: "2021-04-12 18:56:55" "2021-04-27 18:31:29" ...
##  $ start_station_name: chr [1:337230] "State St & Pearson St" "Dorchester Ave & 49th St" "Loomis Blvd & 84th St" "Honore St & Division St" ...
##  $ start_station_id  : chr [1:337230] "TA1307000061" "KA1503000069" "20121" "TA1305000034" ...
##  $ end_station_name  : chr [1:337230] "Southport Ave & Waveland Ave" "Dorchester Ave & 49th St" "Loomis Blvd & 84th St" "Southport Ave & Waveland Ave" ...
##  $ end_station_id    : chr [1:337230] "13235" "KA1503000069" "20121" "13235" ...
##  $ start_lat         : num [1:337230] 41.9 41.8 41.7 41.9 41.7 ...
##  $ start_lng         : num [1:337230] -87.6 -87.6 -87.7 -87.7 -87.7 ...
##  $ end_lat           : num [1:337230] 41.9 41.8 41.7 41.9 41.7 ...
##  $ end_lng           : num [1:337230] -87.7 -87.6 -87.7 -87.7 -87.7 ...
##  $ member_casual     : chr [1:337230] "member" "casual" "casual" "member" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_character(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
str(May_2021_trip)
## spec_tbl_df [531,633 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:531633] "C809ED75D6160B2A" "DD59FDCE0ACACAF3" "0AB83CB88C43EFC2" "7881AC6D39110C60" ...
##  $ rideable_type     : chr [1:531633] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
##  $ started_at        : POSIXct[1:531633], format: "2021-05-30 11:58:15" "2021-05-30 11:29:14" ...
##  $ ended_at          : POSIXct[1:531633], format: "2021-05-30 12:10:39" "2021-05-30 12:14:09" ...
##  $ start_station_name: chr [1:531633] NA NA NA NA ...
##  $ start_station_id  : chr [1:531633] NA NA NA NA ...
##  $ end_station_name  : chr [1:531633] NA NA NA NA ...
##  $ end_station_id    : chr [1:531633] NA NA NA NA ...
##  $ start_lat         : num [1:531633] 41.9 41.9 41.9 41.9 41.9 ...
##  $ start_lng         : num [1:531633] -87.6 -87.6 -87.7 -87.7 -87.7 ...
##  $ end_lat           : num [1:531633] 41.9 41.8 41.9 41.9 41.9 ...
##  $ end_lng           : num [1:531633] -87.6 -87.6 -87.7 -87.7 -87.7 ...
##  $ member_casual     : chr [1:531633] "casual" "casual" "casual" "casual" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_character(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
str(June_2021_trip)
## spec_tbl_df [729,595 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:729595] "99FEC93BA843FB20" "06048DCFC8520CAF" "9598066F68045DF2" "B03C0FE48C412214" ...
##  $ rideable_type     : chr [1:729595] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
##  $ started_at        : POSIXct[1:729595], format: "2021-06-13 14:31:28" "2021-06-04 11:18:02" ...
##  $ ended_at          : POSIXct[1:729595], format: "2021-06-13 14:34:11" "2021-06-04 11:24:19" ...
##  $ start_station_name: chr [1:729595] NA NA NA NA ...
##  $ start_station_id  : chr [1:729595] NA NA NA NA ...
##  $ end_station_name  : chr [1:729595] NA NA NA NA ...
##  $ end_station_id    : chr [1:729595] NA NA NA NA ...
##  $ start_lat         : num [1:729595] 41.8 41.8 41.8 41.8 41.8 ...
##  $ start_lng         : num [1:729595] -87.6 -87.6 -87.6 -87.6 -87.6 ...
##  $ end_lat           : num [1:729595] 41.8 41.8 41.8 41.8 41.8 ...
##  $ end_lng           : num [1:729595] -87.6 -87.6 -87.6 -87.6 -87.6 ...
##  $ member_casual     : chr [1:729595] "member" "member" "member" "member" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_character(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
str(July_2021_trip)
## spec_tbl_df [822,410 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:822410] "0A1B623926EF4E16" "B2D5583A5A5E76EE" "6F264597DDBF427A" "379B58EAB20E8AA5" ...
##  $ rideable_type     : chr [1:822410] "docked_bike" "classic_bike" "classic_bike" "classic_bike" ...
##  $ started_at        : POSIXct[1:822410], format: "2021-07-02 14:44:36" "2021-07-07 16:57:42" ...
##  $ ended_at          : POSIXct[1:822410], format: "2021-07-02 15:19:58" "2021-07-07 17:16:09" ...
##  $ start_station_name: chr [1:822410] "Michigan Ave & Washington St" "California Ave & Cortez St" "Wabash Ave & 16th St" "California Ave & Cortez St" ...
##  $ start_station_id  : chr [1:822410] "13001" "17660" "SL-012" "17660" ...
##  $ end_station_name  : chr [1:822410] "Halsted St & North Branch St" "Wood St & Hubbard St" "Rush St & Hubbard St" "Carpenter St & Huron St" ...
##  $ end_station_id    : chr [1:822410] "KA1504000117" "13432" "KA1503000044" "13196" ...
##  $ start_lat         : num [1:822410] 41.9 41.9 41.9 41.9 41.9 ...
##  $ start_lng         : num [1:822410] -87.6 -87.7 -87.6 -87.7 -87.7 ...
##  $ end_lat           : num [1:822410] 41.9 41.9 41.9 41.9 41.9 ...
##  $ end_lng           : num [1:822410] -87.6 -87.7 -87.6 -87.7 -87.7 ...
##  $ member_casual     : chr [1:822410] "casual" "casual" "member" "member" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_character(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
str(August_2021_trip)
## spec_tbl_df [804,352 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:804352] "99103BB87CC6C1BB" "EAFCCCFB0A3FC5A1" "9EF4F46C57AD234D" "5834D3208BFAF1DA" ...
##  $ rideable_type     : chr [1:804352] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
##  $ started_at        : POSIXct[1:804352], format: "2021-08-10 17:15:49" "2021-08-10 17:23:14" ...
##  $ ended_at          : POSIXct[1:804352], format: "2021-08-10 17:22:44" "2021-08-10 17:39:24" ...
##  $ start_station_name: chr [1:804352] NA NA NA NA ...
##  $ start_station_id  : chr [1:804352] NA NA NA NA ...
##  $ end_station_name  : chr [1:804352] NA NA NA NA ...
##  $ end_station_id    : chr [1:804352] NA NA NA NA ...
##  $ start_lat         : num [1:804352] 41.8 41.8 42 42 41.8 ...
##  $ start_lng         : num [1:804352] -87.7 -87.7 -87.7 -87.7 -87.6 ...
##  $ end_lat           : num [1:804352] 41.8 41.8 42 42 41.8 ...
##  $ end_lng           : num [1:804352] -87.7 -87.6 -87.7 -87.7 -87.6 ...
##  $ member_casual     : chr [1:804352] "member" "member" "member" "member" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_character(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
str(Sept_2021_trip)
## spec_tbl_df [756,147 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:756147] "9DC7B962304CBFD8" "F930E2C6872D6B32" "6EF72137900BB910" "78D1DE133B3DBF55" ...
##  $ rideable_type     : chr [1:756147] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
##  $ started_at        : POSIXct[1:756147], format: "2021-09-28 16:07:10" "2021-09-28 14:24:51" ...
##  $ ended_at          : POSIXct[1:756147], format: "2021-09-28 16:09:54" "2021-09-28 14:40:05" ...
##  $ start_station_name: chr [1:756147] NA NA NA NA ...
##  $ start_station_id  : chr [1:756147] NA NA NA NA ...
##  $ end_station_name  : chr [1:756147] NA NA NA NA ...
##  $ end_station_id    : chr [1:756147] NA NA NA NA ...
##  $ start_lat         : num [1:756147] 41.9 41.9 41.8 41.8 41.9 ...
##  $ start_lng         : num [1:756147] -87.7 -87.6 -87.7 -87.7 -87.7 ...
##  $ end_lat           : num [1:756147] 41.9 42 41.8 41.8 41.9 ...
##  $ end_lng           : num [1:756147] -87.7 -87.7 -87.7 -87.7 -87.7 ...
##  $ member_casual     : chr [1:756147] "casual" "casual" "casual" "casual" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_character(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
str(Oct_2021_trip)
## spec_tbl_df [631,226 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:631226] "620BC6107255BF4C" "4471C70731AB2E45" "26CA69D43D15EE14" "362947F0437E1514" ...
##  $ rideable_type     : chr [1:631226] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
##  $ started_at        : POSIXct[1:631226], format: "2021-10-22 12:46:42" "2021-10-21 09:12:37" ...
##  $ ended_at          : POSIXct[1:631226], format: "2021-10-22 12:49:50" "2021-10-21 09:14:14" ...
##  $ start_station_name: chr [1:631226] "Kingsbury St & Kinzie St" NA NA NA ...
##  $ start_station_id  : chr [1:631226] "KA1503000043" NA NA NA ...
##  $ end_station_name  : chr [1:631226] NA NA NA NA ...
##  $ end_station_id    : chr [1:631226] NA NA NA NA ...
##  $ start_lat         : num [1:631226] 41.9 41.9 41.9 41.9 41.9 ...
##  $ start_lng         : num [1:631226] -87.6 -87.7 -87.7 -87.7 -87.7 ...
##  $ end_lat           : num [1:631226] 41.9 41.9 41.9 41.9 41.9 ...
##  $ end_lng           : num [1:631226] -87.6 -87.7 -87.7 -87.7 -87.7 ...
##  $ member_casual     : chr [1:631226] "member" "member" "member" "member" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_character(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
str(Nov_2021_trip)
## spec_tbl_df [359,978 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:359978] "7C00A93E10556E47" "90854840DFD508BA" "0A7D10CDD144061C" "2F3BE33085BCFF02" ...
##  $ rideable_type     : chr [1:359978] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
##  $ started_at        : POSIXct[1:359978], format: "2021-11-27 13:27:38" "2021-11-27 13:38:25" ...
##  $ ended_at          : POSIXct[1:359978], format: "2021-11-27 13:46:38" "2021-11-27 13:56:10" ...
##  $ start_station_name: chr [1:359978] NA NA NA NA ...
##  $ start_station_id  : chr [1:359978] NA NA NA NA ...
##  $ end_station_name  : chr [1:359978] NA NA NA NA ...
##  $ end_station_id    : chr [1:359978] NA NA NA NA ...
##  $ start_lat         : num [1:359978] 41.9 42 42 41.9 41.9 ...
##  $ start_lng         : num [1:359978] -87.7 -87.7 -87.7 -87.8 -87.6 ...
##  $ end_lat           : num [1:359978] 42 41.9 42 41.9 41.9 ...
##  $ end_lng           : num [1:359978] -87.7 -87.7 -87.7 -87.8 -87.6 ...
##  $ member_casual     : chr [1:359978] "casual" "casual" "casual" "casual" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_character(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
str(Dec_2021_trip)
## spec_tbl_df [247,540 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:247540] "46F8167220E4431F" "73A77762838B32FD" "4CF42452054F59C5" "3278BA87BF698339" ...
##  $ rideable_type     : chr [1:247540] "electric_bike" "electric_bike" "electric_bike" "classic_bike" ...
##  $ started_at        : POSIXct[1:247540], format: "2021-12-07 15:06:07" "2021-12-11 03:43:29" ...
##  $ ended_at          : POSIXct[1:247540], format: "2021-12-07 15:13:42" "2021-12-11 04:10:23" ...
##  $ start_station_name: chr [1:247540] "Laflin St & Cullerton St" "LaSalle Dr & Huron St" "Halsted St & North Branch St" "Halsted St & North Branch St" ...
##  $ start_station_id  : chr [1:247540] "13307" "KP1705001026" "KA1504000117" "KA1504000117" ...
##  $ end_station_name  : chr [1:247540] "Morgan St & Polk St" "Clarendon Ave & Leland Ave" "Broadway & Barry Ave" "LaSalle Dr & Huron St" ...
##  $ end_station_id    : chr [1:247540] "TA1307000130" "TA1307000119" "13137" "KP1705001026" ...
##  $ start_lat         : num [1:247540] 41.9 41.9 41.9 41.9 41.9 ...
##  $ start_lng         : num [1:247540] -87.7 -87.6 -87.6 -87.6 -87.7 ...
##  $ end_lat           : num [1:247540] 41.9 42 41.9 41.9 41.9 ...
##  $ end_lng           : num [1:247540] -87.7 -87.7 -87.6 -87.6 -87.6 ...
##  $ member_casual     : chr [1:247540] "member" "casual" "member" "member" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_character(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
str(Jan_2022_trip)
## spec_tbl_df [103,770 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:103770] "C2F7DD78E82EC875" "A6CF8980A652D272" "BD0F91DFF741C66D" "CBB80ED419105406" ...
##  $ rideable_type     : chr [1:103770] "electric_bike" "electric_bike" "classic_bike" "classic_bike" ...
##  $ started_at        : POSIXct[1:103770], format: "2022-01-13 11:59:47" "2022-01-10 08:41:56" ...
##  $ ended_at          : POSIXct[1:103770], format: "2022-01-13 12:02:44" "2022-01-10 08:46:17" ...
##  $ start_station_name: chr [1:103770] "Glenwood Ave & Touhy Ave" "Glenwood Ave & Touhy Ave" "Sheffield Ave & Fullerton Ave" "Clark St & Bryn Mawr Ave" ...
##  $ start_station_id  : chr [1:103770] "525" "525" "TA1306000016" "KA1504000151" ...
##  $ end_station_name  : chr [1:103770] "Clark St & Touhy Ave" "Clark St & Touhy Ave" "Greenview Ave & Fullerton Ave" "Paulina St & Montrose Ave" ...
##  $ end_station_id    : chr [1:103770] "RP-007" "RP-007" "TA1307000001" "TA1309000021" ...
##  $ start_lat         : num [1:103770] 42 42 41.9 42 41.9 ...
##  $ start_lng         : num [1:103770] -87.7 -87.7 -87.7 -87.7 -87.6 ...
##  $ end_lat           : num [1:103770] 42 42 41.9 42 41.9 ...
##  $ end_lng           : num [1:103770] -87.7 -87.7 -87.7 -87.7 -87.6 ...
##  $ member_casual     : chr [1:103770] "casual" "casual" "member" "casual" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_character(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

Keeping format uniform for Consistancy

Feb_2021_trip <- mutate(Feb_2021_trip, start_station_id = as.character(start_station_id), end_station_id = as.character(end_station_id), started_at = as.POSIXct(started_at, format="%Y-%m-%d %H:%M:%S"), ended_at = as.POSIXct(ended_at, format="%Y-%m-%d %H:%M:%S"))
March_2021_trip <- mutate(March_2021_trip, start_station_id = as.character(start_station_id), end_station_id = as.character(end_station_id), started_at = as.POSIXct(started_at, format="%Y-%m-%d %H:%M:%S"), ended_at = as.POSIXct(ended_at, format="%Y-%m-%d %H:%M:%S"))
April_2021_trip <- mutate(April_2021_trip, start_station_id = as.character(start_station_id), end_station_id = as.character(end_station_id), started_at = as.POSIXct(started_at, format="%Y-%m-%d %H:%M:%S"), ended_at = as.POSIXct(ended_at, format="%Y-%m-%d %H:%M:%S"))
May_2021_trip <- mutate(May_2021_trip, start_station_id = as.character(start_station_id), end_station_id = as.character(end_station_id), started_at = as.POSIXct(started_at, format="%Y-%m-%d %H:%M:%S"), ended_at = as.POSIXct(ended_at, format="%Y-%m-%d %H:%M:%S"))
June_2021_trip <- mutate(June_2021_trip, start_station_id = as.character(start_station_id), end_station_id = as.character(end_station_id), started_at = as.POSIXct(started_at, format="%Y-%m-%d %H:%M:%S"), ended_at = as.POSIXct(ended_at, format="%Y-%m-%d %H:%M:%S"))
July_2021_trip <- mutate(July_2021_trip, start_station_id = as.character(start_station_id), end_station_id = as.character(end_station_id), started_at = as.POSIXct(started_at, format="%Y-%m-%d %H:%M:%S"), ended_at = as.POSIXct(ended_at, format="%Y-%m-%d %H:%M:%S"))
August_2021_trip <- mutate(August_2021_trip, start_station_id = as.character(start_station_id), end_station_id = as.character(end_station_id), started_at = as.POSIXct(started_at, format="%Y-%m-%d %H:%M:%S"), ended_at = as.POSIXct(ended_at, format="%Y-%m-%d %H:%M:%S"))
Sept_2021_trip <- mutate(Sept_2021_trip, start_station_id = as.character(start_station_id), end_station_id = as.character(end_station_id), started_at = as.POSIXct(started_at, format="%Y-%m-%d %H:%M:%S"), ended_at = as.POSIXct(ended_at, format="%Y-%m-%d %H:%M:%S"))
Oct_2021_trip <- mutate(Oct_2021_trip, start_station_id = as.character(start_station_id), end_station_id = as.character(end_station_id), started_at = as.POSIXct(started_at, format="%Y-%m-%d %H:%M:%S"), ended_at = as.POSIXct(ended_at, format="%Y-%m-%d %H:%M:%S"))
Nov_2021_trip <- mutate(Nov_2021_trip, start_station_id = as.character(start_station_id), end_station_id = as.character(end_station_id), started_at = as.POSIXct(started_at, format="%Y-%m-%d %H:%M:%S"), ended_at = as.POSIXct(ended_at, format="%Y-%m-%d %H:%M:%S"))
Dec_2021_trip <- mutate(Dec_2021_trip, start_station_id = as.character(start_station_id), end_station_id = as.character(end_station_id), started_at = as.POSIXct(started_at, format="%Y-%m-%d %H:%M:%S"), ended_at = as.POSIXct(ended_at, format="%Y-%m-%d %H:%M:%S"))
Jan_2022_trip <- mutate(Jan_2022_trip, start_station_id = as.character(start_station_id), end_station_id = as.character(end_station_id), started_at = as.POSIXct(started_at, format="%Y-%m-%d %H:%M:%S"), ended_at = as.POSIXct(ended_at, format="%Y-%m-%d %H:%M:%S"))

Combining all files together into one by binding rows

All_trip <- bind_rows(Feb_2021_trip, March_2021_trip, April_2021_trip, May_2021_trip, June_2021_trip, July_2021_trip, August_2021_trip, Sept_2021_trip, Oct_2021_trip, Nov_2021_trip, Dec_2021_trip, Jan_2022_trip)

Inspecting Combined Data set

#checking Column name
colnames(All_trip)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
#checking number of row
nrow(All_trip)
## [1] 5601999
#Checking Dimensions
dim(All_trip)
## [1] 5601999      13
#checking First six row of data
head(All_trip)
## # A tibble: 6 x 13
##   ride_id rideable_type started_at ended_at start_station_name        
##   <chr>   <chr>         <dttm>     <dttm>   <chr>                     
## 1 89E7AA~ classic_bike  NA         NA       Glenwood Ave & Touhy Ave  
## 2 0FEFDE~ classic_bike  NA         NA       Glenwood Ave & Touhy Ave  
## 3 E6159D~ electric_bike NA         NA       Clark St & Lake St        
## 4 B32D31~ classic_bike  NA         NA       Wood St & Chicago Ave     
## 5 83E463~ electric_bike NA         NA       State St & 33rd St        
## 6 BDAA7E~ electric_bike NA         NA       Fairbanks St & Superior St
## # ... with 8 more variables: start_station_id <chr>, end_station_name <chr>,
## #   end_station_id <chr>, start_lat <dbl>, start_lng <dbl>, end_lat <dbl>,
## #   end_lng <dbl>, member_casual <chr>
#Checking Column names and Data types
str(All_trip)
## tibble [5,601,999 x 13] (S3: tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:5601999] "89E7AA6C29227EFF" "0FEFDE2603568365" "E6159D746B2DBB91" "B32D3199F1C2E75B" ...
##  $ rideable_type     : chr [1:5601999] "classic_bike" "classic_bike" "electric_bike" "classic_bike" ...
##  $ started_at        : POSIXct[1:5601999], format: NA NA ...
##  $ ended_at          : POSIXct[1:5601999], format: NA NA ...
##  $ start_station_name: chr [1:5601999] "Glenwood Ave & Touhy Ave" "Glenwood Ave & Touhy Ave" "Clark St & Lake St" "Wood St & Chicago Ave" ...
##  $ start_station_id  : chr [1:5601999] "525" "525" "KA1503000012" "637" ...
##  $ end_station_name  : chr [1:5601999] "Sheridan Rd & Columbia Ave" "Bosworth Ave & Howard St" "State St & Randolph St" "Honore St & Division St" ...
##  $ end_station_id    : chr [1:5601999] "660" "16806" "TA1305000029" "TA1305000034" ...
##  $ start_lat         : num [1:5601999] 42 42 41.9 41.9 41.8 ...
##  $ start_lng         : num [1:5601999] -87.7 -87.7 -87.6 -87.7 -87.6 ...
##  $ end_lat           : num [1:5601999] 42 42 41.9 41.9 41.8 ...
##  $ end_lng           : num [1:5601999] -87.7 -87.7 -87.6 -87.7 -87.6 ...
##  $ member_casual     : chr [1:5601999] "member" "casual" "member" "member" ...
#checking Summary of data frame
summary(All_trip)
##    ride_id          rideable_type        started_at                 
##  Length:5601999     Length:5601999     Min.   :2021-03-01 00:01:09  
##  Class :character   Class :character   1st Qu.:2021-06-12 13:30:12  
##  Mode  :character   Mode  :character   Median :2021-08-05 21:12:13  
##                                        Mean   :2021-08-06 08:30:19  
##                                        3rd Qu.:2021-09-29 07:41:17  
##                                        Max.   :2022-01-31 23:58:37  
##                                        NA's   :49622                
##     ended_at                   start_station_name start_station_id  
##  Min.   :2021-03-01 00:06:28   Length:5601999     Length:5601999    
##  1st Qu.:2021-06-12 13:57:47   Class :character   Class :character  
##  Median :2021-08-05 21:33:22   Mode  :character   Mode  :character  
##  Mean   :2021-08-06 08:52:13                                        
##  3rd Qu.:2021-09-29 07:55:19                                        
##  Max.   :2022-02-01 01:46:16                                        
##  NA's   :49622                                                      
##  end_station_name   end_station_id       start_lat       start_lng     
##  Length:5601999     Length:5601999     Min.   :41.64   Min.   :-87.84  
##  Class :character   Class :character   1st Qu.:41.88   1st Qu.:-87.66  
##  Mode  :character   Mode  :character   Median :41.90   Median :-87.64  
##                                        Mean   :41.90   Mean   :-87.65  
##                                        3rd Qu.:41.93   3rd Qu.:-87.63  
##                                        Max.   :45.64   Max.   :-73.80  
##                                                                        
##     end_lat         end_lng       member_casual     
##  Min.   :41.39   Min.   :-88.97   Length:5601999    
##  1st Qu.:41.88   1st Qu.:-87.66   Class :character  
##  Median :41.90   Median :-87.64   Mode  :character  
##  Mean   :41.90   Mean   :-87.65                     
##  3rd Qu.:41.93   3rd Qu.:-87.63                     
##  Max.   :42.17   Max.   :-87.49                     
##  NA's   :4754    NA's   :4754
#Viewing some data 
glimpse(All_trip)
## Rows: 5,601,999
## Columns: 13
## $ ride_id            <chr> "89E7AA6C29227EFF", "0FEFDE2603568365", "E6159D746B~
## $ rideable_type      <chr> "classic_bike", "classic_bike", "electric_bike", "c~
## $ started_at         <dttm> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
## $ ended_at           <dttm> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
## $ start_station_name <chr> "Glenwood Ave & Touhy Ave", "Glenwood Ave & Touhy A~
## $ start_station_id   <chr> "525", "525", "KA1503000012", "637", "13216", "1800~
## $ end_station_name   <chr> "Sheridan Rd & Columbia Ave", "Bosworth Ave & Howar~
## $ end_station_id     <chr> "660", "16806", "TA1305000029", "TA1305000034", "TA~
## $ start_lat          <dbl> 42.01270, 42.01270, 41.88579, 41.89563, 41.83473, 4~
## $ start_lng          <dbl> -87.66606, -87.66606, -87.63110, -87.67207, -87.625~
## $ end_lat            <dbl> 42.00458, 42.01954, 41.88487, 41.90312, 41.83816, 4~
## $ end_lng            <dbl> -87.66141, -87.66956, -87.62750, -87.67394, -87.645~
## $ member_casual      <chr> "member", "casual", "member", "member", "member", "~

Filtering and cleaning Data set

colSums(is.na(All_trip))
##            ride_id      rideable_type         started_at           ended_at 
##                  0                  0              49622              49622 
## start_station_name   start_station_id   end_station_name     end_station_id 
##             698444             698441             746820             746820 
##          start_lat          start_lng            end_lat            end_lng 
##                  0                  0               4754               4754 
##      member_casual 
##                  0
All_trip_cleaned <- All_trip[complete.cases(All_trip), ]
# Filtering data In which trip Trip start time is greater than Trip end time
All_trip_cleaned <- All_trip_cleaned %>% filter(All_trip_cleaned$started_at<All_trip_cleaned$ended_at)
#Saving new column date
All_trip_cleaned$date <- as.Date(All_trip_cleaned$started_at, format= "%m/%d/%Y %H:%M")
#Saving new column month
All_trip_cleaned$month <- format(as.Date(All_trip_cleaned$date), "%m")
#Saving new column dates
All_trip_cleaned$dates <- format(as.Date(All_trip_cleaned$date), "%d")
#Saving new column year
All_trip_cleaned$year <- format(as.Date(All_trip_cleaned$date), "%Y")
#Saving new column day
All_trip_cleaned$day <- format(as.Date(All_trip_cleaned$date), "%A")
#Viewing some data to check everything is uniform and consistent 
glimpse(All_trip_cleaned)
## Rows: 4,541,730
## Columns: 18
## $ ride_id            <chr> "CFA86D4455AA1030", "30D9DC61227D1AF3", "846D87A156~
## $ rideable_type      <chr> "classic_bike", "classic_bike", "classic_bike", "cl~
## $ started_at         <dttm> 2021-03-16 08:32:30, 2021-03-28 01:26:28, 2021-03-~
## $ ended_at           <dttm> 2021-03-16 08:36:34, 2021-03-28 01:36:55, 2021-03-~
## $ start_station_name <chr> "Humboldt Blvd & Armitage Ave", "Humboldt Blvd & Ar~
## $ start_station_id   <chr> "15651", "15651", "15443", "TA1308000021", "525", "~
## $ end_station_name   <chr> "Stave St & Armitage Ave", "Central Park Ave & Bloo~
## $ end_station_id     <chr> "13266", "18017", "TA1308000043", "13323", "E008", ~
## $ start_lat          <dbl> 41.91751, 41.91751, 41.84273, 41.96881, 42.01270, 4~
## $ start_lng          <dbl> -87.70181, -87.70181, -87.63549, -87.65766, -87.666~
## $ end_lat            <dbl> 41.91774, 41.91417, 41.83066, 41.95283, 42.05049, 4~
## $ end_lng            <dbl> -87.69139, -87.71676, -87.64717, -87.64999, -87.677~
## $ member_casual      <chr> "casual", "casual", "casual", "casual", "casual", "~
## $ date               <date> 2021-03-16, 2021-03-28, 2021-03-11, 2021-03-11, 20~
## $ month              <chr> "03", "03", "03", "03", "03", "03", "03", "03", "03~
## $ dates              <chr> "16", "28", "11", "11", "21", "20", "20", "23", "31~
## $ year               <chr> "2021", "2021", "2021", "2021", "2021", "2021", "20~
## $ day                <chr> "Tuesday", "Sunday", "Thursday", "Thursday", "Sunda~
#Calculating ride duration/ Trip duration and dividing it by 60 to convert it to minutes
All_trip_cleaned$ride_duration <- difftime(All_trip_cleaned$ended_at,All_trip_cleaned$started_at)/60
str(All_trip_cleaned)
## tibble [4,541,730 x 19] (S3: tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:4541730] "CFA86D4455AA1030" "30D9DC61227D1AF3" "846D87A15682A284" "994D05AA75A168F2" ...
##  $ rideable_type     : chr [1:4541730] "classic_bike" "classic_bike" "classic_bike" "classic_bike" ...
##  $ started_at        : POSIXct[1:4541730], format: "2021-03-16 08:32:30" "2021-03-28 01:26:28" ...
##  $ ended_at          : POSIXct[1:4541730], format: "2021-03-16 08:36:34" "2021-03-28 01:36:55" ...
##  $ start_station_name: chr [1:4541730] "Humboldt Blvd & Armitage Ave" "Humboldt Blvd & Armitage Ave" "Shields Ave & 28th Pl" "Winthrop Ave & Lawrence Ave" ...
##  $ start_station_id  : chr [1:4541730] "15651" "15651" "15443" "TA1308000021" ...
##  $ end_station_name  : chr [1:4541730] "Stave St & Armitage Ave" "Central Park Ave & Bloomingdale Ave" "Halsted St & 35th St" "Broadway & Sheridan Rd" ...
##  $ end_station_id    : chr [1:4541730] "13266" "18017" "TA1308000043" "13323" ...
##  $ start_lat         : num [1:4541730] 41.9 41.9 41.8 42 42 ...
##  $ start_lng         : num [1:4541730] -87.7 -87.7 -87.6 -87.7 -87.7 ...
##  $ end_lat           : num [1:4541730] 41.9 41.9 41.8 42 42.1 ...
##  $ end_lng           : num [1:4541730] -87.7 -87.7 -87.6 -87.6 -87.7 ...
##  $ member_casual     : chr [1:4541730] "casual" "casual" "casual" "casual" ...
##  $ date              : Date[1:4541730], format: "2021-03-16" "2021-03-28" ...
##  $ month             : chr [1:4541730] "03" "03" "03" "03" ...
##  $ dates             : chr [1:4541730] "16" "28" "11" "11" ...
##  $ year              : chr [1:4541730] "2021" "2021" "2021" "2021" ...
##  $ day               : chr [1:4541730] "Tuesday" "Sunday" "Thursday" "Thursday" ...
##  $ ride_duration     : 'difftime' num [1:4541730] 4.06666666666667 10.45 16.4 28.9833333333333 ...
##   ..- attr(*, "units")= chr "secs"
#Converting “ride_length” from Factor to numeric in order to do calculations
is.factor(All_trip_cleaned$ride_duration)
## [1] FALSE
All_trip_cleaned$ride_duration <- as.numeric(as.character(All_trip_cleaned$ride_duration))
is.numeric(All_trip_cleaned$ride_duration)
## [1] TRUE
#Removing “bad” data and creating a new data frame
All_trip_cleaned_v1<- All_trip_cleaned[!(All_trip_cleaned$start_station_name == "HQ QR" |All_trip_cleaned$ride_duration < 0),]

Analyze

#calculating Average ride duration 
mean(All_trip_cleaned_v1$ride_duration)
## [1] 21.79437
#calculating Median ride duration
median(All_trip_cleaned_v1$ride_duration)
## [1] 12.18333
#calculating maximum ride duration
max(All_trip_cleaned_v1$ride_duration)
## [1] 55944.15
#Calculating minimum rude duration 
min(All_trip_cleaned_v1$ride_duration)
## [1] 0.01666667
#Viewing Average, median, maximum and minimum ride duration for better comparison 
summary(All_trip_cleaned_v1$ride_duration)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     0.02     6.93    12.18    21.79    22.08 55944.15

Comparing Rider type : Member Vs Casual Riders

aggregate(All_trip_cleaned_v1$ride_duration ~ All_trip_cleaned_v1$member_casual, FUN = mean)
##   All_trip_cleaned_v1$member_casual All_trip_cleaned_v1$ride_duration
## 1                            casual                          32.46070
## 2                            member                          13.11519
aggregate(All_trip_cleaned_v1$ride_duration ~ All_trip_cleaned_v1$member_casual, FUN = median)
##   All_trip_cleaned_v1$member_casual All_trip_cleaned_v1$ride_duration
## 1                            casual                         16.633333
## 2                            member                          9.666667
aggregate(All_trip_cleaned_v1$ride_duration ~ All_trip_cleaned_v1$member_casual, FUN = min)
##   All_trip_cleaned_v1$member_casual All_trip_cleaned_v1$ride_duration
## 1                            casual                        0.01666667
## 2                            member                        0.01666667
aggregate(All_trip_cleaned_v1$ride_duration ~ All_trip_cleaned_v1$member_casual, FUN = max)
##   All_trip_cleaned_v1$member_casual All_trip_cleaned_v1$ride_duration
## 1                            casual                         55944.150
## 2                            member                          1495.633

Average Ride duration : Member Vs Casual rider

# Arranging and maintaining the order of day by week
All_trip_cleaned_v1$day <- ordered(All_trip_cleaned_v1$day, levels=c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))
aggregate(All_trip_cleaned_v1$ride_duration ~ All_trip_cleaned_v1$member_casual + All_trip_cleaned_v1$day, FUN = mean)
##    All_trip_cleaned_v1$member_casual All_trip_cleaned_v1$day
## 1                             casual                  Sunday
## 2                             member                  Sunday
## 3                             casual                  Monday
## 4                             member                  Monday
## 5                             casual                 Tuesday
## 6                             member                 Tuesday
## 7                             casual               Wednesday
## 8                             member               Wednesday
## 9                             casual                Thursday
## 10                            member                Thursday
## 11                            casual                  Friday
## 12                            member                  Friday
## 13                            casual                Saturday
## 14                            member                Saturday
##    All_trip_cleaned_v1$ride_duration
## 1                           37.59261
## 2                           15.14081
## 3                           32.66038
## 4                           12.66732
## 5                           28.69078
## 6                           12.31038
## 7                           28.32422
## 8                           12.37723
## 9                           28.19142
## 10                          12.28490
## 11                          30.78521
## 12                          12.74821
## 13                          34.68589
## 14                          14.74594

Calculating Average Ride duration, Number of rides : Member Vs Casual rider & Day

All_trip_cleaned_v1 %>% mutate(days = wday(started_at, label = TRUE)) %>% group_by(member_casual,days) %>% summarise(number_of_rides = n(), average_duration = mean(ride_duration)) %>% arrange(member_casual,days)
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
## # A tibble: 14 x 4
## # Groups:   member_casual [2]
##    member_casual days  number_of_rides average_duration
##    <chr>         <ord>           <int>            <dbl>
##  1 casual        Sun            401873             37.6
##  2 casual        Mon            228420             32.7
##  3 casual        Tue            214205             28.7
##  4 casual        Wed            217115             28.3
##  5 casual        Thu            223205             28.2
##  6 casual        Fri            288289             30.8
##  7 casual        Sat            464497             34.7
##  8 member        Sun            306851             15.1
##  9 member        Mon            343722             12.7
## 10 member        Tue            384700             12.3
## 11 member        Wed            392458             12.4
## 12 member        Thu            369009             12.3
## 13 member        Fri            358201             12.7
## 14 member        Sat            349185             14.7

Visualization of Number of rides by day : Member Vs Casual rider

All_trip_cleaned_v1 %>% group_by(member_casual, day) %>% summarise(Number_of_rides = n(), .groups = 'drop') %>% ggplot(aes(x = day, y = Number_of_rides, fill = member_casual)) + geom_col(position = "dodge") + scale_y_continuous(labels = scales::comma) +labs(title = "Number of Rides by Day: Members vs. Casual Riders")

Visualization of Average ride duration by day : Member Vs Casual rider

All_trip_cleaned_v1 %>% group_by(member_casual, day) %>% summarise(Average_duration = mean(ride_duration), .groups = 'drop') %>% ggplot(aes(x = day, y =Average_duration, fill = member_casual )) + geom_col(position = "dodge") + labs(title = "Average ride duration per day Vs Member/Casual")

Visualization of Number of rides by month : Member Vs Casual rider

All_trip_cleaned_v1 %>% group_by(month,member_casual) %>% summarise( number_of_rides= n(), .groups = 'drop') %>% ggplot(aes(x = month,y = number_of_rides, fill = member_casual)) + geom_bar(position = 'dodge', stat = 'identity') + scale_y_continuous(labels = scales::comma) + labs(title = "Number of Rides by Month: Casual Vs member")

Visualization of Average ride duration by month : Member Vs Casual rider

All_trip_cleaned_v1 %>% group_by(month,member_casual) %>% summarise( average_ride_duration = mean(ride_duration), .groups = 'drop') %>% ggplot(aes(x = month,y = average_ride_duration, fill = member_casual)) + geom_bar(position = 'dodge', stat = 'identity') + scale_y_continuous(labels = scales::comma) + labs(title = "Average ride duration by Month: Casual Vs member")

Visualizaion of Ride pecent by Rider type : Member, Casual

All_trip_cleaned_v1 %>% group_by(member_casual) %>% summarize(number_of_rides = n()) %>% mutate(percentage = round(number_of_rides*100/ sum(number_of_rides))) %>% ggplot(aes(x = "", y = number_of_rides, fill = member_casual)) + geom_bar(width = 1, stat="identity", color = "white", show.legend = FALSE) + coord_polar("y", start=0) + geom_text(aes(label = paste(member_casual,paste(percentage,"%"), sep = "\n")), position = position_stack(vjust = 0.5),color ="white") + labs(title = "Ride Percentage of Rider type ") + theme_void()

Analyzing Best performing stations

#Combining all Station's name
#Removing No station name
All_stations <- bind_rows( data.frame("station" = All_trip_cleaned_v1$start_station_name,"member_casual" = All_trip_cleaned_v1$member_casual), data.frame("station" = All_trip_cleaned_v1$end_station_name, "member_casual" = All_trip_cleaned_v1$member_casual))
All_stations_v1 <- All_stations[!(All_stations$station == "" | is.na(All_stations$station)), ]
#Separating rider type
All_station_member <- All_stations_v1[All_stations_v1$member_casual == 'member',]
All_station_casual <- All_stations_v1[All_stations_v1$member_casual == 'casual',]
#Top 10 Station's combined casual & member 
Top_station <- All_stations_v1 %>% group_by(station) %>% summarise(station_count =n()) %>% arrange(desc(station_count)) %>% arrange(desc(station_count)) %>% slice(1:10)
#Top 10 station member rider type
Top_10_station_member <- All_station_member %>% group_by(station) %>% summarise(station_count = n()) %>% arrange(desc(station_count)) %>% head(n=10)
#Top 10 station casual rider type
Top_10_station_casual <- All_station_casual %>% group_by(station) %>% summarise(station_count = n()) %>% arrange(desc(station_count)) %>% head(n=10)

Visualizaion of Top 10 Station by rider type : Member

ggplot(data = Top_10_station_member) + geom_col(aes(x= station, y= station_count,  fill = "thistle")) + labs(title = "Top 10 Used Stations by Members", y = "Number of Rides", x = "") + scale_y_continuous(labels = scales::comma)+ coord_flip() + theme_minimal()

Visualizaion of Top 10 Station by rider type : Casual

ggplot(data = Top_10_station_casual) + geom_col(aes(x= station, y= station_count,  fill = "thistle")) + labs(title = "Top 10 Used Stations by Casual", y = "Number of Rides", x = "") + scale_y_continuous(labels = scales::comma)+ coord_flip() + theme_minimal()

Analyzing Bike type prefrences

#Bike type preference of rider type
All_trip_cleaned_v1 %>% group_by(rideable_type, member_casual) %>% summarize(number_of_rides = n(), .groups = 'drop')
## # A tibble: 5 x 3
##   rideable_type member_casual number_of_rides
##   <chr>         <chr>                   <int>
## 1 classic_bike  casual                1254587
## 2 classic_bike  member                1945901
## 3 docked_bike   casual                 309607
## 4 electric_bike casual                 473410
## 5 electric_bike member                 558225

Visualizaion of Bike prefrence of Rider’s compared to Number of rides

All_trip_cleaned_v1 %>% group_by(rideable_type, member_casual) %>% summarize(number_of_rides = n(), .groups = 'drop') %>% ggplot(aes(x = rideable_type, y =number_of_rides, fill = member_casual )) + geom_col(position = "dodge") + labs(title = "Number of rides & Rideable type Vs Member/Casual")

Visualizaion of Ride pecent by Bike type

All_trip_cleaned_v1 %>% group_by(rideable_type) %>% summarize(number_of_rides = n()) %>% mutate(percentage = round(number_of_rides*100/ sum(number_of_rides))) %>% ggplot(aes(x = "", y = number_of_rides, fill = rideable_type)) + geom_bar(width = 1, stat="identity", color = "white", show.legend = FALSE) + coord_polar("y", start=0) + geom_text(aes(label = paste(rideable_type,paste(percentage,"%"), sep = "\n")), position = position_stack(vjust = 0.5),color ="white") + labs(title = "Ride Percentage of Bike type") + theme_void()

Visualizaion of Number of rides by Month : Bike type & Rider type

All_trip_cleaned_v1 %>% group_by(month, member_casual, rideable_type) %>% summarize(number_of_rides = n(), .groups = 'drop') %>% drop_na() %>% ggplot(aes(x = month, y = number_of_rides, fill = member_casual)) + geom_col(position = "dodge") + scale_y_continuous(labels = scales::comma) + facet_grid(member_casual~rideable_type) + labs(title = "Number of Rides by Month") + theme(axis.text.x = element_text(angle = 90))

Visualizaion Average ride duration by month : Bike type & Rider type

All_trip_cleaned_v1 %>% group_by(month, member_casual, rideable_type) %>% summarize(average_ride_duration = mean(ride_duration), .groups = 'drop') %>% drop_na() %>% ggplot(aes(x = month, y = average_ride_duration, fill = member_casual)) + geom_col(position = "dodge") + scale_y_continuous(labels = scales::comma) + facet_grid(member_casual~rideable_type) + labs(title = "Average Ride duration by Month") + theme(axis.text.x = element_text(angle = 90))

Conclusion

The sole purpose of this data analysis is to design a marketing strategies to convert casual rider into member rider by using digital media and to perform data analysis using Data driven decision making process : Ask - Prepare - Process - Analyze - Share - Act.

Summary of data analysis result

  1. Number of rides of Casual rider is higher On Saturday & Sunday, Number of rides of Member rider is higher from Monday to Friday.

  2. Average ride duration of Casual rider is almost double compared to Member rider type on any given day of week.

  3. In any given month of calender the Average ride duration of Casual rider is almost double of Member rider type.

  4. In last 12 months the Ride percentage of Rider type was Casual Rider : 45 % Member Rider : 55 %

  5. In last 12 months the Bike preference type of rider was Classic Bike : 71% Electric Bike : 23% Docked Bike : 7%

  6. Docked bike is only Preferred by casual member

Based on analysis my 3 key recommendation are :

  1. Design a digital marketing campaign to offer discount on Membership and start running campaign since Thursday and end it on Saturday morning.

  2. Design a digital marketing campaign (Video or series of posts) to educate : How using Cyclist user can avoid heavy traffic, Health benefit of using a bike over other commuting option, Emission free commute and user’s contribution toward Environment.

    as well as keeping user Interest toward Membership.

  3. Design a marketing campaign for local business for choosing Cyclist as their commuting partner, showing advantages compared to other commuting options.

Future Scope

Further analysis of Geometric coordinate, Ride distance analysis, Ride route preferences as well as Regression can be done to further enhance the result.

Survey can be conducted on Casual rider to understand more about user and can be mixed survey data with this analysis to increase effectiveness of marketing camping.

Data-driven decision-making

This data analysis follows : Ask - Prepare - Process - Analyze - Share - Act. We followed Ask- Prepare - Process - Analyze as well as created effective visualization and make recommendation.

Share & Act

To share my visualization, key finding and Recommendation I will make power point presentation, and ask time from stakeholder to present my findings.

Its up to Stakeholder’s to accept my recommendation & finding to Act upon them.