I newly joined a fictional Cyclistic marketing analytics team as a junior data analyst in a bike-share company in Chicago called Cylistic. The company has two major types of memberships which are casual membership and annual membership. The Marketing Director of the Cyclistic company Lily Moreno wants to maximize the number of annual members in the company. She is of the opinion that the success of the company would depend on the growth of the annual members registered to the company.
The financial analyst team at Cyclistic company have concluded that the annual members are more profitable to the company than the casual member riders and this is why the Marketing Director of the company wants to expand the growth of the annual members. But Lily Moreno does not want to create a marketing campaign that would target all new customers but wants to convert the already existing casual members to annual members.
The business task of this project is to convert causal members into annual members. But to do this, there are key questions we need to take into consideration
The data source was gotten from the link {https://divvy-tripdata.s3.amazonaws.com/index.html} I downloaded the twelve previous months of the Cyclistic trip data, which was from December 2020 till November 2021.
I created a folder where i saved all the 12 data sets, then i renamed all the data sets to a suitable name for easy understanding. I checked each data sets to make sure it passed the ROCCC(Reliable,Original, Comprehensive,Current and Cited) check.
The data included information about the causal and annual member cyclists during the period of December 2020 and November 2021. The breakdown of some of the information included are:
#loading the necessary library
library(readr) # helps read data
## Warning: package 'readr' was built under R version 4.1.2
library(dplyr) # helps with data frame
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2) # helps visualize data
## Warning: package 'ggplot2' was built under R version 4.1.2
library(tidyverse) # helps wrangle data
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v tibble 3.1.6 v stringr 1.4.0
## v tidyr 1.1.4 v forcats 0.5.1
## v purrr 0.3.4
## Warning: package 'tibble' was built under R version 4.1.2
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(lubridate) # helps wrangle data attributes
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(janitor) # helps with data cleaning
## Warning: package 'janitor' was built under R version 4.1.2
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(skimr) # helps with data cleaning and visuals
## Warning: package 'skimr' was built under R version 4.1.2
After the libraries and package were loaded, i proceeded to setting my working directory to the file where i had all my data sets saved. Then i used the read_csv function which part of the readr package to read my data set files into the session.
# Step1: Collection of Data
#Now we load these files and assign them to a variable name
trip_dec20 <- read_csv("bike_ride_dec_20.csv")
## Rows: 259716 Columns: 13
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (5): ride_id, rideable_type, start_station_name, end_station_name, memb...
## dbl (6): start_station_id, end_station_id, start_lat, start_lng, end_lat, e...
## dttm (2): started_at, ended_at
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
trip_jan21 <- read_csv("bike_ride_jan_21.csv")
## Rows: 131573 Columns: 13
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
trip_feb21 <- read_csv("bike_ride_feb_21.csv")
## Rows: 96834 Columns: 13
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
trip_mar21 <- read_csv("bike_ride_mar_21.csv")
## Rows: 49622 Columns: 13
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
trip_apr21 <- read_csv("bike_ride_apr_21.csv")
## Rows: 228496 Columns: 13
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
trip_may21 <- read_csv("bike_ride_may_21.csv")
## Rows: 337230 Columns: 13
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
trip_jun21 <- read_csv("bike_ride_jun_21.csv")
## Rows: 531633 Columns: 13
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
trip_jul21 <- read_csv("bike_ride_jul_21.csv")
## Rows: 729595 Columns: 13
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
trip_aug21 <- read_csv("bike_ride_aug_21.csv")
## Rows: 822410 Columns: 13
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
trip_sep21 <- read_csv("bike_ride_sep_21.csv")
## Rows: 804352 Columns: 13
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
trip_oct21 <- read_csv("bike_ride_oct_21.csv")
## Rows: 756147 Columns: 13
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
trip_nov21 <- read_csv("bike_ride_nov_21.csv")
## Rows: 631226 Columns: 13
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
I then compared the column names using colnames to make sure the names were in the same order and format
#Step 2 We Wrangle Data and Combine into a single file
#Now we compare the column names of each file
colnames(trip_dec20)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(trip_jan21)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(trip_feb21)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(trip_mar21)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(trip_apr21)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(trip_may21)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(trip_jun21)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(trip_jul21)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(trip_aug21)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(trip_sep21)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(trip_oct21)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(trip_nov21)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
I looked at the data types using stringR to look for errors or inconsistency.
#Now we want to look at the data types of each data frame using str()
str(trip_dec20)
## spec_tbl_df [259,716 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:259716] "BD0A6FF6FFF9B921" "96A7A7A4BDE4F82D" "C61526D06582BDC5" "E533E89C32080B9E" ...
## $ rideable_type : chr [1:259716] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : POSIXct[1:259716], format: "2020-11-01 13:36:00" "2020-11-01 10:03:26" ...
## $ ended_at : POSIXct[1:259716], format: "2020-11-01 13:45:40" "2020-11-01 10:14:45" ...
## $ start_station_name: chr [1:259716] "Dearborn St & Erie St" "Franklin St & Illinois St" "Lake Shore Dr & Monroe St" "Leavitt St & Chicago Ave" ...
## $ start_station_id : num [1:259716] 110 672 76 659 2 72 76 NA 58 394 ...
## $ end_station_name : chr [1:259716] "St. Clair St & Erie St" "Noble St & Milwaukee Ave" "Federal St & Polk St" "Stave St & Armitage Ave" ...
## $ end_station_id : num [1:259716] 211 29 41 185 2 76 72 NA 288 273 ...
## $ start_lat : num [1:259716] 41.9 41.9 41.9 41.9 41.9 ...
## $ start_lng : num [1:259716] -87.6 -87.6 -87.6 -87.7 -87.6 ...
## $ end_lat : num [1:259716] 41.9 41.9 41.9 41.9 41.9 ...
## $ end_lng : num [1:259716] -87.6 -87.7 -87.6 -87.7 -87.6 ...
## $ member_casual : chr [1:259716] "casual" "casual" "casual" "casual" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_double(),
## .. end_station_name = col_character(),
## .. end_station_id = col_double(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
str(trip_jan21)
## spec_tbl_df [131,573 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:131573] "70B6A9A437D4C30D" "158A465D4E74C54A" "5262016E0F1F2F9A" "BE119628E44F871E" ...
## $ rideable_type : chr [1:131573] "classic_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : POSIXct[1:131573], format: "2020-12-27 12:44:29" "2020-12-18 17:37:15" ...
## $ ended_at : POSIXct[1:131573], format: "2020-12-27 12:55:06" "2020-12-18 17:44:19" ...
## $ start_station_name: chr [1:131573] "Aberdeen St & Jackson Blvd" NA NA NA ...
## $ start_station_id : chr [1:131573] "13157" NA NA NA ...
## $ end_station_name : chr [1:131573] "Desplaines St & Kinzie St" NA NA NA ...
## $ end_station_id : chr [1:131573] "TA1306000003" NA NA NA ...
## $ start_lat : num [1:131573] 41.9 41.9 41.9 41.9 41.8 ...
## $ start_lng : num [1:131573] -87.7 -87.7 -87.7 -87.7 -87.6 ...
## $ end_lat : num [1:131573] 41.9 41.9 41.9 41.9 41.8 ...
## $ end_lng : num [1:131573] -87.6 -87.7 -87.7 -87.7 -87.6 ...
## $ member_casual : chr [1:131573] "member" "member" "member" "member" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
str(trip_feb21)
## spec_tbl_df [96,834 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:96834] "E19E6F1B8D4C42ED" "DC88F20C2C55F27F" "EC45C94683FE3F27" "4FA453A75AE377DB" ...
## $ rideable_type : chr [1:96834] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : POSIXct[1:96834], format: "2021-01-23 16:14:19" "2021-01-27 18:43:08" ...
## $ ended_at : POSIXct[1:96834], format: "2021-01-23 16:24:44" "2021-01-27 18:47:12" ...
## $ start_station_name: chr [1:96834] "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" ...
## $ start_station_id : chr [1:96834] "17660" "17660" "17660" "17660" ...
## $ end_station_name : chr [1:96834] NA NA NA NA ...
## $ end_station_id : chr [1:96834] NA NA NA NA ...
## $ start_lat : num [1:96834] 41.9 41.9 41.9 41.9 41.9 ...
## $ start_lng : num [1:96834] -87.7 -87.7 -87.7 -87.7 -87.7 ...
## $ end_lat : num [1:96834] 41.9 41.9 41.9 41.9 41.9 ...
## $ end_lng : num [1:96834] -87.7 -87.7 -87.7 -87.7 -87.7 ...
## $ member_casual : chr [1:96834] "member" "member" "member" "member" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
str(trip_mar21)
## spec_tbl_df [49,622 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:49622] "89E7AA6C29227EFF" "0FEFDE2603568365" "E6159D746B2DBB91" "B32D3199F1C2E75B" ...
## $ rideable_type : chr [1:49622] "classic_bike" "classic_bike" "electric_bike" "classic_bike" ...
## $ started_at : POSIXct[1:49622], format: "2021-02-12 16:14:56" "2021-02-14 17:52:38" ...
## $ ended_at : POSIXct[1:49622], format: "2021-02-12 16:21:43" "2021-02-14 18:12:09" ...
## $ start_station_name: chr [1:49622] "Glenwood Ave & Touhy Ave" "Glenwood Ave & Touhy Ave" "Clark St & Lake St" "Wood St & Chicago Ave" ...
## $ start_station_id : chr [1:49622] "525" "525" "KA1503000012" "637" ...
## $ end_station_name : chr [1:49622] "Sheridan Rd & Columbia Ave" "Bosworth Ave & Howard St" "State St & Randolph St" "Honore St & Division St" ...
## $ end_station_id : chr [1:49622] "660" "16806" "TA1305000029" "TA1305000034" ...
## $ start_lat : num [1:49622] 42 42 41.9 41.9 41.8 ...
## $ start_lng : num [1:49622] -87.7 -87.7 -87.6 -87.7 -87.6 ...
## $ end_lat : num [1:49622] 42 42 41.9 41.9 41.8 ...
## $ end_lng : num [1:49622] -87.7 -87.7 -87.6 -87.7 -87.6 ...
## $ member_casual : chr [1:49622] "member" "casual" "member" "member" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
str(trip_apr21)
## spec_tbl_df [228,496 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:228496] "CFA86D4455AA1030" "30D9DC61227D1AF3" "846D87A15682A284" "994D05AA75A168F2" ...
## $ rideable_type : chr [1:228496] "classic_bike" "classic_bike" "classic_bike" "classic_bike" ...
## $ started_at : POSIXct[1:228496], format: "2021-03-16 08:32:30" "2021-03-28 01:26:28" ...
## $ ended_at : POSIXct[1:228496], format: "2021-03-16 08:36:34" "2021-03-28 01:36:55" ...
## $ start_station_name: chr [1:228496] "Humboldt Blvd & Armitage Ave" "Humboldt Blvd & Armitage Ave" "Shields Ave & 28th Pl" "Winthrop Ave & Lawrence Ave" ...
## $ start_station_id : chr [1:228496] "15651" "15651" "15443" "TA1308000021" ...
## $ end_station_name : chr [1:228496] "Stave St & Armitage Ave" "Central Park Ave & Bloomingdale Ave" "Halsted St & 35th St" "Broadway & Sheridan Rd" ...
## $ end_station_id : chr [1:228496] "13266" "18017" "TA1308000043" "13323" ...
## $ start_lat : num [1:228496] 41.9 41.9 41.8 42 42 ...
## $ start_lng : num [1:228496] -87.7 -87.7 -87.6 -87.7 -87.7 ...
## $ end_lat : num [1:228496] 41.9 41.9 41.8 42 42.1 ...
## $ end_lng : num [1:228496] -87.7 -87.7 -87.6 -87.6 -87.7 ...
## $ member_casual : chr [1:228496] "casual" "casual" "casual" "casual" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
str(trip_may21)
## spec_tbl_df [337,230 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:337230] "6C992BD37A98A63F" "1E0145613A209000" "E498E15508A80BAD" "1887262AD101C604" ...
## $ rideable_type : chr [1:337230] "classic_bike" "docked_bike" "docked_bike" "classic_bike" ...
## $ started_at : POSIXct[1:337230], format: "2021-04-12 18:25:36" "2021-04-27 17:27:11" ...
## $ ended_at : POSIXct[1:337230], format: "2021-04-12 18:56:55" "2021-04-27 18:31:29" ...
## $ start_station_name: chr [1:337230] "State St & Pearson St" "Dorchester Ave & 49th St" "Loomis Blvd & 84th St" "Honore St & Division St" ...
## $ start_station_id : chr [1:337230] "TA1307000061" "KA1503000069" "20121" "TA1305000034" ...
## $ end_station_name : chr [1:337230] "Southport Ave & Waveland Ave" "Dorchester Ave & 49th St" "Loomis Blvd & 84th St" "Southport Ave & Waveland Ave" ...
## $ end_station_id : chr [1:337230] "13235" "KA1503000069" "20121" "13235" ...
## $ start_lat : num [1:337230] 41.9 41.8 41.7 41.9 41.7 ...
## $ start_lng : num [1:337230] -87.6 -87.6 -87.7 -87.7 -87.7 ...
## $ end_lat : num [1:337230] 41.9 41.8 41.7 41.9 41.7 ...
## $ end_lng : num [1:337230] -87.7 -87.6 -87.7 -87.7 -87.7 ...
## $ member_casual : chr [1:337230] "member" "casual" "casual" "member" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
str(trip_jun21)
## spec_tbl_df [531,633 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:531633] "C809ED75D6160B2A" "DD59FDCE0ACACAF3" "0AB83CB88C43EFC2" "7881AC6D39110C60" ...
## $ rideable_type : chr [1:531633] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : POSIXct[1:531633], format: "2021-05-30 11:58:15" "2021-05-30 11:29:14" ...
## $ ended_at : POSIXct[1:531633], format: "2021-05-30 12:10:39" "2021-05-30 12:14:09" ...
## $ start_station_name: chr [1:531633] NA NA NA NA ...
## $ start_station_id : chr [1:531633] NA NA NA NA ...
## $ end_station_name : chr [1:531633] NA NA NA NA ...
## $ end_station_id : chr [1:531633] NA NA NA NA ...
## $ start_lat : num [1:531633] 41.9 41.9 41.9 41.9 41.9 ...
## $ start_lng : num [1:531633] -87.6 -87.6 -87.7 -87.7 -87.7 ...
## $ end_lat : num [1:531633] 41.9 41.8 41.9 41.9 41.9 ...
## $ end_lng : num [1:531633] -87.6 -87.6 -87.7 -87.7 -87.7 ...
## $ member_casual : chr [1:531633] "casual" "casual" "casual" "casual" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
str(trip_jul21)
## spec_tbl_df [729,595 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:729595] "99FEC93BA843FB20" "06048DCFC8520CAF" "9598066F68045DF2" "B03C0FE48C412214" ...
## $ rideable_type : chr [1:729595] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : POSIXct[1:729595], format: "2021-06-13 14:31:28" "2021-06-04 11:18:02" ...
## $ ended_at : POSIXct[1:729595], format: "2021-06-13 14:34:11" "2021-06-04 11:24:19" ...
## $ start_station_name: chr [1:729595] NA NA NA NA ...
## $ start_station_id : chr [1:729595] NA NA NA NA ...
## $ end_station_name : chr [1:729595] NA NA NA NA ...
## $ end_station_id : chr [1:729595] NA NA NA NA ...
## $ start_lat : num [1:729595] 41.8 41.8 41.8 41.8 41.8 ...
## $ start_lng : num [1:729595] -87.6 -87.6 -87.6 -87.6 -87.6 ...
## $ end_lat : num [1:729595] 41.8 41.8 41.8 41.8 41.8 ...
## $ end_lng : num [1:729595] -87.6 -87.6 -87.6 -87.6 -87.6 ...
## $ member_casual : chr [1:729595] "member" "member" "member" "member" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
str(trip_aug21)
## spec_tbl_df [822,410 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:822410] "0A1B623926EF4E16" "B2D5583A5A5E76EE" "6F264597DDBF427A" "379B58EAB20E8AA5" ...
## $ rideable_type : chr [1:822410] "docked_bike" "classic_bike" "classic_bike" "classic_bike" ...
## $ started_at : POSIXct[1:822410], format: "2021-07-02 14:44:36" "2021-07-07 16:57:42" ...
## $ ended_at : POSIXct[1:822410], format: "2021-07-02 15:19:58" "2021-07-07 17:16:09" ...
## $ start_station_name: chr [1:822410] "Michigan Ave & Washington St" "California Ave & Cortez St" "Wabash Ave & 16th St" "California Ave & Cortez St" ...
## $ start_station_id : chr [1:822410] "13001" "17660" "SL-012" "17660" ...
## $ end_station_name : chr [1:822410] "Halsted St & North Branch St" "Wood St & Hubbard St" "Rush St & Hubbard St" "Carpenter St & Huron St" ...
## $ end_station_id : chr [1:822410] "KA1504000117" "13432" "KA1503000044" "13196" ...
## $ start_lat : num [1:822410] 41.9 41.9 41.9 41.9 41.9 ...
## $ start_lng : num [1:822410] -87.6 -87.7 -87.6 -87.7 -87.7 ...
## $ end_lat : num [1:822410] 41.9 41.9 41.9 41.9 41.9 ...
## $ end_lng : num [1:822410] -87.6 -87.7 -87.6 -87.7 -87.7 ...
## $ member_casual : chr [1:822410] "casual" "casual" "member" "member" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
str(trip_sep21)
## spec_tbl_df [804,352 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:804352] "99103BB87CC6C1BB" "EAFCCCFB0A3FC5A1" "9EF4F46C57AD234D" "5834D3208BFAF1DA" ...
## $ rideable_type : chr [1:804352] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : POSIXct[1:804352], format: "2021-08-10 17:15:49" "2021-08-10 17:23:14" ...
## $ ended_at : POSIXct[1:804352], format: "2021-08-10 17:22:44" "2021-08-10 17:39:24" ...
## $ start_station_name: chr [1:804352] NA NA NA NA ...
## $ start_station_id : chr [1:804352] NA NA NA NA ...
## $ end_station_name : chr [1:804352] NA NA NA NA ...
## $ end_station_id : chr [1:804352] NA NA NA NA ...
## $ start_lat : num [1:804352] 41.8 41.8 42 42 41.8 ...
## $ start_lng : num [1:804352] -87.7 -87.7 -87.7 -87.7 -87.6 ...
## $ end_lat : num [1:804352] 41.8 41.8 42 42 41.8 ...
## $ end_lng : num [1:804352] -87.7 -87.6 -87.7 -87.7 -87.6 ...
## $ member_casual : chr [1:804352] "member" "member" "member" "member" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
str(trip_oct21)
## spec_tbl_df [756,147 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:756147] "9DC7B962304CBFD8" "F930E2C6872D6B32" "6EF72137900BB910" "78D1DE133B3DBF55" ...
## $ rideable_type : chr [1:756147] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : POSIXct[1:756147], format: "2021-09-28 16:07:10" "2021-09-28 14:24:51" ...
## $ ended_at : POSIXct[1:756147], format: "2021-09-28 16:09:54" "2021-09-28 14:40:05" ...
## $ start_station_name: chr [1:756147] NA NA NA NA ...
## $ start_station_id : chr [1:756147] NA NA NA NA ...
## $ end_station_name : chr [1:756147] NA NA NA NA ...
## $ end_station_id : chr [1:756147] NA NA NA NA ...
## $ start_lat : num [1:756147] 41.9 41.9 41.8 41.8 41.9 ...
## $ start_lng : num [1:756147] -87.7 -87.6 -87.7 -87.7 -87.7 ...
## $ end_lat : num [1:756147] 41.9 42 41.8 41.8 41.9 ...
## $ end_lng : num [1:756147] -87.7 -87.7 -87.7 -87.7 -87.7 ...
## $ member_casual : chr [1:756147] "casual" "casual" "casual" "casual" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
str(trip_nov21)
## spec_tbl_df [631,226 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:631226] "620BC6107255BF4C" "4471C70731AB2E45" "26CA69D43D15EE14" "362947F0437E1514" ...
## $ rideable_type : chr [1:631226] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : POSIXct[1:631226], format: "2021-10-22 12:46:42" "2021-10-21 09:12:37" ...
## $ ended_at : POSIXct[1:631226], format: "2021-10-22 12:49:50" "2021-10-21 09:14:14" ...
## $ start_station_name: chr [1:631226] "Kingsbury St & Kinzie St" NA NA NA ...
## $ start_station_id : chr [1:631226] "KA1503000043" NA NA NA ...
## $ end_station_name : chr [1:631226] NA NA NA NA ...
## $ end_station_id : chr [1:631226] NA NA NA NA ...
## $ start_lat : num [1:631226] 41.9 41.9 41.9 41.9 41.9 ...
## $ start_lng : num [1:631226] -87.6 -87.7 -87.7 -87.7 -87.7 ...
## $ end_lat : num [1:631226] 41.9 41.9 41.9 41.9 41.9 ...
## $ end_lng : num [1:631226] -87.6 -87.7 -87.7 -87.7 -87.7 ...
## $ member_casual : chr [1:631226] "member" "member" "member" "member" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
I converted the tibbles to data frames using the function as.data.frame and assigned them to the names variables
#Now we want to convert the data sets into data frame
trip_dec20 <- as.data.frame(trip_dec20)
trip_jan21 <- as.data.frame(trip_jan21)
trip_feb21 <- as.data.frame(trip_feb21)
trip_mar21 <- as.data.frame(trip_mar21)
trip_apr21 <- as.data.frame(trip_apr21)
trip_may21 <- as.data.frame(trip_may21)
trip_jun21 <- as.data.frame(trip_jun21)
trip_jul21 <- as.data.frame(trip_jul21)
trip_aug21 <- as.data.frame(trip_aug21)
trip_sep21 <- as.data.frame(trip_sep21)
trip_oct21 <- as.data.frame(trip_oct21)
trip_nov21 <- as.data.frame(trip_nov21)
I inspected the columns of each data frame by using compare_df_cols and asked it to return any mismatch.I used this because we are dealing with a large data sets.
#Now we want to compare the column data type across all data frame by using comparedf
compare_df_cols(trip_dec20,trip_jan21,trip_feb21,trip_mar21,trip_apr21,trip_may21,
trip_jun21,trip_jul21,trip_aug21,trip_sep21,trip_oct21,trip_nov21,return = "mismatch")
## column_name trip_dec20 trip_jan21 trip_feb21 trip_mar21 trip_apr21
## 1 end_station_id numeric character character character character
## 2 start_station_id numeric character character character character
## trip_may21 trip_jun21 trip_jul21 trip_aug21 trip_sep21 trip_oct21 trip_nov21
## 1 character character character character character character character
## 2 character character character character character character character
I found a mismatch in the data type in trip_dec20 data frame. It had two columns in numeric data type. So we needed to convert them into character data type
#We observe that trip_dec20 has end_station_id and start_station_id in numeric data type, so we convert them to character data type
trip_dec20 <- mutate(trip_dec20,end_station_id = as.character(end_station_id),start_station_id = as.character(start_station_id))
After the conversion,I compared the column data type again to be sure there no no more mismatch.
#We want to confirm we have no more mismatch in data types so we re-run the comparedf again
compare_df_cols(trip_dec20,trip_jan21,trip_feb21,trip_mar21,trip_apr21,trip_may21,
trip_jun21,trip_jul21,trip_aug21,trip_sep21,trip_oct21,trip_nov21,return = "mismatch")
## [1] column_name trip_dec20 trip_jan21 trip_feb21 trip_mar21 trip_apr21
## [7] trip_may21 trip_jun21 trip_jul21 trip_aug21 trip_sep21 trip_oct21
## [13] trip_nov21
## <0 rows> (or 0-length row.names)
I binded the data frames together using the bind function and assigned them to a named variable
#Now we can combine or stack these data frames into one big data frame using bind function
all_trips <- bind_rows(trip_dec20,trip_jan21,trip_feb21,trip_mar21,trip_apr21,trip_may21,trip_jun21,trip_jul21,trip_aug21,
trip_sep21,trip_oct21,trip_nov21)
I dropped the columns that were not needed for the analysis by using the select function
#Now we drop the columns we do not need
all_trips <- all_trips %>%
select(-c(start_lat,start_lng,end_lat,end_lng))
I renamed the column names to a more suitable name i would prefer
#We want to rename the columns to a more prefered name
all_trips <- all_trips %>%
rename(trip_id = ride_id,
ride_type = rideable_type,
trip_start = started_at,
trip_end = ended_at,
from_station_name = start_station_name,
from_station_id = start_station_id,
to_station_name = end_station_name,
to_station_id = end_station_id,
customer_type = member_casual
)
Excellent!! Now that i have sorted the data frame accordingly i can now move on to the data cleaning process. First i will inspect my data frame
#Step 3: Clean up and Add Data to Prepare for Analysis
#Now we inspect our data frame
head(all_trips)
## trip_id ride_type trip_start trip_end
## 1 BD0A6FF6FFF9B921 electric_bike 2020-11-01 13:36:00 2020-11-01 13:45:40
## 2 96A7A7A4BDE4F82D electric_bike 2020-11-01 10:03:26 2020-11-01 10:14:45
## 3 C61526D06582BDC5 electric_bike 2020-11-01 00:34:05 2020-11-01 01:03:06
## 4 E533E89C32080B9E electric_bike 2020-11-01 00:45:16 2020-11-01 00:54:31
## 5 1C9F4EF18C168C60 electric_bike 2020-11-01 15:43:25 2020-11-01 16:16:52
## 6 7259585D8276D338 electric_bike 2020-11-14 15:55:17 2020-11-14 16:44:38
## from_station_name from_station_id to_station_name
## 1 Dearborn St & Erie St 110 St. Clair St & Erie St
## 2 Franklin St & Illinois St 672 Noble St & Milwaukee Ave
## 3 Lake Shore Dr & Monroe St 76 Federal St & Polk St
## 4 Leavitt St & Chicago Ave 659 Stave St & Armitage Ave
## 5 Buckingham Fountain 2 Buckingham Fountain
## 6 Wabash Ave & 16th St 72 Lake Shore Dr & Monroe St
## to_station_id customer_type
## 1 211 casual
## 2 29 casual
## 3 41 casual
## 4 185 casual
## 5 2 casual
## 6 76 casual
colnames(all_trips)
## [1] "trip_id" "ride_type" "trip_start"
## [4] "trip_end" "from_station_name" "from_station_id"
## [7] "to_station_name" "to_station_id" "customer_type"
str(all_trips)
## 'data.frame': 5378834 obs. of 9 variables:
## $ trip_id : chr "BD0A6FF6FFF9B921" "96A7A7A4BDE4F82D" "C61526D06582BDC5" "E533E89C32080B9E" ...
## $ ride_type : chr "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ trip_start : POSIXct, format: "2020-11-01 13:36:00" "2020-11-01 10:03:26" ...
## $ trip_end : POSIXct, format: "2020-11-01 13:45:40" "2020-11-01 10:14:45" ...
## $ from_station_name: chr "Dearborn St & Erie St" "Franklin St & Illinois St" "Lake Shore Dr & Monroe St" "Leavitt St & Chicago Ave" ...
## $ from_station_id : chr "110" "672" "76" "659" ...
## $ to_station_name : chr "St. Clair St & Erie St" "Noble St & Milwaukee Ave" "Federal St & Polk St" "Stave St & Armitage Ave" ...
## $ to_station_id : chr "211" "29" "41" "185" ...
## $ customer_type : chr "casual" "casual" "casual" "casual" ...
dim(all_trips)
## [1] 5378834 9
summary(all_trips)
## trip_id ride_type trip_start
## Length:5378834 Length:5378834 Min. :2020-11-01 00:00:08
## Class :character Class :character 1st Qu.:2021-05-17 12:45:18
## Mode :character Mode :character Median :2021-07-13 22:33:14
## Mean :2021-06-27 18:37:41
## 3rd Qu.:2021-09-02 18:18:14
## Max. :2021-10-31 23:59:49
## trip_end from_station_name from_station_id
## Min. :2020-11-01 00:02:20 Length:5378834 Length:5378834
## 1st Qu.:2021-05-17 13:07:36 Class :character Class :character
## Median :2021-07-13 22:57:23 Mode :character Mode :character
## Mean :2021-06-27 18:58:10
## 3rd Qu.:2021-09-02 18:35:16
## Max. :2021-11-03 21:45:48
## to_station_name to_station_id customer_type
## Length:5378834 Length:5378834 Length:5378834
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
skim(all_trips)
| Name | all_trips |
| Number of rows | 5378834 |
| Number of columns | 9 |
| _______________________ | |
| Column type frequency: | |
| character | 7 |
| POSIXct | 2 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| trip_id | 0 | 1.00 | 16 | 16 | 0 | 5378625 | 0 |
| ride_type | 0 | 1.00 | 11 | 13 | 0 | 3 | 0 |
| from_station_name | 600479 | 0.89 | 3 | 53 | 0 | 814 | 0 |
| from_station_id | 600586 | 0.89 | 1 | 36 | 0 | 1303 | 0 |
| to_station_name | 646471 | 0.88 | 10 | 53 | 0 | 811 | 0 |
| to_station_id | 646548 | 0.88 | 1 | 36 | 0 | 1298 | 0 |
| customer_type | 0 | 1.00 | 6 | 6 | 0 | 2 | 0 |
Variable type: POSIXct
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| trip_start | 0 | 1 | 2020-11-01 00:00:08 | 2021-10-31 23:59:49 | 2021-07-13 22:33:14 | 4487412 |
| trip_end | 0 | 1 | 2020-11-01 00:02:20 | 2021-11-03 21:45:48 | 2021-07-13 22:57:23 | 4479067 |
Now I want to add columns that would enable us aggregate the data
# Now we add columns that list the date, month, day, and year of each the trip
# This will allow us to aggregate ride data for each month, day, or year
#The default format of the date is yyyy-mm-dd
all_trips$date <- as.Date(all_trips$trip_start)
all_trips$month <- format(as.Date(all_trips$date), "%m")
all_trips$day <- format(as.Date(all_trips$date), "%d")
all_trips$year <- format(as.Date(all_trips$date), "%Y")
all_trips$day_of_week <- format(as.Date(all_trips$date), "%A")
I would then add the ride length calculation, convert the ride length data type and remove bad data using the skim function
#Now we Add a "ride_length" calculation to all_trips (in seconds) using difftime which calculates time intervals
all_trips$ride_length <- difftime(all_trips$trip_end,all_trips$trip_start)
str(all_trips$ride_length)
## 'difftime' num [1:5378834] 580 679 1741 555 ...
## - attr(*, "units")= chr "secs"
#Now we convert ride_length from factor to numeric datatype
is.factor(all_trips$ride_length)
## [1] FALSE
all_trips$ride_length <- as.numeric(as.character(all_trips$ride_length))
is.numeric(all_trips$ride_length)
## [1] TRUE
#Now we want to remove bad data
#The data frame includes a few hundred entries when bikes were taken out of docks and checked for quality by Divvy or ride_length was negative
skim(all_trips$ride_length)
| Name | all_trips$ride_length |
| Number of rows | 5378834 |
| Number of columns | 1 |
| _______________________ | |
| Column type frequency: | |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| data | 0 | 1 | 1229.37 | 17999.53 | -1742998 | 418 | 743 | 1346 | 3356649 | ▁▇▁▁▁ |
Creating a new data frame to exclude the negative ride length values
#Now we will create a new data frame v2 that will exclude ride_length that were negative
all_trips_v2 <- all_trips[!(all_trips$ride_length<0),]
head(all_trips_v2$ride_length)
## [1] 580 679 1741 555 2007 2961
Now that we have cleaned, prepared data, we can move onto analysis. We take a look at our data frame using the aggregate and summary functions. The results were in seconds
#Step 4: Conduct Descriptive Analysis
#Now a descriptive analysis on ride_length
summary(all_trips_v2$ride_length)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 418 743 1350 1346 3356649
#Now we want to compare the members and causal riders
aggregate(all_trips_v2$ride_length ~ all_trips_v2$customer_type, FUN = mean)
## all_trips_v2$customer_type all_trips_v2$ride_length
## 1 casual 1953.4299
## 2 member 837.7915
aggregate(all_trips_v2$ride_length ~ all_trips_v2$customer_type, FUN = median)
## all_trips_v2$customer_type all_trips_v2$ride_length
## 1 casual 982
## 2 member 595
aggregate(all_trips_v2$ride_length ~ all_trips_v2$customer_type, FUN = max)
## all_trips_v2$customer_type all_trips_v2$ride_length
## 1 casual 3356649
## 2 member 93596
aggregate(all_trips_v2$ride_length ~ all_trips_v2$customer_type, FUN = min)
## all_trips_v2$customer_type all_trips_v2$ride_length
## 1 casual 0
## 2 member 0
We noticed that the days of the week are out of order. We use the order fundtion to fix that
# Now we want to see the average ride time by each day for members vs casual users
aggregate(all_trips_v2$ride_length ~ all_trips_v2$customer_type + all_trips_v2$day_of_week, FUN = mean)
## all_trips_v2$customer_type all_trips_v2$day_of_week all_trips_v2$ride_length
## 1 casual Friday 1854.5624
## 2 member Friday 819.7759
## 3 casual Monday 1939.1773
## 4 member Monday 811.1557
## 5 casual Saturday 2109.5236
## 6 member Saturday 933.4372
## 7 casual Sunday 2286.3571
## 8 member Sunday 955.1888
## 9 casual Thursday 1692.3220
## 10 member Thursday 786.1396
## 11 casual Tuesday 1717.1894
## 12 member Tuesday 786.2936
## 13 casual Wednesday 1688.8956
## 14 member Wednesday 789.9640
# We Notice that the days of the week are out of order. Let's fix that.
all_trips_v2$day_of_week <- ordered(all_trips_v2$day_of_week, levels=c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))
#Now we re run the average ride time by each day for members vs causal riders
aggregate(all_trips_v2$ride_length ~ all_trips_v2$customer_type + all_trips_v2$day_of_week, FUN = mean)
## all_trips_v2$customer_type all_trips_v2$day_of_week all_trips_v2$ride_length
## 1 casual Sunday 2286.3571
## 2 member Sunday 955.1888
## 3 casual Monday 1939.1773
## 4 member Monday 811.1557
## 5 casual Tuesday 1717.1894
## 6 member Tuesday 786.2936
## 7 casual Wednesday 1688.8956
## 8 member Wednesday 789.9640
## 9 casual Thursday 1692.3220
## 10 member Thursday 786.1396
## 11 casual Friday 1854.5624
## 12 member Friday 819.7759
## 13 casual Saturday 2109.5236
## 14 member Saturday 933.4372
We added a new column called weekday so we could analyse the ridership data by the type and weekday
# analyze ridership data by type and weekday.
#But we have no weekday column so we create it
all_trips_v2 %>%
mutate(weekday = wday(trip_start, label = TRUE)) %>% #creates weekday field using wday()
group_by(customer_type, weekday) %>% #groups by customertype and weekday
summarise(number_of_rides = n(),average_duration = mean(ride_length)) %>%
arrange(customer_type, weekday)
## `summarise()` has grouped output by 'customer_type'. You can override using the `.groups` argument.
## # A tibble: 14 x 4
## # Groups: customer_type [2]
## customer_type weekday number_of_rides average_duration
## <chr> <ord> <int> <dbl>
## 1 casual Sun 476138 2286.
## 2 casual Mon 278252 1939.
## 3 casual Tue 264254 1717.
## 4 casual Wed 267472 1689.
## 5 casual Thu 277319 1692.
## 6 casual Fri 354969 1855.
## 7 casual Sat 551814 2110.
## 8 member Sun 368376 955.
## 9 member Mon 391287 811.
## 10 member Tue 431459 786.
## 11 member Wed 444255 790.
## 12 member Thu 425562 786.
## 13 member Fri 425120 820.
## 14 member Sat 421164 933.
#Now we try to get the total number of causal and member ridership and use that to get the approximate percentage of each member type
all_trips %>%
mutate(weekday = wday(trip_start, label = TRUE)) %>%
group_by(customer_type) %>%
summarise(number_of_rides = n()) %>%
arrange(customer_type)
## # A tibble: 2 x 2
## customer_type number_of_rides
## <chr> <int>
## 1 casual 2470517
## 2 member 2908317
#From the data frame, the causal number of rides = 2,470517 and member or annual number of rides = 2,908317. We pipe the result to sum function to give us the sum figure
We visualized our data by showing how the length of the trip and the number of the trip compare by weekday and the membership status. We will first create the bar graph showing the total ride count per day in comparison to the membership status or type.
# Let's visualize the number of rides by rider type
all_trips_v2 %>%
mutate(weekday = wday(trip_start, label = TRUE)) %>%
group_by(customer_type, weekday) %>%
summarise(number_of_rides = n(),average_duration = mean(ride_length)) %>%
arrange(customer_type, weekday) %>%
ggplot(aes(x = weekday, y = number_of_rides, fill = customer_type)) +
geom_col(position = "dodge") +
scale_fill_manual(values = c("casual" = "green", "member" = "blue")) +
scale_y_continuous(labels = scales::label_number(suffix = " K", scale = 1e-3)) +
labs(title="Number of trip rides per day in comparison to customer type", subtitle = "Causal members vs Annual members",caption=paste0("The data between December 2020 and November 2021"),x= "The Days of the Week", y= "The Total Number of Rides",fill= "Membership type")
## `summarise()` has grouped output by 'customer_type'. You can override using the `.groups` argument.
# 20. Second Data Visualisation.
We created the second bar plot showing the average duration of the trip in comparison to the causal and annual members
# Let's create a visualization for average duration
all_trips_v2 %>%
mutate(weekday = wday(trip_start, label = TRUE)) %>%
group_by(customer_type, weekday) %>%
summarise(number_of_rides = n(),average_duration = mean(ride_length)) %>%
arrange(customer_type, weekday) %>%
ggplot(aes(x = weekday, y = average_duration, fill = customer_type)) +
geom_col(position = "dodge") +
scale_fill_manual(values = c("casual" = "forestgreen",
"member" = "orange")) +
scale_y_continuous(labels = scales::label_number(suffix="sec", scale = 10)) +
labs(title="The average duration of trip rides",subtitles= "Member vs Casual",x="The Days of the Week",y="Average Ride Length in seconds",fill= "Membership type",caption=paste0("The data between December 2020 and November 2021"))
## `summarise()` has grouped output by 'customer_type'. You can override using the `.groups` argument.
From the data analysis above, in step 18 we have a data frame that shows us the number of rides for each customer type. The result from that data frame showed the following * The casual number of rides was 2,470,517 * The member number of ride was 2,908,317 * This would make the sum of 5,378,834
We could take the percentage to know the percentage of each customer type.The percentage calculations would give us * Casual rider 46% * Member rider 54%
This reveals that member riders were slightly higher than casual riders.
Also, from the first visualization on step 19, we found out that members rode more on weekdays (Monday to Friday) when compared to casual riders. But during the weekend (Saturday and Sunday) the casual riders were at the peak level above the member riders.
From the second visualization on step 20, we found out that casual riders went on more rides than member riders. The causal riders rode for as much as twice when compared to how much member riders rode.
The above explanations has helped in answering the key questions in trying to understand how casual riders and member or annual riders used the Cyclistic bike differently.
From the analysis, we can see that casual riders ride more than the annual members. So it would be of financial benefit to design the causal members bikes with less expensive materials that would be equivalent to the savings that would be made from an annual member in a month
We have found out from the analysis that the casual riders go on rides the most during the weekends. The marketing team can set up a of different people that would be at different start stations to meet with the causal riders and offer incentives to try to convert them to annual riders. For, example each causal rider could be offered a 15% slash bonus to register as an annual member. This could be done for a month and afterwards another analysis could be dope to see how many casual riders have converted to annual members.
In convincing the causal riders to convert, we should not forget the already existing annual members. Together with the previous step, a package can be given to the annual riders. For example, giving 40% off the monthly payment to an annual rider who brings in a friend or even an already existing casual rider to register as a member with the Cyclistic company.
Since the aim of this analysis is to convince causal riders to convert to annual members. The company could find the data on the routes with the most traffic and during the engagement with the casual riders on the weekends (as discussed in the second point),the casual riders who don’t ride on weekdays could be convinced to switch to annual membership if riding a bike along those routes would save time and stress from the traffic.