Project Background.

I newly joined a fictional Cyclistic marketing analytics team as a junior data analyst in a bike-share company in Chicago called Cylistic. The company has two major types of memberships which are casual membership and annual membership. The Marketing Director of the Cyclistic company Lily Moreno wants to maximize the number of annual members in the company. She is of the opinion that the success of the company would depend on the growth of the annual members registered to the company.

Financial Incentive for the project.

The financial analyst team at Cyclistic company have concluded that the annual members are more profitable to the company than the casual member riders and this is why the Marketing Director of the company wants to expand the growth of the annual members. But Lily Moreno does not want to create a marketing campaign that would target all new customers but wants to convert the already existing casual members to annual members.

Business Task.

The business task of this project is to convert causal members into annual members. But to do this, there are key questions we need to take into consideration

Key Questions.

How do causal members and annual members use the bike-share differently,if at all?
Why would a casual member buy the annual membership?
How can Cyclistic use digital media to influence causal members to become annual members?

Data Sources.

The data source was gotten from the link {https://divvy-tripdata.s3.amazonaws.com/index.html} I downloaded the twelve previous months of the Cyclistic trip data, which was from December 2020 till November 2021.

Preparing the Data.

I created a folder where i saved all the 12 data sets, then i renamed all the data sets to a suitable name for easy understanding. I checked each data sets to make sure it passed the ROCCC(Reliable,Original, Comprehensive,Current and Cited) check.

Information in the Data.

The data included information about the causal and annual member cyclists during the period of December 2020 and November 2021. The breakdown of some of the information included are:

The start stations where the cyclists rent the bike
The end stations where the cyclists drop the bike
The cyclists Identification number
The type of bike the cyclists ride
The start and end station latitude
The start and end station longitude
The type of membership
The start and end time of each cyclist trip

Below are the steps taken in Rstudio to carry out this analysis:

1. Installing the package and loading the different libraries needed.

#loading the necessary library
library(readr) # helps read data

## Warning: package 'readr' was built under R version 4.1.2

library(dplyr) # helps with data frame

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2) # helps visualize data

## Warning: package 'ggplot2' was built under R version 4.1.2

library(tidyverse) # helps wrangle data

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v tibble  3.1.6     v stringr 1.4.0
## v tidyr   1.1.4     v forcats 0.5.1
## v purrr   0.3.4

## Warning: package 'tibble' was built under R version 4.1.2

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(lubridate) # helps wrangle data attributes

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(janitor)  # helps with data cleaning

## Warning: package 'janitor' was built under R version 4.1.2

## 
## Attaching package: 'janitor'

## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

library(skimr) # helps with data cleaning and visuals

## Warning: package 'skimr' was built under R version 4.1.2

2. Loading the data set files and assigning them to different variable names.

After the libraries and package were loaded, i proceeded to setting my working directory to the file where i had all my data sets saved. Then i used the read_csv function which part of the readr package to read my data set files into the session.

# Step1: Collection of Data

#Now we load these files and assign them to a variable name
trip_dec20 <- read_csv("bike_ride_dec_20.csv")

## Rows: 259716 Columns: 13

## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr  (5): ride_id, rideable_type, start_station_name, end_station_name, memb...
## dbl  (6): start_station_id, end_station_id, start_lat, start_lng, end_lat, e...
## dttm (2): started_at, ended_at

## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

trip_jan21 <- read_csv("bike_ride_jan_21.csv")

## Rows: 131573 Columns: 13

## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at

## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

trip_feb21 <- read_csv("bike_ride_feb_21.csv")

## Rows: 96834 Columns: 13

## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at

## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

trip_mar21 <- read_csv("bike_ride_mar_21.csv")

## Rows: 49622 Columns: 13

## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at

## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

trip_apr21 <- read_csv("bike_ride_apr_21.csv")

## Rows: 228496 Columns: 13

## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at

## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

trip_may21 <- read_csv("bike_ride_may_21.csv")

## Rows: 337230 Columns: 13

## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at

## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

trip_jun21 <- read_csv("bike_ride_jun_21.csv")

## Rows: 531633 Columns: 13

## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at

## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

trip_jul21 <- read_csv("bike_ride_jul_21.csv")

## Rows: 729595 Columns: 13

## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at

## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

trip_aug21 <- read_csv("bike_ride_aug_21.csv")

## Rows: 822410 Columns: 13

## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at

## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

trip_sep21 <- read_csv("bike_ride_sep_21.csv")

## Rows: 804352 Columns: 13

## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at

## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

trip_oct21 <- read_csv("bike_ride_oct_21.csv")

## Rows: 756147 Columns: 13

## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at

## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

trip_nov21 <- read_csv("bike_ride_nov_21.csv")

## Rows: 631226 Columns: 13

## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at

## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

3. Compare column names.

I then compared the column names using colnames to make sure the names were in the same order and format

#Step 2 We Wrangle Data and Combine into a single file

#Now we compare the column names of each file
colnames(trip_dec20)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

colnames(trip_jan21)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

colnames(trip_feb21)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

colnames(trip_mar21)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

colnames(trip_apr21)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

colnames(trip_may21)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

colnames(trip_jun21)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

colnames(trip_jul21)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

colnames(trip_aug21)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

colnames(trip_sep21)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

colnames(trip_oct21)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

colnames(trip_nov21)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

4. Inspect the data.

I looked at the data types using stringR to look for errors or inconsistency.

#Now we want to look at the data types of each data frame using str()
str(trip_dec20)

## spec_tbl_df [259,716 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:259716] "BD0A6FF6FFF9B921" "96A7A7A4BDE4F82D" "C61526D06582BDC5" "E533E89C32080B9E" ...
##  $ rideable_type     : chr [1:259716] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
##  $ started_at        : POSIXct[1:259716], format: "2020-11-01 13:36:00" "2020-11-01 10:03:26" ...
##  $ ended_at          : POSIXct[1:259716], format: "2020-11-01 13:45:40" "2020-11-01 10:14:45" ...
##  $ start_station_name: chr [1:259716] "Dearborn St & Erie St" "Franklin St & Illinois St" "Lake Shore Dr & Monroe St" "Leavitt St & Chicago Ave" ...
##  $ start_station_id  : num [1:259716] 110 672 76 659 2 72 76 NA 58 394 ...
##  $ end_station_name  : chr [1:259716] "St. Clair St & Erie St" "Noble St & Milwaukee Ave" "Federal St & Polk St" "Stave St & Armitage Ave" ...
##  $ end_station_id    : num [1:259716] 211 29 41 185 2 76 72 NA 288 273 ...
##  $ start_lat         : num [1:259716] 41.9 41.9 41.9 41.9 41.9 ...
##  $ start_lng         : num [1:259716] -87.6 -87.6 -87.6 -87.7 -87.6 ...
##  $ end_lat           : num [1:259716] 41.9 41.9 41.9 41.9 41.9 ...
##  $ end_lng           : num [1:259716] -87.6 -87.7 -87.6 -87.7 -87.6 ...
##  $ member_casual     : chr [1:259716] "casual" "casual" "casual" "casual" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_double(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_double(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

str(trip_jan21)

## spec_tbl_df [131,573 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:131573] "70B6A9A437D4C30D" "158A465D4E74C54A" "5262016E0F1F2F9A" "BE119628E44F871E" ...
##  $ rideable_type     : chr [1:131573] "classic_bike" "electric_bike" "electric_bike" "electric_bike" ...
##  $ started_at        : POSIXct[1:131573], format: "2020-12-27 12:44:29" "2020-12-18 17:37:15" ...
##  $ ended_at          : POSIXct[1:131573], format: "2020-12-27 12:55:06" "2020-12-18 17:44:19" ...
##  $ start_station_name: chr [1:131573] "Aberdeen St & Jackson Blvd" NA NA NA ...
##  $ start_station_id  : chr [1:131573] "13157" NA NA NA ...
##  $ end_station_name  : chr [1:131573] "Desplaines St & Kinzie St" NA NA NA ...
##  $ end_station_id    : chr [1:131573] "TA1306000003" NA NA NA ...
##  $ start_lat         : num [1:131573] 41.9 41.9 41.9 41.9 41.8 ...
##  $ start_lng         : num [1:131573] -87.7 -87.7 -87.7 -87.7 -87.6 ...
##  $ end_lat           : num [1:131573] 41.9 41.9 41.9 41.9 41.8 ...
##  $ end_lng           : num [1:131573] -87.6 -87.7 -87.7 -87.7 -87.6 ...
##  $ member_casual     : chr [1:131573] "member" "member" "member" "member" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_character(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

str(trip_feb21)

## spec_tbl_df [96,834 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:96834] "E19E6F1B8D4C42ED" "DC88F20C2C55F27F" "EC45C94683FE3F27" "4FA453A75AE377DB" ...
##  $ rideable_type     : chr [1:96834] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
##  $ started_at        : POSIXct[1:96834], format: "2021-01-23 16:14:19" "2021-01-27 18:43:08" ...
##  $ ended_at          : POSIXct[1:96834], format: "2021-01-23 16:24:44" "2021-01-27 18:47:12" ...
##  $ start_station_name: chr [1:96834] "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" ...
##  $ start_station_id  : chr [1:96834] "17660" "17660" "17660" "17660" ...
##  $ end_station_name  : chr [1:96834] NA NA NA NA ...
##  $ end_station_id    : chr [1:96834] NA NA NA NA ...
##  $ start_lat         : num [1:96834] 41.9 41.9 41.9 41.9 41.9 ...
##  $ start_lng         : num [1:96834] -87.7 -87.7 -87.7 -87.7 -87.7 ...
##  $ end_lat           : num [1:96834] 41.9 41.9 41.9 41.9 41.9 ...
##  $ end_lng           : num [1:96834] -87.7 -87.7 -87.7 -87.7 -87.7 ...
##  $ member_casual     : chr [1:96834] "member" "member" "member" "member" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_character(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

str(trip_mar21)

## spec_tbl_df [49,622 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:49622] "89E7AA6C29227EFF" "0FEFDE2603568365" "E6159D746B2DBB91" "B32D3199F1C2E75B" ...
##  $ rideable_type     : chr [1:49622] "classic_bike" "classic_bike" "electric_bike" "classic_bike" ...
##  $ started_at        : POSIXct[1:49622], format: "2021-02-12 16:14:56" "2021-02-14 17:52:38" ...
##  $ ended_at          : POSIXct[1:49622], format: "2021-02-12 16:21:43" "2021-02-14 18:12:09" ...
##  $ start_station_name: chr [1:49622] "Glenwood Ave & Touhy Ave" "Glenwood Ave & Touhy Ave" "Clark St & Lake St" "Wood St & Chicago Ave" ...
##  $ start_station_id  : chr [1:49622] "525" "525" "KA1503000012" "637" ...
##  $ end_station_name  : chr [1:49622] "Sheridan Rd & Columbia Ave" "Bosworth Ave & Howard St" "State St & Randolph St" "Honore St & Division St" ...
##  $ end_station_id    : chr [1:49622] "660" "16806" "TA1305000029" "TA1305000034" ...
##  $ start_lat         : num [1:49622] 42 42 41.9 41.9 41.8 ...
##  $ start_lng         : num [1:49622] -87.7 -87.7 -87.6 -87.7 -87.6 ...
##  $ end_lat           : num [1:49622] 42 42 41.9 41.9 41.8 ...
##  $ end_lng           : num [1:49622] -87.7 -87.7 -87.6 -87.7 -87.6 ...
##  $ member_casual     : chr [1:49622] "member" "casual" "member" "member" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_character(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

str(trip_apr21)

## spec_tbl_df [228,496 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:228496] "CFA86D4455AA1030" "30D9DC61227D1AF3" "846D87A15682A284" "994D05AA75A168F2" ...
##  $ rideable_type     : chr [1:228496] "classic_bike" "classic_bike" "classic_bike" "classic_bike" ...
##  $ started_at        : POSIXct[1:228496], format: "2021-03-16 08:32:30" "2021-03-28 01:26:28" ...
##  $ ended_at          : POSIXct[1:228496], format: "2021-03-16 08:36:34" "2021-03-28 01:36:55" ...
##  $ start_station_name: chr [1:228496] "Humboldt Blvd & Armitage Ave" "Humboldt Blvd & Armitage Ave" "Shields Ave & 28th Pl" "Winthrop Ave & Lawrence Ave" ...
##  $ start_station_id  : chr [1:228496] "15651" "15651" "15443" "TA1308000021" ...
##  $ end_station_name  : chr [1:228496] "Stave St & Armitage Ave" "Central Park Ave & Bloomingdale Ave" "Halsted St & 35th St" "Broadway & Sheridan Rd" ...
##  $ end_station_id    : chr [1:228496] "13266" "18017" "TA1308000043" "13323" ...
##  $ start_lat         : num [1:228496] 41.9 41.9 41.8 42 42 ...
##  $ start_lng         : num [1:228496] -87.7 -87.7 -87.6 -87.7 -87.7 ...
##  $ end_lat           : num [1:228496] 41.9 41.9 41.8 42 42.1 ...
##  $ end_lng           : num [1:228496] -87.7 -87.7 -87.6 -87.6 -87.7 ...
##  $ member_casual     : chr [1:228496] "casual" "casual" "casual" "casual" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_character(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

str(trip_may21)

## spec_tbl_df [337,230 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:337230] "6C992BD37A98A63F" "1E0145613A209000" "E498E15508A80BAD" "1887262AD101C604" ...
##  $ rideable_type     : chr [1:337230] "classic_bike" "docked_bike" "docked_bike" "classic_bike" ...
##  $ started_at        : POSIXct[1:337230], format: "2021-04-12 18:25:36" "2021-04-27 17:27:11" ...
##  $ ended_at          : POSIXct[1:337230], format: "2021-04-12 18:56:55" "2021-04-27 18:31:29" ...
##  $ start_station_name: chr [1:337230] "State St & Pearson St" "Dorchester Ave & 49th St" "Loomis Blvd & 84th St" "Honore St & Division St" ...
##  $ start_station_id  : chr [1:337230] "TA1307000061" "KA1503000069" "20121" "TA1305000034" ...
##  $ end_station_name  : chr [1:337230] "Southport Ave & Waveland Ave" "Dorchester Ave & 49th St" "Loomis Blvd & 84th St" "Southport Ave & Waveland Ave" ...
##  $ end_station_id    : chr [1:337230] "13235" "KA1503000069" "20121" "13235" ...
##  $ start_lat         : num [1:337230] 41.9 41.8 41.7 41.9 41.7 ...
##  $ start_lng         : num [1:337230] -87.6 -87.6 -87.7 -87.7 -87.7 ...
##  $ end_lat           : num [1:337230] 41.9 41.8 41.7 41.9 41.7 ...
##  $ end_lng           : num [1:337230] -87.7 -87.6 -87.7 -87.7 -87.7 ...
##  $ member_casual     : chr [1:337230] "member" "casual" "casual" "member" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_character(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

str(trip_jun21)

## spec_tbl_df [531,633 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:531633] "C809ED75D6160B2A" "DD59FDCE0ACACAF3" "0AB83CB88C43EFC2" "7881AC6D39110C60" ...
##  $ rideable_type     : chr [1:531633] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
##  $ started_at        : POSIXct[1:531633], format: "2021-05-30 11:58:15" "2021-05-30 11:29:14" ...
##  $ ended_at          : POSIXct[1:531633], format: "2021-05-30 12:10:39" "2021-05-30 12:14:09" ...
##  $ start_station_name: chr [1:531633] NA NA NA NA ...
##  $ start_station_id  : chr [1:531633] NA NA NA NA ...
##  $ end_station_name  : chr [1:531633] NA NA NA NA ...
##  $ end_station_id    : chr [1:531633] NA NA NA NA ...
##  $ start_lat         : num [1:531633] 41.9 41.9 41.9 41.9 41.9 ...
##  $ start_lng         : num [1:531633] -87.6 -87.6 -87.7 -87.7 -87.7 ...
##  $ end_lat           : num [1:531633] 41.9 41.8 41.9 41.9 41.9 ...
##  $ end_lng           : num [1:531633] -87.6 -87.6 -87.7 -87.7 -87.7 ...
##  $ member_casual     : chr [1:531633] "casual" "casual" "casual" "casual" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_character(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

str(trip_jul21)

## spec_tbl_df [729,595 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:729595] "99FEC93BA843FB20" "06048DCFC8520CAF" "9598066F68045DF2" "B03C0FE48C412214" ...
##  $ rideable_type     : chr [1:729595] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
##  $ started_at        : POSIXct[1:729595], format: "2021-06-13 14:31:28" "2021-06-04 11:18:02" ...
##  $ ended_at          : POSIXct[1:729595], format: "2021-06-13 14:34:11" "2021-06-04 11:24:19" ...
##  $ start_station_name: chr [1:729595] NA NA NA NA ...
##  $ start_station_id  : chr [1:729595] NA NA NA NA ...
##  $ end_station_name  : chr [1:729595] NA NA NA NA ...
##  $ end_station_id    : chr [1:729595] NA NA NA NA ...
##  $ start_lat         : num [1:729595] 41.8 41.8 41.8 41.8 41.8 ...
##  $ start_lng         : num [1:729595] -87.6 -87.6 -87.6 -87.6 -87.6 ...
##  $ end_lat           : num [1:729595] 41.8 41.8 41.8 41.8 41.8 ...
##  $ end_lng           : num [1:729595] -87.6 -87.6 -87.6 -87.6 -87.6 ...
##  $ member_casual     : chr [1:729595] "member" "member" "member" "member" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_character(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

str(trip_aug21)

## spec_tbl_df [822,410 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:822410] "0A1B623926EF4E16" "B2D5583A5A5E76EE" "6F264597DDBF427A" "379B58EAB20E8AA5" ...
##  $ rideable_type     : chr [1:822410] "docked_bike" "classic_bike" "classic_bike" "classic_bike" ...
##  $ started_at        : POSIXct[1:822410], format: "2021-07-02 14:44:36" "2021-07-07 16:57:42" ...
##  $ ended_at          : POSIXct[1:822410], format: "2021-07-02 15:19:58" "2021-07-07 17:16:09" ...
##  $ start_station_name: chr [1:822410] "Michigan Ave & Washington St" "California Ave & Cortez St" "Wabash Ave & 16th St" "California Ave & Cortez St" ...
##  $ start_station_id  : chr [1:822410] "13001" "17660" "SL-012" "17660" ...
##  $ end_station_name  : chr [1:822410] "Halsted St & North Branch St" "Wood St & Hubbard St" "Rush St & Hubbard St" "Carpenter St & Huron St" ...
##  $ end_station_id    : chr [1:822410] "KA1504000117" "13432" "KA1503000044" "13196" ...
##  $ start_lat         : num [1:822410] 41.9 41.9 41.9 41.9 41.9 ...
##  $ start_lng         : num [1:822410] -87.6 -87.7 -87.6 -87.7 -87.7 ...
##  $ end_lat           : num [1:822410] 41.9 41.9 41.9 41.9 41.9 ...
##  $ end_lng           : num [1:822410] -87.6 -87.7 -87.6 -87.7 -87.7 ...
##  $ member_casual     : chr [1:822410] "casual" "casual" "member" "member" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_character(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

str(trip_sep21)

## spec_tbl_df [804,352 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:804352] "99103BB87CC6C1BB" "EAFCCCFB0A3FC5A1" "9EF4F46C57AD234D" "5834D3208BFAF1DA" ...
##  $ rideable_type     : chr [1:804352] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
##  $ started_at        : POSIXct[1:804352], format: "2021-08-10 17:15:49" "2021-08-10 17:23:14" ...
##  $ ended_at          : POSIXct[1:804352], format: "2021-08-10 17:22:44" "2021-08-10 17:39:24" ...
##  $ start_station_name: chr [1:804352] NA NA NA NA ...
##  $ start_station_id  : chr [1:804352] NA NA NA NA ...
##  $ end_station_name  : chr [1:804352] NA NA NA NA ...
##  $ end_station_id    : chr [1:804352] NA NA NA NA ...
##  $ start_lat         : num [1:804352] 41.8 41.8 42 42 41.8 ...
##  $ start_lng         : num [1:804352] -87.7 -87.7 -87.7 -87.7 -87.6 ...
##  $ end_lat           : num [1:804352] 41.8 41.8 42 42 41.8 ...
##  $ end_lng           : num [1:804352] -87.7 -87.6 -87.7 -87.7 -87.6 ...
##  $ member_casual     : chr [1:804352] "member" "member" "member" "member" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_character(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

str(trip_oct21)

## spec_tbl_df [756,147 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:756147] "9DC7B962304CBFD8" "F930E2C6872D6B32" "6EF72137900BB910" "78D1DE133B3DBF55" ...
##  $ rideable_type     : chr [1:756147] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
##  $ started_at        : POSIXct[1:756147], format: "2021-09-28 16:07:10" "2021-09-28 14:24:51" ...
##  $ ended_at          : POSIXct[1:756147], format: "2021-09-28 16:09:54" "2021-09-28 14:40:05" ...
##  $ start_station_name: chr [1:756147] NA NA NA NA ...
##  $ start_station_id  : chr [1:756147] NA NA NA NA ...
##  $ end_station_name  : chr [1:756147] NA NA NA NA ...
##  $ end_station_id    : chr [1:756147] NA NA NA NA ...
##  $ start_lat         : num [1:756147] 41.9 41.9 41.8 41.8 41.9 ...
##  $ start_lng         : num [1:756147] -87.7 -87.6 -87.7 -87.7 -87.7 ...
##  $ end_lat           : num [1:756147] 41.9 42 41.8 41.8 41.9 ...
##  $ end_lng           : num [1:756147] -87.7 -87.7 -87.7 -87.7 -87.7 ...
##  $ member_casual     : chr [1:756147] "casual" "casual" "casual" "casual" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_character(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

str(trip_nov21)

## spec_tbl_df [631,226 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:631226] "620BC6107255BF4C" "4471C70731AB2E45" "26CA69D43D15EE14" "362947F0437E1514" ...
##  $ rideable_type     : chr [1:631226] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
##  $ started_at        : POSIXct[1:631226], format: "2021-10-22 12:46:42" "2021-10-21 09:12:37" ...
##  $ ended_at          : POSIXct[1:631226], format: "2021-10-22 12:49:50" "2021-10-21 09:14:14" ...
##  $ start_station_name: chr [1:631226] "Kingsbury St & Kinzie St" NA NA NA ...
##  $ start_station_id  : chr [1:631226] "KA1503000043" NA NA NA ...
##  $ end_station_name  : chr [1:631226] NA NA NA NA ...
##  $ end_station_id    : chr [1:631226] NA NA NA NA ...
##  $ start_lat         : num [1:631226] 41.9 41.9 41.9 41.9 41.9 ...
##  $ start_lng         : num [1:631226] -87.6 -87.7 -87.7 -87.7 -87.7 ...
##  $ end_lat           : num [1:631226] 41.9 41.9 41.9 41.9 41.9 ...
##  $ end_lng           : num [1:631226] -87.6 -87.7 -87.7 -87.7 -87.7 ...
##  $ member_casual     : chr [1:631226] "member" "member" "member" "member" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_character(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

5. Convert tibbles to data frame.

I converted the tibbles to data frames using the function as.data.frame and assigned them to the names variables

#Now we want to convert the data sets into data frame
trip_dec20 <- as.data.frame(trip_dec20)
trip_jan21 <- as.data.frame(trip_jan21)
trip_feb21 <- as.data.frame(trip_feb21)
trip_mar21 <- as.data.frame(trip_mar21)
trip_apr21 <- as.data.frame(trip_apr21)
trip_may21 <- as.data.frame(trip_may21)
trip_jun21 <- as.data.frame(trip_jun21)
trip_jul21 <- as.data.frame(trip_jul21)
trip_aug21 <- as.data.frame(trip_aug21)
trip_sep21 <- as.data.frame(trip_sep21)
trip_oct21 <- as.data.frame(trip_oct21)
trip_nov21 <- as.data.frame(trip_nov21)

6. Compare data frame columns.

I inspected the columns of each data frame by using compare_df_cols and asked it to return any mismatch.I used this because we are dealing with a large data sets.

#Now we want to compare the column data type across all data frame by using comparedf
compare_df_cols(trip_dec20,trip_jan21,trip_feb21,trip_mar21,trip_apr21,trip_may21,
                trip_jun21,trip_jul21,trip_aug21,trip_sep21,trip_oct21,trip_nov21,return = "mismatch")

##        column_name trip_dec20 trip_jan21 trip_feb21 trip_mar21 trip_apr21
## 1   end_station_id    numeric  character  character  character  character
## 2 start_station_id    numeric  character  character  character  character
##   trip_may21 trip_jun21 trip_jul21 trip_aug21 trip_sep21 trip_oct21 trip_nov21
## 1  character  character  character  character  character  character  character
## 2  character  character  character  character  character  character  character

7. Convert the necessary columns to character data type.

I found a mismatch in the data type in trip_dec20 data frame. It had two columns in numeric data type. So we needed to convert them into character data type

#We observe that trip_dec20 has end_station_id and start_station_id in numeric data type, so we convert them to character data type
trip_dec20 <- mutate(trip_dec20,end_station_id = as.character(end_station_id),start_station_id = as.character(start_station_id))

8. Re-compare the data frame columns.

After the conversion,I compared the column data type again to be sure there no no more mismatch.

#We want to confirm we have no more mismatch in data types so we re-run the comparedf again
compare_df_cols(trip_dec20,trip_jan21,trip_feb21,trip_mar21,trip_apr21,trip_may21,
                trip_jun21,trip_jul21,trip_aug21,trip_sep21,trip_oct21,trip_nov21,return = "mismatch")

##  [1] column_name trip_dec20  trip_jan21  trip_feb21  trip_mar21  trip_apr21 
##  [7] trip_may21  trip_jun21  trip_jul21  trip_aug21  trip_sep21  trip_oct21 
## [13] trip_nov21 
## <0 rows> (or 0-length row.names)

9. Bind the data frames together.

I binded the data frames together using the bind function and assigned them to a named variable

#Now we can combine or stack these data frames into one big data frame using bind function
all_trips <- bind_rows(trip_dec20,trip_jan21,trip_feb21,trip_mar21,trip_apr21,trip_may21,trip_jun21,trip_jul21,trip_aug21,
                       trip_sep21,trip_oct21,trip_nov21)

10. Drop Columns.

I dropped the columns that were not needed for the analysis by using the select function

#Now we drop the columns we do not need
all_trips <- all_trips %>%
  select(-c(start_lat,start_lng,end_lat,end_lng))

11. Rename Column Names.

I renamed the column names to a more suitable name i would prefer

#We want to rename the columns to a more prefered name
all_trips <- all_trips %>%
  rename(trip_id = ride_id,
         ride_type = rideable_type,
         trip_start = started_at,
         trip_end = ended_at,
         from_station_name = start_station_name,
         from_station_id = start_station_id,
         to_station_name = end_station_name,
         to_station_id = end_station_id,
         customer_type = member_casual
    
  )

12. Inspect Data Frame.

Excellent!! Now that i have sorted the data frame accordingly i can now move on to the data cleaning process. First i will inspect my data frame

#Step 3: Clean up and Add Data to Prepare for Analysis

#Now we inspect our data frame
head(all_trips)

##            trip_id     ride_type          trip_start            trip_end
## 1 BD0A6FF6FFF9B921 electric_bike 2020-11-01 13:36:00 2020-11-01 13:45:40
## 2 96A7A7A4BDE4F82D electric_bike 2020-11-01 10:03:26 2020-11-01 10:14:45
## 3 C61526D06582BDC5 electric_bike 2020-11-01 00:34:05 2020-11-01 01:03:06
## 4 E533E89C32080B9E electric_bike 2020-11-01 00:45:16 2020-11-01 00:54:31
## 5 1C9F4EF18C168C60 electric_bike 2020-11-01 15:43:25 2020-11-01 16:16:52
## 6 7259585D8276D338 electric_bike 2020-11-14 15:55:17 2020-11-14 16:44:38
##           from_station_name from_station_id           to_station_name
## 1     Dearborn St & Erie St             110    St. Clair St & Erie St
## 2 Franklin St & Illinois St             672  Noble St & Milwaukee Ave
## 3 Lake Shore Dr & Monroe St              76      Federal St & Polk St
## 4  Leavitt St & Chicago Ave             659   Stave St & Armitage Ave
## 5       Buckingham Fountain               2       Buckingham Fountain
## 6      Wabash Ave & 16th St              72 Lake Shore Dr & Monroe St
##   to_station_id customer_type
## 1           211        casual
## 2            29        casual
## 3            41        casual
## 4           185        casual
## 5             2        casual
## 6            76        casual

colnames(all_trips)

## [1] "trip_id"           "ride_type"         "trip_start"       
## [4] "trip_end"          "from_station_name" "from_station_id"  
## [7] "to_station_name"   "to_station_id"     "customer_type"

str(all_trips)

## 'data.frame':    5378834 obs. of  9 variables:
##  $ trip_id          : chr  "BD0A6FF6FFF9B921" "96A7A7A4BDE4F82D" "C61526D06582BDC5" "E533E89C32080B9E" ...
##  $ ride_type        : chr  "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
##  $ trip_start       : POSIXct, format: "2020-11-01 13:36:00" "2020-11-01 10:03:26" ...
##  $ trip_end         : POSIXct, format: "2020-11-01 13:45:40" "2020-11-01 10:14:45" ...
##  $ from_station_name: chr  "Dearborn St & Erie St" "Franklin St & Illinois St" "Lake Shore Dr & Monroe St" "Leavitt St & Chicago Ave" ...
##  $ from_station_id  : chr  "110" "672" "76" "659" ...
##  $ to_station_name  : chr  "St. Clair St & Erie St" "Noble St & Milwaukee Ave" "Federal St & Polk St" "Stave St & Armitage Ave" ...
##  $ to_station_id    : chr  "211" "29" "41" "185" ...
##  $ customer_type    : chr  "casual" "casual" "casual" "casual" ...

dim(all_trips)

## [1] 5378834       9

summary(all_trips)

##    trip_id           ride_type           trip_start                 
##  Length:5378834     Length:5378834     Min.   :2020-11-01 00:00:08  
##  Class :character   Class :character   1st Qu.:2021-05-17 12:45:18  
##  Mode  :character   Mode  :character   Median :2021-07-13 22:33:14  
##                                        Mean   :2021-06-27 18:37:41  
##                                        3rd Qu.:2021-09-02 18:18:14  
##                                        Max.   :2021-10-31 23:59:49  
##     trip_end                   from_station_name  from_station_id   
##  Min.   :2020-11-01 00:02:20   Length:5378834     Length:5378834    
##  1st Qu.:2021-05-17 13:07:36   Class :character   Class :character  
##  Median :2021-07-13 22:57:23   Mode  :character   Mode  :character  
##  Mean   :2021-06-27 18:58:10                                        
##  3rd Qu.:2021-09-02 18:35:16                                        
##  Max.   :2021-11-03 21:45:48                                        
##  to_station_name    to_station_id      customer_type     
##  Length:5378834     Length:5378834     Length:5378834    
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##

skim(all_trips)

Data summary
Name	all_trips
Number of rows	5378834
Number of columns	9
_______________________
Column type frequency:
character	7
POSIXct	2
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
trip_id	0	1.00	16	16	5378625
ride_type	0	1.00	11	13	3
from_station_name	600479	0.89	3	53	814
from_station_id	600586	0.89	1	36	1303
to_station_name	646471	0.88	10	53	811
to_station_id	646548	0.88	1	36	1298
customer_type	0	1.00	6	6	2

Variable type: POSIXct

skim_variable	n_missing	complete_rate	min	max	median	n_unique
trip_start	0	1	2020-11-01 00:00:08	2021-10-31 23:59:49	2021-07-13 22:33:14	4487412
trip_end	0	1	2020-11-01 00:02:20	2021-11-03 21:45:48	2021-07-13 22:57:23	4479067

13. Add Necessary Columns.

Now I want to add columns that would enable us aggregate the data

# Now we add columns that list the date, month, day, and year of each the trip
# This will allow us to aggregate ride data for each month, day, or year 
#The default format of the date is yyyy-mm-dd

all_trips$date <- as.Date(all_trips$trip_start) 
all_trips$month <- format(as.Date(all_trips$date), "%m")
all_trips$day <- format(as.Date(all_trips$date), "%d")
all_trips$year <- format(as.Date(all_trips$date), "%Y")
all_trips$day_of_week <- format(as.Date(all_trips$date), "%A")

14. Add Ride Length,Convert the data type and Remove Bad Data.

I would then add the ride length calculation, convert the ride length data type and remove bad data using the skim function

#Now we Add a "ride_length" calculation to all_trips (in seconds) using difftime which calculates time intervals
all_trips$ride_length <- difftime(all_trips$trip_end,all_trips$trip_start)
str(all_trips$ride_length)

##  'difftime' num [1:5378834] 580 679 1741 555 ...
##  - attr(*, "units")= chr "secs"

#Now we convert ride_length from factor to numeric datatype
is.factor(all_trips$ride_length)

## [1] FALSE

all_trips$ride_length <- as.numeric(as.character(all_trips$ride_length))
is.numeric(all_trips$ride_length)

## [1] TRUE

#Now we want to remove bad data
#The data frame includes a few hundred entries when bikes were taken out of docks and checked for quality by Divvy or ride_length was negative
skim(all_trips$ride_length)

Data summary
Name	all_trips$ride_length
Number of rows	5378834
Number of columns	1
_______________________
Column type frequency:
numeric	1
________________________
Group variables	None

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
data	0	1	1229.37	17999.53	-1742998	418	743	1346	3356649	▁▇▁▁▁

15. New Data frame Excluding the negative ride_length values.

Creating a new data frame to exclude the negative ride length values

#Now we will create a new data frame v2 that will exclude ride_length that were negative
all_trips_v2 <- all_trips[!(all_trips$ride_length<0),]
head(all_trips_v2$ride_length)

## [1]  580  679 1741  555 2007 2961

16. Descriptive Analysis.

Now that we have cleaned, prepared data, we can move onto analysis. We take a look at our data frame using the aggregate and summary functions. The results were in seconds

#Step 4: Conduct Descriptive Analysis

#Now a descriptive analysis on ride_length
summary(all_trips_v2$ride_length)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0     418     743    1350    1346 3356649

#Now we want to compare the members and causal riders
aggregate(all_trips_v2$ride_length ~ all_trips_v2$customer_type, FUN = mean)

##   all_trips_v2$customer_type all_trips_v2$ride_length
## 1                     casual                1953.4299
## 2                     member                 837.7915

aggregate(all_trips_v2$ride_length ~ all_trips_v2$customer_type, FUN = median)

##   all_trips_v2$customer_type all_trips_v2$ride_length
## 1                     casual                      982
## 2                     member                      595

aggregate(all_trips_v2$ride_length ~ all_trips_v2$customer_type, FUN = max)

##   all_trips_v2$customer_type all_trips_v2$ride_length
## 1                     casual                  3356649
## 2                     member                    93596

aggregate(all_trips_v2$ride_length ~ all_trips_v2$customer_type, FUN = min)

##   all_trips_v2$customer_type all_trips_v2$ride_length
## 1                     casual                        0
## 2                     member                        0

17. Fixing the day of the week.

We noticed that the days of the week are out of order. We use the order fundtion to fix that

# Now we want to see the average ride time by each day for members vs casual users
aggregate(all_trips_v2$ride_length ~ all_trips_v2$customer_type + all_trips_v2$day_of_week, FUN = mean)

##    all_trips_v2$customer_type all_trips_v2$day_of_week all_trips_v2$ride_length
## 1                      casual                   Friday                1854.5624
## 2                      member                   Friday                 819.7759
## 3                      casual                   Monday                1939.1773
## 4                      member                   Monday                 811.1557
## 5                      casual                 Saturday                2109.5236
## 6                      member                 Saturday                 933.4372
## 7                      casual                   Sunday                2286.3571
## 8                      member                   Sunday                 955.1888
## 9                      casual                 Thursday                1692.3220
## 10                     member                 Thursday                 786.1396
## 11                     casual                  Tuesday                1717.1894
## 12                     member                  Tuesday                 786.2936
## 13                     casual                Wednesday                1688.8956
## 14                     member                Wednesday                 789.9640

# We Notice that the days of the week are out of order. Let's fix that.
all_trips_v2$day_of_week <- ordered(all_trips_v2$day_of_week, levels=c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))

#Now we re run the average ride time by each day for members vs causal riders
aggregate(all_trips_v2$ride_length ~ all_trips_v2$customer_type + all_trips_v2$day_of_week, FUN = mean)

##    all_trips_v2$customer_type all_trips_v2$day_of_week all_trips_v2$ride_length
## 1                      casual                   Sunday                2286.3571
## 2                      member                   Sunday                 955.1888
## 3                      casual                   Monday                1939.1773
## 4                      member                   Monday                 811.1557
## 5                      casual                  Tuesday                1717.1894
## 6                      member                  Tuesday                 786.2936
## 7                      casual                Wednesday                1688.8956
## 8                      member                Wednesday                 789.9640
## 9                      casual                 Thursday                1692.3220
## 10                     member                 Thursday                 786.1396
## 11                     casual                   Friday                1854.5624
## 12                     member                   Friday                 819.7759
## 13                     casual                 Saturday                2109.5236
## 14                     member                 Saturday                 933.4372

18. Adding a new Column.

We added a new column called weekday so we could analyse the ridership data by the type and weekday

# analyze ridership data by type and weekday.
#But we have no weekday column so we create it 
 all_trips_v2 %>% 
  mutate(weekday = wday(trip_start, label = TRUE)) %>%  #creates weekday field using wday()
  group_by(customer_type, weekday) %>%  #groups by customertype and weekday
  summarise(number_of_rides = n(),average_duration = mean(ride_length)) %>%                 
  arrange(customer_type, weekday)

## `summarise()` has grouped output by 'customer_type'. You can override using the `.groups` argument.

## # A tibble: 14 x 4
## # Groups:   customer_type [2]
##    customer_type weekday number_of_rides average_duration
##    <chr>         <ord>             <int>            <dbl>
##  1 casual        Sun              476138            2286.
##  2 casual        Mon              278252            1939.
##  3 casual        Tue              264254            1717.
##  4 casual        Wed              267472            1689.
##  5 casual        Thu              277319            1692.
##  6 casual        Fri              354969            1855.
##  7 casual        Sat              551814            2110.
##  8 member        Sun              368376             955.
##  9 member        Mon              391287             811.
## 10 member        Tue              431459             786.
## 11 member        Wed              444255             790.
## 12 member        Thu              425562             786.
## 13 member        Fri              425120             820.
## 14 member        Sat              421164             933.

#Now we try to get the total number of causal and member ridership and use that to get the approximate percentage of each member type

 all_trips %>%
   mutate(weekday = wday(trip_start, label = TRUE)) %>%
   group_by(customer_type) %>%
   summarise(number_of_rides = n()) %>%
   arrange(customer_type)

## # A tibble: 2 x 2
##   customer_type number_of_rides
##   <chr>                   <int>
## 1 casual                2470517
## 2 member                2908317

   #From the data frame, the causal number of rides = 2,470517 and member or annual number of rides = 2,908317. We pipe the result to sum function to give us the sum figure

19. Data Visualisation.

We visualized our data by showing how the length of the trip and the number of the trip compare by weekday and the membership status. We will first create the bar graph showing the total ride count per day in comparison to the membership status or type.

# Let's visualize the number of rides by rider type
all_trips_v2 %>% 
  mutate(weekday = wday(trip_start, label = TRUE)) %>% 
  group_by(customer_type, weekday) %>% 
  summarise(number_of_rides = n(),average_duration = mean(ride_length)) %>% 
  arrange(customer_type, weekday)  %>% 
  ggplot(aes(x = weekday, y = number_of_rides, fill = customer_type)) +
  geom_col(position = "dodge") +
  scale_fill_manual(values = c("casual" = "green", "member" = "blue")) +
 scale_y_continuous(labels = scales::label_number(suffix = " K", scale = 1e-3)) +
  labs(title="Number of trip rides per day in comparison to customer type", subtitle = "Causal members vs Annual members",caption=paste0("The data between December 2020 and November 2021"),x= "The Days of the Week", y= "The Total Number of Rides",fill= "Membership type")

## `summarise()` has grouped output by 'customer_type'. You can override using the `.groups` argument.

# 20. Second Data Visualisation.

We created the second bar plot showing the average duration of the trip in comparison to the causal and annual members

# Let's create a visualization for average duration
all_trips_v2 %>% 
  mutate(weekday = wday(trip_start, label = TRUE)) %>% 
  group_by(customer_type, weekday) %>% 
  summarise(number_of_rides = n(),average_duration = mean(ride_length)) %>% 
  arrange(customer_type, weekday)  %>% 
  ggplot(aes(x = weekday, y = average_duration, fill = customer_type)) +
  geom_col(position = "dodge") +
  scale_fill_manual(values = c("casual" = "forestgreen", 
                               "member" = "orange")) +
   scale_y_continuous(labels = scales::label_number(suffix="sec", scale = 10)) +
  labs(title="The average duration of trip rides",subtitles= "Member vs Casual",x="The Days of the Week",y="Average Ride Length in seconds",fill= "Membership type",caption=paste0("The data between December 2020 and November 2021"))

## `summarise()` has grouped output by 'customer_type'. You can override using the `.groups` argument.

Findings from the Analysis.

From the data analysis above, in step 18 we have a data frame that shows us the number of rides for each customer type. The result from that data frame showed the following * The casual number of rides was 2,470,517 * The member number of ride was 2,908,317 * This would make the sum of 5,378,834

We could take the percentage to know the percentage of each customer type.The percentage calculations would give us * Casual rider 46% * Member rider 54%

This reveals that member riders were slightly higher than casual riders.

Also, from the first visualization on step 19, we found out that members rode more on weekdays (Monday to Friday) when compared to casual riders. But during the weekend (Saturday and Sunday) the casual riders were at the peak level above the member riders.

From the second visualization on step 20, we found out that casual riders went on more rides than member riders. The causal riders rode for as much as twice when compared to how much member riders rode.

The above explanations has helped in answering the key questions in trying to understand how casual riders and member or annual riders used the Cyclistic bike differently.

My recommendation.

1. Use of less expensive materials for casual riders

From the analysis, we can see that casual riders ride more than the annual members. So it would be of financial benefit to design the causal members bikes with less expensive materials that would be equivalent to the savings that would be made from an annual member in a month

2. Engaging the casual riders

We have found out from the analysis that the casual riders go on rides the most during the weekends. The marketing team can set up a of different people that would be at different start stations to meet with the causal riders and offer incentives to try to convert them to annual riders. For, example each causal rider could be offered a 15% slash bonus to register as an annual member. This could be done for a month and afterwards another analysis could be dope to see how many casual riders have converted to annual members.

3. Satisfying the Annual riders

In convincing the causal riders to convert, we should not forget the already existing annual members. Together with the previous step, a package can be given to the annual riders. For example, giving 40% off the monthly payment to an annual rider who brings in a friend or even an already existing casual rider to register as a member with the Cyclistic company.

4. Find the traffic congested routes.

Since the aim of this analysis is to convince causal riders to convert to annual members. The company could find the data on the routes with the most traffic and during the engagement with the casual riders on the weekends (as discussed in the second point),the casual riders who don’t ride on weekdays could be convinced to switch to annual membership if riding a bike along those routes would save time and stress from the traffic.

Bike Ride Data Analysis

Idowu, Adetola

December 23, 2021