Processing

The processing stage involves cleaning the data after confirming that it is from a good source and it satisfies the necessary conditions.It also involves confirming that the data is complete, correct and relevant. I chose R studios as the tool to use for this phase of the analysis process. I started by installing and then loading the relevant packages.

library(tidyverse)
Warning: package ‘tidyverse’ was built under R version 4.2.1
Registered S3 methods overwritten by 'dbplyr':
  method         from
  print.tbl_lazy     
  print.tbl_sql      
── Attaching packages ────────────────────────────────────────────────── tidyverse 1.3.1 ──
✔ ggplot2 3.3.6     ✔ purrr   0.3.4
✔ tibble  3.1.7     ✔ dplyr   1.0.9
✔ tidyr   1.2.0     ✔ stringr 1.4.0
✔ readr   2.1.2     ✔ forcats 0.5.1
Warning: package ‘ggplot2’ was built under R version 4.2.1
Warning: package ‘tibble’ was built under R version 4.2.1
Warning: package ‘tidyr’ was built under R version 4.2.1
Warning: package ‘readr’ was built under R version 4.2.1
Warning: package ‘purrr’ was built under R version 4.2.1
Warning: package ‘dplyr’ was built under R version 4.2.1
Warning: package ‘stringr’ was built under R version 4.2.1
Warning: package ‘forcats’ was built under R version 4.2.1
── Conflicts ───────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
library(lubridate)
Warning: package ‘lubridate’ was built under R version 4.2.1

Attaching package: ‘lubridate’

The following objects are masked from ‘package:base’:

    date, intersect, setdiff, union
library(janitor)
Warning: package ‘janitor’ was built under R version 4.2.1

Attaching package: ‘janitor’

The following objects are masked from ‘package:stats’:

    chisq.test, fisher.test
library(skimr)
Warning: package ‘skimr’ was built under R version 4.2.1

Loading datasets into the environment with the correct naming conventions.

June_2021 <- read_csv("202106-divvy-tripdata.csv")
Rows: 729595 Columns: 13
── Column specification ───────────────────────────────────────────────────────────────────
Delimiter: ","
chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_station_nam...
dbl  (4): start_lat, start_lng, end_lat, end_lng
dttm (2): started_at, ended_at

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
July_2021 <- read_csv("202107-divvy-tripdata.csv")
Rows: 822410 Columns: 13
── Column specification ───────────────────────────────────────────────────────────────────
Delimiter: ","
chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_station_nam...
dbl  (4): start_lat, start_lng, end_lat, end_lng
dttm (2): started_at, ended_at

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
August_2021 <- read_csv("202108-divvy-tripdata.csv")
Rows: 804352 Columns: 13
── Column specification ───────────────────────────────────────────────────────────────────
Delimiter: ","
chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_station_nam...
dbl  (4): start_lat, start_lng, end_lat, end_lng
dttm (2): started_at, ended_at

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
September_2021 <- read_csv("202109-divvy-tripdata.csv")
Rows: 756147 Columns: 13
── Column specification ───────────────────────────────────────────────────────────────────
Delimiter: ","
chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_station_nam...
dbl  (4): start_lat, start_lng, end_lat, end_lng
dttm (2): started_at, ended_at

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
October_2021 <- read_csv("202110-divvy-tripdata.csv")
Rows: 631226 Columns: 13
── Column specification ───────────────────────────────────────────────────────────────────
Delimiter: ","
chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_station_nam...
dbl  (4): start_lat, start_lng, end_lat, end_lng
dttm (2): started_at, ended_at

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
November_2021 <- read_csv("202111-divvy-tripdata.csv")
Rows: 359978 Columns: 13
── Column specification ───────────────────────────────────────────────────────────────────
Delimiter: ","
chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_station_nam...
dbl  (4): start_lat, start_lng, end_lat, end_lng
dttm (2): started_at, ended_at

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
December_2021 <- read_csv("202112-divvy-tripdata.csv")
Rows: 247540 Columns: 13
── Column specification ───────────────────────────────────────────────────────────────────
Delimiter: ","
chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_station_nam...
dbl  (4): start_lat, start_lng, end_lat, end_lng
dttm (2): started_at, ended_at

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
January_2022 <- read_csv("202201-divvy-tripdata.csv")
Rows: 103770 Columns: 13
── Column specification ───────────────────────────────────────────────────────────────────
Delimiter: ","
chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_station_nam...
dbl  (4): start_lat, start_lng, end_lat, end_lng
dttm (2): started_at, ended_at

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
February_2022 <- read_csv("202202-divvy-tripdata.csv")
Rows: 115609 Columns: 13
── Column specification ───────────────────────────────────────────────────────────────────
Delimiter: ","
chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_station_nam...
dbl  (4): start_lat, start_lng, end_lat, end_lng
dttm (2): started_at, ended_at

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
March_2022 <- read_csv("202203-divvy-tripdata.csv")
Rows: 284042 Columns: 13
── Column specification ───────────────────────────────────────────────────────────────────
Delimiter: ","
chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_station_nam...
dbl  (4): start_lat, start_lng, end_lat, end_lng
dttm (2): started_at, ended_at

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
April_2022 <- read_csv("202204-divvy-tripdata.csv")
Rows: 371249 Columns: 13
── Column specification ───────────────────────────────────────────────────────────────────
Delimiter: ","
chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_station_nam...
dbl  (4): start_lat, start_lng, end_lat, end_lng
dttm (2): started_at, ended_at

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
May_2022 <- read_csv("202205-divvy-tripdata.csv")
Rows: 634858 Columns: 13
── Column specification ───────────────────────────────────────────────────────────────────
Delimiter: ","
chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_station_nam...
dbl  (4): start_lat, start_lng, end_lat, end_lng
dttm (2): started_at, ended_at

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

The next step would ideally be to combine all of the data sets into one single data frame. But first, I confirmed that the data sets include the same column names.

colnames(June_2021)
 [1] "ride_id"            "rideable_type"      "started_at"        
 [4] "ended_at"           "start_station_name" "start_station_id"  
 [7] "end_station_name"   "end_station_id"     "start_lat"         
[10] "start_lng"          "end_lat"            "end_lng"           
[13] "member_casual"     
colnames(July_2021)
 [1] "ride_id"            "rideable_type"      "started_at"        
 [4] "ended_at"           "start_station_name" "start_station_id"  
 [7] "end_station_name"   "end_station_id"     "start_lat"         
[10] "start_lng"          "end_lat"            "end_lng"           
[13] "member_casual"     
colnames(August_2021)
 [1] "ride_id"            "rideable_type"      "started_at"        
 [4] "ended_at"           "start_station_name" "start_station_id"  
 [7] "end_station_name"   "end_station_id"     "start_lat"         
[10] "start_lng"          "end_lat"            "end_lng"           
[13] "member_casual"     
colnames(September_2021)
 [1] "ride_id"            "rideable_type"      "started_at"        
 [4] "ended_at"           "start_station_name" "start_station_id"  
 [7] "end_station_name"   "end_station_id"     "start_lat"         
[10] "start_lng"          "end_lat"            "end_lng"           
[13] "member_casual"     
colnames(October_2021)
 [1] "ride_id"            "rideable_type"      "started_at"        
 [4] "ended_at"           "start_station_name" "start_station_id"  
 [7] "end_station_name"   "end_station_id"     "start_lat"         
[10] "start_lng"          "end_lat"            "end_lng"           
[13] "member_casual"     
colnames(November_2021)
 [1] "ride_id"            "rideable_type"      "started_at"        
 [4] "ended_at"           "start_station_name" "start_station_id"  
 [7] "end_station_name"   "end_station_id"     "start_lat"         
[10] "start_lng"          "end_lat"            "end_lng"           
[13] "member_casual"     
colnames(December_2021)
 [1] "ride_id"            "rideable_type"      "started_at"        
 [4] "ended_at"           "start_station_name" "start_station_id"  
 [7] "end_station_name"   "end_station_id"     "start_lat"         
[10] "start_lng"          "end_lat"            "end_lng"           
[13] "member_casual"     
colnames(January_2022)
 [1] "ride_id"            "rideable_type"      "started_at"        
 [4] "ended_at"           "start_station_name" "start_station_id"  
 [7] "end_station_name"   "end_station_id"     "start_lat"         
[10] "start_lng"          "end_lat"            "end_lng"           
[13] "member_casual"     
colnames(February_2022)
 [1] "ride_id"            "rideable_type"      "started_at"        
 [4] "ended_at"           "start_station_name" "start_station_id"  
 [7] "end_station_name"   "end_station_id"     "start_lat"         
[10] "start_lng"          "end_lat"            "end_lng"           
[13] "member_casual"     
colnames(March_2022)
 [1] "ride_id"            "rideable_type"      "started_at"        
 [4] "ended_at"           "start_station_name" "start_station_id"  
 [7] "end_station_name"   "end_station_id"     "start_lat"         
[10] "start_lng"          "end_lat"            "end_lng"           
[13] "member_casual"     
colnames(April_2022)
 [1] "ride_id"            "rideable_type"      "started_at"        
 [4] "ended_at"           "start_station_name" "start_station_id"  
 [7] "end_station_name"   "end_station_id"     "start_lat"         
[10] "start_lng"          "end_lat"            "end_lng"           
[13] "member_casual"     
colnames(May_2022)
 [1] "ride_id"            "rideable_type"      "started_at"        
 [4] "ended_at"           "start_station_name" "start_station_id"  
 [7] "end_station_name"   "end_station_id"     "start_lat"         
[10] "start_lng"          "end_lat"            "end_lng"           
[13] "member_casual"     

Checking the overall structure of each data set and if they have the same data types.

str(June_2021)
spec_tbl_df [729,595 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ ride_id           : chr [1:729595] "99FEC93BA843FB20" "06048DCFC8520CAF" "9598066F68045DF2" "B03C0FE48C412214" ...
 $ rideable_type     : chr [1:729595] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
 $ started_at        : POSIXct[1:729595], format: "2021-06-13 14:31:28" "2021-06-04 11:18:02" ...
 $ ended_at          : POSIXct[1:729595], format: "2021-06-13 14:34:11" "2021-06-04 11:24:19" ...
 $ start_station_name: chr [1:729595] NA NA NA NA ...
 $ start_station_id  : chr [1:729595] NA NA NA NA ...
 $ end_station_name  : chr [1:729595] NA NA NA NA ...
 $ end_station_id    : chr [1:729595] NA NA NA NA ...
 $ start_lat         : num [1:729595] 41.8 41.8 41.8 41.8 41.8 ...
 $ start_lng         : num [1:729595] -87.6 -87.6 -87.6 -87.6 -87.6 ...
 $ end_lat           : num [1:729595] 41.8 41.8 41.8 41.8 41.8 ...
 $ end_lng           : num [1:729595] -87.6 -87.6 -87.6 -87.6 -87.6 ...
 $ member_casual     : chr [1:729595] "member" "member" "member" "member" ...
 - attr(*, "spec")=
  .. cols(
  ..   ride_id = col_character(),
  ..   rideable_type = col_character(),
  ..   started_at = col_datetime(format = ""),
  ..   ended_at = col_datetime(format = ""),
  ..   start_station_name = col_character(),
  ..   start_station_id = col_character(),
  ..   end_station_name = col_character(),
  ..   end_station_id = col_character(),
  ..   start_lat = col_double(),
  ..   start_lng = col_double(),
  ..   end_lat = col_double(),
  ..   end_lng = col_double(),
  ..   member_casual = col_character()
  .. )
 - attr(*, "problems")=<externalptr> 
str(July_2021)
spec_tbl_df [822,410 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ ride_id           : chr [1:822410] "0A1B623926EF4E16" "B2D5583A5A5E76EE" "6F264597DDBF427A" "379B58EAB20E8AA5" ...
 $ rideable_type     : chr [1:822410] "docked_bike" "classic_bike" "classic_bike" "classic_bike" ...
 $ started_at        : POSIXct[1:822410], format: "2021-07-02 14:44:36" "2021-07-07 16:57:42" ...
 $ ended_at          : POSIXct[1:822410], format: "2021-07-02 15:19:58" "2021-07-07 17:16:09" ...
 $ start_station_name: chr [1:822410] "Michigan Ave & Washington St" "California Ave & Cortez St" "Wabash Ave & 16th St" "California Ave & Cortez St" ...
 $ start_station_id  : chr [1:822410] "13001" "17660" "SL-012" "17660" ...
 $ end_station_name  : chr [1:822410] "Halsted St & North Branch St" "Wood St & Hubbard St" "Rush St & Hubbard St" "Carpenter St & Huron St" ...
 $ end_station_id    : chr [1:822410] "KA1504000117" "13432" "KA1503000044" "13196" ...
 $ start_lat         : num [1:822410] 41.9 41.9 41.9 41.9 41.9 ...
 $ start_lng         : num [1:822410] -87.6 -87.7 -87.6 -87.7 -87.7 ...
 $ end_lat           : num [1:822410] 41.9 41.9 41.9 41.9 41.9 ...
 $ end_lng           : num [1:822410] -87.6 -87.7 -87.6 -87.7 -87.7 ...
 $ member_casual     : chr [1:822410] "casual" "casual" "member" "member" ...
 - attr(*, "spec")=
  .. cols(
  ..   ride_id = col_character(),
  ..   rideable_type = col_character(),
  ..   started_at = col_datetime(format = ""),
  ..   ended_at = col_datetime(format = ""),
  ..   start_station_name = col_character(),
  ..   start_station_id = col_character(),
  ..   end_station_name = col_character(),
  ..   end_station_id = col_character(),
  ..   start_lat = col_double(),
  ..   start_lng = col_double(),
  ..   end_lat = col_double(),
  ..   end_lng = col_double(),
  ..   member_casual = col_character()
  .. )
 - attr(*, "problems")=<externalptr> 
str(August_2021)
spec_tbl_df [804,352 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ ride_id           : chr [1:804352] "99103BB87CC6C1BB" "EAFCCCFB0A3FC5A1" "9EF4F46C57AD234D" "5834D3208BFAF1DA" ...
 $ rideable_type     : chr [1:804352] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
 $ started_at        : POSIXct[1:804352], format: "2021-08-10 17:15:49" "2021-08-10 17:23:14" ...
 $ ended_at          : POSIXct[1:804352], format: "2021-08-10 17:22:44" "2021-08-10 17:39:24" ...
 $ start_station_name: chr [1:804352] NA NA NA NA ...
 $ start_station_id  : chr [1:804352] NA NA NA NA ...
 $ end_station_name  : chr [1:804352] NA NA NA NA ...
 $ end_station_id    : chr [1:804352] NA NA NA NA ...
 $ start_lat         : num [1:804352] 41.8 41.8 42 42 41.8 ...
 $ start_lng         : num [1:804352] -87.7 -87.7 -87.7 -87.7 -87.6 ...
 $ end_lat           : num [1:804352] 41.8 41.8 42 42 41.8 ...
 $ end_lng           : num [1:804352] -87.7 -87.6 -87.7 -87.7 -87.6 ...
 $ member_casual     : chr [1:804352] "member" "member" "member" "member" ...
 - attr(*, "spec")=
  .. cols(
  ..   ride_id = col_character(),
  ..   rideable_type = col_character(),
  ..   started_at = col_datetime(format = ""),
  ..   ended_at = col_datetime(format = ""),
  ..   start_station_name = col_character(),
  ..   start_station_id = col_character(),
  ..   end_station_name = col_character(),
  ..   end_station_id = col_character(),
  ..   start_lat = col_double(),
  ..   start_lng = col_double(),
  ..   end_lat = col_double(),
  ..   end_lng = col_double(),
  ..   member_casual = col_character()
  .. )
 - attr(*, "problems")=<externalptr> 
str(September_2021)
spec_tbl_df [756,147 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ ride_id           : chr [1:756147] "9DC7B962304CBFD8" "F930E2C6872D6B32" "6EF72137900BB910" "78D1DE133B3DBF55" ...
 $ rideable_type     : chr [1:756147] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
 $ started_at        : POSIXct[1:756147], format: "2021-09-28 16:07:10" "2021-09-28 14:24:51" ...
 $ ended_at          : POSIXct[1:756147], format: "2021-09-28 16:09:54" "2021-09-28 14:40:05" ...
 $ start_station_name: chr [1:756147] NA NA NA NA ...
 $ start_station_id  : chr [1:756147] NA NA NA NA ...
 $ end_station_name  : chr [1:756147] NA NA NA NA ...
 $ end_station_id    : chr [1:756147] NA NA NA NA ...
 $ start_lat         : num [1:756147] 41.9 41.9 41.8 41.8 41.9 ...
 $ start_lng         : num [1:756147] -87.7 -87.6 -87.7 -87.7 -87.7 ...
 $ end_lat           : num [1:756147] 41.9 42 41.8 41.8 41.9 ...
 $ end_lng           : num [1:756147] -87.7 -87.7 -87.7 -87.7 -87.7 ...
 $ member_casual     : chr [1:756147] "casual" "casual" "casual" "casual" ...
 - attr(*, "spec")=
  .. cols(
  ..   ride_id = col_character(),
  ..   rideable_type = col_character(),
  ..   started_at = col_datetime(format = ""),
  ..   ended_at = col_datetime(format = ""),
  ..   start_station_name = col_character(),
  ..   start_station_id = col_character(),
  ..   end_station_name = col_character(),
  ..   end_station_id = col_character(),
  ..   start_lat = col_double(),
  ..   start_lng = col_double(),
  ..   end_lat = col_double(),
  ..   end_lng = col_double(),
  ..   member_casual = col_character()
  .. )
 - attr(*, "problems")=<externalptr> 
str(October_2021)
spec_tbl_df [631,226 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ ride_id           : chr [1:631226] "620BC6107255BF4C" "4471C70731AB2E45" "26CA69D43D15EE14" "362947F0437E1514" ...
 $ rideable_type     : chr [1:631226] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
 $ started_at        : POSIXct[1:631226], format: "2021-10-22 12:46:42" "2021-10-21 09:12:37" ...
 $ ended_at          : POSIXct[1:631226], format: "2021-10-22 12:49:50" "2021-10-21 09:14:14" ...
 $ start_station_name: chr [1:631226] "Kingsbury St & Kinzie St" NA NA NA ...
 $ start_station_id  : chr [1:631226] "KA1503000043" NA NA NA ...
 $ end_station_name  : chr [1:631226] NA NA NA NA ...
 $ end_station_id    : chr [1:631226] NA NA NA NA ...
 $ start_lat         : num [1:631226] 41.9 41.9 41.9 41.9 41.9 ...
 $ start_lng         : num [1:631226] -87.6 -87.7 -87.7 -87.7 -87.7 ...
 $ end_lat           : num [1:631226] 41.9 41.9 41.9 41.9 41.9 ...
 $ end_lng           : num [1:631226] -87.6 -87.7 -87.7 -87.7 -87.7 ...
 $ member_casual     : chr [1:631226] "member" "member" "member" "member" ...
 - attr(*, "spec")=
  .. cols(
  ..   ride_id = col_character(),
  ..   rideable_type = col_character(),
  ..   started_at = col_datetime(format = ""),
  ..   ended_at = col_datetime(format = ""),
  ..   start_station_name = col_character(),
  ..   start_station_id = col_character(),
  ..   end_station_name = col_character(),
  ..   end_station_id = col_character(),
  ..   start_lat = col_double(),
  ..   start_lng = col_double(),
  ..   end_lat = col_double(),
  ..   end_lng = col_double(),
  ..   member_casual = col_character()
  .. )
 - attr(*, "problems")=<externalptr> 
str(November_2021)
spec_tbl_df [359,978 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ ride_id           : chr [1:359978] "7C00A93E10556E47" "90854840DFD508BA" "0A7D10CDD144061C" "2F3BE33085BCFF02" ...
 $ rideable_type     : chr [1:359978] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
 $ started_at        : POSIXct[1:359978], format: "2021-11-27 13:27:38" "2021-11-27 13:38:25" ...
 $ ended_at          : POSIXct[1:359978], format: "2021-11-27 13:46:38" "2021-11-27 13:56:10" ...
 $ start_station_name: chr [1:359978] NA NA NA NA ...
 $ start_station_id  : chr [1:359978] NA NA NA NA ...
 $ end_station_name  : chr [1:359978] NA NA NA NA ...
 $ end_station_id    : chr [1:359978] NA NA NA NA ...
 $ start_lat         : num [1:359978] 41.9 42 42 41.9 41.9 ...
 $ start_lng         : num [1:359978] -87.7 -87.7 -87.7 -87.8 -87.6 ...
 $ end_lat           : num [1:359978] 42 41.9 42 41.9 41.9 ...
 $ end_lng           : num [1:359978] -87.7 -87.7 -87.7 -87.8 -87.6 ...
 $ member_casual     : chr [1:359978] "casual" "casual" "casual" "casual" ...
 - attr(*, "spec")=
  .. cols(
  ..   ride_id = col_character(),
  ..   rideable_type = col_character(),
  ..   started_at = col_datetime(format = ""),
  ..   ended_at = col_datetime(format = ""),
  ..   start_station_name = col_character(),
  ..   start_station_id = col_character(),
  ..   end_station_name = col_character(),
  ..   end_station_id = col_character(),
  ..   start_lat = col_double(),
  ..   start_lng = col_double(),
  ..   end_lat = col_double(),
  ..   end_lng = col_double(),
  ..   member_casual = col_character()
  .. )
 - attr(*, "problems")=<externalptr> 
str(December_2021)
spec_tbl_df [247,540 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ ride_id           : chr [1:247540] "46F8167220E4431F" "73A77762838B32FD" "4CF42452054F59C5" "3278BA87BF698339" ...
 $ rideable_type     : chr [1:247540] "electric_bike" "electric_bike" "electric_bike" "classic_bike" ...
 $ started_at        : POSIXct[1:247540], format: "2021-12-07 15:06:07" "2021-12-11 03:43:29" ...
 $ ended_at          : POSIXct[1:247540], format: "2021-12-07 15:13:42" "2021-12-11 04:10:23" ...
 $ start_station_name: chr [1:247540] "Laflin St & Cullerton St" "LaSalle Dr & Huron St" "Halsted St & North Branch St" "Halsted St & North Branch St" ...
 $ start_station_id  : chr [1:247540] "13307" "KP1705001026" "KA1504000117" "KA1504000117" ...
 $ end_station_name  : chr [1:247540] "Morgan St & Polk St" "Clarendon Ave & Leland Ave" "Broadway & Barry Ave" "LaSalle Dr & Huron St" ...
 $ end_station_id    : chr [1:247540] "TA1307000130" "TA1307000119" "13137" "KP1705001026" ...
 $ start_lat         : num [1:247540] 41.9 41.9 41.9 41.9 41.9 ...
 $ start_lng         : num [1:247540] -87.7 -87.6 -87.6 -87.6 -87.7 ...
 $ end_lat           : num [1:247540] 41.9 42 41.9 41.9 41.9 ...
 $ end_lng           : num [1:247540] -87.7 -87.7 -87.6 -87.6 -87.6 ...
 $ member_casual     : chr [1:247540] "member" "casual" "member" "member" ...
 - attr(*, "spec")=
  .. cols(
  ..   ride_id = col_character(),
  ..   rideable_type = col_character(),
  ..   started_at = col_datetime(format = ""),
  ..   ended_at = col_datetime(format = ""),
  ..   start_station_name = col_character(),
  ..   start_station_id = col_character(),
  ..   end_station_name = col_character(),
  ..   end_station_id = col_character(),
  ..   start_lat = col_double(),
  ..   start_lng = col_double(),
  ..   end_lat = col_double(),
  ..   end_lng = col_double(),
  ..   member_casual = col_character()
  .. )
 - attr(*, "problems")=<externalptr> 
str(January_2022)
spec_tbl_df [103,770 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ ride_id           : chr [1:103770] "C2F7DD78E82EC875" "A6CF8980A652D272" "BD0F91DFF741C66D" "CBB80ED419105406" ...
 $ rideable_type     : chr [1:103770] "electric_bike" "electric_bike" "classic_bike" "classic_bike" ...
 $ started_at        : POSIXct[1:103770], format: "2022-01-13 11:59:47" "2022-01-10 08:41:56" ...
 $ ended_at          : POSIXct[1:103770], format: "2022-01-13 12:02:44" "2022-01-10 08:46:17" ...
 $ start_station_name: chr [1:103770] "Glenwood Ave & Touhy Ave" "Glenwood Ave & Touhy Ave" "Sheffield Ave & Fullerton Ave" "Clark St & Bryn Mawr Ave" ...
 $ start_station_id  : chr [1:103770] "525" "525" "TA1306000016" "KA1504000151" ...
 $ end_station_name  : chr [1:103770] "Clark St & Touhy Ave" "Clark St & Touhy Ave" "Greenview Ave & Fullerton Ave" "Paulina St & Montrose Ave" ...
 $ end_station_id    : chr [1:103770] "RP-007" "RP-007" "TA1307000001" "TA1309000021" ...
 $ start_lat         : num [1:103770] 42 42 41.9 42 41.9 ...
 $ start_lng         : num [1:103770] -87.7 -87.7 -87.7 -87.7 -87.6 ...
 $ end_lat           : num [1:103770] 42 42 41.9 42 41.9 ...
 $ end_lng           : num [1:103770] -87.7 -87.7 -87.7 -87.7 -87.6 ...
 $ member_casual     : chr [1:103770] "casual" "casual" "member" "casual" ...
 - attr(*, "spec")=
  .. cols(
  ..   ride_id = col_character(),
  ..   rideable_type = col_character(),
  ..   started_at = col_datetime(format = ""),
  ..   ended_at = col_datetime(format = ""),
  ..   start_station_name = col_character(),
  ..   start_station_id = col_character(),
  ..   end_station_name = col_character(),
  ..   end_station_id = col_character(),
  ..   start_lat = col_double(),
  ..   start_lng = col_double(),
  ..   end_lat = col_double(),
  ..   end_lng = col_double(),
  ..   member_casual = col_character()
  .. )
 - attr(*, "problems")=<externalptr> 
str(February_2022)
spec_tbl_df [115,609 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ ride_id           : chr [1:115609] "E1E065E7ED285C02" "1602DCDC5B30FFE3" "BE7DD2AF4B55C4AF" "A1789BDF844412BE" ...
 $ rideable_type     : chr [1:115609] "classic_bike" "classic_bike" "classic_bike" "classic_bike" ...
 $ started_at        : POSIXct[1:115609], format: "2022-02-19 18:08:41" "2022-02-20 17:41:30" ...
 $ ended_at          : POSIXct[1:115609], format: "2022-02-19 18:23:56" "2022-02-20 17:45:56" ...
 $ start_station_name: chr [1:115609] "State St & Randolph St" "Halsted St & Wrightwood Ave" "State St & Randolph St" "Southport Ave & Waveland Ave" ...
 $ start_station_id  : chr [1:115609] "TA1305000029" "TA1309000061" "TA1305000029" "13235" ...
 $ end_station_name  : chr [1:115609] "Clark St & Lincoln Ave" "Southport Ave & Wrightwood Ave" "Canal St & Adams St" "Broadway & Sheridan Rd" ...
 $ end_station_id    : chr [1:115609] "13179" "TA1307000113" "13011" "13323" ...
 $ start_lat         : num [1:115609] 41.9 41.9 41.9 41.9 41.9 ...
 $ start_lng         : num [1:115609] -87.6 -87.6 -87.6 -87.7 -87.6 ...
 $ end_lat           : num [1:115609] 41.9 41.9 41.9 42 41.9 ...
 $ end_lng           : num [1:115609] -87.6 -87.7 -87.6 -87.6 -87.6 ...
 $ member_casual     : chr [1:115609] "member" "member" "member" "member" ...
 - attr(*, "spec")=
  .. cols(
  ..   ride_id = col_character(),
  ..   rideable_type = col_character(),
  ..   started_at = col_datetime(format = ""),
  ..   ended_at = col_datetime(format = ""),
  ..   start_station_name = col_character(),
  ..   start_station_id = col_character(),
  ..   end_station_name = col_character(),
  ..   end_station_id = col_character(),
  ..   start_lat = col_double(),
  ..   start_lng = col_double(),
  ..   end_lat = col_double(),
  ..   end_lng = col_double(),
  ..   member_casual = col_character()
  .. )
 - attr(*, "problems")=<externalptr> 
str(March_2022)
spec_tbl_df [284,042 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ ride_id           : chr [1:284042] "47EC0A7F82E65D52" "8494861979B0F477" "EFE527AF80B66109" "9F446FD9DEE3F389" ...
 $ rideable_type     : chr [1:284042] "classic_bike" "electric_bike" "classic_bike" "classic_bike" ...
 $ started_at        : POSIXct[1:284042], format: "2022-03-21 13:45:01" "2022-03-16 09:37:16" ...
 $ ended_at          : POSIXct[1:284042], format: "2022-03-21 13:51:18" "2022-03-16 09:43:34" ...
 $ start_station_name: chr [1:284042] "Wabash Ave & Wacker Pl" "Michigan Ave & Oak St" "Broadway & Berwyn Ave" "Wabash Ave & Wacker Pl" ...
 $ start_station_id  : chr [1:284042] "TA1307000131" "13042" "13109" "TA1307000131" ...
 $ end_station_name  : chr [1:284042] "Kingsbury St & Kinzie St" "Orleans St & Chestnut St (NEXT Apts)" "Broadway & Ridge Ave" "Franklin St & Jackson Blvd" ...
 $ end_station_id    : chr [1:284042] "KA1503000043" "620" "15578" "TA1305000025" ...
 $ start_lat         : num [1:284042] 41.9 41.9 42 41.9 41.9 ...
 $ start_lng         : num [1:284042] -87.6 -87.6 -87.7 -87.6 -87.6 ...
 $ end_lat           : num [1:284042] 41.9 41.9 42 41.9 41.9 ...
 $ end_lng           : num [1:284042] -87.6 -87.6 -87.7 -87.6 -87.7 ...
 $ member_casual     : chr [1:284042] "member" "member" "member" "member" ...
 - attr(*, "spec")=
  .. cols(
  ..   ride_id = col_character(),
  ..   rideable_type = col_character(),
  ..   started_at = col_datetime(format = ""),
  ..   ended_at = col_datetime(format = ""),
  ..   start_station_name = col_character(),
  ..   start_station_id = col_character(),
  ..   end_station_name = col_character(),
  ..   end_station_id = col_character(),
  ..   start_lat = col_double(),
  ..   start_lng = col_double(),
  ..   end_lat = col_double(),
  ..   end_lng = col_double(),
  ..   member_casual = col_character()
  .. )
 - attr(*, "problems")=<externalptr> 
str(April_2022)
spec_tbl_df [371,249 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ ride_id           : chr [1:371249] "3564070EEFD12711" "0B820C7FCF22F489" "89EEEE32293F07FF" "84D4751AEB31888D" ...
 $ rideable_type     : chr [1:371249] "electric_bike" "classic_bike" "classic_bike" "classic_bike" ...
 $ started_at        : POSIXct[1:371249], format: "2022-04-06 17:42:48" "2022-04-24 19:23:07" ...
 $ ended_at          : POSIXct[1:371249], format: "2022-04-06 17:54:36" "2022-04-24 19:43:17" ...
 $ start_station_name: chr [1:371249] "Paulina St & Howard St" "Wentworth Ave & Cermak Rd" "Halsted St & Polk St" "Wentworth Ave & Cermak Rd" ...
 $ start_station_id  : chr [1:371249] "515" "13075" "TA1307000121" "13075" ...
 $ end_station_name  : chr [1:371249] "University Library (NU)" "Green St & Madison St" "Green St & Madison St" "Delano Ct & Roosevelt Rd" ...
 $ end_station_id    : chr [1:371249] "605" "TA1307000120" "TA1307000120" "KA1706005007" ...
 $ start_lat         : num [1:371249] 42 41.9 41.9 41.9 41.9 ...
 $ start_lng         : num [1:371249] -87.7 -87.6 -87.6 -87.6 -87.6 ...
 $ end_lat           : num [1:371249] 42.1 41.9 41.9 41.9 41.9 ...
 $ end_lng           : num [1:371249] -87.7 -87.6 -87.6 -87.6 -87.6 ...
 $ member_casual     : chr [1:371249] "member" "member" "member" "casual" ...
 - attr(*, "spec")=
  .. cols(
  ..   ride_id = col_character(),
  ..   rideable_type = col_character(),
  ..   started_at = col_datetime(format = ""),
  ..   ended_at = col_datetime(format = ""),
  ..   start_station_name = col_character(),
  ..   start_station_id = col_character(),
  ..   end_station_name = col_character(),
  ..   end_station_id = col_character(),
  ..   start_lat = col_double(),
  ..   start_lng = col_double(),
  ..   end_lat = col_double(),
  ..   end_lng = col_double(),
  ..   member_casual = col_character()
  .. )
 - attr(*, "problems")=<externalptr> 
str(May_2022)
spec_tbl_df [634,858 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ ride_id           : chr [1:634858] "EC2DE40644C6B0F4" "1C31AD03897EE385" "1542FBEC830415CF" "6FF59852924528F8" ...
 $ rideable_type     : chr [1:634858] "classic_bike" "classic_bike" "classic_bike" "classic_bike" ...
 $ started_at        : POSIXct[1:634858], format: "2022-05-23 23:06:58" "2022-05-11 08:53:28" ...
 $ ended_at          : POSIXct[1:634858], format: "2022-05-23 23:40:19" "2022-05-11 09:31:22" ...
 $ start_station_name: chr [1:634858] "Wabash Ave & Grand Ave" "DuSable Lake Shore Dr & Monroe St" "Clinton St & Madison St" "Clinton St & Madison St" ...
 $ start_station_id  : chr [1:634858] "TA1307000117" "13300" "TA1305000032" "TA1305000032" ...
 $ end_station_name  : chr [1:634858] "Halsted St & Roscoe St" "Field Blvd & South Water St" "Wood St & Milwaukee Ave" "Clark St & Randolph St" ...
 $ end_station_id    : chr [1:634858] "TA1309000025" "15534" "13221" "TA1305000030" ...
 $ start_lat         : num [1:634858] 41.9 41.9 41.9 41.9 41.9 ...
 $ start_lng         : num [1:634858] -87.6 -87.6 -87.6 -87.6 -87.6 ...
 $ end_lat           : num [1:634858] 41.9 41.9 41.9 41.9 41.9 ...
 $ end_lng           : num [1:634858] -87.6 -87.6 -87.7 -87.6 -87.7 ...
 $ member_casual     : chr [1:634858] "member" "member" "member" "member" ...
 - attr(*, "spec")=
  .. cols(
  ..   ride_id = col_character(),
  ..   rideable_type = col_character(),
  ..   started_at = col_datetime(format = ""),
  ..   ended_at = col_datetime(format = ""),
  ..   start_station_name = col_character(),
  ..   start_station_id = col_character(),
  ..   end_station_name = col_character(),
  ..   end_station_id = col_character(),
  ..   start_lat = col_double(),
  ..   start_lng = col_double(),
  ..   end_lat = col_double(),
  ..   end_lng = col_double(),
  ..   member_casual = col_character()
  .. )
 - attr(*, "problems")=<externalptr> 

Confirming that datatypes are consistent across all datasets.

compare_df_cols(June_2021,July_2021,August_2021,September_2021,October_2021,November_2021,December_2021,January_2022,February_2022,March_2022,April_2022,May_2022,return = "mismatch")

The function above confirms that there are no mismatched datatypes across all 12 of the data sets. I proceeded by combining all of them into a single data frame by stacking them untop of each other vertically.

combined_trips <- bind_rows(June_2021,July_2021,August_2021,September_2021,October_2021,November_2021,December_2021,January_2022,February_2022,March_2022,April_2022,May_2022)

Running some functions to get a better idea of the data frame.

str (combined_trips)
spec_tbl_df [5,860,776 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ ride_id           : chr [1:5860776] "99FEC93BA843FB20" "06048DCFC8520CAF" "9598066F68045DF2" "B03C0FE48C412214" ...
 $ rideable_type     : chr [1:5860776] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
 $ started_at        : POSIXct[1:5860776], format: "2021-06-13 14:31:28" "2021-06-04 11:18:02" ...
 $ ended_at          : POSIXct[1:5860776], format: "2021-06-13 14:34:11" "2021-06-04 11:24:19" ...
 $ start_station_name: chr [1:5860776] NA NA NA NA ...
 $ start_station_id  : chr [1:5860776] NA NA NA NA ...
 $ end_station_name  : chr [1:5860776] NA NA NA NA ...
 $ end_station_id    : chr [1:5860776] NA NA NA NA ...
 $ start_lat         : num [1:5860776] 41.8 41.8 41.8 41.8 41.8 ...
 $ start_lng         : num [1:5860776] -87.6 -87.6 -87.6 -87.6 -87.6 ...
 $ end_lat           : num [1:5860776] 41.8 41.8 41.8 41.8 41.8 ...
 $ end_lng           : num [1:5860776] -87.6 -87.6 -87.6 -87.6 -87.6 ...
 $ member_casual     : chr [1:5860776] "member" "member" "member" "member" ...
 - attr(*, "spec")=
  .. cols(
  ..   ride_id = col_character(),
  ..   rideable_type = col_character(),
  ..   started_at = col_datetime(format = ""),
  ..   ended_at = col_datetime(format = ""),
  ..   start_station_name = col_character(),
  ..   start_station_id = col_character(),
  ..   end_station_name = col_character(),
  ..   end_station_id = col_character(),
  ..   start_lat = col_double(),
  ..   start_lng = col_double(),
  ..   end_lat = col_double(),
  ..   end_lng = col_double(),
  ..   member_casual = col_character()
  .. )
 - attr(*, "problems")=<externalptr> 
skim_without_charts(combined_trips)
── Data Summary ────────────────────────
                           Values        
Name                       combined_trips
Number of rows             5860776       
Number of columns          13            
_______________________                  
Column type frequency:                   
  character                7             
  numeric                  4             
  POSIXct                  2             
________________________                 
Group variables            None          
dim(combined_trips)
[1] 5860776      13
head(combined_trips)

Creating columns that extract the day, month, year and day of the week of each trip using its recorded “started_at” time.

combined_trips$date <- as.Date(combined_trips$started_at)
combined_trips$month <- format(as.Date(combined_trips$date), "%B")
combined_trips$day <- format(as.Date(combined_trips$date), "%d")
combined_trips$year <- format(as.Date(combined_trips$date), "%Y")
combined_trips$day_of_the_week <- weekdays(combined_trips$date)
head(combined_trips)

Adding a fifth variable; the ride length. This can be gotten from the difference between the “started_at” and “ended_at” time of each trip.

combined_trips$ride_length <- difftime(combined_trips$ended_at,combined_trips$started_at, units = "min") 

Reducing the decimal points to two places and changing the datatype to numeric so that arithmetic analysis can be done on the ride_length column.

combined_trips$ride_length <- round(combined_trips$ride_length, 2)
combined_trips$ride_length <- as.numeric(as.character(combined_trips$ride_length))

Checking the structure of the dataset again to confirm that each variable is in the right form for analysis.

str(combined_trips)
spec_tbl_df [5,860,776 × 19] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ ride_id           : chr [1:5860776] "99FEC93BA843FB20" "06048DCFC8520CAF" "9598066F68045DF2" "B03C0FE48C412214" ...
 $ rideable_type     : chr [1:5860776] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
 $ started_at        : POSIXct[1:5860776], format: "2021-06-13 14:31:28" "2021-06-04 11:18:02" ...
 $ ended_at          : POSIXct[1:5860776], format: "2021-06-13 14:34:11" "2021-06-04 11:24:19" ...
 $ start_station_name: chr [1:5860776] NA NA NA NA ...
 $ start_station_id  : chr [1:5860776] NA NA NA NA ...
 $ end_station_name  : chr [1:5860776] NA NA NA NA ...
 $ end_station_id    : chr [1:5860776] NA NA NA NA ...
 $ start_lat         : num [1:5860776] 41.8 41.8 41.8 41.8 41.8 ...
 $ start_lng         : num [1:5860776] -87.6 -87.6 -87.6 -87.6 -87.6 ...
 $ end_lat           : num [1:5860776] 41.8 41.8 41.8 41.8 41.8 ...
 $ end_lng           : num [1:5860776] -87.6 -87.6 -87.6 -87.6 -87.6 ...
 $ member_casual     : chr [1:5860776] "member" "member" "member" "member" ...
 $ date              : Date[1:5860776], format: "2021-06-13" "2021-06-04" ...
 $ month             : chr [1:5860776] "June" "June" "June" "June" ...
 $ day               : chr [1:5860776] "13" "04" "04" "03" ...
 $ year              : chr [1:5860776] "2021" "2021" "2021" "2021" ...
 $ day_of_the_week   : chr [1:5860776] "Sunday" "Friday" "Friday" "Thursday" ...
 $ ride_length       : num [1:5860776] 2.72 6.28 5.98 25.83 4.13 ...
 - attr(*, "spec")=
  .. cols(
  ..   ride_id = col_character(),
  ..   rideable_type = col_character(),
  ..   started_at = col_datetime(format = ""),
  ..   ended_at = col_datetime(format = ""),
  ..   start_station_name = col_character(),
  ..   start_station_id = col_character(),
  ..   end_station_name = col_character(),
  ..   end_station_id = col_character(),
  ..   start_lat = col_double(),
  ..   start_lng = col_double(),
  ..   end_lat = col_double(),
  ..   end_lng = col_double(),
  ..   member_casual = col_character()
  .. )
 - attr(*, "problems")=<externalptr> 

Cleaning the data

In a real life scenario, I would normally ask a superior colleague or one of the primary shareholders of the project on what specific data cleaning procedures I’m allowed to carry out on the data. An example would be what to do about null values in the data and whether I’m allowed to delete anything completely from the data frame. To stay on the safe side, I created a new data frame called “combined_trips_v2” where I performed all data cleaning (deleting) exercises whilst retaining the integrity of the original “combined_trips” data frame.

#deleting null values 
combined_trips_v2 <- drop_na(combined_trips)

Deleting the entries with negative ride_length since the end time should ideally not be earlier than the start time.

combined_trips_v2 <- filter (combined_trips_v2, ride_length > 0)

Extracting the longitude and latitude data from the data set as it will not be needed in this phase of the analysis.

lat_lng1 <- select(combined_trips_v2,start_station_name,end_station_name,start_lat,start_lng,end_lat,end_lng,member_casual)
combined_trips_v2 <- combined_trips_v2 %>%
select (-c(start_lat, start_lng, end_lat, end_lng))
head (combined_trips_v2)

After I concluded that my data was in the correct format and has been properly cleaned, I moved on to the next phase of the analysis process; Analsis.

Analysis

Analysis is the process of making sense of the data collected (and cleaned). The goal is to identify trends and relationships within the data that will help in solving the business task. The main purpose of the analysis was to deduce all the ways that cyclist members and cyclist casuals used the bicycles differently, in line with the business task.

I started by performing descriptive analysis on the data. ### Descriptive analysis

combined_trips_v2 %>%
  group_by (member_casual) %>%
  summarise(number_of_rides=n(), average_ride_length=mean(ride_length))
combined_trips_v2 %>%
  count(rideable_type)
combined_trips_v2 %>%
  group_by(member_casual)%>%
  count (rideable_type)
summary (combined_trips_v2$ride_length) #This shows a summary of the ride length. 
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
    0.02     6.60    11.60    20.22    20.97 55944.15 
combined_trips_v2%>%
  group_by(member_casual)%>%
  summarise(number_of_rides=n(), min_ride_length=min(ride_length),max_ride_length=max(ride_length),avg_ride_length=mean(ride_length),median_ride_length=median(ride_length))

Ordering the months from June 2021 to May 2022 and the days of the week from Sunday to Saturday.

combined_trips_v2$month <- ordered(combined_trips_v2$month, levels=c( "June", "July", "August", "September", "October", "November", "December","January", "February", "March", "April", "May"))

combined_trips_v2$day_of_the_week <- ordered(combined_trips_v2$day_of_the_week, levels=c("Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday"))
combined_trips_v2 %>%
  group_by(member_casual, month) %>%
  summarise(number_of_rides=n(), average_ride_length=mean(ride_length))%>%
  arrange (month)
`summarise()` has grouped output by 'member_casual'. You can override
using the `.groups` argument.
combined_trips_v2 %>%
  group_by(member_casual, day_of_the_week) %>%
  summarise(number_of_rides=n(), average_ride_length=mean(ride_length))%>%
  arrange (day_of_the_week)
`summarise()` has grouped output by 'member_casual'. You can override
using the `.groups` argument.

Most popular start stations for casual riders

combined_trips_v2 %>%
  group_by(start_station_name, member_casual) %>%
  summarise(number_of_trips=n()) %>%
  arrange(desc (number_of_trips)) %>%
  filter(member_casual== "casual") %>%
  select(start_station_name, number_of_trips)
`summarise()` has grouped output by 'start_station_name'. You can override using the `.groups` argument.

Most popular end stations for casual riders

combined_trips_v2 %>%
  group_by(end_station_name, member_casual) %>%
  summarise(number_of_trips=n()) %>%
  arrange(desc (number_of_trips)) %>%
  filter(member_casual== "casual") %>%
  select(end_station_name, number_of_trips)
`summarise()` has grouped output by 'end_station_name'. You can override using the `.groups` argument.

Creating a new version of the data set that combines the start_station_name and the end_station_name to serve as an aggregate representation of each ride route.

combined_trips_v3 <- (unite(combined_trips_v2, "ride_routes", start_station_name, end_station_name, sep= " to "))
head (combined_trips_v3)

Extracting data about the ride routes to be used for visualizations in Tableau.

top_routes <- combined_trips_v3 %>%
  group_by(ride_routes) %>%
  summarise(number_of_rides=n()) %>%
  arrange (desc (number_of_rides))
head (top_routes,10)
top_routes_1 <- combined_trips_v3 %>%
  group_by(ride_routes, member_casual) %>%
  summarise(number_of_rides=n()) %>%
  arrange (desc(number_of_rides)) 
head (top_routes_1, 10)

Visualizations

Creating charts to visualize different parts of the data and acquire useful insights. Tableau was also used for this section of the analysis.

Number of rides and Average ride length segmented by rider type

combined_trips_v3 %>%
  group_by(member_casual) %>%
  summarise(Average_ride_length=mean(ride_length)) %>%
  ggplot(aes(x= member_casual, y=Average_ride_length, fill=member_casual)) + geom_col() + labs(title = "Average ride length by rider type", x="Rider type", y="Average ride length")

combined_trips_v3 %>%
  group_by(member_casual, month) %>%
  summarise(number_of_rides=n(), average_ride_length=mean(ride_length))%>%
  ggplot (aes(x=month, y=number_of_rides, fill=member_casual)) + geom_col(position= "dodge2") + labs(title = "Number of rides per month segmented by rider type", x= "Month", y="Number of rides")+ theme(axis.text.x = element_text(angle = 60, hjust=1))
`summarise()` has grouped output by 'member_casual'. You can override using the `.groups` argument.

combined_trips_v3 %>%
  group_by(member_casual, month) %>%
  summarise(number_of_rides=n(), average_ride_length=mean(ride_length))%>%
  ggplot (aes(x=month, y=average_ride_length, fill=member_casual)) + geom_col(position= "dodge2") + labs(title = "Average ride length per month segmented by rider type", x= "Month", y="Average ride length")+ theme(axis.text.x = element_text(angle = 60, hjust=1))
`summarise()` has grouped output by 'member_casual'. You can override using the `.groups` argument.

combined_trips_v3 %>%
  group_by(member_casual, day_of_the_week) %>%
  summarise(number_of_rides=n(), average_ride_length=mean(ride_length))%>%
  ggplot (aes(x=day_of_the_week, y=number_of_rides, fill=member_casual)) + geom_col(position= "dodge2") + labs(title = "Number of rides per day of the week", x= "Day of the week", y="Number of rides")
`summarise()` has grouped output by 'member_casual'. You can override using the `.groups` argument.

combined_trips_v3 %>%
  group_by(member_casual, day_of_the_week) %>%
  summarise(number_of_rides=n(), average_ride_length=mean(ride_length))%>%
  ggplot (aes(x=day_of_the_week, y=average_ride_length, fill=member_casual)) + geom_col(position= "dodge2") + labs(title = "Average ride length per week day segmented by rider type", x= "Day of the week", y="Average ride length")+ theme(axis.text.x = element_text(angle = 60, hjust=1))
`summarise()` has grouped output by 'member_casual'. You can override using the `.groups` argument.

Number of rides and Average ride length segmented by rideable type

combined_trips_v3 %>%
  group_by(rideable_type, member_casual) %>%
  summarise(number_of_rides=n()) %>%
  ggplot(aes(x=rideable_type, y=number_of_rides, fill=member_casual)) + geom_col(position = "dodge") + labs(title="Number of rides per rideable type" , x="Rideable Type", y="Number of rides")
`summarise()` has grouped output by 'rideable_type'. You can override using the `.groups` argument.

combined_trips_v3 %>%
  group_by(rideable_type, member_casual) %>%
  summarise(Average_ride_length = mean(ride_length)) %>%
  ggplot(aes(x=rideable_type, y=Average_ride_length, fill=member_casual)) + geom_col(position = "dodge") + labs(title="Average ride length per rideable type" , x="Rideable Type", y="Average ride length")
`summarise()` has grouped output by 'rideable_type'. You can override using the `.groups` argument.

combined_trips_v3 %>%
  group_by(member_casual, month, rideable_type) %>%
  summarise(number_of_rides=n(), average_ride_length=mean(ride_length))%>%
  ggplot (aes(x=month, y=number_of_rides, fill=rideable_type)) + geom_col(position="dodge2") + facet_wrap(~member_casual) + labs(title = "Number of rides segmented by rideable types for each month", x="Month", y="Number of rides") + theme(axis.text.x = element_text(angle = 60, hjust=1))
`summarise()` has grouped output by 'member_casual', 'month'. You can override using the `.groups` argument.

combined_trips_v3 %>%
  group_by(member_casual, day_of_the_week, rideable_type) %>%
  summarise(number_of_rides=n(), average_ride_length=mean(ride_length))%>%
  ggplot (aes(x=day_of_the_week, y=number_of_rides, fill=rideable_type)) + geom_col(position = "dodge") + facet_wrap(~member_casual) + labs(title = "Number of rides segmented by rideable types for each day of the week", x="Day of the week", y="Number of rides") + theme(axis.text.x = element_text(angle = 60, hjust=1))
`summarise()` has grouped output by 'member_casual', 'day_of_the_week'. You can override using the `.groups` argument.

head (combined_trips_v2)
combined_trips_v2 %>%
  group_by(start_station_name, member_casual) %>%
  summarise(number_of_trips=n()) %>%
  arrange(desc (number_of_trips)) %>%
  filter(member_casual== "casual", number_of_trips >= 15460) %>%
  select(start_station_name, number_of_trips) %>%
  ggplot(aes(x=start_station_name, y=number_of_trips)) + geom_col(fill="blue") + coord_flip() + labs(title="Top 10 most popular start stations for casual riders", x="Start station name", y="Number of trips")
`summarise()` has grouped output by 'start_station_name'. You can override using the `.groups` argument.

NA
combined_trips_v2 %>%
  group_by(end_station_name, member_casual) %>%
  summarise(number_of_trips=n()) %>%
  arrange(desc (number_of_trips)) %>%
  filter(member_casual== "casual", number_of_trips >= 15596) %>%
  select(end_station_name, number_of_trips) %>%
  ggplot(aes(x=end_station_name, y=number_of_trips)) + geom_col(fill="red") + coord_flip() + labs(title="Top 10 most popular end stations for casual riders", x="End station name", y="Number of trips")
`summarise()` has grouped output by 'end_station_name'. You can override using the `.groups` argument.

NA

The document, plots and tables are then exported for further analysis. Further visualisations also performed on Tableau. Saving the dataframes as CSV files on my local desktop.

write.csv(top_routes,"C:\\Users\\Temi\\Desktop\\Cyclist project\\top_routes.csv", row.names=FALSE)
write.csv(combined_trips_v3,"C:\\Users\\Temi\\Desktop\\Cyclist project\\combined_trips_v3.csv", row.names=FALSE)
write.csv(top_routes_1,"C:\\Users\\Temi\\Desktop\\Cyclist project\\top_routes_1.csv", row.names=FALSE)
write.csv(lat_lng1,"C:\\Users\\Temi\\Desktop\\Cyclist project\\lat_lng1.csv", row.names=FALSE)
---
title: "Cyclistic Case Study"
author: 'Author: Abdullaah Odunmbaku'
output: html_notebook
---
## Processing
The processing stage involves cleaning the data after confirming that it is from a good source and it satisfies the necessary conditions.It also involves confirming that the data is complete, correct and relevant. I chose R studios as the tool to use for this phase of the analysis process. I started by installing and then loading the relevant packages. 
```{r include=FALSE}
install.packages("tidyverse")
install.packages ("lubridate")
install.packages("janitor")
install.packages("skimr")
```
```{r}
library(tidyverse)
library(lubridate)
library(janitor)
library(skimr)
```
Loading datasets into the environment with the correct naming conventions. 
```{r}
June_2021 <- read_csv("202106-divvy-tripdata.csv")
July_2021 <- read_csv("202107-divvy-tripdata.csv")
August_2021 <- read_csv("202108-divvy-tripdata.csv")
September_2021 <- read_csv("202109-divvy-tripdata.csv")
October_2021 <- read_csv("202110-divvy-tripdata.csv")
November_2021 <- read_csv("202111-divvy-tripdata.csv")
December_2021 <- read_csv("202112-divvy-tripdata.csv")
January_2022 <- read_csv("202201-divvy-tripdata.csv")
February_2022 <- read_csv("202202-divvy-tripdata.csv")
March_2022 <- read_csv("202203-divvy-tripdata.csv")
April_2022 <- read_csv("202204-divvy-tripdata.csv")
May_2022 <- read_csv("202205-divvy-tripdata.csv")

```
The next step would ideally be to combine all of the data sets into one single data frame. But first, I confirmed that the data sets include the same column names. 
```{r}
colnames(June_2021)
colnames(July_2021)
colnames(August_2021)
colnames(September_2021)
colnames(October_2021)
colnames(November_2021)
colnames(December_2021)
colnames(January_2022)
colnames(February_2022)
colnames(March_2022)
colnames(April_2022)
colnames(May_2022)
```
Checking the overall structure of each data set and if they have the same data types.
```{r}
str(June_2021)
str(July_2021)
str(August_2021)
str(September_2021)
str(October_2021)
str(November_2021)
str(December_2021)
str(January_2022)
str(February_2022)
str(March_2022)
str(April_2022)
str(May_2022)
```
Confirming that datatypes are consistent across all datasets.
```{r}
compare_df_cols(June_2021,July_2021,August_2021,September_2021,October_2021,November_2021,December_2021,January_2022,February_2022,March_2022,April_2022,May_2022,return = "mismatch")
```
The function above confirms that there are no mismatched datatypes across all 12 of the data sets. I proceeded by combining all of them into a single data frame by stacking them untop of each other vertically. 
```{r}
combined_trips <- bind_rows(June_2021,July_2021,August_2021,September_2021,October_2021,November_2021,December_2021,January_2022,February_2022,March_2022,April_2022,May_2022)
```

Running some functions to get a better idea of the data frame.
```{r}
str (combined_trips)
```

```{r}
skim_without_charts(combined_trips)
```
```{r}
dim(combined_trips)
```
```{r}
head(combined_trips)
```
Creating columns that extract the day, month, year and day of the week of each trip using its recorded "started_at" time.
```{r}
combined_trips$date <- as.Date(combined_trips$started_at)
combined_trips$month <- format(as.Date(combined_trips$date), "%B")
combined_trips$day <- format(as.Date(combined_trips$date), "%d")
combined_trips$year <- format(as.Date(combined_trips$date), "%Y")
combined_trips$day_of_the_week <- weekdays(combined_trips$date)
```

```{r}
head(combined_trips)
```

Adding a fifth variable; the ride length. This can be gotten from the difference between the "started_at" and "ended_at" time of each trip. 
```{r}
combined_trips$ride_length <- difftime(combined_trips$ended_at,combined_trips$started_at, units = "min") 
```
Reducing the decimal points to two places and changing the datatype to numeric so that arithmetic analysis can be done on the ride_length column. 
```{r}
combined_trips$ride_length <- round(combined_trips$ride_length, 2)
```
```{r}
combined_trips$ride_length <- as.numeric(as.character(combined_trips$ride_length))
```
Checking the structure of the dataset again to confirm that each variable is in the right form for analysis. 
```{r}
str(combined_trips)
```
### Cleaning the data
In a real life scenario, I would normally ask a superior colleague or one of the primary shareholders of the project on what specific data cleaning procedures I'm allowed to carry out on the data. An example would be what to do about null values in the data and whether I'm allowed to delete anything completely from the data frame. To stay on the safe side, I created a new data frame called "combined_trips_v2" where I performed all data cleaning (deleting) exercises whilst retaining the integrity of the original "combined_trips" data frame. 

```{r}
#deleting null values 
combined_trips_v2 <- drop_na(combined_trips)
```

Deleting the entries with negative ride_length since the end time should ideally not be earlier than the start time. 
```{r}
combined_trips_v2 <- filter (combined_trips_v2, ride_length > 0)
```

Extracting the longitude and latitude data from the data set as it will not be needed in this phase of the analysis. 
```{r}
lat_lng1 <- select(combined_trips_v2,start_station_name,end_station_name,start_lat,start_lng,end_lat,end_lng,member_casual)
```
```{r}
combined_trips_v2 <- combined_trips_v2 %>%
select (-c(start_lat, start_lng, end_lat, end_lng))
```
```{r}
head (combined_trips_v2)
```
After I concluded that my data was in the correct format and has been properly cleaned, I moved on to the next phase of the analysis process; Analsis.

## Analysis
Analysis is the process of making sense of the data collected (and cleaned). The goal is to identify trends and relationships within the data that will help in solving the business task. 
The main purpose of the analysis was to deduce all the ways that cyclist members and cyclist casuals used the bicycles differently, in line with the business task. 

I started by performing descriptive analysis on the data. 
### Descriptive analysis
```{r}
combined_trips_v2 %>%
  group_by (member_casual) %>%
  summarise(number_of_rides=n(), average_ride_length=mean(ride_length))
```

```{r}
combined_trips_v2 %>%
  count(rideable_type)
```
```{r}
combined_trips_v2 %>%
  group_by(member_casual)%>%
  count (rideable_type)
```

```{r}
summary (combined_trips_v2$ride_length) #This shows a summary of the ride length. 
```

```{r}
combined_trips_v2%>%
  group_by(member_casual)%>%
  summarise(number_of_rides=n(), min_ride_length=min(ride_length),max_ride_length=max(ride_length),avg_ride_length=mean(ride_length),median_ride_length=median(ride_length))
```
Ordering the months from June 2021 to May 2022 and the days of the week from Sunday to Saturday.
```{r}
combined_trips_v2$month <- ordered(combined_trips_v2$month, levels=c( "June", "July", "August", "September", "October", "November", "December","January", "February", "March", "April", "May"))

combined_trips_v2$day_of_the_week <- ordered(combined_trips_v2$day_of_the_week, levels=c("Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday"))
```

```{r}
combined_trips_v2 %>%
  group_by(member_casual, month) %>%
  summarise(number_of_rides=n(), average_ride_length=mean(ride_length))%>%
  arrange (month)
```

```{r}
combined_trips_v2 %>%
  group_by(member_casual, day_of_the_week) %>%
  summarise(number_of_rides=n(), average_ride_length=mean(ride_length))%>%
  arrange (day_of_the_week)
```
Most popular start stations for casual riders
```{r}
combined_trips_v2 %>%
  group_by(start_station_name, member_casual) %>%
  summarise(number_of_trips=n()) %>%
  arrange(desc (number_of_trips)) %>%
  filter(member_casual== "casual") %>%
  select(start_station_name, number_of_trips)
```
Most popular end stations for casual riders
```{r}
combined_trips_v2 %>%
  group_by(end_station_name, member_casual) %>%
  summarise(number_of_trips=n()) %>%
  arrange(desc (number_of_trips)) %>%
  filter(member_casual== "casual") %>%
  select(end_station_name, number_of_trips)
```

Creating a new version of the data set that combines the start_station_name and the end_station_name to serve as an aggregate representation of each ride route.
```{r}
combined_trips_v3 <- (unite(combined_trips_v2, "ride_routes", start_station_name, end_station_name, sep= " to "))
```

```{r}
head (combined_trips_v3)
```

Extracting data about the ride routes to be used for visualizations in Tableau.
```{r}
top_routes <- combined_trips_v3 %>%
  group_by(ride_routes) %>%
  summarise(number_of_rides=n()) %>%
  arrange (desc (number_of_rides))
```

```{r}
head (top_routes,10)
```

```{r}
top_routes_1 <- combined_trips_v3 %>%
  group_by(ride_routes, member_casual) %>%
  summarise(number_of_rides=n()) %>%
  arrange (desc(number_of_rides)) 
```

```{r}
head (top_routes_1, 10)
```

## Visualizations
Creating charts to visualize different parts of the data and acquire useful insights. Tableau was also used for this section of the analysis.

### Number of rides and Average ride length segmented by rider type 
```{r}
combined_trips_v3 %>%
  group_by(member_casual) %>%
  summarise(Average_ride_length=mean(ride_length)) %>%
  ggplot(aes(x= member_casual, y=Average_ride_length, fill=member_casual)) + geom_col() + labs(title = "Average ride length by rider type", x="Rider type", y="Average ride length")
```
```{r}
combined_trips_v3 %>%
  group_by(member_casual, month) %>%
  summarise(number_of_rides=n(), average_ride_length=mean(ride_length))%>%
  ggplot (aes(x=month, y=number_of_rides, fill=member_casual)) + geom_col(position= "dodge2") + labs(title = "Number of rides per month segmented by rider type", x= "Month", y="Number of rides")+ theme(axis.text.x = element_text(angle = 60, hjust=1))
```
```{r}
combined_trips_v3 %>%
  group_by(member_casual, month) %>%
  summarise(number_of_rides=n(), average_ride_length=mean(ride_length))%>%
  ggplot (aes(x=month, y=average_ride_length, fill=member_casual)) + geom_col(position= "dodge2") + labs(title = "Average ride length per month segmented by rider type", x= "Month", y="Average ride length")+ theme(axis.text.x = element_text(angle = 60, hjust=1))
```

```{r}
combined_trips_v3 %>%
  group_by(member_casual, day_of_the_week) %>%
  summarise(number_of_rides=n(), average_ride_length=mean(ride_length))%>%
  ggplot (aes(x=day_of_the_week, y=number_of_rides, fill=member_casual)) + geom_col(position= "dodge2") + labs(title = "Number of rides per day of the week", x= "Day of the week", y="Number of rides")
```

```{r}
combined_trips_v3 %>%
  group_by(member_casual, day_of_the_week) %>%
  summarise(number_of_rides=n(), average_ride_length=mean(ride_length))%>%
  ggplot (aes(x=day_of_the_week, y=average_ride_length, fill=member_casual)) + geom_col(position= "dodge2") + labs(title = "Average ride length per week day segmented by rider type", x= "Day of the week", y="Average ride length")+ theme(axis.text.x = element_text(angle = 60, hjust=1))
```

### Number of rides and Average ride length segmented by rideable type
```{r}
combined_trips_v3 %>%
  group_by(rideable_type, member_casual) %>%
  summarise(number_of_rides=n()) %>%
  ggplot(aes(x=rideable_type, y=number_of_rides, fill=member_casual)) + geom_col(position = "dodge") + labs(title="Number of rides per rideable type" , x="Rideable Type", y="Number of rides")
```
```{r}
combined_trips_v3 %>%
  group_by(rideable_type, member_casual) %>%
  summarise(Average_ride_length = mean(ride_length)) %>%
  ggplot(aes(x=rideable_type, y=Average_ride_length, fill=member_casual)) + geom_col(position = "dodge") + labs(title="Average ride length per rideable type" , x="Rideable Type", y="Average ride length")
```
```{r}
combined_trips_v3 %>%
  group_by(member_casual, month, rideable_type) %>%
  summarise(number_of_rides=n(), average_ride_length=mean(ride_length))%>%
  ggplot (aes(x=month, y=number_of_rides, fill=rideable_type)) + geom_col(position="dodge2") + facet_wrap(~member_casual) + labs(title = "Number of rides segmented by rideable types for each month", x="Month", y="Number of rides") + theme(axis.text.x = element_text(angle = 60, hjust=1))
```

```{r}
combined_trips_v3 %>%
  group_by(member_casual, day_of_the_week, rideable_type) %>%
  summarise(number_of_rides=n(), average_ride_length=mean(ride_length))%>%
  ggplot (aes(x=day_of_the_week, y=number_of_rides, fill=rideable_type)) + geom_col(position = "dodge") + facet_wrap(~member_casual) + labs(title = "Number of rides segmented by rideable types for each day of the week", x="Day of the week", y="Number of rides") + theme(axis.text.x = element_text(angle = 60, hjust=1))
```
```{r}
head (combined_trips_v2)
```

```{r}
combined_trips_v2 %>%
  group_by(start_station_name, member_casual) %>%
  summarise(number_of_trips=n()) %>%
  arrange(desc (number_of_trips)) %>%
  filter(member_casual== "casual", number_of_trips >= 15460) %>%
  select(start_station_name, number_of_trips) %>%
  ggplot(aes(x=start_station_name, y=number_of_trips)) + geom_col(fill="blue") + coord_flip() + labs(title="Top 10 most popular start stations for casual riders", x="Start station name", y="Number of trips")
  
```
```{r}
combined_trips_v2 %>%
  group_by(end_station_name, member_casual) %>%
  summarise(number_of_trips=n()) %>%
  arrange(desc (number_of_trips)) %>%
  filter(member_casual== "casual", number_of_trips >= 15596) %>%
  select(end_station_name, number_of_trips) %>%
  ggplot(aes(x=end_station_name, y=number_of_trips)) + geom_col(fill="red") + coord_flip() + labs(title="Top 10 most popular end stations for casual riders", x="End station name", y="Number of trips")
  
```
The document, plots and tables are then exported for further analysis. Further visualisations also performed on Tableau. 
Saving the dataframes as CSV files on my local desktop. 
```{r}
write.csv(top_routes,"C:\\Users\\Temi\\Desktop\\Cyclist project\\top_routes.csv", row.names=FALSE)
```
```{r}
write.csv(combined_trips_v3,"C:\\Users\\Temi\\Desktop\\Cyclist project\\combined_trips_v3.csv", row.names=FALSE)
```
```{r}
write.csv(top_routes_1,"C:\\Users\\Temi\\Desktop\\Cyclist project\\top_routes_1.csv", row.names=FALSE)
```
```{r}
write.csv(lat_lng1,"C:\\Users\\Temi\\Desktop\\Cyclist project\\lat_lng1.csv", row.names=FALSE)
```






