Alexa 2023-11-10

This analysis is for case study 1 from the Google Data Analytics Certificate “Cyclistic bike-share” I will use Cyclistic’s historical trip data to analyze and identify trends, which is the City of Chicago’s Divvy bicycle sharing service data made available by Motivate International Inc. under this license. This public data can be accessed online.

I will analyze the data from October 1st 2022 to September 30th 2023.

The purpose of this notebook is to consolidate downloaded Cyclistic data into a single data frame and then conduct simple analysis to help answer the key question: “In what ways do members and casual riders use Cyclistic bikes differently?”

Setting up my environment

The ‘tidyverse’ includes packages that I will use along this analysis like ‘dplyr’, ‘stringr’, ‘lubridate’ and ‘ggplot2’. I will use ‘scales’ to customize the appearance of the axis and legend labels of my charts. I will use ‘gt’ and ‘formattable’ to make the appearance of a few tables nicer. I will use ‘sf’ and ‘mapwiew’ to create a map with the most frequently used starting stations for customers’ trips.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(scales)

## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor

library(gt)
library(sf)

## Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE

library(mapview)
library(formattable)

## 
## Attaching package: 'formattable'
## 
## The following object is masked from 'package:gt':
## 
##     currency
## 
## The following objects are masked from 'package:scales':
## 
##     comma, percent, scientific

Data Wrangling

There are multiple functions to load the files: read.csv(), the default csv reader that comes R base and creates data frames; read_csv() from the readr package, included in the tidyverse, which creates tibbles; and the fread() from the data.table package. I will use read_csv() which is faster than read.csv(), slower than fread(), but available with the tidyverse package.

dec <- read_csv("/Users/alexandravelez/Desktop/R programming/Cyclist study case data 2/202212-divvy-tripdata.csv")

## Rows: 181806 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

nov <- read_csv("/Users/alexandravelez/Desktop/R programming/Cyclist study case data 2/202211-divvy-tripdata.csv")

## Rows: 337735 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

oct <- read_csv("/Users/alexandravelez/Desktop/R programming/Cyclist study case data 2/202210-divvy-tripdata.csv")

## Rows: 558685 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

aug <- read_csv("/Users/alexandravelez/Desktop/R programming/Cyclist study case data 2/202308-divvy-tripdata.csv")

## Rows: 771693 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

jul <- read_csv("/Users/alexandravelez/Desktop/R programming/Cyclist study case data 2/202307-divvy-tripdata.csv")

## Rows: 767650 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

jun <- read_csv("/Users/alexandravelez/Desktop/R programming/Cyclist study case data 2/202306-divvy-tripdata.csv")

## Rows: 719618 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

may <- read_csv("/Users/alexandravelez/Desktop/R programming/Cyclist study case data 2/202305-divvy-tripdata.csv")

## Rows: 604827 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

apr <- read_csv("/Users/alexandravelez/Desktop/R programming/Cyclist study case data 2/202304-divvy-tripdata.csv")

## Rows: 426590 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

mar <- read_csv("/Users/alexandravelez/Desktop/R programming/Cyclist study case data 2/202303-divvy-tripdata.csv")

## Rows: 258678 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

feb <- read_csv("/Users/alexandravelez/Desktop/R programming/Cyclist study case data 2/202302-divvy-tripdata.csv")

## Rows: 190445 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

jan <- read_csv("/Users/alexandravelez/Desktop/R programming/Cyclist study case data 2/202301-divvy-tripdata.csv")

## Rows: 190301 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

sep <- read_csv("/Users/alexandravelez/Desktop/R programming/Cyclist study case data 2/202309-divvy-tripdata.csv")

## Rows: 666371 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

First I inspected all the files to get familiar with the column names and data types with these functions:

head(oct, 10)

## # A tibble: 10 × 13
##    ride_id          rideable_type started_at          ended_at           
##    <chr>            <chr>         <dttm>              <dttm>             
##  1 A50255C1E17942AB classic_bike  2022-10-14 17:13:30 2022-10-14 17:19:39
##  2 DB692A70BD2DD4E3 electric_bike 2022-10-01 16:29:26 2022-10-01 16:49:06
##  3 3C02727AAF60F873 electric_bike 2022-10-19 18:55:40 2022-10-19 19:03:30
##  4 47E653FDC2D99236 electric_bike 2022-10-31 07:52:36 2022-10-31 07:58:49
##  5 8B5407BE535159BF classic_bike  2022-10-13 18:41:03 2022-10-13 19:26:18
##  6 A177C92E9F021B99 electric_bike 2022-10-13 15:53:27 2022-10-13 15:59:17
##  7 DF5EC7678DE3C2B3 electric_bike 2022-10-06 15:51:21 2022-10-06 15:55:06
##  8 407DE6D80130A297 classic_bike  2022-10-26 17:30:10 2022-10-26 17:37:57
##  9 45EEAF68A1A051CA classic_bike  2022-10-22 09:47:56 2022-10-22 09:57:42
## 10 66CD8E4D0C38C0F3 electric_bike 2022-10-24 12:39:47 2022-10-24 12:48:36
## # ℹ 9 more variables: start_station_name <chr>, start_station_id <chr>,
## #   end_station_name <chr>, end_station_id <chr>, start_lat <dbl>,
## #   start_lng <dbl>, end_lat <dbl>, end_lng <dbl>, member_casual <chr>

str(oct)

## spc_tbl_ [558,685 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:558685] "A50255C1E17942AB" "DB692A70BD2DD4E3" "3C02727AAF60F873" "47E653FDC2D99236" ...
##  $ rideable_type     : chr [1:558685] "classic_bike" "electric_bike" "electric_bike" "electric_bike" ...
##  $ started_at        : POSIXct[1:558685], format: "2022-10-14 17:13:30" "2022-10-01 16:29:26" ...
##  $ ended_at          : POSIXct[1:558685], format: "2022-10-14 17:19:39" "2022-10-01 16:49:06" ...
##  $ start_station_name: chr [1:558685] "Noble St & Milwaukee Ave" "Damen Ave & Charleston St" "Hoyne Ave & Balmoral Ave" "Rush St & Cedar St" ...
##  $ start_station_id  : chr [1:558685] "13290" "13288" "655" "KA1504000133" ...
##  $ end_station_name  : chr [1:558685] "Larrabee St & Division St" "Damen Ave & Cullerton St" "Western Ave & Leland Ave" "Orleans St & Chestnut St (NEXT Apts)" ...
##  $ end_station_id    : chr [1:558685] "KA1504000079" "13089" "TA1307000140" "620" ...
##  $ start_lat         : num [1:558685] 41.9 41.9 42 41.9 41.9 ...
##  $ start_lng         : num [1:558685] -87.7 -87.7 -87.7 -87.6 -87.6 ...
##  $ end_lat           : num [1:558685] 41.9 41.9 42 41.9 41.9 ...
##  $ end_lng           : num [1:558685] -87.6 -87.7 -87.7 -87.6 -87.6 ...
##  $ member_casual     : chr [1:558685] "member" "casual" "member" "member" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_character(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

summary(oct)

##    ride_id          rideable_type        started_at                    
##  Length:558685      Length:558685      Min.   :2022-10-01 00:00:15.00  
##  Class :character   Class :character   1st Qu.:2022-10-08 01:44:43.00  
##  Mode  :character   Mode  :character   Median :2022-10-15 15:09:17.00  
##                                        Mean   :2022-10-16 00:37:00.44  
##                                        3rd Qu.:2022-10-23 15:13:33.00  
##                                        Max.   :2022-10-31 23:59:33.00  
##                                                                        
##     ended_at                      start_station_name start_station_id  
##  Min.   :2022-10-01 00:01:05.00   Length:558685      Length:558685     
##  1st Qu.:2022-10-08 02:00:54.00   Class :character   Class :character  
##  Median :2022-10-15 15:29:01.00   Mode  :character   Mode  :character  
##  Mean   :2022-10-16 00:54:21.77                                        
##  3rd Qu.:2022-10-23 15:36:58.00                                        
##  Max.   :2022-11-07 04:53:58.00                                        
##                                                                        
##  end_station_name   end_station_id       start_lat       start_lng     
##  Length:558685      Length:558685      Min.   :41.64   Min.   :-87.84  
##  Class :character   Class :character   1st Qu.:41.88   1st Qu.:-87.66  
##  Mode  :character   Mode  :character   Median :41.90   Median :-87.64  
##                                        Mean   :41.90   Mean   :-87.65  
##                                        3rd Qu.:41.93   3rd Qu.:-87.63  
##                                        Max.   :42.07   Max.   :-87.53  
##                                                                        
##     end_lat         end_lng       member_casual     
##  Min.   :41.59   Min.   :-87.87   Length:558685     
##  1st Qu.:41.88   1st Qu.:-87.66   Class :character  
##  Median :41.90   Median :-87.64   Mode  :character  
##  Mean   :41.90   Mean   :-87.65                     
##  3rd Qu.:41.93   3rd Qu.:-87.63                     
##  Max.   :42.13   Max.   :-87.52                     
##  NA's   :475     NA's   :475

I applied this functions to all the data sets.

When inspecting the data, each file represents a month of data, all of the files have the same number of columns, the columns have the same order, and the column names are consistent. I means that they can easily be combined into a single data frame with the rbind function from base R, bind_rows from dplyr package or rbindlist from the data.table package, being rbind the slowest, and rbindlist the fastest regarding performance. I used bind_rows() here.

year <- bind_rows(apr, may, jun, jul, aug, sep, oct, nov, dec, jan, feb, mar)

Checking the tibbles were successfully combined:

str(year)

## spc_tbl_ [5,674,399 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:5674399] "8FE8F7D9C10E88C7" "34E4ED3ADF1D821B" "5296BF07A2F77CB5" "40759916B76D5D52" ...
##  $ rideable_type     : chr [1:5674399] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
##  $ started_at        : POSIXct[1:5674399], format: "2023-04-02 08:37:28" "2023-04-19 11:29:02" ...
##  $ ended_at          : POSIXct[1:5674399], format: "2023-04-02 08:41:37" "2023-04-19 11:52:12" ...
##  $ start_station_name: chr [1:5674399] NA NA NA NA ...
##  $ start_station_id  : chr [1:5674399] NA NA NA NA ...
##  $ end_station_name  : chr [1:5674399] NA NA NA NA ...
##  $ end_station_id    : chr [1:5674399] NA NA NA NA ...
##  $ start_lat         : num [1:5674399] 41.8 41.9 41.9 41.9 41.9 ...
##  $ start_lng         : num [1:5674399] -87.6 -87.7 -87.7 -87.7 -87.7 ...
##  $ end_lat           : num [1:5674399] 41.8 41.9 41.9 41.9 41.9 ...
##  $ end_lng           : num [1:5674399] -87.6 -87.7 -87.7 -87.7 -87.6 ...
##  $ member_casual     : chr [1:5674399] "member" "member" "member" "member" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_character(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

Year is a tibble with 5,674,399 rows and 13 columns. When inspecting the data types of the different variables, rideable_type and member_casual are categorical values, and therefore it makes sense to have them stored as factors instead of strings. Teh data frame has the columns started_at and ended_at stored as POSIXct, start_station_name, star_station_id, end_station-name and end_station_id stored as characters. start_lat, start_lng, end_lat, end_lng are stored as numeric, which is the appropriate data type for these variables.

Converting rideable_type and member_casual into factors:

year$rideable_type = factor(year$rideable_type, levels = c('classic_bike', 'electric_bike', 'docked_bike'))
levels(year$rideable_type)

## [1] "classic_bike"  "electric_bike" "docked_bike"

year$member_casual = factor(year$member_casual)
str(year)

## spc_tbl_ [5,674,399 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:5674399] "8FE8F7D9C10E88C7" "34E4ED3ADF1D821B" "5296BF07A2F77CB5" "40759916B76D5D52" ...
##  $ rideable_type     : Factor w/ 3 levels "classic_bike",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ started_at        : POSIXct[1:5674399], format: "2023-04-02 08:37:28" "2023-04-19 11:29:02" ...
##  $ ended_at          : POSIXct[1:5674399], format: "2023-04-02 08:41:37" "2023-04-19 11:52:12" ...
##  $ start_station_name: chr [1:5674399] NA NA NA NA ...
##  $ start_station_id  : chr [1:5674399] NA NA NA NA ...
##  $ end_station_name  : chr [1:5674399] NA NA NA NA ...
##  $ end_station_id    : chr [1:5674399] NA NA NA NA ...
##  $ start_lat         : num [1:5674399] 41.8 41.9 41.9 41.9 41.9 ...
##  $ start_lng         : num [1:5674399] -87.6 -87.7 -87.7 -87.7 -87.7 ...
##  $ end_lat           : num [1:5674399] 41.8 41.9 41.9 41.9 41.9 ...
##  $ end_lng           : num [1:5674399] -87.6 -87.7 -87.7 -87.7 -87.6 ...
##  $ member_casual     : Factor w/ 2 levels "casual","member": 2 2 2 2 2 2 2 2 2 2 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_character(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

With the str() function I made sure the conversion was successful.

Before any additional modifications of my data, I created a backup, so I can restore my tibble without executing all the code again in case of any errors.

year_backup <- year

To make the name of the column “member_casual” easier to understand, I renamed it ‘customer_type’:

year <- year%>%
  rename(customer_type = member_casual)

Checking the column name was correctly changed:

levels(year$customer_type)

## [1] "casual" "member"

I needed a calculated field for ride_length in order to analyze the trip duration for members and casual costumers. Then I converted ride_length data type from difftime to numeric and to a unit of minutes, to perform calculations later in my analysis.

year$ride_length_min = with(year, difftime(ended_at,started_at,units="mins"))

year$ride_length_min <- as.numeric(year$ride_length_min, unit = "mins")

Checking the new column ride_lenght was created correctly:

glimpse(year)

## Rows: 5,674,399
## Columns: 14
## $ ride_id            <chr> "8FE8F7D9C10E88C7", "34E4ED3ADF1D821B", "5296BF07A2…
## $ rideable_type      <fct> electric_bike, electric_bike, electric_bike, electr…
## $ started_at         <dttm> 2023-04-02 08:37:28, 2023-04-19 11:29:02, 2023-04-…
## $ ended_at           <dttm> 2023-04-02 08:41:37, 2023-04-19 11:52:12, 2023-04-…
## $ start_station_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ start_station_id   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ end_station_name   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ end_station_id     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ start_lat          <dbl> 41.80, 41.87, 41.93, 41.92, 41.91, 41.91, 41.93, 42…
## $ start_lng          <dbl> -87.60, -87.65, -87.66, -87.65, -87.65, -87.63, -87…
## $ end_lat            <dbl> 41.79, 41.93, 41.93, 41.91, 41.91, 41.92, 41.91, 41…
## $ end_lng            <dbl> -87.60, -87.68, -87.66, -87.65, -87.63, -87.65, -87…
## $ customer_type      <fct> member, member, member, member, member, member, mem…
## $ ride_length_min    <dbl> 4.1500000, 23.1666667, 2.0000000, 3.6500000, 4.8333…

Creating a column for ‘month’, day of the week as ‘weekday’, and ‘hour’ to make it easier to aggregate the data for analysis:

year$month <-month(year$started_at, label = TRUE)
year$week_day <-wday(year$started_at, label = TRUE)
year$hour <-hour(year$started_at)

Checking for duplicate entries:

which(duplicated(year$ride_id))

## integer(0)

No duplicated entries were found.

Cleaning the data

One of the most noticeable problems with this data set it the high number of NA in the start_station_name, start_station_id, end_station_name, end_station_id columns

year[is.na(year$start_station_name),]

## # A tibble: 873,186 × 17
##    ride_id          rideable_type started_at          ended_at           
##    <chr>            <fct>         <dttm>              <dttm>             
##  1 8FE8F7D9C10E88C7 electric_bike 2023-04-02 08:37:28 2023-04-02 08:41:37
##  2 34E4ED3ADF1D821B electric_bike 2023-04-19 11:29:02 2023-04-19 11:52:12
##  3 5296BF07A2F77CB5 electric_bike 2023-04-19 08:41:22 2023-04-19 08:43:22
##  4 40759916B76D5D52 electric_bike 2023-04-19 13:31:30 2023-04-19 13:35:09
##  5 77A96F460101AC63 electric_bike 2023-04-19 12:05:36 2023-04-19 12:10:26
##  6 8D6A2328E19DC168 electric_bike 2023-04-19 12:17:34 2023-04-19 12:21:38
##  7 C97BBA66E07889F9 electric_bike 2023-04-19 09:35:48 2023-04-19 09:45:00
##  8 6687AD4C575FF734 electric_bike 2023-04-11 16:13:43 2023-04-11 16:18:41
##  9 A8FA4F73B22BC11F electric_bike 2023-04-11 16:29:24 2023-04-11 16:40:23
## 10 81E158FE63D99994 electric_bike 2023-04-19 17:35:40 2023-04-19 17:36:11
## # ℹ 873,176 more rows
## # ℹ 13 more variables: start_station_name <chr>, start_station_id <chr>,
## #   end_station_name <chr>, end_station_id <chr>, start_lat <dbl>,
## #   start_lng <dbl>, end_lat <dbl>, end_lng <dbl>, customer_type <fct>,
## #   ride_length_min <dbl>, month <ord>, week_day <ord>, hour <int>

year[is.na(year$end_station_name),]

## # A tibble: 926,160 × 17
##    ride_id          rideable_type started_at          ended_at           
##    <chr>            <fct>         <dttm>              <dttm>             
##  1 8FE8F7D9C10E88C7 electric_bike 2023-04-02 08:37:28 2023-04-02 08:41:37
##  2 34E4ED3ADF1D821B electric_bike 2023-04-19 11:29:02 2023-04-19 11:52:12
##  3 5296BF07A2F77CB5 electric_bike 2023-04-19 08:41:22 2023-04-19 08:43:22
##  4 40759916B76D5D52 electric_bike 2023-04-19 13:31:30 2023-04-19 13:35:09
##  5 77A96F460101AC63 electric_bike 2023-04-19 12:05:36 2023-04-19 12:10:26
##  6 8D6A2328E19DC168 electric_bike 2023-04-19 12:17:34 2023-04-19 12:21:38
##  7 C97BBA66E07889F9 electric_bike 2023-04-19 09:35:48 2023-04-19 09:45:00
##  8 6687AD4C575FF734 electric_bike 2023-04-11 16:13:43 2023-04-11 16:18:41
##  9 A8FA4F73B22BC11F electric_bike 2023-04-11 16:29:24 2023-04-11 16:40:23
## 10 81E158FE63D99994 electric_bike 2023-04-19 17:35:40 2023-04-19 17:36:11
## # ℹ 926,150 more rows
## # ℹ 13 more variables: start_station_name <chr>, start_station_id <chr>,
## #   end_station_name <chr>, end_station_id <chr>, start_lat <dbl>,
## #   start_lng <dbl>, end_lat <dbl>, end_lng <dbl>, customer_type <fct>,
## #   ride_length_min <dbl>, month <ord>, week_day <ord>, hour <int>

year[is.na(year$start_station_id),]

## # A tibble: 873,318 × 17
##    ride_id          rideable_type started_at          ended_at           
##    <chr>            <fct>         <dttm>              <dttm>             
##  1 8FE8F7D9C10E88C7 electric_bike 2023-04-02 08:37:28 2023-04-02 08:41:37
##  2 34E4ED3ADF1D821B electric_bike 2023-04-19 11:29:02 2023-04-19 11:52:12
##  3 5296BF07A2F77CB5 electric_bike 2023-04-19 08:41:22 2023-04-19 08:43:22
##  4 40759916B76D5D52 electric_bike 2023-04-19 13:31:30 2023-04-19 13:35:09
##  5 77A96F460101AC63 electric_bike 2023-04-19 12:05:36 2023-04-19 12:10:26
##  6 8D6A2328E19DC168 electric_bike 2023-04-19 12:17:34 2023-04-19 12:21:38
##  7 C97BBA66E07889F9 electric_bike 2023-04-19 09:35:48 2023-04-19 09:45:00
##  8 6687AD4C575FF734 electric_bike 2023-04-11 16:13:43 2023-04-11 16:18:41
##  9 A8FA4F73B22BC11F electric_bike 2023-04-11 16:29:24 2023-04-11 16:40:23
## 10 81E158FE63D99994 electric_bike 2023-04-19 17:35:40 2023-04-19 17:36:11
## # ℹ 873,308 more rows
## # ℹ 13 more variables: start_station_name <chr>, start_station_id <chr>,
## #   end_station_name <chr>, end_station_id <chr>, start_lat <dbl>,
## #   start_lng <dbl>, end_lat <dbl>, end_lng <dbl>, customer_type <fct>,
## #   ride_length_min <dbl>, month <ord>, week_day <ord>, hour <int>

year[is.na(year$end_station_id),]

## # A tibble: 926,301 × 17
##    ride_id          rideable_type started_at          ended_at           
##    <chr>            <fct>         <dttm>              <dttm>             
##  1 8FE8F7D9C10E88C7 electric_bike 2023-04-02 08:37:28 2023-04-02 08:41:37
##  2 34E4ED3ADF1D821B electric_bike 2023-04-19 11:29:02 2023-04-19 11:52:12
##  3 5296BF07A2F77CB5 electric_bike 2023-04-19 08:41:22 2023-04-19 08:43:22
##  4 40759916B76D5D52 electric_bike 2023-04-19 13:31:30 2023-04-19 13:35:09
##  5 77A96F460101AC63 electric_bike 2023-04-19 12:05:36 2023-04-19 12:10:26
##  6 8D6A2328E19DC168 electric_bike 2023-04-19 12:17:34 2023-04-19 12:21:38
##  7 C97BBA66E07889F9 electric_bike 2023-04-19 09:35:48 2023-04-19 09:45:00
##  8 6687AD4C575FF734 electric_bike 2023-04-11 16:13:43 2023-04-11 16:18:41
##  9 A8FA4F73B22BC11F electric_bike 2023-04-11 16:29:24 2023-04-11 16:40:23
## 10 81E158FE63D99994 electric_bike 2023-04-19 17:35:40 2023-04-19 17:36:11
## # ℹ 926,291 more rows
## # ℹ 13 more variables: start_station_name <chr>, start_station_id <chr>,
## #   end_station_name <chr>, end_station_id <chr>, start_lat <dbl>,
## #   start_lng <dbl>, end_lat <dbl>, end_lng <dbl>, customer_type <fct>,
## #   ride_length_min <dbl>, month <ord>, week_day <ord>, hour <int>

year[is.na(year$end_lat),]

## # A tibble: 6,642 × 17
##    ride_id          rideable_type started_at          ended_at           
##    <chr>            <fct>         <dttm>              <dttm>             
##  1 C0D866D60D247389 docked_bike   2023-04-29 19:53:48 2023-05-01 04:48:13
##  2 667C60C2E29B3527 docked_bike   2023-04-11 11:57:32 2023-04-14 15:35:22
##  3 6931AA8C6820608F docked_bike   2023-04-13 18:07:14 2023-04-13 18:14:14
##  4 78DBBD82B3A9B783 classic_bike  2023-04-28 16:29:43 2023-04-29 17:29:35
##  5 30EB784FB59361E9 classic_bike  2023-04-15 19:35:51 2023-04-16 20:35:46
##  6 52F50C9221FBA751 docked_bike   2023-04-30 17:26:49 2023-05-01 18:26:50
##  7 8A647A93C6DD01BA classic_bike  2023-04-30 10:36:21 2023-05-01 11:36:16
##  8 98F35284E4F8321F classic_bike  2023-04-18 12:03:44 2023-04-19 13:03:40
##  9 8CF9D8D92B102A65 classic_bike  2023-04-26 17:54:23 2023-04-27 18:54:02
## 10 184563421ED28E1A classic_bike  2023-04-28 06:38:51 2023-04-29 07:38:45
## # ℹ 6,632 more rows
## # ℹ 13 more variables: start_station_name <chr>, start_station_id <chr>,
## #   end_station_name <chr>, end_station_id <chr>, start_lat <dbl>,
## #   start_lng <dbl>, end_lat <dbl>, end_lng <dbl>, customer_type <fct>,
## #   ride_length_min <dbl>, month <ord>, week_day <ord>, hour <int>

year[is.na(year$end_lng),]

## # A tibble: 6,642 × 17
##    ride_id          rideable_type started_at          ended_at           
##    <chr>            <fct>         <dttm>              <dttm>             
##  1 C0D866D60D247389 docked_bike   2023-04-29 19:53:48 2023-05-01 04:48:13
##  2 667C60C2E29B3527 docked_bike   2023-04-11 11:57:32 2023-04-14 15:35:22
##  3 6931AA8C6820608F docked_bike   2023-04-13 18:07:14 2023-04-13 18:14:14
##  4 78DBBD82B3A9B783 classic_bike  2023-04-28 16:29:43 2023-04-29 17:29:35
##  5 30EB784FB59361E9 classic_bike  2023-04-15 19:35:51 2023-04-16 20:35:46
##  6 52F50C9221FBA751 docked_bike   2023-04-30 17:26:49 2023-05-01 18:26:50
##  7 8A647A93C6DD01BA classic_bike  2023-04-30 10:36:21 2023-05-01 11:36:16
##  8 98F35284E4F8321F classic_bike  2023-04-18 12:03:44 2023-04-19 13:03:40
##  9 8CF9D8D92B102A65 classic_bike  2023-04-26 17:54:23 2023-04-27 18:54:02
## 10 184563421ED28E1A classic_bike  2023-04-28 06:38:51 2023-04-29 07:38:45
## # ℹ 6,632 more rows
## # ℹ 13 more variables: start_station_name <chr>, start_station_id <chr>,
## #   end_station_name <chr>, end_station_id <chr>, start_lat <dbl>,
## #   start_lng <dbl>, end_lat <dbl>, end_lng <dbl>, customer_type <fct>,
## #   ride_length_min <dbl>, month <ord>, week_day <ord>, hour <int>

The data frame has 873,186 blanks in the start_station_name column, 873,318 in the start_station_id column, 926,160 blanks in the end_station_name column, and 926,301 in the end_station_id column.

The data frame also has 6,642 NA in the end_lat and end_lng columns (0.11% of the rows). However, the longitude and latitude of the starting point is complete, which means I can use this columns to map the location where casual costumers and member’s trips are starting, as we will see later.

I will exclude the incomplete columns from the analysis except for start_station_name, as it is not possible to obtain complete information for these columns, and they are not essential for addressing the business problem.

year <- subset(year, select = -c(start_station_id, end_station_name, end_station_id, end_lat, end_lng))

Checking that the columns were correctly eliminated:

str(year)

## tibble [5,674,399 × 12] (S3: tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:5674399] "8FE8F7D9C10E88C7" "34E4ED3ADF1D821B" "5296BF07A2F77CB5" "40759916B76D5D52" ...
##  $ rideable_type     : Factor w/ 3 levels "classic_bike",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ started_at        : POSIXct[1:5674399], format: "2023-04-02 08:37:28" "2023-04-19 11:29:02" ...
##  $ ended_at          : POSIXct[1:5674399], format: "2023-04-02 08:41:37" "2023-04-19 11:52:12" ...
##  $ start_station_name: chr [1:5674399] NA NA NA NA ...
##  $ start_lat         : num [1:5674399] 41.8 41.9 41.9 41.9 41.9 ...
##  $ start_lng         : num [1:5674399] -87.6 -87.7 -87.7 -87.7 -87.7 ...
##  $ customer_type     : Factor w/ 2 levels "casual","member": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ride_length_min   : num [1:5674399] 4.15 23.17 2 3.65 4.83 ...
##  $ month             : Ord.factor w/ 12 levels "Jan"<"Feb"<"Mar"<..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ week_day          : Ord.factor w/ 7 levels "Sun"<"Mon"<"Tue"<..: 1 4 4 4 4 4 4 3 3 4 ...
##  $ hour              : int [1:5674399] 8 11 8 13 12 12 9 16 16 17 ...

year[!complete.cases(year),]

## # A tibble: 873,186 × 12
##    ride_id          rideable_type started_at          ended_at           
##    <chr>            <fct>         <dttm>              <dttm>             
##  1 8FE8F7D9C10E88C7 electric_bike 2023-04-02 08:37:28 2023-04-02 08:41:37
##  2 34E4ED3ADF1D821B electric_bike 2023-04-19 11:29:02 2023-04-19 11:52:12
##  3 5296BF07A2F77CB5 electric_bike 2023-04-19 08:41:22 2023-04-19 08:43:22
##  4 40759916B76D5D52 electric_bike 2023-04-19 13:31:30 2023-04-19 13:35:09
##  5 77A96F460101AC63 electric_bike 2023-04-19 12:05:36 2023-04-19 12:10:26
##  6 8D6A2328E19DC168 electric_bike 2023-04-19 12:17:34 2023-04-19 12:21:38
##  7 C97BBA66E07889F9 electric_bike 2023-04-19 09:35:48 2023-04-19 09:45:00
##  8 6687AD4C575FF734 electric_bike 2023-04-11 16:13:43 2023-04-11 16:18:41
##  9 A8FA4F73B22BC11F electric_bike 2023-04-11 16:29:24 2023-04-11 16:40:23
## 10 81E158FE63D99994 electric_bike 2023-04-19 17:35:40 2023-04-19 17:36:11
## # ℹ 873,176 more rows
## # ℹ 8 more variables: start_station_name <chr>, start_lat <dbl>,
## #   start_lng <dbl>, customer_type <fct>, ride_length_min <dbl>, month <ord>,
## #   week_day <ord>, hour <int>

We can see that only the 873,186 NA identified in the start_station_name column are present. I will leave this column as it is for now, since I have the exact location (latitude and longitud) for these stations and having all the names will not be necessary for my analysis.

All the data except from start_station_name is complete.

####Descriptive analysis on ride length:

year%>%
summarize(avg_ride_length =mean(ride_length_min), median_ride_length=median(ride_length_min), max_ride_length=max(ride_length_min), min_ride_length=min(ride_length_min))

## # A tibble: 1 × 4
##   avg_ride_length median_ride_length max_ride_length min_ride_length
##             <dbl>              <dbl>           <dbl>           <dbl>
## 1            18.4               9.55          98489.           -169.

This descriptive analysis indicates the probable presence of outliers and abnormal values, reflected on a negative minimum ride length and a very long maximum ride length. I will look for ride lengths <=0 minutes.

year[which(year$ride_length_min < 0),]

## # A tibble: 207 × 12
##    ride_id          rideable_type started_at          ended_at           
##    <chr>            <fct>         <dttm>              <dttm>             
##  1 7A4D237E2C99D424 electric_bike 2023-04-04 17:15:08 2023-04-04 17:15:05
##  2 81E1C5175FA5A23D classic_bike  2023-04-19 14:47:18 2023-04-19 14:47:14
##  3 0063C3704F56EC55 electric_bike 2023-04-27 07:51:14 2023-04-27 07:51:09
##  4 DFC43BD5CB34ACBF electric_bike 2023-04-06 23:09:31 2023-04-06 23:00:35
##  5 934174DB8E2AD791 classic_bike  2023-05-29 17:34:21 2023-05-29 17:34:09
##  6 ED9038136686A88A electric_bike 2023-05-29 16:57:34 2023-05-29 16:57:27
##  7 06EC5ECAF8E26A2C electric_bike 2023-05-26 15:39:47 2023-05-26 15:38:17
##  8 F74E0B3EB302A3AE electric_bike 2023-05-26 15:38:53 2023-05-26 15:38:17
##  9 00AC4040E25E347E classic_bike  2023-05-07 15:54:58 2023-05-07 15:54:47
## 10 579596DD4C7C7538 classic_bike  2023-05-23 17:39:38 2023-05-23 17:39:35
## # ℹ 197 more rows
## # ℹ 8 more variables: start_station_name <chr>, start_lat <dbl>,
## #   start_lng <dbl>, customer_type <fct>, ride_length_min <dbl>, month <ord>,
## #   week_day <ord>, hour <int>

year[which(year$ride_length_min == 0),]

## # A tibble: 836 × 12
##    ride_id          rideable_type started_at          ended_at           
##    <chr>            <fct>         <dttm>              <dttm>             
##  1 D0FBBEEF715FD098 classic_bike  2023-04-13 20:35:39 2023-04-13 20:35:39
##  2 183EFB828ABAEB6D electric_bike 2023-04-25 18:50:07 2023-04-25 18:50:07
##  3 ADBFE4E866050462 electric_bike 2023-04-18 07:25:25 2023-04-18 07:25:25
##  4 6BB162FF6B146FA1 electric_bike 2023-04-18 18:52:01 2023-04-18 18:52:01
##  5 F8063DB82D95B1CB classic_bike  2023-04-02 15:04:15 2023-04-02 15:04:15
##  6 1B97C99C3B8ACC83 classic_bike  2023-04-10 21:14:18 2023-04-10 21:14:18
##  7 7C7D88F336F55B87 electric_bike 2023-04-05 07:21:55 2023-04-05 07:21:55
##  8 15FD035AE63ED487 electric_bike 2023-04-09 06:29:48 2023-04-09 06:29:48
##  9 33CBBA7EBFD81651 electric_bike 2023-04-19 11:54:53 2023-04-19 11:54:53
## 10 57381E8364077D79 electric_bike 2023-04-22 16:20:59 2023-04-22 16:20:59
## # ℹ 826 more rows
## # ℹ 8 more variables: start_station_name <chr>, start_lat <dbl>,
## #   start_lng <dbl>, customer_type <fct>, ride_length_min <dbl>, month <ord>,
## #   week_day <ord>, hour <int>

Notably, there are 207 rides with a negative ride length and 836 with a length of 0. This pattern may correspond to instances when bikes were temporarily taken out of the docks by the company for quality checks or repairs.

Since I lack additional information about rides with negative values, and this data cannot be meaningfully analyzed alongside the data for normal bike use by members or casual customers, I will store this information in an object named ‘negative_ride_lengths’ in case I need it in the future, and exclude these items from the rest of the dataset.

negative_ride_lengths <- year[which(year$ride_length_min <= 0),]

year <- year[-which(year$ride_length_min <= 0),]

Inspecting the data again to verify the minimum ride length is > than 0:

year%>%
summarize(avg_ride_length =mean(ride_length_min), median_ride_length=median(ride_length_min), max_ride_length=max(ride_length_min), min_ride_length=min(ride_length_min))

## # A tibble: 1 × 4
##   avg_ride_length median_ride_length max_ride_length min_ride_length
##             <dbl>              <dbl>           <dbl>           <dbl>
## 1            18.4               9.55          98489.          0.0167

On the other hand, it’s worth noting that the maximum ride length is 98,489.07 minutes, which translates to 68.39 days. This is highly abnormal for Cyclistic service, given that, according to their website, a day pass allows for an unlimited number of three-hour rides within a 24-hour period. Additionally, for members, the first 45 minutes of a ride are free, with additional minutes incurring extra fees. Instances like this may be indicative of a stolen or lost bike, or potentially stem from data quality issues.

In order to quickly grasp and identify outliers in this scenario, I will create a boxplot.

Since this ride lengths are very long, I will convert them to hours before I create the boxplot.

ride_length_hours <- with(year, difftime(ended_at,started_at,units="hours"))

Creating the boxplot:

boxplot(ride_length_hours, horizontal = TRUE, main = "Ride length in hours from October 1st 2022 to September 30th 2023", xlab = "Hours", ylab = "Bike rides", notch = TRUE)

I can also use the IQR method to identify high outliers. The rule is that the data point needs to fall more than 1.5 times the Interquartile range above the third quartile to be considered a high outlier (Q3 + 1.5xIQR )

summary(year$ride_length_min)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     0.02     5.43     9.55    18.42    17.00 98489.07

IQR(year$ride_length_min)

## [1] 11.56667

Q3 = 17 and IQR = 11.56 Q3 + 1.5 x IQR 17 + 1.5(11.56) = 34.34

Upon examining the box plot, along with the statistical information provided by the summary function and the IQR method for detecting outliers, it can be concluded that values exceeding 34.34 minutes may be considered ‘abnormally high’ for the distribution of ride lengths in this dataset. Nevertheless, it is crucial to not hastily exclude data that, while appearing as outliers in the distribution, may actually reflect variations in how individuals use the bikes. For example, ride lengths for commuting may differ from those for leisure. Therefore, establishing a definitive cutoff between an outlier and a long ride is not straightforward.

However, considering the company’s policies, if a customer fails to return a bike within a 24-hour period, they may face a lost or stolen bike fee of $250. With this in mind, for the purposes of this analysis, I will be excluding trips with a duration exceeding 24 hours (1440 minutes) from the main dataset. I will store them separately in an object named ‘long_rides’ to perform an analysis on this data later.

long_rides <- year[which(year$ride_length_min > 1440),]
long_rides

## # A tibble: 5,946 × 12
##    ride_id          rideable_type started_at          ended_at           
##    <chr>            <fct>         <dttm>              <dttm>             
##  1 A6EA2393A6E2EBA3 docked_bike   2023-04-15 15:29:11 2023-04-16 15:33:18
##  2 2E1F90AAF861B305 docked_bike   2023-04-15 15:30:15 2023-04-16 15:32:53
##  3 82133AC14BD86DDF docked_bike   2023-04-15 15:28:50 2023-04-16 15:32:33
##  4 C0D866D60D247389 docked_bike   2023-04-29 19:53:48 2023-05-01 04:48:13
##  5 667C60C2E29B3527 docked_bike   2023-04-11 11:57:32 2023-04-14 15:35:22
##  6 78DBBD82B3A9B783 classic_bike  2023-04-28 16:29:43 2023-04-29 17:29:35
##  7 30EB784FB59361E9 classic_bike  2023-04-15 19:35:51 2023-04-16 20:35:46
##  8 52F50C9221FBA751 docked_bike   2023-04-30 17:26:49 2023-05-01 18:26:50
##  9 8A647A93C6DD01BA classic_bike  2023-04-30 10:36:21 2023-05-01 11:36:16
## 10 98F35284E4F8321F classic_bike  2023-04-18 12:03:44 2023-04-19 13:03:40
## # ℹ 5,936 more rows
## # ℹ 8 more variables: start_station_name <chr>, start_lat <dbl>,
## #   start_lng <dbl>, customer_type <fct>, ride_length_min <dbl>, month <ord>,
## #   week_day <ord>, hour <int>

5,946 rows were stored in the long_rides object.

year <- year[-which(year$ride_length_min > 1440),]

Now, I checking the rows with ride lengths greater than 24 hours or 1440 min were successfully removed from the main dataset.

year%>%
summarize(avg_ride_length =mean(ride_length_min), median_ride_length=median(ride_length_min), max_ride_length=max(ride_length_min), min_ride_length=min(ride_length_min))

## # A tibble: 1 × 4
##   avg_ride_length median_ride_length max_ride_length min_ride_length
##             <dbl>              <dbl>           <dbl>           <dbl>
## 1            15.2               9.55           1440.          0.0167

Analysis

Analysis of ride length for different customer types, focusing on rides longer than 0 minutes but shorter than 24 hours:

First I will summarize the ride length by customer type, obtaining the average, mean, max, min of ride lengths and a count of the trips.

ride_length_per_customer_type <- year %>% group_by(customer_type) %>% 
  summarise(avg_ride_length =mean(ride_length_min), median_ride_length=median(ride_length_min),  max_ride_length=max(ride_length_min), min_ride_length=min(ride_length_min),
            .groups = 'drop', number_of_rides = n_distinct(ride_id))

ride_length_per_customer_type

## # A tibble: 2 × 6
##   customer_type avg_ride_length median_ride_length max_ride_length
##   <fct>                   <dbl>              <dbl>           <dbl>
## 1 casual                   20.6               11.8           1440.
## 2 member                   12.1                8.5           1440.
## # ℹ 2 more variables: min_ride_length <dbl>, number_of_rides <int>

Then I will create a bar chart to analyze the average ride length by customer type for rides with length longer than 0 min and shorter than 24 hours:

First I created and stored a theme to apply to the next charts to keep consistency.

mytheme <- theme(
  plot.title = element_text(family = "Arial", face = "bold", size = (15), colour = "#5A5A5A"),
  axis.title = element_text(family = "Arial", size = (10), colour = "#808080",  hjust=c(1), vjust=c(0)),
  axis.text = element_text(family = "Arial", size = (10), colour = "#808080"),
  legend.title = element_text(colour = "#808080", face = "bold", family = "Arial"),
  legend.text = element_text(colour = "#808080", family = "Arial"),
  plot.subtitle = element_text(colour = "#5A5A5A", family = "Arial"),
  plot.caption = element_text(colour = "#5A5A5A", family = "Arial")
)

Then I created a bar chart with ggplot 2.

p1 <- ggplot(ride_length_per_customer_type, aes(x = customer_type, y = avg_ride_length, fill = c("Casual", "Member"))) + 
  geom_col(width = 0.4) +
  scale_x_discrete(labels = c("Casual", "Member"))+
  scale_fill_manual(values = c("#0C2D48", "#B1D4E0"))+
  labs(y= "Average Ride Length in Minutes", x = "Customer Type", title = "Average Ride Length in Minutes by Customer Type", caption = "October 1st 2022 to Semptember 30th 2023", subtitle = "The average ride length of casual customers is almost twice that of members,suggesting that the bikes\nmay be used for different purposes, such as leisure versus commuting", fill = "Customer Type")+
  mytheme +
  geom_text(data = NULL, label = "21 Min", y = 20, x=1, colour = "white", size = 3.5, family = "Arial" ) +
  geom_text(data = NULL, label = "12 Min", y = 11.5, x=2, colour = "white", size = 3.5, family = "Arial")

p1

The average ride length of casual customers is almost twice that of members, and the median ride length is 11.8 min for casual members vs 8.5 minutes for members.

Number of rides by customer type:

Then, to analyze the number of rides by customer type for rides longer than 0 min and shorter than 24 hours I will use another bar chart.

p2 <-ggplot(ride_length_per_customer_type, aes(x = customer_type, y = number_of_rides, fill = c("Casual", "Member"))) +
  geom_col(width = 0.4) +
  scale_x_discrete(labels = c("Casual", "Member"))+
  expand_limits(y = c(0, NA)) +
  scale_y_continuous(labels = unit_format(unit = "M", scale = 1e-6))+
  scale_fill_manual(values = c("#0C2D48", "#B1D4E0"))+
  labs(y= "Number of Rides in Millions", x = "Customer Type", title = "Number of rides by Customer Type", caption = "October 1st 2022 to Semptember 30th 2023", subtitle = "Members use the bikes more often than casual customers.\n64% of the rides are taken by members, compared to 36% by casual customers." , fill = "Customer Type")+
  mytheme +
  geom_text(data = NULL, label = "2 Millions", y = 2000000, x=1, colour = "white", size = 3.5, family = "Arial" ) +
  geom_text(data = NULL, label = "3.5 Millions", y = 3500000, x=2, colour = "white", size = 3.5, family = "Arial")

p2

According to this data, members use the bikes more often than casual customers. 64% of the rides are taken by members, compared to 36% by casual customers.

Analysis of ride length and the number of rides, categorized by customer type, on a monthly basis.

First I summarized the data by customer type and month. Then I stored this data in the ‘monthly_rides_per_customer_type’ object :

monthly_rides_per_customer_type <- year %>% group_by(customer_type, month) %>% 
  summarise(avg_ride_length =mean(ride_length_min), median_ride_length=median(ride_length_min), number_of_rides = n_distinct(ride_id))

## `summarise()` has grouped output by 'customer_type'. You can override using the
## `.groups` argument.

Then, I changed the capitalization of customer types to make it easier to work with it when I make the next graph.

levels(monthly_rides_per_customer_type$customer_type) <- c("Casual", "Member")

I also created a function named number_formatter to quickly and easily convert the units for the charts axis.

number_formatter <- function(x) {
  dplyr::case_when(
    x < 1e3 ~ as.character(x),
    x < 1e6 ~ paste0(as.character(x/1e3), "K"),
    x < 1e9 ~ paste0(as.character(x/1e6), "M"),
    TRUE ~ "To be implemented..."
  )
  }

I used a stacked bar chart to illustrate the number of rides by customer type and month. I put the units inside every bar to make it easier to understand by the reader. To do it, I created functions to get the data for every month and customer type, so I didn’t need to hard code the values, then converted them into a double, rounded them and pass them into a paste function and a loop. The loop created an annotation for each data inside a geom_text function and pass it to the ggplot function for the chart.

p3 <- ggplot(monthly_rides_per_customer_type, aes(x = month, y = number_of_rides, fill = customer_type)) +
  geom_col()+
  scale_x_discrete(labels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov","Dec"))+
  expand_limits(y = c(0, NA)) +
  scale_y_continuous(labels = unit_format(unit = "K", scale = 1e-3))+
  scale_fill_manual(values = c("#0C2D48", "#B1D4E0")) +
  annotate("rect", xmin = 6.5, xmax =8.5, ymin = 0, ymax = 810000,
           alpha = .01, colour = "#880808") +
  labs(y= "Number of Rides in Thousands", x = "Customer Type", title = "Number of Rides by Customer Type and Month", caption = "October 1st 2022 to Semptember 30th 2023", subtitle = "Members and customers tend to use the bikes more frequently during the summer months, particularly\nin July and August, and significantly less during the colder months, from December to February", fill = "Customer Type")+
  mytheme 

for (i in 1:12) {
  loop_input = paste( "geom_text(data = NULL, label = number_formatter(signif(as.double(",monthly_rides_per_customer_type[i,5], "), digits = 2)), y = ", monthly_rides_per_customer_type[i+12,5]," + ", (monthly_rides_per_customer_type[i,5]/2), ", x=", i, ", colour = 'white', size = 3, family = 'Arial')", sep = "")
  p3 <- p3 + eval(parse(text=loop_input))
  }

for (i in 1:12) {
  loop_input = paste("geom_text(data = NULL, label = number_formatter(signif(as.double(",monthly_rides_per_customer_type[i+12,5], "), digits = 2)), y = ", monthly_rides_per_customer_type[i+12,5]/2,", x=", i, ", colour = 'white', size = 3, family = 'Arial')", sep = "")
  p3 <- p3 + eval(parse(text=loop_input))
}

p3

According to this data, members and customers tend to use the bikes more frequently during the summer months, particularly in July and August, and significantly less during the colder months, from December to February.

Average ride length by customer type per month

For this analysis, I created a line chart and used some annotations to make it easier to read and focus the attention of the reader to the important facts.

p4 <- ggplot(monthly_rides_per_customer_type, aes(x = month, y = avg_ride_length, group = customer_type, color = customer_type)) +
  geom_point(size = 1.5) +
  geom_line(size = 1.5) +
  labs(title="Average Ride Length per Customer Type per Month", x="Month", y = "Average Ride Length in Minutes", caption = "October 1st 2022 to Semptember 30th 2023", subtitle = "Both members and casual customers take longer rides during the spring and summer months compared to the winter months.\nHowever, casual customers see a more significant increase in ride length (64%) compared to members (30%)", color = "Customer Type") +
  scale_color_manual(values = c("#0C2D48", "#B1D4E0"))+
  mytheme + 
  annotate("rect", xmin = 4.5, xmax = 9, ymin = 21, ymax = 23.5,
            alpha = .2) +
  annotate("rect", xmin = 4.5, xmax = 9, ymin = 12, ymax = 14,
           alpha = .2) +
  annotate("text", x = 7, y = 13.5, label = "13",
           alpha = .6, size = 3) +
  annotate("text", x = 7, y = 23.2, label = "23",
           alpha = .6, size = 3) +
  annotate("text", x = 1, y = 10.4, label = "10",
           alpha = .8, size = 3, color = "#880808") +
  annotate("text", x = 0.9, y = 14.2, label = "14",
           alpha = .8, size = 3, color = "#880808")

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

p4

Both members and casual customers take longer rides during the spring and summer months compared to the winter months. However, casual customers see a more significant increase in ride length (64%) compared to members (30%), which might reflect a different purpose for their rides, like commuting vs leisure or tourism.

Bike trips by customer type per day

First, I created a summary for the data by customer type and day of the week and I will store it in the day_customer_type object.

day_customer_type <- year %>% group_by(customer_type, week_day) %>% 
  summarise( number_of_rides = n_distinct(ride_id))

## `summarise()` has grouped output by 'customer_type'. You can override using the
## `.groups` argument.

Then, I changed the capitalization of customer types to make it easier to work with it when I make the next graph.

levels(day_customer_type$customer_type) <- c("Casual", "Member")

I created and stored the labels for the axis, legends and annotations to pass them later into the ggplot function.

p5_labels <- data.frame(
  label= c("428 K", "575K" ),
  customer_type = c("Casual", "Member"),
  x <- c("Sat", "Thu"),
  y <- c(415000, 565000)
)

p5_labels

##   label customer_type x....c..Sat....Thu.. y....c.415000..565000.
## 1 428 K        Casual                  Sat                 415000
## 2  575K        Member                  Thu                 565000

Then I created two bar charts using the facet_wrap function by customer type.

p5 <- ggplot(day_customer_type, aes(x = week_day, y = number_of_rides, fill = customer_type)) +
  geom_col(width = 0.7) +
  scale_fill_manual(values = c("#0C2D48", "#B1D4E0"))+
  expand_limits(y = c(0, NA)) +
  geom_text(data = NULL, label = "575 K", y = 3500000, x=2, colour = "white", size = 3.5, family = "Arial")+
  scale_y_continuous(labels = unit_format(unit = "K", scale = 1e-3))+
  geom_text(
    data = p5_labels, 
    mapping = aes(x = x, y = y, label = label), colour = "white", family = "Arial", size = 3)+
  labs(y= "Number of Rides in Thousands", x = "Day", title = "Number of Rides by Customer Type and Day of the Week", caption = "October 1st 2022 to Semptember 30th 2023", subtitle = "Members tend to use the bikes more frequently on weekdays, particularly from Tuesday to Thursday,\nwhile casual customers show a higher usage on weekends. This observation may indicate distinct\npurposes for bike rides, such as commuting versus leisure", fill = "Customer Type")+
  facet_wrap(~customer_type)+
  mytheme
  

p5

According to this data, members tend to use the bikes more frequently on weekdays, particularly from Tuesday to Thursday, while casual customers show a higher usage on weekends. This observation may also indicate distinct purposes for bike rides, such as commuting versus leisure or tourism.

Bike trips by customer type per hour of the day

First, I created a summary for the bike trips by customer type and hour. This data will be stored in the hour_rides_per_customer_type object.

hour_rides_per_customer_type <- year %>% group_by(customer_type, hour) %>% 
  summarise(avg_ride_length = mean(ride_length_min), number_of_rides = n_distinct(ride_id))

## `summarise()` has grouped output by 'customer_type'. You can override using the
## `.groups` argument.

Then, I changed the capitalization of customer types to make it easier to work with it when I make the next graph.

levels(hour_rides_per_customer_type$customer_type) <- c("Casual", "Member")

Then I created a line chart since I wanted to analyze changes in trends over time. I created a few annotations so the reader can quickly spot relevant numbers.

p6 <- ggplot(hour_rides_per_customer_type, aes(x = hour, y = number_of_rides, fill = customer_type, color=customer_type)) +
  geom_line(size = 1.5) +
  expand_limits(y = c(0, NA)) +
  scale_y_continuous(labels = unit_format(unit = "K", scale = 1e-3))+
  labs(title="Number of rides per Customer Type per Hour", x="Hour", y = "Number of Rides in Thousands", caption = "October 1st 2022 to Semptember 30th 2023", subtitle = "Casual customers tend to use the bikes more frequently in the afternoon hours, with a peak at 5 pm.\nIn contrast, members exhibit their highest ridership during the morning and afternoon rush hours, suggesting that\nmembers might be primarily using the bikes for commuting", color = "Customer Type") +
  scale_color_manual(values = c("#0C2D48", "#B1D4E0"))+
  annotate("text", x = 8, y = 242000, label = "236K",
           alpha = .6, size = 3) +
  annotate("text", x = 17, y = 385000, label = "378K",
           alpha = .6, size = 3) +
  annotate("text", x = 17, y = 209000, label = "201K",
           alpha = .6, size = 3) +
  mytheme

p6

Casual customers tend to use the bikes more frequently in the afternoon hours, with a peak at 5 pm. In contrast, members exhibit their highest ridership during the morning and afternoon rush hours, suggesting that members might be primarily using the bikes for commuting.

Analysis of rideable bike by customer type:

I started this section by summarizing the data on rideable type by customer type and stored it in an object.

rideable_type <- year %>% group_by(customer_type, rideable_type) %>% 
  summarise( number_of_rides = n_distinct(ride_id), avg_ride_length = mean(ride_length_min))

## `summarise()` has grouped output by 'customer_type'. You can override using the
## `.groups` argument.

Then, I changed the capitalization of customer types to make it easier to work with it when I make the next graph.

levels(rideable_type$customer_type) <- c("Casual", "Member")

Then I created a bar chart to illustrate the frequency of usage of the different type of bikes by customer type.

p7 <- ggplot(rideable_type,aes(rideable_type,number_of_rides, fill = customer_type))+
  geom_bar(stat="identity", position = "dodge", width = 0.4)+
  expand_limits(y = c(0, NA)) +
  scale_fill_manual(values = c("#0C2D48", "#B1D4E0"))+
  scale_y_continuous(labels = unit_format(unit = "M", scale = 1e-6))+
  scale_x_discrete(labels = c("Classic bike", "Electric bike", "Docked bike"))+
  labs(y= "Number of Rides in Millions", x = "Bike Type", title = "Number of Rides per Bike and Customer Type", caption = "October 1st 2022 to Semptember 30th 2023", subtitle = "Casual customers and members both use more frequently electric bikes than classic bikes.\nHowever, a third class of bike shows up in the data, 'Docked Bike', which is used only by casual customers,\nin a substantially minor frequency. This observation needs further analysis.", fill = "Customer Type")+
  mytheme+
  geom_segment(aes(x = 3, y = 250000, xend = 3, yend = 110000), colour = "#880808",
               arrow = arrow(length = unit(0.5, "cm")))+
  geom_text(data = NULL, label = "834K", y = 790000, x=0.9, colour = "white", size = 3.5, family = "Arial" )+
  geom_text(data = NULL, label = "1.7 M", y = 1700000, x=1.1, colour = "white", size = 3.5, family = "Arial" ) +
  geom_text(data = NULL, label = "1.1 M", y = 1110000, x=1.9, colour = "white", size = 3.5, family = "Arial" )+
  geom_text(data = NULL, label = "1.8 M", y = 1810000, x=2.1, colour = "white", size = 3.5, family = "Arial" )+
  geom_text(data = NULL, label = "96 K", y = 50000, x=3, colour = "white", size = 3.5, family = "Arial" )


p7

Casual customers and members both use more frequently electric bikes than classic bikes. However, a third class of bike shows up in the data, ‘Docked Bike’, which is used only by casual customers, in a substantially minor frequency. This observation needs further analysis since it is not something expected with this data.

For this, I created an object with the information to take a closer look to the length of the rides by bike type.

bike_type <- year%>%
 group_by(rideable_type)%>%
  summarise(avg_ride_length = mean(ride_length_min))

bike_type

## # A tibble: 3 × 2
##   rideable_type avg_ride_length
##   <fct>                   <dbl>
## 1 classic_bike             17.0
## 2 electric_bike            12.4
## 3 docked_bike              54.0

We can see that the length of the trips with docked bikes is very long. I created a bar chart to compared it to the average time lengths for other types of bikes.

p8 <- ggplot(bike_type, aes(rideable_type, avg_ride_length, fill = rideable_type))+
  geom_col(width = 0.4, fill =c("#005b96", "#005b96", "#880808"))+
  scale_x_discrete(labels = c("Classic bike", "Electric bike", "Docked bike"))+
  labs(y= "Average Ride Length in Minutes", x = "Bike Type", title = "Average Ride Length by Rideable Type", caption = "October 1st 2022 to Semptember 30th 2023", subtitle = "The average ride length for docked bikes is more than three times longer than that of electric or classic bikes.\nFurthermore, the fact that members have recorded zero trips with docked bikes, compared to 4.56% of the\ntotal trips taken by casual customers, may suggest a potential error in these entries.", fill = "Bike Type")+
  mytheme+
  geom_text(data = NULL, label = "17 Min", y = 16, x=1, colour = "white", size = 3.5, family = "Arial" )+
  geom_text(data = NULL, label = "12 Min", y = 11.2, x=2, colour = "white", size = 3.5, family = "Arial" )+
  geom_text(data = NULL, label = "54 Min", y = 53, x=3, colour = "white", size = 3.5, family = "Arial" )

p8

The average ride length for docked bikes is more than three times longer than that of electric or classic bikes. Furthermore, the fact that members have recorded zero trips with docked bikes, and docked bikes correspond to 4.56% of the total trips taken by casual customers, may suggest a potential error in these entries. For instance it might be the case that the trip ended, and the bike was docked back at the station, but the registration of the length of the trip didn’t stop for some reason.

Rides with length longer than 24 hours

For this analysis I came back to the previously stored information with rides longer than 24 hours. Then I summarized it by ride length and number of rides.

long_rides_bike <- long_rides%>%
  group_by(rideable_type)%>%
  summarize(number_of_rides = n_distinct(ride_id), avg_ride_length = mean(ride_length_min))

Since these rides are long, I converted them from minutes to hours.

long_rides_bike$avg_ride_length = long_rides_bike$avg_ride_length/60

Then I changed the capitalization of the bike types so I can use them nicely in a table.

levels(long_rides_bike$rideable_type) <- c("Classic Bike", "Electric Bike", "Docked Bike")

Since with this data I want to compare just a couple of numbers, a simple table is the right fit. I used the gt function from the gt package to give a nice format to this table.

t1 <- long_rides_bike%>%
  gt()%>%
  tab_header(
    title = md("Bike Rides Longer than 24 Hours"))%>%
  tab_source_note(md("October 1st 2022 to Semptember 30th 2023"))%>%
  cols_label(
    rideable_type = "Bike Type",
    number_of_rides = "Number of Rides",
    avg_ride_length= "Average Ride Length(Hours)"
  )%>%
  opt_stylize(style = 6, color = "blue")

cols_align(t1,
    align = c("center"),
    columns = everything()
  )

Bike Type	Number of Rides	Average Ride Length(Hours)
Bike Rides Longer than 24 Hours
Classic Bike	4207	24.98829
Docked Bike	1739	116.35724
October 1st 2022 to Semptember 30th 2023

t1

Bike Type	Number of Rides	Average Ride Length(Hours)
Bike Rides Longer than 24 Hours
Classic Bike	4207	24.98829
Docked Bike	1739	116.35724
October 1st 2022 to Semptember 30th 2023

According to this data, there are no rides with electric bikes longer than 24 hours, 4207 rides with classic bikes, and 1739 with docked bikes. Notably, for rides longer than 24 hours, the average ride length for docked bikes is substantially longer at 116 hours, compared to 24 hours for classic bikes. This further suggests that there may be errors in the docked bikes entries.

Analyzis on the most frequently used starting locations for customers’ trips

As I mentioned earlier, I have complete information for the starting locations in terms of latitude and longitude, whereas the station name column is incomplete.

I started by storing this information in an object, grouped by customer type and location, then summarizing it by number of rides departing from each location, and then sorting the information in descending order.

stations <- year%>%
  select(ride_id, start_station_name, start_lat, start_lng, customer_type)%>%
  group_by( customer_type, start_lat, start_lng, start_station_name)%>%
  summarise(number_of_rides = n_distinct(ride_id))%>%
  arrange(-number_of_rides)

## `summarise()` has grouped output by 'customer_type', 'start_lat', 'start_lng'.
## You can override using the `.groups` argument.

stations

## # A tibble: 2,072,795 × 5
## # Groups:   customer_type, start_lat, start_lng [2,070,521]
##    customer_type start_lat start_lng start_station_name          number_of_rides
##    <fct>             <dbl>     <dbl> <chr>                                 <int>
##  1 casual             41.9     -87.6 Streeter Dr & Grand Ave               33902
##  2 casual             41.9     -87.6 DuSable Lake Shore Dr & Mo…           22256
##  3 member             41.9     -87.6 <NA>                                  15853
##  4 member             41.9     -87.6 Clark St & Elm St                     15604
##  5 member             41.8     -87.6 Ellis Ave & 60th St                   15350
##  6 member             41.8     -87.6 University Ave & 57th St              15259
##  7 member             41.9     -87.6 Kingsbury St & Kinzie St              14462
##  8 casual             41.9     -87.6 DuSable Lake Shore Dr & No…           13908
##  9 member             41.9     -87.6 <NA>                                  13427
## 10 casual             41.9     -87.6 Michigan Ave & Oak St                 13325
## # ℹ 2,072,785 more rows

Then I filtered the information by customer type in order to create a map for the 20 most frequently used starting locations by customer type

stations_casual <- filter( stations, customer_type == "casual")

Then I selected the 20 most frequently used locations by casual customers.

stations_casual<- stations_casual[1:20, ]
stations_casual

## # A tibble: 20 × 5
## # Groups:   customer_type, start_lat, start_lng [20]
##    customer_type start_lat start_lng start_station_name          number_of_rides
##    <fct>             <dbl>     <dbl> <chr>                                 <int>
##  1 casual             41.9     -87.6 Streeter Dr & Grand Ave               33902
##  2 casual             41.9     -87.6 DuSable Lake Shore Dr & Mo…           22256
##  3 casual             41.9     -87.6 DuSable Lake Shore Dr & No…           13908
##  4 casual             41.9     -87.6 Michigan Ave & Oak St                 13325
##  5 casual             41.9     -87.6 Theater on the Lake                   11512
##  6 casual             41.9     -87.6 Millennium Park                       11043
##  7 casual             41.9     -87.6 Dusable Harbor                        10408
##  8 casual             41.9     -87.6 Shedd Aquarium                         9975
##  9 casual             41.9     -87.6 <NA>                                   8802
## 10 casual             42.0     -87.7 <NA>                                   8310
## 11 casual             41.9     -87.6 Adler Planetarium                      8281
## 12 casual             41.9     -87.6 <NA>                                   7790
## 13 casual             41.9     -87.6 <NA>                                   7707
## 14 casual             41.9     -87.6 Indiana Ave & Roosevelt Rd             7643
## 15 casual             41.9     -87.6 <NA>                                   7613
## 16 casual             41.9     -87.6 Michigan Ave & 8th St                  7512
## 17 casual             42.0     -87.6 Montrose Harbor                        7323
## 18 casual             41.9     -87.6 Clark St & Lincoln Ave                 6800
## 19 casual             41.9     -87.6 Wells St & Concord Ln                  6599
## 20 casual             41.9     -87.6 Clark St & Armitage Ave                6501

Same steps for member customers. Filtering the information.

stations_member <- filter( stations, customer_type == "member")

Selecting the 20 most frequently used locations by members.

stations_member<- stations_member[1:20, ]
stations_member

## # A tibble: 20 × 5
## # Groups:   customer_type, start_lat, start_lng [20]
##    customer_type start_lat start_lng start_station_name          number_of_rides
##    <fct>             <dbl>     <dbl> <chr>                                 <int>
##  1 member             41.9     -87.6 <NA>                                  15853
##  2 member             41.9     -87.6 Clark St & Elm St                     15604
##  3 member             41.8     -87.6 Ellis Ave & 60th St                   15350
##  4 member             41.8     -87.6 University Ave & 57th St              15259
##  5 member             41.9     -87.6 Kingsbury St & Kinzie St              14462
##  6 member             41.9     -87.6 <NA>                                  13427
##  7 member             41.9     -87.6 <NA>                                  13021
##  8 member             41.9     -87.6 Clinton St & Washington Bl…           12602
##  9 member             41.9     -87.6 Streeter Dr & Grand Ave               12232
## 10 member             41.9     -87.6 <NA>                                  12034
## 11 member             41.8     -87.6 Ellis Ave & 55th St                   12013
## 12 member             41.9     -87.6 <NA>                                  12003
## 13 member             41.9     -87.6 Wells St & Concord Ln                 11884
## 14 member             41.9     -87.6 Wells St & Elm St                     11767
## 15 member             41.9     -87.6 Broadway & Barry Ave                  11342
## 16 member             41.9     -87.6 <NA>                                  11214
## 17 member             41.9     -87.6 <NA>                                  11009
## 18 member             41.9     -87.6 DuSable Lake Shore Dr & No…           10757
## 19 member             41.9     -87.6 State St & Chicago Ave                10718
## 20 member             41.8     -87.6 <NA>                                  10623

The top 20 most frequently used departure locations for casual customers:

To visualize this data, I created a map and a table:

map_casual <- mapview(stations_casual, xcol = "start_lng", ycol = "start_lat", crs = 4269, grid = FALSE)
map_casual

The top 20 most frequently used departure locations for members:

map_member <- mapview(stations_member, xcol = "start_lng", ycol = "start_lat", crs = 4269, grid = FALSE)
map_member

To make it easier to compare this data, I created a couple of tables with the top 10 most frequently used departure locations by customer type, and assigned a provisional name with a consecutive number like “station 1” and “station 2” for the locations without a name of the station available in the data.

Selecting the top ten most frequently used locations for each customer type:

stations_member<- stations_member[1:10, ]
stations_casual<- stations_casual[1:10, ]

Adding provisional names:

stations_member[1,4]<- "Station 1"
stations_member[6,4]<- "Station 2"
stations_member[7,4]<- "Station 3"
stations_member[10,4]<- "Station 4"
stations_casual[9,4]<- "Station 5"
stations_casual[10,4]<- "Station 6"

Removing the customer type for each table, since it is not neccesary becase the data is already filtered.

stations_casual <- subset(stations_casual, select = -customer_type)
stations_casual

## # A tibble: 10 × 4
## # Groups:   start_lat, start_lng [10]
##    start_lat start_lng start_station_name                 number_of_rides
##        <dbl>     <dbl> <chr>                                        <int>
##  1      41.9     -87.6 Streeter Dr & Grand Ave                      33902
##  2      41.9     -87.6 DuSable Lake Shore Dr & Monroe St            22256
##  3      41.9     -87.6 DuSable Lake Shore Dr & North Blvd           13908
##  4      41.9     -87.6 Michigan Ave & Oak St                        13325
##  5      41.9     -87.6 Theater on the Lake                          11512
##  6      41.9     -87.6 Millennium Park                              11043
##  7      41.9     -87.6 Dusable Harbor                               10408
##  8      41.9     -87.6 Shedd Aquarium                                9975
##  9      41.9     -87.6 Station 5                                     8802
## 10      42.0     -87.7 Station 6                                     8310

stations_member <- subset(stations_member, select = -customer_type)
stations_member

## # A tibble: 10 × 4
## # Groups:   start_lat, start_lng [10]
##    start_lat start_lng start_station_name           number_of_rides
##        <dbl>     <dbl> <chr>                                  <int>
##  1      41.9     -87.6 Station 1                              15853
##  2      41.9     -87.6 Clark St & Elm St                      15604
##  3      41.8     -87.6 Ellis Ave & 60th St                    15350
##  4      41.8     -87.6 University Ave & 57th St               15259
##  5      41.9     -87.6 Kingsbury St & Kinzie St               14462
##  6      41.9     -87.6 Station 2                              13427
##  7      41.9     -87.6 Station 3                              13021
##  8      41.9     -87.6 Clinton St & Washington Blvd           12602
##  9      41.9     -87.6 Streeter Dr & Grand Ave                12232
## 10      41.9     -87.6 Station 4                              12034

Subsequently, I gave the columns of each table a more reader-friendly name:

colnames(stations_casual) <- c("Latitud", "Longitud", "Station Name", "Number of Rides")
colnames(stations_casual)

## [1] "Latitud"         "Longitud"        "Station Name"    "Number of Rides"

colnames(stations_member) <- c("Latitud", "Longitud", "Station Name", "Number of Rides")
colnames(stations_member)

## [1] "Latitud"         "Longitud"        "Station Name"    "Number of Rides"

Lastly, I used the formattable function from the formattable to quickly add some format to these tables

The top 10 most frequently used departure locations for casual customers

t2 <- formattable(
  stations_casual,
  align = "c"
)
t2

Latitud	Longitud	Station Name	Number of Rides
41.89228	-87.61204	Streeter Dr & Grand Ave	33902
41.88096	-87.61674	DuSable Lake Shore Dr & Monroe St	22256
41.91172	-87.62680	DuSable Lake Shore Dr & North Blvd	13908
41.90096	-87.62378	Michigan Ave & Oak St	13325
41.92628	-87.63083	Theater on the Lake	11512
41.88103	-87.62408	Millennium Park	11043
41.88698	-87.61281	Dusable Harbor	10408
41.86723	-87.61536	Shedd Aquarium	9975
41.91000	-87.63000	Station 5	8802
41.95000	-87.66000	Station 6	8310

The top 10 most frequently used departure locations for members

t3 <- formattable(
  stations_member,
  align = "c"
)
t3

Latitud	Longitud	Station Name	Number of Rides
41.89000	-87.63000	Station 1	15853
41.90297	-87.63128	Clark St & Elm St	15604
41.78510	-87.60107	Ellis Ave & 60th St	15350
41.79148	-87.59986	University Ave & 57th St	15259
41.88918	-87.63851	Kingsbury St & Kinzie St	14462
41.90000	-87.63000	Station 2	13427
41.88000	-87.63000	Station 3	13021
41.88338	-87.64117	Clinton St & Washington Blvd	12602
41.89228	-87.61204	Streeter Dr & Grand Ave	12232
41.94000	-87.65000	Station 4	12034

Among the most frequently used departure locations for casual customers we have Streeter Dr & Grand Ave situated nearby to Jane Addams Memorial Park, the apartment building Lake Point Tower, Chicago Children’s Museum, and Milton Lee Olive Park. Also Lake Shore Dr & Monroe St, which is situated nearby to the playground Slide Crater,the pier Monroe Harbor - North Harbor Public Dock, Millenium Park, Maggie Daley park and Grand Park.

For the most frequently used departure locations for members, we have the station located at latitude 41.89000 longitude -87.63000, with the provisional name of “station 1”, is at the River North neighborhood in downtown Chicago. This neighborhood is well known for having dozens of high-end art and design galleries, dining establishments, and nightlife spots. Most of the residents of the area rent their homes and many are young professionals. We also have Clark St & Elm St, which is located at the Near North Side neighborhood in Chicago Ridge, also known for its fine dining, galleries, nightlife, and riverwalk amenities.

Since the information about the end of the trips is not complete, this sections of the analysis was made with the intention to provide a general idea about the use of the bikes by type of customers.

Recommendations

The information with negative ride lengths needs to be reviewed. If the negative ride lengths correspond to errors in the time length recording, it needs to be fixed, or if the bikes are taken out for quality check or repairs, it should be recorded as it is and not as a ride trip by a customer.
The information about the docked bikes needs to be reviewed, and especially the reason why these ride lengths are so long and where taken only by casual customers, which doesn’t make sense.
The data shows members use the bikes primarily during weekdays with a peak at 8 and 5 pm, while casual customers use the bikes more often during the weekends afternoon, with a peak at 5 pm. Members usually take shorter rides than casual customers. This suggests members use the bikes primarily for commuting, while casual customers might be using the bikes for leisure rides or tourism. This means that it can be a good idea to design a marketing campaign to target young professionals interested in a transportation alternative for commuting.
Resources for advertising should be spent during the spring and summer months, avoiding the cold weather months because the demand for bikes is very low during this time of the year.

*Casual customers use the bikes more often during the weekends. In order to convert casual customer into members, one possibility is to offer some additional perks to this customers, like a weekend only membership.

Cyclistic Bike Share Analysis

Setting up my environment

Data Wrangling

Cleaning the data

Analysis

Analysis of ride length for different customer types, focusing on rides longer than 0 minutes but shorter than 24 hours:

Number of rides by customer type:

Analysis of ride length and the number of rides, categorized by customer type, on a monthly basis.

Average ride length by customer type per month

Bike trips by customer type per day

Bike trips by customer type per hour of the day

Analysis of rideable bike by customer type:

Rides with length longer than 24 hours

Analyzis on the most frequently used starting locations for customers’ trips

The top 20 most frequently used departure locations for casual customers:

The top 20 most frequently used departure locations for members:

The top 10 most frequently used departure locations for casual customers

The top 10 most frequently used departure locations for members

Recommendations