Alexa 2023-11-10
This analysis is for case study 1 from the Google Data Analytics Certificate “Cyclistic bike-share” I will use Cyclistic’s historical trip data to analyze and identify trends, which is the City of Chicago’s Divvy bicycle sharing service data made available by Motivate International Inc. under this license. This public data can be accessed online.
I will analyze the data from October 1st 2022 to September 30th 2023.
The purpose of this notebook is to consolidate downloaded Cyclistic data into a single data frame and then conduct simple analysis to help answer the key question: “In what ways do members and casual riders use Cyclistic bikes differently?”
The ‘tidyverse’ includes packages that I will use along this analysis like ‘dplyr’, ‘stringr’, ‘lubridate’ and ‘ggplot2’. I will use ‘scales’ to customize the appearance of the axis and legend labels of my charts. I will use ‘gt’ and ‘formattable’ to make the appearance of a few tables nicer. I will use ‘sf’ and ‘mapwiew’ to create a map with the most frequently used starting stations for customers’ trips.
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
##
## Attaching package: 'scales'
##
## The following object is masked from 'package:purrr':
##
## discard
##
## The following object is masked from 'package:readr':
##
## col_factor
## Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE
##
## Attaching package: 'formattable'
##
## The following object is masked from 'package:gt':
##
## currency
##
## The following objects are masked from 'package:scales':
##
## comma, percent, scientific
There are multiple functions to load the files: read.csv(), the default csv reader that comes R base and creates data frames; read_csv() from the readr package, included in the tidyverse, which creates tibbles; and the fread() from the data.table package. I will use read_csv() which is faster than read.csv(), slower than fread(), but available with the tidyverse package.
dec <- read_csv("/Users/alexandravelez/Desktop/R programming/Cyclist study case data 2/202212-divvy-tripdata.csv")## Rows: 181806 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
nov <- read_csv("/Users/alexandravelez/Desktop/R programming/Cyclist study case data 2/202211-divvy-tripdata.csv")## Rows: 337735 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
oct <- read_csv("/Users/alexandravelez/Desktop/R programming/Cyclist study case data 2/202210-divvy-tripdata.csv")## Rows: 558685 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
aug <- read_csv("/Users/alexandravelez/Desktop/R programming/Cyclist study case data 2/202308-divvy-tripdata.csv")## Rows: 771693 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
jul <- read_csv("/Users/alexandravelez/Desktop/R programming/Cyclist study case data 2/202307-divvy-tripdata.csv")## Rows: 767650 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
jun <- read_csv("/Users/alexandravelez/Desktop/R programming/Cyclist study case data 2/202306-divvy-tripdata.csv")## Rows: 719618 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
may <- read_csv("/Users/alexandravelez/Desktop/R programming/Cyclist study case data 2/202305-divvy-tripdata.csv")## Rows: 604827 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
apr <- read_csv("/Users/alexandravelez/Desktop/R programming/Cyclist study case data 2/202304-divvy-tripdata.csv")## Rows: 426590 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
mar <- read_csv("/Users/alexandravelez/Desktop/R programming/Cyclist study case data 2/202303-divvy-tripdata.csv")## Rows: 258678 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
feb <- read_csv("/Users/alexandravelez/Desktop/R programming/Cyclist study case data 2/202302-divvy-tripdata.csv")## Rows: 190445 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
jan <- read_csv("/Users/alexandravelez/Desktop/R programming/Cyclist study case data 2/202301-divvy-tripdata.csv")## Rows: 190301 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sep <- read_csv("/Users/alexandravelez/Desktop/R programming/Cyclist study case data 2/202309-divvy-tripdata.csv")## Rows: 666371 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
First I inspected all the files to get familiar with the column names and data types with these functions:
## # A tibble: 10 × 13
## ride_id rideable_type started_at ended_at
## <chr> <chr> <dttm> <dttm>
## 1 A50255C1E17942AB classic_bike 2022-10-14 17:13:30 2022-10-14 17:19:39
## 2 DB692A70BD2DD4E3 electric_bike 2022-10-01 16:29:26 2022-10-01 16:49:06
## 3 3C02727AAF60F873 electric_bike 2022-10-19 18:55:40 2022-10-19 19:03:30
## 4 47E653FDC2D99236 electric_bike 2022-10-31 07:52:36 2022-10-31 07:58:49
## 5 8B5407BE535159BF classic_bike 2022-10-13 18:41:03 2022-10-13 19:26:18
## 6 A177C92E9F021B99 electric_bike 2022-10-13 15:53:27 2022-10-13 15:59:17
## 7 DF5EC7678DE3C2B3 electric_bike 2022-10-06 15:51:21 2022-10-06 15:55:06
## 8 407DE6D80130A297 classic_bike 2022-10-26 17:30:10 2022-10-26 17:37:57
## 9 45EEAF68A1A051CA classic_bike 2022-10-22 09:47:56 2022-10-22 09:57:42
## 10 66CD8E4D0C38C0F3 electric_bike 2022-10-24 12:39:47 2022-10-24 12:48:36
## # ℹ 9 more variables: start_station_name <chr>, start_station_id <chr>,
## # end_station_name <chr>, end_station_id <chr>, start_lat <dbl>,
## # start_lng <dbl>, end_lat <dbl>, end_lng <dbl>, member_casual <chr>
## spc_tbl_ [558,685 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:558685] "A50255C1E17942AB" "DB692A70BD2DD4E3" "3C02727AAF60F873" "47E653FDC2D99236" ...
## $ rideable_type : chr [1:558685] "classic_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : POSIXct[1:558685], format: "2022-10-14 17:13:30" "2022-10-01 16:29:26" ...
## $ ended_at : POSIXct[1:558685], format: "2022-10-14 17:19:39" "2022-10-01 16:49:06" ...
## $ start_station_name: chr [1:558685] "Noble St & Milwaukee Ave" "Damen Ave & Charleston St" "Hoyne Ave & Balmoral Ave" "Rush St & Cedar St" ...
## $ start_station_id : chr [1:558685] "13290" "13288" "655" "KA1504000133" ...
## $ end_station_name : chr [1:558685] "Larrabee St & Division St" "Damen Ave & Cullerton St" "Western Ave & Leland Ave" "Orleans St & Chestnut St (NEXT Apts)" ...
## $ end_station_id : chr [1:558685] "KA1504000079" "13089" "TA1307000140" "620" ...
## $ start_lat : num [1:558685] 41.9 41.9 42 41.9 41.9 ...
## $ start_lng : num [1:558685] -87.7 -87.7 -87.7 -87.6 -87.6 ...
## $ end_lat : num [1:558685] 41.9 41.9 42 41.9 41.9 ...
## $ end_lng : num [1:558685] -87.6 -87.7 -87.7 -87.6 -87.6 ...
## $ member_casual : chr [1:558685] "member" "casual" "member" "member" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
## ride_id rideable_type started_at
## Length:558685 Length:558685 Min. :2022-10-01 00:00:15.00
## Class :character Class :character 1st Qu.:2022-10-08 01:44:43.00
## Mode :character Mode :character Median :2022-10-15 15:09:17.00
## Mean :2022-10-16 00:37:00.44
## 3rd Qu.:2022-10-23 15:13:33.00
## Max. :2022-10-31 23:59:33.00
##
## ended_at start_station_name start_station_id
## Min. :2022-10-01 00:01:05.00 Length:558685 Length:558685
## 1st Qu.:2022-10-08 02:00:54.00 Class :character Class :character
## Median :2022-10-15 15:29:01.00 Mode :character Mode :character
## Mean :2022-10-16 00:54:21.77
## 3rd Qu.:2022-10-23 15:36:58.00
## Max. :2022-11-07 04:53:58.00
##
## end_station_name end_station_id start_lat start_lng
## Length:558685 Length:558685 Min. :41.64 Min. :-87.84
## Class :character Class :character 1st Qu.:41.88 1st Qu.:-87.66
## Mode :character Mode :character Median :41.90 Median :-87.64
## Mean :41.90 Mean :-87.65
## 3rd Qu.:41.93 3rd Qu.:-87.63
## Max. :42.07 Max. :-87.53
##
## end_lat end_lng member_casual
## Min. :41.59 Min. :-87.87 Length:558685
## 1st Qu.:41.88 1st Qu.:-87.66 Class :character
## Median :41.90 Median :-87.64 Mode :character
## Mean :41.90 Mean :-87.65
## 3rd Qu.:41.93 3rd Qu.:-87.63
## Max. :42.13 Max. :-87.52
## NA's :475 NA's :475
I applied this functions to all the data sets.
When inspecting the data, each file represents a month of data, all of the files have the same number of columns, the columns have the same order, and the column names are consistent. I means that they can easily be combined into a single data frame with the rbind function from base R, bind_rows from dplyr package or rbindlist from the data.table package, being rbind the slowest, and rbindlist the fastest regarding performance. I used bind_rows() here.
Checking the tibbles were successfully combined:
## spc_tbl_ [5,674,399 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:5674399] "8FE8F7D9C10E88C7" "34E4ED3ADF1D821B" "5296BF07A2F77CB5" "40759916B76D5D52" ...
## $ rideable_type : chr [1:5674399] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : POSIXct[1:5674399], format: "2023-04-02 08:37:28" "2023-04-19 11:29:02" ...
## $ ended_at : POSIXct[1:5674399], format: "2023-04-02 08:41:37" "2023-04-19 11:52:12" ...
## $ start_station_name: chr [1:5674399] NA NA NA NA ...
## $ start_station_id : chr [1:5674399] NA NA NA NA ...
## $ end_station_name : chr [1:5674399] NA NA NA NA ...
## $ end_station_id : chr [1:5674399] NA NA NA NA ...
## $ start_lat : num [1:5674399] 41.8 41.9 41.9 41.9 41.9 ...
## $ start_lng : num [1:5674399] -87.6 -87.7 -87.7 -87.7 -87.7 ...
## $ end_lat : num [1:5674399] 41.8 41.9 41.9 41.9 41.9 ...
## $ end_lng : num [1:5674399] -87.6 -87.7 -87.7 -87.7 -87.6 ...
## $ member_casual : chr [1:5674399] "member" "member" "member" "member" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
Year is a tibble with 5,674,399 rows and 13 columns. When inspecting the data types of the different variables, rideable_type and member_casual are categorical values, and therefore it makes sense to have them stored as factors instead of strings. Teh data frame has the columns started_at and ended_at stored as POSIXct, start_station_name, star_station_id, end_station-name and end_station_id stored as characters. start_lat, start_lng, end_lat, end_lng are stored as numeric, which is the appropriate data type for these variables.
Converting rideable_type and member_casual into factors:
year$rideable_type = factor(year$rideable_type, levels = c('classic_bike', 'electric_bike', 'docked_bike'))
levels(year$rideable_type)## [1] "classic_bike" "electric_bike" "docked_bike"
## spc_tbl_ [5,674,399 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:5674399] "8FE8F7D9C10E88C7" "34E4ED3ADF1D821B" "5296BF07A2F77CB5" "40759916B76D5D52" ...
## $ rideable_type : Factor w/ 3 levels "classic_bike",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ started_at : POSIXct[1:5674399], format: "2023-04-02 08:37:28" "2023-04-19 11:29:02" ...
## $ ended_at : POSIXct[1:5674399], format: "2023-04-02 08:41:37" "2023-04-19 11:52:12" ...
## $ start_station_name: chr [1:5674399] NA NA NA NA ...
## $ start_station_id : chr [1:5674399] NA NA NA NA ...
## $ end_station_name : chr [1:5674399] NA NA NA NA ...
## $ end_station_id : chr [1:5674399] NA NA NA NA ...
## $ start_lat : num [1:5674399] 41.8 41.9 41.9 41.9 41.9 ...
## $ start_lng : num [1:5674399] -87.6 -87.7 -87.7 -87.7 -87.7 ...
## $ end_lat : num [1:5674399] 41.8 41.9 41.9 41.9 41.9 ...
## $ end_lng : num [1:5674399] -87.6 -87.7 -87.7 -87.7 -87.6 ...
## $ member_casual : Factor w/ 2 levels "casual","member": 2 2 2 2 2 2 2 2 2 2 ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
With the str() function I made sure the conversion was successful.
Before any additional modifications of my data, I created a backup, so I can restore my tibble without executing all the code again in case of any errors.
To make the name of the column “member_casual” easier to understand, I renamed it ‘customer_type’:
Checking the column name was correctly changed:
## [1] "casual" "member"
I needed a calculated field for ride_length in order to analyze the trip duration for members and casual costumers. Then I converted ride_length data type from difftime to numeric and to a unit of minutes, to perform calculations later in my analysis.
year$ride_length_min = with(year, difftime(ended_at,started_at,units="mins"))
year$ride_length_min <- as.numeric(year$ride_length_min, unit = "mins")Checking the new column ride_lenght was created correctly:
## Rows: 5,674,399
## Columns: 14
## $ ride_id <chr> "8FE8F7D9C10E88C7", "34E4ED3ADF1D821B", "5296BF07A2…
## $ rideable_type <fct> electric_bike, electric_bike, electric_bike, electr…
## $ started_at <dttm> 2023-04-02 08:37:28, 2023-04-19 11:29:02, 2023-04-…
## $ ended_at <dttm> 2023-04-02 08:41:37, 2023-04-19 11:52:12, 2023-04-…
## $ start_station_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ start_station_id <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ end_station_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ end_station_id <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ start_lat <dbl> 41.80, 41.87, 41.93, 41.92, 41.91, 41.91, 41.93, 42…
## $ start_lng <dbl> -87.60, -87.65, -87.66, -87.65, -87.65, -87.63, -87…
## $ end_lat <dbl> 41.79, 41.93, 41.93, 41.91, 41.91, 41.92, 41.91, 41…
## $ end_lng <dbl> -87.60, -87.68, -87.66, -87.65, -87.63, -87.65, -87…
## $ customer_type <fct> member, member, member, member, member, member, mem…
## $ ride_length_min <dbl> 4.1500000, 23.1666667, 2.0000000, 3.6500000, 4.8333…
Creating a column for ‘month’, day of the week as ‘weekday’, and ‘hour’ to make it easier to aggregate the data for analysis:
year$month <-month(year$started_at, label = TRUE)
year$week_day <-wday(year$started_at, label = TRUE)
year$hour <-hour(year$started_at)Checking for duplicate entries:
## integer(0)
No duplicated entries were found.
One of the most noticeable problems with this data set it the high number of NA in the start_station_name, start_station_id, end_station_name, end_station_id columns
## # A tibble: 873,186 × 17
## ride_id rideable_type started_at ended_at
## <chr> <fct> <dttm> <dttm>
## 1 8FE8F7D9C10E88C7 electric_bike 2023-04-02 08:37:28 2023-04-02 08:41:37
## 2 34E4ED3ADF1D821B electric_bike 2023-04-19 11:29:02 2023-04-19 11:52:12
## 3 5296BF07A2F77CB5 electric_bike 2023-04-19 08:41:22 2023-04-19 08:43:22
## 4 40759916B76D5D52 electric_bike 2023-04-19 13:31:30 2023-04-19 13:35:09
## 5 77A96F460101AC63 electric_bike 2023-04-19 12:05:36 2023-04-19 12:10:26
## 6 8D6A2328E19DC168 electric_bike 2023-04-19 12:17:34 2023-04-19 12:21:38
## 7 C97BBA66E07889F9 electric_bike 2023-04-19 09:35:48 2023-04-19 09:45:00
## 8 6687AD4C575FF734 electric_bike 2023-04-11 16:13:43 2023-04-11 16:18:41
## 9 A8FA4F73B22BC11F electric_bike 2023-04-11 16:29:24 2023-04-11 16:40:23
## 10 81E158FE63D99994 electric_bike 2023-04-19 17:35:40 2023-04-19 17:36:11
## # ℹ 873,176 more rows
## # ℹ 13 more variables: start_station_name <chr>, start_station_id <chr>,
## # end_station_name <chr>, end_station_id <chr>, start_lat <dbl>,
## # start_lng <dbl>, end_lat <dbl>, end_lng <dbl>, customer_type <fct>,
## # ride_length_min <dbl>, month <ord>, week_day <ord>, hour <int>
## # A tibble: 926,160 × 17
## ride_id rideable_type started_at ended_at
## <chr> <fct> <dttm> <dttm>
## 1 8FE8F7D9C10E88C7 electric_bike 2023-04-02 08:37:28 2023-04-02 08:41:37
## 2 34E4ED3ADF1D821B electric_bike 2023-04-19 11:29:02 2023-04-19 11:52:12
## 3 5296BF07A2F77CB5 electric_bike 2023-04-19 08:41:22 2023-04-19 08:43:22
## 4 40759916B76D5D52 electric_bike 2023-04-19 13:31:30 2023-04-19 13:35:09
## 5 77A96F460101AC63 electric_bike 2023-04-19 12:05:36 2023-04-19 12:10:26
## 6 8D6A2328E19DC168 electric_bike 2023-04-19 12:17:34 2023-04-19 12:21:38
## 7 C97BBA66E07889F9 electric_bike 2023-04-19 09:35:48 2023-04-19 09:45:00
## 8 6687AD4C575FF734 electric_bike 2023-04-11 16:13:43 2023-04-11 16:18:41
## 9 A8FA4F73B22BC11F electric_bike 2023-04-11 16:29:24 2023-04-11 16:40:23
## 10 81E158FE63D99994 electric_bike 2023-04-19 17:35:40 2023-04-19 17:36:11
## # ℹ 926,150 more rows
## # ℹ 13 more variables: start_station_name <chr>, start_station_id <chr>,
## # end_station_name <chr>, end_station_id <chr>, start_lat <dbl>,
## # start_lng <dbl>, end_lat <dbl>, end_lng <dbl>, customer_type <fct>,
## # ride_length_min <dbl>, month <ord>, week_day <ord>, hour <int>
## # A tibble: 873,318 × 17
## ride_id rideable_type started_at ended_at
## <chr> <fct> <dttm> <dttm>
## 1 8FE8F7D9C10E88C7 electric_bike 2023-04-02 08:37:28 2023-04-02 08:41:37
## 2 34E4ED3ADF1D821B electric_bike 2023-04-19 11:29:02 2023-04-19 11:52:12
## 3 5296BF07A2F77CB5 electric_bike 2023-04-19 08:41:22 2023-04-19 08:43:22
## 4 40759916B76D5D52 electric_bike 2023-04-19 13:31:30 2023-04-19 13:35:09
## 5 77A96F460101AC63 electric_bike 2023-04-19 12:05:36 2023-04-19 12:10:26
## 6 8D6A2328E19DC168 electric_bike 2023-04-19 12:17:34 2023-04-19 12:21:38
## 7 C97BBA66E07889F9 electric_bike 2023-04-19 09:35:48 2023-04-19 09:45:00
## 8 6687AD4C575FF734 electric_bike 2023-04-11 16:13:43 2023-04-11 16:18:41
## 9 A8FA4F73B22BC11F electric_bike 2023-04-11 16:29:24 2023-04-11 16:40:23
## 10 81E158FE63D99994 electric_bike 2023-04-19 17:35:40 2023-04-19 17:36:11
## # ℹ 873,308 more rows
## # ℹ 13 more variables: start_station_name <chr>, start_station_id <chr>,
## # end_station_name <chr>, end_station_id <chr>, start_lat <dbl>,
## # start_lng <dbl>, end_lat <dbl>, end_lng <dbl>, customer_type <fct>,
## # ride_length_min <dbl>, month <ord>, week_day <ord>, hour <int>
## # A tibble: 926,301 × 17
## ride_id rideable_type started_at ended_at
## <chr> <fct> <dttm> <dttm>
## 1 8FE8F7D9C10E88C7 electric_bike 2023-04-02 08:37:28 2023-04-02 08:41:37
## 2 34E4ED3ADF1D821B electric_bike 2023-04-19 11:29:02 2023-04-19 11:52:12
## 3 5296BF07A2F77CB5 electric_bike 2023-04-19 08:41:22 2023-04-19 08:43:22
## 4 40759916B76D5D52 electric_bike 2023-04-19 13:31:30 2023-04-19 13:35:09
## 5 77A96F460101AC63 electric_bike 2023-04-19 12:05:36 2023-04-19 12:10:26
## 6 8D6A2328E19DC168 electric_bike 2023-04-19 12:17:34 2023-04-19 12:21:38
## 7 C97BBA66E07889F9 electric_bike 2023-04-19 09:35:48 2023-04-19 09:45:00
## 8 6687AD4C575FF734 electric_bike 2023-04-11 16:13:43 2023-04-11 16:18:41
## 9 A8FA4F73B22BC11F electric_bike 2023-04-11 16:29:24 2023-04-11 16:40:23
## 10 81E158FE63D99994 electric_bike 2023-04-19 17:35:40 2023-04-19 17:36:11
## # ℹ 926,291 more rows
## # ℹ 13 more variables: start_station_name <chr>, start_station_id <chr>,
## # end_station_name <chr>, end_station_id <chr>, start_lat <dbl>,
## # start_lng <dbl>, end_lat <dbl>, end_lng <dbl>, customer_type <fct>,
## # ride_length_min <dbl>, month <ord>, week_day <ord>, hour <int>
## # A tibble: 6,642 × 17
## ride_id rideable_type started_at ended_at
## <chr> <fct> <dttm> <dttm>
## 1 C0D866D60D247389 docked_bike 2023-04-29 19:53:48 2023-05-01 04:48:13
## 2 667C60C2E29B3527 docked_bike 2023-04-11 11:57:32 2023-04-14 15:35:22
## 3 6931AA8C6820608F docked_bike 2023-04-13 18:07:14 2023-04-13 18:14:14
## 4 78DBBD82B3A9B783 classic_bike 2023-04-28 16:29:43 2023-04-29 17:29:35
## 5 30EB784FB59361E9 classic_bike 2023-04-15 19:35:51 2023-04-16 20:35:46
## 6 52F50C9221FBA751 docked_bike 2023-04-30 17:26:49 2023-05-01 18:26:50
## 7 8A647A93C6DD01BA classic_bike 2023-04-30 10:36:21 2023-05-01 11:36:16
## 8 98F35284E4F8321F classic_bike 2023-04-18 12:03:44 2023-04-19 13:03:40
## 9 8CF9D8D92B102A65 classic_bike 2023-04-26 17:54:23 2023-04-27 18:54:02
## 10 184563421ED28E1A classic_bike 2023-04-28 06:38:51 2023-04-29 07:38:45
## # ℹ 6,632 more rows
## # ℹ 13 more variables: start_station_name <chr>, start_station_id <chr>,
## # end_station_name <chr>, end_station_id <chr>, start_lat <dbl>,
## # start_lng <dbl>, end_lat <dbl>, end_lng <dbl>, customer_type <fct>,
## # ride_length_min <dbl>, month <ord>, week_day <ord>, hour <int>
## # A tibble: 6,642 × 17
## ride_id rideable_type started_at ended_at
## <chr> <fct> <dttm> <dttm>
## 1 C0D866D60D247389 docked_bike 2023-04-29 19:53:48 2023-05-01 04:48:13
## 2 667C60C2E29B3527 docked_bike 2023-04-11 11:57:32 2023-04-14 15:35:22
## 3 6931AA8C6820608F docked_bike 2023-04-13 18:07:14 2023-04-13 18:14:14
## 4 78DBBD82B3A9B783 classic_bike 2023-04-28 16:29:43 2023-04-29 17:29:35
## 5 30EB784FB59361E9 classic_bike 2023-04-15 19:35:51 2023-04-16 20:35:46
## 6 52F50C9221FBA751 docked_bike 2023-04-30 17:26:49 2023-05-01 18:26:50
## 7 8A647A93C6DD01BA classic_bike 2023-04-30 10:36:21 2023-05-01 11:36:16
## 8 98F35284E4F8321F classic_bike 2023-04-18 12:03:44 2023-04-19 13:03:40
## 9 8CF9D8D92B102A65 classic_bike 2023-04-26 17:54:23 2023-04-27 18:54:02
## 10 184563421ED28E1A classic_bike 2023-04-28 06:38:51 2023-04-29 07:38:45
## # ℹ 6,632 more rows
## # ℹ 13 more variables: start_station_name <chr>, start_station_id <chr>,
## # end_station_name <chr>, end_station_id <chr>, start_lat <dbl>,
## # start_lng <dbl>, end_lat <dbl>, end_lng <dbl>, customer_type <fct>,
## # ride_length_min <dbl>, month <ord>, week_day <ord>, hour <int>
The data frame has 873,186 blanks in the start_station_name column, 873,318 in the start_station_id column, 926,160 blanks in the end_station_name column, and 926,301 in the end_station_id column.
The data frame also has 6,642 NA in the end_lat and end_lng columns (0.11% of the rows). However, the longitude and latitude of the starting point is complete, which means I can use this columns to map the location where casual costumers and member’s trips are starting, as we will see later.
I will exclude the incomplete columns from the analysis except for start_station_name, as it is not possible to obtain complete information for these columns, and they are not essential for addressing the business problem.
year <- subset(year, select = -c(start_station_id, end_station_name, end_station_id, end_lat, end_lng))Checking that the columns were correctly eliminated:
## tibble [5,674,399 × 12] (S3: tbl_df/tbl/data.frame)
## $ ride_id : chr [1:5674399] "8FE8F7D9C10E88C7" "34E4ED3ADF1D821B" "5296BF07A2F77CB5" "40759916B76D5D52" ...
## $ rideable_type : Factor w/ 3 levels "classic_bike",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ started_at : POSIXct[1:5674399], format: "2023-04-02 08:37:28" "2023-04-19 11:29:02" ...
## $ ended_at : POSIXct[1:5674399], format: "2023-04-02 08:41:37" "2023-04-19 11:52:12" ...
## $ start_station_name: chr [1:5674399] NA NA NA NA ...
## $ start_lat : num [1:5674399] 41.8 41.9 41.9 41.9 41.9 ...
## $ start_lng : num [1:5674399] -87.6 -87.7 -87.7 -87.7 -87.7 ...
## $ customer_type : Factor w/ 2 levels "casual","member": 2 2 2 2 2 2 2 2 2 2 ...
## $ ride_length_min : num [1:5674399] 4.15 23.17 2 3.65 4.83 ...
## $ month : Ord.factor w/ 12 levels "Jan"<"Feb"<"Mar"<..: 4 4 4 4 4 4 4 4 4 4 ...
## $ week_day : Ord.factor w/ 7 levels "Sun"<"Mon"<"Tue"<..: 1 4 4 4 4 4 4 3 3 4 ...
## $ hour : int [1:5674399] 8 11 8 13 12 12 9 16 16 17 ...
## # A tibble: 873,186 × 12
## ride_id rideable_type started_at ended_at
## <chr> <fct> <dttm> <dttm>
## 1 8FE8F7D9C10E88C7 electric_bike 2023-04-02 08:37:28 2023-04-02 08:41:37
## 2 34E4ED3ADF1D821B electric_bike 2023-04-19 11:29:02 2023-04-19 11:52:12
## 3 5296BF07A2F77CB5 electric_bike 2023-04-19 08:41:22 2023-04-19 08:43:22
## 4 40759916B76D5D52 electric_bike 2023-04-19 13:31:30 2023-04-19 13:35:09
## 5 77A96F460101AC63 electric_bike 2023-04-19 12:05:36 2023-04-19 12:10:26
## 6 8D6A2328E19DC168 electric_bike 2023-04-19 12:17:34 2023-04-19 12:21:38
## 7 C97BBA66E07889F9 electric_bike 2023-04-19 09:35:48 2023-04-19 09:45:00
## 8 6687AD4C575FF734 electric_bike 2023-04-11 16:13:43 2023-04-11 16:18:41
## 9 A8FA4F73B22BC11F electric_bike 2023-04-11 16:29:24 2023-04-11 16:40:23
## 10 81E158FE63D99994 electric_bike 2023-04-19 17:35:40 2023-04-19 17:36:11
## # ℹ 873,176 more rows
## # ℹ 8 more variables: start_station_name <chr>, start_lat <dbl>,
## # start_lng <dbl>, customer_type <fct>, ride_length_min <dbl>, month <ord>,
## # week_day <ord>, hour <int>
We can see that only the 873,186 NA identified in the start_station_name column are present. I will leave this column as it is for now, since I have the exact location (latitude and longitud) for these stations and having all the names will not be necessary for my analysis.
All the data except from start_station_name is complete.
####Descriptive analysis on ride length:
year%>%
summarize(avg_ride_length =mean(ride_length_min), median_ride_length=median(ride_length_min), max_ride_length=max(ride_length_min), min_ride_length=min(ride_length_min)) ## # A tibble: 1 × 4
## avg_ride_length median_ride_length max_ride_length min_ride_length
## <dbl> <dbl> <dbl> <dbl>
## 1 18.4 9.55 98489. -169.
This descriptive analysis indicates the probable presence of outliers and abnormal values, reflected on a negative minimum ride length and a very long maximum ride length. I will look for ride lengths <=0 minutes.
## # A tibble: 207 × 12
## ride_id rideable_type started_at ended_at
## <chr> <fct> <dttm> <dttm>
## 1 7A4D237E2C99D424 electric_bike 2023-04-04 17:15:08 2023-04-04 17:15:05
## 2 81E1C5175FA5A23D classic_bike 2023-04-19 14:47:18 2023-04-19 14:47:14
## 3 0063C3704F56EC55 electric_bike 2023-04-27 07:51:14 2023-04-27 07:51:09
## 4 DFC43BD5CB34ACBF electric_bike 2023-04-06 23:09:31 2023-04-06 23:00:35
## 5 934174DB8E2AD791 classic_bike 2023-05-29 17:34:21 2023-05-29 17:34:09
## 6 ED9038136686A88A electric_bike 2023-05-29 16:57:34 2023-05-29 16:57:27
## 7 06EC5ECAF8E26A2C electric_bike 2023-05-26 15:39:47 2023-05-26 15:38:17
## 8 F74E0B3EB302A3AE electric_bike 2023-05-26 15:38:53 2023-05-26 15:38:17
## 9 00AC4040E25E347E classic_bike 2023-05-07 15:54:58 2023-05-07 15:54:47
## 10 579596DD4C7C7538 classic_bike 2023-05-23 17:39:38 2023-05-23 17:39:35
## # ℹ 197 more rows
## # ℹ 8 more variables: start_station_name <chr>, start_lat <dbl>,
## # start_lng <dbl>, customer_type <fct>, ride_length_min <dbl>, month <ord>,
## # week_day <ord>, hour <int>
## # A tibble: 836 × 12
## ride_id rideable_type started_at ended_at
## <chr> <fct> <dttm> <dttm>
## 1 D0FBBEEF715FD098 classic_bike 2023-04-13 20:35:39 2023-04-13 20:35:39
## 2 183EFB828ABAEB6D electric_bike 2023-04-25 18:50:07 2023-04-25 18:50:07
## 3 ADBFE4E866050462 electric_bike 2023-04-18 07:25:25 2023-04-18 07:25:25
## 4 6BB162FF6B146FA1 electric_bike 2023-04-18 18:52:01 2023-04-18 18:52:01
## 5 F8063DB82D95B1CB classic_bike 2023-04-02 15:04:15 2023-04-02 15:04:15
## 6 1B97C99C3B8ACC83 classic_bike 2023-04-10 21:14:18 2023-04-10 21:14:18
## 7 7C7D88F336F55B87 electric_bike 2023-04-05 07:21:55 2023-04-05 07:21:55
## 8 15FD035AE63ED487 electric_bike 2023-04-09 06:29:48 2023-04-09 06:29:48
## 9 33CBBA7EBFD81651 electric_bike 2023-04-19 11:54:53 2023-04-19 11:54:53
## 10 57381E8364077D79 electric_bike 2023-04-22 16:20:59 2023-04-22 16:20:59
## # ℹ 826 more rows
## # ℹ 8 more variables: start_station_name <chr>, start_lat <dbl>,
## # start_lng <dbl>, customer_type <fct>, ride_length_min <dbl>, month <ord>,
## # week_day <ord>, hour <int>
Notably, there are 207 rides with a negative ride length and 836 with a length of 0. This pattern may correspond to instances when bikes were temporarily taken out of the docks by the company for quality checks or repairs.
Since I lack additional information about rides with negative values, and this data cannot be meaningfully analyzed alongside the data for normal bike use by members or casual customers, I will store this information in an object named ‘negative_ride_lengths’ in case I need it in the future, and exclude these items from the rest of the dataset.
negative_ride_lengths <- year[which(year$ride_length_min <= 0),]
year <- year[-which(year$ride_length_min <= 0),]Inspecting the data again to verify the minimum ride length is > than 0:
year%>%
summarize(avg_ride_length =mean(ride_length_min), median_ride_length=median(ride_length_min), max_ride_length=max(ride_length_min), min_ride_length=min(ride_length_min)) ## # A tibble: 1 × 4
## avg_ride_length median_ride_length max_ride_length min_ride_length
## <dbl> <dbl> <dbl> <dbl>
## 1 18.4 9.55 98489. 0.0167
On the other hand, it’s worth noting that the maximum ride length is 98,489.07 minutes, which translates to 68.39 days. This is highly abnormal for Cyclistic service, given that, according to their website, a day pass allows for an unlimited number of three-hour rides within a 24-hour period. Additionally, for members, the first 45 minutes of a ride are free, with additional minutes incurring extra fees. Instances like this may be indicative of a stolen or lost bike, or potentially stem from data quality issues.
In order to quickly grasp and identify outliers in this scenario, I will create a boxplot.
Since this ride lengths are very long, I will convert them to hours before I create the boxplot.
Creating the boxplot:
boxplot(ride_length_hours, horizontal = TRUE, main = "Ride length in hours from October 1st 2022 to September 30th 2023", xlab = "Hours", ylab = "Bike rides", notch = TRUE)I can also use the IQR method to identify high outliers. The rule is that the data point needs to fall more than 1.5 times the Interquartile range above the third quartile to be considered a high outlier (Q3 + 1.5xIQR )
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.02 5.43 9.55 18.42 17.00 98489.07
## [1] 11.56667
Q3 = 17 and IQR = 11.56 Q3 + 1.5 x IQR 17 + 1.5(11.56) = 34.34
Upon examining the box plot, along with the statistical information provided by the summary function and the IQR method for detecting outliers, it can be concluded that values exceeding 34.34 minutes may be considered ‘abnormally high’ for the distribution of ride lengths in this dataset. Nevertheless, it is crucial to not hastily exclude data that, while appearing as outliers in the distribution, may actually reflect variations in how individuals use the bikes. For example, ride lengths for commuting may differ from those for leisure. Therefore, establishing a definitive cutoff between an outlier and a long ride is not straightforward.
However, considering the company’s policies, if a customer fails to return a bike within a 24-hour period, they may face a lost or stolen bike fee of $250. With this in mind, for the purposes of this analysis, I will be excluding trips with a duration exceeding 24 hours (1440 minutes) from the main dataset. I will store them separately in an object named ‘long_rides’ to perform an analysis on this data later.
## # A tibble: 5,946 × 12
## ride_id rideable_type started_at ended_at
## <chr> <fct> <dttm> <dttm>
## 1 A6EA2393A6E2EBA3 docked_bike 2023-04-15 15:29:11 2023-04-16 15:33:18
## 2 2E1F90AAF861B305 docked_bike 2023-04-15 15:30:15 2023-04-16 15:32:53
## 3 82133AC14BD86DDF docked_bike 2023-04-15 15:28:50 2023-04-16 15:32:33
## 4 C0D866D60D247389 docked_bike 2023-04-29 19:53:48 2023-05-01 04:48:13
## 5 667C60C2E29B3527 docked_bike 2023-04-11 11:57:32 2023-04-14 15:35:22
## 6 78DBBD82B3A9B783 classic_bike 2023-04-28 16:29:43 2023-04-29 17:29:35
## 7 30EB784FB59361E9 classic_bike 2023-04-15 19:35:51 2023-04-16 20:35:46
## 8 52F50C9221FBA751 docked_bike 2023-04-30 17:26:49 2023-05-01 18:26:50
## 9 8A647A93C6DD01BA classic_bike 2023-04-30 10:36:21 2023-05-01 11:36:16
## 10 98F35284E4F8321F classic_bike 2023-04-18 12:03:44 2023-04-19 13:03:40
## # ℹ 5,936 more rows
## # ℹ 8 more variables: start_station_name <chr>, start_lat <dbl>,
## # start_lng <dbl>, customer_type <fct>, ride_length_min <dbl>, month <ord>,
## # week_day <ord>, hour <int>
5,946 rows were stored in the long_rides object.
Now, I checking the rows with ride lengths greater than 24 hours or 1440 min were successfully removed from the main dataset.
year%>%
summarize(avg_ride_length =mean(ride_length_min), median_ride_length=median(ride_length_min), max_ride_length=max(ride_length_min), min_ride_length=min(ride_length_min)) ## # A tibble: 1 × 4
## avg_ride_length median_ride_length max_ride_length min_ride_length
## <dbl> <dbl> <dbl> <dbl>
## 1 15.2 9.55 1440. 0.0167
First I will summarize the ride length by customer type, obtaining the average, mean, max, min of ride lengths and a count of the trips.
ride_length_per_customer_type <- year %>% group_by(customer_type) %>%
summarise(avg_ride_length =mean(ride_length_min), median_ride_length=median(ride_length_min), max_ride_length=max(ride_length_min), min_ride_length=min(ride_length_min),
.groups = 'drop', number_of_rides = n_distinct(ride_id))
ride_length_per_customer_type## # A tibble: 2 × 6
## customer_type avg_ride_length median_ride_length max_ride_length
## <fct> <dbl> <dbl> <dbl>
## 1 casual 20.6 11.8 1440.
## 2 member 12.1 8.5 1440.
## # ℹ 2 more variables: min_ride_length <dbl>, number_of_rides <int>
Then I will create a bar chart to analyze the average ride length by customer type for rides with length longer than 0 min and shorter than 24 hours:
First I created and stored a theme to apply to the next charts to keep consistency.
mytheme <- theme(
plot.title = element_text(family = "Arial", face = "bold", size = (15), colour = "#5A5A5A"),
axis.title = element_text(family = "Arial", size = (10), colour = "#808080", hjust=c(1), vjust=c(0)),
axis.text = element_text(family = "Arial", size = (10), colour = "#808080"),
legend.title = element_text(colour = "#808080", face = "bold", family = "Arial"),
legend.text = element_text(colour = "#808080", family = "Arial"),
plot.subtitle = element_text(colour = "#5A5A5A", family = "Arial"),
plot.caption = element_text(colour = "#5A5A5A", family = "Arial")
)Then I created a bar chart with ggplot 2.
p1 <- ggplot(ride_length_per_customer_type, aes(x = customer_type, y = avg_ride_length, fill = c("Casual", "Member"))) +
geom_col(width = 0.4) +
scale_x_discrete(labels = c("Casual", "Member"))+
scale_fill_manual(values = c("#0C2D48", "#B1D4E0"))+
labs(y= "Average Ride Length in Minutes", x = "Customer Type", title = "Average Ride Length in Minutes by Customer Type", caption = "October 1st 2022 to Semptember 30th 2023", subtitle = "The average ride length of casual customers is almost twice that of members,suggesting that the bikes\nmay be used for different purposes, such as leisure versus commuting", fill = "Customer Type")+
mytheme +
geom_text(data = NULL, label = "21 Min", y = 20, x=1, colour = "white", size = 3.5, family = "Arial" ) +
geom_text(data = NULL, label = "12 Min", y = 11.5, x=2, colour = "white", size = 3.5, family = "Arial")
p1The average ride length of casual customers is almost twice that of members, and the median ride length is 11.8 min for casual members vs 8.5 minutes for members.
Then, to analyze the number of rides by customer type for rides longer than 0 min and shorter than 24 hours I will use another bar chart.
p2 <-ggplot(ride_length_per_customer_type, aes(x = customer_type, y = number_of_rides, fill = c("Casual", "Member"))) +
geom_col(width = 0.4) +
scale_x_discrete(labels = c("Casual", "Member"))+
expand_limits(y = c(0, NA)) +
scale_y_continuous(labels = unit_format(unit = "M", scale = 1e-6))+
scale_fill_manual(values = c("#0C2D48", "#B1D4E0"))+
labs(y= "Number of Rides in Millions", x = "Customer Type", title = "Number of rides by Customer Type", caption = "October 1st 2022 to Semptember 30th 2023", subtitle = "Members use the bikes more often than casual customers.\n64% of the rides are taken by members, compared to 36% by casual customers." , fill = "Customer Type")+
mytheme +
geom_text(data = NULL, label = "2 Millions", y = 2000000, x=1, colour = "white", size = 3.5, family = "Arial" ) +
geom_text(data = NULL, label = "3.5 Millions", y = 3500000, x=2, colour = "white", size = 3.5, family = "Arial")
p2According to this data, members use the bikes more often than casual customers. 64% of the rides are taken by members, compared to 36% by casual customers.
First I summarized the data by customer type and month. Then I stored this data in the ‘monthly_rides_per_customer_type’ object :
monthly_rides_per_customer_type <- year %>% group_by(customer_type, month) %>%
summarise(avg_ride_length =mean(ride_length_min), median_ride_length=median(ride_length_min), number_of_rides = n_distinct(ride_id))## `summarise()` has grouped output by 'customer_type'. You can override using the
## `.groups` argument.
Then, I changed the capitalization of customer types to make it easier to work with it when I make the next graph.
I also created a function named number_formatter to quickly and easily convert the units for the charts axis.
number_formatter <- function(x) {
dplyr::case_when(
x < 1e3 ~ as.character(x),
x < 1e6 ~ paste0(as.character(x/1e3), "K"),
x < 1e9 ~ paste0(as.character(x/1e6), "M"),
TRUE ~ "To be implemented..."
)
}I used a stacked bar chart to illustrate the number of rides by customer type and month. I put the units inside every bar to make it easier to understand by the reader. To do it, I created functions to get the data for every month and customer type, so I didn’t need to hard code the values, then converted them into a double, rounded them and pass them into a paste function and a loop. The loop created an annotation for each data inside a geom_text function and pass it to the ggplot function for the chart.
p3 <- ggplot(monthly_rides_per_customer_type, aes(x = month, y = number_of_rides, fill = customer_type)) +
geom_col()+
scale_x_discrete(labels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov","Dec"))+
expand_limits(y = c(0, NA)) +
scale_y_continuous(labels = unit_format(unit = "K", scale = 1e-3))+
scale_fill_manual(values = c("#0C2D48", "#B1D4E0")) +
annotate("rect", xmin = 6.5, xmax =8.5, ymin = 0, ymax = 810000,
alpha = .01, colour = "#880808") +
labs(y= "Number of Rides in Thousands", x = "Customer Type", title = "Number of Rides by Customer Type and Month", caption = "October 1st 2022 to Semptember 30th 2023", subtitle = "Members and customers tend to use the bikes more frequently during the summer months, particularly\nin July and August, and significantly less during the colder months, from December to February", fill = "Customer Type")+
mytheme
for (i in 1:12) {
loop_input = paste( "geom_text(data = NULL, label = number_formatter(signif(as.double(",monthly_rides_per_customer_type[i,5], "), digits = 2)), y = ", monthly_rides_per_customer_type[i+12,5]," + ", (monthly_rides_per_customer_type[i,5]/2), ", x=", i, ", colour = 'white', size = 3, family = 'Arial')", sep = "")
p3 <- p3 + eval(parse(text=loop_input))
}
for (i in 1:12) {
loop_input = paste("geom_text(data = NULL, label = number_formatter(signif(as.double(",monthly_rides_per_customer_type[i+12,5], "), digits = 2)), y = ", monthly_rides_per_customer_type[i+12,5]/2,", x=", i, ", colour = 'white', size = 3, family = 'Arial')", sep = "")
p3 <- p3 + eval(parse(text=loop_input))
}
p3According to this data, members and customers tend to use the bikes more frequently during the summer months, particularly in July and August, and significantly less during the colder months, from December to February.
For this analysis, I created a line chart and used some annotations to make it easier to read and focus the attention of the reader to the important facts.
p4 <- ggplot(monthly_rides_per_customer_type, aes(x = month, y = avg_ride_length, group = customer_type, color = customer_type)) +
geom_point(size = 1.5) +
geom_line(size = 1.5) +
labs(title="Average Ride Length per Customer Type per Month", x="Month", y = "Average Ride Length in Minutes", caption = "October 1st 2022 to Semptember 30th 2023", subtitle = "Both members and casual customers take longer rides during the spring and summer months compared to the winter months.\nHowever, casual customers see a more significant increase in ride length (64%) compared to members (30%)", color = "Customer Type") +
scale_color_manual(values = c("#0C2D48", "#B1D4E0"))+
mytheme +
annotate("rect", xmin = 4.5, xmax = 9, ymin = 21, ymax = 23.5,
alpha = .2) +
annotate("rect", xmin = 4.5, xmax = 9, ymin = 12, ymax = 14,
alpha = .2) +
annotate("text", x = 7, y = 13.5, label = "13",
alpha = .6, size = 3) +
annotate("text", x = 7, y = 23.2, label = "23",
alpha = .6, size = 3) +
annotate("text", x = 1, y = 10.4, label = "10",
alpha = .8, size = 3, color = "#880808") +
annotate("text", x = 0.9, y = 14.2, label = "14",
alpha = .8, size = 3, color = "#880808")## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Both members and casual customers take longer rides during the spring and summer months compared to the winter months. However, casual customers see a more significant increase in ride length (64%) compared to members (30%), which might reflect a different purpose for their rides, like commuting vs leisure or tourism.
First, I created a summary for the data by customer type and day of the week and I will store it in the day_customer_type object.
day_customer_type <- year %>% group_by(customer_type, week_day) %>%
summarise( number_of_rides = n_distinct(ride_id))## `summarise()` has grouped output by 'customer_type'. You can override using the
## `.groups` argument.
Then, I changed the capitalization of customer types to make it easier to work with it when I make the next graph.
I created and stored the labels for the axis, legends and annotations to pass them later into the ggplot function.
p5_labels <- data.frame(
label= c("428 K", "575K" ),
customer_type = c("Casual", "Member"),
x <- c("Sat", "Thu"),
y <- c(415000, 565000)
)
p5_labels## label customer_type x....c..Sat....Thu.. y....c.415000..565000.
## 1 428 K Casual Sat 415000
## 2 575K Member Thu 565000
Then I created two bar charts using the facet_wrap function by customer type.
p5 <- ggplot(day_customer_type, aes(x = week_day, y = number_of_rides, fill = customer_type)) +
geom_col(width = 0.7) +
scale_fill_manual(values = c("#0C2D48", "#B1D4E0"))+
expand_limits(y = c(0, NA)) +
geom_text(data = NULL, label = "575 K", y = 3500000, x=2, colour = "white", size = 3.5, family = "Arial")+
scale_y_continuous(labels = unit_format(unit = "K", scale = 1e-3))+
geom_text(
data = p5_labels,
mapping = aes(x = x, y = y, label = label), colour = "white", family = "Arial", size = 3)+
labs(y= "Number of Rides in Thousands", x = "Day", title = "Number of Rides by Customer Type and Day of the Week", caption = "October 1st 2022 to Semptember 30th 2023", subtitle = "Members tend to use the bikes more frequently on weekdays, particularly from Tuesday to Thursday,\nwhile casual customers show a higher usage on weekends. This observation may indicate distinct\npurposes for bike rides, such as commuting versus leisure", fill = "Customer Type")+
facet_wrap(~customer_type)+
mytheme
p5According to this data, members tend to use the bikes more frequently on weekdays, particularly from Tuesday to Thursday, while casual customers show a higher usage on weekends. This observation may also indicate distinct purposes for bike rides, such as commuting versus leisure or tourism.
First, I created a summary for the bike trips by customer type and hour. This data will be stored in the hour_rides_per_customer_type object.
hour_rides_per_customer_type <- year %>% group_by(customer_type, hour) %>%
summarise(avg_ride_length = mean(ride_length_min), number_of_rides = n_distinct(ride_id))## `summarise()` has grouped output by 'customer_type'. You can override using the
## `.groups` argument.
Then, I changed the capitalization of customer types to make it easier to work with it when I make the next graph.
Then I created a line chart since I wanted to analyze changes in trends over time. I created a few annotations so the reader can quickly spot relevant numbers.
p6 <- ggplot(hour_rides_per_customer_type, aes(x = hour, y = number_of_rides, fill = customer_type, color=customer_type)) +
geom_line(size = 1.5) +
expand_limits(y = c(0, NA)) +
scale_y_continuous(labels = unit_format(unit = "K", scale = 1e-3))+
labs(title="Number of rides per Customer Type per Hour", x="Hour", y = "Number of Rides in Thousands", caption = "October 1st 2022 to Semptember 30th 2023", subtitle = "Casual customers tend to use the bikes more frequently in the afternoon hours, with a peak at 5 pm.\nIn contrast, members exhibit their highest ridership during the morning and afternoon rush hours, suggesting that\nmembers might be primarily using the bikes for commuting", color = "Customer Type") +
scale_color_manual(values = c("#0C2D48", "#B1D4E0"))+
annotate("text", x = 8, y = 242000, label = "236K",
alpha = .6, size = 3) +
annotate("text", x = 17, y = 385000, label = "378K",
alpha = .6, size = 3) +
annotate("text", x = 17, y = 209000, label = "201K",
alpha = .6, size = 3) +
mytheme
p6Casual customers tend to use the bikes more frequently in the afternoon hours, with a peak at 5 pm. In contrast, members exhibit their highest ridership during the morning and afternoon rush hours, suggesting that members might be primarily using the bikes for commuting.
I started this section by summarizing the data on rideable type by customer type and stored it in an object.
rideable_type <- year %>% group_by(customer_type, rideable_type) %>%
summarise( number_of_rides = n_distinct(ride_id), avg_ride_length = mean(ride_length_min))## `summarise()` has grouped output by 'customer_type'. You can override using the
## `.groups` argument.
Then, I changed the capitalization of customer types to make it easier to work with it when I make the next graph.
Then I created a bar chart to illustrate the frequency of usage of the different type of bikes by customer type.
p7 <- ggplot(rideable_type,aes(rideable_type,number_of_rides, fill = customer_type))+
geom_bar(stat="identity", position = "dodge", width = 0.4)+
expand_limits(y = c(0, NA)) +
scale_fill_manual(values = c("#0C2D48", "#B1D4E0"))+
scale_y_continuous(labels = unit_format(unit = "M", scale = 1e-6))+
scale_x_discrete(labels = c("Classic bike", "Electric bike", "Docked bike"))+
labs(y= "Number of Rides in Millions", x = "Bike Type", title = "Number of Rides per Bike and Customer Type", caption = "October 1st 2022 to Semptember 30th 2023", subtitle = "Casual customers and members both use more frequently electric bikes than classic bikes.\nHowever, a third class of bike shows up in the data, 'Docked Bike', which is used only by casual customers,\nin a substantially minor frequency. This observation needs further analysis.", fill = "Customer Type")+
mytheme+
geom_segment(aes(x = 3, y = 250000, xend = 3, yend = 110000), colour = "#880808",
arrow = arrow(length = unit(0.5, "cm")))+
geom_text(data = NULL, label = "834K", y = 790000, x=0.9, colour = "white", size = 3.5, family = "Arial" )+
geom_text(data = NULL, label = "1.7 M", y = 1700000, x=1.1, colour = "white", size = 3.5, family = "Arial" ) +
geom_text(data = NULL, label = "1.1 M", y = 1110000, x=1.9, colour = "white", size = 3.5, family = "Arial" )+
geom_text(data = NULL, label = "1.8 M", y = 1810000, x=2.1, colour = "white", size = 3.5, family = "Arial" )+
geom_text(data = NULL, label = "96 K", y = 50000, x=3, colour = "white", size = 3.5, family = "Arial" )
p7Casual customers and members both use more frequently electric bikes than classic bikes. However, a third class of bike shows up in the data, ‘Docked Bike’, which is used only by casual customers, in a substantially minor frequency. This observation needs further analysis since it is not something expected with this data.
For this, I created an object with the information to take a closer look to the length of the rides by bike type.
bike_type <- year%>%
group_by(rideable_type)%>%
summarise(avg_ride_length = mean(ride_length_min))
bike_type## # A tibble: 3 × 2
## rideable_type avg_ride_length
## <fct> <dbl>
## 1 classic_bike 17.0
## 2 electric_bike 12.4
## 3 docked_bike 54.0
We can see that the length of the trips with docked bikes is very long. I created a bar chart to compared it to the average time lengths for other types of bikes.
p8 <- ggplot(bike_type, aes(rideable_type, avg_ride_length, fill = rideable_type))+
geom_col(width = 0.4, fill =c("#005b96", "#005b96", "#880808"))+
scale_x_discrete(labels = c("Classic bike", "Electric bike", "Docked bike"))+
labs(y= "Average Ride Length in Minutes", x = "Bike Type", title = "Average Ride Length by Rideable Type", caption = "October 1st 2022 to Semptember 30th 2023", subtitle = "The average ride length for docked bikes is more than three times longer than that of electric or classic bikes.\nFurthermore, the fact that members have recorded zero trips with docked bikes, compared to 4.56% of the\ntotal trips taken by casual customers, may suggest a potential error in these entries.", fill = "Bike Type")+
mytheme+
geom_text(data = NULL, label = "17 Min", y = 16, x=1, colour = "white", size = 3.5, family = "Arial" )+
geom_text(data = NULL, label = "12 Min", y = 11.2, x=2, colour = "white", size = 3.5, family = "Arial" )+
geom_text(data = NULL, label = "54 Min", y = 53, x=3, colour = "white", size = 3.5, family = "Arial" )
p8The average ride length for docked bikes is more than three times longer than that of electric or classic bikes. Furthermore, the fact that members have recorded zero trips with docked bikes, and docked bikes correspond to 4.56% of the total trips taken by casual customers, may suggest a potential error in these entries. For instance it might be the case that the trip ended, and the bike was docked back at the station, but the registration of the length of the trip didn’t stop for some reason.
For this analysis I came back to the previously stored information with rides longer than 24 hours. Then I summarized it by ride length and number of rides.
long_rides_bike <- long_rides%>%
group_by(rideable_type)%>%
summarize(number_of_rides = n_distinct(ride_id), avg_ride_length = mean(ride_length_min))Since these rides are long, I converted them from minutes to hours.
Then I changed the capitalization of the bike types so I can use them nicely in a table.
Since with this data I want to compare just a couple of numbers, a simple table is the right fit. I used the gt function from the gt package to give a nice format to this table.
t1 <- long_rides_bike%>%
gt()%>%
tab_header(
title = md("Bike Rides Longer than 24 Hours"))%>%
tab_source_note(md("October 1st 2022 to Semptember 30th 2023"))%>%
cols_label(
rideable_type = "Bike Type",
number_of_rides = "Number of Rides",
avg_ride_length= "Average Ride Length(Hours)"
)%>%
opt_stylize(style = 6, color = "blue")
cols_align(t1,
align = c("center"),
columns = everything()
)| Bike Rides Longer than 24 Hours | ||
| Bike Type | Number of Rides | Average Ride Length(Hours) |
|---|---|---|
| Classic Bike | 4207 | 24.98829 |
| Docked Bike | 1739 | 116.35724 |
| October 1st 2022 to Semptember 30th 2023 | ||
| Bike Rides Longer than 24 Hours | ||
| Bike Type | Number of Rides | Average Ride Length(Hours) |
|---|---|---|
| Classic Bike | 4207 | 24.98829 |
| Docked Bike | 1739 | 116.35724 |
| October 1st 2022 to Semptember 30th 2023 | ||
According to this data, there are no rides with electric bikes longer than 24 hours, 4207 rides with classic bikes, and 1739 with docked bikes. Notably, for rides longer than 24 hours, the average ride length for docked bikes is substantially longer at 116 hours, compared to 24 hours for classic bikes. This further suggests that there may be errors in the docked bikes entries.
As I mentioned earlier, I have complete information for the starting locations in terms of latitude and longitude, whereas the station name column is incomplete.
I started by storing this information in an object, grouped by customer type and location, then summarizing it by number of rides departing from each location, and then sorting the information in descending order.
stations <- year%>%
select(ride_id, start_station_name, start_lat, start_lng, customer_type)%>%
group_by( customer_type, start_lat, start_lng, start_station_name)%>%
summarise(number_of_rides = n_distinct(ride_id))%>%
arrange(-number_of_rides)## `summarise()` has grouped output by 'customer_type', 'start_lat', 'start_lng'.
## You can override using the `.groups` argument.
## # A tibble: 2,072,795 × 5
## # Groups: customer_type, start_lat, start_lng [2,070,521]
## customer_type start_lat start_lng start_station_name number_of_rides
## <fct> <dbl> <dbl> <chr> <int>
## 1 casual 41.9 -87.6 Streeter Dr & Grand Ave 33902
## 2 casual 41.9 -87.6 DuSable Lake Shore Dr & Mo… 22256
## 3 member 41.9 -87.6 <NA> 15853
## 4 member 41.9 -87.6 Clark St & Elm St 15604
## 5 member 41.8 -87.6 Ellis Ave & 60th St 15350
## 6 member 41.8 -87.6 University Ave & 57th St 15259
## 7 member 41.9 -87.6 Kingsbury St & Kinzie St 14462
## 8 casual 41.9 -87.6 DuSable Lake Shore Dr & No… 13908
## 9 member 41.9 -87.6 <NA> 13427
## 10 casual 41.9 -87.6 Michigan Ave & Oak St 13325
## # ℹ 2,072,785 more rows
Then I filtered the information by customer type in order to create a map for the 20 most frequently used starting locations by customer type
Then I selected the 20 most frequently used locations by casual customers.
## # A tibble: 20 × 5
## # Groups: customer_type, start_lat, start_lng [20]
## customer_type start_lat start_lng start_station_name number_of_rides
## <fct> <dbl> <dbl> <chr> <int>
## 1 casual 41.9 -87.6 Streeter Dr & Grand Ave 33902
## 2 casual 41.9 -87.6 DuSable Lake Shore Dr & Mo… 22256
## 3 casual 41.9 -87.6 DuSable Lake Shore Dr & No… 13908
## 4 casual 41.9 -87.6 Michigan Ave & Oak St 13325
## 5 casual 41.9 -87.6 Theater on the Lake 11512
## 6 casual 41.9 -87.6 Millennium Park 11043
## 7 casual 41.9 -87.6 Dusable Harbor 10408
## 8 casual 41.9 -87.6 Shedd Aquarium 9975
## 9 casual 41.9 -87.6 <NA> 8802
## 10 casual 42.0 -87.7 <NA> 8310
## 11 casual 41.9 -87.6 Adler Planetarium 8281
## 12 casual 41.9 -87.6 <NA> 7790
## 13 casual 41.9 -87.6 <NA> 7707
## 14 casual 41.9 -87.6 Indiana Ave & Roosevelt Rd 7643
## 15 casual 41.9 -87.6 <NA> 7613
## 16 casual 41.9 -87.6 Michigan Ave & 8th St 7512
## 17 casual 42.0 -87.6 Montrose Harbor 7323
## 18 casual 41.9 -87.6 Clark St & Lincoln Ave 6800
## 19 casual 41.9 -87.6 Wells St & Concord Ln 6599
## 20 casual 41.9 -87.6 Clark St & Armitage Ave 6501
Same steps for member customers. Filtering the information.
Selecting the 20 most frequently used locations by members.
## # A tibble: 20 × 5
## # Groups: customer_type, start_lat, start_lng [20]
## customer_type start_lat start_lng start_station_name number_of_rides
## <fct> <dbl> <dbl> <chr> <int>
## 1 member 41.9 -87.6 <NA> 15853
## 2 member 41.9 -87.6 Clark St & Elm St 15604
## 3 member 41.8 -87.6 Ellis Ave & 60th St 15350
## 4 member 41.8 -87.6 University Ave & 57th St 15259
## 5 member 41.9 -87.6 Kingsbury St & Kinzie St 14462
## 6 member 41.9 -87.6 <NA> 13427
## 7 member 41.9 -87.6 <NA> 13021
## 8 member 41.9 -87.6 Clinton St & Washington Bl… 12602
## 9 member 41.9 -87.6 Streeter Dr & Grand Ave 12232
## 10 member 41.9 -87.6 <NA> 12034
## 11 member 41.8 -87.6 Ellis Ave & 55th St 12013
## 12 member 41.9 -87.6 <NA> 12003
## 13 member 41.9 -87.6 Wells St & Concord Ln 11884
## 14 member 41.9 -87.6 Wells St & Elm St 11767
## 15 member 41.9 -87.6 Broadway & Barry Ave 11342
## 16 member 41.9 -87.6 <NA> 11214
## 17 member 41.9 -87.6 <NA> 11009
## 18 member 41.9 -87.6 DuSable Lake Shore Dr & No… 10757
## 19 member 41.9 -87.6 State St & Chicago Ave 10718
## 20 member 41.8 -87.6 <NA> 10623
To visualize this data, I created a map and a table:
map_casual <- mapview(stations_casual, xcol = "start_lng", ycol = "start_lat", crs = 4269, grid = FALSE)
map_casualmap_member <- mapview(stations_member, xcol = "start_lng", ycol = "start_lat", crs = 4269, grid = FALSE)
map_memberTo make it easier to compare this data, I created a couple of tables with the top 10 most frequently used departure locations by customer type, and assigned a provisional name with a consecutive number like “station 1” and “station 2” for the locations without a name of the station available in the data.
Selecting the top ten most frequently used locations for each customer type:
Adding provisional names:
stations_member[1,4]<- "Station 1"
stations_member[6,4]<- "Station 2"
stations_member[7,4]<- "Station 3"
stations_member[10,4]<- "Station 4"
stations_casual[9,4]<- "Station 5"
stations_casual[10,4]<- "Station 6"Removing the customer type for each table, since it is not neccesary becase the data is already filtered.
## # A tibble: 10 × 4
## # Groups: start_lat, start_lng [10]
## start_lat start_lng start_station_name number_of_rides
## <dbl> <dbl> <chr> <int>
## 1 41.9 -87.6 Streeter Dr & Grand Ave 33902
## 2 41.9 -87.6 DuSable Lake Shore Dr & Monroe St 22256
## 3 41.9 -87.6 DuSable Lake Shore Dr & North Blvd 13908
## 4 41.9 -87.6 Michigan Ave & Oak St 13325
## 5 41.9 -87.6 Theater on the Lake 11512
## 6 41.9 -87.6 Millennium Park 11043
## 7 41.9 -87.6 Dusable Harbor 10408
## 8 41.9 -87.6 Shedd Aquarium 9975
## 9 41.9 -87.6 Station 5 8802
## 10 42.0 -87.7 Station 6 8310
## # A tibble: 10 × 4
## # Groups: start_lat, start_lng [10]
## start_lat start_lng start_station_name number_of_rides
## <dbl> <dbl> <chr> <int>
## 1 41.9 -87.6 Station 1 15853
## 2 41.9 -87.6 Clark St & Elm St 15604
## 3 41.8 -87.6 Ellis Ave & 60th St 15350
## 4 41.8 -87.6 University Ave & 57th St 15259
## 5 41.9 -87.6 Kingsbury St & Kinzie St 14462
## 6 41.9 -87.6 Station 2 13427
## 7 41.9 -87.6 Station 3 13021
## 8 41.9 -87.6 Clinton St & Washington Blvd 12602
## 9 41.9 -87.6 Streeter Dr & Grand Ave 12232
## 10 41.9 -87.6 Station 4 12034
Subsequently, I gave the columns of each table a more reader-friendly name:
colnames(stations_casual) <- c("Latitud", "Longitud", "Station Name", "Number of Rides")
colnames(stations_casual)## [1] "Latitud" "Longitud" "Station Name" "Number of Rides"
colnames(stations_member) <- c("Latitud", "Longitud", "Station Name", "Number of Rides")
colnames(stations_member)## [1] "Latitud" "Longitud" "Station Name" "Number of Rides"
Lastly, I used the formattable function from the formattable to quickly add some format to these tables
| Latitud | Longitud | Station Name | Number of Rides |
|---|---|---|---|
| 41.89228 | -87.61204 | Streeter Dr & Grand Ave | 33902 |
| 41.88096 | -87.61674 | DuSable Lake Shore Dr & Monroe St | 22256 |
| 41.91172 | -87.62680 | DuSable Lake Shore Dr & North Blvd | 13908 |
| 41.90096 | -87.62378 | Michigan Ave & Oak St | 13325 |
| 41.92628 | -87.63083 | Theater on the Lake | 11512 |
| 41.88103 | -87.62408 | Millennium Park | 11043 |
| 41.88698 | -87.61281 | Dusable Harbor | 10408 |
| 41.86723 | -87.61536 | Shedd Aquarium | 9975 |
| 41.91000 | -87.63000 | Station 5 | 8802 |
| 41.95000 | -87.66000 | Station 6 | 8310 |
| Latitud | Longitud | Station Name | Number of Rides |
|---|---|---|---|
| 41.89000 | -87.63000 | Station 1 | 15853 |
| 41.90297 | -87.63128 | Clark St & Elm St | 15604 |
| 41.78510 | -87.60107 | Ellis Ave & 60th St | 15350 |
| 41.79148 | -87.59986 | University Ave & 57th St | 15259 |
| 41.88918 | -87.63851 | Kingsbury St & Kinzie St | 14462 |
| 41.90000 | -87.63000 | Station 2 | 13427 |
| 41.88000 | -87.63000 | Station 3 | 13021 |
| 41.88338 | -87.64117 | Clinton St & Washington Blvd | 12602 |
| 41.89228 | -87.61204 | Streeter Dr & Grand Ave | 12232 |
| 41.94000 | -87.65000 | Station 4 | 12034 |
Among the most frequently used departure locations for casual customers we have Streeter Dr & Grand Ave situated nearby to Jane Addams Memorial Park, the apartment building Lake Point Tower, Chicago Children’s Museum, and Milton Lee Olive Park. Also Lake Shore Dr & Monroe St, which is situated nearby to the playground Slide Crater,the pier Monroe Harbor - North Harbor Public Dock, Millenium Park, Maggie Daley park and Grand Park.
For the most frequently used departure locations for members, we have the station located at latitude 41.89000 longitude -87.63000, with the provisional name of “station 1”, is at the River North neighborhood in downtown Chicago. This neighborhood is well known for having dozens of high-end art and design galleries, dining establishments, and nightlife spots. Most of the residents of the area rent their homes and many are young professionals. We also have Clark St & Elm St, which is located at the Near North Side neighborhood in Chicago Ridge, also known for its fine dining, galleries, nightlife, and riverwalk amenities.
Since the information about the end of the trips is not complete, this sections of the analysis was made with the intention to provide a general idea about the use of the bikes by type of customers.
The information with negative ride lengths needs to be reviewed. If the negative ride lengths correspond to errors in the time length recording, it needs to be fixed, or if the bikes are taken out for quality check or repairs, it should be recorded as it is and not as a ride trip by a customer.
The information about the docked bikes needs to be reviewed, and especially the reason why these ride lengths are so long and where taken only by casual customers, which doesn’t make sense.
The data shows members use the bikes primarily during weekdays with a peak at 8 and 5 pm, while casual customers use the bikes more often during the weekends afternoon, with a peak at 5 pm. Members usually take shorter rides than casual customers. This suggests members use the bikes primarily for commuting, while casual customers might be using the bikes for leisure rides or tourism. This means that it can be a good idea to design a marketing campaign to target young professionals interested in a transportation alternative for commuting.
Resources for advertising should be spent during the spring and summer months, avoiding the cold weather months because the demand for bikes is very low during this time of the year.
*Casual customers use the bikes more often during the weekends. In order to convert casual customer into members, one possibility is to offer some additional perks to this customers, like a weekend only membership.