In this case, I am assuming the role of a junior data analyst working in the marketing and analytics team at Cyclistic, a bike-share company in Chicago with two types of customers. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Customers who buy single-ride or full-day passes are known as casual riders, while the customers who purchase annual memberships are known as members. Therefore, the team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, the team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve the recommendations, so they must be backed up with compelling data insights and professional data visualizations. Characters and teams.
In this step, I asked myself questions such as ‘How do annual members and casual riders use Cyclistic bikes differently?’ to guide my analysis. Lily Moreno believes that the annual Members were more profitable than the casual riders. The goal before me is to assist the marketing team analyze the contrast between casual riders and members with the aim of providing recommendations that will aid the conversion of casual riders into annual members to maximize profits.
A bike-share program that features more than 5,800 bicycles and 600 docking stations. Cyclistic sets itself apart by also offering reclining bikes, hand tricycles, and cargo bikes, making bike-share more inclusive to people with disabilities and riders who can’t use a standard two-wheeled bike. The majority of riders opt for traditional bikes; about 8% of riders use the assistive options. Cyclistic users are more likely to ride for leisure, but about 30% use them to commute to work each day.
Lily Moreno: The director of marketing and your manager. Moreno is responsible for the development of campaigns and initiatives to promote the bike-share program. These may include email, social media, and other channels.
Cyclistic marketing analytics team: A team of data analysts who are responsible for collecting, analyzing, and reporting data that helps guide Cyclistic marketing strategy. You joined this team six months ago and have been busy learning about Cyclistic’s mission and business goals — as well as how you, as a junior data analyst, can help Cyclistic achieve them.
Cyclistic executive team: The notoriously detail-oriented executive team will decide whether to approve the recommended marketing program
Data analysis is a crucial step in solving problems. It is the process of collecting, organizing, and interpreting data to answer questions and reach conclusions. By analyzing data, businesses, organizations, and individuals can identify patterns and trends that can help them make informed decisions and improve their performance. This will involve sorting the data into categories and performing calculations or visualizing it with charts and graphs. I will also examine the data for any outliers or inconsistencies that may indicate a problem. After the data has been analyzed, it is important to interpret the results. This involves examining the data to identify any patterns or trends that may be present. It is also important to consider any potential causes of the problem, such as internal processes or external factors.
The data used for this analysis is located at (https://divvy-tripdata.s3.amazonaws.com/index.html), The data is made available under this (https://ride.divvybikes.com/data-license-agreement) I downloaded the most recent 12 months from February 2022 to January 2023 and stored them locally. This version of my analysis was done using R, spreadsheet, and Tableau. The dataset has no user ID or any specific commuter identifier, so the data on duplicate riders are not present.
Given the size of the dataset after inspection with Excel, I decided on using R for analysis and specifically R studio.
library(tidyverse)
library(ggplot2)
library(lubridate)
library(skimr)
feb_2022 <- read_csv("2022_02-divvy-tripdata.csv")
## Rows: 115609 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
mar_2022 <- read_csv("2022_03-divvy-tripdata.csv")
## Rows: 284042 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
apr_2022 <- read_csv("2022_04-divvy-tripdata.csv")
## Rows: 371249 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
may_2022 <- read_csv("2022_05-divvy-tripdata.csv")
## Rows: 634858 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
jun_2022 <- read_csv("2022_06-divvy-tripdata.csv")
## Rows: 769204 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
jul_2022 <- read_csv("2022_07-divvy-tripdata.csv")
## Rows: 823488 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
aug_2022 <- read_csv("2022_08-divvy-tripdata.csv")
## Rows: 785932 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sep_2022 <- read_csv("2022_09-divvy-tripdata.csv")
## Rows: 701339 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
oct_2022 <- read_csv("2022_10-divvy-tripdata.csv")
## Rows: 558685 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
nov_2022 <- read_csv("2022_11-divvy-tripdata.csv")
## Rows: 337735 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dec_2022 <- read_csv("2022_12-divvy-tripdata.csv")
## Rows: 181806 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
jan_2023 <- read_csv("2023_01-divvy-tripdata.csv")
## Rows: 190301 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
View to make sure all have the same columns and combine the 12 dataframes into one dataframe and give it a name.
colnames(feb_2022)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(mar_2022)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(apr_2022)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(may_2022)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(jun_2022)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(jul_2022)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(aug_2022)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(sep_2022)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(oct_2022)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(nov_2022)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(dec_2022)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(jan_2023)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
str(jan_2023)
## spc_tbl_ [190,301 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:190301] "F96D5A74A3E41399" "13CB7EB698CEDB88" "BD88A2E670661CE5" "C90792D034FED968" ...
## $ rideable_type : chr [1:190301] "electric_bike" "classic_bike" "electric_bike" "classic_bike" ...
## $ started_at : POSIXct[1:190301], format: "2023-01-21 20:05:42" "2023-01-10 15:37:36" ...
## $ ended_at : POSIXct[1:190301], format: "2023-01-21 20:16:33" "2023-01-10 15:46:05" ...
## $ start_station_name: chr [1:190301] "Lincoln Ave & Fullerton Ave" "Kimbark Ave & 53rd St" "Western Ave & Lunt Ave" "Kimbark Ave & 53rd St" ...
## $ start_station_id : chr [1:190301] "TA1309000058" "TA1309000037" "RP-005" "TA1309000037" ...
## $ end_station_name : chr [1:190301] "Hampden Ct & Diversey Ave" "Greenwood Ave & 47th St" "Valli Produce - Evanston Plaza" "Greenwood Ave & 47th St" ...
## $ end_station_id : chr [1:190301] "202480.0" "TA1308000002" "599" "TA1308000002" ...
## $ start_lat : num [1:190301] 41.9 41.8 42 41.8 41.8 ...
## $ start_lng : num [1:190301] -87.6 -87.6 -87.7 -87.6 -87.6 ...
## $ end_lat : num [1:190301] 41.9 41.8 42 41.8 41.8 ...
## $ end_lng : num [1:190301] -87.6 -87.6 -87.7 -87.6 -87.6 ...
## $ member_casual : chr [1:190301] "member" "member" "casual" "member" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
feb_2022 <- mutate(feb_2022, ride_id = as.character(ride_id)
,rideable_type = as.character(rideable_type))
mar_2022 <- mutate(mar_2022, ride_id = as.character(ride_id)
,rideable_type = as.character(rideable_type))
apr_2022 <- mutate(apr_2022, ride_id = as.character(ride_id)
,rideable_type = as.character(rideable_type))
may_2022 <- mutate(may_2022, ride_id = as.character(ride_id)
,rideable_type = as.character(rideable_type))
jun_2022 <- mutate(jun_2022, ride_id = as.character(ride_id)
,rideable_type = as.character(rideable_type))
jul_2022 <- mutate(jul_2022, ride_id = as.character(ride_id)
,rideable_type = as.character(rideable_type))
aug_2022 <- mutate(aug_2022, ride_id = as.character(ride_id)
,rideable_type = as.character(rideable_type))
sep_2022 <- mutate(sep_2022, ride_id = as.character(ride_id)
,rideable_type = as.character(rideable_type))
oct_2022 <- mutate(oct_2022, ride_id = as.character(ride_id)
,rideable_type = as.character(rideable_type))
nov_2022 <- mutate(nov_2022, ride_id = as.character(ride_id)
,rideable_type = as.character(rideable_type))
dec_2022 <- mutate(dec_2022, ride_id = as.character(ride_id)
,rideable_type = as.character(rideable_type))
jan_2023 <- mutate(jan_2023, ride_id = as.character(ride_id)
,rideable_type = as.character(rideable_type))
all_trips <- bind_rows(feb_2022, mar_2022, apr_2022, may_2022, jun_2022, jul_2022,
aug_2022, sep_2022, oct_2022, nov_2022, dec_2022, jan_2023)
Perform an inspection to confirm that the data has remained accurate after the combination, use data observed in the spreadsheet as a reference point for confirmation.
colnames(all_trips)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
str(all_trips)
## tibble [5,754,248 × 13] (S3: tbl_df/tbl/data.frame)
## $ ride_id : chr [1:5754248] "E1E065E7ED285C02" "1602DCDC5B30FFE3" "BE7DD2AF4B55C4AF" "A1789BDF844412BE" ...
## $ rideable_type : chr [1:5754248] "classic_bike" "classic_bike" "classic_bike" "classic_bike" ...
## $ started_at : POSIXct[1:5754248], format: "2022-02-19 18:08:41" "2022-02-20 17:41:30" ...
## $ ended_at : POSIXct[1:5754248], format: "2022-02-19 18:23:56" "2022-02-20 17:45:56" ...
## $ start_station_name: chr [1:5754248] "State St & Randolph St" "Halsted St & Wrightwood Ave" "State St & Randolph St" "Southport Ave & Waveland Ave" ...
## $ start_station_id : chr [1:5754248] "TA1305000029" "TA1309000061" "TA1305000029" "13235" ...
## $ end_station_name : chr [1:5754248] "Clark St & Lincoln Ave" "Southport Ave & Wrightwood Ave" "Canal St & Adams St" "Broadway & Sheridan Rd" ...
## $ end_station_id : chr [1:5754248] "13179" "TA1307000113" "13011" "13323" ...
## $ start_lat : num [1:5754248] 41.9 41.9 41.9 41.9 41.9 ...
## $ start_lng : num [1:5754248] -87.6 -87.6 -87.6 -87.7 -87.6 ...
## $ end_lat : num [1:5754248] 41.9 41.9 41.9 42 41.9 ...
## $ end_lng : num [1:5754248] -87.6 -87.7 -87.6 -87.6 -87.6 ...
## $ member_casual : chr [1:5754248] "member" "member" "member" "member" ...
summary(all_trips)
## ride_id rideable_type started_at
## Length:5754248 Length:5754248 Min. :2022-02-01 00:03:18.00
## Class :character Class :character 1st Qu.:2022-06-02 15:18:09.50
## Mode :character Mode :character Median :2022-07-27 22:50:40.50
## Mean :2022-07-29 13:28:03.16
## 3rd Qu.:2022-09-22 20:34:47.25
## Max. :2023-01-31 23:56:09.00
##
## ended_at start_station_name start_station_id
## Min. :2022-02-01 00:09:37.00 Length:5754248 Length:5754248
## 1st Qu.:2022-06-02 15:37:50.50 Class :character Class :character
## Median :2022-07-27 23:09:33.00 Mode :character Mode :character
## Mean :2022-07-29 13:47:21.50
## 3rd Qu.:2022-09-22 20:53:25.25
## Max. :2023-02-04 04:27:03.00
##
## end_station_name end_station_id start_lat start_lng
## Length:5754248 Length:5754248 Min. :41.64 Min. :-87.84
## Class :character Class :character 1st Qu.:41.88 1st Qu.:-87.66
## Mode :character Mode :character Median :41.90 Median :-87.64
## Mean :41.90 Mean :-87.65
## 3rd Qu.:41.93 3rd Qu.:-87.63
## Max. :42.07 Max. :-87.52
##
## end_lat end_lng member_casual
## Min. : 0.00 Min. :-88.14 Length:5754248
## 1st Qu.:41.88 1st Qu.:-87.66 Class :character
## Median :41.90 Median :-87.64 Mode :character
## Mean :41.90 Mean :-87.65
## 3rd Qu.:41.93 3rd Qu.:-87.63
## Max. :42.37 Max. : 0.00
## NA's :5899 NA's :5899
This phase involves checking for any data errors, cleaning the data of any errors, transforming said data by adding, then renaming and by removing data as needed, the rounding it up by checking to see that the data is clean and ready for analysis.
Adding columns for: date, month, day, ride_length, year, day_of_week, and hour
all_trips$date <- as.Date(all_trips$started_at)
all_trips$month <- format(as.Date(all_trips$date), "%B")
all_trips$day <- format(as.Date(all_trips$date), "%d")
all_trips$year <- format(as.Date(all_trips$date), "%Y")
all_trips$day_of_week <- format(as.Date(all_trips$date), "%A")
all_trips$hour <- lubridate::hour(all_trips$started_at)
all_trips$ride_length <- round(as.numeric(difftime(all_trips$ended_at,all_trips$started_at, units = "mins")),2)
Add column for seasons
all_trips <- all_trips %>% mutate(season = recode(month,
December = "Winter",
January = "Winter",
February = "Winter",
March = "Spring",
April = "Spring",
May = "Spring",
June = "Summer",
July = "Summer",
August = "Summer",
September = "Fall",
October = "Fall",
November = "Fall"))
Add a column to represent the time of the day
all_trips <- all_trips %>% mutate(time_of_day = case_when(
hour >= 6 & hour < 9 ~ "Early Morning",
hour >= 9 & hour < 12 ~ "Mid Morning",
hour >= 12 & hour < 18 ~ "Afternoon",
hour >= 18 & hour <= 23 ~ "Evening",
hour >= 0 & hour < 3 ~ "Early Night",
hour >= 3 & hour < 6 ~ "Late Night"))
Check the user type and rideable type for missing values.
all_trips %>% filter(is.na(start_station_id)) %>%
count(start_station_id, start_station_name, rideable_type, member_casual)
## # A tibble: 2 × 5
## start_station_id start_station_name rideable_type member_casual n
## <chr> <chr> <chr> <chr> <int>
## 1 <NA> <NA> electric_bike casual 350065
## 2 <NA> <NA> electric_bike member 493460
The missing values have been located in the electric_bike
Create new column to address missing values and also uniformize the longitude and latitude.
all_trips <- all_trips %>%
mutate(start_lat_round = round(start_lat, digits = 2),
start_lng_round = round(start_lng, digits = 2),
end_lat_round = round(end_lat, digits = 2),
end_lng_round = round(end_lng, digits = 2))
Fix missing station names and station ID
all_trips <- all_trips %>%
group_by(start_lat_round, start_lng_round) %>%
tidyr::fill(start_station_name, .direction = "downup") %>%
ungroup()
all_trips <- all_trips %>%
group_by(end_lat_round, end_lng_round) %>%
tidyr::fill(end_station_name, .direction = "downup") %>%
ungroup()
all_trips <- all_trips %>%
group_by(start_station_name) %>%
tidyr::fill(start_station_id, .direction = "downup") %>%
ungroup()
all_trips <- all_trips %>%
group_by(end_station_name) %>%
tidyr::fill(end_station_id, .direction = "downup") %>%
ungroup()
Check the data again to make sure the missing values have been fixed
colSums(is.na(all_trips))
## ride_id rideable_type started_at ended_at
## 0 0 0 0
## start_station_name start_station_id end_station_name end_station_id
## 11929 11929 40880 40880
## start_lat start_lng end_lat end_lng
## 0 0 5899 5899
## member_casual date month day
## 0 0 0 0
## year day_of_week hour ride_length
## 0 0 0 0
## season time_of_day start_lat_round start_lng_round
## 0 0 0 0
## end_lat_round end_lng_round
## 5899 5899
These shows that some of the missing values can’t be fixed because they lack critical information, this will be handled by removing them from the dataframe. Their number are very small and insignificant so it can be disregarded.
Rides that are by the company employees for the purpose of testing or repairs are immaterial to the analysis being done and can actually skew the data. They will be isolated and eliminated.
all_trips %>%
select(start_station_id) %>%
count(start_station_id) %>%
arrange(desc(n))
## # A tibble: 1,312 × 2
## start_station_id n
## <chr> <int>
## 1 13022 81578
## 2 LF-005 46593
## 3 TA1308000050 43541
## 4 13042 42671
## 5 13300 42225
## 6 TA1307000039 38861
## 7 13008 37142
## 8 KA1503000014 36188
## 9 13179 35857
## 10 TA1308000001 35376
## # … with 1,302 more rows
all_trips %>%
select(end_station_id) %>%
count(end_station_id) %>%
arrange(desc(n))
## # A tibble: 1,317 × 2
## end_station_id n
## <chr> <int>
## 1 13022 81458
## 2 LF-005 45159
## 3 TA1308000050 44744
## 4 13042 42189
## 5 KA1503000014 42012
## 6 13300 41533
## 7 <NA> 40880
## 8 TA1307000039 38672
## 9 13008 37270
## 10 13179 36686
## # … with 1,307 more rows
Filter out the station related to testing and repair
all_trips %>% filter(start_station_id %in% c("DIVVY 001", "DIVVY 001 - Warehouse test station",
"Hubbard Bike-checking (LBS-WH-TEST)", "Pawel Bialowas - Test- PBSC charging station", "DIVVY CASSETTE REPAIR MOBILE STATION",
"2059 Hastings Warehouse Station", "Hastings WH 2", "Throop/Hastings Mobile Station")) %>%
count(start_station_id)
## # A tibble: 6 × 2
## start_station_id n
## <chr> <int>
## 1 2059 Hastings Warehouse Station 585
## 2 DIVVY 001 77
## 3 DIVVY 001 - Warehouse test station 14
## 4 DIVVY CASSETTE REPAIR MOBILE STATION 7
## 5 Hastings WH 2 14
## 6 Hubbard Bike-checking (LBS-WH-TEST) 8124
all_trips %>% filter(end_station_id %in% c("DIVVY 001", "DIVVY 001 - Warehouse test station", "Hubbard Bike-checking (LBS-WH-TEST)",
"Pawel Bialowas - Test- PBSC charging station", "DIVVY CASSETTE REPAIR MOBILE STATION",
"2059 Hastings Warehouse Station", "Hastings WH 2", "Throop/Hastings Mobile Station")) %>%
count(end_station_id)
## # A tibble: 4 × 2
## end_station_id n
## <chr> <int>
## 1 2059 Hastings Warehouse Station 791
## 2 DIVVY CASSETTE REPAIR MOBILE STATION 7
## 3 Hastings WH 2 167
## 4 Hubbard Bike-checking (LBS-WH-TEST) 774
all_trips <- all_trips %>%
filter(ride_length >= 1 & ride_length <= 1440)
all_trips_v2 <- all_trips %>%
filter(!is.na(end_lat) & !is.na(end_lng))
all_trips_v2 <- all_trips_v2 %>%
filter(!start_station_id %in% c("DIVVY 001", "DIVVY 001 - Warehouse test station", "Hubbard Bike-checking (LBS-WH-TEST)",
"Pawel Bialowas - Test- PBSC charging station", "DIVVY CASSETTE REPAIR MOBILE STATION",
"2059 Hastings Warehouse Station", "Hastings WH 2", "Throop/Hastings Mobile Station"))
all_trips_v2 <- all_trips_v2 %>%
filter(!end_station_id %in% c("DIVVY 001", "DIVVY 001 - Warehouse test station", "Hubbard Bike-checking (LBS-WH-TEST)",
"Pawel Bialowas - Test- PBSC charging station", "DIVVY CASSETTE REPAIR MOBILE STATION",
"2059 Hastings Warehouse Station", "Hastings WH 2", "Throop/Hastings Mobile Station"))
Removed the testing and repair stations and also renamed the new dataframe created.
check the removed stations for accuracy
str(all_trips_v2)
## tibble [5,612,878 × 26] (S3: tbl_df/tbl/data.frame)
## $ ride_id : chr [1:5612878] "E1E065E7ED285C02" "1602DCDC5B30FFE3" "BE7DD2AF4B55C4AF" "A1789BDF844412BE" ...
## $ rideable_type : chr [1:5612878] "classic_bike" "classic_bike" "classic_bike" "classic_bike" ...
## $ started_at : POSIXct[1:5612878], format: "2022-02-19 18:08:41" "2022-02-20 17:41:30" ...
## $ ended_at : POSIXct[1:5612878], format: "2022-02-19 18:23:56" "2022-02-20 17:45:56" ...
## $ start_station_name: chr [1:5612878] "State St & Randolph St" "Halsted St & Wrightwood Ave" "State St & Randolph St" "Southport Ave & Waveland Ave" ...
## $ start_station_id : chr [1:5612878] "TA1305000029" "TA1309000061" "TA1305000029" "13235" ...
## $ end_station_name : chr [1:5612878] "Clark St & Lincoln Ave" "Southport Ave & Wrightwood Ave" "Canal St & Adams St" "Broadway & Sheridan Rd" ...
## $ end_station_id : chr [1:5612878] "13179" "TA1307000113" "13011" "13323" ...
## $ start_lat : num [1:5612878] 41.9 41.9 41.9 41.9 41.9 ...
## $ start_lng : num [1:5612878] -87.6 -87.6 -87.6 -87.7 -87.6 ...
## $ end_lat : num [1:5612878] 41.9 41.9 41.9 42 41.9 ...
## $ end_lng : num [1:5612878] -87.6 -87.7 -87.6 -87.6 -87.6 ...
## $ member_casual : chr [1:5612878] "member" "member" "member" "member" ...
## $ date : Date[1:5612878], format: "2022-02-19" "2022-02-20" ...
## $ month : chr [1:5612878] "February" "February" "February" "February" ...
## $ day : chr [1:5612878] "19" "20" "25" "14" ...
## $ year : chr [1:5612878] "2022" "2022" "2022" "2022" ...
## $ day_of_week : chr [1:5612878] "Saturday" "Sunday" "Friday" "Monday" ...
## $ hour : int [1:5612878] 18 17 18 11 5 9 10 20 13 7 ...
## $ ride_length : num [1:5612878] 15.25 4.43 13.63 6.95 2.9 ...
## $ season : chr [1:5612878] "Winter" "Winter" "Winter" "Winter" ...
## $ time_of_day : chr [1:5612878] "Evening" "Afternoon" "Evening" "Mid Morning" ...
## $ start_lat_round : num [1:5612878] 41.9 41.9 41.9 42 41.9 ...
## $ start_lng_round : num [1:5612878] -87.6 -87.7 -87.6 -87.7 -87.6 ...
## $ end_lat_round : num [1:5612878] 41.9 41.9 41.9 42 41.9 ...
## $ end_lng_round : num [1:5612878] -87.6 -87.7 -87.6 -87.7 -87.6 ...
is.factor(all_trips_v2$ride_length)
## [1] FALSE
all_trips_v2$ride_length <- as.numeric(as.character(all_trips_v2$ride_length))
is.numeric(all_trips_v2$ride_length)
## [1] TRUE
all_trips_v2%>%
select(start_station_id) %>%
count(start_station_id) %>%
arrange(desc(n))
## # A tibble: 1,306 × 2
## start_station_id n
## <chr> <int>
## 1 13022 79566
## 2 LF-005 45310
## 3 TA1308000050 42731
## 4 13042 41847
## 5 13300 41325
## 6 TA1307000039 38115
## 7 13008 36247
## 8 KA1503000014 35531
## 9 13179 35118
## 10 TA1308000001 34776
## # … with 1,296 more rows
all_trips_v2%>%
select(end_station_id) %>%
count(end_station_id) %>%
arrange(desc(n))
## # A tibble: 1,313 × 2
## end_station_id n
## <chr> <int>
## 1 13022 79717
## 2 LF-005 44366
## 3 TA1308000050 43686
## 4 13042 41423
## 5 KA1503000014 41239
## 6 13300 40734
## 7 TA1307000039 37702
## 8 13008 36598
## 9 13179 35887
## 10 TA1308000001 34615
## # … with 1,303 more rows
class(all_trips_v2)
## [1] "tbl_df" "tbl" "data.frame"
colSums(is.na(all_trips_v2))
## ride_id rideable_type started_at ended_at
## 0 0 0 0
## start_station_name start_station_id end_station_name end_station_id
## 11558 11558 33871 33871
## start_lat start_lng end_lat end_lng
## 0 0 0 0
## member_casual date month day
## 0 0 0 0
## year day_of_week hour ride_length
## 0 0 0 0
## season time_of_day start_lat_round start_lng_round
## 0 0 0 0
## end_lat_round end_lng_round
## 0 0
skim_without_charts(all_trips_v2)
| Name | all_trips_v2 |
| Number of rows | 5612878 |
| Number of columns | 26 |
| _______________________ | |
| Column type frequency: | |
| character | 13 |
| Date | 1 |
| numeric | 10 |
| POSIXct | 2 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| ride_id | 0 | 1.00 | 16 | 16 | 0 | 5612878 | 0 |
| rideable_type | 0 | 1.00 | 11 | 13 | 0 | 3 | 0 |
| start_station_name | 11558 | 1.00 | 9 | 64 | 0 | 1671 | 0 |
| start_station_id | 11558 | 1.00 | 3 | 37 | 0 | 1305 | 0 |
| end_station_name | 33871 | 0.99 | 9 | 64 | 0 | 1688 | 0 |
| end_station_id | 33871 | 0.99 | 3 | 37 | 0 | 1312 | 0 |
| member_casual | 0 | 1.00 | 6 | 6 | 0 | 2 | 0 |
| month | 0 | 1.00 | 3 | 9 | 0 | 12 | 0 |
| day | 0 | 1.00 | 2 | 2 | 0 | 31 | 0 |
| year | 0 | 1.00 | 4 | 4 | 0 | 2 | 0 |
| day_of_week | 0 | 1.00 | 6 | 9 | 0 | 7 | 0 |
| season | 0 | 1.00 | 4 | 6 | 0 | 4 | 0 |
| time_of_day | 0 | 1.00 | 7 | 13 | 0 | 6 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| date | 0 | 1 | 2022-02-01 | 2023-01-31 | 2022-07-27 | 365 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| start_lat | 0 | 1 | 41.90 | 0.05 | 41.64 | 41.88 | 41.90 | 41.93 | 42.07 |
| start_lng | 0 | 1 | -87.65 | 0.03 | -87.84 | -87.66 | -87.64 | -87.63 | -87.52 |
| end_lat | 0 | 1 | 41.90 | 0.07 | 0.00 | 41.88 | 41.90 | 41.93 | 42.37 |
| end_lng | 0 | 1 | -87.65 | 0.10 | -88.14 | -87.66 | -87.64 | -87.63 | 0.00 |
| hour | 0 | 1 | 14.21 | 5.03 | 0.00 | 11.00 | 15.00 | 18.00 | 23.00 |
| ride_length | 0 | 1 | 16.46 | 28.68 | 1.00 | 6.02 | 10.43 | 18.58 | 1439.37 |
| start_lat_round | 0 | 1 | 41.90 | 0.05 | 41.64 | 41.88 | 41.90 | 41.93 | 42.07 |
| start_lng_round | 0 | 1 | -87.65 | 0.03 | -87.84 | -87.66 | -87.64 | -87.63 | -87.52 |
| end_lat_round | 0 | 1 | 41.90 | 0.07 | 0.00 | 41.88 | 41.90 | 41.93 | 42.37 |
| end_lng_round | 0 | 1 | -87.65 | 0.10 | -88.14 | -87.66 | -87.64 | -87.63 | 0.00 |
Variable type: POSIXct
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| started_at | 0 | 1 | 2022-02-01 00:03:18 | 2023-01-31 23:56:09 | 2022-07-27 19:51:47 | 4724012 |
| ended_at | 0 | 1 | 2022-02-01 00:09:37 | 2023-02-01 00:28:12 | 2022-07-27 20:10:25 | 4736481 |
Analyzing the data in order to get different insights from it in different ways. Focusing on analyzing how the annual member riders and the casual riders use Cyclistic bikes differently.
Get the summary of the data to be used for analysis in order to get a snapshot of the information.
summary(all_trips_v2)
## ride_id rideable_type started_at
## Length:5612878 Length:5612878 Min. :2022-02-01 00:03:18.00
## Class :character Class :character 1st Qu.:2022-06-02 12:16:51.50
## Mode :character Mode :character Median :2022-07-27 19:51:47.50
## Mean :2022-07-29 10:10:19.89
## 3rd Qu.:2022-09-22 19:13:23.50
## Max. :2023-01-31 23:56:09.00
## ended_at start_station_name start_station_id
## Min. :2022-02-01 00:09:37.00 Length:5612878 Length:5612878
## 1st Qu.:2022-06-02 12:36:19.00 Class :character Class :character
## Median :2022-07-27 20:10:25.00 Mode :character Mode :character
## Mean :2022-07-29 10:26:47.51
## 3rd Qu.:2022-09-22 19:28:37.00
## Max. :2023-02-01 00:28:12.00
## end_station_name end_station_id start_lat start_lng
## Length:5612878 Length:5612878 Min. :41.64 Min. :-87.84
## Class :character Class :character 1st Qu.:41.88 1st Qu.:-87.66
## Mode :character Mode :character Median :41.90 Median :-87.64
## Mean :41.90 Mean :-87.65
## 3rd Qu.:41.93 3rd Qu.:-87.63
## Max. :42.07 Max. :-87.52
## end_lat end_lng member_casual date
## Min. : 0.00 Min. :-88.14 Length:5612878 Min. :2022-02-01
## 1st Qu.:41.88 1st Qu.:-87.66 Class :character 1st Qu.:2022-06-02
## Median :41.90 Median :-87.64 Mode :character Median :2022-07-27
## Mean :41.90 Mean :-87.65 Mean :2022-07-28
## 3rd Qu.:41.93 3rd Qu.:-87.63 3rd Qu.:2022-09-22
## Max. :42.37 Max. : 0.00 Max. :2023-01-31
## month day year day_of_week
## Length:5612878 Length:5612878 Length:5612878 Length:5612878
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## hour ride_length season time_of_day
## Min. : 0.00 Min. : 1.00 Length:5612878 Length:5612878
## 1st Qu.:11.00 1st Qu.: 6.02 Class :character Class :character
## Median :15.00 Median : 10.43 Mode :character Mode :character
## Mean :14.21 Mean : 16.46
## 3rd Qu.:18.00 3rd Qu.: 18.58
## Max. :23.00 Max. :1439.37
## start_lat_round start_lng_round end_lat_round end_lng_round
## Min. :41.64 Min. :-87.84 Min. : 0.00 Min. :-88.14
## 1st Qu.:41.88 1st Qu.:-87.66 1st Qu.:41.88 1st Qu.:-87.66
## Median :41.90 Median :-87.64 Median :41.90 Median :-87.64
## Mean :41.90 Mean :-87.65 Mean :41.90 Mean :-87.65
## 3rd Qu.:41.93 3rd Qu.:-87.63 3rd Qu.:41.93 3rd Qu.:-87.63
## Max. :42.07 Max. :-87.52 Max. :42.37 Max. : 0.00
Starting off with numbers and percentage comparison
all_trips_v2%>%
group_by(member_casual) %>%
summarise(count = n(), Percentage = n()/nrow(all_trips_v2)*100)
## # A tibble: 2 × 3
## member_casual count Percentage
## <chr> <int> <dbl>
## 1 casual 2284633 40.7
## 2 member 3328245 59.3
Analysis
During the time frame studied for this analysis (12 month period), 40.7% of the riders that used the rideables were casual riders while about 59.3% of the riders in the same time period were member riders. Converting a portion of the casual riders to members would be very advantageous.
Visualizing by member type
ggplot(all_trips_v2, aes(member_casual, fill=member_casual)) +
geom_bar() +
labs(x="User Type", y="Count", title = "Total Rides: Casual vs Member") +
annotate("text",x=1,y=2000000,label="2,284,633 / (40.7%)",color="black",size=3.5) +
annotate("text",x=2,y=3000000,label="3,328,245 / (59.3%)",color="black",size=3.5)
A majority of the riders (59.3%) are members while the rest (40.7%) are casual riders.
Visualizing the total rides by the user type and rideable type
all_trips_v2%>%
group_by(member_casual, rideable_type) %>%
summarise(count = n()) %>%
ggplot(aes(x=member_casual, y=count, fill=rideable_type)) +
geom_bar(stat="identity", width = 0.3) +
labs(x="Rideable Type", y="Number of Rides", title = "Total Rides by user type and rideable type")
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
The members riders slightly favor the classic bike over the electric bike: - Most casual riders use electric bikes - members don’t use docked bikes - members use both classic and electric bikes almost equally
summary(all_trips_v2$ride_length)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 6.02 10.43 16.46 18.58 1439.37
The average length of ride for each trip (in minutes) by the type of user
aggregate(all_trips_v2$ride_length~all_trips_v2$member_casual, FUN = mean)
## all_trips_v2$member_casual all_trips_v2$ride_length
## 1 casual 22.10731
## 2 member 12.58425
Analysis The casual riders rode the bikes for significantly more time on average than the member riders.
Breaking down ride lengths for more analysis
all_trips_v2%>%
group_by(member_casual) %>%
summarize("<=5min" = sum(ride_length <=5),
"<=15min" = sum(ride_length <=15),
"<=30min" = sum(ride_length <=30),
"<=45min" = sum(ride_length <=45),
"<=60min" = sum(ride_length <=60),
">2hrs" = sum(ride_length >120),
">4hrs" = sum(ride_length >240),
">6hrs" = sum(ride_length >360),)
## # A tibble: 2 × 9
## member_casual `<=5min` <=15m…¹ <=30m…² <=45m…³ <=60m…⁴ `>2hrs` `>4hrs` `>6hrs`
## <chr> <int> <int> <int> <int> <int> <int> <int> <int>
## 1 casual 266189 1278389 1863455 2057141 2142989 33432 5731 3206
## 2 member 757992 2462997 3116478 3273578 3305079 5982 2111 1218
## # … with abbreviated variable names ¹`<=15min`, ²`<=30min`, ³`<=45min`,
## # ⁴`<=60min`
Analysis This shows that for both user types, the vast majority of ride length is below 60 minutes, more data would be needed to analyze this trend but I suspect that it’s the maximum endurance of non professional riders. And distances beyond this point most riders will choose another mode of transportation. Perhaps an availability of more electric bikes would encourage longer rides.
all_trips_v2 %>%
group_by(member_casual) %>%
summarize("<10 min" = sum(ride_length <9.99),
"10-20 min" = sum(ride_length >=10 & ride_length <=19.99),
"20-30 min" = sum(ride_length >=20 & ride_length <=29.99),
"30-60 min" = sum(ride_length >=30 & ride_length <=59.99),
"60-120 min" = sum(ride_length >=60 & ride_length <=119.99),
"120-240 min" = sum(ride_length >=120 & ride_length <=239.99),
"240+min" = sum(ride_length >=240))
## # A tibble: 2 × 8
## member_casual `<10 min` `10-20 min` 20-30 mi…¹ 30-60…² 60-12…³ 120-2…⁴ 240+m…⁵
## <chr> <int> <int> <int> <int> <int> <int> <int>
## 1 casual 845416 707829 309833 279832 108280 27712 5731
## 2 member 1844513 954711 316957 188886 17195 3872 2111
## # … with abbreviated variable names ¹`20-30 min`, ²`30-60 min`, ³`60-120 min`,
## # ⁴`120-240 min`, ⁵`240+min`
Analysis This shows that casual riders tend to keep the rideables for much longer than member riders, but the members tend to take shorter rides.
all_trips_v3 <- all_trips_v2%>% mutate(ride_length_grp = case_when(
ride_length <9.99 ~ "< 10 min",
ride_length >=10 & ride_length <=19.99 ~ "10-20 min",
ride_length >=20 & ride_length <=29.99 ~ "20-30 min",
ride_length >=30 & ride_length <=59.99 ~ "30-60 min",
ride_length >=60 & ride_length <=119.99 ~ "60-120 min",
ride_length >=120 & ride_length <=239.99 ~ "120-240 min",
ride_length >=240 ~ "240+ min"))
all_trips_v3 %>%
group_by(member_casual, ride_length_grp) %>%
summarise(count = n()) %>%
ggplot(aes(x=factor(ride_length_grp, level = c("< 10 min", "10-20 min", "20-30 min", "30-60 min",
"60-120 min", "120-240 min", "240+ min")), y=count, fill=member_casual)) +
geom_col(position = "dodge") +
labs(x="Ride Length", y="Number of Rides", title = "Total Rides by user type and ride length grouping")
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
Analysis Most of the rides fall within the 1 to 20 minutes ride length in minutes, this information can be further enhanced by obtaining survey information on riding behaviors.
Looking at the average ride length of each trip (in minutes) based on user type and the hour of day
all_trips_v3 %>%
group_by(member_casual, hour) %>%
summarise(count = n(), average_ride_length=mean(ride_length)) %>%
arrange(member_casual, hour) %>%
ggplot(aes(x=factor(hour, level= c(6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,0,1,2,3,4,5)), y=average_ride_length, fill=member_casual)) +
geom_col(position = "dodge") +
labs(x="Hour of day", y="Ride Length (in minutes)", title = "Average ride length by user type and hour of the day")
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
Analysis The casual riders had on average longer rides than member riders peaking between 10AM - 3PM, while member riders stayed relatively constant throughout the day.This insight can be useful in planning maintenance times for the rideables.
axis_labels <- c("Early Morning \n6am-9am", "Mid Morning \n9am-12pm", "Afternoon \n12pm-6pm", "Evening \n6pm-11pm", "Early Night \n11pm-3am", "Wee Night \n3am-6am")
all_trips_v3 %>%
group_by(member_casual, time_of_day) %>%
summarise(count = n(), average_ride_length=mean(ride_length)) %>%
ggplot(aes(x=factor(time_of_day, level= c("Early Morning", "Mid Morning", "Afternoon", "Evening", "Early Night", "Late Night")), y=average_ride_length, fill=member_casual)) +
geom_col(position = "dodge", width = 0.4) +
labs(x="Time of Day", y="Ride Length (minutes)", title = "Average ride length by user type and time of the day") +
scale_x_discrete(labels = axis_labels)
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
Analysis
aggregate(all_trips_v3$ride_length ~ all_trips_v3$member_casual +
all_trips_v3$day_of_week, FUN = mean)
## all_trips_v3$member_casual all_trips_v3$day_of_week all_trips_v3$ride_length
## 1 casual Friday 20.71813
## 2 member Friday 12.39689
## 3 casual Monday 22.55896
## 4 member Monday 12.15048
## 5 casual Saturday 24.84920
## 6 member Saturday 14.04720
## 7 casual Sunday 25.31523
## 8 member Sunday 13.92415
## 9 casual Thursday 19.70325
## 10 member Thursday 12.16705
## 11 casual Tuesday 19.76585
## 12 member Tuesday 11.94311
## 13 casual Wednesday 19.07947
## 14 member Wednesday 11.99994
all_trips_v3 %>%
group_by(member_casual, day_of_week) %>%
summarise(count = n(), average_ride_length=mean(ride_length)) %>%
ggplot(aes(x=factor(day_of_week, level= c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")), y=average_ride_length, fill=member_casual)) +
geom_col(position = "dodge", width = 0.4) +
labs(x="Day of Week", y="Ride Length ( minutes)", title = "Average ride length by the user type and day of the week")
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
Analysis
The weekends saw increased bike use by minutes for both user types, casual riders were consistent throughout the week up until Saturday which is their preferred day of week. Meanwhile, the member riders looked to have a consistent weekly ride and a small weekend spike on the weekend for their bike use. The most popular day for minutes used for both user types is on Sundays.
all_trips_v3 %>%
group_by(member_casual, hour) %>%
summarise(count = n()) %>%
arrange(member_casual, hour) %>%
ggplot(aes(x=factor(hour, level= c(6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,0,1,2,3,4,5)), y=count, fill=member_casual)) +
geom_col(position = "dodge") +
labs(x="Hour of the day", y="Number of Rides", title = "Total Rides by user type and hour of the day")
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
Analysis
all_trips_v3 %>%
group_by(member_casual) %>%
summarize("Early Morning" = sum(time_of_day == "Early Morning"),
"Mid Morning" = sum(time_of_day == "Mid Morning"),
"Afternoon" = sum(time_of_day == "Afternoon"),
"Evening" = sum(time_of_day == "Evening"),
"Early Night" = sum(time_of_day == "Early Night"),
"Late Night" = sum(time_of_day == "Late Night"))
## # A tibble: 2 × 7
## member_casual `Early Morning` `Mid Morning` Afternoon Evening Early …¹ Late …²
## <chr> <int> <int> <int> <int> <int> <int>
## 1 casual 150178 282434 1034339 694212 92950 30520
## 2 member 470104 440497 1412537 885751 70557 48799
## # … with abbreviated variable names ¹`Early Night`, ²`Late Night`
all_trips_v3 %>%
group_by(member_casual, time_of_day) %>%
summarise(count = n()) %>%
ggplot(aes(x=factor(time_of_day, level= c("Early Morning", "Mid Morning", "Afternoon", "Evening", "Early Night", "Late Night")), y=count, fill=member_casual)) +
geom_col(position = "dodge", width = 0.4) +
labs(x="Time of the Day", y="Number of Rides", title = "Total Rides by user type and time of the day") +
scale_x_discrete(labels = axis_labels)
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
Analysis
all_trips_v3 %>%
group_by(member_casual, time_of_day) %>%
summarise(count = n()) %>%
ggplot(aes(x=factor(time_of_day, level= c("Early Morning", "Mid Morning", "Afternoon", "Evening", "Early Night", "Late Night")), y=count, color=member_casual)) +
geom_point() + geom_line(aes(group = member_casual)) +
labs(x="Time of the Day", y="Number of Rides", title = "Total Rides by user type and time of the day") + ylim(0, NA) +
scale_x_discrete(labels = axis_labels)
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
Analysis
This follows the trends previously noticed, with peak times, suggestion that maintenance should be done outside of peak hours in order to main rideable availability and create customer satisfaction.
all_trips_v3 %>%
group_by(member_casual) %>%
summarize("Monday" = sum(day_of_week == "Monday"),
"Tuesday" = sum(day_of_week == "Tuesday"),
"Wednesday" = sum(day_of_week == "Wednesday"),
"Thursday" = sum(day_of_week == "Thursday"),
"Friday" = sum(day_of_week == "Friday"),
"Saturday" = sum(day_of_week == "Saturday"),
"Sunday" = sum(day_of_week == "Sunday"))
## # A tibble: 2 × 8
## member_casual Monday Tuesday Wednesday Thursday Friday Saturday Sunday
## <chr> <int> <int> <int> <int> <int> <int> <int>
## 1 casual 273930 261419 271058 304027 328728 462513 382958
## 2 member 471235 521864 523339 527774 464289 435606 384138
all_trips_v3 %>%
group_by(member_casual, day_of_week) %>%
summarise(count = n()) %>%
ggplot(aes(x=factor(day_of_week, level= c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")), y=count, fill=member_casual)) +
geom_col(position = "dodge", width = 0.4) +
labs(x="Day of the Week", y="Number of Rides", title = "Total Rides by user type and day of the week")
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
Analysis
The member riders use more rideables during the week while the casual riders use more rideables during the weekends.
all_trips_v3 %>%
group_by(member_casual, day_of_week) %>%
summarise(count = n()) %>%
ggplot(aes(x=factor(day_of_week, level= c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")), y=count, color=member_casual)) +
geom_point() + geom_line(aes(group = member_casual)) +
labs(x="Day of Week", y="Number of Rides", title = "Total Rides by user type and day of the week") +
ylim(0, NA)
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
all_trips_v3 %>%
group_by(member_casual, season) %>%
summarise(count = n()) %>%
ggplot(aes(x=factor(season, level= c("Spring", "Summer", "Fall", "Winter")), y=count, fill=member_casual)) +
geom_col(position = "dodge", width = 0.4) +
labs(x="Season of the Year", y="Number of Rides", title = "Total Rides by user type and season of the year")
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
Analysis Summer saw the highest rideable use and the winter saw the lowest rideable use, this is probably due to the weather. There is little that can be done to mitigate this but heated bikes is an avenue worth exploring.
all_trips_v3 %>%
filter(!(is.na(start_station_name))) %>%
filter(member_casual == "casual") %>%
group_by(start_station_name) %>%
summarize(count=n()) %>%
arrange(-count) %>%
top_n(10)
## Selecting by count
## # A tibble: 10 × 2
## start_station_name count
## <chr> <int>
## 1 Streeter Dr & Grand Ave 61066
## 2 DuSable Lake Shore Dr & Monroe St 31809
## 3 Michigan Ave & Oak St 26456
## 4 DuSable Lake Shore Dr & North Blvd 26386
## 5 Millennium Park 26308
## 6 Shedd Aquarium 20493
## 7 Theater on the Lake 19651
## 8 Wells St & Concord Ln 18923
## 9 Clark St & Lincoln Ave 18419
## 10 Dusable Harbor 16228
all_trips_v3 %>%
filter(!(is.na(start_station_name))) %>%
filter(member_casual == "casual") %>%
group_by(start_station_name) %>%
summarize(count=n()) %>%
arrange(-count) %>%
top_n(10) %>%
mutate(start_station_name= fct_reorder(start_station_name, count)) %>%
ggplot(aes(x=start_station_name, y=count, fill=count)) +
geom_bar(stat = "identity") +
coord_flip() +
labs(x="Number of Rides", y="Name of Start Station ", title="Top 10 starting stations for casual riders")
## Selecting by count
all_trips_v3 %>%
filter(!(is.na(end_station_name))) %>%
filter(member_casual == "casual") %>%
group_by(end_station_name) %>%
summarize(count=n()) %>%
arrange(-count) %>%
top_n(10)
## Selecting by count
## # A tibble: 10 × 2
## end_station_name count
## <chr> <int>
## 1 Streeter Dr & Grand Ave 63523
## 2 DuSable Lake Shore Dr & Monroe St 30126
## 3 Millennium Park 27700
## 4 DuSable Lake Shore Dr & North Blvd 27217
## 5 Michigan Ave & Oak St 27198
## 6 Theater on the Lake 20663
## 7 Clark St & Lincoln Ave 19096
## 8 Wells St & Concord Ln 18814
## 9 Shedd Aquarium 18584
## 10 Dusable Harbor 15474
all_trips_v3 %>%
filter(!(is.na(start_station_name))) %>%
filter(member_casual == "member") %>%
group_by(start_station_name) %>%
summarize(count=n()) %>%
arrange(-count) %>%
top_n(10)
## Selecting by count
## # A tibble: 10 × 2
## start_station_name count
## <chr> <int>
## 1 Ellis Ave & 60th St 29268
## 2 Kingsbury St & Kinzie St 25303
## 3 University Ave & 57th St 25261
## 4 Ellis Ave & 55th St 24845
## 5 Clark St & Elm St 23954
## 6 Wells St & Concord Ln 23808
## 7 Clinton St & Washington Blvd 21257
## 8 Wells St & Elm St 20687
## 9 Loomis St & Lexington St 19920
## 10 Broadway & Barry Ave 19672
all_trips_v3 %>%
filter(!(is.na(start_station_name))) %>%
filter(member_casual == "member") %>%
group_by(start_station_name) %>%
summarize(count=n()) %>%
arrange(-count) %>%
top_n(10) %>%
mutate(start_station_name= fct_reorder(start_station_name, count)) %>%
ggplot(aes(x=start_station_name, y=count, fill=count)) +
geom_bar(stat = "identity") +
coord_flip() +
labs(x="Number of Rides", y="Name of Start Station ", title="Top 10 starting stations for member riders")
## Selecting by count
all_trips_v3 %>%
filter(!(is.na(end_station_name))) %>%
filter(member_casual == "member") %>%
group_by(end_station_name) %>%
summarize(count=n()) %>%
arrange(-count) %>%
top_n(10)
## Selecting by count
## # A tibble: 10 × 2
## end_station_name count
## <chr> <int>
## 1 Ellis Ave & 60th St 34214
## 2 Kingsbury St & Kinzie St 25718
## 3 University Ave & 57th St 25095
## 4 Wells St & Concord Ln 24872
## 5 Clark St & Elm St 24082
## 6 Ellis Ave & 55th St 22564
## 7 Clinton St & Washington Blvd 22176
## 8 Clinton St & Madison St 19623
## 9 Broadway & Barry Ave 19552
## 10 Loomis St & Lexington St 19219
Analysis
The information on the top 10 end and start station for both riders can be used to determine availability of rideables.
write_csv(all_trips_v3, file = "all_trips_v3.csv")
The first phenomenon that is easily noticeable is the stark difference between casual riders and riders with membership subscriptions, there is a significant difference in length of ride between both sets of riders.
Apart from the weekends (Saturday and Sunday) the ride length of the member riders remained relatively unchanged.
There is a slight uptick (about 16%) in member riders’ length of rides on weekends, it starts on Friday and experiences a sharp drop on Monday. It remains relatively the same until Friday again.
The ride length for casual riders changed significantly from day to day with the peek coming on weekends (Saturday and Sunday).
There is a visible inverse bell curve for the ride length of the casual riders, from the beginning of the week (Sunday) to the end of the week (Saturday). There is a dip in midweek and then a steady climb until the weekend, then a slow drop after the weekend.
Comparing the casual riders to the member riders, there is a doubling (100% increase) in ride length from the member riders to the casual riders. This means that for any giving day of the week the casual riders have twice the ride time of the member riders.
Challenges
The lack of individual rider identifiers which would have been integral in deciphering the ride patterns of casual riders would have been quite useful. Finding out if the casual riders were repeat users or just one-time users would have been beneficial.
Given the information analyzed, the following steps can be deployed to increase annual membership among casual riders:
Overall, more insight could be provided if more data is available, especially individual ride identities. I’m positive that these recommendations provided would convince casual riders to subscribe to a membership, which in turn will lead to increased revenue.