Scenario
You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.
Characters and teams
Cyclistic: A bike-share program that features more than 5,800 bicycles and 600 docking stations. Cyclistic sets itself apart by also offering reclining bikes, hand tricycles, and cargo bikes, making bike-share more inclusive to people with disabilities and riders who can’t use a standard two-wheeled bike. The majority of riders opt for traditional bikes; about 8% of riders use the assistive options. Cyclistic users are more likely to ride for leisure, but about 30% use them to commute to work each day.
Lily Moreno: The director of marketing and your manager. Moreno is responsible for the development of campaigns and initiatives to promote the bike-share program. These may include email, social media, and other channels.
Cyclistic marketing analytics team: A team of data analysts who are responsible for collecting, analyzing, and reporting data that helps guide Cyclistic marketing strategy. You joined this team six months ago and have been busy learning about Cyclistic’s mission and business goals — as well as how you, as a junior data analyst, can help Cyclistic achieve them.
Cyclistic executive team: The notoriously detail-oriented executive team will decide whether to approve the recommended marketing program
Approach
Data analysis is a crucial step in solving problems. It is the process of collecting, organizing, and interpreting data to answer questions and reach conclusions. By analyzing data, businesses, organizations, and individuals can identify patterns and trends that can help them make informed decisions and improve their performance. This will involve sorting the data into categories and performing calculations or visualizing it with charts and graphs. I will also examine the data for any outliers or inconsistencies that may indicate a problem. After the data has been analyzed, it is important to interpret the results. This involves examining the data to identify any patterns or trends that may be present. It is also important to consider any potential causes of the problem, such as internal processes or external factors.
Ask phase
In this step, I asked myself questions such as ‘How do annual members and casual riders use Cyclistic bikes differently?’ to guide my analysis. Lily Moreno believes that the annual Members were more profitable than the casual riders. The goal before me is to assist the marketing team analyze the contrast between casual riders and members with the aim of providing recommendations that will aid the conversion of casual riders into annual members to maximize profits.
Prepare phase
The data used for this analysis is located at (https://divvy-tripdata.s3.amazonaws.com/index.html), I downloaded the most recent 12 months from February 2022 to January 2023 and stored them locally. This version of my analysis was done using R. The dataset has no user ID or any specific commuter identifier, so the data on duplicate riders are not present.
Process Phase
I used Rstudio for the process phase. I have outlined the steps taken:
-uploaded all 12 months of the datasets into R
-Combine data into single file
-Clean and inspect data for uniformity
-Clean up by removing rows with missing or unavailable data
-Clean up column names and prepare for data analysis
Analyze Phase
The analysis was done using Rstudio, the steps are outline:
Loading Packages
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.2.2
## Warning: package 'ggplot2' was built under R version 4.2.2
## Warning: package 'tibble' was built under R version 4.2.2
## Warning: package 'tidyr' was built under R version 4.2.2
## Warning: package 'readr' was built under R version 4.2.2
## Warning: package 'purrr' was built under R version 4.2.2
## Warning: package 'dplyr' was built under R version 4.2.2
## Warning: package 'stringr' was built under R version 4.2.2
## Warning: package 'forcats' was built under R version 4.2.2
## Warning: package 'lubridate' was built under R version 4.2.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.0 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.1 ✔ tibble 3.1.8
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(ggplot2)
library(lubridate)
collect and load your data into R
getwd()
## [1] "C:/Users/okoli/Desktop/data analyst/Capstone_Cyclistic_Project"
feb_2022 <- read_csv("2022_02-divvy-tripdata.csv")
## Rows: 115609 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
mar_2022 <- read_csv("2022_03-divvy-tripdata.csv")
## Rows: 284042 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
apr_2022 <- read_csv("2022_04-divvy-tripdata.csv")
## Rows: 371249 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
may_2022 <- read_csv("2022_05-divvy-tripdata.csv")
## Rows: 634858 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
jun_2022 <- read_csv("2022_06-divvy-tripdata.csv")
## Rows: 769204 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
jul_2022 <- read_csv("2022_07-divvy-tripdata.csv")
## Rows: 823488 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
aug_2022 <- read_csv("2022_08-divvy-tripdata.csv")
## Rows: 785932 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sep_2022 <- read_csv("2022_09-divvy-tripdata.csv")
## Rows: 701339 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
oct_2022 <- read_csv("2022_10-divvy-tripdata.csv")
## Rows: 558685 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
nov_2022 <- read_csv("2022_11-divvy-tripdata.csv")
## Rows: 337735 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dec_2022 <- read_csv("2022_12-divvy-tripdata.csv")
## Rows: 181806 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
jan_2023 <- read_csv("2023_01-divvy-tripdata.csv")
## Rows: 190301 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Inspect the dataset
check columns for differences
colnames(feb_2022)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(mar_2022)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(apr_2022)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(may_2022)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(jun_2022)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(jul_2022)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(aug_2022)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(sep_2022)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(oct_2022)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(nov_2022)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(dec_2022)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(jan_2023)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
str(jan_2023)
## spc_tbl_ [190,301 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:190301] "F96D5A74A3E41399" "13CB7EB698CEDB88" "BD88A2E670661CE5" "C90792D034FED968" ...
## $ rideable_type : chr [1:190301] "electric_bike" "classic_bike" "electric_bike" "classic_bike" ...
## $ started_at : POSIXct[1:190301], format: "2023-01-21 20:05:42" "2023-01-10 15:37:36" ...
## $ ended_at : POSIXct[1:190301], format: "2023-01-21 20:16:33" "2023-01-10 15:46:05" ...
## $ start_station_name: chr [1:190301] "Lincoln Ave & Fullerton Ave" "Kimbark Ave & 53rd St" "Western Ave & Lunt Ave" "Kimbark Ave & 53rd St" ...
## $ start_station_id : chr [1:190301] "TA1309000058" "TA1309000037" "RP-005" "TA1309000037" ...
## $ end_station_name : chr [1:190301] "Hampden Ct & Diversey Ave" "Greenwood Ave & 47th St" "Valli Produce - Evanston Plaza" "Greenwood Ave & 47th St" ...
## $ end_station_id : chr [1:190301] "202480.0" "TA1308000002" "599" "TA1308000002" ...
## $ start_lat : num [1:190301] 41.9 41.8 42 41.8 41.8 ...
## $ start_lng : num [1:190301] -87.6 -87.6 -87.7 -87.6 -87.6 ...
## $ end_lat : num [1:190301] 41.9 41.8 42 41.8 41.8 ...
## $ end_lng : num [1:190301] -87.6 -87.6 -87.7 -87.6 -87.6 ...
## $ member_casual : chr [1:190301] "member" "member" "casual" "member" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
ride_id and rideable_id were converted to character so they could be stacked correctly
feb_2022 <- mutate(feb_2022, ride_id = as.character(ride_id)
,rideable_type = as.character(rideable_type))
mar_2022 <- mutate(mar_2022, ride_id = as.character(ride_id)
,rideable_type = as.character(rideable_type))
apr_2022 <- mutate(apr_2022, ride_id = as.character(ride_id)
,rideable_type = as.character(rideable_type))
may_2022 <- mutate(may_2022, ride_id = as.character(ride_id)
,rideable_type = as.character(rideable_type))
jun_2022 <- mutate(jun_2022, ride_id = as.character(ride_id)
,rideable_type = as.character(rideable_type))
jul_2022 <- mutate(jul_2022, ride_id = as.character(ride_id)
,rideable_type = as.character(rideable_type))
aug_2022 <- mutate(aug_2022, ride_id = as.character(ride_id)
,rideable_type = as.character(rideable_type))
sep_2022 <- mutate(sep_2022, ride_id = as.character(ride_id)
,rideable_type = as.character(rideable_type))
oct_2022 <- mutate(oct_2022, ride_id = as.character(ride_id)
,rideable_type = as.character(rideable_type))
nov_2022 <- mutate(nov_2022, ride_id = as.character(ride_id)
,rideable_type = as.character(rideable_type))
dec_2022 <- mutate(dec_2022, ride_id = as.character(ride_id)
,rideable_type = as.character(rideable_type))
jan_2023 <- mutate(jan_2023, ride_id = as.character(ride_id)
,rideable_type = as.character(rideable_type))
Stack all months into one larger dataframe
all_trips <- bind_rows(feb_2022, mar_2022, apr_2022, may_2022, jun_2022, jul_2022,
aug_2022, sep_2022, oct_2022, nov_2022, dec_2022, jan_2023)
Removed the columns not necessary for analysis (lat, long)
all_trips <- all_trips %>%
select(-c(start_lat, start_lng, end_lat, end_lng))
Inspected the newly formed table for accuracy
colnames(all_trips)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "member_casual"
str(all_trips)
## tibble [5,754,248 × 9] (S3: tbl_df/tbl/data.frame)
## $ ride_id : chr [1:5754248] "E1E065E7ED285C02" "1602DCDC5B30FFE3" "BE7DD2AF4B55C4AF" "A1789BDF844412BE" ...
## $ rideable_type : chr [1:5754248] "classic_bike" "classic_bike" "classic_bike" "classic_bike" ...
## $ started_at : POSIXct[1:5754248], format: "2022-02-19 18:08:41" "2022-02-20 17:41:30" ...
## $ ended_at : POSIXct[1:5754248], format: "2022-02-19 18:23:56" "2022-02-20 17:45:56" ...
## $ start_station_name: chr [1:5754248] "State St & Randolph St" "Halsted St & Wrightwood Ave" "State St & Randolph St" "Southport Ave & Waveland Ave" ...
## $ start_station_id : chr [1:5754248] "TA1305000029" "TA1309000061" "TA1305000029" "13235" ...
## $ end_station_name : chr [1:5754248] "Clark St & Lincoln Ave" "Southport Ave & Wrightwood Ave" "Canal St & Adams St" "Broadway & Sheridan Rd" ...
## $ end_station_id : chr [1:5754248] "13179" "TA1307000113" "13011" "13323" ...
## $ member_casual : chr [1:5754248] "member" "member" "member" "member" ...
summary(all_trips)
## ride_id rideable_type started_at
## Length:5754248 Length:5754248 Min. :2022-02-01 00:03:18.00
## Class :character Class :character 1st Qu.:2022-06-02 15:18:09.50
## Mode :character Mode :character Median :2022-07-27 22:50:40.50
## Mean :2022-07-29 13:28:03.16
## 3rd Qu.:2022-09-22 20:34:47.25
## Max. :2023-01-31 23:56:09.00
## ended_at start_station_name start_station_id
## Min. :2022-02-01 00:09:37.00 Length:5754248 Length:5754248
## 1st Qu.:2022-06-02 15:37:50.50 Class :character Class :character
## Median :2022-07-27 23:09:33.00 Mode :character Mode :character
## Mean :2022-07-29 13:47:21.50
## 3rd Qu.:2022-09-22 20:53:25.25
## Max. :2023-02-04 04:27:03.00
## end_station_name end_station_id member_casual
## Length:5754248 Length:5754248 Length:5754248
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
Added columns that listed the date, month, day, and year of each ride
all_trips$date <- as.Date(all_trips$started_at)
all_trips$month <- format(as.Date(all_trips$date), "%m")
all_trips$day <- format(as.Date(all_trips$date), "%d")
all_trips$year <- format(as.Date(all_trips$date), "%Y")
all_trips$day_of_week <- format(as.Date(all_trips$date), "%A")
Added “ride_length” column as a calculation to all_trips (in seconds)
Inspected the current structure of the columns
all_trips$ride_length <- difftime(all_trips$ended_at,all_trips$started_at)
str(all_trips)
## tibble [5,754,248 × 15] (S3: tbl_df/tbl/data.frame)
## $ ride_id : chr [1:5754248] "E1E065E7ED285C02" "1602DCDC5B30FFE3" "BE7DD2AF4B55C4AF" "A1789BDF844412BE" ...
## $ rideable_type : chr [1:5754248] "classic_bike" "classic_bike" "classic_bike" "classic_bike" ...
## $ started_at : POSIXct[1:5754248], format: "2022-02-19 18:08:41" "2022-02-20 17:41:30" ...
## $ ended_at : POSIXct[1:5754248], format: "2022-02-19 18:23:56" "2022-02-20 17:45:56" ...
## $ start_station_name: chr [1:5754248] "State St & Randolph St" "Halsted St & Wrightwood Ave" "State St & Randolph St" "Southport Ave & Waveland Ave" ...
## $ start_station_id : chr [1:5754248] "TA1305000029" "TA1309000061" "TA1305000029" "13235" ...
## $ end_station_name : chr [1:5754248] "Clark St & Lincoln Ave" "Southport Ave & Wrightwood Ave" "Canal St & Adams St" "Broadway & Sheridan Rd" ...
## $ end_station_id : chr [1:5754248] "13179" "TA1307000113" "13011" "13323" ...
## $ member_casual : chr [1:5754248] "member" "member" "member" "member" ...
## $ date : Date[1:5754248], format: "2022-02-19" "2022-02-20" ...
## $ month : chr [1:5754248] "02" "02" "02" "02" ...
## $ day : chr [1:5754248] "19" "20" "25" "14" ...
## $ year : chr [1:5754248] "2022" "2022" "2022" "2022" ...
## $ day_of_week : chr [1:5754248] "Saturday" "Sunday" "Friday" "Monday" ...
## $ ride_length : 'difftime' num [1:5754248] 915 266 818 417 ...
## ..- attr(*, "units")= chr "secs"
Converted “ride_length” from Factor to numeric to enable the ability to run calculations on the data
is.factor(all_trips$ride_length)
## [1] FALSE
all_trips$ride_length <- as.numeric(as.character(all_trips$ride_length))
is.numeric(all_trips$ride_length)
## [1] TRUE
Removed invalid data
all_trips_v2 <- all_trips[!(all_trips$start_station_name == "HQ QR" | all_trips$ride_length<0 ),]
all_trips_v3 <- na.omit(all_trips_v2)
Descriptive analysis on ride_length (all figures in seconds)
mean(all_trips_v3$ride_length)
## [1] 1017.968
median(all_trips_v3$ride_length)
## [1] 631
max(all_trips_v3$ride_length)
## [1] 2061244
min(all_trips_v3$ride_length)
## [1] 0
summary(all_trips_v3$ride_length)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 360 631 1018 1133 2061244
#Comparism of members and casual users
aggregate(all_trips_v3$ride_length ~ all_trips_v3$member_casual, FUN = mean)
## all_trips_v3$member_casual all_trips_v3$ride_length
## 1 casual 1429.0284
## 2 member 743.8795
aggregate(all_trips_v3$ride_length ~ all_trips_v3$member_casual, FUN = max)
## all_trips_v3$member_casual all_trips_v3$ride_length
## 1 casual 2061244
## 2 member 89872
aggregate(all_trips_v3$ride_length ~ all_trips_v3$member_casual, FUN = min)
## all_trips_v3$member_casual all_trips_v3$ride_length
## 1 casual 0
## 2 member 0
average ride time by each day for members vs casual users
aggregate(all_trips_v3$ride_length ~ all_trips_v3$member_casual + all_trips_v3$day_of_week, FUN = mean)
## all_trips_v3$member_casual all_trips_v3$day_of_week all_trips_v3$ride_length
## 1 casual Friday 1333.3149
## 2 member Friday 731.0674
## 3 casual Monday 1478.8465
## 4 member Monday 718.0486
## 5 casual Saturday 1597.8397
## 6 member Saturday 836.7879
## 7 casual Sunday 1627.6773
## 8 member Sunday 828.0498
## 9 casual Thursday 1266.9570
## 10 member Thursday 719.0429
## 11 casual Tuesday 1277.4698
## 12 member Tuesday 703.4152
## 13 casual Wednesday 1228.4272
## 14 member Wednesday 708.4237
Fixed the days of the week to appear in order
all_trips_v3$day_of_week <- ordered(all_trips_v3$day_of_week, levels=c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))
now view average ride time by each day for members vs casual users
aggregate(all_trips_v3$ride_length ~ all_trips_v3$member_casual + all_trips_v3$day_of_week, FUN = mean)
## all_trips_v3$member_casual all_trips_v3$day_of_week all_trips_v3$ride_length
## 1 casual Sunday 1627.6773
## 2 member Sunday 828.0498
## 3 casual Monday 1478.8465
## 4 member Monday 718.0486
## 5 casual Tuesday 1277.4698
## 6 member Tuesday 703.4152
## 7 casual Wednesday 1228.4272
## 8 member Wednesday 708.4237
## 9 casual Thursday 1266.9570
## 10 member Thursday 719.0429
## 11 casual Friday 1333.3149
## 12 member Friday 731.0674
## 13 casual Saturday 1597.8397
## 14 member Saturday 836.7879
Analyzed the ridership data by type and weekday
all_trips_v3 %>%
mutate(weekday = wday(started_at, label = TRUE)) %>%
group_by(member_casual, weekday) %>%
summarise(number_of_rides = n()
,average_duration = mean(ride_length)) %>%
arrange(member_casual, weekday)
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
## # A tibble: 14 × 4
## # Groups: member_casual [2]
## member_casual weekday number_of_rides average_duration
## <chr> <ord> <int> <dbl>
## 1 casual Sun 304430 1628.
## 2 casual Mon 213357 1479.
## 3 casual Tue 199913 1277.
## 4 casual Wed 206447 1228.
## 5 casual Thu 231915 1267.
## 6 casual Fri 250701 1333.
## 7 casual Sat 368403 1598.
## 8 member Sun 302990 828.
## 9 member Mon 382627 718.
## 10 member Tue 424050 703.
## 11 member Wed 422211 708.
## 12 member Thu 422467 719.
## 13 member Fri 366731 731.
## 14 member Sat 341205 837.
visualized the number of rides by rider type
all_trips_v3 %>%
mutate(weekday = wday(started_at, label = TRUE)) %>%
group_by(member_casual, weekday) %>%
summarise(number_of_rides = n()
,average_duration = mean(ride_length)) %>%
arrange(member_casual, weekday) %>%
ggplot(aes(x = weekday, y = number_of_rides, fill = member_casual)) +
geom_col(position = "dodge")
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
created a visualization for the average duration
all_trips_v3 %>%
mutate(weekday = wday(started_at, label = TRUE)) %>%
group_by(member_casual, weekday) %>%
summarise(number_of_rides = n()
,average_duration = mean(ride_length)) %>%
arrange(member_casual, weekday) %>%
ggplot(aes(x = weekday, y = average_duration, fill = member_casual)) +
geom_col(position = "dodge")
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
Created a csv file and exported it for further analysis
counts <- aggregate(all_trips_v3$ride_length ~ all_trips_v3$member_casual
+ all_trips_v3$day_of_week, FUN = mean)
write.csv(counts, file = "C:/Users/okoli/Downloads/Capstone_Cyclistic_Project/avg_ride_length.csv")
Share Phase
Interactive table was created with tableau and can be accessed at https://public.tableau.com/app/profile/n.o2620/viz/Ridelengthbyweekdayandridertype/Sheet1
Act phase
The first phenomenon that is easily noticeable is the stark difference between casual riders and riders with membership subscriptions, there is a significant difference in length of ride between both sets of riders.
Key Findings
Apart from the weekends (Saturday and Sunday) the ride length of the member riders remained relatively unchanged.
There is a slight untick (about 16%) in member riders’ length of rides on weekends, it starts on Friday and experiences a sharp drop on Monday. It remains relatively the same until Friday again.
The ride length for casual riders changed significantly from day to day with the peek coming on weekends (Saturday and Sunday).
There is a visible inverse bell curve for the ride length of the casual riders, from the beginning of the week (Sunday) to the end of the week (Saturday). There is a dip in midweek and then a steady climb until the weekend, then a slow drop after the weekend.
Comparing the casual riders to the member riders, there is a doubling (100% increase) in ride length from the member riders to the casual riders. This means that for any giving day of the week the casual riders have twice the ride time of the member riders.
Business Task Questions
How do annual members and casual riders use Cyclistic bikes differently?
Why would casual riders buy Cyclistic annual memberships?
How can Cyclistic use digital media to influence casual riders to become members?
Conclusion
The lack of individual rider identifiers which would have been integral in deciphering the ride patterns of casual riders would have been quite useful. Finding out if the casual riders were repeat users or just one-time users would have been beneficial.
Convincing casual riders to buy annual memberships
Given the information analyzed, the following steps can be deployed to increase annual membership among casual riders:
-If the cost to rent for a day or 2 is lower than the weekly cost of membership then the cost of a one-day usage should be increased, this would encourage weekend users to subscribe.
-Offer different types of memberships to cater to different types of riders, for example a weekend pass can be offered to capitalize on the weekend spike. The weekday pass can be made cheaper than the weekend pass to encourage weekday casual users.
-Offer members only exclusives with the rides, an example will be bikes reserved for members only, priority bike usage over the weekend due to increased demands
-Run an ad campaign to showcase the benefits of a membership, explain the pricing and the benefits
-Offer a discount if the whole annual membership fee was paid all at once instead of monthly, also offer autopay capabilities and make payment as seamless as possible by offering different pay types.
-Explore charging casual riders by the hour, while members are given unlimited ride time
Overall, more insight could be provided if more data is available, especially individual ride identities. I’m positive that these recommendations provided would convince casual riders to subscribe to a membership, which in turn will lead to increased revenue.