This is one of the capstone projects of the Google Data Analytics Professional Certificate. In this scenario, I am working as a junior data analyst in the marketing analyst team at Cyclistic, a bike-share company in Chicago.
Cyclistic is a bike-share program that was launched in 2016 that features more than 5,800 bicycles and 600 docking stations. Cyclistic sets itself apart by also offering reclining bikes, hand tricycles, and cargo bikes, making bike-share more inclusive to people with disabilities and riders who can’t use a standard two-wheeled bike. The majority of riders opt for traditional bikes; about 8% of riders use the assistive options. Cyclistic users are more likely to ride for leisure, but about 30% use them to commute to work each day. The pricing plans are: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members.
The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. My team wants to understand how casual riders and annual members use Cyclistic bikes differently. We would use the insights gained to design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve our recommendations, so they must be backed up with compelling data insights and professional data visualizations. I have followed the steps of the data analysis process:ask, prepare, process, analyze, share, and act, to be able to achieve our goal.
How do Annual members and Casual riders use Cyclistic bikes differently?
Key stakeholders
1.Lily Moreno, the director of marketing and my manger
2.Cyclistic executive team.
*By identifying and analysing trends,my team will be able to answer the key business question and design a new marketing strategy to convert casual riders into annual members
The data used is the Cyclistic’s Trip Data to analyze and identify trends.
The previous 4 months data from 2023 March 1 to 2023 August 31 is used for analysis.
The data is stored in CSV files. Each file contains one month data. So there are a total of 4 csv files.
The data is structured data
Cyclistic is a fictional company. For the purposes of this case study. It is originally based on the case study “‘Sophisticated, Clear, and Polished’: Divvy and Data Visualization” written by Kevin Hartman found here, the datasets are appropriate
There are no issues with bias or credibility.It is Reliable, Original, Current and Cited
The datasets have a different name because Cyclistic is a fictional company. For the purposes of this case study, the datasets are appropriate.
The data has been made available by Motivate International Inc. under this license
I’m using R for this analysis because given the size of this data, it’s easier and faster.
As of data Integrity, its Accurate, Consistent and Trustworthy.
This data was first cleaned in excel by removing duplicates, checking for missing data and ensuring right data type formatting of the column.
#installing and loading necessary packages
library(tidyverse) #helps wrangle data
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate) #helps wrangle date attributes
library(ggplot2) #helps visualize data
library(tidyr)
library(dplyr)
#collect the data
#august_2022 <- read_csv("data/202208-202307-divvy-tripdata/202208-divvy-tripdata.csv")
#september_2022 <- read_csv("data/202208-202307-divvy-tripdata/202209-divvy-publictripdata.csv")
#october_2022 <- read_csv("data/202208-202307-divvy-tripdata/202210-divvy-tripdata.csv")
#november_2022 <- read_csv("data/202208-202307-divvy-tripdata/202211-divvy-tripdata.csv")
#december_2022 <- read_csv("data/202208-202307-divvy-tripdata/202212-divvy-tripdata.csv")
#january_2023 <- read_csv("data/202208-202307-divvy-tripdata/202301-divvy-tripdata.csv")
february_2023 <- read_csv("data/202302-divvy-tripdata.csv")
## Rows: 190445 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
march_2023 <- read_csv("data/202303-divvy-tripdata.csv")
## Rows: 258678 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
april_2023 <- read_csv("data/202304-divvy-tripdata.csv")
## Rows: 426590 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
may_2023 <- read_csv("data/202305-divvy-tripdata.csv")
## Rows: 604827 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
june_2023 <- read_csv("data//202306-divvy-tripdata.csv")
## Rows: 719618 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
july_2023 <- read_csv("data/202307-divvy-tripdata.csv")
## Rows: 767650 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#wrangle data and combine into a single file
#compare column names with each other to ensure consistency. so that they can be joined into one file
#colnames(august_2022)
#colnames(september_2022)
#colnames(october_2022)
#colnames(novemer_2022)
#colnames(december_2022)
#colnames(january_2023)
colnames(february_2023)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(march_2023)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(april_2023)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(may_2023)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(june_2023)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(july_2023)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
#now that it is confirm the column names are consistent, combine the dataset as one file
all_trips <- rbind(april_2023, may_2023, june_2023, july_2023)
str(all_trips)
## spc_tbl_ [2,518,685 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ride_id : chr [1:2518685] "8FE8F7D9C10E88C7" "34E4ED3ADF1D821B" "5296BF07A2F77CB5" "40759916B76D5D52" ...
## $ rideable_type : chr [1:2518685] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : POSIXct[1:2518685], format: "2023-04-02 08:37:28" "2023-04-19 11:29:02" ...
## $ ended_at : POSIXct[1:2518685], format: "2023-04-02 08:41:37" "2023-04-19 11:52:12" ...
## $ start_station_name: chr [1:2518685] NA NA NA NA ...
## $ start_station_id : chr [1:2518685] NA NA NA NA ...
## $ end_station_name : chr [1:2518685] NA NA NA NA ...
## $ end_station_id : chr [1:2518685] NA NA NA NA ...
## $ start_lat : num [1:2518685] 41.8 41.9 41.9 41.9 41.9 ...
## $ start_lng : num [1:2518685] -87.6 -87.7 -87.7 -87.7 -87.7 ...
## $ end_lat : num [1:2518685] 41.8 41.9 41.9 41.9 41.9 ...
## $ end_lng : num [1:2518685] -87.6 -87.7 -87.7 -87.7 -87.6 ...
## $ member_casual : chr [1:2518685] "member" "member" "member" "member" ...
## - attr(*, "spec")=
## .. cols(
## .. ride_id = col_character(),
## .. rideable_type = col_character(),
## .. started_at = col_datetime(format = ""),
## .. ended_at = col_datetime(format = ""),
## .. start_station_name = col_character(),
## .. start_station_id = col_character(),
## .. end_station_name = col_character(),
## .. end_station_id = col_character(),
## .. start_lat = col_double(),
## .. start_lng = col_double(),
## .. end_lat = col_double(),
## .. end_lng = col_double(),
## .. member_casual = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
#remove data with missing values
all_trips <- drop_na(all_trips)
# Convert ride_id and rideable_type to character so that they can stack correctly
all_trips <- mutate(all_trips, ride_id = as.character(ride_id))
# Remove columns that are not needed
all_trips <- all_trips %>% select(-c(start_lat, start_lng, end_lat, end_lng))
#all_trips <- subset(all_trips, select = -c(start_lat, start_lng, end_lat, end_lng))
# Inspect the new table that has been created
colnames(all_trips) #List of column names
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "member_casual"
nrow(all_trips) #How many rows are in data frame
## [1] 1896140
dim(all_trips) #Dimensions of the data frame
## [1] 1896140 9
head(all_trips) #See the first 6 rows of data frame. Also tail(qs_raw)
## # A tibble: 6 × 9
## ride_id rideable_type started_at ended_at
## <chr> <chr> <dttm> <dttm>
## 1 5B6500E1E58655C0 classic_bike 2023-04-10 17:34:35 2023-04-10 18:02:36
## 2 AA65D25D69AF771F classic_bike 2023-04-12 12:29:46 2023-04-12 12:54:00
## 3 079FB2C196414482 electric_bike 2023-04-13 17:39:23 2023-04-13 17:40:57
## 4 599623864C871207 classic_bike 2023-04-29 20:57:10 2023-04-29 20:57:13
## 5 63ECC8A13D11A76A classic_bike 2023-04-20 17:03:11 2023-04-20 17:24:58
## 6 A396F54F4C1927D8 electric_bike 2023-04-14 22:29:12 2023-04-14 22:29:15
## # ℹ 5 more variables: start_station_name <chr>, start_station_id <chr>,
## # end_station_name <chr>, end_station_id <chr>, member_casual <chr>
str(all_trips) #See list of columns and data types (numeric, character, etc)
## tibble [1,896,140 × 9] (S3: tbl_df/tbl/data.frame)
## $ ride_id : chr [1:1896140] "5B6500E1E58655C0" "AA65D25D69AF771F" "079FB2C196414482" "599623864C871207" ...
## $ rideable_type : chr [1:1896140] "classic_bike" "classic_bike" "electric_bike" "classic_bike" ...
## $ started_at : POSIXct[1:1896140], format: "2023-04-10 17:34:35" "2023-04-12 12:29:46" ...
## $ ended_at : POSIXct[1:1896140], format: "2023-04-10 18:02:36" "2023-04-12 12:54:00" ...
## $ start_station_name: chr [1:1896140] "Avenue O & 134th St" "Cottage Grove Ave & 51st St" "Morgan Ave & 14th Pl" "Cottage Grove Ave & 51st St" ...
## $ start_station_id : chr [1:1896140] "20214" "TA1309000067" "TA1306000002" "TA1309000067" ...
## $ end_station_name : chr [1:1896140] "Avenue O & 134th St" "Cottage Grove Ave & 51st St" "Morgan Ave & 14th Pl" "Cottage Grove Ave & 51st St" ...
## $ end_station_id : chr [1:1896140] "20214" "TA1309000067" "TA1306000002" "TA1309000067" ...
## $ member_casual : chr [1:1896140] "member" "member" "member" "member" ...
summary(all_trips) #Statistical summary of data. Mainly for numerics
## ride_id rideable_type started_at
## Length:1896140 Length:1896140 Min. :2023-04-01 00:00:02.0
## Class :character Class :character 1st Qu.:2023-05-11 20:42:45.5
## Mode :character Mode :character Median :2023-06-09 13:37:44.0
## Mean :2023-06-07 20:49:18.5
## 3rd Qu.:2023-07-07 13:17:26.5
## Max. :2023-07-31 23:59:15.0
## ended_at start_station_name start_station_id
## Min. :2023-04-01 00:03:10.00 Length:1896140 Length:1896140
## 1st Qu.:2023-05-11 20:58:41.75 Class :character Class :character
## Median :2023-06-09 13:55:39.00 Mode :character Mode :character
## Mean :2023-06-07 21:06:39.20
## 3rd Qu.:2023-07-07 13:37:24.75
## Max. :2023-08-01 20:40:50.00
## end_station_name end_station_id member_casual
## Length:1896140 Length:1896140 Length:1896140
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
# Add columns that list the date, month, day, and year of each ride
# This will allow us to aggregate ride data for each month, day, or year
all_trips$date <- as.Date(all_trips$started_at) #The default format for date in R is yyyy-mm-dd
all_trips$month <- format(as.Date(all_trips$date), "%m")
all_trips$day <- format(as.Date(all_trips$date), "%d")
all_trips$year <- format(as.Date(all_trips$date), "%Y")
all_trips$day_of_week <- format(as.Date(all_trips$date), "%A")
# Add a "ride_length" calculation to all_trips (in seconds)
all_trips$ride_length <- difftime(all_trips$ended_at, all_trips$started_at)
# Inspect the structure of the columns
str(all_trips)
## tibble [1,896,140 × 15] (S3: tbl_df/tbl/data.frame)
## $ ride_id : chr [1:1896140] "5B6500E1E58655C0" "AA65D25D69AF771F" "079FB2C196414482" "599623864C871207" ...
## $ rideable_type : chr [1:1896140] "classic_bike" "classic_bike" "electric_bike" "classic_bike" ...
## $ started_at : POSIXct[1:1896140], format: "2023-04-10 17:34:35" "2023-04-12 12:29:46" ...
## $ ended_at : POSIXct[1:1896140], format: "2023-04-10 18:02:36" "2023-04-12 12:54:00" ...
## $ start_station_name: chr [1:1896140] "Avenue O & 134th St" "Cottage Grove Ave & 51st St" "Morgan Ave & 14th Pl" "Cottage Grove Ave & 51st St" ...
## $ start_station_id : chr [1:1896140] "20214" "TA1309000067" "TA1306000002" "TA1309000067" ...
## $ end_station_name : chr [1:1896140] "Avenue O & 134th St" "Cottage Grove Ave & 51st St" "Morgan Ave & 14th Pl" "Cottage Grove Ave & 51st St" ...
## $ end_station_id : chr [1:1896140] "20214" "TA1309000067" "TA1306000002" "TA1309000067" ...
## $ member_casual : chr [1:1896140] "member" "member" "member" "member" ...
## $ date : Date[1:1896140], format: "2023-04-10" "2023-04-12" ...
## $ month : chr [1:1896140] "04" "04" "04" "04" ...
## $ day : chr [1:1896140] "10" "12" "13" "29" ...
## $ year : chr [1:1896140] "2023" "2023" "2023" "2023" ...
## $ day_of_week : chr [1:1896140] "Monday" "Wednesday" "Thursday" "Saturday" ...
## $ ride_length : 'difftime' num [1:1896140] 1681 1454 94 3 ...
## ..- attr(*, "units")= chr "secs"
# Convert "ride_length" from Factor to numeric so we can run calculations on the data
is.factor(all_trips$ride_length)
## [1] FALSE
all_trips$ride_length <- as.numeric(as.character(all_trips$ride_length))
is.numeric(all_trips$ride_length)
## [1] TRUE
# Remove "bad" data
# The dataframe includes a few hundred entries when bikes were taken out of docks and checked for quality by Divvy or ride_length was negative
# new version of the data v2 will be;
all_trips_v2 <- all_trips[!(all_trips$start_station_name == "HQ QR" | all_trips$ride_length<0),]
#Conduct descriptive analysis
# Descriptive analysis on ride_length (all figures in seconds)
mean(all_trips_v2$ride_length) #straight average (total ride length / rides)
## [1] 1040.609
median(all_trips_v2$ride_length) #midpoint number in the ascending array of ride lengths
## [1] 640
max(all_trips_v2$ride_length) #longest ride
## [1] 728178
min(all_trips_v2$ride_length) #shortest ride
## [1] 0
#summary of the descriptive analysis
summary(all_trips_v2$ride_length)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 362 640 1041 1150 728178
# Compare members and casual users
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = mean)
## all_trips_v2$member_casual all_trips_v2$ride_length
## 1 casual 1460.3255
## 2 member 764.3405
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = median)
## all_trips_v2$member_casual all_trips_v2$ride_length
## 1 casual 828
## 2 member 548
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = max)
## all_trips_v2$member_casual all_trips_v2$ride_length
## 1 casual 728178
## 2 member 89165
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = min)
## all_trips_v2$member_casual all_trips_v2$ride_length
## 1 casual 0
## 2 member 0
# See the average ride time by each day for members vs casual users
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual + all_trips_v2$day_of_week, FUN = mean)
## all_trips_v2$member_casual all_trips_v2$day_of_week all_trips_v2$ride_length
## 1 casual Friday 1405.3515
## 2 member Friday 747.9173
## 3 casual Monday 1465.2394
## 4 member Monday 733.0519
## 5 casual Saturday 1619.7595
## 6 member Saturday 857.7113
## 7 casual Sunday 1660.4473
## 8 member Sunday 855.6031
## 9 casual Thursday 1285.8628
## 10 member Thursday 731.8465
## 11 casual Tuesday 1350.4401
## 12 member Tuesday 737.8307
## 13 casual Wednesday 1240.3217
## 14 member Wednesday 719.3947
# Fixing the days of the week that are out of order
all_trips_v2$day_of_week <- ordered(all_trips_v2$day_of_week, levels=c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))
# To get the average ride time by each day for members vs casual users
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual + all_trips_v2$day_of_week, FUN = mean)
## all_trips_v2$member_casual all_trips_v2$day_of_week all_trips_v2$ride_length
## 1 casual Sunday 1660.4473
## 2 member Sunday 855.6031
## 3 casual Monday 1465.2394
## 4 member Monday 733.0519
## 5 casual Tuesday 1350.4401
## 6 member Tuesday 737.8307
## 7 casual Wednesday 1240.3217
## 8 member Wednesday 719.3947
## 9 casual Thursday 1285.8628
## 10 member Thursday 731.8465
## 11 casual Friday 1405.3515
## 12 member Friday 747.9173
## 13 casual Saturday 1619.7595
## 14 member Saturday 857.7113
# analyze ridership data by type and weekday
all_trips_v2 %>%
mutate(weekday = wday(started_at, label = TRUE)) %>%
group_by(member_casual, weekday) %>%
summarise(number_of_rides = n()
,average_duration = mean(ride_length)) %>%
arrange(member_casual, weekday)
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
## # A tibble: 14 × 4
## # Groups: member_casual [2]
## member_casual weekday number_of_rides average_duration
## <chr> <ord> <int> <dbl>
## 1 casual Sun 122944 1660.
## 2 casual Mon 91537 1465.
## 3 casual Tue 84486 1350.
## 4 casual Wed 82016 1240.
## 5 casual Thu 98037 1286.
## 6 casual Fri 113100 1405.
## 7 casual Sat 160539 1620.
## 8 member Sun 128154 856.
## 9 member Mon 162802 733.
## 10 member Tue 170013 738.
## 11 member Wed 173191 719.
## 12 member Thu 186382 732.
## 13 member Fri 167642 748.
## 14 member Sat 155283 858.
How do Annual members and Casual riders use Cyclistic bike share differently?
From the first visual which shows the number of riders (both casual and member) per day
It is evident that on weekdays (Monday through Friday),relatively the same number of casual riders use Cyclistic each day. Same holds true for the member riders but comparing both groups, there were significantly more member riders than casual riders utilizing Cyclistic.
Interestingly, on the weekends (Saturday and Sunday), the number of riders from both groups remains relatively close, with Saturday even witnessing a higher count of casual riders.
From the second visual which shows the average_duration of both casual and member riders per day