Scenario

Working as a junior data analyst in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore,team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights,team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve the recommendations, so they must be backed up with compelling data insights and professional data visualizations.

About the company

In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime.

Until now, Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments. One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members.

Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders. Although the pricing flexibility helps Cyclistic attract more customers, The Director of marketing believes that maximizing the number of annual members will be key to future growth. Rather than creating a marketing campaign that targets all-new customers, director believes there is a solid opportunity to convert casual riders into members. She notes that casual riders are already aware of the Cyclistic program and have chosen Cyclistic for their mobility needs.

The goal of the case study

Three questions will guide future marketing team:-

How do annual members and casual riders use Cyclistic bikes differently?
Why would casual riders buy Cyclistic annual memberships?
How can Cyclistic use digital media to influence casual riders to become members?

The director of marketing has assigned the first question to answer: How do annual members and casual riders use Cyclistic bikes differently?

In this assignment, a report with the following deliverable will be shown:

A clear statement of the business task
A description of all data sources used
Documentation of any cleaning or manipulation of data
A summary of my analysis
Supporting visualizations and key findings

Note:-In this case study, Google’s analysis process(Ask - Prepare - Process - Analyze - Share - Act) is used

Ask

1.Business Task In order to maximize the number of annual membership, I, data analyst, will find trend and patterns among casual riders and membership riders, and identify potential riders who can get benefit from annual membership.I do not need to raise awareness of annual membership among casual riders as they are already aware of the program.

2.Stakeholders

The director of marketing
The marketing analysis team
Cyclistic’s Executive team

3.Stakeholder’s expectation Design marketing strategies aimed at converting casual riders into annual members. In order to do that, however, the marketing analyst team needs to better understand how annual members and casual riders differ, why casual riders would buy a membership, and how digital media could affect their marketing tactics. The marketing team is interested in analyzing the Cyclistic historical bike trip data to identify trends.

Prepare

About the data set:

Since Cyclistic is a fictional company, I will use Divvy’s, a bike-share program based in Chicago, data used from January 2023 – December 2023 to complete this case studyThis data was made public by Motivate International Inc, under this license. Due to data privacy issues, personal information has been removed or encrypted.

In this phase data loaded

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(dplyr)
library(geosphere)

Data loaded, verified, and merged into a single dataframe

all_trips <- list.files(path = "Trip_data",full.names = TRUE) %>% 
  lapply(read_csv) %>% 
  bind_rows()

## Rows: 190301 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 190445 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 258678 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 426590 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 604827 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 719618 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 767650 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 771693 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 666371 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 537113 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 362518 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 224073 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Data columns,Dimensions,Summary Checked

colnames(all_trips)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

dim(all_trips)

## [1] 5719877      13

head(all_trips)

## # A tibble: 6 × 13
##   ride_id          rideable_type started_at          ended_at           
##   <chr>            <chr>         <dttm>              <dttm>             
## 1 F96D5A74A3E41399 electric_bike 2023-01-21 20:05:42 2023-01-21 20:16:33
## 2 13CB7EB698CEDB88 classic_bike  2023-01-10 15:37:36 2023-01-10 15:46:05
## 3 BD88A2E670661CE5 electric_bike 2023-01-02 07:51:57 2023-01-02 08:05:11
## 4 C90792D034FED968 classic_bike  2023-01-22 10:52:58 2023-01-22 11:01:44
## 5 3397017529188E8A classic_bike  2023-01-12 13:58:01 2023-01-12 14:13:20
## 6 58E68156DAE3E311 electric_bike 2023-01-31 07:18:03 2023-01-31 07:21:16
## # ℹ 9 more variables: start_station_name <chr>, start_station_id <chr>,
## #   end_station_name <chr>, end_station_id <chr>, start_lat <dbl>,
## #   start_lng <dbl>, end_lat <dbl>, end_lng <dbl>, member_casual <chr>

summary(all_trips)

##    ride_id          rideable_type        started_at                    
##  Length:5719877     Length:5719877     Min.   :2023-01-01 00:01:58.00  
##  Class :character   Class :character   1st Qu.:2023-05-21 12:50:44.00  
##  Mode  :character   Mode  :character   Median :2023-07-20 18:02:50.00  
##                                        Mean   :2023-07-16 10:27:50.01  
##                                        3rd Qu.:2023-09-16 20:08:49.00  
##                                        Max.   :2023-12-31 23:59:38.00  
##                                                                        
##     ended_at                      start_station_name start_station_id  
##  Min.   :2023-01-01 00:02:41.00   Length:5719877     Length:5719877    
##  1st Qu.:2023-05-21 13:14:09.00   Class :character   Class :character  
##  Median :2023-07-20 18:19:47.00   Mode  :character   Mode  :character  
##  Mean   :2023-07-16 10:46:00.18                                        
##  3rd Qu.:2023-09-16 20:28:10.00                                        
##  Max.   :2024-01-01 23:50:51.00                                        
##                                                                        
##  end_station_name   end_station_id       start_lat       start_lng     
##  Length:5719877     Length:5719877     Min.   :41.63   Min.   :-87.94  
##  Class :character   Class :character   1st Qu.:41.88   1st Qu.:-87.66  
##  Mode  :character   Mode  :character   Median :41.90   Median :-87.64  
##                                        Mean   :41.90   Mean   :-87.65  
##                                        3rd Qu.:41.93   3rd Qu.:-87.63  
##                                        Max.   :42.07   Max.   :-87.46  
##                                                                        
##     end_lat         end_lng       member_casual     
##  Min.   : 0.00   Min.   :-88.16   Length:5719877    
##  1st Qu.:41.88   1st Qu.:-87.66   Class :character  
##  Median :41.90   Median :-87.64   Mode  :character  
##  Mean   :41.90   Mean   :-87.65                     
##  3rd Qu.:41.93   3rd Qu.:-87.63                     
##  Max.   :42.18   Max.   :  0.00                     
##  NA's   :6990    NA's   :6990

Process

Data Cleaning before conducting analysis

Added columns that list the date, month, day, and year of each ride as we might need to aggregate ride data for each month, day, or year.The default format is yyyy-mm-dd columns verfied

all_trips$date <- as.Date(all_trips$started_at)
all_trips$month <- format(as.Date(all_trips$date),"%m")
all_trips$day <- format(as.Date(all_trips$date),"%d")
all_trips$year <- format(as.Date(all_trips$date),"%Y")
all_trips$day_of_week <- format(as.Date(all_trips$date),"%A")

colnames(all_trips)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"      "date"               "month"             
## [16] "day"                "year"               "day_of_week"

Added a “ride_length” calculation to all_trips (in seconds) so that I can compare ride length for each ride

all_trips$ride_length <- difftime(all_trips$ended_at,all_trips$started_at)

Converted “ride_length” from Double to numeric so we can run calculations on the data

all_trips$ride_length <- as.numeric(as.character(all_trips$ride_length))
is.numeric(all_trips$ride_length)

## [1] TRUE

Remove dirty data:

Removed ride length is less than 0 second and is > 1440 minutes as ride length shouldn’t be either negative or more than one day Created a new data frame without records that have ride length <= zero minute OR > 1440 minutes New Dataframe checked:

all_trips_v2 <- all_trips[!(all_trips$ride_length <= 0 | all_trips$ride_length > 1440),]

dim(all_trips_v2)

## [1] 4902180      19

summary(all_trips_v2)

##    ride_id          rideable_type        started_at                    
##  Length:4902180     Length:4902180     Min.   :2023-01-01 00:01:58.00  
##  Class :character   Class :character   1st Qu.:2023-05-18 22:29:22.25  
##  Mode  :character   Mode  :character   Median :2023-07-20 20:34:25.50  
##                                        Mean   :2023-07-16 07:43:44.45  
##                                        3rd Qu.:2023-09-19 18:09:40.50  
##                                        Max.   :2023-12-31 23:59:38.00  
##                                                                        
##     ended_at                      start_station_name start_station_id  
##  Min.   :2023-01-01 00:02:41.00   Length:4902180     Length:4902180    
##  1st Qu.:2023-05-18 22:37:03.75   Class :character   Class :character  
##  Median :2023-07-20 20:44:02.00   Mode  :character   Mode  :character  
##  Mean   :2023-07-16 07:53:04.57                                        
##  3rd Qu.:2023-09-19 18:18:52.25                                        
##  Max.   :2024-01-01 00:06:08.00                                        
##                                                                        
##  end_station_name   end_station_id       start_lat       start_lng     
##  Length:4902180     Length:4902180     Min.   :41.64   Min.   :-87.92  
##  Class :character   Class :character   1st Qu.:41.88   1st Qu.:-87.66  
##  Mode  :character   Mode  :character   Median :41.90   Median :-87.64  
##                                        Mean   :41.90   Mean   :-87.65  
##                                        3rd Qu.:41.93   3rd Qu.:-87.63  
##                                        Max.   :42.07   Max.   :-87.52  
##                                                                        
##     end_lat         end_lng       member_casual           date           
##  Min.   : 0.00   Min.   :-87.99   Length:4902180     Min.   :2023-01-01  
##  1st Qu.:41.88   1st Qu.:-87.66   Class :character   1st Qu.:2023-05-18  
##  Median :41.90   Median :-87.65   Mode  :character   Median :2023-07-20  
##  Mean   :41.90   Mean   :-87.65                      Mean   :2023-07-15  
##  3rd Qu.:41.93   3rd Qu.:-87.63                      3rd Qu.:2023-09-19  
##  Max.   :42.09   Max.   :  0.00                      Max.   :2023-12-31  
##  NA's   :146     NA's   :146                                             
##     month               day                year           day_of_week       
##  Length:4902180     Length:4902180     Length:4902180     Length:4902180    
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##   ride_length    
##  Min.   :   1.0  
##  1st Qu.: 295.0  
##  Median : 492.0  
##  Mean   : 560.1  
##  3rd Qu.: 778.0  
##  Max.   :1440.0  
##

Remove NA data from the all_trips_v2 to get accurate data.

all_trips_v2 <- drop_na(all_trips_v2)
summary(all_trips_v2)

##    ride_id          rideable_type        started_at                    
##  Length:3667286     Length:3667286     Min.   :2023-01-01 00:03:26.00  
##  Class :character   Class :character   1st Qu.:2023-05-17 15:06:54.50  
##  Mode  :character   Mode  :character   Median :2023-07-20 17:09:49.00  
##                                        Mean   :2023-07-15 13:37:51.17  
##                                        3rd Qu.:2023-09-19 16:28:53.00  
##                                        Max.   :2023-12-31 23:58:55.00  
##     ended_at                      start_station_name start_station_id  
##  Min.   :2023-01-01 00:07:23.00   Length:3667286     Length:3667286    
##  1st Qu.:2023-05-17 15:15:37.00   Class :character   Class :character  
##  Median :2023-07-20 17:20:26.00   Mode  :character   Mode  :character  
##  Mean   :2023-07-15 13:47:20.02                                        
##  3rd Qu.:2023-09-19 16:38:41.50                                        
##  Max.   :2024-01-01 00:06:08.00                                        
##  end_station_name   end_station_id       start_lat       start_lng     
##  Length:3667286     Length:3667286     Min.   :41.65   Min.   :-87.84  
##  Class :character   Class :character   1st Qu.:41.88   1st Qu.:-87.66  
##  Mode  :character   Mode  :character   Median :41.90   Median :-87.64  
##                                        Mean   :41.90   Mean   :-87.65  
##                                        3rd Qu.:41.93   3rd Qu.:-87.63  
##                                        Max.   :42.06   Max.   :-87.53  
##     end_lat         end_lng       member_casual           date           
##  Min.   : 0.00   Min.   :-87.84   Length:3667286     Min.   :2023-01-01  
##  1st Qu.:41.88   1st Qu.:-87.66   Class :character   1st Qu.:2023-05-17  
##  Median :41.90   Median :-87.64   Mode  :character   Median :2023-07-20  
##  Mean   :41.90   Mean   :-87.65                      Mean   :2023-07-14  
##  3rd Qu.:41.93   3rd Qu.:-87.63                      3rd Qu.:2023-09-19  
##  Max.   :42.06   Max.   :  0.00                      Max.   :2023-12-31  
##     month               day                year           day_of_week       
##  Length:3667286     Length:3667286     Length:3667286     Length:3667286    
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##   ride_length    
##  Min.   :   1.0  
##  1st Qu.: 304.0  
##  Median : 501.0  
##  Mean   : 568.9  
##  3rd Qu.: 786.0  
##  Max.   :1440.0

Find out the distance for each ride,viewed the dataframe and summary of dataframe

all_trips_v2$ride_distance <- distGeo(matrix(c(all_trips_v2$start_lng,all_trips_v2$start_lat),ncol = 2),
                                      matrix(c(all_trips_v2$end_lng,all_trips_v2$end_lat),ncol = 2))
View(all_trips_v2)
summary(all_trips_v2)

##    ride_id          rideable_type        started_at                    
##  Length:3667286     Length:3667286     Min.   :2023-01-01 00:03:26.00  
##  Class :character   Class :character   1st Qu.:2023-05-17 15:06:54.50  
##  Mode  :character   Mode  :character   Median :2023-07-20 17:09:49.00  
##                                        Mean   :2023-07-15 13:37:51.17  
##                                        3rd Qu.:2023-09-19 16:28:53.00  
##                                        Max.   :2023-12-31 23:58:55.00  
##     ended_at                      start_station_name start_station_id  
##  Min.   :2023-01-01 00:07:23.00   Length:3667286     Length:3667286    
##  1st Qu.:2023-05-17 15:15:37.00   Class :character   Class :character  
##  Median :2023-07-20 17:20:26.00   Mode  :character   Mode  :character  
##  Mean   :2023-07-15 13:47:20.02                                        
##  3rd Qu.:2023-09-19 16:38:41.50                                        
##  Max.   :2024-01-01 00:06:08.00                                        
##  end_station_name   end_station_id       start_lat       start_lng     
##  Length:3667286     Length:3667286     Min.   :41.65   Min.   :-87.84  
##  Class :character   Class :character   1st Qu.:41.88   1st Qu.:-87.66  
##  Mode  :character   Mode  :character   Median :41.90   Median :-87.64  
##                                        Mean   :41.90   Mean   :-87.65  
##                                        3rd Qu.:41.93   3rd Qu.:-87.63  
##                                        Max.   :42.06   Max.   :-87.53  
##     end_lat         end_lng       member_casual           date           
##  Min.   : 0.00   Min.   :-87.84   Length:3667286     Min.   :2023-01-01  
##  1st Qu.:41.88   1st Qu.:-87.66   Class :character   1st Qu.:2023-05-17  
##  Median :41.90   Median :-87.64   Mode  :character   Median :2023-07-20  
##  Mean   :41.90   Mean   :-87.65                      Mean   :2023-07-14  
##  3rd Qu.:41.93   3rd Qu.:-87.63                      3rd Qu.:2023-09-19  
##  Max.   :42.06   Max.   :  0.00                      Max.   :2023-12-31  
##     month               day                year           day_of_week       
##  Length:3667286     Length:3667286     Length:3667286     Length:3667286    
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##   ride_length     ride_distance    
##  Min.   :   1.0   Min.   :      0  
##  1st Qu.: 304.0   1st Qu.:    857  
##  Median : 501.0   Median :   1414  
##  Mean   : 568.9   Mean   :   1733  
##  3rd Qu.: 786.0   3rd Qu.:   2325  
##  Max.   :1440.0   Max.   :9818680

Analyze

Firstly, let’s find out the number of ride by type of rider. Assign the correct order to each day of the week

all_trips_v2$day_of_week <- 
  ordered(all_trips_v2$day_of_week, levels = c('Monday', 'Tuesday', 'Wednesday',
                                               'Thursday', 'Friday', 'Saturday', 'Sunday'))

all_trips_v2 %>%
  group_by(member_casual, day_of_week) %>%
  summarise(number_of_ride = n(), .groups = 'drop') %>%
  arrange(day_of_week)

## # A tibble: 14 × 3
##    member_casual day_of_week number_of_ride
##    <chr>         <ord>                <int>
##  1 casual        Monday              132295
##  2 member        Monday              351944
##  3 casual        Tuesday             143391
##  4 member        Tuesday             408089
##  5 casual        Wednesday           148147
##  6 member        Wednesday           413726
##  7 casual        Thursday            159769
##  8 member        Thursday            412675
##  9 casual        Friday              173226
## 10 member        Friday              362254
## 11 casual        Saturday            215442
## 12 member        Saturday            304606
## 13 casual        Sunday              175226
## 14 member        Sunday              266496

Assign the correct order to each month of the year

all_trips_v2$month <-
  ordered(all_trips_v2$month, levels = c('05', '06', '07', '08', '09', '10', '11', '12', '01', '02', '03', '04'))

all_trips_v2 %>%
  group_by(member_casual, month) %>%
  summarise(number_of_ride = n(), .groups = 'drop') %>%
  arrange(month)

## # A tibble: 24 × 3
##    member_casual month number_of_ride
##    <chr>         <ord>          <int>
##  1 casual        05            126599
##  2 member        05            253682
##  3 casual        06            160526
##  4 member        06            278485
##  5 casual        07            174088
##  6 member        07            287690
##  7 casual        08            170883
##  8 member        08            309150
##  9 casual        09            145166
## 10 member        09            275832
## # ℹ 14 more rows

Findings:

Casual riders are more likely to take a ride on weekend while membership riders use on weekday more often.
Summer is the peak season for both rider types

Now, find out whether ride_length can be different depends on rider type.

aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual + all_trips_v2$day_of_week, FUN=mean)

##    all_trips_v2$member_casual all_trips_v2$day_of_week all_trips_v2$ride_length
## 1                      casual                   Monday                 612.7304
## 2                      member                   Monday                 523.9225
## 3                      casual                  Tuesday                 606.4741
## 4                      member                  Tuesday                 532.6762
## 5                      casual                Wednesday                 601.8963
## 6                      member                Wednesday                 534.7719
## 7                      casual                 Thursday                 609.6153
## 8                      member                 Thursday                 535.8058
## 9                      casual                   Friday                 631.5177
## 10                     member                   Friday                 533.8625
## 11                     casual                 Saturday                 679.1902
## 12                     member                 Saturday                 564.8961
## 13                     casual                   Sunday                 671.2377
## 14                     member                   Sunday                 557.8728

all_trips_v2 %>%
  group_by(member_casual, month) %>%
  summarise(average_ride_length = mean(ride_length), .groups = 'drop') %>%
  arrange(month)

## # A tibble: 24 × 3
##    member_casual month average_ride_length
##    <chr>         <ord>               <dbl>
##  1 casual        05                   653.
##  2 member        05                   554.
##  3 casual        06                   661.
##  4 member        06                   571.
##  5 casual        07                   669.
##  6 member        07                   573.
##  7 casual        08                   665.
##  8 member        08                   572.
##  9 casual        09                   648.
## 10 member        09                   559.
## # ℹ 14 more rows

Findings:

Membership rider’s trip is longer than casual ones regardless of the season or day
All users take longer trips over weekend and summer

Next, checking whether each type of rider use the bike by looking at ride distance.

all_trips_v2 %>%
  group_by(member_casual, day_of_week) %>%
  summarise(distance_of_ride = mean(ride_distance), .groups = 'drop') %>%
  arrange(day_of_week)

## # A tibble: 14 × 3
##    member_casual day_of_week distance_of_ride
##    <chr>         <ord>                  <dbl>
##  1 casual        Monday                 1677.
##  2 member        Monday                 1708.
##  3 casual        Tuesday                1728.
##  4 member        Tuesday                1742.
##  5 casual        Wednesday              1732.
##  6 member        Wednesday              1750.
##  7 casual        Thursday               1799.
##  8 member        Thursday               1769.
##  9 casual        Friday                 1707.
## 10 member        Friday                 1705.
## 11 casual        Saturday               1716.
## 12 member        Saturday               1739.
## 13 casual        Sunday                 1713.
## 14 member        Sunday                 1730.

all_trips_v2 %>%
  group_by(member_casual, month) %>%
  summarise(distance_of_ride = mean(ride_distance), .groups = 'drop') %>%
  arrange(month)

## # A tibble: 24 × 3
##    member_casual month distance_of_ride
##    <chr>         <ord>            <dbl>
##  1 casual        05               1766.
##  2 member        05               1803.
##  3 casual        06               1849.
##  4 member        06               1875.
##  5 casual        07               1753.
##  6 member        07               1820.
##  7 casual        08               1756.
##  8 member        08               1794.
##  9 casual        09               1727.
## 10 member        09               1764.
## # ℹ 14 more rows

Findings:

While Casual rider has slightly longer distance trip on weekday, membership ride’s slighly longer over weekend.
All users take slightly longer distance trip in Spring

Finaly, in order to support my assumption, let’s find out how many riders use the same bike station for start point and end point (ride_distance = 0).

all_trips_v2 %>%
  group_by(member_casual) %>%
  summarize(number_of_rides = n() , .groups = 'drop')

## # A tibble: 2 × 2
##   member_casual number_of_rides
##   <chr>                   <int>
## 1 casual                1147496
## 2 member                2519790

all_trips_v2 %>%
  group_by(member_casual) %>%
  filter(ride_distance < 1) %>%
  summarize(number_of_rides = n() , .groups = 'drop')

## # A tibble: 2 × 2
##   member_casual number_of_rides
##   <chr>                   <int>
## 1 casual                  53738
## 2 member                  57589

Finding:

While 6% of casual riders return their bike to their start point station, 4% of membership rider returns at their start point station.

Here,visualizations are shared which would allow executives to understand my conclusion easily.

all_trips_v2 %>%
  group_by(member_casual, day_of_week) %>%
  summarise(number_of_rides = n(), .groups = 'drop') %>%
  ggplot(aes(x = day_of_week, y = number_of_rides, fill = member_casual)) + 
  geom_bar(position = "dodge", stat = "identity")+
  theme(axis.text.x = element_text(angle = 45))

all_trips_v2 %>%
  group_by(member_casual, month) %>%
  summarise(number_of_rides = n(), .groups = 'drop') %>%
  ggplot(aes(x = month, y = number_of_rides, fill = member_casual)) + 
  geom_bar(position = "dodge", stat = "identity")

all_trips_v2 %>%
  group_by(member_casual, day_of_week) %>%
  summarise(average_ride_length = mean(ride_length), .groups = 'drop') %>%
  ggplot(aes(x = day_of_week, y = average_ride_length, fill = member_casual)) + 
  geom_bar(position = "dodge", stat = "identity")+
  theme(axis.text.x = element_text(angle = 45))

all_trips_v2 %>%
  group_by(member_casual, month) %>%
  summarise(average_ride_length = mean(ride_length), .groups = 'drop') %>%
  ggplot(aes(x = month, y = average_ride_length, fill = member_casual)) + 
  geom_bar(position = "dodge", stat = "identity")

Removed outlier, disregarded binwidth

all_trips_v2 %>%
  group_by(member_casual) %>%
  filter(ride_distance < 10000) %>% 
  ggplot(aes(x = ride_distance, fill = member_casual)) + 
  geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

all_trips_v2 %>%
  group_by(member_casual, day_of_week) %>%
  summarise(average_ride_distance = mean(ride_distance), .groups = 'drop') %>%
  ggplot(aes(x = day_of_week, y = average_ride_distance, fill = member_casual)) + 
  geom_bar(position = "dodge", stat = "identity")

all_trips_v2 %>%
  group_by(member_casual, month) %>%
  summarise(average_ride_distance = mean(ride_distance), .groups = 'drop') %>%
  ggplot(aes(x = month, y = average_ride_distance, fill = member_casual)) + 
  geom_bar(position = "dodge", stat = "identity")

Analysis:

It seems that the casual users travel the same average distance than the member users, but they have relatively longer rides, that would indicate a more leisure oriented usage vs a more “public transport” or pragmatic use of the bikes by the annual members.
Casual riders are more likely to return their bikes at the same station.
Additionaly, while that membership riders are more active on weekday, casual riders use the service more often over weekend. It lead me to conclude that membership riders use this service for their commute while casual rider use it for fun.

Conclusion

The Casual users have leisure, and tourism rides mostly on weekends.

The Annual users have commute or pragmatic rides during weekdays.

                                      ***End of the Report***

Complete Case Study on Cyclistic, a bike-share company

Shuvam Anupam

2024-03-02

Scenario

About the company

The goal of the case study

Ask

Prepare

Process

Remove dirty data:

Analyze

Findings:

Findings:

Findings:

Finding:

Conclusion