Google Certificate Case study analysis

INTRODUCTION:

This is a capstone project required for Google Course Certificate Program. The project entails analyzing a fictional company, Cyclistic Bike share company, Chicago.

The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. This analysis is to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, a new marketing strategy targeted at the most profitable riding category will be designed.

Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members or annual members.

METHODOLOGY

This analysis will follow the steps of data analysis process: Ask, Prepare, Process, Analyze, Share, and Act as a guide. The data to be used to analyze trends is from historical trip data of the company.

The study will follow the SMART methodology to ask questions that will help solve the business problem as well as align with the business task. The methodology ensures that specific, measurable, action oriented, relevant, time-bound questions are asked.

Three questions will guide the future marketing program:

How do annual members and casual riders use Cyclistic bikes differently?
Why would casual riders buy Cyclistic annual memberships?
How can Cyclistic use digital media to influence casual riders to become members?

The business task is to identify how casual riders and annual members use Cyclistic bikes differently and provide actionable insights that will help to design a new marketing strategy to convert casual riders into annual members.

The key stakeholders are the marketing team members, including the director of marketing, Lily Moreno, and the executive team members.

PREPARE AND PROCESS

DATA LOCATION AND ORGANIZATION

The data is made available by Motivate International Inc. The data is publicly available for download. It contains the historical trip data of the Cyclistic bike share company grouped by monthly and quarterly data. All the data files are in zip folders which can be converted to comma delimited files for convenience of data processing. Click the link to view or download the data https://divvy-tripdata.s3.amazonaws.com/index.html

DATA CREDIBILITY

The population of the data set is the Cyclistic bike riders. The data is gathered by the fictional company making it a first-hand primary data,original and reliable. It is also downloaded from an open source which makes it accessible. The most recent data including millions of rows and several columns with information regarding trip details (routes, start and end times with their corresponding station names and ids) which are relevant to the business questions are available for the analysis.

DATA MANIPULATION

In order to verify the credibility of the data, a quick summary of the whole data was conducted and is found to have consistent columns throughout individual files, although some rows were found to have missing values.

The data is initially in Zip files. They were transformed to .csv files and then downloaded into the Rstudio for analysis. They were then wrapped up using a code for merging provided by the tidyverse package. 1080470 rows with missing start station details, and 1111801 rows with missing end station details were deleted to ensure data quality.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(janitor)

## 
## Attaching package: 'janitor'

## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

library(lubridate)

library(skimr)

library(ggplot2)

These will load the necessary tools into the R platform.

The next step is to load the data that will be used to conduct the study. Previous 12 months data will be downloaded, including the immediate previous months.

library(readr)
dec_2023_tripdata <- read_csv("202312-divvy-tripdata.csv")

## Rows: 224073 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

library(readr)
jan_2024_tripdata <- read_csv("202401-divvy-tripdata.csv")

## Rows: 144873 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

library(readr)
feb_2024_tripdata <- read_csv("202402-divvy-tripdata.csv")

## Rows: 223164 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

library(readr)
mar_2024_tripdata <- read_csv("202403-divvy-tripdata.csv")

## Rows: 301687 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

library(readr)
apr_2024_tripdata <- read_csv("202404-divvy-tripdata.csv")

## Rows: 415025 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

library(readr)
may_2024_tripdata <- read_csv("202405-divvy-tripdata.csv")

## Rows: 609493 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

library(readr)
jun_2024_tripdata <- read_csv("202406-divvy-tripdata.csv")

## Rows: 710721 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

library(readr)
jul_2024_tripdata <- read_csv("202407-divvy-tripdata.csv")

## Rows: 748962 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

library(readr)
aug_2024_tripdata <- read_csv("202408-divvy-tripdata.csv")

## Rows: 755639 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

library(readr)
sep_2024_tripdata <- read_csv("202408-divvy-tripdata.csv")

## Rows: 755639 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

library(readr)
oct_2024_tripdata <- read_csv("202409-divvy-tripdata.csv")

## Rows: 821276 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

library(readr)
nov_2024_tripdata <- read_csv("202411-divvy-tripdata.csv")

## Rows: 335075 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

#merging the seperate 12 months data files into 1 data frame
total_trip_data <- rbind(dec_2023_tripdata, jan_2024_tripdata, feb_2024_tripdata, mar_2024_tripdata, apr_2024_tripdata, may_2024_tripdata, jun_2024_tripdata, jul_2024_tripdata, aug_2024_tripdata, sep_2024_tripdata, oct_2024_tripdata, nov_2024_tripdata)

skim_without_charts(total_trip_data) #in order to have a comprehensive summary of the data set

Data summary
Name	total_trip_data
Number of rows	6045627
Number of columns	13
_______________________
Column type frequency:
character	7
numeric	4
POSIXct	2
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
ride_id	0	1.00	16	16	5289777
rideable_type	0	1.00	12	16	3
start_station_name	1113914	0.82	10	64	1775
start_station_id	1113914	0.82	3	35	1727
end_station_name	1144509	0.81	10	64	1788
end_station_id	1144509	0.81	3	36	1740
member_casual	0	1.00	6	6	2

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100
start_lat	0	1	41.90	0.04	41.64	41.88	41.90	41.93	42.07
start_lng	0	1	-87.65	0.03	-87.91	-87.66	-87.64	-87.63	-87.52
end_lat	7799	1	41.90	0.06	16.06	41.88	41.90	41.93	87.96
end_lng	7799	1	-87.65	0.06	-144.05	-87.66	-87.64	-87.63	1.72

Variable type: POSIXct

skim_variable	n_missing	complete_rate	min	max	median	n_unique
started_at	0	1	2023-12-01 00:00:03	2024-11-30 23:52:17	2024-07-18 00:44:23	5063177
ended_at	0	1	2023-12-01 00:04:12	2024-11-30 23:57:43	2024-07-18 01:13:44	5066285

#to familiarize with the data structure and columns
str(total_trip_data)

## spc_tbl_ [6,045,627 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:6045627] "C9BD54F578F57246" "CDBD92F067FA620E" "ABC0858E52CBFC84" "F44B6F0E8F76DC90" ...
##  $ rideable_type     : chr [1:6045627] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
##  $ started_at        : POSIXct[1:6045627], format: "2023-12-02 18:44:01" "2023-12-02 18:48:19" ...
##  $ ended_at          : POSIXct[1:6045627], format: "2023-12-02 18:47:51" "2023-12-02 18:54:48" ...
##  $ start_station_name: chr [1:6045627] NA NA NA NA ...
##  $ start_station_id  : chr [1:6045627] NA NA NA NA ...
##  $ end_station_name  : chr [1:6045627] NA NA NA NA ...
##  $ end_station_id    : chr [1:6045627] NA NA NA NA ...
##  $ start_lat         : num [1:6045627] 41.9 41.9 41.9 42 41.9 ...
##  $ start_lng         : num [1:6045627] -87.7 -87.7 -87.6 -87.7 -87.6 ...
##  $ end_lat           : num [1:6045627] 41.9 41.9 41.9 41.9 41.9 ...
##  $ end_lng           : num [1:6045627] -87.7 -87.6 -87.6 -87.7 -87.6 ...
##  $ member_casual     : chr [1:6045627] "member" "member" "member" "member" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_character(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

colnames(total_trip_data)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

All data have consistent columns

# have a view of first few rows of the data frame
head(total_trip_data)

## # A tibble: 6 × 13
##   ride_id          rideable_type started_at          ended_at           
##   <chr>            <chr>         <dttm>              <dttm>             
## 1 C9BD54F578F57246 electric_bike 2023-12-02 18:44:01 2023-12-02 18:47:51
## 2 CDBD92F067FA620E electric_bike 2023-12-02 18:48:19 2023-12-02 18:54:48
## 3 ABC0858E52CBFC84 electric_bike 2023-12-24 01:56:32 2023-12-24 02:04:09
## 4 F44B6F0E8F76DC90 electric_bike 2023-12-24 10:58:12 2023-12-24 11:03:04
## 5 3C876413281A90DF electric_bike 2023-12-24 12:43:16 2023-12-24 12:44:57
## 6 28C0D6EFB81E1769 electric_bike 2023-12-24 13:59:57 2023-12-24 14:10:57
## # ℹ 9 more variables: start_station_name <chr>, start_station_id <chr>,
## #   end_station_name <chr>, end_station_id <chr>, start_lat <dbl>,
## #   start_lng <dbl>, end_lat <dbl>, end_lng <dbl>, member_casual <chr>

# alternatively, using inbuilt code to chech for columns with null values
sapply(total_trip_data, function(x) sum(is.na(x)))

##            ride_id      rideable_type         started_at           ended_at 
##                  0                  0                  0                  0 
## start_station_name   start_station_id   end_station_name     end_station_id 
##            1113914            1113914            1144509            1144509 
##          start_lat          start_lng            end_lat            end_lng 
##                  0                  0               7799               7799 
##      member_casual 
##                  0

#drop rows with missing data values
#rows with missing end_lat and end_lng will be ignored because other data relevant for the study are complete in these rows.
clean_total_tripdata <- total_trip_data %>% 
  drop_na(start_station_name, start_station_id, end_station_name, end_station_id)

#check for deleted rows
sapply(clean_total_tripdata, function(x) sum(is.na(x)))

##            ride_id      rideable_type         started_at           ended_at 
##                  0                  0                  0                  0 
## start_station_name   start_station_id   end_station_name     end_station_id 
##                  0                  0                  0                  0 
##          start_lat          start_lng            end_lat            end_lng 
##                  0                  0                  0                  0 
##      member_casual 
##                  0

unique(clean_total_tripdata) # to ensure there are no duplicates

## # A tibble: 3,795,608 × 13
##    ride_id          rideable_type started_at          ended_at           
##    <chr>            <chr>         <dttm>              <dttm>             
##  1 84BFC1F137684EAB classic_bike  2023-12-02 23:12:51 2023-12-02 23:21:01
##  2 EEC92D30A70471E5 classic_bike  2023-12-14 13:43:14 2023-12-14 13:44:14
##  3 1C33464DEEB1F23C electric_bike 2023-12-04 11:57:04 2023-12-04 12:13:59
##  4 E0A61810C305E5EC classic_bike  2023-12-04 09:34:22 2023-12-04 09:35:56
##  5 0706CEB2E1924F3D classic_bike  2023-12-04 09:36:27 2023-12-04 09:36:40
##  6 EB09035006DCCB2C electric_bike 2023-12-02 06:06:32 2023-12-02 06:09:06
##  7 81EE8687F217E531 classic_bike  2023-12-27 23:55:45 2023-12-28 01:43:13
##  8 2C519D5FC6290C41 electric_bike 2023-12-02 13:08:54 2023-12-02 13:14:45
##  9 BACE7E3BCE0919A8 electric_bike 2023-12-24 07:38:07 2023-12-24 07:45:46
## 10 DCCFC2DE81C0B1F9 electric_bike 2023-12-25 10:23:13 2023-12-25 10:25:53
## # ℹ 3,795,598 more rows
## # ℹ 9 more variables: start_station_name <chr>, start_station_id <chr>,
## #   end_station_name <chr>, end_station_id <chr>, start_lat <dbl>,
## #   start_lng <dbl>, end_lat <dbl>, end_lng <dbl>, member_casual <chr>

#extracting year,day and months column. also calculating ride length
trip_data <- clean_total_tripdata %>% 
  mutate(year = format(as.Date(started_at), "%Y")) %>% # extract year
  mutate(month = format(as.Date(started_at), "%B")) %>% #extract month
  mutate(date = format(as.Date(started_at), "%d")) %>% # extract date
  mutate(day_of_week = format(as.Date(started_at), "%A")) %>% # extract day of week
  mutate(ride_length = difftime(ended_at, started_at)) %>% 
  mutate(start_time = strftime(started_at, "%H"))

trip_data <- trip_data %>% 
  mutate(ride_length = as.numeric(ride_length))
is.numeric(trip_data$ride_length) # to check it is right format

## [1] TRUE

#in order to avoid using negative trip ride length
clean_trip_data <- filter(trip_data,ride_length > 1)

str(clean_trip_data)

## tibble [4,336,182 × 19] (S3: tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:4336182] "84BFC1F137684EAB" "EEC92D30A70471E5" "1C33464DEEB1F23C" "E0A61810C305E5EC" ...
##  $ rideable_type     : chr [1:4336182] "classic_bike" "classic_bike" "electric_bike" "classic_bike" ...
##  $ started_at        : POSIXct[1:4336182], format: "2023-12-02 23:12:51" "2023-12-14 13:43:14" ...
##  $ ended_at          : POSIXct[1:4336182], format: "2023-12-02 23:21:01" "2023-12-14 13:44:14" ...
##  $ start_station_name: chr [1:4336182] "DuSable Museum" "California Ave & Division St" "Chicago State University" "Cottage Grove Ave & 51st St" ...
##  $ start_station_id  : chr [1:4336182] "KA1503000075" "13256" "20106" "TA1309000067" ...
##  $ end_station_name  : chr [1:4336182] "Cottage Grove Ave & 51st St" "California Ave & Division St" "Chicago State University" "Cottage Grove Ave & 51st St" ...
##  $ end_station_id    : chr [1:4336182] "TA1309000067" "13256" "20106" "TA1309000067" ...
##  $ start_lat         : num [1:4336182] 41.8 41.9 41.7 41.8 41.8 ...
##  $ start_lng         : num [1:4336182] -87.6 -87.7 -87.6 -87.6 -87.6 ...
##  $ end_lat           : num [1:4336182] 41.8 41.9 41.7 41.8 41.8 ...
##  $ end_lng           : num [1:4336182] -87.6 -87.7 -87.6 -87.6 -87.6 ...
##  $ member_casual     : chr [1:4336182] "member" "casual" "casual" "casual" ...
##  $ year              : chr [1:4336182] "2023" "2023" "2023" "2023" ...
##  $ month             : chr [1:4336182] "December" "December" "December" "December" ...
##  $ date              : chr [1:4336182] "02" "14" "04" "04" ...
##  $ day_of_week       : chr [1:4336182] "Saturday" "Thursday" "Monday" "Monday" ...
##  $ ride_length       : num [1:4336182] 490 60 1015 94 13 ...
##  $ start_time        : chr [1:4336182] "00" "14" "12" "10" ...

#to have a view of the data

#checking details of the cleaned data set, in summary
summary(clean_trip_data)

##    ride_id          rideable_type        started_at                    
##  Length:4336182     Length:4336182     Min.   :2023-12-01 00:00:20.00  
##  Class :character   Class :character   1st Qu.:2024-05-08 05:15:29.00  
##  Mode  :character   Mode  :character   Median :2024-07-15 14:12:23.23  
##                                        Mean   :2024-06-26 21:18:35.27  
##                                        3rd Qu.:2024-08-23 14:54:12.25  
##                                        Max.   :2024-11-30 23:50:53.45  
##     ended_at                      start_station_name start_station_id  
##  Min.   :2023-12-01 00:05:59.00   Length:4336182     Length:4336182    
##  1st Qu.:2024-05-08 05:27:56.75   Class :character   Class :character  
##  Median :2024-07-15 14:30:30.95   Mode  :character   Mode  :character  
##  Mean   :2024-06-26 21:35:30.93                                        
##  3rd Qu.:2024-08-23 15:12:52.67                                        
##  Max.   :2024-11-30 23:57:43.00                                        
##  end_station_name   end_station_id       start_lat       start_lng     
##  Length:4336182     Length:4336182     Min.   :41.65   Min.   :-87.86  
##  Class :character   Class :character   1st Qu.:41.88   1st Qu.:-87.66  
##  Mode  :character   Mode  :character   Median :41.89   Median :-87.64  
##                                        Mean   :41.90   Mean   :-87.64  
##                                        3rd Qu.:41.93   3rd Qu.:-87.63  
##                                        Max.   :42.06   Max.   :-87.53  
##     end_lat         end_lng       member_casual          year          
##  Min.   :41.65   Min.   :-87.84   Length:4336182     Length:4336182    
##  1st Qu.:41.88   1st Qu.:-87.66   Class :character   Class :character  
##  Median :41.90   Median :-87.64   Mode  :character   Mode  :character  
##  Mean   :41.90   Mean   :-87.64                                        
##  3rd Qu.:41.93   3rd Qu.:-87.63                                        
##  Max.   :42.06   Max.   :-87.53                                        
##     month               date           day_of_week         ride_length      
##  Length:4336182     Length:4336182     Length:4336182     Min.   :    1.01  
##  Class :character   Class :character   Class :character   1st Qu.:  355.66  
##  Mode  :character   Mode  :character   Mode  :character   Median :  618.76  
##                                                           Mean   : 1015.65  
##                                                           3rd Qu.: 1112.30  
##                                                           Max.   :90562.00  
##   start_time       
##  Length:4336182    
##  Class :character  
##  Mode  :character  
##                    
##                    
##

ANALYSIS

#descriptive analysis of the data
#ride length analysis
# mean of ride length = average lenth of ride
#max ride length = longest ride
# mode of ride length = most frequent ride
#min ride length = the shortest ride distance
clean_trip_data %>% 
  summarize(average_ride_length = mean(ride_length), median_ride_length = median(ride_length), max_ride_length = max(ride_length), min_ride_length = min(ride_length))

## # A tibble: 1 × 4
##   average_ride_length median_ride_length max_ride_length min_ride_length
##                 <dbl>              <dbl>           <dbl>           <dbl>
## 1               1016.               619.           90562            1.01

clean_trip_data %>% 
  group_by(member_casual) %>% 
  summarize(count = n()) %>% 
  mutate(percentage = count/sum(count)*100)

## # A tibble: 2 × 3
##   member_casual   count percentage
##   <chr>           <int>      <dbl>
## 1 casual        1598947       36.9
## 2 member        2737235       63.1

#member_casual is the customer type.

The average ride length is 998.38(mins). 64% of the bike users are customers that signed up annual membership while the remaining 36% of the bike riders are one_off purchase casual riders.

#finding the trends of how diffent customers ride the bikes
#analsye the frequency and pattern of rides daily and monthlty
clean_trip_data %>% 
  group_by(month, member_casual) %>% 
  summarize(mean(ride_length), min(ride_length), max(ride_length))

## `summarise()` has grouped output by 'month'. You can override using the
## `.groups` argument.

## # A tibble: 24 × 5
## # Groups:   month [12]
##    month member_casual `mean(ride_length)` `min(ride_length)` `max(ride_length)`
##    <chr> <chr>                       <dbl>              <dbl>              <dbl>
##  1 April casual                      1487.               2                89613 
##  2 April member                       737.               2                89007 
##  3 Augu… casual                      1486.               1.06             89853.
##  4 Augu… member                       788.               1.01             86887.
##  5 Dece… casual                       992.               2                84885 
##  6 Dece… member                       648.               2                89668 
##  7 Febr… casual                      1190.               2                89100 
##  8 Febr… member                       705.               2                89859 
##  9 Janu… casual                       932.               2                88737 
## 10 Janu… member                       694.               2                89839 
## # ℹ 14 more rows

#for daily trends
clean_trip_data %>% 
  group_by(day_of_week, member_casual) %>% 
  summarize(mean(ride_length), min(ride_length), max(ride_length))

## `summarise()` has grouped output by 'day_of_week'. You can override using the
## `.groups` argument.

## # A tibble: 14 × 5
## # Groups:   day_of_week [7]
##    day_of_week member_casual `mean(ride_length)` `min(ride_length)`
##    <chr>       <chr>                       <dbl>              <dbl>
##  1 Friday      casual                      1407.               1.18
##  2 Friday      member                       733.               1.10
##  3 Monday      casual                      1406.               1.07
##  4 Monday      member                       720.               1.03
##  5 Saturday    casual                      1642.               1.03
##  6 Saturday    member                       848.               1.02
##  7 Sunday      casual                      1658.               1.01
##  8 Sunday      member                       849.               1.07
##  9 Thursday    casual                      1276.               1.03
## 10 Thursday    member                       723.               1.01
## 11 Tuesday     casual                      1262.               1.08
## 12 Tuesday     member                       721.               1.01
## 13 Wednesday   casual                      1320.               1.05
## 14 Wednesday   member                       740.               1.03
## # ℹ 1 more variable: `max(ride_length)` <dbl>

SHARE AND ACT

#VISUALIZATION
#chart for rides by members and casual riders grouped by month of the year
#this will help us find the peak months and days with most requent rides
library(ggplot2)

ggplot(data = clean_trip_data)+
  geom_bar(mapping=aes(x=month, fill=member_casual,))+
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1) # Rotate labels 45 degrees and adjust alignment
  )

From the chart, it can be observed that annual members ride the bike more frequent than casual riders. Also, Irrespective of the type of member, the months with the larger number pf rides are the months of May,June,July, August, September and October. The other months of the year have less rides. This may be due to the weather factor. The months with the least rides are months with low, to very low temperatures.

ggplot(data = clean_trip_data)+
  geom_bar(mapping=aes(x=month, fill=member_casual,))+
  facet_wrap(~rideable_type)+
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1) # Rotate labels 45 degrees and adjust alignment
  )

Classic bikes are booked more often than the other bike options available. This may be due to other factors like convenience or pricing. Electric scooters are almost never booked by the customers.

ggplot(data=clean_trip_data)+
  geom_bar(mapping = aes(x=day_of_week, fill=rideable_type))+
  facet_wrap(~member_casual)+
  scale_x_discrete(limits=c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))+
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1) # Rotate labels 45 degrees and adjust alignment
  )

The chart above plotted how members and casual riders use the cyclistics bikes. the trend for casual riders over the week show quite irregular number of rides throughout the week with more frequent rides on the weekends. this could show that the casula riders use the bikes for leisure rides or errand during the weekends.

The annual members on the other hand show an almost even number of rides during the weekdays and lower number of rides during the weekends. this may be because the annual members with signed memberships use the bikes for a routine ride like commuting to place of work or school. the trend is quite regular despite the type of bike that was used.

RECOMMENDATIONS

The annual members have more consistent bike rides. The almost predicted ride numbers may be because the annual members use the bikes for routine trips. Having the membership may also be a factor that encourages using the cyclistic bikes for commuting.

The casual bike riders trend shows that they are not very regular because they don’t have any commitment to the bike company, therefore are open to other alternatives when the need for trips arise. the casual members may be individuals that have random outings, therefore do not need a plan for moving around.

Based on these findings, I will make a suggestion for marketing campaigns to target casual members to sign up for membership because i feel having a commitment will encourage patronizing the company. Packages can also be made to cater for the different bike riders since each category have different ride patterns. Weekend packages, weekday packages and combined packages can be designed to encourage both riders that use the bike for casual purposes and those that use it for commute will be encouraged.Special promos can also be run during the periods with high number of bike users like the month of August or through out the summer season.