Google Data Analytics Course - Cyclistic Case Study

Scenario

You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.

Characters and teams

Cyclistic: A bike-share program that features more than 5,800 bicycles and 600 docking stations. Cyclistic sets itself apart by also offering reclining bikes, hand tricycles, and cargo bikes, making bike-share more inclusive to people with disabilities and riders who can’t use a standard two-wheeled bike. The majority of riders opt for traditional bikes; about 8% of riders use the assistive options. Cyclistic users are more likely to ride for leisure, but about 30% use them to commute to work each day.

Lily Moreno: The director of marketing and your manager. Moreno is responsible for the development of campaigns and initiatives to promote the bike-share program. These may include email, social media, and other channels.

Cyclistic marketing analytics team: A team of data analysts who are responsible for collecting, analyzing, and reporting data that helps guide Cyclistic marketing strategy. You joined this team six months ago and have been busy learning about Cyclistic’s mission and business goals — as well as how you, as a junior data analyst, can help Cyclistic achieve them.

Cyclistic executive team: The notoriously detail-oriented executive team will decide whether to approve the recommended marketing program

Approach

Data analysis is a crucial step in solving problems. It is the process of collecting, organizing, and interpreting data to answer questions and reach conclusions. By analyzing data, businesses, organizations, and individuals can identify patterns and trends that can help them make informed decisions and improve their performance. This will involve sorting the data into categories and performing calculations or visualizing it with charts and graphs. I will also examine the data for any outliers or inconsistencies that may indicate a problem. After the data has been analyzed, it is important to interpret the results. This involves examining the data to identify any patterns or trends that may be present. It is also important to consider any potential causes of the problem, such as internal processes or external factors.

Ask phase

In this step, I asked myself questions such as ‘How do annual members and casual riders use Cyclistic bikes differently?’ to guide my analysis. Lily Moreno believes that the annual Members were more profitable than the casual riders. The goal before me is to assist the marketing team analyze the contrast between casual riders and members with the aim of providing recommendations that will aid the conversion of casual riders into annual members to maximize profits.

Prepare phase

The data used for this analysis is located at (https://divvy-tripdata.s3.amazonaws.com/index.html), I downloaded the most recent 12 months from February 2022 to January 2023 and stored them locally. This version of my analysis was done using R. The dataset has no user ID or any specific commuter identifier, so the data on duplicate riders are not present.

Process Phase

I used Rstudio for the process phase. I have outlined the steps taken:

-uploaded all 12 months of the datasets into R

-Combine data into single file

-Clean and inspect data for uniformity

-Clean up by removing rows with missing or unavailable data

-Clean up column names and prepare for data analysis

Analyze Phase

The analysis was done using Rstudio, the steps are outline:

Loading Packages

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.2.2

## Warning: package 'ggplot2' was built under R version 4.2.2

## Warning: package 'tibble' was built under R version 4.2.2

## Warning: package 'tidyr' was built under R version 4.2.2

## Warning: package 'readr' was built under R version 4.2.2

## Warning: package 'purrr' was built under R version 4.2.2

## Warning: package 'dplyr' was built under R version 4.2.2

## Warning: package 'stringr' was built under R version 4.2.2

## Warning: package 'forcats' was built under R version 4.2.2

## Warning: package 'lubridate' was built under R version 4.2.2

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.0     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.1     ✔ tibble    3.1.8
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors

library(ggplot2)
library(lubridate)

collect and load your data into R

getwd()

## [1] "C:/Users/okoli/Desktop/data analyst/Capstone_Cyclistic_Project"

feb_2022 <- read_csv("2022_02-divvy-tripdata.csv")

## Rows: 115609 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

mar_2022 <- read_csv("2022_03-divvy-tripdata.csv")

## Rows: 284042 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

apr_2022 <- read_csv("2022_04-divvy-tripdata.csv")

## Rows: 371249 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

may_2022 <- read_csv("2022_05-divvy-tripdata.csv")

## Rows: 634858 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

jun_2022 <- read_csv("2022_06-divvy-tripdata.csv")

## Rows: 769204 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

jul_2022 <- read_csv("2022_07-divvy-tripdata.csv")

## Rows: 823488 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

aug_2022 <- read_csv("2022_08-divvy-tripdata.csv")

## Rows: 785932 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

sep_2022 <- read_csv("2022_09-divvy-tripdata.csv")

## Rows: 701339 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

oct_2022 <- read_csv("2022_10-divvy-tripdata.csv")

## Rows: 558685 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

nov_2022 <- read_csv("2022_11-divvy-tripdata.csv")

## Rows: 337735 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

dec_2022 <- read_csv("2022_12-divvy-tripdata.csv")

## Rows: 181806 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

jan_2023 <- read_csv("2023_01-divvy-tripdata.csv")

## Rows: 190301 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Inspect the dataset

check columns for differences

colnames(feb_2022)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

colnames(mar_2022)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

colnames(apr_2022)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

colnames(may_2022)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

colnames(jun_2022)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

colnames(jul_2022)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

colnames(aug_2022)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

colnames(sep_2022)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

colnames(oct_2022)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

colnames(nov_2022)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

colnames(dec_2022)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

colnames(jan_2023)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

str(jan_2023)

## spc_tbl_ [190,301 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:190301] "F96D5A74A3E41399" "13CB7EB698CEDB88" "BD88A2E670661CE5" "C90792D034FED968" ...
##  $ rideable_type     : chr [1:190301] "electric_bike" "classic_bike" "electric_bike" "classic_bike" ...
##  $ started_at        : POSIXct[1:190301], format: "2023-01-21 20:05:42" "2023-01-10 15:37:36" ...
##  $ ended_at          : POSIXct[1:190301], format: "2023-01-21 20:16:33" "2023-01-10 15:46:05" ...
##  $ start_station_name: chr [1:190301] "Lincoln Ave & Fullerton Ave" "Kimbark Ave & 53rd St" "Western Ave & Lunt Ave" "Kimbark Ave & 53rd St" ...
##  $ start_station_id  : chr [1:190301] "TA1309000058" "TA1309000037" "RP-005" "TA1309000037" ...
##  $ end_station_name  : chr [1:190301] "Hampden Ct & Diversey Ave" "Greenwood Ave & 47th St" "Valli Produce - Evanston Plaza" "Greenwood Ave & 47th St" ...
##  $ end_station_id    : chr [1:190301] "202480.0" "TA1308000002" "599" "TA1308000002" ...
##  $ start_lat         : num [1:190301] 41.9 41.8 42 41.8 41.8 ...
##  $ start_lng         : num [1:190301] -87.6 -87.6 -87.7 -87.6 -87.6 ...
##  $ end_lat           : num [1:190301] 41.9 41.8 42 41.8 41.8 ...
##  $ end_lng           : num [1:190301] -87.6 -87.6 -87.7 -87.6 -87.6 ...
##  $ member_casual     : chr [1:190301] "member" "member" "casual" "member" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_character(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

ride_id and rideable_id were converted to character so they could be stacked correctly

feb_2022 <- mutate(feb_2022, ride_id = as.character(ride_id) 
                   ,rideable_type = as.character(rideable_type))
mar_2022 <- mutate(mar_2022, ride_id = as.character(ride_id) 
                   ,rideable_type = as.character(rideable_type))
apr_2022 <- mutate(apr_2022, ride_id = as.character(ride_id) 
                   ,rideable_type = as.character(rideable_type))
may_2022 <- mutate(may_2022, ride_id = as.character(ride_id) 
                   ,rideable_type = as.character(rideable_type))
jun_2022 <- mutate(jun_2022, ride_id = as.character(ride_id) 
                   ,rideable_type = as.character(rideable_type))
jul_2022 <- mutate(jul_2022, ride_id = as.character(ride_id) 
                   ,rideable_type = as.character(rideable_type))
aug_2022 <- mutate(aug_2022, ride_id = as.character(ride_id) 
                   ,rideable_type = as.character(rideable_type))
sep_2022 <- mutate(sep_2022, ride_id = as.character(ride_id) 
                   ,rideable_type = as.character(rideable_type))
oct_2022 <- mutate(oct_2022, ride_id = as.character(ride_id) 
                   ,rideable_type = as.character(rideable_type))
nov_2022 <- mutate(nov_2022, ride_id = as.character(ride_id) 
                   ,rideable_type = as.character(rideable_type))
dec_2022 <- mutate(dec_2022, ride_id = as.character(ride_id) 
                   ,rideable_type = as.character(rideable_type))
jan_2023 <- mutate(jan_2023, ride_id = as.character(ride_id) 
                   ,rideable_type = as.character(rideable_type))

Stack all months into one larger dataframe

all_trips <- bind_rows(feb_2022, mar_2022, apr_2022, may_2022, jun_2022, jul_2022, 
                       aug_2022, sep_2022, oct_2022, nov_2022, dec_2022, jan_2023)

Removed the columns not necessary for analysis (lat, long)

all_trips <- all_trips %>%  
  select(-c(start_lat, start_lng, end_lat, end_lng))

Inspected the newly formed table for accuracy

colnames(all_trips)

## [1] "ride_id"            "rideable_type"      "started_at"        
## [4] "ended_at"           "start_station_name" "start_station_id"  
## [7] "end_station_name"   "end_station_id"     "member_casual"

str(all_trips)

## tibble [5,754,248 × 9] (S3: tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:5754248] "E1E065E7ED285C02" "1602DCDC5B30FFE3" "BE7DD2AF4B55C4AF" "A1789BDF844412BE" ...
##  $ rideable_type     : chr [1:5754248] "classic_bike" "classic_bike" "classic_bike" "classic_bike" ...
##  $ started_at        : POSIXct[1:5754248], format: "2022-02-19 18:08:41" "2022-02-20 17:41:30" ...
##  $ ended_at          : POSIXct[1:5754248], format: "2022-02-19 18:23:56" "2022-02-20 17:45:56" ...
##  $ start_station_name: chr [1:5754248] "State St & Randolph St" "Halsted St & Wrightwood Ave" "State St & Randolph St" "Southport Ave & Waveland Ave" ...
##  $ start_station_id  : chr [1:5754248] "TA1305000029" "TA1309000061" "TA1305000029" "13235" ...
##  $ end_station_name  : chr [1:5754248] "Clark St & Lincoln Ave" "Southport Ave & Wrightwood Ave" "Canal St & Adams St" "Broadway & Sheridan Rd" ...
##  $ end_station_id    : chr [1:5754248] "13179" "TA1307000113" "13011" "13323" ...
##  $ member_casual     : chr [1:5754248] "member" "member" "member" "member" ...

summary(all_trips)

##    ride_id          rideable_type        started_at                    
##  Length:5754248     Length:5754248     Min.   :2022-02-01 00:03:18.00  
##  Class :character   Class :character   1st Qu.:2022-06-02 15:18:09.50  
##  Mode  :character   Mode  :character   Median :2022-07-27 22:50:40.50  
##                                        Mean   :2022-07-29 13:28:03.16  
##                                        3rd Qu.:2022-09-22 20:34:47.25  
##                                        Max.   :2023-01-31 23:56:09.00  
##     ended_at                      start_station_name start_station_id  
##  Min.   :2022-02-01 00:09:37.00   Length:5754248     Length:5754248    
##  1st Qu.:2022-06-02 15:37:50.50   Class :character   Class :character  
##  Median :2022-07-27 23:09:33.00   Mode  :character   Mode  :character  
##  Mean   :2022-07-29 13:47:21.50                                        
##  3rd Qu.:2022-09-22 20:53:25.25                                        
##  Max.   :2023-02-04 04:27:03.00                                        
##  end_station_name   end_station_id     member_casual     
##  Length:5754248     Length:5754248     Length:5754248    
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##

Added columns that listed the date, month, day, and year of each ride

all_trips$date <- as.Date(all_trips$started_at)
all_trips$month <- format(as.Date(all_trips$date), "%m")
all_trips$day <- format(as.Date(all_trips$date), "%d")
all_trips$year <- format(as.Date(all_trips$date), "%Y")
all_trips$day_of_week <- format(as.Date(all_trips$date), "%A")

Added “ride_length” column as a calculation to all_trips (in seconds)

Inspected the current structure of the columns

all_trips$ride_length <- difftime(all_trips$ended_at,all_trips$started_at)
str(all_trips)

## tibble [5,754,248 × 15] (S3: tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:5754248] "E1E065E7ED285C02" "1602DCDC5B30FFE3" "BE7DD2AF4B55C4AF" "A1789BDF844412BE" ...
##  $ rideable_type     : chr [1:5754248] "classic_bike" "classic_bike" "classic_bike" "classic_bike" ...
##  $ started_at        : POSIXct[1:5754248], format: "2022-02-19 18:08:41" "2022-02-20 17:41:30" ...
##  $ ended_at          : POSIXct[1:5754248], format: "2022-02-19 18:23:56" "2022-02-20 17:45:56" ...
##  $ start_station_name: chr [1:5754248] "State St & Randolph St" "Halsted St & Wrightwood Ave" "State St & Randolph St" "Southport Ave & Waveland Ave" ...
##  $ start_station_id  : chr [1:5754248] "TA1305000029" "TA1309000061" "TA1305000029" "13235" ...
##  $ end_station_name  : chr [1:5754248] "Clark St & Lincoln Ave" "Southport Ave & Wrightwood Ave" "Canal St & Adams St" "Broadway & Sheridan Rd" ...
##  $ end_station_id    : chr [1:5754248] "13179" "TA1307000113" "13011" "13323" ...
##  $ member_casual     : chr [1:5754248] "member" "member" "member" "member" ...
##  $ date              : Date[1:5754248], format: "2022-02-19" "2022-02-20" ...
##  $ month             : chr [1:5754248] "02" "02" "02" "02" ...
##  $ day               : chr [1:5754248] "19" "20" "25" "14" ...
##  $ year              : chr [1:5754248] "2022" "2022" "2022" "2022" ...
##  $ day_of_week       : chr [1:5754248] "Saturday" "Sunday" "Friday" "Monday" ...
##  $ ride_length       : 'difftime' num [1:5754248] 915 266 818 417 ...
##   ..- attr(*, "units")= chr "secs"

Converted “ride_length” from Factor to numeric to enable the ability to run calculations on the data

is.factor(all_trips$ride_length)

## [1] FALSE

all_trips$ride_length <- as.numeric(as.character(all_trips$ride_length))
is.numeric(all_trips$ride_length)

## [1] TRUE

Removed invalid data

all_trips_v2 <- all_trips[!(all_trips$start_station_name == "HQ QR" | all_trips$ride_length<0 ),]
all_trips_v3 <- na.omit(all_trips_v2)

Descriptive analysis on ride_length (all figures in seconds)

mean(all_trips_v3$ride_length)

## [1] 1017.968

median(all_trips_v3$ride_length)

## [1] 631

max(all_trips_v3$ride_length)

## [1] 2061244

min(all_trips_v3$ride_length)

## [1] 0

summary(all_trips_v3$ride_length)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0     360     631    1018    1133 2061244

#Comparism of members and casual users

aggregate(all_trips_v3$ride_length ~ all_trips_v3$member_casual, FUN = mean)

##   all_trips_v3$member_casual all_trips_v3$ride_length
## 1                     casual                1429.0284
## 2                     member                 743.8795

aggregate(all_trips_v3$ride_length ~ all_trips_v3$member_casual, FUN = max)

##   all_trips_v3$member_casual all_trips_v3$ride_length
## 1                     casual                  2061244
## 2                     member                    89872

aggregate(all_trips_v3$ride_length ~ all_trips_v3$member_casual, FUN = min)

##   all_trips_v3$member_casual all_trips_v3$ride_length
## 1                     casual                        0
## 2                     member                        0

average ride time by each day for members vs casual users

aggregate(all_trips_v3$ride_length ~ all_trips_v3$member_casual + all_trips_v3$day_of_week, FUN = mean)

##    all_trips_v3$member_casual all_trips_v3$day_of_week all_trips_v3$ride_length
## 1                      casual                   Friday                1333.3149
## 2                      member                   Friday                 731.0674
## 3                      casual                   Monday                1478.8465
## 4                      member                   Monday                 718.0486
## 5                      casual                 Saturday                1597.8397
## 6                      member                 Saturday                 836.7879
## 7                      casual                   Sunday                1627.6773
## 8                      member                   Sunday                 828.0498
## 9                      casual                 Thursday                1266.9570
## 10                     member                 Thursday                 719.0429
## 11                     casual                  Tuesday                1277.4698
## 12                     member                  Tuesday                 703.4152
## 13                     casual                Wednesday                1228.4272
## 14                     member                Wednesday                 708.4237

Fixed the days of the week to appear in order

all_trips_v3$day_of_week <- ordered(all_trips_v3$day_of_week, levels=c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))

now view average ride time by each day for members vs casual users

aggregate(all_trips_v3$ride_length ~ all_trips_v3$member_casual + all_trips_v3$day_of_week, FUN = mean)

##    all_trips_v3$member_casual all_trips_v3$day_of_week all_trips_v3$ride_length
## 1                      casual                   Sunday                1627.6773
## 2                      member                   Sunday                 828.0498
## 3                      casual                   Monday                1478.8465
## 4                      member                   Monday                 718.0486
## 5                      casual                  Tuesday                1277.4698
## 6                      member                  Tuesday                 703.4152
## 7                      casual                Wednesday                1228.4272
## 8                      member                Wednesday                 708.4237
## 9                      casual                 Thursday                1266.9570
## 10                     member                 Thursday                 719.0429
## 11                     casual                   Friday                1333.3149
## 12                     member                   Friday                 731.0674
## 13                     casual                 Saturday                1597.8397
## 14                     member                 Saturday                 836.7879

Analyzed the ridership data by type and weekday

all_trips_v3 %>% 
  mutate(weekday = wday(started_at, label = TRUE)) %>%
  group_by(member_casual, weekday) %>% 
  summarise(number_of_rides = n()
            ,average_duration = mean(ride_length)) %>%
  arrange(member_casual, weekday)

## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

## # A tibble: 14 × 4
## # Groups:   member_casual [2]
##    member_casual weekday number_of_rides average_duration
##    <chr>         <ord>             <int>            <dbl>
##  1 casual        Sun              304430            1628.
##  2 casual        Mon              213357            1479.
##  3 casual        Tue              199913            1277.
##  4 casual        Wed              206447            1228.
##  5 casual        Thu              231915            1267.
##  6 casual        Fri              250701            1333.
##  7 casual        Sat              368403            1598.
##  8 member        Sun              302990             828.
##  9 member        Mon              382627             718.
## 10 member        Tue              424050             703.
## 11 member        Wed              422211             708.
## 12 member        Thu              422467             719.
## 13 member        Fri              366731             731.
## 14 member        Sat              341205             837.

visualized the number of rides by rider type

all_trips_v3 %>% 
  mutate(weekday = wday(started_at, label = TRUE)) %>% 
  group_by(member_casual, weekday) %>% 
  summarise(number_of_rides = n()
            ,average_duration = mean(ride_length)) %>% 
  arrange(member_casual, weekday)  %>% 
  ggplot(aes(x = weekday, y = number_of_rides, fill = member_casual)) +
  geom_col(position = "dodge")

## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

created a visualization for the average duration

all_trips_v3 %>% 
  mutate(weekday = wday(started_at, label = TRUE)) %>% 
  group_by(member_casual, weekday) %>% 
  summarise(number_of_rides = n()
            ,average_duration = mean(ride_length)) %>% 
  arrange(member_casual, weekday)  %>% 
  ggplot(aes(x = weekday, y = average_duration, fill = member_casual)) +
  geom_col(position = "dodge")

## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

Created a csv file and exported it for further analysis

counts <- aggregate(all_trips_v3$ride_length ~ all_trips_v3$member_casual 
                    + all_trips_v3$day_of_week, FUN = mean)
write.csv(counts, file = "C:/Users/okoli/Downloads/Capstone_Cyclistic_Project/avg_ride_length.csv")

Share Phase

Interactive table was created with tableau and can be accessed at https://public.tableau.com/app/profile/n.o2620/viz/Ridelengthbyweekdayandridertype/Sheet1

Act phase

The first phenomenon that is easily noticeable is the stark difference between casual riders and riders with membership subscriptions, there is a significant difference in length of ride between both sets of riders.

Key Findings

Apart from the weekends (Saturday and Sunday) the ride length of the member riders remained relatively unchanged.
There is a slight untick (about 16%) in member riders’ length of rides on weekends, it starts on Friday and experiences a sharp drop on Monday. It remains relatively the same until Friday again.
The ride length for casual riders changed significantly from day to day with the peek coming on weekends (Saturday and Sunday).
There is a visible inverse bell curve for the ride length of the casual riders, from the beginning of the week (Sunday) to the end of the week (Saturday). There is a dip in midweek and then a steady climb until the weekend, then a slow drop after the weekend.
Comparing the casual riders to the member riders, there is a doubling (100% increase) in ride length from the member riders to the casual riders. This means that for any giving day of the week the casual riders have twice the ride time of the member riders.

Business Task Questions

How do annual members and casual riders use Cyclistic bikes differently?
Why would casual riders buy Cyclistic annual memberships?
How can Cyclistic use digital media to influence casual riders to become members?

Conclusion

The lack of individual rider identifiers which would have been integral in deciphering the ride patterns of casual riders would have been quite useful. Finding out if the casual riders were repeat users or just one-time users would have been beneficial.

Convincing casual riders to buy annual memberships

Given the information analyzed, the following steps can be deployed to increase annual membership among casual riders:

-If the cost to rent for a day or 2 is lower than the weekly cost of membership then the cost of a one-day usage should be increased, this would encourage weekend users to subscribe.

-Offer different types of memberships to cater to different types of riders, for example a weekend pass can be offered to capitalize on the weekend spike. The weekday pass can be made cheaper than the weekend pass to encourage weekday casual users.

-Offer members only exclusives with the rides, an example will be bikes reserved for members only, priority bike usage over the weekend due to increased demands

-Run an ad campaign to showcase the benefits of a membership, explain the pricing and the benefits

-Offer a discount if the whole annual membership fee was paid all at once instead of monthly, also offer autopay capabilities and make payment as seamless as possible by offering different pay types.

-Explore charging casual riders by the hour, while members are given unlimited ride time

Overall, more insight could be provided if more data is available, especially individual ride identities. I’m positive that these recommendations provided would convince casual riders to subscribe to a membership, which in turn will lead to increased revenue.

Google Data Analytics Course - Cyclistic Case Study

Nnamdi Okoli

2023-04-25