1 Table of Contents

2 Scenario

In this case, I am assuming the role of a junior data analyst working in the marketing and analytics team at Cyclistic, a bike-share company in Chicago with two types of customers. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Customers who buy single-ride or full-day passes are known as casual riders, while the customers who purchase annual memberships are known as members. Therefore, the team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, the team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve the recommendations, so they must be backed up with compelling data insights and professional data visualizations. Characters and teams.

3 Asking questions and problem definition

3.1 Ask the right questions

In this step, I asked myself questions such as ‘How do annual members and casual riders use Cyclistic bikes differently?’ to guide my analysis. Lily Moreno believes that the annual Members were more profitable than the casual riders. The goal before me is to assist the marketing team analyze the contrast between casual riders and members with the aim of providing recommendations that will aid the conversion of casual riders into annual members to maximize profits.

How do annual members and casual riders use Cyclistic bikes differently?
Why would casual riders buy Cyclistic annual memberships?
How can Cyclistic use digital media to influence casual riders to become members?

3.2 Identify the business task

A bike-share program that features more than 5,800 bicycles and 600 docking stations. Cyclistic sets itself apart by also offering reclining bikes, hand tricycles, and cargo bikes, making bike-share more inclusive to people with disabilities and riders who can’t use a standard two-wheeled bike. The majority of riders opt for traditional bikes; about 8% of riders use the assistive options. Cyclistic users are more likely to ride for leisure, but about 30% use them to commute to work each day.

3.3 Identifying key stakeholders

Lily Moreno: The director of marketing and your manager. Moreno is responsible for the development of campaigns and initiatives to promote the bike-share program. These may include email, social media, and other channels.

Cyclistic marketing analytics team: A team of data analysts who are responsible for collecting, analyzing, and reporting data that helps guide Cyclistic marketing strategy. You joined this team six months ago and have been busy learning about Cyclistic’s mission and business goals — as well as how you, as a junior data analyst, can help Cyclistic achieve them.

Cyclistic executive team: The notoriously detail-oriented executive team will decide whether to approve the recommended marketing program

3.4 Approach

Data analysis is a crucial step in solving problems. It is the process of collecting, organizing, and interpreting data to answer questions and reach conclusions. By analyzing data, businesses, organizations, and individuals can identify patterns and trends that can help them make informed decisions and improve their performance. This will involve sorting the data into categories and performing calculations or visualizing it with charts and graphs. I will also examine the data for any outliers or inconsistencies that may indicate a problem. After the data has been analyzed, it is important to interpret the results. This involves examining the data to identify any patterns or trends that may be present. It is also important to consider any potential causes of the problem, such as internal processes or external factors.

4 Prepare the data

The data used for this analysis is located at (https://divvy-tripdata.s3.amazonaws.com/index.html), The data is made available under this (https://ride.divvybikes.com/data-license-agreement) I downloaded the most recent 12 months from February 2022 to January 2023 and stored them locally. This version of my analysis was done using R, spreadsheet, and Tableau. The dataset has no user ID or any specific commuter identifier, so the data on duplicate riders are not present.

4.1 Sort and filter the data in a spreasheet

I checked for incomplete values in all columns
I checked for duplicate values
I checked for outliers

4.2 Utilize R and RStudio

Given the size of the dataset after inspection with Excel, I decided on using R for analysis and specifically R studio.

4.3 Loaded the packages

library(tidyverse)
library(ggplot2)
library(lubridate)
library(skimr)

4.4 Upload the data files and create data frames

feb_2022 <- read_csv("2022_02-divvy-tripdata.csv")

## Rows: 115609 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

mar_2022 <- read_csv("2022_03-divvy-tripdata.csv")

## Rows: 284042 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

apr_2022 <- read_csv("2022_04-divvy-tripdata.csv")

## Rows: 371249 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

may_2022 <- read_csv("2022_05-divvy-tripdata.csv")

## Rows: 634858 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

jun_2022 <- read_csv("2022_06-divvy-tripdata.csv")

## Rows: 769204 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

jul_2022 <- read_csv("2022_07-divvy-tripdata.csv")

## Rows: 823488 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

aug_2022 <- read_csv("2022_08-divvy-tripdata.csv")

## Rows: 785932 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

sep_2022 <- read_csv("2022_09-divvy-tripdata.csv")

## Rows: 701339 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

oct_2022 <- read_csv("2022_10-divvy-tripdata.csv")

## Rows: 558685 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

nov_2022 <- read_csv("2022_11-divvy-tripdata.csv")

## Rows: 337735 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

dec_2022 <- read_csv("2022_12-divvy-tripdata.csv")

## Rows: 181806 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

jan_2023 <- read_csv("2023_01-divvy-tripdata.csv")

## Rows: 190301 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

4.5 Check for column consistency

View to make sure all have the same columns and combine the 12 dataframes into one dataframe and give it a name.

colnames(feb_2022)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

colnames(mar_2022)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

colnames(apr_2022)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

colnames(may_2022)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

colnames(jun_2022)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

colnames(jul_2022)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

colnames(aug_2022)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

colnames(sep_2022)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

colnames(oct_2022)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

colnames(nov_2022)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

colnames(dec_2022)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

colnames(jan_2023)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

str(jan_2023)

## spc_tbl_ [190,301 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:190301] "F96D5A74A3E41399" "13CB7EB698CEDB88" "BD88A2E670661CE5" "C90792D034FED968" ...
##  $ rideable_type     : chr [1:190301] "electric_bike" "classic_bike" "electric_bike" "classic_bike" ...
##  $ started_at        : POSIXct[1:190301], format: "2023-01-21 20:05:42" "2023-01-10 15:37:36" ...
##  $ ended_at          : POSIXct[1:190301], format: "2023-01-21 20:16:33" "2023-01-10 15:46:05" ...
##  $ start_station_name: chr [1:190301] "Lincoln Ave & Fullerton Ave" "Kimbark Ave & 53rd St" "Western Ave & Lunt Ave" "Kimbark Ave & 53rd St" ...
##  $ start_station_id  : chr [1:190301] "TA1309000058" "TA1309000037" "RP-005" "TA1309000037" ...
##  $ end_station_name  : chr [1:190301] "Hampden Ct & Diversey Ave" "Greenwood Ave & 47th St" "Valli Produce - Evanston Plaza" "Greenwood Ave & 47th St" ...
##  $ end_station_id    : chr [1:190301] "202480.0" "TA1308000002" "599" "TA1308000002" ...
##  $ start_lat         : num [1:190301] 41.9 41.8 42 41.8 41.8 ...
##  $ start_lng         : num [1:190301] -87.6 -87.6 -87.7 -87.6 -87.6 ...
##  $ end_lat           : num [1:190301] 41.9 41.8 42 41.8 41.8 ...
##  $ end_lng           : num [1:190301] -87.6 -87.6 -87.7 -87.6 -87.6 ...
##  $ member_casual     : chr [1:190301] "member" "member" "casual" "member" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_character(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

4.6 Combine the data frames into one data frame

feb_2022 <- mutate(feb_2022, ride_id = as.character(ride_id)
                   ,rideable_type = as.character(rideable_type))
mar_2022 <- mutate(mar_2022, ride_id = as.character(ride_id)
                   ,rideable_type = as.character(rideable_type))
apr_2022 <- mutate(apr_2022, ride_id = as.character(ride_id)
                   ,rideable_type = as.character(rideable_type))
may_2022 <- mutate(may_2022, ride_id = as.character(ride_id)
                   ,rideable_type = as.character(rideable_type))
jun_2022 <- mutate(jun_2022, ride_id = as.character(ride_id)
                   ,rideable_type = as.character(rideable_type))
jul_2022 <- mutate(jul_2022, ride_id = as.character(ride_id)
                   ,rideable_type = as.character(rideable_type))
aug_2022 <- mutate(aug_2022, ride_id = as.character(ride_id)
                   ,rideable_type = as.character(rideable_type))
sep_2022 <- mutate(sep_2022, ride_id = as.character(ride_id)
                   ,rideable_type = as.character(rideable_type))
oct_2022 <- mutate(oct_2022, ride_id = as.character(ride_id)
                   ,rideable_type = as.character(rideable_type))
nov_2022 <- mutate(nov_2022, ride_id = as.character(ride_id)
                   ,rideable_type = as.character(rideable_type))
dec_2022 <- mutate(dec_2022, ride_id = as.character(ride_id)
                   ,rideable_type = as.character(rideable_type))
jan_2023 <- mutate(jan_2023, ride_id = as.character(ride_id)
                   ,rideable_type = as.character(rideable_type))
all_trips <- bind_rows(feb_2022, mar_2022, apr_2022, may_2022, jun_2022, jul_2022,
                       aug_2022, sep_2022, oct_2022, nov_2022, dec_2022, jan_2023)

4.7 Inspect the new data frame

Perform an inspection to confirm that the data has remained accurate after the combination, use data observed in the spreadsheet as a reference point for confirmation.

colnames(all_trips)

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

str(all_trips)

## tibble [5,754,248 × 13] (S3: tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:5754248] "E1E065E7ED285C02" "1602DCDC5B30FFE3" "BE7DD2AF4B55C4AF" "A1789BDF844412BE" ...
##  $ rideable_type     : chr [1:5754248] "classic_bike" "classic_bike" "classic_bike" "classic_bike" ...
##  $ started_at        : POSIXct[1:5754248], format: "2022-02-19 18:08:41" "2022-02-20 17:41:30" ...
##  $ ended_at          : POSIXct[1:5754248], format: "2022-02-19 18:23:56" "2022-02-20 17:45:56" ...
##  $ start_station_name: chr [1:5754248] "State St & Randolph St" "Halsted St & Wrightwood Ave" "State St & Randolph St" "Southport Ave & Waveland Ave" ...
##  $ start_station_id  : chr [1:5754248] "TA1305000029" "TA1309000061" "TA1305000029" "13235" ...
##  $ end_station_name  : chr [1:5754248] "Clark St & Lincoln Ave" "Southport Ave & Wrightwood Ave" "Canal St & Adams St" "Broadway & Sheridan Rd" ...
##  $ end_station_id    : chr [1:5754248] "13179" "TA1307000113" "13011" "13323" ...
##  $ start_lat         : num [1:5754248] 41.9 41.9 41.9 41.9 41.9 ...
##  $ start_lng         : num [1:5754248] -87.6 -87.6 -87.6 -87.7 -87.6 ...
##  $ end_lat           : num [1:5754248] 41.9 41.9 41.9 42 41.9 ...
##  $ end_lng           : num [1:5754248] -87.6 -87.7 -87.6 -87.6 -87.6 ...
##  $ member_casual     : chr [1:5754248] "member" "member" "member" "member" ...

summary(all_trips)

##    ride_id          rideable_type        started_at                    
##  Length:5754248     Length:5754248     Min.   :2022-02-01 00:03:18.00  
##  Class :character   Class :character   1st Qu.:2022-06-02 15:18:09.50  
##  Mode  :character   Mode  :character   Median :2022-07-27 22:50:40.50  
##                                        Mean   :2022-07-29 13:28:03.16  
##                                        3rd Qu.:2022-09-22 20:34:47.25  
##                                        Max.   :2023-01-31 23:56:09.00  
##                                                                        
##     ended_at                      start_station_name start_station_id  
##  Min.   :2022-02-01 00:09:37.00   Length:5754248     Length:5754248    
##  1st Qu.:2022-06-02 15:37:50.50   Class :character   Class :character  
##  Median :2022-07-27 23:09:33.00   Mode  :character   Mode  :character  
##  Mean   :2022-07-29 13:47:21.50                                        
##  3rd Qu.:2022-09-22 20:53:25.25                                        
##  Max.   :2023-02-04 04:27:03.00                                        
##                                                                        
##  end_station_name   end_station_id       start_lat       start_lng     
##  Length:5754248     Length:5754248     Min.   :41.64   Min.   :-87.84  
##  Class :character   Class :character   1st Qu.:41.88   1st Qu.:-87.66  
##  Mode  :character   Mode  :character   Median :41.90   Median :-87.64  
##                                        Mean   :41.90   Mean   :-87.65  
##                                        3rd Qu.:41.93   3rd Qu.:-87.63  
##                                        Max.   :42.07   Max.   :-87.52  
##                                                                        
##     end_lat         end_lng       member_casual     
##  Min.   : 0.00   Min.   :-88.14   Length:5754248    
##  1st Qu.:41.88   1st Qu.:-87.66   Class :character  
##  Median :41.90   Median :-87.64   Mode  :character  
##  Mean   :41.90   Mean   :-87.65                     
##  3rd Qu.:41.93   3rd Qu.:-87.63                     
##  Max.   :42.37   Max.   :  0.00                     
##  NA's   :5899    NA's   :5899

4.8 Data summary, addressing missing data and the linitations of the data

Some of the data columns have missing values this will be dealt with in the cleaning process.
A lot of user data is not available due to privacy issues.

5 Processing the data by cleaning

This phase involves checking for any data errors, cleaning the data of any errors, transforming said data by adding, then renaming and by removing data as needed, the rounding it up by checking to see that the data is clean and ready for analysis.

5.1 Add colums

Adding columns for: date, month, day, ride_length, year, day_of_week, and hour

all_trips$date <- as.Date(all_trips$started_at)
all_trips$month <- format(as.Date(all_trips$date), "%B")
all_trips$day <- format(as.Date(all_trips$date), "%d")
all_trips$year <- format(as.Date(all_trips$date), "%Y")
all_trips$day_of_week <- format(as.Date(all_trips$date), "%A")
all_trips$hour <- lubridate::hour(all_trips$started_at)
all_trips$ride_length <- round(as.numeric(difftime(all_trips$ended_at,all_trips$started_at, units = "mins")),2)

Add column for seasons

all_trips <- all_trips %>% mutate(season = recode(month,
                                                  December = "Winter",
                                                  January = "Winter",
                                                  February = "Winter",
                                                  March = "Spring",
                                                  April = "Spring",
                                                  May = "Spring",
                                                  June = "Summer",
                                                  July = "Summer",
                                                  August = "Summer",
                                                  September = "Fall",
                                                  October = "Fall",
                                                  November = "Fall"))

Add a column to represent the time of the day

all_trips <- all_trips %>% mutate(time_of_day = case_when(
  hour >= 6 & hour < 9 ~ "Early Morning",
  hour >= 9 & hour < 12 ~ "Mid Morning",
  hour >= 12 & hour < 18  ~ "Afternoon",
  hour >= 18 & hour <= 23  ~ "Evening",
  hour >= 0 & hour < 3  ~ "Early Night",
  hour >= 3 & hour < 6  ~ "Late Night"))

5.2 Find and address missing values

Check the user type and rideable type for missing values.

all_trips %>% filter(is.na(start_station_id)) %>% 
  count(start_station_id, start_station_name, rideable_type, member_casual)

## # A tibble: 2 × 5
##   start_station_id start_station_name rideable_type member_casual      n
##   <chr>            <chr>              <chr>         <chr>          <int>
## 1 <NA>             <NA>               electric_bike casual        350065
## 2 <NA>             <NA>               electric_bike member        493460

The missing values have been located in the electric_bike

Create new column to address missing values and also uniformize the longitude and latitude.

all_trips <- all_trips %>% 
  mutate(start_lat_round = round(start_lat, digits = 2),
         start_lng_round = round(start_lng, digits = 2),
         end_lat_round = round(end_lat, digits = 2),
         end_lng_round = round(end_lng, digits = 2))

Fix missing station names and station ID

all_trips <- all_trips %>% 
  group_by(start_lat_round, start_lng_round) %>% 
  tidyr::fill(start_station_name, .direction = "downup") %>% 
  ungroup()

all_trips <- all_trips %>% 
  group_by(end_lat_round, end_lng_round) %>% 
  tidyr::fill(end_station_name, .direction = "downup") %>% 
  ungroup()

all_trips <- all_trips %>% 
  group_by(start_station_name) %>% 
  tidyr::fill(start_station_id, .direction = "downup") %>% 
  ungroup()

all_trips <- all_trips %>% 
  group_by(end_station_name) %>% 
  tidyr::fill(end_station_id, .direction = "downup") %>% 
  ungroup()

Check the data again to make sure the missing values have been fixed

colSums(is.na(all_trips))

##            ride_id      rideable_type         started_at           ended_at 
##                  0                  0                  0                  0 
## start_station_name   start_station_id   end_station_name     end_station_id 
##              11929              11929              40880              40880 
##          start_lat          start_lng            end_lat            end_lng 
##                  0                  0               5899               5899 
##      member_casual               date              month                day 
##                  0                  0                  0                  0 
##               year        day_of_week               hour        ride_length 
##                  0                  0                  0                  0 
##             season        time_of_day    start_lat_round    start_lng_round 
##                  0                  0                  0                  0 
##      end_lat_round      end_lng_round 
##               5899               5899

These shows that some of the missing values can’t be fixed because they lack critical information, this will be handled by removing them from the dataframe. Their number are very small and insignificant so it can be disregarded.

5.3 Find and clean “testing” or “repair” rides

Rides that are by the company employees for the purpose of testing or repairs are immaterial to the analysis being done and can actually skew the data. They will be isolated and eliminated.

all_trips %>% 
  select(start_station_id) %>% 
  count(start_station_id) %>% 
  arrange(desc(n))

## # A tibble: 1,312 × 2
##    start_station_id     n
##    <chr>            <int>
##  1 13022            81578
##  2 LF-005           46593
##  3 TA1308000050     43541
##  4 13042            42671
##  5 13300            42225
##  6 TA1307000039     38861
##  7 13008            37142
##  8 KA1503000014     36188
##  9 13179            35857
## 10 TA1308000001     35376
## # … with 1,302 more rows

all_trips %>% 
  select(end_station_id) %>% 
  count(end_station_id) %>% 
  arrange(desc(n))

## # A tibble: 1,317 × 2
##    end_station_id     n
##    <chr>          <int>
##  1 13022          81458
##  2 LF-005         45159
##  3 TA1308000050   44744
##  4 13042          42189
##  5 KA1503000014   42012
##  6 13300          41533
##  7 <NA>           40880
##  8 TA1307000039   38672
##  9 13008          37270
## 10 13179          36686
## # … with 1,307 more rows

Filter out the station related to testing and repair

all_trips %>% filter(start_station_id %in% c("DIVVY 001", "DIVVY 001 - Warehouse test station", 
                                             "Hubbard Bike-checking (LBS-WH-TEST)", "Pawel Bialowas - Test- PBSC charging station", "DIVVY CASSETTE REPAIR MOBILE STATION", 
                                             "2059 Hastings Warehouse Station", "Hastings WH 2", "Throop/Hastings Mobile Station")) %>%
  count(start_station_id)

## # A tibble: 6 × 2
##   start_station_id                         n
##   <chr>                                <int>
## 1 2059 Hastings Warehouse Station        585
## 2 DIVVY 001                               77
## 3 DIVVY 001 - Warehouse test station      14
## 4 DIVVY CASSETTE REPAIR MOBILE STATION     7
## 5 Hastings WH 2                           14
## 6 Hubbard Bike-checking (LBS-WH-TEST)   8124

all_trips %>% filter(end_station_id %in% c("DIVVY 001", "DIVVY 001 - Warehouse test station", "Hubbard Bike-checking (LBS-WH-TEST)", 
                                           "Pawel Bialowas - Test- PBSC charging station", "DIVVY CASSETTE REPAIR MOBILE STATION", 
                                           "2059 Hastings Warehouse Station", "Hastings WH 2", "Throop/Hastings Mobile Station")) %>% 
  count(end_station_id)

## # A tibble: 4 × 2
##   end_station_id                           n
##   <chr>                                <int>
## 1 2059 Hastings Warehouse Station        791
## 2 DIVVY CASSETTE REPAIR MOBILE STATION     7
## 3 Hastings WH 2                          167
## 4 Hubbard Bike-checking (LBS-WH-TEST)    774

5.4 remove data with unreasonable ride length

Rides that have a negative ride length will be treated as invalid, this is because the trip start time can’t be greater than the trip end time
When the rides last more than a day (24hrs) it can be considered invalid because it is probably not being used, it was probably not docked after use.
The rides that are less than a minute (60s) are probably a docking mistake.

all_trips <- all_trips %>% 
  filter(ride_length >= 1 & ride_length <= 1440)

5.5 Removed data that were for the repair and testing station

all_trips_v2 <- all_trips %>%
  filter(!is.na(end_lat) & !is.na(end_lng))

all_trips_v2 <- all_trips_v2 %>% 
  filter(!start_station_id %in% c("DIVVY 001", "DIVVY 001 - Warehouse test station", "Hubbard Bike-checking (LBS-WH-TEST)", 
                                  "Pawel Bialowas - Test- PBSC charging station", "DIVVY CASSETTE REPAIR MOBILE STATION", 
                                  "2059 Hastings Warehouse Station", "Hastings WH 2", "Throop/Hastings Mobile Station"))

all_trips_v2 <- all_trips_v2 %>%
  filter(!end_station_id %in% c("DIVVY 001", "DIVVY 001 - Warehouse test station", "Hubbard Bike-checking (LBS-WH-TEST)", 
                                "Pawel Bialowas - Test- PBSC charging station", "DIVVY CASSETTE REPAIR MOBILE STATION", 
                                "2059 Hastings Warehouse Station", "Hastings WH 2", "Throop/Hastings Mobile Station"))

Removed the testing and repair stations and also renamed the new dataframe created.

check the removed stations for accuracy

str(all_trips_v2)

## tibble [5,612,878 × 26] (S3: tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:5612878] "E1E065E7ED285C02" "1602DCDC5B30FFE3" "BE7DD2AF4B55C4AF" "A1789BDF844412BE" ...
##  $ rideable_type     : chr [1:5612878] "classic_bike" "classic_bike" "classic_bike" "classic_bike" ...
##  $ started_at        : POSIXct[1:5612878], format: "2022-02-19 18:08:41" "2022-02-20 17:41:30" ...
##  $ ended_at          : POSIXct[1:5612878], format: "2022-02-19 18:23:56" "2022-02-20 17:45:56" ...
##  $ start_station_name: chr [1:5612878] "State St & Randolph St" "Halsted St & Wrightwood Ave" "State St & Randolph St" "Southport Ave & Waveland Ave" ...
##  $ start_station_id  : chr [1:5612878] "TA1305000029" "TA1309000061" "TA1305000029" "13235" ...
##  $ end_station_name  : chr [1:5612878] "Clark St & Lincoln Ave" "Southport Ave & Wrightwood Ave" "Canal St & Adams St" "Broadway & Sheridan Rd" ...
##  $ end_station_id    : chr [1:5612878] "13179" "TA1307000113" "13011" "13323" ...
##  $ start_lat         : num [1:5612878] 41.9 41.9 41.9 41.9 41.9 ...
##  $ start_lng         : num [1:5612878] -87.6 -87.6 -87.6 -87.7 -87.6 ...
##  $ end_lat           : num [1:5612878] 41.9 41.9 41.9 42 41.9 ...
##  $ end_lng           : num [1:5612878] -87.6 -87.7 -87.6 -87.6 -87.6 ...
##  $ member_casual     : chr [1:5612878] "member" "member" "member" "member" ...
##  $ date              : Date[1:5612878], format: "2022-02-19" "2022-02-20" ...
##  $ month             : chr [1:5612878] "February" "February" "February" "February" ...
##  $ day               : chr [1:5612878] "19" "20" "25" "14" ...
##  $ year              : chr [1:5612878] "2022" "2022" "2022" "2022" ...
##  $ day_of_week       : chr [1:5612878] "Saturday" "Sunday" "Friday" "Monday" ...
##  $ hour              : int [1:5612878] 18 17 18 11 5 9 10 20 13 7 ...
##  $ ride_length       : num [1:5612878] 15.25 4.43 13.63 6.95 2.9 ...
##  $ season            : chr [1:5612878] "Winter" "Winter" "Winter" "Winter" ...
##  $ time_of_day       : chr [1:5612878] "Evening" "Afternoon" "Evening" "Mid Morning" ...
##  $ start_lat_round   : num [1:5612878] 41.9 41.9 41.9 42 41.9 ...
##  $ start_lng_round   : num [1:5612878] -87.6 -87.7 -87.6 -87.7 -87.6 ...
##  $ end_lat_round     : num [1:5612878] 41.9 41.9 41.9 42 41.9 ...
##  $ end_lng_round     : num [1:5612878] -87.6 -87.7 -87.6 -87.7 -87.6 ...

is.factor(all_trips_v2$ride_length)

## [1] FALSE

all_trips_v2$ride_length <- as.numeric(as.character(all_trips_v2$ride_length))
is.numeric(all_trips_v2$ride_length)

## [1] TRUE

all_trips_v2%>% 
  select(start_station_id) %>% 
  count(start_station_id) %>% 
  arrange(desc(n))

## # A tibble: 1,306 × 2
##    start_station_id     n
##    <chr>            <int>
##  1 13022            79566
##  2 LF-005           45310
##  3 TA1308000050     42731
##  4 13042            41847
##  5 13300            41325
##  6 TA1307000039     38115
##  7 13008            36247
##  8 KA1503000014     35531
##  9 13179            35118
## 10 TA1308000001     34776
## # … with 1,296 more rows

all_trips_v2%>% 
  select(end_station_id) %>% 
  count(end_station_id) %>% 
  arrange(desc(n))

## # A tibble: 1,313 × 2
##    end_station_id     n
##    <chr>          <int>
##  1 13022          79717
##  2 LF-005         44366
##  3 TA1308000050   43686
##  4 13042          41423
##  5 KA1503000014   41239
##  6 13300          40734
##  7 TA1307000039   37702
##  8 13008          36598
##  9 13179          35887
## 10 TA1308000001   34615
## # … with 1,303 more rows

class(all_trips_v2)

## [1] "tbl_df"     "tbl"        "data.frame"

colSums(is.na(all_trips_v2))

##            ride_id      rideable_type         started_at           ended_at 
##                  0                  0                  0                  0 
## start_station_name   start_station_id   end_station_name     end_station_id 
##              11558              11558              33871              33871 
##          start_lat          start_lng            end_lat            end_lng 
##                  0                  0                  0                  0 
##      member_casual               date              month                day 
##                  0                  0                  0                  0 
##               year        day_of_week               hour        ride_length 
##                  0                  0                  0                  0 
##             season        time_of_day    start_lat_round    start_lng_round 
##                  0                  0                  0                  0 
##      end_lat_round      end_lng_round 
##                  0                  0

skim_without_charts(all_trips_v2)

Data summary
Name	all_trips_v2
Number of rows	5612878
Number of columns	26
_______________________
Column type frequency:
character	13
Date	1
numeric	10
POSIXct	2
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
ride_id	0	1.00	16	16	5612878
rideable_type	0	1.00	11	13	3
start_station_name	11558	1.00	9	64	1671
start_station_id	11558	1.00	3	37	1305
end_station_name	33871	0.99	9	64	1688
end_station_id	33871	0.99	3	37	1312
member_casual	0	1.00	6	6	2
month	0	1.00	3	9	12
day	0	1.00	2	2	31
year	0	1.00	4	4	2
day_of_week	0	1.00	6	9	7
season	0	1.00	4	6	4
time_of_day	0	1.00	7	13	6

Variable type: Date

skim_variable	n_missing	complete_rate	min	max	median	n_unique
date	0	1	2022-02-01	2023-01-31	2022-07-27	365

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100
start_lat	1	41.90	0.05	41.64	41.88	41.90	41.93	42.07
start_lng	1	-87.65	0.03	-87.84	-87.66	-87.64	-87.63	-87.52
end_lat	1	41.90	0.07	0.00	41.88	41.90	41.93	42.37
end_lng	1	-87.65	0.10	-88.14	-87.66	-87.64	-87.63	0.00
hour	1	14.21	5.03	0.00	11.00	15.00	18.00	23.00
ride_length	1	16.46	28.68	1.00	6.02	10.43	18.58	1439.37
start_lat_round	1	41.90	0.05	41.64	41.88	41.90	41.93	42.07
start_lng_round	1	-87.65	0.03	-87.84	-87.66	-87.64	-87.63	-87.52
end_lat_round	1	41.90	0.07	0.00	41.88	41.90	41.93	42.37
end_lng_round	1	-87.65	0.10	-88.14	-87.66	-87.64	-87.63	0.00

Variable type: POSIXct

skim_variable	n_missing	complete_rate	min	max	median	n_unique
started_at	0	1	2022-02-01 00:03:18	2023-01-31 23:56:09	2022-07-27 19:51:47	4724012
ended_at	0	1	2022-02-01 00:09:37	2023-02-01 00:28:12	2022-07-27 20:10:25	4736481

6 Analysis and Visualization of the Data

Analyzing the data in order to get different insights from it in different ways. Focusing on analyzing how the annual member riders and the casual riders use Cyclistic bikes differently.

6.1 Summary

Get the summary of the data to be used for analysis in order to get a snapshot of the information.

summary(all_trips_v2)

##    ride_id          rideable_type        started_at                    
##  Length:5612878     Length:5612878     Min.   :2022-02-01 00:03:18.00  
##  Class :character   Class :character   1st Qu.:2022-06-02 12:16:51.50  
##  Mode  :character   Mode  :character   Median :2022-07-27 19:51:47.50  
##                                        Mean   :2022-07-29 10:10:19.89  
##                                        3rd Qu.:2022-09-22 19:13:23.50  
##                                        Max.   :2023-01-31 23:56:09.00  
##     ended_at                      start_station_name start_station_id  
##  Min.   :2022-02-01 00:09:37.00   Length:5612878     Length:5612878    
##  1st Qu.:2022-06-02 12:36:19.00   Class :character   Class :character  
##  Median :2022-07-27 20:10:25.00   Mode  :character   Mode  :character  
##  Mean   :2022-07-29 10:26:47.51                                        
##  3rd Qu.:2022-09-22 19:28:37.00                                        
##  Max.   :2023-02-01 00:28:12.00                                        
##  end_station_name   end_station_id       start_lat       start_lng     
##  Length:5612878     Length:5612878     Min.   :41.64   Min.   :-87.84  
##  Class :character   Class :character   1st Qu.:41.88   1st Qu.:-87.66  
##  Mode  :character   Mode  :character   Median :41.90   Median :-87.64  
##                                        Mean   :41.90   Mean   :-87.65  
##                                        3rd Qu.:41.93   3rd Qu.:-87.63  
##                                        Max.   :42.07   Max.   :-87.52  
##     end_lat         end_lng       member_casual           date           
##  Min.   : 0.00   Min.   :-88.14   Length:5612878     Min.   :2022-02-01  
##  1st Qu.:41.88   1st Qu.:-87.66   Class :character   1st Qu.:2022-06-02  
##  Median :41.90   Median :-87.64   Mode  :character   Median :2022-07-27  
##  Mean   :41.90   Mean   :-87.65                      Mean   :2022-07-28  
##  3rd Qu.:41.93   3rd Qu.:-87.63                      3rd Qu.:2022-09-22  
##  Max.   :42.37   Max.   :  0.00                      Max.   :2023-01-31  
##     month               day                year           day_of_week       
##  Length:5612878     Length:5612878     Length:5612878     Length:5612878    
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##       hour        ride_length         season          time_of_day       
##  Min.   : 0.00   Min.   :   1.00   Length:5612878     Length:5612878    
##  1st Qu.:11.00   1st Qu.:   6.02   Class :character   Class :character  
##  Median :15.00   Median :  10.43   Mode  :character   Mode  :character  
##  Mean   :14.21   Mean   :  16.46                                        
##  3rd Qu.:18.00   3rd Qu.:  18.58                                        
##  Max.   :23.00   Max.   :1439.37                                        
##  start_lat_round start_lng_round  end_lat_round   end_lng_round   
##  Min.   :41.64   Min.   :-87.84   Min.   : 0.00   Min.   :-88.14  
##  1st Qu.:41.88   1st Qu.:-87.66   1st Qu.:41.88   1st Qu.:-87.66  
##  Median :41.90   Median :-87.64   Median :41.90   Median :-87.64  
##  Mean   :41.90   Mean   :-87.65   Mean   :41.90   Mean   :-87.65  
##  3rd Qu.:41.93   3rd Qu.:-87.63   3rd Qu.:41.93   3rd Qu.:-87.63  
##  Max.   :42.07   Max.   :-87.52   Max.   :42.37   Max.   :  0.00

6.2 Comparism of Member riders to Casual Riders

Starting off with numbers and percentage comparison

all_trips_v2%>% 
  group_by(member_casual) %>% 
  summarise(count = n(), Percentage = n()/nrow(all_trips_v2)*100)

## # A tibble: 2 × 3
##   member_casual   count Percentage
##   <chr>           <int>      <dbl>
## 1 casual        2284633       40.7
## 2 member        3328245       59.3

Analysis

During the time frame studied for this analysis (12 month period), 40.7% of the riders that used the rideables were casual riders while about 59.3% of the riders in the same time period were member riders. Converting a portion of the casual riders to members would be very advantageous.

Visualizing by member type

ggplot(all_trips_v2, aes(member_casual, fill=member_casual)) +
  geom_bar() +
  labs(x="User Type", y="Count", title = "Total Rides: Casual vs Member") +
  annotate("text",x=1,y=2000000,label="2,284,633 / (40.7%)",color="black",size=3.5) +
  annotate("text",x=2,y=3000000,label="3,328,245 / (59.3%)",color="black",size=3.5)

A majority of the riders (59.3%) are members while the rest (40.7%) are casual riders.

Visualizing the total rides by the user type and rideable type

all_trips_v2%>% 
  group_by(member_casual, rideable_type) %>% 
  summarise(count = n()) %>% 
  ggplot(aes(x=member_casual, y=count, fill=rideable_type)) + 
  geom_bar(stat="identity", width = 0.3) +
  labs(x="Rideable Type", y="Number of Rides", title = "Total Rides by user type and rideable type")

## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

The members riders slightly favor the classic bike over the electric bike: - Most casual riders use electric bikes - members don’t use docked bikes - members use both classic and electric bikes almost equally

6.3 Analyzing the length of rides

summary(all_trips_v2$ride_length)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    6.02   10.43   16.46   18.58 1439.37

The minimum ride length is 1 minute, which is expected due to cleaning
Average of all ride lengths is 16.46 minutes

The average length of ride for each trip (in minutes) by the type of user

aggregate(all_trips_v2$ride_length~all_trips_v2$member_casual, FUN = mean)

##   all_trips_v2$member_casual all_trips_v2$ride_length
## 1                     casual                 22.10731
## 2                     member                 12.58425

Analysis The casual riders rode the bikes for significantly more time on average than the member riders.

Breaking down ride lengths for more analysis

all_trips_v2%>% 
  group_by(member_casual) %>% 
  summarize("<=5min" = sum(ride_length <=5),
            "<=15min" = sum(ride_length <=15),
            "<=30min" = sum(ride_length <=30),
            "<=45min" = sum(ride_length <=45),
            "<=60min" = sum(ride_length <=60),
            ">2hrs" = sum(ride_length >120),
            ">4hrs" = sum(ride_length >240),
            ">6hrs" = sum(ride_length >360),)

## # A tibble: 2 × 9
##   member_casual `<=5min` <=15m…¹ <=30m…² <=45m…³ <=60m…⁴ `>2hrs` `>4hrs` `>6hrs`
##   <chr>            <int>   <int>   <int>   <int>   <int>   <int>   <int>   <int>
## 1 casual          266189 1278389 1863455 2057141 2142989   33432    5731    3206
## 2 member          757992 2462997 3116478 3273578 3305079    5982    2111    1218
## # … with abbreviated variable names ¹`<=15min`, ²`<=30min`, ³`<=45min`,
## #   ⁴`<=60min`

Analysis This shows that for both user types, the vast majority of ride length is below 60 minutes, more data would be needed to analyze this trend but I suspect that it’s the maximum endurance of non professional riders. And distances beyond this point most riders will choose another mode of transportation. Perhaps an availability of more electric bikes would encourage longer rides.

all_trips_v2 %>% 
  group_by(member_casual) %>% 
  summarize("<10 min" = sum(ride_length <9.99),
            "10-20 min" = sum(ride_length >=10 & ride_length <=19.99),
            "20-30 min" = sum(ride_length >=20 & ride_length <=29.99),
            "30-60 min" = sum(ride_length >=30 & ride_length <=59.99),
            "60-120 min" = sum(ride_length >=60 & ride_length <=119.99),
            "120-240 min" = sum(ride_length >=120 & ride_length <=239.99),
            "240+min" = sum(ride_length >=240))

## # A tibble: 2 × 8
##   member_casual `<10 min` `10-20 min` 20-30 mi…¹ 30-60…² 60-12…³ 120-2…⁴ 240+m…⁵
##   <chr>             <int>       <int>      <int>   <int>   <int>   <int>   <int>
## 1 casual           845416      707829     309833  279832  108280   27712    5731
## 2 member          1844513      954711     316957  188886   17195    3872    2111
## # … with abbreviated variable names ¹`20-30 min`, ²`30-60 min`, ³`60-120 min`,
## #   ⁴`120-240 min`, ⁵`240+min`

Analysis This shows that casual riders tend to keep the rideables for much longer than member riders, but the members tend to take shorter rides.

all_trips_v3 <- all_trips_v2%>% mutate(ride_length_grp = case_when(
  ride_length <9.99 ~ "< 10 min",
  ride_length >=10 & ride_length <=19.99 ~ "10-20 min",
  ride_length >=20 & ride_length <=29.99  ~ "20-30 min",
  ride_length >=30 & ride_length <=59.99  ~ "30-60 min",
  ride_length >=60 & ride_length <=119.99  ~ "60-120 min",
  ride_length >=120 & ride_length <=239.99  ~ "120-240 min",
  ride_length >=240  ~ "240+ min"))

all_trips_v3 %>% 
  group_by(member_casual, ride_length_grp) %>% 
  summarise(count = n()) %>% 
  ggplot(aes(x=factor(ride_length_grp, level = c("< 10 min", "10-20 min", "20-30 min", "30-60 min",
                                                 "60-120 min", "120-240 min", "240+ min")), y=count, fill=member_casual)) + 
  geom_col(position = "dodge") +
  labs(x="Ride Length", y="Number of Rides", title = "Total Rides by user type and ride length grouping")

## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

Analysis Most of the rides fall within the 1 to 20 minutes ride length in minutes, this information can be further enhanced by obtaining survey information on riding behaviors.

Looking at the average ride length of each trip (in minutes) based on user type and the hour of day

all_trips_v3 %>% 
  group_by(member_casual, hour) %>% 
  summarise(count = n(), average_ride_length=mean(ride_length)) %>% 
  arrange(member_casual, hour) %>% 
  ggplot(aes(x=factor(hour, level= c(6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,0,1,2,3,4,5)), y=average_ride_length, fill=member_casual)) + 
  geom_col(position = "dodge") +
  labs(x="Hour of day", y="Ride Length (in minutes)", title = "Average ride length by user type and hour of the day")

## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

Analysis The casual riders had on average longer rides than member riders peaking between 10AM - 3PM, while member riders stayed relatively constant throughout the day.This insight can be useful in planning maintenance times for the rideables.

axis_labels <- c("Early Morning \n6am-9am", "Mid Morning \n9am-12pm", "Afternoon \n12pm-6pm", "Evening \n6pm-11pm", "Early Night \n11pm-3am", "Wee Night \n3am-6am")
all_trips_v3 %>% 
  group_by(member_casual, time_of_day) %>% 
  summarise(count = n(), average_ride_length=mean(ride_length)) %>% 
  ggplot(aes(x=factor(time_of_day, level= c("Early Morning", "Mid Morning", "Afternoon", "Evening", "Early Night", "Late Night")), y=average_ride_length, fill=member_casual)) + 
  geom_col(position = "dodge", width = 0.4) +
  labs(x="Time of Day", y="Ride Length (minutes)", title = "Average ride length by user type and time of the day") +
  scale_x_discrete(labels = axis_labels)

## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

Analysis

The length of the bike ride for the casual riders peaks during the mid morning through the afternoon
The length of the bike ride for member riders remains more constant throughout the day

aggregate(all_trips_v3$ride_length ~ all_trips_v3$member_casual + 
            all_trips_v3$day_of_week, FUN = mean)

##    all_trips_v3$member_casual all_trips_v3$day_of_week all_trips_v3$ride_length
## 1                      casual                   Friday                 20.71813
## 2                      member                   Friday                 12.39689
## 3                      casual                   Monday                 22.55896
## 4                      member                   Monday                 12.15048
## 5                      casual                 Saturday                 24.84920
## 6                      member                 Saturday                 14.04720
## 7                      casual                   Sunday                 25.31523
## 8                      member                   Sunday                 13.92415
## 9                      casual                 Thursday                 19.70325
## 10                     member                 Thursday                 12.16705
## 11                     casual                  Tuesday                 19.76585
## 12                     member                  Tuesday                 11.94311
## 13                     casual                Wednesday                 19.07947
## 14                     member                Wednesday                 11.99994

all_trips_v3 %>% 
  group_by(member_casual, day_of_week) %>% 
  summarise(count = n(), average_ride_length=mean(ride_length)) %>% 
  ggplot(aes(x=factor(day_of_week, level= c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")), y=average_ride_length, fill=member_casual)) + 
  geom_col(position = "dodge", width = 0.4) + 
  labs(x="Day of Week", y="Ride Length ( minutes)", title = "Average ride length by the user type and day of the week")

## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

Analysis

The weekends saw increased bike use by minutes for both user types, casual riders were consistent throughout the week up until Saturday which is their preferred day of week. Meanwhile, the member riders looked to have a consistent weekly ride and a small weekend spike on the weekend for their bike use. The most popular day for minutes used for both user types is on Sundays.

all_trips_v3 %>% 
  group_by(member_casual, hour) %>% 
  summarise(count = n()) %>%  
  arrange(member_casual, hour) %>% 
  ggplot(aes(x=factor(hour, level= c(6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,0,1,2,3,4,5)), y=count, fill=member_casual)) + 
  geom_col(position = "dodge") +
  labs(x="Hour of the day", y="Number of Rides", title = "Total Rides by user type and hour of the day")

## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

Analysis

The member riders show an increased use of rideables starting from 6am and peaking at 8am and they see another peak from 4pm to 6pm.
The Casual riders only peak from 4pm to 6 pm.

all_trips_v3 %>% 
  group_by(member_casual) %>% 
  summarize("Early Morning" = sum(time_of_day == "Early Morning"),
            "Mid Morning" = sum(time_of_day == "Mid Morning"),
            "Afternoon" = sum(time_of_day == "Afternoon"),
            "Evening" = sum(time_of_day == "Evening"),
            "Early Night" = sum(time_of_day == "Early Night"),
            "Late Night" = sum(time_of_day == "Late Night"))

## # A tibble: 2 × 7
##   member_casual `Early Morning` `Mid Morning` Afternoon Evening Early …¹ Late …²
##   <chr>                   <int>         <int>     <int>   <int>    <int>   <int>
## 1 casual                 150178        282434   1034339  694212    92950   30520
## 2 member                 470104        440497   1412537  885751    70557   48799
## # … with abbreviated variable names ¹`Early Night`, ²`Late Night`

all_trips_v3 %>% 
  group_by(member_casual, time_of_day) %>% 
  summarise(count = n()) %>% 
  ggplot(aes(x=factor(time_of_day, level= c("Early Morning", "Mid Morning", "Afternoon", "Evening", "Early Night", "Late Night")), y=count, fill=member_casual)) + 
  geom_col(position = "dodge", width = 0.4) + 
  labs(x="Time of the Day", y="Number of Rides", title = "Total Rides by user type and time of the day") +
  scale_x_discrete(labels = axis_labels)

## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

Analysis

Member riders out use the causal riders at all times except for during the early night
Afternoon and evening saw the biggest number of rides for both membership types

all_trips_v3 %>% 
  group_by(member_casual, time_of_day) %>% 
  summarise(count = n()) %>% 
  ggplot(aes(x=factor(time_of_day, level= c("Early Morning", "Mid Morning", "Afternoon", "Evening", "Early Night", "Late Night")), y=count, color=member_casual)) + 
  geom_point() + geom_line(aes(group = member_casual)) +
  labs(x="Time of the Day", y="Number of Rides", title = "Total Rides by user type and time of the day") + ylim(0, NA) +
  scale_x_discrete(labels = axis_labels)

## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

Analysis

This follows the trends previously noticed, with peak times, suggestion that maintenance should be done outside of peak hours in order to main rideable availability and create customer satisfaction.

6.4 Analysis of total rides based on user type and day of the week

all_trips_v3 %>% 
  group_by(member_casual) %>% 
  summarize("Monday" = sum(day_of_week == "Monday"),
            "Tuesday" = sum(day_of_week == "Tuesday"),
            "Wednesday" = sum(day_of_week == "Wednesday"),
            "Thursday" = sum(day_of_week == "Thursday"),
            "Friday" = sum(day_of_week == "Friday"),
            "Saturday" = sum(day_of_week == "Saturday"),
            "Sunday" = sum(day_of_week == "Sunday"))

## # A tibble: 2 × 8
##   member_casual Monday Tuesday Wednesday Thursday Friday Saturday Sunday
##   <chr>          <int>   <int>     <int>    <int>  <int>    <int>  <int>
## 1 casual        273930  261419    271058   304027 328728   462513 382958
## 2 member        471235  521864    523339   527774 464289   435606 384138

all_trips_v3 %>% 
  group_by(member_casual, day_of_week) %>% 
  summarise(count = n()) %>% 
  ggplot(aes(x=factor(day_of_week, level= c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")), y=count, fill=member_casual)) + 
  geom_col(position = "dodge", width = 0.4) + 
  labs(x="Day of the Week", y="Number of Rides", title = "Total Rides by user type and day of the week")

## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

Analysis

The member riders use more rideables during the week while the casual riders use more rideables during the weekends.

all_trips_v3 %>% 
  group_by(member_casual, day_of_week) %>% 
  summarise(count = n()) %>% 
  ggplot(aes(x=factor(day_of_week, level= c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")), y=count, color=member_casual)) + 
  geom_point() + geom_line(aes(group = member_casual)) +
  labs(x="Day of Week", y="Number of Rides", title = "Total Rides by user type and day of the week") +
  ylim(0, NA)

## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

6.5 The analysis of the total rides by the type of user and the season

all_trips_v3 %>% 
  group_by(member_casual, season) %>% 
  summarise(count = n()) %>% 
  ggplot(aes(x=factor(season, level= c("Spring", "Summer", "Fall", "Winter")), y=count, fill=member_casual)) + 
  geom_col(position = "dodge", width = 0.4) + 
  labs(x="Season of the Year", y="Number of Rides", title = "Total Rides by user type and season of the year")

## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

Analysis Summer saw the highest rideable use and the winter saw the lowest rideable use, this is probably due to the weather. There is little that can be done to mitigate this but heated bikes is an avenue worth exploring.

all_trips_v3 %>% 
  filter(!(is.na(start_station_name))) %>% 
  filter(member_casual == "casual") %>% 
  group_by(start_station_name) %>% 
  summarize(count=n()) %>% 
  arrange(-count) %>% 
  top_n(10)

## Selecting by count

## # A tibble: 10 × 2
##    start_station_name                 count
##    <chr>                              <int>
##  1 Streeter Dr & Grand Ave            61066
##  2 DuSable Lake Shore Dr & Monroe St  31809
##  3 Michigan Ave & Oak St              26456
##  4 DuSable Lake Shore Dr & North Blvd 26386
##  5 Millennium Park                    26308
##  6 Shedd Aquarium                     20493
##  7 Theater on the Lake                19651
##  8 Wells St & Concord Ln              18923
##  9 Clark St & Lincoln Ave             18419
## 10 Dusable Harbor                     16228

all_trips_v3 %>% 
  filter(!(is.na(start_station_name))) %>% 
  filter(member_casual == "casual") %>% 
  group_by(start_station_name) %>% 
  summarize(count=n()) %>% 
  arrange(-count) %>% 
  top_n(10) %>% 
  mutate(start_station_name= fct_reorder(start_station_name, count)) %>% 
  ggplot(aes(x=start_station_name, y=count, fill=count)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  labs(x="Number of Rides", y="Name of Start Station ", title="Top 10 starting stations for casual riders")

## Selecting by count

all_trips_v3 %>% 
  filter(!(is.na(end_station_name))) %>% 
  filter(member_casual == "casual") %>% 
  group_by(end_station_name) %>% 
  summarize(count=n()) %>% 
  arrange(-count) %>% 
  top_n(10)

## Selecting by count

## # A tibble: 10 × 2
##    end_station_name                   count
##    <chr>                              <int>
##  1 Streeter Dr & Grand Ave            63523
##  2 DuSable Lake Shore Dr & Monroe St  30126
##  3 Millennium Park                    27700
##  4 DuSable Lake Shore Dr & North Blvd 27217
##  5 Michigan Ave & Oak St              27198
##  6 Theater on the Lake                20663
##  7 Clark St & Lincoln Ave             19096
##  8 Wells St & Concord Ln              18814
##  9 Shedd Aquarium                     18584
## 10 Dusable Harbor                     15474

all_trips_v3 %>% 
  filter(!(is.na(start_station_name))) %>% 
  filter(member_casual == "member") %>% 
  group_by(start_station_name) %>% 
  summarize(count=n()) %>% 
  arrange(-count) %>% 
  top_n(10)

## Selecting by count

## # A tibble: 10 × 2
##    start_station_name           count
##    <chr>                        <int>
##  1 Ellis Ave & 60th St          29268
##  2 Kingsbury St & Kinzie St     25303
##  3 University Ave & 57th St     25261
##  4 Ellis Ave & 55th St          24845
##  5 Clark St & Elm St            23954
##  6 Wells St & Concord Ln        23808
##  7 Clinton St & Washington Blvd 21257
##  8 Wells St & Elm St            20687
##  9 Loomis St & Lexington St     19920
## 10 Broadway & Barry Ave         19672

all_trips_v3 %>% 
  filter(!(is.na(start_station_name))) %>% 
  filter(member_casual == "member") %>% 
  group_by(start_station_name) %>% 
  summarize(count=n()) %>% 
  arrange(-count) %>% 
  top_n(10) %>% 
  mutate(start_station_name= fct_reorder(start_station_name, count)) %>% 
  ggplot(aes(x=start_station_name, y=count, fill=count)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  labs(x="Number of Rides", y="Name of Start Station ", title="Top 10 starting stations for member riders")

## Selecting by count

all_trips_v3 %>% 
  filter(!(is.na(end_station_name))) %>% 
  filter(member_casual == "member") %>% 
  group_by(end_station_name) %>% 
  summarize(count=n()) %>% 
  arrange(-count) %>% 
  top_n(10)

## Selecting by count

## # A tibble: 10 × 2
##    end_station_name             count
##    <chr>                        <int>
##  1 Ellis Ave & 60th St          34214
##  2 Kingsbury St & Kinzie St     25718
##  3 University Ave & 57th St     25095
##  4 Wells St & Concord Ln        24872
##  5 Clark St & Elm St            24082
##  6 Ellis Ave & 55th St          22564
##  7 Clinton St & Washington Blvd 22176
##  8 Clinton St & Madison St      19623
##  9 Broadway & Barry Ave         19552
## 10 Loomis St & Lexington St     19219

Analysis

The information on the top 10 end and start station for both riders can be used to determine availability of rideables.

6.6 Export file for more analysis

write_csv(all_trips_v3, file = "all_trips_v3.csv")

7 Conclusion

The first phenomenon that is easily noticeable is the stark difference between casual riders and riders with membership subscriptions, there is a significant difference in length of ride between both sets of riders.

7.1 Key Findings

Apart from the weekends (Saturday and Sunday) the ride length of the member riders remained relatively unchanged.
There is a slight uptick (about 16%) in member riders’ length of rides on weekends, it starts on Friday and experiences a sharp drop on Monday. It remains relatively the same until Friday again.
The ride length for casual riders changed significantly from day to day with the peek coming on weekends (Saturday and Sunday).
There is a visible inverse bell curve for the ride length of the casual riders, from the beginning of the week (Sunday) to the end of the week (Saturday). There is a dip in midweek and then a steady climb until the weekend, then a slow drop after the weekend.
Comparing the casual riders to the member riders, there is a doubling (100% increase) in ride length from the member riders to the casual riders. This means that for any giving day of the week the casual riders have twice the ride time of the member riders.

7.2 Business Task Questions

How do annual members and casual riders use Cyclistic bikes differently?
Why would casual riders buy Cyclistic annual memberships?
How can Cyclistic use digital media to influence casual riders to become members?

Challenges

The lack of individual rider identifiers which would have been integral in deciphering the ride patterns of casual riders would have been quite useful. Finding out if the casual riders were repeat users or just one-time users would have been beneficial.

7.3 Convincing casual riders to buy annual memberships

Given the information analyzed, the following steps can be deployed to increase annual membership among casual riders:

If the cost to rent for a day or 2 is lower than the weekly cost of membership then the cost of a one-day usage should be increased, this would encourage weekend users to subscribe.
Offer different types of memberships to cater to different types of riders, for example a weekend pass can be offered to capitalize on the weekend spike. The weekday pass can be made cheaper than the weekend pass to encourage weekday casual users.
Offer members only exclusives with the rides, an example will be bikes reserved for members only, priority bike usage over the weekend due to increased demands
Run an ad campaign to showcase the benefits of a membership, explain the pricing and the benefits
Offer a discount if the whole annual membership fee was paid all at once instead of monthly, also offer auto-pay capabilities and make payment as seamless as possible by offering different pay types.
Explore charging casual riders by the hour, while members are given unlimited ride time

Overall, more insight could be provided if more data is available, especially individual ride identities. I’m positive that these recommendations provided would convince casual riders to subscribe to a membership, which in turn will lead to increased revenue.

Cyclistic Bike-Share Case Study

Nnamdi Okoli

2023-07-15

1 Table of Contents

2 Scenario

3 Asking questions and problem definition

3.1 Ask the right questions

3.2 Identify the business task

3.3 Identifying key stakeholders

3.4 Approach

4 Prepare the data

4.1 Sort and filter the data in a spreasheet

4.2 Utilize R and RStudio

4.3 Loaded the packages

4.4 Upload the data files and create data frames

4.5 Check for column consistency

4.6 Combine the data frames into one data frame

4.7 Inspect the new data frame

4.8 Data summary, addressing missing data and the linitations of the data

5 Processing the data by cleaning

5.1 Add colums

5.2 Find and address missing values

5.3 Find and clean “testing” or “repair” rides

5.4 remove data with unreasonable ride length

5.5 Removed data that were for the repair and testing station

6 Analysis and Visualization of the Data

6.1 Summary

6.2 Comparism of Member riders to Casual Riders

6.3 Analyzing the length of rides

6.4 Analysis of total rides based on user type and day of the week

6.5 The analysis of the total rides by the type of user and the season

6.6 Export file for more analysis

7 Conclusion

7.1 Key Findings

7.2 Business Task Questions

7.3 Convincing casual riders to buy annual memberships