Introduction

This project is the capstone assignment for the Google Data Analytics Professional Certificate program. The program prepares participants for a career in data analytics with training focused on key analytical skills (data cleaning, analysis, and visualization) and tools (Excel, SQL, R Programming, Tableau).

In the project the steps of the data analysis process followed: Ask, Prepare, Process, Analyze, and Share, Act.

In this project publicly available data sets provided by the course, for a bike share program based in Chicago are analyzed.

The primary purpose is to provide information for the company to use data-driven making decisions

Background

Scenario

I am a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, my team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, my team will design a new marketing strategy to convert casual riders into annual members.

About the company

Cyclistic is a successful bike-share program launched in 2016. It has since grown to a fleet of 5,824 bicycles that are geo-tracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime.

Cyclistic offers a variety of pricing plans including: single-ride passes, full-day passes, and annual memberships. Customers who purchase a single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members.

The goal of this case study

Three questions will guide the future marketing program:

  1. How do annual members and casual riders use Cyclistic bikes differently?
  2. Why would casual riders buy Cyclistic annual memberships?
  3. How can Cyclistic use digital media to influence casual riders to become members?

The director of marketing has assigned me the first question to answer: How do annual members and casual riders use Cyclistic bikes differently?

In this assignment, I will produce a report with the following deliverables:

  1. A clear statement of the business task
  2. A description of all data sources used
  3. Documentation of any cleaning or manipulation of data
  4. A summary of your analysis
  5. Supporting visualizations and key findings
  6. Your top three recommendations based on your analysis

Ask

Business objective

The business objective of the case study is to identify trend, pattern & opportunities among casual riders and membership riders, and identify potential riders who can get benefit from annual membership. This will be done through analysis of bike trip data and understanding the user behavior and preferences. The ultimate goal is to increase profitability and drive future growth for the company.

Key stekholders

Cyclistic Director of Marketing: Lily Moreno, responsible for implementing the marketing campaigns at Cyclistic.

Cyclistic marketing analytics team: Team of data analysts who are responsible for collecting, analyzing, and reporting data that helps guide Cyclistic marketing strategy.

Cyclistic executive team: Executive team will decide whether to approve the recommended marketing program.

Cyclistic casual riders: They are the target audience of the marketing campaign, and their behavior and preferences will be a key focus of the analysis.

Cyclistic annual members: They are the group that the marketing campaign is trying to convert casual riders into, and their behavior and preferences may be compared to those of casual riders in the analysis.

Cyclistic investors and shareholders: They have a financial interest in the company’s success and may be interested in the results of the analysis and any changes to the marketing strategy.

Prepare

Data sources used: Cyclistic’s historical trip data downloaded from https://divvy-tripdata.s3.amazonaws.com/index.html

Data owner: Unknown.

Data License: Made available by Motivate International Inc. under the license https://www.divvybikes.com/data-license-agreement.

How is the Data Organized?

The data used consists of monthly .csv files for the past 12 months from January 2022 - December 2022. Each file contains 13 columns of information related to trip data.

Data ROCCC validation:

Reliable(R): The data is reliable, not biased, and can be used in the project.

Original(O): The originality of the project is certain the original public data can be located.

Comprehensive(C): The data includes trip duration, user type, start & destination station name, start & end time which are comprehensive for analyzing user trends.

Current(C): Data is not obsolete as the historical data records from 2013 to 2022.

Cited(C): Yes the data is cited with the proper reference.

Licensing, privacy, security, and accessibility

The data used for this analysis has had all identifying information removed in order to protect the privacy of users. This limitation on the data does restrict the scope of the possible analysis, as it is not possible to determine whether casual riders are repeat users or residents of the Chicago area.

Problem with data

The data consists of 12 monthly trips which are completely separate .csv files. We need to combine all the 12 files into one aggregate file for proper analysis. Unnecessary variables, wrong & duplicated data were removed.

Process

Tools Used

Since the amount of data is large tools like Excel is not able to handle it, hence R programming language is used.For Visualization tableau was considered but due to large size of the data and some limitations, R is chosen for it.

Data Review

All 12 files were combined into one data frame after checking the column names. The combined data set is reviewed.The final data set consisted of 5667186 rows with 13 columns of character and numeric data.

Data Description

The data contains the following columns:

  1. ride_id (categorical): Unique number assigned to a ride trip.
  2. rideable_type (categorical): Type of bike used during trip; standard two-wheel bike, reclining bike, hand tricycle, or cargo bike.
  3. started_at (datetime): Start date and time for the trip
  4. ended_at (datetime): End data and time for the trip
  5. start_station_name (categorical): Name of the station where the trip started
  6. start_station_id (categorical): Unique identification code assigned to the start station.
  7. end_station_name (categorical): Name of the station where the trip ended.
  8. end_station_id (categorical): Unique identification code assigned to the end station.
  9. start_lat (numeric): Latitude coordinate of where the trip started.
  10. start_lng (numeric): Longitude coordinate of where the trip started.
  11. end_lat (numeric): Latitude coordinate of where the trip ended.
  12. end_lng (numeric): Longitude coordinate of where the trip ended.
  13. member_casual (categorical): Customer type; “member” = annual member, “casual” = casual rider.

Process

Data processing and analyzing will occur in RStudio using the R programming language including visualizations.

Load packages

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.4.1
## Warning: package 'ggplot2' was built under R version 4.4.1
## Warning: package 'dplyr' was built under R version 4.4.1
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## âś” dplyr     1.1.4     âś” readr     2.1.5
## âś” forcats   1.0.0     âś” stringr   1.5.1
## âś” ggplot2   3.5.1     âś” tibble    3.2.1
## âś” lubridate 1.9.3     âś” tidyr     1.3.1
## âś” purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## âś– dplyr::filter() masks stats::filter()
## âś– dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(janitor)
## Warning: package 'janitor' was built under R version 4.4.1
## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library(readr)
library(psych)
## Warning: package 'psych' was built under R version 4.4.1
## 
## Attaching package: 'psych'
## 
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

Import 12 month trip data csv file

trip_202201 <- read.csv("D://Data Analysis/Case Study/trip_data_2022/202201-divvy-tripdata.csv")
trip_202202 <- read.csv("D://Data Analysis/Case Study/trip_data_2022/202202-divvy-tripdata.csv")
trip_202203 <- read.csv("D://Data Analysis/Case Study/trip_data_2022/202203-divvy-tripdata.csv")
trip_202204 <- read.csv("D://Data Analysis/Case Study/trip_data_2022/202204-divvy-tripdata.csv")
trip_202205 <- read.csv("D://Data Analysis/Case Study/trip_data_2022/202205-divvy-tripdata.csv")
trip_202206 <- read.csv("D://Data Analysis/Case Study/trip_data_2022/202206-divvy-tripdata.csv")
trip_202207 <- read.csv("D://Data Analysis/Case Study/trip_data_2022/202207-divvy-tripdata.csv")
trip_202208 <- read.csv("D://Data Analysis/Case Study/trip_data_2022/202208-divvy-tripdata.csv")
trip_202209 <- read.csv("D://Data Analysis/Case Study/trip_data_2022/202209-divvy-tripdata.csv")
trip_202210 <- read.csv("D://Data Analysis/Case Study/trip_data_2022/202210-divvy-tripdata.csv")
trip_202211 <- read.csv("D://Data Analysis/Case Study/trip_data_2022/202211-divvy-tripdata.csv")
trip_202212 <- read.csv("D://Data Analysis/Case Study/trip_data_2022/202212-divvy-tripdata.csv")

Checking the column names of datasets

colnames(trip_202201)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
colnames(trip_202202)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
colnames(trip_202203)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
colnames(trip_202204)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
colnames(trip_202205)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
colnames(trip_202206)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
colnames(trip_202207)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
colnames(trip_202208)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
colnames(trip_202209)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
colnames(trip_202210)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
colnames(trip_202211)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
colnames(trip_202212)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

Checking the structure of datasets

str(trip_202201)
## 'data.frame':    103770 obs. of  13 variables:
##  $ ride_id           : chr  "C2F7DD78E82EC875" "A6CF8980A652D272" "BD0F91DFF741C66D" "CBB80ED419105406" ...
##  $ rideable_type     : chr  "electric_bike" "electric_bike" "classic_bike" "classic_bike" ...
##  $ started_at        : chr  "2022-01-13 11:59:47" "2022-01-10 08:41:56" "2022-01-25 04:53:40" "2022-01-04 00:18:04" ...
##  $ ended_at          : chr  "2022-01-13 12:02:44" "2022-01-10 08:46:17" "2022-01-25 04:58:01" "2022-01-04 00:33:00" ...
##  $ start_station_name: chr  "Glenwood Ave & Touhy Ave" "Glenwood Ave & Touhy Ave" "Sheffield Ave & Fullerton Ave" "Clark St & Bryn Mawr Ave" ...
##  $ start_station_id  : chr  "525" "525" "TA1306000016" "KA1504000151" ...
##  $ end_station_name  : chr  "Clark St & Touhy Ave" "Clark St & Touhy Ave" "Greenview Ave & Fullerton Ave" "Paulina St & Montrose Ave" ...
##  $ end_station_id    : chr  "RP-007" "RP-007" "TA1307000001" "TA1309000021" ...
##  $ start_lat         : num  42 42 41.9 42 41.9 ...
##  $ start_lng         : num  -87.7 -87.7 -87.7 -87.7 -87.6 ...
##  $ end_lat           : num  42 42 41.9 42 41.9 ...
##  $ end_lng           : num  -87.7 -87.7 -87.7 -87.7 -87.6 ...
##  $ member_casual     : chr  "casual" "casual" "member" "casual" ...
str(trip_202202)
## 'data.frame':    115609 obs. of  13 variables:
##  $ ride_id           : chr  "E1E065E7ED285C02" "1602DCDC5B30FFE3" "BE7DD2AF4B55C4AF" "A1789BDF844412BE" ...
##  $ rideable_type     : chr  "classic_bike" "classic_bike" "classic_bike" "classic_bike" ...
##  $ started_at        : chr  "2022-02-19 18:08:41" "2022-02-20 17:41:30" "2022-02-25 18:55:56" "2022-02-14 11:57:03" ...
##  $ ended_at          : chr  "2022-02-19 18:23:56" "2022-02-20 17:45:56" "2022-02-25 19:09:34" "2022-02-14 12:04:00" ...
##  $ start_station_name: chr  "State St & Randolph St" "Halsted St & Wrightwood Ave" "State St & Randolph St" "Southport Ave & Waveland Ave" ...
##  $ start_station_id  : chr  "TA1305000029" "TA1309000061" "TA1305000029" "13235" ...
##  $ end_station_name  : chr  "Clark St & Lincoln Ave" "Southport Ave & Wrightwood Ave" "Canal St & Adams St" "Broadway & Sheridan Rd" ...
##  $ end_station_id    : chr  "13179" "TA1307000113" "13011" "13323" ...
##  $ start_lat         : num  41.9 41.9 41.9 41.9 41.9 ...
##  $ start_lng         : num  -87.6 -87.6 -87.6 -87.7 -87.6 ...
##  $ end_lat           : num  41.9 41.9 41.9 42 41.9 ...
##  $ end_lng           : num  -87.6 -87.7 -87.6 -87.6 -87.6 ...
##  $ member_casual     : chr  "member" "member" "member" "member" ...
str(trip_202203)
## 'data.frame':    284042 obs. of  13 variables:
##  $ ride_id           : chr  "47EC0A7F82E65D52" "8494861979B0F477" "EFE527AF80B66109" "9F446FD9DEE3F389" ...
##  $ rideable_type     : chr  "classic_bike" "electric_bike" "classic_bike" "classic_bike" ...
##  $ started_at        : chr  "2022-03-21 13:45:01" "2022-03-16 09:37:16" "2022-03-23 19:52:02" "2022-03-01 19:12:26" ...
##  $ ended_at          : chr  "2022-03-21 13:51:18" "2022-03-16 09:43:34" "2022-03-23 19:54:48" "2022-03-01 19:22:14" ...
##  $ start_station_name: chr  "Wabash Ave & Wacker Pl" "Michigan Ave & Oak St" "Broadway & Berwyn Ave" "Wabash Ave & Wacker Pl" ...
##  $ start_station_id  : chr  "TA1307000131" "13042" "13109" "TA1307000131" ...
##  $ end_station_name  : chr  "Kingsbury St & Kinzie St" "Orleans St & Chestnut St (NEXT Apts)" "Broadway & Ridge Ave" "Franklin St & Jackson Blvd" ...
##  $ end_station_id    : chr  "KA1503000043" "620" "15578" "TA1305000025" ...
##  $ start_lat         : num  41.9 41.9 42 41.9 41.9 ...
##  $ start_lng         : num  -87.6 -87.6 -87.7 -87.6 -87.6 ...
##  $ end_lat           : num  41.9 41.9 42 41.9 41.9 ...
##  $ end_lng           : num  -87.6 -87.6 -87.7 -87.6 -87.7 ...
##  $ member_casual     : chr  "member" "member" "member" "member" ...
str(trip_202204)
## 'data.frame':    371249 obs. of  13 variables:
##  $ ride_id           : chr  "3564070EEFD12711" "0B820C7FCF22F489" "89EEEE32293F07FF" "84D4751AEB31888D" ...
##  $ rideable_type     : chr  "electric_bike" "classic_bike" "classic_bike" "classic_bike" ...
##  $ started_at        : chr  "2022-04-06 17:42:48" "2022-04-24 19:23:07" "2022-04-20 19:29:08" "2022-04-22 21:14:06" ...
##  $ ended_at          : chr  "2022-04-06 17:54:36" "2022-04-24 19:43:17" "2022-04-20 19:35:16" "2022-04-22 21:23:29" ...
##  $ start_station_name: chr  "Paulina St & Howard St" "Wentworth Ave & Cermak Rd" "Halsted St & Polk St" "Wentworth Ave & Cermak Rd" ...
##  $ start_station_id  : chr  "515" "13075" "TA1307000121" "13075" ...
##  $ end_station_name  : chr  "University Library (NU)" "Green St & Madison St" "Green St & Madison St" "Delano Ct & Roosevelt Rd" ...
##  $ end_station_id    : chr  "605" "TA1307000120" "TA1307000120" "KA1706005007" ...
##  $ start_lat         : num  42 41.9 41.9 41.9 41.9 ...
##  $ start_lng         : num  -87.7 -87.6 -87.6 -87.6 -87.6 ...
##  $ end_lat           : num  42.1 41.9 41.9 41.9 41.9 ...
##  $ end_lng           : num  -87.7 -87.6 -87.6 -87.6 -87.6 ...
##  $ member_casual     : chr  "member" "member" "member" "casual" ...
str(trip_202205)
## 'data.frame':    634858 obs. of  13 variables:
##  $ ride_id           : chr  "EC2DE40644C6B0F4" "1C31AD03897EE385" "1542FBEC830415CF" "6FF59852924528F8" ...
##  $ rideable_type     : chr  "classic_bike" "classic_bike" "classic_bike" "classic_bike" ...
##  $ started_at        : chr  "2022-05-23 23:06:58" "2022-05-11 08:53:28" "2022-05-26 18:36:28" "2022-05-10 07:30:07" ...
##  $ ended_at          : chr  "2022-05-23 23:40:19" "2022-05-11 09:31:22" "2022-05-26 18:58:18" "2022-05-10 07:38:49" ...
##  $ start_station_name: chr  "Wabash Ave & Grand Ave" "DuSable Lake Shore Dr & Monroe St" "Clinton St & Madison St" "Clinton St & Madison St" ...
##  $ start_station_id  : chr  "TA1307000117" "13300" "TA1305000032" "TA1305000032" ...
##  $ end_station_name  : chr  "Halsted St & Roscoe St" "Field Blvd & South Water St" "Wood St & Milwaukee Ave" "Clark St & Randolph St" ...
##  $ end_station_id    : chr  "TA1309000025" "15534" "13221" "TA1305000030" ...
##  $ start_lat         : num  41.9 41.9 41.9 41.9 41.9 ...
##  $ start_lng         : num  -87.6 -87.6 -87.6 -87.6 -87.6 ...
##  $ end_lat           : num  41.9 41.9 41.9 41.9 41.9 ...
##  $ end_lng           : num  -87.6 -87.6 -87.7 -87.6 -87.7 ...
##  $ member_casual     : chr  "member" "member" "member" "member" ...
str(trip_202206)
## 'data.frame':    769204 obs. of  13 variables:
##  $ ride_id           : chr  "600CFD130D0FD2A4" "F5E6B5C1682C6464" "B6EB6D27BAD771D2" "C9C320375DE1D5C6" ...
##  $ rideable_type     : chr  "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
##  $ started_at        : chr  "2022-06-30 17:27:53" "2022-06-30 18:39:52" "2022-06-30 11:49:25" "2022-06-30 11:15:25" ...
##  $ ended_at          : chr  "2022-06-30 17:35:15" "2022-06-30 18:47:28" "2022-06-30 12:02:54" "2022-06-30 11:19:43" ...
##  $ start_station_name: chr  "" "" "" "" ...
##  $ start_station_id  : chr  "" "" "" "" ...
##  $ end_station_name  : chr  "" "" "" "" ...
##  $ end_station_id    : chr  "" "" "" "" ...
##  $ start_lat         : num  41.9 41.9 41.9 41.8 41.9 ...
##  $ start_lng         : num  -87.6 -87.6 -87.7 -87.7 -87.6 ...
##  $ end_lat           : num  41.9 41.9 41.9 41.8 41.9 ...
##  $ end_lng           : num  -87.6 -87.6 -87.6 -87.7 -87.6 ...
##  $ member_casual     : chr  "casual" "casual" "casual" "casual" ...
str(trip_202207)
## 'data.frame':    823488 obs. of  13 variables:
##  $ ride_id           : chr  "954144C2F67B1932" "292E027607D218B6" "57765852588AD6E0" "B5B6BE44314590E6" ...
##  $ rideable_type     : chr  "classic_bike" "classic_bike" "classic_bike" "classic_bike" ...
##  $ started_at        : chr  "2022-07-05 08:12:47" "2022-07-26 12:53:38" "2022-07-03 13:58:49" "2022-07-31 17:44:21" ...
##  $ ended_at          : chr  "2022-07-05 08:24:32" "2022-07-26 12:55:31" "2022-07-03 14:06:32" "2022-07-31 18:42:50" ...
##  $ start_station_name: chr  "Ashland Ave & Blackhawk St" "Buckingham Fountain (Temp)" "Buckingham Fountain (Temp)" "Buckingham Fountain (Temp)" ...
##  $ start_station_id  : chr  "13224" "15541" "15541" "15541" ...
##  $ end_station_name  : chr  "Kingsbury St & Kinzie St" "Michigan Ave & 8th St" "Michigan Ave & 8th St" "Woodlawn Ave & 55th St" ...
##  $ end_station_id    : chr  "KA1503000043" "623" "623" "TA1307000164" ...
##  $ start_lat         : num  41.9 41.9 41.9 41.9 41.9 ...
##  $ start_lng         : num  -87.7 -87.6 -87.6 -87.6 -87.6 ...
##  $ end_lat           : num  41.9 41.9 41.9 41.8 41.9 ...
##  $ end_lng           : num  -87.6 -87.6 -87.6 -87.6 -87.7 ...
##  $ member_casual     : chr  "member" "casual" "casual" "casual" ...
str(trip_202208)
## 'data.frame':    785932 obs. of  13 variables:
##  $ ride_id           : chr  "550CF7EFEAE0C618" "DAD198F405F9C5F5" "E6F2BC47B65CB7FD" "F597830181C2E13C" ...
##  $ rideable_type     : chr  "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
##  $ started_at        : chr  "2022-08-07 21:34:15" "2022-08-08 14:39:21" "2022-08-08 15:29:50" "2022-08-08 02:43:50" ...
##  $ ended_at          : chr  "2022-08-07 21:41:46" "2022-08-08 14:53:23" "2022-08-08 15:40:34" "2022-08-08 02:58:53" ...
##  $ start_station_name: chr  "" "" "" "" ...
##  $ start_station_id  : chr  "" "" "" "" ...
##  $ end_station_name  : chr  "" "" "" "" ...
##  $ end_station_id    : chr  "" "" "" "" ...
##  $ start_lat         : num  41.9 41.9 42 41.9 41.9 ...
##  $ start_lng         : num  -87.7 -87.6 -87.7 -87.7 -87.7 ...
##  $ end_lat           : num  41.9 41.9 42 42 41.8 ...
##  $ end_lng           : num  -87.7 -87.6 -87.7 -87.7 -87.7 ...
##  $ member_casual     : chr  "casual" "casual" "casual" "casual" ...
str(trip_202209)
## 'data.frame':    701339 obs. of  13 variables:
##  $ ride_id           : chr  "5156990AC19CA285" "E12D4A16BF51C274" "A02B53CD7DB72DD7" "C82E05FEE872DF11" ...
##  $ rideable_type     : chr  "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
##  $ started_at        : chr  "2022-09-01 08:36:22" "2022-09-01 17:11:29" "2022-09-01 17:15:50" "2022-09-01 09:00:28" ...
##  $ ended_at          : chr  "2022-09-01 08:39:05" "2022-09-01 17:14:45" "2022-09-01 17:16:12" "2022-09-01 09:10:32" ...
##  $ start_station_name: chr  "" "" "" "" ...
##  $ start_station_id  : chr  "" "" "" "" ...
##  $ end_station_name  : chr  "California Ave & Milwaukee Ave" "" "" "" ...
##  $ end_station_id    : chr  "13084" "" "" "" ...
##  $ start_lat         : num  41.9 41.9 41.9 41.9 41.9 ...
##  $ start_lng         : num  -87.7 -87.6 -87.6 -87.7 -87.7 ...
##  $ end_lat           : num  41.9 41.9 41.9 41.9 41.9 ...
##  $ end_lng           : num  -87.7 -87.6 -87.6 -87.7 -87.7 ...
##  $ member_casual     : chr  "casual" "casual" "casual" "casual" ...
str(trip_202210)
## 'data.frame':    558685 obs. of  13 variables:
##  $ ride_id           : chr  "A50255C1E17942AB" "DB692A70BD2DD4E3" "3C02727AAF60F873" "47E653FDC2D99236" ...
##  $ rideable_type     : chr  "classic_bike" "electric_bike" "electric_bike" "electric_bike" ...
##  $ started_at        : chr  "2022-10-14 17:13:30" "2022-10-01 16:29:26" "2022-10-19 18:55:40" "2022-10-31 07:52:36" ...
##  $ ended_at          : chr  "2022-10-14 17:19:39" "2022-10-01 16:49:06" "2022-10-19 19:03:30" "2022-10-31 07:58:49" ...
##  $ start_station_name: chr  "Noble St & Milwaukee Ave" "Damen Ave & Charleston St" "Hoyne Ave & Balmoral Ave" "Rush St & Cedar St" ...
##  $ start_station_id  : chr  "13290" "13288" "655" "KA1504000133" ...
##  $ end_station_name  : chr  "Larrabee St & Division St" "Damen Ave & Cullerton St" "Western Ave & Leland Ave" "Orleans St & Chestnut St (NEXT Apts)" ...
##  $ end_station_id    : chr  "KA1504000079" "13089" "TA1307000140" "620" ...
##  $ start_lat         : num  41.9 41.9 42 41.9 41.9 ...
##  $ start_lng         : num  -87.7 -87.7 -87.7 -87.6 -87.6 ...
##  $ end_lat           : num  41.9 41.9 42 41.9 41.9 ...
##  $ end_lng           : num  -87.6 -87.7 -87.7 -87.6 -87.6 ...
##  $ member_casual     : chr  "member" "casual" "member" "member" ...
str(trip_202211)
## 'data.frame':    337735 obs. of  13 variables:
##  $ ride_id           : chr  "BCC66FC6FAB27CC7" "772AB67E902C180F" "585EAD07FDEC0152" "91C4E7ED3C262FF9" ...
##  $ rideable_type     : chr  "electric_bike" "classic_bike" "classic_bike" "classic_bike" ...
##  $ started_at        : chr  "2022-11-10 06:21:55" "2022-11-04 07:31:55" "2022-11-21 17:20:29" "2022-11-25 17:29:34" ...
##  $ ended_at          : chr  "2022-11-10 06:31:27" "2022-11-04 07:46:25" "2022-11-21 17:34:36" "2022-11-25 17:45:15" ...
##  $ start_station_name: chr  "Canal St & Adams St" "Canal St & Adams St" "Indiana Ave & Roosevelt Rd" "Indiana Ave & Roosevelt Rd" ...
##  $ start_station_id  : chr  "13011" "13011" "SL-005" "SL-005" ...
##  $ end_station_name  : chr  "St. Clair St & Erie St" "St. Clair St & Erie St" "St. Clair St & Erie St" "St. Clair St & Erie St" ...
##  $ end_station_id    : chr  "13016" "13016" "13016" "13016" ...
##  $ start_lat         : num  41.9 41.9 41.9 41.9 41.9 ...
##  $ start_lng         : num  -87.6 -87.6 -87.6 -87.6 -87.6 ...
##  $ end_lat           : num  41.9 41.9 41.9 41.9 41.9 ...
##  $ end_lng           : num  -87.6 -87.6 -87.6 -87.6 -87.6 ...
##  $ member_casual     : chr  "member" "member" "member" "member" ...
str(trip_202212)
## 'data.frame':    181806 obs. of  13 variables:
##  $ ride_id           : chr  "65DBD2F447EC51C2" "0C201AA7EA0EA1AD" "E0B148CCB358A49D" "54C5775D2B7C9188" ...
##  $ rideable_type     : chr  "electric_bike" "classic_bike" "electric_bike" "classic_bike" ...
##  $ started_at        : chr  "2022-12-05 10:47:18" "2022-12-18 06:42:33" "2022-12-13 08:47:45" "2022-12-13 18:50:47" ...
##  $ ended_at          : chr  "2022-12-05 10:56:34" "2022-12-18 07:08:44" "2022-12-13 08:59:51" "2022-12-13 19:19:48" ...
##  $ start_station_name: chr  "Clifton Ave & Armitage Ave" "Broadway & Belmont Ave" "Sangamon St & Lake St" "Shields Ave & 31st St" ...
##  $ start_station_id  : chr  "TA1307000163" "13277" "TA1306000015" "KA1503000038" ...
##  $ end_station_name  : chr  "Sedgwick St & Webster Ave" "Sedgwick St & Webster Ave" "St. Clair St & Erie St" "Damen Ave & Madison St" ...
##  $ end_station_id    : chr  "13191" "13191" "13016" "13134" ...
##  $ start_lat         : num  41.9 41.9 41.9 41.8 41.9 ...
##  $ start_lng         : num  -87.7 -87.6 -87.7 -87.6 -87.7 ...
##  $ end_lat           : num  41.9 41.9 41.9 41.9 41.9 ...
##  $ end_lng           : num  -87.6 -87.6 -87.6 -87.7 -87.7 ...
##  $ member_casual     : chr  "member" "casual" "member" "member" ...

Checking total number of rows

sum(nrow(trip_202201), nrow(trip_202202), nrow(trip_202203), nrow(trip_202204), 
    nrow(trip_202205), nrow(trip_202206), nrow(trip_202207), nrow(trip_202208), 
    nrow(trip_202209), nrow(trip_202210), nrow(trip_202211), nrow(trip_202212))
## [1] 5667717

Combining the monthly datasets

trip_data_2022 <- rbind(trip_202201, trip_202202, trip_202203, trip_202204, 
                        trip_202205, trip_202206, trip_202207, trip_202208, 
                        trip_202209, trip_202210, trip_202211, trip_202212)

Checking total rows of combined datasets

nrow(trip_data_2022)
## [1] 5667717

Checking column names of combined datasets

colnames(trip_data_2022)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

Checking structure of combined datasets

str(trip_data_2022)
## 'data.frame':    5667717 obs. of  13 variables:
##  $ ride_id           : chr  "C2F7DD78E82EC875" "A6CF8980A652D272" "BD0F91DFF741C66D" "CBB80ED419105406" ...
##  $ rideable_type     : chr  "electric_bike" "electric_bike" "classic_bike" "classic_bike" ...
##  $ started_at        : chr  "2022-01-13 11:59:47" "2022-01-10 08:41:56" "2022-01-25 04:53:40" "2022-01-04 00:18:04" ...
##  $ ended_at          : chr  "2022-01-13 12:02:44" "2022-01-10 08:46:17" "2022-01-25 04:58:01" "2022-01-04 00:33:00" ...
##  $ start_station_name: chr  "Glenwood Ave & Touhy Ave" "Glenwood Ave & Touhy Ave" "Sheffield Ave & Fullerton Ave" "Clark St & Bryn Mawr Ave" ...
##  $ start_station_id  : chr  "525" "525" "TA1306000016" "KA1504000151" ...
##  $ end_station_name  : chr  "Clark St & Touhy Ave" "Clark St & Touhy Ave" "Greenview Ave & Fullerton Ave" "Paulina St & Montrose Ave" ...
##  $ end_station_id    : chr  "RP-007" "RP-007" "TA1307000001" "TA1309000021" ...
##  $ start_lat         : num  42 42 41.9 42 41.9 ...
##  $ start_lng         : num  -87.7 -87.7 -87.7 -87.7 -87.6 ...
##  $ end_lat           : num  42 42 41.9 42 41.9 ...
##  $ end_lng           : num  -87.7 -87.7 -87.7 -87.7 -87.6 ...
##  $ member_casual     : chr  "casual" "casual" "member" "casual" ...

Glimpse of combined datasets

glimpse(trip_data_2022)
## Rows: 5,667,717
## Columns: 13
## $ ride_id            <chr> "C2F7DD78E82EC875", "A6CF8980A652D272", "BD0F91DFF7…
## $ rideable_type      <chr> "electric_bike", "electric_bike", "classic_bike", "…
## $ started_at         <chr> "2022-01-13 11:59:47", "2022-01-10 08:41:56", "2022…
## $ ended_at           <chr> "2022-01-13 12:02:44", "2022-01-10 08:46:17", "2022…
## $ start_station_name <chr> "Glenwood Ave & Touhy Ave", "Glenwood Ave & Touhy A…
## $ start_station_id   <chr> "525", "525", "TA1306000016", "KA1504000151", "TA13…
## $ end_station_name   <chr> "Clark St & Touhy Ave", "Clark St & Touhy Ave", "Gr…
## $ end_station_id     <chr> "RP-007", "RP-007", "TA1307000001", "TA1309000021",…
## $ start_lat          <dbl> 42.01280, 42.01276, 41.92560, 41.98359, 41.87785, 4…
## $ start_lng          <dbl> -87.66591, -87.66597, -87.65371, -87.66915, -87.624…
## $ end_lat            <dbl> 42.01256, 42.01256, 41.92533, 41.96151, 41.88462, 4…
## $ end_lng            <dbl> -87.67437, -87.67437, -87.66580, -87.67139, -87.627…
## $ member_casual      <chr> "casual", "casual", "member", "casual", "member", "…

Removing columns start lat, start lng, end lat, end lng (not required for analysis)

trip_data_2022 <- trip_data_2022 %>% 
  select(-c(start_lat, start_lng, end_lat, end_lng))

Removing data where started_at is greater than ended_at

trip_data_2022 <- trip_data_2022 %>% 
  filter(started_at < ended_at)

Removing duplicate data

trip_data_2022 <- distinct(trip_data_2022)

Separate date in date, day, month, year

trip_data_2022$date <- as.Date(trip_data_2022$started_at)
trip_data_2022$month <- format(as.Date(trip_data_2022$date), "%b_%y")
trip_data_2022$day <- format(as.Date(trip_data_2022$date), "%d")
trip_data_2022$week_day <- format(as.Date(trip_data_2022$date), "%A")
trip_data_2022$year <- format(as.Date(trip_data_2022$date), "%Y")

Insert separate column for start time

trip_data_2022$start_hour <- lubridate:: hour(trip_data_2022$started_at)

Calculate ride duration

trip_data_2022$ride_length <- difftime(trip_data_2022$ended_at, trip_data_2022$started_at, units = "mins")

Rename columns member_casual & rideable_type for better understanding

trip_data_2022 <- dplyr::rename(trip_data_2022, customer_type = member_casual, bike_type = rideable_type)

Removing data where ride lenth < 0

clean_trip_data_2022 <- trip_data_2022[!trip_data_2022$ride_length < 0,]

Checking dataset dimension before saving

dim(clean_trip_data_2022)
## [1] 5667186      16

Saving the final dataset into .csv file for analysis

write.csv(clean_trip_data_2022, "D://Data Analysis/Case Study/trip_data_2022/clean_trip_data_2022.csv", row.names = FALSE)

Analysis

Import clean trip data for data analysis

clean_trip_final_2022 <- read.csv("D://Data Analysis/Case Study/trip_data_2022/clean_trip_data_2022.csv")

Order clean trip data wrt month

clean_trip_final_2022$month <- ordered(clean_trip_final_2022$month, 
                                       levels=c("Jan_22", "Feb_22", "Mar_22", "Apr_22", 
                                                "May_22", "Jun_22", "Jul_22", "Aug_22", 
                                                "Sep_22", "Oct_22", "Nov_22", "Dec_22"))

Order clean trip data wrt week days

clean_trip_final_2022$week_day <- ordered(clean_trip_final_2022$week_day, 
                                       levels = c("Sunday", "Monday", "Tuesday", 
                                                  "Wednesday", "Thursday", "Friday", 
                                                  "Saturday"))

Descriptive Analysis

Analysis: min, max, mean, median, sd

View(describe(clean_trip_final_2022$ride_length, fast=TRUE))

Data customer type wise

view(table(clean_trip_final_2022$customer_type))

Total rides for each customer type in minutes

View(setNames(aggregate(ride_length ~ customer_type, clean_trip_final_2022, sum),
              c("customer_type", "total_ride_length(mins)")))

Difference between member and casual riders wrt ride length

View(clean_trip_final_2022 %>% 
       group_by(customer_type) %>% 
       summarise(min_length_mins = min(ride_length), 
                 max_length_mins = max(ride_length),
                 median_length_mins = median(ride_length), 
                 mean_length_mins = mean(ride_length)))

Average ride_length for users by week days and total rides by week days

View(clean_trip_final_2022 %>% 
       group_by(week_day) %>% 
       summarise(avg_length = mean(ride_length),
                 number_of_ride = n()))

Average ride_length by month

view(clean_trip_final_2022 %>% 
       group_by(month) %>% 
       summarise(avg_length = mean(ride_length), number_of_ride = n()))

Average ride_length comparison by each week day wrt customer type

view(aggregate(clean_trip_final_2022$ride_length ~ clean_trip_final_2022$customer_type +
                 clean_trip_final_2022$week_day, FUN = mean))

Average ride_length comparison by each month wrt customer type

view(aggregate(clean_trip_final_2022$ride_length ~ clean_trip_final_2022$customer_type +
                 clean_trip_final_2022$month, FUN = mean))

Analyze ride_length data by customer type and week_day

view(clean_trip_final_2022 %>% 
       group_by(customer_type, week_day) %>% 
       summarise(number_of_ride = n(), avg_duration = mean(ride_length),
     median_duration = median(ride_length), max_duration = max(ride_length),
     min_duration = min(ride_length)))
## `summarise()` has grouped output by 'customer_type'. You can override using the
## `.groups` argument.

Analyze ride_length data by customer type and month

view(clean_trip_final_2022 %>% 
       group_by(customer_type, month) %>% 
       summarise(number_of_ride = n(), avg_duration = mean(ride_length),
                 median_duration = median(ride_length),
                 max_duration = max(ride_length),
                 min_duration = min(ride_length)))
## `summarise()` has grouped output by 'customer_type'. You can override using the
## `.groups` argument.

Save the data for data visualization in .csv format

write.csv(clean_trip_final_2022, 
          "D://Data Analysis/Case Study/trip_data_2022/clean_trip_final_2022_visual.csv",
          row.names = FALSE)

Share

Visualization

Import data for visualization

clean_trip_final_2022_viz <- read.csv("D://Data Analysis/Case Study/trip_data_2022/clean_trip_final_2022_visual.csv")

order clean trip data wrt month

clean_trip_final_2022_viz$month <- ordered(clean_trip_final_2022$month, 
                                       levels=c("Jan_22", "Feb_22", "Mar_22", "Apr_22", 
                                                "May_22", "Jun_22", "Jul_22", "Aug_22", 
                                                "Sep_22", "Oct_22", "Nov_22", "Dec_22"))

order clean trip data wrt week days

clean_trip_final_2022_viz$week_day <- ordered(clean_trip_final_2022$week_day, 
                                       levels = c("Sunday", "Monday", "Tuesday", 
                                                  "Wednesday", "Thursday", "Friday", 
                                                  "Saturday"))

Plot: Number of Rides by Rider Type

clean_trip_final_2022_viz %>% 
  group_by(customer_type, week_day) %>% 
  summarise(number_of_ride = n(),
            avgerage_duration = mean(ride_length))%>% 
  arrange(customer_type, week_day)  %>% 
  ggplot(aes(x = week_day, y = number_of_ride, fill = customer_type)) +
  geom_col(position = "dodge") + 
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE))+
  labs(title = "Number of Rides: Casual vs. Member Riders")
## `summarise()` has grouped output by 'customer_type'. You can override using the
## `.groups` argument.

Plot: Average Duration: Casual vs. Member Riders

clean_trip_final_2022_viz %>% 
  group_by(customer_type, week_day) %>% 
  summarise(number_of_ride = n(),average_duration = mean(ride_length)) %>% 
  arrange(customer_type, week_day)  %>% 
  ggplot(aes(x = week_day, y = average_duration, fill = customer_type)) +
  geom_col(position = "dodge") +
  labs(title = "Average Duration: Casual vs. Member Riders")+
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE))
## `summarise()` has grouped output by 'customer_type'. You can override using the
## `.groups` argument.

Plot: Average number of trips by customer_type vs. month:

clean_trip_final_2022_viz %>%
  group_by(customer_type, month) %>%
  summarise(number_of_rides = n()) %>%
  arrange(customer_type, month) %>%
  ggplot(aes(x = month, y = number_of_rides, fill = customer_type)) +
  geom_col(position = "dodge")+
  labs(title = "Total trips by customer type Vs. Month") +
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE))
## `summarise()` has grouped output by 'customer_type'. You can override using the
## `.groups` argument.

Plot: Difference between customer_type vs ride length

clean_trip_final_2022_viz %>% 
  group_by(customer_type) %>% 
  summarise(mean_length_mins = mean(ride_length)) %>% 
  ggplot(aes(x = customer_type, y = mean_length_mins, fill = customer_type)) +
  geom_col(position = "dodge") + 
  labs(title = "customer_type vs ride length", 
       x= "Customer Type", y = "Average length (mins)") +
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE))

Plot: Bike demand over a day

clean_trip_final_2022_viz %>% 
  group_by(customer_type, start_hour) %>% 
  summarise(number_of_rides=n()) %>% 
  ggplot(aes(x = start_hour, y = number_of_rides, color = customer_type)) +
  geom_line()+
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE))+
  labs(title = "Rides by time of day and user type", 
       x = "Start hour", y = "Number of rides")
## `summarise()` has grouped output by 'customer_type'. You can override using the
## `.groups` argument.

Plot: Number_of_trips by customer_type vs rider_type:

clean_trip_final_2022_viz %>%
  group_by(bike_type, customer_type) %>%
  summarise(number_of_trips = n()) %>%
  ggplot(aes(x= bike_type, y=number_of_trips, fill= customer_type))+
  geom_bar(stat='identity') +
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE)) +
  labs(title ="Ride type Vs. Number of trips", 
       x = "Bike type", y = "Number of trips")
## `summarise()` has grouped output by 'bike_type'. You can override using the
## `.groups` argument.

Plot: Rides by weekday vs bike type

clean_trip_final_2022_viz  %>% 
  group_by(bike_type, week_day) %>% 
  summarise(number_of_rides=n()) %>% 
  ggplot(aes(x=week_day,y=number_of_rides, fill=bike_type)) +
  geom_col(position = "dodge")+
  scale_y_continuous(labels = function(x) format (x, scientific = FALSE))+
  labs(Title = "Rides by weekday and bike type:",
       x = "Week Day", y = "Number. of rides")
## `summarise()` has grouped output by 'bike_type'. You can override using the
## `.groups` argument.

Plot: Average Ride Length per Month: Casual vs. Member

clean_trip_final_2022_viz %>% 
  group_by(customer_type, month) %>% 
  summarise(average_duration = mean(ride_length))%>% 
  arrange(average_duration, month) %>% 
  ggplot(aes(x = month, y = average_duration, group = customer_type)) +
  geom_line(aes(color = customer_type)) +
  labs(title = "Average Trip Duration by Month for Casual Customers",
       x = "Month", y = "Average Duration (minutes)")
## `summarise()` has grouped output by 'customer_type'. You can override using the
## `.groups` argument.

Plot: Number of rides by month and rider type

clean_trip_final_2022_viz %>% 
  group_by(customer_type, month) %>% 
  summarise(number_of_rides = n(),
            average_duration = mean(ride_length)) %>% 
  arrange(customer_type, month)  %>% 
  ggplot(aes(x = month, y = number_of_rides, group = customer_type)) +
  geom_line(aes(color = customer_type)) + 
  scale_y_continuous(labels = function(x) format (x, scientific = FALSE)) +
  labs(title = "Number of rides by month and rider type", 
       x= "Month", y = "Number of rides")
## `summarise()` has grouped output by 'customer_type'. You can override using the
## `.groups` argument.

Act

Key Takeaways :

  1. Annual members use the bike-sharing service primarily for commuting purposes, while casual riders tend to use it for leisure purpose, particularly on weekends and in the summer months.
  2. Annual members exhibit a more consistent usage of the service throughout the week and year, compared to casual riders.
  3. Both annual members and casual riders favor classic bikes over the other two types of bikes offered. Annual members primarily use classic bikes and rarely use docked bikes, while casual riders use all types of bikes.
  4. Casual riders tend to have longer ride duration’s, averaging around 50% longer than annual members.
  5. Casual riders show lower usage of the service during the winter months compared to annual members.

Recommendations :

  1. Increase marketing efforts targeting leisure riders, especially during the summer months and on weekends, in order to increase bike usage and revenue.
  2. Discounts or incentives should be offered for annual members to encourage them to use the bikes more regularly throughout the week and year.
  3. More marketing should be done for the docked bikes due to their less popularity among the casual and member riders.
  4. Offers should be focussed on longer rental periods for casual riders, as their average ride length is longer than annual members, in order to increase revenue.
  5. Increase marketing efforts targeting casual riders during the winter months in order to increase usage and revenue during traditionally slower periods.
  6. Since casual riders take long rides in weekends, weekend membership with discounts or incentives can be provided.