Introduction to my Capstone Project

This Capstone project is the final part of my Google Data Analytics Certification. This project showcases my analytical skills and my thought process through out the project.This case study is based on a fictional bike-share company called Cyclistic.

About the Company

In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime.

It compromises of two types of users, Casual and members
Cyclistic users are more likely to ride for leisure, but about 30% use them to commute to work each day.

Scenario

I am a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.

This casestudy would follow the six stages of the Google Data Analytics; Ask,Prepare,Process,Analyze,Share,Act.

Step 1: Ask

1.1 Three questions will guide the future marketing program:

How do annual members and casual riders use Cyclistic bikes differently?
Why would casual riders buy Cyclistic annual memberships?
How can Cyclistic use digital media to influence casual riders to become members?

1.2 Consider key stakeholders

The director of marketing : responsible for the development of campaigns and initiatives to promote the bike-share program. These may include email, social media, and other channels.
Cyclistic executive team : The notoriously detail-oriented executive team will decide whether to approve the recommended marketing program.

Step 2: Prepare the Data

This step includes identifying and collecting the data from its location and determining its integrity, credibility and accessibility.

2.1 Data Location

Since Cyclistic is a fictional company, real-world data of Divvy, a bike-share company in Chicago has been used.
Data Source link: Divvy Public Data

2.2 Data Organization

Data in this project includes the previous 12 month historical trip from May,2022 to June 2021 with one .csv file for each month -Each .csv file is organized in rows and columns structure with 13 Columns and variable rows

2.3 Data Credibility

Data is a from a standard source, is credible, and free from bias
Data obeys the the ROCCC standard for any data set; Reliable, Original,Comprehensive, Current and Cited

2.4 Licensing, Privacy, Security, and Accessibility

The data has been made available by Motivate International Inc. under this license
The data does not contain any private information of the riders, thereby maintaining their privacy.
The data stands secure in an AWS web portal.
The data is open-source and accessible to all

2.5 Loading Required R Packages

library(tidyverse) #data analysis

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.6     v purrr   0.3.4
## v tibble  3.1.7     v dplyr   1.0.9
## v tidyr   1.2.0     v stringr 1.4.0
## v readr   2.1.2     v forcats 0.5.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(lubridate) #date and time manipulation

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(ggplot2) #data visualization
library(janitor) #data cleaning

## 
## Attaching package: 'janitor'

## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

library(skimr) #data inspection
library(dplyr) #data manipulation

2.6 Getting working direcory

#getwd()

2.7 Collecting Data

may_2022 <- read.csv("202204-divvy-tripdata.csv")
apr_2022 <- read.csv("202203-divvy-tripdata.csv")
march_2022 <- read.csv("202202-divvy-tripdata.csv")
feb_2022 <- read.csv("202201-divvy-tripdata.csv")
jan_2022 <- read.csv("202112-divvy-tripdata.csv")
dec_2021 <- read.csv("202111-divvy-tripdata.csv")
nov_2021 <- read.csv("202110-divvy-tripdata.csv")
oct_2021 <- read.csv("202109-divvy-tripdata.csv")
sep_2021 <- read.csv("202108-divvy-tripdata.csv")
aug_2021 <- read.csv("202107-divvy-tripdata.csv")
july_2021 <- read.csv("202106-divvy-tripdata.csv")
june_2021 <- read.csv("202105-divvy-tripdata.csv")

2.8 Checking for data frame consistency

#I want to check if all columns in the 12 datasets match, this is part of EDA; Exploratory Data Analysis
compare_df_cols_same(may_2022,apr_2022,march_2022,feb_2022,jan_2022,dec_2021,nov_2021,oct_2021,sep_2021,aug_2021,july_2021,june_2021)

## [1] TRUE

2.9 Joining the the datasets together

Annual_trip <- bind_rows(may_2022,apr_2022,march_2022,feb_2022,jan_2022,dec_2021,nov_2021,oct_2021,sep_2021,aug_2021,july_2021,june_2021)

2.10 Inspect the new dataset

skim_without_charts(Annual_trip) #checking the total dataset structure

Data summary
Name	Annual_trip
Number of rows	5757551
Number of columns	13
_______________________
Column type frequency:
character	9
numeric	4
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	empty	n_unique
ride_id	1	16	16	0	5757551
rideable_type	1	11	13	0	3
started_at	1	19	19	0	4821094
ended_at	1	19	19	0	4815847
start_station_name	1	0	53	790207	866
start_station_id	1	0	44	790204	857
end_station_name	1	0	53	843361	865
end_station_id	1	0	44	843361	857
member_casual	1	6	6	0	2

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100
start_lat	0	1	41.90	0.05	41.64	41.88	41.90	41.93	45.64
start_lng	0	1	-87.65	0.03	-87.84	-87.66	-87.64	-87.63	-73.80
end_lat	4766	1	41.90	0.05	41.39	41.88	41.90	41.93	42.17
end_lng	4766	1	-87.65	0.03	-88.97	-87.66	-87.64	-87.63	-87.49

head(Annual_trip) #checking the first six rows of the data set

##            ride_id rideable_type          started_at            ended_at
## 1 3564070EEFD12711 electric_bike 2022-04-06 17:42:48 2022-04-06 17:54:36
## 2 0B820C7FCF22F489  classic_bike 2022-04-24 19:23:07 2022-04-24 19:43:17
## 3 89EEEE32293F07FF  classic_bike 2022-04-20 19:29:08 2022-04-20 19:35:16
## 4 84D4751AEB31888D  classic_bike 2022-04-22 21:14:06 2022-04-22 21:23:29
## 5 5664BCF0D1DE7A8B electric_bike 2022-04-16 15:56:30 2022-04-16 16:02:11
## 6 AA9EB7BD2E1FC128  classic_bike 2022-04-21 16:52:33 2022-04-21 16:56:51
##            start_station_name start_station_id         end_station_name
## 1      Paulina St & Howard St              515  University Library (NU)
## 2   Wentworth Ave & Cermak Rd            13075    Green St & Madison St
## 3        Halsted St & Polk St     TA1307000121    Green St & Madison St
## 4   Wentworth Ave & Cermak Rd            13075 Delano Ct & Roosevelt Rd
## 5        Halsted St & Polk St     TA1307000121  Clinton St & Madison St
## 6 Desplaines St & Randolph St            15535      Canal St & Adams St
##   end_station_id start_lat start_lng  end_lat   end_lng member_casual
## 1            605  42.01913 -87.67353 42.05294 -87.67345        member
## 2   TA1307000120  41.85308 -87.63193 41.88189 -87.64879        member
## 3   TA1307000120  41.87184 -87.64664 41.88189 -87.64879        member
## 4   KA1706005007  41.85308 -87.63193 41.86749 -87.63219        casual
## 5   TA1305000032  41.87181 -87.64657 41.88224 -87.64107        member
## 6          13011  41.88462 -87.64457 41.87926 -87.63990        member

glimpse(Annual_trip) #checking the summary of the rows

## Rows: 5,757,551
## Columns: 13
## $ ride_id            <chr> "3564070EEFD12711", "0B820C7FCF22F489", "89EEEE3229~
## $ rideable_type      <chr> "electric_bike", "classic_bike", "classic_bike", "c~
## $ started_at         <chr> "2022-04-06 17:42:48", "2022-04-24 19:23:07", "2022~
## $ ended_at           <chr> "2022-04-06 17:54:36", "2022-04-24 19:43:17", "2022~
## $ start_station_name <chr> "Paulina St & Howard St", "Wentworth Ave & Cermak R~
## $ start_station_id   <chr> "515", "13075", "TA1307000121", "13075", "TA1307000~
## $ end_station_name   <chr> "University Library (NU)", "Green St & Madison St",~
## $ end_station_id     <chr> "605", "TA1307000120", "TA1307000120", "KA170600500~
## $ start_lat          <dbl> 42.01913, 41.85308, 41.87184, 41.85308, 41.87181, 4~
## $ start_lng          <dbl> -87.67353, -87.63193, -87.64664, -87.63193, -87.646~
## $ end_lat            <dbl> 42.05294, 41.88189, 41.88189, 41.86749, 41.88224, 4~
## $ end_lng            <dbl> -87.67345, -87.64879, -87.64879, -87.63219, -87.641~
## $ member_casual      <chr> "member", "member", "member", "casual", "member", "~

colnames(Annual_trip) #checking the column names of the data set

##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

str(Annual_trip) #checking the properties of the columns

## 'data.frame':    5757551 obs. of  13 variables:
##  $ ride_id           : chr  "3564070EEFD12711" "0B820C7FCF22F489" "89EEEE32293F07FF" "84D4751AEB31888D" ...
##  $ rideable_type     : chr  "electric_bike" "classic_bike" "classic_bike" "classic_bike" ...
##  $ started_at        : chr  "2022-04-06 17:42:48" "2022-04-24 19:23:07" "2022-04-20 19:29:08" "2022-04-22 21:14:06" ...
##  $ ended_at          : chr  "2022-04-06 17:54:36" "2022-04-24 19:43:17" "2022-04-20 19:35:16" "2022-04-22 21:23:29" ...
##  $ start_station_name: chr  "Paulina St & Howard St" "Wentworth Ave & Cermak Rd" "Halsted St & Polk St" "Wentworth Ave & Cermak Rd" ...
##  $ start_station_id  : chr  "515" "13075" "TA1307000121" "13075" ...
##  $ end_station_name  : chr  "University Library (NU)" "Green St & Madison St" "Green St & Madison St" "Delano Ct & Roosevelt Rd" ...
##  $ end_station_id    : chr  "605" "TA1307000120" "TA1307000120" "KA1706005007" ...
##  $ start_lat         : num  42 41.9 41.9 41.9 41.9 ...
##  $ start_lng         : num  -87.7 -87.6 -87.6 -87.6 -87.6 ...
##  $ end_lat           : num  42.1 41.9 41.9 41.9 41.9 ...
##  $ end_lng           : num  -87.7 -87.6 -87.6 -87.6 -87.6 ...
##  $ member_casual     : chr  "member" "member" "member" "casual" ...

2.11 Observations:

Total Rows: 5757551, Total Columns:13
Rows represents the total number of rides
No duplicates
Three different types of rideable_type; Classic, electric and docked bike
Two distinct users; Casual and Member

2.12 Data Issues

Some of the columns have missing values; end_lat and end_lng
The column started_at and ended_at have dates and time but are expressed as a ‘chr’ property, that would be worked on
No difference in time a ride started to when it ended, a column showing that would be created.
To properly analyze the data we would need to aggregate it, showing days hours,days,month of ride.

2.13 Data that would address business task

the type of bike
trip length
weekday of trip
time of trip
trip start station
trip end station
monthly trends
weekly trends

Step3: Process the Data

This involves the Data Wrangling,to aid further analysis

The Annual_trip data set has 5.7million Rows, therefore Rstudio would be employed for the cleaning and analysis of this data set.
Tableau would be used for the dashboard creation.

3.1 Data Manipulation

Configuration of the started_at and ended_at column

Annual_trip$started_at <- as.POSIXct(Annual_trip$started_at)
Annual_trip$ended_at <- as.POSIXct(Annual_trip$ended_at)

Confirmation:

str(Annual_trip)

## 'data.frame':    5757551 obs. of  13 variables:
##  $ ride_id           : chr  "3564070EEFD12711" "0B820C7FCF22F489" "89EEEE32293F07FF" "84D4751AEB31888D" ...
##  $ rideable_type     : chr  "electric_bike" "classic_bike" "classic_bike" "classic_bike" ...
##  $ started_at        : POSIXct, format: "2022-04-06 17:42:48" "2022-04-24 19:23:07" ...
##  $ ended_at          : POSIXct, format: "2022-04-06 17:54:36" "2022-04-24 19:43:17" ...
##  $ start_station_name: chr  "Paulina St & Howard St" "Wentworth Ave & Cermak Rd" "Halsted St & Polk St" "Wentworth Ave & Cermak Rd" ...
##  $ start_station_id  : chr  "515" "13075" "TA1307000121" "13075" ...
##  $ end_station_name  : chr  "University Library (NU)" "Green St & Madison St" "Green St & Madison St" "Delano Ct & Roosevelt Rd" ...
##  $ end_station_id    : chr  "605" "TA1307000120" "TA1307000120" "KA1706005007" ...
##  $ start_lat         : num  42 41.9 41.9 41.9 41.9 ...
##  $ start_lng         : num  -87.7 -87.6 -87.6 -87.6 -87.6 ...
##  $ end_lat           : num  42.1 41.9 41.9 41.9 41.9 ...
##  $ end_lng           : num  -87.7 -87.6 -87.6 -87.6 -87.6 ...
##  $ member_casual     : chr  "member" "member" "member" "casual" ...

Renaming Column

Annual_trip <- rename(Annual_trip, "user"="member_casual")

Generating Ride length Column

Annual_trip$ride_length <- difftime(Annual_trip$ended_at,Annual_trip$started_at)

Generating column for date, day,week and month of ride

Annual_trip$date <- as.Date(Annual_trip$started_at)  #generating date of ride column
Annual_trip$day <- format(as.Date(Annual_trip$date), "%d")  #generating day column
Annual_trip$weekday <- format(as.Date(Annual_trip$date),"%a")  #generating week day column
Annual_trip$month <- format(as.Date(Annual_trip$date), "%b")  #generating month column
Annual_trip$year <- format(as.Date(Annual_trip$date), "%Y")  #generating year column

Adding hour and time of day columns

Annual_trip$hour <- lubridate::hour(Annual_trip$started_at)  #generating hour each trip started
Annual_trip <- Annual_trip %>%
  mutate(time_of_day=case_when(hour>=5 & hour<12 ~ "Morning",hour>=12 & hour<16 ~ "Afternoon",hour>=16 & hour<21 ~ "Evening",hour<5 | hour>=21 ~ "Night"))

3.2 Inspect the dataset

skim_without_charts(Annual_trip) #generating summary of the process

Data summary
Name	Annual_trip
Number of rows	5757551
Number of columns	21
_______________________
Column type frequency:
character	12
Date	1
difftime	1
numeric	5
POSIXct	2
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	empty	n_unique
ride_id	1	16	16	0	5757551
rideable_type	1	11	13	0	3
start_station_name	1	0	53	790207	866
start_station_id	1	0	44	790204	857
end_station_name	1	0	53	843361	865
end_station_id	1	0	44	843361	857
user	1	6	6	0	2
day	1	2	2	0	31
weekday	1	3	3	0	7
month	1	3	3	0	12
year	1	4	4	0	2
time_of_day	1	5	9	0	4

Variable type: Date

skim_variable	n_missing	complete_rate	min	max	median	n_unique
date	0	1	2021-04-30	2022-04-30	2021-08-31	366

Variable type: difftime

skim_variable	n_missing	complete_rate	min	max	median	n_unique
ride_length	0	1	-3482 secs	3356649 secs	691 secs	25358

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100
start_lat	0	1	41.90	0.05	41.64	41.88	41.90	41.93	45.64
start_lng	0	1	-87.65	0.03	-87.84	-87.66	-87.64	-87.63	-73.80
end_lat	4766	1	41.90	0.05	41.39	41.88	41.90	41.93	42.17
end_lng	4766	1	-87.65	0.03	-88.97	-87.66	-87.64	-87.63	-87.49
hour	0	1	14.20	5.07	0.00	11.00	15.00	18.00	23.00

Variable type: POSIXct

skim_variable	n_missing	complete_rate	min	max	median	n_unique
started_at	0	1	2021-05-01 00:00:11	2022-04-30 23:59:54	2021-08-31 17:17:20	4821094
ended_at	0	1	2021-05-01 00:03:26	2022-05-02 00:35:01	2021-08-31 17:34:09	4815847

3.2.1 Observations

The minimum value of the ride_length is negative, which is impossible
The maximum value of the ride_length is 3356649 secs (932.4 hrs)
Ride_length is in sec, we need to convert it to numeric to make calculations

3.3 Converting ride_length to numeric

Annual_trip$ride_length <- as.numeric(Annual_trip$ride_length)

Check if it is numeric

is.numeric(Annual_trip$ride_length)

## [1] TRUE

3.4 Data Cleaning

Ride with negative values are considered invalid, since rides are counted with respect to time.
Company website stated that rides less than 60 seconds are invalid as it was potentially customers docking their bike or a false start link to website
Rides greater than 24 hrs are also considered as outliers, because they are grossly inflated.
Rides with empty end_lat, and end_lng are considered invalid because the rides were not ended in a proper way.

3.4.1 Removing Ride length less than 60 sec and greater than 24hrs

Annual_trip_2 <- Annual_trip %>%
  filter(!(ride_length<60)) %>%
  filter(!(ride_length>86400)) #24hrs = 86400 secs

3.4.2 Checking Minimum and Maximum ride

min(Annual_trip_2$ride_length)

## [1] 60

max(Annual_trip_2$ride_length)

## [1] 86397

3.4.3 Removing empty end_lat and end_lng

Annual_trip_3 <- Annual_trip_2 %>%
  filter(!(is.na(end_lng)) | !(is.na(end_lat)))

3.4.4 Inspecting the dataset

skim_without_charts(Annual_trip_3)

Data summary
Name	Annual_trip_3
Number of rows	5659755
Number of columns	21
_______________________
Column type frequency:
character	12
Date	1
numeric	6
POSIXct	2
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	empty	n_unique
ride_id	1	16	16	0	5659755
rideable_type	1	11	13	0	3
start_station_name	1	0	53	766171	865
start_station_id	1	0	44	766168	856
end_station_name	1	0	53	810022	863
end_station_id	1	0	44	810022	855
user	1	6	6	0	2
day	1	2	2	0	31
weekday	1	3	3	0	7
month	1	3	3	0	12
year	1	4	4	0	2
time_of_day	1	5	9	0	4

Variable type: Date

skim_variable	n_missing	complete_rate	min	max	median	n_unique
date	0	1	2021-04-30	2022-04-30	2021-08-31	366

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100
start_lat	1	41.90	0.05	41.64	41.88	41.90	41.93	45.64
start_lng	1	-87.65	0.03	-87.84	-87.66	-87.64	-87.63	-73.80
end_lat	1	41.90	0.05	41.39	41.88	41.90	41.93	42.17
end_lng	1	-87.65	0.03	-88.97	-87.66	-87.64	-87.63	-87.49
ride_length	1	1136.45	2057.57	60.00	399.00	702.00	1269.00	86391.00
hour	1	14.20	5.07	0.00	11.00	15.00	18.00	23.00

Variable type: POSIXct

skim_variable	n_missing	complete_rate	min	max	median	n_unique
started_at	0	1	2021-05-01 00:00:11	2022-04-30 23:59:52	2021-08-31 16:47:11	4751555
ended_at	0	1	2021-05-01 00:03:26	2022-05-01 16:16:59	2021-08-31 17:03:58	4746038

3.5 Getting summary of the ride length with fitered NA for start_station_name and end_station_name

Annual_trip_3 <- Annual_trip_3 %>%
  filter(!(is.na(start_station_name)) | !(is.na(end_station_name))) 
summary(Annual_trip_3$ride_length)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      60     399     702    1136    1269   86391

3.6 Data Check

No duplicates
All outliers and grossly inflated ride_lengths have been removed.
The data is not outdated, it is current, hence having the recent 12 months of data set.
All the 12 data sets have been combined together to form a bigger singular set.
The Singular dataset is Reliable, Original, Comprehensive, Current and Cited.
Data is set to answer the business questions.

3.7 Data Documentation

This R markdown gives a detailed documentation of the overall process using code chunks for it’s analysis

Step 4: Analysis and Visualization

Summary of the Ride length

summary(Annual_trip_3$ride_length)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      60     399     702    1136    1269   86391

Observations

Minimum ride_length is 60 seconds for all rides
Average ride_length is 1136 seconds (18.93 mins) for all rides _ Maximum ride_length is 86391 seconds (23.9hrs) for all rides

4.1 Brief rundown of all the process

summary(Annual_trip_3)

##    ride_id          rideable_type        started_at                 
##  Length:5659755     Length:5659755     Min.   :2021-05-01 00:00:11  
##  Class :character   Class :character   1st Qu.:2021-07-07 15:23:51  
##  Mode  :character   Mode  :character   Median :2021-08-31 16:47:11  
##                                        Mean   :2021-09-18 16:06:55  
##                                        3rd Qu.:2021-11-03 17:50:56  
##                                        Max.   :2022-04-30 23:59:52  
##     ended_at                   start_station_name start_station_id  
##  Min.   :2021-05-01 00:03:26   Length:5659755     Length:5659755    
##  1st Qu.:2021-07-07 15:45:37   Class :character   Class :character  
##  Median :2021-08-31 17:03:58   Mode  :character   Mode  :character  
##  Mean   :2021-09-18 16:25:52                                        
##  3rd Qu.:2021-11-03 18:03:08                                        
##  Max.   :2022-05-01 16:16:59                                        
##  end_station_name   end_station_id       start_lat       start_lng     
##  Length:5659755     Length:5659755     Min.   :41.64   Min.   :-87.84  
##  Class :character   Class :character   1st Qu.:41.88   1st Qu.:-87.66  
##  Mode  :character   Mode  :character   Median :41.90   Median :-87.64  
##                                        Mean   :41.90   Mean   :-87.65  
##                                        3rd Qu.:41.93   3rd Qu.:-87.63  
##                                        Max.   :45.64   Max.   :-73.80  
##     end_lat         end_lng           user            ride_length   
##  Min.   :41.39   Min.   :-88.97   Length:5659755     Min.   :   60  
##  1st Qu.:41.88   1st Qu.:-87.66   Class :character   1st Qu.:  399  
##  Median :41.90   Median :-87.64   Mode  :character   Median :  702  
##  Mean   :41.90   Mean   :-87.65                      Mean   : 1136  
##  3rd Qu.:41.93   3rd Qu.:-87.63                      3rd Qu.: 1269  
##  Max.   :42.17   Max.   :-87.49                      Max.   :86391  
##       date                day              weekday             month          
##  Min.   :2021-04-30   Length:5659755     Length:5659755     Length:5659755    
##  1st Qu.:2021-07-07   Class :character   Class :character   Class :character  
##  Median :2021-08-31   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :2021-09-18                                                           
##  3rd Qu.:2021-11-03                                                           
##  Max.   :2022-04-30                                                           
##      year                hour      time_of_day       
##  Length:5659755     Min.   : 0.0   Length:5659755    
##  Class :character   1st Qu.:11.0   Class :character  
##  Mode  :character   Median :15.0   Mode  :character  
##                     Mean   :14.2                     
##                     3rd Qu.:18.0                     
##                     Max.   :23.0

glimpse(Annual_trip_3)

## Rows: 5,659,755
## Columns: 21
## $ ride_id            <chr> "3564070EEFD12711", "0B820C7FCF22F489", "89EEEE3229~
## $ rideable_type      <chr> "electric_bike", "classic_bike", "classic_bike", "c~
## $ started_at         <dttm> 2022-04-06 17:42:48, 2022-04-24 19:23:07, 2022-04-~
## $ ended_at           <dttm> 2022-04-06 17:54:36, 2022-04-24 19:43:17, 2022-04-~
## $ start_station_name <chr> "Paulina St & Howard St", "Wentworth Ave & Cermak R~
## $ start_station_id   <chr> "515", "13075", "TA1307000121", "13075", "TA1307000~
## $ end_station_name   <chr> "University Library (NU)", "Green St & Madison St",~
## $ end_station_id     <chr> "605", "TA1307000120", "TA1307000120", "KA170600500~
## $ start_lat          <dbl> 42.01913, 41.85308, 41.87184, 41.85308, 41.87181, 4~
## $ start_lng          <dbl> -87.67353, -87.63193, -87.64664, -87.63193, -87.646~
## $ end_lat            <dbl> 42.05294, 41.88189, 41.88189, 41.86749, 41.88224, 4~
## $ end_lng            <dbl> -87.67345, -87.64879, -87.64879, -87.63219, -87.641~
## $ user               <chr> "member", "member", "member", "casual", "member", "~
## $ ride_length        <dbl> 708, 1210, 368, 563, 341, 258, 278, 744, 89, 826, 2~
## $ date               <date> 2022-04-06, 2022-04-24, 2022-04-20, 2022-04-22, 20~
## $ day                <chr> "06", "24", "20", "22", "16", "21", "04", "05", "29~
## $ weekday            <chr> "Wed", "Sun", "Wed", "Fri", "Sat", "Thu", "Mon", "T~
## $ month              <chr> "Apr", "Apr", "Apr", "Apr", "Apr", "Apr", "Apr", "A~
## $ year               <chr> "2022", "2022", "2022", "2022", "2022", "2022", "20~
## $ hour               <int> 17, 19, 19, 21, 15, 16, 17, 8, 23, 15, 8, 13, 11, 2~
## $ time_of_day        <chr> "Evening", "Evening", "Evening", "Night", "Afternoo~

4.2 Analyzing users with respect to ride_length

Annual_trip_3 %>%
  group_by(user) %>%
  summarise(average_ride_length=mean(ride_length),median_ride_length=median(ride_length),min_ride_length=min(ride_length),max_ride_length=max(ride_length))

## # A tibble: 2 x 5
##   user   average_ride_length median_ride_length min_ride_length max_ride_length
##   <chr>                <dbl>              <dbl>           <dbl>           <dbl>
## 1 casual               1581.                947              60           86391
## 2 member                786.                561              60           86128

Visualize the Average ride_length

Annual_trip_3 %>%
  group_by(user) %>%
  summarise(average_ride_length=mean(ride_length)) %>%
  ggplot() + geom_bar(mapping=aes(x=user,y=average_ride_length,fill=user),stat='identity',width=0.3) + labs(title="Ride length average for each user",x="USER TYPE",y="AVERAGE RIDE LENGTH") + coord_flip() + scale_fill_manual(values=c("Orange","Blue"))

4.2.1 Observations

The average ride length for casual is 1581sec(26.35min) and that for member is 786sec(13.1min)
Average ride length for casual is twice that of members

4.3 Analyzing the total number of rides with respect to user

Annual_trip_3 %>%
  group_by(user)%>%
  summarise(number_of_rides=n())

## # A tibble: 2 x 2
##   user   number_of_rides
##   <chr>            <int>
## 1 casual         2495672
## 2 member         3164083

Visualizing the total rides with respect to bike type

Annual_trip_3 %>%
  group_by(user,rideable_type)%>%
  summarise(number_of_ride=n())%>%
  ggplot() + geom_bar(mapping=aes(x=user,y=number_of_ride,fill=rideable_type),stat='identity',width=0.4) + labs(title="Total No. of ride with respect to bike type",x="USER TYPE",y="TOTAL NO OF RIDES") + scale_fill_brewer(palette="Dark2")

## `summarise()` has grouped output by 'user'. You can override using the
## `.groups` argument.

4.3.1 Observations

Classic bik is the most used bike type
docked bike is only used by casual members
Total number of Member rides is larger than Caual ride

4.4 Analyzing Monthly rides with respect to each user

Annual_trip_3 <- Annual_trip_3%>%
  mutate(month=factor(month,levels=c("Jun","Jul","Aug","Sep","Oct","Nov","Dec","Jan","Feb","Mar","Apr","May")))

Annual_trip_3%>%
  group_by(user,month)%>%
  summarise(number_of_ride=n())%>%
  arrange(user)

## `summarise()` has grouped output by 'user'. You can override using the
## `.groups` argument.

## # A tibble: 24 x 3
## # Groups:   user [2]
##    user   month number_of_ride
##    <chr>  <fct>          <int>
##  1 casual Jun           364315
##  2 casual Jul           435611
##  3 casual Aug           406314
##  4 casual Sep           358711
##  5 casual Oct           253260
##  6 casual Nov           105161
##  7 casual Dec            68690
##  8 casual Jan            17960
##  9 casual Feb            20925
## 10 casual Mar            88068
## # ... with 14 more rows

Visualizing Monthly Rides

Annual_trip_3 %>%
  group_by(user,month)%>%
  summarise(number_of_ride=n())%>%
  ggplot(aes(x=month,y=number_of_ride,group=user,color=user)) + geom_line(size=1.0) + geom_point(size=2.0) + labs(title="Monthly rides for Users",x="Month",y="Number of Rides") + scale_color_manual(values=c("Green","Blue"))

## `summarise()` has grouped output by 'user'. You can override using the
## `.groups` argument.

4.4.1 Observations

Casual rides are significantly higher from June to August but are relatively the lowest from Dec to Feb.
Member Rides maintain a significant balance from June to Oct but decreases rapidly to February, although not as low as Casual rides.

4.5 Analyzing weekly rides with respect to each user

Annual_trip_3 <- Annual_trip_3%>%
  mutate(weekday=factor(weekday, levels=c("Mon","Tue","Wed","Thu","Fri","Sat","Sun")))

Annual_trip_3%>%
  group_by(user,weekday)%>%
  summarise(number_of_ride=n())%>%
  arrange(user)

## `summarise()` has grouped output by 'user'. You can override using the
## `.groups` argument.

## # A tibble: 14 x 3
## # Groups:   user [2]
##    user   weekday number_of_ride
##    <chr>  <fct>            <int>
##  1 casual Mon             282629
##  2 casual Tue             266664
##  3 casual Wed             280917
##  4 casual Thu             295797
##  5 casual Fri             360045
##  6 casual Sat             553326
##  7 casual Sun             456294
##  8 member Mon             437391
##  9 member Tue             490862
## 10 member Wed             498813
## 11 member Thu             479008
## 12 member Fri             449561
## 13 member Sat             434910
## 14 member Sun             373538

Visualizing Weekly Total Rides with respect to each user

Annual_trip_3%>%
  group_by(user,weekday)%>%
  summarise(number_of_ride=n())%>%
  ggplot() + geom_bar(mapping=aes(x=weekday,y=number_of_ride,fill=user),stat='identity',width=0.2) + labs(title="Total Weekly Ride for Users",x="Weekday",y="Number of Rides") + scale_fill_manual(values=c("Orange","Blue"))

## `summarise()` has grouped output by 'user'. You can override using the
## `.groups` argument.

4.5.1 Observations

Casual Rides are higher than member rides during weekends.
Member rides are higher during weekdays, which means members use this as a medium to commute to work.

4.6 Analyzing Average weekly rides per user

Annual_trip_3 %>%
  group_by(user,weekday)%>%
  summarise(Average_ride=mean(ride_length))%>%
  arrange(user)

## `summarise()` has grouped output by 'user'. You can override using the
## `.groups` argument.

## # A tibble: 14 x 3
## # Groups:   user [2]
##    user   weekday Average_ride
##    <chr>  <fct>          <dbl>
##  1 casual Mon            1594.
##  2 casual Tue            1392.
##  3 casual Wed            1374.
##  4 casual Thu            1379.
##  5 casual Fri            1474.
##  6 casual Sat            1734.
##  7 casual Sun            1839.
##  8 member Mon             758.
##  9 member Tue             734.
## 10 member Wed             747.
## 11 member Thu             743.
## 12 member Fri             771.
## 13 member Sat             884.
## 14 member Sun             899.

Visualizing the Average Weekly Ride

Annual_trip_3%>%
  group_by(user,weekday)%>%
  summarise(Average_ride=mean(ride_length))%>%
  ggplot() + geom_bar(mapping=aes(x=weekday,y=Average_ride,fill=user),stat='identity',width=0.2) + labs(title="Average Weekly Ride for Users",x="Weekday",y="Average Rides") + scale_fill_brewer(palette="Set2")

## `summarise()` has grouped output by 'user'. You can override using the
## `.groups` argument.

4.6.2 Observation

Average casual ride is relatively higher on weekends.
Average ride for members is relatively constant during weekdays.

4.7 Analyzing time of ride with respect to each user

Annual_trip_3 <- Annual_trip_3%>%
  mutate(time_of_day=factor(time_of_day,levels=c("Morning","Afternoon","Evening","Night")))

Annual_trip_3%>%
  group_by(user,time_of_day)%>%
  summarise(number_of_ride=n())%>%
  arrange(user)

## `summarise()` has grouped output by 'user'. You can override using the
## `.groups` argument.

## # A tibble: 8 x 3
## # Groups:   user [2]
##   user   time_of_day number_of_ride
##   <chr>  <fct>                <int>
## 1 casual Morning             467741
## 2 casual Afternoon           688578
## 3 casual Evening             926619
## 4 casual Night               412734
## 5 member Morning             894178
## 6 member Afternoon           745771
## 7 member Evening            1205045
## 8 member Night               319089

Visualizing the time of day ride with respect to user

Annual_trip_3%>%
  group_by(user,time_of_day)%>%
  summarise(number_of_ride=n())%>%
  ggplot(aes(x=time_of_day,y=number_of_ride,group=user,color=user))+geom_line(size=1.1)+geom_point(size=2.2)+labs(title="Time of Day Ride took place",x="Time Period",y="Number of Ride",caption="Morning:5AM-12PM, Afternoon:12PM-4PM,Evening:4PM-9PM,Night:9PM-5AM") + scale_color_manual(values=c("Orange","Blue"))

## `summarise()` has grouped output by 'user'. You can override using the
## `.groups` argument.

4.7.1 Observations

Afternoon; 12PM-4PM is the most active time for both casual and member users
Member user are significantly higher than Casual rider during the active time

4.8 Analyzing top three Ride Start Stations for Member Users

Annual_trip_3%>%
  filter(!(is.na(start_station_name)))%>%   #filter out missing 
  filter(user=="member")%>%
  group_by(start_station_name)%>%
  summarise(number_of_ride=n())%>%
  arrange(-number_of_ride)%>%
  top_n(4)

## Selecting by number_of_ride

## # A tibble: 4 x 2
##   start_station_name         number_of_ride
##   <chr>                               <int>
## 1 ""                                 433663
## 2 "Kingsbury St & Kinzie St"          25292
## 3 "Clark St & Elm St"                 24429
## 4 "Wells St & Concord Ln"             23690

4.9 Analyzing top three Ride Start Stations for Casual Users

Annual_trip_3%>%
  filter(!(is.na(start_station_name)))%>%  #filter out missing
  filter(user=="casual")%>%
  group_by(start_station_name)%>%
  summarise(number_of_ride=n())%>%
  arrange(-number_of_ride)%>%
  top_n(4)

## Selecting by number_of_ride

## # A tibble: 4 x 2
##   start_station_name        number_of_ride
##   <chr>                              <int>
## 1 ""                                332508
## 2 "Streeter Dr & Grand Ave"          67118
## 3 "Millennium Park"                  32147
## 4 "Michigan Ave & Oak St"            28672

4.10 Analyzing top three End Ride Stations for Member Users

Annual_trip_3%>%
  filter(!(is.na(end_station_name)))%>%  #filter out missing
  filter(user=="member")%>%
  group_by(end_station_name)%>%
  summarise(number_of_ride=n())%>%
  arrange(-number_of_ride)%>%
  top_n(4)

## Selecting by number_of_ride

## # A tibble: 4 x 2
##   end_station_name           number_of_ride
##   <chr>                               <int>
## 1 ""                                 431268
## 2 "Kingsbury St & Kinzie St"          25154
## 3 "Clark St & Elm St"                 24438
## 4 "Wells St & Concord Ln"             24316

4.11 Analyzing top three End Ride stations for Casual Users

Annual_trip_3%>%
  filter(!(is.na(end_station_name)))%>%
  filter(user=="casual")%>%
  group_by(end_station_name)%>%
  summarise(number_of_ride=n())%>%
  arrange(-number_of_ride)%>%
  top_n(4)

## Selecting by number_of_ride

## # A tibble: 4 x 2
##   end_station_name          number_of_ride
##   <chr>                              <int>
## 1 ""                                378754
## 2 "Streeter Dr & Grand Ave"          69695
## 3 "Millennium Park"                  33202
## 4 "Michigan Ave & Oak St"            30010

Step5: Exportation of Data

5.1 Creating a csv file to visualize with

write.csv(Annual_trip_3,file = "Annual_trip_3.csv")

Step6: Act

This involves a conclusive statement from the Analysis done to aid Marketing startegy

6.1 Summation of Analysis

The Average ride length for casual is 26 mins while that of member is half of that 13 mins
Casual members are more active on weekends while members are more active on weekdays
Casual Members are more active from June to August with its highest in July but are very low from December to February

6.2 Recommendation from My Analysis

For successful marketing campaign, promotion should be done from June, July and August for casual members
Timing is also a factor; since Afternoon 12PM to 5PM is the most active time for casual riders, this would be the perfect time to push marketing campaign.
Location also matters; Casual Riders have the same top locations for both start and end locations; it would be very effective to target these top locations for marketing.

Conclusion

This wraps up my study on the Capstone Project on my Google Data Analytics Certification, helped in exploring the wonderful tools and packages that comes with R for effective Data Analysis.

Capstone Project

Emmanuel Azi-love

2022-05-27

Introduction to my Capstone Project

About the Company

Scenario

Step 1: Ask

1.1 Three questions will guide the future marketing program:

1.2 Consider key stakeholders

Step 2: Prepare the Data

2.1 Data Location

2.2 Data Organization

2.3 Data Credibility

2.4 Licensing, Privacy, Security, and Accessibility

2.5 Loading Required R Packages

2.6 Getting working direcory

2.7 Collecting Data

2.8 Checking for data frame consistency

2.9 Joining the the datasets together

2.10 Inspect the new dataset

2.11 Observations:

2.12 Data Issues

2.13 Data that would address business task

Step3: Process the Data

3.1 Data Manipulation

Configuration of the started_at and ended_at column

Renaming Column

Generating Ride length Column

Generating column for date, day,week and month of ride

Adding hour and time of day columns

3.2 Inspect the dataset

3.2.1 Observations

3.3 Converting ride_length to numeric

3.4 Data Cleaning

3.4.1 Removing Ride length less than 60 sec and greater than 24hrs

3.4.2 Checking Minimum and Maximum ride

3.4.3 Removing empty end_lat and end_lng

3.4.4 Inspecting the dataset

3.5 Getting summary of the ride length with fitered NA for start_station_name and end_station_name

3.6 Data Check

3.7 Data Documentation

Step 4: Analysis and Visualization

4.1 Brief rundown of all the process

4.2 Analyzing users with respect to ride_length

4.2.1 Observations

4.3 Analyzing the total number of rides with respect to user

4.3.1 Observations

4.4 Analyzing Monthly rides with respect to each user

4.4.1 Observations

4.5 Analyzing weekly rides with respect to each user

4.5.1 Observations

4.6 Analyzing Average weekly rides per user

4.6.2 Observation

4.7 Analyzing time of ride with respect to each user

4.7.1 Observations

4.8 Analyzing top three Ride Start Stations for Member Users

4.9 Analyzing top three Ride Start Stations for Casual Users

4.10 Analyzing top three End Ride Stations for Member Users

4.11 Analyzing top three End Ride stations for Casual Users

Step5: Exportation of Data

5.1 Creating a csv file to visualize with

Step6: Act

6.1 Summation of Analysis

6.2 Recommendation from My Analysis

Conclusion