This Capstone project is the final part of my Google Data Analytics Certification. This project showcases my analytical skills and my thought process through out the project.This case study is based on a fictional bike-share company called Cyclistic.
In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime.
I am a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.
This step includes identifying and collecting the data from its location and determining its integrity, credibility and accessibility.
library(tidyverse) #data analysis
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.6 v purrr 0.3.4
## v tibble 3.1.7 v dplyr 1.0.9
## v tidyr 1.2.0 v stringr 1.4.0
## v readr 2.1.2 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(lubridate) #date and time manipulation
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(ggplot2) #data visualization
library(janitor) #data cleaning
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(skimr) #data inspection
library(dplyr) #data manipulation
#getwd()
may_2022 <- read.csv("202204-divvy-tripdata.csv")
apr_2022 <- read.csv("202203-divvy-tripdata.csv")
march_2022 <- read.csv("202202-divvy-tripdata.csv")
feb_2022 <- read.csv("202201-divvy-tripdata.csv")
jan_2022 <- read.csv("202112-divvy-tripdata.csv")
dec_2021 <- read.csv("202111-divvy-tripdata.csv")
nov_2021 <- read.csv("202110-divvy-tripdata.csv")
oct_2021 <- read.csv("202109-divvy-tripdata.csv")
sep_2021 <- read.csv("202108-divvy-tripdata.csv")
aug_2021 <- read.csv("202107-divvy-tripdata.csv")
july_2021 <- read.csv("202106-divvy-tripdata.csv")
june_2021 <- read.csv("202105-divvy-tripdata.csv")
#I want to check if all columns in the 12 datasets match, this is part of EDA; Exploratory Data Analysis
compare_df_cols_same(may_2022,apr_2022,march_2022,feb_2022,jan_2022,dec_2021,nov_2021,oct_2021,sep_2021,aug_2021,july_2021,june_2021)
## [1] TRUE
Annual_trip <- bind_rows(may_2022,apr_2022,march_2022,feb_2022,jan_2022,dec_2021,nov_2021,oct_2021,sep_2021,aug_2021,july_2021,june_2021)
skim_without_charts(Annual_trip) #checking the total dataset structure
| Name | Annual_trip |
| Number of rows | 5757551 |
| Number of columns | 13 |
| _______________________ | |
| Column type frequency: | |
| character | 9 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| ride_id | 0 | 1 | 16 | 16 | 0 | 5757551 | 0 |
| rideable_type | 0 | 1 | 11 | 13 | 0 | 3 | 0 |
| started_at | 0 | 1 | 19 | 19 | 0 | 4821094 | 0 |
| ended_at | 0 | 1 | 19 | 19 | 0 | 4815847 | 0 |
| start_station_name | 0 | 1 | 0 | 53 | 790207 | 866 | 0 |
| start_station_id | 0 | 1 | 0 | 44 | 790204 | 857 | 0 |
| end_station_name | 0 | 1 | 0 | 53 | 843361 | 865 | 0 |
| end_station_id | 0 | 1 | 0 | 44 | 843361 | 857 | 0 |
| member_casual | 0 | 1 | 6 | 6 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| start_lat | 0 | 1 | 41.90 | 0.05 | 41.64 | 41.88 | 41.90 | 41.93 | 45.64 |
| start_lng | 0 | 1 | -87.65 | 0.03 | -87.84 | -87.66 | -87.64 | -87.63 | -73.80 |
| end_lat | 4766 | 1 | 41.90 | 0.05 | 41.39 | 41.88 | 41.90 | 41.93 | 42.17 |
| end_lng | 4766 | 1 | -87.65 | 0.03 | -88.97 | -87.66 | -87.64 | -87.63 | -87.49 |
head(Annual_trip) #checking the first six rows of the data set
## ride_id rideable_type started_at ended_at
## 1 3564070EEFD12711 electric_bike 2022-04-06 17:42:48 2022-04-06 17:54:36
## 2 0B820C7FCF22F489 classic_bike 2022-04-24 19:23:07 2022-04-24 19:43:17
## 3 89EEEE32293F07FF classic_bike 2022-04-20 19:29:08 2022-04-20 19:35:16
## 4 84D4751AEB31888D classic_bike 2022-04-22 21:14:06 2022-04-22 21:23:29
## 5 5664BCF0D1DE7A8B electric_bike 2022-04-16 15:56:30 2022-04-16 16:02:11
## 6 AA9EB7BD2E1FC128 classic_bike 2022-04-21 16:52:33 2022-04-21 16:56:51
## start_station_name start_station_id end_station_name
## 1 Paulina St & Howard St 515 University Library (NU)
## 2 Wentworth Ave & Cermak Rd 13075 Green St & Madison St
## 3 Halsted St & Polk St TA1307000121 Green St & Madison St
## 4 Wentworth Ave & Cermak Rd 13075 Delano Ct & Roosevelt Rd
## 5 Halsted St & Polk St TA1307000121 Clinton St & Madison St
## 6 Desplaines St & Randolph St 15535 Canal St & Adams St
## end_station_id start_lat start_lng end_lat end_lng member_casual
## 1 605 42.01913 -87.67353 42.05294 -87.67345 member
## 2 TA1307000120 41.85308 -87.63193 41.88189 -87.64879 member
## 3 TA1307000120 41.87184 -87.64664 41.88189 -87.64879 member
## 4 KA1706005007 41.85308 -87.63193 41.86749 -87.63219 casual
## 5 TA1305000032 41.87181 -87.64657 41.88224 -87.64107 member
## 6 13011 41.88462 -87.64457 41.87926 -87.63990 member
glimpse(Annual_trip) #checking the summary of the rows
## Rows: 5,757,551
## Columns: 13
## $ ride_id <chr> "3564070EEFD12711", "0B820C7FCF22F489", "89EEEE3229~
## $ rideable_type <chr> "electric_bike", "classic_bike", "classic_bike", "c~
## $ started_at <chr> "2022-04-06 17:42:48", "2022-04-24 19:23:07", "2022~
## $ ended_at <chr> "2022-04-06 17:54:36", "2022-04-24 19:43:17", "2022~
## $ start_station_name <chr> "Paulina St & Howard St", "Wentworth Ave & Cermak R~
## $ start_station_id <chr> "515", "13075", "TA1307000121", "13075", "TA1307000~
## $ end_station_name <chr> "University Library (NU)", "Green St & Madison St",~
## $ end_station_id <chr> "605", "TA1307000120", "TA1307000120", "KA170600500~
## $ start_lat <dbl> 42.01913, 41.85308, 41.87184, 41.85308, 41.87181, 4~
## $ start_lng <dbl> -87.67353, -87.63193, -87.64664, -87.63193, -87.646~
## $ end_lat <dbl> 42.05294, 41.88189, 41.88189, 41.86749, 41.88224, 4~
## $ end_lng <dbl> -87.67345, -87.64879, -87.64879, -87.63219, -87.641~
## $ member_casual <chr> "member", "member", "member", "casual", "member", "~
colnames(Annual_trip) #checking the column names of the data set
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
str(Annual_trip) #checking the properties of the columns
## 'data.frame': 5757551 obs. of 13 variables:
## $ ride_id : chr "3564070EEFD12711" "0B820C7FCF22F489" "89EEEE32293F07FF" "84D4751AEB31888D" ...
## $ rideable_type : chr "electric_bike" "classic_bike" "classic_bike" "classic_bike" ...
## $ started_at : chr "2022-04-06 17:42:48" "2022-04-24 19:23:07" "2022-04-20 19:29:08" "2022-04-22 21:14:06" ...
## $ ended_at : chr "2022-04-06 17:54:36" "2022-04-24 19:43:17" "2022-04-20 19:35:16" "2022-04-22 21:23:29" ...
## $ start_station_name: chr "Paulina St & Howard St" "Wentworth Ave & Cermak Rd" "Halsted St & Polk St" "Wentworth Ave & Cermak Rd" ...
## $ start_station_id : chr "515" "13075" "TA1307000121" "13075" ...
## $ end_station_name : chr "University Library (NU)" "Green St & Madison St" "Green St & Madison St" "Delano Ct & Roosevelt Rd" ...
## $ end_station_id : chr "605" "TA1307000120" "TA1307000120" "KA1706005007" ...
## $ start_lat : num 42 41.9 41.9 41.9 41.9 ...
## $ start_lng : num -87.7 -87.6 -87.6 -87.6 -87.6 ...
## $ end_lat : num 42.1 41.9 41.9 41.9 41.9 ...
## $ end_lng : num -87.7 -87.6 -87.6 -87.6 -87.6 ...
## $ member_casual : chr "member" "member" "member" "casual" ...
This involves the Data Wrangling,to aid further analysis
Annual_trip$started_at <- as.POSIXct(Annual_trip$started_at)
Annual_trip$ended_at <- as.POSIXct(Annual_trip$ended_at)
Confirmation:
str(Annual_trip)
## 'data.frame': 5757551 obs. of 13 variables:
## $ ride_id : chr "3564070EEFD12711" "0B820C7FCF22F489" "89EEEE32293F07FF" "84D4751AEB31888D" ...
## $ rideable_type : chr "electric_bike" "classic_bike" "classic_bike" "classic_bike" ...
## $ started_at : POSIXct, format: "2022-04-06 17:42:48" "2022-04-24 19:23:07" ...
## $ ended_at : POSIXct, format: "2022-04-06 17:54:36" "2022-04-24 19:43:17" ...
## $ start_station_name: chr "Paulina St & Howard St" "Wentworth Ave & Cermak Rd" "Halsted St & Polk St" "Wentworth Ave & Cermak Rd" ...
## $ start_station_id : chr "515" "13075" "TA1307000121" "13075" ...
## $ end_station_name : chr "University Library (NU)" "Green St & Madison St" "Green St & Madison St" "Delano Ct & Roosevelt Rd" ...
## $ end_station_id : chr "605" "TA1307000120" "TA1307000120" "KA1706005007" ...
## $ start_lat : num 42 41.9 41.9 41.9 41.9 ...
## $ start_lng : num -87.7 -87.6 -87.6 -87.6 -87.6 ...
## $ end_lat : num 42.1 41.9 41.9 41.9 41.9 ...
## $ end_lng : num -87.7 -87.6 -87.6 -87.6 -87.6 ...
## $ member_casual : chr "member" "member" "member" "casual" ...
Annual_trip <- rename(Annual_trip, "user"="member_casual")
Annual_trip$ride_length <- difftime(Annual_trip$ended_at,Annual_trip$started_at)
Annual_trip$date <- as.Date(Annual_trip$started_at) #generating date of ride column
Annual_trip$day <- format(as.Date(Annual_trip$date), "%d") #generating day column
Annual_trip$weekday <- format(as.Date(Annual_trip$date),"%a") #generating week day column
Annual_trip$month <- format(as.Date(Annual_trip$date), "%b") #generating month column
Annual_trip$year <- format(as.Date(Annual_trip$date), "%Y") #generating year column
Annual_trip$hour <- lubridate::hour(Annual_trip$started_at) #generating hour each trip started
Annual_trip <- Annual_trip %>%
mutate(time_of_day=case_when(hour>=5 & hour<12 ~ "Morning",hour>=12 & hour<16 ~ "Afternoon",hour>=16 & hour<21 ~ "Evening",hour<5 | hour>=21 ~ "Night"))
skim_without_charts(Annual_trip) #generating summary of the process
| Name | Annual_trip |
| Number of rows | 5757551 |
| Number of columns | 21 |
| _______________________ | |
| Column type frequency: | |
| character | 12 |
| Date | 1 |
| difftime | 1 |
| numeric | 5 |
| POSIXct | 2 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| ride_id | 0 | 1 | 16 | 16 | 0 | 5757551 | 0 |
| rideable_type | 0 | 1 | 11 | 13 | 0 | 3 | 0 |
| start_station_name | 0 | 1 | 0 | 53 | 790207 | 866 | 0 |
| start_station_id | 0 | 1 | 0 | 44 | 790204 | 857 | 0 |
| end_station_name | 0 | 1 | 0 | 53 | 843361 | 865 | 0 |
| end_station_id | 0 | 1 | 0 | 44 | 843361 | 857 | 0 |
| user | 0 | 1 | 6 | 6 | 0 | 2 | 0 |
| day | 0 | 1 | 2 | 2 | 0 | 31 | 0 |
| weekday | 0 | 1 | 3 | 3 | 0 | 7 | 0 |
| month | 0 | 1 | 3 | 3 | 0 | 12 | 0 |
| year | 0 | 1 | 4 | 4 | 0 | 2 | 0 |
| time_of_day | 0 | 1 | 5 | 9 | 0 | 4 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| date | 0 | 1 | 2021-04-30 | 2022-04-30 | 2021-08-31 | 366 |
Variable type: difftime
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| ride_length | 0 | 1 | -3482 secs | 3356649 secs | 691 secs | 25358 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| start_lat | 0 | 1 | 41.90 | 0.05 | 41.64 | 41.88 | 41.90 | 41.93 | 45.64 |
| start_lng | 0 | 1 | -87.65 | 0.03 | -87.84 | -87.66 | -87.64 | -87.63 | -73.80 |
| end_lat | 4766 | 1 | 41.90 | 0.05 | 41.39 | 41.88 | 41.90 | 41.93 | 42.17 |
| end_lng | 4766 | 1 | -87.65 | 0.03 | -88.97 | -87.66 | -87.64 | -87.63 | -87.49 |
| hour | 0 | 1 | 14.20 | 5.07 | 0.00 | 11.00 | 15.00 | 18.00 | 23.00 |
Variable type: POSIXct
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| started_at | 0 | 1 | 2021-05-01 00:00:11 | 2022-04-30 23:59:54 | 2021-08-31 17:17:20 | 4821094 |
| ended_at | 0 | 1 | 2021-05-01 00:03:26 | 2022-05-02 00:35:01 | 2021-08-31 17:34:09 | 4815847 |
Annual_trip$ride_length <- as.numeric(Annual_trip$ride_length)
Check if it is numeric
is.numeric(Annual_trip$ride_length)
## [1] TRUE
Annual_trip_2 <- Annual_trip %>%
filter(!(ride_length<60)) %>%
filter(!(ride_length>86400)) #24hrs = 86400 secs
min(Annual_trip_2$ride_length)
## [1] 60
max(Annual_trip_2$ride_length)
## [1] 86397
Annual_trip_3 <- Annual_trip_2 %>%
filter(!(is.na(end_lng)) | !(is.na(end_lat)))
skim_without_charts(Annual_trip_3)
| Name | Annual_trip_3 |
| Number of rows | 5659755 |
| Number of columns | 21 |
| _______________________ | |
| Column type frequency: | |
| character | 12 |
| Date | 1 |
| numeric | 6 |
| POSIXct | 2 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| ride_id | 0 | 1 | 16 | 16 | 0 | 5659755 | 0 |
| rideable_type | 0 | 1 | 11 | 13 | 0 | 3 | 0 |
| start_station_name | 0 | 1 | 0 | 53 | 766171 | 865 | 0 |
| start_station_id | 0 | 1 | 0 | 44 | 766168 | 856 | 0 |
| end_station_name | 0 | 1 | 0 | 53 | 810022 | 863 | 0 |
| end_station_id | 0 | 1 | 0 | 44 | 810022 | 855 | 0 |
| user | 0 | 1 | 6 | 6 | 0 | 2 | 0 |
| day | 0 | 1 | 2 | 2 | 0 | 31 | 0 |
| weekday | 0 | 1 | 3 | 3 | 0 | 7 | 0 |
| month | 0 | 1 | 3 | 3 | 0 | 12 | 0 |
| year | 0 | 1 | 4 | 4 | 0 | 2 | 0 |
| time_of_day | 0 | 1 | 5 | 9 | 0 | 4 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| date | 0 | 1 | 2021-04-30 | 2022-04-30 | 2021-08-31 | 366 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| start_lat | 0 | 1 | 41.90 | 0.05 | 41.64 | 41.88 | 41.90 | 41.93 | 45.64 |
| start_lng | 0 | 1 | -87.65 | 0.03 | -87.84 | -87.66 | -87.64 | -87.63 | -73.80 |
| end_lat | 0 | 1 | 41.90 | 0.05 | 41.39 | 41.88 | 41.90 | 41.93 | 42.17 |
| end_lng | 0 | 1 | -87.65 | 0.03 | -88.97 | -87.66 | -87.64 | -87.63 | -87.49 |
| ride_length | 0 | 1 | 1136.45 | 2057.57 | 60.00 | 399.00 | 702.00 | 1269.00 | 86391.00 |
| hour | 0 | 1 | 14.20 | 5.07 | 0.00 | 11.00 | 15.00 | 18.00 | 23.00 |
Variable type: POSIXct
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| started_at | 0 | 1 | 2021-05-01 00:00:11 | 2022-04-30 23:59:52 | 2021-08-31 16:47:11 | 4751555 |
| ended_at | 0 | 1 | 2021-05-01 00:03:26 | 2022-05-01 16:16:59 | 2021-08-31 17:03:58 | 4746038 |
Annual_trip_3 <- Annual_trip_3 %>%
filter(!(is.na(start_station_name)) | !(is.na(end_station_name)))
summary(Annual_trip_3$ride_length)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 60 399 702 1136 1269 86391
This R markdown gives a detailed documentation of the overall process using code chunks for it’s analysis
Summary of the Ride length
summary(Annual_trip_3$ride_length)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 60 399 702 1136 1269 86391
Observations
summary(Annual_trip_3)
## ride_id rideable_type started_at
## Length:5659755 Length:5659755 Min. :2021-05-01 00:00:11
## Class :character Class :character 1st Qu.:2021-07-07 15:23:51
## Mode :character Mode :character Median :2021-08-31 16:47:11
## Mean :2021-09-18 16:06:55
## 3rd Qu.:2021-11-03 17:50:56
## Max. :2022-04-30 23:59:52
## ended_at start_station_name start_station_id
## Min. :2021-05-01 00:03:26 Length:5659755 Length:5659755
## 1st Qu.:2021-07-07 15:45:37 Class :character Class :character
## Median :2021-08-31 17:03:58 Mode :character Mode :character
## Mean :2021-09-18 16:25:52
## 3rd Qu.:2021-11-03 18:03:08
## Max. :2022-05-01 16:16:59
## end_station_name end_station_id start_lat start_lng
## Length:5659755 Length:5659755 Min. :41.64 Min. :-87.84
## Class :character Class :character 1st Qu.:41.88 1st Qu.:-87.66
## Mode :character Mode :character Median :41.90 Median :-87.64
## Mean :41.90 Mean :-87.65
## 3rd Qu.:41.93 3rd Qu.:-87.63
## Max. :45.64 Max. :-73.80
## end_lat end_lng user ride_length
## Min. :41.39 Min. :-88.97 Length:5659755 Min. : 60
## 1st Qu.:41.88 1st Qu.:-87.66 Class :character 1st Qu.: 399
## Median :41.90 Median :-87.64 Mode :character Median : 702
## Mean :41.90 Mean :-87.65 Mean : 1136
## 3rd Qu.:41.93 3rd Qu.:-87.63 3rd Qu.: 1269
## Max. :42.17 Max. :-87.49 Max. :86391
## date day weekday month
## Min. :2021-04-30 Length:5659755 Length:5659755 Length:5659755
## 1st Qu.:2021-07-07 Class :character Class :character Class :character
## Median :2021-08-31 Mode :character Mode :character Mode :character
## Mean :2021-09-18
## 3rd Qu.:2021-11-03
## Max. :2022-04-30
## year hour time_of_day
## Length:5659755 Min. : 0.0 Length:5659755
## Class :character 1st Qu.:11.0 Class :character
## Mode :character Median :15.0 Mode :character
## Mean :14.2
## 3rd Qu.:18.0
## Max. :23.0
glimpse(Annual_trip_3)
## Rows: 5,659,755
## Columns: 21
## $ ride_id <chr> "3564070EEFD12711", "0B820C7FCF22F489", "89EEEE3229~
## $ rideable_type <chr> "electric_bike", "classic_bike", "classic_bike", "c~
## $ started_at <dttm> 2022-04-06 17:42:48, 2022-04-24 19:23:07, 2022-04-~
## $ ended_at <dttm> 2022-04-06 17:54:36, 2022-04-24 19:43:17, 2022-04-~
## $ start_station_name <chr> "Paulina St & Howard St", "Wentworth Ave & Cermak R~
## $ start_station_id <chr> "515", "13075", "TA1307000121", "13075", "TA1307000~
## $ end_station_name <chr> "University Library (NU)", "Green St & Madison St",~
## $ end_station_id <chr> "605", "TA1307000120", "TA1307000120", "KA170600500~
## $ start_lat <dbl> 42.01913, 41.85308, 41.87184, 41.85308, 41.87181, 4~
## $ start_lng <dbl> -87.67353, -87.63193, -87.64664, -87.63193, -87.646~
## $ end_lat <dbl> 42.05294, 41.88189, 41.88189, 41.86749, 41.88224, 4~
## $ end_lng <dbl> -87.67345, -87.64879, -87.64879, -87.63219, -87.641~
## $ user <chr> "member", "member", "member", "casual", "member", "~
## $ ride_length <dbl> 708, 1210, 368, 563, 341, 258, 278, 744, 89, 826, 2~
## $ date <date> 2022-04-06, 2022-04-24, 2022-04-20, 2022-04-22, 20~
## $ day <chr> "06", "24", "20", "22", "16", "21", "04", "05", "29~
## $ weekday <chr> "Wed", "Sun", "Wed", "Fri", "Sat", "Thu", "Mon", "T~
## $ month <chr> "Apr", "Apr", "Apr", "Apr", "Apr", "Apr", "Apr", "A~
## $ year <chr> "2022", "2022", "2022", "2022", "2022", "2022", "20~
## $ hour <int> 17, 19, 19, 21, 15, 16, 17, 8, 23, 15, 8, 13, 11, 2~
## $ time_of_day <chr> "Evening", "Evening", "Evening", "Night", "Afternoo~
Annual_trip_3 %>%
group_by(user) %>%
summarise(average_ride_length=mean(ride_length),median_ride_length=median(ride_length),min_ride_length=min(ride_length),max_ride_length=max(ride_length))
## # A tibble: 2 x 5
## user average_ride_length median_ride_length min_ride_length max_ride_length
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 casual 1581. 947 60 86391
## 2 member 786. 561 60 86128
Visualize the Average ride_length
Annual_trip_3 %>%
group_by(user) %>%
summarise(average_ride_length=mean(ride_length)) %>%
ggplot() + geom_bar(mapping=aes(x=user,y=average_ride_length,fill=user),stat='identity',width=0.3) + labs(title="Ride length average for each user",x="USER TYPE",y="AVERAGE RIDE LENGTH") + coord_flip() + scale_fill_manual(values=c("Orange","Blue"))
Annual_trip_3 %>%
group_by(user)%>%
summarise(number_of_rides=n())
## # A tibble: 2 x 2
## user number_of_rides
## <chr> <int>
## 1 casual 2495672
## 2 member 3164083
Visualizing the total rides with respect to bike type
Annual_trip_3 %>%
group_by(user,rideable_type)%>%
summarise(number_of_ride=n())%>%
ggplot() + geom_bar(mapping=aes(x=user,y=number_of_ride,fill=rideable_type),stat='identity',width=0.4) + labs(title="Total No. of ride with respect to bike type",x="USER TYPE",y="TOTAL NO OF RIDES") + scale_fill_brewer(palette="Dark2")
## `summarise()` has grouped output by 'user'. You can override using the
## `.groups` argument.
Annual_trip_3 <- Annual_trip_3%>%
mutate(month=factor(month,levels=c("Jun","Jul","Aug","Sep","Oct","Nov","Dec","Jan","Feb","Mar","Apr","May")))
Annual_trip_3%>%
group_by(user,month)%>%
summarise(number_of_ride=n())%>%
arrange(user)
## `summarise()` has grouped output by 'user'. You can override using the
## `.groups` argument.
## # A tibble: 24 x 3
## # Groups: user [2]
## user month number_of_ride
## <chr> <fct> <int>
## 1 casual Jun 364315
## 2 casual Jul 435611
## 3 casual Aug 406314
## 4 casual Sep 358711
## 5 casual Oct 253260
## 6 casual Nov 105161
## 7 casual Dec 68690
## 8 casual Jan 17960
## 9 casual Feb 20925
## 10 casual Mar 88068
## # ... with 14 more rows
Visualizing Monthly Rides
Annual_trip_3 %>%
group_by(user,month)%>%
summarise(number_of_ride=n())%>%
ggplot(aes(x=month,y=number_of_ride,group=user,color=user)) + geom_line(size=1.0) + geom_point(size=2.0) + labs(title="Monthly rides for Users",x="Month",y="Number of Rides") + scale_color_manual(values=c("Green","Blue"))
## `summarise()` has grouped output by 'user'. You can override using the
## `.groups` argument.
Annual_trip_3 <- Annual_trip_3%>%
mutate(weekday=factor(weekday, levels=c("Mon","Tue","Wed","Thu","Fri","Sat","Sun")))
Annual_trip_3%>%
group_by(user,weekday)%>%
summarise(number_of_ride=n())%>%
arrange(user)
## `summarise()` has grouped output by 'user'. You can override using the
## `.groups` argument.
## # A tibble: 14 x 3
## # Groups: user [2]
## user weekday number_of_ride
## <chr> <fct> <int>
## 1 casual Mon 282629
## 2 casual Tue 266664
## 3 casual Wed 280917
## 4 casual Thu 295797
## 5 casual Fri 360045
## 6 casual Sat 553326
## 7 casual Sun 456294
## 8 member Mon 437391
## 9 member Tue 490862
## 10 member Wed 498813
## 11 member Thu 479008
## 12 member Fri 449561
## 13 member Sat 434910
## 14 member Sun 373538
Visualizing Weekly Total Rides with respect to each user
Annual_trip_3%>%
group_by(user,weekday)%>%
summarise(number_of_ride=n())%>%
ggplot() + geom_bar(mapping=aes(x=weekday,y=number_of_ride,fill=user),stat='identity',width=0.2) + labs(title="Total Weekly Ride for Users",x="Weekday",y="Number of Rides") + scale_fill_manual(values=c("Orange","Blue"))
## `summarise()` has grouped output by 'user'. You can override using the
## `.groups` argument.
Annual_trip_3 %>%
group_by(user,weekday)%>%
summarise(Average_ride=mean(ride_length))%>%
arrange(user)
## `summarise()` has grouped output by 'user'. You can override using the
## `.groups` argument.
## # A tibble: 14 x 3
## # Groups: user [2]
## user weekday Average_ride
## <chr> <fct> <dbl>
## 1 casual Mon 1594.
## 2 casual Tue 1392.
## 3 casual Wed 1374.
## 4 casual Thu 1379.
## 5 casual Fri 1474.
## 6 casual Sat 1734.
## 7 casual Sun 1839.
## 8 member Mon 758.
## 9 member Tue 734.
## 10 member Wed 747.
## 11 member Thu 743.
## 12 member Fri 771.
## 13 member Sat 884.
## 14 member Sun 899.
Visualizing the Average Weekly Ride
Annual_trip_3%>%
group_by(user,weekday)%>%
summarise(Average_ride=mean(ride_length))%>%
ggplot() + geom_bar(mapping=aes(x=weekday,y=Average_ride,fill=user),stat='identity',width=0.2) + labs(title="Average Weekly Ride for Users",x="Weekday",y="Average Rides") + scale_fill_brewer(palette="Set2")
## `summarise()` has grouped output by 'user'. You can override using the
## `.groups` argument.
Annual_trip_3 <- Annual_trip_3%>%
mutate(time_of_day=factor(time_of_day,levels=c("Morning","Afternoon","Evening","Night")))
Annual_trip_3%>%
group_by(user,time_of_day)%>%
summarise(number_of_ride=n())%>%
arrange(user)
## `summarise()` has grouped output by 'user'. You can override using the
## `.groups` argument.
## # A tibble: 8 x 3
## # Groups: user [2]
## user time_of_day number_of_ride
## <chr> <fct> <int>
## 1 casual Morning 467741
## 2 casual Afternoon 688578
## 3 casual Evening 926619
## 4 casual Night 412734
## 5 member Morning 894178
## 6 member Afternoon 745771
## 7 member Evening 1205045
## 8 member Night 319089
Visualizing the time of day ride with respect to user
Annual_trip_3%>%
group_by(user,time_of_day)%>%
summarise(number_of_ride=n())%>%
ggplot(aes(x=time_of_day,y=number_of_ride,group=user,color=user))+geom_line(size=1.1)+geom_point(size=2.2)+labs(title="Time of Day Ride took place",x="Time Period",y="Number of Ride",caption="Morning:5AM-12PM, Afternoon:12PM-4PM,Evening:4PM-9PM,Night:9PM-5AM") + scale_color_manual(values=c("Orange","Blue"))
## `summarise()` has grouped output by 'user'. You can override using the
## `.groups` argument.
Annual_trip_3%>%
filter(!(is.na(start_station_name)))%>% #filter out missing
filter(user=="member")%>%
group_by(start_station_name)%>%
summarise(number_of_ride=n())%>%
arrange(-number_of_ride)%>%
top_n(4)
## Selecting by number_of_ride
## # A tibble: 4 x 2
## start_station_name number_of_ride
## <chr> <int>
## 1 "" 433663
## 2 "Kingsbury St & Kinzie St" 25292
## 3 "Clark St & Elm St" 24429
## 4 "Wells St & Concord Ln" 23690
Annual_trip_3%>%
filter(!(is.na(start_station_name)))%>% #filter out missing
filter(user=="casual")%>%
group_by(start_station_name)%>%
summarise(number_of_ride=n())%>%
arrange(-number_of_ride)%>%
top_n(4)
## Selecting by number_of_ride
## # A tibble: 4 x 2
## start_station_name number_of_ride
## <chr> <int>
## 1 "" 332508
## 2 "Streeter Dr & Grand Ave" 67118
## 3 "Millennium Park" 32147
## 4 "Michigan Ave & Oak St" 28672
Annual_trip_3%>%
filter(!(is.na(end_station_name)))%>% #filter out missing
filter(user=="member")%>%
group_by(end_station_name)%>%
summarise(number_of_ride=n())%>%
arrange(-number_of_ride)%>%
top_n(4)
## Selecting by number_of_ride
## # A tibble: 4 x 2
## end_station_name number_of_ride
## <chr> <int>
## 1 "" 431268
## 2 "Kingsbury St & Kinzie St" 25154
## 3 "Clark St & Elm St" 24438
## 4 "Wells St & Concord Ln" 24316
Annual_trip_3%>%
filter(!(is.na(end_station_name)))%>%
filter(user=="casual")%>%
group_by(end_station_name)%>%
summarise(number_of_ride=n())%>%
arrange(-number_of_ride)%>%
top_n(4)
## Selecting by number_of_ride
## # A tibble: 4 x 2
## end_station_name number_of_ride
## <chr> <int>
## 1 "" 378754
## 2 "Streeter Dr & Grand Ave" 69695
## 3 "Millennium Park" 33202
## 4 "Michigan Ave & Oak St" 30010
write.csv(Annual_trip_3,file = "Annual_trip_3.csv")
This involves a conclusive statement from the Analysis done to aid Marketing startegy
This wraps up my study on the Capstone Project on my Google Data Analytics Certification, helped in exploring the wonderful tools and packages that comes with R for effective Data Analysis.