Google Data Analytics Capstone Project-Cyclistic (According to Gender and Different Age Group)

Case Study 1: How Does a Bike-Share Navigate Speedy Success

Introduction

In the previous Project i worked on how people use different type of the bikes according to day,month vise and how much time they are spent on each ride but in this project i have worked on how male and female,different age group use bike and how much time they spent on rides.In order to answer the key business questions, I have followed the steps of the data analysis process: ask, prepare, process, analyze, share, and act.

About the company

In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime.

Until now, Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments. Oneapproach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members.

Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders. Although the pricing flexibility helps Cyclistic attract more customers, Moreno believes that maximizing the number of annual members will be key to future growth. Rather than creating a marketing campaign that targets all-new customers, Moreno believes there is a very good chance to convert casual riders into members. She notes that casual riders are already aware of the Cyclistic program and have chosen Cyclistic for their mobility needs.

Moreno has set a clear goal: Design marketing strategies aimed at converting casual riders into annual members. In order to do that, however, the marketing analyst team needs to better understand how annual members and casual riders differ, why casual riders would buy a membership, and how digital media could affect their marketing tactics. Moreno and her team are interested in analyzing the Cyclistic historical bike trip data to identify trends.

Ask

Key Task

Total Numbers of Male and Female who used bikes.
Trip Duration According to Male and Female.
Total Number of Different Age Group who used bikes.
Trip Duration According to Different Age Group.

Prepare

I will use Cyclistic’s historical trip data to analyze and identify trends. The data has been made available by Motivate International Inc. under this license.I Will choose to work with Quarterly data of 2019.This is public data that I Will use to explore how different customer types are using Cyclistic bikes. But note that data-privacy issues prohibit me from using riders’ personally identifiable information. This means that I won’t be able to connect pass purchases to credit card numbers to determine if casual riders live in the Cyclistic service area or if they have purchased multiple single passes.

Key tasks

Download data and store it appropriately.
Identify how it’s organized.
Sort and filter the data.
Determine the credibility of the data.

Installing the Required Packages

library(ggplot2)
library(tidyr)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ lubridate 1.9.2     ✔ tibble    3.2.1
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(dplyr)
library(readr)

Importing the data

Divvy_Trips_2019_Q1 <- read_csv("C:/Users/SUKHVIR/Downloads/Divvy_Trips_2019_Q1/Divvy_Trips_2019_Q1.csv")
Divvy_Trips_2019_Q2 <- read_csv("C:/Users/SUKHVIR/Downloads/Divvy_Trips_2019_Q2/Divvy_Trips_2019_Q2.csv")
Divvy_Trips_2019_Q3 <- read_csv("C:/Users/SUKHVIR/Downloads/Divvy_Trips_2019_Q3/Divvy_Trips_2019_Q3.csv")
Divvy_Trips_2019_Q4 <- read_csv("C:/Users/SUKHVIR/Downloads/Divvy_Trips_2019_Q4/Divvy_Trips_2019_Q4.csv")

Changing the column name because quarter2 column has different column name

colnames(Divvy_Trips_2019_Q2)<- colnames(Divvy_Trips_2019_Q1)

Combining the data into one file

Combine_data <-rbind(Divvy_Trips_2019_Q1,Divvy_Trips_2019_Q2, Divvy_Trips_2019_Q3,Divvy_Trips_2019_Q4)

Reviving the Data

glimpse(Combine_data)

## Rows: 3,818,004
## Columns: 12
## $ trip_id           <dbl> 21742443, 21742444, 21742445, 21742446, 21742447, 21…
## $ start_time        <dttm> 2019-01-01 00:04:37, 2019-01-01 00:08:13, 2019-01-0…
## $ end_time          <dttm> 2019-01-01 00:11:07, 2019-01-01 00:15:34, 2019-01-0…
## $ bikeid            <dbl> 2167, 4386, 1524, 252, 1170, 2437, 2708, 2796, 6205,…
## $ tripduration      <dbl> 390, 441, 829, 1783, 364, 216, 177, 100, 1727, 336, …
## $ from_station_id   <dbl> 199, 44, 15, 123, 173, 98, 98, 211, 150, 268, 299, 2…
## $ from_station_name <chr> "Wabash Ave & Grand Ave", "State St & Randolph St", …
## $ to_station_id     <dbl> 84, 624, 644, 176, 35, 49, 49, 142, 148, 141, 295, 4…
## $ to_station_name   <chr> "Milwaukee Ave & Grand Ave", "Dearborn St & Van Bure…
## $ usertype          <chr> "Subscriber", "Subscriber", "Subscriber", "Subscribe…
## $ gender            <chr> "Male", "Female", "Female", "Male", "Male", "Female"…
## $ birthyear         <dbl> 1989, 1990, 1994, 1993, 1994, 1983, 1984, 1990, 1995…

str(Combine_data)

## spc_tbl_ [3,818,004 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ trip_id          : num [1:3818004] 21742443 21742444 21742445 21742446 21742447 ...
##  $ start_time       : POSIXct[1:3818004], format: "2019-01-01 00:04:37" "2019-01-01 00:08:13" ...
##  $ end_time         : POSIXct[1:3818004], format: "2019-01-01 00:11:07" "2019-01-01 00:15:34" ...
##  $ bikeid           : num [1:3818004] 2167 4386 1524 252 1170 ...
##  $ tripduration     : num [1:3818004] 390 441 829 1783 364 ...
##  $ from_station_id  : num [1:3818004] 199 44 15 123 173 98 98 211 150 268 ...
##  $ from_station_name: chr [1:3818004] "Wabash Ave & Grand Ave" "State St & Randolph St" "Racine Ave & 18th St" "California Ave & Milwaukee Ave" ...
##  $ to_station_id    : num [1:3818004] 84 624 644 176 35 49 49 142 148 141 ...
##  $ to_station_name  : chr [1:3818004] "Milwaukee Ave & Grand Ave" "Dearborn St & Van Buren St (*)" "Western Ave & Fillmore St (*)" "Clark St & Elm St" ...
##  $ usertype         : chr [1:3818004] "Subscriber" "Subscriber" "Subscriber" "Subscriber" ...
##  $ gender           : chr [1:3818004] "Male" "Female" "Female" "Male" ...
##  $ birthyear        : num [1:3818004] 1989 1990 1994 1993 1994 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   trip_id = col_double(),
##   ..   start_time = col_datetime(format = ""),
##   ..   end_time = col_datetime(format = ""),
##   ..   bikeid = col_double(),
##   ..   tripduration = col_number(),
##   ..   from_station_id = col_double(),
##   ..   from_station_name = col_character(),
##   ..   to_station_id = col_double(),
##   ..   to_station_name = col_character(),
##   ..   usertype = col_character(),
##   ..   gender = col_character(),
##   ..   birthyear = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

Process

Key tasks

Check the data for errors.
Choose your tools.
Transform the data so you can work with it effectively.
Document the cleaning process.

Selecting the required column name

Combine_data_01<-Combine_data %>% 
  select(trip_id,bikeid,tripduration,usertype,gender,birthyear)

colnames(Combine_data_01)

## [1] "trip_id"      "bikeid"       "tripduration" "usertype"     "gender"      
## [6] "birthyear"

Adding New column to calculate the birth year

Combine_data_02<-Combine_data_01 %>% 
  mutate(current_year= 2019)

glimpse(Combine_data_02)

## Rows: 3,818,004
## Columns: 7
## $ trip_id      <dbl> 21742443, 21742444, 21742445, 21742446, 21742447, 2174244…
## $ bikeid       <dbl> 2167, 4386, 1524, 252, 1170, 2437, 2708, 2796, 6205, 3939…
## $ tripduration <dbl> 390, 441, 829, 1783, 364, 216, 177, 100, 1727, 336, 886, …
## $ usertype     <chr> "Subscriber", "Subscriber", "Subscriber", "Subscriber", "…
## $ gender       <chr> "Male", "Female", "Female", "Male", "Male", "Female", "Ma…
## $ birthyear    <dbl> 1989, 1990, 1994, 1993, 1994, 1983, 1984, 1990, 1995, 199…
## $ current_year <dbl> 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 201…

Checking the duplicated and Na value

sum(duplicated(Combine_data_02))

## [1] 0

sum(is.na(Combine_data_02))

## [1] 1097957

Removing the NA value

Combine_data_02_Na<-drop_na(Combine_data_02)
sum(is.na(Combine_data_02_Na))

## [1] 0

Subtracting current year from Birth_year to know the actual year of birth

Combine_data_02_Na_F<- Combine_data_02_Na %>% 
  mutate(year_of_birth=current_year - birthyear )

glimpse(Combine_data_02_Na_F)

## Rows: 3,258,796
## Columns: 8
## $ trip_id       <dbl> 21742443, 21742444, 21742445, 21742446, 21742447, 217424…
## $ bikeid        <dbl> 2167, 4386, 1524, 252, 1170, 2437, 2708, 2796, 6205, 393…
## $ tripduration  <dbl> 390, 441, 829, 1783, 364, 216, 177, 100, 1727, 336, 886,…
## $ usertype      <chr> "Subscriber", "Subscriber", "Subscriber", "Subscriber", …
## $ gender        <chr> "Male", "Female", "Female", "Male", "Male", "Female", "M…
## $ birthyear     <dbl> 1989, 1990, 1994, 1993, 1994, 1983, 1984, 1990, 1995, 19…
## $ current_year  <dbl> 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 20…
## $ year_of_birth <dbl> 30, 29, 25, 26, 25, 36, 35, 29, 24, 23, 25, 25, 33, 29, …

Arrange

Key tasks

Aggregate your data so it’s useful and accessible.
Organize and format your data.
Perform calculations.
Identify trends and relationships.

checking the year of birth and filtering

Combine_data_02_Na_F %>% 
  distinct(year_of_birth) %>% 
  arrange(year_of_birth)

## # A tibble: 89 × 1
##    year_of_birth
##            <dbl>
##  1             5
##  2            16
##  3            17
##  4            18
##  5            19
##  6            20
##  7            21
##  8            22
##  9            23
## 10            24
## # ℹ 79 more rows

Combine_data_02_Na_F %>% 
 filter(year_of_birth == 5)

## # A tibble: 5 × 8
##    trip_id bikeid tripduration usertype   gender birthyear current_year
##      <dbl>  <dbl>        <dbl> <chr>      <chr>      <dbl>        <dbl>
## 1 22463474   6225         7209 Subscriber Female      2014         2019
## 2 22483110   6391         4515 Subscriber Female      2014         2019
## 3 22634065   2076         8469 Subscriber Female      2014         2019
## 4 22670749   2076       175251 Subscriber Female      2014         2019
## 5 22895143   2334      2479420 Subscriber Female      2014         2019
## # ℹ 1 more variable: year_of_birth <dbl>

Combine_data_02_Na_F_1<- Combine_data_02_Na_F %>% 
  filter(year_of_birth >= 16)


Combine_data_02_Na_F_1 %>% 
  distinct(year_of_birth) %>% 
  arrange(year_of_birth)

## # A tibble: 88 × 1
##    year_of_birth
##            <dbl>
##  1            16
##  2            17
##  3            18
##  4            19
##  5            20
##  6            21
##  7            22
##  8            23
##  9            24
## 10            25
## # ℹ 78 more rows

Checking min,max,average trip duration according to gender vise

Combine_data_02_Na_F_1 %>% 
  group_by(gender) %>% 
  summarise(min_trip= min(tripduration),max_trip = max(tripduration),avg_trip=mean(tripduration))

## # A tibble: 2 × 4
##   gender min_trip max_trip avg_trip
##   <chr>     <dbl>    <dbl>    <dbl>
## 1 Female       61  8203637    1301.
## 2 Male         61  9056633     987.

Adding a new column of Age_category

Combine_data_Final<-Combine_data_02_Na_F_1 %>% 
  mutate(age_cat = case_when(year_of_birth >= 16 & year_of_birth <= 30 ~ "16-30",
                             year_of_birth >= 31 & year_of_birth <= 50 ~  "31-50",
                             year_of_birth >= 51 & year_of_birth <= 70 ~  "51-70",
                             year_of_birth >= 71 & year_of_birth <= 90 ~   "71-90",
                             year_of_birth >= 91 & year_of_birth <=119 ~ "94-119"))

glimpse(Combine_data_Final)

## Rows: 3,258,791
## Columns: 9
## $ trip_id       <dbl> 21742443, 21742444, 21742445, 21742446, 21742447, 217424…
## $ bikeid        <dbl> 2167, 4386, 1524, 252, 1170, 2437, 2708, 2796, 6205, 393…
## $ tripduration  <dbl> 390, 441, 829, 1783, 364, 216, 177, 100, 1727, 336, 886,…
## $ usertype      <chr> "Subscriber", "Subscriber", "Subscriber", "Subscriber", …
## $ gender        <chr> "Male", "Female", "Female", "Male", "Male", "Female", "M…
## $ birthyear     <dbl> 1989, 1990, 1994, 1993, 1994, 1983, 1984, 1990, 1995, 19…
## $ current_year  <dbl> 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 20…
## $ year_of_birth <dbl> 30, 29, 25, 26, 25, 36, 35, 29, 24, 23, 25, 25, 33, 29, …
## $ age_cat       <chr> "16-30", "16-30", "16-30", "16-30", "16-30", "31-50", "3…

Checking the NA value and making the final data

sum(is.na(Combine_data_Final))

## [1] 136

Combine_data_Final<-drop_na(Combine_data_Final)

Combine_data_Final %>% 
  group_by(age_cat) %>% 
  summarise(count= n())

## # A tibble: 5 × 2
##   age_cat   count
##   <chr>     <int>
## 1 16-30   1463681
## 2 31-50   1414519
## 3 51-70    372814
## 4 71-90      6789
## 5 94-119      852

view(Combine_data_Final)

Key task

Determine the best way to share your findings.
Create effective data visualizations.
Present your findings.
Ensure your work is accessible.

I will visualize the data according to the gender type and age category.

Gender Vise

Total count

Total_Gender_Per<-Combine_data_Final %>% 
  group_by(gender) %>%
  summarise(count=n()) %>% 
  mutate(Percent=paste0(round(count/sum(count)*100,2),"%"))


ggplot(Total_Gender_Per,aes(x=gender,y=count,fill =count))+
  geom_col()+theme_minimal()+
  geom_text(aes(label=Percent),vjust=-0.4)+
  labs(title = "Total Count of Men & Woman")

Trip Duration According to Male and Female

trip_percent<- Combine_data_Final %>% 
  group_by(gender) %>% 
  summarise(Count=sum(tripduration)) %>% 
  mutate(percent= paste0(round(Count/sum(Count)*100,2),"%"))

ggplot(trip_percent,aes(x=gender,y=Count,fill=Count))+
  geom_col()+theme_minimal()+
  geom_text(aes(label=percent),vjust=-0.4)+
  labs(title = "Trip Duration According to Male And Female")

Age category

Total Count

Total_count_Age<-Combine_data_Final %>% 
  group_by(age_cat) %>% 
  summarise(count=n()) %>% 
  mutate(Percent=paste0(round(count/sum(count)*100,2),"%"))

ggplot(Total_count_Age,aes(x=age_cat,y=count,fill=count))+
  geom_col()+theme_minimal()+
  geom_text(aes(label=Percent), vjust=-0.4)+
  labs(title = "Total Count According to the Different Age Group")

Trip Duration According to Different Age Group

Trip_duration_age<- Combine_data_Final %>% 
  group_by(age_cat) %>% 
  summarise(Count=sum(tripduration)) %>% 
  mutate(Percent=paste0(round(Count/sum(Count)*100,2),"%"))


ggplot(Trip_duration_age,aes(x=age_cat,y=Count,fill=Count))+
  geom_col()+theme_minimal()+
  geom_text(aes(label=Percent),vjust=- 0.4)+
  labs(title = "Total Trip Duration According to Male and Female")

Findings

1.Gender vise :- Male used most of the bikes in comparison to the female. 73.67 % male used the bikes followed by the 26.33 % female. In case of trip duration, Male has high trip duration (67.99%) in comparison to female (32.01%).

2.Age category :- As per the data, people of 16 to 50 age group has the highest usage of bikes with 88.33% and trip duration of 16 to 50 age grop has highest.

Conclusion

To conclude, Male has highest trip duration and count in comparison with female and 16 to 50 age group people are leading in both count and trip duration.

Act

Recommendations

We should target people who is come under the age category of 30 to 51 and run the different online campaign . we should also use survey if possible to know what they want so that we can convert casual member to annual members.
we should also more focused on the female side also because we can increase the count and convert them into annual membership.we should also run the online survey to know the preference.we can also include the digital marketing and promotion.

Google Data Analytics Capstone Project-Cyclistic (According to Gender and Different Age Group)

Sukhvir Singh

2023-09-29

Introduction

About the company

Ask

Prepare

Process

Arrange

Act