#Introduction
Cyclistic bike sharing Company wants to design a marketing strategy that will help increase their annual membership.As a Junior Data Analyst i am tasked with finding out how annual membership and casual riders use Cyclistic bikes differently.It involves analyzing data made available by Motivate International Inc https://divvy-tripdata.s3.amazonaws.com/index.html.
#1.Ask
##Business Objective
Maximizing annual memberships by converting casual riders through a targeted marketing approach
##Business Task
How do annual memberships and casual riders use cyclistic bikes differently?
##Stakeholder Involvement
*Lily Moreno: The director of marketing and your manager. Moreno is responsible for the development of campaigns and initiatives to promote the bike-share program
*Cyclistic marketing analytics team: A team of data analysts who are responsible for collecting, analyzing, and reporting data that helps guide Cyclistic marketing strategy.
*Cyclistic executive team: The notoriously detail-oriented executive team will decide whether to approve the recommended marketing program
#2.Prepare
##Location of Data
The data used in the analysis can be found herehttps://divvy-tripdata.s3.amazonaws.com/index.html This data was collected by Motivate International employed by the City of Chicago to collect data about the usage of bikes.
##Data Oganization
The data is stored in csv file.For this research the most current data was used.it included 13 columns.
##Credibilty of Data
Motivate International is employed by the City of Chicago to collect data on Cyclistic Bikes sharing program directly. The information is current and data includes all ride types rather than just a sample.
##Licensing,Privacy, Security and Accessibility
This licensehttps://ride.divvybikes.com/data-license-agreement governs the distribution of this data.The anonymity of the data has been enforced, this protects privacy and at the same time restricts the scoop of the investigation.
##Usefulness of Data
The type of riders and starting and closing trip location has beeen provided. This data is necessary in knowing areas to target for marketing
##Problems with the dataset
data types need to be changed, missing observations and duplicates can be solved by data cleaning.
#3.Process
I chose RStudio Desktop to explore,clean,analyse and visualize the data.
##Loading data and libraries
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 0.3.5
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(scales)
##
## Attaching package: 'scales'
##
## The following object is masked from 'package:purrr':
##
## discard
##
## The following object is masked from 'package:readr':
##
## col_factor
library(lubridate)
## Loading required package: timechange
##
## Attaching package: 'lubridate'
##
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(skimr)
library(knitr)
library(scales)
getwd()
## [1] "C:/Users/dell/Documents/cyclistic folder"
my_data <-read.csv("202210-divvy-tripdata.csv")
#Glimpse of Data
glimpse(my_data)
## Rows: 558,685
## Columns: 13
## $ ride_id <chr> "A50255C1E17942AB", "DB692A70BD2DD4E3", "3C02727AAF…
## $ rideable_type <chr> "classic_bike", "electric_bike", "electric_bike", "…
## $ started_at <chr> "2022-10-14 17:13:30", "2022-10-01 16:29:26", "2022…
## $ ended_at <chr> "2022-10-14 17:19:39", "2022-10-01 16:49:06", "2022…
## $ start_station_name <chr> "Noble St & Milwaukee Ave", "Damen Ave & Charleston…
## $ start_station_id <chr> "13290", "13288", "655", "KA1504000133", "13028", "…
## $ end_station_name <chr> "Larrabee St & Division St", "Damen Ave & Cullerton…
## $ end_station_id <chr> "KA1504000079", "13089", "TA1307000140", "620", "13…
## $ start_lat <dbl> 41.90068, 41.92004, 41.97988, 41.90227, 41.87475, 4…
## $ start_lng <dbl> -87.66260, -87.67794, -87.68190, -87.62769, -87.649…
## $ end_lat <dbl> 41.90349, 41.85497, 41.96640, 41.89820, 41.86610, 4…
## $ end_lng <dbl> -87.64335, -87.67570, -87.68870, -87.63754, -87.607…
## $ member_casual <chr> "member", "casual", "member", "member", "casual", "…
#Missing data
trip<-my_data %>%
filter(complete.cases(.)
)
View(trip)
##Remove duplicates
trip_1<-distinct(trip)
View(trip_1)
summary(trip_1)
## ride_id rideable_type started_at ended_at
## Length:558210 Length:558210 Length:558210 Length:558210
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## start_station_name start_station_id end_station_name end_station_id
## Length:558210 Length:558210 Length:558210 Length:558210
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## start_lat start_lng end_lat end_lng
## Min. :41.64 Min. :-87.84 Min. :41.59 Min. :-87.87
## 1st Qu.:41.88 1st Qu.:-87.66 1st Qu.:41.88 1st Qu.:-87.66
## Median :41.90 Median :-87.64 Median :41.90 Median :-87.64
## Mean :41.90 Mean :-87.65 Mean :41.90 Mean :-87.65
## 3rd Qu.:41.93 3rd Qu.:-87.63 3rd Qu.:41.93 3rd Qu.:-87.63
## Max. :42.07 Max. :-87.53 Max. :42.13 Max. :-87.52
## member_casual
## Length:558210
## Class :character
## Mode :character
##
##
##
#Renaming Columns for consistency
trip_2<-trip_1 %>%
rename(bikes=rideable_type,
Users=member_casual
)
View(trip_2)
#Converting column type-bikes and users
trip_3<-trip_2 %>%
mutate(bikes=as_factor(bikes),
Users=as_factor(Users),
started_at=ymd_hms(as_datetime(started_at)),
ended_at=ymd_hms(as_datetime(ended_at))
)
glimpse(trip_3)
## Rows: 558,210
## Columns: 13
## $ ride_id <chr> "A50255C1E17942AB", "DB692A70BD2DD4E3", "3C02727AAF…
## $ bikes <fct> classic_bike, electric_bike, electric_bike, electri…
## $ started_at <dttm> 2022-10-14 17:13:30, 2022-10-01 16:29:26, 2022-10-…
## $ ended_at <dttm> 2022-10-14 17:19:39, 2022-10-01 16:49:06, 2022-10-…
## $ start_station_name <chr> "Noble St & Milwaukee Ave", "Damen Ave & Charleston…
## $ start_station_id <chr> "13290", "13288", "655", "KA1504000133", "13028", "…
## $ end_station_name <chr> "Larrabee St & Division St", "Damen Ave & Cullerton…
## $ end_station_id <chr> "KA1504000079", "13089", "TA1307000140", "620", "13…
## $ start_lat <dbl> 41.90068, 41.92004, 41.97988, 41.90227, 41.87475, 4…
## $ start_lng <dbl> -87.66260, -87.67794, -87.68190, -87.62769, -87.649…
## $ end_lat <dbl> 41.90349, 41.85497, 41.96640, 41.89820, 41.86610, 4…
## $ end_lng <dbl> -87.64335, -87.67570, -87.68870, -87.63754, -87.607…
## $ Users <fct> member, casual, member, member, casual, casual, mem…
#Create extra columns based on started_at and ended_at date-time column
trip_4<-trip_3 %>%
mutate(
Hours_Begin=hour(started_at),
Week_day=wday(started_at,label = T,abbr = F),
Month=month(started_at,label = T,abbr = F),
Day=day(started_at),
Week=strftime(started_at,format = "%V"),
Trip_Time = difftime(ended_at,started_at,units = "mins")
)
View(trip_4)
#Filtering data based on trip_time of 1 min to 1440 mins
trip_5<-trip_4 %>%
filter(between(Trip_Time,1,1440))
## Warning: between() called on numeric vector with S3 class
str(trip_4)
## 'data.frame': 558210 obs. of 19 variables:
## $ ride_id : chr "A50255C1E17942AB" "DB692A70BD2DD4E3" "3C02727AAF60F873" "47E653FDC2D99236" ...
## $ bikes : Factor w/ 3 levels "classic_bike",..: 1 2 2 2 1 2 2 1 1 2 ...
## $ started_at : POSIXct, format: "2022-10-14 17:13:30" "2022-10-01 16:29:26" ...
## $ ended_at : POSIXct, format: "2022-10-14 17:19:39" "2022-10-01 16:49:06" ...
## $ start_station_name: chr "Noble St & Milwaukee Ave" "Damen Ave & Charleston St" "Hoyne Ave & Balmoral Ave" "Rush St & Cedar St" ...
## $ start_station_id : chr "13290" "13288" "655" "KA1504000133" ...
## $ end_station_name : chr "Larrabee St & Division St" "Damen Ave & Cullerton St" "Western Ave & Leland Ave" "Orleans St & Chestnut St (NEXT Apts)" ...
## $ end_station_id : chr "KA1504000079" "13089" "TA1307000140" "620" ...
## $ start_lat : num 41.9 41.9 42 41.9 41.9 ...
## $ start_lng : num -87.7 -87.7 -87.7 -87.6 -87.6 ...
## $ end_lat : num 41.9 41.9 42 41.9 41.9 ...
## $ end_lng : num -87.6 -87.7 -87.7 -87.6 -87.6 ...
## $ Users : Factor w/ 2 levels "member","casual": 1 2 1 1 2 2 1 1 2 1 ...
## $ Hours_Begin : int 17 16 18 7 18 15 15 17 9 12 ...
## $ Week_day : Ord.factor w/ 7 levels "Sunday"<"Monday"<..: 6 7 4 2 5 5 5 4 7 2 ...
## $ Month : Ord.factor w/ 12 levels "January"<"February"<..: 10 10 10 10 10 10 10 10 10 10 ...
## $ Day : int 14 1 19 31 13 13 6 26 22 24 ...
## $ Week : chr "41" "39" "42" "44" ...
## $ Trip_Time : 'difftime' num 6.15 19.6666666666667 7.83333333333333 6.21666666666667 ...
## ..- attr(*, "units")= chr "mins"
View(trip_4)
head(trip_4,10)
## ride_id bikes started_at ended_at
## 1 A50255C1E17942AB classic_bike 2022-10-14 17:13:30 2022-10-14 17:19:39
## 2 DB692A70BD2DD4E3 electric_bike 2022-10-01 16:29:26 2022-10-01 16:49:06
## 3 3C02727AAF60F873 electric_bike 2022-10-19 18:55:40 2022-10-19 19:03:30
## 4 47E653FDC2D99236 electric_bike 2022-10-31 07:52:36 2022-10-31 07:58:49
## 5 8B5407BE535159BF classic_bike 2022-10-13 18:41:03 2022-10-13 19:26:18
## 6 A177C92E9F021B99 electric_bike 2022-10-13 15:53:27 2022-10-13 15:59:17
## 7 DF5EC7678DE3C2B3 electric_bike 2022-10-06 15:51:21 2022-10-06 15:55:06
## 8 407DE6D80130A297 classic_bike 2022-10-26 17:30:10 2022-10-26 17:37:57
## 9 45EEAF68A1A051CA classic_bike 2022-10-22 09:47:56 2022-10-22 09:57:42
## 10 66CD8E4D0C38C0F3 electric_bike 2022-10-24 12:39:47 2022-10-24 12:48:36
## start_station_name start_station_id
## 1 Noble St & Milwaukee Ave 13290
## 2 Damen Ave & Charleston St 13288
## 3 Hoyne Ave & Balmoral Ave 655
## 4 Rush St & Cedar St KA1504000133
## 5 900 W Harrison St 13028
## 6 900 W Harrison St 13028
## 7 900 W Harrison St 13028
## 8 Rush St & Cedar St KA1504000133
## 9 Noble St & Milwaukee Ave 13290
## 10 Noble St & Milwaukee Ave 13290
## end_station_name end_station_id start_lat start_lng
## 1 Larrabee St & Division St KA1504000079 41.90068 -87.66260
## 2 Damen Ave & Cullerton St 13089 41.92004 -87.67794
## 3 Western Ave & Leland Ave TA1307000140 41.97988 -87.68190
## 4 Orleans St & Chestnut St (NEXT Apts) 620 41.90227 -87.62769
## 5 Adler Planetarium 13431 41.87475 -87.64981
## 6 Loomis St & Lexington St 13332 41.87472 -87.64983
## 7 Halsted St & Roosevelt Rd TA1305000017 41.87471 -87.64987
## 8 Larrabee St & Division St KA1504000079 41.90231 -87.62769
## 9 Kingsbury St & Kinzie St KA1503000043 41.90068 -87.66260
## 10 Damen Ave & Cortland St 13133 41.90070 -87.66257
## end_lat end_lng Users Hours_Begin Week_day Month Day Week
## 1 41.90349 -87.64335 member 17 Friday October 14 41
## 2 41.85497 -87.67570 casual 16 Saturday October 1 39
## 3 41.96640 -87.68870 member 18 Wednesday October 19 42
## 4 41.89820 -87.63754 member 7 Monday October 31 44
## 5 41.86610 -87.60727 casual 18 Thursday October 13 41
## 6 41.87219 -87.66150 casual 15 Thursday October 13 41
## 7 41.86732 -87.64862 member 15 Thursday October 6 40
## 8 41.90349 -87.64335 member 17 Wednesday October 26 43
## 9 41.88918 -87.63851 casual 9 Saturday October 22 42
## 10 41.91598 -87.67733 member 12 Monday October 24 43
## Trip_Time
## 1 6.150000 mins
## 2 19.666667 mins
## 3 7.833333 mins
## 4 6.216667 mins
## 5 45.250000 mins
## 6 5.833333 mins
## 7 3.750000 mins
## 8 7.783333 mins
## 9 9.766667 mins
## 10 8.816667 mins
View(trip_5)
#Data Summary
summary(trip_5)
## ride_id bikes started_at
## Length:545089 classic_bike :210221 Min. :2022-10-01 00:00:15.00
## Class :character electric_bike:322603 1st Qu.:2022-10-08 00:57:44.00
## Mode :character docked_bike : 12265 Median :2022-10-15 15:01:36.00
## Mean :2022-10-16 00:27:21.63
## 3rd Qu.:2022-10-23 15:09:43.00
## Max. :2022-10-31 23:59:33.00
##
## ended_at start_station_name start_station_id
## Min. :2022-10-01 00:02:52.00 Length:545089 Length:545089
## 1st Qu.:2022-10-08 01:10:31.00 Class :character Class :character
## Median :2022-10-15 15:19:40.00 Mode :character Mode :character
## Mean :2022-10-16 00:41:46.25
## 3rd Qu.:2022-10-23 15:31:27.00
## Max. :2022-11-01 08:59:46.00
##
## end_station_name end_station_id start_lat start_lng
## Length:545089 Length:545089 Min. :41.64 Min. :-87.84
## Class :character Class :character 1st Qu.:41.88 1st Qu.:-87.66
## Mode :character Mode :character Median :41.90 Median :-87.64
## Mean :41.90 Mean :-87.65
## 3rd Qu.:41.93 3rd Qu.:-87.63
## Max. :42.07 Max. :-87.53
##
## end_lat end_lng Users Hours_Begin
## Min. :41.59 Min. :-87.87 member:341269 Min. : 0.00
## 1st Qu.:41.88 1st Qu.:-87.66 casual:203820 1st Qu.:11.00
## Median :41.90 Median :-87.64 Median :15.00
## Mean :41.90 Mean :-87.65 Mean :13.92
## 3rd Qu.:41.93 3rd Qu.:-87.63 3rd Qu.:17.00
## Max. :42.13 Max. :-87.52 Max. :23.00
##
## Week_day Month Day Week
## Sunday : 92647 October :545089 Min. : 1.00 Length:545089
## Monday : 83321 January : 0 1st Qu.: 8.00 Class :character
## Tuesday : 54081 February: 0 Median :15.00 Mode :character
## Wednesday: 67910 March : 0 Mean :15.42
## Thursday : 70114 April : 0 3rd Qu.:23.00
## Friday : 69372 May : 0 Max. :31.00
## Saturday :107644 (Other) : 0
## Trip_Time
## Length:545089
## Class :difftime
## Mode :numeric
##
##
##
##
#Conclusion
Based on the summary, there are members:341269 and casual:203820. The field name formally rideable_types now bikes have three categories classic_bikes:210221, electric_bikes:322603 and docked_bikes:12265 Saturday of the weekdays provided the highest values of 107663.
#4.Analysis
#Creating Data frames for analysis
Trips_time_data_frame<-trip_5 %>%
drop_na(end_lat,end_lng) %>%
select(ride_id,Users,bikes,Hours_Begin,Week_day,Month,Day,Week,Trip_Time)
colSums(is.na(Trips_time_data_frame))
## ride_id Users bikes Hours_Begin Week_day Month
## 0 0 0 0 0 0
## Day Week Trip_Time
## 0 0 0
View(Trips_time_data_frame)
Trip_point_data_frame<-trip_5 %>%
select(ride_id,start_station_name,end_station_name,start_lat,end_lat,start_lng,end_lng,Users,Trip_Time) %>%
drop_na(start_station_name,end_station_name)
View(Trip_point_data_frame)
colSums(is.na(Trip_point_data_frame))
## ride_id start_station_name end_station_name start_lat
## 0 0 0 0
## end_lat start_lng end_lng Users
## 0 0 0 0
## Trip_Time
## 0
#5.Data Visualization
#Graphs on Users and hour of day
library(ggplot2)
ride_hours<-Trips_time_data_frame %>%
group_by(Users,Hours_Begin) %>%
summarise(
number_trips=n(),
average_trips=mean(Trip_Time),
total_trips=sum(Trip_Time)
)
## `summarise()` has grouped output by 'Users'. You can override using the
## `.groups` argument.
View(ride_hours)
#Number of Trips per hour
ride_hours %>%
ggplot(aes(Hours_Begin,number_trips,fill=Users))+geom_col(position = "dodge")+scale_y_continuous()+
labs(title = "Number of Trips per Hours",caption = "Figure 1",x="hour of the day",y="number of Trips",)+
theme()
#Average Number of Trips per hour
ride_hours %>%
ggplot(aes(x=Hours_Begin,y=average_trips,fill=Users))+geom_col(position = "dodge")+scale_y_continuous()+
labs(title = "Average number of Trips per Hour",caption = "Figure 2",x="hour of the day",y="average number of Trips",)+
theme()
#Total Number of Trips per hour
ride_hours %>%
ggplot(aes(x=Hours_Begin,y=total_trips,fill=Users))+geom_col(position = "dodge")+scale_y_continuous()+
labs(title = "total number of Trips per Hour",caption = "Figure 3",x="hour of the day",y="total number of Trips",)+
theme()
ride_week<-Trips_time_data_frame %>%
group_by(Users,Week_day) %>%
summarise(
average_trips_week=mean(Trip_Time),
total_trips_week=sum(Trip_Time),
number_trips_week=n())
## `summarise()` has grouped output by 'Users'. You can override using the
## `.groups` argument.
#Number of trips per week
ride_week %>%
ggplot(aes(x=Week_day,y=number_trips_week,fill=Users))+geom_col(position = "dodge")+scale_y_continuous()+
labs(title = "number of Trips per week",caption = "Figure 4",x="Day of Week",y="number of Trips",)+
theme()
#Average number of trips per week
ride_week %>%
ggplot(aes(x=Week_day,y=average_trips_week,fill=Users))+geom_col(position = "dodge")+scale_y_continuous()+
labs(title = "Average number of Trips per week",caption = "Figure 5",x="Day of Week",y="Average number of Trips",)+
theme()
#Total number of trips per week
ride_week %>%
ggplot(aes(x=Week_day,y=total_trips_week,fill=Users))+geom_col(position = "dodge")+scale_y_continuous()+
labs(title = "Total number of Trips per week",caption = "Figure 6",x="Day of Week",y="Total number of Trips",)+
theme()
#Conclusion
ride_day<-Trips_time_data_frame %>%
group_by(Users,Day) %>%
summarise(number_trips_day=n(),
avg_trip_day=mean(Trip_Time),
Total_trip_day=sum(Trip_Time)
)
## `summarise()` has grouped output by 'Users'. You can override using the
## `.groups` argument.
#Number of trips per day
ride_day %>%
ggplot(aes(x=Day,y=number_trips_day,fill=Users))+geom_col(position = "dodge")+scale_y_continuous()+
labs(title = "number of Trips per day",caption = "Figure 7",x="number of Trips",y="Days",)+
theme()
#Average number of trips per day
ride_day %>%
ggplot(aes(x=Day,y=avg_trip_day,fill=Users))+geom_col(position = "dodge")+scale_y_continuous()+
labs(title = "Average number of Trips per day",caption = "Figure 8",x="Average number of Trips",y="Days",)+
theme()
#Total number of trips per day
ride_day %>%
ggplot(aes(x=Day,y=Total_trip_day,fill=Users))+geom_col(position = "dodge")+scale_y_continuous()+
labs(title = "Total number of Trips per day",caption = "Figure 9",x="Total number of Trips",y="Days",)+
theme()
#Analysis based on type of bikes
ride_bikes<-Trips_time_data_frame %>%
group_by(Users,bikes,Trip_Time) %>%
summarise(no_trips=n(),
average_no_trips=mean(Trip_Time),
Total_no_trips=sum(Trip_Time)
)
## `summarise()` has grouped output by 'Users', 'bikes'. You can override using
## the `.groups` argument.
#Number of Trips per bike
ride_bikes %>%
ggplot(aes(x=bikes,y=no_trips,fill=Users))+geom_col(position = "dodge")+scale_y_continuous()+
labs(title = "Number of Trips per Bikes",caption = "Figure 10",x="bike types",y="number of trips")+
theme()
#Average Number of Trips per bike
ride_bikes %>%
ggplot(aes(x=bikes,y=average_no_trips,fill=Users))+geom_col(position = "dodge")+scale_y_continuous()+
labs(title = "Average Number of Trips per Bikes",caption = "Figure 11",x="bike types",y="Average number of trips")+
theme()
#Total Number of Trips per bike
ride_bikes %>%
ggplot(aes(x=bikes,y=Total_no_trips,fill=Users))+geom_col(position = "dodge")+scale_y_continuous()+
labs(title = "Total Number of Trips per Bikes",caption = "Figure 12",x="bike types",y="Total number of trips")+
theme()
#Analysis by Location
starting_station<-Trip_point_data_frame %>%
group_by(
Users,start_station_name,start_lat,start_lng) %>%
summarise(
no_of_trips=n(),
) %>%
arrange(-no_of_trips)
## `summarise()` has grouped output by 'Users', 'start_station_name', 'start_lat'.
## You can override using the `.groups` argument.
View(starting_station)
colSums(is.na(starting_station))
## Users start_station_name start_lat start_lng
## 0 0 0 0
## no_of_trips
## 0
View(starting_station)
ending_station<-Trip_point_data_frame %>%
group_by(
Users,end_station_name,end_lat,end_lng) %>%
summarise(
no_of_trip_end=n()
) %>%
arrange(-no_of_trip_end)
## `summarise()` has grouped output by 'Users', 'end_station_name', 'end_lat'. You
## can override using the `.groups` argument.
View(ending_station)
#Ten Most popular start stations
starting_station[1:10,] %>%
ggplot(aes(no_of_trips,start_station_name,fill=Users))+
geom_col(position = "dodge")+labs(title = "10 most liked start station",caption = "Figure 13",x="Number of Trips",y="Station Names")+
theme()
#Ten Most popular end stations
ending_station[1:10,] %>%
ggplot(aes(no_of_trip_end,end_station_name,fill=Users))+
geom_col(position = "dodge")+labs(title = "10 most liked end stations",caption = "Figure 14", x="Number of Trips",y="Station Names")+
theme()
Exported files for further Analysis
write.csv(trip_3), trip_3.csv
write.csv(Trip_point_data_frame), Trip_point_data_frame.csv
write.csv(Trip_time_data_frame),Trip_time_data_frame.csv
write.csv(starting_station), starting_station.csv
write.csv(ending_station), ending_station.csv
#5.Share
#6.Act Recommendation
Marketing campaign should target Streeter Dr & Grand Avenue which is the most popular start and end station used by casual riders.