#NOTE: This is a revision. Originally, observations with any blank fields were omitted. This removed over 20% of the data. It was determined that the observations with some blank fields should still be included and would not disrupt proper analysis or skew results. Most blank fields occurred for the following variables: start station name, start station ID, end station name, end station ID. Because latitude and longitude data is complete, it was determined this would be a sufficient substitute during analysis. Other than trip counts, the biggest impact this adjustment had was on the rideable type preference for casual riders (revised from classic bike to electric bike), and rank of the most popular stations.
knitr::opts_chunk$set(warning = FALSE, message = FALSE)
Task:
You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago.
The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.
In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime. Until now, Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments. One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members.
Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders. Although the pricing flexibility helps Cyclistic attract more customers, Moreno believes that maximizing the number of annual members will be key to future growth. Rather than creating a marketing campaign that targets all-new customers, Moreno believes there is a very good chance to convert casual riders into members. She notes that casual riders are already aware of the Cyclistic program and have chosen Cyclistic for their mobility needs.
Moreno has set a clear goal: Design marketing strategies aimed at converting casual riders into annual members. In order to do that, however, the marketing analyst team needs to better understand how annual members and casual riders differ, why casual riders would buy a membership, and how digital media could affect their marketing tactics. Moreno and her team are interested in analyzing the Cyclistic historical bike trip data to identify trends.
Stakeholders:
Cyclistic: A bike-share program that features more than 5,800 bicycles and 600 docking stations. Cyclistic sets itself apart by also offering reclining bikes, hand tricycles, and cargo bikes, making bike-share more inclusive to people with disabilities and riders who can’t use a standard two-wheeled bike. The majority of riders opt for traditional bikes; about 8% of riders use the assistive options. Cyclistic users are more likely to ride for leisure, but about 30% use them to commute to work each day.
Lily Moreno: The director of marketing and your manager. Moreno is responsible for the development of campaigns and initiatives to promote the bike-share program. These may include email, social media, and other channels.
Cyclistic marketing analytics team: A team of data analysts who are responsible for collecting, analyzing, and reporting data that helps guide Cyclistic marketing strategy. You joined this team six months ago and have been busy learning about Cyclistic’s mission and business goals — as well as how you, as a junior data analyst, can help Cyclistic achieve them.
Cyclistic executive team: The notoriously detail-oriented executive team will decide whether to approve the recommended marketing program.
Three questions will guide the future marketing program:
Moreno has assigned you the first question to answer: How do annual members and casual riders use Cyclistic bikes differently?
You will produce a report with the following deliverables:
#SET WORKING DIRECTORY ---------------------------------------------------------
setwd("/Users/jenmason/Documents")
options(repos = list(CRAN="http://cran.rstudio.com/"))
#IMPORT DATA -------------------------------------------------------------------
table2021_11 <- read.csv("202111-divvy-tripdata.csv")
table2021_12 <- read.csv("202112-divvy-tripdata.csv")
table2022_01 <- read.csv("202201-divvy-tripdata.csv")
table2022_02 <- read.csv("202202-divvy-tripdata.csv")
table2022_03 <- read.csv("202203-divvy-tripdata.csv")
table2022_04 <- read.csv("202204-divvy-tripdata.csv")
table2022_05 <- read.csv("202205-divvy-tripdata.csv")
table2022_06 <- read.csv("202206-divvy-tripdata.csv")
table2022_07 <- read.csv("202207-divvy-tripdata.csv")
table2022_08 <- read.csv("202208-divvy-tripdata.csv")
table2022_09 <- read.csv("202209-divvy-tripdata.csv")
table2022_10 <- read.csv("202210-divvy-tripdata.csv")
#combine/merge CSV files
data_all <- rbind(table2021_11, table2021_12, table2022_01, table2022_02,
table2022_03, table2022_04, table2022_05, table2022_06,
table2022_07, table2022_08, table2022_09, table2022_10)
#EXPLORE DATA ------------------------------------------------------------------
#preview data
View(data_all)
#view data structure.
str(data_all)
## 'data.frame': 5755694 obs. of 13 variables:
## $ ride_id : chr "7C00A93E10556E47" "90854840DFD508BA" "0A7D10CDD144061C" "2F3BE33085BCFF02" ...
## $ rideable_type : chr "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
## $ started_at : chr "2021-11-27 13:27:38" "2021-11-27 13:38:25" "2021-11-26 22:03:34" "2021-11-27 09:56:49" ...
## $ ended_at : chr "2021-11-27 13:46:38" "2021-11-27 13:56:10" "2021-11-26 22:05:56" "2021-11-27 10:01:50" ...
## $ start_station_name: chr "" "" "" "" ...
## $ start_station_id : chr "" "" "" "" ...
## $ end_station_name : chr "" "" "" "" ...
## $ end_station_id : chr "" "" "" "" ...
## $ start_lat : num 41.9 42 42 41.9 41.9 ...
## $ start_lng : num -87.7 -87.7 -87.7 -87.8 -87.6 ...
## $ end_lat : num 42 41.9 42 41.9 41.9 ...
## $ end_lng : num -87.7 -87.7 -87.7 -87.8 -87.6 ...
## $ member_casual : chr "casual" "casual" "casual" "casual" ...
#skim data for empty fields, unique values, whitespace.
#skim(data_all)
#showed empty fields for station names and IDs; no whitespace; appropriate number of unique values for rideable type and membership.
#get initial summary of data
summary(data_all)
## ride_id rideable_type started_at ended_at
## Length:5755694 Length:5755694 Length:5755694 Length:5755694
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## start_station_name start_station_id end_station_name end_station_id
## Length:5755694 Length:5755694 Length:5755694 Length:5755694
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## start_lat start_lng end_lat end_lng
## Min. :41.64 Min. :-87.84 Min. :41.39 Min. :-88.97
## 1st Qu.:41.88 1st Qu.:-87.66 1st Qu.:41.88 1st Qu.:-87.66
## Median :41.90 Median :-87.64 Median :41.90 Median :-87.64
## Mean :41.90 Mean :-87.65 Mean :41.90 Mean :-87.65
## 3rd Qu.:41.93 3rd Qu.:-87.63 3rd Qu.:41.93 3rd Qu.:-87.63
## Max. :45.64 Max. :-73.80 Max. :42.37 Max. :-87.30
## NA's :5835 NA's :5835
## member_casual
## Length:5755694
## Class :character
## Mode :character
##
##
##
##
Summary of Data Exploration
#DATA CLEANING------------------------------------------------------------------
#clean names; include only characters, numbers, and underscores in names.
data_all_v02 <- clean_names(data_all)
#remove empty rows and columns.
data_all_v03 <- remove_empty(data_all_v02)
#remove duplicates; select only unique rows.
data_all_v04 <- distinct(data_all_v03)
#check for naming anomalies; include only appropriate entries for variables with limited response types.
count(data_all_v04, rideable_type)
count(data_all_v04, member_casual)
#convert to date-time format.
data_all_v05 <- data_all_v04
data_all_v05$started_at <- as.POSIXct(data_all_v05$started_at, format = "%Y-%m-%d %H:%M:%S")
data_all_v05$ended_at <- as.POSIXct(data_all_v05$ended_at, format = "%Y-%m-%d %H:%M:%S")
#In previous versions, this is where any incomplete cases were omitted. This removed more than 20% of observations. After reconsideration, this action may not have been necessary and will not be performed in this version.
#check for data added/removed during analysis period ----
#rideable type:
#create data frame to see when a unique rideable type was added.
rideable_type_check <- data_all_v05 %>%
mutate(year = year(started_at),
month = month(started_at)) %>%
group_by(month, year) %>%
select(rideable_type, month, year) %>%
count(rideable_type)
#no unique types added or removed during the analysis period.
#parse date for analysis; create columns for year, month, day, hour; ensure case of "Y/y, M/m, D/d, H/h, M/m, S/s" is correct.
data_all_v06$year <- format(data_all_v06$started_at, "%Y")
data_all_v06$month <- format(data_all_v06$started_at, "%m")
data_all_v06$day <- format(data_all_v06$started_at, "%d")
data_all_v06$hour <- format(data_all_v06$started_at, "%H")
Summary of Data Cleaning:
#days of the week:
#create column for day of the week.
data_all_v06$day_of_week <- format(data_all_v06$started_at,"%A")
#convert to factor w/levels, specify order (M-S).
data_all_v06$day_of_week <- factor(data_all_v06$day_of_week, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))
#trip duration:
#calculate trip duration for analysis; ensure data type is numeric.
data_all_v06$trip_duration_seconds <- difftime(data_all_v06$ended_at, data_all_v06$started_at, units = "secs")
data_all_v06$trip_duration_seconds <- as.numeric(as.character(data_all_v06$trip_duration_seconds))
#remove trips < 1 minute.
data_all_v07 <- data_all_v06 %>%
filter(trip_duration_seconds >= 60)
#check for outlier data; Rosner test to detect multiple outliers.
install.packages("EnvStats")
library(EnvStats)
test_outliers <- rosnerTest(data_all_v08$trip_duration_seconds)
#outlier data is present.
#arrange/order data by trip duration, descending; inspect long trips.
order_by_trip_duration_desc <- data_all_v08[order(data_all_v08$trip_duration_seconds ,decreasing=TRUE),]
#remove outlier data using IQR method; find upper and lower limit of the interquartile range; filter data.
quartiles <- quantile(data_all_v08$trip_duration_seconds, probs=c(.25, .75), na.rm = FALSE)
IQR <- IQR(data_all_v08$trip_duration_seconds)
Lower <- quartiles[1] - 1.5*IQR
Upper <- quartiles[2] + 1.5*IQR
data_no_outliers_IQR <- subset(data_all_v08, data_all_v08$trip_duration_seconds > Lower & data_all_v08$trip_duration_seconds < Upper)
#this removed 414,368 observations.
data_all_v09 <- data_no_outliers_IQR
summary(data_all_v09$trip_duration_seconds)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 60.0 349.0 588.0 723.6 978.0 2279.0
#inspect trip duration.
order_by_trip_duration_desc <- data_all_v09[order(data_all_v09$trip_duration_seconds ,decreasing=TRUE),]
order_by_trip_duration_asc <- data_all_v09[order(data_all_v09$trip_duration_seconds),]
#create column with trip duration in minutes.
data_all_v10 <- data_all_v09
data_all_v10$trip_duration_minutes <- data_all_v10$trip_duration_seconds/60
#create column with trip duration in hours.
data_all_v10$trip_duration_hours <- data_all_v10$trip_duration_minutes/60
#create bins for trips < 1 hour (the majority of trips).
data_all_v10 <- data_all_v10 %>%
mutate(trip_interval_minutes
= case_when(trip_duration_minutes >00 & trip_duration_minutes <=05 ~'0-5 minutes',
trip_duration_minutes >05 & trip_duration_minutes <=10 ~'5-10 minutes',
trip_duration_minutes >10 & trip_duration_minutes <=15 ~'10-15 minutes',
trip_duration_minutes >15 & trip_duration_minutes <=20 ~'15-20 minutes',
trip_duration_minutes >20 & trip_duration_minutes <=25 ~'20-25 minutes',
trip_duration_minutes >25 & trip_duration_minutes <=30 ~'25-30 minutes',
trip_duration_minutes >30 & trip_duration_minutes <=35 ~'30-35 minutes',
trip_duration_minutes >35 & trip_duration_minutes <=40 ~'35-40 minutes',
trip_duration_minutes >40 & trip_duration_minutes <=45 ~'40-45 minutes',
trip_duration_minutes >45 & trip_duration_minutes <=50 ~'45-50 minutes',
trip_duration_minutes >50 & trip_duration_minutes <=55 ~'50-55 minutes',
trip_duration_minutes >55 & trip_duration_minutes <=60 ~'55-60 minutes'))
#membership:
#create separate data frames for casual riders and members; may be useful for analysis.
data_member <- data_all_v10 %>%
filter(member_casual == "member")
data_casual <- data_all_v10 %>%
filter(member_casual == "casual")
Summary of Data Preparation
#summarize number of trips by membership.
num_trips_by_membership <- table(data_all_v10$member_casual) %>%
as.data.frame()
#create pie chart to show number of trips by membership.
ggplot(num_trips_by_membership, aes(x = "", y = Freq, fill = Var1)) +
geom_bar(stat = "identity", width = 1, color="white") +
coord_polar("y", start = 0, direction = -1) +
theme_void() +
theme(axis.line = element_blank(), axis.ticks = element_blank(), plot.title = element_text(hjust = 0.5)) +
ggtitle("Trips per Membership Type") +
theme(plot.title=element_text( hjust=0.5, vjust=0.5, face='bold')) +
labs(fill = "Membership Type", x = NULL, y = NULL) +
geom_text(aes(label = paste(round(Freq / sum(Freq) * 100, 1), "%")), position = position_stack(vjust = 0.5))
#create bar plot to show number of trips by membership (alternative to pie chart).
ggplot(data_all_v10, aes(x=factor(member_casual))) +
geom_bar(stat="count", position = "dodge", width=0.6) +
scale_y_continuous(name = "Number of Trips", labels = function(x) format(x, big.mark = ",", scientific = FALSE)) +
labs(x= "Membership Type", fill = "Membership Type") +
coord_cartesian(ylim = c(1000000,2500000)) +
ggtitle("Trips per Membership Type") +
theme(plot.title=element_text( hjust=0.5, vjust=0.5, face='bold'))
Casual riders took approximately 2 million trips while members took approximately 2.5 million trips.
#plot number of trips per day of the week, grouped by membership (bar plot).
ggplot(data = data_all_v10) %>% +
geom_bar(mapping = aes(x=day_of_week, fill=member_casual), position = "dodge") +
scale_x_discrete(name = "Day of the Week",
limits = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")) +
scale_y_continuous(name = "Number of Trips", labels = function(x) format(x, big.mark = ",", scientific = FALSE)) +
ggtitle("Trips per Day of the Week") +
theme(plot.title=element_text( hjust=0.5, vjust=0.5, face='bold')) +
labs(fill = "Membership Type")
Insight: Members make more trips Monday-Friday while casual riders make more trips on the weekends.
#summarize mean trip duration by membership.
data_all_v10 %>%
group_by(member_casual) %>%
summarize(mean_td_minutes=mean(trip_duration_minutes))
#summarize mean trip duration by rideable type, for each membership type.
data_all_v10 %>%
group_by(rideable_type, member_casual) %>%
summarize(mean_td_minutes=mean(trip_duration_minutes))
#create dataframe for summarized data.
mean_trip_duration_by_rideable_type <- data_all_v10 %>%
group_by(rideable_type, member_casual) %>%
summarize(mean_td_minutes=mean(trip_duration_minutes))
#substitute zero for missing value.
mean_trip_duration_by_rideable_type[nrow(mean_trip_duration_by_rideable_type) + 1,] <- list("docked_bike", "member", 0)
#create bar plot for mean trip duration by rideable type, for each membership type.
mean_trip_duration_by_rideable_type %>%
ggplot(aes(x = rideable_type, y = mean_td_minutes, fill = member_casual)) +
geom_col(position = position_dodge(), width=0.6) +
coord_cartesian(ylim = c(9,20)) +
labs(x = "Rideable Type", fill = "Membership Type") +
scale_x_discrete(labels=function(x) gsub("_", " ", x, fixed=TRUE)) +
scale_y_continuous(name = "Mean Trip Duration (minutes)") +
ggtitle("Mean Trip Duration by Rideable Type") +
theme(plot.title=element_text( hjust=0.5, vjust=0.5, face='bold'))
Insight: On average, casual riders took longer trips (14 minutes) than members (11 minutes). The longest trips (20 minutes) were taken using docked bikes by casual riders, and the shortest trips (10 minutes) on electric bikes by members.
#summarize trip duration by day of the week, for each membership type.
data_all_v10 %>%
group_by(day_of_week, member_casual) %>%
summarize(mean_td_minutes=mean(trip_duration_minutes))
#create dataframe for summarized data.
mean_trip_duration_by_day <- data_all_v10 %>%
group_by(day_of_week, member_casual) %>%
summarize(mean_td_minutes=mean(trip_duration_minutes))
#plot mean trip duration by day of the week, for each membership type.
mean_trip_duration_by_day %>%
ggplot(aes(x = day_of_week, y = mean_td_minutes, fill = member_casual)) +
geom_col(position = position_dodge(), width=0.6) +
coord_cartesian(ylim = c(10,16)) +
labs(x = "Day of the Week", y = "Mean Trip Duration (minutes)", fill = "Membership Type") +
ggtitle("Mean Trip Duration by Day of the Week") +
theme(plot.title=element_text( hjust=0.5, vjust=0.5, face='bold'))
Insight: Both casual riders and members took longer trips on weekends.
#summarize mean trip duration by time of day, for each membership type).
data_all_v10 %>%
group_by(hour, member_casual) %>%
summarize(mean=mean(trip_duration_minutes)) %>%
mutate_if(is.numeric, round, 2) %>%
as.data.frame() %>%
print(n = nrow(48))
#create dataframe for summarized data.
mean_trip_duration_by_TOD <- data_all_v10 %>%
group_by(hour, member_casual) %>%
summarize(mean_td_minutes=mean(trip_duration_minutes))
#plot mean trip duration by time of day, for each membership type.
mean_trip_duration_by_TOD %>%
ggplot() +
geom_line(aes(x=hour, y=mean_td_minutes, color=member_casual, group=member_casual), size = 2) +
coord_cartesian(ylim = c(9,16)) +
labs(x = "Time of Day", y = "Mean Trip Duration (minutes)", fill = "Membership Type") +
ggtitle("Mean Trip Duration by Time of Day") +
theme(plot.title=element_text( hjust=0.5, vjust=0.5, face='bold')) +
guides(color=guide_legend("Membership Type"))
Insight: The longest trips taken by casual riders occur between 10am - 6pm. Members took their longest trips during what is presumably the commute from work (5pm).
#create dataframe for trips < 1 hour.
trips_less_than_1_hour <- data_all_v10 %>%
group_by(trip_interval_minutes, member_casual) %>%
summarize(n=n()) %>%
mutate_if(is.numeric, round, 2) %>%
as.data.frame()
trips_less_than_1_hour$trip_interval_minutes <- factor(trips_less_than_1_hour$trip_interval_minutes, levels = c("0-5 minutes", "5-10 minutes", "10-15 minutes", "15-20 minutes", "20-25 minutes", "25-30 minutes", "30-35 minutes", "35-40 minutes", "40-45 minutes", "45-50 minutes", "50-55 minutes", "55-60 minutes"))
trips_less_than_1_hour %>%
ggplot(aes(x = trip_interval_minutes, y = n, fill = member_casual)) +
geom_col(position = position_dodge(), width=0.6) +
theme(axis.text.x=element_text(angle = -45, hjust = 0)) +
scale_y_continuous(name = "Number of Trips", labels = function(x) format(x, big.mark = ",", scientific = FALSE)) +
labs(x = "Trip Duration", fill = "Membership Type", fill = "Membership Type") +
ggtitle("Trip Duration for Trips Less Than 1 Hour") +
theme(plot.title=element_text( hjust=0.5, vjust=0.5, face='bold'))
Insight: Nearly all trips, for both casual riders and members, are less than one hour. The greatest number of trips were between 5-10 minutes for both groups. Trips between 0-5 minutes are mostly attributed to members. After trips 5-10 minutes in duration, the number of trips decreases as trip duration increases,.
#summarize number of trips per time of day, for each membership type.
table(data_all_v10$hour, data_all_v10$member_casual)
#create data frame of trips per time of day, for each membership type.
num_trips_by_TOD <- data_all_v10 %>%
group_by(hour, member_casual) %>%
summarize(n=n()) %>%
mutate_if(is.numeric, round, 2) %>%
as.data.frame() %>%
print(n = nrow(24))
## hour member_casual n
## 1 00 casual 41811
## 2 00 member 34551
## 3 01 casual 26960
## 4 01 member 21259
## 5 02 casual 17017
## 6 02 member 12209
## 7 03 casual 10184
## 8 03 member 7584
## 9 04 casual 7086
## 10 04 member 8496
## 11 05 casual 11586
## 12 05 member 32052
## 13 06 casual 26804
## 14 06 member 87874
## 15 07 casual 47799
## 16 07 member 167742
## 17 08 casual 62715
## 18 08 member 198911
## 19 09 casual 61902
## 20 09 member 140206
## 21 10 casual 76170
## 22 10 member 131642
## 23 11 casual 98514
## 24 11 member 156776
## 25 12 casual 118271
## 26 12 member 181635
## 27 13 casual 123374
## 28 13 member 181071
## 29 14 casual 131318
## 30 14 member 178874
## 31 15 casual 148424
## 32 15 member 212887
## 33 16 casual 167907
## 34 16 member 279133
## 35 17 casual 192057
## 36 17 member 333827
## 37 18 casual 173916
## 38 18 member 273027
## 39 19 casual 133507
## 40 19 member 197278
## 41 20 casual 98741
## 42 20 member 139696
## 43 21 casual 85741
## 44 21 member 108884
## 45 22 casual 78560
## 46 22 member 83741
## 47 23 casual 58183
## 48 23 member 54971
#plot trips per time of day, for each membership type.
num_trips_by_TOD %>%
ggplot() +
geom_line(aes(x=hour, y=n, color=member_casual, group=member_casual), size = 2) +
scale_y_continuous(name = "Number of Trips", labels = function(x) format(x, big.mark = ",", scientific = FALSE)) +
labs(x = "Time of Day", y = "Number of Trips", fill = "Membership Type") +
ggtitle("Number of Trips per Time of Day") +
theme(plot.title=element_text( hjust=0.5, vjust=0.5, face='bold')) +
guides(color=guide_legend("Membership Type"))
Insight: For both casual riders and members, spikes in ridership occur between 6-9am, at 12pm, and at 5pm, suggesting bikes are used for work commutes.
#create data frame for number of trips by rideable type.
num_trips_by_rideable_type <- table(data_all_v10$rideable_type, data_all_v10$member_casual) %>%
as.data.frame()
num_trips_by_rideable_type_renamed <- rename(num_trips_by_rideable_type, Rideable_Type = Var1, Membership_Type = Var2)
#plot number of trips by rideable type, for each membership type.
num_trips_by_rideable_type_renamed %>%
ggplot(aes(x=Rideable_Type, y=Freq, fill=Membership_Type)) +
geom_col(position = position_dodge(), width=0.6) +
labs(x = "Rideable Type", y = "Number of Trips", fill = "Membership Type") +
scale_x_discrete(labels=function(x) gsub("_", " ", x, fixed=TRUE)) +
scale_y_continuous(name = "Number of Trips", labels = function(x) format(x, big.mark = ",", scientific = FALSE)) +
ggtitle("Number of Trips by Rideable Type") +
theme(plot.title=element_text( hjust=0.5, vjust=0.5, face='bold'))
#plot mean trip duration by rideable type, for each membership type.
mean_trip_duration_by_rideable_type %>%
ggplot(aes(x = rideable_type, y = mean_td_minutes, fill = member_casual)) +
geom_col(position = position_dodge(), width=0.6) +
coord_cartesian(ylim = c(9,20)) +
labs(x = "Rideable Type", y = "Mean Trip Duration (minutes)", fill = "Membership Type") +
scale_x_discrete(labels=function(x) gsub("_", " ", x, fixed=TRUE)) +
ggtitle("Mean Trip Duration by Rideable Type") +
theme(plot.title=element_text( hjust=0.5, vjust=0.5, face='bold'))
Insight: The most popular rideable type for casual riders is the electric bike, while members prefer the classic bike. Docked bikes are the least popular and are ridden exclusively by casual riders.
#count number of trips per month.
data_all_v10$YearMon <- paste(data_all_v10$year,data_all_v10$month, sep="-")
num_trips_per_month <- table(data_all_v10$YearMon, data_all_v10$member_casual) %>%
as.data.frame()
num_trips_per_month_2 <- filter_if(num_trips_per_month, is.numeric, all_vars((.) != 0)) %>%
as.data.frame()
#plot trips per month.
num_trips_per_month_2 %>%
ggplot() +
geom_line(aes(x=Var1, y=Freq, color=Var2, group=Var2), size = 2) +
theme(axis.text.x=element_text(angle = -45, hjust = 0)) +
scale_y_continuous(name = "Number of Trips", labels = function(x) format(x, big.mark = ",", scientific = FALSE)) +
labs(x = " ", y = "Number of Trips") +
ggtitle("Number of Trips per Month") +
theme(plot.title=element_text( hjust=0.5, vjust=0.5, face='bold')) +
guides(color=guide_legend("Membership Type"))
Insight: The most popular months for both casual riders and members are May through September. January and February were the least popular. Peak ridership for casual riders occurred in July.
#list most popular stations, note the number of trips and notable characteristics.
num_trips_by_station <- data_all_v10 %>%
group_by(start_station_name) %>%
summarize(n=n())
num_trips_by_station[order(num_trips_by_station$n, decreasing = TRUE),]
#create bubble map to show the number of rides per station.
install.packages("leaflet")
##
## The downloaded binary packages are in
## /var/folders/hn/spvym4sx55bcvhcwsrywl5bm0000gn/T//Rtmpfas3fV/downloaded_packages
library(leaflet)
install.packages("htmlwidgets")
##
## The downloaded binary packages are in
## /var/folders/hn/spvym4sx55bcvhcwsrywl5bm0000gn/T//Rtmpfas3fV/downloaded_packages
library(htmlwidgets)
install.packages("htmltools")
##
## The downloaded binary packages are in
## /var/folders/hn/spvym4sx55bcvhcwsrywl5bm0000gn/T//Rtmpfas3fV/downloaded_packages
library(htmltools)
#create dataframe that groups number of trips by station name w/ latitude and longitude data.
map_data <- data_all_v10 %>%
select( start_station_name,
start_lat,
start_lng) %>%
group_by(start_station_name) %>%
mutate(numtrips = n()) %>%
distinct(start_station_name, .keep_all = TRUE)
#create bins for map key.
map_bins <- seq(0, 50000, by = 5000)
#assign color palette to show trip density per station.
my_palette <- colorBin(palette ="viridis", domain = map_data$numtrips, na.color = "transparent", bins = map_bins, reverse = TRUE)
#set text for interactive tooltip.
map_text <- paste("Station name: ", map_data$start_station_name, "<br/>","Number of trips: ", map_data$numtrips, sep = "") %>%
lapply(htmltools::HTML)
#create interactive html leaflet widget to show trip density per station.
trips_per_station_map <- leaflet(map_data) %>%
addTiles() %>%
#set Chicago coordinates.
setView(lng = -87.6298, lat = 41.8781, zoom = 10.5) %>%
#set map style.
addProviderTiles("Esri.WorldGrayCanvas") %>%
#add circle markers for each station, set fill color, add tooltip.
addCircleMarkers(~ start_lng, ~ start_lat,
fillColor = ~ my_palette(numtrips),
fillOpacity = 0.6,
color = "white",
# radius = ~ num_trips_by_station$n/1500, #this is an option for variable marker size based on number of trips.
radius = 6,
stroke = FALSE,
label = map_text,
labelOptions = labelOptions(style = list("font-weight" = "normal", padding = "3px 8px"),
textsize = "13px",
direction = "auto")) %>%
#add legend.
addLegend(
pal = my_palette,
values = ~ numtrips,
opacity = 0.8,
title = "Number of Trips",
position = "bottomright")
#view map.
trips_per_station_map
Insight: By far, the most popular station is Streeter Dr & Grand Ave. This station is close to the lakeside train connector, as well as Jane Addams memorial park and Ohio Beach Street; likely used by tourists and weekend visitors. Six of the top ten most popular stations are adjacent to public parks, three are in front of large residential buildings, and two are in close vicinity to a fitness center.
Casual riders make more trips on weekends and take longer trips on average than members. Like members, casual riders appear to use bikes for their work commutes during the week, evident in ridership spikes during a.m. and p.m. rush hours, as well as the typical lunch hour. Casual riders prefer the classic bike. Like members, casual riders are more likely to use bikes in the months of May through September. The most popular stations were adjacent to public parks, large residential buildings, and fitness centers. The most popular station is close to a train connector, public beach, and park. Because of the casual rider’s preference for weekends, and the proximity of popular stations to parks, it is reasonable to assume that casual riders use bikes for exercise and leisure. Because the most popular station is close to a major train connector, it may be deduced that casual riders are coming in from suburbs or areas of the city further from the waterfront parks in east Chicago.
The following is recommended to Cyclistic’s marketing strategy team: