Cyclistic: Bike Share case of study

Cyclistic: How does Bike Share navigate speedy success?

Summary

Industry focus: Bike-share Company.
Problem Statement: Maximizing the number of annual memberships, converting casual riders into annual members.
Business Use Case: Identify the different use of the Cyclistic bikes between members and casual riders.

Other questions for the marketing analysis team are:
1. Why would casual riders buy Cyclistic annual memberships?
2. How can Cyclistic use digital media to influence casual riders to become members?
Deliverables:
1. A clear statement of the business task
2. A description of all data sources used
3. Documentation of any cleaning or manipulation of data
4. A summary of analysis
5. Supporting visualizations and key findings
6. Top three recommendations based on analysis
Datasets available in: https://divvy-tripdata.s3.amazonaws.com/index.html

1. About the Company

Cyclistic is a fictional company created for the purposes of this case study. In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime.

Some Key Notes:

Cyclistic users are more likely to ride for leisure, but about 30% use them to commute to work each day.
Strategy: general awareness and appealing to broad consumer segments. The flexibility of its pricing plans: single-ride passes, full-day passes (considered as casual riders), and annual memberships.
Annual members are much more profitable than casual riders.
Rather than creating a marketing campaign that targets all-new customers, the Director of Marketing believes there is a very good chance to convert casual riders into members.
Marketing Analyst Team’s questions: how annual members and casual riders differ, why casual riders would buy a membership, and how digital media could affect their marketing tactics.

Stakeholders

Lily Moreno: The director of marketing and my manager.
Cyclistic marketing analytics team: A team of data analysts who are responsible for collecting, analyzing, and reporting data that helps guide Cyclistic marketing strategy.
Cyclistic executive team: The detail-oriented executive team that will decide whether to approve the recommended marketing program or not.

2. Preparing Data

Structured data is organized in 51 csv zipped files, with information about trips (from 06/2013 to 12/2022) and stations (from 06/2013 to 09/2016).
Information is licensed by Motivate International Inc under https://ride.divvybikes.com/data-license-agreement
Limitations: Data-privacy issues prohibit you from using riders’ personally identifiable information, so it is not possible to connect pass purchases to credit card numbers or ID’s
Unzipped files contain different 3 type of data and structures. It can be classified in information about Trips and Stations.
For this analysis, is considered relevant (reliable, complete, and current) data from the las 12 moths, between 01/2022 to 12/2022. Previous data don’t meet the standards to be considered for this analysis since they consider the pandemic period.
Files were checked and their names were standardized according to the information they contained, following this convention: Tripdata: AAAAMM-divvy-tripdata.

3. Processing Data

I chose Rstudio to do the data wrangling, and later build the analysis.

3.1. Setting up the enviroment

# Setting up the enviroment
# install.packages("tidyverse")
# install.packages("lubridate")
# install.packages("janitor")
# install.packages("skimr")
# install.packages("geosphere")

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   1.0.0 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(lubridate)

## 
## Attaching package: 'lubridate'
## 
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(janitor)

## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

library(skimr)
library(geosphere)

3.2. Importing data

There are 12 data frames by each month of the year 2022 csv format.Before stacking data frames, all files were check in Excel to see their structure. Structure were compatible, so I stack them into one big data frame.

# Importing and stacking csv files in one data frame
trips22 <- list.files(path = "2023_Bikeshare_files", full.names = TRUE) %>%
  lapply(read.csv) %>% 
bind_rows

Before starting the cleaning, I am going to check data structure basics:

# Checking the new data frame 
str(trips22)

## 'data.frame':    5667717 obs. of  13 variables:
##  $ ride_id           : chr  "C2F7DD78E82EC875" "A6CF8980A652D272" "BD0F91DFF741C66D" "CBB80ED419105406" ...
##  $ rideable_type     : chr  "electric_bike" "electric_bike" "classic_bike" "classic_bike" ...
##  $ started_at        : chr  "2022-01-13 11:59:47" "2022-01-10 08:41:56" "2022-01-25 04:53:40" "2022-01-04 00:18:04" ...
##  $ ended_at          : chr  "2022-01-13 12:02:44" "2022-01-10 08:46:17" "2022-01-25 04:58:01" "2022-01-04 00:33:00" ...
##  $ start_station_name: chr  "Glenwood Ave & Touhy Ave" "Glenwood Ave & Touhy Ave" "Sheffield Ave & Fullerton Ave" "Clark St & Bryn Mawr Ave" ...
##  $ start_station_id  : chr  "525" "525" "TA1306000016" "KA1504000151" ...
##  $ end_station_name  : chr  "Clark St & Touhy Ave" "Clark St & Touhy Ave" "Greenview Ave & Fullerton Ave" "Paulina St & Montrose Ave" ...
##  $ end_station_id    : chr  "RP-007" "RP-007" "TA1307000001" "TA1309000021" ...
##  $ start_lat         : num  42 42 41.9 42 41.9 ...
##  $ start_lng         : num  -87.7 -87.7 -87.7 -87.7 -87.6 ...
##  $ end_lat           : num  42 42 41.9 42 41.9 ...
##  $ end_lng           : num  -87.7 -87.7 -87.7 -87.7 -87.6 ...
##  $ member_casual     : chr  "casual" "casual" "member" "casual" ...

head(trips22)

tail(trips22)

summary(trips22)

##    ride_id          rideable_type       started_at          ended_at        
##  Length:5667717     Length:5667717     Length:5667717     Length:5667717    
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  start_station_name start_station_id   end_station_name   end_station_id    
##  Length:5667717     Length:5667717     Length:5667717     Length:5667717    
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##    start_lat       start_lng         end_lat         end_lng      
##  Min.   :41.64   Min.   :-87.84   Min.   : 0.00   Min.   :-88.14  
##  1st Qu.:41.88   1st Qu.:-87.66   1st Qu.:41.88   1st Qu.:-87.66  
##  Median :41.90   Median :-87.64   Median :41.90   Median :-87.64  
##  Mean   :41.90   Mean   :-87.65   Mean   :41.90   Mean   :-87.65  
##  3rd Qu.:41.93   3rd Qu.:-87.63   3rd Qu.:41.93   3rd Qu.:-87.63  
##  Max.   :45.64   Max.   :-73.80   Max.   :42.37   Max.   :  0.00  
##                                   NA's   :5858    NA's   :5858    
##  member_casual     
##  Length:5667717    
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
##

3.3. Cleaning the Dataframe

It is necessary converting to datetime the attributes started_at and ended_at attribute and calculate some columns, in order get the required information for analysis.

Dates transformation

Date

# Date
trips22$date <- as.Date(trips22$started_at)

Year

# Year
trips22$year <- format(as.Date(trips22$date), "%Y")

Month

# Month
trips22$month <- format(as.Date(trips22$date), "%m")

# Day
trips22$day <- format(as.Date(trips22$date), "%d")

Day of the week

# Day of the week
trips22$weekday <- format(as.Date(trips22$date),"%A")

Part of the week

# Part of the week
trips22 <- trips22 %>% 
  mutate(part_of_week = case_when(weekday == "Monday" ~ "Workday",
                                  weekday == "Tuesday" ~ "Workday",
                                  weekday == "Wednesday" ~ "Workday",
                                  weekday == "Thursday" ~ "Workday",
                                  weekday == "Friday" ~ "Workday",
                                  weekday == "Saturday" ~ "Weekend",
                                  weekday == "Sunday" ~ "Weekend"))

Time transformation

Datetime For started_at and ended_at attributes, since they are character fields

# Datetime
trips22$started_time <-strptime(trips22$started_at, "%Y-%m-%d %H:%M:%S")
trips22$ended_time <-strptime(trips22$ended_at, "%Y-%m-%d %H:%M:%S")

Hour

# Hour
trips22$hour <- trips22$started_at %>% hour()

Duration of rides: calculated as the difference between the ending and starting time of the ride.

# Calculating Ride length in seconds
trips22$ride_length_s <- difftime(trips22$ended_time, trips22$started_time)
trips22$ride_length_s<- as.numeric(as.character(trips22$ride_length_s))
is.numeric(trips22$ride_length_s)

## [1] TRUE

# Converting Ride Length from seconds into minutes in a new column
trips22 <- trips22 %>% 
  mutate(ride_length_m = ride_length_s/60)
trips22$ride_length_m <- round(trips22$ride_length_m, digits = 0)

Distance transformation

Calculating distance in kilometers in a new column: It is assumed that distance is linear, since is calculated as the difference between latitude and longitude from the ending and starting point of the ride.

# Calculating distance in kilometers in a new column
trips22 <- trips22 %>% 
  mutate(distance_km = distHaversine(cbind(trips22$start_lng, trips22$start_lat), 
                                                 cbind(trips22$end_lng, 
                                                       trips22$end_lat))*0.001)

# Checking the new attributes
str(trips22)

## 'data.frame':    5667717 obs. of  25 variables:
##  $ ride_id           : chr  "C2F7DD78E82EC875" "A6CF8980A652D272" "BD0F91DFF741C66D" "CBB80ED419105406" ...
##  $ rideable_type     : chr  "electric_bike" "electric_bike" "classic_bike" "classic_bike" ...
##  $ started_at        : chr  "2022-01-13 11:59:47" "2022-01-10 08:41:56" "2022-01-25 04:53:40" "2022-01-04 00:18:04" ...
##  $ ended_at          : chr  "2022-01-13 12:02:44" "2022-01-10 08:46:17" "2022-01-25 04:58:01" "2022-01-04 00:33:00" ...
##  $ start_station_name: chr  "Glenwood Ave & Touhy Ave" "Glenwood Ave & Touhy Ave" "Sheffield Ave & Fullerton Ave" "Clark St & Bryn Mawr Ave" ...
##  $ start_station_id  : chr  "525" "525" "TA1306000016" "KA1504000151" ...
##  $ end_station_name  : chr  "Clark St & Touhy Ave" "Clark St & Touhy Ave" "Greenview Ave & Fullerton Ave" "Paulina St & Montrose Ave" ...
##  $ end_station_id    : chr  "RP-007" "RP-007" "TA1307000001" "TA1309000021" ...
##  $ start_lat         : num  42 42 41.9 42 41.9 ...
##  $ start_lng         : num  -87.7 -87.7 -87.7 -87.7 -87.6 ...
##  $ end_lat           : num  42 42 41.9 42 41.9 ...
##  $ end_lng           : num  -87.7 -87.7 -87.7 -87.7 -87.6 ...
##  $ member_casual     : chr  "casual" "casual" "member" "casual" ...
##  $ date              : Date, format: "2022-01-13" "2022-01-10" ...
##  $ year              : chr  "2022" "2022" "2022" "2022" ...
##  $ month             : chr  "01" "01" "01" "01" ...
##  $ day               : chr  "13" "10" "25" "04" ...
##  $ weekday           : chr  "Thursday" "Monday" "Tuesday" "Tuesday" ...
##  $ part_of_week      : chr  "Workday" "Workday" "Workday" "Workday" ...
##  $ started_time      : POSIXlt, format: "2022-01-13 11:59:47" "2022-01-10 08:41:56" ...
##  $ ended_time        : POSIXlt, format: "2022-01-13 12:02:44" "2022-01-10 08:46:17" ...
##  $ hour              : int  11 8 4 0 1 18 18 12 7 15 ...
##  $ ride_length_s     : num  177 261 261 896 362 ...
##  $ ride_length_m     : num  3 4 4 15 6 3 17 12 25 7 ...
##  $ distance_km       : num  0.7 0.695 1.002 2.466 0.815 ...

Cleaning the data frame from negative time and zeros and duplicates

Clean data is moved to a new data frame: trips_22_2

# Cleaning the dataframe of negative and zero times
trips22_2 <- trips22[(trips22$ride_length_s > 0),]
# Calculating the rows eliminated
nrow(trips22)-nrow(trips22_2)

## [1] 531

Removing duplicates. The rest of the cleaning and analysis is taking place in the new dataframe.

# Removing duplicates from Ride_id
trips22_2$ride_id[duplicated(trips22_2$ride_id)]

##    [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##   [25] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##   [49] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##   [73] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##   [97] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [121] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [145] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [169] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [193] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [217] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [241] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [265] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [289] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [313] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [337] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [361] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [385] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [409] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [433] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [457] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [481] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [505] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [529] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [553] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [577] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [601] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [625] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [649] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [673] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [697] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [721] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [745] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [769] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [793] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [817] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [841] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [865] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [889] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [913] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [937] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [961] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [985] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [1009] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

Limitations: Data have missings in some attributes that that may cause inaccuracies in further analysis.

# Identifying missing data  
sapply(trips22_2,function(x) sum(is.na(x)))

##            ride_id      rideable_type         started_at           ended_at 
##               1026               1026               1026               1026 
## start_station_name   start_station_id   end_station_name     end_station_id 
##               1026               1026               1026               1026 
##          start_lat          start_lng            end_lat            end_lng 
##               1026               1026               6884               6884 
##      member_casual               date               year              month 
##               1026               1026               1026               1026 
##                day            weekday       part_of_week       started_time 
##               1026               1026               1026               1026 
##         ended_time               hour      ride_length_s      ride_length_m 
##               1026               1026               1026               1026 
##        distance_km 
##               6884

4. Analizing the Dataframe

4.1. General Statistics of key attributes

** Summary**

Where, * number_rides:* number of rides made by users * avg_duration:* average of the duration of the ride. * min_duration:* the shortest duration of a ride * max_duration:* the longest duration of a ride * avg_distance:* average distance of a ride * min_distance:* the closest distance of a ride * max_distance:* the farthest distance of a ride

# General stats 
trips22_2 %>% 
  group_by(member_casual) %>% 
  summarize(number_rides = n(), avg_duration_m = mean(ride_length_m),
            min_duration_m = min(ride_length_m), max_duration_m = max(ride_length_m), 
            avg_distance_km = mean(distance_km, na.rm = TRUE) ) %>% 
  drop_na()

# For distance
summary(trips22_2$distance_km)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##    0.000    0.874    1.576    2.142    2.783 9825.063     6884

# For duration
trips22_2 %>% 
  group_by(member_casual) %>%
  summarize(avg_duration_m = mean (ride_length_m), median_duration_m = median (ride_length_m), 
            max_duration_m = max (ride_length_m), min_duration_m = min (ride_length_m)) %>% 
  drop_na()

** Pie chart of total rides**

# Pie chart of total rides
trips22_2 %>% 
  group_by(member_casual) %>%
  summarize(count_of = round(n()/5667186*100),0) %>% 
  drop_na() %>% 
  ggplot(aes(x = "", y = count_of, fill = member_casual)) +coord_polar(theta="y")+
  geom_bar(stat = "identity")+ labs(title = "Number of rides by user type (in %)",
                                    subtitle = "For the period between January to December of 2022",caption = "Total rides = 5,667,186",fill = "Member type")+theme_void() + 
  geom_text(aes(label=count_of), position=position_stack(vjust=0.5),color="white",size=5)

4.2. Featuring members and casual riders

** Number of rides by user type, and day of the week**

# Number of rides by user type by day of the week
trips22_2 %>% 
  mutate(weekday = wday(started_at, label = TRUE)) %>% 
  group_by(member_casual, weekday) %>%
    summarize(number_rides_m = n()/1000
            ,avg_duration_m = mean(ride_length_m)) %>% 
  arrange(member_casual, weekday)  %>%
  drop_na() %>% 
  ggplot(aes(x = weekday, y = number_rides_m, fill = member_casual)) + facet_wrap(~member_casual)+
  geom_col(position = "dodge") + 
  labs(title = "Number of rides by user type and day of the week",
                                      subtitle = "For the period between January to December of 2022",
                                      x = "Day of the week", y = "Number of rides (in thousands)", fill = "Member Type")

## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

** Number of rides by user type, and month**

# Number of rides by user type by month
trips22_2 %>% 
  mutate(month = month(started_at, label = TRUE)) %>% 
  group_by(member_casual, month) %>%
  summarize(number_of_rides_m = n()/1000
            ,avg_duration_m = mean(ride_length_m)) %>% 
  arrange(member_casual, month) %>%
  drop_na() %>% 
  ggplot(aes(x = month, y = number_of_rides_m)) + 
  geom_point(aes(group = member_casual, color = member_casual)) + 
  geom_line(aes(group = member_casual, color = member_casual)) +
                            labs(title = "Number of rides by user type and month",
                                      subtitle = "For the period between January to December of 2022", 
                                 x = "Month", y = "Number of rides (in thousands)", fill = "Member Type")

## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

** Number of rides by day hour, part of the week, and user**

# Number of rides by day hour, part of the week, and user
trips22_2 %>% 
  group_by(member_casual, hour, part_of_week) %>%
  summarize(number_of_rides_m = n()/1000
            ,avg_duration_m = mean(ride_length_m)) %>% 
  arrange(member_casual, hour) %>%
  drop_na() %>% 
  ggplot(aes(x = hour, y = number_of_rides_m)) + 
           geom_point(aes(group = member_casual, color = member_casual)) + 
           geom_line(aes(group = member_casual, color = member_casual)) + 
           facet_wrap(~part_of_week)+labs(title = "Number of rides by user type, hour, and part of the week",
                                      subtitle = "For the period between January to December of 2022",
                                      x = "Started hour", y = "Number of rides (in thousands)", 
                                      fill = "Member Type")

## `summarise()` has grouped output by 'member_casual', 'hour'. You can override
## using the `.groups` argument.

** Average distance by member type, and part of the week**

# Average distance by member type, and part of the week
trips22_2 %>%
  group_by(member_casual, part_of_week) %>%
  summarize(avg_distance_km = mean(distance_km, na.rm = TRUE)) %>%
  drop_na() %>% 
  ggplot(aes(x = part_of_week, y = avg_distance_km, fill = member_casual))+facet_wrap(~member_casual)+
  geom_col(position = "dodge")+ labs(title = "Average distance by user type and part of the week",
                                     subtitle = "For the period between January to December of 2022",
                                     x = "Part of the week", y = "Average distance (in km)", fill = "Member Type")

## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

** Average distance by member type, and month**

# Average distance by member type and month  -->ok
trips22_2 %>%
  mutate(month = month(started_at, label = TRUE)) %>% 
  group_by(member_casual, month) %>%
  summarize(avg_distance_km = mean(distance_km, na.rm = TRUE)) %>%
  drop_na() %>% 
  ggplot(aes(x = month, y = avg_distance_km, fill = member_casual))+
  geom_col(position = "dodge")+ labs(title = "Average distance by user type and month",
                                     subtitle = "For the period between January to December of 2022",
                                     x = "Month", y = "Average distance (in km)", fill = "Member Type")

## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

** Average duration of rides (in minutes) by user type, and month**

# Average duration of rides (in minutes) by user type, and month
trips22_2 %>%
  mutate(month = month(started_at, label = TRUE)) %>% 
  group_by(member_casual, month) %>% 
  summarize(avg_duration_m = mean (ride_length_m)) %>% 
  drop_na() %>% 
  ggplot(aes(x = month, y = avg_duration_m, fill = member_casual))+facet_wrap(~member_casual)+
  geom_col(position = "dodge")+ labs(title = "Average duration by user type and month",
                                     subtitle = "For the period between January to December of 2022",
                                     x = "Month", y = "Average duration (in minutes)", fill = "Member Type")

## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

** Average duration of rides (in minutes) by day of the week**

# Average duration of rides (in minutes) by day of the week
trips22_2 %>% 
  mutate(weekday = wday(started_at, label = TRUE)) %>% 
  group_by(member_casual, weekday) %>% 
  summarize(avg_duration_m = mean (ride_length_m)) %>%
  drop_na() %>% 
  ggplot(aes(x = weekday, y = avg_duration_m, fill = member_casual))+facet_wrap(~member_casual)+
  geom_col(position = "dodge")+ labs(title = "Average duration by user type and day of the week",
                                     subtitle = "For the period between January to December of 2022",
                                     x = "Day of the week", y = "Average duration (in minutes)", fill = "Member Type")

## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

** Bike preference by user type (in miles of rides)**

# Bike preference by user type (in thousand of rides)
trips22_2 %>% 
  group_by(member_casual, rideable_type, part_of_week) %>%
  summarize(count_of_m = n()/1000, part_of_week) %>% 
  drop_na() %>% 
  ggplot(aes(x = member_casual, y = count_of_m, fill = rideable_type)) + facet_wrap(~part_of_week)+
  geom_bar(stat = "identity")+ labs(title = "Bike preference by user type",
                                    subtitle = "For the period between January to December of 2022",
                                    x = "User Type", y = "Number of rides (in thousands)", fill = "Bike Type")

## `summarise()` has grouped output by 'member_casual', 'rideable_type',
## 'part_of_week'. You can override using the `.groups` argument.

** Top 5 start stations**

First, I filtered out the start stations without name (NA)

# Filtering out start stations without names
top_5_start_st <- trips22_2 %>% 
  filter (start_station_name != "") %>% 
  group_by(member_casual, start_station_name) %>%
  drop_na(start_station_name) %>% 
  summarize(count_of= n()) %>% 
  arrange(desc(count_of))

## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

Then, I ranked top 5 stations for members and casual users

# Top 5 start stations for members
top_5_start_member <- filter (top_5_start_st, member_casual == "member") %>% 
  slice(1:5)
print(top_5_start_member)

## # A tibble: 5 × 3
## # Groups:   member_casual [1]
##   member_casual start_station_name           count_of
##   <chr>         <chr>                           <int>
## 1 member        Kingsbury St & Kinzie St        24936
## 2 member        Clark St & Elm St               22030
## 3 member        Wells St & Concord Ln           21294
## 4 member        University Ave & 57th St        19948
## 5 member        Clinton St & Washington Blvd    19827

# Top 5 start stations for casual riders
top_5_start_casual <- filter (top_5_start_st, member_casual == "casual") %>%
  slice(1:5)
print(top_5_start_casual)

## # A tibble: 5 × 3
## # Groups:   member_casual [1]
##   member_casual start_station_name                 count_of
##   <chr>         <chr>                                 <int>
## 1 casual        Streeter Dr & Grand Ave               58078
## 2 casual        DuSable Lake Shore Dr & Monroe St     31850
## 3 casual        Millennium Park                       25519
## 4 casual        Michigan Ave & Oak St                 25263
## 5 casual        DuSable Lake Shore Dr & North Blvd    23651

# Filtering out end stations without names
top_5_end_st <- trips22_2 %>% 
  filter (end_station_name != "") %>% 
  group_by(member_casual, end_station_name) %>%
  drop_na(end_station_name) %>% 
  summarize(count_of= n()) %>% 
  arrange(desc(count_of))

## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

# Top 5 end stations for members
top_5_end_member <- filter (top_5_end_st, member_casual == "member") %>% 
  slice(1:5)
print(top_5_end_member)

## # A tibble: 5 × 3
## # Groups:   member_casual [1]
##   member_casual end_station_name             count_of
##   <chr>         <chr>                           <int>
## 1 member        Kingsbury St & Kinzie St        24634
## 2 member        Clark St & Elm St               22361
## 3 member        Wells St & Concord Ln           21912
## 4 member        University Ave & 57th St        20531
## 5 member        Clinton St & Washington Blvd    20529

# Top 5 end stations for casual riders
top_5_end_casual <- filter (top_5_end_st, member_casual == "casual") %>%
  slice(1:5)
print(top_5_end_casual)

## # A tibble: 5 × 3
## # Groups:   member_casual [1]
##   member_casual end_station_name                   count_of
##   <chr>         <chr>                                 <int>
## 1 casual        Streeter Dr & Grand Ave               59864
## 2 casual        DuSable Lake Shore Dr & Monroe St     29600
## 3 casual        Millennium Park                       26673
## 4 casual        Michigan Ave & Oak St                 26446
## 5 casual        DuSable Lake Shore Dr & North Blvd    26139

5. Findings and conclusions

41% of the rides are made by casual riders, and 59% by members.
Behavior between member and casual users is inverse: members use more the service in workdays, as an alternative to transportation, while casual riders use more Bike-Share on weekends, for fun.
The service is more used on workdays than on weekends.
The number of rides of members in workdays is higher than in casual users. In weekends, the number of rides in both groups are similar.
Peak of rides for casual users is between June to September. For members is from May to October.
The hours of more use are:
- Workdays: between 7 to 10, and 15 to 20 hrs.
- Weekends: between 10 to 19 hrs.
Average distance traveled by members and casual users is similar. There is no difference between them on weekends or workdays, but from January to May, casuals ride farther.
Duration of the rides in casual users is the more than the double (29.1 min) of the duration in members (12.7 min). Both groups ride longer during the weekends.
Member have almost equally divided preferences between electric and classic bikes, while casual users prefer electric bikes. Docked bikes participation is not relevant in the market, even in weekends.

6. Recommendations

Create special memberships campaign for casual riders:
- With “extended weekend” use only (Fridays, Saturdays and Sundays).
- Limited by time (for example: 50 hours or a year)
Organize biking competitions on weekends only for members, with special prizes. This will attract members to ride on weekends, but also will engage casuals to get the membership.
Create a loyalty program with discount coupons for members.

7. Considerations for further explanations

Analyze the return over investment of each type of bike, since docked bikes aren’t almost used.
Study the real travel route of the bikes to improve the distances analysis (for this analysis, I consider as distance the difference between the ending and the starting point, not the real route).
Improve the knowing of the casual member profile to define better membership alternatives.
Study the price elasticity of each segment, to refine the price strategy.

Cyclistic: Bike Share case of study

Claudia Aravena

February, 12, 2023