Cyclistic Case Study

#NOTE: This is a revision. Originally, observations with any blank fields were omitted. This removed over 20% of the data. It was determined that the observations with some blank fields should still be included and would not disrupt proper analysis or skew results. Most blank fields occurred for the following variables: start station name, start station ID, end station name, end station ID. Because latitude and longitude data is complete, it was determined this would be a sufficient substitute during analysis. Other than trip counts, the biggest impact this adjustment had was on the rideable type preference for casual riders (revised from classic bike to electric bike), and rank of the most popular stations.

knitr::opts_chunk$set(warning = FALSE, message = FALSE)

Background

Task:

You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago.

The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.

In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime. Until now, Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments. One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members.

Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders. Although the pricing flexibility helps Cyclistic attract more customers, Moreno believes that maximizing the number of annual members will be key to future growth. Rather than creating a marketing campaign that targets all-new customers, Moreno believes there is a very good chance to convert casual riders into members. She notes that casual riders are already aware of the Cyclistic program and have chosen Cyclistic for their mobility needs.

Moreno has set a clear goal: Design marketing strategies aimed at converting casual riders into annual members. In order to do that, however, the marketing analyst team needs to better understand how annual members and casual riders differ, why casual riders would buy a membership, and how digital media could affect their marketing tactics. Moreno and her team are interested in analyzing the Cyclistic historical bike trip data to identify trends.

Stakeholders:

Cyclistic: A bike-share program that features more than 5,800 bicycles and 600 docking stations. Cyclistic sets itself apart by also offering reclining bikes, hand tricycles, and cargo bikes, making bike-share more inclusive to people with disabilities and riders who can’t use a standard two-wheeled bike. The majority of riders opt for traditional bikes; about 8% of riders use the assistive options. Cyclistic users are more likely to ride for leisure, but about 30% use them to commute to work each day.
Lily Moreno: The director of marketing and your manager. Moreno is responsible for the development of campaigns and initiatives to promote the bike-share program. These may include email, social media, and other channels.
Cyclistic marketing analytics team: A team of data analysts who are responsible for collecting, analyzing, and reporting data that helps guide Cyclistic marketing strategy. You joined this team six months ago and have been busy learning about Cyclistic’s mission and business goals — as well as how you, as a junior data analyst, can help Cyclistic achieve them.
Cyclistic executive team: The notoriously detail-oriented executive team will decide whether to approve the recommended marketing program.

Three questions will guide the future marketing program:

How do annual members and casual riders use Cyclistic bikes differently?
Why would casual riders buy Cyclistic annual memberships?
How can Cyclistic use digital media to influence casual riders to become members?

Moreno has assigned you the first question to answer: How do annual members and casual riders use Cyclistic bikes differently?

You will produce a report with the following deliverables:

A clear statement of the business task
A description of all data sources used
Documentation of any cleaning or manipulation of data
A summary of your analysis
Supporting visualizations and key findings
Your top three recommendations based on your analysis

[Above text in “Background” authored by Google.]

Business Task: Clean, process and analyze historical trip data over a 12 month period (November 2021 - October 2022) to determine differences between casual riders and members. These differences will inform recommendations on a new marketing strategy aimed recruiting casual riders for annual memberships.

Data Source: https://divvy-tripdata.s3.amazonaws.com/index.html

Data Integrity: The data has been made available by Motivate International Inc. under this license. This is primary data, collected internally, with no identifying information on clients.

Data Limitations: To protect the privacy of its users, Cyclistic does not link trip data to individuals. Data can be analyzed only on a single trip basis. After processing the data, it is clear that there is a large volume of invalid entries, likely caused by canceled transactions and the proper docking of bikes. There is also a large amount of station names missing, though coordinates are available.

Data Exploration

#SET WORKING DIRECTORY ---------------------------------------------------------

  setwd("/Users/jenmason/Documents")
  options(repos = list(CRAN="http://cran.rstudio.com/"))

#IMPORT DATA -------------------------------------------------------------------

  table2021_11 <- read.csv("202111-divvy-tripdata.csv")
  table2021_12 <- read.csv("202112-divvy-tripdata.csv")
  table2022_01 <- read.csv("202201-divvy-tripdata.csv")
  table2022_02 <- read.csv("202202-divvy-tripdata.csv")
  table2022_03 <- read.csv("202203-divvy-tripdata.csv")
  table2022_04 <- read.csv("202204-divvy-tripdata.csv")
  table2022_05 <- read.csv("202205-divvy-tripdata.csv")
  table2022_06 <- read.csv("202206-divvy-tripdata.csv")
  table2022_07 <- read.csv("202207-divvy-tripdata.csv")
  table2022_08 <- read.csv("202208-divvy-tripdata.csv")
  table2022_09 <- read.csv("202209-divvy-tripdata.csv")
  table2022_10 <- read.csv("202210-divvy-tripdata.csv")

#combine/merge CSV files
  data_all <- rbind(table2021_11, table2021_12, table2022_01, table2022_02, 
                    table2022_03, table2022_04, table2022_05, table2022_06, 
                    table2022_07, table2022_08, table2022_09, table2022_10)

#EXPLORE DATA ------------------------------------------------------------------
  
#preview data
  View(data_all)

#view data structure.
  str(data_all)

## 'data.frame':    5755694 obs. of  13 variables:
##  $ ride_id           : chr  "7C00A93E10556E47" "90854840DFD508BA" "0A7D10CDD144061C" "2F3BE33085BCFF02" ...
##  $ rideable_type     : chr  "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
##  $ started_at        : chr  "2021-11-27 13:27:38" "2021-11-27 13:38:25" "2021-11-26 22:03:34" "2021-11-27 09:56:49" ...
##  $ ended_at          : chr  "2021-11-27 13:46:38" "2021-11-27 13:56:10" "2021-11-26 22:05:56" "2021-11-27 10:01:50" ...
##  $ start_station_name: chr  "" "" "" "" ...
##  $ start_station_id  : chr  "" "" "" "" ...
##  $ end_station_name  : chr  "" "" "" "" ...
##  $ end_station_id    : chr  "" "" "" "" ...
##  $ start_lat         : num  41.9 42 42 41.9 41.9 ...
##  $ start_lng         : num  -87.7 -87.7 -87.7 -87.8 -87.6 ...
##  $ end_lat           : num  42 41.9 42 41.9 41.9 ...
##  $ end_lng           : num  -87.7 -87.7 -87.7 -87.8 -87.6 ...
##  $ member_casual     : chr  "casual" "casual" "casual" "casual" ...

#skim data for empty fields, unique values, whitespace.
  #skim(data_all)

#showed empty fields for station names and IDs; no whitespace; appropriate number of unique values for rideable type and membership.

#get initial summary of data
  summary(data_all)

##    ride_id          rideable_type       started_at          ended_at        
##  Length:5755694     Length:5755694     Length:5755694     Length:5755694    
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  start_station_name start_station_id   end_station_name   end_station_id    
##  Length:5755694     Length:5755694     Length:5755694     Length:5755694    
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##    start_lat       start_lng         end_lat         end_lng      
##  Min.   :41.64   Min.   :-87.84   Min.   :41.39   Min.   :-88.97  
##  1st Qu.:41.88   1st Qu.:-87.66   1st Qu.:41.88   1st Qu.:-87.66  
##  Median :41.90   Median :-87.64   Median :41.90   Median :-87.64  
##  Mean   :41.90   Mean   :-87.65   Mean   :41.90   Mean   :-87.65  
##  3rd Qu.:41.93   3rd Qu.:-87.63   3rd Qu.:41.93   3rd Qu.:-87.63  
##  Max.   :45.64   Max.   :-73.80   Max.   :42.37   Max.   :-87.30  
##                                   NA's   :5835    NA's   :5835    
##  member_casual     
##  Length:5755694    
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
##

Summary of Data Exploration

Twelve CSV files, representing historical trip data from November of 2021 through October of 2022 were collected and merged.
Data was previewed for structure, incomplete fields, whitespace, appropriate number of unique values for categorical variables, and the initial summary statistics.
This preview revealed many incomplete fields, data missing most often from the station name variables, as well as many fields that were likely invalid (discussed below).

Data Cleaning

#DATA CLEANING------------------------------------------------------------------

#clean names; include only characters, numbers, and underscores in names.
  data_all_v02 <- clean_names(data_all)

#remove empty rows and columns.
  data_all_v03 <- remove_empty(data_all_v02)

#remove duplicates; select only unique rows.
  data_all_v04 <- distinct(data_all_v03)

#check for naming anomalies; include only appropriate entries for variables with limited response types.
  count(data_all_v04, rideable_type)
  count(data_all_v04, member_casual)

#convert to date-time format.
  data_all_v05 <- data_all_v04
    data_all_v05$started_at <- as.POSIXct(data_all_v05$started_at, format = "%Y-%m-%d %H:%M:%S")
    data_all_v05$ended_at <- as.POSIXct(data_all_v05$ended_at, format = "%Y-%m-%d %H:%M:%S")

#In previous versions, this is where any incomplete cases were omitted. This removed more than 20% of observations. After reconsideration, this action may not have been necessary and will not be performed in this version.

#check for data added/removed during analysis period ----

#rideable type:

#create data frame to see when a unique rideable type was added.
rideable_type_check <- data_all_v05 %>%
  mutate(year = year(started_at), 
         month = month(started_at)) %>%
  group_by(month, year) %>%
  select(rideable_type, month, year) %>%
  count(rideable_type)
    #no unique types added or removed during the analysis period.

#parse date for analysis; create columns for year, month, day, hour; ensure case of "Y/y, M/m, D/d, H/h, M/m, S/s" is correct.
  data_all_v06$year <- format(data_all_v06$started_at, "%Y")
  data_all_v06$month <- format(data_all_v06$started_at, "%m")
  data_all_v06$day <- format(data_all_v06$started_at, "%d")
  data_all_v06$hour <- format(data_all_v06$started_at, "%H")

Summary of Data Cleaning:

Variable names were formatted to include only characters, numbers, and underscores.
Empty rows and columns were removed.
Duplicate observations were removed.
Observations were checked for naming anomalies to include only appropriate entries for categorical variables or variables with limited response types.
A workable date-time format was applied.
Rows with fields that were blank or “NA” were not removed because they accounted for over 20% of the data.
The only variables added or removed during the analysis period were stations with low trip counts, having a negligible impact on data analysis.

Data Manipulation & Preparation for Analysis

#days of the week:

#create column for day of the week.
  data_all_v06$day_of_week <- format(data_all_v06$started_at,"%A")

#convert to factor w/levels, specify order (M-S).
  data_all_v06$day_of_week <- factor(data_all_v06$day_of_week, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))

#trip duration:

#calculate trip duration for analysis; ensure data type is numeric.
  data_all_v06$trip_duration_seconds <- difftime(data_all_v06$ended_at, data_all_v06$started_at, units = "secs") 
  data_all_v06$trip_duration_seconds <- as.numeric(as.character(data_all_v06$trip_duration_seconds))

#remove trips < 1 minute.
  data_all_v07 <- data_all_v06 %>%
    filter(trip_duration_seconds >= 60)

#check for outlier data; Rosner test to detect multiple outliers.
  install.packages("EnvStats")
    library(EnvStats)
  test_outliers <- rosnerTest(data_all_v08$trip_duration_seconds)
    #outlier data is present.

#arrange/order data by trip duration, descending; inspect long trips.
  order_by_trip_duration_desc <- data_all_v08[order(data_all_v08$trip_duration_seconds ,decreasing=TRUE),]

#remove outlier data using IQR method; find upper and lower limit of the interquartile range; filter data.
  quartiles <- quantile(data_all_v08$trip_duration_seconds, probs=c(.25, .75), na.rm = FALSE)
  IQR <- IQR(data_all_v08$trip_duration_seconds)
  
  Lower <- quartiles[1] - 1.5*IQR
  Upper <- quartiles[2] + 1.5*IQR

data_no_outliers_IQR <- subset(data_all_v08, data_all_v08$trip_duration_seconds > Lower & data_all_v08$trip_duration_seconds < Upper)
  #this removed 414,368 observations.

data_all_v09 <- data_no_outliers_IQR
  summary(data_all_v09$trip_duration_seconds)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    60.0   349.0   588.0   723.6   978.0  2279.0

#inspect trip duration.
  order_by_trip_duration_desc <- data_all_v09[order(data_all_v09$trip_duration_seconds ,decreasing=TRUE),]
  order_by_trip_duration_asc  <- data_all_v09[order(data_all_v09$trip_duration_seconds),]

#create column with trip duration in minutes.
  data_all_v10 <- data_all_v09 
  data_all_v10$trip_duration_minutes <- data_all_v10$trip_duration_seconds/60

#create column with trip duration in hours.
data_all_v10$trip_duration_hours <- data_all_v10$trip_duration_minutes/60

#create bins for trips < 1 hour (the majority of trips).
  data_all_v10 <- data_all_v10 %>%
    mutate(trip_interval_minutes 
     = case_when(trip_duration_minutes >00 & trip_duration_minutes <=05 ~'0-5 minutes',
                                       trip_duration_minutes >05 & trip_duration_minutes <=10 ~'5-10 minutes',
                                       trip_duration_minutes >10 & trip_duration_minutes <=15 ~'10-15 minutes',
                                       trip_duration_minutes >15 & trip_duration_minutes <=20 ~'15-20 minutes',
                                       trip_duration_minutes >20 & trip_duration_minutes <=25 ~'20-25 minutes',
                                       trip_duration_minutes >25 & trip_duration_minutes <=30 ~'25-30 minutes',
                                       trip_duration_minutes >30 & trip_duration_minutes <=35 ~'30-35 minutes',
                                       trip_duration_minutes >35 & trip_duration_minutes <=40 ~'35-40 minutes',
                                       trip_duration_minutes >40 & trip_duration_minutes <=45 ~'40-45 minutes',
                                       trip_duration_minutes >45 & trip_duration_minutes <=50 ~'45-50 minutes',
                                       trip_duration_minutes >50 & trip_duration_minutes <=55 ~'50-55 minutes',
                                       trip_duration_minutes >55 & trip_duration_minutes <=60 ~'55-60 minutes'))

#membership:

#create separate data frames for casual riders and members; may be useful for analysis.
  data_member <- data_all_v10 %>% 
    filter(member_casual == "member")
  data_casual <- data_all_v10 %>% 
    filter(member_casual == "casual")

Summary of Data Preparation

A column for days of the week was created and factored.
Trip duration in seconds was calculated.
A dataframe was created for trips where start station = end station.
Trips that likely represented invalid data such as canceled transactions or other errors in data collection were removed. This included trips less than 1 minute. Confer with stakeholders.
The Rosner Test was used to detect outlier data. The interquartile range method was used to exclude outlier data. Upper and lower limits of data inclusion were calculated as follows:
- Lower <- quartiles[1] - 1.5*IQR
- Upper <- quartiles[2] + 1.5*IQR
Trip duration in minutes and hours was calculated for ease of analysis.
Trips less than one hour in duration (a majority of all data) were binned in five minute increments for ease of analysis.

Data Analysis

Membership

#summarize number of trips by membership.
  num_trips_by_membership <- table(data_all_v10$member_casual) %>% 
    as.data.frame()

#create pie chart to show number of trips by membership.
  ggplot(num_trips_by_membership, aes(x = "", y = Freq, fill = Var1)) +
        geom_bar(stat = "identity", width = 1, color="white") +
        coord_polar("y", start = 0, direction = -1) +
        theme_void() +
        theme(axis.line = element_blank(), axis.ticks = element_blank(), plot.title = element_text(hjust = 0.5)) +
        ggtitle("Trips per Membership Type") +
        theme(plot.title=element_text( hjust=0.5, vjust=0.5, face='bold')) +
        labs(fill = "Membership Type", x = NULL, y = NULL) +
        geom_text(aes(label = paste(round(Freq / sum(Freq) * 100, 1), "%")), position = position_stack(vjust = 0.5))

#create bar plot to show number of trips by membership (alternative to pie chart).        
  ggplot(data_all_v10, aes(x=factor(member_casual))) +
    geom_bar(stat="count", position = "dodge", width=0.6) +
    scale_y_continuous(name = "Number of Trips", labels = function(x) format(x, big.mark = ",", scientific = FALSE)) +
    labs(x= "Membership Type", fill = "Membership Type") +
    coord_cartesian(ylim = c(1000000,2500000)) +
    ggtitle("Trips per Membership Type") +
    theme(plot.title=element_text( hjust=0.5, vjust=0.5, face='bold'))

Casual riders took approximately 2 million trips while members took approximately 2.5 million trips.

Days of the Week

#plot number of trips per day of the week, grouped by membership (bar plot).
  ggplot(data = data_all_v10) %>% +
    geom_bar(mapping = aes(x=day_of_week, fill=member_casual), position = "dodge") +
    scale_x_discrete(name = "Day of the Week", 
      limits = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")) +
    scale_y_continuous(name = "Number of Trips", labels = function(x) format(x, big.mark = ",", scientific = FALSE)) +
    ggtitle("Trips per Day of the Week") +
    theme(plot.title=element_text( hjust=0.5, vjust=0.5, face='bold')) +
    labs(fill = "Membership Type")

Insight: Members make more trips Monday-Friday while casual riders make more trips on the weekends.

Trip Duration

Trip Duration by Membership

#summarize mean trip duration by membership.
  data_all_v10 %>% 
    group_by(member_casual) %>% 
      summarize(mean_td_minutes=mean(trip_duration_minutes))

#summarize mean trip duration by rideable type, for each membership type. 
  data_all_v10 %>% 
    group_by(rideable_type, member_casual) %>% 
      summarize(mean_td_minutes=mean(trip_duration_minutes))

#create dataframe for summarized data.
  mean_trip_duration_by_rideable_type <- data_all_v10 %>%
    group_by(rideable_type, member_casual) %>%
    summarize(mean_td_minutes=mean(trip_duration_minutes))

#substitute zero for missing value.
  mean_trip_duration_by_rideable_type[nrow(mean_trip_duration_by_rideable_type) + 1,] <- list("docked_bike", "member", 0)

#create bar plot for mean trip duration by rideable type, for each membership type.
  mean_trip_duration_by_rideable_type %>% 
    ggplot(aes(x = rideable_type, y = mean_td_minutes, fill = member_casual)) + 
    geom_col(position = position_dodge(), width=0.6) +
    coord_cartesian(ylim = c(9,20)) +
    labs(x = "Rideable Type", fill = "Membership Type") +
    scale_x_discrete(labels=function(x) gsub("_", " ", x, fixed=TRUE)) +
    scale_y_continuous(name = "Mean Trip Duration (minutes)") +
    ggtitle("Mean Trip Duration by Rideable Type") +
    theme(plot.title=element_text( hjust=0.5, vjust=0.5, face='bold'))

Insight: On average, casual riders took longer trips (14 minutes) than members (11 minutes). The longest trips (20 minutes) were taken using docked bikes by casual riders, and the shortest trips (10 minutes) on electric bikes by members.

#summarize trip duration by day of the week, for each membership type.
  data_all_v10 %>% 
    group_by(day_of_week, member_casual) %>% 
      summarize(mean_td_minutes=mean(trip_duration_minutes))

#create dataframe for summarized data.
  mean_trip_duration_by_day <- data_all_v10 %>%
    group_by(day_of_week, member_casual) %>%
    summarize(mean_td_minutes=mean(trip_duration_minutes))

#plot mean trip duration by day of the week, for each membership type.
mean_trip_duration_by_day %>% 
  ggplot(aes(x = day_of_week, y = mean_td_minutes, fill = member_casual)) + 
  geom_col(position = position_dodge(), width=0.6) + 
  coord_cartesian(ylim = c(10,16)) +
  labs(x = "Day of the Week", y = "Mean Trip Duration (minutes)", fill = "Membership Type") +
  ggtitle("Mean Trip Duration by Day of the Week") +
  theme(plot.title=element_text( hjust=0.5, vjust=0.5, face='bold'))

Insight: Both casual riders and members took longer trips on weekends.

#summarize mean trip duration by time of day, for each membership type).
  data_all_v10 %>% 
    group_by(hour, member_casual) %>%
    summarize(mean=mean(trip_duration_minutes)) %>%
    mutate_if(is.numeric, round, 2) %>%
    as.data.frame() %>% 
    print(n = nrow(48))

#create dataframe for summarized data.
  mean_trip_duration_by_TOD <- data_all_v10 %>%
    group_by(hour, member_casual) %>%
    summarize(mean_td_minutes=mean(trip_duration_minutes))

#plot mean trip duration by time of day, for each membership type.
mean_trip_duration_by_TOD %>% 
  ggplot() +
  geom_line(aes(x=hour, y=mean_td_minutes, color=member_casual, group=member_casual), size = 2) +
  coord_cartesian(ylim = c(9,16)) +
  labs(x = "Time of Day", y = "Mean Trip Duration (minutes)", fill = "Membership Type") +
  ggtitle("Mean Trip Duration by Time of Day") +
  theme(plot.title=element_text( hjust=0.5, vjust=0.5, face='bold')) +
  guides(color=guide_legend("Membership Type"))

Insight: The longest trips taken by casual riders occur between 10am - 6pm. Members took their longest trips during what is presumably the commute from work (5pm).

#create dataframe for trips < 1 hour.
  trips_less_than_1_hour <- data_all_v10 %>% 
    group_by(trip_interval_minutes, member_casual) %>%
    summarize(n=n()) %>%
    mutate_if(is.numeric, round, 2) %>%
    as.data.frame()

  trips_less_than_1_hour$trip_interval_minutes <- factor(trips_less_than_1_hour$trip_interval_minutes, levels = c("0-5 minutes", "5-10 minutes", "10-15 minutes", "15-20 minutes", "20-25 minutes", "25-30 minutes", "30-35 minutes", "35-40 minutes", "40-45 minutes", "45-50 minutes", "50-55 minutes", "55-60 minutes"))

  trips_less_than_1_hour %>% 
    ggplot(aes(x = trip_interval_minutes, y = n, fill = member_casual)) + 
    geom_col(position = position_dodge(), width=0.6) + 
    theme(axis.text.x=element_text(angle = -45, hjust = 0)) +
    scale_y_continuous(name = "Number of Trips", labels = function(x) format(x, big.mark = ",", scientific = FALSE)) +
    labs(x = "Trip Duration", fill = "Membership Type", fill = "Membership Type") +
    ggtitle("Trip Duration for Trips Less Than 1 Hour") +
    theme(plot.title=element_text( hjust=0.5, vjust=0.5, face='bold'))

Insight: Nearly all trips, for both casual riders and members, are less than one hour. The greatest number of trips were between 5-10 minutes for both groups. Trips between 0-5 minutes are mostly attributed to members. After trips 5-10 minutes in duration, the number of trips decreases as trip duration increases,.

Time of Day

#summarize number of trips per time of day, for each membership type.
  table(data_all_v10$hour, data_all_v10$member_casual)

#create data frame of trips per time of day, for each membership type.
  num_trips_by_TOD <- data_all_v10 %>% 
    group_by(hour, member_casual) %>%
    summarize(n=n()) %>%
    mutate_if(is.numeric, round, 2) %>%
    as.data.frame() %>% 
    print(n = nrow(24))

##    hour member_casual      n
## 1    00        casual  41811
## 2    00        member  34551
## 3    01        casual  26960
## 4    01        member  21259
## 5    02        casual  17017
## 6    02        member  12209
## 7    03        casual  10184
## 8    03        member   7584
## 9    04        casual   7086
## 10   04        member   8496
## 11   05        casual  11586
## 12   05        member  32052
## 13   06        casual  26804
## 14   06        member  87874
## 15   07        casual  47799
## 16   07        member 167742
## 17   08        casual  62715
## 18   08        member 198911
## 19   09        casual  61902
## 20   09        member 140206
## 21   10        casual  76170
## 22   10        member 131642
## 23   11        casual  98514
## 24   11        member 156776
## 25   12        casual 118271
## 26   12        member 181635
## 27   13        casual 123374
## 28   13        member 181071
## 29   14        casual 131318
## 30   14        member 178874
## 31   15        casual 148424
## 32   15        member 212887
## 33   16        casual 167907
## 34   16        member 279133
## 35   17        casual 192057
## 36   17        member 333827
## 37   18        casual 173916
## 38   18        member 273027
## 39   19        casual 133507
## 40   19        member 197278
## 41   20        casual  98741
## 42   20        member 139696
## 43   21        casual  85741
## 44   21        member 108884
## 45   22        casual  78560
## 46   22        member  83741
## 47   23        casual  58183
## 48   23        member  54971

#plot trips per time of day, for each membership type.
  num_trips_by_TOD %>% 
    ggplot() +
    geom_line(aes(x=hour, y=n, color=member_casual, group=member_casual), size = 2) +
    scale_y_continuous(name = "Number of Trips", labels = function(x) format(x, big.mark = ",", scientific = FALSE)) +
    labs(x = "Time of Day", y = "Number of Trips", fill = "Membership Type") +
    ggtitle("Number of Trips per Time of Day") +
    theme(plot.title=element_text( hjust=0.5, vjust=0.5, face='bold')) +
    guides(color=guide_legend("Membership Type"))

Insight: For both casual riders and members, spikes in ridership occur between 6-9am, at 12pm, and at 5pm, suggesting bikes are used for work commutes.

Rideable Type

#create data frame for number of trips by rideable type.      
  num_trips_by_rideable_type <- table(data_all_v10$rideable_type, data_all_v10$member_casual) %>% 
    as.data.frame()
  num_trips_by_rideable_type_renamed <- rename(num_trips_by_rideable_type, Rideable_Type = Var1, Membership_Type = Var2)

#plot number of trips by rideable type, for each membership type.
  num_trips_by_rideable_type_renamed %>% 
    ggplot(aes(x=Rideable_Type, y=Freq, fill=Membership_Type)) +
    geom_col(position = position_dodge(), width=0.6) +
    labs(x = "Rideable Type", y = "Number of Trips", fill = "Membership Type") +
    scale_x_discrete(labels=function(x) gsub("_", " ", x, fixed=TRUE)) +
    scale_y_continuous(name = "Number of Trips", labels = function(x) format(x, big.mark = ",", scientific = FALSE)) +
    ggtitle("Number of Trips by Rideable Type") +
    theme(plot.title=element_text( hjust=0.5, vjust=0.5, face='bold'))

#plot mean trip duration by rideable type, for each membership type.  
  mean_trip_duration_by_rideable_type %>%   
    ggplot(aes(x = rideable_type, y = mean_td_minutes, fill = member_casual)) + 
    geom_col(position = position_dodge(), width=0.6) +
    coord_cartesian(ylim = c(9,20)) +
    labs(x = "Rideable Type", y = "Mean Trip Duration (minutes)", fill = "Membership Type") +
    scale_x_discrete(labels=function(x) gsub("_", " ", x, fixed=TRUE)) +
    ggtitle("Mean Trip Duration by Rideable Type") +
    theme(plot.title=element_text( hjust=0.5, vjust=0.5, face='bold'))

Insight: The most popular rideable type for casual riders is the electric bike, while members prefer the classic bike. Docked bikes are the least popular and are ridden exclusively by casual riders.

Month

#count number of trips per month.
  data_all_v10$YearMon <- paste(data_all_v10$year,data_all_v10$month, sep="-")
  num_trips_per_month <- table(data_all_v10$YearMon, data_all_v10$member_casual) %>%
    as.data.frame()
  num_trips_per_month_2 <- filter_if(num_trips_per_month, is.numeric, all_vars((.) != 0)) %>%
    as.data.frame()

#plot trips per month.
  num_trips_per_month_2 %>% 
    ggplot() +
    geom_line(aes(x=Var1, y=Freq, color=Var2, group=Var2), size = 2) +
    theme(axis.text.x=element_text(angle = -45, hjust = 0)) +
    scale_y_continuous(name = "Number of Trips", labels = function(x) format(x, big.mark = ",", scientific = FALSE)) +
    labs(x = " ", y = "Number of Trips") +
    ggtitle("Number of Trips per Month") +
    theme(plot.title=element_text( hjust=0.5, vjust=0.5, face='bold')) +
    guides(color=guide_legend("Membership Type"))

Insight: The most popular months for both casual riders and members are May through September. January and February were the least popular. Peak ridership for casual riders occurred in July.

Stations

#list most popular stations, note the number of trips and notable characteristics.
  num_trips_by_station <- data_all_v10 %>% 
    group_by(start_station_name) %>% 
    summarize(n=n()) 
    num_trips_by_station[order(num_trips_by_station$n, decreasing = TRUE),]

#create bubble map to show the number of rides per station.
  install.packages("leaflet")

## 
## The downloaded binary packages are in
##  /var/folders/hn/spvym4sx55bcvhcwsrywl5bm0000gn/T//Rtmpfas3fV/downloaded_packages

    library(leaflet)
  install.packages("htmlwidgets")

## 
## The downloaded binary packages are in
##  /var/folders/hn/spvym4sx55bcvhcwsrywl5bm0000gn/T//Rtmpfas3fV/downloaded_packages

    library(htmlwidgets)
  install.packages("htmltools")

## 
## The downloaded binary packages are in
##  /var/folders/hn/spvym4sx55bcvhcwsrywl5bm0000gn/T//Rtmpfas3fV/downloaded_packages

    library(htmltools)

#create dataframe that groups number of trips by station name w/ latitude and longitude data.
  map_data <- data_all_v10 %>%
    select( start_station_name, 
            start_lat, 
            start_lng) %>%
    group_by(start_station_name) %>%
    mutate(numtrips = n()) %>%
    distinct(start_station_name, .keep_all = TRUE)

#create bins for map key.
  map_bins <- seq(0, 50000, by = 5000)

#assign color palette to show trip density per station. 
  my_palette <- colorBin(palette ="viridis", domain = map_data$numtrips, na.color = "transparent", bins = map_bins, reverse = TRUE)

#set text for interactive tooltip.
  map_text <- paste("Station name: ", map_data$start_station_name, "<br/>","Number of trips: ", map_data$numtrips, sep = "") %>%
    lapply(htmltools::HTML)

#create interactive html leaflet widget to show trip density per station.
  trips_per_station_map <- leaflet(map_data) %>% 
    addTiles() %>%  

#set Chicago coordinates.
  setView(lng = -87.6298, lat = 41.8781, zoom = 10.5) %>% 
  
#set map style. 
  addProviderTiles("Esri.WorldGrayCanvas") %>%
  
#add circle markers for each station, set fill color, add tooltip.
  addCircleMarkers(~ start_lng, ~ start_lat, 
    fillColor = ~ my_palette(numtrips), 
    fillOpacity = 0.6, 
    color = "white", 
    # radius = ~ num_trips_by_station$n/1500, #this is an option for variable marker size based on number of trips.
    radius = 6,
    stroke = FALSE,
    label = map_text,
    labelOptions = labelOptions(style = list("font-weight" = "normal", padding = "3px 8px"), 
    textsize = "13px", 
    direction = "auto")) %>%
  
#add legend.
  addLegend( 
    pal = my_palette, 
    values = ~ numtrips, 
    opacity = 0.8,
    title = "Number of Trips", 
    position = "bottomright")

#view map.
  trips_per_station_map

Insight: By far, the most popular station is Streeter Dr & Grand Ave. This station is close to the lakeside train connector, as well as Jane Addams memorial park and Ohio Beach Street; likely used by tourists and weekend visitors. Six of the top ten most popular stations are adjacent to public parks, three are in front of large residential buildings, and two are in close vicinity to a fitness center.

Conclusions & Recommendations

Casual riders make more trips on weekends and take longer trips on average than members. Like members, casual riders appear to use bikes for their work commutes during the week, evident in ridership spikes during a.m. and p.m. rush hours, as well as the typical lunch hour. Casual riders prefer the classic bike. Like members, casual riders are more likely to use bikes in the months of May through September. The most popular stations were adjacent to public parks, large residential buildings, and fitness centers. The most popular station is close to a train connector, public beach, and park. Because of the casual rider’s preference for weekends, and the proximity of popular stations to parks, it is reasonable to assume that casual riders use bikes for exercise and leisure. Because the most popular station is close to a major train connector, it may be deduced that casual riders are coming in from suburbs or areas of the city further from the waterfront parks in east Chicago.

The following is recommended to Cyclistic’s marketing strategy team:

A digital campaign should focus on outdoor recreation in Chicago’s public parks.
Because casual riders are using bikes recreationally, include a limited number of “buddy passes” with membership. This may encourage members to invite friends and family to ride with them, potentially expanding clientele. Passes could be limited to weekends.
The campaign can also encourage patrons of fitness centers to choose bikes over other methods of transportation used to and from their fitness centers. Walking is health conscious, but time consuming; cars or trains are time conscious, but sedentary. Biking is both time and health conscious.
Ensure electric bikes are available at the most popular stations, as they are prefered by casual riders.
If financially feasible, consider making seasonal memberships available in addition to annual. It would likely feel less committal and less wasteful (since winter weather in Chicago is prohibitive) to potential members.