knitr::opts_chunk$set(echo = TRUE)
This is for the Google Data Analytics Capstone project on Coursera. For the purposes of this exercise, we will be using public data from a bike share company called Divvy. In this exercise, I work for a fictional company, Divvy. In order to answer the key business questions, I will follow the steps of the data analysis process: ask, prepare, process, analyze, share, and act.
In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime. Until now, Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments. One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members. Cyclistic finance analyst: Design marketing strategies aimed at converting casual riders into annual members. In order to do that, however, the marketing analyst team needs to better understand how annual members and casual riders differ, why casual riders would buy a membership, and how digital media could affect their marketing tactics. Moreno and her team are interested in analyzing the Cyclistic historical bike trip data to identify trends.
1. How do annual members and casual riders use Cyclistic bikes differently?
2. Why would casual riders buy Cyclistic annual memberships?
3. How can Cyclistic use digital media to influence casual riders to become members?
To deliver recommendations by analyzing the 2022 trip data for Divvy to understand how casual riders, and annual members use the service differently. Using that information I will try to provide answers on how to convert casual riders to annual members.
I used the most recent twelve-month (#December_2021-September_2022) historical trip datasets provided by Divvy to perform this analysis. This includes twelve different files each containing a month of data ranging from August 2021, to July 2022. This is public data that you can use to explore how different customer types are using Divvy bikes. The data has been made available by Motivate International Inc. under this license.
Each dataset is made up of thirteen columns:
[1] ride_id rideable_type started_at
[4] ended_at start_station_name start_station_id
[7] end_station_name end_station_id start_lat
[10] start_lng end_lat end_lng
[13] member_casual
In this phase we will process the data by cleaning and ensuring that it is correct,relevant,complete and error free. - We have to check if data contains any missing or null values -Transform the data into format we want for the analysis
I will be using the programming language R to conduct this analysis.
First I need to install and load relevant packages:
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.1
## ✔ tibble 3.1.8 ✔ dplyr 1.1.0
## ✔ tidyr 1.3.0 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 1.0.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(lubridate)
##
## Attaching package: 'lubridate'
##
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(timechange)
library(ggplot2)
library(dplyr)
In the project we make use of Cyclistic’s historical trip data to analyze and identify trends.Download the previous 12 months of Cyclistic trip data here. (Note: The datasets have a different name because Cyclistic is a fictional company. For the purposes of this case study, the datasets are appropriate and will enable you to answer the business questions. The data has been made available by Motivate International Inc. under this license.) This is a public data that can be use to explore how different customer types are using Cyclistic bikes.
Dec_21 <-read.csv("C:\\Users\\LAITAN PC\\Desktop\\Case1_divvy_tripdata\\Case_1\\csv\\202112-divvy-tripdata.csv")
Nov_21 <-read.csv("C:\\Users\\LAITAN PC\\Desktop\\Case1_divvy_tripdata\\Case_1\\csv\\202111-divvy-tripdata.csv")
Oct_21 <-read.csv("C:\\Users\\LAITAN PC\\Desktop\\Case1_divvy_tripdata\\Case_1\\csv\\202110-divvy-tripdata.csv")
Jan_22 <-read.csv("C:\\Users\\LAITAN PC\\Desktop\\Case1_divvy_tripdata\\Case_1\\csv\\202201-divvy-tripdata.csv")
Feb_22 <- read.csv("C:\\Users\\LAITAN PC\\Desktop\\Case1_divvy_tripdata\\Case_1\\csv\\202202-divvy-tripdata.csv")
Mar_22 <-read.csv("C:\\Users\\LAITAN PC\\Desktop\\Case1_divvy_tripdata\\Case_1\\csv\\202203-divvy-tripdata.csv")
Apr_22 <-read.csv("C:\\Users\\LAITAN PC\\Desktop\\Case1_divvy_tripdata\\Case_1\\csv\\202204-divvy-tripdata.csv")
May_22 <-read.csv("C:\\Users\\LAITAN PC\\Desktop\\Case1_divvy_tripdata\\Case_1\\csv\\202205-divvy-tripdata.csv")
Jun_22 <-read.csv("C:\\Users\\LAITAN PC\\Desktop\\Case1_divvy_tripdata\\Case_1\\csv\\202206-divvy-tripdata.csv")
Jul_22 <-read.csv("C:\\Users\\LAITAN PC\\Desktop\\Case1_divvy_tripdata\\Case_1\\csv\\202207-divvy-tripdata.csv")
Aug_22 <-read.csv("C:\\Users\\LAITAN PC\\Desktop\\Case1_divvy_tripdata\\Case_1\\csv\\202208-divvy-tripdata.csv")
Sep_22 <-read.csv("C:\\Users\\LAITAN PC\\Desktop\\Case1_divvy_tripdata\\Case_1\\csv\\202209-divvy-publictripdata.csv")
To ensure consistency, we make sure all of these files have the same columns, so I will run colnames() on each csv file. Once I verify that all csv files contain each of the 13 columns shown above, I need to use rbind() to merge all files together vertically into a single large data frame.
Total_trip <- rbind(Dec_21, Nov_21, Oct_21, Jan_22, Feb_22, Mar_22, Apr_22, May_22, Jun_22, Jul_22 , Aug_22, Sep_22)
To make sure our data is clean, let’s check for and remove NA/null values. We will have to assign this to a new data frame.
Total_trip_1 <- na.omit(Total_trip)
We are going to add a new column called “ride_length” to calculate the length of each trip in minutes. We will use the difftime() function to do this, and shown in the chunk below:
Total_trip_1$ride_length <- difftime(Total_trip_1$ended_at, Total_trip_1$started_at, units="mins")
This will help to confirm the data_types that was previuosly loaded.
str(Total_trip_1)
## 'data.frame': 5822391 obs. of 14 variables:
## $ ride_id : chr "46F8167220E4431F" "73A77762838B32FD" "4CF42452054F59C5" "3278BA87BF698339" ...
## $ rideable_type : chr "electric_bike" "electric_bike" "electric_bike" "classic_bike" ...
## $ started_at : chr "2021-12-07 15:06:07" "2021-12-11 03:43:29" "2021-12-15 23:10:28" "2021-12-26 16:16:10" ...
## $ ended_at : chr "2021-12-07 15:13:42" "2021-12-11 04:10:23" "2021-12-15 23:23:14" "2021-12-26 16:30:53" ...
## $ start_station_name: chr "Laflin St & Cullerton St" "LaSalle Dr & Huron St" "Halsted St & North Branch St" "Halsted St & North Branch St" ...
## $ start_station_id : chr "13307" "KP1705001026" "KA1504000117" "KA1504000117" ...
## $ end_station_name : chr "Morgan St & Polk St" "Clarendon Ave & Leland Ave" "Broadway & Barry Ave" "LaSalle Dr & Huron St" ...
## $ end_station_id : chr "TA1307000130" "TA1307000119" "13137" "KP1705001026" ...
## $ start_lat : num 41.9 41.9 41.9 41.9 41.9 ...
## $ start_lng : num -87.7 -87.6 -87.6 -87.6 -87.7 ...
## $ end_lat : num 41.9 42 41.9 41.9 41.9 ...
## $ end_lng : num -87.7 -87.7 -87.6 -87.6 -87.6 ...
## $ member_casual : chr "member" "casual" "member" "member" ...
## $ ride_length : 'difftime' num 7.58333333333333 26.9 12.7666666666667 14.7166666666667 ...
## ..- attr(*, "units")= chr "mins"
## - attr(*, "na.action")= 'omit' Named int [1:5844] 3342 3391 3467 3586 3681 3692 3693 111319 111481 111526 ...
## ..- attr(*, "names")= chr [1:5844] "3342" "3391" "3467" "3586" ...
Using View(trip_data) to check out our new column, we can see that there are some negative ride times. Let’s check how many columns display negative ride lengths and then remove them:
Total_trip_2 <- Total_trip_1[!Total_trip_1$ride_length < 0,]
Now we are going to add four new columns. The first will be populated with the abbreviated day of the week that the ride started on. We will also make a column for the month and the year. We can use this to explore how both casual riders and members use the bikes differently depending on the day of the week, month, or year. Going further, lets add a column isolating the hour that the ride started_at
Even though we already have this data in the “started_at” and “ended_at” columns, breaking it down into its own columns will make it easier to aggregate later on.
Total_trip_2$month <- format(as.Date(Total_trip_2$started_at), "%m")
Total_trip_2$year <- format(as.Date(Total_trip_2$started_at), "%Y")
Total_trip_2$start_hour <- format(as.POSIXct(Total_trip_2$started_at), format = "%H")
Total_trip_2 <- mutate(Total_trip_2, started_at = as_datetime(started_at), ended_at = as_datetime(ended_at))
Total_trip_2 <- mutate(Total_trip_2, "days_of_week" = weekdays(started_at))
glimpse(Total_trip_2)
## Rows: 5,822,283
## Columns: 18
## $ ride_id <chr> "46F8167220E4431F", "73A77762838B32FD", "4CF4245205…
## $ rideable_type <chr> "electric_bike", "electric_bike", "electric_bike", …
## $ started_at <dttm> 2021-12-07 15:06:07, 2021-12-11 03:43:29, 2021-12-…
## $ ended_at <dttm> 2021-12-07 15:13:42, 2021-12-11 04:10:23, 2021-12-…
## $ start_station_name <chr> "Laflin St & Cullerton St", "LaSalle Dr & Huron St"…
## $ start_station_id <chr> "13307", "KP1705001026", "KA1504000117", "KA1504000…
## $ end_station_name <chr> "Morgan St & Polk St", "Clarendon Ave & Leland Ave"…
## $ end_station_id <chr> "TA1307000130", "TA1307000119", "13137", "KP1705001…
## $ start_lat <dbl> 41.85483, 41.89441, 41.89936, 41.89939, 41.89558, 4…
## $ start_lng <dbl> -87.66366, -87.63233, -87.64852, -87.64854, -87.682…
## $ end_lat <dbl> 41.87197, 41.96797, 41.93758, 41.89488, 41.93125, 4…
## $ end_lng <dbl> -87.65097, -87.65000, -87.64410, -87.63233, -87.644…
## $ member_casual <chr> "member", "casual", "member", "member", "member", "…
## $ ride_length <drtn> 7.583333 mins, 26.900000 mins, 12.766667 mins, 14.…
## $ month <chr> "12", "12", "12", "12", "12", "12", "12", "12", "12…
## $ year <chr> "2021", "2021", "2021", "2021", "2021", "2021", "20…
## $ start_hour <chr> "15", "03", "23", "16", "11", "18", "15", "13", "14…
## $ days_of_week <chr> "Tuesday", "Saturday", "Wednesday", "Sunday", "Thur…
Before we move on to the analysis, we need to first convert ‘ride_length’ from Factor to Numeric so we can run calculations on the data.
is.factor(Total_trip_2$ride_length)
## [1] FALSE
Total_trip_2$ride_length <- as.numeric(as.character(Total_trip_2$ride_length))
is.numeric(Total_trip_2$ride_length)
## [1] TRUE
Now we have everything we need to move on to the analysis phase.
We will use the summary() function to quickly get the min, median, mean, and max for “ride_length”.
summary(Total_trip_2$ride_length)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 5.93 10.48 16.76 18.82 40705.02
#Number of rides per hour of the day
Total_trip_2 %>%
mutate(hour = format(as.POSIXct(Total_trip_2$started_at), format = "%H")) %>%
group_by(member_casual, hour) %>%
summarise(number_of_rides = n()
,average_duration = mean(ride_length)) %>%
arrange(member_casual, hour) %>%
ggplot(aes(x = hour, y = number_of_rides, fill = member_casual)) +
geom_col(position = "dodge")
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
Based on the graph above, we can come to the following conclusions: Members tend to have a higher frequency of rides between 6:00-9:00 and 15:00-19:00. This would seem to suggest that annual members use their bikes to commute to and from work. Casual riders have a spike in number of rides around 15:00-18:00 as well, but that isn’t reflected in the morning meaning this is likely not reflecting casual riders using the bikes to commute to and from work but rather perhaps after work activities.
# analyze ridership data by type and weekday
Total_trip_2 %>%
mutate(day_of_week = wday(started_at, label = TRUE)) %>% #creates weekday field using wday()
group_by(member_casual, day_of_week) %>% #groups by usertype(member_casual) and weekday
summarise(number_of_rides = n() #calculates the number of rides and average duration
,average_duration = mean(ride_length)) %>% # calculates the average duration
arrange(member_casual, day_of_week) # sorts
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
## # A tibble: 14 × 4
## # Groups: member_casual [2]
## member_casual day_of_week number_of_rides average_duration
## <chr> <ord> <int> <dbl>
## 1 casual Sun 403983 26.3
## 2 casual Mon 279170 23.4
## 3 casual Tue 275219 20.3
## 4 casual Wed 281149 19.8
## 5 casual Thu 306111 20.3
## 6 casual Fri 351729 21.3
## 7 casual Sat 498682 25.5
## 8 member Sun 393502 13.9
## 9 member Mon 472981 12.1
## 10 member Tue 541416 11.9
## 11 member Wed 538401 11.9
## 12 member Thu 530450 12.0
## 13 member Fri 491368 12.3
## 14 member Sat 458122 13.9
# Let's visualize the number of rides by rider type per weekday
Total_trip_2 %>%
mutate(day_of_week = wday(started_at, label = TRUE)) %>%
group_by(member_casual, day_of_week) %>%
summarise(number_of_rides = n()
,average_duration = mean(ride_length)) %>%
arrange(member_casual, day_of_week) %>%
ggplot(aes(x = day_of_week, y = number_of_rides, fill = member_casual)) +
geom_col(position = "dodge")
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
We can make some interesting discoveries from the graph shown above. We can see that during week day, members have a much higher frequency of rides than casual riders. Casual riders however, pull ahead with slightly more rides on the weekends, while the number of rides for members actually decreases slightly.
Total_trip_2 %>%
mutate(day_of_week = wday(started_at, label = TRUE)) %>%
group_by(member_casual, day_of_week) %>%
summarise(number_of_rides = n()
,average_duration = mean(ride_length)) %>%
arrange(member_casual, day_of_week) %>%
ggplot(aes(x = day_of_week, y = average_duration, fill = member_casual)) +
geom_col(position = "dodge")
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
In this graph, we find that casual riders, while not taking quite as many rides, do in fact take much longer rides than members throughout each day of the week, with slight increases on the weekends. The average duration for members rides stays very consistent throughout the week.
Total_trip_2 %>%
mutate(month = format(as.Date(Total_trip_2$started_at), "%m")) %>%
group_by(member_casual, month) %>%
summarise(number_of_rides = n()
,average_duration = mean(ride_length)) %>%
arrange(member_casual, month) %>%
ggplot(aes(x = month, y = number_of_rides, fill = member_casual)) +
geom_col(position = "dodge")
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
We observe from this graph that casual riders always has less number of rides than annual members in every months of the year. It is worth noting that both members and casual riders number of rides follows a bell curve, meaning that both parties are using bikes more in the warmer months, than the colder ones. Casual riders bike trips dips drastically between November and March.
Total_trip_2 %>%
group_by(member_casual, rideable_type) %>%
arrange(member_casual, rideable_type) %>%
ggplot(aes(x = rideable_type, fill = member_casual)) +
geom_bar(position = "dodge")
Based on this graph, both members and casual riders prefer to use the classic bike type as well electric bike type. It appear to be less important to focus on the type of bike the riders used.
Casual riders take less trips during the week day than annual members do. We believe this to be due to members taking more rides on the way to work, and on the way home from work.
We suggest increasing casual riders week day trips by advertising at work commute hours. (7:00-9:00 and 16:00-18:00)
We recommend advertising between May and September. These are the months with the highest frequencies of bike trips.
Casual riders and members both prefer the classic bike, since the use of electric bikes aren’t too far behind, advertiser should recommend both bike to both user types .
Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.