Main Purpose
The purpose of this script was to consolidate downloaded Divvy data into a single data frame and then conduct a simple analysis to help answer the key question:
This analysis was based on the Divvy case study ‘Sophisticated, Clear, and Polished’: Divvy and Data Visualization written by Kevin Hartman.
Divvy, a Chicago Department of Transportation program, is a bike-share system in Chicago and Evanston. Divvy is a convenient, fun, and affordable transportation option for commuting to work, getting around town, and exploring Chicago. Like other bike-share systems, Divvy consists of a fleet of specially designed, geotracked, and durable bikes that are locked into a network of docking stations throughout the region. The bikes can be unlocked from one station and returned to any other station in the system 24/7.
Consumers can buy access to Divvy bikes using these options: (1) Single-ride passes for $3 per 30-minute trip; (2) Full-day passes for $15 per day for unlimited three-hour rides in a 24-hour period; and (3) Annual memberships for $99 per year for unlimited 45-minute rides. Small charges (15 cents per minute) are incurred when single rides exceed the maximum time allowance to dissuade consumers from checking out bikes and not returning them on time.
Director of Marketing for Divvy, Lindsay Silk-Kremenak and her small but effective marketing team had overseen a successful launch of the program in the Chicago area, with 750 bikes at 75 stations, in 2013. Since that launch, Silk-Kremenak and her team were the engine behind the program’s growth. By 2019, Divvy had expanded to feature 5,800 bicycles and 608 stations, covering almost all of the city of Chicago and two nearby suburbs.
Consumers could use a bike for short trips by buying a single-ride pass or for an entire day through a full-day pass (internally, Divvy referred to these consumers as “casual riders”). Alternatively, consumers could become annual members to gain unlimited access to Divvy bikes for the entire year (customers Divvy called “members”). Silk-Kremenak’s finance analysts concluded that members were much more profitable than casual riders (both single ride and full day). Although the pricing flexibility helped Divvy attract more customers, it was clear to Silk-Kremenak that her primary objective was maximizing the number of annual members.
Silk-Kremenak turned to her team of data analysts for help. With her objective of converting casual riders to members, she worked with her team to design three questions which, once answered, could provide guidance for the marketing program she needed:
What ways do members and casual riders use Divvy bikes differently?
Why would Casual Riders want use Divvy more?
How can Divvy influence Casual Riders to become Members?
\[\\[.1in]\]
The purpose of this script was to consolidate downloaded Divvy data into a single data frame and then conduct a simple analysis to help answer the key question:
\[\\[.1in]\]
I used the desktop version of RStudio to conduct a case study on time spent between June 2022 and May 2023.
First, I installed the required packages:
instal.packages("tidyverse")
instal.packages("lubridate")
instal.packages("ggplot2")
then:
library(tidyverse)
library(lubridate)
library(ggplot2)
\[\\[.1in]\]
I imported datasets and assigned new file names by these codes:
m06_22 <- (x202206_divvy_tripdata)
m07_22 <- (x202207_divvy_tripdata)
m08_22 <- (x202208_divvy_tripdata)
m09_22 <- (x2022069_divvy_publictripdata)
m10_22 <- (x202210_divvy_tripdata)
m11_22 <- (x202211_divvy_tripdata)
m12_22 <- (x202212_divvy_tripdata)
m01_23 <- (x202301_divvy_tripdata)
m02_23 <- (x202302_divvy_tripdata)
m03_23 <- (x202303_divvy_tripdata)
m04_23 <- (x202304_divvy_tripdata)
m05_23 <- (x202305_divvy_tripdata)
\[\\[.1in]\]
To ensure data consistency, I compared the column names of each file.
colnames(m06_22)
colnames(m07_22)
colnames(m08_22)
colnames(m09_22)
colnames(m10_22)
colnames(m11_22)
colnames(m12_22)
colnames(m01_23)
colnames(m02_23)
\[\\[.1in]\]
The columns in all Tibbles were identified and I didn’t need to rename them in all files.
I Inspected the data frames and look for incongruencies:
str(m06_22)
str(m07_22)
str(m08_22)
str(m09_22)
str(m10_22)
str(m11_22)
str(m12_22)
str(m01_23)
str(m02_23)
str(m03_23)
str(m04_23)
str(m05_23)
I didn’t find an inconsistency in data formatting.
\[\\[.1in]\]
I converted ride_id and rideable_type to character so that they can stack correctly:
m06_22<- mutate(all_trips, ride_id = as.character(ride_id), rideable_type = as.character(rideable_type))
I applied mutate functions for all data frames.
\[\\[.1in]\]
I stacked data frames into one big data frame named all_trips:
all_trips <- rbind(m06_22, m07_22, m08_22, m09_22, m10_22, m011_22, m12_22, m01_23, m02_23, m03_23, m04_23, m05_23)
\[\\[.1in]\]
The data frame contained additional fields, so I removed the following columns: start_lat, end_lat, start_lng, end_lng, start_station_id, start_station_name, end_station_id, end_station_name:
all_trips <- all_trips %>%
select(-c(start_lat, start_lng, end_lat, end_lng, start_station_id, start_station_name, end_station_id, end_station_name))
\[\\[.1in]\]
I inspected the new created tibble named all_trips:
colnames(all_trips)
nrow(all_trips)
dim(all_trips)
head(all_trips)
tail(all_trips)
summary(all_trips)
str(all_trips)
\[\\[.1in]\]
The structure of the all_trips was as this:
\[\\[.1in]\]
I added a ride_length column to create a calculated field for the length of the ride:
all_trips$ride_length <- difftime(all_trips$ended_at,all_trips$started_at)
\[\\[.1in]\]
I needed to aggregate ride data for each month, day, or year. So I added columns for recording each ride’s date, month, day, and year and day of week (before completing these operations, I could only aggregate at the ride level):
all_trips$date <- as.Date(all_trips$started_at)
all_trips$month <- format(as.Date(all_trips$date), "%m")
all_trips$day <- format(as.Date(all_trips$date), "%d")
all_trips$year <- format(as.Date(all_trips$date), "%Y")
all_trips$day_of_week <- format(as.Date(all_trips$date), "%A")
\[\\[.1in]\]
I Inspected the structure of the columns:
str(all_trips)
\[\\[.1in]\]
The ride_length column was not numeric. I converted the type of “ride_length” to numeric in order to perform calculations on the data:
all_trips$ride_length <- as.numeric(as.character(all_trips$ride_length))
Then I checked it:
is.numeric(all_trips$ride_length).
The response was true.
\[\\[.1in]\]
Since the start time will never be equal to or greater than the end time, any value equal to or lower than zero is considered incorrect and represents bad data. To remove this bad data, I created a new version of the data frame named all_trips_v2:
all_trips_v2 <- all_trips %>%
filter(all_trips$ride_length > 0)
I checked new data frame:
str(all_trips_v2)
\[\\[.1in]\]
The new tibble all_trips_v2 had 5,828,463 rows, which is 567 rows fewer than the original “all_trips” tibble.
Fig.2. The structure of the all_trips_v2
\[\\[.1in]\]
I conducted a descriptive analysis on the “ride_length” column in
seconds. I calculated:
Average of ride_length (it was 11122.857 secs):
mean(all_trips_v2$ride_length)
Median (midpoint number in the ascending array) of ride lengths (it was 591 secs):
median(all_trips_v2$ride_length)
Maximum ride length (it was 42483235):
max(all_trips_v2$ride_length)
Minimum ride length (it was 1 sec):
min(all_trips_v2$ride_length)
\[\\[.1in]\]
To apply object-oriented calculations, I aggregated the “ride_length” and “member_casual” columns and applied the mean, median, maximum, and minimum functions. This allowed me to compare the usage patterns between member and casual users:
— Average ride length by users:
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = mean)
Average Ride Length time by User tibble was as Fig.3.
\[\\[.1in]\]
Fig.3. Average Ride Length time by User
It was found that significantly casual’s mean ride length is more than 2.26 folds greater than member’s mean ride length.
\[\\[.1in]\]
— Median ride length by users:
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = median)
Median Ride Length time by User tibble was as Fig.4.
\[\\[.1in]\]
\[\\[.1in]\]
— Maximum ride length by users:
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = max)
Median Ride Length time by User tibble was as Fig.5.
\[\\[.1in]\]
\[\\[.1in]\]
— Minimum ride length by users:
(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = min)
It was found that minimum ride length in both users is 1 sec.
\[\\[.1in]\]
The days of the week were out of order. I fixed it by:
all_trips_v2$day_of_week <- ordered(all_trips_v2$day_of_week, levels=c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))
\[\\[.1in]\]
Then I ran the average ride time by each day for members vs casual users:
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual + all_trips_v2$day_of_week, FUN = mean)
The Average Ride Length Time by Each Day for Members vs Casual Users tibble was as Fig.6.
\[\\[.1in]\]
\[\\[.1in]\]
I analyzed ridership data by type and weekday. I created a weekday field using wday(), then groepd by user type and weekday, then calculated number of rides and average duration and eentually sorted by member type:
all_trips_v2 %>%
mutate(weekday = wday(started_at, label = TRUE)) %>%
group_by(member_casual, weekday) %>%
summarise(number_of_rides = n(), average_duration = mean(ride_length)) %>%
arrange(member_casual, weekday)
The Number and Average Length of Rides by Users in Weekdays tibble was as Fig.7.
\[\\[.1in]\]
\[\\[.1in]\]
I made a plot for the number of rides by rider type in weekdays:
all_trips_v2 %>%
mutate(weekday = wday(started_at, label = TRUE)) %>%
group_by(member_casual, weekday) %>%
summarise(number_of_rides = n(), average_duration = mean(ride_length)) %>%
arrange(member_casual, weekday) %>%
ggplot(aes(x = weekday, y = number_of_rides, fill = member_casual)) +
geom_col(position = "dodge") +
ggtitle("Number of Rides by Rider Type", subtitle = "06-2022 to 05-2023")
\[\\[.1in]\]
\[\\[.1in]\]
I made a plot for the average duration by rider type:
all_trips_v2 %>%
mutate(weekday = wday(started_at, label = TRUE)) %>%
group_by(member_casual, weekday) %>%
summarise(number_of_rides = n(), average_duration = mean(ride_length)) %>%
arrange(member_casual, weekday) %>%
ggplot(aes(x = weekday, y = average_duration, fill = member_casual)) +
geom_col(position = "dodge") +
ggtitle("Average of Ride Duration by Rider Type", subtitle = "06-2022 to 05-2023")
\[\\[.1in]\]
I created a summary csv file named ride_summary to visualize data in Excel, Tableau, or any other presentation software:
ride_summary <- aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual + all_trips_v2$day_of_week, FUN = mean)
Then I exported it in csv format to my computer:
write.csv(ride_summary, file = '~/Desktop/Divvy_Exercise/ride_summary.csv')
\[\\[.1in]\]