This analysis was based on the Divvy case study ‘Sophisticated, Clear, and Polished’: Divvy and Data Visualization written by Kevin Hartman. The purpose of this script was to consolidate downloaded Divvy data into a single data frame and then conduct a simple analysis to help answer the key question:
I used the desktop version of RStudio to conduct a case study on time spent between June 2022 and May 2023.
\[\\[.1in]\]
First, I installed the required packages:
for data import and wrangling:
instal.packages(“tidyverse”)
for date functions:
instal.packages (“lubridate”)
for visualization:
instal.packages (“ggplot”)
then:
\[\\[.1in]\]
I imported datasets and assigned new file names by these codes:
m06_22 <- (x202206_divvy_tripdata)
m07_22 <- (x202207_divvy_tripdata)
m08_22 <- (x202208_divvy_tripdata)
m09_22 <- (x2022069_divvy_publictripdata)
m10_22 <- (x202210_divvy_tripdata)
m11_22 <- (x202211_divvy_tripdata)
m12_22 <- (x202212_divvy_tripdata)
m01_23 <- (x202301_divvy_tripdata)
m02_23 <- (x202302_divvy_tripdata)
m03_23 <- (x202303_divvy_tripdata)
m04_23 <- (x202304_divvy_tripdata)
m05_23 <- (x202305_divvy_tripdata)
\[\\[.1in]\]
To ensure data consistency, I compared the column names of each file.
colnames(m06_22)
colnames(m07_22)
colnames(m08_22)
colnames(m09_22)
colnames(m10_22)
colnames(m11_22)
colnames(m12_22)
colnames(m01_23)
colnames(m02_23)
\[\\[.1in]\]
The columns in all Tibbles were identified and I didn’t need to rename them in all files.
I Inspected the data frames and look for incongruencies:
str(m06_22)
str(m07_22)
str(m08_22)
str(m09_22)
str(m10_22)
str(m11_22)
str(m12_22)
str(m01_23)
str(m02_23)
str(m03_23)
str(m04_23)
str(m05_23)
I didn’t find an inconsistency in data formatting.
\[\\[.1in]\]
I converted ride_id and rideable_type to character so that they can stack correctly:
m06_22<- mutate(all_trips, ride_id = as.character(ride_id), rideable_type = as.character(rideable_type))
I applied mutate functions for all data frames.
\[\\[.1in]\]
I stacked data frames into one big data frame named all_trips:
all_trips <- rbind(m06_22, m07_22, m08_22, m09_22, m10_22, m011_22, m12_22, m01_23, m02_23, m03_23, m04_23, m05_23)
\[\\[.1in]\]
The data frame contained additional fields, so I removed the following columns: start_lat, end_lat, start_lng, end_lng, start_station_id, start_station_name, end_station_id, end_station_name:
all_trips <- all_trips %>%
select(-c(start_lat, start_lng, end_lat, end_lng, start_station_id, start_station_name, end_station_id, end_station_name))
\[\\[.1in]\]
I inspected the new created tibble named all_trips:
colnames(all_trips)
nrow(all_trips)
dim(all_trips)
head(all_trips)
tail(all_trips)
summary(all_trips)
str(all_trips)
\[\\[.1in]\]
The structure of the all_trips was as this:
\[\\[.1in]\]
I added a ride_length column to create a calculated field for the length of the ride:
all_trips$ride_length <- difftime(all_trips$ended_at,all_trips$started_at)
\[\\[.1in]\]
I needed to aggregate ride data for each month, day, or year. So I added columns for recording each ride’s date, month, day, and year and day of week (before completing these operations, I could only aggregate at the ride level):
all_trips$date <- as.Date(all_trips$started_at)
all_trips$month <- format(as.Date(all_trips$date), "%m")
all_trips$day <- format(as.Date(all_trips$date), "%d")
all_trips$year <- format(as.Date(all_trips$date), "%Y")
all_trips$day_of_week <- format(as.Date(all_trips$date), "%A")
\[\\[.1in]\]
I Inspected the structure of the columns:
str(all_trips)
\[\\[.1in]\]
The ride_length column was not numeric. I converted the type of “ride_length” to numeric in order to perform calculations on the data:
all_trips$ride_length <- as.numeric(as.character(all_trips$ride_length))
Then I checked it:
is.numeric(all_trips$ride_length).
The response was true.
\[\\[.1in]\]
Since the start time will never be equal to or greater than the end time, any value equal to or lower than zero is considered incorrect and represents bad data. To remove this bad data, I created a new version of the data frame named all_trips_v2:
all_trips_v2 <- all_trips %>%
filter(all_trips$ride_length > 0)
I checked new data frame:
str(all_trips_v2)
\[\\[.1in]\]
The new tibble all_trips_v2 had 5,828,463 rows, which is 567 rows fewer than the original “all_trips” tibble.
Fig.2. The structure of the all_trips_v2
\[\\[.1in]\]
I conducted a descriptive analysis on the “ride_length” column in
seconds. I calculated:
Average of ride_length (it was 11122.857 secs):
mean(all_trips_v2$ride_length)
Median (midpoint number in the ascending array) of ride lengths (it was 591 secs):
median(all_trips_v2$ride_length)
Maximum ride length (it was 42483235):
max(all_trips_v2$ride_length)
Minimum ride length (it was 1 sec):
min(all_trips_v2$ride_length)
\[\\[.1in]\]
To apply object-oriented calculations, I aggregated the “ride_length” and “member_casual” columns and applied the mean, median, maximum, and minimum functions. This allowed me to compare the usage patterns between member and casual users:
— Average ride length by users:
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = mean)
Average Ride Length time by User tibble was as Fig.3.
\[\\[.1in]\]
Fig.3. Average Ride Length time by User
It was found that significantly casual’s mean ride length is more than 2.26 folds greater than member’s mean ride length.
\[\\[.1in]\]
— Median ride length by users:
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = median)
Median Ride Length time by User tibble was as Fig.4.
\[\\[.1in]\]
\[\\[.1in]\]
— Maximum ride length by users:
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = max)
Median Ride Length time by User tibble was as Fig.5.
\[\\[.1in]\]
\[\\[.1in]\]
— Minimum ride length by users:
(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = min)
It was found that minimum ride length in both users is 1 sec.
\[\\[.1in]\]
The days of the week were out of order. I fixed it by:
all_trips_v2$day_of_week <- ordered(all_trips_v2$day_of_week, levels=c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))
\[\\[.1in]\]
Then I ran the average ride time by each day for members vs casual users:
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual + all_trips_v2$day_of_week, FUN = mean)
The Average Ride Length Time by Each Day for Members vs Casual Users tibble was as Fig.6.
\[\\[.1in]\]
\[\\[.1in]\]
I analyzed ridership data by type and weekday. I created a weekday field using wday(), then groepd by user type and weekday, then calculated number of rides and average duration and eentually sorted by member type:
all_trips_v2 %>%
mutate(weekday = wday(started_at, label = TRUE)) %>%
group_by(member_casual, weekday) %>%
summarise(number_of_rides = n(), average_duration = mean(ride_length)) %>%
arrange(member_casual, weekday)
The Number and Average Length of Rides by Users in Weekdays tibble was as Fig.7.
\[\\[.1in]\]
\[\\[.1in]\]
I made a plot for the number of rides by rider type in weekdays:
all_trips_v2 %>%
mutate(weekday = wday(started_at, label = TRUE)) %>%
group_by(member_casual, weekday) %>%
summarise(number_of_rides = n(), average_duration = mean(ride_length)) %>%
arrange(member_casual, weekday) %>%
ggplot(aes(x = weekday, y = number_of_rides, fill = member_casual)) +
geom_col(position = "dodge") +
ggtitle("Number of Rides by Rider Type", subtitle = "06-2022 to 05-2023")
\[\\[.1in]\]
\[\\[.1in]\]
I made a plot for the average duration by rider type:
all_trips_v2 %>%
mutate(weekday = wday(started_at, label = TRUE)) %>%
group_by(member_casual, weekday) %>%
summarise(number_of_rides = n(), average_duration = mean(ride_length)) %>%
arrange(member_casual, weekday) %>%
ggplot(aes(x = weekday, y = average_duration, fill = member_casual)) +
geom_col(position = "dodge") +
ggtitle("Average of Ride Duration by Rider Type", subtitle = "06-2022 to 05-2023")
\[\\[.1in]\]
I created a summary csv file named ride_summary to visualize data in Excel, Tableau, or any other presentation software:
ride_summary <- aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual + all_trips_v2$day_of_week, FUN = mean)
Then I exported it in csv format to my computer:
write.csv(ride_summary, file = '~/Desktop/Divvy_Exercise/ride_summary.csv')
\[\\[.1in]\]