This analysis was based on the Divvy case study ‘Sophisticated, Clear, and Polished’: Divvy and Data Visualization written by Kevin Hartman. The purpose of this script was to consolidate downloaded Divvy data into a single data frame and then conduct a simple analysis to help answer the key question:

“In what ways do members and casual riders use Divvy bikes differently?”

I used the desktop version of RStudio to conduct a case study on time spent between June 2022 and May 2023.

\[\\[.1in]\]


1. RSTUDIO PREPARATION

First, I installed the required packages:

then:

\[\\[.1in]\]


2. DATA GATHERING

I imported datasets and assigned new file names by these codes:

m06_22 <- (x202206_divvy_tripdata)
m07_22 <- (x202207_divvy_tripdata)
m08_22 <- (x202208_divvy_tripdata)
m09_22 <- (x2022069_divvy_publictripdata)
m10_22 <- (x202210_divvy_tripdata)
m11_22 <- (x202211_divvy_tripdata)
m12_22 <- (x202212_divvy_tripdata)
m01_23 <- (x202301_divvy_tripdata)
m02_23 <- (x202302_divvy_tripdata)
m03_23 <- (x202303_divvy_tripdata)
m04_23 <- (x202304_divvy_tripdata)
m05_23 <- (x202305_divvy_tripdata)

\[\\[.1in]\]


3. DATA MANIPULATION

To ensure data consistency, I compared the column names of each file.

colnames(m06_22)
colnames(m07_22)
colnames(m08_22)
colnames(m09_22)
colnames(m10_22)
colnames(m11_22)
colnames(m12_22)
colnames(m01_23)
colnames(m02_23)

\[\\[.1in]\]

The columns in all Tibbles were identified and I didn’t need to rename them in all files.

I Inspected the data frames and look for incongruencies:

str(m06_22)
str(m07_22)
str(m08_22)
str(m09_22)
str(m10_22)
str(m11_22)
str(m12_22)
str(m01_23)
str(m02_23)
str(m03_23)
str(m04_23)
str(m05_23)

I didn’t find an inconsistency in data formatting.

\[\\[.1in]\]

I converted ride_id and rideable_type to character so that they can stack correctly:

m06_22<-  mutate(all_trips, ride_id = as.character(ride_id), rideable_type = as.character(rideable_type)) 

I applied mutate functions for all data frames.

\[\\[.1in]\]

I stacked data frames into one big data frame named all_trips:

all_trips <- rbind(m06_22, m07_22, m08_22, m09_22, m10_22, m011_22, m12_22, m01_23, m02_23, m03_23, m04_23, m05_23)

\[\\[.1in]\]

The data frame contained additional fields, so I removed the following columns: start_lat, end_lat, start_lng, end_lng, start_station_id, start_station_name, end_station_id, end_station_name:

all_trips <- all_trips %>%  
  select(-c(start_lat, start_lng, end_lat, end_lng, start_station_id, start_station_name, end_station_id, end_station_name))

\[\\[.1in]\]


4. DATA CLEANING

I inspected the new created tibble named all_trips:

colnames(all_trips) 
nrow(all_trips)
dim(all_trips)
head(all_trips)  
tail(all_trips)
summary(all_trips)
str(all_trips)  

\[\\[.1in]\]

The structure of the all_trips was as this:

Fig.1. Structure of all_trips
Fig.1. Structure of all_trips

\[\\[.1in]\]

I added a ride_length column to create a calculated field for the length of the ride:

all_trips$ride_length <- difftime(all_trips$ended_at,all_trips$started_at)

\[\\[.1in]\]

I needed to aggregate ride data for each month, day, or year. So I added columns for recording each ride’s date, month, day, and year and day of week (before completing these operations, I could only aggregate at the ride level):

all_trips$date <- as.Date(all_trips$started_at)
all_trips$month <- format(as.Date(all_trips$date), "%m")
all_trips$day <- format(as.Date(all_trips$date), "%d")
all_trips$year <- format(as.Date(all_trips$date), "%Y")
all_trips$day_of_week <- format(as.Date(all_trips$date), "%A")

\[\\[.1in]\]

I Inspected the structure of the columns:

str(all_trips)

\[\\[.1in]\]

The ride_length column was not numeric. I converted the type of “ride_length” to numeric in order to perform calculations on the data:

all_trips$ride_length <- as.numeric(as.character(all_trips$ride_length))  

Then I checked it:

is.numeric(all_trips$ride_length). 

The response was true.

\[\\[.1in]\]

Since the start time will never be equal to or greater than the end time, any value equal to or lower than zero is considered incorrect and represents bad data. To remove this bad data, I created a new version of the data frame named all_trips_v2:

all_trips_v2 <- all_trips %>% 
    filter(all_trips$ride_length > 0)

I checked new data frame:

str(all_trips_v2)  

\[\\[.1in]\]

The new tibble all_trips_v2 had 5,828,463 rows, which is 567 rows fewer than the original “all_trips” tibble.

Fig.2. The structure of the all_trips_v2

\[\\[.1in]\]


5. DATA ANALYSIS

I conducted a descriptive analysis on the “ride_length” column in seconds. I calculated:
Average of ride_length (it was 11122.857 secs):

mean(all_trips_v2$ride_length)   

Median (midpoint number in the ascending array) of ride lengths (it was 591 secs):

median(all_trips_v2$ride_length) 

Maximum ride length (it was 42483235):

max(all_trips_v2$ride_length)

Minimum ride length (it was 1 sec):

min(all_trips_v2$ride_length)   

\[\\[.1in]\]

To apply object-oriented calculations, I aggregated the “ride_length” and “member_casual” columns and applied the mean, median, maximum, and minimum functions. This allowed me to compare the usage patterns between member and casual users:

— Average ride length by users:

aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = mean)

Average Ride Length time by User tibble was as Fig.3.

\[\\[.1in]\]

Fig.3. Average Ride Length time by User

It was found that significantly casual’s mean ride length is more than 2.26 folds greater than member’s mean ride length.

\[\\[.1in]\]

— Median ride length by users:

aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = median)

Median Ride Length time by User tibble was as Fig.4.

\[\\[.1in]\]

Fig.4. The median ride length by users
Fig.4. The median ride length by users

\[\\[.1in]\]

— Maximum ride length by users:

aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = max)

Median Ride Length time by User tibble was as Fig.5.

\[\\[.1in]\]

Fig.5. Maximum Ride Length by Users
Fig.5. Maximum Ride Length by Users

\[\\[.1in]\]

— Minimum ride length by users:

 (all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = min)

It was found that minimum ride length in both users is 1 sec.

\[\\[.1in]\]

The days of the week were out of order. I fixed it by:

all_trips_v2$day_of_week <- ordered(all_trips_v2$day_of_week, levels=c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))  

\[\\[.1in]\]

Then I ran the average ride time by each day for members vs casual users:

aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual + all_trips_v2$day_of_week, FUN = mean)  

The Average Ride Length Time by Each Day for Members vs Casual Users tibble was as Fig.6.

\[\\[.1in]\]

Fig.6. The Average Ride Length Time by Each Day for Members vs Casual Users
Fig.6. The Average Ride Length Time by Each Day for Members vs Casual Users

\[\\[.1in]\]

I analyzed ridership data by type and weekday. I created a weekday field using wday(), then groepd by user type and weekday, then calculated number of rides and average duration and eentually sorted by member type:

all_trips_v2 %>% 
  mutate(weekday = wday(started_at, label = TRUE)) %>%  
  group_by(member_casual, weekday) %>%  
  summarise(number_of_rides = n(), average_duration = mean(ride_length)) %>%        
  arrange(member_casual, weekday)                               

The Number and Average Length of Rides by Users in Weekdays tibble was as Fig.7.

\[\\[.1in]\]

Fig.7. The Number and Average Length of Rides by Users in Weekdays
Fig.7. The Number and Average Length of Rides by Users in Weekdays

\[\\[.1in]\]


6. DATA VISUALIZATION

I made a plot for the number of rides by rider type in weekdays:

all_trips_v2 %>% 
  mutate(weekday = wday(started_at, label = TRUE)) %>% 
  group_by(member_casual, weekday) %>% 
  summarise(number_of_rides = n(), average_duration = mean(ride_length)) %>% 
  arrange(member_casual, weekday)  %>% 
  ggplot(aes(x = weekday, y = number_of_rides, fill = member_casual)) + 
  geom_col(position = "dodge") +
  ggtitle("Number of Rides by Rider Type", subtitle = "06-2022 to 05-2023")

\[\\[.1in]\]

Fig.8. The Plot of Number of Rides by Rider Type
Fig.8. The Plot of Number of Rides by Rider Type

\[\\[.1in]\]

I made a plot for the average duration by rider type:

all_trips_v2 %>% 
  mutate(weekday = wday(started_at, label = TRUE)) %>% 
  group_by(member_casual, weekday) %>% 
  summarise(number_of_rides = n(), average_duration = mean(ride_length)) %>% 
  arrange(member_casual, weekday)  %>% 
  ggplot(aes(x = weekday, y = average_duration, fill = member_casual)) + 
  geom_col(position = "dodge") +
  ggtitle("Average of Ride Duration by Rider Type", subtitle = "06-2022 to 05-2023")
Fig.9. The Average of Ride Duration by Rider Type
Fig.9. The Average of Ride Duration by Rider Type

\[\\[.1in]\]


7. DATA TRANSFERE

I created a summary csv file named ride_summary to visualize data in Excel, Tableau, or any other presentation software:

ride_summary <- aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual + all_trips_v2$day_of_week, FUN = mean)

Then I exported it in csv format to my computer:

write.csv(ride_summary, file = '~/Desktop/Divvy_Exercise/ride_summary.csv')

\[\\[.1in]\]