1 INRODUCTION TO TASKS AND DATABASE
- 1.1 Overview
2 PREPARE
3 clean
4 Aggregating data.
5 Conclusion

1 INRODUCTION TO TASKS AND DATABASE

1.1 Overview

This document presents a summarized version of the data analysis conducted on the Divvy bike-share dataset. It focuses on showcasing key code snippets and outputs from the main phases of the project: data cleaning, transformation and aggregation.

Rather than detailing every step or decision made, this R Markdown file serves as a concise reference or codebook highlighting the most important operations and findings. The full exploratory work was done separately, and this summary is intended to support presentation or documentation purposes. The dataset used consists of internal CSV files covering the last 12 months of data.

To download the original data, please visit: Divvy Trip Data.

Note: This case study was conducted in February 2025. The data analyzed includes trips from the 12-month period up to and including February 2025. Any data beyond this date were not used in this study.

2 PREPARE

The data was originally provided as separate monthly CSV files. In this step, we combined them into a single dataset using the bind_rows() function from the dplyr package.

# Load necessary library
install.packages('tidyverse')
install.packages('janitor')
install.packages('lubridate')
library(readr)
library(tidyverse)
library(lubridate)
library(janitor)
library(scales)

# Set working directory or folder path (adjust as needed)
setwd("path")
#load files.
divvy_tripdata_202403 <- read_csv("202403-divvy-tripdata.csv")
divvy_tripdata_202404 <- read_csv("202404-divvy-tripdata.csv")
divvy_tripdata_202405 <- read_csv("202405-divvy-tripdata.csv")
divvy_tripdata_202406 <- read_csv("202406-divvy-tripdata.csv")
divvy_tripdata_202407 <- read_csv("202407-divvy-tripdata.csv")
divvy_tripdata_202408 <- read_csv("202408-divvy-tripdata.csv")
divvy_tripdata_202409 <- read_csv("202409-divvy-tripdata.csv")
divvy_tripdata_202410 <- read_csv("202410-divvy-tripdata.csv")
divvy_tripdata_202411 <- read_csv("202411-divvy-tripdata.csv")
divvy_tripdata_202412 <- read_csv("202412-divvy-tripdata.csv")
divvy_tripdata_202501 <- read_csv("202501-divvy-tripdata.csv")
divvy_tripdata_202502 <- read_csv("202502-divvy-tripdata.csv")

# binding files.
divvy_tripdata_complet <- bind_rows(
  divvy_tripdata_202403, divvy_tripdata_202404, divvy_tripdata_202405,
  divvy_tripdata_202406, divvy_tripdata_202407, divvy_tripdata_202408,
  divvy_tripdata_202409, divvy_tripdata_202410, divvy_tripdata_202411,
  divvy_tripdata_202412, divvy_tripdata_202501, divvy_tripdata_202502
)

3 clean

3.1 removing dups

To remove dupes, we have to check for full dupes rows, then dups ride_id, to conclude if there are different rides with same ride_id.

## checking for dups
get_dupes(divvy_tripdata_complet)

   ##remove dups
dtc_cleaned <- divvy_tripdata_complet %>% distinct()
   
   ##rechecking for dups
get_dupes(dtc_cleaned,ride_id)

3.2 Unefficient data

As part of the cleaning process, some columns with a high percentage of missing values were removed. These columns contained station-related information which was either incomplete or not relevant to the current analysis.

The following code was used to remove them:

# Remove columns with many missing values
dtc_cleaned <- dtc_cleaned %>%
  select(-start_station_name, -start_station_id, -end_station_name, -end_station_id)

3.3 Removing nulls.

3.3.1 Removing Rows with Missing Location Data

During the cleaning phase, we identified some rows with missing values in key location columns such as end_lat. Since these values are essential for ride analysis (e.g., calculating distance, direction, or mapping), we needed to assess whether it’s safe to remove them.

We first isolated the rows with missing values and evaluated how much they represent in the dataset:

# Find rows with missing end_lat
null_frames <- dtc_cleaned %>% filter(is.na(end_lat))

# Compare member types in full data vs. missing rows
dtc_cleaned %>% count(member_casual)
null_frames %>% count(member_casual)

After reviewing the counts, we confirmed that the rows with missing end_lat values made up only a small portion of the overall dataset and were evenly distributed across member types. Therefore, removing them would not introduce bias.

The following code was used to remove those incomplete rows:

# Remove rows with missing end_lat
dtc_cleaned <- dtc_cleaned %>% filter(!is.na(end_lat))

3.3.2 Removing Rows with Any Missing Values

After cleaning specific columns such as end_lat, we performed a full check for missing values across the entire dataset to ensure data integrity.

# Check for any missing values in the dataset
colSums(is.na(dtc_cleaned))

The result showed that no missing values remain in the dataset. Therefore, applying drop_na() was not necessary.

3.4 Ride time and distance calculation.

This section computes the ride duration and ride distance between starting and ending points.

#Ride time
dtc_cleaned<- dtc_cleaned %>% mutate(trip_duration_mins=as.numeric(difftime(as.POSIXct(ended_at),as.POSIXct(started_at),units='mins')))
dtc_cleaned <-dtc_cleaned %>% mutate(trip_duration_Hours=trip_duration_mins/60)

#Ride distance.
dtc_cleaned <-divvy_tripdata_complet %>%
  mutate( distance_meters=distHaversine(cbind(start_lng,start_lat),cbind(end_lng,end_lat)),
          distance_km=distance_meters/1000    )

4 Aggregating data.

In this section, I used dplyr to aggregate and summarize the cleaned data. These aggregations help identify how usage differs across user types, time periods, and distance or duration patterns. This step is essential before visualization, as it provides a clear structure and insight into the dataset.

4.1 Ride per date.

ride_per_day <- dtc_cleaned %>% mutate(day= wday(date_only,label = TRUE,abbr = FALSE,week_start = 1))
ride_per_day <-ride_per_day %>% mutate(date_type = if_else(day %in% c("samedi","dimanche"),"weekend","weekday"))

The result is a data frame summarizing rides by date, member type and day type. This summary highlights differences in usage patterns between member types and reveals the most frequent dates and weekdays for bike trips. ## Ride by month.

ride_per_month <- ride_per_day %>% mutate(month =floor_date(date_only,unit = "month")) %>% group_by(month,member_casual)  %>% summarise(total=sum(n))

The result is total ride by month and by member type. This summary helps identify differences between member types and shows which months have the highest usage of bike trips. ## Rush_hours_vs_leisure.

rush_hours_vs_leisure <- dtc_cleaned %>% mutate(hour=hour(started_at)) %>% group_by(hour,member_casual) %>% count(hour)

the result is a data frame summarizing rides by hour of day and by member type. This summary highlights differences in usage patterns between member types and reveals the most frequent hours for bike trips.

4.2 Ride per day.

ride_per_day_2 <- ride_per_day %>% group_by(day,member_casual) %>% summarise(total=sum(n))

the result is a data frame showing total rides by day and member type. this summary shows which days got the highest usage of bikes.

4.3 Average ride duration by month.

avg_dur_ride_Month <- dtc_cleaned %>% mutate(month=month(started_at,label = TRUE)) %>% group_by(month,member_casual) %>% summarise(avg_dur=mean(trip_duration_mins))

The result is a data frame showing average ride duration by month for each member type. This will lead to identify seasonal patterns in bikes usage and the times of year where clients uses the bike for long rides and times. It will also helps to find if the bikes are used the most in summer holiday or not.

4.4 Rides by member type

dist_ride_dur <- dtc_cleaned %>% group_by(member_casual) %>% summarise(total=sum(trip_duration_mins))

this shows total rides by member type.

4.5 Duration time.

long_ride <- dtc_cleaned %>%mutate(under_30=trip_duration_mins<30,ovr_30= trip_duration_mins>30 , ovr_60= trip_duration_mins>60)  %>% group_by(member_casual) %>% summarise(und30=sum(under_30),ovr30=sum(ovr_30),ovr60=sum(ovr_60))

The resulting data frame summarizes the total number of rides segmented by duration: under 30 minutes, over 30 minutes, and over 60 minutes. This breakdown helps assess whether bikes are primarily used for short or long trips across different member types. It also provides insight into whether ride duration can be used as a basis for marketing strategies, and whether promoting the annual membership offers greater value to frequent riders.

4.6 Frequency usage pattern.

freq_usg_pat <- dtc_cleaned %>% mutate(day =wday(started_at,label = TRUE,abbr = FALSE,week_start = 1),day_type= if_else(day %in% c("samedi","dimanche"),"weekend","weekday" ),hour=hour(started_at),under_30=trip_duration_mins<30,ovr_30=trip_duration_mins>30,ovr_60=trip_duration_mins>60) %>% group_by(hour,member_casual,day_type) %>% summarise(und30=sum(under_30),ovr30=sum(ovr_30),ovr60=sum(ovr_60))

The resulting data frame summarizes the total number of rides by hour, segmented by duration: under 30 minutes, over 30 minutes, and over 60 minutes. This breakdown helps assess the times where bikes are primarily used for short or long trips across different member types in each hour and for which reason why each member type uses the bikes whether if its for going to work or for leisure in after work times.

5 Conclusion

This analysis provided insights into the usage patterns of Divvy bike-share services over the past year. By aggregating and visualizing key metrics such as ride duration, member type preferences, and time-based trends, we identified clear behavioral differences between casual and member riders.

Key findings include: - Casual riders tend to use bikes more during weekends and summer months, suggesting leisure-based usage. - Members exhibit more consistent usage across weekdays, indicating commuter behavior. - Ride durations were generally longer for casual riders across all time periods.

These observations can inform targeted marketing strategies, resource allocation, and operational improvements for Divvy’s service.

Additionally, the aggregated data frames generated during this analysis were exported and downloaded for further interactive visualization and storytelling using Tableau Public.

Note:
Although variables such as start_lat, start_lng, end_lat, and end_lng were not fully explored in this analysis, they represent valuable geographic coordinates that can be leveraged in advanced spatial analysis.
For example, calculating the straight-line distance between starting and ending points using the Haversine formula can offer deeper insights into travel behavior and trip length. These features could be further utilized in future geospatial visualizations or route optimization studies.

CASE 1:ANALYSIS PROCESS

Oussama AIT-IDMOU

2025-06-19