This assignment focuses on analyzing public transport operations in Mumbai. You are provided with a dataset containing trip details, passenger counts, and weather conditions.

Instructions:

  1. Complete the R code for each question. The code block must contain clean, commented R code.
  2. Handle “dirty data” (negative values, inconsistent casing, missing values) in the pre-processing section.
  3. Provide a 2-3 sentence interpretation of the visualization. The visualization must have clear labels, appropriate scales, and a professional theme (e.g., theme_minimal()).
  4. Provide an analytical interpretation of each visualization. A brief paragraph explaining what the plot reveals about Mumbai’s transport system.
  5. Submit the R notebook containing the R code and the interpretation. The notebook should execute without any errors. Also submit a rendered version of the notebook, preferably in PDF format (Rmd -> HTML -> PDF).

For more information on R notebooks, visit the following link: https://bookdown.org/yihui/rmarkdown/notebook.html

Part 1: Data Pre-processing

Before plotting, clean the dataset. Handle the negative boarding counts, inconsistent weather names (e.g., ‘monsoon’ vs ‘Monsoon’), and convert time strings into usable time objects.

df$weather_condition <- tolower(df$weather_condition)

df$passenger_count_boarding[df$passenger_count_boarding < 0] <- NA
df$passenger_count_alighting[df$passenger_count_alighting < 0] <- NA

df$date <- ymd(df$date)

df$arrival_time_scheduled   <- hms(df$arrival_time_scheduled)
df$arrival_time_actual      <- hms(df$arrival_time_actual)
df$departure_time_scheduled <- hms(df$departure_time_scheduled)
df$departure_time_actual    <- hms(df$departure_time_actual)

df_clean <- df[complete.cases(
  df$passenger_count_boarding,
  df$passenger_count_alighting,
  df$arrival_time_scheduled,
  df$arrival_time_actual,
  df$weather_condition), ]

df_clean$boarding_num <- suppressWarnings(
  as.numeric(as.character(df_clean$passenger_count_boarding)) )

#Data Clean check
nrow(df)
[1] 1500
nrow(df_clean)
[1] 1249
sum(is.na(df_clean$passenger_count_boarding))
[1] 0
sum(is.na(df_clean$passenger_count_alighting))
[1] 0

Part 2: Visualizing with ggplot2

Q1. Distribution of Boarding by Route (Faceting)

Create a histogram for passenger_count_boarding. Use faceting to display a separate plot for each route_id.

df_clean$route_type <- ifelse(
  df_clean$route_id %in% c("MUM_Route_C", "MUM_Route_E"),
  "High variability routes",
  "Stable routes" )

ggplot( df_clean,
  aes( x = boarding_num,
    fill = route_type ) ) +
  geom_histogram(
    bins = 15,
    color = "white",
    na.rm = TRUE ) +
  facet_wrap(~ route_id, scales = "free_y") +
  scale_fill_manual(
    values = c(
      "High variability routes" = "#E67E22",
      "Stable routes" = "#628141" ) ) +
  labs(
    title = "Distribution of Passenger Boarding by Route",
    subtitle = "Routes C and E show higher variability and demand spikes",
    x = "Passenger Boarding Count",
    y = "Frequency",
    fill = "Route Type" ) +
  theme_minimal()

Interpretation: [Routes C and E exhibit greater variability in passenger boarding, with longer right tails and higher-frequency peaks at larger boarding values. This indicates intermittent demand surges, possibly driven by land-use concentration or transfer activity along these corridors. In contrast, the remaining routes display more consistent and evenly distributed boarding patterns, suggesting steadier demand across time.]

Q2. Ridership Dynamics (Conditional Coloring)

Create a scatter plot showing Boarding Count vs. Alighting Count. Color the points based on weather_condition.

ggplot(df_clean,
       aes(x = passenger_count_boarding,
           y = passenger_count_alighting,
           color = ifelse(weather_condition %in% c("monsoon","rainy"),
                          "Greater impact weather",
                          "Less impact weather"))) +
  geom_point(alpha = 0.6) +
  facet_wrap(~weather_condition) +
  scale_color_manual(values = c("Greater impact weather" = "#E67E22",
                                "Less impact weather" = "#628141")) +
  labs(title = "Boarding vs Alighting Across Weather Conditions",
       subtitle = "Monsoon and rainy conditions show greater impact on passenger movement",
       x = "Passenger Boarding Count",
       y = "Passenger Alighting Count",
       color = "Weather Impact") +
  theme_minimal()

Interpretation: [Passenger boarding and alighting patterns show greater dispersion during monsoon and rainy conditions, indicating increased variability in passenger movement due to weather-related disruptions. In contrast, sunny, cloudy, and partly cloudy conditions exhibit more stable and predictable passenger flows, suggesting minimal weather impact. These patterns highlight the need for weather-responsive operational planning, such as flexible scheduling and buffer capacity during adverse weather conditions.]

Q3. Hourly Peak Demand (Faceting)

Extract the hour from arrival_time_scheduled. Create a bar chart of total boarding per hour, faceted by day_of_week.

df_clean$arrival_hour <- hour(df_clean$arrival_time_actual)
df_clean$day_type <- ifelse(
  df_clean$day_of_week %in% c("Saturday", "Sunday"),
  "Weekend",
  "Weekday" )

hourly_boarding_type <- aggregate(
  passenger_count_boarding ~ arrival_hour + day_type,
  data = df_clean,
  sum,
  na.rm = TRUE )

ggplot( hourly_boarding_type,
  aes(
    x = arrival_hour,
    y = passenger_count_boarding,
    fill = day_type ) ) +
  geom_col(position = "dodge") +
  scale_fill_manual(
    values = c(
      "Weekday" = "#628141",
      "Weekend" = "#E67E22" ) ) +
  labs(
    title = "Hourly Passenger Boarding: Weekdays vs Weekends",
    subtitle = "Weekday passenger boarding is consistently higher",
    x = "Hour of Day",
    y = "Total Passenger Boarding",
    fill = "Day Type"
  ) +
  theme_minimal()

Interpretation: [Hourly passenger boarding on weekdays is consistently higher than weekends across all hours, with pronounced peaks during typical commuting periods. Weekend demand remains lower and flatter, indicating predominantly discretionary travel. This pattern supports higher peak-hour frequencies and greater fleet deployment on weekdays, while reduced or evenly spaced headway on weekends can improve operational efficiency without affecting service quality.]

Q4. Delay Distribution

Calculate delay_minutes (Actual - Scheduled). Create a scatter plot of Scheduled Time vs Actual Time, coloring points by whether the bus was “Late” (delay > 5 mins) or “On-Time”.

df_clean$scheduled_minutes <-
  as.numeric(df_clean$arrival_time_scheduled) / 60

df_clean$actual_minutes <-
  as.numeric(df_clean$arrival_time_actual) / 60

df_clean$delay_status <- ifelse(
  df_clean$actual_minutes > df_clean$scheduled_minutes,
  "Late",
  "OnTime" )

df_clean$delay_status <- factor(
  df_clean$delay_status,
  levels = c("OnTime", "Late") )

ggplot(df_clean, aes(
  x = scheduled_minutes,
  y = actual_minutes,
  color = delay_status )) +
  geom_point(alpha = 0.6) +
  geom_abline(
    intercept = 0,
    slope = 1,
    linetype = "dashed",
    color = "black" ) +
  coord_cartesian(
    xlim = c(350, 1150),
    ylim = c(350, 1150) ) +
  scale_color_manual(values = c(
    "OnTime" = "#628141",
    "Late"   = "#E67E22" )) +
  labs(
    title = "Scheduled vs Actual Arrival Time (Focused View)",
    subtitle = "Deviations from the 45-degree line indicate delays",
    x = "Scheduled Arrival Time (minutes from midnight)",
    y = "Actual Arrival Time (minutes from midnight)",
    color = "Status" ) +
  theme_minimal()

Interpretation: [The majority of trips lie close to the 45-degree reference line, indicating good schedule adherence during active service hours. However, consistent deviations above the line reflect systematic late arrivals, particularly as scheduled times increase, suggesting cumulative delays over the service period. This indicates a need for schedule recovery time or headway adjustments to improve reliability.]


Part 3: Conditioning with Lattice

Q5. Density Conditioning

Using the lattice package, create a density plot (densityplot) of passenger_count_boarding conditioned on day_of_week.

df_clean$day_of_week <- factor(
  df_clean$day_of_week,
  levels = c("Monday", "Tuesday", "Wednesday",
             "Thursday", "Friday", "Saturday", "Sunday") )
densityplot(
  ~ passenger_count_boarding | day_of_week,
  data = df_clean,
  layout = c(4, 2),
  groups = ifelse(day_of_week %in% c("Saturday", "Sunday"),
                  "Weekend", "Weekday"),
  plot.points = FALSE,
  col = c("#628141", "#E67E22"),
  lwd = 3,
  main = "Passenger Boarding Density by Day of Week",
  xlab = "Passenger Boarding Count",
  ylab = "Density",
  auto.key = TRUE )

Interpretation: [Weekdays exhibit consistent and sharply peaked boarding densities, reflecting stable commuter-driven demand, with mid-week days showing the highest reliability. In contrast, weekends display broader and flatter distributions, indicating lower but more evenly spread travel demand. This suggests the need for peak-oriented scheduling on weekdays and flexible service provision on weekends.]

Q6. Using Shingles for Temporal Analysis

Create a shingle for the arrival_time_scheduled variable (e.g., 3-4 overlapping windows). Use xyplot to plot passenger_count_boarding vs. passenger_count_alighting conditioned on the time shingles.

df_clean$arrival_minutes <- as.numeric(df_clean$arrival_time_actual)
df_clean$time_of_day <- cut(
  df_clean$arrival_minutes,
  breaks = c(0, 6*3600, 10*3600, 15*3600, 24*3600),
  labels = c("Early Morning", "Morning Peak", "Midday", "Evening"),
  include.lowest = TRUE )

xyplot(
passenger_count_alighting ~ passenger_count_boarding | time_of_day,
data = df_clean,
groups = factor(
ifelse(df_clean$time_of_day == "Evening",
"Evening", "Other Periods"),
levels = c("Evening", "Other Periods") ),
  xlab = "Passenger Boarding Count",
  ylab = "Passenger Alighting Count",
  main = "Boarding vs Alighting by Time of Day",
  pch = 16,
  col = c("#E67E22", "#628141"),
  alpha = 0.6,
  auto.key = list(title = "Time Period", columns = 1) )

Interpretation: [The evening period exhibits the highest simultaneous levels and widest dispersion of passenger boarding and alighting, indicating intense bidirectional movement associated with return trips, transfers, and activity-based travel. In contrast, morning peak demand is more directional, with comparatively constrained alighting relative to boarding, while early morning and midday show lower and more stable interactions. This suggests that evening operations require greater dwell-time management, platform capacity, and schedule recovery buffers than other time periods.]

Q7. Route Performance (Lattice Boxplots)

Use bwplot to compare the distribution of passenger_count_boarding conditioned on route_id.

df_clean$route_highlight <- ifelse(
  df_clean$route_id %in% c("MUM_Route_C", "MUM_Route_E"),
  "Higher Variability Routes",
  "Other Routes" )
bwplot(
  passenger_count_boarding ~ route_id,
  data = df_clean,
  groups = df_clean$route_highlight,
  box.width = 0.7,
  lwd = 2,
  auto.key = list(
    space = "right",
    title = "Route Category" ),
  xlab = "Route ID",
  ylab = "Passenger Boarding Count",
  main = "Passenger Boarding Distribution by Route",
  par.settings = list(
    box.rectangle = list(col = c("#628141", "#E67E22")),
    box.umbrella  = list(col = c("#628141", "#E67E22")),
    plot.symbol   = list(col = c("#628141", "#E67E22")) ) )

Interpretation: [MUM_Route_C and MUM_Route_E has greater demand variability and pronounced peak loads. In contrast, MUM_Route_A, MUM_Route_B, and MUM_Route_D has stable and predictable passenger demand. This suggests that Routes C and E require flexible headways or higher-capacity vehicles, while Routes A, B, and D can be efficiently operated with fixed schedules and standard fleet deployment.]


Part 4: Synthesis & Operations

Q8. Dwell Time and Weather (ggplot2)

Calculate “Dwell Time” (Actual Departure - Actual Arrival). Create a scatter plot of Boarding Count vs. Dwell Time. Facet by weather_condition and color by route_id.

df_clean$dwell_time_minutes <-
as.numeric(df_clean$departure_time_actual - df_clean$arrival_time_actual) / 60

df_clean_q8 <- df_clean %>%
filter(dwell_time_minutes >= 0, dwell_time_minutes <= 30)

ggplot(
  df_clean_q8,
  aes(
    x = passenger_count_boarding,
    y = dwell_time_minutes) ) +
  geom_point(color = "#628141", alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE, color = "#E67E22") +
  facet_wrap(~ weather_condition) +
  labs(
title = "Impact of Boarding Volume on Dwell Time Across Weather Conditions",
subtitle = "Weather influences how boarding volumes translate into dwell time",
    x = "Passenger Boarding Count",
    y = "Dwell Time (minutes)" ) +
  theme_minimal()

Interpretation: [During monsoon and rainy weather, buses take longer to stop even when there are not many passengers, showing that rain slows down boarding and operations more than passenger numbers do.

In sunny and partly cloudy weather, boarding is quicker and dwell time increases gradually as passenger numbers increase, indicating smoother and more efficient movement.

This shows that weather itself affects bus stopping time, not just how many people are boarding.

Therefore, bus schedules should allow extra stopping time during bad weather, instead of using the same assumptions for all conditions.]

Q9. Specific Stop Analysis (Lattice)

Filter data for the top 3 busiest stops. Create an xyplot showing boarding counts over date, conditioned on stop_name.

df_clean$temp_index <- seq_len(nrow(df_clean))

top_stops <- df_clean %>%
  group_by(stop_name) %>%
  summarise(
    total_boarding = sum(passenger_count_boarding, na.rm = TRUE)
  ) %>%
  arrange(desc(total_boarding)) %>%
  slice(1:3) %>%
  pull(stop_name)

df_top_stops <- df_clean %>%
  filter(stop_name %in% top_stops)

xyplot(
  passenger_count_boarding ~ temp_index | stop_name,
  data = df_top_stops,
  type = c("p", "l"),
  pch = 16,
  col = "#628141",
  alpha = 0.6,
  layout = c(1, 3),
  xlab = "Observation Sequence (Time Proxy)",
  ylab = "Passenger Boarding Count",
  main = "Passenger Boarding Trends at Top 3 Busiest Stops",
  panel = function(x, y, ...) 
    { panel.xyplot(x, y, ...)
    panel.rect(
    xleft = min(x),
    ybottom = 20,
    xright = max(x),
    ytop = 50,
    border = "black",
    lty = 2 ) } )

Interpretation: [Nerul, Borivali, and Belapur consistently operate within a common high-demand boarding range, indicating persistent passenger pressure at these locations rather than occasional spikes.

Borivali and Belapur exhibit higher variability and more frequent peaks, suggesting stronger sensitivity to time-of-day effects, surrounding land-use activity, or transfer movements.

Nerul shows comparatively steadier boarding levels, indicating a more predictable demand pattern.

The repeated clustering of observations within the highlighted demand band confirms that these stops act as structural demand anchors in the network.

From an operational perspective, these stops should be prioritised for capacity management, dwell-time control, and passenger facility upgrades, as congestion or delay here is likely to propagate along the route.]

Q10. Reliability Heatmap (ggplot2)

Create a heatmap (geom_tile) where the X-axis is day_of_week, the Y-axis is route_id, and the fill color represents the average delay.

df_clean$delay_minutes <-
  (as.numeric(df_clean$arrival_time_actual) -
  as.numeric(df_clean$arrival_time_scheduled)) / 60
delay_summary <- df_clean %>%
  group_by(route_id, day_of_week) %>%
  summarise(
    avg_delay = mean(delay_minutes, na.rm = TRUE),
    .groups = "drop" )

network_avg <- mean(delay_summary$avg_delay, na.rm = TRUE)

delay_summary <- delay_summary %>%
  mutate( delay_vs_network = avg_delay - network_avg )

ggplot( delay_summary,
  aes(
    x = day_of_week,
    y = route_id,
    fill = delay_vs_network ) ) +
  geom_tile(color = "white") +
  scale_fill_gradientn(
    colours = c("#40513B", "#628141", "#E5D9B6", "#E67E22"),
    values = scales::rescale(c(
      min(delay_summary$delay_vs_network),
      -5, 0,
      max(delay_summary$delay_vs_network) )),
    name = "Deviation from\nNetwork Avg (min)" ) +
  labs(
    title = "Route Reliability Relative to Network Average",
    subtitle = "Routes performing better or worse than network average by day",
    x = "Day of Week",
    y = "Route ID"
  ) +
  theme_minimal()

Interpretation: [MUM_Route_B and MUM_Route_D perform better than the network average on multiple weekdays, suggesting more stable operations and better schedule adherence.

MUM_Route_E shows comparatively poorer reliability on certain mid-week days, indicating recurring operational constraints that may be linked to congestion or route design.

MUM_Route_C exhibits strong performance on some days but deteriorates on others, reflecting inconsistent reliability rather than persistent underperformance.

These patterns highlight the need for route- and day-specific reliability interventions, such as targeted schedule adjustments, fleet allocation, or operational priority measures, instead of uniform system-wide solutions.] ```

---
title: 'Assignment #2'
subtitle: Data Visualization
output:
  html_notebook: default
  word_document: default
  pdf_document: default
---
  
```{r setup, include=FALSE}
# Loading necessary libraries
library(dplyr)
library(ggplot2)
library(lattice)
library(lubridate)

setwd("D:\\CEPT\\4th_Sem\\Data_Analysis\\UT4609-2026-main\\UT4609-2026-main")
#filename<-"mumbai_bus_data.csv"

df <-read.csv("mumbai_bus_data.csv")

```


This assignment focuses on analyzing public transport operations in Mumbai. You are provided with a dataset containing trip details, passenger counts, and weather conditions. 

**Instructions:**

1. Complete the R code for each question. The code block must contain clean, commented R code.
1. Handle "dirty data" (negative values, inconsistent casing, missing values) in the pre-processing section.
1. Provide a 2-3 sentence interpretation of the visualization. The visualization must have clear labels, appropriate scales, and a professional theme (e.g., theme_minimal()).
1. Provide an analytical interpretation of each visualization. A brief paragraph explaining what the plot reveals about Mumbai's transport system.
1. Submit the R notebook containing the R code and the interpretation. The notebook should execute without any errors. Also submit a *rendered* version of the notebook, preferably in PDF format (Rmd -> HTML -> PDF).

For more information on R notebooks, visit the following link:
https://bookdown.org/yihui/rmarkdown/notebook.html

## Part 1: Data Pre-processing
Before plotting, clean the dataset. Handle the negative boarding counts, inconsistent weather names (e.g., 'monsoon' vs 'Monsoon'), and convert time strings into usable time objects.

```{r cleaning}
df$weather_condition <- tolower(df$weather_condition)

df$passenger_count_boarding[df$passenger_count_boarding < 0] <- NA
df$passenger_count_alighting[df$passenger_count_alighting < 0] <- NA

df$date <- ymd(df$date)

df$arrival_time_scheduled   <- hms(df$arrival_time_scheduled)
df$arrival_time_actual      <- hms(df$arrival_time_actual)
df$departure_time_scheduled <- hms(df$departure_time_scheduled)
df$departure_time_actual    <- hms(df$departure_time_actual)

df_clean <- df[complete.cases(
  df$passenger_count_boarding,
  df$passenger_count_alighting,
  df$arrival_time_scheduled,
  df$arrival_time_actual,
  df$weather_condition), ]

df_clean$boarding_num <- suppressWarnings(
  as.numeric(as.character(df_clean$passenger_count_boarding)) )

#Data Clean check
nrow(df)
nrow(df_clean)
sum(is.na(df_clean$passenger_count_boarding))
sum(is.na(df_clean$passenger_count_alighting))

```

---
  
## Part 2: Visualizing with ggplot2
  
#### Q1. Distribution of Boarding by Route (Faceting)
Create a histogram for `passenger_count_boarding`. Use **faceting** to display a separate plot for each `route_id`.

```{r q1}
df_clean$route_type <- ifelse(
  df_clean$route_id %in% c("MUM_Route_C", "MUM_Route_E"),
  "High variability routes",
  "Stable routes" )

ggplot( df_clean,
  aes( x = boarding_num,
    fill = route_type ) ) +
  geom_histogram(
    bins = 15,
    color = "white",
    na.rm = TRUE ) +
  facet_wrap(~ route_id, scales = "free_y") +
  scale_fill_manual(
    values = c(
      "High variability routes" = "#E67E22",
      "Stable routes" = "#628141" ) ) +
  labs(
    title = "Distribution of Passenger Boarding by Route",
    subtitle = "Routes C and E show higher variability and demand spikes",
    x = "Passenger Boarding Count",
    y = "Frequency",
    fill = "Route Type" ) +
  theme_minimal()

```
**Interpretation:** [Routes C and E exhibit greater variability in passenger boarding, with longer right tails and higher-frequency peaks at larger boarding values. This indicates intermittent demand surges, possibly driven by land-use concentration or transfer activity along these corridors. In contrast, the remaining routes display more consistent and evenly distributed boarding patterns, suggesting steadier demand across time.]

#### Q2. Ridership Dynamics (Conditional Coloring)
Create a scatter plot showing Boarding Count vs. Alighting Count. **Color the points** based on `weather_condition`.

```{r q2}
ggplot(df_clean,
       aes(x = passenger_count_boarding,
           y = passenger_count_alighting,
           color = ifelse(weather_condition %in% c("monsoon","rainy"),
                          "Greater impact weather",
                          "Less impact weather"))) +
  geom_point(alpha = 0.6) +
  facet_wrap(~weather_condition) +
  scale_color_manual(values = c("Greater impact weather" = "#E67E22",
                                "Less impact weather" = "#628141")) +
  labs(title = "Boarding vs Alighting Across Weather Conditions",
       subtitle = "Monsoon and rainy conditions show greater impact on passenger movement",
       x = "Passenger Boarding Count",
       y = "Passenger Alighting Count",
       color = "Weather Impact") +
  theme_minimal()
```
**Interpretation:** [Passenger boarding and alighting patterns show greater dispersion during monsoon and rainy conditions, indicating increased variability in passenger movement due to weather-related disruptions. In contrast, sunny, cloudy, and partly cloudy conditions exhibit more stable and predictable passenger flows, suggesting minimal weather impact. These patterns highlight the need for weather-responsive operational planning, such as flexible scheduling and buffer capacity during adverse weather conditions.]

#### Q3. Hourly Peak Demand (Faceting)
Extract the hour from `arrival_time_scheduled`. Create a bar chart of total boarding per hour, **faceted by** `day_of_week`.

```{r q3}
df_clean$arrival_hour <- hour(df_clean$arrival_time_actual)
df_clean$day_type <- ifelse(
  df_clean$day_of_week %in% c("Saturday", "Sunday"),
  "Weekend",
  "Weekday" )

hourly_boarding_type <- aggregate(
  passenger_count_boarding ~ arrival_hour + day_type,
  data = df_clean,
  sum,
  na.rm = TRUE )

ggplot( hourly_boarding_type,
  aes(
    x = arrival_hour,
    y = passenger_count_boarding,
    fill = day_type ) ) +
  geom_col(position = "dodge") +
  scale_fill_manual(
    values = c(
      "Weekday" = "#628141",
      "Weekend" = "#E67E22" ) ) +
  labs(
    title = "Hourly Passenger Boarding: Weekdays vs Weekends",
    subtitle = "Weekday passenger boarding is consistently higher",
    x = "Hour of Day",
    y = "Total Passenger Boarding",
    fill = "Day Type"
  ) +
  theme_minimal()
```
**Interpretation:** [Hourly passenger boarding on weekdays is consistently higher than weekends across all hours, with pronounced peaks during typical commuting periods. Weekend demand remains lower and flatter, indicating predominantly discretionary travel. This pattern supports higher peak-hour frequencies and greater fleet deployment on weekdays, while reduced or evenly spaced headway on weekends can improve operational efficiency without affecting service quality.]

#### Q4. Delay Distribution
Calculate `delay_minutes` (Actual - Scheduled). Create a scatter plot of Scheduled Time vs Actual Time, **coloring points** by whether the bus was "Late" (delay > 5 mins) or "On-Time".

```{r q4}
df_clean$scheduled_minutes <-
  as.numeric(df_clean$arrival_time_scheduled) / 60

df_clean$actual_minutes <-
  as.numeric(df_clean$arrival_time_actual) / 60

df_clean$delay_status <- ifelse(
  df_clean$actual_minutes > df_clean$scheduled_minutes,
  "Late",
  "OnTime" )

df_clean$delay_status <- factor(
  df_clean$delay_status,
  levels = c("OnTime", "Late") )

ggplot(df_clean, aes(
  x = scheduled_minutes,
  y = actual_minutes,
  color = delay_status )) +
  geom_point(alpha = 0.6) +
  geom_abline(
    intercept = 0,
    slope = 1,
    linetype = "dashed",
    color = "black" ) +
  coord_cartesian(
    xlim = c(350, 1150),
    ylim = c(350, 1150) ) +
  scale_color_manual(values = c(
    "OnTime" = "#628141",
    "Late"   = "#E67E22" )) +
  labs(
    title = "Scheduled vs Actual Arrival Time (Focused View)",
    subtitle = "Deviations from the 45-degree line indicate delays",
    x = "Scheduled Arrival Time (minutes from midnight)",
    y = "Actual Arrival Time (minutes from midnight)",
    color = "Status" ) +
  theme_minimal()
```
**Interpretation:** [The majority of trips lie close to the 45-degree reference line, indicating good schedule adherence during active service hours. However, consistent deviations above the line reflect systematic late arrivals, particularly as scheduled times increase, suggesting cumulative delays over the service period. This indicates a need for schedule recovery time or headway adjustments to improve reliability.]

---

## Part 3: Conditioning with Lattice
  
#### Q5. Density Conditioning

Using the `lattice` package, create a density plot (`densityplot`) of `passenger_count_boarding` **conditioned on** `day_of_week`.

```{r q5}
df_clean$day_of_week <- factor(
  df_clean$day_of_week,
  levels = c("Monday", "Tuesday", "Wednesday",
             "Thursday", "Friday", "Saturday", "Sunday") )
densityplot(
  ~ passenger_count_boarding | day_of_week,
  data = df_clean,
  layout = c(4, 2),
  groups = ifelse(day_of_week %in% c("Saturday", "Sunday"),
                  "Weekend", "Weekday"),
  plot.points = FALSE,
  col = c("#628141", "#E67E22"),
  lwd = 3,
  main = "Passenger Boarding Density by Day of Week",
  xlab = "Passenger Boarding Count",
  ylab = "Density",
  auto.key = TRUE )
```
**Interpretation:** [Weekdays exhibit consistent and sharply peaked boarding densities, reflecting stable commuter-driven demand, with mid-week days showing the highest reliability. In contrast, weekends display broader and flatter distributions, indicating lower but more evenly spread travel demand. This suggests the need for peak-oriented scheduling on weekdays and flexible service provision on weekends.]

#### Q6. Using Shingles for Temporal Analysis
Create a **shingle** for the `arrival_time_scheduled` variable (e.g., 3-4 overlapping windows). Use `xyplot` to plot `passenger_count_boarding` vs. `passenger_count_alighting` **conditioned on the time shingles**.

```{r q6}
df_clean$arrival_minutes <- as.numeric(df_clean$arrival_time_actual)
df_clean$time_of_day <- cut(
  df_clean$arrival_minutes,
  breaks = c(0, 6*3600, 10*3600, 15*3600, 24*3600),
  labels = c("Early Morning", "Morning Peak", "Midday", "Evening"),
  include.lowest = TRUE )

xyplot(
passenger_count_alighting ~ passenger_count_boarding | time_of_day,
data = df_clean,
groups = factor(
ifelse(df_clean$time_of_day == "Evening",
"Evening", "Other Periods"),
levels = c("Evening", "Other Periods") ),
  xlab = "Passenger Boarding Count",
  ylab = "Passenger Alighting Count",
  main = "Boarding vs Alighting by Time of Day",
  pch = 16,
  col = c("#E67E22", "#628141"),
  alpha = 0.6,
  auto.key = list(title = "Time Period", columns = 1) )
```
**Interpretation:** [The evening period exhibits the highest simultaneous levels and widest dispersion of passenger boarding and alighting, indicating intense bidirectional movement associated with return trips, transfers, and activity-based travel. In contrast, morning peak demand is more directional, with comparatively constrained alighting relative to boarding, while early morning and midday show lower and more stable interactions. This suggests that evening operations require greater dwell-time management, platform capacity, and schedule recovery buffers than other time periods.]

#### Q7. Route Performance (Lattice Boxplots)
Use `bwplot` to compare the distribution of `passenger_count_boarding` **conditioned on** `route_id`.

```{r q7}
df_clean$route_highlight <- ifelse(
  df_clean$route_id %in% c("MUM_Route_C", "MUM_Route_E"),
  "Higher Variability Routes",
  "Other Routes" )
bwplot(
  passenger_count_boarding ~ route_id,
  data = df_clean,
  groups = df_clean$route_highlight,
  box.width = 0.7,
  lwd = 2,
  auto.key = list(
    space = "right",
    title = "Route Category" ),
  xlab = "Route ID",
  ylab = "Passenger Boarding Count",
  main = "Passenger Boarding Distribution by Route",
  par.settings = list(
    box.rectangle = list(col = c("#628141", "#E67E22")),
    box.umbrella  = list(col = c("#628141", "#E67E22")),
    plot.symbol   = list(col = c("#628141", "#E67E22")) ) )
```
**Interpretation:** [MUM_Route_C and MUM_Route_E has greater demand variability and pronounced peak loads. In contrast, MUM_Route_A, MUM_Route_B, and MUM_Route_D has stable and predictable passenger demand. This suggests that Routes C and E require flexible headways or higher-capacity vehicles, while Routes A, B, and D can be efficiently operated with fixed schedules and standard fleet deployment.]

---
  
## Part 4: Synthesis & Operations
  
#### Q8. Dwell Time and Weather (ggplot2)

Calculate "Dwell Time" (Actual Departure - Actual Arrival). Create a scatter plot of Boarding Count vs. Dwell Time. **Facet by** `weather_condition` and **color by** `route_id`.

```{r q8}
df_clean$dwell_time_minutes <-
as.numeric(df_clean$departure_time_actual - df_clean$arrival_time_actual) / 60

df_clean_q8 <- df_clean %>%
filter(dwell_time_minutes >= 0, dwell_time_minutes <= 30)

ggplot(
  df_clean_q8,
  aes(
    x = passenger_count_boarding,
    y = dwell_time_minutes) ) +
  geom_point(color = "#628141", alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE, color = "#E67E22") +
  facet_wrap(~ weather_condition) +
  labs(
title = "Impact of Boarding Volume on Dwell Time Across Weather Conditions",
subtitle = "Weather influences how boarding volumes translate into dwell time",
    x = "Passenger Boarding Count",
    y = "Dwell Time (minutes)" ) +
  theme_minimal()
```
**Interpretation:** [During monsoon and rainy weather, buses take longer to stop even when there are not many passengers, showing that rain slows down boarding and operations more than passenger numbers do.

In sunny and partly cloudy weather, boarding is quicker and dwell time increases gradually as passenger numbers increase, indicating smoother and more efficient movement.

This shows that weather itself affects bus stopping time, not just how many people are boarding.

Therefore, bus schedules should allow extra stopping time during bad weather, instead of using the same assumptions for all conditions.]

#### Q9. Specific Stop Analysis (Lattice)
Filter data for the top 3 busiest stops. Create an `xyplot` showing boarding counts over `date`, **conditioned on** `stop_name`.

```{r q9}
df_clean$temp_index <- seq_len(nrow(df_clean))

top_stops <- df_clean %>%
  group_by(stop_name) %>%
  summarise(
    total_boarding = sum(passenger_count_boarding, na.rm = TRUE)
  ) %>%
  arrange(desc(total_boarding)) %>%
  slice(1:3) %>%
  pull(stop_name)

df_top_stops <- df_clean %>%
  filter(stop_name %in% top_stops)

xyplot(
  passenger_count_boarding ~ temp_index | stop_name,
  data = df_top_stops,
  type = c("p", "l"),
  pch = 16,
  col = "#628141",
  alpha = 0.6,
  layout = c(1, 3),
  xlab = "Observation Sequence (Time Proxy)",
  ylab = "Passenger Boarding Count",
  main = "Passenger Boarding Trends at Top 3 Busiest Stops",
  panel = function(x, y, ...) 
    { panel.xyplot(x, y, ...)
    panel.rect(
    xleft = min(x),
    ybottom = 20,
    xright = max(x),
    ytop = 50,
    border = "black",
    lty = 2 ) } )
```
**Interpretation:** [Nerul, Borivali, and Belapur consistently operate within a common high-demand boarding range, indicating persistent passenger pressure at these locations rather than occasional spikes.

Borivali and Belapur exhibit higher variability and more frequent peaks, suggesting stronger sensitivity to time-of-day effects, surrounding land-use activity, or transfer movements.

Nerul shows comparatively steadier boarding levels, indicating a more predictable demand pattern.

The repeated clustering of observations within the highlighted demand band confirms that these stops act as structural demand anchors in the network.

From an operational perspective, these stops should be prioritised for capacity management, dwell-time control, and passenger facility upgrades, as congestion or delay here is likely to propagate along the route.]

#### Q10. Reliability Heatmap (ggplot2)
Create a heatmap (`geom_tile`) where the X-axis is `day_of_week`, the Y-axis is `route_id`, and the **fill color** represents the average delay.

```{r q10}
df_clean$delay_minutes <-
  (as.numeric(df_clean$arrival_time_actual) -
  as.numeric(df_clean$arrival_time_scheduled)) / 60
delay_summary <- df_clean %>%
  group_by(route_id, day_of_week) %>%
  summarise(
    avg_delay = mean(delay_minutes, na.rm = TRUE),
    .groups = "drop" )

network_avg <- mean(delay_summary$avg_delay, na.rm = TRUE)

delay_summary <- delay_summary %>%
  mutate( delay_vs_network = avg_delay - network_avg )

ggplot( delay_summary,
  aes(
    x = day_of_week,
    y = route_id,
    fill = delay_vs_network ) ) +
  geom_tile(color = "white") +
  scale_fill_gradientn(
    colours = c("#40513B", "#628141", "#E5D9B6", "#E67E22"),
    values = scales::rescale(c(
      min(delay_summary$delay_vs_network),
      -5, 0,
      max(delay_summary$delay_vs_network) )),
    name = "Deviation from\nNetwork Avg (min)" ) +
  labs(
    title = "Route Reliability Relative to Network Average",
    subtitle = "Routes performing better or worse than network average by day",
    x = "Day of Week",
    y = "Route ID"
  ) +
  theme_minimal()
```
**Interpretation:** [MUM_Route_B and MUM_Route_D perform better than the network average on multiple weekdays, suggesting more stable operations and better schedule adherence.

MUM_Route_E shows comparatively poorer reliability on certain mid-week days, indicating recurring operational constraints that may be linked to congestion or route design.

MUM_Route_C exhibits strong performance on some days but deteriorates on others, reflecting inconsistent reliability rather than persistent underperformance.

These patterns highlight the need for route- and day-specific reliability interventions, such as targeted schedule adjustments, fleet allocation, or operational priority measures, instead of uniform system-wide solutions.]
```
