STS3404 Group Assignment 2: KL Transit Ridership Analytics

Author

Group 1

Published

July 3, 2026

1. Introduction

Urban rail systems such as MRT, LRT, and Monorail play an important role in supporting daily mobility in Kuala Lumpur. Understanding passenger demand patterns is important for transport planning, scheduling, and capacity management.

This report aims to:

  • Analyse ridership patterns across multiple rail lines

  • Identify temporal trends (daily, weekly, seasonal behaviour)

  • Develop a predictive model using machine learning (tidymodels)

The analysis combines exploratory data analysis (EDA) and regression modelling.

2. Data Description

The dataset contains daily ridership values for multiple KL rail lines:

  • LRT Ampang Line

  • MRT Kajang Line

  • LRT Kelana Jaya Line

  • KL Monorail

  • MRT Putrajaya Line

Each row represents a daily observation of passenger volume.

glimpse(raw_data)
Rows: 2,708
Columns: 6
$ date            <date> 2019-01-01, 2019-01-02, 2019-01-03, 2019-01-04, 2019-…
$ rail_lrt_ampang <int> 113357, 182715, 187904, 198420, 120773, 101145, 197569…
$ rail_mrt_kajang <int> 114173, 169316, 175304, 187891, 112660, 95913, 184365,…
$ rail_lrt_kj     <int> 139634, 274224, 286469, 304755, 145036, 120032, 301290…
$ rail_monorail   <int> 35804, 31859, 31893, 34121, 29950, 25342, 31988, 31792…
$ rail_mrt_pjy    <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
summary(raw_data)
      date            rail_lrt_ampang  rail_mrt_kajang   rail_lrt_kj    
 Min.   :2019-01-01   Min.   :  6587   Min.   :  4973   Min.   :  7195  
 1st Qu.:2020-11-07   1st Qu.: 95205   1st Qu.: 95571   1st Qu.:116982  
 Median :2022-09-15   Median :144048   Median :174088   Median :172135  
 Mean   :2022-09-15   Mean   :140994   Mean   :164170   Mean   :186375  
 3rd Qu.:2024-07-23   3rd Qu.:198979   3rd Qu.:221509   3rd Qu.:276786  
 Max.   :2026-05-31   Max.   :270317   Max.   :401756   Max.   :352504  
                                                                        
 rail_monorail    rail_mrt_pjy   
 Min.   : 1392   Min.   : 12108  
 1st Qu.:20475   1st Qu.: 72988  
 Median :39147   Median :113663  
 Mean   :37813   Mean   :111618  
 3rd Qu.:55069   3rd Qu.:163010  
 Max.   :99361   Max.   :236225  
                 NA's   :1262    

Interpretation:

  • The dataset is time-series based

  • Each rail line has continuous numerical ridership values

  • No categorical outcome variable → regression problem

3. Exploratory Data Analysis (EDA)

3.2 Distribution of Ridership

ggplot(df_long, aes(Ridership, fill = Line)) +
  geom_histogram(alpha = 0.5, bins = 30) +
  theme_minimal() +
  labs(title = "Distribution of Ridership Across Rail Lines")
Warning: Removed 1262 rows containing non-finite outside the scale range
(`stat_bin()`).

Interpretation:

The distribution is right-skewed, indicating that there are periods of very high passenger demand. This is typical in transport systems where peak hours and special events cause spikes in ridership.

3.3 Day of Week Pattern

df_long$weekday <- weekdays(df_long$date)

ggplot(df_long, aes(weekday, Ridership)) +
  stat_summary(fun = mean, geom = "bar", fill = "steelblue") +
  theme_minimal() +
  labs(title = "Average Ridership by Day of Week")
Warning: Removed 1262 rows containing non-finite outside the scale range
(`stat_summary()`).

Interpretation:

Ridership is generally higher on weekdays compared to weekends. This suggests that commuter travel (work and education) is the main driver of demand.

4. Predictive Modeling

A linear regression model was developed to predict ridership based on:

  • Time trend (time_index)

  • Rail line category

This allows us to capture both temporal and structural differences between rail lines.

4.1 Data Preparation

model_df <- df_long %>%
  group_by(date, Line) %>%
  summarise(Ridership = mean(Ridership), .groups = "drop") %>%
  mutate(time_index = as.numeric(date))

Explanation:

The date is converted into a numeric time index so that regression can capture long-term trends.

4.2 Train-Test Split

set.seed(123)  
split <- initial_split(model_df, prop = 0.8) 
train <- training(split) 
test <- testing(split)

Explanation:

The dataset is split into:

  • 80% training data (model learning)

  • 20% testing data (model evaluation)

This ensures fair evaluation of model performance.

4.3 Model Building

rec <- recipe(Ridership ~ time_index + Line, train)

model <- linear_reg() %>%
  set_engine("lm")

wf <- workflow() %>%
  add_recipe(rec) %>%
  add_model(model)

fit <- fit(wf, train)

Explanation:

A multiple linear regression model is used. It assumes:

  • Ridership changes linearly over time

  • Different rail lines have different baseline demand levels

4.4 Model Evaluation

pred <- predict(fit, test) %>%   
  bind_cols(test)  

metrics(pred, truth = Ridership, estimate = .pred)
# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard   60371.   
2 rsq     standard       0.517
3 mae     standard   47238.   

Interpretation:

The evaluation metrics (RMSE, MAE, R²) show how close predictions are to actual values:

  • Lower RMSE → better accuracy

  • Higher R² → better explanation of variance

This indicates the model provides a reasonable baseline prediction but may not capture all real-world fluctuations.

4.5 Actual vs Predicted Results

ggplot(pred, aes(time_index)) +   
  geom_point(aes(y = Ridership)) +   
  geom_line(aes(y = .pred), color = "red") +   
  theme_minimal() +   
  labs(title = "Actual vs Predicted Ridership")
Warning: Removed 253 rows containing missing values or values outside the scale range
(`geom_point()`).

Interpretation:

The model captures the general trend of ridership but may underperform during sudden spikes or drops. This is expected for a linear regression model.

5. Key Findings

From the analysis:

  • Ridership shows clear long-term trends

  • Different rail lines have distinct passenger volumes

  • Weekday demand is significantly higher than weekend demand

  • Linear regression provides a useful but simplified prediction model

6. Conclusion

This study demonstrates that KL rail ridership is influenced by both time-based trends and rail line differences. While the linear regression model provides a baseline prediction, real-world ridership is affected by additional factors such as weather, events, and economic conditions.

7. Recommendations

To improve future analysis:

  • Include external variables (weather, holidays, fuel prices)

  • Use advanced forecasting models (ARIMA, Random Forest)

  • Build real-time prediction dashboards

  • Incorporate passenger flow simulation for planning