Urban rail systems such as MRT, LRT, and Monorail play an important role in supporting daily mobility in Kuala Lumpur. Understanding passenger demand patterns is important for transport planning, scheduling, and capacity management.
This report aims to:
Analyse ridership patterns across multiple rail lines
date rail_lrt_ampang rail_mrt_kajang rail_lrt_kj
Min. :2019-01-01 Min. : 6587 Min. : 4973 Min. : 7195
1st Qu.:2020-11-07 1st Qu.: 95205 1st Qu.: 95571 1st Qu.:116982
Median :2022-09-15 Median :144048 Median :174088 Median :172135
Mean :2022-09-15 Mean :140994 Mean :164170 Mean :186375
3rd Qu.:2024-07-23 3rd Qu.:198979 3rd Qu.:221509 3rd Qu.:276786
Max. :2026-05-31 Max. :270317 Max. :401756 Max. :352504
rail_monorail rail_mrt_pjy
Min. : 1392 Min. : 12108
1st Qu.:20475 1st Qu.: 72988
Median :39147 Median :113663
Mean :37813 Mean :111618
3rd Qu.:55069 3rd Qu.:163010
Max. :99361 Max. :236225
NA's :1262
Interpretation:
The dataset is time-series based
Each rail line has continuous numerical ridership values
No categorical outcome variable → regression problem
Warning: Removed 1262 rows containing missing values or values outside the scale range
(`geom_line()`).
Interpretation:
The trend plot shows that ridership varies significantly over time. All rail lines exhibit similar temporal patterns, suggesting that external factors such as economic conditions, public holidays, or pandemic effects influence overall demand.
3.2 Distribution of Ridership
ggplot(df_long, aes(Ridership, fill = Line)) +geom_histogram(alpha =0.5, bins =30) +theme_minimal() +labs(title ="Distribution of Ridership Across Rail Lines")
Warning: Removed 1262 rows containing non-finite outside the scale range
(`stat_bin()`).
Interpretation:
The distribution is right-skewed, indicating that there are periods of very high passenger demand. This is typical in transport systems where peak hours and special events cause spikes in ridership.
3.3 Day of Week Pattern
df_long$weekday <-weekdays(df_long$date)ggplot(df_long, aes(weekday, Ridership)) +stat_summary(fun = mean, geom ="bar", fill ="steelblue") +theme_minimal() +labs(title ="Average Ridership by Day of Week")
Warning: Removed 1262 rows containing non-finite outside the scale range
(`stat_summary()`).
Interpretation:
Ridership is generally higher on weekdays compared to weekends. This suggests that commuter travel (work and education) is the main driver of demand.
4. Predictive Modeling
A linear regression model was developed to predict ridership based on:
Time trend (time_index)
Rail line category
This allows us to capture both temporal and structural differences between rail lines.
A multiple linear regression model is used. It assumes:
Ridership changes linearly over time
Different rail lines have different baseline demand levels
4.4 Model Evaluation
pred <-predict(fit, test) %>%bind_cols(test) metrics(pred, truth = Ridership, estimate = .pred)
# A tibble: 3 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 60371.
2 rsq standard 0.517
3 mae standard 47238.
Interpretation:
The evaluation metrics (RMSE, MAE, R²) show how close predictions are to actual values:
Lower RMSE → better accuracy
Higher R² → better explanation of variance
This indicates the model provides a reasonable baseline prediction but may not capture all real-world fluctuations.
4.5 Actual vs Predicted Results
ggplot(pred, aes(time_index)) +geom_point(aes(y = Ridership)) +geom_line(aes(y = .pred), color ="red") +theme_minimal() +labs(title ="Actual vs Predicted Ridership")
Warning: Removed 253 rows containing missing values or values outside the scale range
(`geom_point()`).
Interpretation:
The model captures the general trend of ridership but may underperform during sudden spikes or drops. This is expected for a linear regression model.
5. Key Findings
From the analysis:
Ridership shows clear long-term trends
Different rail lines have distinct passenger volumes
Weekday demand is significantly higher than weekend demand
Linear regression provides a useful but simplified prediction model
6. Conclusion
This study demonstrates that KL rail ridership is influenced by both time-based trends and rail line differences. While the linear regression model provides a baseline prediction, real-world ridership is affected by additional factors such as weather, events, and economic conditions.
7. Recommendations
To improve future analysis:
Include external variables (weather, holidays, fuel prices)
Use advanced forecasting models (ARIMA, Random Forest)
Build real-time prediction dashboards
Incorporate passenger flow simulation for planning