Analysis of Indian Railways Timetable

About the Data

Data source: The Open Government Data Platform of India
URL: data.gov.in

ISL stands for Intermediate Stoppage List and refers to the list of stations where the train stops (the origin and destination stations are counted in this list as well).

The columns are as follows:

Column	Explanation
Train No.	The train name
Train Name	The train number
Islno	The stoppage number for that train number
Station Code	The station code
Station Name	The station name
Arrival Time	Arrival time (HH:MM:SS)
Departure Time	Departure time (HH:MM:SS)
Distance	Distance (kms)
Source Station Code	Source station code
source Station Name	Source station name
Destination Station Code	Destination station code
Destination Station Name	Destination station name

Area of study

For a train, see the distribution of the number of stops during the journey.
See the distribution of the distances travelled by trains.
If there’s any correlation between the distance travelled and the number of stops for a train.
Analysis of trains from each stations:
- How many trains leave from each station?
- What is the total distance travelled by trains from that station?

The analysis is based on the timetable data published by the Indian Railways.

Loading the data file

timetable <- read.csv('c:/rdata/isl_wise_train_details.csv', stringsAsFactors = FALSE)

General Information

timetable.srs.dest <- subset(timetable, 
                             timetable$Station.Code == timetable$Destination.Station.Code)

cat('Unique train numbers in India:', sum(!is.na(timetable.srs.dest$Train.No.)))

## Unique train numbers in India: 2828

cat('Total distance covered by the trains (in kms):', sum(timetable.srs.dest$Distance))

## Total distance covered by the trains (in kms): 3034862

Intermediate Stoppage List

summary(timetable.srs.dest$Islno)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   13.00   20.00   24.17   31.00  120.00

library(ggplot2)

ggplot(data = timetable.srs.dest, aes(x = timetable.srs.dest$Islno)) +
  geom_histogram(color = 'black', fill = 'firebrick', binwidth = 5) +
  geom_freqpoly(binwidth = 5) +
  xlab('Intermediate Stoppage List') + ylab('Count') +
  ggtitle("Distribution of Intermediate Stoppage List (ISL)")

Distance travelled by each train

Longest train is from Dibrugarh (Assam) to Kanyakumari (Tamilnadu) travelling a distance of 4273 kms!

summary(timetable$Distance)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   173.0   432.0   647.8   931.0  4273.0

cat('Number of trains that travels over 3000 kms:', sum(timetable.srs.dest$Distance >= 3000))

## Number of trains that travels over 3000 kms: 62

ggplot(data = timetable.srs.dest, aes(x = timetable.srs.dest$Distance)) +
  geom_histogram(color = 'black', fill = 'firebrick', binwidth = 200) +
  geom_freqpoly(binwidth = 200) +
  xlab('Distance travelled (kms)') + ylab('Count') +
  ggtitle("Distribution of distances travelled (kms)")

Distance vs ISL

ggplot(data = timetable.srs.dest, 
       aes(x = timetable.srs.dest$Distance, y = timetable.srs.dest$Islno)) +
  geom_point(color = 'firebrick', alpha = 1/2) +
  xlab('Distance travelled (kms)') + ylab('Intermediate Stoppage List') +
  ggtitle("Distance vs. ISL")

Is there a correlation between the ISL and distance travelled by the train?

cor.test(timetable.srs.dest$Islno, timetable.srs.dest$Distance, method = 'pearson')

## 
##  Pearson's product-moment correlation
## 
## data:  timetable.srs.dest$Islno and timetable.srs.dest$Distance
## t = 24.369, df = 2826, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3857724 0.4467039
## sample estimates:
##       cor 
## 0.4167061

From the test looks like the there is not much correlation. The ISL depends on the route of the train rather than ditances. We would expect higher stops for distances between 3000 kms and 4000 kms but distances > 1000 kms and < 2000 kms have rather high number of stops.

Narrowing down the Distance to < 2000 kms and ISL to < 20 we notice something interesting. Some trains stops more (like mail trains) while some dont (like superfast trains).

timetable.subset = subset(timetable.srs.dest, timetable.srs.dest$Distance < 2000 & 
                            timetable.srs.dest$Islno < 20)
ggplot(data = timetable.subset, 
       aes(x = timetable.subset$Distance, y = timetable.subset$Islno)) +
  geom_point(color = 'firebrick', alpha = 1/2) +
  xlab('Distance travelled (kms)') + ylab('Intermediate Stoppage List') +
  ggtitle("Distance < 2000 kms) vs. ISL < 20")

After narrowing down the data, the correlation is even less than the full data set’s correlation.

cor.test(timetable.subset$Islno, timetable.subset$Distance, method = 'pearson')

## 
##  Pearson's product-moment correlation
## 
## data:  timetable.subset$Islno and timetable.subset$Distance
## t = 11.275, df = 1310, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2472770 0.3459671
## sample estimates:
##       cor 
## 0.2974163

Analysis of trains from stations

Let’s see the total distance travelled from stations.

library(dplyr)

station.distance.summary <- group_by(timetable.srs.dest, 
              Station.Name = factor(timetable.srs.dest$Source.Station.Name)) %>% 
  summarise(Total.Distance.Of.Trains = sum(Distance)) %>%
  arrange(desc(Total.Distance.Of.Trains))

We limit the oberservations to the top 20. Definitely Howrah station stands out!

top.20 <- top_n(station.distance.summary, 20)
knitr::kable(top.20)

Station.Name	Total.Distance.Of.Trains
HOWRAH JN	114943
LOKMANYATILAK T	98012
NEW DELHI	94948
H NIZAMUDDIN	88231
YESVANTPUR JN	80261
CHENNAI CENTRAL	74115
JAMMU TAWI	58600
MUMBAI CST	54653
PURI	53674
TRIVANDRUM CNTL	53641
AHMEDABAD JN	50689
BANDRA TERMINUS	49514
PUNE JN	48117
AMRITSAR JN	47672
GUWAHATI	44001
SECUNDERABAD JN	43651
BANGALORE CY JN	42615
GORAKHPUR JN	42292
AJMER JN	40773
VISAKHAPATNAM	36131

Let’s see the number of trains from each station…again limiting it to top 20

station.train.summary <- group_by(timetable.srs.dest, 
              Station.Name = factor(timetable.srs.dest$Source.Station.Name)) %>% 
  summarise(Total.Trains = length(Train.No.)) %>%
  arrange(desc(Total.Trains))

top.20 <- top_n(station.train.summary, 20)
knitr::kable(top.20)

Station.Name	Total.Trains
HOWRAH JN	102
NEW DELHI	85
CHENNAI CENTRAL	69
LOKMANYATILAK T	64
MUMBAI CST	51
YESVANTPUR JN	50
H NIZAMUDDIN	48
PUNE JN	46
SECUNDERABAD JN	46
AHMEDABAD JN	44
BANGALORE CY JN	41
PURI	41
BANDRA TERMINUS	39
JAMMU TAWI	37
DELHI	36
VISAKHAPATNAM	36
AMRITSAR JN	34
CHENNAI EGMORE	34
INDORE JN BG	34
TIRUPATI	33

Analysis done using R3.2.2