About the Data

Data source: The Open Government Data Platform of India
URL: data.gov.in

ISL stands for Intermediate Stoppage List and refers to the list of stations where the train stops (the origin and destination stations are counted in this list as well).

The columns are as follows:

Column Explanation
Train No. The train name
Train Name The train number
Islno The stoppage number for that train number
Station Code The station code
Station Name The station name
Arrival Time Arrival time (HH:MM:SS)
Departure Time Departure time (HH:MM:SS)
Distance Distance (kms)
Source Station Code Source station code
source Station Name Source station name
Destination Station Code Destination station code
Destination Station Name Destination station name

Area of study

The analysis is based on the timetable data published by the Indian Railways.

Loading the data file

timetable <- read.csv('c:/rdata/isl_wise_train_details.csv', stringsAsFactors = FALSE)

General Information

timetable.srs.dest <- subset(timetable, 
                             timetable$Station.Code == timetable$Destination.Station.Code)

cat('Unique train numbers in India:', sum(!is.na(timetable.srs.dest$Train.No.)))
## Unique train numbers in India: 2828
cat('Total distance covered by the trains (in kms):', sum(timetable.srs.dest$Distance))
## Total distance covered by the trains (in kms): 3034862

Intermediate Stoppage List

summary(timetable.srs.dest$Islno)  
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   13.00   20.00   24.17   31.00  120.00
library(ggplot2)

ggplot(data = timetable.srs.dest, aes(x = timetable.srs.dest$Islno)) +
  geom_histogram(color = 'black', fill = 'firebrick', binwidth = 5) +
  geom_freqpoly(binwidth = 5) +
  xlab('Intermediate Stoppage List') + ylab('Count') +
  ggtitle("Distribution of Intermediate Stoppage List (ISL)")

Distance travelled by each train

Longest train is from Dibrugarh (Assam) to Kanyakumari (Tamilnadu) travelling a distance of 4273 kms!

summary(timetable$Distance)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   173.0   432.0   647.8   931.0  4273.0
cat('Number of trains that travels over 3000 kms:', sum(timetable.srs.dest$Distance >= 3000))
## Number of trains that travels over 3000 kms: 62
ggplot(data = timetable.srs.dest, aes(x = timetable.srs.dest$Distance)) +
  geom_histogram(color = 'black', fill = 'firebrick', binwidth = 200) +
  geom_freqpoly(binwidth = 200) +
  xlab('Distance travelled (kms)') + ylab('Count') +
  ggtitle("Distribution of distances travelled (kms)")

Distance vs ISL

ggplot(data = timetable.srs.dest, 
       aes(x = timetable.srs.dest$Distance, y = timetable.srs.dest$Islno)) +
  geom_point(color = 'firebrick', alpha = 1/2) +
  xlab('Distance travelled (kms)') + ylab('Intermediate Stoppage List') +
  ggtitle("Distance vs. ISL")

Is there a correlation between the ISL and distance travelled by the train?

cor.test(timetable.srs.dest$Islno, timetable.srs.dest$Distance, method = 'pearson')
## 
##  Pearson's product-moment correlation
## 
## data:  timetable.srs.dest$Islno and timetable.srs.dest$Distance
## t = 24.369, df = 2826, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3857724 0.4467039
## sample estimates:
##       cor 
## 0.4167061

From the test looks like the there is not much correlation. The ISL depends on the route of the train rather than ditances. We would expect higher stops for distances between 3000 kms and 4000 kms but distances > 1000 kms and < 2000 kms have rather high number of stops.

Narrowing down the Distance to < 2000 kms and ISL to < 20 we notice something interesting. Some trains stops more (like mail trains) while some dont (like superfast trains).

timetable.subset = subset(timetable.srs.dest, timetable.srs.dest$Distance < 2000 & 
                            timetable.srs.dest$Islno < 20)
ggplot(data = timetable.subset, 
       aes(x = timetable.subset$Distance, y = timetable.subset$Islno)) +
  geom_point(color = 'firebrick', alpha = 1/2) +
  xlab('Distance travelled (kms)') + ylab('Intermediate Stoppage List') +
  ggtitle("Distance < 2000 kms) vs. ISL < 20")

After narrowing down the data, the correlation is even less than the full data set’s correlation.

cor.test(timetable.subset$Islno, timetable.subset$Distance, method = 'pearson')
## 
##  Pearson's product-moment correlation
## 
## data:  timetable.subset$Islno and timetable.subset$Distance
## t = 11.275, df = 1310, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2472770 0.3459671
## sample estimates:
##       cor 
## 0.2974163

Analysis of trains from stations

Let’s see the total distance travelled from stations.

library(dplyr)

station.distance.summary <- group_by(timetable.srs.dest, 
              Station.Name = factor(timetable.srs.dest$Source.Station.Name)) %>% 
  summarise(Total.Distance.Of.Trains = sum(Distance)) %>%
  arrange(desc(Total.Distance.Of.Trains))

We limit the oberservations to the top 20. Definitely Howrah station stands out!

top.20 <- top_n(station.distance.summary, 20)
knitr::kable(top.20)
Station.Name Total.Distance.Of.Trains
HOWRAH JN 114943
LOKMANYATILAK T 98012
NEW DELHI 94948
H NIZAMUDDIN 88231
YESVANTPUR JN 80261
CHENNAI CENTRAL 74115
JAMMU TAWI 58600
MUMBAI CST 54653
PURI 53674
TRIVANDRUM CNTL 53641
AHMEDABAD JN 50689
BANDRA TERMINUS 49514
PUNE JN 48117
AMRITSAR JN 47672
GUWAHATI 44001
SECUNDERABAD JN 43651
BANGALORE CY JN 42615
GORAKHPUR JN 42292
AJMER JN 40773
VISAKHAPATNAM 36131

Let’s see the number of trains from each station…again limiting it to top 20

station.train.summary <- group_by(timetable.srs.dest, 
              Station.Name = factor(timetable.srs.dest$Source.Station.Name)) %>% 
  summarise(Total.Trains = length(Train.No.)) %>%
  arrange(desc(Total.Trains))

top.20 <- top_n(station.train.summary, 20)
knitr::kable(top.20)
Station.Name Total.Trains
HOWRAH JN 102
NEW DELHI 85
CHENNAI CENTRAL 69
LOKMANYATILAK T 64
MUMBAI CST 51
YESVANTPUR JN 50
H NIZAMUDDIN 48
PUNE JN 46
SECUNDERABAD JN 46
AHMEDABAD JN 44
BANGALORE CY JN 41
PURI 41
BANDRA TERMINUS 39
JAMMU TAWI 37
DELHI 36
VISAKHAPATNAM 36
AMRITSAR JN 34
CHENNAI EGMORE 34
INDORE JN BG 34
TIRUPATI 33

Analysis done using R3.2.2