Data source: The Open Government Data Platform of India
URL: data.gov.in

ISL stands for Intermediate Stoppage List and refers to the list of stations where the train stops (the origin and destination stations are counted in this list as well).

The columns are as follows:

Column Explanation
Train No. The train name
Train Name The train number
Islno The stoppage number for that train number
Station Code The station code
Station Name The station name
Arrival Time Arrival time (HH:MM:SS)
Departure Time Departure time (HH:MM:SS)
Distance Distance (kms)
Source Station Code Source station code
source Station Name Source station name
Destination Station Code Destination station code
Destination Station Name Destination station name

### Area of study

• For a train, see the distribution of the number of stops during the journey.
• See the distribution of the distances travelled by trains.
• If there’s any correlation between the distance travelled and the number of stops for a train.
• Analysis of trains from each stations:
• How many trains leave from each station?
• What is the total distance travelled by trains from that station?

The analysis is based on the timetable data published by the Indian Railways.

timetable <- read.csv('c:/rdata/isl_wise_train_details.csv', stringsAsFactors = FALSE)

### General Information

timetable.srs.dest <- subset(timetable,
timetable$Station.Code == timetable$Destination.Station.Code)

cat('Unique train numbers in India:', sum(!is.na(timetable.srs.dest$Train.No.))) ## Unique train numbers in India: 2828 cat('Total distance covered by the trains (in kms):', sum(timetable.srs.dest$Distance))
## Total distance covered by the trains (in kms): 3034862

### Intermediate Stoppage List

summary(timetable.srs.dest$Islno)  ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 1.00 13.00 20.00 24.17 31.00 120.00 library(ggplot2) ggplot(data = timetable.srs.dest, aes(x = timetable.srs.dest$Islno)) +
geom_histogram(color = 'black', fill = 'firebrick', binwidth = 5) +
geom_freqpoly(binwidth = 5) +
xlab('Intermediate Stoppage List') + ylab('Count') +
ggtitle("Distribution of Intermediate Stoppage List (ISL)")

### Distance travelled by each train

Longest train is from Dibrugarh (Assam) to Kanyakumari (Tamilnadu) travelling a distance of 4273 kms!

summary(timetable$Distance) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.0 173.0 432.0 647.8 931.0 4273.0 cat('Number of trains that travels over 3000 kms:', sum(timetable.srs.dest$Distance >= 3000))
## Number of trains that travels over 3000 kms: 62
timetable.srs.dest$Islno < 20) ggplot(data = timetable.subset, aes(x = timetable.subset$Distance, y = timetable.subset$Islno)) + geom_point(color = 'firebrick', alpha = 1/2) + xlab('Distance travelled (kms)') + ylab('Intermediate Stoppage List') + ggtitle("Distance < 2000 kms) vs. ISL < 20") After narrowing down the data, the correlation is even less than the full data set’s correlation. cor.test(timetable.subset$Islno, timetable.subset$Distance, method = 'pearson') ## ## Pearson's product-moment correlation ## ## data: timetable.subset$Islno and timetable.subset$Distance ## t = 11.275, df = 1310, p-value < 2.2e-16 ## alternative hypothesis: true correlation is not equal to 0 ## 95 percent confidence interval: ## 0.2472770 0.3459671 ## sample estimates: ## cor ## 0.2974163 ### Analysis of trains from stations Let’s see the total distance travelled from stations. library(dplyr) station.distance.summary <- group_by(timetable.srs.dest, Station.Name = factor(timetable.srs.dest$Source.Station.Name)) %>%
summarise(Total.Distance.Of.Trains = sum(Distance)) %>%
arrange(desc(Total.Distance.Of.Trains))

We limit the oberservations to the top 20. Definitely Howrah station stands out!

top.20 <- top_n(station.distance.summary, 20)
knitr::kable(top.20)
Station.Name Total.Distance.Of.Trains
HOWRAH JN 114943
LOKMANYATILAK T 98012
NEW DELHI 94948
H NIZAMUDDIN 88231
YESVANTPUR JN 80261
CHENNAI CENTRAL 74115
JAMMU TAWI 58600
MUMBAI CST 54653
PURI 53674
TRIVANDRUM CNTL 53641
BANDRA TERMINUS 49514
PUNE JN 48117
AMRITSAR JN 47672
GUWAHATI 44001
BANGALORE CY JN 42615
GORAKHPUR JN 42292
AJMER JN 40773
VISAKHAPATNAM 36131

Let’s see the number of trains from each station…again limiting it to top 20

station.train.summary <- group_by(timetable.srs.dest,
Station.Name = factor(timetable.srs.dest\$Source.Station.Name)) %>%
summarise(Total.Trains = length(Train.No.)) %>%
arrange(desc(Total.Trains))

top.20 <- top_n(station.train.summary, 20)
knitr::kable(top.20)
Station.Name Total.Trains
HOWRAH JN 102
NEW DELHI 85
CHENNAI CENTRAL 69
LOKMANYATILAK T 64
MUMBAI CST 51
YESVANTPUR JN 50
H NIZAMUDDIN 48
PUNE JN 46