Data source: The Open Government Data Platform of India
URL: data.gov.in
ISL stands for Intermediate Stoppage List and refers to the list of stations where the train stops (the origin and destination stations are counted in this list as well).
The columns are as follows:
Column | Explanation |
---|---|
Train No. | The train name |
Train Name | The train number |
Islno | The stoppage number for that train number |
Station Code | The station code |
Station Name | The station name |
Arrival Time | Arrival time (HH:MM:SS) |
Departure Time | Departure time (HH:MM:SS) |
Distance | Distance (kms) |
Source Station Code | Source station code |
source Station Name | Source station name |
Destination Station Code | Destination station code |
Destination Station Name | Destination station name |
The analysis is based on the timetable data published by the Indian Railways.
timetable <- read.csv('c:/rdata/isl_wise_train_details.csv', stringsAsFactors = FALSE)
timetable.srs.dest <- subset(timetable,
timetable$Station.Code == timetable$Destination.Station.Code)
cat('Unique train numbers in India:', sum(!is.na(timetable.srs.dest$Train.No.)))
## Unique train numbers in India: 2828
cat('Total distance covered by the trains (in kms):', sum(timetable.srs.dest$Distance))
## Total distance covered by the trains (in kms): 3034862
summary(timetable.srs.dest$Islno)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 13.00 20.00 24.17 31.00 120.00
library(ggplot2)
ggplot(data = timetable.srs.dest, aes(x = timetable.srs.dest$Islno)) +
geom_histogram(color = 'black', fill = 'firebrick', binwidth = 5) +
geom_freqpoly(binwidth = 5) +
xlab('Intermediate Stoppage List') + ylab('Count') +
ggtitle("Distribution of Intermediate Stoppage List (ISL)")
Longest train is from Dibrugarh (Assam) to Kanyakumari (Tamilnadu) travelling a distance of 4273 kms!
summary(timetable$Distance)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 173.0 432.0 647.8 931.0 4273.0
cat('Number of trains that travels over 3000 kms:', sum(timetable.srs.dest$Distance >= 3000))
## Number of trains that travels over 3000 kms: 62
ggplot(data = timetable.srs.dest, aes(x = timetable.srs.dest$Distance)) +
geom_histogram(color = 'black', fill = 'firebrick', binwidth = 200) +
geom_freqpoly(binwidth = 200) +
xlab('Distance travelled (kms)') + ylab('Count') +
ggtitle("Distribution of distances travelled (kms)")
ggplot(data = timetable.srs.dest,
aes(x = timetable.srs.dest$Distance, y = timetable.srs.dest$Islno)) +
geom_point(color = 'firebrick', alpha = 1/2) +
xlab('Distance travelled (kms)') + ylab('Intermediate Stoppage List') +
ggtitle("Distance vs. ISL")
Is there a correlation between the ISL and distance travelled by the train?
cor.test(timetable.srs.dest$Islno, timetable.srs.dest$Distance, method = 'pearson')
##
## Pearson's product-moment correlation
##
## data: timetable.srs.dest$Islno and timetable.srs.dest$Distance
## t = 24.369, df = 2826, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3857724 0.4467039
## sample estimates:
## cor
## 0.4167061
From the test looks like the there is not much correlation. The ISL depends on the route of the train rather than ditances. We would expect higher stops for distances between 3000 kms and 4000 kms but distances > 1000 kms and < 2000 kms have rather high number of stops.
Narrowing down the Distance to < 2000 kms and ISL to < 20 we notice something interesting. Some trains stops more (like mail trains) while some dont (like superfast trains).
timetable.subset = subset(timetable.srs.dest, timetable.srs.dest$Distance < 2000 &
timetable.srs.dest$Islno < 20)
ggplot(data = timetable.subset,
aes(x = timetable.subset$Distance, y = timetable.subset$Islno)) +
geom_point(color = 'firebrick', alpha = 1/2) +
xlab('Distance travelled (kms)') + ylab('Intermediate Stoppage List') +
ggtitle("Distance < 2000 kms) vs. ISL < 20")
After narrowing down the data, the correlation is even less than the full data set’s correlation.
cor.test(timetable.subset$Islno, timetable.subset$Distance, method = 'pearson')
##
## Pearson's product-moment correlation
##
## data: timetable.subset$Islno and timetable.subset$Distance
## t = 11.275, df = 1310, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2472770 0.3459671
## sample estimates:
## cor
## 0.2974163
Let’s see the total distance travelled from stations.
library(dplyr)
station.distance.summary <- group_by(timetable.srs.dest,
Station.Name = factor(timetable.srs.dest$Source.Station.Name)) %>%
summarise(Total.Distance.Of.Trains = sum(Distance)) %>%
arrange(desc(Total.Distance.Of.Trains))
We limit the oberservations to the top 20. Definitely Howrah station stands out!
top.20 <- top_n(station.distance.summary, 20)
knitr::kable(top.20)
Station.Name | Total.Distance.Of.Trains |
---|---|
HOWRAH JN | 114943 |
LOKMANYATILAK T | 98012 |
NEW DELHI | 94948 |
H NIZAMUDDIN | 88231 |
YESVANTPUR JN | 80261 |
CHENNAI CENTRAL | 74115 |
JAMMU TAWI | 58600 |
MUMBAI CST | 54653 |
PURI | 53674 |
TRIVANDRUM CNTL | 53641 |
AHMEDABAD JN | 50689 |
BANDRA TERMINUS | 49514 |
PUNE JN | 48117 |
AMRITSAR JN | 47672 |
GUWAHATI | 44001 |
SECUNDERABAD JN | 43651 |
BANGALORE CY JN | 42615 |
GORAKHPUR JN | 42292 |
AJMER JN | 40773 |
VISAKHAPATNAM | 36131 |
Let’s see the number of trains from each station…again limiting it to top 20
station.train.summary <- group_by(timetable.srs.dest,
Station.Name = factor(timetable.srs.dest$Source.Station.Name)) %>%
summarise(Total.Trains = length(Train.No.)) %>%
arrange(desc(Total.Trains))
top.20 <- top_n(station.train.summary, 20)
knitr::kable(top.20)
Station.Name | Total.Trains |
---|---|
HOWRAH JN | 102 |
NEW DELHI | 85 |
CHENNAI CENTRAL | 69 |
LOKMANYATILAK T | 64 |
MUMBAI CST | 51 |
YESVANTPUR JN | 50 |
H NIZAMUDDIN | 48 |
PUNE JN | 46 |
SECUNDERABAD JN | 46 |
AHMEDABAD JN | 44 |
BANGALORE CY JN | 41 |
PURI | 41 |
BANDRA TERMINUS | 39 |
JAMMU TAWI | 37 |
DELHI | 36 |
VISAKHAPATNAM | 36 |
AMRITSAR JN | 34 |
CHENNAI EGMORE | 34 |
INDORE JN BG | 34 |
TIRUPATI | 33 |
Analysis done using R3.2.2