**Data source: The Open Government Data Platform of India**

URL: data.gov.in

ISL stands for *Intermediate Stoppage List* and refers to the list of stations where the train stops (the origin and destination stations are counted in this list as well).

*The columns are as follows:*

Column | Explanation |
---|---|

Train No. | The train name |

Train Name | The train number |

Islno | The stoppage number for that train number |

Station Code | The station code |

Station Name | The station name |

Arrival Time | Arrival time (HH:MM:SS) |

Departure Time | Departure time (HH:MM:SS) |

Distance | Distance (kms) |

Source Station Code | Source station code |

source Station Name | Source station name |

Destination Station Code | Destination station code |

Destination Station Name | Destination station name |

- For a train, see the distribution of the number of stops during the journey.
- See the distribution of the distances travelled by trains.
- If there’s any correlation between the distance travelled and the number of stops for a train.
- Analysis of trains from each stations:
- How many trains leave from each station?
- What is the total distance travelled by trains from that station?

**The analysis is based on the timetable data published by the Indian Railways.**

`timetable <- read.csv('c:/rdata/isl_wise_train_details.csv', stringsAsFactors = FALSE)`

```
timetable.srs.dest <- subset(timetable,
timetable$Station.Code == timetable$Destination.Station.Code)
cat('Unique train numbers in India:', sum(!is.na(timetable.srs.dest$Train.No.)))
```

`## Unique train numbers in India: 2828`

`cat('Total distance covered by the trains (in kms):', sum(timetable.srs.dest$Distance))`

`## Total distance covered by the trains (in kms): 3034862`

`summary(timetable.srs.dest$Islno) `

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 13.00 20.00 24.17 31.00 120.00
```

```
library(ggplot2)
ggplot(data = timetable.srs.dest, aes(x = timetable.srs.dest$Islno)) +
geom_histogram(color = 'black', fill = 'firebrick', binwidth = 5) +
geom_freqpoly(binwidth = 5) +
xlab('Intermediate Stoppage List') + ylab('Count') +
ggtitle("Distribution of Intermediate Stoppage List (ISL)")
```

Longest train is from Dibrugarh (Assam) to Kanyakumari (Tamilnadu) travelling a distance of 4273 kms!

`summary(timetable$Distance)`

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 173.0 432.0 647.8 931.0 4273.0
```

`cat('Number of trains that travels over 3000 kms:', sum(timetable.srs.dest$Distance >= 3000))`

`## Number of trains that travels over 3000 kms: 62`

```
ggplot(data = timetable.srs.dest, aes(x = timetable.srs.dest$Distance)) +
geom_histogram(color = 'black', fill = 'firebrick', binwidth = 200) +
geom_freqpoly(binwidth = 200) +
xlab('Distance travelled (kms)') + ylab('Count') +
ggtitle("Distribution of distances travelled (kms)")
```

```
ggplot(data = timetable.srs.dest,
aes(x = timetable.srs.dest$Distance, y = timetable.srs.dest$Islno)) +
geom_point(color = 'firebrick', alpha = 1/2) +
xlab('Distance travelled (kms)') + ylab('Intermediate Stoppage List') +
ggtitle("Distance vs. ISL")
```

**Is there a correlation between the ISL and distance travelled by the train?**

`cor.test(timetable.srs.dest$Islno, timetable.srs.dest$Distance, method = 'pearson')`

```
##
## Pearson's product-moment correlation
##
## data: timetable.srs.dest$Islno and timetable.srs.dest$Distance
## t = 24.369, df = 2826, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3857724 0.4467039
## sample estimates:
## cor
## 0.4167061
```

From the test looks like the there is not much correlation. The ISL depends on the route of the train rather than ditances. We would expect higher stops for distances between 3000 kms and 4000 kms but distances > 1000 kms and < 2000 kms have rather high number of stops.

Narrowing down the Distance to < 2000 kms and ISL to < 20 we notice something interesting. Some trains stops more (like mail trains) while some dont (like superfast trains).

```
timetable.subset = subset(timetable.srs.dest, timetable.srs.dest$Distance < 2000 &
timetable.srs.dest$Islno < 20)
ggplot(data = timetable.subset,
aes(x = timetable.subset$Distance, y = timetable.subset$Islno)) +
geom_point(color = 'firebrick', alpha = 1/2) +
xlab('Distance travelled (kms)') + ylab('Intermediate Stoppage List') +
ggtitle("Distance < 2000 kms) vs. ISL < 20")
```

After narrowing down the data, the correlation is even less than the full data set’s correlation.

`cor.test(timetable.subset$Islno, timetable.subset$Distance, method = 'pearson')`

```
##
## Pearson's product-moment correlation
##
## data: timetable.subset$Islno and timetable.subset$Distance
## t = 11.275, df = 1310, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2472770 0.3459671
## sample estimates:
## cor
## 0.2974163
```

Let’s see the total distance travelled from stations.

```
library(dplyr)
station.distance.summary <- group_by(timetable.srs.dest,
Station.Name = factor(timetable.srs.dest$Source.Station.Name)) %>%
summarise(Total.Distance.Of.Trains = sum(Distance)) %>%
arrange(desc(Total.Distance.Of.Trains))
```

We limit the oberservations to the top 20. Definitely Howrah station stands out!

```
top.20 <- top_n(station.distance.summary, 20)
knitr::kable(top.20)
```

Station.Name | Total.Distance.Of.Trains |
---|---|

HOWRAH JN | 114943 |

LOKMANYATILAK T | 98012 |

NEW DELHI | 94948 |

H NIZAMUDDIN | 88231 |

YESVANTPUR JN | 80261 |

CHENNAI CENTRAL | 74115 |

JAMMU TAWI | 58600 |

MUMBAI CST | 54653 |

PURI | 53674 |

TRIVANDRUM CNTL | 53641 |

AHMEDABAD JN | 50689 |

BANDRA TERMINUS | 49514 |

PUNE JN | 48117 |

AMRITSAR JN | 47672 |

GUWAHATI | 44001 |

SECUNDERABAD JN | 43651 |

BANGALORE CY JN | 42615 |

GORAKHPUR JN | 42292 |

AJMER JN | 40773 |

VISAKHAPATNAM | 36131 |

Let’s see the number of trains from each station…again limiting it to top 20

```
station.train.summary <- group_by(timetable.srs.dest,
Station.Name = factor(timetable.srs.dest$Source.Station.Name)) %>%
summarise(Total.Trains = length(Train.No.)) %>%
arrange(desc(Total.Trains))
top.20 <- top_n(station.train.summary, 20)
knitr::kable(top.20)
```

Station.Name | Total.Trains |
---|---|

HOWRAH JN | 102 |

NEW DELHI | 85 |

CHENNAI CENTRAL | 69 |

LOKMANYATILAK T | 64 |

MUMBAI CST | 51 |

YESVANTPUR JN | 50 |

H NIZAMUDDIN | 48 |

PUNE JN | 46 |

SECUNDERABAD JN | 46 |

AHMEDABAD JN | 44 |

BANGALORE CY JN | 41 |

PURI | 41 |

BANDRA TERMINUS | 39 |

JAMMU TAWI | 37 |

DELHI | 36 |

VISAKHAPATNAM | 36 |

AMRITSAR JN | 34 |

CHENNAI EGMORE | 34 |

INDORE JN BG | 34 |

TIRUPATI | 33 |

*Analysis done using R3.2.2*