Mapping out the train routes in India

I had nothing better to do on a Sunday morning so I made a map. If you have been following my blog, you will know that I am trying to make sense of the Indian Railways, why the trains runs late  and if there's anyway we can learn about the cause of the delays, based on patters in delay.

Towards that goal, I posted two blog posts that described how and where I was collecting my data from and a first look at the data I was gathering. Even from the preliminary look, it can be seen that there are specific routes/stations that are causing a delay along a train's route. And the delays were induced on multiple runs at the same station, meaning that it wasn't simply a one time thing.

Moving on, another way to look at the problem is to understand how crowded the railway lines are. By mapping all train movement in India, we will be able to understand how crowded specific routes are and if they're crowded at a specific time of the day. By adding delay information from multiple trains to the map, we will also be able to prove with good certainty about the routes that are crowded and are leading to large delays.

I took a small step in that direction today by mapping out the routes of a few trains. Below you will find one such map/image, where the red lines correspond to the routes of a few trains. Note that this isn't the complete roster of trains that are run by the Indian Railways, it is but a very very small subset. But the process by which such a map can be made is scalable, the small subset to make this proof of concept.



While this image isn't interactive, running the code actually creates an interactive image, which displays station codes when the user hovers over the route. I modified an available example that produced a USA flight paths map using plotly.  The code can be found below.



import glob
import plotly.plotly as py
import pandas as pd


files = glob.glob('routes/*')

df_stations = pd.DataFrame(columns=['station', 'lat', 'long'])
df_station_paths = pd.DataFrame(columns=['start_lon', 'start_lat', 'end_lon', 'end_lat'])

for file in files:
    df_stations_temp = pd.read_csv(file,
                                   names=['station', 'lat', 'long'],
                                   na_values=[0.0])
    df_stations_temp = df_stations_temp.dropna(axis=0, how='any')
    df_station_paths_temp = pd.DataFrame([[df_stations_temp.iloc[i]['long'],
                                          df_stations_temp.iloc[i]['lat'],
                                          df_stations_temp.iloc[i+1]['long'],
                                          df_stations_temp.iloc[i+1]['lat']] 
                                         for i in range(len(df_stations_temp)-1)],
                                        columns=['start_lon', 'start_lat', 'end_lon', 'end_lat'])
    df_stations = pd.concat([df_stations, df_stations_temp], ignore_index=True)
    df_station_paths = pd.concat([df_station_paths, df_station_paths_temp], ignore_index=True)

stations = [ dict(
        type = 'scattergeo',
        locationmode = 'India',
        lon = df_stations['long'],
        lat = df_stations['lat'],
        hoverinfo = 'text',
        text = df_stations['station'],
        mode = 'markers',
        marker = dict( 
            size=2, 
            color='rgb(255, 0, 0)',
            line = dict(
                width=3,
                color='rgba(68, 68, 68, 0)'
            )
        ))]
        
station_paths = []
for i in range( len( df_station_paths ) ):
    station_paths.append(
        dict(
            type = 'scattergeo',
            locationmode = 'India',
            lon = [ df_station_paths['start_lon'][i], df_station_paths['end_lon'][i] ],
            lat = [ df_station_paths['start_lat'][i], df_station_paths['end_lat'][i] ],
            mode = 'lines',
            line = dict(
                width = 1,
                color = 'red',
            ),
            opacity = 1.,
        )
    )
    
layout = dict(
        showlegend = False, 
        height=1000,
        geo = dict(
            scope='India',
            projection=dict( type='azimuthal equal area' ),
            showland = True,
            landcolor = 'rgb(243, 243, 243)',
            countrycolor = 'rgb(204, 204, 204)',
        ),
    )

fig = dict( data=station_paths+stations, layout=layout )
py.iplot( fig, filename='d3-station-paths' )


To briefly go over the code, the route files for individual trains were stored in `routes/train_number.csv` and each file contained three columns - station code along route, lat, long. Note that the locations and station codes along the route of a specific train were acquired using RailwayAPI. From the file, the above code first creates a Pandas DataFrame, which is then manipulated to create a new DataFrame that contains the train's path/route. These two DataFrames are finally modified and passed onto plotly, which creates the above map.

The map is far from perfect. For starters, like I mentioned earlier, it is but a small subset of all trains available. Secondly, the lat/long data seems to be faulty, because there seem to be stray lines that deviate from a train's actual route in the map.

I am trying to look for a better source of information than RailwayAPI. I am trying to get a list of all trains run by Indian Railways. I am trying to find an official source of information, from the Indian Railways. I am trying to find an easier way to make such a map, and make it interactive. If there's something I can do to make the map/processing better, point it out to me! I'd love to hear comments/feedback.

Until the next time ...

Popular posts from this blog

Farewell to Enthought

Arxiv author affiliations using Python

Elementary (particle physics), my dear Watson