Bus, from the Latin omnibus, meaning “to or for, by, with or from everyone”

This is an R Markdown Notebook. Like the other Notebooks in this package, it contains a self-learning data science tutorial. This series of tutorials is based around public transport data. We will work with data from Southeast Queensland, Australia, but you can probably run many of the the examples on data from your city.

We will start with ‘static GTFS data’. GTFS is a standard format specification for public transit timetables and routes. The SE Queensland data are available from Translink. They are also in the file ../static_data/SEQ_GTFS.zip. The data are in a series of CSV files (but with .txt extensions)

Reading in the stop locations

First, we unzip the supplied file into a temporary directory. The temporary direction will be deleted when you quit R.

staticGTFS<-tempdir()
unzip("../static_data/SEQ_GTFS.zip",exdir=staticGTFS) #FIXME will use package.file() eventually
list.files(staticGTFS)
## [1] "agency.txt"         "calendar_dates.txt" "calendar.txt"      
## [4] "feed_info.txt"      "routes.txt"         "shapes.txt"        
## [7] "stop_times.txt"     "stops.txt"          "trips.txt"
stops<-read.csv(paste(staticGTFS,"stops.txt",sep="/"))
summary(stops)
##     stop_id        stop_code     
##  1      :    1   Min.   :     1  
##  10     :    1   1st Qu.:  4577  
##  100    :    1   Median :300210  
##  1000   :    1   Mean   :171094  
##  10000  :    1   3rd Qu.:311491  
##  10001  :    1   Max.   :600854  
##  (Other):12852   NA's   :197     
##                                       stop_name     stop_desc     
##  Beams Rd at St Flannans                   :    4   Mode:logical  
##  Cleveland Redland Bay Rd near Beveridge Rd:    4   NA's:12858    
##  Grieve Rd near Pillinger Rd               :    4                 
##  Miles Platting Rd near Gardner Rd         :    4                 
##  Mount Cotton Rd near Killarney Cr         :    4                 
##  Musgrave Ave near Botanical Dr            :    4                 
##  (Other)                                   :12834                 
##     stop_lat         stop_lon        zone_id    
##  Min.   :-28.17   Min.   :152.1   2      :4829  
##  1st Qu.:-27.62   1st Qu.:153.0   3      :2473  
##  Median :-27.50   Median :153.1   1      :2421  
##  Mean   :-27.48   Mean   :153.1   5      :1181  
##  3rd Qu.:-27.40   3rd Qu.:153.1   6      : 752  
##  Max.   :-26.16   Max.   :153.5   4      : 541  
##                                   (Other): 661  
##                                       stop_url     location_type    
##                                           :  320   Min.   :0.00000  
##  http://translink.com.au/stop/000001/gtfs/:    1   1st Qu.:0.00000  
##  http://translink.com.au/stop/000002/gtfs/:    1   Median :0.00000  
##  http://translink.com.au/stop/000003/gtfs/:    1   Mean   :0.01532  
##  http://translink.com.au/stop/000004/gtfs/:    1   3rd Qu.:0.00000  
##  http://translink.com.au/stop/000005/gtfs/:    1   Max.   :1.00000  
##  (Other)                                  :12533                    
##       parent_station  platform_code  
##              :12231          :12205  
##  place_QSBS  :   17   1      :  190  
##  place_INTGCY:   15   2      :  178  
##  place_INTCAR:   11   3      :   49  
##  place_INTCHE:   10   A      :   44  
##  place_INTCAN:    9   B      :   39  
##  (Other)     :  565   (Other):  153

We can plot the stops to get the crudest possible map

plot(stop_lat~stop_lon,data=stops)

This isn’t a very good map: the points are ugly, and the aspect ratio is wrong. A unit of longitude is larger than a unit of latitude except at the equator. The scale factor is the cosine of the latitude: in Brisbane, about 0.89

We can also use the zone_id variable to colour the points, and can make them smaller.

plot(stop_lat~stop_lon,data=stops, asp=1/0.89, col=zone_id,pch=19,cex=.5)

The zones don’t match what you’d expect: the colours seem to be reused in each population centre. Let’s try a table

table(stops$zone_id)
## 
##         1  1/2    2  2/3    3  3/4    4  4/5    5  5/6    6  6/7    7    8 
##  201 2421  126 4829   24 2473    3  541    9 1181    7  752    1   85  205

Oh. There are stops on the zone boundaries. The data are more complicated than we expected.

For a simpler map, I’ll recode the zones so each stop is listed only in its further-out zone.

stops$outzone<-factor(c(1,2,2,3,3,4,4,5,5,6,6,7,7,8)[stops$zone_id])
stops$inzone<-factor(c(1,1,2,2,3,3,4,4,5,5,6,6,7,8)[stops$zone_id])
plot(stop_lat~stop_lon,data=stops, asp=1/0.89, col=outzone,pch=19,cex=.5)

But it’s going to be useful to see where these stops really are: let’s try to get them on a street map. A quick search finds a package called “RgoogleMaps”

library("RgoogleMaps")
center <- with(stops, c(mean(stop_lat), mean(stop_lon)))
zoom <- with(stops, min(MaxZoom(range(stop_lat), range(stop_lon))))
MyMap <- GetMap(center=center, zoom=zoom, destfile = "MyTile1.png")
tmp <- PlotOnStaticMap(MyMap, lat = stops$stop_lat,lon = stops$stop_lon,destfile = "MyTile1.png", cex=1.5,pch=20,col=palette()[stops$outzone], add=FALSE)

The stops overlay southeast Queensland as we expect, so the data are broadly reasonable

Now lets zoom in a bit more, and make the points partially transparent

MyMap <- GetMap(center=center, zoom=zoom+3, destfile = "MyTile1.png")
tmp <- PlotOnStaticMap(MyMap, lat = stops$stop_lat,lon = stops$stop_lon,destfile = "MyTile1.png", cex=1.5,pch=20,col=adjustcolor(palette()[stops$outzone],alpha.f=.3), add=FALSE)

The railway stations at the airport and in central Brisbane are in the right places, and the buses follow the roads. The data look basically reasonable and we’re happy.