Linking RStudio and Postgis using docker

What is docker?

System adminstrators and webbapp developers routnely use containers for virtualisation. However the concept is not as well known by desktop users. I became aware of docker through first being confused by the way a system I had originally setup on a single PC running as a server was being transrered onto another server by an experienced system engineer. My experience of a virtual machine was through running complete copies of an operating system using vbox. Under this setup all the applications are held in the same environment and share the same dependencies and files. There can be problems with this. A common issue for open source geopspatial applications is that they all rely on the gdal library. If applications use different versions of the library there can be problems when they are installed on the same machine.

Docker avoids these problems by setting up each application in its own container. A docker container holds all the basic components of a linux operating system in its own isolated environment. If nothing else is added to the environment a docker container set up by one person is then completely portable from one machine to another. The configuration of the host is of no importance providing it has a linux kernel (Windows and Mac users can also use docker within an environment that adds the kernel). Prebuilt docker images that can be used to build local containers are stored in repositories online making the install of a working server application very simple and clean. For desktop users this can a useful way of running multiple versions of databases such as Postgresql without any fear of breaking anything. It can also be a simple way to run applications with complex setup requirements, such as CartoDB. If someone else has got all the install instructions to run on a clean operating system there is no need to go through the process again. You can just use their image and build your own containers from that. The raw Dockerfile that is used to build the image is essentially a set of commands such as apt-get install that are run after importing a basic linux environment. So builds on older versions of debian or ubuntu can be run on machines using newer versions and vice versa.

The advantages of this for a system adminstrator is clear. However the concepts involved can be rather strange at first. The challenge of working with docker containers for uniniated users, rather than experienced system administrators involve linking the containers together so that they can “talk to each other”.

Docker install on ubuntu

Installing docker on Ubuntu is very simple. Just run

sudo apt-get install docker.io

Once docker is installed prebuilt images of almost any server application can be found on https://hub.docker.com. There is no need to start from scratch and build your own, although a basic understanding of dockerfiles is useful in order to understand what the image is based on. The files used to build the images are made available through github. Using prebuilt images avoids any attempts to reinvent the wheel, but it may be necessary to add your own components to them.

Images and containers

The key concept behind docker is easy to understand. The system works with images and containers. An image is the starting point for a local instance, i.e. the container. Once you have an image you can construct multiple containers from it. Changes made within each container do not effect the image. They are lost if the container is destroyed. From a system adminstrator’s point of view the great thing about docker is that containers share the layers used by the image. Images themselves can also share layers used by other images. So this is all very eficient with regard to storage. It is also a great way to run server software on a laptop without filling the disk with virtual machines. In effect its a simple way to setup services that may possibly conflict with existing software, and once some concepts are understood it not much harder than using apt-get.

The first step to forming a container is to build the image on the host machine. This can be done either from local dockerfiles holding the instructions for the build or by pulling down a prebuilt image from the cloud. The docker commands “build” and “pull” refer to images, not containers.

The command that confused me totally at first was “run”. It appeared logical to assume that run refered to the local container. In fact the run command in docker is best thought of as a way of initialising a container, not “running” it. I mentally think of it as “init” not run. Running a prebuilt container that has stopped running is achieved throught the start command. Using run will build a brand new container based on the image. If you don’t realise this it may look as if changes are getting lost each time you run a container. However they are not, they are stil there on an old container buiilt by the last run command so you will fill up the system with multiple containers. The run command also very cleverly either uses a local build, or if one is not available, builds one from the cloud. So unless you are adapting images there is no real need for the pull and build commands as they are effectively included in the run command.

There are a lot of different flags for the run command that can be looked at in the docker documentation. The -p flag lets you map ports on the container to ports on the host. It is importnant to know which ports are exposed on the container and to map them to an unused port on the host. If for example you already have postgresql on the host the port 5432 will be in use if you run a docker image. Its also possible to link containers togather and to hold data either in other containers or on files on the host. All this is done when the container is first built using a run command.

So, to setup a recent version of postgis using docker I used this.

mkdir -p ~/postgres_data
sudo docker run --name "postgis" -p 25432:5432 -d -v $HOME/postgres_data:/var/lib/postgresql kartoza/postgis`

What the command did was to pull a prebuilt image from here https://hub.docker.com/r/kartoza/postgis/ I first setup a local directory to hold any data in my home directory and linked it when the run command was executed. The container is named postgis, so I can then start and stop the container that is created by the run command by typing.

sudo docker stop postgis
sudo docker start postgis

There is no need to use the run command again unless you want a new container. One element that is missing in the original image is plr, that I routinely use to run R from within Postgis. There are two ways to handle this. One would be to add the commands to my own dockerfile and build an image with plr included. This is prefered in many ways. However another way it is to install the packages within the container itself. These will be lost if the container is destroyed of course. However it is possible to use the commit command to form a new image based on these changes. This is convenient, but may be considered bad practice as it is not clearly reproducible in the same way as a doucmented build would be. The following command logs onto the bash shell in a running container

sudo docker exec -it postgis bash

Now from within the container the usual install can be run. The version of postgresql in this build is 2.5

apt-get install r-base 
apt-get install postgresql-9.5-plr

The port 5432 within the postigis container is mapped to 25432 and the default user is “docker” with password docker. So a new database can be created on the host using

createdb -h localhost -p 25432 -U docker elections

Data dumped from a database can be loaded using.

psql -h localhost -p 25432 -U docker elections < dump.sql

Linking an RStudio server container

The Rstudio server runs in any web browser on port 8787. Linking the postgis container to this allows connections between R and the postgis data base using ODBC.

sudo mkdir -p ~/rstudio_server
sudo chmod 777 -R ~/rstudio_server
 sudo docker run --name "rstudio" --link postgis:postgis -d -p 8787:8787 -v $HOME/rstudio_server:/home/rstudio rocker/rstudio

However, once more the original image lacks some of the packages. These can be added by logging in.

sudo docker exec -it rstudio bash

On a local install its worth adding the default rstudio user to sudo as a bash shell can be run within the browser if more changes are needed.

sudo adduser rstudio sudo 
sudo apt-get update
sudo apt-get -y install odbc-postgresql unixodb
apt-get install r-cran-rodbc

The final step to establish the connection is to add an /etc/odbc.ini file. One way to do this is to edit it directly within Rstudio server as a plain text file, save it to home, and then copy it to the /etc directory taking advantage of the sudo rights given to the rstudio user.

So within Rstudio make a plain text file and save it as odbc.ini

[elections]
Driver = /usr/lib/x86_64-linux-gnu/odbc/psqlodbcw.so
Database = elections
Servername = postgis
Username = docker
Password = docker
Protocol = 8.2.5
ReadOnly = 0

Then open a shell under the tools menu and run

sudo cp /home/rstudio/odbc.ini /etc/odbc.ini

This process can be repeated to add more connections by adding them to the file.

Now to test that its all working (this file has been compiled on the RStudio server setup using this procedure)

library(RODBC)
con<-odbcConnect("elections")
query<-"select pct_leave,mig from referendum"
d<-sqlQuery(con,query)
head(d)
##   pct_leave  mig
## 1     48.03  334
## 2     50.70   95
## 3     44.98  101
## 4     50.50 1200
## 5     71.39 1011
## 6     39.77  544

Next steps

In order to have a fully working system that links Posgis with R you need to have a libgdal.dev on the R server in order to build rgdal. To use system commands to upload spatial data to PostGIS you also need shp2pgsql and raster2pgsql. Currently the only way to install these requires a full install of postgis again within the container running R. There is clearly some duplication involved in this and the install increases the size of the container by around 1.5GB. It still seems preferable to add this functionality to the RStudio container and keep PostGIS as a second container in order to allow smooth updates and prevent conflicts. Having different versions of gdal in each container is unlikely to cause any problems. Running these commands within the Rserver container will add this functionality.

sudo apt.get install libgdal.dev
sudo apt.get install postgis

The examples shown here an now be run in the system.

https://rpubs.com/dgolicher/6373

Including Geoserver

Geoserver can also be setup in a container and linked to the postgis container and to a local folder for storing data.

sudo docker run --name "geoserver"  --link postgis:postgis -p 8081:8080 -d  -v $HOME/geoserver_data:/opt/geoserver/data_dir  -t kartoza/geoserver

I have also built a version of Geoserver which includes Geoexplorer

sudo docker run --name "geoserver-explorer"  --link postgis:postgis -p 8081:8080 -d  -v $HOME/geoserver_data:/opt/geoserver/data_dir  -t dgolicher/geoserver-explorer

Conclusion

Docker containers have many advantages as an easy and portable way of safely running server applications locally. Setting up a full system on a network would require additional configuaration for security and to allow multiple users access to the services of course.