Task

The Zalando Data Intelligence Team is searching for a new top analyst. We already know of an excellent candidate with top analytical and programming skills. Unfortunately, we don’t know her exact whereabouts but we only have some vague information where she might be. Can you tell us where to best send our recruiters and plot an easy to read map of your solution for them?

This is what we could extract from independent sources:

The candidate is likely to be close to the river Spree. The probability at any point is given by a Gaussian function of its shortest distance to the river. The function peaks at zero and has 95% of its total integral within \(\pm2730\) m.

A probability distribution centered around the Brandenburg Gate also informs us of the candidate’s location. The distribution’s radial profile is log-normal with a mean of 4700m and a mode of 3877m in every direction.

A satellite offers further information: with 95% probability she is located within 2400 m distance of the satellite’s path (assuming a normal probability distribution)

Coordinates

Earth radius: 6371km.

Brandenburg Gate GPS coordinates: 52.516288,13.377689.

Satellite path is a great circle path between coordinates (52.590117,13.39915) and (52.437385,13.553989).

River Spree can be approximated as piecewise linear between the following coordinates:

1-5 6-10 11-15 16-20
52.529198,13.274099 52.531835,13.29234 52.522116,13.298541 52.520569,13.317349
52.524877,13.322434 52.522788,13.329 52.517056,13.332075 52.522514,13.340743
52.517239,13.356665 52.523063,13.372158 52.519198,13.379453 52.522462,13.392328
52.520921,13.399703 52.515333,13.406054 52.514863,13.416354 52.506034,13.435923
52.496473,13.461587 52.487641,13.483216 52.488739,13.491456 52.464011,13.503386

Tip for conversion of coordinates

You can (but don’t have to) use following simple projection for getting GPS coordinates into an orthogonal coordinate system. The projection is reasonably accurate for the Berlin area.

Result is an XY coordinate system with the origin (0,0) at the South-West corner of the area we are interested in. The X axis corresponds to East-West and is given in kilometres. The Y axis corresponds to North-South and is also given in kilometres.

South-west corner of the area we are interested in: \[SW_{lat} = 52.464011, SW_{lon} = 13.274099\]

The x and y coordinates of a GPS coordinate P with (\(P_{lat}, P_{lon}\)) can be calculated using: \[P_x = − (P_{lon}-SW_{lon})\cdot cos(SW_{lat})\cdot 111.323\] \[P_y = (P_{lat}-SW_{lat})\cdot 111.323\]

Solution

We’ll start with writing some functions that we use afterwards.

Functions

First, function that converts GPS coordinates (2d-vector) into an orthogonal coordinate system (also 2d-vector).

geo2plane<-function(geo){
  swLat=52.464011
  swLong=13.274099
  x=-(geo[2]-swLong)*cos(swLat)*111.323
  y=(geo[1]-swLat)*111.323
  c(x,y)
}

Since we want to have the result in real GPS coordinates, we also need function that converts coordinates back from orthogonal system.

plane2geo<-function(plane){
  swLat=52.464011
  swLong=13.274099
  lat=(plane[2]/111.323)+swLat
  long=-plane[1]/(111.323*cos(swLat))+swLong
  geo=c(lat,long)
  geo
}

We will have to measure distances from different objects, so we need functions for that. All of them are for measuring distance on plane with orthogonal coordinates.

Function for measuring distance between two points \(p\) and \(v\).

dist2dot=function(p,v){
  sqrt((p[1]-v[1])^2+(p[2]-v[2])^2)
}

Function for measuring distance between point \(p\) and segment defined by its two vertices \(v\) and \(w\).

dist2Segment=function(p,v,w){
  dist=0
  segLength=dist2dot(v,w)
  if (segLength==0){
    dist=dist2dot(p,v)
  }
  projection=((p[1]-v[1])*(w[1]-v[1])+(p[2]-v[2])*(w[2]-v[2]))/segLength^2
  if (projection<0){
    dist=dist2dot(p,v)
  }
  else
    if (projection>1){
      dist=dist2dot(p,w)
    }
  else dist=dist2dot(p,c(v[1]+projection*(w[1]-v[1]),v[2]+projection*(w[2]-v[2])))
  dist
}

Function that measures distance from point \(p\) to polygon given as \(n \times 2\) array of \(n\) consecutive pairs of coordinates of vertices of the polygon.

dist2Polygon=function(p,polygon){
  dist=dist2Segment(p,polygon[1,],polygon[2,])
  for (i in 2:(nrow(polygon)-1)){
    if (dist>dist2Segment(p,polygon[i,],polygon[i+1,])){
      dist=dist2Segment(p,polygon[i,],polygon[i+1,])
    }
  }
  dist
}

Defining known objects

We create two \(20\times 2\) matrices that contain GPS and planar coordinates of vertices of river Spree.

spreeGeo<-matrix(c(52.529198,13.274099,  
                   52.531835,13.29234,  
                   52.522116,13.298541, 
                   52.520569,13.317349,  
                   52.524877,13.322434,  
                   52.522788,13.329,  
                   52.517056,13.332075,  
                   52.522514,13.340743,  
                   52.517239,13.356665,  
                   52.523063,13.372158,  
                   52.519198,13.379453,  
                   52.522462,13.392328,  
                   52.520921,13.399703,  
                   52.515333,13.406054,  
                   52.514863,13.416354,  
                   52.506034,13.435923,  
                   52.496473,13.461587,  
                   52.487641,13.483216,  
                   52.488739,13.491456,  
                   52.464011,13.503386),ncol=2,byrow=TRUE)

spreePlane<-t(apply(spreeGeo,1,geo2plane))

Now the Brandenburger Tor. We know that related distribution is lognormal and know its mean \(BT_{mean}\) and mode \(BT_{mode}\). To use some functions related to lognormal distribution we have to know mean (\(\mu\)) and standard deviation (\(\sigma\)) of the corresponding normal distribution.

We have formulas for mean and mode of lognormal distribution: \[BT_{mean}=e^{\mu+\sigma^2/2}\] \[BT_{mode}=e^{\mu - \sigma^2}\] From these we can derive formulas for \(\mu\) and \(\sigma\): \[\mu=\frac{2log(BT_{mean})+log(BT_{mode})}{3}\] \[\sigma=\sqrt{\frac{2}{3}(log(BT_{mean})-log(BT_{mode}))}\]

bt<-geo2plane(c(52.516288,13.377689))
btMean=4.7
btMode=3.877
btSigma=sqrt((2/3)*(log(btMean)-log(btMode)))
btMu=(2*log(btMean)+log(btMode))/3

Finally, we have to input information about satellite’s path. Since we’re interested in very small piece of Earth (about \(16\times 16\) square kilometers), we can easily approximate the great circle path there with the straight line going between two given points. Even more, as we will further see, given two points are at such distance and position that when we need to calculate distance from some point in Berlin to the satellite path, we can calculate distance to the line segment between those two given points, therefore there’s no need in writing another function for calculating distance between point and line.

sat1<-geo2plane(c(52.590117,13.39915))
sat2<-geo2plane(c(52.437385,13.553989))

Intersection of 95% confidence intervals

First thing that comes to mind is to look on the intersection of three 95% confidence intervals given by our three probability distribution. These intervals are:

  • points that are not far than 2730 m from Spree
  • points that are not far than 2400 m from satellite’s path
  • points that are not close to the Brandenburger Tor than 2.5%-quantile of given lognormal distribution and not far than 97.5%-quantile of aforementioned distribution.

For uniformity and future purposes let’s calculate standard deviation for Spree- and satellite-related distributions and after that rather operate in quantiles than the given numbers. We know that for normal distribution 95% confidence interval approximately corresponds to the distance of 1.96 standard errors from the mean.

sdSpree<-2.73/1.96
sdSat<-2.4/1.96

Also let’s look at the range of our coordinates to get the picture of scale:

level=.975
summary(spreePlane)
##        V1               V2       
##  Min.   : 0.000   Min.   :0.000  
##  1st Qu.: 3.482   1st Qu.:5.415  
##  Median : 6.650   Median :6.220  
##  Mean   : 6.940   Mean   :5.492  
##  3rd Qu.: 9.621   3rd Qu.:6.520  
##  Max.   :14.991   Max.   :7.550
sat1
## [1]  8.175985 14.038498
sat2
## [1] 18.299545 -2.964086
bt
## [1] 6.772839 5.819632
qlnorm(level,btMu,btSigma)
## [1] 8.89534

It seems that \(16\times 16\) \(km^2\) square will really give us pretty decent picture.

To depict the points within the confidence intervals we will for every distribution go through a grid with a 100 meter step and save the points that lie within confidence interval. Grids are slightly moved against each other so that points did not overlap.

#Spree dots
spreeStripe<-matrix(ncol=2)
xgrid=seq(0,16,.1)
ygrid=seq(0,16,.1)
for (i in xgrid){
  for (j in ygrid){
   if (dist2Polygon(c(i,j),spreePlane)<qnorm(level,0,sdSpree)){
     spreeStripe=rbind(spreeStripe,c(i,j))
   } 
  }
}
spreeStripe<-spreeStripe[-1,]
spreeStripe=t(apply(spreeStripe,1,plane2geo))

#satellite dots
satStripe<-matrix(ncol=2)
xgrid=seq(0.05,16,.1)
ygrid=seq(0.05,16,.1)
for (i in xgrid){
  for (j in ygrid){
    if (dist2Segment(c(i,j),sat1,sat2)<qnorm(level,0,sdSat)){
      satStripe=rbind(satStripe,c(i,j))
    } 
  }
}
satStripe<-satStripe[-1,]
satStripe=t(apply(satStripe,1,plane2geo))

#Brandenburger Tor
bTor<-matrix(ncol=2)
bTor[1,]<-bt #let's also keep the point that is Brandenburger Tor
xgrid=seq(0.03,16,.1)
ygrid=seq(0.03,16,.1)
for (i in xgrid){
  for (j in ygrid){
    if (dist2dot(c(i,j),bt)<qlnorm(level,btMu,btSigma) && 
          dist2dot(c(i,j),bt)>qlnorm(1-level,btMu,btSigma)){
      bTor=rbind(bTor,c(i,j))
    } 
  }
}
bTor=t(apply(bTor,1,plane2geo))

Now let see how it looks on the map

library(RgoogleMaps)
center = c(52.464011+8/111.323,13.274099+8/111.323);
zoom <- min(MaxZoom(c(52.464011,52.464011+16/111.323),c(13.274099,13.274099+16/111.323)));
MyMap <- GetMap(center=center, zoom=zoom, destfile = "MyTile.png");
tmp <-PlotOnStaticMap(MyMap, lat = spreeStripe[,1], lon = spreeStripe[,2],
                       destfile = "MyTile.png", col="blue",cex=.1, pch=20, add=FALSE);
PlotOnStaticMap(MyMap, lat = bTor[,1], lon = bTor[,2],
                 FUN = points, col="red", cex=.1,pch=20, add=TRUE)
PlotOnStaticMap(MyMap, lat = satStripe[,1], lon = satStripe[,2],
                FUN = points, col="black", cex=.1,pch=20, add=TRUE)

Intersection is quite big. What we can do now is try to narrow down intervals by choosing smaller confidence levels until we have very small intersection - e.g. about \(10\times 10 m^2\). This will be the most probable location of future analyst. We can do it by couting number of points from some grid in the intersection. To reduce computing time we can start from large grid and big step for the confidence level and then zoom in. Based on the picture, we can also start our grid not from (0,0) but rather from (8,0) and stop somewhere around (16,10): even for the 95%-interval all the intersection lies in that area.

Constriction of confidence intervals

intersection_size=1
level=.975
while(is.null(intersection_size)==FALSE){
  goodPoints<-matrix(ncol=2)
  xgrid=seq(8.03,16,.2)
  ygrid=seq(0.03,10,.2)
  for (i in xgrid){
    for (j in ygrid){
      if (dist2dot(c(i,j),bt)<qlnorm(level,btMu,btSigma) && 
            dist2dot(c(i,j),bt)>qlnorm(1-level,btMu,btSigma) &&
            dist2Segment(c(i,j),sat1,sat2)<qnorm(level,0,sdSat) &&
            dist2Polygon(c(i,j),spreePlane)<qnorm(level,0,sdSpree)){
        goodPoints=rbind(goodPoints,c(i,j))
      } 
    }
  }
  goodPoints=goodPoints[-1,]
  intersection_size=nrow(goodPoints)
  print(c(level,intersection_size))
  level=level-.01
}
## [1]   0.975 587.000
## [1]   0.965 443.000
## [1]   0.955 343.000
## [1]   0.945 271.000
## [1]   0.935 216.000
## [1]   0.925 169.000
## [1]   0.915 138.000
## [1]   0.905 108.000
## [1]  0.895 84.000
## [1]  0.885 67.000
## [1]  0.875 50.000
## [1]  0.865 36.000
## [1]  0.855 27.000
## [1]  0.845 18.000
## [1]  0.835 12.000
## [1] 0.825 6.000
## [1] 0.815 4.000
## [1] 0.805

Let’s look at the last point we got:

goodPoints
## [1] 12.63  4.63

We can do it another way around: plot contour lines for different confidence intervals:

xgrid=expand.grid(seq(0,16,.1),seq(0,16,.1))
levelContour<-vector(length=nrow(xgrid))
for (i in 1:nrow(xgrid)){
  level=1
  while(dist2dot(xgrid[i,],bt)<qlnorm(level-.05,btMu,btSigma) && 
    dist2dot(xgrid[i,],bt)>qlnorm(1-level,btMu,btSigma) &&
    dist2Segment(xgrid[i,],sat1,sat2)<qnorm(level-.05,0,sdSat) &&
    dist2Polygon(xgrid[i,],spreePlane)<qnorm(level-.05,0,sdSpree)){
    level=level-.05
  }
  levelContour[i]=level
}

geogrid<-t(apply(xgrid,1,plane2geo))
geogrid=as.data.frame(cbind(geogrid[,2],geogrid[,1],levelContour))
colnames(geogrid)=c('x','y','z')
library(ggmap)
## Loading required package: ggplot2
berlin<-get_map("Berlin",zoom=12)
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=Berlin&zoom=12&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Berlin&sensor=false
ggp=ggmap(berlin)
ggp=ggp+stat_contour(data=geogrid,aes(x=x,y=y,z=z,color=..level..))
plot(ggp)
## Warning: Removed 6193 rows containing non-finite values (stat_contour).

Now for more thorough search in that area:

intersection_size=4
level=.805
while(intersection_size>3){
  goodPoints<-matrix(ncol=2)
  xgrid=seq(12,13,.01)
  ygrid=seq(4,5,.01)
  for (i in xgrid){
    for (j in ygrid){
      if (dist2dot(c(i,j),bt)<qlnorm(level,btMu,btSigma) && 
            dist2dot(c(i,j),bt)>qlnorm(1-level,btMu,btSigma) &&
            dist2Segment(c(i,j),sat1,sat2)<qnorm(level,0,sdSat) &&
            dist2Polygon(c(i,j),spreePlane)<qnorm(level,0,sdSpree)){
        goodPoints=rbind(goodPoints,c(i,j))
      } 
    }
  }
  intersection_size=nrow(goodPoints)-1
  goodPoints=goodPoints[-1,]
  print(c(level,intersection_size))
  level=level-.001
}
## [1]   0.805 482.000
## [1]   0.804 419.000
## [1]   0.803 362.000
## [1]   0.802 309.000
## [1]   0.801 259.000
## [1]   0.8 214.0
## [1]   0.799 174.000
## [1]   0.798 139.000
## [1]   0.797 109.000
## [1]  0.796 80.000
## [1]  0.795 56.000
## [1]  0.794 36.000
## [1]  0.793 20.000
## [1]  0.792 12.000
## [1] 0.791 3.000

So it seems that the most probable location of future data analyst is somewhere around these coordinates:

plane2geo(goodPoints[1,])
## [1] 52.50677 13.46605

Now we can look where on the map are the last three points we got:

goodGeo<-t(apply(goodPoints,1,plane2geo))
center = goodGeo[1,];
zoom <- min(MaxZoom(c(52.464011,52.464011+.5/111.323),c(13.274099,13.274099+.5/111.323)));
MyMap <- GetMap(center=center, zoom=zoom, destfile = "MyTile3.png");
tmp <-PlotOnStaticMap(MyMap, lat = goodGeo[,1], lon = goodGeo[,2],
                       destfile = "MyTile3.png", col="blue",cex=.5, pch=20, add=FALSE);

So it’s that little Böcklinstraße and the area around it. If we do some Google search, we can discover that there is Zalando Customer Service in the neighborhood: Neue Bahnhofstraße 11-17. So it’s quite possible that future analyst is sitting there at the moment and you don’t have to go far to find her.