The Zalando Data Intelligence Team is searching for a new top analyst. We already know of an excellent candidate with top analytical and programming skills. Unfortunately, we don’t know her exact whereabouts but we only have some vague information where she might be. Can you tell us where to best send our recruiters and plot an easy to read map of your solution for them?
This is what we could extract from independent sources:
The candidate is likely to be close to the river Spree. The probability at any point is given by a Gaussian function of its shortest distance to the river. The function peaks at zero and has 95% of its total integral within \(\pm2730\) m.
A probability distribution centered around the Brandenburg Gate also informs us of the candidate’s location. The distribution’s radial profile is log-normal with a mean of 4700m and a mode of 3877m in every direction.
A satellite offers further information: with 95% probability she is located within 2400 m distance of the satellite’s path (assuming a normal probability distribution)
Earth radius: 6371km.
Brandenburg Gate GPS coordinates: 52.516288,13.377689.
Satellite path is a great circle path between coordinates (52.590117,13.39915) and (52.437385,13.553989).
River Spree can be approximated as piecewise linear between the following coordinates:
| 1-5 | 6-10 | 11-15 | 16-20 |
|---|---|---|---|
| 52.529198,13.274099 | 52.531835,13.29234 | 52.522116,13.298541 | 52.520569,13.317349 |
| 52.524877,13.322434 | 52.522788,13.329 | 52.517056,13.332075 | 52.522514,13.340743 |
| 52.517239,13.356665 | 52.523063,13.372158 | 52.519198,13.379453 | 52.522462,13.392328 |
| 52.520921,13.399703 | 52.515333,13.406054 | 52.514863,13.416354 | 52.506034,13.435923 |
| 52.496473,13.461587 | 52.487641,13.483216 | 52.488739,13.491456 | 52.464011,13.503386 |
You can (but don’t have to) use following simple projection for getting GPS coordinates into an orthogonal coordinate system. The projection is reasonably accurate for the Berlin area.
Result is an XY coordinate system with the origin (0,0) at the South-West corner of the area we are interested in. The X axis corresponds to East-West and is given in kilometres. The Y axis corresponds to North-South and is also given in kilometres.
South-west corner of the area we are interested in: \[SW_{lat} = 52.464011, SW_{lon} = 13.274099\]
The x and y coordinates of a GPS coordinate P with (\(P_{lat}, P_{lon}\)) can be calculated using: \[P_x = − (P_{lon}-SW_{lon})\cdot cos(SW_{lat})\cdot 111.323\] \[P_y = (P_{lat}-SW_{lat})\cdot 111.323\]
We’ll start with writing some functions that we use afterwards.
First, function that converts GPS coordinates (2d-vector) into an orthogonal coordinate system (also 2d-vector).
geo2plane<-function(geo){
swLat=52.464011
swLong=13.274099
x=-(geo[2]-swLong)*cos(swLat)*111.323
y=(geo[1]-swLat)*111.323
c(x,y)
}
Since we want to have the result in real GPS coordinates, we also need function that converts coordinates back from orthogonal system.
plane2geo<-function(plane){
swLat=52.464011
swLong=13.274099
lat=(plane[2]/111.323)+swLat
long=-plane[1]/(111.323*cos(swLat))+swLong
geo=c(lat,long)
geo
}
We will have to measure distances from different objects, so we need functions for that. All of them are for measuring distance on plane with orthogonal coordinates.
Function for measuring distance between two points \(p\) and \(v\).
dist2dot=function(p,v){
sqrt((p[1]-v[1])^2+(p[2]-v[2])^2)
}
Function for measuring distance between point \(p\) and segment defined by its two vertices \(v\) and \(w\).
dist2Segment=function(p,v,w){
dist=0
segLength=dist2dot(v,w)
if (segLength==0){
dist=dist2dot(p,v)
}
projection=((p[1]-v[1])*(w[1]-v[1])+(p[2]-v[2])*(w[2]-v[2]))/segLength^2
if (projection<0){
dist=dist2dot(p,v)
}
else
if (projection>1){
dist=dist2dot(p,w)
}
else dist=dist2dot(p,c(v[1]+projection*(w[1]-v[1]),v[2]+projection*(w[2]-v[2])))
dist
}
Function that measures distance from point \(p\) to polygon given as \(n \times 2\) array of \(n\) consecutive pairs of coordinates of vertices of the polygon.
dist2Polygon=function(p,polygon){
dist=dist2Segment(p,polygon[1,],polygon[2,])
for (i in 2:(nrow(polygon)-1)){
if (dist>dist2Segment(p,polygon[i,],polygon[i+1,])){
dist=dist2Segment(p,polygon[i,],polygon[i+1,])
}
}
dist
}
We create two \(20\times 2\) matrices that contain GPS and planar coordinates of vertices of river Spree.
spreeGeo<-matrix(c(52.529198,13.274099,
52.531835,13.29234,
52.522116,13.298541,
52.520569,13.317349,
52.524877,13.322434,
52.522788,13.329,
52.517056,13.332075,
52.522514,13.340743,
52.517239,13.356665,
52.523063,13.372158,
52.519198,13.379453,
52.522462,13.392328,
52.520921,13.399703,
52.515333,13.406054,
52.514863,13.416354,
52.506034,13.435923,
52.496473,13.461587,
52.487641,13.483216,
52.488739,13.491456,
52.464011,13.503386),ncol=2,byrow=TRUE)
spreePlane<-t(apply(spreeGeo,1,geo2plane))
Now the Brandenburger Tor. We know that related distribution is lognormal and know its mean \(BT_{mean}\) and mode \(BT_{mode}\). To use some functions related to lognormal distribution we have to know mean (\(\mu\)) and standard deviation (\(\sigma\)) of the corresponding normal distribution.
We have formulas for mean and mode of lognormal distribution: \[BT_{mean}=e^{\mu+\sigma^2/2}\] \[BT_{mode}=e^{\mu - \sigma^2}\] From these we can derive formulas for \(\mu\) and \(\sigma\): \[\mu=\frac{2log(BT_{mean})+log(BT_{mode})}{3}\] \[\sigma=\sqrt{\frac{2}{3}(log(BT_{mean})-log(BT_{mode}))}\]
bt<-geo2plane(c(52.516288,13.377689))
btMean=4.7
btMode=3.877
btSigma=sqrt((2/3)*(log(btMean)-log(btMode)))
btMu=(2*log(btMean)+log(btMode))/3
Finally, we have to input information about satellite’s path. Since we’re interested in very small piece of Earth (about \(16\times 16\) square kilometers), we can easily approximate the great circle path there with the straight line going between two given points. Even more, as we will further see, given two points are at such distance and position that when we need to calculate distance from some point in Berlin to the satellite path, we can calculate distance to the line segment between those two given points, therefore there’s no need in writing another function for calculating distance between point and line.
sat1<-geo2plane(c(52.590117,13.39915))
sat2<-geo2plane(c(52.437385,13.553989))
First thing that comes to mind is to look on the intersection of three 95% confidence intervals given by our three probability distribution. These intervals are:
For uniformity and future purposes let’s calculate standard deviation for Spree- and satellite-related distributions and after that rather operate in quantiles than the given numbers. We know that for normal distribution 95% confidence interval approximately corresponds to the distance of 1.96 standard errors from the mean.
sdSpree<-2.73/1.96
sdSat<-2.4/1.96
Also let’s look at the range of our coordinates to get the picture of scale:
level=.975
summary(spreePlane)
## V1 V2
## Min. : 0.000 Min. :0.000
## 1st Qu.: 3.482 1st Qu.:5.415
## Median : 6.650 Median :6.220
## Mean : 6.940 Mean :5.492
## 3rd Qu.: 9.621 3rd Qu.:6.520
## Max. :14.991 Max. :7.550
sat1
## [1] 8.175985 14.038498
sat2
## [1] 18.299545 -2.964086
bt
## [1] 6.772839 5.819632
qlnorm(level,btMu,btSigma)
## [1] 8.89534
It seems that \(16\times 16\) \(km^2\) square will really give us pretty decent picture.
To depict the points within the confidence intervals we will for every distribution go through a grid with a 100 meter step and save the points that lie within confidence interval. Grids are slightly moved against each other so that points did not overlap.
#Spree dots
spreeStripe<-matrix(ncol=2)
xgrid=seq(0,16,.1)
ygrid=seq(0,16,.1)
for (i in xgrid){
for (j in ygrid){
if (dist2Polygon(c(i,j),spreePlane)<qnorm(level,0,sdSpree)){
spreeStripe=rbind(spreeStripe,c(i,j))
}
}
}
spreeStripe<-spreeStripe[-1,]
spreeStripe=t(apply(spreeStripe,1,plane2geo))
#satellite dots
satStripe<-matrix(ncol=2)
xgrid=seq(0.05,16,.1)
ygrid=seq(0.05,16,.1)
for (i in xgrid){
for (j in ygrid){
if (dist2Segment(c(i,j),sat1,sat2)<qnorm(level,0,sdSat)){
satStripe=rbind(satStripe,c(i,j))
}
}
}
satStripe<-satStripe[-1,]
satStripe=t(apply(satStripe,1,plane2geo))
#Brandenburger Tor
bTor<-matrix(ncol=2)
bTor[1,]<-bt #let's also keep the point that is Brandenburger Tor
xgrid=seq(0.03,16,.1)
ygrid=seq(0.03,16,.1)
for (i in xgrid){
for (j in ygrid){
if (dist2dot(c(i,j),bt)<qlnorm(level,btMu,btSigma) &&
dist2dot(c(i,j),bt)>qlnorm(1-level,btMu,btSigma)){
bTor=rbind(bTor,c(i,j))
}
}
}
bTor=t(apply(bTor,1,plane2geo))
Now let see how it looks on the map
library(RgoogleMaps)
center = c(52.464011+8/111.323,13.274099+8/111.323);
zoom <- min(MaxZoom(c(52.464011,52.464011+16/111.323),c(13.274099,13.274099+16/111.323)));
MyMap <- GetMap(center=center, zoom=zoom, destfile = "MyTile.png");
tmp <-PlotOnStaticMap(MyMap, lat = spreeStripe[,1], lon = spreeStripe[,2],
destfile = "MyTile.png", col="blue",cex=.1, pch=20, add=FALSE);
PlotOnStaticMap(MyMap, lat = bTor[,1], lon = bTor[,2],
FUN = points, col="red", cex=.1,pch=20, add=TRUE)
PlotOnStaticMap(MyMap, lat = satStripe[,1], lon = satStripe[,2],
FUN = points, col="black", cex=.1,pch=20, add=TRUE)
Intersection is quite big. What we can do now is try to narrow down intervals by choosing smaller confidence levels until we have very small intersection - e.g. about \(10\times 10 m^2\). This will be the most probable location of future analyst. We can do it by couting number of points from some grid in the intersection. To reduce computing time we can start from large grid and big step for the confidence level and then zoom in. Based on the picture, we can also start our grid not from (0,0) but rather from (8,0) and stop somewhere around (16,10): even for the 95%-interval all the intersection lies in that area.
intersection_size=1
level=.975
while(is.null(intersection_size)==FALSE){
goodPoints<-matrix(ncol=2)
xgrid=seq(8.03,16,.2)
ygrid=seq(0.03,10,.2)
for (i in xgrid){
for (j in ygrid){
if (dist2dot(c(i,j),bt)<qlnorm(level,btMu,btSigma) &&
dist2dot(c(i,j),bt)>qlnorm(1-level,btMu,btSigma) &&
dist2Segment(c(i,j),sat1,sat2)<qnorm(level,0,sdSat) &&
dist2Polygon(c(i,j),spreePlane)<qnorm(level,0,sdSpree)){
goodPoints=rbind(goodPoints,c(i,j))
}
}
}
goodPoints=goodPoints[-1,]
intersection_size=nrow(goodPoints)
print(c(level,intersection_size))
level=level-.01
}
## [1] 0.975 587.000
## [1] 0.965 443.000
## [1] 0.955 343.000
## [1] 0.945 271.000
## [1] 0.935 216.000
## [1] 0.925 169.000
## [1] 0.915 138.000
## [1] 0.905 108.000
## [1] 0.895 84.000
## [1] 0.885 67.000
## [1] 0.875 50.000
## [1] 0.865 36.000
## [1] 0.855 27.000
## [1] 0.845 18.000
## [1] 0.835 12.000
## [1] 0.825 6.000
## [1] 0.815 4.000
## [1] 0.805
Let’s look at the last point we got:
goodPoints
## [1] 12.63 4.63
We can do it another way around: plot contour lines for different confidence intervals:
xgrid=expand.grid(seq(0,16,.1),seq(0,16,.1))
levelContour<-vector(length=nrow(xgrid))
for (i in 1:nrow(xgrid)){
level=1
while(dist2dot(xgrid[i,],bt)<qlnorm(level-.05,btMu,btSigma) &&
dist2dot(xgrid[i,],bt)>qlnorm(1-level,btMu,btSigma) &&
dist2Segment(xgrid[i,],sat1,sat2)<qnorm(level-.05,0,sdSat) &&
dist2Polygon(xgrid[i,],spreePlane)<qnorm(level-.05,0,sdSpree)){
level=level-.05
}
levelContour[i]=level
}
geogrid<-t(apply(xgrid,1,plane2geo))
geogrid=as.data.frame(cbind(geogrid[,2],geogrid[,1],levelContour))
colnames(geogrid)=c('x','y','z')
library(ggmap)
## Loading required package: ggplot2
berlin<-get_map("Berlin",zoom=12)
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=Berlin&zoom=12&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Berlin&sensor=false
ggp=ggmap(berlin)
ggp=ggp+stat_contour(data=geogrid,aes(x=x,y=y,z=z,color=..level..))
plot(ggp)
## Warning: Removed 6193 rows containing non-finite values (stat_contour).
Now for more thorough search in that area:
intersection_size=4
level=.805
while(intersection_size>3){
goodPoints<-matrix(ncol=2)
xgrid=seq(12,13,.01)
ygrid=seq(4,5,.01)
for (i in xgrid){
for (j in ygrid){
if (dist2dot(c(i,j),bt)<qlnorm(level,btMu,btSigma) &&
dist2dot(c(i,j),bt)>qlnorm(1-level,btMu,btSigma) &&
dist2Segment(c(i,j),sat1,sat2)<qnorm(level,0,sdSat) &&
dist2Polygon(c(i,j),spreePlane)<qnorm(level,0,sdSpree)){
goodPoints=rbind(goodPoints,c(i,j))
}
}
}
intersection_size=nrow(goodPoints)-1
goodPoints=goodPoints[-1,]
print(c(level,intersection_size))
level=level-.001
}
## [1] 0.805 482.000
## [1] 0.804 419.000
## [1] 0.803 362.000
## [1] 0.802 309.000
## [1] 0.801 259.000
## [1] 0.8 214.0
## [1] 0.799 174.000
## [1] 0.798 139.000
## [1] 0.797 109.000
## [1] 0.796 80.000
## [1] 0.795 56.000
## [1] 0.794 36.000
## [1] 0.793 20.000
## [1] 0.792 12.000
## [1] 0.791 3.000
So it seems that the most probable location of future data analyst is somewhere around these coordinates:
plane2geo(goodPoints[1,])
## [1] 52.50677 13.46605
Now we can look where on the map are the last three points we got:
goodGeo<-t(apply(goodPoints,1,plane2geo))
center = goodGeo[1,];
zoom <- min(MaxZoom(c(52.464011,52.464011+.5/111.323),c(13.274099,13.274099+.5/111.323)));
MyMap <- GetMap(center=center, zoom=zoom, destfile = "MyTile3.png");
tmp <-PlotOnStaticMap(MyMap, lat = goodGeo[,1], lon = goodGeo[,2],
destfile = "MyTile3.png", col="blue",cex=.5, pch=20, add=FALSE);
So it’s that little Böcklinstraße and the area around it. If we do some Google search, we can discover that there is Zalando Customer Service in the neighborhood: Neue Bahnhofstraße 11-17. So it’s quite possible that future analyst is sitting there at the moment and you don’t have to go far to find her.