Anomalies in Citydata

Orlenko Irina

2022-09-19

R Markdown

Here is a sample of the dataset:

did date time lat lon hAccuracy
YjU0c2M3YT 2019-09-02 14:02:20 1.307193 103.789969 16
NTVqZnByMj 2019-09-08 15:17:08 1.306691 103.788064 40
Yzg3aTI0Zm 2019-09-07 00:04:46 1.296493 103.788552 16
NDlwajkxZ3 2020-09-29 16:23:06 1.302764 103.789300 16
Ymh1bzJudj 2019-09-19 09:48:05 1.299170 103.788830 10
YXY2ZzUwcG 2020-09-27 23:48:38 1.306168 103.790950 17
Nm5rODRyMm 2020-09-12 20:47:53 1.291254 103.791026 43
N2d2dDBmY3 2021-09-19 08:34:20 1.297297 103.792369 9
M3A0ZGpyM2 2019-09-03 12:46:02 1.297263 103.792366 34
aGFsczRrb3 2019-09-18 14:22:41 1.297277 103.792415 11
N2d1MHFobT 2020-09-16 22:32:47 1.306542 103.790730 5
M3NtcTVob3 2019-09-16 15:21:41 1.298444 103.793557 18
NWlzOHEyaj 2020-09-17 15:37:02 1.296259 103.790090 11
MjlwbDBkYW 2020-09-03 17:49:38 1.305604 103.790868 4
YXNxcTlqdG 2020-09-10 12:26:56 1.299795 103.787668 22
M2liM3BocX 2021-09-17 23:03:59 1.298468 103.790561 28
ZGE5M2kxOG 2020-09-08 19:45:47 1.292339 103.791598 5
ZTEwZnZoOX 2020-09-02 01:42:49 1.306552 103.787742 11
NHA5bGwxNm 2019-09-27 20:18:12 1.297539 103.787827 7
ZjJmOHZic3 2020-09-24 14:43:20 1.305188 103.788919 18
ODI5OGphcT 2020-09-06 17:07:39 1.297297 103.792370 5514
NjZ1cW9pNG 2019-09-03 12:15:45 1.304208 103.791564 9
NGN0cmNjMm 2020-09-27 12:52:14 1.295670 103.788300 8
MmpkMXN1NT 2020-09-07 06:52:40 1.292591 103.793451 22
NGQwOWNhdH 2020-09-23 11:57:23 1.306516 103.790526 7
YnVlOWdtMz 2020-09-23 13:56:27 1.306118 103.789510 10
YmI4ajZsbH 2020-09-24 12:01:03 1.300471 103.788184 10
NWZvbmRwZz 2020-09-23 13:27:16 1.302981 103.792100 10
NTQ2bTJuN2 2019-09-13 03:06:06 1.296743 103.795281 17
YTU0cjU4b2 2020-09-14 18:10:48 1.305662 103.787796 5
MmVkdjhndH 2020-09-04 09:22:45 1.307362 103.789176 7
ZGFjcm5rNW 2019-09-19 13:39:09 1.305716 103.790522 6
MTNzYnJjZ2 2019-09-13 17:04:53 1.299018 103.787447 29
YW1kajNsYW 2019-09-25 08:25:28 1.303971 103.787800 5
Nmd2YmR1bn 2020-09-28 01:06:29 1.297297 103.792370 5514
YmlxOXRidj 2019-09-16 08:14:37 1.298590 103.788280 18
YzY4bGxoZT 2019-09-20 05:55:26 1.305514 103.790134 5
NW1qbW02Yj 2020-09-01 18:43:44 1.301549 103.787105 26
OGprMzY1bm 2020-09-02 17:37:23 1.302220 103.792820 6
NWlhcGtiZj 2019-09-24 20:35:48 1.298815 103.787456 17

Outliers

Pings

We have four data dimensions: location, time, pings, devices

  1. Location: proportion of the records with trailing zeros coordinates
decimal places degrees distance
0 1.0 111 km
1 0.1 11.1 km
2 0.01 1.11 km
3 0.001 111 m
4 0.0001 11.1 m
5 0.00001 1.11 m
6 0.000001 0.111 m

Pings are mostly given with 6 decimal places

Some pings have fewer than 6 decimal places in their coordinates

did lat lon lat dp lon dp
Ymt0c25rNm 1.289280 103.789420 5 5
ZWoybXFkNj 1.292877 103.785080 6 5
Ymt0c25rNm 1.289280 103.789420 5 5
NTBnZmc5MD 1.291510 103.785356 5 6
Ymt0c25rNm 1.289280 103.789420 5 5
NnRrdWJlcT 1.291309 103.784288 6 6
M2trb2RvcG 1.291588 103.784485 6 6
OW44MGdxNW 1.289958 103.790013 6 6
OWpzdDViZW 1.289099 103.790276 6 6
Ymt0c25rNm 1.289280 103.789420 5 5
Ymt0c25rNm 1.289280 103.789420 5 5
NjRnOG1pZj 1.290559 103.785370 6 5
Ymt0c25rNm 1.289280 103.789420 5 5
Ymt0c25rNm 1.289300 103.789444 4 6
OWJqdGhxY2 1.290833 103.784705 6 6
OW9sNjk4ZW 1.289461 103.791380 6 5
NDl0bTViN3 1.287974 103.791340 6 5
NzdhZWJibj 1.289070 103.791577 5 6
Y21kNHZvaj 1.291490 103.784394 5 6
NGY1cDlzdT 1.289850 103.787447 5 6

Assuming that the number of decimal places can be between 2 and 6 (we will count 0 and 1 decimal place as 2), the fraction of points where the longitude has exactly 6 decimal places should be 90%. Indeed, the 6th decimal place can be 0, 1, 2, …, 9. In 9 out of 10 cases, the longitude has 6 decimal places and in one case, i.e., when the 6th decimal place is 0, the number of decimal places is 5 or less. By the same logic, the fraction of points whose longtitude has 5 decimal places should be 9% (90% of the remaining 10%), the fraction of points whose longitude has 4 decimal places should be 0.9% etc.

The same logic is applied to latitude

decimal places fraction of points
0 / 1 / 2 0.01%
3 0.09%
4 0.9%
5 9%
6 90%

However, we observe different frequencies. South Park I:

lon dp ping count % expected count expected % observed / expected
<=2 856 1.1 8 0.01 106.61
3 163 0.2 72 0.09 2.26
4 3236 4.0 723 0.90 4.48
5 17082 21.3 7226 9.00 2.36
6 58954 73.4 72262 90.00 0.82
lat dp ping count % expected count expected % observed / expected
<=2 532 0.7 8 0.01 66.26
3 38 0.0 72 0.09 0.53
4 1161 1.4 723 0.90 1.61
5 9026 11.2 7226 9.00 1.25
6 69534 86.6 72262 90.00 0.96

Science Park II:

lon dp ping count % expected count expected % observed / expected
<=2 397 0.7 6 0.01 67.57
3 200 0.3 53 0.09 3.78
4 2519 4.3 529 0.90 4.76
5 17166 29.2 5288 9.00 3.25
6 38469 65.5 52876 90.00 0.73
lat dp ping count % expected count expected % observed / expected
<=2 341 0.6 6 0.01 58.04
3 24 0.0 53 0.09 0.45
4 1362 2.3 529 0.90 2.58
5 4934 8.4 5288 9.00 0.93
6 52090 88.7 52876 90.00 0.99

One North:

lon dp ping count % expected count expected % observed / expected
<=2 27720 3.2 86 0.01 324.06
3 18397 2.2 770 0.09 23.90
4 37764 4.4 7699 0.90 4.91
5 193082 22.6 76986 9.00 2.51
6 578437 67.6 769860 90.00 0.75
lat dp ping count % expected count expected % observed / expected
<=2 955 0.1 86 0.01 11.16
3 845 0.1 770 0.09 1.10
4 9460 1.1 7699 0.90 1.23
5 77512 9.1 76986 9.00 1.01
6 766628 89.6 769860 90.00 1.00

There are too many pings with abnormally few decimal places. There presence can be explained by

To estimate the number of incorrect pings in our dataset, we notice that the expected fraction of pings where both longitude and latitude have 6 decimal places should be 81% (90% of 90%). The difference between the observed fraction and the expected fraction is due to the presence of incorrect data points.

Science Park I

lon dp lat dp ping count %
2 2 527 0.7
2 4 325 0.4
2 6 4 0.0
3 4 5 0.0
3 5 10 0.0
3 6 148 0.2
4 3 3 0.0
4 4 33 0.0
4 5 269 0.3
4 6 2931 3.7
5 3 3 0.0
5 4 112 0.1
5 5 3511 4.4
5 6 13456 16.8
6 2 5 0.0
6 3 32 0.0
6 4 686 0.9
6 5 5236 6.5
6 6 52995 66.0

Estimated percentage of incorrect data points is 81% - 66% = 15%

Science Park II

lon dp lat dp ping count %
2 2 340 0.6
2 4 5 0.0
2 5 3 0.0
2 6 49 0.1
3 4 1 0.0
3 5 13 0.0
3 6 186 0.3
4 4 27 0.0
4 5 239 0.4
4 6 2253 3.8
5 3 2 0.0
5 4 907 1.5
5 5 1158 2.0
5 6 15099 25.7
6 2 1 0.0
6 3 22 0.0
6 4 422 0.7
6 5 3521 6.0
6 6 34503 58.7

Estimated percentage of incorrect data points is 81% - 59% = 22%

One North

lon dp lat dp ping count %
2 1 888 0.1
2 3 4 0.0
2 4 247 0.0
2 5 5395 0.6
2 6 21186 2.5
3 3 203 0.0
3 4 310 0.0
3 5 3147 0.4
3 6 14737 1.7
4 1 9 0.0
4 3 42 0.0
4 4 1408 0.2
4 5 3114 0.4
4 6 33191 3.9
5 1 11 0.0
5 3 83 0.0
5 4 1429 0.2
5 5 14822 1.7
5 6 176737 20.7
6 1 46 0.0
6 2 1 0.0
6 3 513 0.1
6 4 6066 0.7
6 5 51034 6.0
6 6 520777 60.9

Estimated percentage of incorrect data points is 81% - 61% = 20%

Now we will sort all pings into several bands:

If we believe that errors that we have are just rounding errors, then we can remove pings with at least one coordinate given with not enough decimal places.

Science Park I

dp ping count ping % cumulative ping % device count device %
2 861 1.1 1.1 99 2.3
3 201 0.3 1.3 94 2.2
4 4031 5.0 6.3 771 17.9
5 22203 27.7 34.0 2112 49.0
6 52995 66.0 100.0 3822 88.8

Science Park II

dp ping count ping % cumulative ping % device count device %
2 398 0.7 0.7 66 2.6
3 224 0.4 1.1 71 2.8
4 3848 6.5 7.6 565 22.5
5 19778 33.7 41.3 1373 54.7
6 34503 58.7 100.0 2241 89.3

One North

dp ping count ping % cumulative ping % device count device %
1 954 0.1 0.1 117 0.3
2 26833 3.1 3.2 2990 6.6
3 19035 2.2 5.5 3255 7.1
4 45208 5.3 10.8 7900 17.3
5 242593 28.4 39.1 22322 49.0
6 520777 60.9 100.0 37697 82.7

Devices

Need a plan!

  1. Count unique points where each device pinged. The definition of a unique point will be with different resolution, e.g., if as a unique point we count pairs (lon, lat), where both coordinates are rounded to the 4th decimal place, then we are essentially counting the number of different squares \(11\times 11\) m that this device has been to.

  2. Stationary devices are devices that pinged from very few \(11\times 11\) m squares.

  3. Devices that pinged always from one \(11\times 11\) cm square are probably geolocated based on IP rather than based on satellite.

## `summarise()` has grouped output by 'did', 'lon'. You can override using the
## `.groups` argument.
did dp 6
NDE1ZWx2cjhvMmRudDpiamQxZmJkYTE0ajI5 4
YWZqdWhhNzZnZDE3djo5MjE1ZmRjM29oY2Ew 10
Mm5qdTVwNTU3cmwxcTo0OHVvMG82bWo0N2pz 4
YXBsZWE5ajU2cWQ4bTo2bWRpNmE1NDB1aXF2 16
OTZxa2JobzU5aTI2cjo0ZG1kbWgzNGt0OWhh 181
Y284a2dpMXRjNTcxMTpmazdhc2lyM21kcjNl 2
NmE5cXE0c3FhZmNxbDo1NzUyZmlwc2pscnNv 34
ZGw0dHRsb2NnNnAyajo0bmRhODhwZzkwanNs 3
OHJyNTJuNmJydGNsOTo1cTVnM3E5bjdvYW5q 3
MmEza3MzZXJwbmkzbzpjMW9vNGhmcTVobXRy 6
YnBqYjJyc3ZudGgxNzpiZTRrYmE2ODM1Z2Js 1
M25yMHRwdmRpYWwzcjo0dWYxMWhsbHU0MW03 38
ZjhibTZlZzRyNTJwaDo5YzhjOWp1cDRsYnJx 2
NmFmNzMxZW9qbGNjaTo4cXJiaG02a2MwYWVy 1
cG00aDcwYWJpdTYzOjNzanNxYTlxamo0Z2M= 1
N3BtbzlvOGFzZnU0cTozN2dyZ2s4ODc0MHY5 1
ZnUzcWw2ZmJ1MTMzMTo0Mm02b3E2cW02c2Rj 5
NjdiYmx0dWV0NzE4aTplbmxxaXQ5NW92dXFr 2
Y242N2k0ZDBibHFnZjo1cmVqMjhkM21xdjJy 1
NXRicHR1cnBwbWVsazpkcmFrZTNwcjFyNGZr 1
## `summarise()` has grouped output by 'did', 'lon'. You can override using the
## `.groups` argument.
did 1100 m
Zml2c2IzZHV0cnNwZToxbHE2amRmaW1yczNu 1
MzZmYmo4dmZlMjhwYzozNmNqbHIybXBka2Jy 1
N2gydDg2MjBiY2czaTpmNHU4cjZkMjQ4NHR2 1
MzByNGVuajhvbjk0OmM0cnFwc3QxamZsM2Y= 1
N3Nqa2ptYW9qbW9najpkNTQzc2cwcWU1bmFz 1
YmMybTZsN25ibHRscDo1MmtyZjg4ZDRvb2Zx 2
Y3NmMGcyMnU4M2piOTo3dWl0Mjd2Nmh0YnRi 1
NDBvMmc0M3NnbXFtYzozZHFhdnFjc3BxbG9l 1
NnM1M2d1OGVlMmdxdTpldjJqOGo4bm5obWt2 1
OWlmcG1mZzJmbmRmMDpkZzRwbjNuOGJpMnA2 1
NHYyNHF0MzM5b2trMDpmOTYwazJqcDU1aGU2 1
ZmMwOThtbDZoZzN2dDozNHB0ZG5lOXJpcHB2 1
MmNsN3MyaGgzMzR0NjppZGI1OWc4dDloZjg= 1
OTZxa2JobzU5aTI2cjo0ZG1kbWgzNGt0OWhh 1
NGozdnEyY2Ixc3ZmaDo3YW84cTdncGEzbnBu 1
MzRxdG9uMGRobGZtaDpjZXBycXNsN2RkaDho 1
ZmJmMXVpNWNvcGpsODo1MWs4MWswZzEwbTdu 1
YWMydDh2dWNhdmNpazozNmNndjZrNmxiM3Nw 1
NWUzOW8xN2hyOXJobzozMzF1dXJodDRnOGM2 1
NWU3dnRyZDR2M2w5ZjppNmxqMDh1YWR0cnQ= 1

Science Park

did 1100 m 110 m 11 m 1.1 m 0.11 m ping count
OHJ2NXNpY21lMzRwZzo0Ymd1OTZqMzhvYXJs 2 6 27 200 299 532
YWtmbW9ucWdnZmczajphcHF1a2p0Y2FsYTYw 1 5 34 332 453 636
N3ZmbmNidm1lOXQ3cjo2OG9lZHV2dnBkcXE1 1 7 67 324 499 501
MzUzc2FvdjFyOW5xazpldGJhZzJuNGx2dXNx 2 6 35 301 501 599
Nzg2ZXN1bTdyMDF2aTo3Z2FmNG1wYms3OGhz 2 7 68 433 502 506
ODQ1bGczdWhjcGRtMjpkc2wwMW9xZm9jdW81 2 6 87 477 522 522
YTJla2RlMTQ5Y29pMTo2MGdwN2ZhcTNmZms2 2 13 65 383 546 610
ZGttZGJ2dWowNGFxODo1OGQ5MjE5NnBnYTQ4 2 17 71 371 583 588
ZTlxdWRtMjhtMG1xYjo4Y2ozZ2xla2ZyNGNu 1 4 27 256 613 624
YnA4dnZqdnAxY2prMTpkNTkwZWVvM3BmOHVj 1 1 4 118 642 662
YnV0czdjdTBkZmxnbzozMnYwM2dqbjdpc3Fw 2 10 34 333 658 665
ZmFscm5xcmk3bmRwaTpkam9lZ2F2aDExZ2Jt 1 15 159 602 664 664
YWhnbWJmNWVwbmFsczphb281MHQwMWxtMXZz 2 15 122 456 686 690
YXJsZ3ZrZXVnbzlucDphNHZzbTNzZXNscjlx 1 1 2 121 736 762
OW9oMHMwbWFxOXNzbjoxaXMxOXFxODRtdm1y 2 10 49 427 751 762
NjRnOG1pZjN2MW9qczpjY2VhM3JoZjhudTYx 2 13 60 487 796 1013
ZGVlczRkcWUyNDRzMzo0dWZ0bWZlYnM1N3U4 2 4 38 528 818 820
Nm5xaXEzZzdnMHFzMzpiOTQ2NmszMGI0b2U5 1 13 182 688 966 969
NGxpNGllMmNmYjZhYzo2aWljMmg1NzloMjRp 2 10 63 476 1011 1028
Ymt0c25rNmc2NDZwdjppcXJlZDVvMG92azk= 1 1 4 121 1968 4434
Y2trMGk3amcwamNrcDo0NXMxMDlmanJhY211 2 5 42 617 2017 3727
Nm05cHFnOTNvaHA2YjozczlnZXI0NGpmdTBk 1 13 130 758 2100 2168
Y21kNHZvajJzMDA2NjozN3U5YjRqbTY1Z29q 2 4 18 359 2593 2742

One North