The Google Location History (GLH) can be downloaded from your Google account under https://takeout.google.com/settings/takeout. The data provided by Google for download is a .json file and can be loaded using the jsonlite package. Loading this file into R might take a few minutes. It depends on how many location points Google had saved about you.
The packages used to generate this report are:
* jsonlite
* dplyr
* ggplot2
* pander
* lubridate
* leaflet
* leaflet.extras
* scales
* kableExtra
* knitr
First, we need to load the JSON file into R and create a dataframe. The data is stored under the attribute locations.
datos <- fromJSON("Location HistoryLast.json")
class(datos);attributes(datos);class(datos$locations)
## [1] "list"
## $names
## [1] "locations"
## [1] "data.frame"
# extract location dataframe
df <- datos$locations
Let’s get a glimpse of the data before start its cleaning. There are 1887207 observations and 9 variables. In table 1, we can observe the number of missing values in each of the variables.
glimpse(df);
## Observations: 1,887,207
## Variables: 9
## $ timestampMs <chr> "1307998861249", "1307998872287", "1307998876320",...
## $ latitudeE7 <int> 403240840, 403242250, 403245230, 403240310, 403240...
## $ longitudeE7 <int> -37778120, -37776070, -37778520, -37775600, -37775...
## $ accuracy <int> 232, 93, 46, 34, 22, 22, 22, 25, 2, 2, 2, 2, 2, 2,...
## $ activity <list> [NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, ...
## $ velocity <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ altitude <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ heading <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ verticalAccuracy <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
y = t(t(sapply(df,function(NAs)(sum(is.na(NAs))))))
colnames(y)<-c("NAs")
kable(y,caption="Table 1. Number of NA") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F )
| NAs | |
|---|---|
| timestampMs | 0 |
| latitudeE7 | 0 |
| longitudeE7 | 0 |
| accuracy | 3 |
| activity | 0 |
| velocity | 1811521 |
| altitude | 1539197 |
| heading | 1835180 |
| verticalAccuracy | 1543999 |
In table 2 and 3, we show the possible meaning of the attributes present in the Google Location History.
Table 2.
| Attribute | Meaning |
|---|---|
| timeStampMs | timestamp in milliseconds when the observation was recorded |
| latitudeE7 | Latitude of the observation as integer |
| longitudeE7 | Longitude of the observation as integer |
| accuracy | Google’s estimate of how accurate the data is |
| activity | List of activities (Table 3) |
| velocity | This could refer to the speed of the device at capture time |
| altitude | Altitude of the observation |
| heading | Direction the device is traveling |
| verticalAccuracy | This could refer to the accuracy of the vertical location of the device |
Table 3. Activity
| Attribute | Meaning |
|---|---|
| activity.type | It could refer to multiple values. It seems that Google infers what the user is potentially doing. There are many possible values |
| activity.confidence | Google assigns a confidence value for the activity type guessed |
| activity.timestampMs | Timestamp in milliseconds for the recorded activity |
Next, we transform some of the data in a more readable form, and extract some information from the timestamps recorded by Google.
##Convert the position and time stamps into a more readable form
df.map <- df %>% mutate(time = as_datetime(as.numeric(df$timestampMs)/1000),
date = date(time),
hour.min = paste(hour(time),minute(time),sep=":"),
week = isoweek(time),
year = isoyear(time),
latitude = latitudeE7/1e7,
longitude= longitudeE7/1e7) %>%
select(-timestampMs,-latitudeE7,-longitudeE7,-time,-activity)
The downloaded GLH file contains data from 2011-06-13 until 2020-02-22. There are 2934 distinct days reported.
summary(df.map$date)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## "2011-06-13" "2014-11-04" "2016-12-22" "2016-09-16" "2018-06-15" "2020-02-22"
n_distinct(df.map$date)
## [1] 2934
kable(df.map %>% group_by(year) %>% summarise(n=n()),col.names=c("Year","Observations"), align=c('c','r'),caption="Table 4. Data collected by year") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F )
| Year | Observations |
|---|---|
| 2011 | 23588 |
| 2012 | 18416 |
| 2013 | 219941 |
| 2014 | 240655 |
| 2015 | 177286 |
| 2016 | 270439 |
| 2017 | 353040 |
| 2018 | 275022 |
| 2019 | 276393 |
| 2020 | 32427 |
df.map %>% group_by(week,year) %>% summarise(n = n()) %>%
ggplot( aes(x=week, y=n)) +
geom_bar(stat="identity") +
facet_grid(facets = year ~ .) +
scale_x_continuous(breaks = c(1:54)) +
labs(x = "Week of year", y = "Entries",
title="Google Location History: Tracks per week") +
theme_bw()
The average value of the accuracy is 270, and the median is 24. The next figure shows the distribution of the accuracy for values less than 500.
summary(df.map$accuracy)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1 20 24 270 29 4984961 3
df.map[df.map$accuracy<500,] %>%
ggplot(aes(x=accuracy))+
geom_density(size=1, col='grey')+
# coord_cartesian(xlim=c(0,2000)) +
theme_bw()
## Warning: Removed 3 rows containing non-finite values (stat_density).
The next figure shows the variation in altitude during the year 2017.
df.map %>% filter(!is.na(altitude) & year==2017) %>%
ggplot(aes(x=as.Date(date),y=altitude)) +
geom_point() +
theme_bw() + theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
scale_x_date(breaks = function(x) seq.Date(from = min(x), to = max(x), by = "1 week"),
minor_breaks = function(x) seq.Date(from = min(x), to = max(x), by = "1 week")) +
ggtitle("Altitude variation 2017") + labs(x="Date")
The following figure shows the variation during the year 2018.
df.map %>% filter(!is.na(altitude) & year==2018) %>%
ggplot(aes(x=as.Date(date),y=altitude)) +
geom_point() +
theme_bw() + theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
scale_x_date(breaks = function(x) seq.Date(from = min(x), to = max(x), by = "1 week"),
minor_breaks = function(x) seq.Date(from = min(x), to = max(x), by = "1 week")) +
ggtitle("Altitude variation 2018") + labs(x="Date")
The figure shows the altitude variation during the 2019.
df.map %>% filter(!is.na(altitude) & year==2019) %>%
ggplot(aes(x=as.Date(date),y=altitude)) +
geom_point() +
theme_bw() + theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
scale_x_date(breaks = function(x) seq.Date(from = min(x), to = max(x), by = "1 week"),
minor_breaks = function(x) seq.Date(from = min(x), to = max(x), by = "1 week")) +
ggtitle("Altitude variation 2019") + labs(x="Date")
The next figure shows the altitude variation during the period that Google has been collecting data.
df.map %>% filter(!is.na(altitude)) %>% arrange(date) %>%
ggplot(aes(x=as.Date(date),y=altitude)) +
geom_point() +
theme_bw() + theme(axis.text.x = element_text(angle = 90, hjust = 1))+
scale_x_date(breaks = function(x) seq.Date(from = min(x), to = max(x), by = "1 month"),
minor_breaks = function(x) seq.Date(from = min(x), to = max(x), by = "1 month")) +
ggtitle("Altitude Variation by month") + labs(x="Date")
map2019 <- df.map %>% filter(year==2019)
myMap = leaflet(map2019) %>%
addProviderTiles(providers$CartoDB.Positron) %>%
fitBounds(~min(longitude), ~min(latitude), ~max(longitude), ~max(latitude)) %>%
addHeatmap(lng = ~longitude, lat = ~latitude, group = "HeatMap", blur = 20, max = 0.01, radius = 15) %>%
addMarkers(data = map2019, ~longitude, ~latitude, clusterOptions = markerClusterOptions(), group = "Points")
myMap