The Google Location History (GLH) can be downloaded from your Google account under https://takeout.google.com/settings/takeout. The data provided by Google for download is a .json file and can be loaded using the jsonlite package. Loading this file into R might take a few minutes. It depends on how many location points Google had saved about you.

R Packages Used

The packages used to generate this report are:
* jsonlite
* dplyr
* ggplot2
* pander
* lubridate
* leaflet
* leaflet.extras
* scales
* kableExtra
* knitr

Loading the data from a json file

First, we need to load the JSON file into R and create a dataframe. The data is stored under the attribute locations.

datos <- fromJSON("Location HistoryLast.json")
class(datos);attributes(datos);class(datos$locations)
## [1] "list"
## $names
## [1] "locations"
## [1] "data.frame"
# extract location dataframe
df <- datos$locations

Let’s get a glimpse of the data before start its cleaning. There are 1887207 observations and 9 variables. In table 1, we can observe the number of missing values in each of the variables.

glimpse(df);
## Observations: 1,887,207
## Variables: 9
## $ timestampMs      <chr> "1307998861249", "1307998872287", "1307998876320",...
## $ latitudeE7       <int> 403240840, 403242250, 403245230, 403240310, 403240...
## $ longitudeE7      <int> -37778120, -37776070, -37778520, -37775600, -37775...
## $ accuracy         <int> 232, 93, 46, 34, 22, 22, 22, 25, 2, 2, 2, 2, 2, 2,...
## $ activity         <list> [NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, ...
## $ velocity         <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ altitude         <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ heading          <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ verticalAccuracy <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
y = t(t(sapply(df,function(NAs)(sum(is.na(NAs))))))
colnames(y)<-c("NAs")
kable(y,caption="Table 1. Number of NA") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F )
Table 1. Number of NA
NAs
timestampMs 0
latitudeE7 0
longitudeE7 0
accuracy 3
activity 0
velocity 1811521
altitude 1539197
heading 1835180
verticalAccuracy 1543999

Data cleansing and Transformation

In table 2 and 3, we show the possible meaning of the attributes present in the Google Location History.

Table 2.

Attribute Meaning
timeStampMs timestamp in milliseconds when the observation was recorded
latitudeE7 Latitude of the observation as integer
longitudeE7 Longitude of the observation as integer
accuracy Google’s estimate of how accurate the data is
activity List of activities (Table 3)
velocity This could refer to the speed of the device at capture time
altitude Altitude of the observation
heading Direction the device is traveling
verticalAccuracy This could refer to the accuracy of the vertical location of the device

Table 3. Activity

Attribute Meaning
activity.type It could refer to multiple values. It seems that Google infers what the user is potentially doing. There are many possible values
activity.confidence Google assigns a confidence value for the activity type guessed
activity.timestampMs Timestamp in milliseconds for the recorded activity

Next, we transform some of the data in a more readable form, and extract some information from the timestamps recorded by Google.

##Convert the position and time stamps into a more readable form
df.map <- df %>% mutate(time  = as_datetime(as.numeric(df$timestampMs)/1000),
                        date = date(time),
                        hour.min  = paste(hour(time),minute(time),sep=":"),
                        week = isoweek(time),
                        year = isoyear(time),
                        latitude = latitudeE7/1e7,
                        longitude= longitudeE7/1e7) %>%
                        select(-timestampMs,-latitudeE7,-longitudeE7,-time,-activity)

How long did have Google collected data?

The downloaded GLH file contains data from 2011-06-13 until 2020-02-22. There are 2934 distinct days reported.

summary(df.map$date)
##         Min.      1st Qu.       Median         Mean      3rd Qu.         Max. 
## "2011-06-13" "2014-11-04" "2016-12-22" "2016-09-16" "2018-06-15" "2020-02-22"
n_distinct(df.map$date)
## [1] 2934
kable(df.map %>% group_by(year) %>% summarise(n=n()),col.names=c("Year","Observations"), align=c('c','r'),caption="Table 4. Data collected by year") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F )
Table 4. Data collected by year
Year Observations
2011 23588
2012 18416
2013 219941
2014 240655
2015 177286
2016 270439
2017 353040
2018 275022
2019 276393
2020 32427

Tracks per week

df.map %>%  group_by(week,year) %>% summarise(n = n()) %>%
   ggplot( aes(x=week, y=n)) +
      geom_bar(stat="identity") +
      facet_grid(facets = year ~ .) +
      scale_x_continuous(breaks = c(1:54)) +
      labs(x = "Week of year", y = "Entries",
      title="Google Location History: Tracks per week") +
      theme_bw()

Accuracy of the measurements

The average value of the accuracy is 270, and the median is 24. The next figure shows the distribution of the accuracy for values less than 500.

summary(df.map$accuracy)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       1      20      24     270      29 4984961       3
df.map[df.map$accuracy<500,] %>%
ggplot(aes(x=accuracy))+
  geom_density(size=1, col='grey')+ 
#  coord_cartesian(xlim=c(0,2000)) +
  theme_bw() 
## Warning: Removed 3 rows containing non-finite values (stat_density).

Altitude Variation

The next figure shows the variation in altitude during the year 2017.

df.map %>% filter(!is.na(altitude) & year==2017)  %>%
   ggplot(aes(x=as.Date(date),y=altitude)) +
   geom_point() +
   theme_bw() + theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
   scale_x_date(breaks = function(x) seq.Date(from = min(x), to = max(x), by = "1 week"),
                minor_breaks = function(x) seq.Date(from = min(x), to = max(x), by = "1 week")) +
   ggtitle("Altitude variation 2017") + labs(x="Date")

The following figure shows the variation during the year 2018.

df.map %>% filter(!is.na(altitude) & year==2018)  %>%
   ggplot(aes(x=as.Date(date),y=altitude)) +
   geom_point() +
   theme_bw() + theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
   scale_x_date(breaks = function(x) seq.Date(from = min(x), to = max(x), by = "1 week"),
                minor_breaks = function(x) seq.Date(from = min(x), to = max(x), by = "1 week")) +
   ggtitle("Altitude variation 2018") + labs(x="Date")

The figure shows the altitude variation during the 2019.

df.map %>% filter(!is.na(altitude) & year==2019)  %>%
   ggplot(aes(x=as.Date(date),y=altitude)) +
   geom_point() +
   theme_bw() + theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
   scale_x_date(breaks = function(x) seq.Date(from = min(x), to = max(x), by = "1 week"),
                minor_breaks = function(x) seq.Date(from = min(x), to = max(x), by = "1 week")) +
   ggtitle("Altitude variation 2019") + labs(x="Date")

The next figure shows the altitude variation during the period that Google has been collecting data.

df.map %>% filter(!is.na(altitude)) %>% arrange(date) %>%
  ggplot(aes(x=as.Date(date),y=altitude)) +
  geom_point() +
  theme_bw() + theme(axis.text.x = element_text(angle = 90, hjust = 1))+

  scale_x_date(breaks = function(x) seq.Date(from = min(x), to = max(x), by = "1 month"),
                minor_breaks = function(x) seq.Date(from = min(x), to = max(x), by = "1 month")) +
   ggtitle("Altitude Variation by month") + labs(x="Date")

Locations visited during 2019

map2019 <- df.map %>% filter(year==2019)
myMap = leaflet(map2019) %>% 
  addProviderTiles(providers$CartoDB.Positron) %>%
  fitBounds(~min(longitude), ~min(latitude), ~max(longitude), ~max(latitude)) %>%  
  addHeatmap(lng = ~longitude, lat = ~latitude, group = "HeatMap", blur = 20, max = 0.01, radius = 15) %>%
  addMarkers(data = map2019, ~longitude, ~latitude, clusterOptions = markerClusterOptions(), group = "Points")

myMap

Locations visited during 2018

map2018 <- df.map %>% filter(year==2018)
myMap = leaflet(map2018) %>% 
  addProviderTiles(providers$CartoDB.Positron) %>%
  fitBounds(~min(longitude), ~min(latitude), ~max(longitude), ~max(latitude)) %>%  
  addHeatmap(lng = ~longitude, lat = ~latitude, group = "HeatMap", blur = 20, max = 0.01, radius = 15) %>%
  addMarkers(data = map2018, ~longitude, ~latitude, clusterOptions = markerClusterOptions(), group = "Points")

myMap