Google Location History

The Google Location History (GLH) can be downloaded from your Google account under https://takeout.google.com/settings/takeout. The data provided by Google for download is a .json file and can be loaded using the jsonlite package. Loading this file into R might take a few minutes. It depends on how many location points Google had saved about you.

R Packages Used

The packages used to generate this report are:
* jsonlite
* dplyr
* ggplot2
* pander
* lubridate
* leaflet
* leaflet.extras
* scales
* kableExtra
* knitr

Loading the data from a json file

First, we need to load the JSON file into R and create a dataframe. The data is stored under the attribute locations.

datos <- fromJSON("Location HistoryLast.json")

class(datos);attributes(datos);class(datos$locations)

## [1] "list"

## $names
## [1] "locations"

## [1] "data.frame"

# extract location dataframe
df <- datos$locations

Let’s get a glimpse of the data before start its cleaning. There are 1887207 observations and 9 variables. In table 1, we can observe the number of missing values in each of the variables.

glimpse(df);

## Observations: 1,887,207
## Variables: 9
## $ timestampMs      <chr> "1307998861249", "1307998872287", "1307998876320",...
## $ latitudeE7       <int> 403240840, 403242250, 403245230, 403240310, 403240...
## $ longitudeE7      <int> -37778120, -37776070, -37778520, -37775600, -37775...
## $ accuracy         <int> 232, 93, 46, 34, 22, 22, 22, 25, 2, 2, 2, 2, 2, 2,...
## $ activity         <list> [NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, ...
## $ velocity         <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ altitude         <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ heading          <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ verticalAccuracy <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...

y = t(t(sapply(df,function(NAs)(sum(is.na(NAs))))))
colnames(y)<-c("NAs")
kable(y,caption="Table 1. Number of NA") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F )

Table 1. Number of NA
	NAs
timestampMs	0
latitudeE7	0
longitudeE7	0
accuracy	3
activity	0
velocity	1811521
altitude	1539197
heading	1835180
verticalAccuracy	1543999

Data cleansing and Transformation

In table 2 and 3, we show the possible meaning of the attributes present in the Google Location History.

Table 2.

Attribute	Meaning
timeStampMs	timestamp in milliseconds when the observation was recorded
latitudeE7	Latitude of the observation as integer
longitudeE7	Longitude of the observation as integer
accuracy	Google’s estimate of how accurate the data is
activity	List of activities (Table 3)
velocity	This could refer to the speed of the device at capture time
altitude	Altitude of the observation
heading	Direction the device is traveling
verticalAccuracy	This could refer to the accuracy of the vertical location of the device

Table 3. Activity

Attribute	Meaning
activity.type	It could refer to multiple values. It seems that Google infers what the user is potentially doing. There are many possible values
activity.confidence	Google assigns a confidence value for the activity type guessed
activity.timestampMs	Timestamp in milliseconds for the recorded activity

Next, we transform some of the data in a more readable form, and extract some information from the timestamps recorded by Google.

##Convert the position and time stamps into a more readable form
df.map <- df %>% mutate(time  = as_datetime(as.numeric(df$timestampMs)/1000),
                        date = date(time),
                        hour.min  = paste(hour(time),minute(time),sep=":"),
                        week = isoweek(time),
                        year = isoyear(time),
                        latitude = latitudeE7/1e7,
                        longitude= longitudeE7/1e7) %>%
                        select(-timestampMs,-latitudeE7,-longitudeE7,-time,-activity)

How long did have Google collected data?

The downloaded GLH file contains data from 2011-06-13 until 2020-02-22. There are 2934 distinct days reported.

summary(df.map$date)

##         Min.      1st Qu.       Median         Mean      3rd Qu.         Max. 
## "2011-06-13" "2014-11-04" "2016-12-22" "2016-09-16" "2018-06-15" "2020-02-22"

n_distinct(df.map$date)

## [1] 2934

kable(df.map %>% group_by(year) %>% summarise(n=n()),col.names=c("Year","Observations"), align=c('c','r'),caption="Table 4. Data collected by year") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F )

Table 4. Data collected by year
Year	Observations
2011	23588
2012	18416
2013	219941
2014	240655
2015	177286
2016	270439
2017	353040
2018	275022
2019	276393
2020	32427

Tracks per week

df.map %>%  group_by(week,year) %>% summarise(n = n()) %>%
   ggplot( aes(x=week, y=n)) +
      geom_bar(stat="identity") +
      facet_grid(facets = year ~ .) +
      scale_x_continuous(breaks = c(1:54)) +
      labs(x = "Week of year", y = "Entries",
      title="Google Location History: Tracks per week") +
      theme_bw()

Accuracy of the measurements

The average value of the accuracy is 270, and the median is 24. The next figure shows the distribution of the accuracy for values less than 500.

summary(df.map$accuracy)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       1      20      24     270      29 4984961       3

df.map[df.map$accuracy<500,] %>%
ggplot(aes(x=accuracy))+
  geom_density(size=1, col='grey')+ 
#  coord_cartesian(xlim=c(0,2000)) +
  theme_bw()

## Warning: Removed 3 rows containing non-finite values (stat_density).

Altitude Variation

The next figure shows the variation in altitude during the year 2017.

df.map %>% filter(!is.na(altitude) & year==2017)  %>%
   ggplot(aes(x=as.Date(date),y=altitude)) +
   geom_point() +
   theme_bw() + theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
   scale_x_date(breaks = function(x) seq.Date(from = min(x), to = max(x), by = "1 week"),
                minor_breaks = function(x) seq.Date(from = min(x), to = max(x), by = "1 week")) +
   ggtitle("Altitude variation 2017") + labs(x="Date")

The following figure shows the variation during the year 2018.

df.map %>% filter(!is.na(altitude) & year==2018)  %>%
   ggplot(aes(x=as.Date(date),y=altitude)) +
   geom_point() +
   theme_bw() + theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
   scale_x_date(breaks = function(x) seq.Date(from = min(x), to = max(x), by = "1 week"),
                minor_breaks = function(x) seq.Date(from = min(x), to = max(x), by = "1 week")) +
   ggtitle("Altitude variation 2018") + labs(x="Date")

The figure shows the altitude variation during the 2019.

df.map %>% filter(!is.na(altitude) & year==2019)  %>%
   ggplot(aes(x=as.Date(date),y=altitude)) +
   geom_point() +
   theme_bw() + theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
   scale_x_date(breaks = function(x) seq.Date(from = min(x), to = max(x), by = "1 week"),
                minor_breaks = function(x) seq.Date(from = min(x), to = max(x), by = "1 week")) +
   ggtitle("Altitude variation 2019") + labs(x="Date")

The next figure shows the altitude variation during the period that Google has been collecting data.

df.map %>% filter(!is.na(altitude)) %>% arrange(date) %>%
  ggplot(aes(x=as.Date(date),y=altitude)) +
  geom_point() +
  theme_bw() + theme(axis.text.x = element_text(angle = 90, hjust = 1))+

  scale_x_date(breaks = function(x) seq.Date(from = min(x), to = max(x), by = "1 month"),
                minor_breaks = function(x) seq.Date(from = min(x), to = max(x), by = "1 month")) +
   ggtitle("Altitude Variation by month") + labs(x="Date")

Locations visited during 2019

map2019 <- df.map %>% filter(year==2019)
myMap = leaflet(map2019) %>% 
  addProviderTiles(providers$CartoDB.Positron) %>%
  fitBounds(~min(longitude), ~min(latitude), ~max(longitude), ~max(latitude)) %>%  
  addHeatmap(lng = ~longitude, lat = ~latitude, group = "HeatMap", blur = 20, max = 0.01, radius = 15) %>%
  addMarkers(data = map2019, ~longitude, ~latitude, clusterOptions = markerClusterOptions(), group = "Points")

myMap

Locations visited during 2018

map2018 <- df.map %>% filter(year==2018)
myMap = leaflet(map2018) %>% 
  addProviderTiles(providers$CartoDB.Positron) %>%
  fitBounds(~min(longitude), ~min(latitude), ~max(longitude), ~max(latitude)) %>%  
  addHeatmap(lng = ~longitude, lat = ~latitude, group = "HeatMap", blur = 20, max = 0.01, radius = 15) %>%
  addMarkers(data = map2018, ~longitude, ~latitude, clusterOptions = markerClusterOptions(), group = "Points")

myMap