About this Notebook





Analytics Toolkit: Require Packages



Install required packages

install.packages(c("rvest", 
                   "lubridate", 
                   "wordcloud", 
                   "tm"), dependencies = TRUE)
Warning in install.packages :
  dependencies ‘Rcampdf’, ‘Rgraphviz’, ‘tm.lexicon.GeneralInquirer’ are not available
trying URL 'https://cran.rstudio.com/bin/macosx/el-capitan/contrib/3.4/rvest_0.3.2.tgz'
Content type 'application/x-gzip' length 852813 bytes (832 KB)
==================================================
downloaded 832 KB

trying URL 'https://cran.rstudio.com/bin/macosx/el-capitan/contrib/3.4/lubridate_1.7.2.tgz'
Content type 'application/x-gzip' length 1245093 bytes (1.2 MB)
==================================================
downloaded 1.2 MB

trying URL 'https://cran.rstudio.com/bin/macosx/el-capitan/contrib/3.4/wordcloud_2.5.tgz'
Content type 'application/x-gzip' length 143945 bytes (140 KB)
==================================================
downloaded 140 KB

trying URL 'https://cran.rstudio.com/bin/macosx/el-capitan/contrib/3.4/tm_0.7-3.tgz'
Content type 'application/x-gzip' length 969407 bytes (946 KB)
==================================================
downloaded 946 KB

The downloaded binary packages are in
    /var/folders/tx/fd4xqmys47lcf6p2b26w8pzm0000gn/T//RtmpCjbqMs/downloaded_packages


Load required packages

library(tm)
Loading required package: NLP
library(rvest)
Loading required package: xml2
library(lubridate)

Attaching package: ‘lubridate’

The following object is masked from ‘package:base’:

    date
library(tidyverse)
── Attaching packages ────────────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 2.2.1     ✔ purrr   0.2.4
✔ tibble  1.4.1     ✔ dplyr   0.7.4
✔ tidyr   0.7.2     ✔ stringr 1.2.0
✔ readr   1.1.1     ✔ forcats 0.2.0
── Conflicts ───────────────────────────────────────────── tidyverse_conflicts() ──
✖ ggplot2::annotate()      masks NLP::annotate()
✖ lubridate::as.difftime() masks base::as.difftime()
✖ lubridate::date()        masks base::date()
✖ dplyr::filter()          masks stats::filter()
✖ readr::guess_encoding()  masks rvest::guess_encoding()
✖ lubridate::intersect()   masks base::intersect()
✖ dplyr::lag()             masks stats::lag()
✖ purrr::pluck()           masks rvest::pluck()
✖ lubridate::setdiff()     masks base::setdiff()
✖ lubridate::union()       masks base::union()
library(wordcloud)
Loading required package: RColorBrewer



Data Collection: Claiming your Google Search Data



1) Sign into your google account, then Go to:

3) Select all Google products to create a complete archive of your data


4) After selecting the products, choose the file type and max archive size to make sure that all your account data is archive



Data Preparation: Extracting Google Search Information



Locate the Google archive, then find the search data. For this case, it is an html file located in “My Activity” folder inside the “Search” folder the file is named “MyActivity.html”

  • Takeout -> My Activity -> Search -> MyActivity.html

Laveraging regular expression (regex) we can extract relavant information from the HTML document:


Extract Search Time

date_search <- search_archive %>% 
  html_nodes(xpath = '//div[@class="mdl-grid"]/div/div') %>% 
  str_extract(pattern = "(?<=<br>)(.*)(?<=PM|AM)") %>%
  mdy_hms()
 2 failed to parse.


Extract Search Text

text_search <- search_archive %>% 
  html_nodes(xpath = '//div[@class="mdl-grid"]/div/div') %>%
  str_extract(pattern = '(?<=<a)(.*)(?=</a>)') %>% 
  str_extract(pattern = '(?<=\">)(.*)')


Extract Search Type

type_search <- search_archive %>% 
  html_nodes(xpath = '//div[@class="mdl-grid"]/div/div') %>% 
  str_extract(pattern = "(?<=mdl-typography--body-1\">)(.*)(?=<a)") %>% 
  str_extract(pattern = "(\\w+)(?=\\s)")


Create a data frame using the data extracted from the html file

search_data <- tibble(timestamp = date_search,
                      date = as_date(date_search),
                      year = year(date_search),
                      month = month(date_search, label = TRUE),
                      day = weekdays(date_search),
                      hour = hour(date_search),
                      type = type_search,
                      search = text_search)
search_data$day <- factor(search_data$day, 
                          levels = c("Sunday", "Monday", "Tuesday",
                                     "Wednesday","Thursday", "Friday",
                                     "Saturday"))
search_data <- na.omit(search_data)
head(search_data)



Data Analysis: Visualizing Google Searches



To get an overall idea of the search volume, we can plot searches by year

p <- ggplot(search_data, aes(year))
p + geom_bar()


After determine the years with the largest search volume we can plot monthly searches

monthly <- search_data[(search_data$year > 2014 & search_data$year< 2018), ]
ggplot(monthly) + geom_bar(aes(x = month, group = year)) +
  theme(axis.text.x = element_text(angle=90)) +
  facet_grid(.~year, scales="free")


Another interesting metrict is searches by Hour

p <- ggplot(search_data, aes(hour))
p + geom_bar()


We can also plot the search data by day of the week to determine day are the most active

p <- ggplot(search_data, aes(day))
p + geom_bar()


We can take it an step further and group search time with day of the week.

ggplot(search_data) + 
  geom_bar(aes(x = hour, group = day) ) +
  facet_grid(.~day, scales = "free")


We can group the search data by year and day of the week, to visualize the overall trend

wkday <- group_by(search_data, year, day) %>% summarize(count = n())
p <- ggplot(wkday, aes(day, count, fill = year)) 
p + geom_bar(stat = "identity") + labs(x = "", y = "Search Volume")



Reporting: A Wordcloud from Google Search Data



First we need to extract the text and clean it using regular expressions

search <- tolower(search_data$search)
search <- iconv(search, "ASCII", "UTF-8", " ")
search <- gsub('(http|https)\\S+\\s*|(#|@)\\S+\\s*|\\n|\\"', " ", search)
search <- gsub("(.*.)\\.com(.*.)\\S+\\s|[^[:alnum:]]", " ", search)
search <- trimws(search)


After cleaning the text we can create a Text Corpus (a large and structured set of texts) and remove some words

search_corpus <-  Corpus(VectorSource(search))
search_corpus <- tm_map(search_corpus, content_transformer(removePunctuation))
search_corpus <- tm_map(search_corpus, content_transformer(removeNumbers))
stopwords <- c(stopwords("english"), "chrome", "chicago", "jlroo", "google")
search_corpus <- tm_map(search_corpus, removeWords, stopwords)


Now from the corpus we need to create a Term Document Matrix in order to create word associations and a wordcloud

search_tdm <- TermDocumentMatrix(search_corpus)
search_matrix <- as.matrix(search_tdm)


Set a threshold for the min/max frequency of words to create the wordcloud

wordcloud(d$word, d$freq, min.freq = 50, scale = c(3 , 0.5), max.words = 200)

---
title: "Analyzing Google Search History"
author: "Jose Luis Rodriguez"
output:
  html_notebook: default
  html_document: default
date: "January 30, 2018"
subtitle: "CME Group Foundation Business Analytics Lab"
---

<br>

--------------

## About this Notebook

--------------

<br>

* The google search data on this notebook comes from a google account archive

* The steps outlined here to collect and analyze the data may change at any time

* Below are the steps to claim your google account data 


<br>

--------------

## Analytics Toolkit: Require Packages

--------------

<br>

**Install required packages**

* Package: tidyverse, lubridate, rvest, tm, wordcloud

```{r}

install.packages(c("rvest", 
                   "lubridate", 
                   "wordcloud", 
                   "tm"), dependencies = TRUE)

```


<br>

**Load required packages**
```{r}

library(tm)
library(rvest)
library(lubridate)
library(tidyverse)
library(wordcloud)

```

<br>

--------------

## Data Collection: Claiming your Google Search Data

--------------

<br>

#### 1) Sign into your google account, then Go to:
* https://myaccount.google.com/privacy

#### 2) Find the link to download your data archive or Go to: 
* https://takeout.google.com/settings/takeout

```{r, echo=FALSE}

knitr::include_graphics('imgs/img01.png')

```

<br>

#### 3) Select all Google products to create a complete archive of your data

```{r, echo=FALSE}

knitr::include_graphics('imgs/img02.png')

```

<br>

#### 4) After selecting the products, choose the file type and max archive size to make sure that all your account data is archive

```{r, echo=FALSE}

knitr::include_graphics('imgs/img03.png')

```

<br>

--------------

## Data Preparation: Extracting Google Search Information

--------------

<br>

#### Locate the Google archive, then find the search data. For this case, it is an html file located in "My Activity" folder inside the "Search" folder the file is named "MyActivity.html"

* Takeout -> My Activity -> Search -> MyActivity.html

#### Using the rvest package we can read the html document that contains the related google search data

```{r}

doc <- "Takeout/My Activity/Search/MyActivity.html"
search_archive <- read_html(doc)

```

<br>

--------------

### Laveraging regular expression (regex) we can extract relavant information from the HTML document:

<br>

#### Extract Search Time

```{r}

date_search <- search_archive %>% 
  html_nodes(xpath = '//div[@class="mdl-grid"]/div/div') %>% 
  str_extract(pattern = "(?<=<br>)(.*)(?<=PM|AM)") %>%
  mdy_hms()

```

<br>

#### Extract Search Text

```{r}

text_search <- search_archive %>% 
  html_nodes(xpath = '//div[@class="mdl-grid"]/div/div') %>%
  str_extract(pattern = '(?<=<a)(.*)(?=</a>)') %>% 
  str_extract(pattern = '(?<=\">)(.*)')

```

<br>

#### Extract Search Type

```{r}

type_search <- search_archive %>% 
  html_nodes(xpath = '//div[@class="mdl-grid"]/div/div') %>% 
  str_extract(pattern = "(?<=mdl-typography--body-1\">)(.*)(?=<a)") %>% 
  str_extract(pattern = "(\\w+)(?=\\s)")

```

<br>

#### Create a data frame using the data extracted from the html file

```{r}

search_data <- tibble(timestamp = date_search,
                      date = as_date(date_search),
                      year = year(date_search),
                      month = month(date_search, label = TRUE),
                      day = weekdays(date_search),
                      hour = hour(date_search),
                      type = type_search,
                      search = text_search)

search_data$day <- factor(search_data$day, 
                          levels = c("Sunday", "Monday", "Tuesday",
                                     "Wednesday","Thursday", "Friday",
                                     "Saturday"))

search_data <- na.omit(search_data)

head(search_data)

```

<br>

--------------

## Data Analysis: Visualizing Google Searches

--------------

<br>

#### To get an overall idea of the search volume, we can plot searches by year 

```{r}

p <- ggplot(search_data, aes(year))
p + geom_bar()

```

<br>

#### After determine the years with the largest search volume we can plot monthly searches

```{r}

monthly <- search_data[(search_data$year > 2014 & search_data$year< 2018), ]

ggplot(monthly) + geom_bar(aes(x = month, group = year)) +
  theme(axis.text.x = element_text(angle=90)) +
  facet_grid(.~year, scales="free")

```

<br>

#### Another interesting metrict is searches by Hour

```{r}

p <- ggplot(search_data, aes(hour))
p + geom_bar()

```

<br>

#### We can also plot the search data by day of the week to determine day are the most active

```{r}

p <- ggplot(search_data, aes(day))
p + geom_bar()

```

<br>

#### We can take it an step further and group search time with day of the week. 

```{r}

ggplot(search_data) + 
  geom_bar(aes(x = hour, group = day) ) +
  facet_grid(.~day, scales = "free")

```

<br>

#### We can group the search data by year and day of the week, to visualize the overall trend 

```{r}

wkday <- group_by(search_data, year, day) %>% summarize(count = n())
p <- ggplot(wkday, aes(day, count, fill = year)) 
p + geom_bar(stat = "identity") + labs(x = "", y = "Search Volume")

```

<br>

--------------

## Reporting: A Wordcloud from Google Search Data

--------------

<br>

#### First we need to extract the text and clean it using regular expressions

```{r}

search <- tolower(search_data$search)
search <- iconv(search, "ASCII", "UTF-8", " ")
search <- gsub('(http|https)\\S+\\s*|(#|@)\\S+\\s*|\\n|\\"', " ", search)
search <- gsub("(.*.)\\.com(.*.)\\S+\\s|[^[:alnum:]]", " ", search)
search <- trimws(search)

```

<br>

#### After cleaning the text we can create a Text Corpus (a large and structured set of texts) and remove some words 

```{r}

search_corpus <-  Corpus(VectorSource(search))
search_corpus <- tm_map(search_corpus, content_transformer(removePunctuation))
search_corpus <- tm_map(search_corpus, content_transformer(removeNumbers))
stopwords <- c(stopwords("english"), "chrome", "chicago", "jlroo", "google")
search_corpus <- tm_map(search_corpus, removeWords, stopwords)

```

<br>

#### Now from the corpus we need to create a Term Document Matrix in order to create word associations and a wordcloud

```{r}

search_tdm <- TermDocumentMatrix(search_corpus)
search_matrix <- as.matrix(search_tdm)

```


<br>

#### Using the Term Document matrix we can create a data frame with words and related frequencies 

```{r}

v <- sort(rowSums(search_matrix), decreasing = TRUE)
tw_names <- names(v)
d <- data.frame(word = tw_names, freq = v)

```

<br>

#### Set a threshold for the min/max frequency of words to create the wordcloud

```{r}

wordcloud(d$word, d$freq, min.freq = 50, scale = c(3 , 0.5), max.words = 200)

```
