R: Google Trends and Finance - 介紹及應用 (1)

怎麼用Google Trends (搜尋趨勢)？

搜尋引擎已經成為人們生活的一部分，凡是有任何疑問，通常谷歌一下就會有爆量的資訊湧出，換個角度想，Google Trends也能夠成為研究上有利的工具！何不利用Google提供的這個服務來看一下大眾對於股票的搜尋熱度如何，剛好仿間也有高手為R及Python提供API，R可以參考gtrendsR；Python則可以參考gtrends。

接下來我會介紹R套件的使用方法，並且示範幾個在財金上可以應用的例子。使用方法很簡單，只要用gtrends函數就可以獲取資料，需要注意的是，它的參數設定還滿複雜的，接下來會一一介紹幾個最常用的參數：

keyword

關鍵字的向量，可以輸入一個字串(如，"AAPL")，也可以是一個向量(如，c("AAPL", "QCOM"))。

geo

區域代碼，代碼可以透過套件所附的資料來查詢(data("countries"))。

只要輸入以上兩個參數就可以取得資料了，這邊以Apple的股票代碼(AAPL)為例，下載下來的資料會是一個list，包含7個不同的物件，使用names函數觀測每個物件的名稱，我個人最常用的就是“interest_over_time”，是時間序列的資料，將資料print out出來後可以發現，hits變數才是我們關切的搜尋趨勢，時間(time)變數接下來會有專門的介紹，而gprop、category不是非常重要，但下面的段落還是會將定義標示出來。

library(gtrendsR)
# get trends data
trends_aapl <- gtrends(keyword = "AAPL", geo = "US")
# what are included in a gtrends object
names(trends_aapl)

## [1] "interest_over_time"  "interest_by_country" "interest_by_region" 
## [4] "interest_by_dma"     "interest_by_city"    "related_topics"     
## [7] "related_queries"

# data content
head(trends_aapl$interest_over_time)

##         date hits geo      time keyword gprop category
## 1 2014-06-08   81  US today+5-y    AAPL   web        0
## 2 2014-06-15   50  US today+5-y    AAPL   web        0
## 3 2014-06-22   41  US today+5-y    AAPL   web        0
## 4 2014-06-29   37  US today+5-y    AAPL   web        0
## 5 2014-07-06   44  US today+5-y    AAPL   web        0
## 6 2014-07-13   49  US today+5-y    AAPL   web        0

# plot the gtrends object
plot(trends_aapl)

time

時間，Google提供的選擇其實很有限，Google Trends的資料最早從2004年1月開始，資料的頻率會跟著你所設定的參數而有所改變，最細可以到每分鐘一筆資料，最大可以到月資料，下面會做詳細的介紹，因為時間所造成的資料頻率不同會產生麻煩的問題： - “now 1-H”: 過去1個小時，資料頻率為1分鐘； - “now 4-H”: 過去4個小時，資料頻率為1分鐘； - “now 1-d”: 過去1天，資料頻率為8分鐘； - “now 7-d”: 過去7天，資料頻率為1小時； - “today 1-m”: 過去30天，資料頻率為1日； - “today 3-m”: 過去90天，資料頻率為1日； - “today 12-m”: 過去12個月，資料頻率為1週(7天)； - “today+5-y”: 過去5年，資料頻率為1週，這也是預設值； - “all”: 從2004年1月至今，資料頻率為1個月； - “Y-m-d Y-m-d”: 時間區間，格式必須要符合，兩個日期間要有空格(如，"2019-04-01 2019-05-01")，資料頻率會視區間長度而有所不同。

gprob

Google產品類別，可以從5種不同的產品來選擇，有“web”、“news”、“images”、“froogle”、“youtube”，“web”則是預設值。

onlyInterest

此為邏輯值(TRUE/FALSE)，是否只回傳時間序列資料(interest_over_time)，這個參數我個人很常用，這樣就可以只回傳我要的資料，只是它回傳的物件還是一樣是list，這點需要注意，記得轉換成data frame型態。

其他變數

包括category、hl、low_search_volume、cookie_url、tz，有需要的話再參閱Help即可。

知道如何下載資料之後，許多人可能會想要更長時間的日資料，日資料在財金應用上比較常見，無奈若是將time參數的設定拉長，出來的不是週資料就是月資料，下一個章節就要來討論，如何批次下載以及取得合理的長時間期的日資料。

Google Trends資料頻率不一致的問題

對於想要使用Google Trends的研究者來說，一定會發現一個問題：資料不是真實的搜尋次數！沒錯，Google為了求方便，會隨機抽取一個子資料集(subset)並且標準化作為輸出後的指數，而值從0至100，100表示該時間區間中搜尋次數最高的時間點。因此當你想要取得更長時間下的日資料時，會遇到以下困難：

直接將time參數設為all，只能取得月資料；
即使可以透過迴圈的方式，1個月1個月抓取資料，再全部合併，每個月的值都是0-100，每個月的日資料都是透過不同的平均數在做標準化，所以根本沒有可比性，例如，1月的hits數100並不代表跟3月的hits數100有相同的搜尋熱度，我們無從得知原始資料。

要解決這個方式不難，我們只需要知道全部資料期間下每個月的指數，再將每日乘對該月的指數做調整就好。每個月的指數在這邊可以看作是該Keyword在每個月的搜尋次數的權重，將日資料乘以該月對應的權重就可以取得長資料期間下的日資料了。

先寫一個雙重迴圈來取得多個月的日資料，中間有寫一個簡單的if判斷式，由於2019年目前剛到6月，所以希望迴圈能夠在月份超過上個月份時能夠停止，其餘時間就繼續執行。

# required packages
library(lubridate)
library(dplyr)
# create an empty dataframe
trends_unadjusted <- data.frame()
# for loop
for (yr in 2018:2019) {
  for (mon in 1:12) {
    if(yr == year(today()) & mon >= month(today())) {
      break
    } else {
      start <- ymd(paste(yr, formatC(mon, width = 2, flag = "0"), "01", sep = "-"))
      if(mon == 12) {
        end <- ymd(paste(yr, "12-31", sep = "-"))
      } else {
        end <- ymd(paste(yr, formatC(mon+1, width = 2, flag = "0"), "01", sep = "-")) - 1
      }
      span <- paste(start, end, sep = " ")
      temp <- gtrends(keyword = "AAPL", geo = "US", time = span, onlyInterest = T) %>% .[[1]]
      trends_unadjusted <- rbind(trends_unadjusted, temp)
    }
  }
}
tail(trends_unadjusted)

##           date hits geo                  time keyword gprop category
## 511 2019-05-26    4  US 2019-05-01 2019-05-31    AAPL   web        0
## 512 2019-05-27    6  US 2019-05-01 2019-05-31    AAPL   web        0
## 513 2019-05-28   27  US 2019-05-01 2019-05-31    AAPL   web        0
## 514 2019-05-29   26  US 2019-05-01 2019-05-31    AAPL   web        0
## 515 2019-05-30   28  US 2019-05-01 2019-05-31    AAPL   web        0
## 516 2019-05-31   27  US 2019-05-01 2019-05-31    AAPL   web        0

接下來可以利用time = "all"的參數設定來取得每個月的權重，由於待會要進行合併，只需保留必要的變數，並且利用“Y-m”作為連結用的變數(link)，另外，套件可能有點bug，月資料下載以後，日期格式(date)那一欄會全部是空值，很奇怪，不過我們都知道他是從“2004-01”到現今，自行手動賦值給它，

trends_all <- gtrends(keyword = "AAPL", geo = "US", time = "all", onlyInterest = T) %>% .[[1]] %>% 
    select(weight = hits)
end_yearmon <- paste(year(today()), formatC(nrow(trends_all)%%12, width = 2, flag = "0"), "01", sep = "-")
trends_all <- mutate(trends_all, link = format(seq(ymd("2004-01-01"), ymd(end_yearmon), by = "month"), "%Y-%m"))
head(trends_all)

##   weight    link
## 1      3 2004-01
## 2      3 2004-02
## 3      3 2004-03
## 4      3 2004-04
## 5      2 2004-05
## 6      3 2004-06

最後一個步驟就是將日資料與權重合併，並計算出調整後的hits，調整後的搜尋趨勢(adjusted_hits)以調整前的數值乘以權重即可，我將它除以100只是為了讓調整後的值維持在0-100之間，比較好比較，scale並不會影響後面的分析，

trends_adjusted <- trends_unadjusted %>% 
    mutate(link = format(date, "%Y-%m")) %>% 
    merge(trends_all, by = "link", all.x = T) %>% 
    mutate(adjusted_hits = hits * weight/100,
           date = as.Date(date))
head(trends_adjusted)

##      link       date hits geo                  time keyword gprop category
## 1 2018-01 2018-01-01    8  US 2018-01-01 2018-01-31    AAPL   web        0
## 2 2018-01 2018-01-02   56  US 2018-01-01 2018-01-31    AAPL   web        0
## 3 2018-01 2018-01-03   62  US 2018-01-01 2018-01-31    AAPL   web        0
## 4 2018-01 2018-01-04   59  US 2018-01-01 2018-01-31    AAPL   web        0
## 5 2018-01 2018-01-05   65  US 2018-01-01 2018-01-31    AAPL   web        0
## 6 2018-01 2018-01-06   16  US 2018-01-01 2018-01-31    AAPL   web        0
##   weight adjusted_hits
## 1     37          2.96
## 2     37         20.72
## 3     37         22.94
## 4     37         21.83
## 5     37         24.05
## 6     37          5.92

進階使用

我們將這個過程可以寫成函數，方便進行批次下載及調整，值得一提的是，我發現如果直接讓程式從2004年抓取到2019年的資料，會出現error的訊息(Error in get_widget(comparison_item, category, gprop, hl, cookie_url, : widget$status_code == 200 is not TRUE)，原因可能是因為短時間大量的下載達到伺服器那邊的上限，因此建議不要一次進行大量的下載，可以設定參數來切割時間如下，

trends_daily <- function(keyword, geo, start_year = 2004, end_year = year(today())) {
  trends_unadjusted <- data.frame()
  for (yr in start_year:end_year) {
    for (mon in 1:12) {
      if(yr == year(today()) & mon >= month(today())) {
        break
      } else {
        start <- ymd(paste(yr, formatC(mon, width = 2, flag = "0"), "01", sep = "-"))
        if(mon == 12) {
          end <- ymd(paste(yr, "12-31", sep = "-"))
        } else {
          end <- ymd(paste(yr, formatC(mon+1, width = 2, flag = "0"), "01", sep = "-")) - 1
        }
        span <- paste(start, end, sep = " ")
        temp <- gtrends(keyword = keyword, geo = geo, time = span, onlyInterest = T) %>% .[[1]]
        trends_unadjusted <- rbind(trends_unadjusted, temp)
      }
    }
  }
  # Monthly Data
  trends_all <- gtrends(keyword = keyword, geo = geo, time = "all", onlyInterest = T) %>% .[[1]] %>% 
    select(weight = hits)
  end_yearmon <- paste(year(today()), formatC(nrow(trends_all)%%12, width = 2, flag = "0"), "01", sep = "-")
  trends_all <- mutate(trends_all, link = format(seq(ymd("2004-01-01"), ymd(end_yearmon), by = "month"), "%Y-%m"))
  # Adjustment
  trends_adjusted <- trends_unadjusted %>% 
    mutate(link = format(date, "%Y-%m")) %>% 
    merge(trends_all, by = "link", all.x = T) %>% 
    mutate(adjusted_hits = hits * weight/100,
           date = as.Date(date))
  return(trends_adjusted)
}
# test
trends_aapl <- trends_daily("AAPL", "US", 2018, 2019)
tail(trends_aapl)

##        link       date hits geo                  time keyword gprop
## 511 2019-05 2019-05-26    4  US 2019-05-01 2019-05-31    AAPL   web
## 512 2019-05 2019-05-27    6  US 2019-05-01 2019-05-31    AAPL   web
## 513 2019-05 2019-05-28   27  US 2019-05-01 2019-05-31    AAPL   web
## 514 2019-05 2019-05-29   26  US 2019-05-01 2019-05-31    AAPL   web
## 515 2019-05 2019-05-30   28  US 2019-05-01 2019-05-31    AAPL   web
## 516 2019-05 2019-05-31   27  US 2019-05-01 2019-05-31    AAPL   web
##     category weight adjusted_hits
## 511        0     39          1.56
## 512        0     39          2.34
## 513        0     39         10.53
## 514        0     39         10.14
## 515        0     39         10.92
## 516        0     39         10.53

視覺化證據

利用highcharter來看看調整前後差在哪裡。Figure 1是調整前的搜尋趨勢，很明顯可以看出從2018年至今，好幾天都出現100的值，但是這幾個100都只是相對高點(local maximum)，想要找至高點(global maximum)就必須知道每個月的權重(Figure 2)，在2018年的10月，權重才42，比2018年11月的60少很多，但是以Figure 1來說兩個月看起來趨勢都很高。經過調整後，Figure 3就可以看出如Figure 2的走勢，但是資料頻率更高，更加清晰。

以頻果為例，2018年10月30日有個重要的發表會，而搜尋量也在11/1達到高峰(60)，其實前面3/27及6/4的發表會都有達到32，此外，2018年2月、5月、8月、包括11月都有很高的搜尋量，這些日子都是財報發布後的兩三天內，其實搜尋量甚至比產品發表會還要來的顯著，這個現象非常有趣，顯示Google Trends跟財務的事件是有關聯的。

# highcharter
library(highcharter)
# Figure 1. Unadjusted hits
hchart(trends_aapl, type = "area", hcaes(x = date, y = hits)) %>% 
  hc_xAxis(title = list(text = "Date")) %>% 
  hc_yAxis(title = list(text = "Unadjusted Frequency")) %>% 
  hc_add_theme(hc_theme_ft())

# Figure 2. Monthly weights
trends_all <- gtrends(keyword = "AAPL", geo = "US", time = "all", onlyInterest = T) %>% .[[1]] %>% 
    select(weight = hits)
end_yearmon <- paste(year(today()), formatC(nrow(trends_all)%%12, width = 2, flag = "0"), "01", sep = "-")
trends_all <- mutate(trends_all, link = format(seq(ymd("2004-01-01"), ymd(end_yearmon), by = "month"), "%Y-%m"))
hchart(trends_all, type = "area", hcaes(x = link, y = weight)) %>% 
  hc_xAxis(title = list(text = "Date")) %>% 
  hc_yAxis(title = list(text = "Weight")) %>% 
  hc_add_theme(hc_theme_ft())

# Figure 3. Adjusted hits
hchart(trends_aapl, type = "area", hcaes(x = date, y = adjusted_hits)) %>% 
  hc_xAxis(title = list(text = "Date")) %>% 
  hc_yAxis(title = list(text = "Adjusted Frequency")) %>% 
  hc_add_theme(hc_theme_ft())

另一個隱憂

在研究過程中，我發現同樣的query會產生不同的資料，例如，同樣一條程式碼trends_aapl <- gtrends(keyword = "AAPL", geo = "US")竟然會出現不同的hits值，只要過幾分鐘，再跑一次同一條的程式碼就會出現不太一樣的值，後來發現這是因為Google會了加快速度(可能為了讓使用者能更快看到資料)，如我前幾段所述，採取隨機抽取子資料集的方式，隨機抽取後再做標準化，假設2019年1月之間總共有10萬筆的搜尋紀錄，Google可能隨機抽取其中1萬筆來做指數的編制，這樣可以大大降低計算所需要的資源消耗。

然而，Google這種做法也帶給我們滿多不便的，如果我只想要短短幾天的資料，也就是小樣本的話，hits數值可能會變化很大，導致參考價值不高。根據這個問題，Da, Engelberg, and Gao (JF, 2011)在第1,467頁第4項註腳有詳細說明，如下：

To increase the response speed, Google currently calculates SVI actual historical search data. This is why SVIs on the same search term when they are downloaded at different points in time. We believe th error is small for our study and should bias against finding significa the SVIs several times and compute their correlation, we find the 97%. In addition, we also find that if we restrict our analysis to a sampling error standard deviation reported by Google Trends is low, we get stronger results.

小結

在財金領域中，Da et al.(2011)首先引入了Google Trends作為散戶投資人關注度的代理變數，投資人關注度(investor attention)一直是行為財務學中其中一個有趣的主題，因為投資人沒有辦法在短時間內注意到市場上流通的所有資訊，因而影響到他們在投資決策上的績效，而且這個影響通常是負面的。回到他們的研究，他們發現當某支股票的搜尋量異常高時，可以預測接下來2週內這隻股票的價格會漲，且在1年內反轉(reversal)。

下一篇網誌就來實際做看看Da et al.(2011)他們的結論是否會在台灣實現，剛好Google Trends也可以設定區域，可以觀察在台灣的狀況。若是有時間，我也會提幾篇與Google Trends相關的有趣研究。

Reference

DA, Z. , ENGELBERG, J. and GAO, P. (2011), In Search of Attention. The Journal of Finance, 66: 1461-1499. doi:10.1111/j.1540-6261.2011.01679.x

Google Trends: How to acquire daily data for broad time frames, 2018, Franz B., Medium.com.