Google searches are one of the most important datasets ever collected. This is not only the tool to search or get answers but also great mean to understand people around the world. It is digital gold mine which can unravel much unknown information.
Google Trends data is an unbiased sample of our Google search data. It’s anonymized (no one is personally identified), categorized (determining the topic for a search query) and aggregated (grouped together). This allows us to measure interest in a particular topic across search, from around the globe, right down to city-level geography
Below is a report on “Analysis of Google Trends data”. I gathered Google Trends data different ways and then tried to get meaning out it. Initially, Google search data didn’t seem to be a proper source of information for “serious” research. But as we will see, trails we leave as we try to query on the Internet are extremely revealing. The idea of this project is to analyze multiple topics and some places merge it with other external data.
Although data from Google Trends is nicely formatted and easily available. There is no official API available to query this data. An API to accompany the Google Trends service was announced by Marissa Mayer, then vice president of search-products and user experience at Google. This was announced in 2007, and so far has not been released. Philippe Massicotte has written unofficial gTrendsR API to get this data and package is maintained by him. Fortunately, a package is updated but the documentation is not,
Analysing Bitcoin price movement with Google search for term “Bitcoin”
#reading
bitcoinPrice <- read.csv("https://raw.githubusercontent.com/chirag-vithlani/Capstone/master/data/bitcoin/BitCoin_price_updated.csv")
#Formatting date
bitcoinPrice$newdate<-as.Date(bitcoinPrice$Date, "%m/%d/%Y")
#Price started moving only after year 2016, so reading data only from year 2016
newbitcoinPrice<-subset(bitcoinPrice, newdate > as.Date("2016-01-01"))
#converting price movement to percentage, as google search shows relative index value
newbitcoinPrice$pricePercent<-(newbitcoinPrice$Bitcoin.Price*100)/max(newbitcoinPrice$Bitcoin.Price)
head(newbitcoinPrice,2)
## Date Bitcoin.Price newdate pricePercent
## 1828 1/2/2016 433.36 2016-01-02 2.257822
## 1829 1/3/2016 433.36 2016-01-03 2.257822
bitcoinTrends <- read.csv("https://raw.githubusercontent.com/chirag-vithlani/Capstone/master/data/bitcoin/BitCoin_Trends_updated.csv")
#Formatting date
bitcoinTrends$newdate<-as.Date(bitcoinTrends$Week, "%m/%d/%Y")
head(bitcoinTrends,2)
## Week Bitcoin...United.States. newdate
## 1 1/3/2016 3 2016-01-03
## 2 1/10/2016 2 2016-01-10
bitCoinPriceAndTrends<-merge(bitcoinTrends,newbitcoinPrice)
head(bitCoinPriceAndTrends,3)
## newdate Week Bitcoin...United.States. Date Bitcoin.Price
## 1 2016-01-03 1/3/2016 3 1/3/2016 433.36
## 2 2016-01-10 1/10/2016 2 1/10/2016 447.46
## 3 2016-01-17 1/17/2016 2 1/17/2016 381.85
## pricePercent
## 1 2.257822
## 2 2.331283
## 3 1.989453
Plotting Bitcoin search Trends v/s price
Below is Bitcoin price and Bitcoin search trends comparison using line chart. It seems there is a correlation between how much people search about Bitcoin and the price of Bitcoin. Also, it shows around December-January the price and searches for Bitcoin was highest. It doesn’t mean more people search for Bitcoin; more its value will be. But reverse might be true.
m <- list(
l = 50,
r = 50,
b = 100,
t = 100,
pad = 4
)
plot_ly(x = ~bitCoinPriceAndTrends$newdate) %>%
add_lines(y = ~bitCoinPriceAndTrends$pricePercent, name = "Actual Price Percentage", line = list(shape = "Actual Price Percentage")) %>%
add_lines(y = ~bitCoinPriceAndTrends$Bitcoin...United.States., name = "Google Bitcoin Serach Trends", line = list(shape = "Google Bitcoin Search Trends"))%>%
layout(autosize = F, width = 1022, height = 500, margin = m,font = list(family = "\"Droid Sans\", sans-serif"),
title = "Google Bitcoin Serach Trends Vs Bitcoin Price Percentage Change",xaxis = list(title = "Timeline"),yaxis = list(title = "Bitcoin Percentage"))
As we can see search for Bitcoin and price both picked around last week of December 2017 and starting of January 2018. Although It’s not a perfect indicator Google Trends sometimes lags and sometimes leads bitcoin’ price.
We can say there’s a strong correlation between bitcoin’s price and the performance of the search term “Bitcoin” on Google. Maybe price follows interest, and therefore, more buyers and greater search volume. If that is the case Trends chart shows a reversal to the uptrend for interest in Bitcoin.
Next I choose to analyze academy awards nominated movies. I wanted to compare is there some pattern in how people search about movies and the business that movie does. So I chose below four movies to do analysis.I got search Google search data from trends tool and the data of business that movie did from BoxOfficeMojo.com
As we can see “Shape of water” is highest searched movie among all four and that is also movie which had highest gross business.Second highest searched movie is “Three Billboards Outside Ebbing, Missouri” whereas second most earner was movie “Darkest Hour”. Least searched movie among four is “Lady Bird” and that is the one which did least business. so looks like Google search does reasonably good job with choosing first and last winner.
We frequently query “how to ….” on Google, so here is analysis which “how to” query we do the most.
To get “how to query”, we use Gtrends R package. We query “how to ..” term for the year 2012 to 2017 and showing most searched terms using wordcloud.
library(data.table)
# Writing function to display wordcloud
getYearTrends <- function(timeline)
{
HowTo2017<-gtrends("How to", geo="US", time = timeline)
HowTo2017<-HowTo2017$related_queries
return (HowTo2017)
}
gTrendswordcloud <- function(timeline)
{
#HowTo2017<-gtrends("How to", geo="US", time = timeline)
HowTo2017<-getYearTrends(timeline)
HowTo2017$subjectNew<-gsub('%','',HowTo2017$subject)
HowTo2017[which(HowTo2017[,7]=='<1', arr.ind=TRUE), 7] <-0
HowTo2017[which(HowTo2017[,7]=='Breakout', arr.ind=TRUE), 7] <-9999
HowTo2017$subjectNew<-as.numeric(gsub(',','',HowTo2017$subjectNew))
HowTo2017$subjectNew<-as.numeric(HowTo2017$subjectNew)
max<-max(subset(HowTo2017, related_queries == 'rising')$subjectNew)
HowTo2017rising<-subset(HowTo2017,related_queries=='rising')
HowTo2017rising$subjectNew<-subset(HowTo2017rising,related_queries=='rising')$subjectNew*100/max
HowTo2017risingTop<-subset(HowTo2017,related_queries=='top')
HowTo2017<-rbind(HowTo2017rising ,HowTo2017risingTop)
HowTo2017$subjectNew<-as.integer(HowTo2017$subjectNew)
HowTo2017Top<-subset(HowTo2017,as.numeric(HowTo2017$subjectNew)>1)
HowTo2017Top<-subset(HowTo2017Top, select=c("value", "subjectNew"))
HowTo2017Top$subjectNew<-as.numeric(HowTo2017Top$subjectNew)
colnames(HowTo2017Top)[2]<-"freq"
HowTo2017Top<-HowTo2017Top[order(-HowTo2017Top$freq),]
wordcloud2(data = HowTo2017Top)
}
#gTrendswordcloud("2017-01-01 2017-12-31")
2012 | 2013 | 2014 | 2015 | 2016 | 2017 |
---|---|---|---|---|---|
|
|
|
|
|
|
Make GIF using ezgif.com
Result shows most searched “How to..” queries on Google. The clouds give greater prominence to words that appear more frequently.
library(data.table)
tr2012<-getYearTrends("2012-01-01 2012-12-31")
tr2013<-getYearTrends("2013-01-01 2013-12-31")
tr2014<-getYearTrends("2014-01-01 2014-12-31")
tr2015<-getYearTrends("2015-01-01 2015-12-31")
tr2016<-getYearTrends("2016-01-01 2016-12-31")
tr2017<-getYearTrends("2017-01-01 2017-12-31")
all<-rbind(tr2012,tr2013,tr2014,tr2015,tr2016,tr2017)
#plot_ly(x = df$Var1,y = df$Freq,name = "SF Zoo",type = "bar")
df<-as.data.frame(table(all$value))
df<-subset(df, df$Freq > 4)
df[order(-df$Freq),]
## Var1 Freq
## 10 how to boil eggs 6
## 70 how to solve a rubix cube 6
## 74 how to tie a tie 6
## 75 how to train your dragon 6
## 26 how to draw 5
## 34 how to get away with murder 5
kable(df, "html") %>%
kable_styling(bootstrap_options = "striped", full_width = F)
Var1 | Freq | |
---|---|---|
10 | how to boil eggs | 6 |
26 | how to draw | 5 |
34 | how to get away with murder | 5 |
70 | how to solve a rubix cube | 6 |
74 | how to tie a tie | 6 |
75 | how to train your dragon | 6 |
get2012uniqueQueries <- function(){
allExcept2012<-rbind(tr2013,tr2014,tr2015,tr2016,tr2017)
only2012<-setDT(tr2012)[!allExcept2012, on="value"]
only2012df<-as.data.frame(only2012$value)
colnames(only2012df)="Year 2012 Unique queries"
only2012df<-unique(only2012df)
kable(only2012df, "html") %>%
kable_styling(bootstrap_options = "striped", full_width = F)
}
get2016uniqueQueries <- function(){
allExcept2016<-rbind(tr2012,tr2013,tr2014,tr2015,tr2017)
only2016<-setDT(tr2016)[!allExcept2016, on="value"]
only2016df<-as.data.frame(only2016$value)
colnames(only2016df)="Year 2016 Unique queries"
only2016df<-unique(only2016df)
kable(only2016df, "html") %>%
kable_styling(bootstrap_options = "striped", full_width = F)
}
get2017uniqueQueries <- function(){
allExcept2017<-rbind(tr2012,tr2013,tr2014,tr2015,tr2016)
only2017<-setDT(tr2017)[!allExcept2017, on="value"]
only2017df<-as.data.frame(only2017$value)
colnames(only2017df)="Year 2017 Unique queries"
only2017df<-unique(only2017df)
kable(only2017df, "html") %>%
kable_styling(bootstrap_options = "striped", full_width = F)
}
|
|
|
Here are interesting queries in last 12 months.
howToData <- read.csv("https://raw.githubusercontent.com/chirag-vithlani/Capstone/master/data/How_to_Interesting.csv")
howToDataSubSet<-subset(howToData, select = c(1, 4))
colnames(howToDataSubSet)[2] <- "Country"
kable(howToDataSubSet, "html") %>%
kable_styling(bootstrap_options = "striped", full_width = F)
Topic | Country |
---|---|
how to make paper flowers | Bhutan |
how to take pictures of northern lights | Iceland |
how to become good teacher | India |
how to get twins | Kenya |
how to hack facebook | Myanmar |
how to make carrot oil | Nigeria |
how to handle wife | Pakistan |
how to identify AIDS | Sri Lanka |
how science is trying to help us eat better | Israel |
how to make solar system | Jamaica |
how to make a girl like you | Solomon Islands |
#Create dataframe with toy data:
LAND_ISO <- howToData$Country
value <- howToData$val
topic<-howToData$Topic
data <- data.frame(LAND_ISO, value,topic)
g <- list(scope = 'world')
plot_geo(data) %>%
add_trace(
z = ~value, locations = ~LAND_ISO, colors = c(Pass="yellow", High="red", Low= "cyan", Sigma= "magenta", Mean='limegreen', Fail="blue", Median="violet"),text = ~paste(howToData$Topic)
) %>%
layout(geo = g)%>% hide_colorbar()
Out of above unique “How to” queries, I found “How to handle wife” quite funny and serious at the same time. It points out gender inequality and wherever we see such query, I expect that location to have high gender inequality. So here we are finding top five such countries.
howToHandleWifeSearch<-gtrends("how to handle wife", time = "today 12-m")
howToHandleWifeSearchHead<-head(howToHandleWifeSearch$interest_by_country,5)
howToHandleWifeSearchHead<-subset(howToHandleWifeSearchHead, select = c(1, 2))
colnames(howToHandleWifeSearchHead)[2] <- "Percentage of Hits"
kable(howToHandleWifeSearchHead, "html") %>%
kable_styling(bootstrap_options = "striped", full_width = F)
location | Percentage of Hits |
---|---|
Pakistan | 100 |
Sri Lanka | 84 |
United Arab Emirates | 64 |
India | 43 |
Bangladesh | 33 |
It is a natural light display in the Earth’s sky, predominantly seen in the high-latitude regions like Iceland. That is the reason people from Iceland search “how to take pictures of northern lights”. This was the most amazing thing to know while working on this project.
Source : Wikipedia
Google Trends is a unique and useful tool we can use to keep track of what people want to know about. When people do Google search; they misspell words and collectively Google can tell which words are misspelled more frequently. Similarly people also use google when they want to know the spelling of some complex word with query “how to spell..”.
There is article with title “Google reveals top ‘how to spell’ searches by Canadian province” which gave interesting analysis of which words are misspelled by Canadians. Similar data analysis for United States given by Google trends twitter handle.
ICYMI - here's our map of the most misspelled words in America #spellingbee
— GoogleTrends (@GoogleTrends) May 30, 2017
(corrected legend) pic.twitter.com/2w56NpDgGK
If we query “how to spell” right now then top Google trends result in United States is “how to spell the sound of a sniff” which is strange result but there is story behind it. Macaroni Tony with Twitter handle @BigBeard_Ali, who is followed by 14K users tweeted below on 11 Feb 2018 .
I got $750 to anybody that can spell the sound of a sniff????????????
— Macaroni Tony (@BigBeard_Ali) February 11, 2018
This tweet had around 6K re-tweets and 10K likes and this led people to search for “how to spell sound of sniff”. I was able to see that this query picked only after this tweet. It is amazing that only one tweet with small amount of prize money can have such ripple effect.
Inspired with above analysis, I decided to do how to pronounce analysis. It would be fun to understand which are the words people find difficult to pronounce.
state_code | top_query | breakout_query |
---|---|---|
AL | how to pronounce gif | how to pronounce gif |
AZ | how to pronounce acai | how to pronounce gif |
CA | how do you pronounce | how to pronounce khalid |
CO | how to pronounce pho | how to pronounce qatar |
CT | how to pronounce gif | how to pronounce acai |
FL | how do you pronounce | how to pronounce xxxtentacion |
GA | how to pronounce gyro | how to pronounce sza |
HI | how to pronounce acai | how to pronounce acai |
IL | how to pronounce names | how to pronounce sza |
IN | how to pronounce gif | how to pronounce acai |
IA | how to pronounce gif | how to pronounce gif |
KS | how to pronounce gyro | how to pronounce gif |
MD | how to pronounce acai | how to pronounce acai |
MA | how to pronounce gyro | how to pronounce nguyen |
MI | how to pronounce gyro | how to pronounce sza |
MN | how to pronounce names | how to pronounce qatar |
MS | how to pronounce gyro | how to pronounce gyro |
MO | how to pronounce acai | how to pronounce sza |
NV | how to pronounce nevada | how to pronounce gyro |
NJ | how to pronounce acai | how to pronounce nguyen |
NY | how do you pronounce | how to pronounce xxxtentacion |
NC | how to pronounce names | how to pronounce pho |
OH | how to pronounce gyro | how to pronounce sza |
OR | how to pronounce gyro | how to pronounce gyro |
PA | how to pronounce gyro | how to pronounce sza |
TX | how do you pronounce | how to pronounce khalid |
UT | how to pronounce gif | how to pronounce gif |
VA | how to pronounce names | how to pronounce pho |
WA | how to pronounce gif | how to pronounce sza |
WI | how to pronounce gif | how to pronounce gif |
DC | how to pronounce gyro | how to pronounce gyro |
In last twelve months, most difficult words people find difficult to pronounce are people’s names.
First is singer-songwriter named “Sza” |
|
Followed by Wonder Woman fame “Gal Gadot” |
|
Followed by American rapper “XXXTentacion |
|
Below is state wise breakdown showing each state find which word difficult to pronounce.
This also contains food related items like ‘pho’, ‘gyro’ and acaí. Map is created through mapchart.net. At world stage top query is “how to pronounce pyeongchang” which hosted the 2018 Winter Olympics and the 2018 Winter Paralympics.
Each new year we all make resolutions and we keep making them each year. We all search lose weight and quit smoking each year. These search queries peak around January each year.
trends_us = gtrends(c("quit smoking"), geo = c("US"), gprop = "web", time = "2015-01-01 2018-04-30")[[1]]
forcase <- trends_us[,c("date","hits")]
colnames(forcase) <-c("ds","y")
m <- prophet(forcase)
## Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.
## Initial log joint probability = -4.16404
## Optimization terminated normally:
## Convergence detected: relative gradient magnitude is below tolerance
future <- make_future_dataframe(m, periods = 365)
forecast <- predict(m, future)
plot(m, forecast)
prophet_plot_components(m, forecast)
Image of above shows the forecasting (using Facebook’s time series forecasting prophet package) of “quit smoking” query which shows this trend continuing in January 2019.Image after that shows same data in different spans. As we can overall trend is over the years is going down. Somehow we search about “quit smoking” more on weekdays.
In 21st century we would like to think that we treat boys and girls same. At least in United States - which is arguably one most the most liberal country in the world. we would like to think that american parents have similar standards and similar dreams for their sons and daughters and there will not be any gender bias. But study from Seth Stephens-Davidowitz suggests that is not the case. Google searches suggest parents have different concerns for male and female children. As he points out and same shown below with Google trends data that our parents expect ( may be unknowingly ) boys to be smarter and girls to be thinner. They are more excited by the intellectual potential of their sons and they are more concerned about the weight and appearance of their daughters.
giftedQuery<- function(query1,query2,timespan)
{
query11<-gtrends(query1, geo="US", time=timespan)
query11Avg<-mean(query11$interest_over_time$hits)
query22<-gtrends(query2, geo="US", time=timespan)
query22Avg<-mean(query22$interest_over_time$hits)
p <- plot_ly(
x = c(query1,query2),
y = c(query11Avg, query22Avg),
name = "SF Zoo",
type = "bar"
)%>%
layout(title = 'Gender bias clearly visible',xaxis = list(title = paste(query1,query2,sep = " <b>v/s</b> ")),yaxis = list(title = 'Average of search'))
p
}
giftedQuery('is my daughter gifted','is my son gifted','today 12-m')
##2017-01-01 2017-12-31
giftedQuery('is my daughter overweight','is my son overweight','2017-01-01 2017-12-31')
Mostly all parens like to believe that their kid has special talent. But as below graph shows people in US queried more “is my son gifted” than “is my daughter gifted”. This search difference is almost 50%
Unfortunately, this is despite the fact that that in real life this is exactly opposite. As David Walsh, an American psychologist, who specializes in parenting, points out as below “Girls talk earlier than boys, have larger pre-school vocabularies, and use more complex sentence structures. Once in school, girls are one to one-and-a-half years ahead of boys in reading and writing. Boys are twice as likely to have a language or reading problem and three to four times more likely to stutter. Girls do better on tests of verbal memory, spelling and verbal fluency. On average, girls utter two to three times more words per day than boys and even speak faster—twice as many words per minute.”
Childhood obesity has undoubtedly become one of the most complex public health problems facing future generations. Parents search “Is my daughter overweight?” almost twice as frequently as they search “Is my son overweight?”. Clearly parents worry more about overweight girls than overweight boys. This is despite the fact that trends shows - in reality boys are more overweight than girls.
Unlike “what women want” which no one can claim to know about,we think we would have good idea “What pregnant Women Want”. It would be the cravings for pickles and chocolate, the avoidance of wine, the nausea and stretch marks, and good supporting husband. It looks like Google trends can put more light on this question and surely we can expect different answers from different parts of the world. We can start with questions about “what pregnant women can do” safely. The top questions women ask in United States are: Can pregnant women “eat shrimp,tuna,fish or crab”, “drink wine,coffee or tea” or “take Tylenol”? There are (comparatively) less frequently asked questions too like “fly”, “take bath” or “paint”.
pregnantWomenEat<- function(query1,geo1,time1)
{
query11<-gtrends(query1, geo=geo1, time=time1)
all<-subset(query11$related_queries,query11$related_queries$related_queries=='top')$value
#all1<-replace(all,"can pregnant women eat ","")
num<-grep('^can pregnant women eat',all)
all1<-all[num]
all1<-gsub("can pregnant women eat", "", all1)
all1
all1<-head(all1,3)
df1<-data.frame(all1)
names(df1)<-geo1
#kable(df1, "html") %>% kable_styling(bootstrap_options = "striped", full_width = F)
}
But if we inquire same question in other countries, they don’t look much like the United States or one another. Whether pregnant women can “drink wine” is not among the top 10 questions in Canada, Australia or United Kingdom. Australia’s unique concern is mostly related to eating cream cheese. The differences in questions have less to do with what is safe to do and more to do with information coming from different sources in each country including old stories,local custom and neighborhood trivial talk. We can see another clear difference when we look at the top searches for “how to ___ during pregnancy?” In the United States, the top search is (gain or lose) weight related queries whereas in India women are more concerned about how to sleep or weight of baby. In India it is illegal to find gender of baby before birth and that is why one of the top query is to know any trick about knowing gender. While the cultural manifestations of pregnancy may be different, the physical experience tends to be similar everywhere. I tested how often various symptoms were searched in combination with the word “pregnant.” For example, how often is “pregnant” searched in conjunction with “nausea,” “back pain” or “constipation”? Canada’s symptoms were very close to those in the United States. Symptoms in countries like Britain, Australia and India were all roughly similar, too. Preliminary evidence suggests that no part of the world has stumbled upon a diet or environment that drastically reduces a pregnancy symptom. We can extend this analysis and check what expectant fathers are searching for. In the United States, the top searches include “be nice to me my wife is pregnant shirt”
howTopregnantWomen<- function(query1,geo1)
{
query11<-gtrends(query1, geo=geo1, time='2017-01-01 2017-12-31')
all<-subset(query11$related_queries,query11$related_queries$related_queries=='top')$value
num<-grep('during pregnancy',all)
all1<-all[num]
num<-grep('^how to',all1)
all2<-all1[num]
all3<-gsub("how to ", "", all2)
all4<-gsub("during pregnancy","",all3)
all5<-head(all4,7)
df1<-data.frame(all5)
names(df1)<-geo1
kable(df1, "html") %>%
kable_styling(bootstrap_options = "striped", full_width = F)
}
|
|
Philosophers speculated about a tool called “cerebroscope,” a mythical device that would display a person’s thoughts on a screen, people have been looking for tools to expose the workings of human nature. Google Trends data is one of such tool which is an anonymous, categorized, and unbiased sample of Google search data. It tracks trillions of searches per year,making it one of the most useful, real-time data indicators of human interest by region and category. Google Trends is most often used to understand brand health and monitor changes in consumer interests along competitive metrics and factors such as seasonality. Search engine query data offer insights into our life on the smallest possible scale of individual actions. In order to investigate whether Internet search volume is correlated with another aspect of our life, I used search volume data provided by the search engine Google. I tried to focus on the incredible amount of information about the localized behavior we can get from Google Trends. We got answers to questions we never asked from people we never considered. On top of that, we got information on historic behavior - we can’t ask panels how they felt many years ago!
One of the reasons why I preferred to use Google Trends as my source for information instead of the standard surveys or focus groups is the fact that we are leveraging the largest panel in the world (the internet). It’s honest, trusted and not influenced/skewed. Google Trends is arguably the best publicly available data we have. I say more trusted because it is default search engine we all use. If people were using the different search engine then this kind of analysis would have been difficult. I admire Google as a company for keeping this kind of data publicly available for free. Taking the meaning of open [ https://googleblog.blogspot.com/2009/12/meaning-of-open.html ] to next level. Due to this openness, anyone can do analysis with one’s own interest. Ideas can come from anyone. Data analysis is no longer restricted elite group of researchers and academias. Opportunities are endless. For many people, Google is more than just a simple search engine - it’s one of their closest confidants. The evidence is provided by Jeremy Ginsberg that Google Trends data can be used to track influenza-like illness in a population. Because the relative frequency of certain queries is highly correlated with the percentage of physician visits in which a patient presents with influenza-like symptoms, an estimate of weekly influenza activity can be reported.
This kind of data analysis can answer taboo subjects like what percent of American men are gay? or issues related to child sexual exploitation or child abuse can be analyzed more reliably because the Internet is the first thing we reach out. The advantage of this data source, of course, is that most people are making these searches in private. Google Trends offer an unprecedented peek into people’s psyches. people can unburden themselves of some wish or fear without a real person reacting in dismay. Most important thing is we should ask the right questions.