Task:

Using keyword search data from GoogleTrends, examine relationships between various keyword searches over time. Test for correlations between “kombucha” and seven other search terms using the cor.test() function. Then, create a data frame displaying specified information from the output of each correlation test.

Data:

I used Google Trends to collect data on search frequency over time for ‘kombucha’, ‘guthealth’, ‘superfoods’, ‘matcha’, ‘paleodiet’, ‘microbiome’, ‘gutbacteria’ and ‘probiotic’. Data was collected weekly from 5/20/12 to 5/7/17.

Import data:

data=read.csv("Kombucha_Searches.csv")
colnames(data)=c('Week', 'kombucha', 'guthealth', 'superfoods', 'matcha', 
      'paleodiet', 'microbiome', 'gutbacteria', 'probiotic')

Trying out cor.test()

cor.test(data$kombucha, data$matcha)
## 
##  Pearson's product-moment correlation
## 
## data:  data$kombucha and data$matcha
## t = 30.754, df = 258, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8571622 0.9099211
## sample estimates:
##       cor 
## 0.8863862

This output is way too clunky. We’re running tests for correlations between “kombucha” and seven other search terms… so what we need is a way to access key elements from each cor.test() output— correlation coefficients and p-values in this case— in an easy-to-read format.


Start by subsetting columns into “kombucha” and a group of all other terms. Then store cor.test() results in an empty list of length 7 (for our 7 tests):

# "kombucha" as our x
x=data[,2]
# seven other search terms as our y's
y=data[3:length(data)]
# store results in vector
results <- vector("list", 7)

Looping through seven tests, storing results in “results” list. List items contain 9 pieces of output from cor.test().

for (i in 1:length(y) ){
  results[[i]] <- cor.test(x, y[[i]])
}

Extracting correlation coefficients and p-values from the “results” list, creating data frame to display corresponding values for each search term:

#Create empty data frame
corrs=data.frame(nrow=2)
# loop through "results" list, extract estimates/p-values
for (i in 1:length(results)){
  # shorten values using Jeromy Anglim's specify decimal function:
  specify_decimal <- function(x, k) format(round(x, k), nsmall=k)
  # assign estimates/p-values to corresponding columns
  corrs[,i]=specify_decimal(results[[i]]$estimate,5)
  corrs[2,i]=specify_decimal(results[[i]]$p.value,5)
}
print(corrs)
##      nrow       V2      V3       V4      V5      V6      V7
## 1 0.89568 -0.06868 0.88639 -0.49892 0.80940 0.69457 0.87756
## 2 0.00000  0.26988 0.00000  0.00000 0.00000 0.00000 0.00000

Cleaning up and transforming the data frame:

#Assigning column names from "data"
words=as.list(names(data[3:length(data)]))
names(corrs)=words
#Assigning row names
row.names(corrs)=c("cor","p-value")
# transpose corr, new data frame
Correlations=as.data.frame(t(corrs))
print(Correlations)
##                  cor p-value
## guthealth    0.89568 0.00000
## superfoods  -0.06868 0.26988
## matcha       0.88639 0.00000
## paleodiet   -0.49892 0.00000
## microbiome   0.80940 0.00000
## gutbacteria  0.69457 0.00000
## probiotic    0.87756 0.00000

Printing with knitr:

library('knitr')
kable(Correlations, format = "markdown")
cor p-value
guthealth 0.89568 0.00000
superfoods -0.06868 0.26988
matcha 0.88639 0.00000
paleodiet -0.49892 0.00000
microbiome 0.80940 0.00000
gutbacteria 0.69457 0.00000
probiotic 0.87756 0.00000