Task:
Using keyword search data from GoogleTrends, examine relationships between various keyword searches over time. Test for correlations between “kombucha” and seven other search terms using the cor.test() function. Then, create a data frame displaying specified information from the output of each correlation test.
Data:
I used Google Trends to collect data on search frequency over time for ‘kombucha’, ‘guthealth’, ‘superfoods’, ‘matcha’, ‘paleodiet’, ‘microbiome’, ‘gutbacteria’ and ‘probiotic’. Data was collected weekly from 5/20/12 to 5/7/17.
Import data:
data=read.csv("Kombucha_Searches.csv")
colnames(data)=c('Week', 'kombucha', 'guthealth', 'superfoods', 'matcha',
'paleodiet', 'microbiome', 'gutbacteria', 'probiotic')
Trying out cor.test()
cor.test(data$kombucha, data$matcha)
##
## Pearson's product-moment correlation
##
## data: data$kombucha and data$matcha
## t = 30.754, df = 258, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8571622 0.9099211
## sample estimates:
## cor
## 0.8863862
This output is way too clunky. We’re running tests for correlations between “kombucha” and seven other search terms… so what we need is a way to access key elements from each cor.test() output— correlation coefficients and p-values in this case— in an easy-to-read format.
Start by subsetting columns into “kombucha” and a group of all other terms. Then store cor.test() results in an empty list of length 7 (for our 7 tests):
# "kombucha" as our x
x=data[,2]
# seven other search terms as our y's
y=data[3:length(data)]
# store results in vector
results <- vector("list", 7)
Looping through seven tests, storing results in “results” list. List items contain 9 pieces of output from cor.test().
for (i in 1:length(y) ){
results[[i]] <- cor.test(x, y[[i]])
}
Extracting correlation coefficients and p-values from the “results” list, creating data frame to display corresponding values for each search term:
#Create empty data frame
corrs=data.frame(nrow=2)
# loop through "results" list, extract estimates/p-values
for (i in 1:length(results)){
# shorten values using Jeromy Anglim's specify decimal function:
specify_decimal <- function(x, k) format(round(x, k), nsmall=k)
# assign estimates/p-values to corresponding columns
corrs[,i]=specify_decimal(results[[i]]$estimate,5)
corrs[2,i]=specify_decimal(results[[i]]$p.value,5)
}
print(corrs)
## nrow V2 V3 V4 V5 V6 V7
## 1 0.89568 -0.06868 0.88639 -0.49892 0.80940 0.69457 0.87756
## 2 0.00000 0.26988 0.00000 0.00000 0.00000 0.00000 0.00000
Cleaning up and transforming the data frame:
#Assigning column names from "data"
words=as.list(names(data[3:length(data)]))
names(corrs)=words
#Assigning row names
row.names(corrs)=c("cor","p-value")
# transpose corr, new data frame
Correlations=as.data.frame(t(corrs))
print(Correlations)
## cor p-value
## guthealth 0.89568 0.00000
## superfoods -0.06868 0.26988
## matcha 0.88639 0.00000
## paleodiet -0.49892 0.00000
## microbiome 0.80940 0.00000
## gutbacteria 0.69457 0.00000
## probiotic 0.87756 0.00000
Printing with knitr:
library('knitr')
kable(Correlations, format = "markdown")
| cor | p-value | |
|---|---|---|
| guthealth | 0.89568 | 0.00000 |
| superfoods | -0.06868 | 0.26988 |
| matcha | 0.88639 | 0.00000 |
| paleodiet | -0.49892 | 0.00000 |
| microbiome | 0.80940 | 0.00000 |
| gutbacteria | 0.69457 | 0.00000 |
| probiotic | 0.87756 | 0.00000 |