Monthly yoga searches by state: 2004 to 2016 Yearly yoga searches by state: 2004 to 2016
Yoga Queries on Google
The next set of data comes from google http://googletrends.github.io/data/.
Google has summarized data from year 2004-2016 which provides summary of how many times term Yoga was searched in google and also by the state these searches happened. The data set returned 149 rows, but has 52 variables. 50 states are represented as variables. It would be fun to clean and analyse this data.
Readme: https://data.world/dotslashmaggie/google-trends-yoga/workspace/project-summary
1. I plan to cleanup the data to analyse where in United stated Yoga was most popular<br>
2. I also plan to analyse yearly summaries of yoga searches<br>
3. I also plan to analyse/plot the data based on State and year to see if there are any trends<br>
# Load the libraries
library(tidyverse) #For Tidyverse## -- Attaching packages ---------------------------------------- tidyverse 1.2.1 --
## v ggplot2 2.2.1 v purrr 0.2.4
## v tibble 1.4.1 v dplyr 0.7.4
## v tidyr 0.8.0 v stringr 1.2.0
## v readr 1.1.1 v forcats 0.3.0
## -- Conflicts ------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(RCurl) #For File Operations## Loading required package: bitops
##
## Attaching package: 'RCurl'
## The following object is masked from 'package:tidyr':
##
## complete
library(dplyr) #For Manipulating the data frames
library(DT) #For Data table package
library(ggplot2) #For Visualizations# Good Practise: Basic house keeping: cleanup the env before you start new work
rm(list=ls())
# Garbage collector to free the memory
gc()## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 786354 42.0 1442291 77.1 1168576 62.5
## Vcells 1152219 8.8 2060183 15.8 1430183 11.0
# Good Practise: Set up the Working Directory when working with a file system
setwd("C:\\CUNY\\607Data\\Assignments\\project02")
# Read the File directly from Github
untidy_data <- read.csv("20160502_YogaByStateMonth.csv", header = TRUE, sep = ",")
nrow(untidy_data)## [1] 149
# remove the row with comments in there
untidy_data <- untidy_data[2:149,]
# check the dimenstions
dim(untidy_data)## [1] 148 52
# check the rows to ensure validity of data
head(untidy_data,2)## X Alabama..us.al. Alaska..us.ak. Arizona..us.az. Arkansas..us.ar.
## 2 2004-01 20 23 21 24
## 3 2004-02 8 26 25 16
## California..us.ca. Colorado..us.co. Connecticut..us.ct. Delaware..us.de.
## 2 32 33 27 47
## 3 27 30 26 28
## District.of.Columbia..us.dc. Florida..us.fl. Georgia..us.ga.
## 2 32 21 21
## 3 36 17 20
## Hawaii..us.hi. Idaho..us.id. Illinois..us.il. Indiana..us.in.
## 2 36 21 25 24
## 3 24 22 23 14
## Iowa..us.ia. Kansas..us.ks. Kentucky..us.ky. Louisiana..us.la.
## 2 14 20 17 20
## 3 16 12 19 15
## Maine..us.me. Maryland..us.md. Massachusetts..us.ma. Michigan..us.mi.
## 2 29 26 41 19
## 3 29 23 33 18
## Minnesota..us.mn. Mississippi..us.ms. Missouri..us.mo. Montana..us.mt.
## 2 26 16 19 44
## 3 22 20 18 26
## Nebraska..us.ne. Nevada..us.nv. New.Hampshire..us.nh. New.Jersey..us.nj.
## 2 15 20 45 27
## 3 21 25 20 22
## New.Mexico..us.nm. New.York..us.ny. North.Carolina..us.nc.
## 2 33 35 23
## 3 25 28 22
## North.Dakota..us.nd. Ohio..us.oh. Oklahoma..us.ok. Oregon..us.or.
## 2 52 19 22 34
## 3 45 16 19 30
## Pennsylvania..us.pa. Rhode.Island..us.ri. South.Carolina..us.sc.
## 2 19 44 24
## 3 18 26 19
## South.Dakota..us.sd. Tennessee..us.tn. Texas..us.tx. Utah..us.ut.
## 2 25 21 24 26
## 3 22 18 16 20
## Vermont..us.vt. Virginia..us.va. Washington..us.wa.
## 2 42 22 30
## 3 39 16 29
## West.Virginia..us.wv. Wisconsin..us.wi. Wyoming..us.wy.
## 2 23 18 0
## 3 17 17 37
names(untidy_data)## [1] "X" "Alabama..us.al."
## [3] "Alaska..us.ak." "Arizona..us.az."
## [5] "Arkansas..us.ar." "California..us.ca."
## [7] "Colorado..us.co." "Connecticut..us.ct."
## [9] "Delaware..us.de." "District.of.Columbia..us.dc."
## [11] "Florida..us.fl." "Georgia..us.ga."
## [13] "Hawaii..us.hi." "Idaho..us.id."
## [15] "Illinois..us.il." "Indiana..us.in."
## [17] "Iowa..us.ia." "Kansas..us.ks."
## [19] "Kentucky..us.ky." "Louisiana..us.la."
## [21] "Maine..us.me." "Maryland..us.md."
## [23] "Massachusetts..us.ma." "Michigan..us.mi."
## [25] "Minnesota..us.mn." "Mississippi..us.ms."
## [27] "Missouri..us.mo." "Montana..us.mt."
## [29] "Nebraska..us.ne." "Nevada..us.nv."
## [31] "New.Hampshire..us.nh." "New.Jersey..us.nj."
## [33] "New.Mexico..us.nm." "New.York..us.ny."
## [35] "North.Carolina..us.nc." "North.Dakota..us.nd."
## [37] "Ohio..us.oh." "Oklahoma..us.ok."
## [39] "Oregon..us.or." "Pennsylvania..us.pa."
## [41] "Rhode.Island..us.ri." "South.Carolina..us.sc."
## [43] "South.Dakota..us.sd." "Tennessee..us.tn."
## [45] "Texas..us.tx." "Utah..us.ut."
## [47] "Vermont..us.vt." "Virginia..us.va."
## [49] "Washington..us.wa." "West.Virginia..us.wv."
## [51] "Wisconsin..us.wi." "Wyoming..us.wy."
##
# Create a long format
##
yoga_tidy <- untidy_data %>%
gather(State, Value, 2:52) %>%
filter(!is.na(Value)) ##
# Cleanup the names of the State using REGEx
##
yoga_tidy$State <- unlist(str_extract_all(yoga_tidy$State, ".+\\.\\.{1}"))
yoga_tidy$State <- gsub("\\.\\.",'',yoga_tidy$State)
yoga_tidy$State <- gsub("\\.",' ' ,yoga_tidy$State)
head(yoga_tidy)## X State Value
## 1 2004-01 Alabama 20
## 2 2004-02 Alabama 8
## 3 2004-03 Alabama 10
## 4 2004-04 Alabama 15
## 5 2004-05 Alabama 15
## 6 2004-06 Alabama 12
##
# Seperate the year and month so the we can process it better
##
yoga_tidy_date <- separate(yoga_tidy, X, c("Year", "Month"), sep="-")
# Summarise by year
yoga_tidy_by_year <- yoga_tidy_date %>%
group_by(Year) %>%
summarise("Total_Interest" = mean(Value))
ggplot(yoga_tidy_by_year, aes(x=Year, y=Total_Interest, fill=Total_Interest)) +
geom_bar(stat = "identity") +
xlab("Year") + ylab("Mean of the Indexed Google Searches by Year") +
ggtitle("Interest in Yoga by State from 2004-2016") +
theme(plot.title = element_text(lineheight = .8, face = "bold")) +
theme(axis.text.x = element_text(angle = 60, vjust = .5, size = 9)) +
theme_bw()# Summarise by state
yoga_tidy_by_state <- yoga_tidy_date %>%
group_by(State) %>%
summarise("Total_Interest" = mean(Value)) %>%
arrange(desc(`Total_Interest`))
ggplot(yoga_tidy_by_state, aes(x=State, y=Total_Interest, fill=Total_Interest, label=Total_Interest)) +
geom_bar(stat = "identity") +
xlab("US State") + ylab("Mean of the Indexed Google Searches by State") +
ggtitle("Interest in Yoga by State from 2004-2016") +
theme(plot.title = element_text(lineheight = .8, face = "bold")) +
theme(axis.text.x = element_text(angle = 60, vjust = .5, size = 9)) +
theme_bw()+
coord_flip()# Summarise by state and Year
yoga_tidy_by_state_Year <- yoga_tidy_date %>%
group_by(State, Year) %>%
summarise("Total_Interest" = mean(Value)) %>%
arrange(desc(`Total_Interest`))
head(yoga_tidy_by_state_Year)## # A tibble: 6 x 3
## # Groups: State [1]
## State Year Total_Interest
## <chr> <chr> <dbl>
## 1 Vermont 2016 92.8
## 2 Vermont 2015 72.0
## 3 Vermont 2014 68.0
## 4 Vermont 2013 63.4
## 5 Vermont 2012 62.8
## 6 Vermont 2011 61.5
ggplot(yoga_tidy_by_state_Year, aes(x = as.numeric(Year) , y = as.numeric(Total_Interest), group = State, colour = State)) +
geom_line() +
geom_point() +
scale_y_continuous() +
scale_x_continuous(limits = c(2004, 2016)) +
theme_bw() +
ylab("Mean Interest") +
xlab("Year") +
ggtitle("Interest in Yoga by state") +
ylab("Mean Interest") +
theme(plot.title = element_text(lineheight = .8))# Datatable : Yoga by state and Year
datatable(yoga_tidy_by_state_Year, options = list(filter = FALSE),filter="top")
1. We saw in 4a, how the interest in yoga came down around 2009 but since then is on rise
2. We saw in 4b, Vermont was the state with highest amount of interest in Yoga. Alabama had the lowest amount of mean interest in Yoga.
3. We saw in 4c, shows how the interest in yoga had spike in Vermont (I wonder why) while in all other states the intest was either steady or had slight rise.
Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.