Yoga Searches on Google

Monthly yoga searches by state: 2004 to 2016 Yearly yoga searches by state: 2004 to 2016

Yoga Queries on Google

Yoga Queries on Google

The next set of data comes from google http://googletrends.github.io/data/.
Google has summarized data from year 2004-2016 which provides summary of how many times term Yoga was searched in google and also by the state these searches happened. The data set returned 149 rows, but has 52 variables. 50 states are represented as variables. It would be fun to clean and analyse this data.

Readme: https://data.world/dotslashmaggie/google-trends-yoga/workspace/project-summary


My Planned Analysis

1. I plan to cleanup the data to analyse where in United stated Yoga was most popular<br>
2. I also plan to analyse yearly summaries of yoga searches<br>
3. I also plan to analyse/plot the data based on State and year to see if there are any trends<br>

INDEX (Step by Step)

STEP 1. Load Libraries
STEP 2. Load the file
STEP 3a. Use Dplyr to convert the data in long format
STEP 3b. Use REGEX to cleanup the data
STEP 3c. Use Dplyr group_by and Summarise to summarise the values
STEP 5. Conclusion

STEP 1 : Load your libraries

# Load the libraries
library(tidyverse)  #For Tidyverse
## -- Attaching packages ---------------------------------------- tidyverse 1.2.1 --
## v ggplot2 2.2.1     v purrr   0.2.4
## v tibble  1.4.1     v dplyr   0.7.4
## v tidyr   0.8.0     v stringr 1.2.0
## v readr   1.1.1     v forcats 0.3.0
## -- Conflicts ------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(RCurl)      #For File Operations
## Loading required package: bitops
## 
## Attaching package: 'RCurl'
## The following object is masked from 'package:tidyr':
## 
##     complete
library(dplyr)      #For Manipulating the data frames
library(DT)         #For Data table package
library(ggplot2)    #For Visualizations

STEP 2 : Load the File

# Good Practise: Basic house keeping: cleanup the env before you start new work
rm(list=ls())

# Garbage collector to free the memory
gc()
##           used (Mb) gc trigger (Mb) max used (Mb)
## Ncells  786354 42.0    1442291 77.1  1168576 62.5
## Vcells 1152219  8.8    2060183 15.8  1430183 11.0
# Good Practise: Set up the Working Directory when working with a file system
setwd("C:\\CUNY\\607Data\\Assignments\\project02")

# Read the File directly from Github
untidy_data <- read.csv("20160502_YogaByStateMonth.csv", header = TRUE, sep = ",")
nrow(untidy_data)
## [1] 149
# remove the row with comments in there
untidy_data <- untidy_data[2:149,]

# check the dimenstions
dim(untidy_data)
## [1] 148  52
# check the rows to ensure validity of data
head(untidy_data,2)
##         X Alabama..us.al. Alaska..us.ak. Arizona..us.az. Arkansas..us.ar.
## 2 2004-01              20             23              21               24
## 3 2004-02               8             26              25               16
##   California..us.ca. Colorado..us.co. Connecticut..us.ct. Delaware..us.de.
## 2                 32               33                  27               47
## 3                 27               30                  26               28
##   District.of.Columbia..us.dc. Florida..us.fl. Georgia..us.ga.
## 2                           32              21              21
## 3                           36              17              20
##   Hawaii..us.hi. Idaho..us.id. Illinois..us.il. Indiana..us.in.
## 2             36            21               25              24
## 3             24            22               23              14
##   Iowa..us.ia. Kansas..us.ks. Kentucky..us.ky. Louisiana..us.la.
## 2           14             20               17                20
## 3           16             12               19                15
##   Maine..us.me. Maryland..us.md. Massachusetts..us.ma. Michigan..us.mi.
## 2            29               26                    41               19
## 3            29               23                    33               18
##   Minnesota..us.mn. Mississippi..us.ms. Missouri..us.mo. Montana..us.mt.
## 2                26                  16               19              44
## 3                22                  20               18              26
##   Nebraska..us.ne. Nevada..us.nv. New.Hampshire..us.nh. New.Jersey..us.nj.
## 2               15             20                    45                 27
## 3               21             25                    20                 22
##   New.Mexico..us.nm. New.York..us.ny. North.Carolina..us.nc.
## 2                 33               35                     23
## 3                 25               28                     22
##   North.Dakota..us.nd. Ohio..us.oh. Oklahoma..us.ok. Oregon..us.or.
## 2                   52           19               22             34
## 3                   45           16               19             30
##   Pennsylvania..us.pa. Rhode.Island..us.ri. South.Carolina..us.sc.
## 2                   19                   44                     24
## 3                   18                   26                     19
##   South.Dakota..us.sd. Tennessee..us.tn. Texas..us.tx. Utah..us.ut.
## 2                   25                21            24           26
## 3                   22                18            16           20
##   Vermont..us.vt. Virginia..us.va. Washington..us.wa.
## 2              42               22                 30
## 3              39               16                 29
##   West.Virginia..us.wv. Wisconsin..us.wi. Wyoming..us.wy.
## 2                    23                18               0
## 3                    17                17              37
As we can see that the data has 149 observations and 52 variables, it is in a wide format. We need to convert this to the long format.
The values show search per month for yoga and it has been indexed to 100, where 100 is the highest value.

STEP 3a. Use Dplyr to convert the data in long format

names(untidy_data)
##  [1] "X"                            "Alabama..us.al."             
##  [3] "Alaska..us.ak."               "Arizona..us.az."             
##  [5] "Arkansas..us.ar."             "California..us.ca."          
##  [7] "Colorado..us.co."             "Connecticut..us.ct."         
##  [9] "Delaware..us.de."             "District.of.Columbia..us.dc."
## [11] "Florida..us.fl."              "Georgia..us.ga."             
## [13] "Hawaii..us.hi."               "Idaho..us.id."               
## [15] "Illinois..us.il."             "Indiana..us.in."             
## [17] "Iowa..us.ia."                 "Kansas..us.ks."              
## [19] "Kentucky..us.ky."             "Louisiana..us.la."           
## [21] "Maine..us.me."                "Maryland..us.md."            
## [23] "Massachusetts..us.ma."        "Michigan..us.mi."            
## [25] "Minnesota..us.mn."            "Mississippi..us.ms."         
## [27] "Missouri..us.mo."             "Montana..us.mt."             
## [29] "Nebraska..us.ne."             "Nevada..us.nv."              
## [31] "New.Hampshire..us.nh."        "New.Jersey..us.nj."          
## [33] "New.Mexico..us.nm."           "New.York..us.ny."            
## [35] "North.Carolina..us.nc."       "North.Dakota..us.nd."        
## [37] "Ohio..us.oh."                 "Oklahoma..us.ok."            
## [39] "Oregon..us.or."               "Pennsylvania..us.pa."        
## [41] "Rhode.Island..us.ri."         "South.Carolina..us.sc."      
## [43] "South.Dakota..us.sd."         "Tennessee..us.tn."           
## [45] "Texas..us.tx."                "Utah..us.ut."                
## [47] "Vermont..us.vt."              "Virginia..us.va."            
## [49] "Washington..us.wa."           "West.Virginia..us.wv."       
## [51] "Wisconsin..us.wi."            "Wyoming..us.wy."
##
# Create a long format 
##
yoga_tidy <- untidy_data %>% 
    gather(State, Value, 2:52) %>% 
    filter(!is.na(Value)) 

STEP 3b. Use REGEX to cleanup the data

##
# Cleanup the names of the State using REGEx
##
yoga_tidy$State <- unlist(str_extract_all(yoga_tidy$State, ".+\\.\\.{1}")) 
yoga_tidy$State <- gsub("\\.\\.",'',yoga_tidy$State)
yoga_tidy$State <- gsub("\\.",' ' ,yoga_tidy$State)    
head(yoga_tidy)
##         X   State Value
## 1 2004-01 Alabama    20
## 2 2004-02 Alabama     8
## 3 2004-03 Alabama    10
## 4 2004-04 Alabama    15
## 5 2004-05 Alabama    15
## 6 2004-06 Alabama    12

STEP 5: Conclusion

Conclusion

1. We saw in 4a, how the interest in yoga came down around 2009 but since then is on rise


2. We saw in 4b, Vermont was the state with highest amount of interest in Yoga. Alabama had the lowest amount of mean interest in Yoga.


3. We saw in 4c, shows how the interest in yoga had spike in Vermont (I wonder why) while in all other states the intest was either steady or had slight rise.


Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.