RK_607_Project02-03

Yoga Searches on Google

Monthly yoga searches by state: 2004 to 2016 Yearly yoga searches by state: 2004 to 2016

Yoga Queries on Google

The next set of data comes from google http://googletrends.github.io/data/.
Google has summarized data from year 2004-2016 which provides summary of how many times term Yoga was searched in google and also by the state these searches happened. The data set returned 149 rows, but has 52 variables. 50 states are represented as variables. It would be fun to clean and analyse this data.

Readme: https://data.world/dotslashmaggie/google-trends-yoga/workspace/project-summary

My Planned Analysis

1. I plan to cleanup the data to analyse where in United stated Yoga was most popular<br>
2. I also plan to analyse yearly summaries of yoga searches<br>
3. I also plan to analyse/plot the data based on State and year to see if there are any trends<br>

INDEX (Step by Step)

STEP 1. Load Libraries

STEP 2. Load the file

STEP 3a. Use Dplyr to convert the data in long format

STEP 3b. Use REGEX to cleanup the data

STEP 3c. Use Dplyr group_by and Summarise to summarise the values

STEP 4a. Analysis 1: Analyse popularity trends of yoga by the year

STEP 4b. Analysis 2: Analyse popularity trends of yoga by the state

STEP 4c. Analysis 3: Analyse popularity trends of yoga by the state and Year

STEP 5. Conclusion

STEP 1 : Load your libraries

# Load the libraries
library(tidyverse)  #For Tidyverse

## -- Attaching packages ---------------------------------------- tidyverse 1.2.1 --

## v ggplot2 2.2.1     v purrr   0.2.4
## v tibble  1.4.1     v dplyr   0.7.4
## v tidyr   0.8.0     v stringr 1.2.0
## v readr   1.1.1     v forcats 0.3.0

## -- Conflicts ------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(RCurl)      #For File Operations

## Loading required package: bitops

## 
## Attaching package: 'RCurl'

## The following object is masked from 'package:tidyr':
## 
##     complete

library(dplyr)      #For Manipulating the data frames
library(DT)         #For Data table package
library(ggplot2)    #For Visualizations

STEP 2 : Load the File

# Good Practise: Basic house keeping: cleanup the env before you start new work
rm(list=ls())

# Garbage collector to free the memory
gc()

##           used (Mb) gc trigger (Mb) max used (Mb)
## Ncells  786354 42.0    1442291 77.1  1168576 62.5
## Vcells 1152219  8.8    2060183 15.8  1430183 11.0

# Good Practise: Set up the Working Directory when working with a file system
setwd("C:\\CUNY\\607Data\\Assignments\\project02")

# Read the File directly from Github
untidy_data <- read.csv("20160502_YogaByStateMonth.csv", header = TRUE, sep = ",")
nrow(untidy_data)

## [1] 149

# remove the row with comments in there
untidy_data <- untidy_data[2:149,]

# check the dimenstions
dim(untidy_data)

## [1] 148  52

# check the rows to ensure validity of data
head(untidy_data,2)

##         X Alabama..us.al. Alaska..us.ak. Arizona..us.az. Arkansas..us.ar.
## 2 2004-01              20             23              21               24
## 3 2004-02               8             26              25               16
##   California..us.ca. Colorado..us.co. Connecticut..us.ct. Delaware..us.de.
## 2                 32               33                  27               47
## 3                 27               30                  26               28
##   District.of.Columbia..us.dc. Florida..us.fl. Georgia..us.ga.
## 2                           32              21              21
## 3                           36              17              20
##   Hawaii..us.hi. Idaho..us.id. Illinois..us.il. Indiana..us.in.
## 2             36            21               25              24
## 3             24            22               23              14
##   Iowa..us.ia. Kansas..us.ks. Kentucky..us.ky. Louisiana..us.la.
## 2           14             20               17                20
## 3           16             12               19                15
##   Maine..us.me. Maryland..us.md. Massachusetts..us.ma. Michigan..us.mi.
## 2            29               26                    41               19
## 3            29               23                    33               18
##   Minnesota..us.mn. Mississippi..us.ms. Missouri..us.mo. Montana..us.mt.
## 2                26                  16               19              44
## 3                22                  20               18              26
##   Nebraska..us.ne. Nevada..us.nv. New.Hampshire..us.nh. New.Jersey..us.nj.
## 2               15             20                    45                 27
## 3               21             25                    20                 22
##   New.Mexico..us.nm. New.York..us.ny. North.Carolina..us.nc.
## 2                 33               35                     23
## 3                 25               28                     22
##   North.Dakota..us.nd. Ohio..us.oh. Oklahoma..us.ok. Oregon..us.or.
## 2                   52           19               22             34
## 3                   45           16               19             30
##   Pennsylvania..us.pa. Rhode.Island..us.ri. South.Carolina..us.sc.
## 2                   19                   44                     24
## 3                   18                   26                     19
##   South.Dakota..us.sd. Tennessee..us.tn. Texas..us.tx. Utah..us.ut.
## 2                   25                21            24           26
## 3                   22                18            16           20
##   Vermont..us.vt. Virginia..us.va. Washington..us.wa.
## 2              42               22                 30
## 3              39               16                 29
##   West.Virginia..us.wv. Wisconsin..us.wi. Wyoming..us.wy.
## 2                    23                18               0
## 3                    17                17              37

As we can see that the data has 149 observations and 52 variables, it is in a wide format. We need to convert this to the long format.

The values show search per month for yoga and it has been indexed to 100, where 100 is the highest value.

STEP 3a. Use Dplyr to convert the data in long format

names(untidy_data)

##  [1] "X"                            "Alabama..us.al."             
##  [3] "Alaska..us.ak."               "Arizona..us.az."             
##  [5] "Arkansas..us.ar."             "California..us.ca."          
##  [7] "Colorado..us.co."             "Connecticut..us.ct."         
##  [9] "Delaware..us.de."             "District.of.Columbia..us.dc."
## [11] "Florida..us.fl."              "Georgia..us.ga."             
## [13] "Hawaii..us.hi."               "Idaho..us.id."               
## [15] "Illinois..us.il."             "Indiana..us.in."             
## [17] "Iowa..us.ia."                 "Kansas..us.ks."              
## [19] "Kentucky..us.ky."             "Louisiana..us.la."           
## [21] "Maine..us.me."                "Maryland..us.md."            
## [23] "Massachusetts..us.ma."        "Michigan..us.mi."            
## [25] "Minnesota..us.mn."            "Mississippi..us.ms."         
## [27] "Missouri..us.mo."             "Montana..us.mt."             
## [29] "Nebraska..us.ne."             "Nevada..us.nv."              
## [31] "New.Hampshire..us.nh."        "New.Jersey..us.nj."          
## [33] "New.Mexico..us.nm."           "New.York..us.ny."            
## [35] "North.Carolina..us.nc."       "North.Dakota..us.nd."        
## [37] "Ohio..us.oh."                 "Oklahoma..us.ok."            
## [39] "Oregon..us.or."               "Pennsylvania..us.pa."        
## [41] "Rhode.Island..us.ri."         "South.Carolina..us.sc."      
## [43] "South.Dakota..us.sd."         "Tennessee..us.tn."           
## [45] "Texas..us.tx."                "Utah..us.ut."                
## [47] "Vermont..us.vt."              "Virginia..us.va."            
## [49] "Washington..us.wa."           "West.Virginia..us.wv."       
## [51] "Wisconsin..us.wi."            "Wyoming..us.wy."

##
# Create a long format 
##
yoga_tidy <- untidy_data %>% 
    gather(State, Value, 2:52) %>% 
    filter(!is.na(Value))

STEP 3b. Use REGEX to cleanup the data

##
# Cleanup the names of the State using REGEx
##
yoga_tidy$State <- unlist(str_extract_all(yoga_tidy$State, ".+\\.\\.{1}")) 
yoga_tidy$State <- gsub("\\.\\.",'',yoga_tidy$State)
yoga_tidy$State <- gsub("\\.",' ' ,yoga_tidy$State)    
head(yoga_tidy)

##         X   State Value
## 1 2004-01 Alabama    20
## 2 2004-02 Alabama     8
## 3 2004-03 Alabama    10
## 4 2004-04 Alabama    15
## 5 2004-05 Alabama    15
## 6 2004-06 Alabama    12

STEP 4a. Analysis 1: Analyse popularity trends of yoga by the year

##
# Seperate the year and month so the we can process it better
##
yoga_tidy_date <- separate(yoga_tidy, X, c("Year", "Month"), sep="-")


# Summarise by year
yoga_tidy_by_year <- yoga_tidy_date %>% 
    group_by(Year) %>% 
    summarise("Total_Interest" = mean(Value))

ggplot(yoga_tidy_by_year, aes(x=Year, y=Total_Interest, fill=Total_Interest)) +
  geom_bar(stat = "identity") +
  xlab("Year") + ylab("Mean of the Indexed Google Searches by Year") +
  ggtitle("Interest in Yoga by State from 2004-2016") +
  theme(plot.title = element_text(lineheight = .8, face = "bold")) +
  theme(axis.text.x = element_text(angle = 60, vjust = .5, size = 9)) +
  theme_bw()

STEP 4b. Analysis 2: Analyse popularity trends of yoga by the state

# Summarise by state
yoga_tidy_by_state <- yoga_tidy_date %>% 
    group_by(State) %>% 
    summarise("Total_Interest" = mean(Value)) %>% 
    arrange(desc(`Total_Interest`))

ggplot(yoga_tidy_by_state, aes(x=State, y=Total_Interest, fill=Total_Interest, label=Total_Interest)) +
  geom_bar(stat = "identity") +
  xlab("US State") + ylab("Mean of the Indexed Google Searches by State") +
  ggtitle("Interest in Yoga by State from 2004-2016") +
  theme(plot.title = element_text(lineheight = .8, face = "bold")) +
  theme(axis.text.x = element_text(angle = 60, vjust = .5, size = 9)) +
  theme_bw()+
    coord_flip()

STEP 4c. Analysis 3: Analyse popularity trends of yoga by the state and Year

# Summarise by state and Year
yoga_tidy_by_state_Year <- yoga_tidy_date %>% 
    group_by(State, Year) %>% 
    summarise("Total_Interest" = mean(Value)) %>% 
    arrange(desc(`Total_Interest`))

head(yoga_tidy_by_state_Year)

## # A tibble: 6 x 3
## # Groups: State [1]
##   State   Year  Total_Interest
##   <chr>   <chr>          <dbl>
## 1 Vermont 2016            92.8
## 2 Vermont 2015            72.0
## 3 Vermont 2014            68.0
## 4 Vermont 2013            63.4
## 5 Vermont 2012            62.8
## 6 Vermont 2011            61.5

ggplot(yoga_tidy_by_state_Year, aes(x = as.numeric(Year) , y = as.numeric(Total_Interest), group = State, colour = State)) +
  geom_line() +
  geom_point() +
  scale_y_continuous() +
  scale_x_continuous(limits = c(2004, 2016)) +
  theme_bw() +
  ylab("Mean Interest") +
  xlab("Year") +
  ggtitle("Interest in Yoga by state") +
  ylab("Mean Interest") +
  theme(plot.title = element_text(lineheight = .8))

Create Data tables for these tidy dataframes

# Datatable : Yoga by state and Year
datatable(yoga_tidy_by_state_Year, options = list(filter = FALSE),filter="top")

STEP 5: Conclusion

Conclusion

1. We saw in 4a, how the interest in yoga came down around 2009 but since then is on rise

2. We saw in 4b, Vermont was the state with highest amount of interest in Yoga. Alabama had the lowest amount of mean interest in Yoga.

3. We saw in 4c, shows how the interest in yoga had spike in Vermont (I wonder why) while in all other states the intest was either steady or had slight rise.

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

RK_607_Project02-03

Raj Kumar

March 13th, 2018

Yoga Searches on Google

My Planned Analysis

INDEX (Step by Step)

STEP 1. Load Libraries

STEP 2. Load the file

STEP 3a. Use Dplyr to convert the data in long format

STEP 3b. Use REGEX to cleanup the data

STEP 3c. Use Dplyr group_by and Summarise to summarise the values

STEP 4a. Analysis 1: Analyse popularity trends of yoga by the year

STEP 4b. Analysis 2: Analyse popularity trends of yoga by the state

STEP 4c. Analysis 3: Analyse popularity trends of yoga by the state and Year

STEP 5. Conclusion

STEP 1 : Load your libraries

STEP 2 : Load the File

As we can see that the data has 149 observations and 52 variables, it is in a wide format. We need to convert this to the long format.

The values show search per month for yoga and it has been indexed to 100, where 100 is the highest value.

STEP 3a. Use Dplyr to convert the data in long format

STEP 3b. Use REGEX to cleanup the data

STEP 4a. Analysis 1: Analyse popularity trends of yoga by the year

STEP 4b. Analysis 2: Analyse popularity trends of yoga by the state

STEP 4c. Analysis 3: Analyse popularity trends of yoga by the state and Year

Create Data tables for these tidy dataframes

STEP 5: Conclusion

Conclusion