Assignment

Choose any three of the “wide” datasets identified in the Week 6 Discussion items. (You may use your own dataset; please don’t use my Sample Post dataset, since that was used in your Week 6 assignment!)

For each of the three chosen datasets:

  1. Create a .CSV file (or optionally, a MySQL database!) that includes all of the information included in the dataset. You’re encouraged to use a “wide” structure similar to how the information appears in the discussion item, so that you can practice tidying and transformations as described below.

  2. Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. [Most of your grade will be based on this step!]

  3. Perform the analysis requested in the discussion item.

  4. Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions.

Solution

Data

Data Source

Data: Google’s monthly yoga searches by state from 2004 to 2016.

Data source: https://data.world/dotslashmaggie/google-trends-yoga/workspace/project-summary

Note about the data from Google: The context of our numbers also matters. We index our data to 100, where 100 is the maximum search interest for the time and location selected. That means that if we look at search interest in the 2016 elections since the start of 2012, we’ll see that March 2016 had the highest search interest, with a value of 100.

Source: https://medium.com/google-news-lab/what-is-google-trends-data-and-what-does-it-mean-b48f07342ee8

Read CSV from URL

yogaURL <- "https://data.world/dotslashmaggie/google-trends-yoga/workspace/file?filename=20160502_YogaByStateMonth.csv"
yoga <- read.csv("https://query.data.world/s/64O-zcKa02sO0C7fZDCjSyZlDJmZ7e", header=TRUE, stringsAsFactors=FALSE)
head(yoga)
##                                                                                                              X
## 1 Values show search interest per month in yoga and have been indexed to 100, where 100 is the maximum value. 
## 2                                                                                                      2004-01
## 3                                                                                                      2004-02
## 4                                                                                                      2004-03
## 5                                                                                                      2004-04
## 6                                                                                                      2004-05
##   Alabama..us.al. Alaska..us.ak. Arizona..us.az. Arkansas..us.ar.
## 1              NA             NA              NA               NA
## 2              20             23              21               24
## 3               8             26              25               16
## 4              10             26              22               26
## 5              15             34              24               18
## 6              15             14              24               11
##   California..us.ca. Colorado..us.co. Connecticut..us.ct. Delaware..us.de.
## 1                 NA               NA                  NA               NA
## 2                 32               33                  27               47
## 3                 27               30                  26               28
## 4                 28               29                  30               51
## 5                 25               27                  22               27
## 6                 25               24                  25               20
##   District.of.Columbia..us.dc. Florida..us.fl. Georgia..us.ga.
## 1                           NA              NA              NA
## 2                           32              21              21
## 3                           36              17              20
## 4                           29              17              20
## 5                           29              19              15
## 6                           24              20              16
##   Hawaii..us.hi. Idaho..us.id. Illinois..us.il. Indiana..us.in.
## 1             NA            NA               NA              NA
## 2             36            21               25              24
## 3             24            22               23              14
## 4             36            21               25              17
## 5             30            18               17              14
## 6             39            14               17               9
##   Iowa..us.ia. Kansas..us.ks. Kentucky..us.ky. Louisiana..us.la.
## 1           NA             NA               NA                NA
## 2           14             20               17                20
## 3           16             12               19                15
## 4           18             13               18                17
## 5           19             19               12                20
## 6           14             15               15                14
##   Maine..us.me. Maryland..us.md. Massachusetts..us.ma. Michigan..us.mi.
## 1            NA               NA                    NA               NA
## 2            29               26                    41               19
## 3            29               23                    33               18
## 4            26               22                    32               17
## 5            21               17                    31               15
## 6            25               21                    27               15
##   Minnesota..us.mn. Mississippi..us.ms. Missouri..us.mo. Montana..us.mt.
## 1                NA                  NA               NA              NA
## 2                26                  16               19              44
## 3                22                  20               18              26
## 4                21                  18               15              41
## 5                17                  18               12              25
## 6                22                  13               15              24
##   Nebraska..us.ne. Nevada..us.nv. New.Hampshire..us.nh. New.Jersey..us.nj.
## 1               NA             NA                    NA                 NA
## 2               15             20                    45                 27
## 3               21             25                    20                 22
## 4               16             24                    22                 23
## 5               16             17                    26                 20
## 6               14             25                    23                 22
##   New.Mexico..us.nm. New.York..us.ny. North.Carolina..us.nc.
## 1                 NA               NA                     NA
## 2                 33               35                     23
## 3                 25               28                     22
## 4                 26               29                     20
## 5                 18               26                     18
## 6                 21               28                     16
##   North.Dakota..us.nd. Ohio..us.oh. Oklahoma..us.ok. Oregon..us.or.
## 1                   NA           NA               NA             NA
## 2                   52           19               22             34
## 3                   45           16               19             30
## 4                   45           15               12             28
## 5                   45           13               11             22
## 6                   45           16               17             25
##   Pennsylvania..us.pa. Rhode.Island..us.ri. South.Carolina..us.sc.
## 1                   NA                   NA                     NA
## 2                   19                   44                     24
## 3                   18                   26                     19
## 4                   20                   27                     18
## 5                   19                   31                     14
## 6                   16                   26                     11
##   South.Dakota..us.sd. Tennessee..us.tn. Texas..us.tx. Utah..us.ut.
## 1                   NA                NA            NA           NA
## 2                   25                21            24           26
## 3                   22                18            16           20
## 4                   21                16            17           10
## 5                   21                21            16           20
## 6                   28                15            17           19
##   Vermont..us.vt. Virginia..us.va. Washington..us.wa.
## 1              NA               NA                 NA
## 2              42               22                 30
## 3              39               16                 29
## 4              41               19                 27
## 5              37               17                 25
## 6              31               18                 25
##   West.Virginia..us.wv. Wisconsin..us.wi. Wyoming..us.wy.
## 1                    NA                NA              NA
## 2                    23                18               0
## 3                    17                17              37
## 4                    27                20              35
## 5                    26                17              37
## 6                    21                18              35

Removed the 1st row of data

yoga <- yoga[-c(1:1), ]

Write CSV file

write.csv(yoga, file = "yoga.csv", row.names = FALSE)

Tidy Data

Tidy data with tidyr and dplyr

library(dplyr)
## Warning: package 'dplyr' was built under R version 3.4.2
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
yoga <- dplyr::tbl_df(yoga)
yoga
## # A tibble: 148 x 52
##          X Alabama..us.al. Alaska..us.ak. Arizona..us.az. Arkansas..us.ar.
##  *   <chr>           <int>          <int>           <int>            <int>
##  1 2004-01              20             23              21               24
##  2 2004-02               8             26              25               16
##  3 2004-03              10             26              22               26
##  4 2004-04              15             34              24               18
##  5 2004-05              15             14              24               11
##  6 2004-06              12             21              23               14
##  7 2004-07              13             17              25               15
##  8 2004-08              11             17              20               15
##  9 2004-09              10             23              20               10
## 10 2004-10              13             23              22               18
## # ... with 138 more rows, and 47 more variables: California..us.ca. <int>,
## #   Colorado..us.co. <int>, Connecticut..us.ct. <int>,
## #   Delaware..us.de. <int>, District.of.Columbia..us.dc. <int>,
## #   Florida..us.fl. <int>, Georgia..us.ga. <int>, Hawaii..us.hi. <int>,
## #   Idaho..us.id. <int>, Illinois..us.il. <int>, Indiana..us.in. <int>,
## #   Iowa..us.ia. <int>, Kansas..us.ks. <int>, Kentucky..us.ky. <int>,
## #   Louisiana..us.la. <int>, Maine..us.me. <int>, Maryland..us.md. <int>,
## #   Massachusetts..us.ma. <int>, Michigan..us.mi. <int>,
## #   Minnesota..us.mn. <int>, Mississippi..us.ms. <int>,
## #   Missouri..us.mo. <int>, Montana..us.mt. <int>, Nebraska..us.ne. <int>,
## #   Nevada..us.nv. <int>, New.Hampshire..us.nh. <int>,
## #   New.Jersey..us.nj. <int>, New.Mexico..us.nm. <int>,
## #   New.York..us.ny. <int>, North.Carolina..us.nc. <int>,
## #   North.Dakota..us.nd. <int>, Ohio..us.oh. <int>,
## #   Oklahoma..us.ok. <int>, Oregon..us.or. <int>,
## #   Pennsylvania..us.pa. <int>, Rhode.Island..us.ri. <int>,
## #   South.Carolina..us.sc. <int>, South.Dakota..us.sd. <int>,
## #   Tennessee..us.tn. <int>, Texas..us.tx. <int>, Utah..us.ut. <int>,
## #   Vermont..us.vt. <int>, Virginia..us.va. <int>,
## #   Washington..us.wa. <int>, West.Virginia..us.wv. <int>,
## #   Wisconsin..us.wi. <int>, Wyoming..us.wy. <int>
1. Rename the column headers

I wanted to rename all the state names to be cleaner using regular expressions, as I thought it would be faster and easier than typing in all the code (like below). But, unfortunately, I struggled (for a while) to figure out how. Then I tried to rename it using dplyr::rename and %>% but that didn’t work either… So I finally resorted to this method, which is long but worked:

yoga <- rename(yoga, Alabama = Alabama..us.al., Alaska = Alaska..us.ak., Arizona = Arizona..us.az., Arkansas = Arkansas..us.ar., California = California..us.ca., Colorado = Colorado..us.co., Connecticut = Connecticut..us.ct., Delaware = Delaware..us.de., DC = District.of.Columbia..us.dc., Florida = Florida..us.fl., Georgia = Georgia..us.ga., Hawaii = Hawaii..us.hi., Idaho = Idaho..us.id., Illinois = Illinois..us.il., Indiana = Indiana..us.in., Iowa = Iowa..us.ia., Kansas = Kansas..us.ks., Kentucky = Kentucky..us.ky., Louisiana = Louisiana..us.la., Maine = Maine..us.me., Maryland = Maryland..us.md., Massachusetts = Massachusetts..us.ma., Michigan = Michigan..us.mi., Minnesota = Minnesota..us.mn., Mississippi = Mississippi..us.ms., Missouri = Missouri..us.mo., Montana = Montana..us.mt., Nebraska = Nebraska..us.ne., Nevada = Nevada..us.nv., New_Hampshire = New.Hampshire..us.nh., New_Jersey = New.Jersey..us.nj., New_Mexico = New.Mexico..us.nm., New_York = New.York..us.ny., North_Carolina = North.Carolina..us.nc., North_Dakota = North.Dakota..us.nd., Ohio = Ohio..us.oh., Oklahoma = Oklahoma..us.ok., Oregon = Oregon..us.or., Pennsylvania = Pennsylvania..us.pa., Rhode_Island = Rhode.Island..us.ri., South_Carolina = South.Carolina..us.sc., South_Dakota = South.Dakota..us.sd., Tennessee = Tennessee..us.tn., Texas = Texas..us.tx., Utah = Utah..us.ut., Vermont = Vermont..us.vt., Virginia = Virginia..us.va., Washington = Washington..us.wa., West_Virginia = West.Virginia..us.wv., Wisconsin = Wisconsin..us.wi., Wyoming = Wyoming..us.wy.)
yoga 
## # A tibble: 148 x 52
##          X Alabama Alaska Arizona Arkansas California Colorado Connecticut
##  *   <chr>   <int>  <int>   <int>    <int>      <int>    <int>       <int>
##  1 2004-01      20     23      21       24         32       33          27
##  2 2004-02       8     26      25       16         27       30          26
##  3 2004-03      10     26      22       26         28       29          30
##  4 2004-04      15     34      24       18         25       27          22
##  5 2004-05      15     14      24       11         25       24          25
##  6 2004-06      12     21      23       14         27       27          26
##  7 2004-07      13     17      25       15         28       30          22
##  8 2004-08      11     17      20       15         28       27          27
##  9 2004-09      10     23      20       10         26       25          24
## 10 2004-10      13     23      22       18         25       30          19
## # ... with 138 more rows, and 44 more variables: Delaware <int>, DC <int>,
## #   Florida <int>, Georgia <int>, Hawaii <int>, Idaho <int>,
## #   Illinois <int>, Indiana <int>, Iowa <int>, Kansas <int>,
## #   Kentucky <int>, Louisiana <int>, Maine <int>, Maryland <int>,
## #   Massachusetts <int>, Michigan <int>, Minnesota <int>,
## #   Mississippi <int>, Missouri <int>, Montana <int>, Nebraska <int>,
## #   Nevada <int>, New_Hampshire <int>, New_Jersey <int>, New_Mexico <int>,
## #   New_York <int>, North_Carolina <int>, North_Dakota <int>, Ohio <int>,
## #   Oklahoma <int>, Oregon <int>, Pennsylvania <int>, Rhode_Island <int>,
## #   South_Carolina <int>, South_Dakota <int>, Tennessee <int>,
## #   Texas <int>, Utah <int>, Vermont <int>, Virginia <int>,
## #   Washington <int>, West_Virginia <int>, Wisconsin <int>, Wyoming <int>
2. Separate “Year-Month” into two columns
yoga <- tidyr::separate(yoga, X, c("Year", "Month"))
yoga
## # A tibble: 148 x 53
##     Year Month Alabama Alaska Arizona Arkansas California Colorado
##  * <chr> <chr>   <int>  <int>   <int>    <int>      <int>    <int>
##  1  2004    01      20     23      21       24         32       33
##  2  2004    02       8     26      25       16         27       30
##  3  2004    03      10     26      22       26         28       29
##  4  2004    04      15     34      24       18         25       27
##  5  2004    05      15     14      24       11         25       24
##  6  2004    06      12     21      23       14         27       27
##  7  2004    07      13     17      25       15         28       30
##  8  2004    08      11     17      20       15         28       27
##  9  2004    09      10     23      20       10         26       25
## 10  2004    10      13     23      22       18         25       30
## # ... with 138 more rows, and 45 more variables: Connecticut <int>,
## #   Delaware <int>, DC <int>, Florida <int>, Georgia <int>, Hawaii <int>,
## #   Idaho <int>, Illinois <int>, Indiana <int>, Iowa <int>, Kansas <int>,
## #   Kentucky <int>, Louisiana <int>, Maine <int>, Maryland <int>,
## #   Massachusetts <int>, Michigan <int>, Minnesota <int>,
## #   Mississippi <int>, Missouri <int>, Montana <int>, Nebraska <int>,
## #   Nevada <int>, New_Hampshire <int>, New_Jersey <int>, New_Mexico <int>,
## #   New_York <int>, North_Carolina <int>, North_Dakota <int>, Ohio <int>,
## #   Oklahoma <int>, Oregon <int>, Pennsylvania <int>, Rhode_Island <int>,
## #   South_Carolina <int>, South_Dakota <int>, Tennessee <int>,
## #   Texas <int>, Utah <int>, Vermont <int>, Virginia <int>,
## #   Washington <int>, West_Virginia <int>, Wisconsin <int>, Wyoming <int>
ncol(yoga)
## [1] 53
3. Make the states into observations from variables with tidyr ‘gather’ function
library(tidyr)
YogaLong <- gather(yoga, "State", "n", 3:53)
YogaLong
## # A tibble: 7,548 x 4
##     Year Month   State     n
##    <chr> <chr>   <chr> <int>
##  1  2004    01 Alabama    20
##  2  2004    02 Alabama     8
##  3  2004    03 Alabama    10
##  4  2004    04 Alabama    15
##  5  2004    05 Alabama    15
##  6  2004    06 Alabama    12
##  7  2004    07 Alabama    13
##  8  2004    08 Alabama    11
##  9  2004    09 Alabama    10
## 10  2004    10 Alabama    13
## # ... with 7,538 more rows
4. Rename “n” to “Searches”
YogaLong <- dplyr::rename(YogaLong, Searches = 'n')
YogaLong
## # A tibble: 7,548 x 4
##     Year Month   State Searches
##    <chr> <chr>   <chr>    <int>
##  1  2004    01 Alabama       20
##  2  2004    02 Alabama        8
##  3  2004    03 Alabama       10
##  4  2004    04 Alabama       15
##  5  2004    05 Alabama       15
##  6  2004    06 Alabama       12
##  7  2004    07 Alabama       13
##  8  2004    08 Alabama       11
##  9  2004    09 Alabama       10
## 10  2004    10 Alabama       13
## # ... with 7,538 more rows
5. Change “Year” and “Month” columns from characters to numeric?
YogaLong$Year <- as.numeric(as.character(YogaLong$Year))
YogaLong$Month <- as.numeric(as.character(YogaLong$Month))
YogaLong
## # A tibble: 7,548 x 4
##     Year Month   State Searches
##    <dbl> <dbl>   <chr>    <int>
##  1  2004     1 Alabama       20
##  2  2004     2 Alabama        8
##  3  2004     3 Alabama       10
##  4  2004     4 Alabama       15
##  5  2004     5 Alabama       15
##  6  2004     6 Alabama       12
##  7  2004     7 Alabama       13
##  8  2004     8 Alabama       11
##  9  2004     9 Alabama       10
## 10  2004    10 Alabama       13
## # ... with 7,538 more rows

Pretty Table

library(DT)
datatable(YogaLong)

Analysis

library(ggplot2)

1a. When was there the most interest in yoga and where??

arrange(YogaLong, desc(Searches))

1b. When was there the least interest in yoga and where?

arrange(YogaLong, Searches)

2. Which state had the highest interest in yoga based on their mean number of google searches from 2004-2016?

First need to group the searches by state and year and calculate the mean, min, and max of searches by each.

State <- YogaLong %>% 
  group_by(State) %>% 
  summarise(Mean = mean(Searches), Min = min(Searches), Max = max(Searches), n = n())
  
State <- arrange(State, desc(Mean))

State
## # A tibble: 51 x 5
##            State     Mean   Min   Max     n
##            <chr>    <dbl> <dbl> <dbl> <int>
##  1       Vermont 50.87162    20   100   148
##  2        Hawaii 38.50676    24    54   148
##  3  Rhode_Island 37.47297    20    55   148
##  4 New_Hampshire 35.75676    20    57   148
##  5         Maine 35.47973    19    57   148
##  6       Montana 33.62162    13    55   148
##  7       Wyoming 32.53378     0    59   148
##  8 Massachusetts 32.18243    25    43   148
##  9        Alaska 31.93243    12    61   148
## 10      Colorado 30.93243    23    43   148
## # ... with 41 more rows
ggplot(State, aes(x=State, y=Mean, fill=Mean)) +
  geom_bar(stat = "identity") +
  xlab("State") + ylab("Mean of the Indexed Google Searches by Month") +
  ggtitle("Interest in Yoga by State from 2004-2016") +
  theme(plot.title = element_text(lineheight = .8, face = "bold")) +
  theme(axis.text.x = element_text(angle = 60, vjust = .5, size = 9)) +
  theme_classic() +
  coord_flip()

3. What is the interest in yoga by year across the entire US?

Year <- YogaLong %>% 
  group_by(Year) %>% 
  summarise(Mean = mean(Searches), Min = min(Searches), Max = max(Searches), n = n())
Year
## # A tibble: 13 x 5
##     Year     Mean   Min   Max     n
##    <dbl>    <dbl> <dbl> <dbl> <int>
##  1  2004 22.00490     0    52   612
##  2  2005 21.52451     9    54   612
##  3  2006 21.16340     8    61   612
##  4  2007 20.18464     6    68   612
##  5  2008 19.74673     7    53   612
##  6  2009 20.70425     9    53   612
##  7  2010 24.30392    12    57   612
##  8  2011 23.19771     9    74   612
##  9  2012 24.30556    10    77   612
## 10  2013 26.71569    10    79   612
## 11  2014 28.50000    14    83   612
## 12  2015 30.18791    14    87   612
## 13  2016 32.01471    14   100   204
ggplot(Year, aes(x=Year, y=Mean)) +
  geom_line(size = .5, colour = "blue") +
  geom_point(size = 1, colour = "light blue") +
  ggtitle("Interest of Yoga Across US by Year from 2004-2016") +
  ylab("Mean of the Indexed Google Searches") +
  theme_classic()

4. When during the year do people search most for yoga?

Month <- YogaLong %>% 
  group_by(Month) %>% 
  summarise(Mean = mean(Searches), Min = min(Searches), Max = max(Searches), n = n())
Month
## # A tibble: 12 x 5
##    Month     Mean   Min   Max     n
##    <dbl>    <dbl> <dbl> <dbl> <int>
##  1     1 28.14781     0   100   663
##  2     2 24.66817     8    99   663
##  3     3 24.18250     6    87   663
##  4     4 23.03620     9    85   663
##  5     5 22.11275     9    60   612
##  6     6 23.80065     9    69   612
##  7     7 24.40196     6    81   612
##  8     8 24.32026     7    77   612
##  9     9 23.19281    10    77   612
## 10    10 21.92647     9    62   612
## 11    11 22.40033     9    79   612
## 12    12 22.68464     8    87   612
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
ggplot(Month, aes(x=month(Month, label = TRUE, abbr = TRUE), y=Mean, group=1)) +
  geom_line(size = .5, colour = "navy") +
  geom_point(size = 1, colour = "lime green") +
  ggtitle("Interest in Yoga by Month") +
  ylab("Indexed Google Searches") + xlab("Month") +
  theme_classic() 

Experienced much difficulty getting the Month names to appear in the X axis. Finally found this answer on stackoverflow: https://stackoverflow.com/questions/42425195/ggplot-data-year-over-year-axis-labels-not-showing

Conclusions

The highest search interest for yoga was in January 2016 in Vermont.

The least interest was in January 2004 in Wyoming.

Yoga experienced a good deal of interest in 2010, decreased in 2011, and since then has been steadily increasing in interest, with the most interest being in 2016.

People search for yoga the most in January. February, March, July and August are second to January for the most popular months when people search for “yoga” on Google.