Choose any three of the “wide” datasets identified in the Week 6 Discussion items. (You may use your own dataset; please don’t use my Sample Post dataset, since that was used in your Week 6 assignment!)
For each of the three chosen datasets:
Create a .CSV file (or optionally, a MySQL database!) that includes all of the information included in the dataset. You’re encouraged to use a “wide” structure similar to how the information appears in the discussion item, so that you can practice tidying and transformations as described below.
Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. [Most of your grade will be based on this step!]
Perform the analysis requested in the discussion item.
Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions.
Data: Google’s monthly yoga searches by state from 2004 to 2016.
Data source: https://data.world/dotslashmaggie/google-trends-yoga/workspace/project-summary
Note about the data from Google: The context of our numbers also matters. We index our data to 100, where 100 is the maximum search interest for the time and location selected. That means that if we look at search interest in the 2016 elections since the start of 2012, we’ll see that March 2016 had the highest search interest, with a value of 100.
Source: https://medium.com/google-news-lab/what-is-google-trends-data-and-what-does-it-mean-b48f07342ee8
yogaURL <- "https://data.world/dotslashmaggie/google-trends-yoga/workspace/file?filename=20160502_YogaByStateMonth.csv"
yoga <- read.csv("https://query.data.world/s/64O-zcKa02sO0C7fZDCjSyZlDJmZ7e", header=TRUE, stringsAsFactors=FALSE)
head(yoga)
## X
## 1 Values show search interest per month in yoga and have been indexed to 100, where 100 is the maximum value.
## 2 2004-01
## 3 2004-02
## 4 2004-03
## 5 2004-04
## 6 2004-05
## Alabama..us.al. Alaska..us.ak. Arizona..us.az. Arkansas..us.ar.
## 1 NA NA NA NA
## 2 20 23 21 24
## 3 8 26 25 16
## 4 10 26 22 26
## 5 15 34 24 18
## 6 15 14 24 11
## California..us.ca. Colorado..us.co. Connecticut..us.ct. Delaware..us.de.
## 1 NA NA NA NA
## 2 32 33 27 47
## 3 27 30 26 28
## 4 28 29 30 51
## 5 25 27 22 27
## 6 25 24 25 20
## District.of.Columbia..us.dc. Florida..us.fl. Georgia..us.ga.
## 1 NA NA NA
## 2 32 21 21
## 3 36 17 20
## 4 29 17 20
## 5 29 19 15
## 6 24 20 16
## Hawaii..us.hi. Idaho..us.id. Illinois..us.il. Indiana..us.in.
## 1 NA NA NA NA
## 2 36 21 25 24
## 3 24 22 23 14
## 4 36 21 25 17
## 5 30 18 17 14
## 6 39 14 17 9
## Iowa..us.ia. Kansas..us.ks. Kentucky..us.ky. Louisiana..us.la.
## 1 NA NA NA NA
## 2 14 20 17 20
## 3 16 12 19 15
## 4 18 13 18 17
## 5 19 19 12 20
## 6 14 15 15 14
## Maine..us.me. Maryland..us.md. Massachusetts..us.ma. Michigan..us.mi.
## 1 NA NA NA NA
## 2 29 26 41 19
## 3 29 23 33 18
## 4 26 22 32 17
## 5 21 17 31 15
## 6 25 21 27 15
## Minnesota..us.mn. Mississippi..us.ms. Missouri..us.mo. Montana..us.mt.
## 1 NA NA NA NA
## 2 26 16 19 44
## 3 22 20 18 26
## 4 21 18 15 41
## 5 17 18 12 25
## 6 22 13 15 24
## Nebraska..us.ne. Nevada..us.nv. New.Hampshire..us.nh. New.Jersey..us.nj.
## 1 NA NA NA NA
## 2 15 20 45 27
## 3 21 25 20 22
## 4 16 24 22 23
## 5 16 17 26 20
## 6 14 25 23 22
## New.Mexico..us.nm. New.York..us.ny. North.Carolina..us.nc.
## 1 NA NA NA
## 2 33 35 23
## 3 25 28 22
## 4 26 29 20
## 5 18 26 18
## 6 21 28 16
## North.Dakota..us.nd. Ohio..us.oh. Oklahoma..us.ok. Oregon..us.or.
## 1 NA NA NA NA
## 2 52 19 22 34
## 3 45 16 19 30
## 4 45 15 12 28
## 5 45 13 11 22
## 6 45 16 17 25
## Pennsylvania..us.pa. Rhode.Island..us.ri. South.Carolina..us.sc.
## 1 NA NA NA
## 2 19 44 24
## 3 18 26 19
## 4 20 27 18
## 5 19 31 14
## 6 16 26 11
## South.Dakota..us.sd. Tennessee..us.tn. Texas..us.tx. Utah..us.ut.
## 1 NA NA NA NA
## 2 25 21 24 26
## 3 22 18 16 20
## 4 21 16 17 10
## 5 21 21 16 20
## 6 28 15 17 19
## Vermont..us.vt. Virginia..us.va. Washington..us.wa.
## 1 NA NA NA
## 2 42 22 30
## 3 39 16 29
## 4 41 19 27
## 5 37 17 25
## 6 31 18 25
## West.Virginia..us.wv. Wisconsin..us.wi. Wyoming..us.wy.
## 1 NA NA NA
## 2 23 18 0
## 3 17 17 37
## 4 27 20 35
## 5 26 17 37
## 6 21 18 35
Removed the 1st row of data
yoga <- yoga[-c(1:1), ]
write.csv(yoga, file = "yoga.csv", row.names = FALSE)
Tidy data with tidyr and dplyr
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.4.2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
yoga <- dplyr::tbl_df(yoga)
yoga
## # A tibble: 148 x 52
## X Alabama..us.al. Alaska..us.ak. Arizona..us.az. Arkansas..us.ar.
## * <chr> <int> <int> <int> <int>
## 1 2004-01 20 23 21 24
## 2 2004-02 8 26 25 16
## 3 2004-03 10 26 22 26
## 4 2004-04 15 34 24 18
## 5 2004-05 15 14 24 11
## 6 2004-06 12 21 23 14
## 7 2004-07 13 17 25 15
## 8 2004-08 11 17 20 15
## 9 2004-09 10 23 20 10
## 10 2004-10 13 23 22 18
## # ... with 138 more rows, and 47 more variables: California..us.ca. <int>,
## # Colorado..us.co. <int>, Connecticut..us.ct. <int>,
## # Delaware..us.de. <int>, District.of.Columbia..us.dc. <int>,
## # Florida..us.fl. <int>, Georgia..us.ga. <int>, Hawaii..us.hi. <int>,
## # Idaho..us.id. <int>, Illinois..us.il. <int>, Indiana..us.in. <int>,
## # Iowa..us.ia. <int>, Kansas..us.ks. <int>, Kentucky..us.ky. <int>,
## # Louisiana..us.la. <int>, Maine..us.me. <int>, Maryland..us.md. <int>,
## # Massachusetts..us.ma. <int>, Michigan..us.mi. <int>,
## # Minnesota..us.mn. <int>, Mississippi..us.ms. <int>,
## # Missouri..us.mo. <int>, Montana..us.mt. <int>, Nebraska..us.ne. <int>,
## # Nevada..us.nv. <int>, New.Hampshire..us.nh. <int>,
## # New.Jersey..us.nj. <int>, New.Mexico..us.nm. <int>,
## # New.York..us.ny. <int>, North.Carolina..us.nc. <int>,
## # North.Dakota..us.nd. <int>, Ohio..us.oh. <int>,
## # Oklahoma..us.ok. <int>, Oregon..us.or. <int>,
## # Pennsylvania..us.pa. <int>, Rhode.Island..us.ri. <int>,
## # South.Carolina..us.sc. <int>, South.Dakota..us.sd. <int>,
## # Tennessee..us.tn. <int>, Texas..us.tx. <int>, Utah..us.ut. <int>,
## # Vermont..us.vt. <int>, Virginia..us.va. <int>,
## # Washington..us.wa. <int>, West.Virginia..us.wv. <int>,
## # Wisconsin..us.wi. <int>, Wyoming..us.wy. <int>
I wanted to rename all the state names to be cleaner using regular expressions, as I thought it would be faster and easier than typing in all the code (like below). But, unfortunately, I struggled (for a while) to figure out how. Then I tried to rename it using dplyr::rename and %>% but that didn’t work either… So I finally resorted to this method, which is long but worked:
yoga <- rename(yoga, Alabama = Alabama..us.al., Alaska = Alaska..us.ak., Arizona = Arizona..us.az., Arkansas = Arkansas..us.ar., California = California..us.ca., Colorado = Colorado..us.co., Connecticut = Connecticut..us.ct., Delaware = Delaware..us.de., DC = District.of.Columbia..us.dc., Florida = Florida..us.fl., Georgia = Georgia..us.ga., Hawaii = Hawaii..us.hi., Idaho = Idaho..us.id., Illinois = Illinois..us.il., Indiana = Indiana..us.in., Iowa = Iowa..us.ia., Kansas = Kansas..us.ks., Kentucky = Kentucky..us.ky., Louisiana = Louisiana..us.la., Maine = Maine..us.me., Maryland = Maryland..us.md., Massachusetts = Massachusetts..us.ma., Michigan = Michigan..us.mi., Minnesota = Minnesota..us.mn., Mississippi = Mississippi..us.ms., Missouri = Missouri..us.mo., Montana = Montana..us.mt., Nebraska = Nebraska..us.ne., Nevada = Nevada..us.nv., New_Hampshire = New.Hampshire..us.nh., New_Jersey = New.Jersey..us.nj., New_Mexico = New.Mexico..us.nm., New_York = New.York..us.ny., North_Carolina = North.Carolina..us.nc., North_Dakota = North.Dakota..us.nd., Ohio = Ohio..us.oh., Oklahoma = Oklahoma..us.ok., Oregon = Oregon..us.or., Pennsylvania = Pennsylvania..us.pa., Rhode_Island = Rhode.Island..us.ri., South_Carolina = South.Carolina..us.sc., South_Dakota = South.Dakota..us.sd., Tennessee = Tennessee..us.tn., Texas = Texas..us.tx., Utah = Utah..us.ut., Vermont = Vermont..us.vt., Virginia = Virginia..us.va., Washington = Washington..us.wa., West_Virginia = West.Virginia..us.wv., Wisconsin = Wisconsin..us.wi., Wyoming = Wyoming..us.wy.)
yoga
## # A tibble: 148 x 52
## X Alabama Alaska Arizona Arkansas California Colorado Connecticut
## * <chr> <int> <int> <int> <int> <int> <int> <int>
## 1 2004-01 20 23 21 24 32 33 27
## 2 2004-02 8 26 25 16 27 30 26
## 3 2004-03 10 26 22 26 28 29 30
## 4 2004-04 15 34 24 18 25 27 22
## 5 2004-05 15 14 24 11 25 24 25
## 6 2004-06 12 21 23 14 27 27 26
## 7 2004-07 13 17 25 15 28 30 22
## 8 2004-08 11 17 20 15 28 27 27
## 9 2004-09 10 23 20 10 26 25 24
## 10 2004-10 13 23 22 18 25 30 19
## # ... with 138 more rows, and 44 more variables: Delaware <int>, DC <int>,
## # Florida <int>, Georgia <int>, Hawaii <int>, Idaho <int>,
## # Illinois <int>, Indiana <int>, Iowa <int>, Kansas <int>,
## # Kentucky <int>, Louisiana <int>, Maine <int>, Maryland <int>,
## # Massachusetts <int>, Michigan <int>, Minnesota <int>,
## # Mississippi <int>, Missouri <int>, Montana <int>, Nebraska <int>,
## # Nevada <int>, New_Hampshire <int>, New_Jersey <int>, New_Mexico <int>,
## # New_York <int>, North_Carolina <int>, North_Dakota <int>, Ohio <int>,
## # Oklahoma <int>, Oregon <int>, Pennsylvania <int>, Rhode_Island <int>,
## # South_Carolina <int>, South_Dakota <int>, Tennessee <int>,
## # Texas <int>, Utah <int>, Vermont <int>, Virginia <int>,
## # Washington <int>, West_Virginia <int>, Wisconsin <int>, Wyoming <int>
yoga <- tidyr::separate(yoga, X, c("Year", "Month"))
yoga
## # A tibble: 148 x 53
## Year Month Alabama Alaska Arizona Arkansas California Colorado
## * <chr> <chr> <int> <int> <int> <int> <int> <int>
## 1 2004 01 20 23 21 24 32 33
## 2 2004 02 8 26 25 16 27 30
## 3 2004 03 10 26 22 26 28 29
## 4 2004 04 15 34 24 18 25 27
## 5 2004 05 15 14 24 11 25 24
## 6 2004 06 12 21 23 14 27 27
## 7 2004 07 13 17 25 15 28 30
## 8 2004 08 11 17 20 15 28 27
## 9 2004 09 10 23 20 10 26 25
## 10 2004 10 13 23 22 18 25 30
## # ... with 138 more rows, and 45 more variables: Connecticut <int>,
## # Delaware <int>, DC <int>, Florida <int>, Georgia <int>, Hawaii <int>,
## # Idaho <int>, Illinois <int>, Indiana <int>, Iowa <int>, Kansas <int>,
## # Kentucky <int>, Louisiana <int>, Maine <int>, Maryland <int>,
## # Massachusetts <int>, Michigan <int>, Minnesota <int>,
## # Mississippi <int>, Missouri <int>, Montana <int>, Nebraska <int>,
## # Nevada <int>, New_Hampshire <int>, New_Jersey <int>, New_Mexico <int>,
## # New_York <int>, North_Carolina <int>, North_Dakota <int>, Ohio <int>,
## # Oklahoma <int>, Oregon <int>, Pennsylvania <int>, Rhode_Island <int>,
## # South_Carolina <int>, South_Dakota <int>, Tennessee <int>,
## # Texas <int>, Utah <int>, Vermont <int>, Virginia <int>,
## # Washington <int>, West_Virginia <int>, Wisconsin <int>, Wyoming <int>
ncol(yoga)
## [1] 53
library(tidyr)
YogaLong <- gather(yoga, "State", "n", 3:53)
YogaLong
## # A tibble: 7,548 x 4
## Year Month State n
## <chr> <chr> <chr> <int>
## 1 2004 01 Alabama 20
## 2 2004 02 Alabama 8
## 3 2004 03 Alabama 10
## 4 2004 04 Alabama 15
## 5 2004 05 Alabama 15
## 6 2004 06 Alabama 12
## 7 2004 07 Alabama 13
## 8 2004 08 Alabama 11
## 9 2004 09 Alabama 10
## 10 2004 10 Alabama 13
## # ... with 7,538 more rows
YogaLong <- dplyr::rename(YogaLong, Searches = 'n')
YogaLong
## # A tibble: 7,548 x 4
## Year Month State Searches
## <chr> <chr> <chr> <int>
## 1 2004 01 Alabama 20
## 2 2004 02 Alabama 8
## 3 2004 03 Alabama 10
## 4 2004 04 Alabama 15
## 5 2004 05 Alabama 15
## 6 2004 06 Alabama 12
## 7 2004 07 Alabama 13
## 8 2004 08 Alabama 11
## 9 2004 09 Alabama 10
## 10 2004 10 Alabama 13
## # ... with 7,538 more rows
YogaLong$Year <- as.numeric(as.character(YogaLong$Year))
YogaLong$Month <- as.numeric(as.character(YogaLong$Month))
YogaLong
## # A tibble: 7,548 x 4
## Year Month State Searches
## <dbl> <dbl> <chr> <int>
## 1 2004 1 Alabama 20
## 2 2004 2 Alabama 8
## 3 2004 3 Alabama 10
## 4 2004 4 Alabama 15
## 5 2004 5 Alabama 15
## 6 2004 6 Alabama 12
## 7 2004 7 Alabama 13
## 8 2004 8 Alabama 11
## 9 2004 9 Alabama 10
## 10 2004 10 Alabama 13
## # ... with 7,538 more rows
library(DT)
datatable(YogaLong)
library(ggplot2)
arrange(YogaLong, desc(Searches))
arrange(YogaLong, Searches)
First need to group the searches by state and year and calculate the mean, min, and max of searches by each.
State <- YogaLong %>%
group_by(State) %>%
summarise(Mean = mean(Searches), Min = min(Searches), Max = max(Searches), n = n())
State <- arrange(State, desc(Mean))
State
## # A tibble: 51 x 5
## State Mean Min Max n
## <chr> <dbl> <dbl> <dbl> <int>
## 1 Vermont 50.87162 20 100 148
## 2 Hawaii 38.50676 24 54 148
## 3 Rhode_Island 37.47297 20 55 148
## 4 New_Hampshire 35.75676 20 57 148
## 5 Maine 35.47973 19 57 148
## 6 Montana 33.62162 13 55 148
## 7 Wyoming 32.53378 0 59 148
## 8 Massachusetts 32.18243 25 43 148
## 9 Alaska 31.93243 12 61 148
## 10 Colorado 30.93243 23 43 148
## # ... with 41 more rows
ggplot(State, aes(x=State, y=Mean, fill=Mean)) +
geom_bar(stat = "identity") +
xlab("State") + ylab("Mean of the Indexed Google Searches by Month") +
ggtitle("Interest in Yoga by State from 2004-2016") +
theme(plot.title = element_text(lineheight = .8, face = "bold")) +
theme(axis.text.x = element_text(angle = 60, vjust = .5, size = 9)) +
theme_classic() +
coord_flip()
Year <- YogaLong %>%
group_by(Year) %>%
summarise(Mean = mean(Searches), Min = min(Searches), Max = max(Searches), n = n())
Year
## # A tibble: 13 x 5
## Year Mean Min Max n
## <dbl> <dbl> <dbl> <dbl> <int>
## 1 2004 22.00490 0 52 612
## 2 2005 21.52451 9 54 612
## 3 2006 21.16340 8 61 612
## 4 2007 20.18464 6 68 612
## 5 2008 19.74673 7 53 612
## 6 2009 20.70425 9 53 612
## 7 2010 24.30392 12 57 612
## 8 2011 23.19771 9 74 612
## 9 2012 24.30556 10 77 612
## 10 2013 26.71569 10 79 612
## 11 2014 28.50000 14 83 612
## 12 2015 30.18791 14 87 612
## 13 2016 32.01471 14 100 204
ggplot(Year, aes(x=Year, y=Mean)) +
geom_line(size = .5, colour = "blue") +
geom_point(size = 1, colour = "light blue") +
ggtitle("Interest of Yoga Across US by Year from 2004-2016") +
ylab("Mean of the Indexed Google Searches") +
theme_classic()
Month <- YogaLong %>%
group_by(Month) %>%
summarise(Mean = mean(Searches), Min = min(Searches), Max = max(Searches), n = n())
Month
## # A tibble: 12 x 5
## Month Mean Min Max n
## <dbl> <dbl> <dbl> <dbl> <int>
## 1 1 28.14781 0 100 663
## 2 2 24.66817 8 99 663
## 3 3 24.18250 6 87 663
## 4 4 23.03620 9 85 663
## 5 5 22.11275 9 60 612
## 6 6 23.80065 9 69 612
## 7 7 24.40196 6 81 612
## 8 8 24.32026 7 77 612
## 9 9 23.19281 10 77 612
## 10 10 21.92647 9 62 612
## 11 11 22.40033 9 79 612
## 12 12 22.68464 8 87 612
library(lubridate)
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
ggplot(Month, aes(x=month(Month, label = TRUE, abbr = TRUE), y=Mean, group=1)) +
geom_line(size = .5, colour = "navy") +
geom_point(size = 1, colour = "lime green") +
ggtitle("Interest in Yoga by Month") +
ylab("Indexed Google Searches") + xlab("Month") +
theme_classic()
Experienced much difficulty getting the Month names to appear in the X axis. Finally found this answer on stackoverflow: https://stackoverflow.com/questions/42425195/ggplot-data-year-over-year-axis-labels-not-showing
The highest search interest for yoga was in January 2016 in Vermont.
The least interest was in January 2004 in Wyoming.
Yoga experienced a good deal of interest in 2010, decreased in 2011, and since then has been steadily increasing in interest, with the most interest being in 2016.
People search for yoga the most in January. February, March, July and August are second to January for the most popular months when people search for “yoga” on Google.