library(tidyr)
library(ggplot2)
library(dplyr)
#Import data
rawData <- read.csv('https://raw.githubusercontent.com/fivethirtyeight/data/master/nfl-fandom/NFL_fandom_data-google_trends.csv', header = FALSE)
#Remove first row and rename columns
data <- rawData[-1,]
colnames(data) <- as.character(unlist(data[1,]))
data <- data[-1, ]
#tidy data
tidyData <- data %>%
#separate data into team and % of searches columns
gather(Team, searchPercent, 2:8) %>%
#arrange rows by DMA
arrange(DMA)
team <- tidyData$Team
#Remove % sign
tidyData[-1]<-data.frame(apply(tidyData[-1], 2, function(x)
as.numeric(sub("%","",as.character(x)))))
fullDf <- cbind(tidyData, team)
#Create column that returns true or false if trump won over 50% of votes
fullDf2 <- fullDf %>%
mutate(trumpMajority = `Trump 2016 Vote%`>50)
#Display HTML data table
DT::datatable(fullDf2, editable = TRUE)
Is there an association between major professional sports league fans and 2016 political affiliations?
What are the cases, and how many are there?
The cases are each designated market area and there are 207 cases
Describe the method of data collection.
Google Trends data was derived comparing 5-year search traffic for 7 major sports leagues (https://g.co/trends/5P8aa)
Results are listed by designated market area (DMA).
The percentages are the approximate percentage of major-sports searches that were conducted for each league.
Trump’s percentage is his share of the vote within the DMA in the 2016 presidential election.
What type of study is this (observational/experiment)?
Obervational
If you collected the data, state self-collected. If not, provide a citation/link.
Google Trends data was derived comparing 5-year search traffic for 7 major sports leagues and the csv file is available on github:
https://github.com/fivethirtyeight/data/blob/master/nfl-fandom/NFL_fandom_data-google_trends.csv
What is the response variable? Is it quantitative or qualitative?
The response variable is Trump’s percentage of the 2016 presidential election vote within the DMA and it is quantitative
You should have two independent variables, one quantitative and one qualitative.
Quantitative: percentage of major-sports searches that were conducted for each league
Qualitative: major league team
Provide summary statistics for each the variables. Also include appropriate visualizations related to your research question (e.g. scatter plot, boxplots, etc). This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.
summary(fullDf2)
## DMA Trump 2016 Vote% Team
## Abilene-Sweetwater TX : 7 Min. :18.56 Min. : NA
## Albany GA : 7 1st Qu.:46.28 1st Qu.: NA
## Albany-Schenectady-Troy NY: 7 Median :55.26 Median : NA
## Albuquerque-Santa Fe NM : 7 Mean :54.53 Mean :NaN
## Alexandria LA : 7 3rd Qu.:63.82 3rd Qu.: NA
## Alpena MI : 7 Max. :79.13 Max. : NA
## (Other) :1407 NA's :1449
## searchPercent team trumpMajority
## Min. : 0.00 CBB :207 Mode :logical
## 1st Qu.: 4.00 CFB :207 FALSE:483
## Median :10.00 MLB :207 TRUE :966
## Mean :14.29 NASCAR:207
## 3rd Qu.:20.00 NBA :207
## Max. :56.00 NFL :207
## NHL :207
#Create scatter plot by team
plot <- ggplot(fullDf2, aes(x=searchPercent, y=`Trump 2016 Vote%`, color = trumpMajority)) +
geom_point() +
facet_wrap(~team)
plot
#Create box plots
plot <- ggplot(fullDf, aes(x=team, y=searchPercent)) +
geom_boxplot()
plot