606 Project Proposal

Data Preparation

library(tidyr)
library(ggplot2)
library(dplyr)

#Import data
rawData <- read.csv('https://raw.githubusercontent.com/fivethirtyeight/data/master/nfl-fandom/NFL_fandom_data-google_trends.csv', header = FALSE)

#Remove first row and rename columns
data <- rawData[-1,]
colnames(data) <- as.character(unlist(data[1,]))
data <- data[-1, ]

#tidy data
tidyData <- data %>%
  #separate data into team and % of searches columns
  gather(Team, searchPercent, 2:8) %>%
  #arrange rows by DMA
  arrange(DMA)

team <- tidyData$Team
  
#Remove % sign
tidyData[-1]<-data.frame(apply(tidyData[-1], 2, function(x) 
    as.numeric(sub("%","",as.character(x)))))
  
fullDf <- cbind(tidyData, team)

#Create column that returns true or false if trump won over 50% of votes
fullDf2 <- fullDf %>% 
        mutate(trumpMajority = `Trump 2016 Vote%`>50)

#Display HTML data table
DT::datatable(fullDf2, editable = TRUE)

Research question

Is there an association between major professional sports league fans and 2016 political affiliations?

Cases

What are the cases, and how many are there?

The cases are each designated market area and there are 207 cases

Data collection

Describe the method of data collection.

Google Trends data was derived comparing 5-year search traffic for 7 major sports leagues (https://g.co/trends/5P8aa)

Results are listed by designated market area (DMA).

The percentages are the approximate percentage of major-sports searches that were conducted for each league.

Trump’s percentage is his share of the vote within the DMA in the 2016 presidential election.

Type of study

What type of study is this (observational/experiment)?

Obervational

Data Source

If you collected the data, state self-collected. If not, provide a citation/link.

Google Trends data was derived comparing 5-year search traffic for 7 major sports leagues and the csv file is available on github:

https://github.com/fivethirtyeight/data/blob/master/nfl-fandom/NFL_fandom_data-google_trends.csv

Dependent Variable

What is the response variable? Is it quantitative or qualitative?

The response variable is Trump’s percentage of the 2016 presidential election vote within the DMA and it is quantitative

Independent Variable

You should have two independent variables, one quantitative and one qualitative.

Quantitative: percentage of major-sports searches that were conducted for each league
Qualitative: major league team

Relevant summary statistics

Provide summary statistics for each the variables. Also include appropriate visualizations related to your research question (e.g. scatter plot, boxplots, etc). This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.

summary(fullDf2)

##                          DMA       Trump 2016 Vote%      Team     
##  Abilene-Sweetwater TX     :   7   Min.   :18.56    Min.   : NA   
##  Albany GA                 :   7   1st Qu.:46.28    1st Qu.: NA   
##  Albany-Schenectady-Troy NY:   7   Median :55.26    Median : NA   
##  Albuquerque-Santa Fe NM   :   7   Mean   :54.53    Mean   :NaN   
##  Alexandria LA             :   7   3rd Qu.:63.82    3rd Qu.: NA   
##  Alpena MI                 :   7   Max.   :79.13    Max.   : NA   
##  (Other)                   :1407                    NA's   :1449  
##  searchPercent       team     trumpMajority  
##  Min.   : 0.00   CBB   :207   Mode :logical  
##  1st Qu.: 4.00   CFB   :207   FALSE:483      
##  Median :10.00   MLB   :207   TRUE :966      
##  Mean   :14.29   NASCAR:207                  
##  3rd Qu.:20.00   NBA   :207                  
##  Max.   :56.00   NFL   :207                  
##                  NHL   :207

#Create scatter plot by team
plot <- ggplot(fullDf2, aes(x=searchPercent, y=`Trump 2016 Vote%`, color = trumpMajority)) + 
  geom_point() +
  facet_wrap(~team)

plot

#Create box plots
plot <- ggplot(fullDf, aes(x=team, y=searchPercent)) + 
  geom_boxplot()

plot