Data Preparation

library(knitr)
library(tidyverse)
library(cowplot)
library(ggplot2)
library(psych)
sports <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/nfl-fandom/NFL_fandom_data-google_trends.csv",skip=1,header=TRUE, stringsAsFactors = FALSE )

## Data Cleanup- remove % symbol, convert cols to numeric
sports2 <- as_data_frame(lapply(sports, gsub, pattern='\\%', replacement=''))
sports_names <- colnames(sports2)[-1]
sports2[sports_names] <- sapply(sports2[sports_names],as.numeric)
sports_names
## get the means of sports categories
team_means <- lapply(sports2[sports_names],mean)
## get sd of categories
col_sd <- sapply(sports2[sports_names], sd, na.rm = TRUE)
col_sd

Research Question

How does Google search results for the Seven major sports correlate to Trump’s 2016 vote percentage?

Cases

The towns are listed by designated market area (DMA)
207 different cases

Data Collection

“Google Trends data was derived from comparing 5-year search traffic for the 7 sports leagues we analyzed:”(from github info)

Type of study

Observational

Data Source

If you collected the data, state self-collected. If not, provide a citation/link.

[https://github.com/fivethirtyeight/data/blob/master/nfl-fandom/NFL_fandom_data-google_trends.csv]

Response

What is the response variable, and what type is it (numerical/categorical)?

Discrete numerical- Trump 2016 vote percentage

Explanatory

What is the explanatory variable, and what type is it (numerical/categorival)?

Discrete numerical- Pct. of major sports searches
- Create categorical clusters within this data based on \(\sigma\) to test if there is a difference between (higher, average, and lower sport specific searches) and Trump’s average 2016 vote

Relevant summary statistics

Provide summary statistics relevant to your research question. For example, if you’re comparing means across groups provide means, SDs, sample sizes of each group. This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.

paste("the means of column",sports_names,"is ",team_means," The sd of each column is ",col_sd)

## [1] "the means of column NFL is  39.0966183574879  The sd of each column is  6.43264868397972"             
## [2] "the means of column NBA is  22.8019323671498  The sd of each column is  5.5436832009321"              
## [3] "the means of column MLB is  13.5942028985507  The sd of each column is  3.99022655723329"             
## [4] "the means of column NHL is  5.09178743961353  The sd of each column is  3.6117768793827"              
## [5] "the means of column NASCAR is  5.3719806763285  The sd of each column is  2.3006515412352"            
## [6] "the means of column CBB is  4.75845410628019  The sd of each column is  3.7867440642748"              
## [7] "the means of column CFB is  9.28019323671498  The sd of each column is  5.21840878561084"             
## [8] "the means of column Trump.2016.Vote. is  54.5292270531401  The sd of each column is  12.2978147234567"

Generalizability

With an average Trump vote percentage of 54.5, our sample is clearly biased(Trump lost popular vote)
- This will be shown statistically
As an observational Study this will not prove there is a causation between sports searches and Trump’s 2016 vote percentage

Several ideas for Approach

Use linear regression to build a predictive model
Look at each sport individually. Categorize each sport into clusters based on \(\sigma\). I will attempt to create equal categories. I.E. (<-1.5 \(\sigma\),-1 \(\sigma\),-.5 \(\sigma\), .5 \(\sigma\),1 \(\sigma\), >1.5 \(\sigma\)) I will use these categories to determine if there are statistically significant differences between how these clusters voted for Trump in 2016.
My explanatory variables are completely dependent on each other as they add up to 100% overall. Any analysis involving how multiple explanatory variables effect Trumps voting percentage, will have to find a way to account for this

Visualizations

Attempt to plot all data together

Sports_dfsports2 <- sports2%>% 
    gather(.,"Sports",Viewership,2:8) %>% 
    arrange(Sports)



ggplot(Sports_dfsports2, aes(x=Viewership, y=Trump.2016.Vote., fill=Sports, color=Sports)) + geom_point(size=2, shape=17)+
    geom_smooth(aes(group=Sports),method=lm)

Split sports into similar viewership dataframes for visuals

NFL_NBA_MLB <- Sports_dfsports2 %>% 
    filter(Sports==c('MLB','NBA','NFL'))
ggplot(NFL_NBA_MLB, aes(x=Viewership, y=Trump.2016.Vote., fill=Sports, color=Sports)) + 
    geom_point(size=2, shape=17)+
    geom_smooth(aes(group=Sports),method=lm)

## Other sports
other_sports<- Sports_dfsports2 %>% 
    filter(Sports==c('CBB','CFB','NASCAR',"NHL"))

## Warning in Sports == c("CBB", "CFB", "NASCAR", "NHL"): longer object length
## is not a multiple of shorter object length

ggplot(other_sports, aes(x=Viewership, y=Trump.2016.Vote., fill=Sports, color=Sports)) + 
    geom_point(size=2, shape=17)+
    geom_smooth(aes(group=Sports),method=lm)

Just eyeballing these sports( which could be done better graphed individually but aggregated seems more concise)
- NBA most negative correlation between viewership and Trump vote
- NHL and MLB seem to have a strong negative correlation as well
- CFB and Nascar have a very positive correlation between viewership and Trump vote
Lets show Nascar and NBA individually

NBA <- Sports_dfsports2 %>% 
     filter(Sports=='NBA')
Nascar <- Sports_dfsports2 %>% 
     filter(Sports=='NASCAR')
theme_set(theme_cowplot(font_size=12)) # reduce default font size
plot.NBA <- ggplot(NBA,  aes(x=Viewership, y=Trump.2016.Vote., fill=Sports, color=Sports)) + 
    geom_point(size=2, shape=17)+
    geom_smooth(aes(group=Sports),method=lm)
plot.Nascar <- ggplot(Nascar,  aes(x=Viewership, y=Trump.2016.Vote., fill=Sports, color=Sports)) + 
    geom_point(size=2, shape=17)+
    geom_smooth(aes(group=Sports),method=lm)
plot_grid(plot.NBA, plot.Nascar, labels = c('A', 'B'))

Build sports Histograms

plot_gg <- function(z){df <- cbind(z,)
ggplot(z, aes(x=z, y=Trump.2016.Vote.)) + 
    geom_point(size=2, shape=17)+
    geom_smooth(method=lm)
}

Major_sport_names <- colnames(sports2)[2:8]
individual_sports <- mapply(hist,sports2[Major_sport_names],main=Major_sport_names)

MLB and College BB don’t appear to be normally distributed
- With CBB it appears to be a binning issue, MLB I’m not so sure
The other sports have some tails and outliers but are relatively normal

Closer look at MLB

Sports_dfsports2 %>% 
     filter(Sports=='MLB') %>% 
    ggplot(.,  aes(x=Viewership, y=Trump.2016.Vote., fill=Sports, color=Sports)) + 
    geom_point(size=2, shape=17)+
    geom_smooth(aes(group=Sports),method=lm)

hist(sports2$MLB,breaks=6)

describe(sports2$MLB)

##    vars   n  mean   sd median trimmed  mad min max range skew kurtosis
## X1    1 207 13.59 3.99     13   13.23 4.45   8  27    19 0.88      0.5
##      se
## X1 0.28

length(sports2$MLB[sports2$MLB>22])/207

## [1] 0.03381643

MLB seems ok in the end. Most of the data is within 2 sd as only 3% of data is 2 sd above mean

Describe function across Sports

sports_described <- lapply(sports2[Major_sport_names],describe)
sports_described

## $NFL
##    vars   n mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 207 39.1 6.43     39   38.93 7.41  22  56    34 0.16    -0.24 0.45
## 
## $NBA
##    vars   n mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 207 22.8 5.54     22   22.39 5.93  13  41    28 0.72     0.35 0.39
## 
## $MLB
##    vars   n  mean   sd median trimmed  mad min max range skew kurtosis
## X1    1 207 13.59 3.99     13   13.23 4.45   8  27    19 0.88      0.5
##      se
## X1 0.28
## 
## $NHL
##    vars   n mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 207 5.09 3.61      4    4.53 2.97   0  23    23 1.79     4.22 0.25
## 
## $NASCAR
##    vars   n mean  sd median trimmed  mad min max range skew kurtosis   se
## X1    1 207 5.37 2.3      5    5.17 1.48   1  15    14  0.9     1.28 0.16
## 
## $CBB
##    vars   n mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 207 4.76 3.79      4    4.08 1.48   1  35    34  4.2    24.88 0.26
## 
## $CFB
##    vars   n mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 207 9.28 5.22      8    8.61 4.45   2  29    27 1.25     1.67 0.36

R Notebook

DATA 606

Justin Herman