Data Preparation

library(knitr)
library(tidyverse)
library(cowplot)
library(ggplot2)
library(psych)
sports <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/nfl-fandom/NFL_fandom_data-google_trends.csv",skip=1,header=TRUE, stringsAsFactors = FALSE )

## Data Cleanup- remove % symbol, convert cols to numeric
sports2 <- as_data_frame(lapply(sports, gsub, pattern='\\%', replacement=''))
sports_names <- colnames(sports2)[-1]
sports2[sports_names] <- sapply(sports2[sports_names],as.numeric)
sports_names
## get the means of sports categories
team_means <- lapply(sports2[sports_names],mean)
## get sd of categories
col_sd <- sapply(sports2[sports_names], sd, na.rm = TRUE)
col_sd

Research Question

Cases

Data Collection

Type of study

Data Source

If you collected the data, state self-collected. If not, provide a citation/link.

[https://github.com/fivethirtyeight/data/blob/master/nfl-fandom/NFL_fandom_data-google_trends.csv]

Response

What is the response variable, and what type is it (numerical/categorical)?

Explanatory

What is the explanatory variable, and what type is it (numerical/categorival)?

Relevant summary statistics

Provide summary statistics relevant to your research question. For example, if you’re comparing means across groups provide means, SDs, sample sizes of each group. This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.

paste("the means of column",sports_names,"is ",team_means," The sd of each column is ",col_sd)
## [1] "the means of column NFL is  39.0966183574879  The sd of each column is  6.43264868397972"             
## [2] "the means of column NBA is  22.8019323671498  The sd of each column is  5.5436832009321"              
## [3] "the means of column MLB is  13.5942028985507  The sd of each column is  3.99022655723329"             
## [4] "the means of column NHL is  5.09178743961353  The sd of each column is  3.6117768793827"              
## [5] "the means of column NASCAR is  5.3719806763285  The sd of each column is  2.3006515412352"            
## [6] "the means of column CBB is  4.75845410628019  The sd of each column is  3.7867440642748"              
## [7] "the means of column CFB is  9.28019323671498  The sd of each column is  5.21840878561084"             
## [8] "the means of column Trump.2016.Vote. is  54.5292270531401  The sd of each column is  12.2978147234567"

Generalizability

Several ideas for Approach

Visualizations

Attempt to plot all data together

Sports_dfsports2 <- sports2%>% 
    gather(.,"Sports",Viewership,2:8) %>% 
    arrange(Sports)



ggplot(Sports_dfsports2, aes(x=Viewership, y=Trump.2016.Vote., fill=Sports, color=Sports)) + geom_point(size=2, shape=17)+
    geom_smooth(aes(group=Sports),method=lm)

Split sports into similar viewership dataframes for visuals

NFL_NBA_MLB <- Sports_dfsports2 %>% 
    filter(Sports==c('MLB','NBA','NFL'))
ggplot(NFL_NBA_MLB, aes(x=Viewership, y=Trump.2016.Vote., fill=Sports, color=Sports)) + 
    geom_point(size=2, shape=17)+
    geom_smooth(aes(group=Sports),method=lm)

## Other sports
other_sports<- Sports_dfsports2 %>% 
    filter(Sports==c('CBB','CFB','NASCAR',"NHL"))
## Warning in Sports == c("CBB", "CFB", "NASCAR", "NHL"): longer object length
## is not a multiple of shorter object length
ggplot(other_sports, aes(x=Viewership, y=Trump.2016.Vote., fill=Sports, color=Sports)) + 
    geom_point(size=2, shape=17)+
    geom_smooth(aes(group=Sports),method=lm)

  • Just eyeballing these sports( which could be done better graphed individually but aggregated seems more concise)
    • NBA most negative correlation between viewership and Trump vote
    • NHL and MLB seem to have a strong negative correlation as well
    • CFB and Nascar have a very positive correlation between viewership and Trump vote
  • Lets show Nascar and NBA individually
NBA <- Sports_dfsports2 %>% 
     filter(Sports=='NBA')
Nascar <- Sports_dfsports2 %>% 
     filter(Sports=='NASCAR')
theme_set(theme_cowplot(font_size=12)) # reduce default font size
plot.NBA <- ggplot(NBA,  aes(x=Viewership, y=Trump.2016.Vote., fill=Sports, color=Sports)) + 
    geom_point(size=2, shape=17)+
    geom_smooth(aes(group=Sports),method=lm)
plot.Nascar <- ggplot(Nascar,  aes(x=Viewership, y=Trump.2016.Vote., fill=Sports, color=Sports)) + 
    geom_point(size=2, shape=17)+
    geom_smooth(aes(group=Sports),method=lm)
plot_grid(plot.NBA, plot.Nascar, labels = c('A', 'B'))

Build sports Histograms

plot_gg <- function(z){df <- cbind(z,)
ggplot(z, aes(x=z, y=Trump.2016.Vote.)) + 
    geom_point(size=2, shape=17)+
    geom_smooth(method=lm)
}

Major_sport_names <- colnames(sports2)[2:8]
individual_sports <- mapply(hist,sports2[Major_sport_names],main=Major_sport_names)

  • MLB and College BB don’t appear to be normally distributed
    • With CBB it appears to be a binning issue, MLB I’m not so sure
  • The other sports have some tails and outliers but are relatively normal

Closer look at MLB

Sports_dfsports2 %>% 
     filter(Sports=='MLB') %>% 
    ggplot(.,  aes(x=Viewership, y=Trump.2016.Vote., fill=Sports, color=Sports)) + 
    geom_point(size=2, shape=17)+
    geom_smooth(aes(group=Sports),method=lm)

hist(sports2$MLB,breaks=6)

describe(sports2$MLB)
##    vars   n  mean   sd median trimmed  mad min max range skew kurtosis
## X1    1 207 13.59 3.99     13   13.23 4.45   8  27    19 0.88      0.5
##      se
## X1 0.28
length(sports2$MLB[sports2$MLB>22])/207
## [1] 0.03381643
  • MLB seems ok in the end. Most of the data is within 2 sd as only 3% of data is 2 sd above mean

Describe function across Sports

sports_described <- lapply(sports2[Major_sport_names],describe)
sports_described
## $NFL
##    vars   n mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 207 39.1 6.43     39   38.93 7.41  22  56    34 0.16    -0.24 0.45
## 
## $NBA
##    vars   n mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 207 22.8 5.54     22   22.39 5.93  13  41    28 0.72     0.35 0.39
## 
## $MLB
##    vars   n  mean   sd median trimmed  mad min max range skew kurtosis
## X1    1 207 13.59 3.99     13   13.23 4.45   8  27    19 0.88      0.5
##      se
## X1 0.28
## 
## $NHL
##    vars   n mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 207 5.09 3.61      4    4.53 2.97   0  23    23 1.79     4.22 0.25
## 
## $NASCAR
##    vars   n mean  sd median trimmed  mad min max range skew kurtosis   se
## X1    1 207 5.37 2.3      5    5.17 1.48   1  15    14  0.9     1.28 0.16
## 
## $CBB
##    vars   n mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 207 4.76 3.79      4    4.08 1.48   1  35    34  4.2    24.88 0.26
## 
## $CFB
##    vars   n mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 207 9.28 5.22      8    8.61 4.45   2  29    27 1.25     1.67 0.36