Data Preparation
library(knitr)
library(tidyverse)
library(cowplot)
library(ggplot2)
library(psych)
sports <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/nfl-fandom/NFL_fandom_data-google_trends.csv",skip=1,header=TRUE, stringsAsFactors = FALSE )
## Data Cleanup- remove % symbol, convert cols to numeric
sports2 <- as_data_frame(lapply(sports, gsub, pattern='\\%', replacement=''))
sports_names <- colnames(sports2)[-1]
sports2[sports_names] <- sapply(sports2[sports_names],as.numeric)
sports_names
## get the means of sports categories
team_means <- lapply(sports2[sports_names],mean)
## get sd of categories
col_sd <- sapply(sports2[sports_names], sd, na.rm = TRUE)
col_sd
Research Question
- How does Google search results for the Seven major sports correlate to Trump’s 2016 vote percentage?
Cases
- The towns are listed by designated market area (DMA)
- 207 different cases
Data Collection
- “Google Trends data was derived from comparing 5-year search traffic for the 7 sports leagues we analyzed:”(from github info)
Type of study
- Observational
Data Source
If you collected the data, state self-collected. If not, provide a citation/link.
[https://github.com/fivethirtyeight/data/blob/master/nfl-fandom/NFL_fandom_data-google_trends.csv]
Response
What is the response variable, and what type is it (numerical/categorical)?
- Discrete numerical- Trump 2016 vote percentage
Explanatory
What is the explanatory variable, and what type is it (numerical/categorival)?
- Discrete numerical- Pct. of major sports searches
- Create categorical clusters within this data based on \(\sigma\) to test if there is a difference between (higher, average, and lower sport specific searches) and Trump’s average 2016 vote
Relevant summary statistics
Provide summary statistics relevant to your research question. For example, if you’re comparing means across groups provide means, SDs, sample sizes of each group. This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.
paste("the means of column",sports_names,"is ",team_means," The sd of each column is ",col_sd)
## [1] "the means of column NFL is 39.0966183574879 The sd of each column is 6.43264868397972"
## [2] "the means of column NBA is 22.8019323671498 The sd of each column is 5.5436832009321"
## [3] "the means of column MLB is 13.5942028985507 The sd of each column is 3.99022655723329"
## [4] "the means of column NHL is 5.09178743961353 The sd of each column is 3.6117768793827"
## [5] "the means of column NASCAR is 5.3719806763285 The sd of each column is 2.3006515412352"
## [6] "the means of column CBB is 4.75845410628019 The sd of each column is 3.7867440642748"
## [7] "the means of column CFB is 9.28019323671498 The sd of each column is 5.21840878561084"
## [8] "the means of column Trump.2016.Vote. is 54.5292270531401 The sd of each column is 12.2978147234567"
Generalizability
- With an average Trump vote percentage of 54.5, our sample is clearly biased(Trump lost popular vote)
- This will be shown statistically
- As an observational Study this will not prove there is a causation between sports searches and Trump’s 2016 vote percentage
Several ideas for Approach
- Use linear regression to build a predictive model
- Look at each sport individually. Categorize each sport into clusters based on \(\sigma\). I will attempt to create equal categories. I.E. (<-1.5 \(\sigma\),-1 \(\sigma\),-.5 \(\sigma\), .5 \(\sigma\),1 \(\sigma\), >1.5 \(\sigma\)) I will use these categories to determine if there are statistically significant differences between how these clusters voted for Trump in 2016.
- My explanatory variables are completely dependent on each other as they add up to 100% overall. Any analysis involving how multiple explanatory variables effect Trumps voting percentage, will have to find a way to account for this
Visualizations
Attempt to plot all data together
Sports_dfsports2 <- sports2%>%
gather(.,"Sports",Viewership,2:8) %>%
arrange(Sports)
ggplot(Sports_dfsports2, aes(x=Viewership, y=Trump.2016.Vote., fill=Sports, color=Sports)) + geom_point(size=2, shape=17)+
geom_smooth(aes(group=Sports),method=lm)
Split sports into similar viewership dataframes for visuals
NFL_NBA_MLB <- Sports_dfsports2 %>%
filter(Sports==c('MLB','NBA','NFL'))
ggplot(NFL_NBA_MLB, aes(x=Viewership, y=Trump.2016.Vote., fill=Sports, color=Sports)) +
geom_point(size=2, shape=17)+
geom_smooth(aes(group=Sports),method=lm)
## Other sports
other_sports<- Sports_dfsports2 %>%
filter(Sports==c('CBB','CFB','NASCAR',"NHL"))
## Warning in Sports == c("CBB", "CFB", "NASCAR", "NHL"): longer object length
## is not a multiple of shorter object length
ggplot(other_sports, aes(x=Viewership, y=Trump.2016.Vote., fill=Sports, color=Sports)) +
geom_point(size=2, shape=17)+
geom_smooth(aes(group=Sports),method=lm)
- Just eyeballing these sports( which could be done better graphed individually but aggregated seems more concise)
- NBA most negative correlation between viewership and Trump vote
- NHL and MLB seem to have a strong negative correlation as well
- CFB and Nascar have a very positive correlation between viewership and Trump vote
- Lets show Nascar and NBA individually
NBA <- Sports_dfsports2 %>%
filter(Sports=='NBA')
Nascar <- Sports_dfsports2 %>%
filter(Sports=='NASCAR')
theme_set(theme_cowplot(font_size=12)) # reduce default font size
plot.NBA <- ggplot(NBA, aes(x=Viewership, y=Trump.2016.Vote., fill=Sports, color=Sports)) +
geom_point(size=2, shape=17)+
geom_smooth(aes(group=Sports),method=lm)
plot.Nascar <- ggplot(Nascar, aes(x=Viewership, y=Trump.2016.Vote., fill=Sports, color=Sports)) +
geom_point(size=2, shape=17)+
geom_smooth(aes(group=Sports),method=lm)
plot_grid(plot.NBA, plot.Nascar, labels = c('A', 'B'))
Build sports Histograms
plot_gg <- function(z){df <- cbind(z,)
ggplot(z, aes(x=z, y=Trump.2016.Vote.)) +
geom_point(size=2, shape=17)+
geom_smooth(method=lm)
}
Major_sport_names <- colnames(sports2)[2:8]
individual_sports <- mapply(hist,sports2[Major_sport_names],main=Major_sport_names)
- MLB and College BB don’t appear to be normally distributed
- With CBB it appears to be a binning issue, MLB I’m not so sure
- The other sports have some tails and outliers but are relatively normal
Closer look at MLB
Sports_dfsports2 %>%
filter(Sports=='MLB') %>%
ggplot(., aes(x=Viewership, y=Trump.2016.Vote., fill=Sports, color=Sports)) +
geom_point(size=2, shape=17)+
geom_smooth(aes(group=Sports),method=lm)
hist(sports2$MLB,breaks=6)
describe(sports2$MLB)
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 207 13.59 3.99 13 13.23 4.45 8 27 19 0.88 0.5
## se
## X1 0.28
length(sports2$MLB[sports2$MLB>22])/207
## [1] 0.03381643
- MLB seems ok in the end. Most of the data is within 2 sd as only 3% of data is 2 sd above mean
Describe function across Sports
sports_described <- lapply(sports2[Major_sport_names],describe)
sports_described
## $NFL
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 207 39.1 6.43 39 38.93 7.41 22 56 34 0.16 -0.24 0.45
##
## $NBA
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 207 22.8 5.54 22 22.39 5.93 13 41 28 0.72 0.35 0.39
##
## $MLB
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 207 13.59 3.99 13 13.23 4.45 8 27 19 0.88 0.5
## se
## X1 0.28
##
## $NHL
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 207 5.09 3.61 4 4.53 2.97 0 23 23 1.79 4.22 0.25
##
## $NASCAR
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 207 5.37 2.3 5 5.17 1.48 1 15 14 0.9 1.28 0.16
##
## $CBB
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 207 4.76 3.79 4 4.08 1.48 1 35 34 4.2 24.88 0.26
##
## $CFB
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 207 9.28 5.22 8 8.61 4.45 2 29 27 1.25 1.67 0.36