DATA 606 Final Project

Part 1 - Introduction

Research Question:

Is there an association between major professional sports league fan bases and 2016 political affiliations?

Motivation:

The association between major professional sports league fan bases and 2016 political affiliations could be insightful to marketers and political campaigns. Knowing sports league fan bases could help both groups target potential customers or supporters.

Part 2 - Data

Data collection:

I found this data on Five Thirty Eight and the csv file is available on github: https://github.com/fivethirtyeight/data/blob/master/nfl-fandom/NFL_fandom_data-google_trends.csv

The original data was collected from Google Trends data comparing 5-year search traffic for 7 major sports leagues (https://g.co/trends/5P8aa). The 7 sports leagues include: college basketball (CBB), college football (CFB), Major League Baseball (MLB), National Association for Stock Car Auto Racing (NASCAR), National Basketball Association (NBA), National Football League (NFL), and National Hockey League (NHL). Results are listed by designated market area (DMA). The percentages are the approximate percentage of major-sports searches that were conducted for each league. Trump’s percentage is his share of the vote within the DMA in the 2016 presidential election.

Cases:

The cases are each designated market area (DMA) and there are 207 cases

Variables:

Response variable:

Trump’s percentage of the 2016 presidential election vote within the DMA (quantitative)

Independent variables:

Percentage of major-sports searches that were conducted for each league (quantitative)
Major league team (qualitative)

Type of study:

This is an observational study because data was collected in a way that does not directly interfere with how the data arise; the data was collected by only monitoring what occurred.

Scope of inference - generalizability:

The population of interest is individuals who voted in the 2016 U.S. presidential election. Since this is a large sample, the findings from this analysis can be generalized to the whole population.

Scope of inference - causality:

These data cannot be used to establish causal links between the variables of interest because it is an observational study. Therefore, no causal conclusion can be drawn, but a correlation statement can be generalized to the whole population.

Part 3 - Exploratory data analysis

Load Libraries

library(tidyr)
library(ggplot2)
library(dplyr)
library(kableExtra)
library(RColorBrewer)

Import and Clean Data

#Import data
rawData <- read.csv('https://raw.githubusercontent.com/fivethirtyeight/data/master/nfl-fandom/NFL_fandom_data-google_trends.csv', header = FALSE)

#Remove first row and rename columns
data <- rawData[-1,]
colnames(data) <- as.character(unlist(data[1,]))
data <- data[-1, ]

#tidy data
tidyData <- data %>%
  #separate data into team and % of searches columns
  gather(Team, searchPercent, 2:8) %>%
  #arrange rows by DMA
  arrange(DMA)

team <- tidyData$Team
  
#Remove % sign
tidyData[-1]<-data.frame(apply(tidyData[-1], 2, function(x) 
    as.numeric(sub("%","",as.character(x)))))
  
fullDf <- cbind(tidyData, team)

#Create column that returns true or false if trump won over 50% of votes
fullDf2 <- fullDf %>% 
        mutate(trumpMajority = `Trump 2016 Vote%`>50) 

#Display HTML data table
DT::datatable(fullDf2, editable = TRUE)

#Subset of data where Trump had majority
trumpM <- subset(fullDf2, trumpMajority == 'TRUE')
#Subset of data where Trump had majority
trumpm <- subset(fullDf2, trumpMajority == 'FALSE')

Summary Statistics

summary(fullDf2)

##                          DMA       Trump 2016 Vote%      Team     
##  Abilene-Sweetwater TX     :   7   Min.   :18.56    Min.   : NA   
##  Albany GA                 :   7   1st Qu.:46.28    1st Qu.: NA   
##  Albany-Schenectady-Troy NY:   7   Median :55.26    Median : NA   
##  Albuquerque-Santa Fe NM   :   7   Mean   :54.53    Mean   :NaN   
##  Alexandria LA             :   7   3rd Qu.:63.82    3rd Qu.: NA   
##  Alpena MI                 :   7   Max.   :79.13    Max.   : NA   
##  (Other)                   :1407                    NA's   :1449  
##  searchPercent       team     trumpMajority  
##  Min.   : 0.00   CBB   :207   Mode :logical  
##  1st Qu.: 4.00   CFB   :207   FALSE:483      
##  Median :10.00   MLB   :207   TRUE :966      
##  Mean   :14.29   NASCAR:207                  
##  3rd Qu.:20.00   NBA   :207                  
##  Max.   :56.00   NFL   :207                  
##                  NHL   :207

Let’s see how mean search percent for each league compares between Trump and Clinton supporters.

rank1 <- trumpM %>%
    group_by(team) %>%
    summarize(meanSearchPercent = mean(searchPercent))
    
rank2 <- trumpm %>%
    group_by(team) %>%
    summarize(meanSearchPercent = mean(searchPercent))    
    
rank <- fullDf2 %>% 
          arrange(DMA, desc(searchPercent)) %>%
          group_by(DMA) %>%
          mutate(rank=row_number())

rankTrumpM <- subset(rank, trumpMajority == 'TRUE')
rankTrumpm <- subset(rank, trumpMajority == 'FALSE')

Search Percent Rankings (Trump Supporters)

rankTrumpM %>%
    group_by(team) %>%
    summarize(meanRank = mean(rank)) %>%
    arrange(meanRank) %>%
    mutate(roundedMeanRank=row_number()) %>% 
    kable() %>% 
    kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

team	meanRank	roundedMeanRank
NFL	1.050725	1
NBA	2.202899	2
MLB	3.289855	3
CFB	4.268116	4
NASCAR	5.188406	5
NHL	5.913043	6
CBB	6.086957	7

Search Percent Rankings (Clinton Supporters)

rankTrumpm %>%
    group_by(team) %>%
    summarize(meanRank = mean(rank)) %>%
    arrange(meanRank) %>%
    mutate(roundedMeanRank=row_number())  %>% 
    kable() %>% 
    kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

team	meanRank	roundedMeanRank
NFL	1.086957	1
NBA	1.971014	2
MLB	3.130435	3
CFB	4.666667	4
NHL	5.043478	5
NASCAR	5.797101	6
CBB	6.304348	7

NHL and NASCAR switch rounded rankings

Data Visualizations

#Create scatter plot by team
plot <- ggplot(fullDf2, aes(x=searchPercent, y=`Trump 2016 Vote%`, color = trumpMajority)) + 
  geom_point() +
  facet_wrap(~team) +
  scale_color_manual(values = c("#0000FF", "#FF0000")) +
  ggtitle("Search Percent vs. Trump 2016 vote % by League") +
  xlab("Search Percent")


plot

It appears that search traffic was larger for CFB and NASCAR among Trump supporters. Meanwhile, it appears that search traffic was larger for NBA among Clinton supporters.

#Create box plots
#All Data
plot <- ggplot(fullDf, aes(x=team, y=searchPercent, fill = team)) + 
  geom_boxplot() +
  scale_fill_brewer(palette="RdBu") +
  labs(title="Search Percent Box Plot by League",
        x ="Team", 
        y = "Search Percent")

plot

From the box plots, we see that NFL had the most search traffic followed by NBA, MLB, CFB, NASCAR, CBB, and NHL.

plot <- ggplot(fullDf2, aes(trumpMajority, searchPercent, fill = trumpMajority)) + 
  geom_boxplot() +
  facet_wrap(~team) +
  scale_fill_manual(values = c("#0000FF", "#FF0000")) +
  labs(title="Search Percent by League and Political Affiliation",
        x ="Trump Majority", 
        y = "Search Percent")

plot

Comparing search percent by political affiliation for each league, we see that the mean search percent appears different for CFB, MLB, NASCAR, and NBA.

EDA suggests that there is an association between only some of the 7 major professional sports league fan bases (CFB, MLB, NASCAR, & NBA) and 2016 political affiliations.

Part 4 - Inference

Let’s run two sample independent t-tests (difference of two means) for each league to determine if the sample means for Trump majority search percent and Trump minority search percent are different.

\[ H_0: \mu_1 - \mu_a = 0 \\ H_a: \mu_1 - \mu_a \neq 0 \]

Null hypothesis: The difference in sample means (Trump majority vs. Trump minority) is equal to 0
Alternative hypothesis: The difference in sample means (Trump majority vs. Trump minority) is not equal to 0

First we will check that the data meets the following assumptions:

Independence of observations and samples

We will assume the observations and samples are independent of each other. One ballot decision or sports league search history should not be influenced by any other.
Observations come from a nearly normal distribution

We will use histograms, Q-Q Plots, Shapiro-Wilk normality tests to verify the normality of each sample

The histogram of normal data follows a bell curve shape.

A Q-Q plot is a scatterplot created by plotting two sets of quantiles against one another. If both sets of quantiles are from the same distribution, the points should form a straight line.

The null hypothesis for Shapiro-Wilk normality tests is that the samples came from a normal distribution. If the p-value from the test is less than 0.05, then the null hypothesis is rejected and the conclusion is that the data is not from a Normal distribution. If the p-value is greater than 0.05, the conclusion is that the data is from a normal distribution.
Equal variance across two samples

From the box-plots displayed above, we will assume the two samples for each league have equal variance

CBB

cbbTM <- subset(trumpM, team == 'CBB')
cbbTm <- subset(trumpm, team == 'CBB')

par(mfrow = c(2, 2))
hist(cbbTM$searchPercent, xlim = c(0, 50), main = "CBB Search Percent Histogram", xlab = "CBB Search Percent Trump Majority")
hist(cbbTm$searchPercent, xlim = c(0, 50), main = "CBB Search Percent Histogram", xlab = "CBB Search Percent Trump Minority")

qqnorm(cbbTM$searchPercent)
qqline(cbbTM$searchPercent)

qqnorm(cbbTm$searchPercent)
qqline(cbbTm$searchPercent)

shapiro.test(cbbTM$searchPercent)

## 
##  Shapiro-Wilk normality test
## 
## data:  cbbTM$searchPercent
## W = 0.59182, p-value < 2.2e-16

shapiro.test(cbbTm$searchPercent)

## 
##  Shapiro-Wilk normality test
## 
## data:  cbbTm$searchPercent
## W = 0.69356, p-value = 1.032e-10

Sample data is not normal so we will run a simulation

#Simulation
cbbTMmean <- mean(cbbTM$searchPercent)
cbbTMsd <- sd(cbbTM$searchPercent)

cbbTmMean <- mean(cbbTm$searchPercent)
cbbTmSd <- sd(cbbTm$searchPercent)

sim_cbbTM <- rnorm(n = 100, mean = cbbTMmean, sd = cbbTMsd)
sim_cbbTm <- rnorm(n = 100, mean = cbbTmMean, sd = cbbTmSd)

par(mfrow = c(2, 2))
hist(sim_cbbTM, main = "CBB Search Percent Histogram", xlab = "CBB Search Percent Trump Majority")
hist(sim_cbbTm, main = "CBB Search Percent Histogram", xlab = "CBB Search Percent Trump Minority")

qqnorm(sim_cbbTM)
qqline(sim_cbbTM)

qqnorm(sim_cbbTm)
qqline(sim_cbbTm)

shapiro.test(sim_cbbTM)

## 
##  Shapiro-Wilk normality test
## 
## data:  sim_cbbTM
## W = 0.98996, p-value = 0.6613

shapiro.test(sim_cbbTm)

## 
##  Shapiro-Wilk normality test
## 
## data:  sim_cbbTm
## W = 0.99281, p-value = 0.8761

#simulation data is normal

#Perform two sample t-test
res <- t.test(sim_cbbTM, sim_cbbTm, var.equal = TRUE)
res # 6.894e-05

## 
##  Two Sample t-test
## 
## data:  sim_cbbTM and sim_cbbTm
## t = 1.7079, df = 198, p-value = 0.08922
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1303769  1.8165672
## sample estimates:
## mean of x mean of y 
##  5.318356  4.475261

Sometimes the p-value from the t-test is greater than 0.05 and other times it is less than 0.05. Therefore, the CBB sample data is inconclusive.

CFB

cfbTM <- subset(trumpM, team == 'CFB')
cfbTm <- subset(trumpm, team == 'CFB')

par(mfrow = c(2, 2))
hist(cfbTM$searchPercent, xlim = c(0, 50), main = "CFB Search Percent Histogram", xlab = "CFB Search Percent Trump Majority")
hist(cfbTm$searchPercent, xlim = c(0, 50), main = "CFB Search Percent Histogram", xlab = "CFB Search Percent Trump Minority")

qqnorm(cfbTM$searchPercent)
qqline(cfbTM$searchPercent)

qqnorm(cfbTm$searchPercent)
qqline(cfbTm$searchPercent)

shapiro.test(cfbTM$searchPercent)

## 
##  Shapiro-Wilk normality test
## 
## data:  cfbTM$searchPercent
## W = 0.91346, p-value = 2.185e-07

shapiro.test(cfbTm$searchPercent)

## 
##  Shapiro-Wilk normality test
## 
## data:  cfbTm$searchPercent
## W = 0.84684, p-value = 6.184e-07

#tails skewed, not normal

Sample data is not normal so we will run a simulation

#Simulation
cfbTMmean <- mean(cfbTM$searchPercent)
cfbTMsd <- sd(cfbTM$searchPercent)

cfbTmMean <- mean(cfbTm$searchPercent)
cfbTmSd <- sd(cfbTm$searchPercent)

sim_cfbTM <- rnorm(n = 100, mean = cfbTMmean, sd = cfbTMsd)
sim_cfbTm <- rnorm(n = 100, mean = cfbTmMean, sd = cfbTmSd)

par(mfrow = c(2, 2))
hist(sim_cfbTM, main = "CFB Search Percent Histogram", xlab = "CBB Search Percent Trump Majority")
hist(sim_cfbTm, main = "CFB Search Percent Histogram", xlab = "CBB Search Percent Trump Minority")

qqnorm(sim_cfbTM)
qqline(sim_cfbTM)

qqnorm(sim_cfbTm)
qqline(sim_cfbTm)

shapiro.test(sim_cfbTM)

## 
##  Shapiro-Wilk normality test
## 
## data:  sim_cfbTM
## W = 0.98676, p-value = 0.4214

shapiro.test(sim_cfbTm)

## 
##  Shapiro-Wilk normality test
## 
## data:  sim_cfbTm
## W = 0.99183, p-value = 0.8087

#simulation data is normal

#Perform two sample t-test
res <- t.test(sim_cfbTM, sim_cfbTm, var.equal = TRUE)
res

## 
##  Two Sample t-test
## 
## data:  sim_cfbTM and sim_cfbTm
## t = 3.4469, df = 198, p-value = 0.0006925
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.9210949 3.3842747
## sample estimates:
## mean of x mean of y 
##  9.216064  7.063379

The t-test p-value is less than 0.05, so we reject the null hypothesis in favor of the alternative and conclude that the difference in the two sample means is not equal to zero. Thus, there is an association between CFB fan base and 2016 political affiliations.

MLB

mlbTM <- subset(trumpM, team == 'MLB')
mlbTm <- subset(trumpm, team == 'MLB')

par(mfrow = c(2, 2))
hist(mlbTM$searchPercent, xlim = c(0, 50), main = "MLB Search Percent Histogram", xlab = "MLB Search Percent Trump Majority")
hist(mlbTm$searchPercent, xlim = c(0, 50), main = "MLB Search Percent Histogram", xlab = "MLB Search Percent Trump Minority")

qqnorm(mlbTM$searchPercent)
qqline(mlbTM$searchPercent)

qqnorm(mlbTm$searchPercent)
qqline(mlbTm$searchPercent)

shapiro.test(mlbTM$searchPercent)

## 
##  Shapiro-Wilk normality test
## 
## data:  mlbTM$searchPercent
## W = 0.90523, p-value = 7.322e-08

shapiro.test(mlbTm$searchPercent)

## 
##  Shapiro-Wilk normality test
## 
## data:  mlbTm$searchPercent
## W = 0.96941, p-value = 0.08855

Sample data is not normal

#Simulation
mlbTMmean <- mean(mlbTM$searchPercent)
mlbTMsd <- sd(mlbTM$searchPercent)

mlbTmMean <- mean(mlbTm$searchPercent)
mlbTmSd <- sd(mlbTm$searchPercent)

sim_mlbTM <- rnorm(n = 100, mean = mlbTMmean, sd = mlbTMsd)
sim_mlbTm <- rnorm(n = 100, mean = mlbTmMean, sd = mlbTmSd)

par(mfrow = c(2, 2))
hist(sim_mlbTM, main = "CFB Search Percent Histogram", xlab = "MLB Search Percent Trump Majority")
hist(sim_mlbTm, main = "CFB Search Percent Histogram", xlab = "MLB Search Percent Trump Minority")

qqnorm(sim_mlbTM)
qqline(sim_mlbTM)

qqnorm(sim_mlbTm)
qqline(sim_mlbTm)

shapiro.test(sim_mlbTM)

## 
##  Shapiro-Wilk normality test
## 
## data:  sim_mlbTM
## W = 0.97968, p-value = 0.1255

shapiro.test(sim_mlbTm)

## 
##  Shapiro-Wilk normality test
## 
## data:  sim_mlbTm
## W = 0.99073, p-value = 0.7238

#simulation data is normal

#Perform two sample t-test
res <- t.test(sim_mlbTM, sim_mlbTm, var.equal = TRUE)
res

## 
##  Two Sample t-test
## 
## data:  sim_mlbTM and sim_mlbTm
## t = -1.6573, df = 198, p-value = 0.09904
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.7286542  0.1499017
## sample estimates:
## mean of x mean of y 
##  13.35736  14.14674

Sometimes the p-value is greater than 0.05 and other times it is less than 0.05. Therefore, the MLB data is inconclusive.

NASCAR

nascarTM <- subset(trumpM, team == 'NASCAR')
nascarTm <- subset(trumpm, team == 'NASCAR')

par(mfrow = c(2, 2))
hist(nascarTM$searchPercent, xlim = c(0, 50), main = "NASCAR Search Percent Histogram", xlab = "NASCAR Search Percent Trump Majority")
hist(nascarTm$searchPercent, xlim = c(0, 50), main = "NASCAR Search Percent Histogram", xlab = "NASCAR Search Percent Trump Minority")

qqnorm(nascarTM$searchPercent)
qqline(nascarTM$searchPercent)

qqnorm(nascarTm$searchPercent)
qqline(nascarTm$searchPercent)

shapiro.test(nascarTM$searchPercent)

## 
##  Shapiro-Wilk normality test
## 
## data:  nascarTM$searchPercent
## W = 0.91418, p-value = 2.412e-07

shapiro.test(nascarTm$searchPercent)

## 
##  Shapiro-Wilk normality test
## 
## data:  nascarTm$searchPercent
## W = 0.91533, p-value = 0.0001841

#Nascar data heavily right skewed

Sample data is not normal so we will run a simulation

#Simulation
nascarTMmean <- mean(nascarTM$searchPercent)
nascarTMsd <- sd(nascarTM$searchPercent)

nascarTmMean <- mean(nascarTm$searchPercent)
nascarTmSd <- sd(nascarTm$searchPercent)

par(mfrow = c(2, 2))
sim_nascarTM <- rnorm(n = 100, mean = nascarTMmean, sd = nascarTMsd)
sim_nascarTm <- rnorm(n = 100, mean = nascarTmMean, sd = nascarTmSd)

hist(sim_nascarTM, main = "NASCAR Search Percent Histogram", xlab = "NASCAR Search Percent Trump Majority")
hist(sim_nascarTm, main = "NASCAR Search Percent Histogram", xlab = "NASCAR Search Percent Trump Minority")

qqnorm(sim_nascarTM)
qqline(sim_nascarTM)

qqnorm(sim_nascarTm)
qqline(sim_nascarTm)

shapiro.test(sim_nascarTM)

## 
##  Shapiro-Wilk normality test
## 
## data:  sim_nascarTM
## W = 0.98799, p-value = 0.5076

shapiro.test(sim_nascarTm)

## 
##  Shapiro-Wilk normality test
## 
## data:  sim_nascarTm
## W = 0.98556, p-value = 0.3483

#simulation data is normal

#Perform two sample t-test
res <- t.test(sim_nascarTM, sim_nascarTm, var.equal = TRUE)
res

## 
##  Two Sample t-test
## 
## data:  sim_nascarTM and sim_nascarTm
## t = 8.3062, df = 198, p-value = 1.539e-14
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1.686051 2.735882
## sample estimates:
## mean of x mean of y 
##  5.754701  3.543734

Since the p-value is less than 0.05, we reject the null hypothesis in favor of the alternative and conclude that the difference in the two sample means is not equal to zero. Thus, there is an association between NASCAR fan base and 2016 political affiliations.

NBA

nbaTM <- subset(trumpM, team == 'NBA')
nbaTm <- subset(trumpm, team == 'NBA')

par(mfrow = c(2, 2))
hist(nbaTM$searchPercent, xlim = c(0, 50), main = "NBA Search Percent Histogram", xlab = "NBA Search Percent Trump Majority")
hist(nbaTm$searchPercent, xlim = c(0, 50), main = "NBA Search Percent Histogram", xlab = "NBA Search Percent Trump Minority")

qqnorm(nbaTM$searchPercent)
qqline(nbaTM$searchPercent)

qqnorm(nbaTm$searchPercent)
qqline(nbaTm$searchPercent)

shapiro.test(nbaTM$searchPercent)

## 
##  Shapiro-Wilk normality test
## 
## data:  nbaTM$searchPercent
## W = 0.96921, p-value = 0.003271

shapiro.test(nbaTm$searchPercent)

## 
##  Shapiro-Wilk normality test
## 
## data:  nbaTm$searchPercent
## W = 0.95835, p-value = 0.02169

#NBA data does not meet normality assumption

Sample data is not normal so we will run a simulation

#Simulation
nbaTMmean <- mean(nbaTM$searchPercent)
nbaTMsd <- sd(nbaTM$searchPercent)

nbaTmMean <- mean(nbaTm$searchPercent)
nbaTmSd <- sd(nbaTm$searchPercent)

sim_nbaTM <- rnorm(n = 100, mean = nbaTMmean, sd = nbaTMsd)
sim_nbaTm <- rnorm(n = 100, mean = nbaTmMean, sd = nbaTmSd)

par(mfrow = c(2, 2))
hist(sim_nbaTM, main = "NBA Search Percent Histogram", xlab = "NBA Search Percent Trump Majority")
hist(sim_nbaTm, main = "NBA Search Percent Histogram", xlab = "NBA Search Percent Trump Minority")

qqnorm(sim_nbaTM)
qqline(sim_nbaTM)

qqnorm(sim_nbaTm)
qqline(sim_nbaTm)

shapiro.test(sim_nbaTM)

## 
##  Shapiro-Wilk normality test
## 
## data:  sim_nbaTM
## W = 0.98528, p-value = 0.3325

shapiro.test(sim_nbaTm)

## 
##  Shapiro-Wilk normality test
## 
## data:  sim_nbaTm
## W = 0.9906, p-value = 0.7127

#Simulation data is normal

#Perform two sample t-test
res <- t.test(sim_nbaTM, sim_nbaTm, var.equal = TRUE)
res

## 
##  Two Sample t-test
## 
## data:  sim_nbaTM and sim_nbaTm
## t = -5.0927, df = 198, p-value = 8.196e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -5.281525 -2.332996
## sample estimates:
## mean of x mean of y 
##  21.77668  25.58394

The p-value is less than 0.05, so we reject the null hypothesis in favor of the alternative. The difference in the two sample means is not equal to zero. Thus, there is an association between NBA fan base and 2016 political affiliations.

NFL

nflTM <- subset(trumpM, team == 'NFL')
nflTm <- subset(trumpm, team == 'NFL')

par(mfrow = c(2, 2))
hist(nflTM$searchPercent, xlim = c(0, 50), main = "NFL Search Percent Histogram", xlab = "NFL Search Percent Trump Majority")
hist(nflTm$searchPercent, xlim = c(0, 50), main = "NFL Search Percent Histogram", xlab = "NFL Search Percent Trump Minority")

qqnorm(nflTM$searchPercent)
qqline(nflTM$searchPercent)

qqnorm(nflTm$searchPercent)
qqline(nflTm$searchPercent)

shapiro.test(nflTM$searchPercent)

## 
##  Shapiro-Wilk normality test
## 
## data:  nflTM$searchPercent
## W = 0.992, p-value = 0.6267

shapiro.test(nflTm$searchPercent)

## 
##  Shapiro-Wilk normality test
## 
## data:  nflTm$searchPercent
## W = 0.9799, p-value = 0.3312

#nfl sample data is normal

NFL sample data is normal

#Perform two sample t-test
res <- t.test(nflTM$searchPercent, nflTm$searchPercent, var.equal = TRUE)
res

## 
##  Two Sample t-test
## 
## data:  nflTM$searchPercent and nflTm$searchPercent
## t = -0.14482, df = 205, p-value = 0.885
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.012093  1.736730
## sample estimates:
## mean of x mean of y 
##  39.05072  39.18841

The p-value is greater than 0.05, so we fail to reject the null hypothesis and conclude that the difference in the two sample means is equal to zero. Thus, there is no association between NFL fan base and 2016 political affiliations.

Run Simulation:

#Simulation
nflTMmean <- mean(nflTM$searchPercent)
nflTMsd <- sd(nflTM$searchPercent)

nflTmMean <- mean(nflTm$searchPercent)
nflTmSd <- sd(nflTm$searchPercent)

sim_nflTM <- rnorm(n = 100, mean = nflTMmean, sd = nflTMsd)
sim_nflTm <- rnorm(n = 100, mean = nflTmMean, sd = nflTmSd)

par(mfrow = c(2, 2))
hist(sim_nflTM, main = "NFL Search Percent Histogram", xlab = "NFL Search Percent Trump Majority")
hist(sim_nflTm, main = "NFL Search Percent Histogram", xlab = "NFL Search Percent Trump Minority")

qqnorm(sim_nflTM)
qqline(sim_nflTM)

qqnorm(sim_nflTm)
qqline(sim_nflTm)

shapiro.test(sim_nflTM)

## 
##  Shapiro-Wilk normality test
## 
## data:  sim_nflTM
## W = 0.98991, p-value = 0.657

shapiro.test(sim_nflTm)

## 
##  Shapiro-Wilk normality test
## 
## data:  sim_nflTm
## W = 0.99496, p-value = 0.9741

#simulation data is normal

#Perform two sample t-test
res <- t.test(sim_nflTM, sim_nflTm, var.equal = TRUE)
res

## 
##  Two Sample t-test
## 
## data:  sim_nflTM and sim_nflTm
## t = 0.69219, df = 198, p-value = 0.4896
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.087075  2.262956
## sample estimates:
## mean of x mean of y 
##  39.19896  38.61102

NHL

#Create subset
nhl <- subset(fullDf, team == 'NHL')
nhlTM <- subset(trumpM, team == 'NHL')
nhlTm <- subset(trumpm, team == 'NHL')

par(mfrow = c(2, 2))
hist(nhlTM$searchPercent, xlim = c(0, 50), main = "NHL Search Percent Histogram", xlab = "NHL Search Percent Trump Majority")
hist(nhlTm$searchPercent, xlim = c(0, 50), main = "NHL Search Percent Histogram", xlab = "NHL Search Percent Trump Minority")
#Both plots are right skewed

qqnorm(nhlTM$searchPercent)
qqline(nhlTM$searchPercent)

qqnorm(nhlTm$searchPercent)
qqline(nhlTm$searchPercent)

#Both plots have variation in the tails

shapiro.test(nhlTM$searchPercent)

## 
##  Shapiro-Wilk normality test
## 
## data:  nhlTM$searchPercent
## W = 0.78445, p-value = 5.728e-13

shapiro.test(nhlTm$searchPercent)

## 
##  Shapiro-Wilk normality test
## 
## data:  nhlTm$searchPercent
## W = 0.90079, p-value = 4.688e-05

#NHL data does not meet normality assumption

Data is not normal so we will run a simulation

nhlTMmean <- mean(nhlTM$searchPercent)
nhlTMsd <- sd(nhlTM$searchPercent)

nhlTmMean <- mean(nhlTm$searchPercent)
nhlTmSd <- sd(nhlTm$searchPercent)

#Generate simulation data
sim_nhlTM <- rnorm(n = 100, mean = nhlTMmean, sd = nhlTMsd)
sim_nhlTm <- rnorm(n = 100, mean = nhlTmMean, sd = nhlTmSd)

par(mfrow = c(2, 2))
hist(sim_nhlTM, main = "NHL Search Percent Histogram", xlab = "NHL Search Percent Trump Majority")
hist(sim_nhlTm, main = "NHL Search Percent Histogram", xlab = "NHL Search Percent Trump Minority")

qqnorm(sim_nhlTM)
qqline(sim_nhlTM)

qqnorm(sim_nhlTm)
qqline(sim_nhlTm)

shapiro.test(sim_nhlTM)

## 
##  Shapiro-Wilk normality test
## 
## data:  sim_nhlTM
## W = 0.99028, p-value = 0.6873

shapiro.test(sim_nhlTm)

## 
##  Shapiro-Wilk normality test
## 
## data:  sim_nhlTm
## W = 0.97992, p-value = 0.1312

#Simulation data is normal

#Perform two sample t-test
res <- t.test(sim_nhlTM, sim_nhlTm, var.equal = TRUE)
res #0.009932

## 
##  Two Sample t-test
## 
## data:  sim_nhlTM and sim_nhlTm
## t = -2.308, df = 198, p-value = 0.02203
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.1059272 -0.1653022
## sample estimates:
## mean of x mean of y 
##  4.716346  5.851961

Sometimes the p-value is greater than 0.05 and other times it is less than 0.05. Therefore, the NHL sample data is inconclusive.

Part 5 - Conclusion

This analysis examined whether there is an association between major professional sports league fans and 2016 political affiliations. Two sample t-test results support that there is an association between 2016 political affiliations and the following teams: CFB, NASCAR, and NBA. NBA search traffic was higher among Clinton supporters while CFB and NASCAR search traffic were higher among Trump supporters. Since the t-tests were inconclusive for CBB, MLB, and NHL simulation data, there is no association between their fan bases and 2016 political affiliations.

Regarding my research question, I learned that there are associations between some major professional sports league fan bases and 2016 political affiliations. The NFL had the largest and least partisan fan base. This makes sense because it is the most popular sports league of the seven.

Future research could investigate the same question with historical U.S. presidential election data to see if similar results are produced. It would also be interesting to analyze the geographic locations of all the teams in each league. Having more teams based in prominently Republican or Democratic areas may affect the leagues’ fan base.

References

https://github.com/fivethirtyeight/data/blob/master/nfl-fandom/NFL_fandom_data-google_trends.csv

https://fivethirtyeight.com/features/how-every-nfl-teams-fans-lean-politically/

DATA 606 Final Project

Katherine Evers

4/22/2019

Part 1 - Introduction

Part 2 - Data

Part 3 - Exploratory data analysis

Load Libraries

Import and Clean Data

Summary Statistics

Data Visualizations

Part 4 - Inference

CBB

CFB

MLB

NASCAR

NBA

NFL

NHL

Part 5 - Conclusion

References