DATA 606 Final Project

Abstract

The FIFA World Cup is the largest sporting event in the world, in terms of television viewership. With over 5 billion viewers per tournament, there is significant interest in the event internationally. UEFA, the European football (also called soccer) confederation is often considered the best confederation, both in terms of quality of play and popularity. While the success of European teams in international competitions is unquestioned, I wanted to explore the notion that Europeans watch the game more than other continents. To examine this, I used 2010 World Cup TV viewership data collected by FiveThirtyEight. I used a statistical t-test to compare the mean viewership (both GDP-weighted and raw viewership) between European and non-european countries. As it turns out,there was a statistically significant difference in the mean GDP-weighted viewership share for that world cup. In addition, I ran an ANOVA to test whether or not there was a difference in the mean viewership between FIFA confederations. This test also produced results that would allow us to reject the null hypothesis that there is no difference in World Cup viewership between confederations. All told, there are various inputs that impact the TV viewership of a country (population, GDP, popularity of a sport). For the time being, UEFA is still the largest viewership bloc, but the increased popularity of the game globally may challenge the hegemony of the European confederation.

Introduction

Soccer (also called football internationally), is the most popular sport in the world. The FIFA World Cup is the most-watched sporting event in the world. UEFA, the European confederation for soccer, is generally considered the gold standard of federations. For instance, the UEFA Champions League is considered the highest level of soccer in the world, pitting the best teams in Europe’s top leagues against each other. With Europe claiming soccer superiority, does viewership reflect this?

Research Question: Is there a significant difference in the mean viewership of soccer in Europe vs another continent (confederation)?

library(ggplot2)
library(dplyr)
library(maps)
library(stringr)

Data

Our TV viewership data for the 2010 World Cup was collected into a csv file by FiveThirtyEight. The original csv file lives in their GitHub repo here. The original file contains 5 fields: - country - confederation - tv_audience_share - population_share - gdp_weighted_share

We’ll read in the csv into a dataframe from the original GitHub repo:

# load data
data_url <- "https://raw.githubusercontent.com/fivethirtyeight/data/master/fifa/fifa_countries_audience.csv"

fifa <- read.csv(data_url)

Exploratory Data Analysis

Adding in a cleaner continent column based on a country’s confederation. These loosely represent the geographic location of FIFA confederations

# Adding continent column for cleaner plots/readability
confederation <- c("CONCACAF", "UEFA", "CONMEBOL", "AFC", "CAF", "OFC")
continent <- c("North America", "Europe", "South America", "Asia", "Africa", "Oceania")
continent_map <- data.frame(confederation, continent)

fifa <- merge(fifa, continent_map)

We’ll need to look at the gdp_weighted_share variable in our dataframe as that accounts for the population/GDP of a country. First, to compare the viewership of European countries vs that of non-European countries, we’ll need to group our data into European vs non-European nations.

fifa <- fifa %>%
              mutate(european = ifelse(fifa$confederation == "UEFA", TRUE, FALSE))

Let’s take a look at some of the summary stats for our dataset via the summary method

summary(fifa)

##  confederation        country          population_share  tv_audience_share
##  Length:191         Length:191         Min.   : 0.0000   Min.   : 0.000   
##  Class :character   Class :character   1st Qu.: 0.0000   1st Qu.: 0.000   
##  Mode  :character   Mode  :character   Median : 0.1000   Median : 0.100   
##                                        Mean   : 0.5225   Mean   : 0.523   
##                                        3rd Qu.: 0.3500   3rd Qu.: 0.300   
##                                        Max.   :19.5000   Max.   :14.800   
##  gdp_weighted_share  continent          european      
##  Min.   : 0.0000    Length:191         Mode :logical  
##  1st Qu.: 0.0000    Class :character   FALSE:145      
##  Median : 0.0000    Mode  :character   TRUE :46       
##  Mean   : 0.5204                                      
##  3rd Qu.: 0.3000                                      
##  Max.   :11.3000

Data Visualization

Adding some plots to paint a better picture of our data. Let’s start with the gdp_weighted_share variable

# GDP weighted viewership share
ggplot(fifa, aes(x = gdp_weighted_share)) + geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(fifa, aes(x = gdp_weighted_share, fill = confederation)) + 
  geom_histogram() + 
  facet_grid(confederation ~ .)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# population share
ggplot(fifa, aes(x = population_share)) + geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Also plotting by confederation
ggplot(fifa, aes(x = population_share, fill = confederation)) + 
  geom_histogram() + 
  facet_grid(confederation ~ .)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# TV audience share
ggplot(fifa, aes(x = tv_audience_share)) + geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Also plot by confederation
ggplot(fifa, aes(x = tv_audience_share, fill = confederation)) + 
  geom_histogram() + 
  facet_grid(confederation ~ .)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

One thing that’d be interested to see is the world map with countries colored by their tv_audience_share values. We’ll need to grab latitude and longitudinal data for each country to create a choropleth map. We can use the maps library and the map_data function to grab a listing of countries with their coordinates. From there we can create choropleths with colormaps representing the metrics we care about (gdo_weighted share, tv_audience_share, etc).

world_map <- map_data("world") %>%
              rename("country" = "region")

# USA named differently in each dataset, replacing to standardize column values (same wqith UK)
world_map$country <- str_replace(world_map$country, "USA", "United States")
world_map$country <- str_replace(world_map$country, "UK", "United Kingdom")

# Joining our world_map geographic data to our World Cup viewership data by country
fifa_coords <- world_map %>% 
  left_join(fifa, by = c("country"))

head(fifa_coords)

##        long      lat group order country subregion confederation
## 1 -69.89912 12.45200     1     1   Aruba      <NA>      CONCACAF
## 2 -69.89571 12.42300     1     2   Aruba      <NA>      CONCACAF
## 3 -69.94219 12.43853     1     3   Aruba      <NA>      CONCACAF
## 4 -70.00415 12.50049     1     4   Aruba      <NA>      CONCACAF
## 5 -70.06612 12.54697     1     5   Aruba      <NA>      CONCACAF
## 6 -70.05088 12.59707     1     6   Aruba      <NA>      CONCACAF
##   population_share tv_audience_share gdp_weighted_share     continent european
## 1                0                 0                  0 North America    FALSE
## 2                0                 0                  0 North America    FALSE
## 3                0                 0                  0 North America    FALSE
## 4                0                 0                  0 North America    FALSE
## 5                0                 0                  0 North America    FALSE
## 6                0                 0                  0 North America    FALSE

First, let’s take a look at the FIFA Confederation membership of each country

ggplot(fifa_coords, aes(long, lat, group = group)) +
  geom_polygon(aes( group=group, fill=confederation)) + 
  ggtitle("FIFA Confederation memberdship by country") +
  xlab("Longitude (deg)") + ylab("Latitude (deg)")

library(viridis)

## Loading required package: viridisLite

## 
## Attaching package: 'viridis'

## The following object is masked from 'package:maps':
## 
##     unemp

# Plotting TV Audience share by country
ggplot(fifa_coords, aes(long, lat, group = group)) +
  geom_polygon(aes( group=group, fill=tv_audience_share)) + 
  ggtitle("TV Audience share (%) of world cup viewership by country") +
  xlab("Longitude (deg)") + ylab("Latitude (deg)") + scale_fill_viridis()

ggplot(fifa_coords, aes(long, lat, group = group)) +
  geom_polygon(aes( group=group, fill=gdp_weighted_share)) + 
  ggtitle("GDP-weighted share (%) of world cup viewership by country") +
  xlab("Longitude (deg)") + ylab("Latitude (deg)") + scale_fill_viridis()

Inference

Let’s run our t-test comparing the gdp-adjusted viewership for european vs non-european countries’. We’ll be conducting this test with a significance level \(\alpha = 0.05\). Our null and alternative hypotheses are listed below:

\(H_0\): The mean GDP-weighted viewership share of the world cup in Europe is not higher than that of other confederations
\(H_a\): The mean GDP-weighter viewership share of the world cup is higher than that of other confederations

# Filtering into our two groups: Europe vs not Europe.
europe <- fifa %>% filter(european== TRUE)
other_countries <- fifa %>% filter(european == FALSE)

# Running one-tailed t-test using R built-in
t.test(europe$gdp_weighted_share, other_countries$gdp_weighted_share, alternative="greater")

## 
##  Welch Two Sample t-test
## 
## data:  europe$gdp_weighted_share and other_countries$gdp_weighted_share
## t = 1.7859, df = 77.677, p-value = 0.03901
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  0.02926207        Inf
## sample estimates:
## mean of x mean of y 
## 0.8478261 0.4165517

Since our p-value is less than \(\alpha = 0.05\), we can reject the null hypothesis and claim that GDP-adjusted average world cup viewership in Europe is higher than that of non-European countries.

Let’s also run an inference test on our tv_audience_share variable (\(\alpha = 0.05\)) to see if there’s a significant difference between European and non-European countries. We’ll start with our hypotheses: - \(H_0\): The mean tv audience viewership share of the world cup in Europe is not higher than that of other confederations - \(H_a\): The mean tv audience viewership share of the world cup is higher than that of other confederations

# Running one-tailed t-test using R built-in
t.test(europe$tv_audience_share, other_countries$tv_audience_share, alternative="greater")

## 
##  Welch Two Sample t-test
## 
## data:  europe$tv_audience_share and other_countries$tv_audience_share
## t = 0.18208, df = 151.44, p-value = 0.4279
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  -0.2641503        Inf
## sample estimates:
## mean of x mean of y 
## 0.5478261 0.5151724

Our p-value of \(0.4279 > \alpha\) indicates that we cannot reject the null hypothesis in this instance. There is no statistically significant different in raw TV audience viewership share between European and non-European countries

ANOVA Between Confederations

Lastly, it’d be interesting to run an ANOVA analysis on tv_audience_share between Confederations (not just Europe vs Non-Europe) to see if there’s any significant differences in World Cup viewership between confederations.

\(H_0\): There is no difference in mean tv audience World Cup viewership share between confederations
\(H_a\): The mean tv audience viewership share of the world cup is different between confederations

We will conduct our ANOVA testing at a 5% significance level (\(\alpha = 0.05\))

confederation_test <- aov(tv_audience_share ~ confederation, data = fifa)

summary(confederation_test)

##                Df Sum Sq Mean Sq F value Pr(>F)  
## confederation   5   26.7   5.333   2.648 0.0244 *
## Residuals     185  372.6   2.014                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since our p-value (0.0244) is less than \(\alpha = 0.05\), we can reject the null hypothesis and assert that there is a statistically significant different in mean TV audience viewership between FIFA confederations.

References

(Nate Silver 2015)

Nate Silver. 2015. How to Break FIFA. FiveThirtyEight. https://fivethirtyeight.com/features/how-to-break-fifa/.