The FIFA World Cup is the largest sporting event in the world, in terms of television viewership. With over 5 billion viewers per tournament, there is significant interest in the event internationally. UEFA, the European football (also called soccer) confederation is often considered the best confederation, both in terms of quality of play and popularity. While the success of European teams in international competitions is unquestioned, I wanted to explore the notion that Europeans watch the game more than other continents. To examine this, I used 2010 World Cup TV viewership data collected by FiveThirtyEight. I used a statistical t-test to compare the mean viewership (both GDP-weighted and raw viewership) between European and non-european countries. As it turns out,there was a statistically significant difference in the mean GDP-weighted viewership share for that world cup. In addition, I ran an ANOVA to test whether or not there was a difference in the mean viewership between FIFA confederations. This test also produced results that would allow us to reject the null hypothesis that there is no difference in World Cup viewership between confederations. All told, there are various inputs that impact the TV viewership of a country (population, GDP, popularity of a sport). For the time being, UEFA is still the largest viewership bloc, but the increased popularity of the game globally may challenge the hegemony of the European confederation.
Soccer (also called football internationally), is the most popular sport in the world. The FIFA World Cup is the most-watched sporting event in the world. UEFA, the European confederation for soccer, is generally considered the gold standard of federations. For instance, the UEFA Champions League is considered the highest level of soccer in the world, pitting the best teams in Europe’s top leagues against each other. With Europe claiming soccer superiority, does viewership reflect this?
Research Question: Is there a significant difference in the mean viewership of soccer in Europe vs another continent (confederation)?
library(ggplot2)
library(dplyr)
library(maps)
library(stringr)
Our TV viewership data for the 2010 World Cup was collected into a
csv file by FiveThirtyEight. The original csv file lives in their GitHub
repo here.
The original file contains 5 fields: - country -
confederation - tv_audience_share -
population_share - gdp_weighted_share
We’ll read in the csv into a dataframe from the original GitHub repo:
# load data
data_url <- "https://raw.githubusercontent.com/fivethirtyeight/data/master/fifa/fifa_countries_audience.csv"
fifa <- read.csv(data_url)
Adding in a cleaner continent column based on a
country’s confederation. These loosely represent the geographic location
of FIFA
confederations
# Adding continent column for cleaner plots/readability
confederation <- c("CONCACAF", "UEFA", "CONMEBOL", "AFC", "CAF", "OFC")
continent <- c("North America", "Europe", "South America", "Asia", "Africa", "Oceania")
continent_map <- data.frame(confederation, continent)
fifa <- merge(fifa, continent_map)
We’ll need to look at the gdp_weighted_share variable in
our dataframe as that accounts for the population/GDP of a country.
First, to compare the viewership of European countries vs that of
non-European countries, we’ll need to group our data into European vs
non-European nations.
fifa <- fifa %>%
mutate(european = ifelse(fifa$confederation == "UEFA", TRUE, FALSE))
Let’s take a look at some of the summary stats for our dataset via
the summary method
summary(fifa)
## confederation country population_share tv_audience_share
## Length:191 Length:191 Min. : 0.0000 Min. : 0.000
## Class :character Class :character 1st Qu.: 0.0000 1st Qu.: 0.000
## Mode :character Mode :character Median : 0.1000 Median : 0.100
## Mean : 0.5225 Mean : 0.523
## 3rd Qu.: 0.3500 3rd Qu.: 0.300
## Max. :19.5000 Max. :14.800
## gdp_weighted_share continent european
## Min. : 0.0000 Length:191 Mode :logical
## 1st Qu.: 0.0000 Class :character FALSE:145
## Median : 0.0000 Mode :character TRUE :46
## Mean : 0.5204
## 3rd Qu.: 0.3000
## Max. :11.3000
Adding some plots to paint a better picture of our data. Let’s start
with the gdp_weighted_share variable
# GDP weighted viewership share
ggplot(fifa, aes(x = gdp_weighted_share)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(fifa, aes(x = gdp_weighted_share, fill = confederation)) +
geom_histogram() +
facet_grid(confederation ~ .)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# population share
ggplot(fifa, aes(x = population_share)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Also plotting by confederation
ggplot(fifa, aes(x = population_share, fill = confederation)) +
geom_histogram() +
facet_grid(confederation ~ .)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# TV audience share
ggplot(fifa, aes(x = tv_audience_share)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Also plot by confederation
ggplot(fifa, aes(x = tv_audience_share, fill = confederation)) +
geom_histogram() +
facet_grid(confederation ~ .)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
One thing that’d be interested to see is the world map with countries
colored by their tv_audience_share values. We’ll need to
grab latitude and longitudinal data for each country to create a
choropleth map. We can use the maps library and the
map_data function to grab a listing of countries with their
coordinates. From there we can create choropleths with colormaps
representing the metrics we care about (gdo_weighted share,
tv_audience_share, etc).
world_map <- map_data("world") %>%
rename("country" = "region")
# USA named differently in each dataset, replacing to standardize column values (same wqith UK)
world_map$country <- str_replace(world_map$country, "USA", "United States")
world_map$country <- str_replace(world_map$country, "UK", "United Kingdom")
# Joining our world_map geographic data to our World Cup viewership data by country
fifa_coords <- world_map %>%
left_join(fifa, by = c("country"))
head(fifa_coords)
## long lat group order country subregion confederation
## 1 -69.89912 12.45200 1 1 Aruba <NA> CONCACAF
## 2 -69.89571 12.42300 1 2 Aruba <NA> CONCACAF
## 3 -69.94219 12.43853 1 3 Aruba <NA> CONCACAF
## 4 -70.00415 12.50049 1 4 Aruba <NA> CONCACAF
## 5 -70.06612 12.54697 1 5 Aruba <NA> CONCACAF
## 6 -70.05088 12.59707 1 6 Aruba <NA> CONCACAF
## population_share tv_audience_share gdp_weighted_share continent european
## 1 0 0 0 North America FALSE
## 2 0 0 0 North America FALSE
## 3 0 0 0 North America FALSE
## 4 0 0 0 North America FALSE
## 5 0 0 0 North America FALSE
## 6 0 0 0 North America FALSE
First, let’s take a look at the FIFA Confederation membership of each country
ggplot(fifa_coords, aes(long, lat, group = group)) +
geom_polygon(aes( group=group, fill=confederation)) +
ggtitle("FIFA Confederation memberdship by country") +
xlab("Longitude (deg)") + ylab("Latitude (deg)")
library(viridis)
## Loading required package: viridisLite
##
## Attaching package: 'viridis'
## The following object is masked from 'package:maps':
##
## unemp
# Plotting TV Audience share by country
ggplot(fifa_coords, aes(long, lat, group = group)) +
geom_polygon(aes( group=group, fill=tv_audience_share)) +
ggtitle("TV Audience share (%) of world cup viewership by country") +
xlab("Longitude (deg)") + ylab("Latitude (deg)") + scale_fill_viridis()
ggplot(fifa_coords, aes(long, lat, group = group)) +
geom_polygon(aes( group=group, fill=gdp_weighted_share)) +
ggtitle("GDP-weighted share (%) of world cup viewership by country") +
xlab("Longitude (deg)") + ylab("Latitude (deg)") + scale_fill_viridis()
Let’s run our t-test comparing the gdp-adjusted viewership for european vs non-european countries’. We’ll be conducting this test with a significance level \(\alpha = 0.05\). Our null and alternative hypotheses are listed below:
# Filtering into our two groups: Europe vs not Europe.
europe <- fifa %>% filter(european== TRUE)
other_countries <- fifa %>% filter(european == FALSE)
# Running one-tailed t-test using R built-in
t.test(europe$gdp_weighted_share, other_countries$gdp_weighted_share, alternative="greater")
##
## Welch Two Sample t-test
##
## data: europe$gdp_weighted_share and other_countries$gdp_weighted_share
## t = 1.7859, df = 77.677, p-value = 0.03901
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## 0.02926207 Inf
## sample estimates:
## mean of x mean of y
## 0.8478261 0.4165517
Since our p-value is less than \(\alpha = 0.05\), we can reject the null hypothesis and claim that GDP-adjusted average world cup viewership in Europe is higher than that of non-European countries.
Let’s also run an inference test on our
tv_audience_share variable (\(\alpha = 0.05\)) to see if there’s a
significant difference between European and non-European countries.
We’ll start with our hypotheses: - \(H_0\): The mean tv audience viewership
share of the world cup in Europe is not higher than that of
other confederations - \(H_a\): The
mean tv audience viewership share of the world cup is higher
than that of other confederations
# Running one-tailed t-test using R built-in
t.test(europe$tv_audience_share, other_countries$tv_audience_share, alternative="greater")
##
## Welch Two Sample t-test
##
## data: europe$tv_audience_share and other_countries$tv_audience_share
## t = 0.18208, df = 151.44, p-value = 0.4279
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## -0.2641503 Inf
## sample estimates:
## mean of x mean of y
## 0.5478261 0.5151724
Our p-value of \(0.4279 > \alpha\) indicates that we cannot reject the null hypothesis in this instance. There is no statistically significant different in raw TV audience viewership share between European and non-European countries
Lastly, it’d be interesting to run an ANOVA analysis on
tv_audience_share between Confederations (not just Europe
vs Non-Europe) to see if there’s any significant differences in World
Cup viewership between confederations.
We will conduct our ANOVA testing at a 5% significance level (\(\alpha = 0.05\))
confederation_test <- aov(tv_audience_share ~ confederation, data = fifa)
summary(confederation_test)
## Df Sum Sq Mean Sq F value Pr(>F)
## confederation 5 26.7 5.333 2.648 0.0244 *
## Residuals 185 372.6 2.014
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Since our p-value (0.0244) is less than \(\alpha = 0.05\), we can reject the null hypothesis and assert that there is a statistically significant different in mean TV audience viewership between FIFA confederations.