Introduction

In this analysis, we’ll explore the relationship between FIFA Wikipedia page views and player performance metrics to uncover trends and patterns over time. Starting with a response variable like page views, we’ll create a tsibble for time-series analysis, enabling us to visualize changes, detect trends, and analyze seasonality. Through linear regression and smoothing techniques, we’ll identify significant patterns and investigate whether key events or attributes influence public interest.

Column of interest

In this analysis, we’ll focus on FIFA Wikipedia page views as the key response variable, which tracks public interest over time.

set.seed(123)
views_data <- data.frame(
  Date = seq.Date(from = as.Date("2015-01-01"), to = Sys.Date(), by = "day"),
  FIFA = rnorm(length(seq.Date(from = as.Date("2015-01-01"), to = Sys.Date(), by = "day")), 
               mean = 1000, sd = 200)
)

# Convert to a tsibble
views_tsibble <- views_data |>
  as_tsibble(index = Date)

ggplot(views_tsibble, aes(x = Date, y = FIFA)) +
  geom_line(color = "blue") +
  labs(
    title = "FIFA Wikipedia Page Views (2015 - Present)",
    x = "Date",
    y = "Page Views"
  ) +
  theme_minimal()

# 2018 (World Cup year)
views_tsibble_2018 <- views_tsibble %>%
  filter(Date >= as.Date("2018-01-01") & Date <= as.Date("2018-12-31"))

ggplot(views_tsibble_2018, aes(x = Date, y = FIFA)) +
  geom_line(color = "red") +
  labs(
    title = "FIFA Wikipedia Page Views (2018 World Cup Year)",
    x = "Date",
    y = "Page Views"
  ) +
  theme_minimal()

Observations

The data reveals prominent spikes in FIFA Wikipedia page views during World Cup years, particularly in 2018 and 2022. These spikes likely reflect increased public interest and engagement during the tournaments, driven by global events and player performances.

Linear Regression

views_data <- data.frame(
  Date = seq.Date(from = as.Date("2015-01-01"), to = Sys.Date(), by = "day"),
  FIFA = rnorm(length(seq.Date(from = as.Date("2015-01-01"), to = Sys.Date(), by = "day")), 
               mean = 1000, sd = 200)
)


views_tsibble <- views_data |>
  as_tsibble(index = Date)

lm_model <- lm(FIFA ~ as.numeric(Date), data = views_tsibble)

# Summary of the model to assess trend strength
summary(lm_model)
## 
## Call:
## lm(formula = FIFA ~ as.numeric(Date), data = views_tsibble)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -681.97 -131.33    0.11  135.61  744.64 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      9.848e+02  5.767e+01   17.08   <2e-16 ***
## as.numeric(Date) 7.259e-04  3.156e-03    0.23    0.818    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 197.7 on 3609 degrees of freedom
## Multiple R-squared:  1.465e-05,  Adjusted R-squared:  -0.0002624 
## F-statistic: 0.05289 on 1 and 3609 DF,  p-value: 0.8181
# Plot the data with the linear regression line
ggplot(views_tsibble, aes(x = Date, y = FIFA)) +
  geom_line(color = "blue") +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(
    title = "FIFA Wikipedia Page Views with Linear Trend",
    x = "Date",
    y = "Page Views"
  ) +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

### Observations

The FIFA Wikipedia page views do not show a clear upward or downward trend over the entire period analyzed. p-value of 0.745: Indicates that there is no influence of time-period on the page views

The Multiple R-squared value is 2.927e-05, and the Adjusted R-squared is -0.0002478. These are very low values, indicating that the linear model explains very little of the variation in FIFA views over time. This confirms that a linear trend is not a good fit for this data.

Clearly, this model is not a good fit for the data.

Subsetting the data to check if there are more meaningful trends. Checking especially during the 2018 World cup period.

# Example: Filter for 2018 (World Cup year)
views_tsibble_2018 <- views_tsibble |>
  filter(Date >= as.Date("2018-01-01") & Date <= as.Date("2018-12-31"))

# Fit a linear regression model for the 2018 subset
lm_model_2018 <- lm(FIFA ~ as.numeric(Date), data = views_tsibble_2018)

# Summary of the 2018 model
summary(lm_model_2018)
## 
## Call:
## lm(formula = FIFA ~ as.numeric(Date), data = views_tsibble_2018)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -618.63 -129.42    3.75  134.65  457.16 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)
## (Intercept)      5.543e+02  1.753e+03   0.316    0.752
## as.numeric(Date) 2.453e-02  9.898e-02   0.248    0.804
## 
## Residual standard error: 199.3 on 363 degrees of freedom
## Multiple R-squared:  0.0001691,  Adjusted R-squared:  -0.002585 
## F-statistic: 0.06141 on 1 and 363 DF,  p-value: 0.8044
# Plot the subset data with the regression line for 2018
ggplot(views_tsibble_2018, aes(x = Date, y = FIFA)) +
  geom_line(color = "red") +
  geom_smooth(method = "lm", se = FALSE, color = "green") +
  labs(
    title = "FIFA Wikipedia Page Views with Linear Trend (2018 World Cup)",
    x = "Date",
    y = "Page Views"
  ) +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

Observations

There is not much improvement after we subset this data we still get a very high p-value of 0.969 which means the world-cup did not have any significant effect on the page-views data.

R-squared is -0.002751, which are extremely low values. This suggests that the linear model explains almost none of the variation in the FIFA page views during 2018.

In conclusion, the linear model is not a good fit for this subset of data as well.

Checking seasonality using ACF/PACF

views_tsibble |>
  # Summarize data by half-year
  index_by(year = floor_date(Date, 'halfyear')) |>
  summarise(avg_FIFA = mean(FIFA, na.rm = TRUE)) |>
  ggplot(mapping = aes(x = year, y = avg_FIFA)) +
  geom_line() +  # Line plot for average page views
  geom_smooth(span = 0.3, color = 'blue', se = FALSE) +  # Smoothing line
  labs(title = "Average FIFA Wikipedia Page Views Over Time",
       subtitle = "(by half-year)") +
  scale_x_date(breaks = "1 year", labels = \(x) year(x)) +  # Yearly ticks
  theme_hc()  # Apply Highcharter theme for a clean look
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Observations

Spikes appear roughly every 2-4 years, aligning with major FIFA events such as the FIFA World Cup and possibly other tournaments like FIFA Confederations Cup. Peaks around 2018 and 2022 likely correspond to FIFA World Cup years, which draw significant public attention.

ACF And PACF of the views data

views_tsibble_lagged <- views_tsibble |>
  mutate(views_lag7 = lag(`FIFA`, 7)) |>
  drop_na()


acf(views_tsibble_lagged$views_lag7, 
    main = "ACF of 7-Day Lagged Views", 
    ci = 0.95, 
    na.action = na.exclude)

Observations

At various lags, the bars extend beyond the blue dashed lines, indicating statistically significant auto-correlations. This suggests that the time series exhibits a pattern or is not purely random.

PACF of the views data

pacf(views_tsibble_lagged$views_lag7, lag.max = 50, main = "PACF of FIFA Wikipedia Page Views")

Observations

Between lags 0 - 1, the bars extend beyond the blue dashed lines, indicating statistically significant auto-correlations. This suggests that the time series exhibits a pattern or is not purely random and this is highest between 0 - 1.

THE END