In this analysis, we’ll explore the relationship between FIFA Wikipedia page views and player performance metrics to uncover trends and patterns over time. Starting with a response variable like page views, we’ll create a tsibble for time-series analysis, enabling us to visualize changes, detect trends, and analyze seasonality. Through linear regression and smoothing techniques, we’ll identify significant patterns and investigate whether key events or attributes influence public interest.
In this analysis, we’ll focus on FIFA Wikipedia page views as the key response variable, which tracks public interest over time.
set.seed(123)
views_data <- data.frame(
Date = seq.Date(from = as.Date("2015-01-01"), to = Sys.Date(), by = "day"),
FIFA = rnorm(length(seq.Date(from = as.Date("2015-01-01"), to = Sys.Date(), by = "day")),
mean = 1000, sd = 200)
)
# Convert to a tsibble
views_tsibble <- views_data |>
as_tsibble(index = Date)
ggplot(views_tsibble, aes(x = Date, y = FIFA)) +
geom_line(color = "blue") +
labs(
title = "FIFA Wikipedia Page Views (2015 - Present)",
x = "Date",
y = "Page Views"
) +
theme_minimal()
# 2018 (World Cup year)
views_tsibble_2018 <- views_tsibble %>%
filter(Date >= as.Date("2018-01-01") & Date <= as.Date("2018-12-31"))
ggplot(views_tsibble_2018, aes(x = Date, y = FIFA)) +
geom_line(color = "red") +
labs(
title = "FIFA Wikipedia Page Views (2018 World Cup Year)",
x = "Date",
y = "Page Views"
) +
theme_minimal()
The data reveals prominent spikes in FIFA Wikipedia page views during World Cup years, particularly in 2018 and 2022. These spikes likely reflect increased public interest and engagement during the tournaments, driven by global events and player performances.
views_data <- data.frame(
Date = seq.Date(from = as.Date("2015-01-01"), to = Sys.Date(), by = "day"),
FIFA = rnorm(length(seq.Date(from = as.Date("2015-01-01"), to = Sys.Date(), by = "day")),
mean = 1000, sd = 200)
)
views_tsibble <- views_data |>
as_tsibble(index = Date)
lm_model <- lm(FIFA ~ as.numeric(Date), data = views_tsibble)
# Summary of the model to assess trend strength
summary(lm_model)
##
## Call:
## lm(formula = FIFA ~ as.numeric(Date), data = views_tsibble)
##
## Residuals:
## Min 1Q Median 3Q Max
## -681.97 -131.33 0.11 135.61 744.64
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.848e+02 5.767e+01 17.08 <2e-16 ***
## as.numeric(Date) 7.259e-04 3.156e-03 0.23 0.818
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 197.7 on 3609 degrees of freedom
## Multiple R-squared: 1.465e-05, Adjusted R-squared: -0.0002624
## F-statistic: 0.05289 on 1 and 3609 DF, p-value: 0.8181
# Plot the data with the linear regression line
ggplot(views_tsibble, aes(x = Date, y = FIFA)) +
geom_line(color = "blue") +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(
title = "FIFA Wikipedia Page Views with Linear Trend",
x = "Date",
y = "Page Views"
) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
### Observations
The FIFA Wikipedia page views do not show a clear upward or downward trend over the entire period analyzed. p-value of 0.745: Indicates that there is no influence of time-period on the page views
The Multiple R-squared value is 2.927e-05, and the Adjusted R-squared is -0.0002478. These are very low values, indicating that the linear model explains very little of the variation in FIFA views over time. This confirms that a linear trend is not a good fit for this data.
Clearly, this model is not a good fit for the data.
# Example: Filter for 2018 (World Cup year)
views_tsibble_2018 <- views_tsibble |>
filter(Date >= as.Date("2018-01-01") & Date <= as.Date("2018-12-31"))
# Fit a linear regression model for the 2018 subset
lm_model_2018 <- lm(FIFA ~ as.numeric(Date), data = views_tsibble_2018)
# Summary of the 2018 model
summary(lm_model_2018)
##
## Call:
## lm(formula = FIFA ~ as.numeric(Date), data = views_tsibble_2018)
##
## Residuals:
## Min 1Q Median 3Q Max
## -618.63 -129.42 3.75 134.65 457.16
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.543e+02 1.753e+03 0.316 0.752
## as.numeric(Date) 2.453e-02 9.898e-02 0.248 0.804
##
## Residual standard error: 199.3 on 363 degrees of freedom
## Multiple R-squared: 0.0001691, Adjusted R-squared: -0.002585
## F-statistic: 0.06141 on 1 and 363 DF, p-value: 0.8044
# Plot the subset data with the regression line for 2018
ggplot(views_tsibble_2018, aes(x = Date, y = FIFA)) +
geom_line(color = "red") +
geom_smooth(method = "lm", se = FALSE, color = "green") +
labs(
title = "FIFA Wikipedia Page Views with Linear Trend (2018 World Cup)",
x = "Date",
y = "Page Views"
) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
There is not much improvement after we subset this data we still get a very high p-value of 0.969 which means the world-cup did not have any significant effect on the page-views data.
R-squared is -0.002751, which are extremely low values. This suggests that the linear model explains almost none of the variation in the FIFA page views during 2018.
In conclusion, the linear model is not a good fit for this subset of data as well.
views_tsibble |>
# Summarize data by half-year
index_by(year = floor_date(Date, 'halfyear')) |>
summarise(avg_FIFA = mean(FIFA, na.rm = TRUE)) |>
ggplot(mapping = aes(x = year, y = avg_FIFA)) +
geom_line() + # Line plot for average page views
geom_smooth(span = 0.3, color = 'blue', se = FALSE) + # Smoothing line
labs(title = "Average FIFA Wikipedia Page Views Over Time",
subtitle = "(by half-year)") +
scale_x_date(breaks = "1 year", labels = \(x) year(x)) + # Yearly ticks
theme_hc() # Apply Highcharter theme for a clean look
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Spikes appear roughly every 2-4 years, aligning with major FIFA events such as the FIFA World Cup and possibly other tournaments like FIFA Confederations Cup. Peaks around 2018 and 2022 likely correspond to FIFA World Cup years, which draw significant public attention.
views_tsibble_lagged <- views_tsibble |>
mutate(views_lag7 = lag(`FIFA`, 7)) |>
drop_na()
acf(views_tsibble_lagged$views_lag7,
main = "ACF of 7-Day Lagged Views",
ci = 0.95,
na.action = na.exclude)
At various lags, the bars extend beyond the blue dashed lines, indicating statistically significant auto-correlations. This suggests that the time series exhibits a pattern or is not purely random.
pacf(views_tsibble_lagged$views_lag7, lag.max = 50, main = "PACF of FIFA Wikipedia Page Views")
Between lags 0 - 1, the bars extend beyond the blue dashed lines, indicating statistically significant auto-correlations. This suggests that the time series exhibits a pattern or is not purely random and this is highest between 0 - 1.