What is Average Viewership: Average viewership is calculated by measuring the number of concurrent viewers at regular intervals throughout a broadcast (e.g., every minute or 5 minutes), summing those totals, and dividing by the number of intervals.
# Libraries
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(forecast)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.5.2
library(zoo)
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
# Reading in Data
sb_data <- read.csv("C:/Users/james/OneDrive/Desktop/super-bowl-ratings.csv")
# Viewing first few rows of data
head(sb_data)
## super_bowl super_bowl_number date network average_viewers total_viewers
## 1 I 1 1/15/1967 NBC 24430000 35600000
## 2 II 2 1/14/1968 CBS 39120000 51300000
## 3 III 3 1/12/1969 NBC 41660000 54500000
## 4 IV 4 1/11/1970 CBS 44270000 59200000
## 5 V 5 1/17/1971 NBC 46040000 58500000
## 6 VI 6 1/16/1972 CBS 56640000 67300000
## household_rating household_share cost_of_30_second_ad_usd
## 1 18.5 36 37500
## 2 36.8 68 54500
## 3 36.0 70 55000
## 4 39.4 69 78200
## 5 39.9 75 72500
## 6 44.2 74 86100
# NA values per column
colSums(is.na(sb_data))
## super_bowl super_bowl_number date
## 0 0 0
## network average_viewers total_viewers
## 0 0 8
## household_rating household_share cost_of_30_second_ad_usd
## 0 1 0
# Density plot of average viewers
ggplot(sb_data, aes(x = average_viewers))+
geom_density(fill = "skyblue")+
theme_bw()+
theme(panel.grid.major = element_blank(), # Turn of the background grid
panel.grid.minor = element_blank(),
panel.border = element_blank(),
panel.background = element_blank()) +
scale_x_continuous(breaks = scales::pretty_breaks(),
labels = scales::comma_format())+
labs(x="Average Viewership", y= "Density", title="Density Plot of Super Bowl Average Viewership (1967-2021)")
# Finding mean average viewership per super bowl by network
networks <- sb_data %>%
group_by(network) %>%
summarise(avg = mean(average_viewers)) %>%
ungroup() %>%
arrange(desc(avg))
# Plotting Average Viewership per Super Bowl by Network
ggplot(networks, aes(x=network, y=avg, fill=network))+
geom_bar(stat = "identity")+
theme_bw()+
theme(panel.grid.major = element_blank(), # Turn of the background grid
panel.grid.minor = element_blank(),
panel.border = element_blank(),
panel.background = element_blank()) +
labs(x="Network", y="Avg. Viewership Per Super Bowl (Millions)", title="Avg. Viewership Per Super Bowl Airing By Network")
# Filtering to get 50 years of data
sb_data30 <- sb_data %>%
filter(super_bowl_number >= 11)
# Converting to time series object
sb.ts <- ts(sb_data30$average_viewers, start = c(1977, 1), end = c(2026, 1), freq = 1)
# Plotting as Time Series Object
plot(sb.ts, main="Super Bowl Avg. Viewers (1977-2026)",
xlab="Year",
ylab="Avg. Viewers (Millions)",
col = "royalblue",
lwd = 2)
# Setting validation period length
nValid <- 10
# Setting training period length
ntrain <- length(sb.ts) - nValid
# Creating training data
train.ts <- window(sb.ts, end = c(2016, 1))
# Creating Validation data
valid.ts <- window(sb.ts, start = c(2017, 1), end = c(2026, 1))
# Finding lagged differences
diff.sb.ts <- diff(sb.ts)
# Plotting lagged differences
Acf(diff.sb.ts)
The ACF plot communicates that there is constant auto correlation structure meaning no trends or seasonality. This also signifies constant variance and a constant mean. The data possesses stationarity due to constant variance, mean, adn auto correlation.
# Difference Plot
plot.ts(diff(sb_data30$average_viewers))
This plot conveys the yearly change in super bowl average viewership from one year to the next. We can see a constant yearly fluctutation in the differences.
# Fitting Linear Trend Model
linear_mod <- tslm(train.ts ~ trend)
# Results
summary(linear_mod)
##
## Call:
## tslm(formula = train.ts ~ trend)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10853979 -5735079 -584860 5153420 13066739
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 70334462 2095995 33.56 < 2e-16 ***
## trend 916880 89091 10.29 1.53e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6504000 on 38 degrees of freedom
## Multiple R-squared: 0.736, Adjusted R-squared: 0.729
## F-statistic: 105.9 on 1 and 38 DF, p-value: 1.527e-12
# Predicting on validation set
linear_pred <- forecast(linear_mod, h = nValid, level = 90)
autoplot(sb.ts)+
autolayer(linear_pred, alpha = 0.3)+
autolayer(linear_pred$mean, series="Linear Forecast")+
autolayer(linear_mod$fitted.values, series="Linear Fit")+
autolayer(valid.ts, series="Observed")+
labs(x="Year", y="Avg. Viewership (Millions)", title="Forecasts From Linear Model")+
guides(color = guide_legend(title = ""))
linear.res <- linear_mod$residuals
acf(linear.res)
The ACF Plot on the residuals will indicate whether the residuals are autocorrelated. If so the model has missed a pattern in the data. We can see significant autocorrelation at lags 0, 1, and 2. With the significant spike at low lags the model is underfitting a small bit. Ideally all residuals should be in the blue bounds (95% Confidience Interval) and their behavior is random. Auto correaltion is represented by points outside of the bounds.
# Fitting Exp Trend model
exp_mod <- tslm(train.ts ~ trend, lambda = 0) #lambda = 0 indicates exponential trend
# Results
summary(exp_mod)
##
## Call:
## tslm(formula = train.ts ~ trend, lambda = 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.154006 -0.058298 -0.001982 0.062365 0.154312
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.809e+01 2.347e-02 770.76 < 2e-16 ***
## trend 1.019e-02 9.975e-04 10.22 1.88e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.07282 on 38 degrees of freedom
## Multiple R-squared: 0.7331, Adjusted R-squared: 0.726
## F-statistic: 104.4 on 1 and 38 DF, p-value: 1.881e-12
# Forecasting on validation data
exp_pred <- forecast(exp_mod, h= nValid, level = 90)
# Ploting Forecast
autoplot(sb.ts) +
autolayer(exp_pred, alpha = 0.3)+
autolayer(exp_pred$mean, series="Exponential Forecast") +
autolayer(exp_mod$fitted.values, series = 'Exponential Fit') +
autolayer(valid.ts, series = "Observed") +
labs(x="Year", y="Avg. Viewership (Millions)", title="Exponential Forecast") +
guides(color = guide_legend(title = ""))
exp.res <- exp_mod$residuals
acf(exp.res)
We can see the spikes at early lags similar to the linear trend model meaning that this model is slightly underfitting.
# Fitting Quadratic Trend Model
quad_mod <- tslm(train.ts ~ trend + I(trend^2))
# Results
summary(quad_mod)
##
## Call:
## tslm(formula = train.ts ~ trend + I(trend^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -14625610 -4467887 -564261 3906991 13571833
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 76637154 2996871 25.572 < 2e-16 ***
## trend 16495 337107 0.049 0.96124
## I(trend^2) 21961 7974 2.754 0.00907 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6005000 on 37 degrees of freedom
## Multiple R-squared: 0.7809, Adjusted R-squared: 0.769
## F-statistic: 65.93 on 2 and 37 DF, p-value: 6.348e-13
# Forecasting on Validation Data
quad_pred <- forecast(quad_mod, h= nValid, level=90)
# Plotting Forecast
autoplot(sb.ts) +
autolayer(quad_pred, alpha = 0.3)+
autolayer(quad_pred$mean, series="Quadratic Forecast") +
autolayer(quad_mod$fitted.values, series = "Quadratic Fit") +
autolayer(valid.ts, series = "Observed") +
labs(x="Year", y="Avg. Viewership (Millions)", title="Quadratic Forecast") +
guides(color = guide_legend(title = ""))
# Plotting Residuals of Quadratic Trend Model
quad.res <- quad_mod$residuals
acf(quad.res)
# Holt Winters Model
ets_mod <- ets(train.ts, model = "ZZZ")
# Results
summary(ets_mod)
## ETS(A,N,N)
##
## Call:
## ets(y = train.ts, model = "ZZZ")
##
## Smoothing parameters:
## alpha = 0.5884
##
## Initial states:
## l = 78298996.0723
##
## sigma: 6307425
##
## AIC AICc BIC
## 1404.082 1404.749 1409.149
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 1445933 6147718 4830112 1.128857 5.666198 0.9610446 -0.1290618
# Forecasting on Validation Period
ets_pred <- forecast(ets_mod, h = nValid, level = 90)
autoplot(sb.ts) +
autolayer(ets_pred, alpha = 0.3)+
autolayer(ets_pred$mean, series = "ETS Forecast")+
autolayer(ets_pred$fitted, series = "ETS Fit")+
autolayer(valid.ts, series = "Observed")+
labs(x="Year", y="Avg. Viewership (Millions)", title="ETS Forecast")+
guides(color = guide_legend(title = ""))
# Plotting ACF of Residuals
ets.res <- ets_mod$residuals
acf(ets.res)
ets.2.mod <- ets(train.ts, model = "AAN")
summary(ets.2.mod)
## ETS(A,A,N)
##
## Call:
## ets(y = train.ts, model = "AAN")
##
## Smoothing parameters:
## alpha = 0.5433
## beta = 0.0264
##
## Initial states:
## l = 71627417.2278
## b = 2234848.6442
##
## sigma: 6145168
##
## AIC AICc BIC
## 1403.835 1405.600 1412.279
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set -791350.3 5829819 4619342 -1.390291 5.518434 0.9191078 -0.0433885
ets.2_pred <- forecast(ets.2.mod, h = nValid, level = 90)
autoplot(sb.ts) +
autolayer(ets.2_pred, alpha = 0.3)+
autolayer(ets.2_pred$fitted, series = "ETS (AAN) Fit")+
autolayer(ets.2_pred$mean, series = "ETS (AAN) Forecast")+
autolayer(valid.ts, series = "Observed")+
labs(x="Year", y="Avg. Viewership (Millions)", title="ETS (AAN) Forecast")+
guides(color = guide_legend(title = ""))
ets2.res <- ets.2.mod$residuals
acf(ets2.res)
p <- 5 # Number of previous time steps used for forecast
P <- 0 # Number of previous seasonal values to use
size <- 9 # Number of hidden nodes
# Fitting Model
nn_mod <- nnetar(train.ts, repeats = 20, p=p, P=P, size = size)
# Forecasting on validation data
nn_pred <- forecast(nn_mod, h = nValid, level = 90)
#Plotting Forecast
autoplot(sb.ts) +
autolayer(nn_pred, alpha = 0.3)+
autolayer(nn_pred$mean, series = "NN Forecast")+
autolayer(valid.ts, series = "Observed")+
labs(x="Year", y="Avg. Viewership (Millions)", title="NN Model Forecast")+
guides(color = guide_legend(title = ""))
# Finding Neural Net Residuals
nn.res <- nn_mod$residuals
nn_df <- data.frame(nn.res)
nn_df <- na.omit(nn_df)
# ACF Plot
acf(nn_df)
The neural network residuals behave essentially like white noise, which is a good sign. It suggests the model has successfully captured the underlying structure in the data, leaving no exploitable autocorrelation behind in the residuals.
#Fitting Auto Arima
arima <- auto.arima(train.ts)
#Results
summary(arima)
## Series: train.ts
## ARIMA(0,1,1) with drift
##
## Coefficients:
## ma1 drift
## -0.4790 1154542.2
## s.e. 0.1525 482308.6
##
## sigma^2 = 3.34e+13: log likelihood = -661.66
## AIC=1329.33 AICc=1330.01 BIC=1334.32
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 192013.8 5558220 4212230 -0.0699361 4.883411 0.838105 0.04521311
# Forecasting on Validation Data
arima_pred <- forecast(arima, h = nValid, level = 90)
#Plotting Forecast
autoplot(sb.ts) +
autolayer(arima_pred, alpha = 0.3)+
autolayer(arima_pred$fitted, series = "Arima Fitted")+
autolayer(arima_pred$mean, series = "Arima Forecast")+
autolayer(valid.ts, series = "Observed")+
labs(x="Year", y="Avg. Viewership (Millions)", title="Arima Model Forecast")+
guides(color = guide_legend(title = ""))
# Finding Residuals of ARIMA
arima.res <- arima$residuals
# Finding differences
diff.res <- diff(arima.res)
# Plotting ACF of Differences
acf(diff.res)
# Moving Average with window size of two to predict viewership for next Super Bowl
ma.trailing <- rollmean(train.ts, k = 2, align = "right")
# Taking last 3 moving averages
last.ma <- tail(ma.trailing, 3)
# Using the Prior 3 Super Bowls moving averagea to make predictions
ma.trailing.pred <- ts(rep(last.ma, nValid), start = c(1977, ntrain + 1), end = c(1977, ntrain + nValid), freq = 1)
# Plotting Moving Average
autoplot(sb.ts)+
autolayer(ma.trailing, series = "Moving Average") +
autolayer(ma.trailing.pred, series = "MA Forecast")+
autolayer(valid.ts, series = "Observed")+
labs(x="Year", y="Avg. Viewership (Millions)", title="Moving Average Forecast")+
guides(color = guide_legend(title = ""))
# List of model names
model_names <- c("Linear Trend", "Exponential Trend", "Quadratic Trend", "Holt ZZZ", "Holt AAN", "Neural Net", "Auto Arima", "Moving Average")
# Linear Trend Accuracy
lr_acc <- accuracy(linear_pred$mean, valid.ts)
# Exponential Trend Accuracy
exp_acc <- accuracy(exp_pred$mean, valid.ts)
# Quadratic Trend Accuracy
quad_acc <- accuracy(quad_pred$mean, valid.ts)
# Holt ZZZ
holt_ZZZ_acc <- accuracy(ets_pred$mean, valid.ts)
# Holt AAN
holt_AAN_acc <- accuracy(ets.2_pred$mean, valid.ts)
# Neural Network accuracy
nn_acc <- accuracy(nn_pred$mean, valid.ts)
# Auto Arima Accuracy
arima_acc <- accuracy(arima_pred$mean, valid.ts)
# Moving Average Accuracy
ma_acc <- accuracy(ma.trailing.pred, valid.ts)
# Row Binding Error Measures for Display
errors_df <- rbind(lr_acc, exp_acc, quad_acc, holt_ZZZ_acc, holt_AAN_acc, nn_acc, arima_acc, ma_acc)
# Assigning Row Names
rownames(errors_df) <- model_names
# Displaying Error Measures for each model
errors_df
## ME RMSE MAE MPE MAPE ACF1
## Linear Trend -2149498 9373206 8397680 -2.759635 7.833089 0.6620626
## Exponential Trend -4053224 9621524 8298442 -4.464157 7.899401 0.6408273
## Quadratic Trend -13129798 15227857 13129798 -12.648481 12.648481 0.5442585
## Holt ZZZ -2425612 11070578 9974722 -3.170302 9.237705 0.7180497
## Holt AAN -11294425 14099564 11616135 -11.076495 11.328422 0.6168719
## Neural Net -2138868 11425551 10152979 -2.927622 9.390412 0.6716574
## Auto Arima -9723814 13091436 10545821 -9.675694 10.319668 0.6411429
## Moving Average -2212500 11414057 10331500 -3.006025 9.571976 0.6504315
## Theil's U
## Linear Trend 1.402724
## Exponential Trend 1.483165
## Quadratic Trend 2.387743
## Holt ZZZ 1.637872
## Holt AAN 2.216295
## Neural Net 1.686763
## Auto Arima 2.057607
## Moving Average 1.685917
train.res.ar.1 <- Arima(linear.res, order = c(1, 0, 0))
summary(train.res.ar.1)
## Series: linear.res
## ARIMA(1,0,0) with non-zero mean
##
## Coefficients:
## ar1 mean
## 0.5201 -112485.1
## s.e. 0.1371 1744510.0
##
## sigma^2 = 3.106e+13: log likelihood = -677.23
## AIC=1360.45 AICc=1361.12 BIC=1365.52
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 151659.8 5431810 4182438 196.4112 208.5965 0.8555932 -0.1219622
lr.res.for <- forecast(linear.res, h =nValid, level = 90)
autoplot(linear.res)+
autolayer(lr.res.for$residuals)
autoplot(sb.ts) +
autolayer(linear_pred$mean + lr.res.for$mean, series=" Linear Forecast + Residuals") +
autolayer(valid.ts, series = "Observed") +
labs(x="Year", y="Avg. Viewership (Millions)", title="Linear Forecast") +
guides(color = guide_legend(title = ""))
# Model Errors for improved neural network
nn_mod_imp <- accuracy(lr.res.for$mean + linear_pred$mean, valid.ts)
print(nn_mod_imp)
## ME RMSE MAE MPE MAPE ACF1 Theil's U
## Test set -8222950 12282245 10025562 -8.33791 9.758806 0.6620626 1.923796
As we can see running an arima model on the linear trend models residuals and plugging the residuals back into the model does not improve the forecast. Error measures of this forecast are higher then those of the first linear model across the board.
# Fitting Model
linear.final <- tslm(sb.ts ~ trend)
# Forecasting Next 10 Years of viewership
final_pred <- forecast(linear.final, h=nValid, level=90)
# Forecast Graph
autoplot(sb.ts)+
autolayer(final_pred, alpha = 0.3)+
autolayer(final_pred$mean, series = "Linear Forecast")+
autolayer(linear.final$fitted.values, series="Linear Fit")+
labs(x="Year", y="Avg. Viewership (Millions)", title="Linear Trend Model 10 Year Forecast")+
guides(color = guide_legend(title = ""))
# 90% Confidence Intervals
upper_bound <- final_pred$upper
lower_bound <- final_pred$lower
intervals <- data.frame(lower_bound, upper_bound)
rownames(intervals) <- c("2027", "2028", "2029", "2030", "2031", "2032", "2033", "2034", "2035", "2036")
intervals
## lower_bound upper_bound
## 2027 103451210 128481901
## 2028 104311845 129400211
## 2029 105171441 130319560
## 2030 106030006 131239941
## 2031 106887547 132161346
## 2032 107744071 133083767
## 2033 108599586 134007197
## 2034 109454101 134931627
## 2035 110307624 135857050
## 2036 111160163 136783457
For next years super bowl the model predicts that the average viewership will be approximately 116,000,000. Our lower bound of our confidence interval allows us to state that with 90% certainty average viewership will not fall lower that 103,000,000. Based off the recent trend of viewership skyrocketing we can expect that viewership will approach the the upper bound of our 90% confidence interval.