This document provides the code for the graphs and forecast included in the article ‘A Degree of Extrapolation’. We look at HESA statistics on qualifications achieved by first-degree graduates. The data may be downloaded from a Google Sheet.
First, we download the required packages and set the theme.
library(tidyverse)
library(readxl)
library(forecast)
The theme for the graphs is here:
theme_clean <- theme_bw(base_family="Calibri") +
theme(legend.position = "top",
legend.title = element_text(size = 12),
legend.text = element_text(size = 12),
plot.title = element_text(size = 18, face = "bold"),
plot.subtitle = element_text(size = 12, face = "italic", margin = margin(b=12)),
plot.caption = element_text(size = 10),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_blank())
theme_set(theme_clean)
Using readxl, I draw the data from my Excel file. One column needs to be renamed, as the ‘/’ in “Third/Pass” is treated as a calculation in the mutate function.
hesa_degree_df <- read_excel("HESA Class of Degree by First Degree Graduates - 2019-12-30.xlsx",
sheet = "DATA") %>%
dplyr::rename(ThirdPass = "Third/Pass") %>%
dplyr::mutate(Classified = First + Upper_Second + Lower_Second + ThirdPass,
First_Share = (First / Classified)* 100)
This table then needs to be tidied, with factors set:
hesa_class_df <- hesa_degree_df %>%
select(1:5) %>%
pivot_longer(cols = 2:5,
names_to = "Class",
values_to = "Count")
hesa_class_df$Class <- factor(hesa_class_df$Class,
levels = c("First", "Upper_Second", "Lower_Second", "ThirdPass"))
We then produce the graphs used in the article. The first is the count of degrees by class.
hesa_class_gg <- ggplot(data = hesa_class_df,
aes(x = AcademicYear_EY,
y = Count,
group = Class)) +
geom_line(size = 1.2,
aes(color = Class)) +
scale_color_discrete(name = "Classification",
labels = c("First", "Upper Second", "Lower Second", "Third/Pass")) +
scale_y_continuous(labels = scales::comma) +
labs(title = "First and Upper Second degrees have generally risen in volume",
subtitle = "Rounded classifications of first degree graduates, in each academic year at UK institutions.",
caption = "Source: HESA Enrolment and Qualification reports; HESA Higher Education Student Statistics.",
x = "Academic Year (End Year)",
y = "")
The first graph is here:
hesa_class_gg
The second graph shows the proportion of Firsts.
hesa_firstshare_gg <- ggplot(data = hesa_degree_df,
aes(x = AcademicYear_EY,
y = First_Share)) +
geom_line(size = 1.2) +
geom_point(fill = "white", size = 3, shape = 21) +
geom_text(aes(label = round(First_Share)), nudge_y = 2) +
ylim(c(0,30)) +
labs(title = "28% of first-time graduates received a First-class degree in 2017/18",
subtitle = "Proportion of Firsts among first degree graduates with classified degrees at UK institutions [%].",
caption = "Source: HESA Enrolment and Qualification reports; HESA Higher Education Student Statistics.",
x = "Academic Year (End Year)",
y = "")
The second graph is here:
hesa_firstshare_gg
We use the forecasting function, for 12 years into the future. First, we convert the table into a time series:
hesa_firstshare_df <- hesa_degree_df %>%
select("AcademicYear_EY", "First_Share")
hesa_firstshare_ts <- ts(data = hesa_firstshare_df$First_Share,
start = 2004,
frequency = 1)
The forecast function then produces the following prediction:
forecast(hesa_firstshare_ts, h = 12) %>% autoplot()
We can see the exact values by not using autoplot:
forecast(hesa_firstshare_ts, h = 12)
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 2019 29.90600 29.16713 30.64486 28.77600 31.03599
## 2020 31.98033 30.32921 33.63144 29.45516 34.50549
## 2021 34.05465 31.26486 36.84445 29.78803 38.32128
## 2022 36.12898 31.99096 40.26701 29.80042 42.45755
## 2023 38.20331 32.51715 43.88947 29.50708 46.89955
## 2024 40.27764 32.84876 47.70653 28.91614 51.63914
## 2025 42.35197 32.98835 51.71559 28.03155 56.67239
## 2026 44.42630 32.93669 55.91591 26.85446 61.99815
## 2027 46.50063 32.69317 60.30809 25.38394 67.61732
## 2028 48.57496 32.25618 64.89374 23.61753 73.53239
## 2029 50.64929 31.62326 69.67531 21.55150 79.74708
## 2030 52.72362 30.79129 74.65595 19.18101 86.26623
Also, this is the model being used:
forecast(hesa_firstshare_ts, h = 12)$model
## ETS(M,A,N)
##
## Call:
## ets(y = object, lambda = lambda, biasadj = biasadj, allow.multiplicative.trend = allow.multiplicative.trend)
##
## Smoothing parameters:
## alpha = 0.9999
## beta = 0.9619
##
## Initial states:
## l = 10.8162
## b = 0.3826
##
## sigma: 0.0193
##
## AIC AICc BIC
## 11.48065 18.14732 15.02090
We code our graph, with more labels:
hesa_tspred_gg <- forecast(hesa_firstshare_ts, h = 12) %>%
autoplot() +
labs(title = "These forecasts have wide prediction intervals by 2029/30",
subtitle = "ETS(M,A,N) forecasts of the First class proportion of first-time graduates at UK institutions [%]. (80% and 95% prediction intervals shown.)",
caption = "Sources: HESA Higher Education Statistics; Forecast function in R.",
x = "Academic Year (End Year)", y = "") +
ylim(c(0,100))
The forecast graph is here:
hesa_tspred_gg
The proportions of Firsts achieved cannot be lower than 0% nor exceed 100%.
We transform this series with the logit transformation. This will map the bounded series (from 0 to 100) to the real line. After we have produced a forecast, we transform it back.
hesa_fslogit_df <- hesa_degree_df %>%
dplyr::mutate(First_Share_Logit = log(First_Share/(100-First_Share))) %>%
dplyr::select("AcademicYear_EY", "First_Share_Logit")
hesa_fslogit_ts <- ts(data = hesa_fslogit_df$First_Share_Logit,
start = 2004,
frequency = 1)
We forecast from our new time series:
hesa_fslogit_fc <- forecast(hesa_fslogit_ts, h = 12)
hesa_fslogit_fc$model
## ETS(A,A,N)
##
## Call:
## ets(y = object, lambda = lambda, biasadj = biasadj, allow.multiplicative.trend = allow.multiplicative.trend)
##
## Smoothing parameters:
## alpha = 0.9999
## beta = 0.7294
##
## Initial states:
## l = -2.1107
## b = 0.0404
##
## sigma: 0.0217
##
## AIC AICc BIC
## -68.98018 -62.31352 -65.43993
Next, we do the inverse transformation on each forecasted series:
hesa_fslogit_fc$mean <- 100*exp(hesa_fslogit_fc$mean)/(1+exp(hesa_fslogit_fc$mean))
hesa_fslogit_fc$lower <- 100*exp(hesa_fslogit_fc$lower)/(1+exp(hesa_fslogit_fc$lower))
hesa_fslogit_fc$upper <- 100*exp(hesa_fslogit_fc$upper)/(1+exp(hesa_fslogit_fc$upper))
hesa_fslogit_fc$x <- 100*exp(hesa_fslogit_fc$x)/(1+exp(hesa_fslogit_fc$x))
hesa_fslogit_fc
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 2019 30.02986 29.44947 30.61673 29.14489 30.92997
## 2020 32.32391 31.12217 33.54945 30.49602 34.20743
## 2021 34.70627 32.73992 36.72623 31.72250 37.81525
## 2022 37.16779 34.30193 40.12685 32.82842 41.72446
## 2023 39.69775 35.80849 43.72172 33.81934 45.88944
## 2024 42.28403 37.25957 47.47336 34.70005 50.25002
## 2025 44.91330 38.65497 51.33694 35.47469 54.73303
## 2026 47.57132 39.99444 55.26168 36.14693 59.25567
## 2027 50.24316 41.27779 59.19292 36.72016 63.73068
## 2028 52.91362 42.50498 63.07489 37.19753 68.07251
## 2029 55.56750 43.67609 66.85364 37.58203 72.20357
## 2030 58.18999 44.79140 70.47994 37.87653 76.05956
Finally, we make the graph:
hesa_fslogit_gg <- hesa_fslogit_fc %>%
autoplot() +
labs(title = "For 2029/30, the central forecast is 58% (with the 80% prediction interval from 45% to 70%).",
subtitle = "ETS(A,A,N) forecast of logit-transformed Firsts proportion achieved by first-time graduates at UK institutions [%] (80% and 95% prediction intervals shown).",
caption = "Sources: HESA Higher Education Statistics; Forecast function in R.",
x = "Academic Year (End Year)", y = "") +
scale_y_continuous(limits = c(0,100),
expand = c(0,0))
The new forecast graph is here:
hesa_fslogit_gg