Purpose

This document provides the code for the graphs and forecast included in the article ‘A Degree of Extrapolation’. We look at HESA statistics on qualifications achieved by first-degree graduates. The data may be downloaded from a Google Sheet.

Packages and Themes

First, we download the required packages and set the theme.

library(tidyverse)
library(readxl)
library(forecast)

The theme for the graphs is here:

theme_clean <- theme_bw(base_family="Calibri") + 
  theme(legend.position = "top",
        legend.title = element_text(size = 12),
        legend.text = element_text(size = 12),
        plot.title = element_text(size = 18, face = "bold"),
        plot.subtitle = element_text(size = 12, face = "italic", margin = margin(b=12)),
        plot.caption = element_text(size = 10),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.border = element_blank())
theme_set(theme_clean)

Import and tidy the data

Using readxl, I draw the data from my Excel file. One column needs to be renamed, as the ‘/’ in “Third/Pass” is treated as a calculation in the mutate function.

hesa_degree_df <- read_excel("HESA Class of Degree by First Degree Graduates - 2019-12-30.xlsx",
                             sheet = "DATA") %>%
  dplyr::rename(ThirdPass = "Third/Pass") %>%
  dplyr::mutate(Classified = First + Upper_Second + Lower_Second + ThirdPass,
                First_Share = (First / Classified)* 100)

This table then needs to be tidied, with factors set:

hesa_class_df <- hesa_degree_df %>%
  select(1:5) %>%
  pivot_longer(cols = 2:5,
               names_to = "Class",
               values_to = "Count")
hesa_class_df$Class <- factor(hesa_class_df$Class,
                              levels = c("First", "Upper_Second", "Lower_Second", "ThirdPass"))

Draw the graphs

We then produce the graphs used in the article. The first is the count of degrees by class.

hesa_class_gg <- ggplot(data = hesa_class_df,
                        aes(x = AcademicYear_EY,
                            y = Count,
                            group = Class)) +
  geom_line(size = 1.2,
            aes(color = Class)) +
  scale_color_discrete(name = "Classification",
                       labels = c("First", "Upper Second", "Lower Second", "Third/Pass")) +
  scale_y_continuous(labels = scales::comma) +
  labs(title = "First and Upper Second degrees have generally risen in volume",
       subtitle = "Rounded classifications of first degree graduates, in each academic year at UK institutions.",
       caption = "Source: HESA Enrolment and Qualification reports; HESA Higher Education Student Statistics.",
       x = "Academic Year (End Year)",
       y = "")

The first graph is here:

hesa_class_gg

The second graph shows the proportion of Firsts.

hesa_firstshare_gg <- ggplot(data = hesa_degree_df,
                             aes(x = AcademicYear_EY,
                                 y = First_Share)) +
  geom_line(size = 1.2) +
  geom_point(fill = "white", size = 3, shape = 21) +
  geom_text(aes(label = round(First_Share)), nudge_y = 2) +
  ylim(c(0,30)) +
  labs(title = "28% of first-time graduates received a First-class degree in 2017/18",
       subtitle = "Proportion of Firsts among first degree graduates with classified degrees at UK institutions [%].",
       caption = "Source: HESA Enrolment and Qualification reports; HESA Higher Education Student Statistics.",
       x = "Academic Year (End Year)",
       y = "")

The second graph is here:

hesa_firstshare_gg

Forecasting

We use the forecasting function, for 12 years into the future. First, we convert the table into a time series:

hesa_firstshare_df <- hesa_degree_df %>%
  select("AcademicYear_EY", "First_Share")
hesa_firstshare_ts <- ts(data = hesa_firstshare_df$First_Share,
                         start = 2004,
                         frequency = 1)

The forecast function then produces the following prediction:

forecast(hesa_firstshare_ts, h = 12) %>% autoplot()

We can see the exact values by not using autoplot:

forecast(hesa_firstshare_ts, h = 12)
##      Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## 2019       29.90600 29.16713 30.64486 28.77600 31.03599
## 2020       31.98033 30.32921 33.63144 29.45516 34.50549
## 2021       34.05465 31.26486 36.84445 29.78803 38.32128
## 2022       36.12898 31.99096 40.26701 29.80042 42.45755
## 2023       38.20331 32.51715 43.88947 29.50708 46.89955
## 2024       40.27764 32.84876 47.70653 28.91614 51.63914
## 2025       42.35197 32.98835 51.71559 28.03155 56.67239
## 2026       44.42630 32.93669 55.91591 26.85446 61.99815
## 2027       46.50063 32.69317 60.30809 25.38394 67.61732
## 2028       48.57496 32.25618 64.89374 23.61753 73.53239
## 2029       50.64929 31.62326 69.67531 21.55150 79.74708
## 2030       52.72362 30.79129 74.65595 19.18101 86.26623

Also, this is the model being used:

forecast(hesa_firstshare_ts, h = 12)$model
## ETS(M,A,N) 
## 
## Call:
##  ets(y = object, lambda = lambda, biasadj = biasadj, allow.multiplicative.trend = allow.multiplicative.trend) 
## 
##   Smoothing parameters:
##     alpha = 0.9999 
##     beta  = 0.9619 
## 
##   Initial states:
##     l = 10.8162 
##     b = 0.3826 
## 
##   sigma:  0.0193
## 
##      AIC     AICc      BIC 
## 11.48065 18.14732 15.02090

We code our graph, with more labels:

hesa_tspred_gg <- forecast(hesa_firstshare_ts, h = 12) %>%
  autoplot() +
  labs(title = "These forecasts have wide prediction intervals by 2029/30",
       subtitle = "ETS(M,A,N) forecasts of the First class proportion of first-time graduates at UK institutions [%]. (80% and 95% prediction intervals shown.)",
       caption = "Sources: HESA Higher Education Statistics; Forecast function in R.",
       x = "Academic Year (End Year)", y = "") +
  ylim(c(0,100))

The forecast graph is here:

hesa_tspred_gg

Forecasting (Logit Transformation)

The proportions of Firsts achieved cannot be lower than 0% nor exceed 100%.

We transform this series with the logit transformation. This will map the bounded series (from 0 to 100) to the real line. After we have produced a forecast, we transform it back.

hesa_fslogit_df <- hesa_degree_df %>%
  dplyr::mutate(First_Share_Logit = log(First_Share/(100-First_Share))) %>%
  dplyr::select("AcademicYear_EY", "First_Share_Logit")

hesa_fslogit_ts <- ts(data = hesa_fslogit_df$First_Share_Logit,
                         start = 2004,
                         frequency = 1)

We forecast from our new time series:

hesa_fslogit_fc <- forecast(hesa_fslogit_ts, h = 12)

hesa_fslogit_fc$model
## ETS(A,A,N) 
## 
## Call:
##  ets(y = object, lambda = lambda, biasadj = biasadj, allow.multiplicative.trend = allow.multiplicative.trend) 
## 
##   Smoothing parameters:
##     alpha = 0.9999 
##     beta  = 0.7294 
## 
##   Initial states:
##     l = -2.1107 
##     b = 0.0404 
## 
##   sigma:  0.0217
## 
##       AIC      AICc       BIC 
## -68.98018 -62.31352 -65.43993

Next, we do the inverse transformation on each forecasted series:

hesa_fslogit_fc$mean <- 100*exp(hesa_fslogit_fc$mean)/(1+exp(hesa_fslogit_fc$mean))
hesa_fslogit_fc$lower <- 100*exp(hesa_fslogit_fc$lower)/(1+exp(hesa_fslogit_fc$lower))
hesa_fslogit_fc$upper <- 100*exp(hesa_fslogit_fc$upper)/(1+exp(hesa_fslogit_fc$upper))
hesa_fslogit_fc$x <- 100*exp(hesa_fslogit_fc$x)/(1+exp(hesa_fslogit_fc$x))

hesa_fslogit_fc
##      Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## 2019       30.02986 29.44947 30.61673 29.14489 30.92997
## 2020       32.32391 31.12217 33.54945 30.49602 34.20743
## 2021       34.70627 32.73992 36.72623 31.72250 37.81525
## 2022       37.16779 34.30193 40.12685 32.82842 41.72446
## 2023       39.69775 35.80849 43.72172 33.81934 45.88944
## 2024       42.28403 37.25957 47.47336 34.70005 50.25002
## 2025       44.91330 38.65497 51.33694 35.47469 54.73303
## 2026       47.57132 39.99444 55.26168 36.14693 59.25567
## 2027       50.24316 41.27779 59.19292 36.72016 63.73068
## 2028       52.91362 42.50498 63.07489 37.19753 68.07251
## 2029       55.56750 43.67609 66.85364 37.58203 72.20357
## 2030       58.18999 44.79140 70.47994 37.87653 76.05956

Finally, we make the graph:

hesa_fslogit_gg <- hesa_fslogit_fc %>%
  autoplot() +
  labs(title = "For 2029/30, the central forecast is 58% (with the 80% prediction interval from 45% to 70%).",
       subtitle = "ETS(A,A,N) forecast of logit-transformed Firsts proportion achieved by first-time graduates at UK institutions [%] (80% and 95% prediction intervals shown).",
       caption = "Sources: HESA Higher Education Statistics; Forecast function in R.",
       x = "Academic Year (End Year)", y = "") +
  scale_y_continuous(limits = c(0,100),
                     expand = c(0,0))

The new forecast graph is here:

hesa_fslogit_gg