A Degree of Extrapolation

Purpose

This document provides the code for the graphs and forecast included in the article ‘A Degree of Extrapolation’. We look at HESA statistics on qualifications achieved by first-degree graduates. The data may be downloaded from a Google Sheet.

Packages and Themes

First, we download the required packages and set the theme.

library(tidyverse)
library(readxl)
library(forecast)

The theme for the graphs is here:

theme_clean <- theme_bw(base_family="Calibri") + 
  theme(legend.position = "top",
        legend.title = element_text(size = 12),
        legend.text = element_text(size = 12),
        plot.title = element_text(size = 18, face = "bold"),
        plot.subtitle = element_text(size = 12, face = "italic", margin = margin(b=12)),
        plot.caption = element_text(size = 10),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.border = element_blank())
theme_set(theme_clean)

Import and tidy the data

Using readxl, I draw the data from my Excel file. One column needs to be renamed, as the ‘/’ in “Third/Pass” is treated as a calculation in the mutate function.

hesa_degree_df <- read_excel("HESA Class of Degree by First Degree Graduates - 2019-12-30.xlsx",
                             sheet = "DATA") %>%
  dplyr::rename(ThirdPass = "Third/Pass") %>%
  dplyr::mutate(Classified = First + Upper_Second + Lower_Second + ThirdPass,
                First_Share = (First / Classified)* 100)

This table then needs to be tidied, with factors set:

hesa_class_df <- hesa_degree_df %>%
  select(1:5) %>%
  pivot_longer(cols = 2:5,
               names_to = "Class",
               values_to = "Count")
hesa_class_df$Class <- factor(hesa_class_df$Class,
                              levels = c("First", "Upper_Second", "Lower_Second", "ThirdPass"))

Draw the graphs

We then produce the graphs used in the article. The first is the count of degrees by class.

hesa_class_gg <- ggplot(data = hesa_class_df,
                        aes(x = AcademicYear_EY,
                            y = Count,
                            group = Class)) +
  geom_line(size = 1.2,
            aes(color = Class)) +
  scale_color_discrete(name = "Classification",
                       labels = c("First", "Upper Second", "Lower Second", "Third/Pass")) +
  scale_y_continuous(labels = scales::comma) +
  labs(title = "First and Upper Second degrees have generally risen in volume",
       subtitle = "Rounded classifications of first degree graduates, in each academic year at UK institutions.",
       caption = "Source: HESA Enrolment and Qualification reports; HESA Higher Education Student Statistics.",
       x = "Academic Year (End Year)",
       y = "")

The first graph is here:

hesa_class_gg

The second graph shows the proportion of Firsts.

hesa_firstshare_gg <- ggplot(data = hesa_degree_df,
                             aes(x = AcademicYear_EY,
                                 y = First_Share)) +
  geom_line(size = 1.2) +
  geom_point(fill = "white", size = 3, shape = 21) +
  geom_text(aes(label = round(First_Share)), nudge_y = 2) +
  ylim(c(0,30)) +
  labs(title = "28% of first-time graduates received a First-class degree in 2017/18",
       subtitle = "Proportion of Firsts among first degree graduates with classified degrees at UK institutions [%].",
       caption = "Source: HESA Enrolment and Qualification reports; HESA Higher Education Student Statistics.",
       x = "Academic Year (End Year)",
       y = "")

The second graph is here:

hesa_firstshare_gg

Forecasting

We use the forecasting function, for 12 years into the future. First, we convert the table into a time series:

hesa_firstshare_df <- hesa_degree_df %>%
  select("AcademicYear_EY", "First_Share")
hesa_firstshare_ts <- ts(data = hesa_firstshare_df$First_Share,
                         start = 2004,
                         frequency = 1)

The forecast function then produces the following prediction:

forecast(hesa_firstshare_ts, h = 12) %>% autoplot()

We can see the exact values by not using autoplot:

forecast(hesa_firstshare_ts, h = 12)

##      Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## 2019       29.90600 29.16713 30.64486 28.77600 31.03599
## 2020       31.98033 30.32921 33.63144 29.45516 34.50549
## 2021       34.05465 31.26486 36.84445 29.78803 38.32128
## 2022       36.12898 31.99096 40.26701 29.80042 42.45755
## 2023       38.20331 32.51715 43.88947 29.50708 46.89955
## 2024       40.27764 32.84876 47.70653 28.91614 51.63914
## 2025       42.35197 32.98835 51.71559 28.03155 56.67239
## 2026       44.42630 32.93669 55.91591 26.85446 61.99815
## 2027       46.50063 32.69317 60.30809 25.38394 67.61732
## 2028       48.57496 32.25618 64.89374 23.61753 73.53239
## 2029       50.64929 31.62326 69.67531 21.55150 79.74708
## 2030       52.72362 30.79129 74.65595 19.18101 86.26623

Also, this is the model being used:

forecast(hesa_firstshare_ts, h = 12)$model

## ETS(M,A,N) 
## 
## Call:
##  ets(y = object, lambda = lambda, biasadj = biasadj, allow.multiplicative.trend = allow.multiplicative.trend) 
## 
##   Smoothing parameters:
##     alpha = 0.9999 
##     beta  = 0.9619 
## 
##   Initial states:
##     l = 10.8162 
##     b = 0.3826 
## 
##   sigma:  0.0193
## 
##      AIC     AICc      BIC 
## 11.48065 18.14732 15.02090

We code our graph, with more labels:

hesa_tspred_gg <- forecast(hesa_firstshare_ts, h = 12) %>%
  autoplot() +
  labs(title = "These forecasts have wide prediction intervals by 2029/30",
       subtitle = "ETS(M,A,N) forecasts of the First class proportion of first-time graduates at UK institutions [%]. (80% and 95% prediction intervals shown.)",
       caption = "Sources: HESA Higher Education Statistics; Forecast function in R.",
       x = "Academic Year (End Year)", y = "") +
  ylim(c(0,100))

The forecast graph is here:

hesa_tspred_gg

Forecasting (Logit Transformation)

The proportions of Firsts achieved cannot be lower than 0% nor exceed 100%.

We transform this series with the logit transformation. This will map the bounded series (from 0 to 100) to the real line. After we have produced a forecast, we transform it back.

hesa_fslogit_df <- hesa_degree_df %>%
  dplyr::mutate(First_Share_Logit = log(First_Share/(100-First_Share))) %>%
  dplyr::select("AcademicYear_EY", "First_Share_Logit")

hesa_fslogit_ts <- ts(data = hesa_fslogit_df$First_Share_Logit,
                         start = 2004,
                         frequency = 1)

We forecast from our new time series:

hesa_fslogit_fc <- forecast(hesa_fslogit_ts, h = 12)

hesa_fslogit_fc$model

## ETS(A,A,N) 
## 
## Call:
##  ets(y = object, lambda = lambda, biasadj = biasadj, allow.multiplicative.trend = allow.multiplicative.trend) 
## 
##   Smoothing parameters:
##     alpha = 0.9999 
##     beta  = 0.7294 
## 
##   Initial states:
##     l = -2.1107 
##     b = 0.0404 
## 
##   sigma:  0.0217
## 
##       AIC      AICc       BIC 
## -68.98018 -62.31352 -65.43993

Next, we do the inverse transformation on each forecasted series:

hesa_fslogit_fc$mean <- 100*exp(hesa_fslogit_fc$mean)/(1+exp(hesa_fslogit_fc$mean))
hesa_fslogit_fc$lower <- 100*exp(hesa_fslogit_fc$lower)/(1+exp(hesa_fslogit_fc$lower))
hesa_fslogit_fc$upper <- 100*exp(hesa_fslogit_fc$upper)/(1+exp(hesa_fslogit_fc$upper))
hesa_fslogit_fc$x <- 100*exp(hesa_fslogit_fc$x)/(1+exp(hesa_fslogit_fc$x))

hesa_fslogit_fc

##      Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## 2019       30.02986 29.44947 30.61673 29.14489 30.92997
## 2020       32.32391 31.12217 33.54945 30.49602 34.20743
## 2021       34.70627 32.73992 36.72623 31.72250 37.81525
## 2022       37.16779 34.30193 40.12685 32.82842 41.72446
## 2023       39.69775 35.80849 43.72172 33.81934 45.88944
## 2024       42.28403 37.25957 47.47336 34.70005 50.25002
## 2025       44.91330 38.65497 51.33694 35.47469 54.73303
## 2026       47.57132 39.99444 55.26168 36.14693 59.25567
## 2027       50.24316 41.27779 59.19292 36.72016 63.73068
## 2028       52.91362 42.50498 63.07489 37.19753 68.07251
## 2029       55.56750 43.67609 66.85364 37.58203 72.20357
## 2030       58.18999 44.79140 70.47994 37.87653 76.05956

Finally, we make the graph:

hesa_fslogit_gg <- hesa_fslogit_fc %>%
  autoplot() +
  labs(title = "For 2029/30, the central forecast is 58% (with the 80% prediction interval from 45% to 70%).",
       subtitle = "ETS(A,A,N) forecast of logit-transformed Firsts proportion achieved by first-time graduates at UK institutions [%] (80% and 95% prediction intervals shown).",
       caption = "Sources: HESA Higher Education Statistics; Forecast function in R.",
       x = "Academic Year (End Year)", y = "") +
  scale_y_continuous(limits = c(0,100),
                     expand = c(0,0))

The new forecast graph is here:

hesa_fslogit_gg