Introduction

A review of the data over a 3-year period of program performance — December 2019 to December 2022 — for OregonSaves, Illinois Secure Choice, and CalSavers, highlighting several positive trends.
Note: This report only focuses on Total Assets indicator.

Data at a Glance

df <- as.data.frame(Assets_by_Program)
datatable(df, options = list(dom = 't'))

Data Cleaning and Preprocessing

In the data preprocessing and cleaning steps, several important procedures were carried out to ensure the quality and completeness of the dataset. These steps are detailed below:

1. Removing Duplicate Rows: Two duplicate rows in the dataset were identified and eliminated using the unique function in R. This process helps to ensure that each data point in the dataset is unique, preventing any duplication that might lead to inaccurate analysis or skewed results.

2. Handling Missing Values in “CalSavers”: In the “CalSavers” column, four rows contained missing values. This occurred because CalSavers did not initiate monthly reporting until June 30, 2020, and instead had a previous practice of reporting on a quarterly basis. To resolve this issue and align the data with the reporting frequency, we applied the Last Observation Carried Forward (LOCF) imputation method. The LOCF technique involves replacing missing values with the most recent observed value in the time series. This method was chosen to ensure that the imputed values accurately reflect the temporal context of the data. By utilizing the most recent available data point, we maintain the continuity of the information, addressing the missing values without disrupting the dataset’s chronological integrity. This approach enhances the dataset’s completeness and allows for more meaningful analysis and interpretation.

3. Filling in Missing “Total” Values: The “Total” column had four missing values as well because Calsavers was missing four values. To calculate the Total value for all the three states, the sum of the “OregonSaves,” “IL Secure Choice,” and “CalSavers” columns for their respective rows was computed using the rowSums function. This step allowed us to ensure that the “Total” column is complete and accurately reflects the total assets under management for each corresponding row.

These data cleaning and preprocessing steps were crucial to enhance the dataset’s reliability and suitability for subsequent analysis. Removing duplicates, imputing missing values, and filling in the “Total” column contribute to the overall data quality and integrity, which is essential for robust and accurate analysis and reporting.

# Cleaned look of the data
df <- as.data.frame(Assets_by_Program)
datatable(df, options = list(dom = 't'))

Total Assets Visualizations

State Retirement Programs - Total Assets Growth

This plot provides an overview of the growth rates of total assets across different state retirement programs over time. The purpose of this plot is to visualize how the assets under management in these programs have changed month by month.

X-Axis (Date): The horizontal axis represents time, with each point on the axis corresponding to a specific month and year.

Y-Axis (Value): The vertical axis represents the growth rate of total assets. It measures how much the total assets have changed relative to the previous month.

Programs: Each line in the plot represents a different retirement program. The legend on the right side of the plot indicates which program each color corresponds to. In this plot, you can see the growth rates for “OregonSaves,” “IL Secure Choice,” “CalSavers,” and “Total.”

Hover Information: When you hover your cursor over a data point in the plot, you can see additional information, such as the exact growth rate for that month. This information helps you understand the magnitude of the changes.

The plot allows you to visually compare the growth rates of the different state retirement programs over time. You can observe trends, fluctuations, and any significant changes in the assets under management for each program.

library(dplyr)
library(tidyr)
library(plotly)

# Calculate growth rates for each program
your_data <- Assets_by_Program %>%
  arrange(Date) %>%
  mutate(
    OregonSaves_Growth = (OregonSaves / lag(OregonSaves)) - 1,
    `IL Secure Choice_Growth` = (`IL Secure Choice` / lag(`IL Secure Choice`)) - 1,
    CalSavers_Growth = (CalSavers / lag(CalSavers)) - 1,
    Total_Growth = (Total / lag(Total)) - 1
  )

# Melt the data to long format
growth_melted <- your_data %>%
  select(Date, starts_with("OregonSaves"), starts_with("IL Secure Choice"), starts_with("CalSavers"), starts_with("Total")) %>%
  pivot_longer(cols = -Date, names_to = "Program", values_to = "Value")

# Split data into assets and growth
assets_data <- growth_melted %>%
  filter(!grepl("_Growth", Program))

growth_data <- growth_melted %>%
  filter(grepl("_Growth", Program)) %>%
  rename(Value2 = Value, Program2 = Program, Date2 = Date)

# Merge assets and growth data
data <- bind_cols(assets_data, growth_data)

# Create the Plotly plot
plot <- plot_ly(data, x = ~Date, y = ~Value, color = ~Program, 
                colors = c("Total" = "black", "OregonSaves" = "darkgreen", 
                           "IL Secure Choice" = "orange", "CalSavers" = "blue"),
                text = ~paste("Growth Rate: ", round(Value2 * 100, digits = 2), "%")) %>%
  #add_lines(line = list(width = 3)) %>%
  add_lines(data = data %>% filter(Program != "Total"), x = ~Date, y = ~Value, color = ~Program,
            line = list(width = 2)) %>%
  add_lines(data = data %>% filter(Program == "Total"), x = ~Date, y = ~Value, color = ~Program,
            line = list(width = 4)) %>%
  layout(
    title = list(text = "Total Assets", font = list(size = 17, color = "black")),
    xaxis = list(title = "Date", titlefont = list(size = 14, color = "black")),
    yaxis = list(title = "Assets (Millions)", titlefont = list(size = 14, color = "black")),
    hovermode = "x unified",
    height = 400,  
    width = 900  
  )

# Print the plot
plot
long_data <- your_data %>%
  gather(key = "State", value = "Assets", -Date)

long_data2 <- long_data %>%
  filter(State!= "Total",
         State!= "Total_Growth") %>%
  mutate(Type = ifelse(grepl("_Growth", State), "Growth", "Value"))

# Now plot the data with faceting
ggplot() +
  geom_line(data = long_data2, aes(x = Date, y = Assets, color = State)) +
  #geom_line(data = growth_data, aes(x = Date2, y = Value2, color = Program2)) +
  facet_wrap(~ Type, scales = "free_y", ncol = 1) +
  scale_color_manual(values = c("OregonSaves" = "green", "IL Secure Choice" = "orange", "CalSavers" = "blue", "Total" = "black", "OregonSaves_Growth" = "green", "IL Secure Choice_Growth" = "orange", "CalSavers_Growth" = "blue")) +
  labs(title = "State Retirement Programs - Assets Over Time",
       x = "Date",
       y = "Assets",
       color = "Program") +
  theme_minimal() +
  theme(legend.position = "bottom") +
  guides(color = guide_legend(title = "Program"))
## Warning: Removed 3 rows containing missing values (`geom_line()`).

# Define a custom label function for millions
label_millions <- function() {
  function(x) {
    paste0(round(x / 1e6, 1), "M")
  }
}

# Filter for the first and last dates for each program
annotate_data <- your_data %>%
  filter(Date == as.Date('2019-12-31') | Date == as.Date('2022-12-31')) %>%
  gather(key = "Program", value = "Value", -Date) %>%
  arrange(Program, Date)

# Now plot the data, format the y-axis, and add annotations
ggplot(your_data, aes(x = Date)) +
  geom_line(aes(y = Total, color = "Total"), size = 1.2) +
  geom_line(aes(y = OregonSaves, color = "OregonSaves")) +
  geom_line(aes(y = `IL Secure Choice`, color = "IL Secure Choice")) +
  geom_line(aes(y = CalSavers, color = "CalSavers")) +
  geom_text(data = annotate_data, aes(y = Value, label = paste0(round(Value / 1e6, 1), "M"), color = Program), 
            vjust = 1, hjust = 1, nudge_y = 0.02 * max(your_data$Total)) +
  scale_y_continuous(labels = label_millions()) + # This will apply the custom label function to y-axis
  scale_color_manual(values = c("Total" = "black", "OregonSaves" = "darkgreen", "IL Secure Choice" = "orange", "CalSavers" = "blue")) +
  labs(title = "Total Assets",
       x = "Date",
       y = "Assets (Millions)",
       color = "Legend") +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", size = 24, hjust = 0.5)) +
  guides(color = guide_legend(title = "Program"))

Plot Analysis/ Interpretation

Total Assets Growth: Total assets under management grew significantly, increasing by over 1000%, from $54.8 million in December 2019 to $640.2 million in December 2022.

CalSavers Key Insights

Rapid Asset Expansion

  • CalSavers commenced in December 2019 with assets amounting to a mere $1.4 million.
  • By December 2022, it witnessed a monumental expansion in its assets, catapulting the figure to $373.0 million.
  • This translates to a growth rate exceeding 26,000% over the span of three years.

Growth Pattern

  • The asset accumulation for CalSavers demonstrates an exponential growth curve, underlined by a robust positive trend.
  • On a month-on-month basis, CalSavers averaged a growth rate of 18.3%.
  • When annualized, this rate further magnifies, reflecting an average yearly growth of 219.6%.

Highlight on Monthly Percent Growth

  • Analyzing monthly growth percentages, December 2020 emerged as the pinnacle for CalSavers, where it recorded a monthly growth rate of 42.7%.

Important Note: A crucial factor to remember while interpreting these numbers is the preprocessing strategy employed on the dataset. The Last Observation Carried Forward (LOCF) method was utilized to handle NA values. This preprocessing, between January 2020 to June 2020, inadvertently inflated the growth percentages, making them appear higher. Thus, while some months in this range showed growth rates as high as 108% and 76%, these figures are not representative of the actual growth and are outcomes of the preprocessing strategy. The real, unskewed data was available post-June 2020, reaffirming the aforementioned peak growth rate in December 2020.

Comparison with Other State Programs

  • Initial figures from December 2019 depict CalSavers lagging behind its counterparts - Illinois Secure Choice and OregonSaves, in terms of total assets.

  • However, this narrative shifted dramatically as CalSavers displayed an aggressive growth pattern:

    • By March 2021, CalSavers outpaced Illinois in asset accumulation.
    • A few months later, by October 2021, it surpassed OregonSaves.
  • Post these milestones, CalSavers not only retained its lead but managed to substantially widen the asset gap with its counterparts.

Illinois Secure Choice Key Insights

OregonSaves Key Insights

Percentage Contribution of Retirement Programs Over Time - Total Assets

library(ggplot2)

# Calculate the percentage contribution for each program
percent_data <- Assets_by_Program %>%
  mutate(
    OregonSaves_pct = (OregonSaves / Total) * 100,
    `IL Secure Choice_pct` = (`IL Secure Choice` / Total) * 100,
    CalSavers_pct = (CalSavers / Total) * 100
  ) %>%
  mutate(year_month = paste(year(Date), month(Date, label = TRUE, abbr = TRUE))) %>%
  filter(year_month %in% c("2019 Dec","2020 Dec", "2021 Dec", "2022 Dec")) %>%
  select(Date, OregonSaves_pct, `IL Secure Choice_pct`, CalSavers_pct)
  
visualized_data <-percent_data %>%
  pivot_longer(cols = -Date, names_to = "Program", values_to = "Value")


# Modify the Date column to display only year and month
visualized_data$Date <- format(visualized_data$Date, "%Y %b")

# Create the static stacked bar chart using ggplot2
p <- ggplot(visualized_data, aes(x = Date, y = Value, fill = Program)) +
  geom_bar(stat = "identity", position = "stack") +
  labs(
    title = "Percentage Contribution of Retirement Programs Over Time",
    x = "Date",
    y = "Percentage"
  ) +
  geom_text(
    aes(label = sprintf("%.1f%%", Value), y = Value),
    position = position_stack(vjust = 0.5),
    size = 3
  ) +
  theme_minimal()

# Print the plot
print(p)

Seasonality Visualization

Assets_by_Program$Date <- as.Date(Assets_by_Program$Date)
t<-Assets_by_Program %>%
  select(Date, Total)

ts_data <- ts(t[, -1], start = c(2019, 12), frequency = 12)
ggsubseriesplot(ts_data)

ts_data <- c(40985051, 43831388, 45957188, 46302255, 51087593, 54650909, 58159405, 
             62516577, 66846469, 69167837, 71325749, 79106584, 84741739, 87886663, 
             92275410, 99147494, 118898674, 133535002, 107217323, 113149423, 
             125013352, 131548621, 140695842, 140697792, 150010539, 147278855, 
             148076981, 153896905, 147788686, 151300357, 146004291, 158032364, 
             156901495, 148185293, 156888309, 170449511, 168705846)

# Create a ts object with monthly frequency and starting year and month
ts_object <- ts(ts_data, frequency = 12, start = c(2019, 12))

# Perform STL decomposition (s.window = "periodic" implies a stable seasonal pattern)
stl_decomp <- stl(ts_object, s.window = "periodic")

# Plot the STL decomposition
autoplot(stl_decomp)

# Extract the components
trend <- stl_decomp$time.series[, "trend"]
seasonal <- stl_decomp$time.series[, "seasonal"]
remainder <- stl_decomp$time.series[, "remainder"]

# Calculate variances
total_variance <- var(ts_object)
trend_variance <- var(trend, na.rm = TRUE)
seasonal_variance <- var(seasonal, na.rm = TRUE)
remainder_variance <- var(remainder, na.rm = TRUE)

# Calculate strength measures
strength_of_trend <- trend_variance / total_variance
strength_of_seasonality <- seasonal_variance / total_variance

# Print the strength measures
print(paste("Strength of trend:", strength_of_trend))
## [1] "Strength of trend: 0.932901707541205"
print(paste("Strength of seasonality:", strength_of_seasonality))
## [1] "Strength of seasonality: 0.00730811152430397"

Seasonality Analysis/ Interpretation

working on it…

seasonal_component <- stl_decomp$time.series[, "seasonal"]

# Find the index of the peak and trough
peak_index <- which.max(seasonal_component)
trough_index <- which.min(seasonal_component)

# Get the corresponding months for the peak and trough
peak_month <- cycle(ts_object)[peak_index]
trough_month <- cycle(ts_object)[trough_index]

# Since 'cycle' gives us the position in the cycle, we need to map it to actual month names
months <- month.name
peak_month_name <- months[peak_month]
trough_month_name <- months[trough_month]

# Output the result
cat("The peak of the seasonal component occurs in:", peak_month_name, "\n")
## The peak of the seasonal component occurs in: November
cat("The trough of the seasonal component occurs in:", trough_month_name, "\n")
## The trough of the seasonal component occurs in: June