Introduction

This presentation explains how education level impacts income, savings, and net worth over time using a simulated dataset in R.

People in the training population have several variables:

  • A randomly assigned education, with factors such as:
    • Student loans
    • Costs of living
    • Age they start working
  • Age
    • Determines work experience
    • Determines if person is still in school
  • Net worth (Calculated based on these factors over time)

Required libraries

# if (!require(ggplot2)) install.packages("ggplot2")
# if (!require(dplyr)) install.packages("dplyr")

library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

(I disabled these so that it doesnt mess with the slides) ## Data setup

set.seed(42)

N <- 50000
EducationLevels <- c("High School", "Associate", "Bachelor", "Master", "PhD", "Professional")

Df <- data.frame(
  ID = 1:N,
  Age = sample(18:70, N, TRUE),
  Education = sample(EducationLevels, N, TRUE, prob = c(0.15, 0.25, 0.25, 0.15, 0.10, 0.10))
)

This code creates 50,000 instances as a training population (I tested with a testing population in a sperate notebook and got 96% accuracy. When I asked chatGPT to cross reference this with studies and surveys I got around a 85% accuracy.

School end age and employment

EducationAges <- c("High School" = 18, "Associate" = 20, "Bachelor" = 22, 
                   "Master" = 24, "PhD" = 28, "Professional" = 30)
Df <- Df %>%
  mutate(
    SchoolEndAge = EducationAges[Education],
    EmploymentStatus = dplyr::case_when(
      Age < SchoolEndAge ~ "Student",
      Age <= 21 ~ "Part-time",
      Age < 55 ~ "Full-time",
      TRUE ~ "Retired"
    ),
    WorkExperience = pmax(0, Age - SchoolEndAge)
  )

This code decides when a certain “person” starts and stops working and also decides how long they have been working (necessary for exp bumps)(raises based off of work experience)

Income generation function

GenerateIncome <- function(Edu, Exp, Status) {
  if (Status != "Full-time") return(0)
  BaseSalaries <- c("High School" = 770 * 52, "Associate" = 900 * 52, "Bachelor" = 1300 * 52,
                    "Master" = 1500 * 52, "PhD" = 1800 * 52, "Professional" = 2100 * 52)
  ExpBumps <- c("High School" = 700, "Associate" = 800, "Bachelor" = 900,
                "Master" = 1000, "PhD" = 1200, "Professional" = 1500)
  Sds <- c("High School" = 9000, "Associate" = 11000, "Bachelor" = 13000,
           "Master" = 17000, "PhD" = 22000, "Professional" = 10000)
  
  rnorm(1, BaseSalaries[Edu] + Exp * ExpBumps[Edu], Sds[Edu])
}

Base income straight after their education with no experience (higher with better educations) then raises based off of work experience (randomized) and then because of the random element in it I added a sds function to remove outliers

Apply income and trim outliers

Df$Income <- mapply(GenerateIncome, Df$Education, Df$WorkExperience, Df$EmploymentStatus)
Df <- Df[abs(scale(Df$Income)) < 3, ]

This just adds all the data and removes the outliers before graphing. This uses standard deviations with this formula:

  • z = (x - μ) / σ
  • z| < 3

and this makes sure that none of those outliers skew the graph either way

Annual savings

CostOfLiving <- c("High School" = 40000, "Associate" = 45000, 
                  "Bachelor" = 50000, "Master" = 55000, 
                  "PhD" = 55000, "Professional" = 55000)

Df <- Df %>%
  mutate(
    AnnualSavings = pmax(Income - CostOfLiving[Education], 0)
  )

The code here subtracts the cost of living and then uses it later to calculate the annual savings of someone with their income/education.

Student loans

StudentLoans <- data.frame(
  Education = EducationLevels,
  TotalDebt = c(0, 0, 30000, 45000, 55000, 90000),
  MonthlyPayment = c(0, 0, 300, 450, 450, 700),
  LoanYears = c(0, 0, 10, 10, 12, 15)
)

Df <- Df %>%
  left_join(StudentLoans, by = "Education") %>%
  mutate(
    YearsSinceSchool = pmax(0, Age - SchoolEndAge),
    LoanPayment = ifelse(YearsSinceSchool < LoanYears, 
    MonthlyPayment * 12, 0), AdjustedSavings = pmax
    (AnnualSavings - LoanPayment, 0))

timed payments subtract money early in life and impact savings and later the amount of interest they gain

Net worth calculation

InterestRate <- 0.025

Df <- Df %>%
  mutate(
    NetWorth = ifelse(
      EmploymentStatus == "Full-time" & WorkExperience > 0,
      AdjustedSavings * ((1 + InterestRate)^WorkExperience - 1) / InterestRate,
      0
    )
  )

This “rewards” someone based on how much money they saved after a year. For example if someone pays several thousand a year on loans then they will have less money in savings and they will not gain as much interest.

Summarizing by age and education

SummaryDf <- Df %>%
  filter(EmploymentStatus == "Full-time", Age >= 20, Age <= 54) %>%
  group_by(Age, Education) %>%
  summarise(AvgNetWorth = mean(NetWorth, na.rm = TRUE), .groups = 'drop')

This decides where on the graph each person will go based off of their education, age, and net worth. This is then implemented into the next segment of code and ran.

Visualization

EducationColors <- c("High School" = "green",
                     "Associate" = "red",
                     "Bachelor" = "goldenrod",
                     "Master" = "deepskyblue",
                     "PhD" = "black",
                     "Professional" = "hotpink")

P <- ggplot(SummaryDf, aes(x = Age, y = AvgNetWorth, color = Education)) +
  geom_smooth(se = FALSE, method = "loess", size = 1.2) +
  labs(title = "Estimated net worth vs age by education",
       x = "Age", y = "Estimated net worth ($)", color = "Education") +
  scale_color_manual(values = EducationColors) +
  scale_y_continuous(limits = c(0, 2500000), labels = scales::comma) +
  scale_x_continuous(limits = c(20, 54)) +
  theme_minimal() +
  theme(legend.position = "bottom")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
P
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 3 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_smooth()`).

Conclusion

This graph explains how much extra money someone on average will have based off of their education. An example of this would be people who get more education such as PhD and Professional have less money earlier in life have more debts to pay but they also end up with more money later in life.