This document presents an exploration of all data related to the careboard with the goal of providing logic towards the creation of final care board figures. The goal of this document is to present our analysis not as a succinct causal story but instead as a logical flow that can be used to understand all assumptions we make and where all data we use comes from. As such, this document will provide a variety of figures from a variety of sources all with the aim of analyzing the care economy.

The “Care Economy” refers to the economic activities, institutions, and systems related to the provision of care for people, such as children, the elderly, and those with disabilities or health conditions. It includes both paid and unpaid care work, encompassing formal sectors like health care, elder care, and child care services, as well as informal care often provided by family members or friends. The care economy is vital for societal well-being, yet it is often undervalued and undercompensated, particularly the unpaid labor that disproportionately falls on women. It plays a crucial role in supporting economic productivity by enabling other workers to engage in paid employment while ensuring the health and welfare of dependent populations.

The first step of this documentation is locading in all required packages for this analysis.

if (!requireNamespace("pacman", quietly = TRUE)) install.packages("pacman")
pacman::p_load(
  ipumsr,
  tidyverse,
  data.table,
  caret,
  randomForest,
  doParallel,
  sf,
  tmap,
  viridis,
  plotly,
  networkD3)

options(scipen = 999)

Here’s a bullet-point explanation of what each package does:

Demand

The first section of this document will seek to ask the question: QUESTION: What is the Societal Demand for Care? This section will look at demand for care across both age and health characteristics with the purpose of creating a prediction of what the demand for care looks like throughout the United States of America. To address the demands of care, we start by looking at the distribution of ages throughout the United States. We get the distribution of people by age by utilizing the results of the 2010 Census. In Future iterations we will expand this to use more up to date information from yearly analysis.

setwd("C:/Users/sc363/OneDrive/Work Items/Workspace/CareBoard/")

age_data <- fread("./00_CENSUSAges.csv")

Based on this data, we are going to determain the amount of time of care is needed for the average healthy member of each age group. For instance, a newborn will naturally need more care than a grown adult and we thus will seek to draw distinctions between these groups. We fully acknowledge that there is a wide range of care needed for different care groups, and that assigning a single value can be problematic. However, we seek to explore the median member of each group and how much care they need. To increase our knowledge about the mean member of each group, we both use literature on the load of effort spent by caregivers as well as information from various institutions on the topic. We have a considerable amount of data available on the actual time spent on caregiving, and we can use this information to make sure that our assumptions are valid but to start with we present our demand assumptions below. To be clear on what is included in caregiving here we include the following things as “Caregiving”

A nice thing is that we know how much time people spend in the informal and formal economies typically providing these services, but how much demand for these services is there? We provide our analysis below.

age_data <- age_data %>%
  ungroup() %>%
  summarise(
    child_under_five = sum(CN5AA2010) * 24,
    child_five_nine = sum(CN5AB2010) * 24,
    child_ten_fourteen = sum(CN5AC2010) * 24,
    child_fifteen_seventeen = sum(CN5AD2010) * 16,
    adult_eighteen_nineteen = sum(CN5AE2010) * 10,
    adult_twenty = sum(CN5AF2010) * 8,
    adult_twentyone = sum(CN5AG2010) * 6,
    adult_twentytwo_twentyfour = sum(CN5AH2010) * 4,
    adult_twentyfive_twentynine = sum(CN5AI2010) * 2,
    adult_thirty_thirtyfour = sum(CN5AJ2010) * 2,
    adult_thirtyfive_thirtynine = sum(CN5AK2010) * 2,
    adult_forty_fortyfour = sum(CN5AL2010) * 2,
    adult_fortyfive_fortynine = sum(CN5AM2010) * 2,
    adult_fifty_fiftyfour = sum(CN5AN2010) * 2,
    adult_fiftyfive_fiftynine = sum(CN5AO2010) * 3,
    adult_sixty_sixtyone = sum(CN5AP2010) * 3,
    adult_sixty_two_sixtyfour = sum(CN5AQ2010) * 4,
    adult_sixtyfive_sixtysix = sum(CN5AR2010) * 4,
    adult_sixty_seven_sixtynine = sum(CN5AS2010) * 5,
    adult_seventy_seventyfour = sum(CN5AT2010) * 6,
    adult_seventyfive_seventynine = sum(CN5AU2010) * 12,
    adult_eighty_eightyfour = sum(CN5AV2010) * 16,
    adult_eightyfive_over = sum(CN5AW2010) * 20
  )

Lets now take a look at how this distributes demand throughout our different care groups throughout society. The plot below provides the total societal demand by age group. Societal demand includes both the hours needed to care for a member of that agre group AND the number of people within that age group. The width of the bars below are used due to the fact that the age bins currently used are different widths, some are as little as a single age while others might be an aggregate of five ages.

age_data_sums <- age_data %>%
  summarise(across(starts_with("child_") | starts_with("adult_"), sum, na.rm = TRUE)) %>%
  pivot_longer(cols = everything(), 
               names_to = "age_group", 
               values_to = "total_count")
## Warning: There was 1 warning in `summarise()`.
## ℹ In argument: `across(starts_with("child_") | starts_with("adult_"), sum,
##   na.rm = TRUE)`.
## Caused by warning:
## ! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
## Supply arguments directly to `.fns` through an anonymous function instead.
## 
##   # Previously
##   across(a:b, mean, na.rm = TRUE)
## 
##   # Now
##   across(a:b, \(x) mean(x, na.rm = TRUE))
# Ensure the age_group factor levels follow the original dataset order
age_data_sums$age_group <- factor(age_data_sums$age_group, levels = age_data_sums$age_group)
width = c(5,5,5,3,2,1,1,3,5,5,5,5,5,5,5,2,3,2,3,5,5,5,10)


age_data_sums$width <- width * 0.12


# Create a bar chart with adjusted widths for better separation
ggplot(age_data_sums, aes(x = age_group, y = total_count, width = width)) +
  geom_bar(stat = "identity", fill = "skyblue", color = "black") +
  theme_minimal() +
  labs(title = "Total Demand by Age Group", 
       x = "Age Group", 
       y = "Total Demand") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  # Rotate x-axis labels for readability

As can be seen in this plot, children are by far the largest demand for care in the United States. This is not to say that demand for care for the elderly is non-existent or unimportant, but instead simply to say that there is tremendously high demand for the exceptionally young in our society. In part this is due to the fact thave even among the elderly, there is sometimes a level of independence that is not seen among the exceptionally young, and the fact that there is a higher count of the very young in society than the very old.

One piece of information that we are going to want to add to this plot though is health characteristics. We know that among different age brackets there will be different demand for “deeper” care based on health characteristics. For instnace, a child under the age of five or an adult over the age of eighty five will likely be of a health that they need deeper care. As such, even if they require the same amount of hours, the care they need will nonetheless me more intense. To understand how increased demand due to health characteristics is shaped through society we use data from NHIS. This data asks respondents to give their current health on a five point scale. We can apply this scale to the above plot in order to see how debth of care is distributed.

nhis <- fread("./03_NHISdata.csv")

# Ensure the health column is numeric
nhis$HEALTH <- as.numeric(nhis$HEALTH)

# Subset out observations where health is 7 or 9
nhis <- nhis %>% filter(!HEALTH %in% c(7, 9))

# Define age groups and categorize
nhis <- nhis %>%
  mutate(age_group = cut(AGE, 
                         breaks = c(-Inf, 5, 10, 15, 18, 20, 21, 22, 25, 30, 35, 40, 45, 50, 55, 60, 62, 65, 67, 70, 75, 80, 85, Inf),
                         labels = c("Under 5", "5-9", "10-14", "15-17", "18-19", "20", "21", "22-24", "25-29", "30-34", "35-39", "40-44", 
                                    "45-49", "50-54", "55-59", "60-61", "62-64", "65-66", "67-69", "70-74", "75-79", "80-84", "85+"),
                         right = FALSE))

# Calculate average health score for each age group
average_health_by_age <- nhis %>%
  group_by(age_group) %>%
  summarize(average_health = mean(HEALTH, na.rm = TRUE))

health <- c(average_health_by_age$average_health)
age_data_sums$health <- health

# Create a bar chart with adjusted widths for better separation
ggplot(age_data_sums, aes(x = age_group, y = total_count, width = width, fill = health)) +
  geom_bar(stat = "identity", color = "black") +
  theme_minimal() +
  labs(title = "Total Demand by Age Group", 
       x = "Age Group", 
       y = "Total Demand") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  # Rotate x-axis labels for readability

In the above chart, the brighter colors require a deeper level of care than the darker colors. As can be seen the edges of the plot are both higher hours demanded and a deeper amount of care needed. There is also difference in depth of care required for young children as the young children require more care and also around the age of fifty four, health characteristics start to force higher levels of care on individuals.

nhis <- fread("./03_NHISdata.csv")
# Ensure the health column is numeric
nhis$HEALTH <- as.numeric(nhis$HEALTH)

# Subset out observations where health is 7 or 9
nhis <- nhis %>% filter(!HEALTH %in% c(7, 9))

# Define age groups and categorize
nhis <- nhis %>%
  mutate(age_group = cut(AGE, 
                         breaks = c(-Inf, 5, 10, 15, 18, 20, 21, 22, 25, 30, 35, 40, 45, 50, 55, 60, 62, 65, 67, 70, 75, 80, 85, Inf),
                         labels = c("Under 5", "5-9", "10-14", "15-17", "18-19", "20", "21", "22-24", "25-29", "30-34", "35-39", "40-44", 
                                    "45-49", "50-54", "55-59", "60-61", "62-64", "65-66", "67-69", "70-74", "75-79", "80-84", "85+"),
                         right = FALSE))


nhis <- nhis %>%
  group_by(age_group,health) %>%
  summarise(count = sum(SAMPWEIGHT))
## `summarise()` has grouped output by 'age_group'. You can override using the
## `.groups` argument.
# Create the line plot
ggplot(nhis, aes(x = age_group, y = count, color = health, group = health)) +
  geom_line(size = 1) +  # Add lines with size 1
  geom_point(size = 2) +  # Add points for each data point
  theme_minimal() +  # Use a minimal theme for a professional look
  labs(title = "Health Category Counts by Age Group",
       x = "Age Group",
       y = "Count of People",
       color = "Health Category") +  # Label for the color legend
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +  # Rotate x-axis labels for readability
  scale_color_brewer(palette = "Set1")  # Use a color palette for distinct lines
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

We can also see this same plot without using our age group.

nhis <- fread("./03_NHISdata.csv")
# Ensure the health column is numeric
nhis$HEALTH <- as.numeric(nhis$HEALTH)

# Subset out observations where health is 7 or 9
nhis <- nhis %>% filter(!HEALTH %in% c(7, 9)) %>%
  filter(AGE < 85)


nhis <- nhis %>%
  group_by(AGE,health) %>%
  summarise(count = sum(SAMPWEIGHT))
## `summarise()` has grouped output by 'AGE'. You can override using the `.groups`
## argument.
# Create the line plot
ggplot(nhis, aes(x = AGE, y = count, color = health, group = health)) +
  geom_line(size = 1) +  # Add lines with size 1
  geom_point(size = 2) +  # Add points for each data point
  theme_minimal() +  # Use a minimal theme for a professional look
  labs(title = "Health Category Counts by Age Group",
       x = "Age Group",
       y = "Count of People",
       color = "Health Category") +  # Label for the color legend
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +  # Rotate x-axis labels for readability
  scale_color_brewer(palette = "Set1")  # Use a color palette for distinct lines

All of this combined leads us to the following conclusions about demand for care in the United States. First, demand for care is by far largest in the childcare industry with young children needing both many hours and many

Sankey Diagram of Supply Hours

The first level of connections we need are linking the total supply of population into three age catagories. We want to split the age catagories into children, the elderly, and those of prime age. We can later adjust how we want to split age groups but for now we are sorting out the groups who likely are not care proivders but instead net care recievers.

cps <- fread("./03_cpsdata.csv")

freq <- tapply(cps$WTFINL, cps$prime_age, sum) * 24
freq <- as.vector(freq)

df_1 <- data.frame(
  population = c("Total Supply", "Total Supply", "Total Supply"),
  prime_age = c("Children", "Prime Age", "Over Prime"),
  freq = freq
)

Let’s start with looking at how this breakup looks for the hours available to supply care.

nodes <- data.frame(name = unique(c(df_1$population, df_1$prime_age)))

links <- data.frame(
  source = match(df_1$population, nodes$name) - 1,
  target = match(df_1$prime_age, nodes$name) - 1,
  value = df_1$freq
)

sankeyNetwork(
  Links = links,
  Nodes = nodes,
  Source = "source",
  Target = "target",
  Value = "value",
  NodeID = "name",
  units = "Frequency",
  fontSize = 12,
  nodeWidth = 30
)

For now, let’s only care about the Prime Age adults who are providing care to the rest. In doing this the first thing we need to do is filter out the hours that are spent asleep for each of these groups.

# Original data frame
df_1 <- cps %>%
  filter(YEAR == 2024 & month == "March") %>%
  group_by(prime_age) %>%
  summarise(
    freq = sum(WTFINL*24)
  )

# Duplicate each row twice
df_1_expanded <- df_1 %>%
  slice(rep(1:n(), each = 2)) %>%
  mutate(sleepnotsleep = rep(c("sleep", "notsleep"), length.out = n()))

df_1_expanded <- df_1_expanded %>%
  mutate(
    freq_1 = if_else(
      sleepnotsleep == "sleep",
      freq * (8 / 24),
      freq * ((24 - 8) / 24)
    )
  )

### PART2
# Filter rows where sleepnotsleep is "notsleep"
rows_to_duplicate <- df_1_expanded %>%
  filter(sleepnotsleep == "notsleep")

# Bind the duplicated rows back to the original dataframe
df_1_expanded <- df_1_expanded %>%
  bind_rows(replicate(2, rows_to_duplicate, simplify = FALSE)) %>%
  arrange(prime_age, sleepnotsleep)


# Add the new column with repeating values for the expanded rows
df_1_expanded <- df_1_expanded %>%
  mutate(carejobnotcarejob = rep(c("carejob", "notcarejob", "notworking","NA"), length.out = n()))



# Calculate the values from the cps dataframe for Care_Occ and NonCare_Occ
carejob_freq <- cps %>%
  filter(YEAR == 2024 & month == "March") %>%
  filter(care_job == "Care_Occ") %>%
  filter(AHRSWORKT < 200) %>%
  filter(prime_age == "Prime Age") %>%
  filter(care_job == "Care_Occ") %>%
  summarise(freq_value = (sum(AHRSWORKT/5*WTFINL, na.rm = TRUE))) %>%
  pull(freq_value)

notcarejob_freq <- cps %>%
  filter(YEAR == 2024 & month == "March") %>%
  filter(care_job == "NonCare_Occ") %>%
  filter(prime_age == "Prime Age") %>%
  filter(AHRSWORKT < 200) %>%
  summarise(freq_value = (sum(AHRSWORKT/5*WTFINL, na.rm = TRUE))) %>%
  pull(freq_value)

# Add the freq_2 column to df_1_expanded with conditions
df_1_expanded <- df_1_expanded %>%
  mutate(
    freq_2 = case_when(
      sleepnotsleep == "notsleep" & carejobnotcarejob == "carejob" & prime_age == "Prime Age" ~ carejob_freq,
      sleepnotsleep == "notsleep" & carejobnotcarejob == "notcarejob" & prime_age == "Prime Age" ~ notcarejob_freq,
      TRUE ~ NA_real_
    )
  )

df_1_expanded <- df_1_expanded %>%
  filter(
    prime_age == "Prime Age" | 
      (prime_age != "Prime Age" & sleepnotsleep == "sleep")
  ) %>%
  arrange(prime_age, sleepnotsleep, carejobnotcarejob)



# Update freq_2 for rows where carejobnotcarejob is "networking"
df_1_expanded <- df_1_expanded %>%
  group_by(prime_age) %>%
  mutate(
    freq_2 = if_else(
      is.na(freq_2) & carejobnotcarejob == "notworking" & prime_age == "Prime Age",
      freq - sum(freq_2, na.rm = TRUE),
      freq_2
    )
  ) %>%
  mutate(
    freq_2 = if_else(
      is.na(freq_2), 
      freq,
      freq_2
    )
  ) %>%
  mutate(
    freq_2 = if_else(
      prime_age == "Prime Age" & sleepnotsleep == "sleep",
      freq_1,
      freq_2
    )
  )

nodes <- data.frame(name = c(
  "All Categories",
  "Fifty-Five Plus", "Prime Age", "Under Twenty-Five",
  "Prime Age Sleep", "Prime Age NotSleep",
  "Carejob", "Notcarejob", "Notworking"
))


links <- data.frame(
  source = c(0, 0, 0, 2, 2, 5, 5, 5),  # Define connections
  target = c(1, 2, 3, 4, 5, 6, 7, 8),  # Define next levels in order
  value = c(
    df_1_expanded %>% filter(prime_age == "Fifty-Five Plus") %>% summarise(value = sum(freq_2, na.rm = TRUE)) %>% pull(value),
    df_1_expanded %>% filter(prime_age == "Prime Age", sleepnotsleep == "sleep") %>% summarise(value = sum(freq_2, na.rm = TRUE)) %>% pull(value),
    df_1_expanded %>% filter(prime_age == "Under Twenty-Five") %>% summarise(value = sum(freq_2, na.rm = TRUE)) %>% pull(value),
    df_1_expanded %>% filter(prime_age == "Prime Age", sleepnotsleep == "sleep") %>% summarise(value = sum(freq_2, na.rm = TRUE)) %>% pull(value),
    df_1_expanded %>% filter(prime_age == "Prime Age", sleepnotsleep == "notsleep") %>% summarise(value = sum(freq_2, na.rm = TRUE)) %>% pull(value),
    df_1_expanded %>% filter(prime_age == "Prime Age", sleepnotsleep == "notsleep", carejobnotcarejob == "carejob") %>% summarise(value = sum(freq_2, na.rm = TRUE)) %>% pull(value),
    df_1_expanded %>% filter(prime_age == "Prime Age", sleepnotsleep == "notsleep", carejobnotcarejob == "notcarejob") %>% summarise(value = sum(freq_2, na.rm = TRUE)) %>% pull(value),
    df_1_expanded %>% filter(prime_age == "Prime Age", sleepnotsleep == "notsleep", carejobnotcarejob == "notworking") %>% summarise(value = sum(freq_2, na.rm = TRUE)) %>% pull(value)
  )
)


sankeyNetwork(
  Links = links,
  Nodes = nodes,
  Source = "source",
  Target = "target",
  Value = "value",
  NodeID = "name",
  fontSize = 12,
  nodeWidth = 30
)