Introduction
- Purpose and research questions
- Data source and context
Data wrangling and feature engineering
- Cleaning and preprocessing steps
- Defining aging related courses
Analysis
Key findings and implications
- Summary of findings
- Potential actions for the target audience
Limitations, ethics, and next steps
References

Introduction

Purpose and research questions

Lifelong learning supports social participation, cognitive health, and well being in later life. Online learning platforms often celebrate lifelong learning, yet older learners rarely appear as a focus in learning analytics work.

This project uses an EdX course catalog snapshot to study how often courses explicitly reference aging, older adults, or later life. The analysis addresses three research questions.

RQ1. Coverage. How common are courses that explicitly reference aging, older adults, or gerontology in this EdX dataset.
RQ2. Location. In which subjects and course levels do these aging related courses appear compared with the rest of the catalog.
RQ3. Positioning. Which ideas and keywords are most prominent in aging related course descriptions and how they frame aging and older adulthood.

The intended audience includes:

EdX and other MOOC platform staff who make decisions about catalog breadth.
Gerontology and adult education programs that might point older learners to MOOCs.
Learning analytics practitioners who study equity in access to learning opportunities.

Data source and context

# path to where the CSV is stored
edx_raw <- readr::read_csv("edx_courses.csv")

glimpse(edx_raw)

## Rows: 975
## Columns: 16
## $ title              <chr> "How to Learn Online", "Programming for Everybody (…
## $ summary            <chr> "Learn essential strategies for successful online l…
## $ n_enrolled         <dbl> 124980, 293864, 2442271, 129555, 81140, 301793, 328…
## $ course_type        <chr> "Self-paced on your time", "Self-paced on your time…
## $ institution        <chr> "edX", "The University of Michigan", "Harvard Unive…
## $ instructors        <chr> "Nina Huntemann-Robyn Belair-Ben Piscopo", "Charles…
## $ Level              <chr> "Introductory", "Introductory", "Introductory", "In…
## $ subject            <chr> "Education & Teacher Training", "Computer Science",…
## $ language           <chr> "English", "English", "English", "English", "Englis…
## $ subtitles          <chr> "English", "English", "English", "English", "Englis…
## $ course_effort      <chr> "2–3 hours per week", "2–4 hours per week", "6–18 h…
## $ course_length      <chr> "2 Weeks", "7 Weeks", "12 Weeks", "13 Weeks", "4 We…
## $ price              <chr> "FREE-Add a Verified Certificate for $49 USD", "FRE…
## $ course_description <chr> "Designed for those who are new to elearning, this …
## $ course_syllabus    <chr> "Welcome - We start with opportunities to meet your…
## $ course_url         <chr> "https://www.edx.org/course/how-to-learn-online", "…

The dataset contains 975 EdX courses with the following key fields.

title: course title
summary: short description shown in the catalog
course_description: full course description
course_syllabus: syllabus or weekly outline where available
subject: high level subject category, such as Social Sciences or Computer Science
Level: course level, such as Introductory, Intermediate, or Advanced
language: primary language of instruction
course_effort: estimated weekly effort text, such as “2–4 hours per week”
course_length: expected course length, such as “6 weeks”
n_enrolled: enrollment count as a formatted string
other metadata such as institution, course type, and course URL

The file appears to be a scraped snapshot of the EdX catalog around 2021. It reflects courses that were public on the platform at that time rather than the complete historical catalog. The dataset contains course level metadata only, not learner level interaction data.

Data wrangling and feature engineering

Cleaning and preprocessing steps

The analysis builds a combined text field for keyword search and text mining and performs light cleaning of character variables. The goal is to keep transformations simple and transparent.

edx_clean <- edx_raw |>
  mutate(
    across(where(is.character), ~ str_squish(.x)),
    # combined text across title, summary, description, and syllabus
    text_all = str_to_lower(
      paste(
        title,
        summary,
        course_description,
        course_syllabus,
        sep = " "
      )
    ),
    subject = as.factor(subject),
    Level = as.factor(Level),
    language = as.factor(language)
  )

edx_clean |>
  summarise(
    n_courses = n(),
    missing_title = sum(is.na(title) | title == ""),
    missing_text = sum(is.na(text_all) | text_all == "")
  )

These steps:

Standardize spacing in character fields.
Lowercase all text for reliable string matching.
Combine several descriptive fields into text_all so that searches capture references in the title, summary, detailed description, or syllabus.
Convert subject, Level, and language to factors for grouped summaries and plots.

More aggressive preprocessing such as stemming or removal of all high frequency words appears later in the text mining section, since those choices affect interpretability.

If enrollment counts are needed, n_enrolled can be converted from formatted text to a numeric variable.

edx_clean <- edx_clean |>
  mutate(
    n_enrolled_num = readr::parse_number(as.character(n_enrolled))
  )

This project does not rely on enrollment for the main research questions, so the numeric conversion is optional.

Defining aging related courses

To answer the research questions, the analysis uses an operational definition of an aging related course. The focus is on courses that explicitly reference aging, older adults, or later life in their text, not courses that might be relevant in a broader sense.

To reduce false positives from words like “senior” in job titles or the string “aging” inside other words, the analysis uses explicit word boundary patterns.

aging_patterns <- c(
  "\\bolder adult(s)?\\b",
  "\\bolder people\\b",
  "\\bolder persons\\b",
  "\\bsenior citizen(s)?\\b",
  "\\belderly\\b",
  "\\baging\\b",
  "\\bageing\\b",
  "\\bgerontology\\b",
  "\\bgerontological\\b",
  "\\bgeriatric(s)?\\b",
  "\\bretirement\\b",
  "\\bretiree(s)?\\b",
  "\\blater life\\b",
  "\\bthird age\\b",
  "\\blifelong learning\\b",
  "\\bencore career\\b"
)

aging_regex <- stringr::str_c(aging_patterns, collapse = "|")

edx_features <- edx_clean |>
  mutate(
    aging_flag = str_detect(
      text_all,
      regex(aging_regex, ignore_case = TRUE)
    )
  )

edx_features |>
  count(aging_flag)

This rule flags courses that use explicit aging terms or clear older adult phrases. It will still miss some courses that are relevant but do not use this language, and it can include some that discuss aging at the population level rather than focusing on older learners.

A much looser rule that includes all uses of “senior” yields many more matches, but most of these refer to senior managers or senior level roles rather than older adults. For this project, the stricter definition keeps the focus on courses that clearly mention aging or later life.

Analysis

Overall prevalence of aging related courses

n_total <- nrow(edx_features)
n_aging <- sum(edx_features$aging_flag, na.rm = TRUE)

prevalence_tbl <- tibble(
  group = c("Aging related courses", "All other courses"),
  n = c(n_aging, n_total - n_aging)
) |>
  mutate(prop = n / sum(n))

prevalence_tbl

prevalence_tbl |>
  mutate(
    group = fct_relevel(group, "Aging related courses")
  ) |>
  ggplot(aes(x = group, y = prop, fill = group)) +
  geom_col(width = 0.6) +
  geom_text(
    aes(label = percent(prop, accuracy = 0.1)),
    vjust = -0.3,
    size = 3.5
  ) +
  scale_y_continuous(
    labels = percent_format(accuracy = 1),
    expand = expansion(mult = c(0, 0.1))
  ) +
  labs(
    x = NULL,
    y = "Percent of courses",
    title = "Most EdX courses do not explicitly reference aging or later life",
    subtitle = glue::glue("{n_aging} of {n_total} courses are aging related")
  ) +
  custom_theme()

Out of 975 courses in this dataset, only 8 courses are flagged as aging related using the strict pattern. This equals 0.8% of the catalog snapshot.

This result suggests that courses that directly name aging, older adults, or later life are rare in this EdX sample.

Where aging related courses appear

To address RQ2, the analysis compares the distribution of aging related courses by subject and level.

by_subject <- edx_features |>
  group_by(subject) |>
  summarise(
    n_total = n(),
    n_aging = sum(aging_flag, na.rm = TRUE),
    prop_aging = n_aging / n_total
  ) |>
  arrange(desc(n_aging))

by_subject

by_subject |>
  filter(n_aging > 0) |>
  mutate(
    subject = fct_reorder(subject, prop_aging)
  ) |>
  ggplot(aes(x = subject, y = prop_aging, fill = prop_aging)) +
  geom_col(width = 0.65) +
  geom_text(
    aes(label = percent(prop_aging, accuracy = 0.1)),
    hjust = -0.1,
    size = 3
  ) +
  coord_flip(clip = "off") +
  scale_y_continuous(
    labels = percent_format(accuracy = 0.1),
    expand = expansion(mult = c(0, 0.2))
  ) +
  labs(
    x = NULL,
    y = "Percent of courses in subject\nthat are aging related",
    title = "Aging related courses cluster in a few subjects"
  ) +
  custom_theme()

In this dataset, aging related courses appear in only a small group of subjects. Social Sciences, Health and Safety, and Food and Nutrition show both the most aging related courses and the highest shares within their subject. Humanities and Language include a single aging related course each.

Many subjects have no aging related courses with this strict definition, including Computer Science, Data Analysis and Statistics, Engineering, and Business and Management. That gap matters if the goal is to support older learners in technical or data rich fields.

Course level provides a different view.

by_level <- edx_features |>
  group_by(Level) |>
  summarise(
    n_total = n(),
    n_aging = sum(aging_flag, na.rm = TRUE),
    prop_aging = n_aging / n_total
  ) |>
  arrange(desc(prop_aging))

by_level

by_level |>
  mutate(
    Level = fct_reorder(Level, prop_aging)
  ) |>
  ggplot(aes(x = Level, y = prop_aging, fill = Level)) +
  geom_col(width = 0.6) +
  geom_text(
    aes(label = percent(prop_aging, accuracy = 0.1)),
    vjust = -0.3,
    size = 3
  ) +
  scale_y_continuous(
    labels = percent_format(accuracy = 0.1),
    expand = expansion(mult = c(0, 0.15))
  ) +
  labs(
    x = "Course level",
    y = "Percent of courses that are aging related",
    title = "Aging related courses are most often introductory"
  ) +
  custom_theme()

Most aging related courses are Introductory, with smaller numbers at Intermediate and Advanced levels. The proportions are around one percent or less for each level. Introductory courses dominate the catalog in general, so this pattern partly reflects catalog structure and partly a focus on lower barrier entry points.

Text mining on aging related language

To address RQ3, the analysis uses tidy text methods to explore how aging related courses describe their content, compared with all other courses.

data("stop_words")

tokens <- edx_features |>
  select(title, subject, Level, aging_flag, text_all) |>
  unnest_tokens(word, text_all) |>
  anti_join(stop_words, by = "word") |>
  filter(str_detect(word, "^[a-z]+$"))

Frequent words in aging related courses

top_aging <- tokens |>
  filter(aging_flag) |>
  count(word, sort = TRUE)

top_aging |> slice_head(n = 30)

Some of the most frequent words in aging related courses are common across EdX, such as “course” and “learn”. To focus on content, the analysis removes a small set of domain general words.

set.seed(123)

wc_data <- top_aging |>
  slice_max(n, n = 100)  # top 100 non-domain words in aging courses

wordcloud(
  words = wc_data$word,
  freq = wc_data$n,
  max.words = 100,
  scale = c(3, 0.8),
  random.order = FALSE,
  rot.per = 0.15
)

Remove a small set of general words.

domain_stop <- c(
  "course", "learn", "week", "module", "students",
  "study", "online"
)

top_aging_filtered <- top_aging |>
  filter(! word %in% domain_stop)

top_aging_filtered |> slice_head(n = 30)

Among the remaining words, aging related courses feature:

Health and care terms such as “health”, “nutrition”, “exercise”, “microbiome”, and “care”.
Social and rights terms such as “social”, “rights”, “individuals”, and “families”.
Explicit aging terms such as “aging”, “later”, “life”, and “longevity”.
Practice words such as “work”, “practice”, “intervention”, and “career”.

Distinctive words by aging flag using tf idf

A simple tf idf comparison highlights words that are distinctive for aging related courses compared with all other courses.

tfidf <- tokens |>
  count(aging_flag, word) |>
  bind_tf_idf(word, aging_flag, n) |>
  arrange(desc(tf_idf))

tfidf |>
  filter(aging_flag) |>
  slice_head(n = 20)

freq_by_group <- tokens |>
  count(aging_flag, word) |>
  group_by(aging_flag) |>
  mutate(
    total = sum(n),
    prop = n / total
  ) |>
  ungroup()

top_compare <- freq_by_group |>
  filter(! word %in% domain_stop) |>
  group_by(aging_flag) |>
  slice_max(prop, n = 15) |>
  ungroup()

top_compare |>
  mutate(
    aging_flag = if_else(aging_flag, "Aging-related courses", "All other courses"),
    word = fct_reorder(word, prop)
  ) |>
  ggplot(aes(x = word, y = prop)) +
  geom_col() +
  coord_flip() +
  facet_wrap(~ aging_flag, scales = "free_y") +
  scale_y_continuous(labels = percent_format(accuracy = 0.1)) +
  labs(
    x = NULL,
    y = "Within-group word proportion",
    title = "Most frequent content words in aging vs non-aging courses"
  ) +
  custom_theme()

These distinctive terms reinforce the earlier pattern. Aging related courses in this catalog are framed mainly through health, human development across the lifespan, public health, social work, and the business of aging in the longevity economy. Courses rarely present aging as a context for ongoing learning in technical fields, data literacy, or creative domains.

Key findings and implications

Summary of findings

In this EdX catalog snapshot:

Only 8 of 975 courses, about 0.8%, explicitly reference aging, older adults, or later life in their course text.
Aging related courses cluster in Social Sciences, Health and Safety, and Food and Nutrition and appear rarely in technical subjects such as Computer Science, Engineering, or Data Analysis.
Most aging related courses are Introductory, with very few at Intermediate or Advanced levels.
Text mining shows that these courses emphasize health, social conditions, rights, and the broader demographic challenge of aging populations.
Few courses present aging as a context for ongoing learning in technical fields, digital skills, or creative domains.

These results suggest that older adults appear in this EdX sample mainly as subjects of health and social care or as part of a demographic and economic challenge rather than as a diverse group of learners with broad interests.

Potential actions for the target audience

For platform staff and course designers:

Identify aging related gaps in technical and creative domains and design new courses in areas such as digital literacy for later life, data skills for community engagement, or design for an aging society.
Build a curated pathway or catalog tag for aging, later adulthood, or longevity learning that makes existing courses more visible to older learners and professionals who work with them.

For gerontology and adult education programs:

Partner with MOOC providers to co design courses that reflect the interests and needs of older adults beyond health and risk management, such as civic engagement, entrepreneurship, and intergenerational learning.
Use catalog analytics like this as a baseline when arguing for new course development and partnerships.

For learning analytics practitioners:

Extend this catalog level analysis with learner level data when available, for example by studying who enrolls in aging related MOOCs, how engagement patterns compare across age groups, and how course design features support or hinder older learners.
Combine text analytics with survey or interview data from older learners about how they find and interpret online learning opportunities.

Limitations, ethics, and next steps

Data and measurement limitations

Several limitations shape these findings.

The dataset is a scraped snapshot of the EdX catalog around 2021, not a complete or current representation of all EdX offerings.
The analysis treats each course as a single unit and does not inspect course materials, assessments, or forums.
The aging flag uses explicit keyword patterns, which miss courses that might be relevant but avoid direct references to aging or older adults.
Some flagged courses discuss aging at the population level, not as a focus on older learners themselves.

These constraints mean that the prevalence estimates here are lower bounds on explicit aging related language, not definitive counts of all courses that could serve older learners.

Ethical considerations

The analysis works with public course metadata and does not use any identifiable learner data, so privacy risks are low. Yet downstream use of these findings still raises questions.

If decision makers treat this snapshot as complete, they may overlook recent courses or closed cohort offerings aimed at older adults.
If aging related courses are defined only by a narrow set of keywords, course developers might over optimize descriptions for those terms instead of thinking carefully about inclusive design and content.

A more balanced approach uses catalog analytics as one input among several when planning how to support older learners.

Future work

Future work could:

Compare this EdX snapshot to other platforms and to later catalog snapshots to see whether aging related offerings are growing or static.
Incorporate learner data, where accessible and ethical, to examine participation and completion patterns for older learners in both aging specific and general courses.
Extend the aging flag to include related roles such as caregivers and intergenerational learning, paired with manual review, to draw a richer map of where later life learning appears in the online ecosystem.

References

Krumm, A., Means, B., and Bienkowski, M. 2018. Learning Analytics Goes to School: A Collaborative Approach to Improving Education. Routledge.

Nakhaee, M. [imuhammad]. (2010). Edx Courses: A list of online courses on edx.org learning platform. (Version 5) [Dataset]. Kaggle. https://www.kaggle.com/datasets/imuhammad/edx-courses

Where Are the Courses on Aging? A Learning Analytics Look at EdX

Anna Doo