Introduction

In this analysis, we will perform topic modeling on the census occupation data to extract key concepts and gain insights into the underlying themes present in the occupation descriptions. Additionally, we will conduct a trend analysis to examine how the prevalence of these topics has changed over time.

Mathematics Motivation

Bayes’ Theorem is a fundamental concept in probability theory that relates the conditional probability of an event A given event B to the conditional probability of event B given event A. Mathematically, Bayes’ Theorem can be expressed as:

\[ P(A|B) = \dfrac{P(B|A) \times P(A)}{P(B)} \]

Where: - \(P(A|B)\) is the probability of event A given event B (posterior probability). - \(P(B|A)\) is the probability of event B given event A. - \(P(A)\) is the prior probability of event A. - \(P(B)\) is the prior probability of event B.

In the context of machine learning, especially in algorithms like Linear Discriminant Analysis (LDA), Bayes’ Theorem is used to calculate the probability of a certain class given the input features.

For a classification task in LDA, we assume that the features \(x\) are generated by a Gaussian distribution with parameters \(\mu_k\) and \(\Sigma\) for each class \(k\). The goal is to find the class label \(y\) that maximizes the posterior probability \(P(y | x)\) using Bayes’ Theorem:

\[ P(y | x) = \dfrac{P(x | y) \times P(y)}{P(x)} \]

Where: - \(P(y | x)\) is the posterior probability of class y given input features x. - \(P(x | y)\) is the likelihood of observing features x given class y. - \(P(y)\) is the prior probability of class y. - \(P(x)\) is the evidence or marginal likelihood of observing features x.

By calculating the posterior probabilities for each class given the input features, LDA can classify new data points by assigning them to the class with the highest posterior probability. This approach combines statistical inference with probability theory to make predictions based on the underlying patterns in the data.

Loading Required Libraries

We begin by loading the necessary libraries for data manipulation, text preprocessing, topic modeling, and trend analysis.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(tidytext)
library(SnowballC)
library(topicmodels)
library(tidylo)
library(tm)

## Loading required package: NLP
## 
## Attaching package: 'NLP'
## 
## The following object is masked from 'package:ggplot2':
## 
##     annotate

library(broom)
library(lubridate)

Load the Dataset

Next, we load the census dataset from the UCI Machine Learning Repository and preprocess the text data.

census <- read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",
                   col_names = FALSE)

## Rows: 32561 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): X2, X4, X6, X7, X8, X9, X10, X14, X15
## dbl (6): X1, X3, X5, X11, X12, X13
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(census)

Rename the Columns

colnames(census) <- c("age", "workclass", "fnlwgt", "education", "education.num",
                      "marital.status", "occupation", "relationship", "race", "sex",
                      "capital.gain", "capital.loss", "hours.per.week", "native.country",
                      "income")

head(census)

Preprocess the text data

This preprocessing sequence prepares the text for analysis by cleaning and normalizing the data. These steps are essential for cleaning and normalizing the text data to ensure uniformity and relevance which facilitates more accurate analytical outcomes.

Explanation of the Preprocessing Steps:

Convert Text to Lowercase (tolower):

mutate(occupation = tolower(occupation))

Function Used: tolower()
Purpose: This function converts all text data within the occupation field to lowercase. This normalization is crucial as it eliminates variations in text caused by case differences; for example, “Engineer,” “engineer,” and “ENGINEER” are treated as the same word, ensuring consistency in the textual analysis.

Remove Non-alphanumeric Characters (gsub):

occupation = gsub("[^a-zA-Z\\s]", "", occupation)

Function Used: gsub()
Regex Pattern: "[^a-zA-Z\\s]"
Purpose: We employ the gsub() function to strip away any characters from the occupation text that are neither alphabets nor whitespace. This step cleans the text by removing punctuation and special characters that are typically irrelevant to the text analysis, focusing the analysis purely on the words.

Remove Stop Words (tm::removeWords):

occupation = tm::removeWords(occupation, stopwords("en"))

Function Used: removeWords() from the tm package
Stopwords Source: stopwords("en") for English-language stop words
Purpose: Common words such as “and,” “the,” and “is,” known as stop words, are frequent and carry little meaningful information for analysis. Removing these using the removeWords() function helps to focus on more meaningful words in the text.

Word Stemming (SnowballC::wordStem):

occupation = SnowballC::wordStem(occupation)

Function Used: wordStem() from the SnowballC package
Purpose: Word stemming reduces words to their base or root form (e.g., reducing “running,” “runner,” and “ran” to “run”). This step is crucial as it helps to consolidate variations of a word into a single representative form, simplifying the data and reducing the complexity for analysis. This leads to more straightforward and robust insights.

Summary

By conducting preprocessing steps, the textual data in the occupation field is standardized and optimized for further analysis. This preparation is foundational for any subsequent procedures such as vectorization, topic modeling, or sentiment analysis, ensuring that the data fed into analytical models is accurate and representative of the content intended for study.

census <- census %>%
  mutate(occupation = tolower(occupation),
         occupation = gsub("[^a-zA-Z\\s]", "", occupation),
         occupation = tm::removeWords(occupation, stopwords("en")),
         occupation = SnowballC::wordStem(occupation))

# Display the first 5 rows
(head(census, 5))

Perform Topic Modeling

To performing topic modeling, we are examine the occupation text data from the census dataframe to discover latent themes or topics that might be present, using natural language processing and topic modeling techniques.

Step 1: Creating a Document-Term Matrix (DTM)

The first task is to convert the textual data in the occupation column into a mathematical format that can be used for analysis. We do this through three primary steps:

Tokenizing the text into words: We break down the entire chunk of text from each respondent’s occupation description into individual words (tokens). The tidytext::unnest_tokens() function makes this straightforward by automatically handling common preprocessing tasks like converting all text to lowercase.
Counting the occurrence of each word: After splitting the text into words, we count how often each distinct word appears throughout the entire dataset. This is achieved using the dplyr::count() function.
Casting the word counts into a Document-Term Matrix: We then transform the individual word counts into a matrix format where rows represent documents and columns represent words, with matrix entries detailing the frequency of each word in each document. The cast_dtm() function facilitates this transformation, giving us a structured, sparse matrix.

Step 2: Applying Latent Dirichlet Allocation (LDA)

With a Document-Term Matrix in hand, we proceed to perform topic modeling using Latent Dirichlet Allocation (LDA):

We specify that the model should identify five distinct topics (k = 5). The number of topics is a model parameter that can be tuned depending on the desired granularity of analysis.
We control the randomness of the algorithm by setting a seed (seed = 123). This ensures that the results are reproducible, an essential aspect when conducting scientific or analytical studies that may need verification or replication.

Step 3: Analyzing Topics

Once the LDA model is fitted, we take the following steps to understand and characterize the topics:

Converting LDA model details to a tidy format: The tidy() function from the broom package helps in reshaping the model’s output into a tidy dataframe. Specifically, we extract the beta matrix, which contains the probabilities of words belonging to each topic.
Selecting the top terms for each topic based on their beta values: For each topic, we identify the top 5 terms that have the highest probability (beta values). This helps us understand the essence or core themes of each topic discovered by the LDA model.

These top terms or concepts are then further arranged and clarified for presentation or further analysis, giving us clear insights into the main themes present in the occupational data from the census.

This procedure provides a method for extracting meaningful information from unstructured text data, facilitating a deeper understanding of the underlying patterns in occupational descriptions.

# Create a Document-Term Matrix from the `occupation` column in the `census` dataframe
# This process involves three main steps:
# 1. Tokenizing the text into words
# 2. Counting the occurrence of each word
# 3. Casting the word counts into a Document-Term Matrix

dtm <- census %>%
  tidytext::unnest_tokens(word, occupation) %>%
  dplyr::count(word) %>%
  cast_dtm(word, word, n)


# Apply Latent Dirichlet Allocation (LDA) to perform topic modeling on the Document-Term Matrix
# `k` specifies the number of topics to be generated by the model
# The `control` list includes parameters for the LDA algorithm, here we set a random seed for reproducibility

lda_model <- LDA(dtm, k = 5, control = list(seed = 123))

# Output the structure of the LDA model object, which includes details about the topics and their associated terms
str(lda_model)

## Formal class 'LDA_VEM' [package "topicmodels"] with 14 slots
##   ..@ alpha          : num 0.0217
##   ..@ call           : language LDA(x = dtm, k = 5, control = list(seed = 123))
##   ..@ Dim            : int [1:2] 14 14
##   ..@ control        :Formal class 'LDA_VEMcontrol' [package "topicmodels"] with 13 slots
##   .. .. ..@ estimate.alpha: logi TRUE
##   .. .. ..@ alpha         : num 10
##   .. .. ..@ seed          : int 123
##   .. .. ..@ verbose       : int 0
##   .. .. ..@ prefix        : chr "/var/folders/v6/vbndk1g95x1dh_h2t_t1wzhr0000gn/T//RtmpxVfL5Y/file1195f1cfcda0e"
##   .. .. ..@ save          : int 0
##   .. .. ..@ nstart        : int 1
##   .. .. ..@ best          : logi TRUE
##   .. .. ..@ keep          : int 0
##   .. .. ..@ estimate.beta : logi TRUE
##   .. .. ..@ var           :Formal class 'OPTcontrol' [package "topicmodels"] with 2 slots
##   .. .. .. .. ..@ iter.max: int 500
##   .. .. .. .. ..@ tol     : num 1e-06
##   .. .. ..@ em            :Formal class 'OPTcontrol' [package "topicmodels"] with 2 slots
##   .. .. .. .. ..@ iter.max: int 1000
##   .. .. .. .. ..@ tol     : num 1e-04
##   .. .. ..@ initialize    : chr "random"
##   ..@ k              : int 5
##   ..@ terms          : chr [1:14] "admcler" "armedforc" "craftrepair" "execmanageri" ...
##   ..@ documents      : chr [1:14] "admcler" "armedforc" "craftrepair" "execmanageri" ...
##   ..@ beta           : num [1:5, 1:14] -0.467 -180.446 -180.604 -169.638 -183.104 ...
##   ..@ gamma          : num [1:14, 1:5] 1.00 2.38e-03 5.29e-06 5.33e-06 2.18e-05 ...
##   ..@ wordassignments:List of 5
##   .. ..$ i   : int [1:14] 1 2 3 4 5 6 7 8 9 10 ...
##   .. ..$ j   : int [1:14] 1 2 3 4 5 6 7 8 9 10 ...
##   .. ..$ v   : num [1:14] 1 3 5 2 4 3 2 5 4 3 ...
##   .. ..$ nrow: int 14
##   .. ..$ ncol: int 14
##   .. ..- attr(*, "class")= chr "simple_triplet_matrix"
##   ..@ loglikelihood  : num [1:14] -1764.3 -59.6 -4065.1 -2208.9 -141.1 ...
##   ..@ iter           : int 7
##   ..@ logLiks        : num(0) 
##   ..@ n              : int 30718

# Identify the top terms (concepts) within each topic as determined by the LDA model
# This involves several steps:
# 1. Converting LDA model details to a tidy format
# 2. Selecting the top terms for each topic based on their 'beta' values

concepts <- lda_model %>%
  tidy(matrix = "beta") %>%
  group_by(topic) %>%
  top_n(5, beta) %>%
  ungroup() %>%
  arrange(topic, -beta) %>%
  mutate(concept = term)

print(concepts)

## # A tibble: 25 × 4
##    topic term                beta concept        
##    <int> <chr>              <dbl> <chr>          
##  1     1 admcler         6.27e- 1 admcler        
##  2     1 transportmov    2.65e- 1 transportmov   
##  3     1 protectiveserv  1.08e- 1 protectiveserv 
##  4     1 armedforc       8.21e-65 armedforc      
##  5     1 techsupport     6.18e-75 techsupport    
##  6     2 execmanageri    5.81e- 1 execmanageri   
##  7     2 machineopinspct 2.86e- 1 machineopinspct
##  8     2 techsupport     1.33e- 1 techsupport    
##  9     2 armedforc       3.41e-64 armedforc      
## 10     2 protectiveserv  2.61e-75 protectiveserv 
## # ℹ 15 more rows

3D Interactive Graph

The following R code block generates a 3D scatter plot using the plot_ly function from the Plotly library in R. This visualization is designed to elucidate the relationships between different variables across multiple topics. Here is a breakdown of what this 3D plot conveys to the reader based on the given dataset:

1. Topics and Their Distribution:

Each point in the 3D space represents a unique combination of a “term” and its associated “concept” with a particular “beta” value (possibly a measure of significance, frequency, or impact). By coloring these points based on “topic” (with topics defined as factors ranging from 1 to 5), the plot visually segregates data into clusters that indicate how terms are distributed across different topics.

2. Concepts vs. Terms:

The x-axis and y-axis represent “concept” and “term”, respectively. This arrangement allows readers to see which terms are associated with which concepts and how these associations are distributed across different topics. For instance, one can identify if certain concepts are predominantly associated with specific topics or if there is a broad mix.

3. Significance of Terms (Beta Values):

The z-axis represents the “beta” values of terms within their respective concepts and topics. These values might relate to the significance, frequency, or some other measure of importance of the terms within the topics. Higher or lower positions on this axis signal more or less significance, respectively.

4. Interactive Exploration:

Given that this is a 3D plot, readers can likely interact with the visualization by rotating the view, zooming in and out, and possibly hovering over data points to get more details (e.g., the exact term, its beta value, and associated concept). This interactivity helps in examining the structure and relationships in the data more closely.

5. Clustering and Correlations:

By analyzing the spatial distribution of points (terms), readers can identify clusters where certain concepts/terms are closely related within topics. Beta values aid in understanding the depth or impact of these terms within their conceptual and topical context.

6. Insights into Data Structure:

Overall, the 3D plot gives a comprehensive picture of how different terms are categorized under various concepts across multiple topics, highlighting the intricate relationships that exist within the dataset. This visualization technique enables the identification of patterns, correlations, and potential anomalies within multi-dimensional data.

This 3D visualization aids in bridging the understanding of complex relationships among terms, their significance (beta), and associated concepts distributed across different topics, ultimately providing a holistic view of the data’s structure and key insights that might not be as apparent in lower-dimensional plots.

library(RColorBrewer)
library(plotly)

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

# Generate a color palette with 5 colors from the "Dark2" palette
colors <- RColorBrewer::brewer.pal(n = 5, name = "Dark2")

# Ensure that there is a one-to-one mapping between the levels and the colors

color_mapping <- setNames(colors, levels(as.factor(concepts$topic)))

fig <- plot_ly(concepts, x = ~concept, y = ~topic, z = ~beta,
               type = 'scatter3d', mode = 'markers',
               marker = list(size = 5, 
                             color = ~color_mapping[as.character(topic)],  # Apply the palette based on topic
                             opacity = 0.8),
               text = ~paste('Term:', concept, '<br>Beta:', format(beta, scientific = TRUE))
)



fig <- fig %>% layout(title = '3D Visualization of Topics',
                      scene = list(xaxis = list(title = 'Term'
                                                # Optionally, set tickvals here if necessary
                                               ),
                                   yaxis = list(title = 'Topic',
                                                tickvals = 1:5 # Set tick values to only include integers
                                               ),
                                   zaxis = list(title = 'Beta'
                                                # Optionally, set tickvals here if necessary
                                               )))
fig

# Save the interactive plot as an HTML file
htmlwidgets::saveWidget(fig, "interactive_plot.html", selfcontained = TRUE)

Visualize the Extracted Concepts

ggplot2 is utilized to create a bar chart that visualizes the relationship between different concepts (words or terms) and their associated probability across various topics. The visualization facilitates an easy understanding of which concepts are most important or prevalent within each topic, as determined by the topic modeling analysis Latent Dirichlet Allocation (LDA).

ggplot(concepts, aes(x = reorder(concept, beta), y = beta, fill = factor(topic))) +
  geom_bar(stat = "identity", width = 0.8) +
  coord_flip() +
  labs(x = "Concept", y = "Probability", fill = "Topic")

# Latent Dirichlet Allocation (LDA)

Latent Dirichlet Allocation (LDA) is a generative statistical model that categorizes textual information into topics. It assumes documents are produced from a mixture of topics, and those topics in turn are generated from a mixture of words. This model is unsupervised, which means it discovers these topics directly from the data without requiring labeled datasets.

How Topics Are Chosen

LDA represents topics as distributions over words, and documents as distributions over these topics. Here’s how LDA chooses topics:

Decide on the number of topics K that the LDA should find.
Each topic is defined as a distribution over the entire vocabulary (all the words seen in all documents). These are the word probabilities per topic.
Each document is considered a mixture of these topics in different proportions.

In the model training phase, the algorithm iteratively updates its guesses for the following:

Word distributions for each topic: Initially random, the model updates these based on the words in each document and their current topic assignments.
Topic distributions for each document: Similarly, these start as randomized guesses and are updated based on which words are present and how the word distributions for each topic are taking shape.

The model continues adjusting these guesses until it converges or reaches a predefined number of iterations.

# A trained LDA model and concepts contain topic terms

# Extracting the document-topic matrix
topic_assignments <- lda_model %>%
  tidy(matrix = "gamma") %>%
  group_by(document) %>%
  top_n(1, gamma) %>%
  ungroup() %>%
  arrange(document, topic)

# Generating human-readable labels for each topic
topic_labels <- concepts %>%
  group_by(topic) %>%
  summarise(label = paste(concept, collapse = ", ")) %>%
  pull(label)

# topic_assignments now contains the most probable topic assignments
# topic_labels contains a vector of labels for each topic

Create a Date Variable for Trend Analysis

census$date <- as.Date("1970-01-01") + census$age * 365

Calculate Topic Proportions Over Time

topic_proportions <- census %>%
  inner_join(topic_assignments, by = c("occupation" = "document")) %>%
  group_by(date, topic) %>%
  summarize(proportion = n() / nrow(census), .groups = "drop") %>%
  ungroup()

Visualize Topic Proportions Over Time with Topic Labels

ggplot(topic_proportions, aes(x = date, y = proportion, color = factor(topic))) +
  geom_line() +
  labs(x = "Date", y = "Proportion", color = "Topic") +
  theme_minimal() +
  scale_color_discrete(name = "Topic")

# Summary

Throughout the R code workflow, there is a focus on transforming raw text data into structured and interpretable information through machine learning extraction using topic modeling and visualization. The primary objective is to discover latent topics within the census occupation descriptions and to understand how these topics manifest and change over time. The methodology integrates rigorous machine learning extraction techniques with user-friendly visualizations, making the analyzed results accessible to users from various backgrounds in data analysis.

R Notebook