In this analysis, we will perform topic modeling on the census occupation data to extract key concepts and gain insights into the underlying themes present in the occupation descriptions. Additionally, we will conduct a trend analysis to examine how the prevalence of these topics has changed over time.
Bayes’ Theorem is a fundamental concept in probability theory that relates the conditional probability of an event A given event B to the conditional probability of event B given event A. Mathematically, Bayes’ Theorem can be expressed as:
\[ P(A|B) = \dfrac{P(B|A) \times P(A)}{P(B)} \]
Where: - \(P(A|B)\) is the probability of event A given event B (posterior probability). - \(P(B|A)\) is the probability of event B given event A. - \(P(A)\) is the prior probability of event A. - \(P(B)\) is the prior probability of event B.
In the context of machine learning, especially in algorithms like Linear Discriminant Analysis (LDA), Bayes’ Theorem is used to calculate the probability of a certain class given the input features.
For a classification task in LDA, we assume that the features \(x\) are generated by a Gaussian distribution with parameters \(\mu_k\) and \(\Sigma\) for each class \(k\). The goal is to find the class label \(y\) that maximizes the posterior probability \(P(y | x)\) using Bayes’ Theorem:
\[ P(y | x) = \dfrac{P(x | y) \times P(y)}{P(x)} \]
Where: - \(P(y | x)\) is the posterior probability of class y given input features x. - \(P(x | y)\) is the likelihood of observing features x given class y. - \(P(y)\) is the prior probability of class y. - \(P(x)\) is the evidence or marginal likelihood of observing features x.
By calculating the posterior probabilities for each class given the input features, LDA can classify new data points by assigning them to the class with the highest posterior probability. This approach combines statistical inference with probability theory to make predictions based on the underlying patterns in the data.
We begin by loading the necessary libraries for data manipulation, text preprocessing, topic modeling, and trend analysis.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidytext)
library(SnowballC)
library(topicmodels)
library(tidylo)
library(tm)
## Loading required package: NLP
##
## Attaching package: 'NLP'
##
## The following object is masked from 'package:ggplot2':
##
## annotate
library(broom)
library(lubridate)
Next, we load the census dataset from the UCI Machine Learning Repository and preprocess the text data.
census <- read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",
col_names = FALSE)
## Rows: 32561 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): X2, X4, X6, X7, X8, X9, X10, X14, X15
## dbl (6): X1, X3, X5, X11, X12, X13
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(census)
colnames(census) <- c("age", "workclass", "fnlwgt", "education", "education.num",
"marital.status", "occupation", "relationship", "race", "sex",
"capital.gain", "capital.loss", "hours.per.week", "native.country",
"income")
head(census)
This preprocessing sequence prepares the text for analysis by cleaning and normalizing the data. These steps are essential for cleaning and normalizing the text data to ensure uniformity and relevance which facilitates more accurate analytical outcomes.
mutate(occupation = tolower(occupation))
tolower()
occupation
field to lowercase. This
normalization is crucial as it eliminates variations in text caused by
case differences; for example, “Engineer,” “engineer,” and “ENGINEER”
are treated as the same word, ensuring consistency in the textual
analysis.occupation = gsub("[^a-zA-Z\\s]", "", occupation)
gsub()
"[^a-zA-Z\\s]"
gsub()
function
to strip away any characters from the occupation
text that
are neither alphabets nor whitespace. This step cleans the text by
removing punctuation and special characters that are typically
irrelevant to the text analysis, focusing the analysis purely on the
words.occupation = tm::removeWords(occupation, stopwords("en"))
removeWords()
from the
tm
packagestopwords("en")
for
English-language stop wordsremoveWords()
function helps to focus on more meaningful
words in the text.occupation = SnowballC::wordStem(occupation)
wordStem()
from the
SnowballC
packageBy conducting preprocessing steps, the textual data in the
occupation
field is standardized and optimized for further
analysis. This preparation is foundational for any subsequent procedures
such as vectorization, topic modeling, or sentiment analysis, ensuring
that the data fed into analytical models is accurate and representative
of the content intended for study.
census <- census %>%
mutate(occupation = tolower(occupation),
occupation = gsub("[^a-zA-Z\\s]", "", occupation),
occupation = tm::removeWords(occupation, stopwords("en")),
occupation = SnowballC::wordStem(occupation))
# Display the first 5 rows
(head(census, 5))
To performing topic modeling, we are examine the
occupation
text data from the census
dataframe
to discover latent themes or topics that might be present, using natural
language processing and topic modeling techniques.
The first task is to convert the textual data in the
occupation
column into a mathematical format that can be
used for analysis. We do this through three primary steps:
tidytext::unnest_tokens()
function makes this straightforward by automatically handling common
preprocessing tasks like converting all text to lowercase.dplyr::count()
function.cast_dtm()
function facilitates this
transformation, giving us a structured, sparse matrix.With a Document-Term Matrix in hand, we proceed to perform topic modeling using Latent Dirichlet Allocation (LDA):
k = 5
). The number of topics is a model parameter that can
be tuned depending on the desired granularity of analysis.seed = 123
). This ensures that the results are
reproducible, an essential aspect when conducting scientific or
analytical studies that may need verification or replication.Once the LDA model is fitted, we take the following steps to understand and characterize the topics:
tidy()
function from the broom package helps in
reshaping the model’s output into a tidy dataframe. Specifically, we
extract the beta matrix, which contains the probabilities of words
belonging to each topic.These top terms or concepts are then further arranged and clarified for presentation or further analysis, giving us clear insights into the main themes present in the occupational data from the census.
This procedure provides a method for extracting meaningful information from unstructured text data, facilitating a deeper understanding of the underlying patterns in occupational descriptions.
# Create a Document-Term Matrix from the `occupation` column in the `census` dataframe
# This process involves three main steps:
# 1. Tokenizing the text into words
# 2. Counting the occurrence of each word
# 3. Casting the word counts into a Document-Term Matrix
dtm <- census %>%
tidytext::unnest_tokens(word, occupation) %>%
dplyr::count(word) %>%
cast_dtm(word, word, n)
# Apply Latent Dirichlet Allocation (LDA) to perform topic modeling on the Document-Term Matrix
# `k` specifies the number of topics to be generated by the model
# The `control` list includes parameters for the LDA algorithm, here we set a random seed for reproducibility
lda_model <- LDA(dtm, k = 5, control = list(seed = 123))
# Output the structure of the LDA model object, which includes details about the topics and their associated terms
str(lda_model)
## Formal class 'LDA_VEM' [package "topicmodels"] with 14 slots
## ..@ alpha : num 0.0217
## ..@ call : language LDA(x = dtm, k = 5, control = list(seed = 123))
## ..@ Dim : int [1:2] 14 14
## ..@ control :Formal class 'LDA_VEMcontrol' [package "topicmodels"] with 13 slots
## .. .. ..@ estimate.alpha: logi TRUE
## .. .. ..@ alpha : num 10
## .. .. ..@ seed : int 123
## .. .. ..@ verbose : int 0
## .. .. ..@ prefix : chr "/var/folders/v6/vbndk1g95x1dh_h2t_t1wzhr0000gn/T//RtmpxVfL5Y/file1195f1cfcda0e"
## .. .. ..@ save : int 0
## .. .. ..@ nstart : int 1
## .. .. ..@ best : logi TRUE
## .. .. ..@ keep : int 0
## .. .. ..@ estimate.beta : logi TRUE
## .. .. ..@ var :Formal class 'OPTcontrol' [package "topicmodels"] with 2 slots
## .. .. .. .. ..@ iter.max: int 500
## .. .. .. .. ..@ tol : num 1e-06
## .. .. ..@ em :Formal class 'OPTcontrol' [package "topicmodels"] with 2 slots
## .. .. .. .. ..@ iter.max: int 1000
## .. .. .. .. ..@ tol : num 1e-04
## .. .. ..@ initialize : chr "random"
## ..@ k : int 5
## ..@ terms : chr [1:14] "admcler" "armedforc" "craftrepair" "execmanageri" ...
## ..@ documents : chr [1:14] "admcler" "armedforc" "craftrepair" "execmanageri" ...
## ..@ beta : num [1:5, 1:14] -0.467 -180.446 -180.604 -169.638 -183.104 ...
## ..@ gamma : num [1:14, 1:5] 1.00 2.38e-03 5.29e-06 5.33e-06 2.18e-05 ...
## ..@ wordassignments:List of 5
## .. ..$ i : int [1:14] 1 2 3 4 5 6 7 8 9 10 ...
## .. ..$ j : int [1:14] 1 2 3 4 5 6 7 8 9 10 ...
## .. ..$ v : num [1:14] 1 3 5 2 4 3 2 5 4 3 ...
## .. ..$ nrow: int 14
## .. ..$ ncol: int 14
## .. ..- attr(*, "class")= chr "simple_triplet_matrix"
## ..@ loglikelihood : num [1:14] -1764.3 -59.6 -4065.1 -2208.9 -141.1 ...
## ..@ iter : int 7
## ..@ logLiks : num(0)
## ..@ n : int 30718
# Identify the top terms (concepts) within each topic as determined by the LDA model
# This involves several steps:
# 1. Converting LDA model details to a tidy format
# 2. Selecting the top terms for each topic based on their 'beta' values
concepts <- lda_model %>%
tidy(matrix = "beta") %>%
group_by(topic) %>%
top_n(5, beta) %>%
ungroup() %>%
arrange(topic, -beta) %>%
mutate(concept = term)
print(concepts)
## # A tibble: 25 × 4
## topic term beta concept
## <int> <chr> <dbl> <chr>
## 1 1 admcler 6.27e- 1 admcler
## 2 1 transportmov 2.65e- 1 transportmov
## 3 1 protectiveserv 1.08e- 1 protectiveserv
## 4 1 armedforc 8.21e-65 armedforc
## 5 1 techsupport 6.18e-75 techsupport
## 6 2 execmanageri 5.81e- 1 execmanageri
## 7 2 machineopinspct 2.86e- 1 machineopinspct
## 8 2 techsupport 1.33e- 1 techsupport
## 9 2 armedforc 3.41e-64 armedforc
## 10 2 protectiveserv 2.61e-75 protectiveserv
## # ℹ 15 more rows
The following R code block generates a 3D scatter plot using the
plot_ly
function from the Plotly library in R. This
visualization is designed to elucidate the relationships between
different variables across multiple topics. Here is a breakdown of what
this 3D plot conveys to the reader based on the given dataset:
Each point in the 3D space represents a unique combination of a “term” and its associated “concept” with a particular “beta” value (possibly a measure of significance, frequency, or impact). By coloring these points based on “topic” (with topics defined as factors ranging from 1 to 5), the plot visually segregates data into clusters that indicate how terms are distributed across different topics.
The x-axis and y-axis represent “concept” and “term”, respectively. This arrangement allows readers to see which terms are associated with which concepts and how these associations are distributed across different topics. For instance, one can identify if certain concepts are predominantly associated with specific topics or if there is a broad mix.
The z-axis represents the “beta” values of terms within their respective concepts and topics. These values might relate to the significance, frequency, or some other measure of importance of the terms within the topics. Higher or lower positions on this axis signal more or less significance, respectively.
Given that this is a 3D plot, readers can likely interact with the visualization by rotating the view, zooming in and out, and possibly hovering over data points to get more details (e.g., the exact term, its beta value, and associated concept). This interactivity helps in examining the structure and relationships in the data more closely.
By analyzing the spatial distribution of points (terms), readers can identify clusters where certain concepts/terms are closely related within topics. Beta values aid in understanding the depth or impact of these terms within their conceptual and topical context.
Overall, the 3D plot gives a comprehensive picture of how different terms are categorized under various concepts across multiple topics, highlighting the intricate relationships that exist within the dataset. This visualization technique enables the identification of patterns, correlations, and potential anomalies within multi-dimensional data.
This 3D visualization aids in bridging the understanding of complex relationships among terms, their significance (beta), and associated concepts distributed across different topics, ultimately providing a holistic view of the data’s structure and key insights that might not be as apparent in lower-dimensional plots.
library(RColorBrewer)
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
# Generate a color palette with 5 colors from the "Dark2" palette
colors <- RColorBrewer::brewer.pal(n = 5, name = "Dark2")
# Ensure that there is a one-to-one mapping between the levels and the colors
color_mapping <- setNames(colors, levels(as.factor(concepts$topic)))
fig <- plot_ly(concepts, x = ~concept, y = ~topic, z = ~beta,
type = 'scatter3d', mode = 'markers',
marker = list(size = 5,
color = ~color_mapping[as.character(topic)], # Apply the palette based on topic
opacity = 0.8),
text = ~paste('Term:', concept, '<br>Beta:', format(beta, scientific = TRUE))
)
fig <- fig %>% layout(title = '3D Visualization of Topics',
scene = list(xaxis = list(title = 'Term'
# Optionally, set tickvals here if necessary
),
yaxis = list(title = 'Topic',
tickvals = 1:5 # Set tick values to only include integers
),
zaxis = list(title = 'Beta'
# Optionally, set tickvals here if necessary
)))
fig
# Save the interactive plot as an HTML file
htmlwidgets::saveWidget(fig, "interactive_plot.html", selfcontained = TRUE)
ggplot2 is utilized to create a bar chart that visualizes the relationship between different concepts (words or terms) and their associated probability across various topics. The visualization facilitates an easy understanding of which concepts are most important or prevalent within each topic, as determined by the topic modeling analysis Latent Dirichlet Allocation (LDA).
ggplot(concepts, aes(x = reorder(concept, beta), y = beta, fill = factor(topic))) +
geom_bar(stat = "identity", width = 0.8) +
coord_flip() +
labs(x = "Concept", y = "Probability", fill = "Topic")
# Latent Dirichlet Allocation (LDA)
Latent Dirichlet Allocation (LDA) is a generative statistical model that categorizes textual information into topics. It assumes documents are produced from a mixture of topics, and those topics in turn are generated from a mixture of words. This model is unsupervised, which means it discovers these topics directly from the data without requiring labeled datasets.
LDA represents topics as distributions over words, and documents as distributions over these topics. Here’s how LDA chooses topics:
K
that the LDA should
find.In the model training phase, the algorithm iteratively updates its guesses for the following:
The model continues adjusting these guesses until it converges or reaches a predefined number of iterations.
# A trained LDA model and concepts contain topic terms
# Extracting the document-topic matrix
topic_assignments <- lda_model %>%
tidy(matrix = "gamma") %>%
group_by(document) %>%
top_n(1, gamma) %>%
ungroup() %>%
arrange(document, topic)
# Generating human-readable labels for each topic
topic_labels <- concepts %>%
group_by(topic) %>%
summarise(label = paste(concept, collapse = ", ")) %>%
pull(label)
# topic_assignments now contains the most probable topic assignments
# topic_labels contains a vector of labels for each topic
census$date <- as.Date("1970-01-01") + census$age * 365
topic_proportions <- census %>%
inner_join(topic_assignments, by = c("occupation" = "document")) %>%
group_by(date, topic) %>%
summarize(proportion = n() / nrow(census), .groups = "drop") %>%
ungroup()
ggplot(topic_proportions, aes(x = date, y = proportion, color = factor(topic))) +
geom_line() +
labs(x = "Date", y = "Proportion", color = "Topic") +
theme_minimal() +
scale_color_discrete(name = "Topic")
# Summary
Throughout the R code workflow, there is a focus on transforming raw text data into structured and interpretable information through machine learning extraction using topic modeling and visualization. The primary objective is to discover latent topics within the census occupation descriptions and to understand how these topics manifest and change over time. The methodology integrates rigorous machine learning extraction techniques with user-friendly visualizations, making the analyzed results accessible to users from various backgrounds in data analysis.