A Brief Introduction to Semantic Analysis

Defining Semantic Analysis

Semantic analysis is the study of meaning in language.
It is primarily concerned with understanding the relationships between phrases and words.
If done correctly, this can provide a computer analyzing text with context.
This is essential for any machine attempting to realistically parse and interpret text.

Semantic Analysis Through Natural Language Processing

In the field of Natural Language Processing (NLP), semantic analysis and the ability to determine context is key factor toward achieving accurate results.
Developing a model with semantic analysis in mind ensures that model’s ability to parse text and retrieve pertinent and useful information that could not otherwise be obtained via text-mining.
Text miners, while easier to make, lack the ability to establish context, and rely on extensive and explicit instructions to perform searches.

A Statistical Method for Semantic Analysis

From a statistical standpoint, a word’s meanings are derived from what words surround it.
Words that appear in similar contexts often hold similar meanings, and this is referred to as the Distributional Hypothesis.
This principle allows word embeddings to be created.
Word embeddings assign numerical values to words, and which in turn allows the text to be statistically analyzed.
Visualization of semantic data is then possible.

The Vector Space Model

In the vector space model, words and documents are be represented as vectors within a multi-dimensional space.
A word’s position in this space is determined by its co-occurrence with other words.
Words with similar meanings are also positioned closer to each other.

Cosine Similarity

Cosine Similarity is a critical statistical tool that quantifies the relationship between two words.
The cosine of the angle between vectors (words) can be calculated, and the similarity between words can inferred from the resulting value.
High similarity between words is indicated by a cosine value close to 1.
Low similarity is indicated by a value closer to 0.

\[\cos(\theta) = \frac{A \cdot B}{\|A\| \|B\|}\]

(where \(A\) and \(B\) represent word vectors)

Plotting the Most Frequent Words

Part 1

Word frequency analysis is a simple but powerful statistical tool.
It provides a quantitative overview of the most important terms in a document.
On the next slide we will see the final result of an evaluation of the following sample text:

“The sneaky gray salamander creeps around the sleepy cat.”
“The cat is very sleepy.”
“The salamander is stealthy and the cat is sleepy.”

Plotting the Most Frequent Words

Part 2

The plot below represents the evaluated and tokenized a sample text, and represents the most frequent words.

R Code: Word Frequency Bar Plot

Here is some of the code used to create the word frequency plot on the previous slide.

# create doc term matrix (DTM)
vectorizer <- vocab_vectorizer(v)
dtm <- create_dtm(it_dtm, vectorizer)

# use as.matrix to force dtm into acceptable format for colSum  
dtm_dense <- as.matrix(dtm)

word_counts <- colSums(dtm_dense)  

# make plot
word_counts_df <- data.frame(
  word = names(word_counts),
  count = as.vector(word_counts)
) %>% dplyr::arrange(desc(count)) %>% dplyr::slice_head(n = 15)

ggplot(word_counts_df, aes(x = reorder(word, count), y = count)) +
  geom_bar(stat = "identity", fill = "palegreen") +
  coord_flip() +
  labs(title = "Top 15 Most Frequent Words",
       x = "Word",
       y = "Frequency")

Similarity & Distance

Through word embeddings, we can determine which words are semantically similar.
Additionally, word proximity within the vector space is a direct measure of their relationship and similarity.
A co-occurrence matrix is generally used to show how often words appear together in the same context.

Plotting a Word Co-occurrence Matrix

This statistical data forms the foundation for advanced models and relies on a term co-occurrence matrix (TCM).

A More Sophisticated Methodology for Co-Occurrence

Pointwise Mutual Information (PMI) is advanced concept about the measurement of the strength of the association between two words.
PMI compares each word’s co-occurrence probability to the probability of them co-occurring by chance.

\[PMI(w_i, w_j) = \log_2 \frac{P(w_i, w_j)}{P(w_i)P(w_j)}\]

\(P(w_i, w_j)\) represents the probability of words \(w_i\) and \(w_j\) co-occurring.
\(P(w_i)\) and \(P(w_j)\) represent the individual probabilities of each word appearing.
Yielding a high positive PMI value indicates that a strongly association, and is lends support to the statistical measure of semantic relatedness.

Word Embeddings in 3D

Part 1

Needed word embeddings can be created by employing statistical models like Word2Vec.
These are high-dimensional vectors, so dimensionality reduction must be done to visualize them in 3D.
Words with similar meanings will cluster together within the plot.

Word Embeddings in 3D

Part 2

Conclusion

Semantic analysis is a very powerful statistical application of computer science.
By representing words as vectors in a multi-dimensional space, in-depth quantitative analysis can be performed on the meaning of language.
This approach underpins many modern applications, from search engines to machine translation.