Defining Semantic Analysis

  • Semantic analysis is the study of meaning in language.
  • It is primarily concerned with understanding the relationships between phrases and words.
  • If done correctly, this can provide a computer analyzing text with context.
  • This is essential for any machine attempting to realistically parse and interpret text.

Semantic Analysis Through Natural Language Processing

  • In the field of Natural Language Processing (NLP), semantic analysis and the ability to determine context is key factor toward achieving accurate results.
  • Developing a model with semantic analysis in mind ensures that model’s ability to parse text and retrieve pertinent and useful information that could not otherwise be obtained via text-mining.
  • Text miners, while easier to make, lack the ability to establish context, and rely on extensive and explicit instructions to perform searches.

A Statistical Method for Semantic Analysis

  • From a statistical standpoint, a word’s meanings are derived from what words surround it.
  • Words that appear in similar contexts often hold similar meanings, and this is referred to as the Distributional Hypothesis.
  • This principle allows word embeddings to be created.
  • Word embeddings assign numerical values to words, and which in turn allows the text to be statistically analyzed.
  • Visualization of semantic data is then possible.

The Vector Space Model

  • In the vector space model, words and documents are be represented as vectors within a multi-dimensional space.
  • A word’s position in this space is determined by its co-occurrence with other words.
  • Words with similar meanings are also positioned closer to each other.

Cosine Similarity

  • Cosine Similarity is a critical statistical tool that quantifies the relationship between two words.
  • The cosine of the angle between vectors (words) can be calculated, and the similarity between words can inferred from the resulting value.
  • High similarity between words is indicated by a cosine value close to 1.
  • Low similarity is indicated by a value closer to 0.
\[\cos(\theta) = \frac{A \cdot B}{\|A\| \|B\|}\]

(where \(A\) and \(B\) represent word vectors)

Plotting the Most Frequent Words

Part 1

  • Word frequency analysis is a simple but powerful statistical tool.
  • It provides a quantitative overview of the most important terms in a document.
  • On the next slide we will see the final result of an evaluation of the following sample text:

    “The sneaky gray salamander creeps around the sleepy cat.”
    “The cat is very sleepy.”
    “The salamander is stealthy and the cat is sleepy.”

Plotting the Most Frequent Words

Part 2

  • The plot below represents the evaluated and tokenized a sample text, and represents the most frequent words.

R Code: Word Frequency Bar Plot

  • Here is some of the code used to create the word frequency plot on the previous slide.
# create doc term matrix (DTM)
vectorizer <- vocab_vectorizer(v)
dtm <- create_dtm(it_dtm, vectorizer)

# use as.matrix to force dtm into acceptable format for colSum  
dtm_dense <- as.matrix(dtm)

word_counts <- colSums(dtm_dense)  

# make plot
word_counts_df <- data.frame(
  word = names(word_counts),
  count = as.vector(word_counts)
) %>% dplyr::arrange(desc(count)) %>% dplyr::slice_head(n = 15)

ggplot(word_counts_df, aes(x = reorder(word, count), y = count)) +
  geom_bar(stat = "identity", fill = "palegreen") +
  coord_flip() +
  labs(title = "Top 15 Most Frequent Words",
       x = "Word",
       y = "Frequency")

Similarity & Distance

  • Through word embeddings, we can determine which words are semantically similar.
  • Additionally, word proximity within the vector space is a direct measure of their relationship and similarity.
  • A co-occurrence matrix is generally used to show how often words appear together in the same context.

Plotting a Word Co-occurrence Matrix

  • This statistical data forms the foundation for advanced models and relies on a term co-occurrence matrix (TCM).

A More Sophisticated Methodology for Co-Occurrence

  • Pointwise Mutual Information (PMI) is advanced concept about the measurement of the strength of the association between two words.
  • PMI compares each word’s co-occurrence probability to the probability of them co-occurring by chance.

\[PMI(w_i, w_j) = \log_2 \frac{P(w_i, w_j)}{P(w_i)P(w_j)}\]

  • \(P(w_i, w_j)\) represents the probability of words \(w_i\) and \(w_j\) co-occurring.
  • \(P(w_i)\) and \(P(w_j)\) represent the individual probabilities of each word appearing.
  • Yielding a high positive PMI value indicates that a strongly association, and is lends support to the statistical measure of semantic relatedness.

Word Embeddings in 3D

Part 1

  • Needed word embeddings can be created by employing statistical models like Word2Vec.
  • These are high-dimensional vectors, so dimensionality reduction must be done to visualize them in 3D.
  • Words with similar meanings will cluster together within the plot.

Word Embeddings in 3D

Part 2

Conclusion

  • Semantic analysis is a very powerful statistical application of computer science.
  • By representing words as vectors in a multi-dimensional space, in-depth quantitative analysis can be performed on the meaning of language.
  • This approach underpins many modern applications, from search engines to machine translation.