Word_Frequency

output: html_document

A central question in text mining is how to quantify what a document is about. we can do this but looking at words that make up the document, and measuring term frequency.

There are a lot of words that may not be important, these are the stop words.

One way to remedy this is to look at inverse document frequency words, which decreases the weight for commonly used words and increases the weight for words that are not used very much

Term frequency in Darwins works

install.packages("gutenbergr")

## Installing package into '/home/student/R/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)

library(gutenbergr) 
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidytext) 
  


# Download books from Project Gutenberg
book_words <- gutenberg_download(c(944, 1227, 1228, 2300), mirror = "http://mirror.csclub.uwaterloo.ca/gutenberg")

colnames(book_words)[1] <- "book"

book_words$book[book_words$book == 944] <- "The Voyage of the Beagle"
book_words$book[book_words$book == 1227] <- "The Expression of the Emotions in Man and Animals"
book_words$book[book_words$book == 1228] <- "On the Origin of Species By Means of Natural Selection"
book_words$book[book_words$book == 2300] <- "The Descent of Man, and Selection in Relation to Sex"

Now lets disect

book_words <- book_words %>% 
  unnest_tokens(word, text) %>% 
  count(book,word,sort = TRUE)

book_words

book_words$n <- as.numeric(book_words$n)

total_words <- book_words %>% 
  group_by(book) %>% 
  summarize(total = sum(n))

book_words

book_words <- left_join(book_words,total_words)

## Joining with `by = join_by(book)`

book_words

You can see that the usual supspects are the most common words, but don’t tell us anything about what the books topic is.

library(ggplot2)

ggplot(book_words, aes(n/total, fill = book)) +
geom_histogram(show.legend = FALSE) +
xlim(NA, 0.0009) +
facet_wrap(~book, ncol = 2, scales = "free_y")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 515 rows containing non-finite values (`stat_bin()`).

## Warning: Removed 4 rows containing missing values (`geom_bar()`).

Zipf’s Law

The frequency that a words appears is inversely proportional to its rank wihen predicting a topic.

Lets apply Zipf’s law to Darwin’s work

freq_by_rank <- book_words %>% 
  
  group_by(book) %>% 
  mutate(rank=row_number(),
         `term frequency`= n/total) %>% 
  ungroup()

freq_by_rank

freq_by_rank %>%
ggplot(aes(rank, `term frequency`, color = book)) +
geom_line(size = 1.1, alpha = 0.8, show.legend = FALSE) +
scale_x_log10() +
scale_y_log10()

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Lets us TF - IDF to find words for each document by decreasing the weight for commonly used words and increasing the weight for words that are not used very much in a collection of documents.

book_tf_idf <- book_words %>%
  bind_tf_idf(word, book, n)

book_tf_idf

Lets look at terms with high tf-idf in Darwin’s works

book_tf_idf %>% 
  select(-total) %>% 
  arrange(desc(tf_idf))

Lets look at a visualization for these high tf-idf words

book_tf_idf %>%
  group_by(book) %>%
  slice_max(tf_idf, n = 15) %>%
  ungroup() %>%
  ggplot(aes(tf_idf,forcats::fct_reorder(word,tf_idf), fill = book))+
  geom_col(show.legend= FALSE) +
  facet_wrap(~book, ncol = 2, scales = "free") +
  labs(x = "tf-idf", y= NULL)

Word_Frequency

Arabiah Jefferson

2025-02-24