output: html_document
A central question in text mining is how to quantify what a document is about. we can do this but looking at words that make up the document, and measuring term frequency.
There are a lot of words that may not be important, these are the stop words.
One way to remedy this is to look at inverse document frequency words, which decreases the weight for commonly used words and increases the weight for words that are not used very much
Term frequency in Darwins works
install.packages("gutenbergr")
## Installing package into '/home/student/R/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)
library(gutenbergr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidytext)
# Download books from Project Gutenberg
book_words <- gutenberg_download(c(944, 1227, 1228, 2300), mirror = "http://mirror.csclub.uwaterloo.ca/gutenberg")
colnames(book_words)[1] <- "book"
book_words$book[book_words$book == 944] <- "The Voyage of the Beagle"
book_words$book[book_words$book == 1227] <- "The Expression of the Emotions in Man and Animals"
book_words$book[book_words$book == 1228] <- "On the Origin of Species By Means of Natural Selection"
book_words$book[book_words$book == 2300] <- "The Descent of Man, and Selection in Relation to Sex"
Now lets disect
book_words <- book_words %>%
unnest_tokens(word, text) %>%
count(book,word,sort = TRUE)
book_words
book_words$n <- as.numeric(book_words$n)
total_words <- book_words %>%
group_by(book) %>%
summarize(total = sum(n))
book_words
book_words <- left_join(book_words,total_words)
## Joining with `by = join_by(book)`
book_words
You can see that the usual supspects are the most common words, but don’t tell us anything about what the books topic is.
library(ggplot2)
ggplot(book_words, aes(n/total, fill = book)) +
geom_histogram(show.legend = FALSE) +
xlim(NA, 0.0009) +
facet_wrap(~book, ncol = 2, scales = "free_y")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 515 rows containing non-finite values (`stat_bin()`).
## Warning: Removed 4 rows containing missing values (`geom_bar()`).
Zipf’s Law
The frequency that a words appears is inversely proportional to its rank wihen predicting a topic.
Lets apply Zipf’s law to Darwin’s work
freq_by_rank <- book_words %>%
group_by(book) %>%
mutate(rank=row_number(),
`term frequency`= n/total) %>%
ungroup()
freq_by_rank
freq_by_rank %>%
ggplot(aes(rank, `term frequency`, color = book)) +
geom_line(size = 1.1, alpha = 0.8, show.legend = FALSE) +
scale_x_log10() +
scale_y_log10()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Lets us TF - IDF to find words for each document by decreasing the weight for commonly used words and increasing the weight for words that are not used very much in a collection of documents.
book_tf_idf <- book_words %>%
bind_tf_idf(word, book, n)
book_tf_idf
Lets look at terms with high tf-idf in Darwin’s works
book_tf_idf %>%
select(-total) %>%
arrange(desc(tf_idf))
Lets look at a visualization for these high tf-idf words
book_tf_idf %>%
group_by(book) %>%
slice_max(tf_idf, n = 15) %>%
ungroup() %>%
ggplot(aes(tf_idf,forcats::fct_reorder(word,tf_idf), fill = book))+
geom_col(show.legend= FALSE) +
facet_wrap(~book, ncol = 2, scales = "free") +
labs(x = "tf-idf", y= NULL)