The goals of this week’s assignment are as follows: 1. get primary example code from chapter 2 of “Text Mining with R: A Tidy Approach” by Julia Silge and David Robinson. 2. Work with a different corpus. For this, I chose to work with a collection of works by my favorite author, Charles Dickens. I used the gutenberg project to download the text data since the data is archived. 3. We incorporated the lougharn sentiment. 4. Find most frequent words. 5. Find most important words.
## Warning: package 'tidyverse' was built under R version 4.4.3
## Warning: package 'ggplot2' was built under R version 4.4.3
## Warning: package 'dplyr' was built under R version 4.4.3
## Warning: package 'openintro' was built under R version 4.4.3
## Warning: package 'tidytext' was built under R version 4.4.3
## Warning: package 'janeaustenr' was built under R version 4.4.3
## Warning: package 'textdata' was built under R version 4.4.3
## Warning: package 'wordcloud' was built under R version 4.4.3
## Warning: package 'syuzhet' was built under R version 4.4.3
## Warning: package 'tm' was built under R version 4.4.3
## Warning: package 'lexicon' was built under R version 4.4.3
library(wordcloud) # visualization
library(irr) # inter-rater reliability for lexicons
library(textstem) # stemming and lemmatization (example purposes only)## Warning: package 'textstem' was built under R version 4.4.3
## Warning: package 'koRpus.lang.en' was built under R version 4.4.3
## Warning: package 'koRpus' was built under R version 4.4.3
## Warning: package 'sylly' was built under R version 4.4.3
## Warning: package 'gutenbergr' was built under R version 4.4.3
## # A tibble: 2,477 × 2
## word value
## <chr> <dbl>
## 1 abandon -2
## 2 abandoned -2
## 3 abandons -2
## 4 abducted -2
## 5 abduction -2
## 6 abductions -2
## 7 abhor -3
## 8 abhorred -3
## 9 abhorrent -3
## 10 abhors -3
## # ℹ 2,467 more rows
## # A tibble: 6,786 × 2
## word sentiment
## <chr> <chr>
## 1 2-faces negative
## 2 abnormal negative
## 3 abolish negative
## 4 abominable negative
## 5 abominably negative
## 6 abominate negative
## 7 abomination negative
## 8 abort negative
## 9 aborted negative
## 10 aborts negative
## # ℹ 6,776 more rows
## # A tibble: 13,872 × 2
## word sentiment
## <chr> <chr>
## 1 abacus trust
## 2 abandon fear
## 3 abandon negative
## 4 abandon sadness
## 5 abandoned anger
## 6 abandoned fear
## 7 abandoned negative
## 8 abandoned sadness
## 9 abandonment anger
## 10 abandonment fear
## # ℹ 13,862 more rows
library(janeaustenr)
library(dplyr)
library(stringr)
tidy_books <- austen_books() %>%
group_by(book) %>%
mutate(
linenumber = row_number(),
chapter = cumsum(str_detect(text,
regex("^chapter [\\divxlc]",
ignore_case = TRUE)))) %>%
ungroup() %>%
unnest_tokens(word, text)nrc_joy <- get_sentiments("nrc") %>%
filter(sentiment == "joy")
tidy_books %>%
filter(book == "Emma") %>%
inner_join(nrc_joy) %>%
count(word, sort = TRUE)## Joining with `by = join_by(word)`
## # A tibble: 301 × 2
## word n
## <chr> <int>
## 1 good 359
## 2 friend 166
## 3 hope 143
## 4 happy 125
## 5 love 117
## 6 deal 92
## 7 found 92
## 8 present 89
## 9 kind 82
## 10 happiness 76
## # ℹ 291 more rows
`
library(tidyr)
library(dplyr)
library(tidytext)
library(tidyr)
# Perform sentiment analysis using the Bing lexicon
options(dplyr.summarise.inform = FALSE)
bing <- get_sentiments("bing") %>%
distinct(word, sentiment)
jane_austen_sentiment <- tidy_books %>%
inner_join(bing, by = "word") %>%
count(book, index = linenumber %/% 80, sentiment) %>%
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
mutate(sentiment = positive - negative)## Warning in inner_join(., bing, by = "word"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 435434 of `x` matches multiple rows in `y`.
## ℹ Row 5051 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) +
geom_col(show.legend = FALSE) +
facet_wrap(~book, ncol = 2, scales = "free_x")## # A tibble: 117,077 × 4
## book linenumber chapter word
## <fct> <int> <int> <chr>
## 1 Pride & Prejudice 1 0 pride
## 2 Pride & Prejudice 1 0 and
## 3 Pride & Prejudice 1 0 prejudice
## 4 Pride & Prejudice 3 0 by
## 5 Pride & Prejudice 3 0 jane
## 6 Pride & Prejudice 3 0 austen
## 7 Pride & Prejudice 7 1 chapter
## 8 Pride & Prejudice 7 1 1
## 9 Pride & Prejudice 10 1 it
## 10 Pride & Prejudice 10 1 is
## # ℹ 117,067 more rows
afinn <- pride_prejudice %>%
inner_join(get_sentiments("afinn")) %>%
group_by(index = linenumber %/% 80) %>%
summarise(sentiment = sum(value)) %>%
mutate(method = "AFINN")## Joining with `by = join_by(word)`
bing_and_nrc <- bind_rows(
pride_prejudice %>%
inner_join(get_sentiments("bing")) %>%
mutate(method = "Bing et al."),
pride_prejudice %>%
inner_join(get_sentiments("nrc") %>%
filter(sentiment %in% c("positive",
"negative"))
) %>%
mutate(method = "NRC")) %>%
count(method, index = linenumber %/% 80, sentiment) %>%
pivot_wider(names_from = sentiment,
values_from = n,
values_fill = 0) %>%
mutate(sentiment = positive - negative)## Joining with `by = join_by(word)`
## Joining with `by = join_by(word)`
## Warning in inner_join(., get_sentiments("nrc") %>% filter(sentiment %in% : Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 205 of `x` matches multiple rows in `y`.
## ℹ Row 5178 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
bind_rows(afinn,
bing_and_nrc) %>%
ggplot(aes(index, sentiment, fill = method)) +
geom_col(show.legend = FALSE) +
facet_wrap(~method, ncol = 1, scales = "free_y")## # A tibble: 2 × 2
## sentiment n
## <chr> <int>
## 1 negative 3316
## 2 positive 2308
## # A tibble: 2 × 2
## sentiment n
## <chr> <int>
## 1 negative 4781
## 2 positive 2005
bing_word_counts <- tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()## Joining with `by = join_by(word)`
## Warning in inner_join(., get_sentiments("bing")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 414333 of `x` matches multiple rows in `y`.
## ℹ Row 5051 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
## # A tibble: 2,585 × 3
## word sentiment n
## <chr> <chr> <int>
## 1 miss negative 1792
## 2 well positive 1480
## 3 good positive 1360
## 4 great positive 969
## 5 like positive 720
## 6 better positive 630
## 7 enough positive 605
## 8 happy positive 526
## 9 love positive 480
## 10 pleasure positive 460
## # ℹ 2,575 more rows
#> # A tibble: 2,585 × 3
#> word sentiment n
#> <chr> <chr> <int>
#> 1 miss negative 1855
#> 2 well positive 1523
#> 3 good positive 1380
#> 4 great positive 981
#> 5 like positive 725
#> 6 better positive 639
#> 7 enough positive 613
#> 8 happy positive 534
#> 9 love positive 495
#> 10 pleasure positive 462
#> # ℹ 2,575 more rowsbing_word_counts %>%
group_by(sentiment) %>%
slice_max(n, n = 10) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(n, word, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(x = "Contribution to sentiment",
y = NULL)custom_stop_words <- bind_rows(tibble(word = c("miss"),
lexicon = c("custom")),
stop_words)
custom_stop_words## # A tibble: 1,150 × 2
## word lexicon
## <chr> <chr>
## 1 miss custom
## 2 a SMART
## 3 a's SMART
## 4 able SMART
## 5 about SMART
## 6 above SMART
## 7 according SMART
## 8 accordingly SMART
## 9 across SMART
## 10 actually SMART
## # ℹ 1,140 more rows
library(wordcloud)
tidy_books %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 100))## Joining with `by = join_by(word)`
## Warning: package 'reshape2' was built under R version 4.4.3
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:openintro':
##
## tips
## The following object is masked from 'package:tidyr':
##
## smiths
tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(colors = c("gray20", "gray80"),
max.words = 100)## Joining with `by = join_by(word)`
## Warning in inner_join(., get_sentiments("bing")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 414333 of `x` matches multiple rows in `y`.
## ℹ Row 5051 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
p_and_p_sentences <- tibble(text = prideprejudice) %>%
unnest_tokens(sentence, text, token = "sentences")austen_chapters <- austen_books() %>%
group_by(book) %>%
unnest_tokens(chapter, text, token = "regex",
pattern = "Chapter|CHAPTER [\\dIVXLC]") %>%
ungroup()
austen_chapters %>%
group_by(book) %>%
summarise(chapters = n())## # A tibble: 6 × 2
## book chapters
## <fct> <int>
## 1 Sense & Sensibility 51
## 2 Pride & Prejudice 62
## 3 Mansfield Park 49
## 4 Emma 56
## 5 Northanger Abbey 32
## 6 Persuasion 25
bingnegative <- get_sentiments("bing") %>%
filter(sentiment == "negative")
wordcounts <- tidy_books %>%
group_by(book, chapter) %>%
summarize(words = n())
tidy_books %>%
semi_join(bingnegative) %>%
group_by(book, chapter) %>%
summarize(negativewords = n()) %>%
left_join(wordcounts, by = c("book", "chapter")) %>%
mutate(ratio = negativewords/words) %>%
filter(chapter != 0) %>%
slice_max(ratio, n = 1) %>%
ungroup()## Joining with `by = join_by(word)`
## # A tibble: 6 × 5
## book chapter negativewords words ratio
## <fct> <int> <int> <int> <dbl>
## 1 Sense & Sensibility 43 161 3247 0.0496
## 2 Pride & Prejudice 34 110 1998 0.0551
## 3 Mansfield Park 46 171 3463 0.0494
## 4 Emma 16 83 1804 0.0460
## 5 Northanger Abbey 21 149 2827 0.0527
## 6 Persuasion 24 54 1497 0.0361
# Load Loughran-McDonald sentiment lexicon
library(tidytext)
loughran_sentiments <- get_sentiments("loughran")
# Count word frequencies by sentiment using Loughran-McDonald lexicon
loughran_word_counts <- pride_prejudice %>%
inner_join(loughran_sentiments, by = "word") %>% # Join by 'word'
count(word, sentiment, sort = TRUE) %>%
ungroup()## Warning in inner_join(., loughran_sentiments, by = "word"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 3 of `x` matches multiple rows in `y`.
## ℹ Row 2826 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
# Perform sentiment analysis by binning the text into groups of 80 lines
loughran <- pride_prejudice %>%
inner_join(loughran_sentiments, by = "word") %>% # Ensure proper join on 'word'
group_by(index = linenumber %/% 80) %>% # Group by bin of 80 lines
summarise(sentiment = sum(as.numeric(sentiment), na.rm = TRUE)) %>% # Sum sentiment
mutate(method = "Loughran") # Add method label## Warning in inner_join(., loughran_sentiments, by = "word"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 3 of `x` matches multiple rows in `y`.
## ℹ Row 2826 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
## Warning: There were 163 warnings in `summarise()`.
## The first warning was:
## ℹ In argument: `sentiment = sum(as.numeric(sentiment), na.rm = TRUE)`.
## ℹ In group 1: `index = 0`.
## Caused by warning:
## ! NAs introduced by coercion
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 162 remaining warnings.
title: “Lab Name” author: “Author Name” date: “2025-04-06” output: openintro::lab_report — ### #Text mining and natural language processing
The goals of this week’s assignment are as follows: 1. get primary example code from chapter 2 of “Text Mining with R: A Tidy Approach” by Julia Silge and David Robinson. 2. Work with a different corpus. For this, I chose to work with a collection of works by my favorite author, Charles Dickens. I used the gutenberg project to download the text data since the data is archived. 3. We incorporated the lougharn sentiment. 4. Find most frequent words. 5. Find most important words.
dickens_book1 <- readLines("https://raw.githubusercontent.com/tanzil64/Data-607-Assignment-10/refs/heads/main/A%20Christmas%20Carol.txt")
# View the first few lines of the downloaded book
# Assuming you have already loaded the text file
# View the first few lines
head(dickens_book1)## [1] "The Project Gutenberg eBook of A Christmas Carol in Prose; Being a Ghost Story of Christmas"
## [2] " "
## [3] "This ebook is for the use of anyone anywhere in the United States and"
## [4] "most other parts of the world at no cost and with almost no restrictions"
## [5] "whatsoever. You may copy it, give it away or re-use it under the terms"
## [6] "of the Project Gutenberg License included with this ebook or online"
dickens_book2 <- readLines("https://raw.githubusercontent.com/tanzil64/Data-607-Assignment-10/refs/heads/main/Great%20Expectations.txt")
# View the first few lines of the downloaded book
# Assuming you have already loaded the text file
# View the first few lines
head(dickens_book2)## [1] "The Project Gutenberg eBook of Great Expectations"
## [2] " "
## [3] "This ebook is for the use of anyone anywhere in the United States and"
## [4] "most other parts of the world at no cost and with almost no restrictions"
## [5] "whatsoever. You may copy it, give it away or re-use it under the terms"
## [6] "of the Project Gutenberg License included with this ebook or online"
# Assuming the character vectors for book texts are properly loaded as single strings:
# `dickens_book1` and `dickens_book2` contain the full texts of each book.
# Convert `dickens_book1` and `dickens_book2` into data frames/tibbles with the correct structure
dickens_book1 <- tibble(book = "A Christmas Carol", text = paste(dickens_book1, collapse = " "))
dickens_book2 <- tibble(book = "Great Expectations", text = paste(dickens_book2, collapse = " "))
# Combine the two books into one tibble
dickens_works <- bind_rows(dickens_book1, dickens_book2)
# Check the combined tibble
glimpse(dickens_works)## Rows: 2
## Columns: 2
## $ book <chr> "A Christmas Carol", "Great Expectations"
## $ text <chr> "The Project Gutenberg eBook of A Christmas Carol in Prose; Being…
`
library(dplyr)
library(ggplot2)
# Filter the top 7 words with the highest frequencies
dickens_top_words <- dickens_frequency %>%
filter(n > 5) %>%
slice_max(n, n = 7)
# Plot the top 7 words
ggplot(dickens_top_words, aes(x = reorder(word, n), y = n)) +
geom_col() +
xlab("Words") +
ylab("Frequency") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Rotate x-axis labels for better readability## # A tibble: 2,477 × 2
## word value
## <chr> <dbl>
## 1 abandon -2
## 2 abandoned -2
## 3 abandons -2
## 4 abducted -2
## 5 abduction -2
## 6 abductions -2
## 7 abhor -3
## 8 abhorred -3
## 9 abhorrent -3
## 10 abhors -3
## # ℹ 2,467 more rows
## # A tibble: 6,786 × 2
## word sentiment
## <chr> <chr>
## 1 2-faces negative
## 2 abnormal negative
## 3 abolish negative
## 4 abominable negative
## 5 abominably negative
## 6 abominate negative
## 7 abomination negative
## 8 abort negative
## 9 aborted negative
## 10 aborts negative
## # ℹ 6,776 more rows
## # A tibble: 13,872 × 2
## word sentiment
## <chr> <chr>
## 1 abacus trust
## 2 abandon fear
## 3 abandon negative
## 4 abandon sadness
## 5 abandoned anger
## 6 abandoned fear
## 7 abandoned negative
## 8 abandoned sadness
## 9 abandonment anger
## 10 abandonment fear
## # ℹ 13,862 more rows
## # A tibble: 6 × 2
## word sentiment
## <chr> <chr>
## 1 absolution joy
## 2 abundance joy
## 3 abundant joy
## 4 accolade joy
## 5 accompaniment joy
## 6 accomplish joy
Here I filtered out all of the words related to positive feelings from the nrc lexicon.
Next, I need to find the joyful words in Charles Dickens works.
nrc joy Here we take all of the words that are related to joy from Dickens works.
nrc_joy_dickens <- tidy_dickens %>%
inner_join(get_sentiments("nrc") %>% filter(sentiment == "joy"), by = "word") %>%
count(word, sort = TRUE)## # A tibble: 6 × 2
## word n
## <chr> <int>
## 1 found 160
## 2 hope 106
## 3 money 94
## 4 child 78
## 5 friend 71
## 6 love 68
#affrin joy This will pull out the words from Dickens works that afrinn relate as positive.
affin_joy <- get_sentiments("afinn") %>%
filter(value > 0)
dickens_afinn_joy <- tidy_dickens %>%
inner_join(affin_joy, by = "word") %>%
count(word, sort = TRUE)#Bing joy Here we find positive Dickens words using the lexicon Bing.
bing_joy <- get_sentiments("bing")%>%
filter(sentiment == "positive")
dickens_bing_joy <- tidy_dickens %>%
inner_join(bing_joy, by = "word") %>%
count(word, sort = TRUE)#nrc Joy Here we will use a lexicon not used in the examples of in the book.
nrc_joy <- get_sentiments("nrc")%>%
filter(sentiment == "positive")
dickens_nrc_joy <- tidy_dickens %>%
inner_join(bing_joy, by = "word") %>%
count(word, sort = TRUE)#Loughran Joy Here we will use a lexicon not used in the examples of in the book.
loughran_joy <- get_sentiments("loughran")%>%
filter(sentiment == "positive")
dickens_loughran_joy <- tidy_dickens %>%
inner_join(loughran_joy, by = "word")%>%
count(word, sort = TRUE)#Graphing the top words This is great way to compare how the sentiments work
library(ggplot2)
library(dplyr)
# Keep only the top 10 words for each sentiment
graph_dickens_afinn_top10 <- dickens_afinn_joy %>%
top_n(10, n)
graph_dickens_bing_top10 <- dickens_afinn_joy %>%
top_n(10, n)
graph_dickens_loughran_top10 <- dickens_afinn_joy %>%
top_n(10, n)
graph_dickens_nrc_top10 <- dickens_afinn_joy %>%
top_n(10, n)
# Plot the graph with the top 10 words
ggplot() +
geom_point(data = graph_dickens_afinn_top10, aes(x = word, y = n), color = "red") +
geom_line(data = graph_dickens_afinn_top10, aes(x = word, y = n, group = 1), color = "red") +
geom_point(data = graph_dickens_bing_top10, aes(x = word, y = n), color = "black") +
geom_line(data = graph_dickens_bing_top10, aes(x = word, y = n, group = 1), color = "black") +
geom_point(data = graph_dickens_loughran_top10, aes(x = word, y = n), color = "green") +
geom_line(data = graph_dickens_loughran_top10, aes(x = word, y = n, group = 1), color = "green") +
geom_point(data = graph_dickens_nrc_top10, aes(x = word, y = n), color = "blue") +
geom_line(data = graph_dickens_nrc_top10, aes(x = word, y = n, group = 1), color = "blue") +
theme_minimal() +
labs(title = "Top 10 Words by Sentiment Score (Outliers Removed)",
x = "Words",
y = "Frequency") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))#now to pull out word frequencies per book
book_words <- tidy_dickens %>%
count(book, word, sort = TRUE)
# Step 2: Calculate total number of words per book
total_words <- book_words %>%
group_by(book) %>%
summarize(total = sum(n), .groups = "drop")
# Step 3: Join total word count into the main data
book_words <- left_join(book_words, total_words, by = "book")
# View result
head(book_words)## # A tibble: 6 × 4
## book word n total
## <chr> <chr> <int> <int>
## 1 Great Expectations joe 692 58721
## 2 Great Expectations miss 383 58721
## 3 Great Expectations time 373 58721
## 4 Great Expectations pip 326 58721
## 5 Great Expectations looked 325 58721
## 6 A Christmas Carol scrooge 314 11259
#To find which words are the most important we use the tf_idf.
freq_by_rank <- book_words %>%
group_by(book) %>%
mutate(rank = row_number(),
`term frequency` = n/total)%>%
ungroup()
freq_by_rank## # A tibble: 14,918 × 6
## book word n total rank `term frequency`
## <chr> <chr> <int> <int> <int> <dbl>
## 1 Great Expectations joe 692 58721 1 0.0118
## 2 Great Expectations miss 383 58721 2 0.00652
## 3 Great Expectations time 373 58721 3 0.00635
## 4 Great Expectations pip 326 58721 4 0.00555
## 5 Great Expectations looked 325 58721 5 0.00553
## 6 A Christmas Carol scrooge 314 11259 1 0.0279
## 7 Great Expectations herbert 290 58721 6 0.00494
## 8 Great Expectations don’t 285 58721 7 0.00485
## 9 Great Expectations hand 270 58721 8 0.00460
## 10 Great Expectations wemmick 256 58721 9 0.00436
## # ℹ 14,908 more rows
## # A tibble: 14,918 × 7
## book word n total tf idf tf_idf
## <chr> <chr> <int> <int> <dbl> <dbl> <dbl>
## 1 A Christmas Carol scrooge 314 11259 0.0279 0.693 0.0193
## 2 Great Expectations pip 326 58721 0.00555 0.693 0.00385
## 3 Great Expectations herbert 290 58721 0.00494 0.693 0.00342
## 4 Great Expectations don’t 285 58721 0.00485 0.693 0.00336
## 5 Great Expectations wemmick 256 58721 0.00436 0.693 0.00302
## 6 A Christmas Carol bob 49 11259 0.00435 0.693 0.00302
## 7 A Christmas Carol scrooge's 48 11259 0.00426 0.693 0.00296
## 8 Great Expectations havisham 243 58721 0.00414 0.693 0.00287
## 9 Great Expectations estella 237 58721 0.00404 0.693 0.00280
## 10 Great Expectations biddy 228 58721 0.00388 0.693 0.00269
## # ℹ 14,908 more rows
# Step 1: Count words per book
word_counts <- tidy_dickens %>%
count(book, word, sort = TRUE)
# Step 2: Apply tf-idf
dickens_tf_idf <- word_counts %>%
bind_tf_idf(word, book, n) %>%
arrange(desc(tf_idf))
# Step 3: View top distinctive words
head(dickens_tf_idf, 10)## # A tibble: 10 × 6
## book word n tf idf tf_idf
## <chr> <chr> <int> <dbl> <dbl> <dbl>
## 1 A Christmas Carol scrooge 314 0.0279 0.693 0.0193
## 2 Great Expectations pip 326 0.00555 0.693 0.00385
## 3 Great Expectations herbert 290 0.00494 0.693 0.00342
## 4 Great Expectations don’t 285 0.00485 0.693 0.00336
## 5 Great Expectations wemmick 256 0.00436 0.693 0.00302
## 6 A Christmas Carol bob 49 0.00435 0.693 0.00302
## 7 A Christmas Carol scrooge's 48 0.00426 0.693 0.00296
## 8 Great Expectations havisham 243 0.00414 0.693 0.00287
## 9 Great Expectations estella 237 0.00404 0.693 0.00280
## 10 Great Expectations biddy 228 0.00388 0.693 0.00269
#Conclusions
In conclusion, the same words that are most used happen to also be considered to be the most important words. I wonder how this would change if Project Gutenberg had more of Kurt vonnegut’s books. I imagine that the characters wouldn’t be the top words in regards to frequency or importance.