Instructions In Text Mining with R, Chapter 2 looks at Sentiment Analysis. In this assignment, you should start by getting the primary example code from chapter 2 working in an R Markdown document. You should provide a citation to this base code. You’re then asked to extend the code in two ways:
Work with a different corpus of your choosing, and Incorporate at least one additional sentiment lexicon (possibly from another R package that you’ve found through research). As usual, please submit links to both an .Rmd file posted in your GitHub repository and to your code on rpubs.com. You make work on a small team on this assignment.
I will analyze a sentiment of the Harry Potter series written by. Rowling. We want to know how the words in each chapter are associated with positive or negative feelings using different dictionaries. Bing, NRC, afinn, and other variants created with Loughran dictionaries.
The three general-purpose lexicons are: • AFINN from Finn Årup Nielsen • Bing from Bing Liu and collaborators • NRC from Saif Mohammad and Peter Turney
#Topic modeling
#add library
library(tidytext)
## Warning: package 'tidytext' was built under R version 4.4.2
library(janeaustenr)
## Warning: package 'janeaustenr' was built under R version 4.4.2
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(stringr)
## Warning: package 'stringr' was built under R version 4.4.2
library(wordcloud)
## Warning: package 'wordcloud' was built under R version 4.4.2
## Loading required package: RColorBrewer
library(reshape2)
library(harrypotter)
## Warning: package 'harrypotter' was built under R version 4.4.2
library(httr)
library(readtext)
## Warning: package 'readtext' was built under R version 4.4.2
library(textdata)
## Warning: package 'textdata' was built under R version 4.4.2
##
## Attaching package: 'textdata'
## The following object is masked from 'package:httr':
##
## cache_info
library(reshape2)
library(tidyr)
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:reshape2':
##
## smiths
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.4.2
library(wordcloud)
library(wordcloud2)
## Warning: package 'wordcloud2' was built under R version 4.4.2
library(gutenbergr)
## Warning: package 'gutenbergr' was built under R version 4.4.2
library(harrypotter)
library(devtools)
## Warning: package 'devtools' was built under R version 4.4.2
## Loading required package: usethis
library(tibble)
## Warning: package 'tibble' was built under R version 4.4.2
library(sos)
## Warning: package 'sos' was built under R version 4.4.2
## Loading required package: brew
##
## Attaching package: 'sos'
## The following object is masked from 'package:tidyr':
##
## matches
## The following object is masked from 'package:dplyr':
##
## matches
## The following object is masked from 'package:utils':
##
## ?
sentiments
## # A tibble: 6,786 × 2
## word sentiment
## <chr> <chr>
## 1 2-faces negative
## 2 abnormal negative
## 3 abolish negative
## 4 abominable negative
## 5 abominably negative
## 6 abominate negative
## 7 abomination negative
## 8 abort negative
## 9 aborted negative
## 10 aborts negative
## # ℹ 6,776 more rows
get_sentiments("afinn")
## # A tibble: 2,477 × 2
## word value
## <chr> <dbl>
## 1 abandon -2
## 2 abandoned -2
## 3 abandons -2
## 4 abducted -2
## 5 abduction -2
## 6 abductions -2
## 7 abhor -3
## 8 abhorred -3
## 9 abhorrent -3
## 10 abhors -3
## # ℹ 2,467 more rows
get_sentiments("bing")
## # A tibble: 6,786 × 2
## word sentiment
## <chr> <chr>
## 1 2-faces negative
## 2 abnormal negative
## 3 abolish negative
## 4 abominable negative
## 5 abominably negative
## 6 abominate negative
## 7 abomination negative
## 8 abort negative
## 9 aborted negative
## 10 aborts negative
## # ℹ 6,776 more rows
get_sentiments("nrc")
## # A tibble: 13,872 × 2
## word sentiment
## <chr> <chr>
## 1 abacus trust
## 2 abandon fear
## 3 abandon negative
## 4 abandon sadness
## 5 abandoned anger
## 6 abandoned fear
## 7 abandoned negative
## 8 abandoned sadness
## 9 abandonment anger
## 10 abandonment fear
## # ℹ 13,862 more rows
tidy_books <- austen_books() %>%
group_by(book) %>%
mutate(linenumber = row_number(),
chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]",
ignore_case = TRUE)))) %>%
ungroup() %>%
unnest_tokens(word, text)
tidy_books
## # A tibble: 725,055 × 4
## book linenumber chapter word
## <fct> <int> <int> <chr>
## 1 Sense & Sensibility 1 0 sense
## 2 Sense & Sensibility 1 0 and
## 3 Sense & Sensibility 1 0 sensibility
## 4 Sense & Sensibility 3 0 by
## 5 Sense & Sensibility 3 0 jane
## 6 Sense & Sensibility 3 0 austen
## 7 Sense & Sensibility 5 0 1811
## 8 Sense & Sensibility 10 1 chapter
## 9 Sense & Sensibility 10 1 1
## 10 Sense & Sensibility 13 1 the
## # ℹ 725,045 more rows
austen_books() %>%
distinct(book)
## # A tibble: 6 × 1
## book
## <fct>
## 1 Sense & Sensibility
## 2 Pride & Prejudice
## 3 Mansfield Park
## 4 Emma
## 5 Northanger Abbey
## 6 Persuasion
original_books <- austen_books()%>%
group_by(book)%>%
mutate(line= row_number())%>%
ungroup()
original_books
## # A tibble: 73,422 × 3
## text book line
## <chr> <fct> <int>
## 1 "SENSE AND SENSIBILITY" Sense & Sensibility 1
## 2 "" Sense & Sensibility 2
## 3 "by Jane Austen" Sense & Sensibility 3
## 4 "" Sense & Sensibility 4
## 5 "(1811)" Sense & Sensibility 5
## 6 "" Sense & Sensibility 6
## 7 "" Sense & Sensibility 7
## 8 "" Sense & Sensibility 8
## 9 "" Sense & Sensibility 9
## 10 "CHAPTER 1" Sense & Sensibility 10
## # ℹ 73,412 more rows
tidy_books <-original_books %>%
unnest_tokens(word, text)
tidy_books
## # A tibble: 725,055 × 3
## book line word
## <fct> <int> <chr>
## 1 Sense & Sensibility 1 sense
## 2 Sense & Sensibility 1 and
## 3 Sense & Sensibility 1 sensibility
## 4 Sense & Sensibility 3 by
## 5 Sense & Sensibility 3 jane
## 6 Sense & Sensibility 3 austen
## 7 Sense & Sensibility 5 1811
## 8 Sense & Sensibility 10 chapter
## 9 Sense & Sensibility 10 1
## 10 Sense & Sensibility 13 the
## # ℹ 725,045 more rows
#remove stopwords usning anti_join function
match_stop_word <-tidy_books %>%
anti_join(get_stopwords())
## Joining with `by = join_by(word)`
match_stop_word
## # A tibble: 325,084 × 3
## book line word
## <fct> <int> <chr>
## 1 Sense & Sensibility 1 sense
## 2 Sense & Sensibility 1 sensibility
## 3 Sense & Sensibility 3 jane
## 4 Sense & Sensibility 3 austen
## 5 Sense & Sensibility 5 1811
## 6 Sense & Sensibility 10 chapter
## 7 Sense & Sensibility 10 1
## 8 Sense & Sensibility 13 family
## 9 Sense & Sensibility 13 dashwood
## 10 Sense & Sensibility 13 long
## # ℹ 325,074 more rows
match_stop_word %>%
count(word, sort= TRUE)
## # A tibble: 14,375 × 2
## word n
## <chr> <int>
## 1 mr 3015
## 2 mrs 2446
## 3 must 2071
## 4 said 2041
## 5 much 1935
## 6 miss 1855
## 7 one 1831
## 8 well 1523
## 9 every 1456
## 10 think 1440
## # ℹ 14,365 more rows
match_stop_word %>%
count(word, sort=TRUE) %>%
head(100) %>%
wordcloud2(size = 0.4, shape = 'triangle-forward',
color = c("streeblue","firebrick","darkorchid"),
backgroundColor = "green")
match_stop_word %>%
count(word)%>%
with(wordcloud::wordcloud(word,n,max.words = 100))
positive <- get_sentiments("bing") %>%
filter(sentiment == "positive")
positive
## # A tibble: 2,005 × 2
## word sentiment
## <chr> <chr>
## 1 abound positive
## 2 abounds positive
## 3 abundance positive
## 4 abundant positive
## 5 accessable positive
## 6 accessible positive
## 7 acclaim positive
## 8 acclaimed positive
## 9 acclamation positive
## 10 accolade positive
## # ℹ 1,995 more rows
#count positive word in Emma
tidy_books %>%
filter(book == "Emma") %>%
semi_join(positive)%>%
count(word, sort=TRUE)
## Joining with `by = join_by(word)`
## # A tibble: 668 × 2
## word n
## <chr> <int>
## 1 well 401
## 2 good 359
## 3 great 264
## 4 like 200
## 5 better 173
## 6 enough 129
## 7 happy 125
## 8 love 117
## 9 pleasure 115
## 10 right 92
## # ℹ 658 more rows
bing <-get_sentiments("bing")
janeaustensentiment <- tidy_books %>%
inner_join(bing) %>%
count(book, index = line %/% 80, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)
## Joining with `by = join_by(word)`
## Warning in inner_join(., bing): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 435434 of `x` matches multiple rows in `y`.
## ℹ Row 5051 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
janeaustensentiment
## # A tibble: 920 × 5
## book index negative positive sentiment
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 Sense & Sensibility 0 16 32 16
## 2 Sense & Sensibility 1 19 53 34
## 3 Sense & Sensibility 2 12 31 19
## 4 Sense & Sensibility 3 15 31 16
## 5 Sense & Sensibility 4 16 34 18
## 6 Sense & Sensibility 5 16 51 35
## 7 Sense & Sensibility 6 24 40 16
## 8 Sense & Sensibility 7 23 51 28
## 9 Sense & Sensibility 8 30 40 10
## 10 Sense & Sensibility 9 15 19 4
## # ℹ 910 more rows
positive =green and negative=red
janeaustensentiment %>%
ggplot(aes(index, sentiment,))+
geom_col(show.legend = FALSE, fill="cadetblue")+
geom_col(data= .%>%filter(sentiment<0),show.legend = FALSE, fill="firebrick")+
geom_hline(yintercept = 0, color="goldenrod")+
facet_wrap(~book,ncol=2,scales="free_x")
#Most common positive and negative words
bing_word_counts <-tidy_books %>%
inner_join(bing)%>%
count(word,sentiment, sort=TRUE)
## Joining with `by = join_by(word)`
## Warning in inner_join(., bing): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 435434 of `x` matches multiple rows in `y`.
## ℹ Row 5051 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
bing_word_counts
## # A tibble: 2,585 × 3
## word sentiment n
## <chr> <chr> <int>
## 1 miss negative 1855
## 2 well positive 1523
## 3 good positive 1380
## 4 great positive 981
## 5 like positive 725
## 6 better positive 639
## 7 enough positive 613
## 8 happy positive 534
## 9 love positive 495
## 10 pleasure positive 462
## # ℹ 2,575 more rows
nrcjoy <- get_sentiments("nrc") %>%
filter(sentiment == "joy")
tidy_books %>%
filter(book == "Emma") %>%
inner_join(nrcjoy) %>%
count(word, sort = TRUE)
## Joining with `by = join_by(word)`
## # A tibble: 301 × 2
## word n
## <chr> <int>
## 1 good 359
## 2 friend 166
## 3 hope 143
## 4 happy 125
## 5 love 117
## 6 deal 92
## 7 found 92
## 8 present 89
## 9 kind 82
## 10 happiness 76
## # ℹ 291 more rows
ggplot(janeaustensentiment, aes(index, sentiment, fill = book)) +
geom_col(show.legend = FALSE) +
facet_wrap(~book, ncol = 2, scales = "free_x")
pride_prejudice <- tidy_books %>%
filter(book == "Pride & Prejudice")
pride_prejudice
## # A tibble: 122,204 × 3
## book line word
## <fct> <int> <chr>
## 1 Pride & Prejudice 1 pride
## 2 Pride & Prejudice 1 and
## 3 Pride & Prejudice 1 prejudice
## 4 Pride & Prejudice 3 by
## 5 Pride & Prejudice 3 jane
## 6 Pride & Prejudice 3 austen
## 7 Pride & Prejudice 7 chapter
## 8 Pride & Prejudice 7 1
## 9 Pride & Prejudice 10 it
## 10 Pride & Prejudice 10 is
## # ℹ 122,194 more rows
#word in Emma matching with AFINN
emma_afin <- tidy_books %>%
filter(book=="Emma")%>%
anti_join(get_stopwords())%>%
inner_join(get_sentiments("afinn"))
## Joining with `by = join_by(word)`
## Joining with `by = join_by(word)`
emma_afin
## # A tibble: 10,159 × 4
## book line word value
## <fct> <int> <chr> <dbl>
## 1 Emma 15 clever 2
## 2 Emma 15 rich 2
## 3 Emma 15 comfortable 2
## 4 Emma 16 happy 3
## 5 Emma 16 best 3
## 6 Emma 18 distress -2
## 7 Emma 20 affectionate 3
## 8 Emma 22 died -3
## 9 Emma 24 excellent 3
## 10 Emma 25 fallen -2
## # ℹ 10,149 more rows
emma_afin %>%
count(word,sort=TRUE)
## # A tibble: 894 × 2
## word n
## <chr> <int>
## 1 miss 599
## 2 good 359
## 3 great 264
## 4 dear 241
## 5 like 200
## 6 better 173
## 7 hope 143
## 8 poor 136
## 9 wish 135
## 10 happy 125
## # ℹ 884 more rows
#calculate sentiment #make Sections
emma_afinn_sentiment <-emma_afin %>%
mutate(word_count=1:n(),
index=word_count %/%80)%>%
group_by(index)%>%
summarize(sentiment=sum(value))
emma_afinn_sentiment
## # A tibble: 127 × 2
## index sentiment
## <dbl> <dbl>
## 1 0 40
## 2 1 33
## 3 2 77
## 4 3 84
## 5 4 52
## 6 5 80
## 7 6 98
## 8 7 80
## 9 8 69
## 10 9 68
## # ℹ 117 more rows
#visualize
emma_afin %>%
mutate(word_count=1:n(),
index=word_count %/%80)%>%
filter(index==104)%>%
count(word, sort=TRUE)%>%
with(wordcloud::wordcloud(word,n,rot.per=.3))
emma_afin%>%
mutate(word_count=1:n(),
index=word_count %/%80)%>%
filter(index==104)%>%
count(word, sort=TRUE)%>%
wordcloud2(size= 0.4,shape='diamond',
backgroundColor = "darkseagreen")
#visualize
emma_afinn_sentiment %>%
ggplot(aes(index,sentiment))+
geom_col(aes(fill=cut_interval(sentiment,n=5)))+
geom_hline(yintercept = 0,color="forestgreen",linetype="dashed")+
scale_fill_brewer(palette = "RdBu",guide=FALSE)+
theme(panel.background =element_rect(fill="grey"),
plot.background = element_rect(fill="grey"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank())+
labs(title = "Afinn sentiment analysis of Emma")
## Warning: The `guide` argument in `scale_*()` cannot be `FALSE`. This was deprecated in
## ggplot2 3.3.4.
## ℹ Please use "none" instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# boxplot
emma_afin %>%
mutate(word_count=1:n(),
index=as.character(word_count %/%80))%>%
filter(index==10 |index==104 |index==105)%>%
ggplot(aes(value,index))+
geom_boxplot()+
geom_jitter()+
coord_flip()+
labs(y="section",x="Afinn")
afinn <- get_sentiments("afinn")
afinn <- pride_prejudice %>%
inner_join(afinn) %>%
group_by(index = line %/% 80) %>%
summarise(sentiment = sum(value)) %>%
mutate(method = "AFINN")
## Joining with `by = join_by(word)`
bing_and_nrc <- bind_rows(
pride_prejudice %>%
inner_join(bing) %>%
mutate(method = "Bing et al."),
pride_prejudice %>%
inner_join(get_sentiments("nrc") %>%
filter(sentiment %in% c("positive","negative")))%>%
mutate(method = "NRC")) %>%
count(method, index = line %/% 80, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)
## Joining with `by = join_by(word)`
## Joining with `by = join_by(word)`
## Warning in inner_join(., get_sentiments("nrc") %>% filter(sentiment %in% : Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 215 of `x` matches multiple rows in `y`.
## ℹ Row 5178 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
bind_rows(afinn,
bing_and_nrc) %>%
ggplot(aes(index, sentiment, fill = method)) +
geom_col(show.legend = FALSE) +
facet_wrap(~method, ncol = 1, scales = "free_y")
get_sentiments("nrc") %>%
filter(sentiment %in% c("positive",
"negative")) %>%
count(sentiment)
## # A tibble: 2 × 2
## sentiment n
## <chr> <int>
## 1 negative 3316
## 2 positive 2308
get_sentiments("bing") %>%
count(sentiment)
## # A tibble: 2 × 2
## sentiment n
## <chr> <int>
## 1 negative 4781
## 2 positive 2005
bing_word_counts <- tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()
## Joining with `by = join_by(word)`
## Warning in inner_join(., get_sentiments("bing")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 435434 of `x` matches multiple rows in `y`.
## ℹ Row 5051 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
bing_word_counts
## # A tibble: 2,585 × 3
## word sentiment n
## <chr> <chr> <int>
## 1 miss negative 1855
## 2 well positive 1523
## 3 good positive 1380
## 4 great positive 981
## 5 like positive 725
## 6 better positive 639
## 7 enough positive 613
## 8 happy positive 534
## 9 love positive 495
## 10 pleasure positive 462
## # ℹ 2,575 more rows
bing_word_counts %>%
group_by(sentiment) %>%
top_n(10) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(y = "Contribution to sentiment",
x = NULL) +
coord_flip()
## Selecting by n
#custom stop words
custom_stop_words <- bind_rows(data_frame(word = c("miss"),
lexicon = c("custom")),
stop_words)
## Warning: `data_frame()` was deprecated in tibble 1.1.0.
## ℹ Please use `tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
custom_stop_words
## # A tibble: 1,150 × 2
## word lexicon
## <chr> <chr>
## 1 miss custom
## 2 a SMART
## 3 a's SMART
## 4 able SMART
## 5 about SMART
## 6 above SMART
## 7 according SMART
## 8 accordingly SMART
## 9 across SMART
## 10 actually SMART
## # ℹ 1,140 more rows
tidy_books %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 100))
## Joining with `by = join_by(word)`
#Most common positive and negative words in Jane Austen’s novels
tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(colors = c("gray20", "gray80"),
max.words = 100)
## Joining with `by = join_by(word)`
## Warning in inner_join(., get_sentiments("bing")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 435434 of `x` matches multiple rows in `y`.
## ℹ Row 5051 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
PandP_sentences <- data_frame(text = prideprejudice) %>%
unnest_tokens(sentence, text, token = "sentences")
PandP_sentences$sentence[2]
## [1] "by jane austen"
austen_chapters <- austen_books() %>%
group_by(book) %>%
unnest_tokens(chapter, text, token = "regex",
pattern = "Chapter|CHAPTER [\\dIVXLC]") %>%
ungroup()
austen_chapters %>%
group_by(book) %>%
summarise(chapters = n())
## # A tibble: 6 × 2
## book chapters
## <fct> <int>
## 1 Sense & Sensibility 51
## 2 Pride & Prejudice 62
## 3 Mansfield Park 49
## 4 Emma 56
## 5 Northanger Abbey 32
## 6 Persuasion 25
bingnegative <- get_sentiments("bing") %>%
filter(sentiment == "negative")
wordcounts <- tidy_books %>%
group_by(book) %>%
summarize(words = n())
tidy_books %>%
semi_join(bingnegative) %>%
group_by(book) %>%
summarize(negativewords = n()) %>%
left_join(wordcounts, by = c("book")) %>%
mutate(ratio = negativewords/words) %>%
top_n(1) %>%
ungroup()
## Joining with `by = join_by(word)`
## Selecting by ratio
## # A tibble: 1 × 4
## book negativewords words ratio
## <fct> <int> <int> <dbl>
## 1 Northanger Abbey 2518 77780 0.0324
Analysis:
Now, we will obtain a code example from Chapter 2 of Textmining with R.
devtools::install_github("ropensci/gutenbergr")
## WARNING: Rtools is required to build R packages, but is not currently installed.
##
## Please download and install Rtools 4.4 from https://cran.r-project.org/bin/windows/Rtools/.
## Using GitHub PAT from the git credential store.
## Downloading GitHub repo ropensci/gutenbergr@HEAD
##
## ── R CMD build ─────────────────────────────────────────────────────────────────
## WARNING: Rtools is required to build R packages, but is not currently installed.
##
## Please download and install Rtools 4.4 from https://cran.r-project.org/bin/windows/Rtools/.
## checking for file 'C:\Users\asadn\AppData\Local\Temp\RtmpsJdvxn\remotes2dc059556056\ropensci-gutenbergr-57b1415/DESCRIPTION' ... ✔ checking for file 'C:\Users\asadn\AppData\Local\Temp\RtmpsJdvxn\remotes2dc059556056\ropensci-gutenbergr-57b1415/DESCRIPTION' (370ms)
## ─ preparing 'gutenbergr':
## checking DESCRIPTION meta-information ... checking DESCRIPTION meta-information ... ✔ checking DESCRIPTION meta-information
## ─ checking for LF line-endings in source and make files and shell scripts (371ms)
## ─ checking for empty or unneeded directories
## ─ building 'gutenbergr_0.2.4.9000.tar.gz'
##
##
## Warning: package 'gutenbergr' is in use and will not be installed
library(gutenbergr)
dickens_books <- gutenberg_works(author == 'Dickens, Charles')
dickens_books
## # A tibble: 54 × 8
## gutenberg_id title author gutenberg_author_id language gutenberg_bookshelf
## <int> <chr> <chr> <int> <chr> <chr>
## 1 46 A Chris… Dicke… 37 en "Children's Litera…
## 2 564 The Mys… Dicke… 37 en "Mystery Fiction"
## 3 580 The Pic… Dicke… 37 en "Best Books Ever L…
## 4 699 A Child… Dicke… 37 en "Children's Histor…
## 5 700 The Old… Dicke… 37 en ""
## 6 730 Oliver … Dicke… 37 en ""
## 7 766 David C… Dicke… 37 en "Harvard Classics"
## 8 821 Dombey … Dicke… 37 en ""
## 9 917 Barnaby… Dicke… 37 en "Historical Fictio…
## 10 963 Little … Dicke… 37 en ""
## # ℹ 44 more rows
## # ℹ 2 more variables: rights <chr>, has_text <lgl>
head(dickens_books)
## # A tibble: 6 × 8
## gutenberg_id title author gutenberg_author_id language gutenberg_bookshelf
## <int> <chr> <chr> <int> <chr> <chr>
## 1 46 A Christ… Dicke… 37 en "Children's Litera…
## 2 564 The Myst… Dicke… 37 en "Mystery Fiction"
## 3 580 The Pick… Dicke… 37 en "Best Books Ever L…
## 4 699 A Child'… Dicke… 37 en "Children's Histor…
## 5 700 The Old … Dicke… 37 en ""
## 6 730 Oliver T… Dicke… 37 en ""
## # ℹ 2 more variables: rights <chr>, has_text <lgl>
glimpse(dickens_books)
## Rows: 54
## Columns: 8
## $ gutenberg_id <int> 46, 564, 580, 699, 700, 730, 766, 821, 917, 963, 9…
## $ title <chr> "A Christmas Carol in Prose; Being a Ghost Story o…
## $ author <chr> "Dickens, Charles", "Dickens, Charles", "Dickens, …
## $ gutenberg_author_id <int> 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37…
## $ language <chr> "en", "en", "en", "en", "en", "en", "en", "en", "e…
## $ gutenberg_bookshelf <chr> "Children's Literature/Christmas", "Mystery Fictio…
## $ rights <chr> "Public domain in the USA.", "Public domain in the…
## $ has_text <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TR…
#check data tidy
tidydata <- dickens_books %>%
gutenberg_download(meta_fields = 'title') %>%
group_by(gutenberg_id) %>%
mutate(linenumber = row_number()) %>%
ungroup() %>%
unnest_tokens(word, text)
## Determining mirror for Project Gutenberg from https://www.gutenberg.org/robot/harvest
## Using mirror http://aleph.gutenberg.org
## Warning: ! Could not download a book at http://aleph.gutenberg.org/4/46/46.zip.
## ℹ The book may have been archived.
## ℹ Alternatively, You may need to select a different mirror.
## → See https://www.gutenberg.org/MIRRORS.ALL for options.
## Warning: ! Could not download a book at http://aleph.gutenberg.org/7/3/730/730.zip.
## ℹ The book may have been archived.
## ℹ Alternatively, You may need to select a different mirror.
## → See https://www.gutenberg.org/MIRRORS.ALL for options.
## Warning: ! Could not download a book at http://aleph.gutenberg.org/7/6/766/766.zip.
## ℹ The book may have been archived.
## ℹ Alternatively, You may need to select a different mirror.
## → See https://www.gutenberg.org/MIRRORS.ALL for options.
## Warning: ! Could not download a book at http://aleph.gutenberg.org/9/6/963/963.zip.
## ℹ The book may have been archived.
## ℹ Alternatively, You may need to select a different mirror.
## → See https://www.gutenberg.org/MIRRORS.ALL for options.
## Warning: ! Could not download a book at http://aleph.gutenberg.org/1/0/2/1023/1023.zip.
## ℹ The book may have been archived.
## ℹ Alternatively, You may need to select a different mirror.
## → See https://www.gutenberg.org/MIRRORS.ALL for options.
## Warning: ! Could not download a book at http://aleph.gutenberg.org/1/4/2/1423/1423.zip.
## ℹ The book may have been archived.
## ℹ Alternatively, You may need to select a different mirror.
## → See https://www.gutenberg.org/MIRRORS.ALL for options.
## Warning: ! Could not download a book at
## http://aleph.gutenberg.org/3/0/1/2/30127/30127.zip.
## ℹ The book may have been archived.
## ℹ Alternatively, You may need to select a different mirror.
## → See https://www.gutenberg.org/MIRRORS.ALL for options.
## Warning: ! Could not download a book at
## http://aleph.gutenberg.org/4/0/7/2/40723/40723.zip.
## ℹ The book may have been archived.
## ℹ Alternatively, You may need to select a different mirror.
## → See https://www.gutenberg.org/MIRRORS.ALL for options.
## Warning: ! Could not download a book at
## http://aleph.gutenberg.org/4/0/7/2/40729/40729.zip.
## ℹ The book may have been archived.
## ℹ Alternatively, You may need to select a different mirror.
## → See https://www.gutenberg.org/MIRRORS.ALL for options.
## Warning: ! Could not download a book at
## http://aleph.gutenberg.org/4/1/7/3/41739/41739.zip.
## ℹ The book may have been archived.
## ℹ Alternatively, You may need to select a different mirror.
## → See https://www.gutenberg.org/MIRRORS.ALL for options.
## Warning: ! Could not download a book at
## http://aleph.gutenberg.org/4/1/8/9/41894/41894.zip.
## ℹ The book may have been archived.
## ℹ Alternatively, You may need to select a different mirror.
## → See https://www.gutenberg.org/MIRRORS.ALL for options.
## Warning: ! Could not download a book at
## http://aleph.gutenberg.org/4/2/2/3/42232/42232.zip.
## ℹ The book may have been archived.
## ℹ Alternatively, You may need to select a different mirror.
## → See https://www.gutenberg.org/MIRRORS.ALL for options.
## Warning: ! Could not download a book at
## http://aleph.gutenberg.org/4/3/1/1/43111/43111.zip.
## ℹ The book may have been archived.
## ℹ Alternatively, You may need to select a different mirror.
## → See https://www.gutenberg.org/MIRRORS.ALL for options.
## Warning: ! Could not download a book at
## http://aleph.gutenberg.org/4/3/2/0/43207/43207.zip.
## ℹ The book may have been archived.
## ℹ Alternatively, You may need to select a different mirror.
## → See https://www.gutenberg.org/MIRRORS.ALL for options.
## Warning: ! Could not download a book at
## http://aleph.gutenberg.org/4/6/6/7/46675/46675.zip.
## ℹ The book may have been archived.
## ℹ Alternatively, You may need to select a different mirror.
## → See https://www.gutenberg.org/MIRRORS.ALL for options.
## Warning: ! Could not download a book at
## http://aleph.gutenberg.org/4/7/5/3/47534/47534.zip.
## ℹ The book may have been archived.
## ℹ Alternatively, You may need to select a different mirror.
## → See https://www.gutenberg.org/MIRRORS.ALL for options.
## Warning: ! Could not download a book at
## http://aleph.gutenberg.org/4/7/5/3/47535/47535.zip.
## ℹ The book may have been archived.
## ℹ Alternatively, You may need to select a different mirror.
## → See https://www.gutenberg.org/MIRRORS.ALL for options.
#identyfy the line number
line_number <- dickens_books%>%
group_by(gutenberg_id) %>%
mutate(line = row_number()) %>%
ungroup()
line_number
## # A tibble: 54 × 9
## gutenberg_id title author gutenberg_author_id language gutenberg_bookshelf
## <int> <chr> <chr> <int> <chr> <chr>
## 1 46 A Chris… Dicke… 37 en "Children's Litera…
## 2 564 The Mys… Dicke… 37 en "Mystery Fiction"
## 3 580 The Pic… Dicke… 37 en "Best Books Ever L…
## 4 699 A Child… Dicke… 37 en "Children's Histor…
## 5 700 The Old… Dicke… 37 en ""
## 6 730 Oliver … Dicke… 37 en ""
## 7 766 David C… Dicke… 37 en "Harvard Classics"
## 8 821 Dombey … Dicke… 37 en ""
## 9 917 Barnaby… Dicke… 37 en "Historical Fictio…
## 10 963 Little … Dicke… 37 en ""
## # ℹ 44 more rows
## # ℹ 3 more variables: rights <chr>, has_text <lgl>, line <int>
text <- c("Sentiment text analysis for chapter two -",
"Start by getting the primary example code -",
"should provide a citation to this base code -",
"Incorporate at least one additional sentiment lexicon-",
"You make work on a small team on this assignment"
)
text
## [1] "Sentiment text analysis for chapter two -"
## [2] "Start by getting the primary example code -"
## [3] "should provide a citation to this base code -"
## [4] "Incorporate at least one additional sentiment lexicon-"
## [5] "You make work on a small team on this assignment"
#tidy table
text_df <- tibble(line =1:5, text=text)
text_df
## # A tibble: 5 × 2
## line text
## <int> <chr>
## 1 1 Sentiment text analysis for chapter two -
## 2 2 Start by getting the primary example code -
## 3 3 should provide a citation to this base code -
## 4 4 Incorporate at least one additional sentiment lexicon-
## 5 5 You make work on a small team on this assignment
text_df %>%
unnest_tokens(word, text)
## # A tibble: 38 × 2
## line word
## <int> <chr>
## 1 1 sentiment
## 2 1 text
## 3 1 analysis
## 4 1 for
## 5 1 chapter
## 6 1 two
## 7 2 start
## 8 2 by
## 9 2 getting
## 10 2 the
## # ℹ 28 more rows
Source:
#citation:
Robinson, J. S. and D. (n.d.). Welcome to text mining with r: Text mining with R. A Tidy Approach. https://www.tidytextmining.com/
Silge, Julia, and David Robinson. Text Mining with R: A Tidy Approach. O’Reilly Media, 2017.