WoT Sentiment Analysis
library(tm)
library(pdftools)
library(here)
library(tidyverse)
library(tidytext)
library(wordcloud)
library(reshape2)
library(ggplot2)Abstract
As an avid reader, The Wheel of Time is one of my favorite high fantasy series. After learning about sentiment analysis in the DATA607 class, I was inspired to apply sentiment analysis to the first 9 books in the Wheel of Time series to compare the overall sentiment of each book to its related Goodreads rating and determine any correlation. I also hoped to see if the tone of the series changes with each book. My initial assumptions were that the series would get progressively more negative and that there would be some form of linear relationship between a series sentiment and its rating.
For my approach, I parsed the first 9 PDFs of the series and tokenized each word to match to a numerical positive or negative sentiment (from -5 to 5). Then, I plotted the sentiment progression by page for each book and gathered the initial set of summary statistics (mean, standard deviation etc.). Using linear regression, I found a near zero linear relationship between the page of a book and its corresponding sentiment value.
I compared the mean sentiment of each book to its respective Goodreads rating using linear regression. While the p-value (.04803) is less than .05, the adjusted R-squared is .3711 and thus there seems to be a very weak linear relationship between the mean sentiment of a book and its Goodreads rating with little statistical significance.
Further analysis can be done to strengthen my conclusion by including more books in the series as well as exploring other forms of sentiment analysis with more qualitative characteristics (anger, sadness, etc.). There were also some limitations while parsing the PDFs that could improve how the AFINN lexicon deals with fantasy terms that may have a different connotation within the novels.
Data Preparation
PDF Scraping
Here I build the path to all the pdfs stored in my project folder and call the pdf_text function to convert the PDFs into a character vector.
path_to_pdfs <- here("Final Project","WoT")
book_files <- list.files(path = path_to_pdfs, pattern = "pdf$")
# add relative path to file list
complete_path <- function(x) { str_c(path_to_pdfs, "/",x)}
files <- complete_path(book_files)
# read in all pdfs in the WoT folder
wot <- suppressMessages(lapply(files, pdf_text))I created a function pdf_parse to parse out the titles from the text, unnest words to a text column and remove stop words (and, the, etc.) using the default stop_word dataset.
pdf_parse <- function(list) {
data(stop_words)
title <- list[1] %>% str_split("\n")
title <- title[[1]][1]
pdf_book <- tibble(book = title, page=1:length(list), text = list)
pdf_book <- pdf_book %>% unnest_tokens(word, text)
pdf_book <- pdf_book %>% anti_join(stop_words)
}
# parse all pdfs
all_books <- lapply(wot, pdf_parse)## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
## Joining, by = "word"
glimpse(all_books)## List of 9
## $ : tibble [103,799 x 3] (S3: tbl_df/tbl/data.frame)
## ..$ book: chr [1:103799] "A Crown of Swords" "A Crown of Swords" "A Crown of Swords" "A Crown of Swords" ...
## ..$ page: int [1:103799] 1 1 1 1 2 2 2 2 2 2 ...
## ..$ word: chr [1:103799] "crown" "swords" "robert" "jordan" ...
## $ : tibble [84,142 x 3] (S3: tbl_df/tbl/data.frame)
## ..$ book: chr [1:84142] "The Dragon Reborn" "The Dragon Reborn" "The Dragon Reborn" "The Dragon Reborn" ...
## ..$ page: int [1:84142] 1 1 1 1 2 2 2 2 2 2 ...
## ..$ word: chr [1:84142] "dragon" "reborn" "robert" "jordan" ...
## $ : tibble [105,491 x 3] (S3: tbl_df/tbl/data.frame)
## ..$ book: chr [1:105491] "The Eye Of The World" "The Eye Of The World" "The Eye Of The World" "The Eye Of The World" ...
## ..$ page: int [1:105491] 1 1 1 1 2 2 2 2 2 2 ...
## ..$ word: chr [1:105491] "eye" "world" "robert" "jordan" ...
## $ : tibble [116,686 x 3] (S3: tbl_df/tbl/data.frame)
## ..$ book: chr [1:116686] "The Fires of Heaven" "The Fires of Heaven" "The Fires of Heaven" "The Fires of Heaven" ...
## ..$ page: int [1:116686] 1 1 1 1 2 2 2 2 2 2 ...
## ..$ word: chr [1:116686] "fires" "heaven" "robert" "jordan" ...
## $ : tibble [87,899 x 3] (S3: tbl_df/tbl/data.frame)
## ..$ book: chr [1:87899] "The Great Hunt" "The Great Hunt" "The Great Hunt" "The Great Hunt" ...
## ..$ page: int [1:87899] 1 1 1 2 2 2 2 2 2 2 ...
## ..$ word: chr [1:87899] "hunt" "robert" "jordan" "prologue" ...
## $ : tibble [136,298 x 3] (S3: tbl_df/tbl/data.frame)
## ..$ book: chr [1:136298] "The Lord of Chaos" "The Lord of Chaos" "The Lord of Chaos" "The Lord of Chaos" ...
## ..$ page: int [1:136298] 1 1 1 1 2 2 2 2 2 2 ...
## ..$ word: chr [1:136298] "lord" "chaos" "robert" "jordan" ...
## $ : tibble [82,263 x 3] (S3: tbl_df/tbl/data.frame)
## ..$ book: chr [1:82263] "The Path of Daggers" "The Path of Daggers" "The Path of Daggers" "The Path of Daggers" ...
## ..$ page: int [1:82263] 1 1 1 1 2 2 2 2 2 2 ...
## ..$ word: chr [1:82263] "path" "daggers" "robert" "jordan" ...
## $ : tibble [136,043 x 3] (S3: tbl_df/tbl/data.frame)
## ..$ book: chr [1:136043] "The Shadow Rising" "The Shadow Rising" "The Shadow Rising" "The Shadow Rising" ...
## ..$ page: int [1:136043] 1 1 1 1 2 2 2 2 2 2 ...
## ..$ word: chr [1:136043] "shadow" "rising" "robert" "jordan" ...
## $ : tibble [84,749 x 3] (S3: tbl_df/tbl/data.frame)
## ..$ book: chr [1:84749] "Winter’s Heart" "Winter’s Heart" "Winter’s Heart" "Winter’s Heart" ...
## ..$ page: int [1:84749] 1 1 1 1 2 2 2 2 2 2 ...
## ..$ word: chr [1:84749] "winter’s" "heart" "robert" "jordan" ...
Now I have a list of all the books in a similar dataframe format that I can bind together
# bind all dataframes together
all_books <- all_books %>% reduce(rbind)
all_books## # A tibble: 937,370 x 3
## book page word
## <chr> <int> <chr>
## 1 A Crown of Swords 1 crown
## 2 A Crown of Swords 1 swords
## 3 A Crown of Swords 1 robert
## 4 A Crown of Swords 1 jordan
## 5 A Crown of Swords 2 health
## 6 A Crown of Swords 2 grown
## 7 A Crown of Swords 2 land
## 8 A Crown of Swords 2 dragon
## 9 A Crown of Swords 2 reborn
## 10 A Crown of Swords 2 land
## # ... with 937,360 more rows
CSV Book Ratings
Here I read in the ratings from a csv to store for use after I perform sentiment analysis.
csv_url <- here("Final Project", "book_ratings.csv")
ratings <- read.csv(csv_url)
ratings## book goodreads_rating num_ratings book_num
## 1 The Eye Of The World 4.17 436923 1
## 2 The Great Hunt 4.23 254637 2
## 3 The Dragon Reborn 4.25 234968 3
## 4 The Shadow Rising 4.24 189853 4
## 5 The Fires of Heaven 4.16 153907 5
## 6 The Lord of Chaos 4.14 143274 6
## 7 A Crown of Swords 4.04 139194 7
## 8 The Path of Daggers 3.91 119925 8
## 9 Winter’s Heart 3.94 114984 9
With the dataframe of all the selected books and associated words I can apply sentiment analysis to gather positive or negative values using the AFINN lexicon
Sentiment Analysis
AFINN
I pull in the sentiment using the AFINN lexicon to get a quantitative value of a word’s positive or negative sentiment.
afinn <- get_sentiments("afinn") # -5 to 5
all_books_afinn <- all_books %>% inner_join(afinn)## Joining, by = "word"
all_books_afinn_pages <- all_books %>% inner_join(afinn) %>% group_by(book,page) %>% summarize(value = sum(value))## Joining, by = "word"
## `summarise()` has grouped output by 'book'. You can override using the `.groups` argument.
glimpse(all_books_afinn)## Rows: 98,714
## Columns: 4
## $ book <chr> "A Crown of Swords", "A Crown of Swords", "A Crown of Swords", "~
## $ page <int> 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4~
## $ word <chr> "fire", "proud", "pray", "tears", "fire", "love", "disputed", "t~
## $ value <dbl> -2, 2, 1, -2, -2, 3, -2, 2, 1, -1, -1, -2, -1, -2, 2, 1, -2, 2, ~
Statistical Analysis
I want to analyze my dependent variable (goodreads_rating) and its relationship to the independent variable sentiment mean which is the mean of all the words sentiment value for each book and its relationship to the sentiment progression of a book - i.e. does a book become progressively more or less positive with a linear relationship to page number.
Aggregated Statistics
I create a set of summary statistics for each book and its sentiment.
#add positive and negative classifiers for chi square test
all_books_afinn <- all_books_afinn %>% mutate (positive = if_else(value > 0, 1, 0), negative = if_else(value < 0, 1, 0))
all_books_stats <- all_books_afinn %>% group_by(book) %>% summarize(num_words = n(), sentiment_stdev = sd(value), sentiment_mean = mean(value), sentiment_min = min(value), sentiment_max = max(value), positive = sum(positive), negative = sum(negative))Summary Statistics
all_books_stats## # A tibble: 9 x 8
## book num_words sentiment_stdev sentiment_mean sentiment_min sentiment_max
## <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 A Crown ~ 10928 1.91 -0.421 -4 5
## 2 The Drag~ 8873 1.90 -0.482 -5 4
## 3 The Eye ~ 10663 1.82 -0.512 -5 4
## 4 The Fire~ 12632 1.89 -0.441 -5 5
## 5 The Grea~ 8631 1.85 -0.579 -5 5
## 6 The Lord~ 14687 1.92 -0.382 -4 4
## 7 The Path~ 8634 1.89 -0.408 -4 4
## 8 The Shad~ 14520 1.89 -0.479 -5 5
## 9 Winter’s~ 9146 1.92 -0.402 -5 4
## # ... with 2 more variables: positive <dbl>, negative <dbl>
word_count <- all_books_afinn %>% group_by(book) %>% count(word, sort= TRUE)
word_count_total <- all_books_afinn %>% group_by(word) %>% summarize(count = n()) %>% arrange(desc(count))Top 10 words across books
word_count## # A tibble: 9,268 x 3
## # Groups: book [9]
## book word n
## <chr> <chr> <int>
## 1 The Lord of Chaos hard 287
## 2 The Fires of Heaven hard 250
## 3 The Shadow Rising hard 229
## 4 The Lord of Chaos dead 200
## 5 The Eye Of The World hard 193
## 6 Winter’s Heart hard 191
## 7 The Lord of Chaos smile 190
## 8 The Shadow Rising stop 185
## 9 The Lord of Chaos gray 181
## 10 The Eye Of The World fire 177
## # ... with 9,258 more rows
Top 30 words
wordcloud(word_count_total$word, word_count_total$count, max.words = 30)Page Linear Regression Model
Here we can see that the books sentiment stay about the same for each book in the series as well as for each page in a book.
ggplot(all_books_afinn_pages, aes(x=page,y=value)) + geom_point() + geom_smooth() + facet_wrap(~book, scales = "free_x")## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
ggplot(all_books_afinn, aes(page, value, fill = book)) +
geom_col(show.legend = FALSE) +
facet_wrap(~book, ncol = 3, scales = "free_x")ggplot(all_books_afinn, aes(x=book,y=value)) + geom_boxplot() + coord_flip()page_model<-lm(value~page,data=all_books_afinn_pages)
summary(page_model)##
## Call:
## lm(formula = value ~ page, data = all_books_afinn_pages)
##
## Residuals:
## Min 1Q Median 3Q Max
## -91.968 -8.973 1.076 11.047 52.085
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.310e+01 5.285e-01 -24.787 <2e-16 ***
## page 2.074e-04 2.259e-03 0.092 0.927
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.2 on 3419 degrees of freedom
## Multiple R-squared: 2.465e-06, Adjusted R-squared: -0.00029
## F-statistic: 0.008429 on 1 and 3419 DF, p-value: 0.9269
Our model shows that there is no linear relationship between what page you’re on in a book and the associated sentiment value of that page.
Sentiment Mean Linear Regression Model
I combine the ratings with the summary statistics dataframe to provide the necessary information for the sentiment mean linear regression model.
combined_books <- inner_join(all_books_stats, ratings, by = "book")
combined_books## # A tibble: 9 x 11
## book num_words sentiment_stdev sentiment_mean sentiment_min sentiment_max
## <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 A Crown ~ 10928 1.91 -0.421 -4 5
## 2 The Drag~ 8873 1.90 -0.482 -5 4
## 3 The Eye ~ 10663 1.82 -0.512 -5 4
## 4 The Fire~ 12632 1.89 -0.441 -5 5
## 5 The Grea~ 8631 1.85 -0.579 -5 5
## 6 The Lord~ 14687 1.92 -0.382 -4 4
## 7 The Path~ 8634 1.89 -0.408 -4 4
## 8 The Shad~ 14520 1.89 -0.479 -5 5
## 9 Winter’s~ 9146 1.92 -0.402 -5 4
## # ... with 5 more variables: positive <dbl>, negative <dbl>,
## # goodreads_rating <dbl>, num_ratings <int>, book_num <int>
The key assumptions for a simple linear regression model are:
- Linearity: The relationship between X and the mean of Y is linear
- Homoscedasticity: The variance of the residual is the same for any value of X
- Independence: Observations are independent of each other
- Normality: For any fixed value of X, Y is normally distributed
The dataset roughly meets all of those criteria although it seems like there might be an outlier book with a high rating and a higher sentiment_mean.
combined_books %>% ggplot(aes(x=sentiment_mean, y = goodreads_rating)) + geom_point() + geom_smooth()## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
model<-lm(goodreads_rating~sentiment_mean,data=combined_books)
summary(model)##
## Call:
## lm(formula = goodreads_rating ~ sentiment_mean, data = combined_books)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.14389 -0.05745 -0.02602 0.08840 0.12071
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.4974 0.2625 13.325 3.14e-06 ***
## sentiment_mean -1.3646 0.5705 -2.392 0.048 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1014 on 7 degrees of freedom
## Multiple R-squared: 0.4498, Adjusted R-squared: 0.3711
## F-statistic: 5.722 on 1 and 7 DF, p-value: 0.04803
With a low adjusted R-squared at .3711, the low p-value (.04803) is not significant. The model is also pretty limited based on the small sample size of books. Further analysis would include up to 15 books and provide some more confidence in the model, as well as potential expansion to other high fantasy series to see if the trend holds.
Other limitations are that I used word sentiment instead of sentence sentiment. It’s possible that the overall sentiment of a book could change if I used a sentence token over a word token. For example, the word gray is ranked at -1 in the AFINN lexicon, but it could be in a neutral context when describing a characters eyes. There could also be a difference if I used a chapter separation instead of page separation when analyzing a book’s sentiment over the course of the book. These could all be future extensions of the analysis.
Citations
Finn Årup Nielsen A new ANEW: Evaluation of a word list for sentiment analysis in microblogs. Proceedings of the ESWC2011 Workshop on ‘Making Sense of Microposts’: Big things come in small packages 718 in CEUR Workshop Proceedings 93-98. 2011 May. http://arxiv.org/abs/1103.2903.