Wattpad - Books analysis

Introduction

Wattpad is a global storytelling platform that connects millions of writers with readers. Launched in 2006, it offers a diverse collection of user-generated stories spanning various genres. With an interactive and collaborative approach, Wattpad empowers writers to share their creativity, while readers can explore a vast digital library of engaging narratives in real-time.

This project examines 1500 Wattpad books from 15 different genres. We’re using text mining techniques to see if, despite being labeled in different categories like mystery or adventure, most of these books actually have a love story. By looking at the feelings in the stories and common themes, we want to find out if romance is a big part of Wattpad books, regardless of their official genre.

The data was scrapped with the use of package selenium in python.

Local Image

Loading and Cleaning Data

First step is loading needed packages.

library(Rmisc)
library(tidyverse)
library(tidytext)
library(topicmodels)
library(tm)
library(readr)
library(scales)
library(gridExtra)
library(kableExtra)
library(udpipe)

Now let us load the dataset and look into few examples.

# Saving csv files names 
csv_files <- list.files(getwd(), pattern = "_df.csv$", full.names = TRUE)


# Loading csv files into one data frame
books <- do.call(rbind, lapply(csv_files, read.csv))

# Filtering out title with highest reading time.
# The reason is that they are much longer than the rest of books and can 
# significantly influence the results.

books <- books %>% 
  filter(reading_time < quantile(reading_time, probs=0.95))

# Analyzing only 100 title from each genre
set.seed(31415)
books_100_titles <- books %>%
  select(genre,title,reading_time) %>% 
  distinct() %>% 
  group_by(genre) %>%
  slice_sample(n = 100) %>%
  ungroup()

books_new <- books %>%
  semi_join(books_100_titles, by = c("title", "genre", "reading_time"))

The table below presents examples of chapter content from two different genres: romance and horror.

example_chapters <- readRDS("example_chapters.rds")

kable(example_chapters, format = "html") %>%
  kable_styling(full_width = FALSE)

genre	title	chapter_name	chapter_content
horror	Japanese Urban Legends	Tomino	Tomino is a Japanese Urban Legend about a poem that kills anyone who recites it out loud.1In this world there things that you should never say out loud, and the Japanese poem “Tomino’s Hell” is one of them. According to the legend, if you read this poem out loud, disaster will strike. At best, you will feel very ill or injure yourself. At worst, you could die.2In this video you can hear Tomino being read in Japanese. You will notice that the person who made the video used text to speech software. They didn’t dare read it out loud themselves.A/N: I have the video but I don’t wanna put it here ’cause it freaks me out… And I’m scared that someone will die… T_TThis is a rough English Translation:(A/N: Please Don’t read it out loud T_T)10…
romance	The baby swap✓	Chapter 1: Starlight diamonds	ZOE67I laid on my couch and placed my bowl of popcorn on my lap.68It was Friday night, that meant I would find some killer series and binge-watch until I fell asleep.26While my friends were out on Friday nights, I stayed at home watching a romance movie about ‘true love’ that clearly didn’t exist.79’Just because your life is sad doesn’t mean everyone else’s is.’ A soft voice said in my head61I finally settled on the vampire diaries after scrolling and searching for about ten minutes. It was my all-time favourite. No matter how many times I watched it, I still found myself surprised by what happened next.147I filled my mouth with popcorn as I watched Damon and Katherine driving away from Mystic Falls.123My cell phone rang, interrupting me…

The next step of preparing data is deleting stop words and tokenization.

# Stop words
custom_stop_words <- tribble(
  ~word, ~lexicon,
  "a", "CUSTOM",
  "aa", "CUSTOM",
  "aaa", "CUSTOM",
  "yu", "CUSTOM",
  "couldnt", "CUSTOM",
  "hadnt", "CUSTOM",
  "dont", "CUSTOM",
  "havent", "CUSTOM",
  "qiu", "CUSTOM",
  "ling", "CUSTOM",
  "xiao", "CUSTOM",
  "xin", "CUSTOM",
  "lan", "CUSTOM",
  "hua", "CUSTOM",
  "qiao", "CUSTOM",
  "lin", "CUSTOM",
  "xi", "CUSTOM",
  "leng", "CUSTOM",
  "jin", "CUSTOM",
  "gu", "CUSTOM",
  "zheng", "CUSTOM",
  "jiang", "CUSTOM",
  "nuan", "CUSTOM",
  "tian", "CUSTOM",
  "hong", "CUSTOM",
  "sheng", "CUSTOM",
  "yuerong", "CUSTOM",
  "lu", "CUSTOM",
  "hao", "CUSTOM",
  "ruan", "CUSTOM",
  "yi", "CUSTOM",
  "mu", "CUSTOM",
  "bei", "CUSTOM",
  "zhan", "CUSTOM",
  "su", "CUSTOM",
  "nian", "CUSTOM",
  "chen", "CUSTOM",
  "wei", "CUSTOM",
  "li", "CUSTOM",
  "zhichu", "CUSTOM",
  "sang", "CUSTOM",
  "yuanyuan", "CUSTOM",
  "shen", "CUSTOM",
  "shen", "CUSTOM",
  "yichong", "CUSTOM",
  "xie", "CUSTOM",
  "tao", "CUSTOM",
  "luan", "CUSTOM",
  "ni", "CUSTOM",
  "yang", "CUSTOM",
  "liang", "CUSTOM",
  "xuiy", "CUSTOM"
)


stop_words2 <- stop_words %>%
  bind_rows(custom_stop_words)

# Tokenization
tidy_books <- books_new %>%
  dplyr::mutate(id = row_number()) %>%
  select(id, genre, title,best_ranking,reads, votes, reading_time, 
         chapter_publish_date, chapter_name, chapter_content) %>% 
  unnest_tokens(word, chapter_content) %>% 
  anti_join(stop_words) %>% 
  filter(!grepl("\\d", word)) # Regex, excluding numbers

# Tokenization of words separated with dot
tidy_books_update <- tidy_books %>%
  filter(str_detect(word, "\\.")) %>%
  mutate(word = str_split(word, "\\.")) %>%
  unnest() 

tidy_books <- bind_rows(tidy_books, tidy_books_update) %>% 
  filter(!str_detect(word, "\\.")) %>% 
  anti_join(stop_words2) %>% 
  mutate(word = str_replace_all(word, "[[:punct:]]", "")) %>% 
  filter(word != "") %>% # Deleting punctuation
  filter(!str_detect(word, "a{4,}")) %>% 
  mutate(chapter_publish_date=base::as.Date(chapter_publish_date))

Finally, we execute the lemmatization process. It proves beneficial in topic modeling, yielding more accurate results as we standardize words and eliminate unnecessary similarities and diversities, especially concerning tenses and inflections.

udpipe_download_model(language = "english")
model_path <- "english-ewt-ud-2.5-191206.udpipe"
ud_model <- udpipe_load_model(model_path)


# Empty data frame to store the annotated results
tidy_books_annotated <- data.frame()

chunk_size <- 100000
num_chunks <- ceiling(nrow(tidy_books) / chunk_size)

for (i in 1:num_chunks) {

  start_index <- (i - 1) * chunk_size + 1
  end_index <- min(i * chunk_size, nrow(tidy_books))
  
  current_chunk <- tidy_books$word[start_index:end_index]
  
  cat("Processing chunk", i, "of", num_chunks, "\n")
  
  chunk_annotated <- udpipe_annotate(ud_model, x = current_chunk)
  chunk_annotated_df <- as.data.frame(chunk_annotated)
  tidy_books_annotated <- rbind(tidy_books_annotated, chunk_annotated_df)
}

tidy_books_annotated <- tidy_books_annotated %>% 
  select(sentence, token, lemma, upos, xpos, dep_rel) %>% 
  distinct()

# Joining lemmatized words
tidy_books_lemm <- tidy_books %>% 
  inner_join(tidy_books_annotated %>% select(sentence, lemma),
  by = c("word" = "sentence"))

tidy_books_lemm <- tidy_books_lemm %>% 
  select(!word) %>% 
  rename(word = lemma)

tidy_books_lemm <- tidy_books_lemm %>%
  filter(nchar(word) > 1) %>% 
  filter(!grepl("hh", word)) %>% 
  filter(!grepl("aaa", word))

# Deleting words which occur only once
word_frequencies <- tidy_books_lemm %>% count(word)
tidy_books_lemm <- tidy_books_lemm %>%
  anti_join(word_frequencies %>% filter(n <= 2), by = "word")

Due to the size of the dataset and the limitations of processing it in R Markdown, I will read the data from saved RDS files.

tidy_books_lemm <- readRDS("tidy_books_lemm.rds")

Overview from the Data

Now we are going to look into the data set and see the overview.

Distribution of word counts across different genres

The plot below illustrates the distribution of word counts across different genres. It’s important to note that the dataset is not evenly balanced concerning word counts in each genre. This imbalance could potentially influence the outcomes of genre-specific analyses.

word_counts_genre <- tidy_books_lemm %>%
  dplyr::count(genre)


ggplot(word_counts_genre, aes(x = reorder(genre, -n))) + 
  geom_bar(aes(y = n), fill='#EA906C', stat = 'identity') +
  scale_y_continuous(name = "Number of words",
                     breaks = seq(0, 3e+06, by = 1e+06),
                     labels = scales::label_number(scale = 1e-6, suffix = "M")) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  xlab("Genre")

Distribution of popularity, ranking score and reads for all books together

The next plot illustrates the distribution of ranking scores and the number of reads across all analyzed books. It’s evident that both distributions are right-skewed. Notably, many books exhibit a high ranking score despite having a comparatively low number of reads. It’s worth mentioning that Wattpad’s ranking score isn’t solely derived from popularity but involves a more intricate algorithm, the details of which won’t be delved into here. To explore the relationship between ranking score and the number of reads, I created a correlation plot. The findings confirm that there is no positive correlation between these two variables.

stats_books <- tidy_books_lemm %>%
  dplyr::group_by(title,best_ranking,reads,votes) %>%
  dplyr::mutate(finishing_date = max(chapter_publish_date,na.rm = TRUE)) %>%
  dplyr::ungroup() %>%
  select(title,best_ranking,reads,votes,finishing_date) %>%
  distinct()

stats_books_rank <- stats_books %>% 
  group_by(best_ranking) %>% 
  filter(best_ranking <= 150) %>% 
  dplyr::summarise(number = n())

stats_books_reads <- stats_books %>% 
  group_by(reads) %>% 
  filter(reads <= 1e+06) %>% 
  dplyr::summarise(number = n())

grid.arrange(
  ggplot(stats_books_rank, aes(x = best_ranking, y = number)) + 
    geom_bar(stat = 'identity', fill = '#2B2A4C', color = '#2B2A4C') +
    theme_minimal() +
    xlab("Ranking"),
  ggplot(stats_books_reads, aes(x = reads)) +
    geom_histogram(position = 'stack', aes(y = ..count..),
                   fill = '#2B2A4C', color = '#2B2A4C',
                   binwidth = 10000) +
    scale_x_continuous(name = "Number of reads",
                       labels = scales::label_number(scale = 1e-3, suffix = "K")) +
    theme_minimal(),
  ncol = 1,top = "Distribution of ranking score and number of reads"
)

stats_books_sc <- stats_books %>% 
  filter(best_ranking <= 400,
         reads <= 1e+06) 

ggplot(data = stats_books_sc, aes(y = best_ranking, x = reads)) + 
  geom_point(fill = '#2B2A4C', color = '#2B2A4C') +
  geom_smooth(color = "#B31312") +
  theme_minimal() +
  labs(x = "Reads", y = "Ranking",
       title = "Relation between ranking score and number of reads")

Most Popular Words in Genres

Within each genre, numerous words are associated with the body, probably describing emotions: eyes, hand, head, smile, walk, look. In genres like historical fiction and werewolf, specific genre-related words emerge, including: lady, pack, mate.

word_counts <- tidy_books_lemm %>%
  dplyr::count(word, genre) %>%
  dplyr::group_by(genre) %>%
  dplyr::top_n(20, n) %>%
  ungroup() %>%
  mutate(word2 = fct_reorder(word, n))

top_words_by_genre <- tidy_books_lemm %>%
  group_by(genre) %>%
  count(word, sort = TRUE) %>%
  slice_max(n = 100, order_by = n) %>% 
  count(word, genre) %>%
  group_by(word) %>%
  summarise(num_genres = n_distinct(genre)) %>% 
  filter(num_genres == 15)

word_counts <- top_words_by_genre %>% 
  inner_join(tidy_books_lemm, by = "word") %>%
  count(word, genre) %>%
  group_by(genre) %>%
  top_n(15, n) %>%
  ungroup() %>%
  mutate(word2 = fct_reorder(word, n))


ggplot(word_counts, aes(x=reorder(word2,n), y=n)) + 
  geom_col(show.legend=FALSE, fill = '#EA906C') +
  facet_wrap(~genre, scales="free_y") +
  coord_flip() +
  labs(x = "", y = "",
       title = "Popular words per genre")

Most Popular Bigrams in Genres

Bigrams may be more informative in regards to context of most popular words. As expected body related words are associated with describing emotions: shook head, roll eyes, deep breath, hold hand, raise eyebrow, eyes widen, clear throat.

In fantasy there occur some more unique bigrams including dragon king, dragon realm.

However besides fantasy it seems that in all genres key words are almost identical. It suggests that genres may not be significantly different from each other.

bigrams <- tidy_books_lemm %>%
  mutate(next_word = lead(word)) %>% 
  filter(!is.na(next_word)) %>% 
  mutate(bigram = paste(word, next_word, sep = " ")) %>% 
  count(bigram, genre) %>%
  group_by(genre) %>%
  top_n(10, n) %>%
  ungroup()

bigrams <- readRDS("bigrams.rds")

ggplot(bigrams, aes(x = reorder(bigram, n), y = n)) +
  geom_col(fill = '#EA906C') +
  facet_wrap(~genre, scales = "free_y") +
  coord_flip() +
  labs(x = "", y = "",
       title = "Popular bigrams per genre")

Sentiment Analysis

Sentiment analysis helps us understand the emotions woven into the stories. By looking at different genres like romance, fantasy, and horror, we aim to uncover the feelings and experiences unique to each kind of book. Through sentiment analysis, I aim to investigate whether particular emotions align with specific genres or if genre serves as an unrelated characteristic indicating the type of book.

Positive/Negative Sentiment per genre

A straightforward categorization of sentiments into positive and negative, based on the dataset, reveals that they do not significantly differ. Across all genres, the majority of sentiments lean towards negativity, with a proportion ranging from 60% to 69%.

sentiment_prop_genre <- tidy_books_lemm %>%
  inner_join(get_sentiments("bing")) %>%
  group_by(genre) %>%
  dplyr::summarise(
    total_words = n(),
    positive = sum(sentiment == "positive")/ total_words,
    negative = sum(sentiment == "negative")/ total_words
  ) %>%
  gather(sentiment, proportion, -genre) %>%
  spread(sentiment, proportion)


ggplot(
  sentiment_prop_genre, aes(x = reorder(genre, -negative), y=negative)
  ) + 
  geom_col(show.legend=FALSE, fill='#B31312') +
  geom_text(
    aes(label = scales::percent(negative,accuracy = 1)),
    position = position_stack(vjust = 0.95),
    show.legend = FALSE,
    color = "#EEE2DE"
  ) +
  coord_flip() +
  scale_y_continuous(label = scales::percent) +
  labs(
    title = "Negative Sentiment by Genre",
    x = "Genre",
    y = "Negative Sentiment"
  )

In order to add more context to above numbers, below is presented the distribution of negative sentiment across all titles. While many of titles exhibit 60% proportion of negative sentiment, the distribution range is notably broader. This indicates a potential heightened diversity among titles compared to the relatively more homogenous nature of genres. It may suggest that assigned genre is not an accurate characteristic.

sentiment_prop_title <- tidy_books_lemm %>%
  inner_join(get_sentiments("bing")) %>%
  group_by(title) %>%
  dplyr::summarise(
    total_words = n(),
    positive = sum(sentiment == "positive")/ total_words,
    negative = sum(sentiment == "negative")/ total_words
  ) %>%
  gather(sentiment, proportion, -title) %>%
  spread(sentiment, proportion)

ggplot(
  sentiment_prop_title, aes(x = negative)
) + 
  geom_histogram(fill = '#B31312', bins = 30) +
  scale_x_continuous(labels = scales::percent_format(scale = 100), limits = c(0.25, 1)) +
  geom_vline(xintercept = 0.6, color = "#EEE2DE", linetype = "dashed") +
 labs(
    title = "Distribution of Negative Sentiment per Title",
    x = "Negative Sentiment Proportion",
    y = "Count"
  )

Furthermore, a sentiment analysis was conducted for a broader spectrum of emotions, leading to a similar conclusion: emotions show no significant variations across diverse genres.

Possibly we may distinguish two types of books in regards to sentiments. It is especially visible in the emotion of “joy”: thriller, science fiction, paranormal, horror, mystery seems to represents lower score than genres such as romance, teen fiction, new adult, LGBT, contemporary literature. A similar pattern is observed in the sentiment of “fear.” However, a more in-depth analysis is necessary for a comprehensive understanding.

nrc_sentiment <- tidy_books_lemm %>%
  inner_join(get_sentiments("nrc")) %>%
  group_by(genre) %>%
  dplyr::summarise(
    total_words = n(),
    positive = sum(sentiment == "positive")/ total_words,
    negative = sum(sentiment == "negative")/ total_words,
    anger = sum(sentiment == "anger")/ total_words,
    anticipation = sum(sentiment == "anticipation")/ total_words,
    disgust = sum(sentiment == "disgust")/ total_words,
    fear = sum(sentiment == "fear")/ total_words,
    joy = sum(sentiment == "joy")/ total_words,
    sadness = sum(sentiment == "sadness")/ total_words,
    surprise = sum(sentiment == "surprise")/ total_words,
    trust = sum(sentiment == "trust")/ total_words
  ) %>%
  gather(sentiment, proportion, -genre) %>%
  spread(sentiment, proportion)


nrc_sentiment_long <- pivot_longer(nrc_sentiment, cols = c(anger, anticipation, disgust, fear, joy, negative, positive, sadness, surprise, trust), names_to = "sentiment", values_to = "score")


ggplot(nrc_sentiment_long, aes(x = genre, y = score, fill = genre)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  scale_y_continuous(label = scales::percent) +
  coord_flip() +
  ggtitle("Sentiment Distribution by Genre") +
  scale_fill_manual(values = c("#2B2A4C", "#2B2A4C", "#2B2A4C", "#2B2A4C", "#2B2A4C", "#2B2A4C", "#2B2A4C", "#2B2A4C", "#2B2A4C", "#2B2A4C","#B31312", "#2B2A4C", "#2B2A4C", "#2B2A4C", "#2B2A4C")) +
  theme(legend.position = "none")

The sentiment distribution among titles is centered around a singular value, suggesting that the sentiment scores for the analyzed titles are closely aligned and do not vary significantly from one another.

nrc_lexicon <- get_sentiments("nrc")

nrc_sentiment_title <- tidy_books_lemm %>%
  inner_join(get_sentiments("nrc")) %>%
  group_by(title) %>%
  dplyr::summarise(
    total_words = n(),
    positive = sum(sentiment == "positive")/ total_words,
    negative = sum(sentiment == "negative")/ total_words,
    anger = sum(sentiment == "anger")/ total_words,
    anticipation = sum(sentiment == "anticipation")/ total_words,
    disgust = sum(sentiment == "disgust")/ total_words,
    fear = sum(sentiment == "fear")/ total_words,
    joy = sum(sentiment == "joy")/ total_words,
    sadness = sum(sentiment == "sadness")/ total_words,
    surprise = sum(sentiment == "surprise")/ total_words,
    trust = sum(sentiment == "trust")/ total_words
  ) %>%
  gather(sentiment, proportion, -title) %>%
  spread(sentiment, proportion)


nrc_sentiment_title_long <- pivot_longer(nrc_sentiment_title,
                                         cols = c(anger, anticipation, disgust,
                                                  fear, joy, negative, positive,
                                                  sadness, surprise, trust),
                                         names_to = "sentiment",
                                         values_to = "score")

ggplot(nrc_sentiment_title_long, aes(x = score, fill = sentiment)) +
  geom_histogram(show.legend = FALSE, bins=50) +
  facet_wrap(~sentiment, scales = "free_y") +
  scale_x_continuous(label = scales::percent) +
  labs(title = "Sentiment Distribution by Title",
       x = "Score", y = "Number of Titles")

Topic Modelling

Topic modeling allows us to uncover hidden thematic structures within a corpus of text, revealing patterns and insights. We employed the Latent Dirichlet Allocation (LDA) algorithm. Through the application of this modeling technique, our aim is to discover new themes and patterns not only between words but also within entire books.

Below there are plotted 15 top words for each topic. Based on that there are some simplified characterizations for the four topics:

Topic 1: Emotional Interactions
Key Words: hand, feel, head, pull, door, lip, smile, look, hold, voice, bed, arms, stop, leave, eyes
Characterization: This topic seems to revolve around emotional interactions and relationships, where characters express feelings through physical gestures, expressions, and moments of connection.

Topic 2: Mysterious Atmosphere
Key Words: hand, head, time, leave, eyes, light, people, foot, move, sound, wall, air, dark, blood, step
Characterization: This topic suggests a mysterious and atmospheric setting, possibly involving suspenseful events or moments. The presence of words like “dark,” “blood,” and “step” implies a sense of tension or intrigue.

Topic 3: Social Interactions
Key Words: walk, time, smile, talk, start, call, day, girl, tell, guy, friend, sit, car, laugh, phone
Characterization: This topic revolves around social interactions and everyday moments, including conversations, laughter, and activities. It portrays a social dynamic and may involve character relationships and daily life.

Topic 4: Family and Life
Key Words: time, father, smile, mother, look, day, life, leave, love, child, family, woman, eyes, word, people
Characterization: This topic is centered around themes of family, life, and love. It may explore relationships within a family, life experiences, and the emotions associated with these connections.

# dtm_books <- tidy_books_lemm %>%
#   count(word, id) %>%
#   cast_sparse(id, word, n)
# 
# lda_4topics <- LDA(
#   dtm_books,
#   k = 4,
#   method = "Gibbs",
#   control = list(seed=42)) %>% 
#   tidy(matrix = "beta")

lda_4topics <- readRDS("lda_4topics.rds")

word_probs_4 <- lda_4topics %>%
  group_by(topic) %>%
  top_n(15, beta) %>%
  ungroup() %>%
  mutate(term2 = fct_reorder(term, beta))


ggplot(
  word_probs_4,
  aes(term2, beta, fill=as.factor(topic))
  ) +
  geom_col(show.legend = FALSE) +
  scale_y_continuous(label = scales::percent) +
  facet_wrap(~ topic, scales = "free") +
  coord_flip() +
  labs(x="Beta", y="Word", title="Words with Highest Frequency per Topic")

Topic Modelling per book

We initially used topic modeling to find different themes within the entire collection of words. This process helped us identify four distinct topics with noticeable differences. Now, for a more in-depth analysis, we’re focusing on doing topic modeling for each individual book. The goal is to see if we can group books into specific categories beyond just their genres.

4 Topics

After looking at all the words together, we found four interesting topics. As a result, we decided to delve deeper and uncover four topics among the individual books themselves.

Topics are not balanced in regards to number of titles. The distirbution is presented on the below plot.

# dtm_books_title <- tidy_books_lemm %>%
#   count(word, title, reading_time) %>%
#   cast_sparse(title, word, n)
# 
# lda_4topics_title <- LDA(
#   dtm_books_title,
#   k = 4,
#   method = "Gibbs",
#   control = list(seed=42)) %>% 
#   tidy(matrix = "gamma")

lda_4topics_title <- readRDS("lda_4topics_title.rds")

lda_4topics_title %>%
  group_by(document) %>%
  slice(which.max(gamma)) %>%
  ungroup() %>%
  select(document, main_topic = topic, gamma) -> document_main_topic


ggplot(data = document_main_topic, aes(x = main_topic)) + 
  geom_bar(stat = 'count', aes(y = ..count..), fill='#BB9CC0') +
  geom_text(stat = 'count', aes(label = ..count..), vjust = 1.2, color = "white") +
  labs(x="Main Topic", y="Number of titles")

# tidy_books_topic_4 <- tidy_books_lemm %>% 
#   inner_join(document_main_topic, by = c('title'='document'))

tidy_books_topic_4 <- readRDS("tidy_books_topic_4.rds")

Plotting Popular Words per Topic - 4 Topics

Top 15 most popular words are very similar for each topic. Possibly bigrams introduce additional context and distinctions.

word_counts_4topic <- tidy_books_topic_4 %>%
  dplyr::count(word, main_topic) %>%
  dplyr::group_by(main_topic) %>%
  dplyr::top_n(15, n) %>%
  ungroup() %>%
  mutate(word2 = fct_reorder(word, n))

ggplot(word_counts_4topic, aes(x=reorder(word,n), y=n)) + 
  geom_col(show.legend=FALSE, fill = "#2B2A4C") +
  facet_wrap(~main_topic, scales="free_y") +
  coord_flip() +
  labs(title="Popular Words per Topic",
       x = "", y="Number of words")

Bigrams

The bigrams exhibit substantial similarity across topics, with the exception of the 2nd topic, which appears to be more inclined towards fantasy elements, evident in words like “prince,” “dragon,” “king,” and “realm.” The remaining phrases predominantly revolve around interpersonal relationships, expressions, and emotional experiences, displaying minimal distinctions among the topics.

bigrams_4 <- tidy_books_topic_4 %>%
  mutate(next_word = lead(word)) %>% 
  filter(!is.na(next_word)) %>% 
  mutate(bigram = paste(word, next_word, sep = " ")) %>% 
  count(bigram, main_topic) %>%
  group_by(main_topic) %>%
  top_n(15, n) %>%
  ungroup()

bigrams_4 <- readRDS("bigrams_4.rds")

# Plotting the bigrams
ggplot(bigrams_4, aes(x = reorder(bigram, n), y = n)) +
  geom_col(fill = '#EA906C') +
  facet_wrap(~main_topic, scales = "free_y") +
  coord_flip() +
  labs(x = "", y = "",
       title = "Popular Bigrams per Topic")

Sentiment Analysis per topic - 4 Topics

Regarding the positive/negative proportion in 4 analyzed topics, all lean slightly towards negative vocabulary, and the proportions are in range from 59-66%.

sentiment_prop_topic_4 <- tidy_books_topic_4 %>%
  inner_join(get_sentiments("bing")) %>%
  group_by(main_topic) %>%
  dplyr::summarise(
    total_words = n(),
    positive = sum(sentiment == "positive")/ total_words,
    negative = sum(sentiment == "negative")/ total_words
  ) %>%
  gather(sentiment, proportion, -main_topic) %>% 
  spread(sentiment, proportion)

ggplot(
  sentiment_prop_topic_4, aes(x = reorder(main_topic, -negative), y=negative)
) + 
  geom_col(show.legend=FALSE, fill='#B31312') +
  geom_text(
    aes(label = scales::percent(negative,accuracy = 1)),
    position = position_stack(vjust = 0.95),
    show.legend = FALSE,
    color = "#EEE2DE"
  ) +
  coord_flip() +
  scale_y_continuous(label = scales::percent) +
  labs(
    title = "Negative Sentiment by Topics",
    x = "Genre",
    y = "Negative Sentiment"
  )

Sentiment analysis performed on the topics yielded findings consistent with the sentiment analysis conducted across genres. Overall, emotional expressions exhibit minimal variations across a range of topics. Yet, there’s a potential distinction observable in certain emotional categories such as joy, fear, disgust, and surprise, suggesting a nuanced classification of books into two types. Nevertheless, it’s crucial to note that the observed differences are relatively subtle.

nrc_lexicon <- get_sentiments("nrc")

nrc_sentiment_4topics <- tidy_books_topic_4 %>%
  inner_join(get_sentiments("nrc")) %>%
  group_by(main_topic) %>%
  dplyr::summarise(
    total_words = n(),
    positive = sum(sentiment == "positive")/ total_words,
    negative = sum(sentiment == "negative")/ total_words,
    anger = sum(sentiment == "anger")/ total_words,
    anticipation = sum(sentiment == "anticipation")/ total_words,
    disgust = sum(sentiment == "disgust")/ total_words,
    fear = sum(sentiment == "fear")/ total_words,
    joy = sum(sentiment == "joy")/ total_words,
    sadness = sum(sentiment == "sadness")/ total_words,
    surprise = sum(sentiment == "surprise")/ total_words,
    trust = sum(sentiment == "trust")/ total_words
  ) %>%
  gather(sentiment, proportion, -main_topic) %>% 
  spread(sentiment, proportion)


nrc_sentiment_4topics_long <- pivot_longer(nrc_sentiment_4topics, cols = c(anger, anticipation, disgust, fear, joy, negative, positive, sadness, surprise, trust), names_to = "sentiment", values_to = "score")

ggplot(nrc_sentiment_4topics_long, aes(x = main_topic, y = score, fill = main_topic)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  scale_y_continuous(label = scales::percent) +
  ggtitle("Sentiment Distribution by Genre") +
  theme(legend.position = "none")

2 Topics

Based on the analysis conducted above, we decided to verify whether two topics would be more informative.

# lda_2topics_title <- LDA(
#   dtm_books_title,
#   k = 2,
#   method = "Gibbs",
#   control = list(seed=42)) %>% 
#   tidy(matrix = "gamma")

lda_2topics_title <- readRDS("lda_2topics_title.rds")

lda_2topics_title %>%
  group_by(document) %>%
  slice(which.max(gamma)) %>%
  ungroup() %>%
  select(document, main_topic = topic, gamma) -> document_main_topic_2


ggplot(data = document_main_topic_2, aes(x = main_topic)) + 
  geom_bar(stat = 'count', aes(y = ..count..), fill='#BB9CC0') +
  geom_text(stat = 'count', aes(label = ..count..), vjust = 1.2, color = "white") +
  labs(x="Main Topic", y="Number of titles")

# tidy_books_topic_2 <- tidy_books_lemm %>% 
#   inner_join(document_main_topic_2, by = c('title'='document'))

tidy_books_topic_2 <- readRDS("tidy_books_topic_2.rds")

Plotting Popular Words per Topic - 2 Topics

Again top 15 words and bigrams are very similar for both topics and there is no significant distinctions.

word_counts_topic_2 <- tidy_books_topic_2 %>%
  dplyr::count(word, main_topic) %>%
  dplyr::group_by(main_topic) %>%
  dplyr::top_n(15, n) %>%
  ungroup() %>%
  mutate(word2 = fct_reorder(word, n))

ggplot(word_counts_topic_2, aes(x=reorder(word,n), y=n)) + 
  geom_col(show.legend=FALSE, fill = "#2B2A4C") +
  facet_wrap(~main_topic, scales="free_y") +
  coord_flip() +
  labs(title="Popular words per topic",
       x = "", y="Number of words")

Bigrams

bigrams_2 <- tidy_books_topic_2 %>%
  mutate(next_word = lead(word)) %>% 
  filter(!is.na(next_word)) %>% 
  mutate(bigram = paste(word, next_word, sep = " ")) %>% 
  count(bigram, main_topic) %>%
  group_by(main_topic) %>%
  top_n(15, n) %>%
  ungroup()

bigrams_2 <- readRDS("bigrams_2.rds")

# Plotting the bigrams
ggplot(bigrams_2, aes(x = reorder(bigram, n), y = n)) +
  geom_col(fill = '#EA906C') +
  facet_wrap(~main_topic, scales = "free_y") +
  coord_flip() +
  labs(x = "", y = "",
       title = "Popular Bigrams per Topic")

Sentiment Analysis per Topic 2

Regarding the positive/negative proportion in topics 1 and 2, both lean slightly towards negative vocabulary, and the proportions are equivalent.

sentiment_prop_topic_2 <- tidy_books_topic_2 %>%
  inner_join(get_sentiments("bing")) %>%
  group_by(main_topic) %>%
  dplyr::summarise(
    total_words = n(),
    positive = sum(sentiment == "positive")/ total_words,
    negative = sum(sentiment == "negative")/ total_words
  ) %>%
  gather(sentiment, proportion, -main_topic) %>% 
  spread(sentiment, proportion)

ggplot(
  sentiment_prop_topic_2, aes(x = reorder(main_topic, -negative), y=negative)
) + 
  geom_col(show.legend=FALSE, fill='#B31312') +
  geom_text(
    aes(label = scales::percent(negative,accuracy = 1)),
    position = position_stack(vjust = 0.95),
    show.legend = FALSE,
    color = "#EEE2DE"
  ) +
  coord_flip() +
  scale_y_continuous(label = scales::percent) +
  labs(
    title = "Negative Sentiment by Genre",
    x = "Genre",
    y = "Negative Sentiment"
  )

As expected based on the analysis made before, the notable contrast between topics 1 and 2 is observed in sentiments joy and fear. However while differences are noticeable, they lack significant magnitude when taking into consideration all emotions.

# NRC 
nrc_lexicon <- get_sentiments("nrc")

nrc_sentiment_2topics <- tidy_books_topic_2 %>%
  inner_join(get_sentiments("nrc")) %>%
  group_by(main_topic) %>%
  dplyr::summarise(
    total_words = n(),
    positive = sum(sentiment == "positive")/ total_words,
    negative = sum(sentiment == "negative")/ total_words,
    anger = sum(sentiment == "anger")/ total_words,
    anticipation = sum(sentiment == "anticipation")/ total_words,
    disgust = sum(sentiment == "disgust")/ total_words,
    fear = sum(sentiment == "fear")/ total_words,
    joy = sum(sentiment == "joy")/ total_words,
    sadness = sum(sentiment == "sadness")/ total_words,
    surprise = sum(sentiment == "surprise")/ total_words,
    trust = sum(sentiment == "trust")/ total_words
  ) %>%
  gather(sentiment, proportion, -main_topic) %>% 
  spread(sentiment, proportion)


nrc_sentiment_2topics_long <- pivot_longer(nrc_sentiment_2topics, cols = c(anger, anticipation, disgust, fear, joy, negative, positive, sadness, surprise, trust), names_to = "sentiment", values_to = "score")

# sentiment plot with 10 different emotions
ggplot(nrc_sentiment_2topics_long, aes(x = main_topic, y = score, fill = main_topic)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  scale_y_continuous(label = scales::percent) +
  scale_x_continuous(breaks = seq(min(nrc_sentiment_2topics_long$main_topic),   max(nrc_sentiment_2topics_long$main_topic), 1)) +
  labs(title = "Sentiment Distribution by Genre",
       x = "Main Topic",
       y = "") +
  theme(legend.position = "none")

Summary

In the course of our text analysis, our aim was to uncover the presence of romance within Wattpad books by scrutinizing the emotions depicted in the stories and identifying popular phrases, irrespective of their official genre.

Through the application of topic modeling and sentiment analysis, our investigation revealed a noteworthy revelation: Despite the diverse genre labels, including mystery, romance, or adventure, noticeable differences between genres were lacking. This trend extended to the most frequently used words, bigrams, and sentiments, where books exhibited similar patterns. While many popular words revolved around relationships, expressions, and emotions, we did not observe a distinct focus on love.

In essence, our analysis suggests that genres do not significantly differentiate books. Examined books appear to share common ground in terms of vocabulary and emphasis on human relationships. However, we were unable to substantiate a predominant emphasis on love stories within these genres.

Wattpad - Books analysis

Sentiment Analysis and Topic Modelling

Oliwia Iwańska

2024-02-18 16:12:55

Introduction

Loading and Cleaning Data

Overview from the Data

Distribution of word counts across different genres

Distribution of popularity, ranking score and reads for all books together

Most Popular Words in Genres

Most Popular Bigrams in Genres

Sentiment Analysis

Positive/Negative Sentiment per genre

Topic Modelling

Topic Modelling per book

4 Topics

Plotting Popular Words per Topic - 4 Topics

Bigrams

Sentiment Analysis per topic - 4 Topics

2 Topics

Plotting Popular Words per Topic - 2 Topics

Bigrams

Sentiment Analysis per Topic 2

Summary