You might need to install devtools to install this package, but we are going to use a Harry Potter package to analyze the text included.
The data is the text in the Harry Potter novels for the following:
You can pick any book to analyze! When you use a book, the data is structured such that each chapter is a row and there’s one column in the data. These are giant text blocks, much like the descriptions in the NASA dataset you just learned about. You might need to coerce the data into a tibble/dataframe to get started depending on the book you select. Second, be sure to add a chapter id column so you can keep the chapter number as an id variable.
data("philosophers_stone")
data <- as.data.frame(philosophers_stone)
data$chapter <- rownames(data)
rownames(data) <- NULL
data$philosophers_stone <- as.character(data$philosophers_stone)I chose to analyze the book, ‘Harry Potter and the Philosophers Stone (1997)’.
In this section, you want to create a tibble/dataframe of the individual words from your book. Try using unnest_tokens and anti_join to create a unigram list of words without stopwords included.
The top rows of the cleaned up dataframe are below:
data_ut <- data %>%
unnest_tokens(word, philosophers_stone)
data_aj <- data_ut %>% anti_join(stop_words)
head(data_aj)In this section, use the count function to determine the most frequent words used in Harry Potter that are not stopwords.
The most common words used in the book are below:
Create a tibble/dataframe that includes the collocate pairs in the Harry Potter book you picked using pairwise_count.
The top rows of the dataframe are below:
Create a network plot of the collocates - remember you can change the n > XX to a number that keeps a lot of the data, but filters out a lot of the smaller combinations.
library(ggplot2)
library(igraph)
library(ggraph)
set.seed(12345)
word_pairs %>%
filter(n >= 14) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = n, edge_width = n), edge_colour = "purple") +
geom_node_point(size = 5) +
geom_node_text(aes(label = name), repel = TRUE,
point.padding = unit(0.2, "lines")) +
theme_void()Create a correlation tibble/dataframe of the strongest pairs from your book.
The top rows of the correlation dataframe are below:
keyword_cors = data_aj %>%
group_by(word) %>%
filter(n() >= 50) %>%
pairwise_cor(word, chapter, sort = TRUE, upper = FALSE)
head(keyword_cors)Include a network plot of the correlation data, and you can change the correlation cut off to create the best visualization of the data.
Below is the network plot of the correlation data showing the strongest pairs with correlation > 0.6
keyword_cors %>%
filter(correlation > .6) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = correlation, edge_width = correlation), edge_colour = "blue") +
geom_node_point(size = 5) +
geom_node_text(aes(label = name), repel = TRUE,
point.padding = unit(0.2, "lines")) +
theme_void()What do the simple statistics and network plots tell you about the book you selected? Interpret your output in a few sentences summarizing your visualizations.