ANLY540 - Analysis of Human Language

Getting Set Up

You might need to install devtools to install this package, but we are going to use a Harry Potter package to analyze the text included.

#install.packages('devtools') #run this line if you need it without the # but add the # back so you don't have it running when you knit
#devtools::install_github("bradleyboehmke/harrypotter") #same thing here

Load your libraries!

library(dplyr)
library(harrypotter)
library(tidytext)
library(tidyr)
library(widyr)

The data

The data is the text in the Harry Potter novels for the following:

philosophers_stone: Harry Potter and the Philosophers Stone (1997)
chamber_of_secrets: Harry Potter and the Chamber of Secrets (1998)
prisoner_of_azkaban: Harry Potter and the Prisoner of Azkaban (1999)
goblet_of_fire: Harry Potter and the Goblet of Fire (2000)
order_of_the_phoenix: Harry Potter and the Order of the Phoenix (2003)
half_blood_prince: Harry Potter and the Half-Blood Prince (2005)
deathly_hallows: Harry Potter and the Deathly Hallows (2007)

You can pick any book to analyze! When you use a book, the data is structured such that each chapter is a row and there’s one column in the data. These are giant text blocks, much like the descriptions in the NASA dataset you just learned about. You might need to coerce the data into a tibble/dataframe to get started depending on the book you select. Second, be sure to add a chapter id column so you can keep the chapter number as an id variable.

data("philosophers_stone")
data <- as.data.frame(philosophers_stone)
data$chapter <- rownames(data)
rownames(data) <- NULL
data$philosophers_stone <- as.character(data$philosophers_stone)

I chose to analyze the book, ‘Harry Potter and the Philosophers Stone (1997)’.

Clean up the data

In this section, you want to create a tibble/dataframe of the individual words from your book. Try using unnest_tokens and anti_join to create a unigram list of words without stopwords included.

The top rows of the cleaned up dataframe are below:

data_ut <- data %>% 
  unnest_tokens(word, philosophers_stone)
data_aj <- data_ut %>% anti_join(stop_words)
head(data_aj)

Simple statistics

In this section, use the count function to determine the most frequent words used in Harry Potter that are not stopwords.

The most common words used in the book are below:

data_aj %>% 
  count(word, sort = TRUE)

Collocates clean up

Create a tibble/dataframe that includes the collocate pairs in the Harry Potter book you picked using pairwise_count.

The top rows of the dataframe are below:

word_pairs = data_aj %>% 
  pairwise_count(word, chapter, sort = TRUE, upper = FALSE)
head(word_pairs)

Create a network plot

Create a network plot of the collocates - remember you can change the n > XX to a number that keeps a lot of the data, but filters out a lot of the smaller combinations.

library(ggplot2)
library(igraph)
library(ggraph)

set.seed(12345)
word_pairs %>%
  filter(n >= 14) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") + 
  geom_edge_link(aes(edge_alpha = n, edge_width = n), edge_colour = "purple") +
  geom_node_point(size = 5) +
  geom_node_text(aes(label = name), repel = TRUE, 
                 point.padding = unit(0.2, "lines")) +
  theme_void()

Strongest pairs

Create a correlation tibble/dataframe of the strongest pairs from your book.

The top rows of the correlation dataframe are below:

keyword_cors = data_aj %>% 
  group_by(word) %>%
  filter(n() >= 50) %>%
  pairwise_cor(word, chapter, sort = TRUE, upper = FALSE)
head(keyword_cors)

Visualize the pairs

Include a network plot of the correlation data, and you can change the correlation cut off to create the best visualization of the data.

Below is the network plot of the correlation data showing the strongest pairs with correlation > 0.6

keyword_cors %>%
  filter(correlation > .6) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = correlation, edge_width = correlation), edge_colour = "blue") +
  geom_node_point(size = 5) +
  geom_node_text(aes(label = name), repel = TRUE,
                 point.padding = unit(0.2, "lines")) +
  theme_void()

Interpretation

What do the simple statistics and network plots tell you about the book you selected? Interpret your output in a few sentences summarizing your visualizations.

The most common words in the book are the names of the main characters, Harry, Ron, Hagrid, Hermione, Professor, Dumbledore, Snape, and others words such as looked, uncle, and time.
The collates network plot shows the most commonly occuring collates across chapters. Words in the central part of the network graph represent collate pairs. Harry’s perception in terms of his sight, voice, and hearing form a central theme around how J.K. Rowling describes events in the book
The correlation network plot shows pairs of words with high correlations. Hermione seems to have high correlation with other characters such as Ron, Snape, Neville, Malfoy, and her house Gryffindor. The other characters also seems to have significantly high correlations among themselves. The word Professor seems to have high correlation with Dumbledore, feet, and pulled. Petunia (name of the aunt) seems to have high correlation with aunt, Vernon (her husband), uncle, Dudley (her son), and boy (how the family referred to Harry).

ANLY540 - Analysis of Human Language - Executive Session 2

Suraj Kumaran

6/17/2019