Overview

In the this lab assignment, you are going to construct a social network from the characters in the book “Pride & Prejudice”, a novel written by Jane Austen and available in the `janeaustenr` package. The social network will be a weighted graph connecting the characters, where the weight is equal to the number of times the names of each character appeared in each 10 line section of the book. Once you create the graph, you will load it into tidygraph, make a visualization of the graph, and rank the most connected characters by a measure called “degree centrality”

Problem 1

Load the text of Pride & Prejudice into R using the `janeaustenr` library. Then download and read pride_prejudice_characters.csv,) the csv file from my github page containing a list of characters in `Pride & Prejudice` and their aliases.

Here aliases refers to the different names that the characters go by in the books, for example “Darcy” also goes by the names “Mr. Darcy”, and “Mr. Fitzwilliam Darcy” (not to be confused with his cousin “Colonel Fitzwilliam”).

Process the text of Pride & Prejudice to replace instances where an alias occurs with the full name of the character- I recommend using the iteration techniques you learned earlier, I arranged the order of names in the `csv` file to minimize misidentifications if you replace names in the order that they appear in the file. Making this perfect would require a bit of effort but we are ok if there are some misidentification. Here the final name of each character will be a single word.

#removing blank lines
ppc <- ppc %>%
  mutate(notblankspace= !str_detect(prideprejudice,pattern = "^\\s*$")) %>% 
  filter(notblankspace == TRUE) %>% 
  select(prideprejudice)%>%
  print()
## # A tibble: 10,721 × 1
##    prideprejudice                                                               
##    <chr>                                                                        
##  1 "PRIDE AND PREJUDICE"                                                        
##  2 "By Jane Austen"                                                             
##  3 "Chapter 1"                                                                  
##  4 "It is a truth universally acknowledged, that a single man in possession"    
##  5 "of a good fortune, must be in want of a wife."                              
##  6 "However little known the feelings or views of such a man may be on his"     
##  7 "first entering a neighbourhood, this truth is so well fixed in the minds"   
##  8 "of the surrounding families, that he is considered the rightful property"   
##  9 "of some one or other of their daughters."                                   
## 10 "\"My dear Mr. Bennet,\" said his lady to him one day, \"have you heard that"
## # ℹ 10,711 more rows
#new column for changes
ppc <- ppc %>% 
  mutate(new_col= prideprejudice)

for (x in seq_along(characters$alias)) {
  ppc <- ppc %>%
    mutate(new_col = gsub(pattern = characters$alias[x],
      replacement = characters$unique_name[x],
      x = new_col,
      fixed = TRUE))
}
print(ppc)
## # A tibble: 10,721 × 2
##    prideprejudice                                                        new_col
##    <chr>                                                                 <chr>  
##  1 "PRIDE AND PREJUDICE"                                                 "PRIDE…
##  2 "By Jane Austen"                                                      "By Ja…
##  3 "Chapter 1"                                                           "Chapt…
##  4 "It is a truth universally acknowledged, that a single man in posses… "It is…
##  5 "of a good fortune, must be in want of a wife."                       "of a …
##  6 "However little known the feelings or views of such a man may be on … "Howev…
##  7 "first entering a neighbourhood, this truth is so well fixed in the … "first…
##  8 "of the surrounding families, that he is considered the rightful pro… "of th…
##  9 "of some one or other of their daughters."                            "of so…
## 10 "\"My dear Mr. Bennet,\" said his lady to him one day, \"have you he… "\"My …
## # ℹ 10,711 more rows

Not sure what went on here, but I’ll fix it.

ppc<- ppc %>% 
  mutate(new_col= 
           gsub(x = new_col, pattern = "MrMrBingley",
                replacement = "MrBingley")) %>% 
  print()
## # A tibble: 10,721 × 2
##    prideprejudice                                                        new_col
##    <chr>                                                                 <chr>  
##  1 "PRIDE AND PREJUDICE"                                                 "PRIDE…
##  2 "By Jane Austen"                                                      "By Ja…
##  3 "Chapter 1"                                                           "Chapt…
##  4 "It is a truth universally acknowledged, that a single man in posses… "It is…
##  5 "of a good fortune, must be in want of a wife."                       "of a …
##  6 "However little known the feelings or views of such a man may be on … "Howev…
##  7 "first entering a neighbourhood, this truth is so well fixed in the … "first…
##  8 "of the surrounding families, that he is considered the rightful pro… "of th…
##  9 "of some one or other of their daughters."                            "of so…
## 10 "\"My dear Mr. Bennet,\" said his lady to him one day, \"have you he… "\"My …
## # ℹ 10,711 more rows
ppc <- ppc %>% 
  select(-prideprejudice)

Problem 2

Following the example in chapter 4 of the text mining with R book, create a new column in the data frame corresponding to the Pride & Prejudice text that divides the text into sections of 10 lines each. Then use the `pairwise_count` function from `widyr` to determine the number of times each name occurs with each other name in the same 10 line section.

ppc2 <- ppc%>%
  mutate(section = row_number() %/% 10) %>%
      filter(section > 0) %>% 
      unnest_tokens(words, new_col)
characters <- characters %>% 
  mutate(names= tolower(characters$unique_name))
filtered <- inner_join(x = ppc2, y = characters, by = c("words" = "names"), relationship = "many-to-many") %>% 
  select(section, words) %>% 
  print()
## # A tibble: 7,929 × 2
##    section words    
##      <dbl> <chr>    
##  1       1 mrbennet 
##  2       1 mrbennet 
##  3       1 mrbennet 
##  4       2 mrbingley
##  5       2 mrbingley
##  6       2 mrbingley
##  7       3 mrbennet 
##  8       3 mrbingley
##  9       3 mrbingley
## 10       3 mrbingley
## # ℹ 7,919 more rows
pairwise <- filtered %>%
  pairwise_count(words,section, sort= TRUE) %>% 
  print()
## # A tibble: 236 × 3
##    item1           item2               n
##    <chr>           <chr>           <dbl>
##  1 mrdarcy         elizabethbennet   168
##  2 elizabethbennet mrdarcy           168
##  3 janebennet      elizabethbennet   135
##  4 elizabethbennet janebennet        135
##  5 elizabethbennet mrbingley          79
##  6 mrbingley       elizabethbennet    79
##  7 mrdarcy         mrbingley          63
##  8 mrbingley       mrdarcy            63
##  9 mrcollins       elizabethbennet    55
## 10 elizabethbennet mrcollins          55
## # ℹ 226 more rows

Problem 3

Create a dataframe of nodes which contains the id and unique names of each character, and create a dataframe of edges which contains three columns: a column named from a column named to and a column named weight, where the from and to are the id numbers of each character and weight is the number of co-occurrences you found in Problem 2. Each pair should only appear once in the edge list (i.e. Elizabeth and MrDarcy but not MrDarcy and then Elizabeth). Create a tidygraph object using tbl_graph that contains the social network data that we just constructed.

pairwise <- pairwise %>%
  rename(from = item1, to = item2, weight = n)
graph <- tbl_graph(edges = pairwise, node_key = weight)

Problem 4

Using ggraph, graph the connections between the characters. Make sure that each node is labeled by the character name, and make sure that the weight is represented by the thickness of the edge plotted between the two nodes. Then use the centrality_degree function to calculate the weighted degree centrality of each character, and make a plot which shows the degree centrality of each character where the characters are arranged in order of degree centrality.

graph %>%
  ggraph(layout = "fr")+
  geom_edge_link(aes(edge_alpha = weight),
                 color = "#FA82A7", show.legend = FALSE) +
  geom_node_point(size = 2, color = "#FA82A7")+
  geom_node_text(aes(label = name),
                 repel = TRUE, fontface = "italic",
                 size = 4, color = "#2e0614") +
  theme_void() +
  labs(title = "Weighted Centrality Degree of Characters in Pride and Prejudice")