Lab 9: Graphs and Social Networks in Pride & Prejudice

Author

Amanda R. Knudsen

Overview

In the this lab assignment, you are going to construct a social network from the characters in the book “Pride & Prejudice”, a novel written by Jane Austen and available in the janeaustenr package. The social network will be a weighted graph connecting the characters, where the weight is equal to the number of times the names of each character appeared in each 10 line section of the book. Once you create the graph, you will load it into tidygraph, make a visualization of the graph, and rank the most connected characters by a measure called degree centrality

Problem 1

Load the text of Pride & Prejudice into R using the janeaustenr library. Then download and read pride_prejudice_characters.csv, the csv file from my github page containing a list of characters in Pride & Prejudice and their aliases. Here aliases refers to the different names that the characters go by in the books, for example “Darcy” also goes by the names “Mr. Darcy”, and “Mr. Fitzwilliam Darcy” (not to be confused with his cousin “Colonel Fitzwilliam”).

pride_prejudice_text <- austen_books() |> 
  filter(book == "Pride & Prejudice") |> select(text)

pride_prejudice_text

# A tibble: 13,030 × 1
   text                                                                     
   <chr>                                                                    
 1 "PRIDE AND PREJUDICE"                                                    
 2 ""                                                                       
 3 "By Jane Austen"                                                         
 4 ""                                                                       
 5 ""                                                                       
 6 ""                                                                       
 7 "Chapter 1"                                                              
 8 ""                                                                       
 9 ""                                                                       
10 "It is a truth universally acknowledged, that a single man in possession"
# ℹ 13,020 more rows

pride_prejudice_chars = read_csv("Lab09_Data/pride_prejudice_characters.csv")

Rows: 41 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): unique_name, alias
dbl (1): id

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

pride_prejudice_chars

# A tibble: 41 × 3
      id unique_name     alias            
   <dbl> <chr>           <chr>            
 1     1 ElizabethBennet Elizabeth Bennet 
 2     1 ElizabethBennet Elizabeth        
 3     1 ElizabethBennet Miss Bennet      
 4     1 ElizabethBennet Miss Lizzy       
 5     1 ElizabethBennet Lizzy            
 6     1 ElizabethBennet Eliza Bennet     
 7     2 MrDarcy         Fitzwilliam Darcy
 8     2 MrDarcy         Darcy            
 9     3 MrBennet        Mr. Bennet       
10     4 MrsBennet       Mrs. Bennet      
# ℹ 31 more rows

Process the text of Pride & Prejudice to replace instances where an alias occurs with the full name of the character- I recommend using the iteration techniques you learned earlier, I arranged the order of names in the csv file to minimize misidentifications if you replace names in the order that they appear in the file. Making this perfect would require a bit of effort but we are ok if there are some misidentifications. Here the final name of each character will be a single word.

pride_prejudice_processed <- pride_prejudice_text |> 
  mutate(
    text = {
      updated_text <- text
      
      for (i in seq_len(nrow(pride_prejudice_chars))) {
        alias <- pride_prejudice_chars$alias[i]       
        unique_name <- pride_prejudice_chars$unique_name[i] 

        updated_text <- str_replace_all(
          string = updated_text, 
          pattern = regex(paste0("\\b", alias, "\\b"), ignore_case = TRUE),
          replacement = unique_name
        )
      }
      
      updated_text
    }
  )

pride_prejudice_processed

# A tibble: 13,030 × 1
   text                                                                     
   <chr>                                                                    
 1 "PRIDE AND PREJUDICE"                                                    
 2 ""                                                                       
 3 "By JaneBennet Austen"                                                   
 4 ""                                                                       
 5 ""                                                                       
 6 ""                                                                       
 7 "Chapter 1"                                                              
 8 ""                                                                       
 9 ""                                                                       
10 "It is a truth universally acknowledged, that a single man in possession"
# ℹ 13,020 more rows

It looks like even the “Jane” in Jane Austen has become transformed to JaneBennet, but I guess this is by design.

Problem 2

Following the example in chapter 4 of the text mining with R book, create a new column in the data frame corresponding to the Pride & Prejudice text that divides the text into sections of 10 lines each. Then use the pairwise_count function from widyr to determine the number of times each name occurs with each other name in the same 10 line section.

pride_prejudice_sections <- pride_prejudice_processed |> 
  mutate(section = (row_number() - 1) %/% 10 + 1)

pride_prejudice_tokens <- pride_prejudice_sections |> 
  unnest_tokens(word, text)

character_names <- pride_prejudice_chars |> 
  mutate(unique_name = str_to_lower(unique_name)) |> 
  pull(unique_name)

pride_prejudice_tokens <- pride_prejudice_tokens |> 
  mutate(word = str_to_lower(word))

pride_prejudice_names <- pride_prejudice_tokens |> 
  filter(word %in% character_names)

name_cooccurrences <- pride_prejudice_names |> 
  pairwise_count(word, section, sort = TRUE)

name_cooccurrences

# A tibble: 292 × 3
   item1           item2               n
   <chr>           <chr>           <dbl>
 1 mrdarcy         elizabethbennet   150
 2 elizabethbennet mrdarcy           150
 3 elizabethbennet janebennet        134
 4 janebennet      elizabethbennet   134
 5 elizabethbennet mrbingley          72
 6 mrbingley       elizabethbennet    72
 7 mrdarcy         mrbingley          58
 8 mrbingley       mrdarcy            58
 9 mrwickham       elizabethbennet    56
10 elizabethbennet mrwickham          56
# ℹ 282 more rows

Problem 3

Create a dataframe of nodes which contains the id and unique names of each character, and create a dataframe of edges which contains three columns: a column named from, a column named to, and a column named weight, where the from and to are the id numbers of each character and weight is the number of co-occurences you found in Problem 2. Each pair should only appear once in the edge list (i.e. Elizabeth and MrDarcy but not MrDarcy and then Elizabeth). Create a tidygraph object using tbl_graph that contains the social network data that we just constructed.

nodes <- name_cooccurrences  |> 
  distinct(item1)  |> 
  mutate(node_id = row_number()) |> 
  rename(unique_name = item1)

nodes

# A tibble: 20 × 2
   unique_name        node_id
   <chr>                <int>
 1 mrdarcy                  1
 2 elizabethbennet          2
 3 janebennet               3
 4 mrbingley                4
 5 mrwickham                5
 6 mrsbennet                6
 7 mrcollins                7
 8 missbingley              8
 9 lydia                    9
10 kitty                   10
11 ladycatherine           11
12 charlotte               12
13 mrsgardiner             13
14 mrbennet                14
15 marybennet              15
16 sirwilliam              16
17 mrgardiner              17
18 colonelfitzwilliam      18
19 georgiana               19
20 ladylucas               20

edges <- name_cooccurrences  |> 
  rename(weight = n) |> 
  left_join(nodes, by = c("item1" = "unique_name")) |> 
  rename(from = node_id) |> 
  left_join(nodes, by = c("item2" = "unique_name")) |> 
  rename(to = node_id) |> 
  select(from, to, weight) |> 
  filter(!is.na(from), !is.na(to))

graph <- tbl_graph(nodes = nodes |> 
                     select(unique_name), edges = edges, directed = FALSE)

graph

# A tbl_graph: 20 nodes and 292 edges
#
# An undirected multigraph with 1 component
#
# Node Data: 20 × 1 (active)
   unique_name       
   <chr>             
 1 mrdarcy           
 2 elizabethbennet   
 3 janebennet        
 4 mrbingley         
 5 mrwickham         
 6 mrsbennet         
 7 mrcollins         
 8 missbingley       
 9 lydia             
10 kitty             
11 ladycatherine     
12 charlotte         
13 mrsgardiner       
14 mrbennet          
15 marybennet        
16 sirwilliam        
17 mrgardiner        
18 colonelfitzwilliam
19 georgiana         
20 ladylucas         
#
# Edge Data: 292 × 3
   from    to weight
  <int> <int>  <dbl>
1     1     2    150
2     1     2    150
3     2     3    134
# ℹ 289 more rows

I am seeing duplicates here so the below is an effort to fix…

edges <- edges |> 
  group_by(from, to) |> 
  summarise(weight = sum(weight), .groups = "drop")

edges

# A tibble: 292 × 3
    from    to weight
   <int> <int>  <dbl>
 1     1     2    150
 2     1     3     44
 3     1     4     58
 4     1     5     33
 5     1     6     11
 6     1     7     12
 7     1     8     36
 8     1     9      8
 9     1    10      6
10     1    11     15
# ℹ 282 more rows

graph <- tbl_graph(nodes = nodes |> 
                     select(unique_name), edges = edges, directed = FALSE)
graph

# A tbl_graph: 20 nodes and 292 edges
#
# An undirected multigraph with 1 component
#
# Node Data: 20 × 1 (active)
   unique_name       
   <chr>             
 1 mrdarcy           
 2 elizabethbennet   
 3 janebennet        
 4 mrbingley         
 5 mrwickham         
 6 mrsbennet         
 7 mrcollins         
 8 missbingley       
 9 lydia             
10 kitty             
11 ladycatherine     
12 charlotte         
13 mrsgardiner       
14 mrbennet          
15 marybennet        
16 sirwilliam        
17 mrgardiner        
18 colonelfitzwilliam
19 georgiana         
20 ladylucas         
#
# Edge Data: 292 × 3
   from    to weight
  <int> <int>  <dbl>
1     1     2    150
2     1     3     44
3     1     4     58
# ℹ 289 more rows

Deduplication worked!

Problem 4

Using ggraph, graph the connections between the characters. Make sure that each node is labeled by the character name, and make sure that the weight is represented by the thickness of the edge plotted between the two nodes. Then use the centrality_degree function to calculate the weighted degree centrality of each character, and make a plot which shows the degree centrality of each character where the characters are arranged in order of degree centrality.

graph_plot <- ggraph(graph, layout = "stress") +
  geom_edge_link(aes(width = weight), alpha = 0.4, color = "lavender") +  
  geom_node_point(size = 2, color = "purple") +  
  geom_node_text(aes(label = unique_name), repel = TRUE, size = 4) + 
  labs(title = "Connections between Characters in Pride & Prejudice") + 
  theme_graph() 

ggsave("graph_prideprejudice.png", plot = graph_plot, 
       width = 10, height = 7, dpi = 300)

# graph_plot

I saved the output of the graph and commented-out the rendering of the graph because my Rstudio kept running into errors when I tried to render this as a PDF. So, below, the image is shown of the graph that the plot creates. If you un-comment the graph_plot this will run the actual graph inline.

degree_centrality <- graph |> 
  activate(nodes) |> 
  mutate(degree_centrality = centrality_degree(weights = weight)) |> 
  as_tibble()

degree_centrality |> 
  arrange(desc(degree_centrality))  |> 
  ggplot(aes(x = reorder(unique_name, degree_centrality), y = degree_centrality)) +
  geom_col(fill = "purple") +  
  coord_flip() +                  
  labs(
    title = "Degree Centrality of Characters in Pride & Prejudice",
    x = "Character",
    y = "Degree Centrality"
  ) +
  theme_minimal() # looks better without a grey background!

This is no surprise based on what we saw in the graph representation of character connections: Elizabeth Bennet is the most central of all.