In the this lab assignment, you are going to construct a social network from the characters in the book “Pride & Prejudice”, a novel written by Jane Austen and available in the `janeaustenr` package. The social network will be a weighted graph connecting the characters, where the weight is equal to the number of times the names of each character appeared in each 10 line section of the book. Once you create the graph, you will load it into tidygraph, make a visualization of the graph, and rank the most connected characters by a measure called “degree centrality”
Load the text of Pride & Prejudice into R using the `janeaustenr` library. Then download and read pride_prejudice_characters.csv,) the csv file from my github page containing a list of characters in `Pride & Prejudice` and their aliases.
Here aliases refers to the different names that the characters go by in the books, for example “Darcy” also goes by the names “Mr. Darcy”, and “Mr. Fitzwilliam Darcy” (not to be confused with his cousin “Colonel Fitzwilliam”).
Process the text of Pride & Prejudice to replace instances where an alias occurs with the full name of the character- I recommend using the iteration techniques you learned earlier, I arranged the order of names in the `csv` file to minimize misidentifications if you replace names in the order that they appear in the file. Making this perfect would require a bit of effort but we are ok if there are some misidentification. Here the final name of each character will be a single word.
#removing blank lines
ppc <- ppc %>%
mutate(notblankspace= !str_detect(prideprejudice,pattern = "^\\s*$")) %>%
filter(notblankspace == TRUE) %>%
select(prideprejudice)%>%
print()
## # A tibble: 10,721 × 1
## prideprejudice
## <chr>
## 1 "PRIDE AND PREJUDICE"
## 2 "By Jane Austen"
## 3 "Chapter 1"
## 4 "It is a truth universally acknowledged, that a single man in possession"
## 5 "of a good fortune, must be in want of a wife."
## 6 "However little known the feelings or views of such a man may be on his"
## 7 "first entering a neighbourhood, this truth is so well fixed in the minds"
## 8 "of the surrounding families, that he is considered the rightful property"
## 9 "of some one or other of their daughters."
## 10 "\"My dear Mr. Bennet,\" said his lady to him one day, \"have you heard that"
## # ℹ 10,711 more rows
#new column for changes
ppc <- ppc %>%
mutate(new_col= prideprejudice)
for (x in seq_along(characters$alias)) {
ppc <- ppc %>%
mutate(new_col = gsub(pattern = characters$alias[x],
replacement = characters$unique_name[x],
x = new_col,
fixed = TRUE))
}
print(ppc)
## # A tibble: 10,721 × 2
## prideprejudice new_col
## <chr> <chr>
## 1 "PRIDE AND PREJUDICE" "PRIDE…
## 2 "By Jane Austen" "By Ja…
## 3 "Chapter 1" "Chapt…
## 4 "It is a truth universally acknowledged, that a single man in posses… "It is…
## 5 "of a good fortune, must be in want of a wife." "of a …
## 6 "However little known the feelings or views of such a man may be on … "Howev…
## 7 "first entering a neighbourhood, this truth is so well fixed in the … "first…
## 8 "of the surrounding families, that he is considered the rightful pro… "of th…
## 9 "of some one or other of their daughters." "of so…
## 10 "\"My dear Mr. Bennet,\" said his lady to him one day, \"have you he… "\"My …
## # ℹ 10,711 more rows
Not sure what went on here, but I’ll fix it.
ppc<- ppc %>%
mutate(new_col=
gsub(x = new_col, pattern = "MrMrBingley",
replacement = "MrBingley")) %>%
print()
## # A tibble: 10,721 × 2
## prideprejudice new_col
## <chr> <chr>
## 1 "PRIDE AND PREJUDICE" "PRIDE…
## 2 "By Jane Austen" "By Ja…
## 3 "Chapter 1" "Chapt…
## 4 "It is a truth universally acknowledged, that a single man in posses… "It is…
## 5 "of a good fortune, must be in want of a wife." "of a …
## 6 "However little known the feelings or views of such a man may be on … "Howev…
## 7 "first entering a neighbourhood, this truth is so well fixed in the … "first…
## 8 "of the surrounding families, that he is considered the rightful pro… "of th…
## 9 "of some one or other of their daughters." "of so…
## 10 "\"My dear Mr. Bennet,\" said his lady to him one day, \"have you he… "\"My …
## # ℹ 10,711 more rows
ppc <- ppc %>%
select(-prideprejudice)
Following the example in chapter 4 of the text mining with R book, create a new column in the data frame corresponding to the Pride & Prejudice text that divides the text into sections of 10 lines each. Then use the `pairwise_count` function from `widyr` to determine the number of times each name occurs with each other name in the same 10 line section.
ppc2 <- ppc%>%
mutate(section = row_number() %/% 10) %>%
filter(section > 0) %>%
unnest_tokens(words, new_col)
characters <- characters %>%
mutate(names= tolower(characters$unique_name))
filtered <- inner_join(x = ppc2, y = characters, by = c("words" = "names"), relationship = "many-to-many") %>%
select(section, words) %>%
print()
## # A tibble: 7,929 × 2
## section words
## <dbl> <chr>
## 1 1 mrbennet
## 2 1 mrbennet
## 3 1 mrbennet
## 4 2 mrbingley
## 5 2 mrbingley
## 6 2 mrbingley
## 7 3 mrbennet
## 8 3 mrbingley
## 9 3 mrbingley
## 10 3 mrbingley
## # ℹ 7,919 more rows
pairwise <- filtered %>%
pairwise_count(words,section, sort= TRUE) %>%
print()
## # A tibble: 236 × 3
## item1 item2 n
## <chr> <chr> <dbl>
## 1 mrdarcy elizabethbennet 168
## 2 elizabethbennet mrdarcy 168
## 3 janebennet elizabethbennet 135
## 4 elizabethbennet janebennet 135
## 5 elizabethbennet mrbingley 79
## 6 mrbingley elizabethbennet 79
## 7 mrdarcy mrbingley 63
## 8 mrbingley mrdarcy 63
## 9 mrcollins elizabethbennet 55
## 10 elizabethbennet mrcollins 55
## # ℹ 226 more rows
Create a dataframe of nodes which contains the id and unique names of
each character, and create a dataframe of edges
which
contains three columns: a column named from a column named
to
and a column named weight
, where the from
and to are the id numbers of each character and weight is the number of
co-occurrences you found in Problem 2. Each pair should only appear once
in the edge list (i.e. Elizabeth and MrDarcy but not MrDarcy and then
Elizabeth). Create a tidygraph object using tbl_graph that contains the
social network data that we just constructed.
pairwise <- pairwise %>%
rename(from = item1, to = item2, weight = n)
graph <- tbl_graph(edges = pairwise, node_key = weight)
Using ggraph
, graph the connections between the
characters. Make sure that each node is labeled by the character name,
and make sure that the weight is represented by the thickness of the
edge plotted between the two nodes. Then use the
centrality_degree
function to calculate the weighted degree
centrality of each character, and make a plot which shows the degree
centrality of each character where the characters are arranged in order
of degree centrality.
graph %>%
ggraph(layout = "fr")+
geom_edge_link(aes(edge_alpha = weight),
color = "#FA82A7", show.legend = FALSE) +
geom_node_point(size = 2, color = "#FA82A7")+
geom_node_text(aes(label = name),
repel = TRUE, fontface = "italic",
size = 4, color = "#2e0614") +
theme_void() +
labs(title = "Weighted Centrality Degree of Characters in Pride and Prejudice")