Since it would not be entirely relevant to build recommendations for an entire dataset of 777 books based on a community breakdown, I decided to choose one subgroup - a specific category of books.
Let’s examine the available data on “what other ‘shelf’, category this book is often assigned to” (popular_shelves.3.name), which is where the “marvel” category is found.
books_net_info_check2 = books_net_info_copy %>% group_by(popular_shelves.3.name) %>% summarize(number_of_books_per_category=n()) %>% arrange(desc(number_of_books_per_category)) %>% rename("Category" = popular_shelves.3.name,"N_books" =number_of_books_per_category)
books_net_info_check2
## # A tibble: 41 × 2
## Category N_books
## <chr> <int>
## 1 graphic-novel 232
## 2 fantasy 96
## 3 marvel 52
## 4 horror 46
## 5 graphic-novels 43
## 6 currently-reading 42
## 7 comics 39
## 8 batman 34
## 9 cómics 30
## 10 favorites 28
## # ℹ 31 more rows
There are 52 books in the “marvel” category - a sufficient number to analyze.
# initial graph
comics_marvel = comics_net
# Filtering data in the dataset by "marvel" category
books_net_info_2 = books_net_info_copy %>% filter(popular_shelves.3.name=="marvel")
# Remove unnecessary vertices from the graph
V(comics_marvel)$category = books_net_info_copy$popular_shelves.3.name
comics_marvel1 =delete.vertices(comics_marvel, V(comics_marvel)$category != "marvel")
# Add ID
V(comics_marvel1)$book_id = books_net_info_2$book_id
In order to reliably answer the recommendation question, we need to use the betweenness centrality measure.
The betweenness centrality measure denotes how many shortest paths pass through a point. In this case, where the links between nodes are “similarity of book scores”, the highest value of betweenness will indicate the most average score between two unrelated or poorly related groups of books, and the lowest values will have the most extraordinary scores. Let’s say if we have a group of books with scores between 3 and 3.2, and we also have another group with scores between 3.6 and 3.8, the book with a score of 3.4 will have the highest betweenness among these nodes (if it has a connection to both groups). The larger the groups and the smaller (up to 1) such bridge nodes are, the more betweenness a node acting as a bridge has.
Thus, nodes with a large value of betweenness will signal that this comic can be recommended to more than one group, and nodes with a small value - on the contrary, that this comic can be recommended to no more than one group.
marvel_betw = books_net_info_2 %>%
transmute(book_id,
betw = betweenness(comics_marvel1)) %>%
arrange(desc(betw)) %>% rename("Book ID" = book_id, "Betwenness"=betw)
head(marvel_betw)
## Book ID Betwenness
## 1: 4645370 119.0
## 2: 23017961 89.0
## 3: 25066770 87.0
## 4: 17899546 60.0
## 5: 59962 48.5
## 6: 31981 18.0
plot(comics_marvel1,
vertex.size=0.2*betweenness(comics_marvel1),
vertex.label = V(comics_marvel1)$book_id,
vertex.label.cex = 0.8)
Only 9 comics out of 52 have a betweenness value not equal to zero. This means that the graph is extremely heterogeneous.
Also, we can use the closeness centrality measure. The closeness centrality measure indicates which nodes are closest to other nodes. In other words, the number of steps it takes to get from one node to another will play the biggest role here. Closeness in this case will depend on which estimates were the most common. The most common estimates in the entire network have the highest closeness - since the more points that are the fastest to get to from a given point - the greater the closeness. The fewest estimates have the smallest closeness - from them the path to the largest number of estimates will be the largest.
Accordingly, the more closeness, the more books can be recommended, the less closeness - the narrower the circle of possible recommendations.
options(scipen = 9999)
marvel_betw_clo = books_net_info_2 %>%
transmute(book_id,
closeness = closeness(comics_marvel1,normalized = TRUE),
betw = betweenness(comics_marvel1)
) %>%
arrange(desc(closeness)) %>% rename("Book ID" = book_id, "Betwenness"=betw)
head(marvel_betw_clo,15)
## Book ID closeness Betwenness
## 1: 17251115 1.0000000 2
## 2: 211461 1.0000000 0
## 3: 17182373 1.0000000 0
## 4: 9293295 1.0000000 0
## 5: 485381 1.0000000 0
## 6: 17277815 1.0000000 1
## 7: 207585 1.0000000 0
## 8: 105973 1.0000000 0
## 9: 17251114 0.7500000 0
## 10: 25066773 0.7500000 0
## 11: 18478257 0.6666667 0
## 12: 105925 0.6666667 0
## 13: 23018001 0.6000000 0
## 14: 4645370 0.4871795 119
## 15: 23017961 0.4222222 89
19/52 vertices are isolated - that is why a lot of books do not have closeness.
vertex_size = 10*closeness(comics_marvel1, normalized = T)
vertex_size[is.na(vertex_size)] = 0
plot(comics_marvel1,
vertex.size= vertex_size,
vertex.label = V(comics_marvel1)$book_id,
vertex.label.cex = 0.8)
There are some groups of nodes with high closeness, most of the nodes lies in a rather narrow range of values: from 0.2435897 to 0.3877551 - which indicates that the nodes do not lie too close to each other.
Partitioning the graph into communities will make it easy to determine which to recommend: which books can be recommended in case a person liked book A - books from the community in which book A lies. In order to choose which partitioning method to use let’s look at the modularity of different methods of partitioning this graph.
Walktrap modularity
wt <- walktrap.community(comics_marvel1)
modularity(wt)
## [1] 0.7288781
Fast Greedy modularity
fg <- fastgreedy.community(comics_marvel1)
modularity(fg)
## [1] 0.7288781
Edge Betweenness modularity
eb = edge.betweenness.community(comics_marvel1)
modularity(eb)
## [1] 0.7288781
Modularity is equally high in all methods, and hence we use Walktrap.
set.seed(12346)
plot(fg, comics_marvel1,
vertex.label = books_net_info_2$title_without_series,
vertex.label.cex = 0.9,
vertex.color = membership(wt),
vertex.size=0.2*betweenness(comics_marvel1),
edge.width=0.0001)
Using the Walktrap partitioning method, 27 communities were formed, of which only 8 communities are a group of multiple node links, the remaining 19 communities consist of a single node, and therefore cannot be recommended based on the criterion of similarity of user rating, regardless of what other book from this graph a person would like. Only 3 books (“The Invincible Iron Man, Volume 1: The Five Nightmares”, “Storm, Vol. 1: Make it Rain,” “Hawkeye, Volume 5: All-New Hawkeye”) can be recommended in more than one community based on similarity of user ratings. The other recommendation relationships are indicated in the box above.
The criterion of similarity of user ratings for comics from the “marvel” category is not very good for selecting recommendations. There are too few links between nodes in the graph (in other words, user ratings vary too much).