On this first assignment, applying the basic functions of the Igraph package is required. The following datasets are going to be used:
You have to complete the code chunks in this document but also analyze the results, extract insights and answer the short questions. Fill the CSV attached with your answers, sometimes just the number is enough, some others just a small sentence. Remember to change the header with your email.
In your submission please upload both the R Markdown and the CSV with the solutions.
In this section, the goal is loading the datasets given, building the graph and analyzing basics metrics. Include the edge or node attributes you consider.
rm(list=ls())
library('igraph')
Attaching package: ‘igraph’
The following objects are masked from ‘package:stats’:
decompose, spectrum
The following object is masked from ‘package:base’:
union
library('ggplot2')
actors_dataset <- read.table("/Users/sauravghoshroy/Downloads/imdb_actors_key.tsv",header = TRUE, sep = '\t')
actors_edges <- read.table("/Users/sauravghoshroy/Downloads/imdb_actor_edges.tsv", header = TRUE, sep = '\t')
g <- graph_from_data_frame(d = actors_edges, directed = FALSE, vertices = actors_dataset)
summary(g)
IGRAPH d5b8d4d UNW- 17577 287074 --
+ attr: name (v/c), movies_95_04 (v/n), main_genre (v/c), genres (v/c), weight (e/n)
sprintf("The vertex attributes considered in graph g are %s",paste(unlist(vertex_attr_names(g)),collapse=","))
[1] "The vertex attributes considered in graph g are name,movies_95_04,main_genre,genres"
sprintf("The edge attribute considered in graph g is %s",paste(unlist(edge_attr_names(g)),collapse=","))
[1] "The edge attribute considered in graph g is weight"
plot(g)
Describe the values provided by summary function on the graph object. 1) How many nodes are there? 2) How many edges are there?
sprintf("There are %i nodes in the graph",gorder(g))
[1] "There are 17577 nodes in the graph"
sprintf("There are %i edges in the graph",gsize(g))
[1] "There are 287074 edges in the graph"
Analyse the degree distribution. Compute the total degree distribution. 3) How does this distributions look like? 4) What is the maximum degree? 5) What is the minum degree?
degree_distribution <- degree_distribution(g, mode="total",cumulative=TRUE)
plot(degree_distribution,col="red",xlab="Degree",ylab="Distribution")
sprintf("The maximum degree is %i",max(degree(g)))
[1] "The maximum degree is 784"
sprintf("The minimum degree is %i",min(degree(g)))
[1] "The minimum degree is 1"
6) What is the diameter of the graph? 7) What is the avg path length of the graph?
diameter <- diameter(g)
mean_distance <- mean_distance(g,unconnected = TRUE)
sprintf("The diameter of the graph is %i",diameter)
[1] "The diameter of the graph is 39"
sprintf("The average distance between nodes is %f", mean_distance)
[1] "The average distance between nodes is 4.890546"
Obtain the distribution of the number of movies made by an actor and the number of genres in which an actor starred in. It may be useful to analyze and discuss the results to be obtained in the following exercises.
Obtain three vectors with the degree, betweeness and closeness for each vertex of the actors’ graph.
degree_vector <- degree(g)
betweenness_vector <- betweenness(g,directed=FALSE)
closeness_vector <- closeness(g)
At centrality.c:2617 :closeness centrality is not well-defined for disconnected graphs
head(degree_vector,5)
Rudder, Michael (I) Morgan, Debbi Bellows, Gil Dray, Albert Daly, Shane (I)
36 23 22 23 46
head(betweenness_vector,5)
Rudder, Michael (I) Morgan, Debbi Bellows, Gil Dray, Albert Daly, Shane (I)
3929.012 21716.283 13283.000 111250.619 39968.780
head(closeness_vector,5)
Rudder, Michael (I) Morgan, Debbi Bellows, Gil Dray, Albert Daly, Shane (I)
4.337794e-07 4.367972e-07 4.344953e-07 4.348333e-07 4.354900e-07
Obtain the list of the 20 actors with the largest degree centrality. It can be useful to show a list with the degree, the name of the actor, the number of movies, the main genre, and the number of genres in which the actor has participated.
8) Who is the actor with highest degree centrality? 9) How do you explain the high degree of the top-20 list??
top20_degree <- head(sort(degree_vector,decreasing = TRUE),20)
top_20_degrees <- data.frame('degree'=top20_degree,'name'=names(top20_degree),'no_movies'=(vertex_attr(g, "movies_95_04",index = names(top20_degree))),'main_genre'=vertex_attr(g, "main_genre",index = names(top20_degree)),'no_of_genres'=lengths(strsplit(vertex_attr(g, "genres",index = names(top20_degree)),",")))
top_20_degrees
Obtain the list of the 20 actors with the largest betweenness centrality. Show a list with the betweenness, the name of the actor, the number of movies, the main genre, and the number of genres in which the actor has participated. 10) Who is the actor with highest betweenes? 11) How do you explain the high betweenness of the top-20 list?
top20_betweenness <- head(sort(betweenness_vector,decreasing = TRUE),20)
top20_betweenness
Jeremy, Ron Shahlavi, Darren Del Rosario, Monsour Chan, Jackie (I)
8433928 4302671 4267096 4148216
Cruz, Penélope Depardieu, Gérard Arquette, David Bachchan, Amitabh
3730691 3297737 2717568 2512696
Soualem, Zinedine Kier, Udo Del Rio, Olivia Pelé
2367794 2331680 2300975 2263239
Ice-T Corbett, Denis Jaenicke, Hannes Knaup, Herbert
2246827 2138339 2072783 2050590
Dogg, Snoop Deneuve, Catherine Jackson, Samuel L. Táborský, Miroslav
1983749 1979450 1958456 1901014
top_20_betweenness <- data.frame('betweenness'=top20_betweenness,'name'=names(top20_betweenness),'no_movies'=(vertex_attr(g, "movies_95_04",index = names(top20_betweenness))),'main_genre'=vertex_attr(g, "main_genre",index = names(top20_betweenness)),'no_of_genres'=lengths(strsplit(vertex_attr(g, "genres",index = names(top20_betweenness)),",")))
top_20_betweenness
Obtain the list of the 20 actors with the largest closeness centrality. Show a list with the closeness the name of the actor, the number of movies, the main genre, and the number of genres in which the actor has participated.
12) Who is the actor with highest closeness centrality? 13) How do you explain the high closeness of the top-20 list?
top20_closeness <- head(sort(closeness_vector,decreasing = TRUE),20)
top20_closeness
Diaz, Cameron Goldberg, Whoopi Hanks, Tom Berry, Halle
4.410807e-07 4.410499e-07 4.409749e-07 4.409620e-07
Jackson, Samuel L. Douglas, Michael (I) De Niro, Robert Willis, Bruce (I)
4.409037e-07 4.408397e-07 4.407583e-07 4.407255e-07
Hopper, Dennis Washington, Denzel Hoffman, Dustin Kidman, Nicole
4.407032e-07 4.406898e-07 4.406478e-07 4.406459e-07
Travolta, John Stiller, Ben Cruz, Penélope Myers, Mike (I)
4.405315e-07 4.405156e-07 4.405149e-07 4.404999e-07
Woods, James (I) Ford, Harrison (I) Lopez, Jennifer (I) Smith, Will (I)
4.404951e-07 4.404904e-07 4.404780e-07 4.404371e-07
top_20_closeness <- data.frame('closeness'=top20_closeness,'name'=names(top20_closeness),'no_movies'=(vertex_attr(g, "movies_95_04",index = names(top20_closeness))),'main_genre'=vertex_attr(g, "main_genre",index = names(top20_closeness)),'no_of_genres'=lengths(strsplit(vertex_attr(g, "genres",index = names(top20_closeness)),",")))
top_20_closeness
Explore the Erdös-Renyi model and compare its structural properties to those of real-world networks (actors):
Use any community detection algorithm for the actors’ network and discuss whether the communities found make sense according to the vertex labels.