Project Description

Assignment Instruction

In this project you will analyze a bipartite network (select from SNAP). Read the article “Comparison of methods for the detection of node group membership in bipartite networks” and focus on the concept of node group membership. In class, you will use an R package to detect and analyze nodes. The instructor may either provide you a dataset to analyze or will guide you on how to obtain one from an approved repository.

Among other things, in this project you will:

  • Identify the levels of aggregation.
  • Identify the two sets of nodes.
  • Plot the ties that indicate membership or participation in a set
  • Interpret the meaning of membership and use the analysis and visuals substantiate your interpretation

After you select your networks, prepare a report summarizing and visualizing the useful and scientifically/mathematically interesting information you find.

Data Explanation

What is the Data?

The data that is going to be used in this project is a Twitter data set that consists of social circles. This dataset is pulled from the Stanford Network Analysis Project. -https://snap.stanford.edu/index.html

“This dataset consists of ‘circles’ (or ‘lists’) from Twitter. Twitter data was crawled from public sources. The dataset includes node features (profiles), circles, and ego networks.” -https://snap.stanford.edu/data/egonets-Twitter.html

Goals with the Data

My goals for this project are to identify the largest community of nodes and analyze the levels of importance for each node in the network.


Initial Exploration

Two Sets of Nodes

“In the mathematical field of graph theory, a bipartite graph (or bigraph) is a graph whose vertices can be divided into two disjoint and independent sets U and V such that every edge connects a vertex in U to one in V.” -https://en.wikipedia.org/wiki/Bipartite_graph

The two sets of nodes are the source and target twitter accounts. These nodes represent a collection of user accounts on twitter.

An example of the raw data can be seen below:

tData = read.csv("../twitter/12831.edges", header = FALSE, sep = " ", col.names = c("source", "target"))

tData[1:5,]
##      source   target
## 1 398874773   652193
## 2  18498878 14749606
## 3  14305022  8479062
## 4     22253    12741
## 5  15540222 14809096

The source and target columns are made up of user ids that are anonymized for analytical purposes. The source account follows the target account.

Initial Visualization

library(igraph)
library(ggnet)
library(ggplot2)

g = graph_from_data_frame(tData)

ggnet2(g, node.size = 3, node.color = "#488AC7", edge.size = 0.5, edge.color = "#B6B6B4") +
  theme(panel.background = element_rect(fill = "grey15"))

By default, the Fruchterman-Reingold algorithm is used to plot the network. This alogirthm places the nodes with the highest number of connections into the center of the graph.


Communities in the Network

Group the Largest Community

The plot below highlights the largest clique in the network. As expected, it is located near the center.

vcol = rep("#488AC7", vcount(g))
vcol[unlist(largest_cliques(g))] = "gold"

ggnet2(g, node.size = 3, color=vcol, edge.size = 0.5, edge.color = "#B6B6B4") +
  theme(panel.background = element_rect(fill = "grey15"))

This would represent a social network made up of connections between user accounts on Twitter. It makes sense that the largest network would be located near the center, as it contains the most number of connections between its nodes. This cluster also contains connections with other networks outside of its own, so the largest social network could be thought of as a core to the entire network as a whole.

Eigenvalue Decomposition

Eigenvalues can be used to determine the levels of covariance among the nodes in a network. This is a method for determining the level of connectivity for each node in the network.

The eigenvalues for every node in the network is plotted below.

adjMatr = as.matrix(as_adjacency_matrix(as.undirected(graph_from_data_frame(tData))))
plot(eigen(adjMatr)$values, type="b")
abline(h=1,col="red", lty = 3)

The y-axis represents the eigenvalue, and the x-axis represents a node in the network.

A node associated with a high eigenvalue is responsible for the covariance among the nodes. This means that whenever a user account has a high eigenvalue, then their account is responsible for the high level of connectivity within their community within the twitter network. This essentially means that they are connected to the most nodes within their network.


References

Bipartite graph. (2018, January 15). Retrieved January 18, 2018, from https://en.wikipedia.org/wiki/Bipartite_graph

Fried, E., & Author: Eiko Fried Psychology / methodology post-doc, working on mental disorders, network models, psychometrics, measurement & datavisualization. Loves fantasy & sci-fi, dabbles in photography & singing. Yay! View all posts by Eiko Fried →. (2017, November 28). R tutorial: how to identify communities of items in networks. Retrieved January 18, 2018, from http://psych-networks.com/r-tutorial-identify-communities-items-networks/

Ggnet2: network visualization with ggplot2. (n.d.). Retrieved January 18, 2018, from https://briatte.github.io/ggnet/

Network analysis with R and igraph: NetSci X Tutorial. (n.d.). Retrieved January 18, 2018, from http://kateto.net/networks-r-igraph