MITx: 15.071x The Analytics Edge - VISUALIZING Network Data

Introduction

The cliche goes that the world is an increasingly interconnected place, and the connections between different entities are often best represented with a graph. Graphs are comprised of vertices (also often called “nodes”) and edges connecting those nodes. In this assignment, we will learn how to visualize networks using the igraph package in R.

For this assignment, we will visualize social networking data using anonymized data from Facebook; this data was originally curated in a recent paper about computing social circles in social networks. In our visualizations, the vertices in our network will represent Facebook users and the edges will represent these users being Facebook friends with each other.

The first file we will use, edges.csv, contains variables V1 and V2, which label the endpoints of edges in our network. Each row represents a pair of users in our graph who are Facebook friends. For a pair of friends A and B, edges.csv will only contain a single row – the smaller identifier will be listed first in this row. From this row, we will know that A is friends with B and B is friends with A.

The second file, users.csv, contains information about the Facebook users, who are the vertices in our network. This file contains the following variables:

id: A unique identifier for this user; this is the value that appears in the rows of edges.csv
gender: An identifier for the gender of a user taking the values A and B. Because the data is anonymized, we don't know which value refers to males and which value refers to females.
school: An identifier for the school the user attended taking the values A and AB (users with AB attended school A as well as another school B). Because the data is anonymized, we don't know the schools represented by A and B.
locale: An identifier for the locale of the user taking the values A and B. Because the data is anonymized, we don't know which value refers to what locale.

SUMMARIZING THE DATA

# Load the data sets
edges <- read.csv("edges.csv")
users <- read.csv("users.csv")

# Out of all the students who listed a school, what was the most common
# locale?
table(users$locale, users$school)
##    
##         A AB
##      3  0  0
##   A  6  0  0
##   B 31 17  2
# Or
userSchool <- subset(users, school != "NA")
table(userSchool$locale)
## 
##     A  B 
##  3  6 50

Creating a Network

library(igraph)
g <- graph.data.frame(edges, F, users)
# Plot the network. There are 4 connected components and 7 users with no
# friends in the network
plot(g, vertex.size = 5, vertex.label = NA, vertex.shape = "sphere")

plot of chunk unnamed-chunk-3

# How many users are friends with 10 or more other Facebook users in this
# network?
sum(degree(g) >= 10)
## [1] 9
# What is the average number of friends per user?
mean(degree(g))
## [1] 4.949

Note that in all likelihood these users have a much higher number of Facebook friends. We are computing here the average number of people in this dataset who are their friends, instead of the average total number of Facebook friends.

In a network, it's often visually useful to draw attention to “important” nodes in the network. While this might mean different things in different contexts, in a social network we might consider a user with a large number of friends to be an important user. From the previous problem, we know this is the same as saying that nodes with a high degree are important users.

To visually draw attention to these nodes, we will change the size of the vertices so the vertices with high degrees are larger. To do this, we will change the “size” attribute of the vertices of our graph to be an increasing function of their degrees.

V(g)$size = degree(g)/2 + 2
plot(g, vertex.label = NA)

plot of chunk unnamed-chunk-5

# Maximum and minimum sizes of the vertex
max(V(g)$size)
## [1] 11
min(V(g)$size)
## [1] 2

Coloring the Vertices

Thus far, we have changed the “size” attributes of our vertices. However, we can also change the colors of vertices to capture additional information about the Facebook users we are depicting.

When changing the size of nodes, we first obtained the vertices of our graph with V(g) and then accessed the the size attribute with V(g)$size. To change the color, we will update the attribute V(g)$color.

To color the vertices based on the gender of the user, we will need access to that variable. When we created our graph g, we provided it with the data frame users, which had variables gender, school, and locale. These are now stored as attributes V(g)$gender, V(g)$school, and V(g)$locale.

We can update the colors by setting the color to black for all vertices, than setting it to red for the vertices with gender A and setting it to gray for the vertices with gender B

# V(g)$color = 'black' V(g)$color[V(g)$gender == 'A'] = 'red'
# V(g)$color[V(g)$gender == 'B'] = 'gray' What is the gender of the users
# with the highest degree in the graph? plot(g, vertex.label=NA) Or
table(V(g)$gender)
## 
##     A  B 
##  2 15 42

The two students who attended schools A and B are colored gray; we can see from the graph that they are Facebook friends (aka they are connected by an edge). The high-degree users (depicted by the large nodes) are a mixture of red and black color, meaning some of these users attended school A and other did not.

# Now, color the vertices based on the local that each user in our network
# is from.
V(g)$color = "black"
V(g)$color[V(g)$locale == "A"] = "red"
V(g)$color[V(g)$locale == "B"] = "gray"
plot(g, vertex.label = NA)

plot of chunk unnamed-chunk-7

library(rgl)
# Interactive 3-D plot
rglplot(g, vertex.label = NA)
plot(g, edge.width = 2, vertex.label = NA)

plot of chunk unnamed-chunk-8