Import Pajek (.net) files

Phil Murphy & Brendan Knapp

Pajek and file types
- Brendan's Aside:
- Loading .net files into igraph

knitr::opts_chunk$set(fig.width = 10, fig.height = 8)

First, load the igraph package to work with network data.

library(igraph)

Pajek and file types

The dataset for that we're going to be using in this example - Davis' Southern Women - is available in Pajek's .net format, but one of the perks of working with R is that you are able to read and write practically any kind of data file. This includes Pajek files, which are simply text files adhering to a format which Pajek knows how to parse and compute.

Brendan's Aside:

A quick note about Pajek (and network analysis software in general) that are worth mentioning if you're looking to explore other platforms. The following adapts some of the points that Benjamin blogged about here

Negatives
- Pajek is closed source, meaning that the source code is not available to the public.
  - One of the consequences of closed source software is that community-driven efforts to expand and refine its capabilities are inherently limited, if not impossible.
    - This is a major reason that open source programming languages like R and Python have exploded in recent years.
  - One of the advantages of learning to do network analysis in R is that you're also learning to work in one of the major platforms for statistical analysis.
  - R is also a fully-featured development environment with capabilities that are growing at an increasing rate.
    - Why? It's open source and community-driven efforts are constantly augmenting it.
- Pajek accepts a limited amount of data formats and writes even fewer.
- Pajek only does network analysis.
- While it's easy to argue that Pajek is simpler to learn than R because it's pointing and clicking, Pajek is less user friendly than some other GUI programs like UCINET, Gephi, Visone, etc. and its interface can be a headache until you really learn it.
- Pajek is limited in statistical modeling of networks, which is what really sets R and its packages apart from the rest of the herd.
Positives
- While not open source, Pajek is free. We like free.
- Pajek can send data directly to R.
- While not a full programming environment, Pajek can be automated with macro scripts.
- Most importantly, Pajek is fast. RIDICULOUSLY fast.
  - "Pajek eats your so-called 'big data' for breakfast."
    - Thousands of nodes and hundreds of thousands of edges? Piece of cake
      - Pajek-3XL can handle networks beyond TWO BILLION vertices.
- Being able to use Pajek AND R means you can do more than someone who can only use one platform.

Ultimately, the software that you choose should be based on your own personal preferences. If you are likely to use network analysis a lot, or are likely to use it in combination with other analyses, then R is likely a good choice. On the other hand, if you are coding averse or just prefer the comfort of the point-and-click interface, then there are many powerful and reliable programs, such as Pajek and UCINET, that you may find more to your taste.

Loading .net files into igraph

With that, let's read our Pajek file, which uses the extension .net...

g <- read.graph(file.choose(), format = "pajek")

...and take a look.

## IGRAPH 0aaa3be UNW- 32 89 -- 
## + attr: id (v/c), name (v/c), x (v/n), y (v/n), z (v/n), weight
## | (e/n)
## + edges from 0aaa3be (vertex names):
##  [1] EVELYN   --1 EVELYN   --2 EVELYN   --3 EVELYN   --4 EVELYN   --5
##  [6] EVELYN   --6 EVELYN   --8 LAURA    --1 LAURA    --2 LAURA    --3
## [11] LAURA    --5 LAURA    --6 LAURA    --7 THERESA  --2 THERESA  --3
## [16] THERESA  --4 THERESA  --5 THERESA  --6 THERESA  --7 THERESA  --8
## [21] BRENDA   --1 BRENDA   --3 BRENDA   --4 BRENDA   --5 BRENDA   --6
## [26] BRENDA   --7 CHARLOTTE--3 CHARLOTTE--4 CHARLOTTE--5 FRANCES  --3
## [31] FRANCES  --5 FRANCES  --6 ELEANOR  --5 ELEANOR  --6 ELEANOR  --7
## + ... omitted several edges

If the network that you are working with is a one-mode network, then you are essentially done with the data loading process.

Inspecting the igraph object gives us the header UNW-, which tells us that g is undirected, the vertices are named, and that the edges are weighted. We didn't need to tell igraph whether or not the edges should have directions as Pajek files already specify whether they are directed or undirected by labeling the ties as being arcs or edges.

You may already know that the "Southern Women" network is actually a two-mode, or bipartite network. So, ultimately, we will want the - in UNW- to include a B for bipartite.

Because igraph does not automatically recognize two-mode networks, it is necessary to tell igraph that there are two types of vertices. There are multiple methods for doing this. We cover two options here:

Using igraph's native bipartite.mapping() function
Manually telling igraph that it has a two-mode (bipartite) network

Igraph's `bipartite.mapping()` function

Igraph can evaluate the network that you have entered for whether it meets the criteria of a two-mode network. Those criteria are that there are (1) two sets of nodes in the network, and (2) there are only ties between node sets and not within them. That is, there are two sets of entities in the network, and the entities from each set are only connected with one another through the other node set. If the network meets the criteria, igraph will identify which nodes belong in each mode.

To see what the function does, try running it:

bipartite.mapping(g)

## $res
## [1] TRUE
## 
## $type
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE
## [23]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE

Igraph returns two responses:

Whether the network meets the criteria of a two-mode network ($res), and
Which nodes fall into each mode ($type).

The "type" argument is what igraph uses to identify the two modes. We can add this into the network fairly easily.

# First, don't take chances.
g2M <-g         # Create a new network so you don't accidently
                #  overwrite your network with a mistake
V(g2M)$type <- bipartite_mapping(g2M)$type  # Add the "type" attribute
                                            #  to the network.

Manual Assignment

Igraph's bipartite.mapping() function is certainly handy, and will likely be useful to you 95% of the time. However, there will also be times when you will likely not want igraph to make the decision of which node belongs in which mode for you. In those cases, you can assign node classes manually.

Keep in mind that igraph denotes different modes by whether they are TRUE or FALSE. Now, look at the igraph object above. The vertices are identified with names: + the names of the "Southern Women", and + the events which they attended, annotated as numbers.

The easiest means of assigning nodes into modes is to work with the data as an edgelist, which will be formatted so that the "Southern Women" are in the first column and the events are in the second column. This will make labeling the type of the vertices relatively simple.

First, we use as_edgelist() to create an edgelist object called g_el.
We're then going to use head() to inspect the edgelist.

`as_edgelist()`

g_el <- as_edgelist(g)

head(g_el)

##      [,1]     [,2]
## [1,] "EVELYN" "1" 
## [2,] "EVELYN" "2" 
## [3,] "EVELYN" "3" 
## [4,] "EVELYN" "4" 
## [5,] "EVELYN" "5" 
## [6,] "EVELYN" "6"

Next, let's check to see what kind of object as_edgelist() returns.

class(g_el)

## [1] "matrix"

To make life simple, let's name the column headers of our edgelist so that they reflect which column refers to a person and which refers to an event.

Since we now know that g_el is a matrix, we can use colnames() to assign headers.

To pass multiple arguments to colnames(), we use c() to combine "person" and "event".

colnames(g_el) <- c("person", "event")

Let's take a look at g_el with head().

head(g_el)

##      person   event
## [1,] "EVELYN" "1"  
## [2,] "EVELYN" "2"  
## [3,] "EVELYN" "3"  
## [4,] "EVELYN" "4"  
## [5,] "EVELYN" "5"  
## [6,] "EVELYN" "6"

In order to label each vertices' type, we're going to use a logical function that will return TRUE or FALSE based on which column of g_el each vertex is in. There are few ways to do this. In this case, we're going to use ifelse() to...

check if the name of a vertex is %in% the first column of g_el, which we named with the header "person"
since igraph wants types to be TRUE or FALSE, use those values for our second and third arguments

Note: If you are interested in learning more about logical comparisons in R, visit out helper resources on that topic.

V(g)$type <- ifelse(V(g)$name %in% g_el[,"person"], TRUE, FALSE)

g

## IGRAPH 0aaa3be UNWB 32 89 -- 
## + attr: id (v/c), name (v/c), x (v/n), y (v/n), z (v/n), type
## | (v/l), weight (e/n)
## + edges from 0aaa3be (vertex names):
##  [1] EVELYN   --1 EVELYN   --2 EVELYN   --3 EVELYN   --4 EVELYN   --5
##  [6] EVELYN   --6 EVELYN   --8 LAURA    --1 LAURA    --2 LAURA    --3
## [11] LAURA    --5 LAURA    --6 LAURA    --7 THERESA  --2 THERESA  --3
## [16] THERESA  --4 THERESA  --5 THERESA  --6 THERESA  --7 THERESA  --8
## [21] BRENDA   --1 BRENDA   --3 BRENDA   --4 BRENDA   --5 BRENDA   --6
## [26] BRENDA   --7 CHARLOTTE--3 CHARLOTTE--4 CHARLOTTE--5 FRANCES  --3
## [31] FRANCES  --5 FRANCES  --6 ELEANOR  --5 ELEANOR  --6 ELEANOR  --7
## + ... omitted several edges

Now igraph knows that our network is bipartite, which we can tell from the B in UNWB